Changeset 7342:ae413abe4045 in orange


Ignore:
Timestamp:
02/03/11 21:56:44 (3 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
Convert:
82c5e11ecf79155dc3551d4e9e08967921914838
Message:
 
Location:
orange/Orange/feature
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r7321 r7342  
    6363    0.189  0.382  crime 
    6464    0.166  0.345  adoption-of-the-budget-resolution 
     65 
     66============ 
     67Base Classes 
     68============ 
     69 
     70There are a number of different measures for assessing the relevance of  
     71attributes with respect to much information they contain about the  
     72corresponding class. These procedures are also known as attribute scoring.  
     73Orange implements several methods that all stem from 
     74:obj:`Orange.feature.scoring.Measure`. The most of common ones compute 
     75certain statistics on conditional distributions of class values given 
     76the attribute values; in Orange, these are derived from 
     77:obj:`Orange.feature.scoring.MeasureAttributeFromProbabilities`. 
     78 
     79.. class:: Measure 
     80 
     81    This is the base class for a wide range of classes that measure quality of 
     82    attributes. The class itself is, naturally, abstract. Its fields merely 
     83    describe what kinds of attributes it can handle and what kind of data it  
     84    requires. 
     85 
     86    .. attribute:: handlesDiscrete 
     87    Tells whether the measure can handle discrete attributes. 
     88 
     89    .. attribute:: handlesContinuous 
     90    Tells whether the measure can handle continuous attributes. 
     91 
     92    .. attribute:: computesThresholds 
     93    Tells whether the measure implements the :obj:`thresholdFunction`. 
     94 
     95    .. attribute:: needs 
     96    Tells what kind of data the measure needs. This can be either  
     97    :obj:`NeedsGenerator`, :obj:`NeedsDomainContingency`,  
     98    :obj:`NeedsContingency_Class`. The first need an instance generator 
     99    (Relief is an example of such measure), the second can compute the quality 
     100    from :obj:`Orange.probability.distributions.DomainContingency` and the 
     101    latter only needs the contingency 
     102    (:obj:`Orange.probability.distributions.ContingencyAttrClass`) the  
     103    attribute distribution and the apriori class distribution. Most measures 
     104    only need the latter. 
     105 
     106    Several (but not all) measures can treat unknown attribute values in 
     107    different ways, depending on field :obj:`unknownsTreatment` (this field is 
     108    not defined in :obj:`Measure` but in many derived classes). Undefined values can be 
     109<UL style="margin-top=0cm"> 
     110<LI><B>ignored (<CODE>MeasureAttribute.IgnoreUnknowns</CODE>)</B>; this has the same effect as if the example for which the attribute value is unknown are removed.</LI> 
     111 
     112<LI><B>punished (<CODE>MeasureAttribute.ReduceByUnknown</CODE>)</B>; the attribute quality is reduced by the proportion of unknown values. In impurity measures, this can be interpreted as if the impurity is decreased only on examples for which the value is defined and stays the same for the others, and the attribute quality is the average impurity decrease.</B></LI> 
     113 
     114<LI><B>imputed (<CODE>MeasureAttribute.UnknownsToCommon</CODE>)</B>; here, undefined values are replaced by the most common attribute value. If you want a more clever imputation, you should do it in advance.</LI> 
     115 
     116<LI><B>treated as a separate value (<CODE>MeasureAttribute.UnknownsAsValue</CODE>)</B> 
     117</UL> 
     118 
     119<P>The default treatment is <CODE>ReduceByUnknown</CODE>, which is optimal in most cases and does not make additional presumptions (as, for instance, <CODE>UnknownsToCommon</CODE> which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other attributes). Use other treatments if you know that they make better sense on your data.</P> 
     120 
     121<P>The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the attribute. The number does not have any absolute meaning and can vary widely for different attribute measures. The only common characteristic is that higher the value, better the attribute. If the attribute is so bad that it's quality cannot be measured, the measure returns <CODE>MeasureAttribute.Rejected</CODE>. None of the measures described here do so.</P> 
     122 
     123<P>There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the attribute and the class need to be of the correct type for a particular measure.</P> 
     124 
     125<P class=section>Methods</P> 
     126<DL class=attributes> 
     127 <DT>__call__(attribute, examples[, apriori class distribution][, weightID])<br> 
     128 __call__(attribute, domain contingency[, apriori class distribution])<br> 
     129 __call__(contingency, class distribution[, apriori class distribution])</DT> 
     130 
     131<DD>There are three call operators just to make your life simpler and faster. When working with the data, your method might have already computed, for instance, contingency matrix. If so and if the quality measure you use is OK with that (as most measures are), you can pass the contingency matrix and the measure will compute much faster. If, on the other hand, you only have examples and haven't computed any statistics on them, you can pass examples (and, optionally, an id for meta-attribute with weights) and the measure will compute the contingency itself, if needed.</P> 
     132 
     133<P>Argument <CODE>attribute</CODE> gives the attribute whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the attribute is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the attribute in the domain, though.</DD> 
     134 
     135<P>The data is given either as <CODE>examples</CODE> (and, optionally, id for meta-attribute with weight), <CODE>domain contingency</CODE> (a list of contingencies) or <CODE>contingency</CODE> matrix and <CODE>class distribution</CODE>. If you use the latter form, what you should give as the class distribution depends upon what you do with unknown values (if there are any). If <CODE>unknownsTreatment</CODE> is <CODE>IgnoreUnknowns</CODE>, the class distribution should be computed on examples for which the attribute value is defined. Otherwise, class distribution should be the overall class distribution.</P> 
     136 
     137<P>The optional argument with <CODE>apriori class distribution</CODE> is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate).</P> 
     138</DD> 
     139 
     140<dt>thresholdFunction(attribute, examples[, weightID])</dt> 
     141<dd>This function computes the qualities for different binarizations of the continuous attribute <code>attribute</code>. The attribute should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing attribute values), the second is the measured quality for a corresponding binary attribute and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself.</dd> 
     142</DL> 
     143 
     144<P>The script below shows different ways to assess the quality of astigmatic, tear rate and the first attribute (whichever it is) in the dataset lenses.</P> 
     145 
     146<p class="header"><a href="MeasureAttribute1.py">part of measureattribute1.py</a> 
     147(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     148<XMP class=code>import orange, random 
     149data = orange.ExampleTable("lenses") 
     150 
     151meas = orange.MeasureAttribute_info() 
     152 
     153astigm = data.domain["astigmatic"] 
     154print "Information gain of 'astigmatic': %6.4f" % meas(astigm, data) 
     155 
     156classdistr = orange.Distribution(data.domain.classVar, data) 
     157cont = orange.ContingencyAttrClass("tear_rate", data) 
     158print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr) 
     159 
     160dcont = orange.DomainContingency(data) 
     161print "Information gain of the first attribute: %6.4f" % meas(0, dcont) 
     162print 
     163</XMP> 
     164 
     165<P>As for many other classes in Orange, you can construct the object and use it on-the-fly. For instance, to measure the quality of attribute "tear_rate", you could write simply</P> 
     166 
     167<XMP class=code>>>> print orange.MeasureAttribute_info("tear_rate", data) 
     1680.548794984818 
     169</XMP> 
     170 
     171<P>You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF.</P> 
     172 
     173<P>It is also possible to assess the quality of attributes that do not exist in the dataset. For instance, you can assess the quality of discretized attributes without constructing a new domain and dataset that would include them.</P> 
     174 
     175<p class="header"><a href="MeasureAttribute1.py">measureattribute1a.py</a> 
     176(uses <a href="iris.tab">iris.tab</a>)</p> 
     177<XMP class=code>import orange, random 
     178data = orange.ExampleTable("iris") 
     179 
     180d1 = orange.EntropyDiscretization("petal length", data) 
     181print orange.MeasureAttribute_info(d1, data) 
     182</XMP> 
     183 
     184<P>The quality of the new attribute <CODE>d1</CODE> is assessed on <CODE>data</CODE>, which does not include the new attribute at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the attribute to be present in the dataset.)</p> 
     185 
     186<P>Finally, you can compute the quality of meta-attributes. The following script adds a meta-attribute to an example table, initializes it to random values and measures its information gain.</P> 
     187 
     188<p class="header"><a href="MeasureAttribute1.py">part of measureattribute1.py</a> 
     189(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     190<XMP class=code>mid = orange.newmetaid() 
     191data.domain.addmeta(mid, orange.EnumVariable(values = ["v0", "v1"])) 
     192data.addMetaAttribute(mid) 
     193 
     194rg = random.Random() 
     195rg.seed(0) 
     196for ex in data: 
     197    ex[mid] = orange.Value(rg.randint(0, 1)) 
     198 
     199print "Information gain for a random meta attribute: %6.4f" % \ 
     200  orange.MeasureAttribute_info(mid, data) 
     201</XMP> 
     202 
     203 
     204<P>To show the computation of thresholds, we shall use the Iris data set.</P> 
     205 
     206<p class="header"><a href="MeasureAttribute1a.py">measureattribute1a.py</a> 
     207(uses <a href="iris.tab">iris.tab</a>)</p> 
     208<xmp class="code"> 
     209import orange 
     210data = orange.ExampleTable("iris") 
     211 
     212meas = orange.MeasureAttribute_relief() 
     213for t in meas.thresholdFunction("petal length", data): 
     214    print "%5.3f: %5.3f" % t 
     215</xmp> 
     216 
     217<P>If we hadn't constructed the attribute in advance, we could write <code>orange.MeasureAttribute_relief().thresholdFunction("petal length", data)</code>. This is not recommendable for ReliefF, since it may be a lot slower.</P> 
     218 
     219<P>The script below finds and prints out the best threshold for binarization of an attribute, that is, the threshold with which the resulting binary attribute will have the optimal ReliefF (or any other measure).</P> 
     220<xmp class="code">thresh, score, distr = meas.bestThreshold("petal length", data) 
     221print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)</xmp> 
     222 
     223<H3>MeasureAttributeFromProbabilities</H3> 
     224 
     225<P><CODE><INDEX name="classes/MeasureAttributeFromProbabilities">MeasureAttributeFromProbabilities</CODE> is the abstract base class for attribute quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies.</P> 
     226 
     227<P class=section>Attributes</P> 
     228<DL class=attributes> 
     229<DT>unknownsTreatment</DT> 
     230<DD>Defines what to do with unknown values. See the possibilities described above.</DD> 
     231 
     232<DT>estimatorConstructor, conditionalEstimatorConstructor</DT> 
     233<DD>The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, <CODE>ProbabilityEstimatorConstructor_m</CODE> and <CODE>ConditionalProbabilityEstimatorConstructor_ByRows</CODE> (with estimator constructor again set to <CODE>ProbabilityEstimatorConstructor_m</CODE>), respectively.</DD> 
     234</DL> 
     235 
     236 
     237<HR> 
     238 
     239<H2>Measures for Classification Problems</H2> 
     240 
     241<P>The following section describes the attribute quality measures suitable for discrete attributes and outcomes. See <A href="MeasureAttribute1.py">MeasureAttribute1.py</A>, <A href="MeasureAttribute1a.py">MeasureAttribute1a.py</A>, <A href="MeasureAttribute1b.py">MeasureAttribute1b.py</A>, <A href="MeasureAttribute2.py">MeasureAttribute2.py</A> and <A href="MeasureAttribute3.py">MeasureAttribute3.py</A> for more examples on their use.</P> 
     242 
     243<H3>Information Gain</H3> 
     244<index name="attribute scoring+information gain"> 
     245 
     246<P>The most popular measure, information gain (<CODE><INDEX name="classes/MeasureAttribute_info">MeasureAttribute_info</CODE>), measures the expected decrease of the entropy.</P> 
     247 
     248<H3>Gain Ratio</H3> 
     249<index name="attribute scoring+gain ratio"> 
     250 
     251<P>Gain ratio (<CODE><INDEX name="classes/MeasureAttribute_gainRatio">MeasureAttribute_gainRatio</CODE>) was introduced by Quinlan in order to avoid overestimation of multi-valued attributes. It is computed as information gain divided by the entropy of the attribute's value. (It has been shown, however, that such measure still overstimates the attributes with multiple values.) 
     252 
     253<H3>Gini index</H3> 
     254<index name="attribute scoring+gini index"> 
     255 
     256<P>Gini index (<CODE>MeasureAttribute_gini</CODE>) was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes.</P> 
     257 
     258<H3>Relevance</H3> 
     259<index name="attribute scoring+relevance"> 
     260 
     261<P>Relevance of attributes (<CODE><INDEX 
     262name="classes/MeasureAttribute_relevance">MeasureAttribute_relevance</CODE>) 
     263is a measure that discriminate between attributes on the basis of 
     264their potential value in the formation of decision rules.</P> 
     265 
     266<H3>Costs</H3> 
     267 
     268<P><CODE><INDEX name="classes/MeasureAttribute_cost">MeasureAttribute_cost</CODE> evaluates attributes based on the "saving" achieved by knowing the value of attribute, according to the specified cost matrix.</P> 
     269 
     270<P class=section>Attributes</P> 
     271<DL class=attributes> 
     272<DT>cost</DT> 
     273<DD>Cost matrix (see the page about <A href="CostMatrix.htm">cost matrices</A> for details)</DD> 
     274</DL> 
     275 
     276<P>If cost of predicting the first class for an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for attribute 3 as follows.</P> 
     277 
     278<XMP class=code>>>> meas = orange.MeasureAttribute_cost() 
     279>>> meas.cost = ((0, 5), (1, 0)) 
     280>>> meas(3, data) 
     2810.083333350718021393 
     282</XMP> 
     283 
     284<P>This tells that knowing the value of attribute 3 would decrease the classification cost for appx 0.083 per example.</P> 
     285 
     286 
     287<H3>ReliefF</H3> 
     288<index name="attribute scoring+ReliefF"> 
     289 
     290<P>ReliefF (<CODE><INDEX name="classes/MeasureAttribute_relief">MeasureAttribute_relief</CODE>) was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of attributes based on their ability to distinguish between very similar examples belonging to different classes.</P> 
     291 
     292<P class=section>Attributes</P> 
     293<DL class=attributes> 
     294<DT>k</DT> 
     295<DD>Number of neighbours for each example. Default is 5.</DD> 
     296 
     297<DT>m</DT> 
     298<DD>Number of reference examples. Default is 100. Set to -1 to take all the examples.</DD> 
     299 
     300<DT>checkCachedData</DT> 
     301<DD>A flag best left alone unless you know what you do.</DD> 
     302</DL> 
     303 
     304<P>Computation of ReliefF is rather slow since it needs to find <CODE>k</CODE> nearest neighbours for each of <CODE>m</CODE> reference examples (or all examples, if <code>m</code> is set to -1). Since we normally compute ReliefF for all attributes in the dataset, <CODE>MeasureAttribute_relief</CODE> caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting <code>checkCachedData</code> to <code>False</code>. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the 'version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control.</P> 
     305 
     306<P>Caching will only have an effect if you use the same instance for all attributes in the domain. So, don't do this:</P> 
     307 
     308<XMP class=code>for attr in data.domain.attributes: 
     309    print orange.MeasureAttribute_relief(attr, data) 
     310</XMP> 
     311 
     312<P>In this script, cached data dies together with the instance of <CODE>MeasureAttribute_relief</CODE>, which is constructed and destructed for each attribute separately. It's way faster to go like this.</P> 
     313 
     314<XMP class=code>meas = orange.MeasureAttribute_relief() 
     315for attr in data.domain.attributes: 
     316    print meas(attr, data) 
     317</XMP> 
     318 
     319<P>When called for the first time, <CODE>meas</CODE> will compute ReliefF for all attributes and the subsequent calls simply return the stored data.</P> 
     320 
     321<P>Class <CODE>MeasureAttribute_relief</CODE> works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF.</P> 
     322 
     323<P>Note that ReliefF can also compute the threshold function, that is, the attribute quality at different thresholds for binarization.</P> 
     324 
     325<p>Finally, here is an example which shows what can happen if you disable the computation of checksums.</p> 
     326 
     327<xmp>data = orange.ExampleTable("iris") 
     328r1 = orange.MeasureAttribute_relief() 
     329r2 = orange.MeasureAttribute_relief(checkCachedData = False) 
     330 
     331print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) 
     332for ex in data: 
     333    ex[0] = 0 
     334print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) 
     335</xmp> 
     336 
     337<p>The first print prints out the same number, 0.321 twice. Then we annulate the first attribute. <code>r1</code> notices it and returns -1 as it's ReliefF, while <code>r2</code> does not and returns the same number, 0.321, which is now wrong.</p> 
     338 
     339 
     340<H2>Measure for Attributes for Regression Problems</H2> 
     341 
     342<P>Except for ReliefF, the only attribute quality measure available for regression problems is based on a mean square error.</P> 
     343 
     344<H3>Mean Square Error</H3> 
     345<index name="attribute scoring/mean square error"> 
     346<index name="MSE of an attribute"> 
     347 
     348<P>The mean square error measure is implemented in class <CODE><INDEX name="classes/MeasureAttribute_MSE">MeasureAttribute_MSE</CODE>.</P> 
     349 
     350<P class=section>Attributes</P> 
     351<DL class=attributes> 
     352<DT>unknownsTreatment</DT> 
     353<DD>Tells what to do with unknown attribute values. See description on the top of this page.</DD> 
     354 
     355<DT>m</DT> 
     356<DD>Parameter for m-estimate of error. Default is 0 (no m-estimate). 
     357 
     358</DL>  
    65359 
    66360========== 
  • orange/Orange/feature/selection.py

    r7321 r7342  
    163163import Orange.core as orange 
    164164 
     165from Orange.feature.scoring import attMeasure 
     166 
    165167# from orngFSS 
    166168def bestNAtts(scores, N): 
     
    224226    :param threshold: score threshold for attribute selection. Defaults to 0. 
    225227    :type threshold: float 
    226     :rtype: :obj:`list` 
    227     Result: the first N features (without measures) 
     228    :rtype: :obj:`list` first N features (without measures) 
    228229   
    229230    """ 
     
    299300        :param data: an data table 
    300301        :type data: Orange.data.table 
     302 
    301303        """ 
    302304        ma = attMeasure(data, self.measure) 
Note: See TracChangeset for help on using the changeset viewer.