Changeset 7342:ae413abe4045 in orange

02/03/11 21:56:44 (3 years ago)
tomazc <tomaz.curk@…>
2 edited


  • orange/Orange/feature/

    r7321 r7342  
    6363    0.189  0.382  crime 
    6464    0.166  0.345  adoption-of-the-budget-resolution 
     67Base Classes 
     70There are a number of different measures for assessing the relevance of  
     71attributes with respect to much information they contain about the  
     72corresponding class. These procedures are also known as attribute scoring.  
     73Orange implements several methods that all stem from 
     74:obj:`Orange.feature.scoring.Measure`. The most of common ones compute 
     75certain statistics on conditional distributions of class values given 
     76the attribute values; in Orange, these are derived from 
     79.. class:: Measure 
     81    This is the base class for a wide range of classes that measure quality of 
     82    attributes. The class itself is, naturally, abstract. Its fields merely 
     83    describe what kinds of attributes it can handle and what kind of data it  
     84    requires. 
     86    .. attribute:: handlesDiscrete 
     87    Tells whether the measure can handle discrete attributes. 
     89    .. attribute:: handlesContinuous 
     90    Tells whether the measure can handle continuous attributes. 
     92    .. attribute:: computesThresholds 
     93    Tells whether the measure implements the :obj:`thresholdFunction`. 
     95    .. attribute:: needs 
     96    Tells what kind of data the measure needs. This can be either  
     97    :obj:`NeedsGenerator`, :obj:`NeedsDomainContingency`,  
     98    :obj:`NeedsContingency_Class`. The first need an instance generator 
     99    (Relief is an example of such measure), the second can compute the quality 
     100    from :obj:`Orange.probability.distributions.DomainContingency` and the 
     101    latter only needs the contingency 
     102    (:obj:`Orange.probability.distributions.ContingencyAttrClass`) the  
     103    attribute distribution and the apriori class distribution. Most measures 
     104    only need the latter. 
     106    Several (but not all) measures can treat unknown attribute values in 
     107    different ways, depending on field :obj:`unknownsTreatment` (this field is 
     108    not defined in :obj:`Measure` but in many derived classes). Undefined values can be 
     109<UL style="margin-top=0cm"> 
     110<LI><B>ignored (<CODE>MeasureAttribute.IgnoreUnknowns</CODE>)</B>; this has the same effect as if the example for which the attribute value is unknown are removed.</LI> 
     112<LI><B>punished (<CODE>MeasureAttribute.ReduceByUnknown</CODE>)</B>; the attribute quality is reduced by the proportion of unknown values. In impurity measures, this can be interpreted as if the impurity is decreased only on examples for which the value is defined and stays the same for the others, and the attribute quality is the average impurity decrease.</B></LI> 
     114<LI><B>imputed (<CODE>MeasureAttribute.UnknownsToCommon</CODE>)</B>; here, undefined values are replaced by the most common attribute value. If you want a more clever imputation, you should do it in advance.</LI> 
     116<LI><B>treated as a separate value (<CODE>MeasureAttribute.UnknownsAsValue</CODE>)</B> 
     119<P>The default treatment is <CODE>ReduceByUnknown</CODE>, which is optimal in most cases and does not make additional presumptions (as, for instance, <CODE>UnknownsToCommon</CODE> which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other attributes). Use other treatments if you know that they make better sense on your data.</P> 
     121<P>The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the attribute. The number does not have any absolute meaning and can vary widely for different attribute measures. The only common characteristic is that higher the value, better the attribute. If the attribute is so bad that it's quality cannot be measured, the measure returns <CODE>MeasureAttribute.Rejected</CODE>. None of the measures described here do so.</P> 
     123<P>There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the attribute and the class need to be of the correct type for a particular measure.</P> 
     125<P class=section>Methods</P> 
     126<DL class=attributes> 
     127 <DT>__call__(attribute, examples[, apriori class distribution][, weightID])<br> 
     128 __call__(attribute, domain contingency[, apriori class distribution])<br> 
     129 __call__(contingency, class distribution[, apriori class distribution])</DT> 
     131<DD>There are three call operators just to make your life simpler and faster. When working with the data, your method might have already computed, for instance, contingency matrix. If so and if the quality measure you use is OK with that (as most measures are), you can pass the contingency matrix and the measure will compute much faster. If, on the other hand, you only have examples and haven't computed any statistics on them, you can pass examples (and, optionally, an id for meta-attribute with weights) and the measure will compute the contingency itself, if needed.</P> 
     133<P>Argument <CODE>attribute</CODE> gives the attribute whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the attribute is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the attribute in the domain, though.</DD> 
     135<P>The data is given either as <CODE>examples</CODE> (and, optionally, id for meta-attribute with weight), <CODE>domain contingency</CODE> (a list of contingencies) or <CODE>contingency</CODE> matrix and <CODE>class distribution</CODE>. If you use the latter form, what you should give as the class distribution depends upon what you do with unknown values (if there are any). If <CODE>unknownsTreatment</CODE> is <CODE>IgnoreUnknowns</CODE>, the class distribution should be computed on examples for which the attribute value is defined. Otherwise, class distribution should be the overall class distribution.</P> 
     137<P>The optional argument with <CODE>apriori class distribution</CODE> is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate).</P> 
     140<dt>thresholdFunction(attribute, examples[, weightID])</dt> 
     141<dd>This function computes the qualities for different binarizations of the continuous attribute <code>attribute</code>. The attribute should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing attribute values), the second is the measured quality for a corresponding binary attribute and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself.</dd> 
     144<P>The script below shows different ways to assess the quality of astigmatic, tear rate and the first attribute (whichever it is) in the dataset lenses.</P> 
     146<p class="header"><a href="">part of</a> 
     147(uses <a href=""></a>)</p> 
     148<XMP class=code>import orange, random 
     149data = orange.ExampleTable("lenses") 
     151meas = orange.MeasureAttribute_info() 
     153astigm = data.domain["astigmatic"] 
     154print "Information gain of 'astigmatic': %6.4f" % meas(astigm, data) 
     156classdistr = orange.Distribution(data.domain.classVar, data) 
     157cont = orange.ContingencyAttrClass("tear_rate", data) 
     158print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr) 
     160dcont = orange.DomainContingency(data) 
     161print "Information gain of the first attribute: %6.4f" % meas(0, dcont) 
     165<P>As for many other classes in Orange, you can construct the object and use it on-the-fly. For instance, to measure the quality of attribute "tear_rate", you could write simply</P> 
     167<XMP class=code>>>> print orange.MeasureAttribute_info("tear_rate", data) 
     171<P>You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF.</P> 
     173<P>It is also possible to assess the quality of attributes that do not exist in the dataset. For instance, you can assess the quality of discretized attributes without constructing a new domain and dataset that would include them.</P> 
     175<p class="header"><a href=""></a> 
     176(uses <a href=""></a>)</p> 
     177<XMP class=code>import orange, random 
     178data = orange.ExampleTable("iris") 
     180d1 = orange.EntropyDiscretization("petal length", data) 
     181print orange.MeasureAttribute_info(d1, data) 
     184<P>The quality of the new attribute <CODE>d1</CODE> is assessed on <CODE>data</CODE>, which does not include the new attribute at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the attribute to be present in the dataset.)</p> 
     186<P>Finally, you can compute the quality of meta-attributes. The following script adds a meta-attribute to an example table, initializes it to random values and measures its information gain.</P> 
     188<p class="header"><a href="">part of</a> 
     189(uses <a href=""></a>)</p> 
     190<XMP class=code>mid = orange.newmetaid() 
     191data.domain.addmeta(mid, orange.EnumVariable(values = ["v0", "v1"])) 
     194rg = random.Random() 
     196for ex in data: 
     197    ex[mid] = orange.Value(rg.randint(0, 1)) 
     199print "Information gain for a random meta attribute: %6.4f" % \ 
     200  orange.MeasureAttribute_info(mid, data) 
     204<P>To show the computation of thresholds, we shall use the Iris data set.</P> 
     206<p class="header"><a href=""></a> 
     207(uses <a href=""></a>)</p> 
     208<xmp class="code"> 
     209import orange 
     210data = orange.ExampleTable("iris") 
     212meas = orange.MeasureAttribute_relief() 
     213for t in meas.thresholdFunction("petal length", data): 
     214    print "%5.3f: %5.3f" % t 
     217<P>If we hadn't constructed the attribute in advance, we could write <code>orange.MeasureAttribute_relief().thresholdFunction("petal length", data)</code>. This is not recommendable for ReliefF, since it may be a lot slower.</P> 
     219<P>The script below finds and prints out the best threshold for binarization of an attribute, that is, the threshold with which the resulting binary attribute will have the optimal ReliefF (or any other measure).</P> 
     220<xmp class="code">thresh, score, distr = meas.bestThreshold("petal length", data) 
     221print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)</xmp> 
     225<P><CODE><INDEX name="classes/MeasureAttributeFromProbabilities">MeasureAttributeFromProbabilities</CODE> is the abstract base class for attribute quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies.</P> 
     227<P class=section>Attributes</P> 
     228<DL class=attributes> 
     230<DD>Defines what to do with unknown values. See the possibilities described above.</DD> 
     232<DT>estimatorConstructor, conditionalEstimatorConstructor</DT> 
     233<DD>The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, <CODE>ProbabilityEstimatorConstructor_m</CODE> and <CODE>ConditionalProbabilityEstimatorConstructor_ByRows</CODE> (with estimator constructor again set to <CODE>ProbabilityEstimatorConstructor_m</CODE>), respectively.</DD> 
     239<H2>Measures for Classification Problems</H2> 
     241<P>The following section describes the attribute quality measures suitable for discrete attributes and outcomes. See <A href=""></A>, <A href=""></A>, <A href=""></A>, <A href=""></A> and <A href=""></A> for more examples on their use.</P> 
     243<H3>Information Gain</H3> 
     244<index name="attribute scoring+information gain"> 
     246<P>The most popular measure, information gain (<CODE><INDEX name="classes/MeasureAttribute_info">MeasureAttribute_info</CODE>), measures the expected decrease of the entropy.</P> 
     248<H3>Gain Ratio</H3> 
     249<index name="attribute scoring+gain ratio"> 
     251<P>Gain ratio (<CODE><INDEX name="classes/MeasureAttribute_gainRatio">MeasureAttribute_gainRatio</CODE>) was introduced by Quinlan in order to avoid overestimation of multi-valued attributes. It is computed as information gain divided by the entropy of the attribute's value. (It has been shown, however, that such measure still overstimates the attributes with multiple values.) 
     253<H3>Gini index</H3> 
     254<index name="attribute scoring+gini index"> 
     256<P>Gini index (<CODE>MeasureAttribute_gini</CODE>) was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes.</P> 
     259<index name="attribute scoring+relevance"> 
     261<P>Relevance of attributes (<CODE><INDEX 
     263is a measure that discriminate between attributes on the basis of 
     264their potential value in the formation of decision rules.</P> 
     268<P><CODE><INDEX name="classes/MeasureAttribute_cost">MeasureAttribute_cost</CODE> evaluates attributes based on the "saving" achieved by knowing the value of attribute, according to the specified cost matrix.</P> 
     270<P class=section>Attributes</P> 
     271<DL class=attributes> 
     273<DD>Cost matrix (see the page about <A href="CostMatrix.htm">cost matrices</A> for details)</DD> 
     276<P>If cost of predicting the first class for an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for attribute 3 as follows.</P> 
     278<XMP class=code>>>> meas = orange.MeasureAttribute_cost() 
     279>>> meas.cost = ((0, 5), (1, 0)) 
     280>>> meas(3, data) 
     284<P>This tells that knowing the value of attribute 3 would decrease the classification cost for appx 0.083 per example.</P> 
     288<index name="attribute scoring+ReliefF"> 
     290<P>ReliefF (<CODE><INDEX name="classes/MeasureAttribute_relief">MeasureAttribute_relief</CODE>) was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of attributes based on their ability to distinguish between very similar examples belonging to different classes.</P> 
     292<P class=section>Attributes</P> 
     293<DL class=attributes> 
     295<DD>Number of neighbours for each example. Default is 5.</DD> 
     298<DD>Number of reference examples. Default is 100. Set to -1 to take all the examples.</DD> 
     301<DD>A flag best left alone unless you know what you do.</DD> 
     304<P>Computation of ReliefF is rather slow since it needs to find <CODE>k</CODE> nearest neighbours for each of <CODE>m</CODE> reference examples (or all examples, if <code>m</code> is set to -1). Since we normally compute ReliefF for all attributes in the dataset, <CODE>MeasureAttribute_relief</CODE> caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting <code>checkCachedData</code> to <code>False</code>. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the 'version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control.</P> 
     306<P>Caching will only have an effect if you use the same instance for all attributes in the domain. So, don't do this:</P> 
     308<XMP class=code>for attr in data.domain.attributes: 
     309    print orange.MeasureAttribute_relief(attr, data) 
     312<P>In this script, cached data dies together with the instance of <CODE>MeasureAttribute_relief</CODE>, which is constructed and destructed for each attribute separately. It's way faster to go like this.</P> 
     314<XMP class=code>meas = orange.MeasureAttribute_relief() 
     315for attr in data.domain.attributes: 
     316    print meas(attr, data) 
     319<P>When called for the first time, <CODE>meas</CODE> will compute ReliefF for all attributes and the subsequent calls simply return the stored data.</P> 
     321<P>Class <CODE>MeasureAttribute_relief</CODE> works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF.</P> 
     323<P>Note that ReliefF can also compute the threshold function, that is, the attribute quality at different thresholds for binarization.</P> 
     325<p>Finally, here is an example which shows what can happen if you disable the computation of checksums.</p> 
     327<xmp>data = orange.ExampleTable("iris") 
     328r1 = orange.MeasureAttribute_relief() 
     329r2 = orange.MeasureAttribute_relief(checkCachedData = False) 
     331print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) 
     332for ex in data: 
     333    ex[0] = 0 
     334print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) 
     337<p>The first print prints out the same number, 0.321 twice. Then we annulate the first attribute. <code>r1</code> notices it and returns -1 as it's ReliefF, while <code>r2</code> does not and returns the same number, 0.321, which is now wrong.</p> 
     340<H2>Measure for Attributes for Regression Problems</H2> 
     342<P>Except for ReliefF, the only attribute quality measure available for regression problems is based on a mean square error.</P> 
     344<H3>Mean Square Error</H3> 
     345<index name="attribute scoring/mean square error"> 
     346<index name="MSE of an attribute"> 
     348<P>The mean square error measure is implemented in class <CODE><INDEX name="classes/MeasureAttribute_MSE">MeasureAttribute_MSE</CODE>.</P> 
     350<P class=section>Attributes</P> 
     351<DL class=attributes> 
     353<DD>Tells what to do with unknown attribute values. See description on the top of this page.</DD> 
     356<DD>Parameter for m-estimate of error. Default is 0 (no m-estimate). 
  • orange/Orange/feature/

    r7321 r7342  
    163163import Orange.core as orange 
     165from Orange.feature.scoring import attMeasure 
    165167# from orngFSS 
    166168def bestNAtts(scores, N): 
    224226    :param threshold: score threshold for attribute selection. Defaults to 0. 
    225227    :type threshold: float 
    226     :rtype: :obj:`list` 
    227     Result: the first N features (without measures) 
     228    :rtype: :obj:`list` first N features (without measures) 
    229230    """ 
    299300        :param data: an data table 
    300301        :type data: 
    301303        """ 
    302304        ma = attMeasure(data, self.measure) 
Note: See TracChangeset for help on using the changeset viewer.