Changeset 7374:ee2e1d12c5a0 in orange


Ignore:
Timestamp:
02/04/11 00:46:57 (3 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
Convert:
cfb61bbd53a7d849e1d928d96e4aad3c4312a91a
Message:

Documentatio and code refactoring at Bohinj retreat.

Location:
orange
Files:
5 added
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r7342 r7374  
    106106    Several (but not all) measures can treat unknown attribute values in 
    107107    different ways, depending on field :obj:`unknownsTreatment` (this field is 
    108     not defined in :obj:`Measure` but in many derived classes). Undefined values can be 
    109 <UL style="margin-top=0cm"> 
    110 <LI><B>ignored (<CODE>MeasureAttribute.IgnoreUnknowns</CODE>)</B>; this has the same effect as if the example for which the attribute value is unknown are removed.</LI> 
    111  
    112 <LI><B>punished (<CODE>MeasureAttribute.ReduceByUnknown</CODE>)</B>; the attribute quality is reduced by the proportion of unknown values. In impurity measures, this can be interpreted as if the impurity is decreased only on examples for which the value is defined and stays the same for the others, and the attribute quality is the average impurity decrease.</B></LI> 
    113  
    114 <LI><B>imputed (<CODE>MeasureAttribute.UnknownsToCommon</CODE>)</B>; here, undefined values are replaced by the most common attribute value. If you want a more clever imputation, you should do it in advance.</LI> 
    115  
    116 <LI><B>treated as a separate value (<CODE>MeasureAttribute.UnknownsAsValue</CODE>)</B> 
    117 </UL> 
    118  
    119 <P>The default treatment is <CODE>ReduceByUnknown</CODE>, which is optimal in most cases and does not make additional presumptions (as, for instance, <CODE>UnknownsToCommon</CODE> which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other attributes). Use other treatments if you know that they make better sense on your data.</P> 
    120  
    121 <P>The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the attribute. The number does not have any absolute meaning and can vary widely for different attribute measures. The only common characteristic is that higher the value, better the attribute. If the attribute is so bad that it's quality cannot be measured, the measure returns <CODE>MeasureAttribute.Rejected</CODE>. None of the measures described here do so.</P> 
    122  
    123 <P>There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the attribute and the class need to be of the correct type for a particular measure.</P> 
    124  
    125 <P class=section>Methods</P> 
    126 <DL class=attributes> 
    127  <DT>__call__(attribute, examples[, apriori class distribution][, weightID])<br> 
    128  __call__(attribute, domain contingency[, apriori class distribution])<br> 
    129  __call__(contingency, class distribution[, apriori class distribution])</DT> 
    130  
    131 <DD>There are three call operators just to make your life simpler and faster. When working with the data, your method might have already computed, for instance, contingency matrix. If so and if the quality measure you use is OK with that (as most measures are), you can pass the contingency matrix and the measure will compute much faster. If, on the other hand, you only have examples and haven't computed any statistics on them, you can pass examples (and, optionally, an id for meta-attribute with weights) and the measure will compute the contingency itself, if needed.</P> 
    132  
    133 <P>Argument <CODE>attribute</CODE> gives the attribute whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the attribute is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the attribute in the domain, though.</DD> 
    134  
    135 <P>The data is given either as <CODE>examples</CODE> (and, optionally, id for meta-attribute with weight), <CODE>domain contingency</CODE> (a list of contingencies) or <CODE>contingency</CODE> matrix and <CODE>class distribution</CODE>. If you use the latter form, what you should give as the class distribution depends upon what you do with unknown values (if there are any). If <CODE>unknownsTreatment</CODE> is <CODE>IgnoreUnknowns</CODE>, the class distribution should be computed on examples for which the attribute value is defined. Otherwise, class distribution should be the overall class distribution.</P> 
    136  
    137 <P>The optional argument with <CODE>apriori class distribution</CODE> is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate).</P> 
    138 </DD> 
    139  
    140 <dt>thresholdFunction(attribute, examples[, weightID])</dt> 
    141 <dd>This function computes the qualities for different binarizations of the continuous attribute <code>attribute</code>. The attribute should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing attribute values), the second is the measured quality for a corresponding binary attribute and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself.</dd> 
    142 </DL> 
    143  
    144 <P>The script below shows different ways to assess the quality of astigmatic, tear rate and the first attribute (whichever it is) in the dataset lenses.</P> 
    145  
    146 <p class="header"><a href="MeasureAttribute1.py">part of measureattribute1.py</a> 
    147 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    148 <XMP class=code>import orange, random 
    149 data = orange.ExampleTable("lenses") 
    150  
    151 meas = orange.MeasureAttribute_info() 
    152  
    153 astigm = data.domain["astigmatic"] 
    154 print "Information gain of 'astigmatic': %6.4f" % meas(astigm, data) 
    155  
    156 classdistr = orange.Distribution(data.domain.classVar, data) 
    157 cont = orange.ContingencyAttrClass("tear_rate", data) 
    158 print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr) 
    159  
    160 dcont = orange.DomainContingency(data) 
    161 print "Information gain of the first attribute: %6.4f" % meas(0, dcont) 
    162 print 
    163 </XMP> 
    164  
    165 <P>As for many other classes in Orange, you can construct the object and use it on-the-fly. For instance, to measure the quality of attribute "tear_rate", you could write simply</P> 
    166  
    167 <XMP class=code>>>> print orange.MeasureAttribute_info("tear_rate", data) 
    168 0.548794984818 
    169 </XMP> 
    170  
    171 <P>You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF.</P> 
    172  
    173 <P>It is also possible to assess the quality of attributes that do not exist in the dataset. For instance, you can assess the quality of discretized attributes without constructing a new domain and dataset that would include them.</P> 
    174  
    175 <p class="header"><a href="MeasureAttribute1.py">measureattribute1a.py</a> 
    176 (uses <a href="iris.tab">iris.tab</a>)</p> 
    177 <XMP class=code>import orange, random 
    178 data = orange.ExampleTable("iris") 
    179  
    180 d1 = orange.EntropyDiscretization("petal length", data) 
    181 print orange.MeasureAttribute_info(d1, data) 
    182 </XMP> 
    183  
    184 <P>The quality of the new attribute <CODE>d1</CODE> is assessed on <CODE>data</CODE>, which does not include the new attribute at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the attribute to be present in the dataset.)</p> 
    185  
    186 <P>Finally, you can compute the quality of meta-attributes. The following script adds a meta-attribute to an example table, initializes it to random values and measures its information gain.</P> 
    187  
    188 <p class="header"><a href="MeasureAttribute1.py">part of measureattribute1.py</a> 
    189 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    190 <XMP class=code>mid = orange.newmetaid() 
    191 data.domain.addmeta(mid, orange.EnumVariable(values = ["v0", "v1"])) 
    192 data.addMetaAttribute(mid) 
    193  
    194 rg = random.Random() 
    195 rg.seed(0) 
    196 for ex in data: 
    197     ex[mid] = orange.Value(rg.randint(0, 1)) 
    198  
    199 print "Information gain for a random meta attribute: %6.4f" % \ 
    200   orange.MeasureAttribute_info(mid, data) 
    201 </XMP> 
    202  
    203  
    204 <P>To show the computation of thresholds, we shall use the Iris data set.</P> 
    205  
    206 <p class="header"><a href="MeasureAttribute1a.py">measureattribute1a.py</a> 
    207 (uses <a href="iris.tab">iris.tab</a>)</p> 
    208 <xmp class="code"> 
    209 import orange 
    210 data = orange.ExampleTable("iris") 
    211  
    212 meas = orange.MeasureAttribute_relief() 
    213 for t in meas.thresholdFunction("petal length", data): 
    214     print "%5.3f: %5.3f" % t 
    215 </xmp> 
    216  
    217 <P>If we hadn't constructed the attribute in advance, we could write <code>orange.MeasureAttribute_relief().thresholdFunction("petal length", data)</code>. This is not recommendable for ReliefF, since it may be a lot slower.</P> 
    218  
    219 <P>The script below finds and prints out the best threshold for binarization of an attribute, that is, the threshold with which the resulting binary attribute will have the optimal ReliefF (or any other measure).</P> 
    220 <xmp class="code">thresh, score, distr = meas.bestThreshold("petal length", data) 
    221 print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)</xmp> 
    222  
    223 <H3>MeasureAttributeFromProbabilities</H3> 
    224  
    225 <P><CODE><INDEX name="classes/MeasureAttributeFromProbabilities">MeasureAttributeFromProbabilities</CODE> is the abstract base class for attribute quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies.</P> 
    226  
    227 <P class=section>Attributes</P> 
    228 <DL class=attributes> 
    229 <DT>unknownsTreatment</DT> 
    230 <DD>Defines what to do with unknown values. See the possibilities described above.</DD> 
    231  
    232 <DT>estimatorConstructor, conditionalEstimatorConstructor</DT> 
    233 <DD>The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, <CODE>ProbabilityEstimatorConstructor_m</CODE> and <CODE>ConditionalProbabilityEstimatorConstructor_ByRows</CODE> (with estimator constructor again set to <CODE>ProbabilityEstimatorConstructor_m</CODE>), respectively.</DD> 
    234 </DL> 
    235  
    236  
    237 <HR> 
    238  
    239 <H2>Measures for Classification Problems</H2> 
    240  
    241 <P>The following section describes the attribute quality measures suitable for discrete attributes and outcomes. See <A href="MeasureAttribute1.py">MeasureAttribute1.py</A>, <A href="MeasureAttribute1a.py">MeasureAttribute1a.py</A>, <A href="MeasureAttribute1b.py">MeasureAttribute1b.py</A>, <A href="MeasureAttribute2.py">MeasureAttribute2.py</A> and <A href="MeasureAttribute3.py">MeasureAttribute3.py</A> for more examples on their use.</P> 
    242  
    243 <H3>Information Gain</H3> 
    244 <index name="attribute scoring+information gain"> 
    245  
    246 <P>The most popular measure, information gain (<CODE><INDEX name="classes/MeasureAttribute_info">MeasureAttribute_info</CODE>), measures the expected decrease of the entropy.</P> 
    247  
    248 <H3>Gain Ratio</H3> 
    249 <index name="attribute scoring+gain ratio"> 
    250  
    251 <P>Gain ratio (<CODE><INDEX name="classes/MeasureAttribute_gainRatio">MeasureAttribute_gainRatio</CODE>) was introduced by Quinlan in order to avoid overestimation of multi-valued attributes. It is computed as information gain divided by the entropy of the attribute's value. (It has been shown, however, that such measure still overstimates the attributes with multiple values.) 
    252  
    253 <H3>Gini index</H3> 
    254 <index name="attribute scoring+gini index"> 
    255  
    256 <P>Gini index (<CODE>MeasureAttribute_gini</CODE>) was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes.</P> 
    257  
    258 <H3>Relevance</H3> 
    259 <index name="attribute scoring+relevance"> 
    260  
    261 <P>Relevance of attributes (<CODE><INDEX 
    262 name="classes/MeasureAttribute_relevance">MeasureAttribute_relevance</CODE>) 
    263 is a measure that discriminate between attributes on the basis of 
    264 their potential value in the formation of decision rules.</P> 
    265  
    266 <H3>Costs</H3> 
    267  
    268 <P><CODE><INDEX name="classes/MeasureAttribute_cost">MeasureAttribute_cost</CODE> evaluates attributes based on the "saving" achieved by knowing the value of attribute, according to the specified cost matrix.</P> 
    269  
    270 <P class=section>Attributes</P> 
    271 <DL class=attributes> 
    272 <DT>cost</DT> 
    273 <DD>Cost matrix (see the page about <A href="CostMatrix.htm">cost matrices</A> for details)</DD> 
    274 </DL> 
    275  
    276 <P>If cost of predicting the first class for an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for attribute 3 as follows.</P> 
    277  
    278 <XMP class=code>>>> meas = orange.MeasureAttribute_cost() 
    279 >>> meas.cost = ((0, 5), (1, 0)) 
    280 >>> meas(3, data) 
    281 0.083333350718021393 
    282 </XMP> 
    283  
    284 <P>This tells that knowing the value of attribute 3 would decrease the classification cost for appx 0.083 per example.</P> 
    285  
    286  
    287 <H3>ReliefF</H3> 
    288 <index name="attribute scoring+ReliefF"> 
    289  
    290 <P>ReliefF (<CODE><INDEX name="classes/MeasureAttribute_relief">MeasureAttribute_relief</CODE>) was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of attributes based on their ability to distinguish between very similar examples belonging to different classes.</P> 
    291  
    292 <P class=section>Attributes</P> 
    293 <DL class=attributes> 
    294 <DT>k</DT> 
    295 <DD>Number of neighbours for each example. Default is 5.</DD> 
    296  
    297 <DT>m</DT> 
    298 <DD>Number of reference examples. Default is 100. Set to -1 to take all the examples.</DD> 
    299  
    300 <DT>checkCachedData</DT> 
    301 <DD>A flag best left alone unless you know what you do.</DD> 
    302 </DL> 
    303  
    304 <P>Computation of ReliefF is rather slow since it needs to find <CODE>k</CODE> nearest neighbours for each of <CODE>m</CODE> reference examples (or all examples, if <code>m</code> is set to -1). Since we normally compute ReliefF for all attributes in the dataset, <CODE>MeasureAttribute_relief</CODE> caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting <code>checkCachedData</code> to <code>False</code>. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the 'version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control.</P> 
    305  
    306 <P>Caching will only have an effect if you use the same instance for all attributes in the domain. So, don't do this:</P> 
    307  
    308 <XMP class=code>for attr in data.domain.attributes: 
    309     print orange.MeasureAttribute_relief(attr, data) 
    310 </XMP> 
    311  
    312 <P>In this script, cached data dies together with the instance of <CODE>MeasureAttribute_relief</CODE>, which is constructed and destructed for each attribute separately. It's way faster to go like this.</P> 
    313  
    314 <XMP class=code>meas = orange.MeasureAttribute_relief() 
    315 for attr in data.domain.attributes: 
    316     print meas(attr, data) 
    317 </XMP> 
    318  
    319 <P>When called for the first time, <CODE>meas</CODE> will compute ReliefF for all attributes and the subsequent calls simply return the stored data.</P> 
    320  
    321 <P>Class <CODE>MeasureAttribute_relief</CODE> works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF.</P> 
    322  
    323 <P>Note that ReliefF can also compute the threshold function, that is, the attribute quality at different thresholds for binarization.</P> 
    324  
    325 <p>Finally, here is an example which shows what can happen if you disable the computation of checksums.</p> 
    326  
    327 <xmp>data = orange.ExampleTable("iris") 
    328 r1 = orange.MeasureAttribute_relief() 
    329 r2 = orange.MeasureAttribute_relief(checkCachedData = False) 
    330  
    331 print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) 
    332 for ex in data: 
    333     ex[0] = 0 
    334 print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) 
    335 </xmp> 
    336  
    337 <p>The first print prints out the same number, 0.321 twice. Then we annulate the first attribute. <code>r1</code> notices it and returns -1 as it's ReliefF, while <code>r2</code> does not and returns the same number, 0.321, which is now wrong.</p> 
    338  
    339  
    340 <H2>Measure for Attributes for Regression Problems</H2> 
    341  
    342 <P>Except for ReliefF, the only attribute quality measure available for regression problems is based on a mean square error.</P> 
    343  
    344 <H3>Mean Square Error</H3> 
    345 <index name="attribute scoring/mean square error"> 
    346 <index name="MSE of an attribute"> 
    347  
    348 <P>The mean square error measure is implemented in class <CODE><INDEX name="classes/MeasureAttribute_MSE">MeasureAttribute_MSE</CODE>.</P> 
    349  
    350 <P class=section>Attributes</P> 
    351 <DL class=attributes> 
    352 <DT>unknownsTreatment</DT> 
    353 <DD>Tells what to do with unknown attribute values. See description on the top of this page.</DD> 
    354  
    355 <DT>m</DT> 
    356 <DD>Parameter for m-estimate of error. Default is 0 (no m-estimate). 
    357  
    358 </DL>  
     108    not defined in :obj:`Measure` but in many derived classes). Undefined  
     109    values can be: 
     110     
     111    * ignored (:obj:`Measure.IgnoreUnknowns`); this has the same effect as if  
     112      the example for which the attribute value is unknown are removed. 
     113 
     114    * punished (:obj:`Measure.ReduceByUnknown`); the attribute quality is 
     115      reduced by the proportion of unknown values. In impurity measures, this 
     116      can be interpreted as if the impurity is decreased only on examples for 
     117      which the value is defined and stays the same for the others, and the 
     118      attribute quality is the average impurity decrease. 
     119       
     120    * imputed (:obj:`Measure.UnknownsToCommon`); here, undefined values are 
     121      replaced by the most common attribute value. If you want a more clever 
     122      imputation, you should do it in advance. 
     123 
     124    * treated as a separate value (:obj:`MeasureAttribute.UnknownsAsValue`) 
     125 
     126    The default treatment is :obj:`ReduceByUnknown`, which is optimal in most 
     127    cases and does not make additional presumptions (as, for instance, 
     128    :obj:`UnknownsToCommon` which supposes that missing values are not for 
     129    instance, results of measurements that were not performed due to 
     130    information extracted from the other attributes). Use other treatments if 
     131    you know that they make better sense on your data. 
     132 
     133    The only method supported by all measures is the call operator to which we 
     134    pass the data and get the number representing the quality of the attribute. 
     135    The number does not have any absolute meaning and can vary widely for 
     136    different attribute measures. The only common characteristic is that 
     137    higher the value, better the attribute. If the attribute is so bad that  
     138    it's quality cannot be measured, the measure returns 
     139    :obj:`Measure.Rejected`. None of the measures described here do so. 
     140 
     141    There are different sets of arguments that the call operator can accept. 
     142    Not all classes will accept all kinds of arguments. Relief, for instance, 
     143    cannot be computed from contingencies alone. Besides, the attribute and 
     144    the class need to be of the correct type for a particular measure. 
     145 
     146    There are three call operators just to make your life simpler and faster. 
     147    When working with the data, your method might have already computed, for 
     148    instance, contingency matrix. If so and if the quality measure you use is 
     149    OK with that (as most measures are), you can pass the contingency matrix 
     150    and the measure will compute much faster. If, on the other hand, you only 
     151    have examples and haven't computed any statistics on them, you can pass 
     152    examples (and, optionally, an id for meta-attribute with weights) and the 
     153    measure will compute the contingency itself, if needed. 
     154 
     155    .. method:: __call__(attribute, examples[, apriori class distribution][, weightID]) 
     156    .. method:: __call__(attribute, domain contingency[, apriori class distribution]) 
     157    .. method:: __call__(contingency, class distribution[, apriori class distribution]) 
     158 
     159        :param attribute: gives the attribute whose quality is to be assessed. 
     160          This can be either a descriptor, an index into domain or a name. In the 
     161          first form, if the attribute is given by descriptor, it doesn't need 
     162          to be in the domain. It needs to be computable from the attribute in 
     163          the domain, though. 
     164           
     165        Data is given either as examples (and, optionally, id for  
     166        meta-attribute with weight), domain contingency 
     167        (:obj:`Orange.probability.distributions.DomainContingency`) (a list of 
     168        contingencies) or distribution (:obj:`Orange.probability.distributions`) 
     169        matrix and :obj:`Orange.probability.distributions.Distribution`. If  
     170        you use the latter form, what you should give as the class distribution 
     171        depends upon what you do with unknown values (if there are any). 
     172        If :obj:`unknownsTreatment` is :obj:`IgnoreUnknowns`, the class 
     173        distribution should be computed on examples for which the attribute 
     174        value is defined. Otherwise, class distribution should be the overall 
     175        class distribution. 
     176 
     177        The optional argument with apriori class distribution is 
     178        most often ignored. It comes handy if the measure makes any probability 
     179        estimates based on apriori class probabilities (such as m-estimate). 
     180 
     181    .. method:: thresholdFunction(attribute, examples[, weightID]) 
     182    This function computes the qualities for different binarizations of the 
     183    continuous attribute :obj:`attribute`. The attribute should of course be 
     184    continuous. The result of a function is a list of tuples, where the first 
     185    element represents a threshold (all splits in the middle between two 
     186    existing attribute values), the second is the measured quality for a 
     187    corresponding binary attribute and the last one is the distribution which 
     188    gives the number of examples below and above the threshold. The last 
     189    element, though, may be missing; generally, if the particular measure can 
     190    get the distribution without any computational burden, it will do so and 
     191    the caller can use it. If not, the caller needs to compute it itself. 
     192 
     193    The script below shows different ways to assess the quality of astigmatic, 
     194    tear rate and the first attribute (whichever it is) in the dataset lenses. 
     195 
     196    .. literalinclude:: code/scoring-info-lenses.py 
     197        :lines: 7-21 
     198 
     199    As for many other classes in Orange, you can construct the object and use 
     200    it on-the-fly. For instance, to measure the quality of attribute 
     201    "tear_rate", you could write simply:: 
     202 
     203        >>>> print orange.MeasureAttribute_info("tear_rate", data) 
     204        0.548794984818 
     205 
     206    You shouldn't use this shortcut with ReliefF, though; see the explanation 
     207    in the section on ReliefF. 
     208 
     209    It is also possible to assess the quality of features that do not exist 
     210    in the features. For instance, you can assess the quality of discretized 
     211    features without constructing a new domain and dataset that would include 
     212    them. 
     213 
     214    `scoring-info-iris.py`_ (uses `iris.tab`_): 
     215 
     216    .. literalinclude:: code/scoring-info-iris.py 
     217        :lines: 7-11 
     218 
     219    The quality of the new attribute d1 is assessed on data, which does not 
     220    include the new attribute at all. (Note that ReliefF won't do that since 
     221    it would be too slow. ReliefF requires the attribute to be present in the 
     222    dataset.) 
     223 
     224    Finally, you can compute the quality of meta-features. The following 
     225    script adds a meta-attribute to an example table, initializes it to random 
     226    values and measures its information gain. 
     227 
     228    `scoring-info-lenses.py`_ (uses `lenses.tab`_): 
     229 
     230    .. literalinclude:: code/scoring-info-lenses.py 
     231        :lines: 54- 
     232 
     233    To show the computation of thresholds, we shall use the Iris data set. 
     234 
     235    `scoring-info-iris.py`_ (uses `iris.tab`_): 
     236 
     237    .. literalinclude:: code/scoring-info-iris.py 
     238        :lines: 7-15 
     239 
     240    If we hadn't constructed the attribute in advance, we could write  
     241    `Orange.feature.scoring.Relief().thresholdFunction("petal length", data)`. 
     242    This is not recommendable for ReliefF, since it may be a lot slower. 
     243 
     244    The script below finds and prints out the best threshold for binarization 
     245    of an attribute, that is, the threshold with which the resulting binary 
     246    attribute will have the optimal ReliefF (or any other measure):: 
     247 
     248        thresh, score, distr = meas.bestThreshold("petal length", data) 
     249        print "Best threshold: %5.3f (score %5.3f)" % (thresh, score) 
     250 
     251.. class:: MeasureAttributeFromProbabilities 
     252 
     253    This is the abstract base class for attribute quality measures that can be 
     254    computed from contingency matrices only. It relieves the derived classes 
     255    from having to compute the contingency matrix by defining the first two 
     256    forms of call operator. (Well, that's not something you need to know if 
     257    you only work in Python.) Additional feature of this class is that you can 
     258    set probability estimators. If none are given, probabilities and 
     259    conditional probabilities of classes are estimated by relative frequencies. 
     260 
     261    .. attribute:: unknownsTreatment  
     262    Defines what to do with unknown values. See the possibilities described above. 
     263 
     264    .. attribute:: estimatorConstructor, conditionalEstimatorConstructor 
     265    The classes that are used to estimate unconditional and conditional 
     266    probabilities of classes, respectively. You can set this to, for instance,  
     267    :obj:`ProbabilityEstimatorConstructor_m` and  
     268    :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
     269    (with estimator constructor again set to  
     270    :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
     271 
     272==================================== 
     273Measures for Classification Problems 
     274==================================== 
     275 
     276The following section describes the attribute quality measures suitable for  
     277discrete features and outcomes.  
     278See  `scoring-info-lenses.py`_, `scoring-info-iris.py`_, 
     279`scoring-diff-measures.py`_ and `scoring-regression.py`_ 
     280for more examples on their use. 
     281 
     282---------------- 
     283Information Gain 
     284---------------- 
     285.. index::  
     286   single: feature scoring; information gain 
     287 
     288.. class:: Info 
     289 
     290    The most popular measure, information gain :obj:`Info` measures the expected 
     291    decrease of the entropy. 
     292 
     293---------- 
     294Gain Ratio 
     295---------- 
     296 
     297.. index::  
     298   single: feature scoring; gain ratio 
     299 
     300.. class:: Gain 
     301 
     302    Gain ratio :obj:`GainRatio` was introduced by Quinlan in order to avoid 
     303    overestimation of multi-valued features. It is computed as information 
     304    gain divided by the entropy of the attribute's value. (It has been shown, 
     305    however, that such measure still overstimates the features with multiple 
     306    values.) 
     307 
     308---------- 
     309Gini index 
     310---------- 
     311 
     312.. index::  
     313   single: feature scoring; gini index 
     314 
     315.. class:: Gini 
     316 
     317    Gini index :obj:`Gini` was first introduced by Breiman and can be interpreted 
     318    as the probability that two randomly chosen examples will have different 
     319    classes. 
     320 
     321--------- 
     322Relevance 
     323--------- 
     324 
     325.. index::  
     326   single: feature scoring; relevance 
     327 
     328.. class:: Relevance 
     329 
     330    Relevance of features :obj:`Relevance` is a measure that discriminate 
     331    between features on the basis of their potential value in the formation of 
     332    decision rules. 
     333 
     334----- 
     335Costs 
     336----- 
     337 
     338.. class:: Cost 
     339 
     340    Evaluates features based on the "saving" achieved by knowing the value of 
     341    attribute, according to the specified cost matrix. 
     342 
     343    .. attribute:: cost  
     344    Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
     345 
     346    If cost of predicting the first class for an example that is actually in 
     347    the second is 5, and the cost of the opposite error is 1, than an appropriate 
     348    measure can be constructed and used for attribute 3 as follows:: 
     349 
     350    >>> meas = Orange.feature.scoring.Cost() 
     351    >>> meas.cost = ((0, 5), (1, 0)) 
     352    >>> meas(3, data) 
     353    0.083333350718021393 
     354 
     355    This tells that knowing the value of attribute 3 would decrease the 
     356    classification cost for appx 0.083 per example. 
     357 
     358------- 
     359ReliefF 
     360------- 
     361 
     362.. index::  
     363   single: feature scoring; ReliefF 
     364 
     365.. class:: Relief 
     366 
     367    ReliefF :obj:`Relief` was first developed by Kira and Rendell and then 
     368    substantially generalized and improved by Kononenko. It measures the usefulness 
     369    of attributes based on their ability to distinguish between very similar 
     370    examples belonging to different classes. 
     371 
     372    .. attribute:: k 
     373    Number of neighbours for each example. Default is 5. 
     374 
     375    .. attribute:: m 
     376    Number of reference examples. Default is 100. Set to -1 to take all the 
     377    examples. 
     378 
     379    .. attribute:: checkCachedData 
     380    A flag best left alone unless you know what you do. 
     381 
     382Computation of ReliefF is rather slow since it needs to find k nearest 
     383neighbours for each of m reference examples (or all examples, if m is set to 
     384-1). Since we normally compute ReliefF for all attributes in the dataset, 
     385:obj:`Relief` caches the results. When it is called to compute a quality of 
     386certain attribute, it computes qualities for all attributes in the dataset. 
     387When called again, it uses the stored results if the data has not changeddomain 
     388is still the same and the example table has not changed. Checking is done by 
     389comparing the data table version :obj:`Orange.data.Table` for details) and then 
     390computing a checksum of the data and comparing it with the previous checksum. 
     391The latter can take some time on large tables, so you may want to disable it 
     392by setting `checkCachedData` to :obj:`False`. In most cases it will do no harm, 
     393except when the data is changed in such a way that it passed unnoticed by the  
     394version' control, in which cases the computed ReliefFs can be false. Hence: 
     395disable it if you know that the data does not change or if you know what kind 
     396of changes are detected by the version control. 
     397 
     398Caching will only have an effect if you use the same instance for all 
     399attributes in the domain. So, don't do this:: 
     400 
     401    for attr in data.domain.attributes: 
     402        print Orange.feature.scoring.Relief(attr, data) 
     403 
     404In this script, cached data dies together with the instance of :obj:`Relief`, 
     405which is constructed and destructed for each attribute separately. It's way 
     406faster to go like this:: 
     407 
     408    meas = Orange.feature.scoring.Relief() 
     409    for attr in table.domain.attributes: 
     410        print meas(attr, data) 
     411 
     412When called for the first time, meas will compute ReliefF for all attributes 
     413and the subsequent calls simply return the stored data. 
     414 
     415Class :obj:`Relief` works on discrete and continuous classes and thus 
     416implements functionality of algorithms ReliefF and RReliefF. 
     417 
     418.. note:: ReliefF can also compute the threshold function, that is, the  
     419attribute quality at different thresholds for binarization. 
     420 
     421Finally, here is an example which shows what can happen if you disable the  
     422computation of checksums:: 
     423 
     424    table = Orange.data.Table("iris") 
     425    r1 = Orange.feature.scoring.Relief() 
     426    r2 = Orange.feature.scoring.Relief(checkCachedData = False) 
     427 
     428    print "%.3f\\t%.3f" % (r1(0, table), r2(0, table)) 
     429    for ex in table: 
     430        ex[0] = 0 
     431    print "%.3f\\t%.3f" % (r1(0, table), r2(0, table)) 
     432 
     433The first print prints out the same number, 0.321 twice. Then we annulate the 
     434first attribute. r1 notices it and returns -1 as it's ReliefF, 
     435while r2 does not and returns the same number, 0.321, which is now wrong. 
     436 
     437============================================== 
     438Measure for Attributes for Regression Problems 
     439============================================== 
     440 
     441Except for ReliefF, the only attribute quality measure available for regression 
     442problems is based on a mean square error. 
     443 
     444----------------- 
     445Mean Square Error 
     446----------------- 
     447 
     448.. index::  
     449   single: feature scoring; mean square error 
     450 
     451.. class:: MSE 
     452 
     453    Implements the mean square error measure. 
     454 
     455    .. attribute:: unknownsTreatment 
     456    Tells what to do with unknown attribute values. See description on the top 
     457    of this page. 
     458 
     459    .. attribute:: m 
     460    Parameter for m-estimate of error. Default is 0 (no m-estimate). 
    359461 
    360462========== 
     
    364466* Kononeko: Strojno ucenje. Zalozba FE in FRI, Ljubljana, 2005. 
    365467 
     468.. _iris.tab: code/iris.tab 
     469.. _lenses.tab: code/lenses.tab 
    366470.. _scoring-relief-gainRatio.py: code/scoring-relief-gainRatio.py 
    367471.. _voting.tab: code/voting.tab 
     472.. _selection-best3.py: code/selection-best3.py 
     473.. _scoring-info-lenses.py: code/scoring-info-lenses.py 
     474.. _scoring-info-iris.py: code/scoring-info-iris.py 
     475.. _scoring-diff-measures.py: code/scoring-diff-measures.py 
     476 
     477.. _scoring-regression.py: code/scoring-regression.py 
     478.. _scoring-relief-caching: code/scoring-relief-caching 
    368479 
    369480""" 
  • orange/orngDisc.py

    r7199 r7374  
    1 from Orange.discretization import * 
     1from Orange.feature.discretization import * 
Note: See TracChangeset for help on using the changeset viewer.