Changeset 7387:af9f5b688126 in orange


Ignore:
Timestamp:
02/04/11 09:50:31 (3 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
Convert:
9ca569c4c1fbdb257d0c8645e71ff5f3c116e1ac
Message:

Documentatio and code refactoring at Bohinj retreat.

Location:
orange/Orange/feature
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/__init__.py

    r7385 r7387  
    22.. index:: feature 
    33 
    4 Feature scoring, selection, discretization, continuzation, imputation. 
     4This module provides functionality for feature scoring, selection,  
     5discretization, continuzation, imputation, construction and feature 
     6interaction analysis. 
    57 
    68================== 
  • orange/Orange/feature/scoring.py

    r7385 r7387  
    4646 
    4747There are a number of different measures for assessing the relevance of  
    48 attributes with respect to much information they contain about the  
    49 corresponding class. These procedures are also known as attribute scoring.  
     48features with respect to much information they contain about the  
     49corresponding class. These procedures are also known as feature scoring.  
    5050Orange implements several methods that all stem from 
    5151:obj:`Orange.feature.scoring.Measure`. The most of common ones compute 
    5252certain statistics on conditional distributions of class values given 
    53 the attribute values; in Orange, these are derived from 
     53the feature values; in Orange, these are derived from 
    5454:obj:`Orange.feature.scoring.MeasureAttributeFromProbabilities`. 
    5555 
     
    5757 
    5858    This is the base class for a wide range of classes that measure quality of 
    59     attributes. The class itself is, naturally, abstract. Its fields merely 
    60     describe what kinds of attributes it can handle and what kind of data it  
     59    features. The class itself is, naturally, abstract. Its fields merely 
     60    describe what kinds of features it can handle and what kind of data it  
    6161    requires. 
    6262 
    6363    .. attribute:: handlesDiscrete 
    6464     
    65     Tells whether the measure can handle discrete attributes. 
     65    Tells whether the measure can handle discrete features. 
    6666 
    6767    .. attribute:: handlesContinuous 
    6868     
    69     Tells whether the measure can handle continuous attributes. 
     69    Tells whether the measure can handle continuous features. 
    7070 
    7171    .. attribute:: computesThresholds 
     
    8282    latter only needs the contingency 
    8383    (:obj:`Orange.probability.distributions.ContingencyAttrClass`) the  
    84     attribute distribution and the apriori class distribution. Most measures 
     84    feature distribution and the apriori class distribution. Most measures 
    8585    only need the latter. 
    8686 
    87     Several (but not all) measures can treat unknown attribute values in 
     87    Several (but not all) measures can treat unknown feature values in 
    8888    different ways, depending on field :obj:`unknownsTreatment` (this field is 
    8989    not defined in :obj:`Measure` but in many derived classes). Undefined  
     
    9191     
    9292    * ignored (:obj:`Measure.IgnoreUnknowns`); this has the same effect as if  
    93       the example for which the attribute value is unknown are removed. 
    94  
    95     * punished (:obj:`Measure.ReduceByUnknown`); the attribute quality is 
     93      the example for which the feature value is unknown are removed. 
     94 
     95    * punished (:obj:`Measure.ReduceByUnknown`); the feature quality is 
    9696      reduced by the proportion of unknown values. In impurity measures, this 
    9797      can be interpreted as if the impurity is decreased only on examples for 
    9898      which the value is defined and stays the same for the others, and the 
    99       attribute quality is the average impurity decrease. 
     99      feature quality is the average impurity decrease. 
    100100       
    101101    * imputed (:obj:`Measure.UnknownsToCommon`); here, undefined values are 
    102       replaced by the most common attribute value. If you want a more clever 
     102      replaced by the most common feature value. If you want a more clever 
    103103      imputation, you should do it in advance. 
    104104 
     
    109109    :obj:`UnknownsToCommon` which supposes that missing values are not for 
    110110    instance, results of measurements that were not performed due to 
    111     information extracted from the other attributes). Use other treatments if 
     111    information extracted from the other features). Use other treatments if 
    112112    you know that they make better sense on your data. 
    113113 
    114114    The only method supported by all measures is the call operator to which we 
    115     pass the data and get the number representing the quality of the attribute. 
     115    pass the data and get the number representing the quality of the feature. 
    116116    The number does not have any absolute meaning and can vary widely for 
    117     different attribute measures. The only common characteristic is that 
    118     higher the value, better the attribute. If the attribute is so bad that  
     117    different feature measures. The only common characteristic is that 
     118    higher the value, better the feature. If the feature is so bad that  
    119119    it's quality cannot be measured, the measure returns 
    120120    :obj:`Measure.Rejected`. None of the measures described here do so. 
     
    122122    There are different sets of arguments that the call operator can accept. 
    123123    Not all classes will accept all kinds of arguments. Relief, for instance, 
    124     cannot be computed from contingencies alone. Besides, the attribute and 
     124    cannot be computed from contingencies alone. Besides, the feature and 
    125125    the class need to be of the correct type for a particular measure. 
    126126 
     
    131131    and the measure will compute much faster. If, on the other hand, you only 
    132132    have examples and haven't computed any statistics on them, you can pass 
    133     examples (and, optionally, an id for meta-attribute with weights) and the 
     133    examples (and, optionally, an id for meta-feature with weights) and the 
    134134    measure will compute the contingency itself, if needed. 
    135135 
     
    138138    .. method:: __call__(contingency, class distribution[, apriori class distribution]) 
    139139 
    140         :param attribute: gives the attribute whose quality is to be assessed. 
    141           This can be either a descriptor, an index into domain or a name. In the 
    142           first form, if the attribute is given by descriptor, it doesn't need 
    143           to be in the domain. It needs to be computable from the attribute in 
    144           the domain, though. 
     140        :param attribute: gives the feature whose quality is to be assessed. 
     141          This can be either a descriptor, an index into domain or a name. In 
     142          the first form, if the feature is given by descriptor, it doesn't 
     143          need to be in the domain. It needs to be computable from the 
     144          feature in the domain, though. 
    145145           
    146146        Data is given either as examples (and, optionally, id for  
    147         meta-attribute with weight), domain contingency 
     147        meta-feature with weight), domain contingency 
    148148        (:obj:`Orange.probability.distributions.DomainContingency`) (a list of 
    149149        contingencies) or distribution (:obj:`Orange.probability.distributions`) 
     
    152152        depends upon what you do with unknown values (if there are any). 
    153153        If :obj:`unknownsTreatment` is :obj:`IgnoreUnknowns`, the class 
    154         distribution should be computed on examples for which the attribute 
     154        distribution should be computed on examples for which the feature 
    155155        value is defined. Otherwise, class distribution should be the overall 
    156156        class distribution. 
     
    163163     
    164164    This function computes the qualities for different binarizations of the 
    165     continuous attribute :obj:`attribute`. The attribute should of course be 
     165    continuous feature :obj:`attribute`. The feature should of course be 
    166166    continuous. The result of a function is a list of tuples, where the first 
    167167    element represents a threshold (all splits in the middle between two 
    168     existing attribute values), the second is the measured quality for a 
    169     corresponding binary attribute and the last one is the distribution which 
     168    existing feature values), the second is the measured quality for a 
     169    corresponding binary feature and the last one is the distribution which 
    170170    gives the number of examples below and above the threshold. The last 
    171171    element, though, may be missing; generally, if the particular measure can 
     
    174174 
    175175    The script below shows different ways to assess the quality of astigmatic, 
    176     tear rate and the first attribute (whichever it is) in the dataset lenses. 
     176    tear rate and the first feature (whichever it is) in the dataset lenses. 
    177177 
    178178    .. literalinclude:: code/scoring-info-lenses.py 
     
    180180 
    181181    As for many other classes in Orange, you can construct the object and use 
    182     it on-the-fly. For instance, to measure the quality of attribute 
     182    it on-the-fly. For instance, to measure the quality of feature 
    183183    "tear_rate", you could write simply:: 
    184184 
     
    199199        :lines: 7-11 
    200200 
    201     The quality of the new attribute d1 is assessed on data, which does not 
    202     include the new attribute at all. (Note that ReliefF won't do that since 
    203     it would be too slow. ReliefF requires the attribute to be present in the 
     201    The quality of the new feature d1 is assessed on data, which does not 
     202    include the new feature at all. (Note that ReliefF won't do that since 
     203    it would be too slow. ReliefF requires the feature to be present in the 
    204204    dataset.) 
    205205 
    206206    Finally, you can compute the quality of meta-features. The following 
    207     script adds a meta-attribute to an example table, initializes it to random 
     207    script adds a meta-feature to an example table, initializes it to random 
    208208    values and measures its information gain. 
    209209 
     
    220220        :lines: 7-15 
    221221 
    222     If we hadn't constructed the attribute in advance, we could write  
     222    If we hadn't constructed the feature in advance, we could write  
    223223    `Orange.feature.scoring.Relief().thresholdFunction("petal length", data)`. 
    224224    This is not recommendable for ReliefF, since it may be a lot slower. 
    225225 
    226226    The script below finds and prints out the best threshold for binarization 
    227     of an attribute, that is, the threshold with which the resulting binary 
    228     attribute will have the optimal ReliefF (or any other measure):: 
     227    of an feature, that is, the threshold with which the resulting binary 
     228    feature will have the optimal ReliefF (or any other measure):: 
    229229 
    230230        thresh, score, distr = meas.bestThreshold("petal length", data) 
     
    233233.. class:: MeasureAttributeFromProbabilities 
    234234 
    235     This is the abstract base class for attribute quality measures that can be 
     235    This is the abstract base class for feature quality measures that can be 
    236236    computed from contingency matrices only. It relieves the derived classes 
    237237    from having to compute the contingency matrix by defining the first two 
     
    255255    :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
    256256 
    257 ==================================== 
    258 Measures for Classification Problems 
    259 ==================================== 
     257=========================== 
     258Measures for Classification 
     259=========================== 
    260260 
    261261This script scores features with gain ratio and relief. 
     
    275275    0.166  0.345  adoption-of-the-budget-resolution 
    276276 
    277 The following section describes the attribute quality measures suitable for  
     277The following section describes the feature quality measures suitable for  
    278278discrete features and outcomes.  
    279279See  `scoring-info-lenses.py`_, `scoring-info-iris.py`_, 
     
    296296    Gain ratio :obj:`GainRatio` was introduced by Quinlan in order to avoid 
    297297    overestimation of multi-valued features. It is computed as information 
    298     gain divided by the entropy of the attribute's value. (It has been shown, 
     298    gain divided by the entropy of the feature's value. (It has been shown, 
    299299    however, that such measure still overstimates the features with multiple 
    300300    values.) 
     
    324324 
    325325    Evaluates features based on the "saving" achieved by knowing the value of 
    326     attribute, according to the specified cost matrix. 
     326    feature, according to the specified cost matrix. 
    327327 
    328328    .. attribute:: cost 
     
    332332    If cost of predicting the first class for an example that is actually in 
    333333    the second is 5, and the cost of the opposite error is 1, than an appropriate 
    334     measure can be constructed and used for attribute 3 as follows:: 
     334    measure can be constructed and used for feature 3 as follows:: 
    335335 
    336336        >>> meas = Orange.feature.scoring.Cost() 
     
    339339        0.083333350718021393 
    340340 
    341     This tells that knowing the value of attribute 3 would decrease the 
     341    This tells that knowing the value of feature 3 would decrease the 
    342342    classification cost for appx 0.083 per example. 
    343343 
     
    348348 
    349349    ReliefF :obj:`Relief` was first developed by Kira and Rendell and then 
    350     substantially generalized and improved by Kononenko. It measures the usefulness 
    351     of attributes based on their ability to distinguish between very similar 
    352     examples belonging to different classes. 
     350    substantially generalized and improved by Kononenko. It measures the 
     351    usefulness of features based on their ability to distinguish between 
     352    very similar examples belonging to different classes. 
    353353 
    354354    .. attribute:: k 
     
    367367Computation of ReliefF is rather slow since it needs to find k nearest 
    368368neighbours for each of m reference examples (or all examples, if m is set to 
    369 -1). Since we normally compute ReliefF for all attributes in the dataset, 
     369-1). Since we normally compute ReliefF for all features in the dataset, 
    370370:obj:`Relief` caches the results. When it is called to compute a quality of 
    371 certain attribute, it computes qualities for all attributes in the dataset. 
     371certain feature, it computes qualities for all features in the dataset. 
    372372When called again, it uses the stored results if the data has not changeddomain 
    373373is still the same and the example table has not changed. Checking is done by 
     
    382382 
    383383Caching will only have an effect if you use the same instance for all 
    384 attributes in the domain. So, don't do this:: 
     384features in the domain. So, don't do this:: 
    385385 
    386386    for attr in data.domain.attributes: 
     
    388388 
    389389In this script, cached data dies together with the instance of :obj:`Relief`, 
    390 which is constructed and destructed for each attribute separately. It's way 
     390which is constructed and destructed for each feature separately. It's way 
    391391faster to go like this:: 
    392392 
     
    395395        print meas(attr, data) 
    396396 
    397 When called for the first time, meas will compute ReliefF for all attributes 
     397When called for the first time, meas will compute ReliefF for all features 
    398398and the subsequent calls simply return the stored data. 
    399399 
     
    402402 
    403403.. note:: 
    404    ReliefF can also compute the threshold function, that is, the attribute 
     404   ReliefF can also compute the threshold function, that is, the feature 
    405405   quality at different thresholds for binarization. 
    406406 
     
    418418 
    419419The first print prints out the same number, 0.321 twice. Then we annulate the 
    420 first attribute. r1 notices it and returns -1 as it's ReliefF, 
     420first feature. r1 notices it and returns -1 as it's ReliefF, 
    421421while r2 does not and returns the same number, 0.321, which is now wrong. 
    422422 
    423 ============================================== 
    424 Measure for Attributes for Regression Problems 
    425 ============================================== 
    426  
    427 Except for ReliefF, the only attribute quality measure available for regression 
     423====================== 
     424Measure for Regression 
     425====================== 
     426 
     427Except for ReliefF, the only feature quality measure available for regression 
    428428problems is based on a mean square error. 
    429429 
     
    437437    .. attribute:: unknownsTreatment 
    438438     
    439     Tells what to do with unknown attribute values. See description on the top 
     439    Tells what to do with unknown feature values. See description on the top 
    440440    of this page. 
    441441 
Note: See TracChangeset for help on using the changeset viewer.