Ignore:
Timestamp:
02/07/12 22:08:39 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

Polished discretization scripts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • Orange/feature/scoring.py

    r9919 r9988  
    1 """ 
    2 ##################### 
    3 Scoring (``scoring``) 
    4 ##################### 
    5  
    6 .. index:: feature scoring 
    7  
    8 .. index::  
    9    single: feature; feature scoring 
    10  
    11 Feature score is an assessment of the usefulness of the feature for  
    12 prediction of the dependant (class) variable. 
    13  
    14 To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use: 
    15  
    16     >>> meas = Orange.feature.scoring.InfoGain() 
    17     >>> print meas("tear_rate", data) 
    18     0.548794925213 
    19  
    20 Other scoring methods are listed in :ref:`classification` and 
    21 :ref:`regression`. Various ways to call them are described on 
    22 :ref:`callingscore`. 
    23  
    24 Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
    25 then using it, it is usually more convenient to do both in a single step:: 
    26  
    27     >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
    28     0.548794925213 
    29  
    30 This way is much slower for Relief that can efficiently compute scores 
    31 for all features in parallel. 
    32  
    33 It is also possible to score features that do not appear in the data 
    34 but can be computed from it. A typical case are discretized features: 
    35  
    36 .. literalinclude:: code/scoring-info-iris.py 
    37     :lines: 7-11 
    38  
    39 The following example computes feature scores, both with 
    40 :obj:`score_all` and by scoring each feature individually, and prints out  
    41 the best three features.  
    42  
    43 .. literalinclude:: code/scoring-all.py 
    44     :lines: 7- 
    45  
    46 The output:: 
    47  
    48     Feature scores for best three features (with score_all): 
    49     0.613 physician-fee-freeze 
    50     0.255 el-salvador-aid 
    51     0.228 synfuels-corporation-cutback 
    52  
    53     Feature scores for best three features (scored individually): 
    54     0.613 physician-fee-freeze 
    55     0.255 el-salvador-aid 
    56     0.228 synfuels-corporation-cutback 
    57  
    58 .. comment 
    59     The next script uses :obj:`GainRatio` and :obj:`Relief`. 
    60  
    61     .. literalinclude:: code/scoring-relief-gainRatio.py 
    62         :lines: 7- 
    63  
    64     Notice that on this data the ranks of features match:: 
    65          
    66         Relief GainRt Feature 
    67         0.613  0.752  physician-fee-freeze 
    68         0.255  0.444  el-salvador-aid 
    69         0.228  0.414  synfuels-corporation-cutback 
    70         0.189  0.382  crime 
    71         0.166  0.345  adoption-of-the-budget-resolution 
    72  
    73  
    74 .. _callingscore: 
    75  
    76 ======================= 
    77 Calling scoring methods 
    78 ======================= 
    79  
    80 To score a feature use :obj:`Score.__call__`. There are diferent 
    81 function signatures, which enable optimization. For instance, 
    82 most scoring methods first compute contingency tables from the 
    83 data. If these are already computed, they can be passed to the scorer 
    84 instead of the data. 
    85  
    86 Not all classes accept all kinds of arguments. :obj:`Relief`, 
    87 for instance, only supports the form with instances on the input. 
    88  
    89 .. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
    90  
    91     :param attribute: the chosen feature, either as a descriptor,  
    92       index, or a name. 
    93     :type attribute: :class:`Orange.feature.Descriptor` or int or string 
    94     :param data: data. 
    95     :type data: `Orange.data.Table` 
    96     :param weightID: id for meta-feature with weight. 
    97  
    98     All scoring methods support the first signature. 
    99  
    100 .. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
    101  
    102     :param attribute: the chosen feature, either as a descriptor,  
    103       index, or a name. 
    104     :type attribute: :class:`Orange.feature.Descriptor` or int or string 
    105     :param domain_contingency:  
    106     :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
    107  
    108 .. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
    109  
    110     :param contingency: 
    111     :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
    112     :param class_distribution: distribution of the class 
    113       variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
    114       it should be computed on instances where feature value is 
    115       defined. Otherwise, class distribution should be the overall 
    116       class distribution. 
    117     :type class_distribution:  
    118       :obj:`Orange.statistics.distribution.Distribution` 
    119     :param apriori_class_distribution: Optional and most often 
    120       ignored. Useful if the scoring method makes any probability estimates 
    121       based on apriori class probabilities (such as the m-estimate). 
    122     :return: Feature score - the higher the value, the better the feature. 
    123       If the quality cannot be scored, return :obj:`Score.Rejected`. 
    124     :rtype: float or :obj:`Score.Rejected`. 
    125  
    126 The code below scores the same feature with :obj:`GainRatio`  
    127 using different calls. 
    128  
    129 .. literalinclude:: code/scoring-calls.py 
    130     :lines: 7- 
    131  
    132 .. _classification: 
    133  
    134 ========================================== 
    135 Feature scoring in classification problems 
    136 ========================================== 
    137  
    138 .. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
    139  
    140 .. index::  
    141    single: feature scoring; information gain 
    142  
    143 .. class:: InfoGain 
    144  
    145     Information gain; the expected decrease of entropy. See `page on wikipedia 
    146     <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    147  
    148 .. index::  
    149    single: feature scoring; gain ratio 
    150  
    151 .. class:: GainRatio 
    152  
    153     Information gain ratio; information gain divided by the entropy of the feature's 
    154     value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
    155     of multi-valued features. It has been shown, however, that it still 
    156     overestimates features with multiple values. See `Wikipedia 
    157     <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    158  
    159 .. index::  
    160    single: feature scoring; gini index 
    161  
    162 .. class:: Gini 
    163  
    164     Gini index is the probability that two randomly chosen instances will have different 
    165     classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
    166  
    167 .. index::  
    168    single: feature scoring; relevance 
    169  
    170 .. class:: Relevance 
    171  
    172     The potential value for decision rules. 
    173  
    174 .. index::  
    175    single: feature scoring; cost 
    176  
    177 .. class:: Cost 
    178  
    179     Evaluates features based on the cost decrease achieved by knowing the value of 
    180     feature, according to the specified cost matrix. 
    181  
    182     .. attribute:: cost 
    183       
    184         Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
    185  
    186     If the cost of predicting the first class of an instance that is actually in 
    187     the second is 5, and the cost of the opposite error is 1, than an appropriate 
    188     score can be constructed as follows:: 
    189  
    190  
    191         >>> meas = Orange.feature.scoring.Cost() 
    192         >>> meas.cost = ((0, 5), (1, 0)) 
    193         >>> meas(3, data) 
    194         0.083333350718021393 
    195  
    196     Knowing the value of feature 3 would decrease the 
    197     classification cost for approximately 0.083 per instance. 
    198  
    199     .. comment   opposite error - is this term correct? TODO 
    200  
    201 .. index::  
    202    single: feature scoring; ReliefF 
    203  
    204 .. class:: Relief 
    205  
    206     Assesses features' ability to distinguish between very similar 
    207     instances from different classes. This scoring method was first 
    208     developed by Kira and Rendell and then improved by  Kononenko. The 
    209     class :obj:`Relief` works on discrete and continuous classes and 
    210     thus implements ReliefF and RReliefF. 
    211  
    212     ReliefF is slow since it needs to find k nearest neighbours for 
    213     each of m reference instances. As we normally compute ReliefF for 
    214     all features in the dataset, :obj:`Relief` caches the results for 
    215     all features, when called to score a certain feature.  When called 
    216     again, it uses the stored results if the domain and the data table 
    217     have not changed (data table version and the data checksum are 
    218     compared). Caching will only work if you use the same object.  
    219     Constructing new instances of :obj:`Relief` for each feature, 
    220     like this:: 
    221  
    222         for attr in data.domain.attributes: 
    223             print Orange.feature.scoring.Relief(attr, data) 
    224  
    225     runs much slower than reusing the same instance:: 
    226  
    227         meas = Orange.feature.scoring.Relief() 
    228         for attr in table.domain.attributes: 
    229             print meas(attr, data) 
    230  
    231  
    232     .. attribute:: k 
    233      
    234        Number of neighbours for each instance. Default is 5. 
    235  
    236     .. attribute:: m 
    237      
    238         Number of reference instances. Default is 100. When -1, all 
    239         instances are used as reference. 
    240  
    241     .. attribute:: check_cached_data 
    242      
    243         Check if the cached data is changed, which may be slow on large 
    244         tables.  Defaults to :obj:`True`, but should be disabled when it 
    245         is certain that the data will not change while the scorer is used. 
    246  
    247 .. autoclass:: Orange.feature.scoring.Distance 
    248     
    249 .. autoclass:: Orange.feature.scoring.MDL 
    250  
    251 .. _regression: 
    252  
    253 ====================================== 
    254 Feature scoring in regression problems 
    255 ====================================== 
    256  
    257 .. class:: Relief 
    258  
    259     Relief is used for regression in the same way as for 
    260     classification (see :class:`Relief` in classification 
    261     problems). 
    262  
    263 .. index::  
    264    single: feature scoring; mean square error 
    265  
    266 .. class:: MSE 
    267  
    268     Implements the mean square error score. 
    269  
    270     .. attribute:: unknowns_treatment 
    271      
    272         What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
    273  
    274     .. attribute:: m 
    275      
    276         Parameter for m-estimate of error. Default is 0 (no m-estimate). 
    277  
    278  
    279  
    280 ============ 
    281 Base Classes 
    282 ============ 
    283  
    284 Implemented methods for scoring relevances of features are subclasses 
    285 of :obj:`Score`. Those that compute statistics on conditional 
    286 distributions of class values given the feature values are derived from 
    287 :obj:`ScoreFromProbabilities`. 
    288  
    289 .. class:: Score 
    290  
    291     Abstract base class for feature scoring. Its attributes describe which 
    292     types of features it can handle which kind of data it requires. 
    293  
    294     **Capabilities** 
    295  
    296     .. attribute:: handles_discrete 
    297      
    298         Indicates whether the scoring method can handle discrete features. 
    299  
    300     .. attribute:: handles_continuous 
    301      
    302         Indicates whether the scoring method can handle continuous features. 
    303  
    304     .. attribute:: computes_thresholds 
    305      
    306         Indicates whether the scoring method implements the :obj:`threshold_function`. 
    307  
    308     **Input specification** 
    309  
    310     .. attribute:: needs 
    311      
    312         The type of data needed indicated by one the constants 
    313         below. Classes with use :obj:`DomainContingency` will also handle 
    314         generators. Those based on :obj:`Contingency_Class` will be able 
    315         to take generators and domain contingencies. 
    316  
    317         .. attribute:: Generator 
    318  
    319             Constant. Indicates that the scoring method needs an instance 
    320             generator on the input as, for example, :obj:`Relief`. 
    321  
    322         .. attribute:: DomainContingency 
    323  
    324             Constant. Indicates that the scoring method needs 
    325             :obj:`Orange.statistics.contingency.Domain`. 
    326  
    327         .. attribute:: Contingency_Class 
    328  
    329             Constant. Indicates, that the scoring method needs the contingency 
    330             (:obj:`Orange.statistics.contingency.VarClass`), feature 
    331             distribution and the apriori class distribution (as most 
    332             scoring methods). 
    333  
    334     **Treatment of unknown values** 
    335  
    336     .. attribute:: unknowns_treatment 
    337  
    338         Defined in classes that are able to treat unknown values. It 
    339         should be set to one of the values below. 
    340  
    341         .. attribute:: IgnoreUnknowns 
    342  
    343             Constant. Instances for which the feature value is unknown are removed. 
    344  
    345         .. attribute:: ReduceByUnknown 
    346  
    347             Constant. Features with unknown values are  
    348             punished. The feature quality is reduced by the proportion of 
    349             unknown values. For impurity scores the impurity decreases 
    350             only where the value is defined and stays the same otherwise. 
    351  
    352         .. attribute:: UnknownsToCommon 
    353  
    354             Constant. Undefined values are replaced by the most common value. 
    355  
    356         .. attribute:: UnknownsAsValue 
    357  
    358             Constant. Unknown values are treated as a separate value. 
    359  
    360     **Methods** 
    361  
    362     .. method:: __call__ 
    363  
    364         Abstract. See :ref:`callingscore`. 
    365  
    366     .. method:: threshold_function(attribute, instances[, weightID]) 
    367      
    368         Abstract.  
    369          
    370         Assess different binarizations of the continuous feature 
    371         :obj:`attribute`.  Return a list of tuples. The first element 
    372         is a threshold (between two existing values), the second is 
    373         the quality of the corresponding binary feature, and the third 
    374         the distribution of instances below and above the threshold. 
    375         Not all scorers return the third element. 
    376  
    377         To show the computation of thresholds, we shall use the Iris 
    378         data set: 
    379  
    380         .. literalinclude:: code/scoring-info-iris.py 
    381             :lines: 13-16 
    382  
    383     .. method:: best_threshold(attribute, instances) 
    384  
    385         Return the best threshold for binarization, that is, the threshold 
    386         with which the resulting binary feature will have the optimal 
    387         score. 
    388  
    389         The script below prints out the best threshold for 
    390         binarization of an feature. ReliefF is used scoring:  
    391  
    392         .. literalinclude:: code/scoring-info-iris.py 
    393             :lines: 18-19 
    394  
    395 .. class:: ScoreFromProbabilities 
    396  
    397     Bases: :obj:`Score` 
    398  
    399     Abstract base class for feature scoring method that can be 
    400     computed from contingency matrices. 
    401  
    402     .. attribute:: estimator_constructor 
    403     .. attribute:: conditional_estimator_constructor 
    404      
    405         The classes that are used to estimate unconditional 
    406         and conditional probabilities of classes, respectively. 
    407         Defaults use relative frequencies; possible alternatives are, 
    408         for instance, :obj:`ProbabilityEstimatorConstructor_m` and 
    409         :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
    410         (with estimator constructor again set to 
    411         :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
    412  
    413 ============ 
    414 Other 
    415 ============ 
    416  
    417 .. autoclass:: Orange.feature.scoring.OrderAttributes 
    418    :members: 
    419  
    420 .. autofunction:: Orange.feature.scoring.score_all 
    421  
    422 .. rubric:: Bibliography 
    423  
    424 .. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,  
    425   Woodhead Publishing, 2007. 
    426  
    427 .. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986. 
    428  
    429 .. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984. 
    430  
    431 .. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995. 
    432  
    433 """ 
    434  
    4351import Orange.core as orange 
    4362import Orange.misc 
     
    44511from orange import MeasureAttribute_relief as Relief 
    44612from orange import MeasureAttribute_MSE as MSE 
    447  
    44813 
    44914###### 
Note: See TracChangeset for help on using the changeset viewer.