Changeset 8138:befbd9ca5a2d in orange


Ignore:
Timestamp:
08/02/11 15:16:02 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
ebdd8d6c8d31eaf64043118f0b1a665babd6025e
Message:

Work on Orange.feature.scoring.

Location:
orange
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r8119 r8138  
    99   single: feature; feature scoring 
    1010 
    11 Features selection aims to find relevant features for the given 
    12 prediction task. 
     11Features scoring scores the relevance of features to the 
     12class variable.  
     13 
     14If `data` contains the "lenses" dataset, you can measure the quality of 
     15feature "tear_rate" with information gain by :: 
     16 
     17    >>> meas = Orange.feature.scoring.InfoGain() 
     18    >>> print meas("tear_rate", data) 
     19    0.548794925213 
     20 
     21Orange also implements other measures; see 
     22:ref:`classification` and :ref:`regression`. For various 
     23ways to call them see :obj:`Measure.__call__`. 
     24 
     25You can also construct the object and use 
     26it on-the-fly:: 
     27 
     28    >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
     29    0.548794925213 
     30 
     31You shouldn't use this with :obj:`Relief`; see :obj:`Relief` for the explanation. 
     32 
     33It is also possible to score features that are not  
     34in the domain. For instance, you can score discretized 
     35features on the fly (slow with :obj:`Relief`): 
     36 
     37.. literalinclude:: code/scoring-info-iris.py 
     38    :lines: 7-11 
    1339 
    1440The following example computes feature scores, both with 
     
    3359    0.255 el-salvador-aid 
    3460    0.228 synfuels-corporation-cutback 
     61 
     62.. comment:: 
     63 
     64    The next script uses :obj:`GainRatio` and :obj:`Relief`. 
     65 
     66    .. literalinclude:: code/scoring-relief-gainRatio.py 
     67        :lines: 7- 
     68 
     69    Notice that on this data the ranks of features match:: 
     70         
     71        Relief GainRt Feature 
     72        0.613  0.752  physician-fee-freeze 
     73        0.255  0.444  el-salvador-aid 
     74        0.228  0.414  synfuels-corporation-cutback 
     75        0.189  0.382  crime 
     76        0.166  0.345  adoption-of-the-budget-resolution 
     77 
     78.. _classification: 
     79 
     80=========================== 
     81Measures for Classification 
     82=========================== 
     83 
     84.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
     85 
     86.. index::  
     87   single: feature scoring; information gain 
     88 
     89.. class:: InfoGain 
     90 
     91    Measures the expected decrease of entropy. 
     92 
     93.. index::  
     94   single: feature scoring; gain ratio 
     95 
     96.. class:: GainRatio 
     97 
     98    Information gain divided by the entropy of the feature's 
     99    value. Introduced by Quinlan in order to avoid overestimation of 
     100    multi-valued features. It has been shown, however, that it 
     101    still overestimates features with multiple values. 
     102 
     103.. index::  
     104   single: feature scoring; gini index 
     105 
     106.. class:: Gini 
     107 
     108    The probability that two randomly chosen examples will have different 
     109    classes; first introduced by Breiman. 
     110 
     111.. index::  
     112   single: feature scoring; relevance 
     113 
     114.. class:: Relevance 
     115 
     116    The potential value for decision rules. 
     117 
     118.. index::  
     119   single: feature scoring; cost 
     120 
     121.. class:: Cost 
     122 
     123    Evaluates features based on the "saving" achieved by knowing the value of 
     124    feature, according to the specified cost matrix. 
     125 
     126    .. attribute:: cost 
     127      
     128        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
     129 
     130    If cost of predicting the first class of an example that is actually in 
     131    the second is 5, and the cost of the opposite error is 1, than an appropriate 
     132    measure can be constructed as follows:: 
     133 
     134        >>> meas = Orange.feature.scoring.Cost() 
     135        >>> meas.cost = ((0, 5), (1, 0)) 
     136        >>> meas(3, data) 
     137        0.083333350718021393 
     138 
     139    Knowing the value of feature 3 would decrease the 
     140    classification cost for approximately 0.083 per example. 
     141 
     142.. index::  
     143   single: feature scoring; ReliefF 
     144 
     145.. class:: Relief 
     146 
     147    Assesses features' ability to distinguish between very similar 
     148    examples from different classes.  First developed by Kira and Rendell 
     149    and then improved by Kononenko. 
     150 
     151    .. attribute:: k 
     152     
     153       Number of neighbours for each example. Default is 5. 
     154 
     155    .. attribute:: m 
     156     
     157        Number of reference examples. Default is 100. Set to -1 to take all the 
     158        examples. 
     159 
     160    .. attribute:: check_cached_data 
     161     
     162        Check if the cached data is changed with data checksum. Slow 
     163        on large tables.  Defaults to True. Disable it if you know that 
     164        the data will not change. 
     165 
     166    ReliefF is slow since it needs to find k nearest neighbours for each 
     167    of m reference examples.  As we normally compute ReliefF for all 
     168    features in the dataset, :obj:`Relief` caches the results. When called 
     169    to score a certain feature, it computes all feature scores. 
     170    When called again, it uses the stored results if the domain and the 
     171    data table have not changed (data table version and the data checksum 
     172    are compared). Caching will only work if you use the same instance. 
     173    So, don't do this:: 
     174 
     175        for attr in data.domain.attributes: 
     176            print Orange.feature.scoring.Relief(attr, data) 
     177 
     178    But this:: 
     179 
     180        meas = Orange.feature.scoring.Relief() 
     181        for attr in table.domain.attributes: 
     182            print meas(attr, data) 
     183 
     184    Class :obj:`Relief` works on discrete and continuous classes and thus  
     185    implements functionality of algorithms ReliefF and RReliefF. 
     186 
     187    .. note:: 
     188       Relief can also compute the threshold function, that is, the feature 
     189       quality at different thresholds for binarization. 
     190 
     191.. autoclass:: Orange.feature.scoring.Distance 
     192   :members: 
     193    
     194.. autoclass:: Orange.feature.scoring.MDL 
     195   :members: 
     196 
     197.. _regression: 
     198 
     199======================= 
     200Measures for Regression 
     201======================= 
     202 
     203You can also use :obj:`Relief` for regression. 
     204 
     205.. index::  
     206   single: feature scoring; mean square error 
     207 
     208.. class:: MSE 
     209 
     210    Implements the mean square error measure. 
     211 
     212    .. attribute:: unknowns_treatment 
     213     
     214        What to do with unknown values. See :obj:`Measure.unknowns_treatment`. 
     215 
     216    .. attribute:: m 
     217     
     218        Parameter for m-estimate of error. Default is 0 (no m-estimate). 
     219 
    35220 
    36221 
     
    173358            :lines: 7- 
    174359 
    175     .. method:: threshold_function(attribute, examples[, weightID]) 
     360    .. method:: threshold_function(attribute, instances[, weightID]) 
    176361     
    177362        Abstract.  
     
    184369        last element is optional. 
    185370 
    186     .. method:: best_threshold 
    187  
    188         Return the best threshold for binarization. Parameters? 
    189  
    190  
    191     The script below shows different ways to assess the quality of astigmatic, 
    192     tear rate and the first feature in the dataset lenses. 
    193  
    194     .. literalinclude:: code/scoring-info-lenses.py 
    195         :lines: 7-21 
    196  
    197     As for many other classes in Orange, you can construct the object and use 
    198     it on-the-fly. For instance, to measure the quality of feature 
    199     "tear_rate", you could write simply:: 
    200  
    201         >>> print Orange.feature.scoring.Info("tear_rate", data) 
    202         0.548794984818 
    203  
    204     You shouldn't use this with :obj:`Relief`; see :obj:`Relief` for the explanation. 
    205  
    206     It is also possible to score features that are not  
    207     in the domain. For instance, you can score discretized 
    208     features on the fly: 
    209  
    210     .. literalinclude:: code/scoring-info-iris.py 
    211         :lines: 7-11 
    212  
    213     Note that this is not possible with :obj:`Relief`, as it would be too slow. 
    214  
    215     To show the computation of thresholds, we shall use the Iris data set. 
    216  
    217     `scoring-info-iris.py`_ (uses `iris.tab`_): 
    218  
    219     .. literalinclude:: code/scoring-info-iris.py 
    220         :lines: 7-15 
    221  
    222     If we hadn't constructed the feature in advance, we could write  
    223     `Orange.feature.scoring.Relief().threshold_function("petal length", data)`. 
    224     This is not recommendable for ReliefF, since it may be a lot slower. 
    225  
    226     The script below finds and prints out the best threshold for binarization 
    227     of an feature, that is, the threshold with which the resulting binary 
    228     feature will have the optimal ReliefF (or any other measure):: 
    229  
    230         thresh, score, distr = meas.best_threshold("petal length", data) 
    231         print "Best threshold: %5.3f (score %5.3f)" % (thresh, score) 
     371        To show the computation of thresholds, we shall use the Iris data set 
     372        (part of `scoring-info-iris.py`_, uses `iris.tab`_): 
     373 
     374        .. literalinclude:: code/scoring-info-iris.py 
     375            :lines: 13-15 
     376 
     377    .. method:: best_threshold(attribute, instances) 
     378 
     379        Return the best threshold for binarization, that is, the threshold 
     380        with which the resulting binary feature will have the optimal 
     381        score. 
     382 
     383        The script below prints out the best threshold for 
     384        binarization of an feature. ReliefF is used scoring: (part of 
     385        `scoring-info-iris.py`_, uses `iris.tab`_): 
     386 
     387        .. literalinclude:: code/scoring-info-iris.py 
     388            :lines: 17-18 
    232389 
    233390.. class:: MeasureFromProbabilities 
     
    256413        Both default to relative frequencies. 
    257414 
    258 =========================== 
    259 Measures for Classification 
    260 =========================== 
    261  
    262 This script uses :obj:`GainRatio` and :obj:`Relief`. 
    263  
    264 .. literalinclude:: code/scoring-relief-gainRatio.py 
    265     :lines: 7- 
    266  
    267 Notice that on this data the ranks of features match:: 
    268      
    269     Relief GainRt Feature 
    270     0.613  0.752  physician-fee-freeze 
    271     0.255  0.444  el-salvador-aid 
    272     0.228  0.414  synfuels-corporation-cutback 
    273     0.189  0.382  crime 
    274     0.166  0.345  adoption-of-the-budget-resolution 
    275  
    276 Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
    277  
    278 .. index::  
    279    single: feature scoring; information gain 
    280  
    281 .. class:: InfoGain 
    282  
    283     Measures the expected decrease of entropy. 
    284  
    285 .. index::  
    286    single: feature scoring; gain ratio 
    287  
    288 .. class:: GainRatio 
    289  
    290     Information gain divided by the entropy of the feature's 
    291     value. Introduced by Quinlan in order to avoid overestimation of 
    292     multi-valued features. It has been shown, however, that it 
    293     still overestimates features with multiple values. 
    294  
    295 .. index::  
    296    single: feature scoring; gini index 
    297  
    298 .. class:: Gini 
    299  
    300     The probability that two randomly chosen examples will have different 
    301     classes; first introduced by Breiman. 
    302  
    303 .. index::  
    304    single: feature scoring; relevance 
    305  
    306 .. class:: Relevance 
    307  
    308     The potential value for decision rules. 
    309  
    310 .. index::  
    311    single: feature scoring; cost 
    312  
    313 .. class:: Cost 
    314  
    315     Evaluates features based on the "saving" achieved by knowing the value of 
    316     feature, according to the specified cost matrix. 
    317  
    318     .. attribute:: cost 
    319       
    320         Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
    321  
    322     If cost of predicting the first class of an example that is actually in 
    323     the second is 5, and the cost of the opposite error is 1, than an appropriate 
    324     measure can be constructed as follows:: 
    325  
    326         >>> meas = Orange.feature.scoring.Cost() 
    327         >>> meas.cost = ((0, 5), (1, 0)) 
    328         >>> meas(3, data) 
    329         0.083333350718021393 
    330  
    331     Knowing the value of feature 3 would decrease the 
    332     classification cost for approximately 0.083 per example. 
    333  
    334 .. index::  
    335    single: feature scoring; ReliefF 
    336  
    337 .. class:: Relief 
    338  
    339     Assesses features' ability to distinguish between very similar 
    340     examples from different classes.  First developed by Kira and Rendell 
    341     and then improved by Kononenko. 
    342  
    343     .. attribute:: k 
    344      
    345        Number of neighbours for each example. Default is 5. 
    346  
    347     .. attribute:: m 
    348      
    349         Number of reference examples. Default is 100. Set to -1 to take all the 
    350         examples. 
    351  
    352     .. attribute:: check_cached_data 
    353      
    354         Check if the cached data is changed with data checksum. Slow 
    355         on large tables.  Defaults to True. Disable it if you know that 
    356         the data will not change. 
    357  
    358     ReliefF is slow since it needs to find k nearest neighbours for each 
    359     of m reference examples.  As we normally compute ReliefF for all 
    360     features in the dataset, :obj:`Relief` caches the results. When called 
    361     to score a certain feature, it computes all feature scores. 
    362     When called again, it uses the stored results if the domain and the 
    363     data table have not changed (data table version and the data checksum 
    364     are compared). Caching will only work if you use the same instance. 
    365     So, don't do this:: 
    366  
    367         for attr in data.domain.attributes: 
    368             print Orange.feature.scoring.Relief(attr, data) 
    369  
    370     But this:: 
    371  
    372         meas = Orange.feature.scoring.Relief() 
    373         for attr in table.domain.attributes: 
    374             print meas(attr, data) 
    375  
    376     Class :obj:`Relief` works on discrete and continuous classes and thus  
    377     implements functionality of algorithms ReliefF and RReliefF. 
    378  
    379     .. note:: 
    380        Relief can also compute the threshold function, that is, the feature 
    381        quality at different thresholds for binarization. 
    382  
    383  
    384 ======================= 
    385 Measures for Regression 
    386 ======================= 
    387  
    388 :obj:`Relief` can be also used for regression. 
    389  
    390 .. index::  
    391    single: feature scoring; mean square error 
    392  
    393 .. class:: MSE 
    394  
    395     Implements the mean square error measure. 
    396  
    397     .. attribute:: unknowns_treatment 
    398      
    399         What to do with unknown values. See :obj:`Measure.unknowns_treatment`. 
    400  
    401     .. attribute:: m 
    402      
    403         Parameter for m-estimate of error. Default is 0 (no m-estimate). 
    404  
    405415============ 
    406416Other 
     
    410420   :members: 
    411421 
    412 .. autofunction:: Orange.feature.scoring.Distance 
    413  
    414 .. autoclass:: Orange.feature.scoring.DistanceClass 
    415    :members: 
    416     
    417 .. autofunction:: Orange.feature.scoring.MDL 
    418  
    419 .. autoclass:: Orange.feature.scoring.MDLClass 
    420    :members: 
    421  
    422422.. autofunction:: Orange.feature.scoring.merge_values 
    423423 
    424424.. autofunction:: Orange.feature.scoring.score_all 
    425425 
    426 ========== 
    427 References 
    428 ========== 
    429  
    430 * Igor Kononeko, Matjaz Kukar: Machine Learning and Data Mining,  
     426.. comment .. rubric:: References 
     427 
     428.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,  
    431429  Woodhead Publishing, 2007. 
    432430 
     
    492490        return [x[0] for x in measured] 
    493491 
    494 def Distance(attr=None, data=None): 
    495     """Instantiate :obj:`DistanceClass` and use it to return 
    496     the score of a given feature on given data. 
    497      
    498     :param attr: feature to score 
    499     :type attr: Orange.data.variable 
    500      
    501     :param data: data table used for feature scoring 
    502     :type data: Orange.data.table  
    503      
    504     """ 
    505     m = DistanceClass() 
    506     if attr != None and data != None: 
    507         return m(attr, data) 
    508     else: 
    509         return m 
    510  
    511 class DistanceClass(Measure): 
    512     """The 1-D feature distance measure described in Kononenko.""" 
     492class Distance(Measure): 
     493    """The 1-D feature distance measure described in [Kononenko2007]_.""" 
     494 
     495    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
     496    def __new__(cls, attr=None, data=None, apriori_dist=None, weightID=None): 
     497        self = Measure.__new__(cls) 
     498        if attr != None and data != None: 
     499            #self.__init__(**argkw) 
     500            return self.__call__(attr, data, apriori_dist, weightID) 
     501        else: 
     502            return self 
    513503 
    514504    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
     
    545535            return 0 
    546536 
    547 def MDL(attr=None, data=None): 
    548     """Instantiate :obj:`MDLClass` and use it n given data to 
    549     return the feature's score.""" 
    550     m = MDLClass() 
    551     if attr != None and data != None: 
    552         return m(attr, data) 
    553     else: 
    554         return m 
    555  
    556 class MDLClass(Measure): 
     537class MDL(Measure): 
    557538    """Score feature based on the minimum description length principle.""" 
     539 
     540    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
     541    def __new__(cls, attr=None, data=None, apriori_dist=None, weightID=None): 
     542        self = Measure.__new__(cls) 
     543        if attr != None and data != None: 
     544            #self.__init__(**argkw) 
     545            return self.__call__(attr, data, apriori_dist, weightID) 
     546        else: 
     547            return self 
    558548 
    559549    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
     
    663653      :obj:`Orange.feature.scoring.Relief` with k=20 and m=50. 
    664654    :type measure: :obj:`Orange.feature.scoring.Measure`  
    665     :rtype: :obj:`list` a sorted list of tuples (feature name, score) 
     655    :rtype: :obj:`list`; a sorted list of tuples (feature name, score) 
    666656 
    667657    """ 
  • orange/doc/Orange/rst/code/scoring-info-iris.py

    r8115 r8138  
    33# Uses:        iris 
    44# Referenced:  Orange.feature.html#scoring 
    5 # Classes:     Orange.feature.scoring.EntropyDiscretization, Orange.feature.scoring.Measure, Orange.feature.scoring.InfoGain 
     5# Classes:     Orange.feature.discretization.EntropyDiscretization, Orange.feature.scoring.Measure, Orange.feature.scoring.InfoGain, Orange.feature.scoring.Relief 
    66 
    77import Orange 
     
    99 
    1010d1 = Orange.feature.discretization.EntropyDiscretization("petal length", table) 
    11 print Orange.feature.scoring.Relief(d1, table) 
     11print Orange.feature.scoring.InfoGain(d1, table) 
    1212 
    1313meas = Orange.feature.scoring.Relief() 
  • orange/fixes/fix_changed_names.py

    r8119 r8138  
    484484           "orngScalePolyvizData.orngScalePolyvizData": "Orange.preprocess.scaling.ScalePolyvizData", 
    485485           "orngScaleScatterPlotData.orngScaleScatterPlotData": "Orange.preprocess.scaling.ScaleScatterPlotData", 
     486 
     487           "orngEvalAttr.mergeAttrValues": "Orange.feature.scoring.merge_values", 
     488           "orngEvalAttr.MeasureAttribute_MDL": "Orange.feature.scoring.MDL", 
     489           "orngEvalAttr.MeasureAttribute_MDLClass": "Orange.feature.scoring.MDL", 
     490           "orngEvalAttr.MeasureAttribute_Distance": "Orange.feature.scoring.Distance", 
     491           "orngEvalAttr.MeasureAttribute_DistanceClass": "Orange.feature.scoring.Distance", 
     492           "orngEvalAttr.OrderAttributesByMeasure": "Orange.feature.scoring.OrderAttributes", 
     493 
    486494           } 
    487495 
  • orange/orngEvalAttr.py

    r8115 r8138  
    44 
    55MeasureAttribute_MDL = MDL 
    6 MeasureAttribute_MDLClass = MDLClass 
     6MeasureAttribute_MDLClass = MDL 
    77 
    88MeasureAttribute_Distance = Distance 
    9 MeasureAttribute_DistanceClass = DistanceClass 
     9MeasureAttribute_DistanceClass = Distance 
    1010 
    1111OrderAttributesByMeasure = OrderAttributes 
Note: See TracChangeset for help on using the changeset viewer.