Changeset 8140:894cd4420ee5 in orange


Ignore:
Timestamp:
08/02/11 16:57:20 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
7ba870de9ccb6a4db27dc96148f3440210873e91
Message:

Slight fixes in presentation of Orange.feature.scoring.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r8138 r8140  
    99   single: feature; feature scoring 
    1010 
    11 Features scoring scores the relevance of features to the 
    12 class variable.  
    13  
    14 If `data` contains the "lenses" dataset, you can measure the quality of 
    15 feature "tear_rate" with information gain by :: 
     11Features scoring is an assesment the relevance of features to the 
     12class variable; the higher a feature is scored, the better it 
     13should be for prediction. 
     14 
     15You can score the feature "tear_rate" of the Lenses data set 
     16(loaded into `data`) with:: 
    1617 
    1718    >>> meas = Orange.feature.scoring.InfoGain() 
     
    1920    0.548794925213 
    2021 
    21 Orange also implements other measures; see 
     22Apart from information gain you could also use other measures; 
    2223:ref:`classification` and :ref:`regression`. For various 
    2324ways to call them see :obj:`Measure.__call__`. 
     
    3132You shouldn't use this with :obj:`Relief`; see :obj:`Relief` for the explanation. 
    3233 
    33 It is also possible to score features that are not  
    34 in the domain. For instance, you can score discretized 
    35 features on the fly (slow with :obj:`Relief`): 
     34It is also possible to score features that are not in the domain. For 
     35instance, discretized features can be scored without producing a 
     36data table in advance (slow with :obj:`Relief`): 
    3637 
    3738.. literalinclude:: code/scoring-info-iris.py 
     
    9798 
    9899    Information gain divided by the entropy of the feature's 
    99     value. Introduced by Quinlan in order to avoid overestimation of 
    100     multi-valued features. It has been shown, however, that it 
    101     still overestimates features with multiple values. 
     100    value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
     101    of multi-valued features. It has been shown, however, that it still 
     102    overestimates features with multiple values. 
    102103 
    103104.. index::  
     
    106107.. class:: Gini 
    107108 
    108     The probability that two randomly chosen examples will have different 
    109     classes; first introduced by Breiman. 
     109    The probability that two randomly chosen instances will have different 
     110    classes; first introduced by Breiman [Breiman1984]_. 
    110111 
    111112.. index::  
     
    128129        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
    129130 
    130     If cost of predicting the first class of an example that is actually in 
     131    If the cost of predicting the first class of an instance that is actually in 
    131132    the second is 5, and the cost of the opposite error is 1, than an appropriate 
    132133    measure can be constructed as follows:: 
     
    138139 
    139140    Knowing the value of feature 3 would decrease the 
    140     classification cost for approximately 0.083 per example. 
     141    classification cost for approximately 0.083 per instance. 
    141142 
    142143.. index::  
     
    146147 
    147148    Assesses features' ability to distinguish between very similar 
    148     examples from different classes.  First developed by Kira and Rendell 
    149     and then improved by Kononenko. 
     149    instances from different classes.  First developed by Kira and 
     150    Rendell and then improved by Kononenko. The class :obj:`Relief` 
     151    works on discrete and continuous classes and thus implements ReliefF 
     152    and RReliefF. 
    150153 
    151154    .. attribute:: k 
    152155     
    153        Number of neighbours for each example. Default is 5. 
     156       Number of neighbours for each instance. Default is 5. 
    154157 
    155158    .. attribute:: m 
    156159     
    157         Number of reference examples. Default is 100. Set to -1 to take all the 
    158         examples. 
     160        Number of reference instances. Default is 100. Set to -1 to take all the 
     161        instances. 
    159162 
    160163    .. attribute:: check_cached_data 
     
    164167        the data will not change. 
    165168 
    166     ReliefF is slow since it needs to find k nearest neighbours for each 
    167     of m reference examples.  As we normally compute ReliefF for all 
    168     features in the dataset, :obj:`Relief` caches the results. When called 
    169     to score a certain feature, it computes all feature scores. 
    170     When called again, it uses the stored results if the domain and the 
    171     data table have not changed (data table version and the data checksum 
    172     are compared). Caching will only work if you use the same instance. 
    173     So, don't do this:: 
     169    ReliefF is slow since it needs to find k nearest neighbours for 
     170    each of m reference instances. As we normally compute ReliefF for 
     171    all features in the dataset, :obj:`Relief` caches the results for 
     172    all features, when called to score a certain feature.  When called 
     173    again, it uses the stored results if the domain and the data table 
     174    have not changed (data table version and the data checksum are 
     175    compared). Caching will only work if you use the same object. So, 
     176    don't do this:: 
    174177 
    175178        for attr in data.domain.attributes: 
     
    182185            print meas(attr, data) 
    183186 
    184     Class :obj:`Relief` works on discrete and continuous classes and thus  
    185     implements functionality of algorithms ReliefF and RReliefF. 
    186  
    187187    .. note:: 
    188188       Relief can also compute the threshold function, that is, the feature 
     
    190190 
    191191.. autoclass:: Orange.feature.scoring.Distance 
    192    :members: 
    193192    
    194193.. autoclass:: Orange.feature.scoring.MDL 
    195    :members: 
    196194 
    197195.. _regression: 
     
    257255    .. attribute:: NeedsGenerator 
    258256 
    259         Constant. Indicates that the measure Needs an instance generator on the input (as, for example, the 
     257        Constant. Indicates that the measure needs an instance generator on the input (as, for example, the 
    260258        :obj:`Relief` measure). 
    261259 
     
    282280    .. attribute:: IgnoreUnknowns 
    283281 
    284         Constant. Examples for which the feature value is unknown are removed. 
     282        Constant. Instances for which the feature value is unknown are removed. 
    285283 
    286284    .. attribute:: ReduceByUnknown 
     
    289287        punished. The feature quality is reduced by the proportion of 
    290288        unknown values. For impurity measures the impurity decreases 
    291         only where the value is defined and stays the same otherwise, 
     289        only where the value is defined and stays the same otherwise. 
    292290 
    293291    .. attribute:: UnknownsToCommon 
     
    310308        :param weightID: id for meta-feature with weight. 
    311309 
    312         Abstract. All measures need to support `__call__` with these 
     310        Abstract. All measures need to support these 
    313311        parameters.  Described below. 
    314312 
     
    366364        element is a threshold (between two existing values), the second 
    367365        is the quality of the corresponding binary feature, and the last 
    368         the distribution of examples below and above the threshold. The 
     366        the distribution of instancs below and above the threshold. The 
    369367        last element is optional. 
    370368 
     
    428426.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,  
    429427  Woodhead Publishing, 2007. 
     428 
     429.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986. 
     430 
     431.. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984. 
     432 
    430433 
    431434.. _iris.tab: code/iris.tab 
     
    475478 
    476479        :param data: a data table used to score features 
    477         :type data: Orange.data.table 
     480        :type data: Orange.data.Table 
    478481 
    479482        :param weight: meta attribute that stores weights of instances 
    480         :type weight: Orange.data.variable 
     483        :type weight: Orange.data.variable.Variable 
    481484 
    482485        """ 
     
    504507    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
    505508    def __call__(self, attr, data, apriori_dist=None, weightID=None): 
    506         """Take :obj:`Orange.data.table` data table and score the given  
    507         :obj:`Orange.data.variable`. 
     509        """Score the given feature. 
    508510 
    509511        :param attr: feature to score 
    510         :type attr: Orange.data.variable 
     512        :type attr: Orange.data.variable.Variable 
    511513 
    512514        :param data: a data table used to score features 
     
    517519         
    518520        :param weightID: meta feature used to weight individual data instances 
    519         :type weightID: Orange.data.variable 
     521        :type weightID: Orange.data.variable.Variable 
    520522 
    521523        """ 
     
    549551    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
    550552    def __call__(self, attr, data, apriori_dist=None, weightID=None): 
    551         """Take :obj:`Orange.data.table` data table and score the given  
    552         :obj:`Orange.data.variable`. 
     553        """Score the given feature. 
    553554 
    554555        :param attr: feature to score 
    555         :type attr: Orange.data.variable 
     556        :type attr: Orange.data.variable.Variable 
    556557 
    557558        :param data: a data table used to score the feature 
     
    562563         
    563564        :param weightID: meta feature used to weight individual data instances 
    564         :type weightID: Orange.data.variable 
     565        :type weightID: Orange.data.variable.Variable 
    565566 
    566567        """ 
     
    648649 
    649650    :param data: data table should include a discrete class. 
    650     :type data: :obj:`Orange.data.table` 
     651    :type data: :obj:`Orange.data.Table` 
    651652    :param measure:  feature scoring function. Derived from 
    652       :obj:`Orange.feature.scoring.Measure`. Defaults to  
    653       :obj:`Orange.feature.scoring.Relief` with k=20 and m=50. 
    654     :type measure: :obj:`Orange.feature.scoring.Measure`  
     653      :obj:`~Orange.feature.scoring.Measure`. Defaults to  
     654      :obj:`~Orange.feature.scoring.Relief` with k=20 and m=50. 
     655    :type measure: :obj:`~Orange.feature.scoring.Measure`  
    655656    :rtype: :obj:`list`; a sorted list of tuples (feature name, score) 
    656657 
Note: See TracChangeset for help on using the changeset viewer.