Changeset 8157:2037bc979a14 in orange


Ignore:
Timestamp:
08/08/11 14:25:58 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
6b432e00e78996e150dfe1460f75d52044da5ba8
Message:

Orange.feature.scoring update. References #882. Renamed Measure into
Score, put the call methods of Score in a separate section, other
small changes.

Location:
orange
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r8143 r8157  
    99   single: feature; feature scoring 
    1010 
    11 Features scoring is an assesment the relevance of features to the 
    12 class variable; the higher a feature is scored, the better it 
    13 should be for prediction. 
    14  
    15 You can score the feature "tear_rate" of the Lenses data set 
    16 (loaded into `data`) with:: 
     11Feature scoring is assessment of the usefulness of the feature for  
     12prediction of the dependant (class) variable. 
     13 
     14To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into `data`) use: 
    1715 
    1816    >>> meas = Orange.feature.scoring.InfoGain() 
     
    2018    0.548794925213 
    2119 
    22 Apart from information gain you could also use other measures; 
     20Apart from information gain you could also use other scoring methods; 
    2321:ref:`classification` and :ref:`regression`. For various 
    24 ways to call them see :obj:`Measure.__call__`. 
    25  
    26 You can also construct the object and use 
     22ways to call them see :ref:`callingscore`. 
     23 
     24It is possible to construct the object and use 
    2725it on-the-fly:: 
    2826 
     
    3028    0.548794925213 
    3129 
    32 You shouldn't use this with :obj:`Relief`; see :obj:`Relief` for the explanation. 
    33  
    34 It is also possible to score features that are not in the domain. For 
     30But constructing new instances for each feature is slow for 
     31scoring methods that use caching, such as :obj:`Relief`. 
     32 
     33Scoring features that are not in the domain is also possible. For 
    3534instance, discretized features can be scored without producing a 
    3635data table in advance (slow with :obj:`Relief`): 
     
    7776        0.166  0.345  adoption-of-the-budget-resolution 
    7877 
     78 
     79.. _callingscore: 
     80 
     81======================= 
     82Calling scoring methods 
     83======================= 
     84 
     85To score a feature use :obj:`Score.__call__`. There are diferent 
     86function signatures, which enable optimization. For instance, 
     87if contingency matrix has already been computed, you can speed 
     88up the computation by passing it to the scoring method (if it supports 
     89that form - most do). Otherwise the scoring method will have to compute the 
     90contingency itself. 
     91 
     92Not all classes will accept all kinds of arguments. :obj:`Relief`, 
     93for instance, only supports the form with instances on the input. 
     94 
     95.. method:: Score.__call__(attribute, instances[, apriori_class_distribution][, weightID]) 
     96 
     97    :param attribute: the chosen feature, either as a descriptor,  
     98      index, or a name. 
     99    :type attribute: :class:`Orange.data.variable.Variable` or int or string 
     100    :param instances: data. 
     101    :type instances: `Orange.data.Table` 
     102    :param weightID: id for meta-feature with weight. 
     103 
     104    All scoring methods need to support these parameters. 
     105 
     106.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
     107 
     108    :param attribute: the chosen feature, either as a descriptor,  
     109      index, or a name. 
     110    :type attribute: :class:`Orange.data.variable.Variable` or int or string 
     111    :param domain_contingency:  
     112    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
     113 
     114.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
     115 
     116    :param contingency: 
     117    :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
     118    :param class_distribution: distribution of the class 
     119      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
     120      it should be computed on instances where feature value is 
     121      defined. Otherwise, class distribution should be the overall 
     122      class distribution. 
     123    :type class_distribution:  
     124      :obj:`Orange.statistics.distribution.Distribution` 
     125    :param apriori_class_distribution: Optional and most often 
     126      ignored. Useful if the scoring method makes any probability estimates 
     127      based on apriori class probabilities (such as the m-estimate). 
     128    :return: Feature score - the higher the value, the better the feature. 
     129      If the quality cannot be scored, return :obj:`Score.Rejected`. 
     130    :rtype: float or :obj:`Score.Rejected`. 
     131 
     132The code below scores the same feature with :obj:`GainRatio` in 
     133different ways. 
     134 
     135.. literalinclude:: code/scoring-calls.py 
     136    :lines: 7- 
     137 
    79138.. _classification: 
    80139 
    81140=========================== 
    82 Measures for Classification 
     141Feature scoring in classification problems 
    83142=========================== 
    84143 
     
    88147   single: feature scoring; information gain 
    89148 
    90 As all measures are subclasses of :class:`Measure`, see 
    91 :obj:`Measure.__call__` for usage. 
    92  
    93149.. class:: InfoGain 
    94150 
    95     Measures the expected decrease of entropy. 
     151    Information gain - the expected decrease of entropy. See `page on wikipedia 
     152    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    96153 
    97154.. index::  
     
    100157.. class:: GainRatio 
    101158 
    102     Information gain divided by the entropy of the feature's 
     159    Information gain ratio - information gain divided by the entropy of the feature's 
    103160    value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
    104161    of multi-valued features. It has been shown, however, that it still 
    105     overestimates features with multiple values. 
     162    overestimates features with multiple values. See `Wikipedia 
     163    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    106164 
    107165.. index::  
     
    110168.. class:: Gini 
    111169 
    112     The probability that two randomly chosen instances will have different 
    113     classes; first introduced by Breiman [Breiman1984]_. 
     170    Gini index is the probability that two randomly chosen instances will have different 
     171    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
    114172 
    115173.. index::  
     
    125183.. class:: Cost 
    126184 
    127     Evaluates features based on the "saving" achieved by knowing the value of 
     185    Evaluates features based on the cost decrease achieved by knowing the value of 
    128186    feature, according to the specified cost matrix. 
    129187 
     
    134192    If the cost of predicting the first class of an instance that is actually in 
    135193    the second is 5, and the cost of the opposite error is 1, than an appropriate 
    136     measure can be constructed as follows:: 
     194    score can be constructed as follows:: 
     195 
     196    .. comment:: opposite error - is this term correct? TODO 
    137197 
    138198        >>> meas = Orange.feature.scoring.Cost() 
     
    150210 
    151211    Assesses features' ability to distinguish between very similar 
    152     instances from different classes.  First developed by Kira and 
     212    instances from different classes. This scoring method was 
     213    first developed by Kira and 
    153214    Rendell and then improved by Kononenko. The class :obj:`Relief` 
    154215    works on discrete and continuous classes and thus implements ReliefF 
     
    176237    again, it uses the stored results if the domain and the data table 
    177238    have not changed (data table version and the data checksum are 
    178     compared). Caching will only work if you use the same object. So, 
    179     don't do this:: 
     239    compared). Caching will only work if you use the same object.  
     240    Constructing new instances of :obj:`Relief` fir each feature, 
     241    like this:: 
    180242 
    181243        for attr in data.domain.attributes: 
    182244            print Orange.feature.scoring.Relief(attr, data) 
    183245 
    184     But this:: 
     246    runs much slower than reusing the same instance:: 
    185247 
    186248        meas = Orange.feature.scoring.Relief() 
     
    199261 
    200262======================= 
    201 Measures for Regression 
     263Feature scoring in regression problems 
    202264======================= 
    203265 
    204 As all measures are subclasses of :class:`Measure`, see 
    205 :obj:`Measure.__call__` for usage. 
    206  
    207266You can also use :obj:`Relief` for regression. 
    208267 
     
    212271.. class:: MSE 
    213272 
    214     Implements the mean square error measure. 
     273    Implements the mean square error score. 
    215274 
    216275    .. attribute:: unknowns_treatment 
    217276     
    218         What to do with unknown values. See :obj:`Measure.unknowns_treatment`. 
     277        What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
    219278 
    220279    .. attribute:: m 
     
    229288 
    230289Implemented methods for scoring relevances of features to the class 
    231 are subclasses of :obj:`Measure`. Those that compute statistics on 
     290are subclasses of :obj:`Score`. Those that compute statistics on 
    232291conditional distributions of class values given the feature values are 
    233 derived from :obj:`MeasureFromProbabilities`. 
    234  
    235 .. class:: Measure 
     292derived from :obj:`ScoreFromProbabilities`. 
     293 
     294.. class:: Score 
    236295 
    237296    Abstract base class for feature scoring. Its attributes describe which 
     
    242301    .. attribute:: handles_discrete 
    243302     
    244         Indicates whether the measure can handle discrete features. 
     303        Indicates whether the scoring method can handle discrete features. 
    245304 
    246305    .. attribute:: handles_continuous 
    247306     
    248         Indicates whether the measure can handle continuous features. 
     307        Indicates whether the scoring method can handle continuous features. 
    249308 
    250309    .. attribute:: computes_thresholds 
    251310     
    252         Indicates whether the measure implements the :obj:`threshold_function`. 
     311        Indicates whether the scoring method implements the :obj:`threshold_function`. 
    253312 
    254313    **Input specification** 
     
    256315    .. attribute:: needs 
    257316     
    258         The type of data needed: :obj:`NeedsGenerator`, :obj:`NeedsDomainContingency`, 
    259         or :obj:`NeedsContingency_Class`. 
    260  
    261     .. attribute:: NeedsGenerator 
    262  
    263         Constant. Indicates that the measure needs an instance generator on the input (as, for example, the 
    264         :obj:`Relief` measure). 
    265  
    266     .. attribute:: NeedsDomainContingency 
    267  
    268         Constant. Indicates that the measure needs :obj:`Orange.statistics.contingency.Domain`. 
    269  
    270     .. attribute:: NeedsContingency_Class 
    271  
    272         Constant. Indicates, that the measure needs the contingency 
     317        The type of data needed: :obj:`Generator`, :obj:`DomainContingency`, 
     318        or :obj:`Contingency_Class`. 
     319 
     320    .. attribute:: Generator 
     321 
     322        Constant. Indicates that the scoring method needs an instance generator on the input (as, for example, 
     323        :obj:`Relief`). 
     324 
     325    .. attribute:: DomainContingency 
     326 
     327        Constant. Indicates that the scoring method needs :obj:`Orange.statistics.contingency.Domain`. 
     328 
     329    .. attribute:: Contingency_Class 
     330 
     331        Constant. Indicates, that the scoring method needs the contingency 
    273332        (:obj:`Orange.statistics.contingency.VarClass`), feature 
    274333        distribution and the apriori class distribution (as most 
    275         measures). 
     334        scoring methods). 
    276335 
    277336    **Treatment of unknown values** 
     
    279338    .. attribute:: unknowns_treatment 
    280339 
    281         Not defined in :obj:`Measure` but defined in 
     340        Not defined in :obj:`Score` but defined in 
    282341        classes that are able to treat unknown values. Either 
    283342        :obj:`IgnoreUnknowns`, :obj:`ReduceByUnknown`. 
     
    292351        Constant. Features with unknown values are  
    293352        punished. The feature quality is reduced by the proportion of 
    294         unknown values. For impurity measures the impurity decreases 
     353        unknown values. For impurity scores the impurity decreases 
    295354        only where the value is defined and stays the same otherwise. 
    296355 
     
    305364    **Methods** 
    306365 
    307     .. method:: __call__(attribute, instances[, apriori_class_distribution][, weightID]) 
    308  
    309         :param attribute: the chosen feature, either as a descriptor,  
    310           index, or a name. 
    311         :type attribute: :class:`Orange.data.variable.Variable` or int or string 
    312         :param instances: data. 
    313         :type instances: `Orange.data.Table` 
    314         :param weightID: id for meta-feature with weight. 
    315  
    316         Abstract. All measures need to support these 
    317         parameters.  Described below. 
    318  
    319     .. method:: __call__(attribute, domain_contingency[, apriori_class_distribution]) 
    320  
    321         :param attribute: the chosen feature, either as a descriptor,  
    322           index, or a name. 
    323         :type attribute: :class:`Orange.data.variable.Variable` or int or string 
    324         :param domain_contingency:  
    325         :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
    326  
    327         Abstract. Described below. 
    328          
    329     .. method:: __call__(contingency, class_distribution[, apriori_class_distribution]) 
    330  
    331         :param contingency: 
    332         :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
    333         :param class_distribution: distribution of the class 
    334           variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
    335           it should be computed on instances where feature value is 
    336           defined. Otherwise, class distribution should be the overall 
    337           class distribution. 
    338         :type class_distribution:  
    339           :obj:`Orange.statistics.distribution.Distribution` 
    340         :param apriori_class_distribution: Optional and most often 
    341           ignored. Useful if the measure makes any probability estimates 
    342           based on apriori class probabilities (such as the m-estimate). 
    343         :return: Feature score - the higher the value, the better the feature. 
    344           If the quality cannot be measured, return :obj:`Measure.Rejected`. 
    345         :rtype: float or :obj:`Measure.Rejected`. 
    346  
    347         Abstract. 
    348  
    349         Different forms of `__call__` enable optimization.  For instance, 
    350         if contingency matrix has already been computed, you can speed 
    351         up the computation by passing it to the measure (if it supports 
    352         that form - most do). Otherwise the measure will have to compute the 
    353         contingency itself. 
    354  
    355         Not all classes will accept all kinds of arguments. :obj:`Relief`, 
    356         for instance, only supports the form with instances on the input. 
    357  
    358         The code sample below shows the use of :obj:`GainRatio` with 
    359         different call types. 
    360  
    361         .. literalinclude:: code/scoring-calls.py 
    362             :lines: 7- 
     366    .. method:: __call__ 
     367 
     368        Abstract. See :ref:`callingscore`. 
    363369 
    364370    .. method:: threshold_function(attribute, instances[, weightID]) 
     
    392398            :lines: 17-18 
    393399 
    394 .. class:: MeasureFromProbabilities 
    395  
    396     Bases: :obj:`Measure` 
    397  
    398     Abstract base class for feature quality measures that can be 
     400.. class:: ScoreFromProbabilities 
     401 
     402    Bases: :obj:`Score` 
     403 
     404    Abstract base class for feature scoring method that can be 
    399405    computed from contingency matrices only. It relieves the derived classes 
    400406    from having to compute the contingency matrix by defining the first two 
     
    404410    .. attribute:: unknowns_treatment 
    405411      
    406         See :obj:`Measure.unknowns_treatment`. 
     412        See :obj:`Score.unknowns_treatment`. 
    407413 
    408414    .. attribute:: estimator_constructor 
     
    455461import Orange.misc 
    456462 
    457 from orange import MeasureAttribute as Measure 
    458 from orange import MeasureAttributeFromProbabilities as MeasureFromProbabilities 
     463from orange import MeasureAttribute as Score 
     464from orange import MeasureAttributeFromProbabilities as ScoreFromProbabilities 
    459465from orange import MeasureAttribute_info as InfoGain 
    460466from orange import MeasureAttribute_gainRatio as GainRatio 
     
    468474###### 
    469475# from orngEvalAttr.py 
     476 
    470477class OrderAttributes: 
    471478    """Orders features by their scores. 
    472479     
    473     .. attribute::  measure 
    474      
    475         A measure derived from :obj:`~Orange.feature.scoring.Measure`. 
    476         If None, :obj:`Relief` will be used. 
     480    .. attribute::  score 
     481     
     482        A scoring method derived from :obj:`~Orange.feature.scoring.Score`. 
     483        If None, :obj:`Relief` with m=5 and k=10 will be used. 
    477484     
    478485    """ 
    479     def __init__(self, measure=None): 
    480         self.measure = measure 
     486    def __init__(self, score=None): 
     487        self.score = score 
    481488 
    482489    def __call__(self, data, weight): 
     
    490497 
    491498        """ 
    492         if self.measure: 
    493             measure = self.measure 
     499        if self.score: 
     500            measure = self.score 
    494501        else: 
    495             measure = Relief(m=5,k=10) 
     502            measure = Relief(m=5, k=10) 
    496503 
    497504        measured = [(attr, measure(attr, data, None, weight)) for attr in data.domain.attributes] 
     
    499506        return [x[0] for x in measured] 
    500507 
    501 class Distance(Measure): 
    502     """The 1-D feature distance measure described in [Kononenko2007]_.""" 
     508OrderAttributes = Orange.misc.deprecated_members({ 
     509          "measure": "score", 
     510}, wrap_methods=[])(OrderAttributes) 
     511 
     512class Distance(Score): 
     513    """The 1-D feature distance score described in [Kononenko2007]_. TODO""" 
    503514 
    504515    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
    505516    def __new__(cls, attr=None, data=None, apriori_dist=None, weightID=None): 
    506         self = Measure.__new__(cls) 
     517        self = Score.__new__(cls) 
    507518        if attr != None and data != None: 
    508519            #self.__init__(**argkw) 
     
    543554            return 0 
    544555 
    545 class MDL(Measure): 
    546     """Score feature based on the minimum description length principle.""" 
     556class MDL(Score): 
     557    """Score feature based on the minimum description length principle. TODO.""" 
    547558 
    548559    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) 
    549560    def __new__(cls, attr=None, data=None, apriori_dist=None, weightID=None): 
    550         self = Measure.__new__(cls) 
     561        self = Score.__new__(cls) 
    551562        if attr != None and data != None: 
    552563            #self.__init__(**argkw) 
     
    607618 
    608619 
    609 @Orange.misc.deprecated_keywords({"attrList": "attr_list", "attrMeasure": "attr_measure", "removeUnusedValues": "remove_unused_values"}) 
    610 def merge_values(data, attr_list, attr_measure, remove_unused_values = 1): 
     620@Orange.misc.deprecated_keywords({"attrList": "attr_list", "attrMeasure": "attr_score", "removeUnusedValues": "remove_unused_values"}) 
     621def merge_values(data, attr_list, attr_score, remove_unused_values = 1): 
    611622    import orngCI 
    612623    #data = data.select([data.domain[attr] for attr in attr_list] + [data.domain.classVar]) 
     
    617628    for i in range(len(newAttr.values)): 
    618629        if dist[newAttr.values[i]] > 0: activeValues.append(i) 
    619     currScore = attr_measure(newAttr, newData) 
     630    currScore = attr_score(newAttr, newData) 
    620631    while 1: 
    621632        bestScore, bestMerge = currScore, None 
     
    624635            for ind2 in activeValues[:i1]: 
    625636                newAttr.get_value_from.lookupTable[ind1] = ind2 
    626                 score = attr_measure(newAttr, newData) 
     637                score = attr_score(newAttr, newData) 
    627638                if score >= bestScore: 
    628639                    bestScore, bestMerge = score, (ind1, ind2) 
     
    650661###### 
    651662# from orngFSS 
    652 def score_all(data, measure=Relief(k=20, m=50)): 
     663@Orange.misc.deprecated_keywords({"measure": "score"}) 
     664def score_all(data, score=Relief(k=20, m=50)): 
    653665    """Assess the quality of features using the given measure and return 
    654666    a sorted list of tuples (feature name, measure). 
     
    656668    :param data: data table should include a discrete class. 
    657669    :type data: :obj:`Orange.data.Table` 
    658     :param measure:  feature scoring function. Derived from 
    659       :obj:`~Orange.feature.scoring.Measure`. Defaults to  
     670    :param score:  feature scoring function. Derived from 
     671      :obj:`~Orange.feature.scoring.Score`. Defaults to  
    660672      :obj:`~Orange.feature.scoring.Relief` with k=20 and m=50. 
    661     :type measure: :obj:`~Orange.feature.scoring.Measure`  
     673    :type measure: :obj:`~Orange.feature.scoring.Score`  
    662674    :rtype: :obj:`list`; a sorted list of tuples (feature name, score) 
    663675 
     
    665677    measl=[] 
    666678    for i in data.domain.attributes: 
    667         measl.append((i.name, measure(i, data))) 
     679        measl.append((i.name, score(i, data))) 
    668680    measl.sort(lambda x,y:cmp(y[1], x[1])) 
    669     
    670 #  for i in measl: 
    671 #    print "%25s, %6.3f" % (i[0], i[1]) 
    672681    return measl 
  • orange/Orange/regression/earth.py

    r8153 r8157  
    765765from Orange.misc import _orange__new__ 
    766766 
    767 class ScoreEarthImportance(scoring.Measure): 
     767class ScoreEarthImportance(scoring.Score): 
    768768    """ Score features based on there importance in the Earth model using 
    769769    ``bagged_evimp``'s function return value. 
     
    775775     
    776776#    _cache = weakref.WeakKeyDictionary() 
    777     __new__ = _orange__new__(scoring.Measure) 
     777    __new__ = _orange__new__(scoring.Score) 
    778778         
    779779    def __init__(self, t=10, score_what="nsubsets", cached=True): 
     
    819819     
    820820     
    821 class ScoreRSS(scoring.Measure): 
    822     __new__ = _orange__new__(scoring.Measure) 
     821class ScoreRSS(scoring.Score): 
     822    __new__ = _orange__new__(scoring.Score) 
    823823    def __init__(self): 
    824824        self._cache_data = None 
  • orange/fixes/fix_changed_names.py

    r8138 r8157  
    5555           "orange.DomainContingency": "Orange.statistics.contingency.Domain", 
    5656           
    57            "orange.MeasureAttribute": "Orange.feature.scoring.Measure",  
     57           "orange.MeasureAttribute": "Orange.feature.scoring.Score",  
     58           "orange.MeasureAttributeFromProbabilities": "Orange.feature.scoring.ScoreFromProbabilities",  
    5859           "orange.MeasureAttribute_gainRatio": "Orange.feature.scoring.GainRatio", 
    5960           "orange.MeasureAttribute_relief": "Orange.feature.scoring.Relief", 
Note: See TracChangeset for help on using the changeset viewer.