source: orange/orange/Orange/feature/scoring.py @ 9349:fa13a2c52fcd

Revision 9349:fa13a2c52fcd, 22.8 KB checked in by mitar, 2 years ago (diff)

Changed way of linking to code in documentation.

Line 
1"""
2#####################
3Scoring (``scoring``)
4#####################
5
6.. index:: feature scoring
7
8.. index::
9   single: feature; feature scoring
10
11Feature score is an assessment of the usefulness of the feature for
12prediction of the dependant (class) variable.
13
14To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use:
15
16    >>> meas = Orange.feature.scoring.InfoGain()
17    >>> print meas("tear_rate", data)
18    0.548794925213
19
20Other scoring methods are listed in :ref:`classification` and
21:ref:`regression`. Various ways to call them are described on
22:ref:`callingscore`.
23
24Instead of first constructing the scoring object (e.g. ``InfoGain``) and
25then using it, it is usually more convenient to do both in a single step::
26
27    >>> print Orange.feature.scoring.InfoGain("tear_rate", data)
28    0.548794925213
29
30This way is much slower for Relief that can efficiently compute scores
31for all features in parallel.
32
33It is also possible to score features that do not appear in the data
34but can be computed from it. A typical case are discretized features:
35
36.. literalinclude:: code/scoring-info-iris.py
37    :lines: 7-11
38
39The following example computes feature scores, both with
40:obj:`score_all` and by scoring each feature individually, and prints out
41the best three features.
42
43.. literalinclude:: code/scoring-all.py
44    :lines: 7-
45
46The output::
47
48    Feature scores for best three features (with score_all):
49    0.613 physician-fee-freeze
50    0.255 el-salvador-aid
51    0.228 synfuels-corporation-cutback
52
53    Feature scores for best three features (scored individually):
54    0.613 physician-fee-freeze
55    0.255 el-salvador-aid
56    0.228 synfuels-corporation-cutback
57
58.. comment
59    The next script uses :obj:`GainRatio` and :obj:`Relief`.
60
61    .. literalinclude:: code/scoring-relief-gainRatio.py
62        :lines: 7-
63
64    Notice that on this data the ranks of features match::
65       
66        Relief GainRt Feature
67        0.613  0.752  physician-fee-freeze
68        0.255  0.444  el-salvador-aid
69        0.228  0.414  synfuels-corporation-cutback
70        0.189  0.382  crime
71        0.166  0.345  adoption-of-the-budget-resolution
72
73
74.. _callingscore:
75
76=======================
77Calling scoring methods
78=======================
79
80To score a feature use :obj:`Score.__call__`. There are diferent
81function signatures, which enable optimization. For instance,
82most scoring methods first compute contingency tables from the
83data. If these are already computed, they can be passed to the scorer
84instead of the data.
85
86Not all classes accept all kinds of arguments. :obj:`Relief`,
87for instance, only supports the form with instances on the input.
88
89.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID])
90
91    :param attribute: the chosen feature, either as a descriptor,
92      index, or a name.
93    :type attribute: :class:`Orange.data.variable.Variable` or int or string
94    :param data: data.
95    :type data: `Orange.data.Table`
96    :param weightID: id for meta-feature with weight.
97
98    All scoring methods support the first signature.
99
100.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution])
101
102    :param attribute: the chosen feature, either as a descriptor,
103      index, or a name.
104    :type attribute: :class:`Orange.data.variable.Variable` or int or string
105    :param domain_contingency:
106    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain`
107
108.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution])
109
110    :param contingency:
111    :type contingency: :obj:`Orange.statistics.contingency.VarClass`
112    :param class_distribution: distribution of the class
113      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`,
114      it should be computed on instances where feature value is
115      defined. Otherwise, class distribution should be the overall
116      class distribution.
117    :type class_distribution:
118      :obj:`Orange.statistics.distribution.Distribution`
119    :param apriori_class_distribution: Optional and most often
120      ignored. Useful if the scoring method makes any probability estimates
121      based on apriori class probabilities (such as the m-estimate).
122    :return: Feature score - the higher the value, the better the feature.
123      If the quality cannot be scored, return :obj:`Score.Rejected`.
124    :rtype: float or :obj:`Score.Rejected`.
125
126The code below scores the same feature with :obj:`GainRatio`
127using different calls.
128
129.. literalinclude:: code/scoring-calls.py
130    :lines: 7-
131
132.. _classification:
133
134==========================================
135Feature scoring in classification problems
136==========================================
137
138.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain.
139
140.. index::
141   single: feature scoring; information gain
142
143.. class:: InfoGain
144
145    Information gain; the expected decrease of entropy. See `page on wikipedia
146    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
147
148.. index::
149   single: feature scoring; gain ratio
150
151.. class:: GainRatio
152
153    Information gain ratio; information gain divided by the entropy of the feature's
154    value. Introduced in [Quinlan1986]_ in order to avoid overestimation
155    of multi-valued features. It has been shown, however, that it still
156    overestimates features with multiple values. See `Wikipedia
157    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
158
159.. index::
160   single: feature scoring; gini index
161
162.. class:: Gini
163
164    Gini index is the probability that two randomly chosen instances will have different
165    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_.
166
167.. index::
168   single: feature scoring; relevance
169
170.. class:: Relevance
171
172    The potential value for decision rules.
173
174.. index::
175   single: feature scoring; cost
176
177.. class:: Cost
178
179    Evaluates features based on the cost decrease achieved by knowing the value of
180    feature, according to the specified cost matrix.
181
182    .. attribute:: cost
183     
184        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details.
185
186    If the cost of predicting the first class of an instance that is actually in
187    the second is 5, and the cost of the opposite error is 1, than an appropriate
188    score can be constructed as follows::
189
190
191        >>> meas = Orange.feature.scoring.Cost()
192        >>> meas.cost = ((0, 5), (1, 0))
193        >>> meas(3, data)
194        0.083333350718021393
195
196    Knowing the value of feature 3 would decrease the
197    classification cost for approximately 0.083 per instance.
198
199    .. comment   opposite error - is this term correct? TODO
200
201.. index::
202   single: feature scoring; ReliefF
203
204.. class:: Relief
205
206    Assesses features' ability to distinguish between very similar
207    instances from different classes. This scoring method was first
208    developed by Kira and Rendell and then improved by Kononenko. The
209    class :obj:`Relief` works on discrete and continuous classes and
210    thus implements ReliefF and RReliefF.
211
212    ReliefF is slow since it needs to find k nearest neighbours for
213    each of m reference instances. As we normally compute ReliefF for
214    all features in the dataset, :obj:`Relief` caches the results for
215    all features, when called to score a certain feature.  When called
216    again, it uses the stored results if the domain and the data table
217    have not changed (data table version and the data checksum are
218    compared). Caching will only work if you use the same object.
219    Constructing new instances of :obj:`Relief` for each feature,
220    like this::
221
222        for attr in data.domain.attributes:
223            print Orange.feature.scoring.Relief(attr, data)
224
225    runs much slower than reusing the same instance::
226
227        meas = Orange.feature.scoring.Relief()
228        for attr in table.domain.attributes:
229            print meas(attr, data)
230
231
232    .. attribute:: k
233   
234       Number of neighbours for each instance. Default is 5.
235
236    .. attribute:: m
237   
238        Number of reference instances. Default is 100. When -1, all
239        instances are used as reference.
240
241    .. attribute:: check_cached_data
242   
243        Check if the cached data is changed, which may be slow on large
244        tables.  Defaults to :obj:`True`, but should be disabled when it
245        is certain that the data will not change while the scorer is used.
246
247.. autoclass:: Orange.feature.scoring.Distance
248   
249.. autoclass:: Orange.feature.scoring.MDL
250
251.. _regression:
252
253======================================
254Feature scoring in regression problems
255======================================
256
257.. class:: Relief
258
259    Relief is used for regression in the same way as for
260    classification (see :class:`Relief` in classification
261    problems).
262
263.. index::
264   single: feature scoring; mean square error
265
266.. class:: MSE
267
268    Implements the mean square error score.
269
270    .. attribute:: unknowns_treatment
271   
272        What to do with unknown values. See :obj:`Score.unknowns_treatment`.
273
274    .. attribute:: m
275   
276        Parameter for m-estimate of error. Default is 0 (no m-estimate).
277
278
279
280============
281Base Classes
282============
283
284Implemented methods for scoring relevances of features are subclasses
285of :obj:`Score`. Those that compute statistics on conditional
286distributions of class values given the feature values are derived from
287:obj:`ScoreFromProbabilities`.
288
289.. class:: Score
290
291    Abstract base class for feature scoring. Its attributes describe which
292    types of features it can handle which kind of data it requires.
293
294    **Capabilities**
295
296    .. attribute:: handles_discrete
297   
298        Indicates whether the scoring method can handle discrete features.
299
300    .. attribute:: handles_continuous
301   
302        Indicates whether the scoring method can handle continuous features.
303
304    .. attribute:: computes_thresholds
305   
306        Indicates whether the scoring method implements the :obj:`threshold_function`.
307
308    **Input specification**
309
310    .. attribute:: needs
311   
312        The type of data needed indicated by one the constants
313        below. Classes with use :obj:`DomainContingency` will also handle
314        generators. Those based on :obj:`Contingency_Class` will be able
315        to take generators and domain contingencies.
316
317        .. attribute:: Generator
318
319            Constant. Indicates that the scoring method needs an instance
320            generator on the input as, for example, :obj:`Relief`.
321
322        .. attribute:: DomainContingency
323
324            Constant. Indicates that the scoring method needs
325            :obj:`Orange.statistics.contingency.Domain`.
326
327        .. attribute:: Contingency_Class
328
329            Constant. Indicates, that the scoring method needs the contingency
330            (:obj:`Orange.statistics.contingency.VarClass`), feature
331            distribution and the apriori class distribution (as most
332            scoring methods).
333
334    **Treatment of unknown values**
335
336    .. attribute:: unknowns_treatment
337
338        Defined in classes that are able to treat unknown values. It
339        should be set to one of the values below.
340
341        .. attribute:: IgnoreUnknowns
342
343            Constant. Instances for which the feature value is unknown are removed.
344
345        .. attribute:: ReduceByUnknown
346
347            Constant. Features with unknown values are
348            punished. The feature quality is reduced by the proportion of
349            unknown values. For impurity scores the impurity decreases
350            only where the value is defined and stays the same otherwise.
351
352        .. attribute:: UnknownsToCommon
353
354            Constant. Undefined values are replaced by the most common value.
355
356        .. attribute:: UnknownsAsValue
357
358            Constant. Unknown values are treated as a separate value.
359
360    **Methods**
361
362    .. method:: __call__
363
364        Abstract. See :ref:`callingscore`.
365
366    .. method:: threshold_function(attribute, instances[, weightID])
367   
368        Abstract.
369       
370        Assess different binarizations of the continuous feature
371        :obj:`attribute`.  Return a list of tuples. The first element
372        is a threshold (between two existing values), the second is
373        the quality of the corresponding binary feature, and the third
374        the distribution of instances below and above the threshold.
375        Not all scorers return the third element.
376
377        To show the computation of thresholds, we shall use the Iris
378        data set:
379
380        .. literalinclude:: code/scoring-info-iris.py
381            :lines: 13-16
382
383    .. method:: best_threshold(attribute, instances)
384
385        Return the best threshold for binarization, that is, the threshold
386        with which the resulting binary feature will have the optimal
387        score.
388
389        The script below prints out the best threshold for
390        binarization of an feature. ReliefF is used scoring:
391
392        .. literalinclude:: code/scoring-info-iris.py
393            :lines: 18-19
394
395.. class:: ScoreFromProbabilities
396
397    Bases: :obj:`Score`
398
399    Abstract base class for feature scoring method that can be
400    computed from contingency matrices.
401
402    .. attribute:: estimator_constructor
403    .. attribute:: conditional_estimator_constructor
404   
405        The classes that are used to estimate unconditional
406        and conditional probabilities of classes, respectively.
407        Defaults use relative frequencies; possible alternatives are,
408        for instance, :obj:`ProbabilityEstimatorConstructor_m` and
409        :obj:`ConditionalProbabilityEstimatorConstructor_ByRows`
410        (with estimator constructor again set to
411        :obj:`ProbabilityEstimatorConstructor_m`), respectively.
412
413============
414Other
415============
416
417.. autoclass:: Orange.feature.scoring.OrderAttributes
418   :members:
419
420.. autofunction:: Orange.feature.scoring.score_all
421
422.. rubric:: Bibliography
423
424.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,
425  Woodhead Publishing, 2007.
426
427.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
428
429.. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984.
430
431.. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995.
432
433"""
434
435import Orange.core as orange
436import Orange.misc
437
438from orange import MeasureAttribute as Score
439from orange import MeasureAttributeFromProbabilities as ScoreFromProbabilities
440from orange import MeasureAttribute_info as InfoGain
441from orange import MeasureAttribute_gainRatio as GainRatio
442from orange import MeasureAttribute_gini as Gini
443from orange import MeasureAttribute_relevance as Relevance
444from orange import MeasureAttribute_cost as Cost
445from orange import MeasureAttribute_relief as Relief
446from orange import MeasureAttribute_MSE as MSE
447
448
449######
450# from orngEvalAttr.py
451
452class OrderAttributes:
453    """Orders features by their scores.
454   
455    .. attribute::  score
456   
457        A scoring method derived from :obj:`~Orange.feature.scoring.Score`.
458        If :obj:`None`, :obj:`Relief` with m=5 and k=10 is used.
459   
460    """
461    def __init__(self, score=None):
462        self.score = score
463
464    def __call__(self, data, weight):
465        """Score and order all features.
466
467        :param data: a data table used to score features
468        :type data: Orange.data.Table
469
470        :param weight: meta attribute that stores weights of instances
471        :type weight: Orange.data.variable.Variable
472
473        """
474        if self.score:
475            measure = self.score
476        else:
477            measure = Relief(m=5, k=10)
478
479        measured = [(attr, measure(attr, data, None, weight)) for attr in data.domain.attributes]
480        measured.sort(lambda x, y: cmp(x[1], y[1]))
481        return [x[0] for x in measured]
482
483OrderAttributes = Orange.misc.deprecated_members({
484          "measure": "score",
485}, wrap_methods=[])(OrderAttributes)
486
487class Distance(Score):
488    """The :math:`1-D` distance is defined as information gain divided
489    by joint entropy :math:`H_{CA}` (:math:`C` is the class variable
490    and :math:`A` the feature):
491
492    .. math::
493        1-D(C,A) = \\frac{\\mathrm{Gain}(A)}{H_{CA}}
494    """
495
496    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"})
497    def __new__(cls, attr=None, data=None, apriori_dist=None, weightID=None):
498        self = Score.__new__(cls)
499        if attr != None and data != None:
500            #self.__init__(**argkw)
501            return self.__call__(attr, data, apriori_dist, weightID)
502        else:
503            return self
504
505    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"})
506    def __call__(self, attr, data, apriori_dist=None, weightID=None):
507        """Score the given feature.
508
509        :param attr: feature to score
510        :type attr: Orange.data.variable.Variable
511
512        :param data: a data table used to score features
513        :type data: Orange.data.table
514
515        :param apriori_dist:
516        :type apriori_dist:
517       
518        :param weightID: meta feature used to weight individual data instances
519        :type weightID: Orange.data.variable.Variable
520
521        """
522        import numpy
523        from orngContingency import Entropy
524        if attr in data.domain:  # if we receive attr as string we have to convert to variable
525            attr = data.domain[attr]
526        attrClassCont = orange.ContingencyAttrClass(attr, data)
527        dist = []
528        for vals in attrClassCont.values():
529            dist += list(vals)
530        classAttrEntropy = Entropy(numpy.array(dist))
531        infoGain = InfoGain(attr, data)
532        if classAttrEntropy > 0:
533            return float(infoGain) / classAttrEntropy
534        else:
535            return 0
536
537class MDL(Score):
538    """Minimum description length principle [Kononenko1995]_. Let
539    :math:`n` be the number of instances, :math:`n_0` the number of
540    classes, and :math:`n_{cj}` the number of instances with feature
541    value :math:`j` and class value :math:`c`. Then MDL score for the
542    feature A is
543
544    .. math::
545         \mathrm{MDL}(A) = \\frac{1}{n} \\Bigg[
546         \\log\\binom{n}{n_{1.},\\cdots,n_{n_0 .}} - \\sum_j
547         \\log \\binom{n_{.j}}{n_{1j},\\cdots,n_{n_0 j}} \\\\
548         + \\log \\binom{n+n_0-1}{n_0-1} - \\sum_j \\log
549         \\binom{n_{.j}+n_0-1}{n_0-1}
550         \\Bigg]
551    """
552
553    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"})
554    def __new__(cls, attr=None, data=None, apriori_dist=None, weightID=None):
555        self = Score.__new__(cls)
556        if attr != None and data != None:
557            #self.__init__(**argkw)
558            return self.__call__(attr, data, apriori_dist, weightID)
559        else:
560            return self
561
562    @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"})
563    def __call__(self, attr, data, apriori_dist=None, weightID=None):
564        """Score the given feature.
565
566        :param attr: feature to score
567        :type attr: Orange.data.variable.Variable
568
569        :param data: a data table used to score the feature
570        :type data: Orange.data.table
571
572        :param apriori_dist:
573        :type apriori_dist:
574       
575        :param weightID: meta feature used to weight individual data instances
576        :type weightID: Orange.data.variable.Variable
577
578        """
579        attrClassCont = orange.ContingencyAttrClass(attr, data)
580        classDist = orange.Distribution(data.domain.classVar, data).values()
581        nCls = len(classDist)
582        nEx = len(data)
583        priorMDL = _logMultipleCombs(nEx, classDist) + _logMultipleCombs(nEx+nCls-1, [nEx, nCls-1])
584        postPart1 = [_logMultipleCombs(sum(attrClassCont[key]), attrClassCont[key].values()) for key in attrClassCont.keys()]
585        postPart2 = [_logMultipleCombs(sum(attrClassCont[key])+nCls-1, [sum(attrClassCont[key]), nCls-1]) for key in attrClassCont.keys()]
586        ret = priorMDL
587        for val in postPart1 + postPart2:
588            ret -= val
589        return ret / max(1, nEx)
590
591# compute n! / k1! * k2! * k3! * ... kc!
592# ks = [k1, k2, ...]
593def _logMultipleCombs(n, ks):
594    import math
595    m = max(ks)
596    ks.remove(m)
597    resArray = []
598    for (start, end) in [(m+1, n+1)] + [(1, k+1) for k in ks]:
599        ret = 0
600        curr = 1
601        for val in range(int(start), int(end)):
602            curr *= val
603            if curr > 1e40:
604                ret += math.log(curr)
605                curr = 1
606        ret += math.log(curr)
607        resArray.append(ret)
608    ret = resArray[0]
609    for val in resArray[1:]:
610        ret -= val
611    return ret
612
613
614@Orange.misc.deprecated_keywords({"attrList": "attr_list", "attrMeasure": "attr_score", "removeUnusedValues": "remove_unused_values"})
615def merge_values(data, attr_list, attr_score, remove_unused_values = 1):
616    import orngCI
617    #data = data.select([data.domain[attr] for attr in attr_list] + [data.domain.classVar])
618    newData = data.select(attr_list + [data.domain.class_var])
619    newAttr = orngCI.FeatureByCartesianProduct(newData, attr_list)[0]
620    dist = orange.Distribution(newAttr, newData)
621    activeValues = []
622    for i in range(len(newAttr.values)):
623        if dist[newAttr.values[i]] > 0: activeValues.append(i)
624    currScore = attr_score(newAttr, newData)
625    while 1:
626        bestScore, bestMerge = currScore, None
627        for i1, ind1 in enumerate(activeValues):
628            oldInd1 = newAttr.get_value_from.lookupTable[ind1]
629            for ind2 in activeValues[:i1]:
630                newAttr.get_value_from.lookupTable[ind1] = ind2
631                score = attr_score(newAttr, newData)
632                if score >= bestScore:
633                    bestScore, bestMerge = score, (ind1, ind2)
634                newAttr.get_value_from.lookupTable[ind1] = oldInd1
635
636        if bestMerge:
637            ind1, ind2 = bestMerge
638            currScore = bestScore
639            for i, l in enumerate(newAttr.get_value_from.lookupTable):
640                if not l.isSpecial() and int(l) == ind1:
641                    newAttr.get_value_from.lookupTable[i] = ind2
642            newAttr.values[ind2] = newAttr.values[ind2] + "+" + newAttr.values[ind1]
643            del activeValues[activeValues.index(ind1)]
644        else:
645            break
646
647    if not remove_unused_values:
648        return newAttr
649
650    reducedAttr = orange.EnumVariable(newAttr.name, values = [newAttr.values[i] for i in activeValues])
651    reducedAttr.get_value_from = newAttr.get_value_from
652    reducedAttr.get_value_from.class_var = reducedAttr
653    return reducedAttr
654
655######
656# from orngFSS
657@Orange.misc.deprecated_keywords({"measure": "score"})
658def score_all(data, score=Relief(k=20, m=50)):
659    """Assess the quality of features using the given measure and return
660    a sorted list of tuples (feature name, measure).
661
662    :param data: data table should include a discrete class.
663    :type data: :obj:`Orange.data.Table`
664    :param score:  feature scoring function. Derived from
665      :obj:`~Orange.feature.scoring.Score`. Defaults to
666      :obj:`~Orange.feature.scoring.Relief` with k=20 and m=50.
667    :type measure: :obj:`~Orange.feature.scoring.Score`
668    :rtype: :obj:`list`; a sorted list of tuples (feature name, score)
669
670    """
671    measl=[]
672    for i in data.domain.attributes:
673        measl.append((i.name, score(i, data)))
674    measl.sort(lambda x,y:cmp(y[1], x[1]))
675    return measl
Note: See TracBrowser for help on using the repository browser.