source: orange/docs/reference/rst/Orange.feature.scoring.rst @ 10393:4dbd54af3ac8

Revision 10393:4dbd54af3ac8, 14.4 KB checked in by Lan Zagar <lan.zagar@…>, 2 years ago (diff)

Fixed object addresses appearing in documentation (#1103).

Line 
1.. py:currentmodule:: Orange.feature.scoring
2
3#####################
4Scoring (``scoring``)
5#####################
6
7.. index:: feature scoring
8
9.. index::
10   single: feature; feature scoring
11
12Feature score is an assessment of the usefulness of the feature for
13prediction of the dependant (class) variable. Orange provides classes
14that compute the common feature scores for :ref:`classification
15<classification>` and regression :ref:`regression <regression>`.
16
17The script below computes the information gain of feature "tear_rate"
18in the Lenses data set (loaded into ``data``):
19
20    >>> print Orange.feature.scoring.InfoGain("tear_rate", data)
21    0.548795044422
22
23Calling the scorer by passing the variable and the data to the
24constructor, like above is convenient. However, when scoring multiple
25variables, some methods run much faster if the scorer is constructed,
26stored and called for each variable.
27
28    >>> gain = Orange.feature.scoring.InfoGain()
29    >>> for feature in data.domain.features:
30    ...     print feature.name, gain(feature, data)
31    age 0.0393966436386
32    prescription 0.0395109653473
33    astigmatic 0.377005338669
34    tear_rate 0.548795044422
35
36The speed gain is most noticable in Relief, which computes the scores of
37all features in parallel.
38
39The module also provides a convenience function :obj:`score_all` that
40computes the scores for all attributes. The following example computes
41feature scores, both with :obj:`score_all` and by scoring each feature
42individually, and prints out the best three features.
43
44.. literalinclude:: code/scoring-all.py
45    :lines: 7-
46
47The output::
48
49    Feature scores for best three features (with score_all):
50    0.613 physician-fee-freeze
51    0.255 el-salvador-aid
52    0.228 synfuels-corporation-cutback
53
54    Feature scores for best three features (scored individually):
55    0.613 physician-fee-freeze
56    0.255 el-salvador-aid
57    0.228 synfuels-corporation-cutback
58
59.. comment
60    The next script uses :obj:`GainRatio` and :obj:`Relief`.
61
62    .. literalinclude:: code/scoring-relief-gainRatio.py
63        :lines: 7-
64
65    Notice that on this data the ranks of features match::
66
67        Relief GainRt Feature
68        0.613  0.752  physician-fee-freeze
69        0.255  0.444  el-salvador-aid
70        0.228  0.414  synfuels-corporation-cutback
71        0.189  0.382  crime
72        0.166  0.345  adoption-of-the-budget-resolution
73
74It is also possible to score features that do not appear in the data
75but can be computed from it. A typical case are discretized features:
76
77.. literalinclude:: code/scoring-info-iris.py
78    :lines: 7-11
79
80.. _callingscore:
81
82=======================
83Calling scoring methods
84=======================
85
86Scorers can be called with different type of arguments. For instance,
87when given the data, most scoring methods first compute the
88corresponding contingency tables. If these are already known, they can
89be given to the scorer instead of the data to save some time.
90
91Not all classes accept all kinds of arguments. :obj:`Relief`,
92for instance, only supports the form with instances on the input.
93
94.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID])
95
96    :param attribute: the chosen feature, either as a descriptor,
97      index, or a name.
98    :type attribute: :class:`Orange.feature.Descriptor` or int or string
99    :param data: data.
100    :type data: `Orange.data.Table`
101    :param weightID: id for meta-feature with weight.
102
103    All scoring methods support this form.
104
105.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution])
106
107    :param attribute: the chosen feature, either as a descriptor,
108      index, or a name.
109    :type attribute: :class:`Orange.feature.Descriptor` or int or string
110    :param domain_contingency:
111    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain`
112
113.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution])
114
115    :param contingency:
116    :type contingency: :obj:`Orange.statistics.contingency.VarClass`
117    :param class_distribution: distribution of the class
118      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`,
119      it should be computed on instances where feature value is
120      defined. Otherwise, class distribution should be the overall
121      class distribution.
122    :type class_distribution:
123      :obj:`Orange.statistics.distribution.Distribution`
124    :param apriori_class_distribution: Optional and most often
125      ignored. Useful if the scoring method makes any probability estimates
126      based on apriori class probabilities (such as the m-estimate).
127    :return: Feature score - the higher the value, the better the feature.
128      If the quality cannot be scored, return :obj:`Score.Rejected`.
129    :rtype: float or :obj:`Score.Rejected`.
130
131The code demonstrates using the different call signatures by computing
132the score of the same feature with :obj:`GainRatio`.
133
134.. literalinclude:: code/scoring-calls.py
135    :lines: 7-
136
137.. _classification:
138
139==========================================
140Feature scoring in classification problems
141==========================================
142
143.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain.
144
145.. index::
146   single: feature scoring; information gain
147
148.. class:: InfoGain
149
150    Information gain; the expected decrease of entropy. See `page on wikipedia
151    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
152
153.. index::
154   single: feature scoring; gain ratio
155
156.. class:: GainRatio
157
158    Information gain ratio; information gain divided by the entropy of the feature's
159    value. Introduced in [Quinlan1986]_ in order to avoid overestimation
160    of multi-valued features. It has been shown, however, that it still
161    overestimates features with multiple values. See `Wikipedia
162    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
163
164.. index::
165   single: feature scoring; gini index
166
167.. class:: Gini
168
169    Gini index is the probability that two randomly chosen instances will have different
170    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_.
171
172.. index::
173   single: feature scoring; relevance
174
175.. class:: Relevance
176
177    The potential value for decision rules.
178
179.. index::
180   single: feature scoring; cost
181
182.. class:: Cost
183
184    Evaluates features based on the cost decrease achieved by knowing the value of
185    feature, according to the specified cost matrix.
186
187    .. attribute:: cost
188
189        Cost matrix, an instance of :obj:`Orange.misc.CostMatrix`.
190
191    If the cost of predicting the first class of an instance that is actually in
192    the second is 5, and the cost of the opposite error is 1, than an appropriate
193    score can be constructed as follows::
194
195
196        >>> meas = Orange.feature.scoring.Cost()
197        >>> meas.cost = ((0, 5), (1, 0))
198        >>> meas(3, data)
199        0.083333350718021393
200
201    Knowing the value of feature 3 would decrease the
202    classification cost for approximately 0.083 per instance.
203
204    .. comment   opposite error - is this term correct? TODO
205
206.. index::
207   single: feature scoring; ReliefF
208
209.. class:: Relief
210
211    Assesses features' ability to distinguish between very similar
212    instances from different classes. This scoring method was first
213    developed by Kira and Rendell and then improved by  Kononenko. The
214    class :obj:`Relief` works on discrete and continuous classes and
215    thus implements ReliefF and RReliefF.
216
217    ReliefF is slow since it needs to find k nearest neighbours for
218    each of m reference instances. As we normally compute ReliefF for
219    all features in the dataset, :obj:`Relief` caches the results for
220    all features, when called to score a certain feature.  When called
221    again, it uses the stored results if the domain and the data table
222    have not changed (data table version and the data checksum are
223    compared). Caching will only work if you use the same object.
224    Constructing new instances of :obj:`Relief` for each feature,
225    like this::
226
227        for attr in data.domain.attributes:
228            print Orange.feature.scoring.Relief(attr, data)
229
230    runs much slower than reusing the same instance::
231
232        meas = Orange.feature.scoring.Relief()
233        for attr in table.domain.attributes:
234            print meas(attr, data)
235
236
237    .. attribute:: k
238
239       Number of neighbours for each instance. Default is 5.
240
241    .. attribute:: m
242
243        Number of reference instances. Default is 100. When -1, all
244        instances are used as reference.
245
246    .. attribute:: check_cached_data
247
248        Check if the cached data is changed, which may be slow on large
249        tables.  Defaults to :obj:`True`, but should be disabled when it
250        is certain that the data will not change while the scorer is used.
251
252.. autoclass:: Orange.feature.scoring.Distance
253
254.. autoclass:: Orange.feature.scoring.MDL
255
256.. _regression:
257
258======================================
259Feature scoring in regression problems
260======================================
261
262.. class:: Relief
263
264    Relief is used for regression in the same way as for
265    classification (see :class:`Relief` in classification
266    problems).
267
268.. index::
269   single: feature scoring; mean square error
270
271.. class:: MSE
272
273    Implements the mean square error score.
274
275    .. attribute:: unknowns_treatment
276
277        Decides the treatment of unknown values. See
278        :obj:`Score.unknowns_treatment`.
279
280    .. attribute:: m
281
282        Parameter for m-estimate of error. Default is 0 (no m-estimate).
283
284============
285Base Classes
286============
287
288Implemented methods for scoring relevances of features are subclasses
289of :obj:`Score`. Those that compute statistics on conditional
290distributions of class values given the feature values are derived from
291:obj:`ScoreFromProbabilities`.
292
293.. class:: Score
294
295    Abstract base class for feature scoring. Its attributes describe which
296    types of features it can handle which kind of data it requires.
297
298    **Capabilities**
299
300    .. attribute:: handles_discrete
301
302        Indicates whether the scoring method can handle discrete features.
303
304    .. attribute:: handles_continuous
305
306        Indicates whether the scoring method can handle continuous features.
307
308    .. attribute:: computes_thresholds
309
310        Indicates whether the scoring method implements the :obj:`threshold_function`.
311
312    **Input specification**
313
314    .. attribute:: needs
315
316        The type of data needed indicated by one the constants
317        below. Classes with use :obj:`DomainContingency` will also handle
318        generators. Those based on :obj:`Contingency_Class` will be able
319        to take generators and domain contingencies.
320
321        .. attribute:: Generator
322
323            Constant. Indicates that the scoring method needs an instance
324            generator on the input as, for example, :obj:`Relief`.
325
326        .. attribute:: DomainContingency
327
328            Constant. Indicates that the scoring method needs
329            :obj:`Orange.statistics.contingency.Domain`.
330
331        .. attribute:: Contingency_Class
332
333            Constant. Indicates, that the scoring method needs the contingency
334            (:obj:`Orange.statistics.contingency.VarClass`), feature
335            distribution and the apriori class distribution (as most
336            scoring methods).
337
338    **Treatment of unknown values**
339
340    .. attribute:: unknowns_treatment
341
342        Defined in classes that are able to treat unknown values. It
343        should be set to one of the values below.
344
345        .. attribute:: IgnoreUnknowns
346
347            Constant. Instances for which the feature value is unknown are removed.
348
349        .. attribute:: ReduceByUnknown
350
351            Constant. Features with unknown values are
352            punished. The feature quality is reduced by the proportion of
353            unknown values. For impurity scores the impurity decreases
354            only where the value is defined and stays the same otherwise.
355
356        .. attribute:: UnknownsToCommon
357
358            Constant. Undefined values are replaced by the most common value.
359
360        .. attribute:: UnknownsAsValue
361
362            Constant. Unknown values are treated as a separate value.
363
364    **Methods**
365
366    .. method:: __call__
367
368        Abstract. See :ref:`callingscore`.
369
370    .. method:: threshold_function(attribute, instances[, weightID])
371
372        Abstract.
373
374        Assess different binarizations of the continuous feature
375        :obj:`attribute`.  Return a list of tuples. The first element
376        is a threshold (between two existing values), the second is
377        the quality of the corresponding binary feature, and the third
378        the distribution of instances below and above the threshold.
379        Not all scorers return the third element.
380
381        To show the computation of thresholds, we shall use the Iris
382        data set:
383
384        .. literalinclude:: code/scoring-info-iris.py
385            :lines: 13-16
386
387    .. method:: best_threshold(attribute, instances)
388
389        Return the best threshold for binarization, that is, the threshold
390        with which the resulting binary feature will have the optimal
391        score.
392
393        The script below prints out the best threshold for
394        binarization of an feature. ReliefF is used scoring:
395
396        .. literalinclude:: code/scoring-info-iris.py
397            :lines: 18-19
398
399.. class:: ScoreFromProbabilities
400
401    Bases: :obj:`Score`
402
403    Abstract base class for feature scoring method that can be
404    computed from contingency matrices.
405
406    .. attribute:: estimator_constructor
407    .. attribute:: conditional_estimator_constructor
408
409        The classes that are used to estimate unconditional
410        and conditional probabilities of classes, respectively.
411        Defaults use relative frequencies; possible alternatives are,
412        for instance, :obj:`ProbabilityEstimatorConstructor_m` and
413        :obj:`ConditionalProbabilityEstimatorConstructor_ByRows`
414        (with estimator constructor again set to
415        :obj:`ProbabilityEstimatorConstructor_m`), respectively.
416
417============
418Other
419============
420
421.. autoclass:: Orange.feature.scoring.OrderAttributes
422   :members:
423
424.. autofunction:: Orange.feature.scoring.score_all(data, score=Relief(k=20, m=50))
425
426.. rubric:: Bibliography
427
428.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,
429  Woodhead Publishing, 2007.
430
431.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
432
433.. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984.
434
435.. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995.
Note: See TracBrowser for help on using the repository browser.