source: orange/docs/reference/rst/Orange.feature.scoring.rst @ 9988:4e1229e347ca

Revision 9988:4e1229e347ca, 13.9 KB checked in by blaz <blaz.zupan@…>, 2 years ago (diff)

Polished discretization scripts.

Line 
1.. py:currentmodule:: Orange.feature.scoring
2
3#####################
4Scoring (``scoring``)
5#####################
6
7.. index:: feature scoring
8
9.. index::
10   single: feature; feature scoring
11
12Feature score is an assessment of the usefulness of the feature for
13prediction of the dependant (class) variable.
14
15To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use:
16
17    >>> meas = Orange.feature.scoring.InfoGain()
18    >>> print meas("tear_rate", data)
19    0.548794925213
20
21Other scoring methods are listed in :ref:`classification` and
22:ref:`regression`. Various ways to call them are described on
23:ref:`callingscore`.
24
25Instead of first constructing the scoring object (e.g. ``InfoGain``) and
26then using it, it is usually more convenient to do both in a single step::
27
28    >>> print Orange.feature.scoring.InfoGain("tear_rate", data)
29    0.548794925213
30
31This way is much slower for Relief that can efficiently compute scores
32for all features in parallel.
33
34It is also possible to score features that do not appear in the data
35but can be computed from it. A typical case are discretized features:
36
37.. literalinclude:: code/scoring-info-iris.py
38    :lines: 7-11
39
40The following example computes feature scores, both with
41:obj:`score_all` and by scoring each feature individually, and prints out
42the best three features.
43
44.. literalinclude:: code/scoring-all.py
45    :lines: 7-
46
47The output::
48
49    Feature scores for best three features (with score_all):
50    0.613 physician-fee-freeze
51    0.255 el-salvador-aid
52    0.228 synfuels-corporation-cutback
53
54    Feature scores for best three features (scored individually):
55    0.613 physician-fee-freeze
56    0.255 el-salvador-aid
57    0.228 synfuels-corporation-cutback
58
59.. comment
60    The next script uses :obj:`GainRatio` and :obj:`Relief`.
61
62    .. literalinclude:: code/scoring-relief-gainRatio.py
63        :lines: 7-
64
65    Notice that on this data the ranks of features match::
66
67        Relief GainRt Feature
68        0.613  0.752  physician-fee-freeze
69        0.255  0.444  el-salvador-aid
70        0.228  0.414  synfuels-corporation-cutback
71        0.189  0.382  crime
72        0.166  0.345  adoption-of-the-budget-resolution
73
74
75.. _callingscore:
76
77=======================
78Calling scoring methods
79=======================
80
81To score a feature use :obj:`Score.__call__`. There are diferent
82function signatures, which enable optimization. For instance,
83most scoring methods first compute contingency tables from the
84data. If these are already computed, they can be passed to the scorer
85instead of the data.
86
87Not all classes accept all kinds of arguments. :obj:`Relief`,
88for instance, only supports the form with instances on the input.
89
90.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID])
91
92    :param attribute: the chosen feature, either as a descriptor,
93      index, or a name.
94    :type attribute: :class:`Orange.feature.Descriptor` or int or string
95    :param data: data.
96    :type data: `Orange.data.Table`
97    :param weightID: id for meta-feature with weight.
98
99    All scoring methods support the first signature.
100
101.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution])
102
103    :param attribute: the chosen feature, either as a descriptor,
104      index, or a name.
105    :type attribute: :class:`Orange.feature.Descriptor` or int or string
106    :param domain_contingency:
107    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain`
108
109.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution])
110
111    :param contingency:
112    :type contingency: :obj:`Orange.statistics.contingency.VarClass`
113    :param class_distribution: distribution of the class
114      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`,
115      it should be computed on instances where feature value is
116      defined. Otherwise, class distribution should be the overall
117      class distribution.
118    :type class_distribution:
119      :obj:`Orange.statistics.distribution.Distribution`
120    :param apriori_class_distribution: Optional and most often
121      ignored. Useful if the scoring method makes any probability estimates
122      based on apriori class probabilities (such as the m-estimate).
123    :return: Feature score - the higher the value, the better the feature.
124      If the quality cannot be scored, return :obj:`Score.Rejected`.
125    :rtype: float or :obj:`Score.Rejected`.
126
127The code below scores the same feature with :obj:`GainRatio`
128using different calls.
129
130.. literalinclude:: code/scoring-calls.py
131    :lines: 7-
132
133.. _classification:
134
135==========================================
136Feature scoring in classification problems
137==========================================
138
139.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain.
140
141.. index::
142   single: feature scoring; information gain
143
144.. class:: InfoGain
145
146    Information gain; the expected decrease of entropy. See `page on wikipedia
147    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
148
149.. index::
150   single: feature scoring; gain ratio
151
152.. class:: GainRatio
153
154    Information gain ratio; information gain divided by the entropy of the feature's
155    value. Introduced in [Quinlan1986]_ in order to avoid overestimation
156    of multi-valued features. It has been shown, however, that it still
157    overestimates features with multiple values. See `Wikipedia
158    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
159
160.. index::
161   single: feature scoring; gini index
162
163.. class:: Gini
164
165    Gini index is the probability that two randomly chosen instances will have different
166    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_.
167
168.. index::
169   single: feature scoring; relevance
170
171.. class:: Relevance
172
173    The potential value for decision rules.
174
175.. index::
176   single: feature scoring; cost
177
178.. class:: Cost
179
180    Evaluates features based on the cost decrease achieved by knowing the value of
181    feature, according to the specified cost matrix.
182
183    .. attribute:: cost
184
185        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details.
186
187    If the cost of predicting the first class of an instance that is actually in
188    the second is 5, and the cost of the opposite error is 1, than an appropriate
189    score can be constructed as follows::
190
191
192        >>> meas = Orange.feature.scoring.Cost()
193        >>> meas.cost = ((0, 5), (1, 0))
194        >>> meas(3, data)
195        0.083333350718021393
196
197    Knowing the value of feature 3 would decrease the
198    classification cost for approximately 0.083 per instance.
199
200    .. comment   opposite error - is this term correct? TODO
201
202.. index::
203   single: feature scoring; ReliefF
204
205.. class:: Relief
206
207    Assesses features' ability to distinguish between very similar
208    instances from different classes. This scoring method was first
209    developed by Kira and Rendell and then improved by  Kononenko. The
210    class :obj:`Relief` works on discrete and continuous classes and
211    thus implements ReliefF and RReliefF.
212
213    ReliefF is slow since it needs to find k nearest neighbours for
214    each of m reference instances. As we normally compute ReliefF for
215    all features in the dataset, :obj:`Relief` caches the results for
216    all features, when called to score a certain feature.  When called
217    again, it uses the stored results if the domain and the data table
218    have not changed (data table version and the data checksum are
219    compared). Caching will only work if you use the same object.
220    Constructing new instances of :obj:`Relief` for each feature,
221    like this::
222
223        for attr in data.domain.attributes:
224            print Orange.feature.scoring.Relief(attr, data)
225
226    runs much slower than reusing the same instance::
227
228        meas = Orange.feature.scoring.Relief()
229        for attr in table.domain.attributes:
230            print meas(attr, data)
231
232
233    .. attribute:: k
234
235       Number of neighbours for each instance. Default is 5.
236
237    .. attribute:: m
238
239        Number of reference instances. Default is 100. When -1, all
240        instances are used as reference.
241
242    .. attribute:: check_cached_data
243
244        Check if the cached data is changed, which may be slow on large
245        tables.  Defaults to :obj:`True`, but should be disabled when it
246        is certain that the data will not change while the scorer is used.
247
248.. autoclass:: Orange.feature.scoring.Distance
249
250.. autoclass:: Orange.feature.scoring.MDL
251
252.. _regression:
253
254======================================
255Feature scoring in regression problems
256======================================
257
258.. class:: Relief
259
260    Relief is used for regression in the same way as for
261    classification (see :class:`Relief` in classification
262    problems).
263
264.. index::
265   single: feature scoring; mean square error
266
267.. class:: MSE
268
269    Implements the mean square error score.
270
271    .. attribute:: unknowns_treatment
272
273        What to do with unknown values. See :obj:`Score.unknowns_treatment`.
274
275    .. attribute:: m
276
277        Parameter for m-estimate of error. Default is 0 (no m-estimate).
278
279============
280Base Classes
281============
282
283Implemented methods for scoring relevances of features are subclasses
284of :obj:`Score`. Those that compute statistics on conditional
285distributions of class values given the feature values are derived from
286:obj:`ScoreFromProbabilities`.
287
288.. class:: Score
289
290    Abstract base class for feature scoring. Its attributes describe which
291    types of features it can handle which kind of data it requires.
292
293    **Capabilities**
294
295    .. attribute:: handles_discrete
296
297        Indicates whether the scoring method can handle discrete features.
298
299    .. attribute:: handles_continuous
300
301        Indicates whether the scoring method can handle continuous features.
302
303    .. attribute:: computes_thresholds
304
305        Indicates whether the scoring method implements the :obj:`threshold_function`.
306
307    **Input specification**
308
309    .. attribute:: needs
310
311        The type of data needed indicated by one the constants
312        below. Classes with use :obj:`DomainContingency` will also handle
313        generators. Those based on :obj:`Contingency_Class` will be able
314        to take generators and domain contingencies.
315
316        .. attribute:: Generator
317
318            Constant. Indicates that the scoring method needs an instance
319            generator on the input as, for example, :obj:`Relief`.
320
321        .. attribute:: DomainContingency
322
323            Constant. Indicates that the scoring method needs
324            :obj:`Orange.statistics.contingency.Domain`.
325
326        .. attribute:: Contingency_Class
327
328            Constant. Indicates, that the scoring method needs the contingency
329            (:obj:`Orange.statistics.contingency.VarClass`), feature
330            distribution and the apriori class distribution (as most
331            scoring methods).
332
333    **Treatment of unknown values**
334
335    .. attribute:: unknowns_treatment
336
337        Defined in classes that are able to treat unknown values. It
338        should be set to one of the values below.
339
340        .. attribute:: IgnoreUnknowns
341
342            Constant. Instances for which the feature value is unknown are removed.
343
344        .. attribute:: ReduceByUnknown
345
346            Constant. Features with unknown values are
347            punished. The feature quality is reduced by the proportion of
348            unknown values. For impurity scores the impurity decreases
349            only where the value is defined and stays the same otherwise.
350
351        .. attribute:: UnknownsToCommon
352
353            Constant. Undefined values are replaced by the most common value.
354
355        .. attribute:: UnknownsAsValue
356
357            Constant. Unknown values are treated as a separate value.
358
359    **Methods**
360
361    .. method:: __call__
362
363        Abstract. See :ref:`callingscore`.
364
365    .. method:: threshold_function(attribute, instances[, weightID])
366
367        Abstract.
368
369        Assess different binarizations of the continuous feature
370        :obj:`attribute`.  Return a list of tuples. The first element
371        is a threshold (between two existing values), the second is
372        the quality of the corresponding binary feature, and the third
373        the distribution of instances below and above the threshold.
374        Not all scorers return the third element.
375
376        To show the computation of thresholds, we shall use the Iris
377        data set:
378
379        .. literalinclude:: code/scoring-info-iris.py
380            :lines: 13-16
381
382    .. method:: best_threshold(attribute, instances)
383
384        Return the best threshold for binarization, that is, the threshold
385        with which the resulting binary feature will have the optimal
386        score.
387
388        The script below prints out the best threshold for
389        binarization of an feature. ReliefF is used scoring:
390
391        .. literalinclude:: code/scoring-info-iris.py
392            :lines: 18-19
393
394.. class:: ScoreFromProbabilities
395
396    Bases: :obj:`Score`
397
398    Abstract base class for feature scoring method that can be
399    computed from contingency matrices.
400
401    .. attribute:: estimator_constructor
402    .. attribute:: conditional_estimator_constructor
403
404        The classes that are used to estimate unconditional
405        and conditional probabilities of classes, respectively.
406        Defaults use relative frequencies; possible alternatives are,
407        for instance, :obj:`ProbabilityEstimatorConstructor_m` and
408        :obj:`ConditionalProbabilityEstimatorConstructor_ByRows`
409        (with estimator constructor again set to
410        :obj:`ProbabilityEstimatorConstructor_m`), respectively.
411
412============
413Other
414============
415
416.. autoclass:: Orange.feature.scoring.OrderAttributes
417   :members:
418
419.. autofunction:: Orange.feature.scoring.score_all
420
421.. rubric:: Bibliography
422
423.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,
424  Woodhead Publishing, 2007.
425
426.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
427
428.. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984.
429
430.. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995.
Note: See TracBrowser for help on using the repository browser.