Ignore:
Timestamp:
02/07/12 22:08:39 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

Polished discretization scripts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.feature.scoring.rst

    r9372 r9988  
    1 .. automodule:: Orange.feature.scoring 
     1.. py:currentmodule:: Orange.feature.scoring 
     2 
     3##################### 
     4Scoring (``scoring``) 
     5##################### 
     6 
     7.. index:: feature scoring 
     8 
     9.. index:: 
     10   single: feature; feature scoring 
     11 
     12Feature score is an assessment of the usefulness of the feature for 
     13prediction of the dependant (class) variable. 
     14 
     15To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use: 
     16 
     17    >>> meas = Orange.feature.scoring.InfoGain() 
     18    >>> print meas("tear_rate", data) 
     19    0.548794925213 
     20 
     21Other scoring methods are listed in :ref:`classification` and 
     22:ref:`regression`. Various ways to call them are described on 
     23:ref:`callingscore`. 
     24 
     25Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
     26then using it, it is usually more convenient to do both in a single step:: 
     27 
     28    >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
     29    0.548794925213 
     30 
     31This way is much slower for Relief that can efficiently compute scores 
     32for all features in parallel. 
     33 
     34It is also possible to score features that do not appear in the data 
     35but can be computed from it. A typical case are discretized features: 
     36 
     37.. literalinclude:: code/scoring-info-iris.py 
     38    :lines: 7-11 
     39 
     40The following example computes feature scores, both with 
     41:obj:`score_all` and by scoring each feature individually, and prints out 
     42the best three features. 
     43 
     44.. literalinclude:: code/scoring-all.py 
     45    :lines: 7- 
     46 
     47The output:: 
     48 
     49    Feature scores for best three features (with score_all): 
     50    0.613 physician-fee-freeze 
     51    0.255 el-salvador-aid 
     52    0.228 synfuels-corporation-cutback 
     53 
     54    Feature scores for best three features (scored individually): 
     55    0.613 physician-fee-freeze 
     56    0.255 el-salvador-aid 
     57    0.228 synfuels-corporation-cutback 
     58 
     59.. comment 
     60    The next script uses :obj:`GainRatio` and :obj:`Relief`. 
     61 
     62    .. literalinclude:: code/scoring-relief-gainRatio.py 
     63        :lines: 7- 
     64 
     65    Notice that on this data the ranks of features match:: 
     66 
     67        Relief GainRt Feature 
     68        0.613  0.752  physician-fee-freeze 
     69        0.255  0.444  el-salvador-aid 
     70        0.228  0.414  synfuels-corporation-cutback 
     71        0.189  0.382  crime 
     72        0.166  0.345  adoption-of-the-budget-resolution 
     73 
     74 
     75.. _callingscore: 
     76 
     77======================= 
     78Calling scoring methods 
     79======================= 
     80 
     81To score a feature use :obj:`Score.__call__`. There are diferent 
     82function signatures, which enable optimization. For instance, 
     83most scoring methods first compute contingency tables from the 
     84data. If these are already computed, they can be passed to the scorer 
     85instead of the data. 
     86 
     87Not all classes accept all kinds of arguments. :obj:`Relief`, 
     88for instance, only supports the form with instances on the input. 
     89 
     90.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
     91 
     92    :param attribute: the chosen feature, either as a descriptor, 
     93      index, or a name. 
     94    :type attribute: :class:`Orange.feature.Descriptor` or int or string 
     95    :param data: data. 
     96    :type data: `Orange.data.Table` 
     97    :param weightID: id for meta-feature with weight. 
     98 
     99    All scoring methods support the first signature. 
     100 
     101.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
     102 
     103    :param attribute: the chosen feature, either as a descriptor, 
     104      index, or a name. 
     105    :type attribute: :class:`Orange.feature.Descriptor` or int or string 
     106    :param domain_contingency: 
     107    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
     108 
     109.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
     110 
     111    :param contingency: 
     112    :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
     113    :param class_distribution: distribution of the class 
     114      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
     115      it should be computed on instances where feature value is 
     116      defined. Otherwise, class distribution should be the overall 
     117      class distribution. 
     118    :type class_distribution: 
     119      :obj:`Orange.statistics.distribution.Distribution` 
     120    :param apriori_class_distribution: Optional and most often 
     121      ignored. Useful if the scoring method makes any probability estimates 
     122      based on apriori class probabilities (such as the m-estimate). 
     123    :return: Feature score - the higher the value, the better the feature. 
     124      If the quality cannot be scored, return :obj:`Score.Rejected`. 
     125    :rtype: float or :obj:`Score.Rejected`. 
     126 
     127The code below scores the same feature with :obj:`GainRatio` 
     128using different calls. 
     129 
     130.. literalinclude:: code/scoring-calls.py 
     131    :lines: 7- 
     132 
     133.. _classification: 
     134 
     135========================================== 
     136Feature scoring in classification problems 
     137========================================== 
     138 
     139.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
     140 
     141.. index:: 
     142   single: feature scoring; information gain 
     143 
     144.. class:: InfoGain 
     145 
     146    Information gain; the expected decrease of entropy. See `page on wikipedia 
     147    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
     148 
     149.. index:: 
     150   single: feature scoring; gain ratio 
     151 
     152.. class:: GainRatio 
     153 
     154    Information gain ratio; information gain divided by the entropy of the feature's 
     155    value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
     156    of multi-valued features. It has been shown, however, that it still 
     157    overestimates features with multiple values. See `Wikipedia 
     158    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
     159 
     160.. index:: 
     161   single: feature scoring; gini index 
     162 
     163.. class:: Gini 
     164 
     165    Gini index is the probability that two randomly chosen instances will have different 
     166    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
     167 
     168.. index:: 
     169   single: feature scoring; relevance 
     170 
     171.. class:: Relevance 
     172 
     173    The potential value for decision rules. 
     174 
     175.. index:: 
     176   single: feature scoring; cost 
     177 
     178.. class:: Cost 
     179 
     180    Evaluates features based on the cost decrease achieved by knowing the value of 
     181    feature, according to the specified cost matrix. 
     182 
     183    .. attribute:: cost 
     184 
     185        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
     186 
     187    If the cost of predicting the first class of an instance that is actually in 
     188    the second is 5, and the cost of the opposite error is 1, than an appropriate 
     189    score can be constructed as follows:: 
     190 
     191 
     192        >>> meas = Orange.feature.scoring.Cost() 
     193        >>> meas.cost = ((0, 5), (1, 0)) 
     194        >>> meas(3, data) 
     195        0.083333350718021393 
     196 
     197    Knowing the value of feature 3 would decrease the 
     198    classification cost for approximately 0.083 per instance. 
     199 
     200    .. comment   opposite error - is this term correct? TODO 
     201 
     202.. index:: 
     203   single: feature scoring; ReliefF 
     204 
     205.. class:: Relief 
     206 
     207    Assesses features' ability to distinguish between very similar 
     208    instances from different classes. This scoring method was first 
     209    developed by Kira and Rendell and then improved by  Kononenko. The 
     210    class :obj:`Relief` works on discrete and continuous classes and 
     211    thus implements ReliefF and RReliefF. 
     212 
     213    ReliefF is slow since it needs to find k nearest neighbours for 
     214    each of m reference instances. As we normally compute ReliefF for 
     215    all features in the dataset, :obj:`Relief` caches the results for 
     216    all features, when called to score a certain feature.  When called 
     217    again, it uses the stored results if the domain and the data table 
     218    have not changed (data table version and the data checksum are 
     219    compared). Caching will only work if you use the same object. 
     220    Constructing new instances of :obj:`Relief` for each feature, 
     221    like this:: 
     222 
     223        for attr in data.domain.attributes: 
     224            print Orange.feature.scoring.Relief(attr, data) 
     225 
     226    runs much slower than reusing the same instance:: 
     227 
     228        meas = Orange.feature.scoring.Relief() 
     229        for attr in table.domain.attributes: 
     230            print meas(attr, data) 
     231 
     232 
     233    .. attribute:: k 
     234 
     235       Number of neighbours for each instance. Default is 5. 
     236 
     237    .. attribute:: m 
     238 
     239        Number of reference instances. Default is 100. When -1, all 
     240        instances are used as reference. 
     241 
     242    .. attribute:: check_cached_data 
     243 
     244        Check if the cached data is changed, which may be slow on large 
     245        tables.  Defaults to :obj:`True`, but should be disabled when it 
     246        is certain that the data will not change while the scorer is used. 
     247 
     248.. autoclass:: Orange.feature.scoring.Distance 
     249 
     250.. autoclass:: Orange.feature.scoring.MDL 
     251 
     252.. _regression: 
     253 
     254====================================== 
     255Feature scoring in regression problems 
     256====================================== 
     257 
     258.. class:: Relief 
     259 
     260    Relief is used for regression in the same way as for 
     261    classification (see :class:`Relief` in classification 
     262    problems). 
     263 
     264.. index:: 
     265   single: feature scoring; mean square error 
     266 
     267.. class:: MSE 
     268 
     269    Implements the mean square error score. 
     270 
     271    .. attribute:: unknowns_treatment 
     272 
     273        What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
     274 
     275    .. attribute:: m 
     276 
     277        Parameter for m-estimate of error. Default is 0 (no m-estimate). 
     278 
     279============ 
     280Base Classes 
     281============ 
     282 
     283Implemented methods for scoring relevances of features are subclasses 
     284of :obj:`Score`. Those that compute statistics on conditional 
     285distributions of class values given the feature values are derived from 
     286:obj:`ScoreFromProbabilities`. 
     287 
     288.. class:: Score 
     289 
     290    Abstract base class for feature scoring. Its attributes describe which 
     291    types of features it can handle which kind of data it requires. 
     292 
     293    **Capabilities** 
     294 
     295    .. attribute:: handles_discrete 
     296 
     297        Indicates whether the scoring method can handle discrete features. 
     298 
     299    .. attribute:: handles_continuous 
     300 
     301        Indicates whether the scoring method can handle continuous features. 
     302 
     303    .. attribute:: computes_thresholds 
     304 
     305        Indicates whether the scoring method implements the :obj:`threshold_function`. 
     306 
     307    **Input specification** 
     308 
     309    .. attribute:: needs 
     310 
     311        The type of data needed indicated by one the constants 
     312        below. Classes with use :obj:`DomainContingency` will also handle 
     313        generators. Those based on :obj:`Contingency_Class` will be able 
     314        to take generators and domain contingencies. 
     315 
     316        .. attribute:: Generator 
     317 
     318            Constant. Indicates that the scoring method needs an instance 
     319            generator on the input as, for example, :obj:`Relief`. 
     320 
     321        .. attribute:: DomainContingency 
     322 
     323            Constant. Indicates that the scoring method needs 
     324            :obj:`Orange.statistics.contingency.Domain`. 
     325 
     326        .. attribute:: Contingency_Class 
     327 
     328            Constant. Indicates, that the scoring method needs the contingency 
     329            (:obj:`Orange.statistics.contingency.VarClass`), feature 
     330            distribution and the apriori class distribution (as most 
     331            scoring methods). 
     332 
     333    **Treatment of unknown values** 
     334 
     335    .. attribute:: unknowns_treatment 
     336 
     337        Defined in classes that are able to treat unknown values. It 
     338        should be set to one of the values below. 
     339 
     340        .. attribute:: IgnoreUnknowns 
     341 
     342            Constant. Instances for which the feature value is unknown are removed. 
     343 
     344        .. attribute:: ReduceByUnknown 
     345 
     346            Constant. Features with unknown values are 
     347            punished. The feature quality is reduced by the proportion of 
     348            unknown values. For impurity scores the impurity decreases 
     349            only where the value is defined and stays the same otherwise. 
     350 
     351        .. attribute:: UnknownsToCommon 
     352 
     353            Constant. Undefined values are replaced by the most common value. 
     354 
     355        .. attribute:: UnknownsAsValue 
     356 
     357            Constant. Unknown values are treated as a separate value. 
     358 
     359    **Methods** 
     360 
     361    .. method:: __call__ 
     362 
     363        Abstract. See :ref:`callingscore`. 
     364 
     365    .. method:: threshold_function(attribute, instances[, weightID]) 
     366 
     367        Abstract. 
     368 
     369        Assess different binarizations of the continuous feature 
     370        :obj:`attribute`.  Return a list of tuples. The first element 
     371        is a threshold (between two existing values), the second is 
     372        the quality of the corresponding binary feature, and the third 
     373        the distribution of instances below and above the threshold. 
     374        Not all scorers return the third element. 
     375 
     376        To show the computation of thresholds, we shall use the Iris 
     377        data set: 
     378 
     379        .. literalinclude:: code/scoring-info-iris.py 
     380            :lines: 13-16 
     381 
     382    .. method:: best_threshold(attribute, instances) 
     383 
     384        Return the best threshold for binarization, that is, the threshold 
     385        with which the resulting binary feature will have the optimal 
     386        score. 
     387 
     388        The script below prints out the best threshold for 
     389        binarization of an feature. ReliefF is used scoring: 
     390 
     391        .. literalinclude:: code/scoring-info-iris.py 
     392            :lines: 18-19 
     393 
     394.. class:: ScoreFromProbabilities 
     395 
     396    Bases: :obj:`Score` 
     397 
     398    Abstract base class for feature scoring method that can be 
     399    computed from contingency matrices. 
     400 
     401    .. attribute:: estimator_constructor 
     402    .. attribute:: conditional_estimator_constructor 
     403 
     404        The classes that are used to estimate unconditional 
     405        and conditional probabilities of classes, respectively. 
     406        Defaults use relative frequencies; possible alternatives are, 
     407        for instance, :obj:`ProbabilityEstimatorConstructor_m` and 
     408        :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
     409        (with estimator constructor again set to 
     410        :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
     411 
     412============ 
     413Other 
     414============ 
     415 
     416.. autoclass:: Orange.feature.scoring.OrderAttributes 
     417   :members: 
     418 
     419.. autofunction:: Orange.feature.scoring.score_all 
     420 
     421.. rubric:: Bibliography 
     422 
     423.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining, 
     424  Woodhead Publishing, 2007. 
     425 
     426.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986. 
     427 
     428.. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984. 
     429 
     430.. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995. 
Note: See TracChangeset for help on using the changeset viewer.