Changeset 8842:a4d68fbfb524 in orange


Ignore:
Timestamp:
08/30/11 10:10:30 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
4402f1df3c07a9a08324f952a4e8408e37941068
Message:

Orange.feature.scoring: corrected as Janez suggested.

Location:
orange
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r8764 r8842  
    99   single: feature; feature scoring 
    1010 
    11 Feature scoring is assessment of the usefulness of the feature for  
     11Feature score is an assessment of the usefulness of the feature for  
    1212prediction of the dependant (class) variable. 
    1313 
     
    1818    0.548794925213 
    1919 
    20 Apart from information gain you could also use other scoring methods; 
    21 see :ref:`classification` and :ref:`regression`. Various 
    22 ways to call them are described on :ref:`callingscore`. 
    23  
    24 It is possible to construct the object and use 
    25 it on-the-fly:: 
     20Other scoring methods are listed in :ref:`classification` and 
     21:ref:`regression`. Various ways to call them are described on 
     22:ref:`callingscore`. 
     23 
     24Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
     25then using it, it is usually more convenient to do both in a single step:: 
    2626 
    2727    >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
    2828    0.548794925213 
    2929 
    30 But constructing new instances for each feature is slow for 
    31 scoring methods that use caching, such as :obj:`Relief`. 
    32  
    33 Scoring features that are not in the domain is also possible. For 
    34 instance, discretized features can be scored without producing a 
    35 data table in advance (slow with :obj:`Relief`): 
     30This way is much slower for Relief that can efficiently compute scores 
     31for all features in parallel. 
     32 
     33It is also possible to score features that do not appear in the data 
     34but can be computed from it. A typical case are discretized features: 
    3635 
    3736.. literalinclude:: code/scoring-info-iris.py 
     
    8483To score a feature use :obj:`Score.__call__`. There are diferent 
    8584function signatures, which enable optimization. For instance, 
    86 if contingency matrix has already been computed, you can speed 
    87 up the computation by passing it to the scoring method (if it supports 
    88 that form - most do). Otherwise the scoring method will have to compute the 
    89 contingency itself. 
     85most scoring methods first compute contingency tables from the 
     86data. If these are already computed, they can be passed to the scorer 
     87instead of the data. 
    9088 
    9189Not all classes accept all kinds of arguments. :obj:`Relief`, 
    9290for instance, only supports the form with instances on the input. 
    9391 
    94 .. method:: Score.__call__(attribute, instances[, apriori_class_distribution][, weightID]) 
     92.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
    9593 
    9694    :param attribute: the chosen feature, either as a descriptor,  
    9795      index, or a name. 
    9896    :type attribute: :class:`Orange.data.variable.Variable` or int or string 
    99     :param instances: data. 
    100     :type instances: `Orange.data.Table` 
     97    :param data: data. 
     98    :type data: `Orange.data.Table` 
    10199    :param weightID: id for meta-feature with weight. 
    102100 
    103     All scoring methods need to support these parameters. 
     101    All scoring methods support the first signature. 
    104102 
    105103.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
     
    129127    :rtype: float or :obj:`Score.Rejected`. 
    130128 
    131 The code below scores the same feature with :obj:`GainRatio` in 
    132 different ways. 
     129The code below scores the same feature with :obj:`GainRatio`  
     130using different calls. 
    133131 
    134132.. literalinclude:: code/scoring-calls.py 
     
    148146.. class:: InfoGain 
    149147 
    150     Information gain - the expected decrease of entropy. See `page on wikipedia 
     148    Information gain; the expected decrease of entropy. See `page on wikipedia 
    151149    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    152150 
     
    156154.. class:: GainRatio 
    157155 
    158     Information gain ratio - information gain divided by the entropy of the feature's 
     156    Information gain ratio; information gain divided by the entropy of the feature's 
    159157    value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
    160158    of multi-valued features. It has been shown, however, that it still 
     
    210208 
    211209    Assesses features' ability to distinguish between very similar 
    212     instances from different classes. This scoring method was 
    213     first developed by Kira and 
    214     Rendell and then improved by Kononenko. The class :obj:`Relief` 
    215     works on discrete and continuous classes and thus implements ReliefF 
    216     and RReliefF. 
    217  
    218     .. attribute:: k 
    219      
    220        Number of neighbours for each instance. Default is 5. 
    221  
    222     .. attribute:: m 
    223      
    224         Number of reference instances. Default is 100. Set to -1 to take all the 
    225         instances. 
    226  
    227     .. attribute:: check_cached_data 
    228      
    229         Check if the cached data is changed with data checksum. Slow 
    230         on large tables.  Defaults to :obj:`True`. Disable it if you know that 
    231         the data will not change. 
     210    instances from different classes. This scoring method was first 
     211    developed by Kira and Rendell and then improved by Kononenko. The 
     212    class :obj:`Relief` works on discrete and continuous classes and 
     213    thus implements ReliefF and RReliefF. 
    232214 
    233215    ReliefF is slow since it needs to find k nearest neighbours for 
     
    250232            print meas(attr, data) 
    251233 
    252     .. note:: 
    253        Relief can also compute the threshold function, that is, the feature 
    254        quality at different thresholds for binarization. 
     234 
     235    .. attribute:: k 
     236     
     237       Number of neighbours for each instance. Default is 5. 
     238 
     239    .. attribute:: m 
     240     
     241        Number of reference instances. Default is 100. When -1, all 
     242        instances are used as reference. 
     243 
     244    .. attribute:: check_cached_data 
     245     
     246        Check if the cached data is changed, which may be slow on large 
     247        tables.  Defaults to :obj:`True`, but should be disabled when it 
     248        is certain that the data will not change while the scorer is used. 
    255249 
    256250.. autoclass:: Orange.feature.scoring.Distance 
     
    264258====================================== 
    265259 
    266 You can also use :obj:`Relief` for regression. 
     260.. class:: Relief 
     261 
     262    Relief is used for regression in the same way as for 
     263    classification (see :class:`Relief` in classification 
     264    problems). 
    267265 
    268266.. index::  
     
    287285============ 
    288286 
    289 Implemented methods for scoring relevances of features to the class 
    290 are subclasses of :obj:`Score`. Those that compute statistics on 
    291 conditional distributions of class values given the feature values are 
    292 derived from :obj:`ScoreFromProbabilities`. 
     287Implemented methods for scoring relevances of features are subclasses 
     288of :obj:`Score`. Those that compute statistics on conditional 
     289distributions of class values given the feature values are derived from 
     290:obj:`ScoreFromProbabilities`. 
    293291 
    294292.. class:: Score 
    295293 
    296294    Abstract base class for feature scoring. Its attributes describe which 
    297     features it can handle and the required data. 
     295    types of features it can handle which kind of data it requires. 
    298296 
    299297    **Capabilities** 
     
    315313    .. attribute:: needs 
    316314     
    317         The type of data needed: :obj:`Generator`, :obj:`DomainContingency`, 
    318         or :obj:`Contingency_Class`. 
    319  
    320     .. attribute:: Generator 
    321  
    322         Constant. Indicates that the scoring method needs an instance generator on the input (as, for example, 
    323         :obj:`Relief`). 
    324  
    325     .. attribute:: DomainContingency 
    326  
    327         Constant. Indicates that the scoring method needs :obj:`Orange.statistics.contingency.Domain`. 
    328  
    329     .. attribute:: Contingency_Class 
    330  
    331         Constant. Indicates, that the scoring method needs the contingency 
    332         (:obj:`Orange.statistics.contingency.VarClass`), feature 
    333         distribution and the apriori class distribution (as most 
    334         scoring methods). 
     315        The type of data needed indicated by one the constants 
     316        below. Classes with use :obj:`DomainContingency` will also handle 
     317        generators. Those based on :obj:`Contingency_Class` will be able 
     318        to take generators and domain contingencies. 
     319 
     320        .. attribute:: Generator 
     321 
     322            Constant. Indicates that the scoring method needs an instance 
     323            generator on the input as, for example, :obj:`Relief`. 
     324 
     325        .. attribute:: DomainContingency 
     326 
     327            Constant. Indicates that the scoring method needs 
     328            :obj:`Orange.statistics.contingency.Domain`. 
     329 
     330        .. attribute:: Contingency_Class 
     331 
     332            Constant. Indicates, that the scoring method needs the contingency 
     333            (:obj:`Orange.statistics.contingency.VarClass`), feature 
     334            distribution and the apriori class distribution (as most 
     335            scoring methods). 
    335336 
    336337    **Treatment of unknown values** 
     
    338339    .. attribute:: unknowns_treatment 
    339340 
    340         Not defined in :obj:`Score` but defined in 
    341         classes that are able to treat unknown values. Either 
    342         :obj:`IgnoreUnknowns`, :obj:`ReduceByUnknown`. 
    343         :obj:`UnknownsToCommon`, or :obj:`UnknownsAsValue`. 
    344  
    345     .. attribute:: IgnoreUnknowns 
    346  
    347         Constant. Instances for which the feature value is unknown are removed. 
    348  
    349     .. attribute:: ReduceByUnknown 
    350  
    351         Constant. Features with unknown values are  
    352         punished. The feature quality is reduced by the proportion of 
    353         unknown values. For impurity scores the impurity decreases 
    354         only where the value is defined and stays the same otherwise. 
    355  
    356     .. attribute:: UnknownsToCommon 
    357  
    358         Constant. Undefined values are replaced by the most common value. 
    359  
    360     .. attribute:: UnknownsAsValue 
    361  
    362         Constant. Unknown values are treated as a separate value. 
     341        Defined in classes that are able to treat unknown values. It 
     342        should be set to one of the values below. 
     343 
     344        .. attribute:: IgnoreUnknowns 
     345 
     346            Constant. Instances for which the feature value is unknown are removed. 
     347 
     348        .. attribute:: ReduceByUnknown 
     349 
     350            Constant. Features with unknown values are  
     351            punished. The feature quality is reduced by the proportion of 
     352            unknown values. For impurity scores the impurity decreases 
     353            only where the value is defined and stays the same otherwise. 
     354 
     355        .. attribute:: UnknownsToCommon 
     356 
     357            Constant. Undefined values are replaced by the most common value. 
     358 
     359        .. attribute:: UnknownsAsValue 
     360 
     361            Constant. Unknown values are treated as a separate value. 
    363362 
    364363    **Methods** 
     
    373372         
    374373        Assess different binarizations of the continuous feature 
    375         :obj:`attribute`.  Return a list of tuples, where the first 
    376         element is a threshold (between two existing values), the second 
    377         is the quality of the corresponding binary feature, and the last 
    378         the distribution of instancs below and above the threshold. The 
    379         last element is optional. 
    380  
    381         To show the computation of thresholds, we shall use the Iris data set 
    382         (part of `scoring-info-iris.py`_, uses `iris.tab`_): 
     374        :obj:`attribute`.  Return a list of tuples. The first element 
     375        is a threshold (between two existing values), the second is 
     376        the quality of the corresponding binary feature, and the third 
     377        the distribution of instances below and above the threshold. 
     378        Not all scorers return the third element. 
     379 
     380        To show the computation of thresholds, we shall use the Iris 
     381        data set: 
    383382 
    384383        .. literalinclude:: code/scoring-info-iris.py 
    385             :lines: 13-15 
     384            :lines: 13-16 
    386385 
    387386    .. method:: best_threshold(attribute, instances) 
     
    392391 
    393392        The script below prints out the best threshold for 
    394         binarization of an feature. ReliefF is used scoring: (part of 
    395         `scoring-info-iris.py`_, uses `iris.tab`_): 
     393        binarization of an feature. ReliefF is used scoring:  
    396394 
    397395        .. literalinclude:: code/scoring-info-iris.py 
    398             :lines: 17-18 
     396            :lines: 18-19 
    399397 
    400398.. class:: ScoreFromProbabilities 
     
    403401 
    404402    Abstract base class for feature scoring method that can be 
    405     computed from contingency matrices only. It relieves the derived classes 
    406     from having to compute the contingency matrix by defining the first two 
    407     forms of call operator. (Well, that's not something you need to know if 
    408     you only work in Python.) 
    409  
    410     .. attribute:: unknowns_treatment 
    411       
    412         See :obj:`Score.unknowns_treatment`. 
     403    computed from contingency matrices. 
    413404 
    414405    .. attribute:: estimator_constructor 
    415406    .. attribute:: conditional_estimator_constructor 
    416407     
    417         The classes that are used to estimate unconditional and 
    418         conditional probabilities of classes, respectively. You can set 
    419         this to, for instance, :obj:`ProbabilityEstimatorConstructor_m` 
    420         and :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
     408        The classes that are used to estimate unconditional 
     409        and conditional probabilities of classes, respectively. 
     410        Defaults use relative frequencies; possible alternatives are, 
     411        for instance, :obj:`ProbabilityEstimatorConstructor_m` and 
     412        :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
    421413        (with estimator constructor again set to 
    422414        :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
    423         Both default to relative frequencies. 
    424415 
    425416============ 
     
    430421   :members: 
    431422 
    432 .. autofunction:: Orange.feature.scoring.merge_values 
    433  
    434423.. autofunction:: Orange.feature.scoring.score_all 
    435424 
    436 ..  .. rubric:: References 
     425.. rubric:: Bibliography 
    437426 
    438427.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,  
     
    482471     
    483472        A scoring method derived from :obj:`~Orange.feature.scoring.Score`. 
    484         If :obj:`None`, :obj:`Relief` with m=5 and k=10 will be used. 
     473        If :obj:`None`, :obj:`Relief` with m=5 and k=10 is used. 
    485474     
    486475    """ 
  • orange/doc/Orange/rst/code/scoring-info-iris.py

    r8138 r8842  
    1111print Orange.feature.scoring.InfoGain(d1, table) 
    1212 
     13table = Orange.data.Table("iris") 
    1314meas = Orange.feature.scoring.Relief() 
    1415for t in meas.threshold_function("petal length", table): 
Note: See TracChangeset for help on using the changeset viewer.