Changeset 9988:4e1229e347ca in orange


Ignore:
Timestamp:
02/07/12 22:08:39 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

Polished discretization scripts.

Files:
7 edited

Legend:

Unmodified
Added
Removed
  • Orange/associate/__init__.py

    r9919 r9988  
    1 """ 
    2 ============================== 
    3 Induction of association rules 
    4 ============================== 
    5  
    6 Orange provides two algorithms for induction of 
    7 `association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
    8 One is the basic Agrawal's algorithm with dynamic induction of supported 
    9 itemsets and rules that is designed specifically for datasets with a 
    10 large number of different items. This is, however, not really suitable 
    11 for feature-based machine learning problems. 
    12 We have adapted the original algorithm for efficiency 
    13 with the latter type of data, and to induce the rules where,  
    14 both sides don't only contain features 
    15 (like "bread, butter -> jam") but also their values 
    16 ("bread = wheat, butter = yes -> jam = plum"). 
    17  
    18 It is also possible to extract item sets instead of association rules. These 
    19 are often more interesting than the rules themselves. 
    20  
    21 Besides association rule inducer, Orange also provides a rather simplified 
    22 method for classification by association rules. 
    23  
    24 =================== 
    25 Agrawal's algorithm 
    26 =================== 
    27  
    28 The class that induces rules by Agrawal's algorithm, accepts the data examples 
    29 of two forms. The first is the standard form in which each example is 
    30 described by values of a fixed list of features (defined in domain). 
    31 The algorithm, however, disregards the feature values and only checks whether 
    32 the value is defined or not. The rule shown above ("bread, butter -> jam") 
    33 actually means that if "bread" and "butter" are defined, then "jam" is defined 
    34 as well. It is expected that most of values will be undefined - if this is not 
    35 so, use the :class:`~AssociationRulesInducer`. 
    36  
    37 :class:`AssociationRulesSparseInducer` can also use sparse data.  
    38 Sparse examples have no fixed 
    39 features - the domain is empty. All values assigned to example are given as meta attributes. 
    40 All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
    41 The most suitable format fot this kind of data it is the basket format. 
    42  
    43 The algorithm first dynamically builds all itemsets (sets of features) that have 
    44 at least the prescribed support. Each of these is then used to derive rules 
    45 with requested confidence. 
    46  
    47 If examples were given in the sparse form, so are the left and right side 
    48 of the induced rules. If examples were given in the standard form, so are 
    49 the examples in association rules. 
    50  
    51 .. class:: AssociationRulesSparseInducer 
    52  
    53     .. attribute:: support 
    54      
    55         Minimal support for the rule. 
    56          
    57     .. attribute:: confidence 
    58      
    59         Minimal confidence for the rule. 
    60          
    61     .. attribute:: store_examples 
    62      
    63         Store the examples covered by each rule and 
    64         those confirming it. 
    65          
    66     .. attribute:: max_item_sets 
    67      
    68         The maximal number of itemsets. The algorithm's 
    69         running time (and its memory consumption) depends on the minimal support; 
    70         the lower the requested support, the more eligible itemsets will be found. 
    71         There is no general rule for setting support - perhaps it  
    72         should be around 0.3, but this depends on the data set. 
    73         If the supoort was set too low, the algorithm could run out of memory. 
    74         Therefore, Orange limits the number of generated rules to 
    75         :obj:`max_item_sets`. If Orange reports, that the prescribed 
    76         :obj:`max_item_sets` was exceeded, increase the requered support 
    77         or alternatively, increase :obj:`max_item_sets` to as high as you computer 
    78         can handle. 
    79  
    80     .. method:: __call__(data, weight_id) 
    81  
    82         Induce rules from the data set. 
    83  
    84  
    85     .. method:: get_itemsets(data) 
    86  
    87         Returns a list of pairs. The first element of a pair is a tuple with  
    88         indices of features in the item set (negative for sparse data).  
    89         The second element is a list of indices supporting the item set, that is, 
    90         all the items in the set. If :obj:`store_examples` is False, the second 
    91         element is None. 
    92  
    93 We shall test the rule inducer on a dataset consisting of a brief description 
    94 of Spanish Inquisition, given by Palin et al: 
    95  
    96     NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
    97  
    98     NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
    99      
    100 The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
    101  
    102 Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
    103  
    104 .. literalinclude:: code/inquisition.basket 
    105     
    106 Inducing the rules is trivial (uses :download:`inquisition.basket <code/inquisition.basket>`):: 
    107  
    108     import Orange 
    109     data = Orange.data.Table("inquisition") 
    110  
    111     rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
    112  
    113     print "%5s   %5s" % ("supp", "conf") 
    114     for r in rules: 
    115         print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    116  
    117 The induced rules are surprisingly fear-full: :: 
    118  
    119     0.500   1.000   fear -> surprise 
    120     0.500   1.000   surprise -> fear 
    121     0.500   1.000   fear -> surprise our 
    122     0.500   1.000   fear surprise -> our 
    123     0.500   1.000   fear our -> surprise 
    124     0.500   1.000   surprise -> fear our 
    125     0.500   1.000   surprise our -> fear 
    126     0.500   0.714   our -> fear surprise 
    127     0.500   1.000   fear -> our 
    128     0.500   0.714   our -> fear 
    129     0.500   1.000   surprise -> our 
    130     0.500   0.714   our -> surprise 
    131  
    132 To get only a list of supported item sets, one should call the method 
    133 get_itemsets:: 
    134  
    135     inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True) 
    136     itemsets = inducer.get_itemsets(data) 
    137      
    138 Now itemsets is a list of itemsets along with the examples supporting them 
    139 since we set store_examples to True. :: 
    140  
    141     >>> itemsets[5] 
    142     ((-11, -7), [1, 2, 3, 6, 9]) 
    143     >>> [data.domain[i].name for i in itemsets[5][0]] 
    144     ['surprise', 'our']    
    145      
    146 The sixth itemset contains features with indices -11 and -7, that is, the 
    147 words "surprise" and "our". The examples supporting it are those with 
    148 indices 1,2, 3, 6 and 9. 
    149  
    150 This way of representing the itemsets is memory efficient and faster than using 
    151 objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`. 
    152  
    153 .. _non-sparse-examples: 
    154  
    155 =================== 
    156 Non-sparse data 
    157 =================== 
    158  
    159 :class:`AssociationRulesInducer` works with non-sparse data. 
    160 Unknown values are ignored, while values of features are not (as opposite to 
    161 the algorithm for sparse rules). In addition, the algorithm 
    162 can be directed to search only for classification rules, in which the only 
    163 feature on the right-hand side is the class variable. 
    164  
    165 .. class:: AssociationRulesInducer 
    166  
    167     All attributes can be set with the constructor.  
    168  
    169     .. attribute:: support 
    170      
    171        Minimal support for the rule. 
    172      
    173     .. attribute:: confidence 
    174      
    175         Minimal confidence for the rule. 
    176      
    177     .. attribute:: classification_rules 
    178      
    179         If True (default is False), the classification rules are constructed instead 
    180         of general association rules. 
    181  
    182     .. attribute:: store_examples 
    183      
    184         Store the examples covered by each rule and those 
    185         confirming it 
    186          
    187     .. attribute:: max_item_sets 
    188      
    189         The maximal number of itemsets. 
    190  
    191     .. method:: __call__(data, weight_id) 
    192  
    193         Induce rules from the data set. 
    194  
    195     .. method:: get_itemsets(data) 
    196  
    197         Returns a list of pairs. The first element of a pair is a tuple with  
    198         indices of features in the item set (negative for sparse data).  
    199         The second element is a list of indices supporting the item set, that is, 
    200         all the items in the set. If :obj:`store_examples` is False, the second 
    201         element is None. 
    202  
    203 The example:: 
    204  
    205     import Orange 
    206  
    207     data = Orange.data.Table("lenses") 
    208  
    209     print "Association rules" 
    210     rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
    211     for r in rules: 
    212         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    213          
    214 The found rules are: :: 
    215  
    216     0.333  0.533  lenses=none -> prescription=hypermetrope 
    217     0.333  0.667  prescription=hypermetrope -> lenses=none 
    218     0.333  0.533  lenses=none -> astigmatic=yes 
    219     0.333  0.667  astigmatic=yes -> lenses=none 
    220     0.500  0.800  lenses=none -> tear_rate=reduced 
    221     0.500  1.000  tear_rate=reduced -> lenses=none 
    222      
    223 To limit the algorithm to classification rules, set classificationRules to 1: :: 
    224  
    225     print "\\nClassification rules" 
    226     rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
    227     for r in rules: 
    228         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    229  
    230 The found rules are, naturally, a subset of the above rules: :: 
    231  
    232     0.333  0.667  prescription=hypermetrope -> lenses=none 
    233     0.333  0.667  astigmatic=yes -> lenses=none 
    234     0.500  1.000  tear_rate=reduced -> lenses=none 
    235      
    236 Itemsets are induced in a similar fashion as for sparse data, except that the 
    237 first element of the tuple, the item set, is represented not by indices of 
    238 features, as before, but with tuples (feature-index, value-index): :: 
    239  
    240     inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
    241     itemsets = inducer.get_itemsets(data) 
    242     print itemsets[8] 
    243      
    244 This prints out :: 
    245  
    246     (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    247      
    248 meaning that the ninth itemset contains the second value of the third feature 
    249 (2, 1), and the first value of the fifth (4, 0). 
    250  
    251 ======================= 
    252 Representation of rules 
    253 ======================= 
    254  
    255 An :class:`AssociationRule` represents a rule. In Orange, methods for  
    256 induction of association rules return the induced rules in 
    257 :class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
    258  
    259 .. class:: AssociationRule 
    260  
    261     .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
    262      
    263         Constructs an association rule and computes all measures listed above. 
    264      
    265     .. method:: __init__(left, right, support, confidence) 
    266      
    267         Construct association rule and sets its support and confidence. If 
    268         you intend to pass on such a rule you should set other attributes 
    269         manually - AssociationRules's constructor cannot compute anything 
    270         from arguments support and confidence. 
    271      
    272     .. method:: __init__(rule) 
    273      
    274         Given an association rule as the argument, constructor copies of the 
    275         rule. 
    276   
    277     .. attribute:: left, right 
    278      
    279         The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
    280         In rules created by :class:`AssociationRulesSparseInducer` from examples that 
    281         contain all values as meta-values, left and right are examples in the 
    282         same form. Otherwise, values in left that do not appear in the rule 
    283         are "don't care", and value in right are "don't know". Both can, 
    284         however, be tested by :meth:`~Orange.data.Value.is_special`. 
    285      
    286     .. attribute:: n_left, n_right 
    287      
    288         The number of features (i.e. defined values) on the left and on the 
    289         right side of the rule. 
    290      
    291     .. attribute:: n_applies_left, n_applies_right, n_applies_both 
    292      
    293         The number of (learning) examples that conform to the left, the right 
    294         and to both sides of the rule. 
    295      
    296     .. attribute:: n_examples 
    297      
    298         The total number of learning examples. 
    299      
    300     .. attribute:: support 
    301      
    302         nAppliesBoth/nExamples. 
    303  
    304     .. attribute:: confidence 
    305      
    306         n_applies_both/n_applies_left. 
    307      
    308     .. attribute:: coverage 
    309      
    310         n_applies_left/n_examples. 
    311  
    312     .. attribute:: strength 
    313      
    314         n_applies_right/n_applies_left. 
    315      
    316     .. attribute:: lift 
    317      
    318         n_examples * n_applies_both / (n_applies_left * n_applies_right). 
    319      
    320     .. attribute:: leverage 
    321      
    322         (n_Applies_both * n_examples - n_applies_left * n_applies_right). 
    323      
    324     .. attribute:: examples, match_left, match_both 
    325      
    326         If store_examples was True during induction, examples contains a copy 
    327         of the example table used to induce the rules. Attributes match_left 
    328         and match_both are lists of integers, representing the indices of 
    329         examples which match the left-hand side of the rule and both sides, 
    330         respectively. 
    331     
    332     .. method:: applies_left(example) 
    333      
    334     .. method:: applies_right(example) 
    335      
    336     .. method:: applies_both(example) 
    337      
    338         Tells whether the example fits into the left, right or both sides of 
    339         the rule, respectively. If the rule is represented by sparse examples, 
    340         the given example must be sparse as well. 
    341      
    342 Association rule inducers do not store evidence about which example supports 
    343 which rule. Let us write a function that finds the examples that 
    344 confirm the rule (fit both sides of it) and those that contradict it (fit the 
    345 left-hand side but not the right). The example:: 
    346  
    347     import Orange 
    348  
    349     data = Orange.data.Table("lenses") 
    350  
    351     rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
    352     rule = rules[0] 
    353  
    354     print 
    355     print "Rule: ", rule 
    356     print 
    357  
    358     print "Supporting examples:" 
    359     for example in data: 
    360         if rule.appliesBoth(example): 
    361             print example 
    362     print 
    363  
    364     print "Contradicting examples:" 
    365     for example in data: 
    366         if rule.applies_left(example) and not rule.applies_right(example): 
    367             print example 
    368     print 
    369  
    370 The latter printouts get simpler and faster if we instruct the inducer to 
    371 store the examples. We can then do, for instance, this: :: 
    372  
    373     print "Match left: " 
    374     print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
    375     print "\\nMatch both: " 
    376     print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
    377  
    378 The "contradicting" examples are then those whose indices are found in 
    379 match_left but not in match_both. The memory friendlier and the faster way 
    380 to compute this is as follows: :: 
    381  
    382     >>> [x for x in rule.match_left if not x in rule.match_both] 
    383     [0, 2, 8, 10, 16, 17, 18] 
    384     >>> set(rule.match_left) - set(rule.match_both) 
    385     set([0, 2, 8, 10, 16, 17, 18]) 
    386  
    387 =============== 
    388 Utilities 
    389 =============== 
    390  
    391 .. autofunction:: print_rules 
    392  
    393 .. autofunction:: sort 
    394  
    395 """ 
    396  
    3971from orange import \ 
    3982    AssociationRule, \ 
  • Orange/feature/scoring.py

    r9919 r9988  
    1 """ 
    2 ##################### 
    3 Scoring (``scoring``) 
    4 ##################### 
    5  
    6 .. index:: feature scoring 
    7  
    8 .. index::  
    9    single: feature; feature scoring 
    10  
    11 Feature score is an assessment of the usefulness of the feature for  
    12 prediction of the dependant (class) variable. 
    13  
    14 To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use: 
    15  
    16     >>> meas = Orange.feature.scoring.InfoGain() 
    17     >>> print meas("tear_rate", data) 
    18     0.548794925213 
    19  
    20 Other scoring methods are listed in :ref:`classification` and 
    21 :ref:`regression`. Various ways to call them are described on 
    22 :ref:`callingscore`. 
    23  
    24 Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
    25 then using it, it is usually more convenient to do both in a single step:: 
    26  
    27     >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
    28     0.548794925213 
    29  
    30 This way is much slower for Relief that can efficiently compute scores 
    31 for all features in parallel. 
    32  
    33 It is also possible to score features that do not appear in the data 
    34 but can be computed from it. A typical case are discretized features: 
    35  
    36 .. literalinclude:: code/scoring-info-iris.py 
    37     :lines: 7-11 
    38  
    39 The following example computes feature scores, both with 
    40 :obj:`score_all` and by scoring each feature individually, and prints out  
    41 the best three features.  
    42  
    43 .. literalinclude:: code/scoring-all.py 
    44     :lines: 7- 
    45  
    46 The output:: 
    47  
    48     Feature scores for best three features (with score_all): 
    49     0.613 physician-fee-freeze 
    50     0.255 el-salvador-aid 
    51     0.228 synfuels-corporation-cutback 
    52  
    53     Feature scores for best three features (scored individually): 
    54     0.613 physician-fee-freeze 
    55     0.255 el-salvador-aid 
    56     0.228 synfuels-corporation-cutback 
    57  
    58 .. comment 
    59     The next script uses :obj:`GainRatio` and :obj:`Relief`. 
    60  
    61     .. literalinclude:: code/scoring-relief-gainRatio.py 
    62         :lines: 7- 
    63  
    64     Notice that on this data the ranks of features match:: 
    65          
    66         Relief GainRt Feature 
    67         0.613  0.752  physician-fee-freeze 
    68         0.255  0.444  el-salvador-aid 
    69         0.228  0.414  synfuels-corporation-cutback 
    70         0.189  0.382  crime 
    71         0.166  0.345  adoption-of-the-budget-resolution 
    72  
    73  
    74 .. _callingscore: 
    75  
    76 ======================= 
    77 Calling scoring methods 
    78 ======================= 
    79  
    80 To score a feature use :obj:`Score.__call__`. There are diferent 
    81 function signatures, which enable optimization. For instance, 
    82 most scoring methods first compute contingency tables from the 
    83 data. If these are already computed, they can be passed to the scorer 
    84 instead of the data. 
    85  
    86 Not all classes accept all kinds of arguments. :obj:`Relief`, 
    87 for instance, only supports the form with instances on the input. 
    88  
    89 .. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
    90  
    91     :param attribute: the chosen feature, either as a descriptor,  
    92       index, or a name. 
    93     :type attribute: :class:`Orange.feature.Descriptor` or int or string 
    94     :param data: data. 
    95     :type data: `Orange.data.Table` 
    96     :param weightID: id for meta-feature with weight. 
    97  
    98     All scoring methods support the first signature. 
    99  
    100 .. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
    101  
    102     :param attribute: the chosen feature, either as a descriptor,  
    103       index, or a name. 
    104     :type attribute: :class:`Orange.feature.Descriptor` or int or string 
    105     :param domain_contingency:  
    106     :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
    107  
    108 .. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
    109  
    110     :param contingency: 
    111     :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
    112     :param class_distribution: distribution of the class 
    113       variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
    114       it should be computed on instances where feature value is 
    115       defined. Otherwise, class distribution should be the overall 
    116       class distribution. 
    117     :type class_distribution:  
    118       :obj:`Orange.statistics.distribution.Distribution` 
    119     :param apriori_class_distribution: Optional and most often 
    120       ignored. Useful if the scoring method makes any probability estimates 
    121       based on apriori class probabilities (such as the m-estimate). 
    122     :return: Feature score - the higher the value, the better the feature. 
    123       If the quality cannot be scored, return :obj:`Score.Rejected`. 
    124     :rtype: float or :obj:`Score.Rejected`. 
    125  
    126 The code below scores the same feature with :obj:`GainRatio`  
    127 using different calls. 
    128  
    129 .. literalinclude:: code/scoring-calls.py 
    130     :lines: 7- 
    131  
    132 .. _classification: 
    133  
    134 ========================================== 
    135 Feature scoring in classification problems 
    136 ========================================== 
    137  
    138 .. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
    139  
    140 .. index::  
    141    single: feature scoring; information gain 
    142  
    143 .. class:: InfoGain 
    144  
    145     Information gain; the expected decrease of entropy. See `page on wikipedia 
    146     <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    147  
    148 .. index::  
    149    single: feature scoring; gain ratio 
    150  
    151 .. class:: GainRatio 
    152  
    153     Information gain ratio; information gain divided by the entropy of the feature's 
    154     value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
    155     of multi-valued features. It has been shown, however, that it still 
    156     overestimates features with multiple values. See `Wikipedia 
    157     <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    158  
    159 .. index::  
    160    single: feature scoring; gini index 
    161  
    162 .. class:: Gini 
    163  
    164     Gini index is the probability that two randomly chosen instances will have different 
    165     classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
    166  
    167 .. index::  
    168    single: feature scoring; relevance 
    169  
    170 .. class:: Relevance 
    171  
    172     The potential value for decision rules. 
    173  
    174 .. index::  
    175    single: feature scoring; cost 
    176  
    177 .. class:: Cost 
    178  
    179     Evaluates features based on the cost decrease achieved by knowing the value of 
    180     feature, according to the specified cost matrix. 
    181  
    182     .. attribute:: cost 
    183       
    184         Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
    185  
    186     If the cost of predicting the first class of an instance that is actually in 
    187     the second is 5, and the cost of the opposite error is 1, than an appropriate 
    188     score can be constructed as follows:: 
    189  
    190  
    191         >>> meas = Orange.feature.scoring.Cost() 
    192         >>> meas.cost = ((0, 5), (1, 0)) 
    193         >>> meas(3, data) 
    194         0.083333350718021393 
    195  
    196     Knowing the value of feature 3 would decrease the 
    197     classification cost for approximately 0.083 per instance. 
    198  
    199     .. comment   opposite error - is this term correct? TODO 
    200  
    201 .. index::  
    202    single: feature scoring; ReliefF 
    203  
    204 .. class:: Relief 
    205  
    206     Assesses features' ability to distinguish between very similar 
    207     instances from different classes. This scoring method was first 
    208     developed by Kira and Rendell and then improved by  Kononenko. The 
    209     class :obj:`Relief` works on discrete and continuous classes and 
    210     thus implements ReliefF and RReliefF. 
    211  
    212     ReliefF is slow since it needs to find k nearest neighbours for 
    213     each of m reference instances. As we normally compute ReliefF for 
    214     all features in the dataset, :obj:`Relief` caches the results for 
    215     all features, when called to score a certain feature.  When called 
    216     again, it uses the stored results if the domain and the data table 
    217     have not changed (data table version and the data checksum are 
    218     compared). Caching will only work if you use the same object.  
    219     Constructing new instances of :obj:`Relief` for each feature, 
    220     like this:: 
    221  
    222         for attr in data.domain.attributes: 
    223             print Orange.feature.scoring.Relief(attr, data) 
    224  
    225     runs much slower than reusing the same instance:: 
    226  
    227         meas = Orange.feature.scoring.Relief() 
    228         for attr in table.domain.attributes: 
    229             print meas(attr, data) 
    230  
    231  
    232     .. attribute:: k 
    233      
    234        Number of neighbours for each instance. Default is 5. 
    235  
    236     .. attribute:: m 
    237      
    238         Number of reference instances. Default is 100. When -1, all 
    239         instances are used as reference. 
    240  
    241     .. attribute:: check_cached_data 
    242      
    243         Check if the cached data is changed, which may be slow on large 
    244         tables.  Defaults to :obj:`True`, but should be disabled when it 
    245         is certain that the data will not change while the scorer is used. 
    246  
    247 .. autoclass:: Orange.feature.scoring.Distance 
    248     
    249 .. autoclass:: Orange.feature.scoring.MDL 
    250  
    251 .. _regression: 
    252  
    253 ====================================== 
    254 Feature scoring in regression problems 
    255 ====================================== 
    256  
    257 .. class:: Relief 
    258  
    259     Relief is used for regression in the same way as for 
    260     classification (see :class:`Relief` in classification 
    261     problems). 
    262  
    263 .. index::  
    264    single: feature scoring; mean square error 
    265  
    266 .. class:: MSE 
    267  
    268     Implements the mean square error score. 
    269  
    270     .. attribute:: unknowns_treatment 
    271      
    272         What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
    273  
    274     .. attribute:: m 
    275      
    276         Parameter for m-estimate of error. Default is 0 (no m-estimate). 
    277  
    278  
    279  
    280 ============ 
    281 Base Classes 
    282 ============ 
    283  
    284 Implemented methods for scoring relevances of features are subclasses 
    285 of :obj:`Score`. Those that compute statistics on conditional 
    286 distributions of class values given the feature values are derived from 
    287 :obj:`ScoreFromProbabilities`. 
    288  
    289 .. class:: Score 
    290  
    291     Abstract base class for feature scoring. Its attributes describe which 
    292     types of features it can handle which kind of data it requires. 
    293  
    294     **Capabilities** 
    295  
    296     .. attribute:: handles_discrete 
    297      
    298         Indicates whether the scoring method can handle discrete features. 
    299  
    300     .. attribute:: handles_continuous 
    301      
    302         Indicates whether the scoring method can handle continuous features. 
    303  
    304     .. attribute:: computes_thresholds 
    305      
    306         Indicates whether the scoring method implements the :obj:`threshold_function`. 
    307  
    308     **Input specification** 
    309  
    310     .. attribute:: needs 
    311      
    312         The type of data needed indicated by one the constants 
    313         below. Classes with use :obj:`DomainContingency` will also handle 
    314         generators. Those based on :obj:`Contingency_Class` will be able 
    315         to take generators and domain contingencies. 
    316  
    317         .. attribute:: Generator 
    318  
    319             Constant. Indicates that the scoring method needs an instance 
    320             generator on the input as, for example, :obj:`Relief`. 
    321  
    322         .. attribute:: DomainContingency 
    323  
    324             Constant. Indicates that the scoring method needs 
    325             :obj:`Orange.statistics.contingency.Domain`. 
    326  
    327         .. attribute:: Contingency_Class 
    328  
    329             Constant. Indicates, that the scoring method needs the contingency 
    330             (:obj:`Orange.statistics.contingency.VarClass`), feature 
    331             distribution and the apriori class distribution (as most 
    332             scoring methods). 
    333  
    334     **Treatment of unknown values** 
    335  
    336     .. attribute:: unknowns_treatment 
    337  
    338         Defined in classes that are able to treat unknown values. It 
    339         should be set to one of the values below. 
    340  
    341         .. attribute:: IgnoreUnknowns 
    342  
    343             Constant. Instances for which the feature value is unknown are removed. 
    344  
    345         .. attribute:: ReduceByUnknown 
    346  
    347             Constant. Features with unknown values are  
    348             punished. The feature quality is reduced by the proportion of 
    349             unknown values. For impurity scores the impurity decreases 
    350             only where the value is defined and stays the same otherwise. 
    351  
    352         .. attribute:: UnknownsToCommon 
    353  
    354             Constant. Undefined values are replaced by the most common value. 
    355  
    356         .. attribute:: UnknownsAsValue 
    357  
    358             Constant. Unknown values are treated as a separate value. 
    359  
    360     **Methods** 
    361  
    362     .. method:: __call__ 
    363  
    364         Abstract. See :ref:`callingscore`. 
    365  
    366     .. method:: threshold_function(attribute, instances[, weightID]) 
    367      
    368         Abstract.  
    369          
    370         Assess different binarizations of the continuous feature 
    371         :obj:`attribute`.  Return a list of tuples. The first element 
    372         is a threshold (between two existing values), the second is 
    373         the quality of the corresponding binary feature, and the third 
    374         the distribution of instances below and above the threshold. 
    375         Not all scorers return the third element. 
    376  
    377         To show the computation of thresholds, we shall use the Iris 
    378         data set: 
    379  
    380         .. literalinclude:: code/scoring-info-iris.py 
    381             :lines: 13-16 
    382  
    383     .. method:: best_threshold(attribute, instances) 
    384  
    385         Return the best threshold for binarization, that is, the threshold 
    386         with which the resulting binary feature will have the optimal 
    387         score. 
    388  
    389         The script below prints out the best threshold for 
    390         binarization of an feature. ReliefF is used scoring:  
    391  
    392         .. literalinclude:: code/scoring-info-iris.py 
    393             :lines: 18-19 
    394  
    395 .. class:: ScoreFromProbabilities 
    396  
    397     Bases: :obj:`Score` 
    398  
    399     Abstract base class for feature scoring method that can be 
    400     computed from contingency matrices. 
    401  
    402     .. attribute:: estimator_constructor 
    403     .. attribute:: conditional_estimator_constructor 
    404      
    405         The classes that are used to estimate unconditional 
    406         and conditional probabilities of classes, respectively. 
    407         Defaults use relative frequencies; possible alternatives are, 
    408         for instance, :obj:`ProbabilityEstimatorConstructor_m` and 
    409         :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
    410         (with estimator constructor again set to 
    411         :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
    412  
    413 ============ 
    414 Other 
    415 ============ 
    416  
    417 .. autoclass:: Orange.feature.scoring.OrderAttributes 
    418    :members: 
    419  
    420 .. autofunction:: Orange.feature.scoring.score_all 
    421  
    422 .. rubric:: Bibliography 
    423  
    424 .. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,  
    425   Woodhead Publishing, 2007. 
    426  
    427 .. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986. 
    428  
    429 .. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984. 
    430  
    431 .. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995. 
    432  
    433 """ 
    434  
    4351import Orange.core as orange 
    4362import Orange.misc 
     
    44511from orange import MeasureAttribute_relief as Relief 
    44612from orange import MeasureAttribute_MSE as MSE 
    447  
    44813 
    44914###### 
  • docs/reference/rst/Orange.associate.rst

    r9372 r9988  
    33==================================== 
    44 
    5 .. automodule:: Orange.associate 
     5============================== 
     6Induction of association rules 
     7============================== 
     8 
     9Orange provides two algorithms for induction of 
     10`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
     11One is the basic Agrawal's algorithm with dynamic induction of supported 
     12itemsets and rules that is designed specifically for datasets with a 
     13large number of different items. This is, however, not really suitable 
     14for feature-based machine learning problems. 
     15We have adapted the original algorithm for efficiency 
     16with the latter type of data, and to induce the rules where, 
     17both sides don't only contain features 
     18(like "bread, butter -> jam") but also their values 
     19("bread = wheat, butter = yes -> jam = plum"). 
     20 
     21It is also possible to extract item sets instead of association rules. These 
     22are often more interesting than the rules themselves. 
     23 
     24Besides association rule inducer, Orange also provides a rather simplified 
     25method for classification by association rules. 
     26 
     27=================== 
     28Agrawal's algorithm 
     29=================== 
     30 
     31The class that induces rules by Agrawal's algorithm, accepts the data examples 
     32of two forms. The first is the standard form in which each example is 
     33described by values of a fixed list of features (defined in domain). 
     34The algorithm, however, disregards the feature values and only checks whether 
     35the value is defined or not. The rule shown above ("bread, butter -> jam") 
     36actually means that if "bread" and "butter" are defined, then "jam" is defined 
     37as well. It is expected that most of values will be undefined - if this is not 
     38so, use the :class:`~AssociationRulesInducer`. 
     39 
     40:class:`AssociationRulesSparseInducer` can also use sparse data. 
     41Sparse examples have no fixed 
     42features - the domain is empty. All values assigned to example are given as meta attributes. 
     43All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
     44The most suitable format fot this kind of data it is the basket format. 
     45 
     46The algorithm first dynamically builds all itemsets (sets of features) that have 
     47at least the prescribed support. Each of these is then used to derive rules 
     48with requested confidence. 
     49 
     50If examples were given in the sparse form, so are the left and right side 
     51of the induced rules. If examples were given in the standard form, so are 
     52the examples in association rules. 
     53 
     54.. class:: AssociationRulesSparseInducer 
     55 
     56    .. attribute:: support 
     57 
     58        Minimal support for the rule. 
     59 
     60    .. attribute:: confidence 
     61 
     62        Minimal confidence for the rule. 
     63 
     64    .. attribute:: store_examples 
     65 
     66        Store the examples covered by each rule and 
     67        those confirming it. 
     68 
     69    .. attribute:: max_item_sets 
     70 
     71        The maximal number of itemsets. The algorithm's 
     72        running time (and its memory consumption) depends on the minimal support; 
     73        the lower the requested support, the more eligible itemsets will be found. 
     74        There is no general rule for setting support - perhaps it 
     75        should be around 0.3, but this depends on the data set. 
     76        If the supoort was set too low, the algorithm could run out of memory. 
     77        Therefore, Orange limits the number of generated rules to 
     78        :obj:`max_item_sets`. If Orange reports, that the prescribed 
     79        :obj:`max_item_sets` was exceeded, increase the requered support 
     80        or alternatively, increase :obj:`max_item_sets` to as high as you computer 
     81        can handle. 
     82 
     83    .. method:: __call__(data, weight_id) 
     84 
     85        Induce rules from the data set. 
     86 
     87 
     88    .. method:: get_itemsets(data) 
     89 
     90        Returns a list of pairs. The first element of a pair is a tuple with 
     91        indices of features in the item set (negative for sparse data). 
     92        The second element is a list of indices supporting the item set, that is, 
     93        all the items in the set. If :obj:`store_examples` is False, the second 
     94        element is None. 
     95 
     96We shall test the rule inducer on a dataset consisting of a brief description 
     97of Spanish Inquisition, given by Palin et al: 
     98 
     99    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
     100 
     101    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
     102 
     103The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
     104 
     105Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
     106 
     107.. literalinclude:: code/inquisition.basket 
     108 
     109Inducing the rules is trivial (uses :download:`inquisition.basket <code/inquisition.basket>`):: 
     110 
     111    import Orange 
     112    data = Orange.data.Table("inquisition") 
     113 
     114    rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
     115 
     116    print "%5s   %5s" % ("supp", "conf") 
     117    for r in rules: 
     118        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
     119 
     120The induced rules are surprisingly fear-full: :: 
     121 
     122    0.500   1.000   fear -> surprise 
     123    0.500   1.000   surprise -> fear 
     124    0.500   1.000   fear -> surprise our 
     125    0.500   1.000   fear surprise -> our 
     126    0.500   1.000   fear our -> surprise 
     127    0.500   1.000   surprise -> fear our 
     128    0.500   1.000   surprise our -> fear 
     129    0.500   0.714   our -> fear surprise 
     130    0.500   1.000   fear -> our 
     131    0.500   0.714   our -> fear 
     132    0.500   1.000   surprise -> our 
     133    0.500   0.714   our -> surprise 
     134 
     135To get only a list of supported item sets, one should call the method 
     136get_itemsets:: 
     137 
     138    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True) 
     139    itemsets = inducer.get_itemsets(data) 
     140 
     141Now itemsets is a list of itemsets along with the examples supporting them 
     142since we set store_examples to True. :: 
     143 
     144    >>> itemsets[5] 
     145    ((-11, -7), [1, 2, 3, 6, 9]) 
     146    >>> [data.domain[i].name for i in itemsets[5][0]] 
     147    ['surprise', 'our'] 
     148 
     149The sixth itemset contains features with indices -11 and -7, that is, the 
     150words "surprise" and "our". The examples supporting it are those with 
     151indices 1,2, 3, 6 and 9. 
     152 
     153This way of representing the itemsets is memory efficient and faster than using 
     154objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`. 
     155 
     156.. _non-sparse-examples: 
     157 
     158=================== 
     159Non-sparse data 
     160=================== 
     161 
     162:class:`AssociationRulesInducer` works with non-sparse data. 
     163Unknown values are ignored, while values of features are not (as opposite to 
     164the algorithm for sparse rules). In addition, the algorithm 
     165can be directed to search only for classification rules, in which the only 
     166feature on the right-hand side is the class variable. 
     167 
     168.. class:: AssociationRulesInducer 
     169 
     170    All attributes can be set with the constructor. 
     171 
     172    .. attribute:: support 
     173 
     174       Minimal support for the rule. 
     175 
     176    .. attribute:: confidence 
     177 
     178        Minimal confidence for the rule. 
     179 
     180    .. attribute:: classification_rules 
     181 
     182        If True (default is False), the classification rules are constructed instead 
     183        of general association rules. 
     184 
     185    .. attribute:: store_examples 
     186 
     187        Store the examples covered by each rule and those 
     188        confirming it 
     189 
     190    .. attribute:: max_item_sets 
     191 
     192        The maximal number of itemsets. 
     193 
     194    .. method:: __call__(data, weight_id) 
     195 
     196        Induce rules from the data set. 
     197 
     198    .. method:: get_itemsets(data) 
     199 
     200        Returns a list of pairs. The first element of a pair is a tuple with 
     201        indices of features in the item set (negative for sparse data). 
     202        The second element is a list of indices supporting the item set, that is, 
     203        all the items in the set. If :obj:`store_examples` is False, the second 
     204        element is None. 
     205 
     206The example:: 
     207 
     208    import Orange 
     209 
     210    data = Orange.data.Table("lenses") 
     211 
     212    print "Association rules" 
     213    rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
     214    for r in rules: 
     215        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     216 
     217The found rules are: :: 
     218 
     219    0.333  0.533  lenses=none -> prescription=hypermetrope 
     220    0.333  0.667  prescription=hypermetrope -> lenses=none 
     221    0.333  0.533  lenses=none -> astigmatic=yes 
     222    0.333  0.667  astigmatic=yes -> lenses=none 
     223    0.500  0.800  lenses=none -> tear_rate=reduced 
     224    0.500  1.000  tear_rate=reduced -> lenses=none 
     225 
     226To limit the algorithm to classification rules, set classificationRules to 1: :: 
     227 
     228    print "\\nClassification rules" 
     229    rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
     230    for r in rules: 
     231        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     232 
     233The found rules are, naturally, a subset of the above rules: :: 
     234 
     235    0.333  0.667  prescription=hypermetrope -> lenses=none 
     236    0.333  0.667  astigmatic=yes -> lenses=none 
     237    0.500  1.000  tear_rate=reduced -> lenses=none 
     238 
     239Itemsets are induced in a similar fashion as for sparse data, except that the 
     240first element of the tuple, the item set, is represented not by indices of 
     241features, as before, but with tuples (feature-index, value-index): :: 
     242 
     243    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
     244    itemsets = inducer.get_itemsets(data) 
     245    print itemsets[8] 
     246 
     247This prints out :: 
     248 
     249    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
     250 
     251meaning that the ninth itemset contains the second value of the third feature 
     252(2, 1), and the first value of the fifth (4, 0). 
     253 
     254======================= 
     255Representation of rules 
     256======================= 
     257 
     258An :class:`AssociationRule` represents a rule. In Orange, methods for 
     259induction of association rules return the induced rules in 
     260:class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
     261 
     262.. class:: AssociationRule 
     263 
     264    .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
     265 
     266        Constructs an association rule and computes all measures listed above. 
     267 
     268    .. method:: __init__(left, right, support, confidence) 
     269 
     270        Construct association rule and sets its support and confidence. If 
     271        you intend to pass on such a rule you should set other attributes 
     272        manually - AssociationRules's constructor cannot compute anything 
     273        from arguments support and confidence. 
     274 
     275    .. method:: __init__(rule) 
     276 
     277        Given an association rule as the argument, constructor copies of the 
     278        rule. 
     279 
     280    .. attribute:: left, right 
     281 
     282        The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
     283        In rules created by :class:`AssociationRulesSparseInducer` from examples that 
     284        contain all values as meta-values, left and right are examples in the 
     285        same form. Otherwise, values in left that do not appear in the rule 
     286        are "don't care", and value in right are "don't know". Both can, 
     287        however, be tested by :meth:`~Orange.data.Value.is_special`. 
     288 
     289    .. attribute:: n_left, n_right 
     290 
     291        The number of features (i.e. defined values) on the left and on the 
     292        right side of the rule. 
     293 
     294    .. attribute:: n_applies_left, n_applies_right, n_applies_both 
     295 
     296        The number of (learning) examples that conform to the left, the right 
     297        and to both sides of the rule. 
     298 
     299    .. attribute:: n_examples 
     300 
     301        The total number of learning examples. 
     302 
     303    .. attribute:: support 
     304 
     305        nAppliesBoth/nExamples. 
     306 
     307    .. attribute:: confidence 
     308 
     309        n_applies_both/n_applies_left. 
     310 
     311    .. attribute:: coverage 
     312 
     313        n_applies_left/n_examples. 
     314 
     315    .. attribute:: strength 
     316 
     317        n_applies_right/n_applies_left. 
     318 
     319    .. attribute:: lift 
     320 
     321        n_examples * n_applies_both / (n_applies_left * n_applies_right). 
     322 
     323    .. attribute:: leverage 
     324 
     325        (n_Applies_both * n_examples - n_applies_left * n_applies_right). 
     326 
     327    .. attribute:: examples, match_left, match_both 
     328 
     329        If store_examples was True during induction, examples contains a copy 
     330        of the example table used to induce the rules. Attributes match_left 
     331        and match_both are lists of integers, representing the indices of 
     332        examples which match the left-hand side of the rule and both sides, 
     333        respectively. 
     334 
     335    .. method:: applies_left(example) 
     336 
     337    .. method:: applies_right(example) 
     338 
     339    .. method:: applies_both(example) 
     340 
     341        Tells whether the example fits into the left, right or both sides of 
     342        the rule, respectively. If the rule is represented by sparse examples, 
     343        the given example must be sparse as well. 
     344 
     345Association rule inducers do not store evidence about which example supports 
     346which rule. Let us write a function that finds the examples that 
     347confirm the rule (fit both sides of it) and those that contradict it (fit the 
     348left-hand side but not the right). The example:: 
     349 
     350    import Orange 
     351 
     352    data = Orange.data.Table("lenses") 
     353 
     354    rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
     355    rule = rules[0] 
     356 
     357    print 
     358    print "Rule: ", rule 
     359    print 
     360 
     361    print "Supporting examples:" 
     362    for example in data: 
     363        if rule.appliesBoth(example): 
     364            print example 
     365    print 
     366 
     367    print "Contradicting examples:" 
     368    for example in data: 
     369        if rule.applies_left(example) and not rule.applies_right(example): 
     370            print example 
     371    print 
     372 
     373The latter printouts get simpler and faster if we instruct the inducer to 
     374store the examples. We can then do, for instance, this: :: 
     375 
     376    print "Match left: " 
     377    print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
     378    print "\\nMatch both: " 
     379    print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
     380 
     381The "contradicting" examples are then those whose indices are found in 
     382match_left but not in match_both. The memory friendlier and the faster way 
     383to compute this is as follows: :: 
     384 
     385    >>> [x for x in rule.match_left if not x in rule.match_both] 
     386    [0, 2, 8, 10, 16, 17, 18] 
     387    >>> set(rule.match_left) - set(rule.match_both) 
     388    set([0, 2, 8, 10, 16, 17, 18]) 
     389 
     390=============== 
     391Utilities 
     392=============== 
     393 
     394.. autofunction:: print_rules 
     395 
     396.. autofunction:: sort 
  • docs/reference/rst/Orange.feature.scoring.rst

    r9372 r9988  
    1 .. automodule:: Orange.feature.scoring 
     1.. py:currentmodule:: Orange.feature.scoring 
     2 
     3##################### 
     4Scoring (``scoring``) 
     5##################### 
     6 
     7.. index:: feature scoring 
     8 
     9.. index:: 
     10   single: feature; feature scoring 
     11 
     12Feature score is an assessment of the usefulness of the feature for 
     13prediction of the dependant (class) variable. 
     14 
     15To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use: 
     16 
     17    >>> meas = Orange.feature.scoring.InfoGain() 
     18    >>> print meas("tear_rate", data) 
     19    0.548794925213 
     20 
     21Other scoring methods are listed in :ref:`classification` and 
     22:ref:`regression`. Various ways to call them are described on 
     23:ref:`callingscore`. 
     24 
     25Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
     26then using it, it is usually more convenient to do both in a single step:: 
     27 
     28    >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
     29    0.548794925213 
     30 
     31This way is much slower for Relief that can efficiently compute scores 
     32for all features in parallel. 
     33 
     34It is also possible to score features that do not appear in the data 
     35but can be computed from it. A typical case are discretized features: 
     36 
     37.. literalinclude:: code/scoring-info-iris.py 
     38    :lines: 7-11 
     39 
     40The following example computes feature scores, both with 
     41:obj:`score_all` and by scoring each feature individually, and prints out 
     42the best three features. 
     43 
     44.. literalinclude:: code/scoring-all.py 
     45    :lines: 7- 
     46 
     47The output:: 
     48 
     49    Feature scores for best three features (with score_all): 
     50    0.613 physician-fee-freeze 
     51    0.255 el-salvador-aid 
     52    0.228 synfuels-corporation-cutback 
     53 
     54    Feature scores for best three features (scored individually): 
     55    0.613 physician-fee-freeze 
     56    0.255 el-salvador-aid 
     57    0.228 synfuels-corporation-cutback 
     58 
     59.. comment 
     60    The next script uses :obj:`GainRatio` and :obj:`Relief`. 
     61 
     62    .. literalinclude:: code/scoring-relief-gainRatio.py 
     63        :lines: 7- 
     64 
     65    Notice that on this data the ranks of features match:: 
     66 
     67        Relief GainRt Feature 
     68        0.613  0.752  physician-fee-freeze 
     69        0.255  0.444  el-salvador-aid 
     70        0.228  0.414  synfuels-corporation-cutback 
     71        0.189  0.382  crime 
     72        0.166  0.345  adoption-of-the-budget-resolution 
     73 
     74 
     75.. _callingscore: 
     76 
     77======================= 
     78Calling scoring methods 
     79======================= 
     80 
     81To score a feature use :obj:`Score.__call__`. There are diferent 
     82function signatures, which enable optimization. For instance, 
     83most scoring methods first compute contingency tables from the 
     84data. If these are already computed, they can be passed to the scorer 
     85instead of the data. 
     86 
     87Not all classes accept all kinds of arguments. :obj:`Relief`, 
     88for instance, only supports the form with instances on the input. 
     89 
     90.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
     91 
     92    :param attribute: the chosen feature, either as a descriptor, 
     93      index, or a name. 
     94    :type attribute: :class:`Orange.feature.Descriptor` or int or string 
     95    :param data: data. 
     96    :type data: `Orange.data.Table` 
     97    :param weightID: id for meta-feature with weight. 
     98 
     99    All scoring methods support the first signature. 
     100 
     101.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
     102 
     103    :param attribute: the chosen feature, either as a descriptor, 
     104      index, or a name. 
     105    :type attribute: :class:`Orange.feature.Descriptor` or int or string 
     106    :param domain_contingency: 
     107    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
     108 
     109.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
     110 
     111    :param contingency: 
     112    :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
     113    :param class_distribution: distribution of the class 
     114      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
     115      it should be computed on instances where feature value is 
     116      defined. Otherwise, class distribution should be the overall 
     117      class distribution. 
     118    :type class_distribution: 
     119      :obj:`Orange.statistics.distribution.Distribution` 
     120    :param apriori_class_distribution: Optional and most often 
     121      ignored. Useful if the scoring method makes any probability estimates 
     122      based on apriori class probabilities (such as the m-estimate). 
     123    :return: Feature score - the higher the value, the better the feature. 
     124      If the quality cannot be scored, return :obj:`Score.Rejected`. 
     125    :rtype: float or :obj:`Score.Rejected`. 
     126 
     127The code below scores the same feature with :obj:`GainRatio` 
     128using different calls. 
     129 
     130.. literalinclude:: code/scoring-calls.py 
     131    :lines: 7- 
     132 
     133.. _classification: 
     134 
     135========================================== 
     136Feature scoring in classification problems 
     137========================================== 
     138 
     139.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
     140 
     141.. index:: 
     142   single: feature scoring; information gain 
     143 
     144.. class:: InfoGain 
     145 
     146    Information gain; the expected decrease of entropy. See `page on wikipedia 
     147    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
     148 
     149.. index:: 
     150   single: feature scoring; gain ratio 
     151 
     152.. class:: GainRatio 
     153 
     154    Information gain ratio; information gain divided by the entropy of the feature's 
     155    value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
     156    of multi-valued features. It has been shown, however, that it still 
     157    overestimates features with multiple values. See `Wikipedia 
     158    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
     159 
     160.. index:: 
     161   single: feature scoring; gini index 
     162 
     163.. class:: Gini 
     164 
     165    Gini index is the probability that two randomly chosen instances will have different 
     166    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
     167 
     168.. index:: 
     169   single: feature scoring; relevance 
     170 
     171.. class:: Relevance 
     172 
     173    The potential value for decision rules. 
     174 
     175.. index:: 
     176   single: feature scoring; cost 
     177 
     178.. class:: Cost 
     179 
     180    Evaluates features based on the cost decrease achieved by knowing the value of 
     181    feature, according to the specified cost matrix. 
     182 
     183    .. attribute:: cost 
     184 
     185        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
     186 
     187    If the cost of predicting the first class of an instance that is actually in 
     188    the second is 5, and the cost of the opposite error is 1, than an appropriate 
     189    score can be constructed as follows:: 
     190 
     191 
     192        >>> meas = Orange.feature.scoring.Cost() 
     193        >>> meas.cost = ((0, 5), (1, 0)) 
     194        >>> meas(3, data) 
     195        0.083333350718021393 
     196 
     197    Knowing the value of feature 3 would decrease the 
     198    classification cost for approximately 0.083 per instance. 
     199 
     200    .. comment   opposite error - is this term correct? TODO 
     201 
     202.. index:: 
     203   single: feature scoring; ReliefF 
     204 
     205.. class:: Relief 
     206 
     207    Assesses features' ability to distinguish between very similar 
     208    instances from different classes. This scoring method was first 
     209    developed by Kira and Rendell and then improved by  Kononenko. The 
     210    class :obj:`Relief` works on discrete and continuous classes and 
     211    thus implements ReliefF and RReliefF. 
     212 
     213    ReliefF is slow since it needs to find k nearest neighbours for 
     214    each of m reference instances. As we normally compute ReliefF for 
     215    all features in the dataset, :obj:`Relief` caches the results for 
     216    all features, when called to score a certain feature.  When called 
     217    again, it uses the stored results if the domain and the data table 
     218    have not changed (data table version and the data checksum are 
     219    compared). Caching will only work if you use the same object. 
     220    Constructing new instances of :obj:`Relief` for each feature, 
     221    like this:: 
     222 
     223        for attr in data.domain.attributes: 
     224            print Orange.feature.scoring.Relief(attr, data) 
     225 
     226    runs much slower than reusing the same instance:: 
     227 
     228        meas = Orange.feature.scoring.Relief() 
     229        for attr in table.domain.attributes: 
     230            print meas(attr, data) 
     231 
     232 
     233    .. attribute:: k 
     234 
     235       Number of neighbours for each instance. Default is 5. 
     236 
     237    .. attribute:: m 
     238 
     239        Number of reference instances. Default is 100. When -1, all 
     240        instances are used as reference. 
     241 
     242    .. attribute:: check_cached_data 
     243 
     244        Check if the cached data is changed, which may be slow on large 
     245        tables.  Defaults to :obj:`True`, but should be disabled when it 
     246        is certain that the data will not change while the scorer is used. 
     247 
     248.. autoclass:: Orange.feature.scoring.Distance 
     249 
     250.. autoclass:: Orange.feature.scoring.MDL 
     251 
     252.. _regression: 
     253 
     254====================================== 
     255Feature scoring in regression problems 
     256====================================== 
     257 
     258.. class:: Relief 
     259 
     260    Relief is used for regression in the same way as for 
     261    classification (see :class:`Relief` in classification 
     262    problems). 
     263 
     264.. index:: 
     265   single: feature scoring; mean square error 
     266 
     267.. class:: MSE 
     268 
     269    Implements the mean square error score. 
     270 
     271    .. attribute:: unknowns_treatment 
     272 
     273        What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
     274 
     275    .. attribute:: m 
     276 
     277        Parameter for m-estimate of error. Default is 0 (no m-estimate). 
     278 
     279============ 
     280Base Classes 
     281============ 
     282 
     283Implemented methods for scoring relevances of features are subclasses 
     284of :obj:`Score`. Those that compute statistics on conditional 
     285distributions of class values given the feature values are derived from 
     286:obj:`ScoreFromProbabilities`. 
     287 
     288.. class:: Score 
     289 
     290    Abstract base class for feature scoring. Its attributes describe which 
     291    types of features it can handle which kind of data it requires. 
     292 
     293    **Capabilities** 
     294 
     295    .. attribute:: handles_discrete 
     296 
     297        Indicates whether the scoring method can handle discrete features. 
     298 
     299    .. attribute:: handles_continuous 
     300 
     301        Indicates whether the scoring method can handle continuous features. 
     302 
     303    .. attribute:: computes_thresholds 
     304 
     305        Indicates whether the scoring method implements the :obj:`threshold_function`. 
     306 
     307    **Input specification** 
     308 
     309    .. attribute:: needs 
     310 
     311        The type of data needed indicated by one the constants 
     312        below. Classes with use :obj:`DomainContingency` will also handle 
     313        generators. Those based on :obj:`Contingency_Class` will be able 
     314        to take generators and domain contingencies. 
     315 
     316        .. attribute:: Generator 
     317 
     318            Constant. Indicates that the scoring method needs an instance 
     319            generator on the input as, for example, :obj:`Relief`. 
     320 
     321        .. attribute:: DomainContingency 
     322 
     323            Constant. Indicates that the scoring method needs 
     324            :obj:`Orange.statistics.contingency.Domain`. 
     325 
     326        .. attribute:: Contingency_Class 
     327 
     328            Constant. Indicates, that the scoring method needs the contingency 
     329            (:obj:`Orange.statistics.contingency.VarClass`), feature 
     330            distribution and the apriori class distribution (as most 
     331            scoring methods). 
     332 
     333    **Treatment of unknown values** 
     334 
     335    .. attribute:: unknowns_treatment 
     336 
     337        Defined in classes that are able to treat unknown values. It 
     338        should be set to one of the values below. 
     339 
     340        .. attribute:: IgnoreUnknowns 
     341 
     342            Constant. Instances for which the feature value is unknown are removed. 
     343 
     344        .. attribute:: ReduceByUnknown 
     345 
     346            Constant. Features with unknown values are 
     347            punished. The feature quality is reduced by the proportion of 
     348            unknown values. For impurity scores the impurity decreases 
     349            only where the value is defined and stays the same otherwise. 
     350 
     351        .. attribute:: UnknownsToCommon 
     352 
     353            Constant. Undefined values are replaced by the most common value. 
     354 
     355        .. attribute:: UnknownsAsValue 
     356 
     357            Constant. Unknown values are treated as a separate value. 
     358 
     359    **Methods** 
     360 
     361    .. method:: __call__ 
     362 
     363        Abstract. See :ref:`callingscore`. 
     364 
     365    .. method:: threshold_function(attribute, instances[, weightID]) 
     366 
     367        Abstract. 
     368 
     369        Assess different binarizations of the continuous feature 
     370        :obj:`attribute`.  Return a list of tuples. The first element 
     371        is a threshold (between two existing values), the second is 
     372        the quality of the corresponding binary feature, and the third 
     373        the distribution of instances below and above the threshold. 
     374        Not all scorers return the third element. 
     375 
     376        To show the computation of thresholds, we shall use the Iris 
     377        data set: 
     378 
     379        .. literalinclude:: code/scoring-info-iris.py 
     380            :lines: 13-16 
     381 
     382    .. method:: best_threshold(attribute, instances) 
     383 
     384        Return the best threshold for binarization, that is, the threshold 
     385        with which the resulting binary feature will have the optimal 
     386        score. 
     387 
     388        The script below prints out the best threshold for 
     389        binarization of an feature. ReliefF is used scoring: 
     390 
     391        .. literalinclude:: code/scoring-info-iris.py 
     392            :lines: 18-19 
     393 
     394.. class:: ScoreFromProbabilities 
     395 
     396    Bases: :obj:`Score` 
     397 
     398    Abstract base class for feature scoring method that can be 
     399    computed from contingency matrices. 
     400 
     401    .. attribute:: estimator_constructor 
     402    .. attribute:: conditional_estimator_constructor 
     403 
     404        The classes that are used to estimate unconditional 
     405        and conditional probabilities of classes, respectively. 
     406        Defaults use relative frequencies; possible alternatives are, 
     407        for instance, :obj:`ProbabilityEstimatorConstructor_m` and 
     408        :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
     409        (with estimator constructor again set to 
     410        :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
     411 
     412============ 
     413Other 
     414============ 
     415 
     416.. autoclass:: Orange.feature.scoring.OrderAttributes 
     417   :members: 
     418 
     419.. autofunction:: Orange.feature.scoring.score_all 
     420 
     421.. rubric:: Bibliography 
     422 
     423.. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining, 
     424  Woodhead Publishing, 2007. 
     425 
     426.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986. 
     427 
     428.. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984. 
     429 
     430.. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995. 
  • docs/reference/rst/code/discretization-table-method.py

    r9973 r9988  
    22iris = Orange.data.Table("iris.tab") 
    33disc = Orange.data.discretization.DiscretizeTable() 
    4 disc.method = Orange.feature.discretization.EquiNDiscretization(numberOfIntervals=2) 
     4disc.method = Orange.feature.discretization.EqualFreq(numberOfIntervals=2) 
    55disc_iris = disc(iris) 
    66 
  • docs/reference/rst/code/discretization-table.py

    r9812 r9988  
    11import Orange 
    22iris = Orange.data.Table("iris.tab") 
    3 disc_iris = Orange.feature.discretization.DiscretizeTable(iris, 
    4     method=Orange.feature.discretization.EquiNDiscretization(n_intervals=3)) 
     3disc_iris = Orange.data.discretization.DiscretizeTable(iris, 
     4    method=Orange.feature.discretization.EqualFreq(n_intervals=3)) 
    55 
    66print "Original data set:" 
  • docs/reference/rst/code/discretization.py

    r9927 r9988  
    2020 
    2121print "\nManual construction of Interval discretizer - single attribute" 
    22 idisc = Orange.feature.discretization.Interval(points = [3.0, 5.0]) 
     22idisc = Orange.feature.discretization.IntervalDiscretizer(points = [3.0, 5.0]) 
    2323sep_l = idisc.construct_variable(data.domain["sepal length"]) 
    2424data2 = data.select([data.domain["sepal length"], sep_l, data.domain.classVar]) 
     
    2828 
    2929print "\nManual construction of Interval discretizer - all attributes" 
    30 idisc = Orange.feature.discretization.Interval(points = [3.0, 5.0]) 
     30idisc = Orange.feature.discretization.IntervalDiscretizer(points = [3.0, 5.0]) 
    3131newattrs = [idisc.construct_variable(attr) for attr in data.domain.attributes] 
    3232data2 = data.select(newattrs + [data.domain.class_var]) 
     
    6767 
    6868print "\nManual construction of EqualWidth - all attributes" 
    69 edisc = Orange.feature.discretization.EqualWidth(first_cut = 2.0, step = 1.0, number_of_intervals = 5) 
     69edisc = Orange.feature.discretization.EqualWidthDiscretizer(first_cut=2.0, step=1.0, n=5) 
    7070newattrs = [edisc.constructVariable(attr) for attr in data.domain.attributes] 
    7171data2 = data.select(newattrs + [data.domain.classVar]) 
Note: See TracChangeset for help on using the changeset viewer.