Ignore:
Timestamp:
02/07/12 22:08:39 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

Polished discretization scripts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • Orange/associate/__init__.py

    r9919 r9988  
    1 """ 
    2 ============================== 
    3 Induction of association rules 
    4 ============================== 
    5  
    6 Orange provides two algorithms for induction of 
    7 `association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
    8 One is the basic Agrawal's algorithm with dynamic induction of supported 
    9 itemsets and rules that is designed specifically for datasets with a 
    10 large number of different items. This is, however, not really suitable 
    11 for feature-based machine learning problems. 
    12 We have adapted the original algorithm for efficiency 
    13 with the latter type of data, and to induce the rules where,  
    14 both sides don't only contain features 
    15 (like "bread, butter -> jam") but also their values 
    16 ("bread = wheat, butter = yes -> jam = plum"). 
    17  
    18 It is also possible to extract item sets instead of association rules. These 
    19 are often more interesting than the rules themselves. 
    20  
    21 Besides association rule inducer, Orange also provides a rather simplified 
    22 method for classification by association rules. 
    23  
    24 =================== 
    25 Agrawal's algorithm 
    26 =================== 
    27  
    28 The class that induces rules by Agrawal's algorithm, accepts the data examples 
    29 of two forms. The first is the standard form in which each example is 
    30 described by values of a fixed list of features (defined in domain). 
    31 The algorithm, however, disregards the feature values and only checks whether 
    32 the value is defined or not. The rule shown above ("bread, butter -> jam") 
    33 actually means that if "bread" and "butter" are defined, then "jam" is defined 
    34 as well. It is expected that most of values will be undefined - if this is not 
    35 so, use the :class:`~AssociationRulesInducer`. 
    36  
    37 :class:`AssociationRulesSparseInducer` can also use sparse data.  
    38 Sparse examples have no fixed 
    39 features - the domain is empty. All values assigned to example are given as meta attributes. 
    40 All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
    41 The most suitable format fot this kind of data it is the basket format. 
    42  
    43 The algorithm first dynamically builds all itemsets (sets of features) that have 
    44 at least the prescribed support. Each of these is then used to derive rules 
    45 with requested confidence. 
    46  
    47 If examples were given in the sparse form, so are the left and right side 
    48 of the induced rules. If examples were given in the standard form, so are 
    49 the examples in association rules. 
    50  
    51 .. class:: AssociationRulesSparseInducer 
    52  
    53     .. attribute:: support 
    54      
    55         Minimal support for the rule. 
    56          
    57     .. attribute:: confidence 
    58      
    59         Minimal confidence for the rule. 
    60          
    61     .. attribute:: store_examples 
    62      
    63         Store the examples covered by each rule and 
    64         those confirming it. 
    65          
    66     .. attribute:: max_item_sets 
    67      
    68         The maximal number of itemsets. The algorithm's 
    69         running time (and its memory consumption) depends on the minimal support; 
    70         the lower the requested support, the more eligible itemsets will be found. 
    71         There is no general rule for setting support - perhaps it  
    72         should be around 0.3, but this depends on the data set. 
    73         If the supoort was set too low, the algorithm could run out of memory. 
    74         Therefore, Orange limits the number of generated rules to 
    75         :obj:`max_item_sets`. If Orange reports, that the prescribed 
    76         :obj:`max_item_sets` was exceeded, increase the requered support 
    77         or alternatively, increase :obj:`max_item_sets` to as high as you computer 
    78         can handle. 
    79  
    80     .. method:: __call__(data, weight_id) 
    81  
    82         Induce rules from the data set. 
    83  
    84  
    85     .. method:: get_itemsets(data) 
    86  
    87         Returns a list of pairs. The first element of a pair is a tuple with  
    88         indices of features in the item set (negative for sparse data).  
    89         The second element is a list of indices supporting the item set, that is, 
    90         all the items in the set. If :obj:`store_examples` is False, the second 
    91         element is None. 
    92  
    93 We shall test the rule inducer on a dataset consisting of a brief description 
    94 of Spanish Inquisition, given by Palin et al: 
    95  
    96     NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
    97  
    98     NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
    99      
    100 The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
    101  
    102 Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
    103  
    104 .. literalinclude:: code/inquisition.basket 
    105     
    106 Inducing the rules is trivial (uses :download:`inquisition.basket <code/inquisition.basket>`):: 
    107  
    108     import Orange 
    109     data = Orange.data.Table("inquisition") 
    110  
    111     rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
    112  
    113     print "%5s   %5s" % ("supp", "conf") 
    114     for r in rules: 
    115         print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    116  
    117 The induced rules are surprisingly fear-full: :: 
    118  
    119     0.500   1.000   fear -> surprise 
    120     0.500   1.000   surprise -> fear 
    121     0.500   1.000   fear -> surprise our 
    122     0.500   1.000   fear surprise -> our 
    123     0.500   1.000   fear our -> surprise 
    124     0.500   1.000   surprise -> fear our 
    125     0.500   1.000   surprise our -> fear 
    126     0.500   0.714   our -> fear surprise 
    127     0.500   1.000   fear -> our 
    128     0.500   0.714   our -> fear 
    129     0.500   1.000   surprise -> our 
    130     0.500   0.714   our -> surprise 
    131  
    132 To get only a list of supported item sets, one should call the method 
    133 get_itemsets:: 
    134  
    135     inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True) 
    136     itemsets = inducer.get_itemsets(data) 
    137      
    138 Now itemsets is a list of itemsets along with the examples supporting them 
    139 since we set store_examples to True. :: 
    140  
    141     >>> itemsets[5] 
    142     ((-11, -7), [1, 2, 3, 6, 9]) 
    143     >>> [data.domain[i].name for i in itemsets[5][0]] 
    144     ['surprise', 'our']    
    145      
    146 The sixth itemset contains features with indices -11 and -7, that is, the 
    147 words "surprise" and "our". The examples supporting it are those with 
    148 indices 1,2, 3, 6 and 9. 
    149  
    150 This way of representing the itemsets is memory efficient and faster than using 
    151 objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`. 
    152  
    153 .. _non-sparse-examples: 
    154  
    155 =================== 
    156 Non-sparse data 
    157 =================== 
    158  
    159 :class:`AssociationRulesInducer` works with non-sparse data. 
    160 Unknown values are ignored, while values of features are not (as opposite to 
    161 the algorithm for sparse rules). In addition, the algorithm 
    162 can be directed to search only for classification rules, in which the only 
    163 feature on the right-hand side is the class variable. 
    164  
    165 .. class:: AssociationRulesInducer 
    166  
    167     All attributes can be set with the constructor.  
    168  
    169     .. attribute:: support 
    170      
    171        Minimal support for the rule. 
    172      
    173     .. attribute:: confidence 
    174      
    175         Minimal confidence for the rule. 
    176      
    177     .. attribute:: classification_rules 
    178      
    179         If True (default is False), the classification rules are constructed instead 
    180         of general association rules. 
    181  
    182     .. attribute:: store_examples 
    183      
    184         Store the examples covered by each rule and those 
    185         confirming it 
    186          
    187     .. attribute:: max_item_sets 
    188      
    189         The maximal number of itemsets. 
    190  
    191     .. method:: __call__(data, weight_id) 
    192  
    193         Induce rules from the data set. 
    194  
    195     .. method:: get_itemsets(data) 
    196  
    197         Returns a list of pairs. The first element of a pair is a tuple with  
    198         indices of features in the item set (negative for sparse data).  
    199         The second element is a list of indices supporting the item set, that is, 
    200         all the items in the set. If :obj:`store_examples` is False, the second 
    201         element is None. 
    202  
    203 The example:: 
    204  
    205     import Orange 
    206  
    207     data = Orange.data.Table("lenses") 
    208  
    209     print "Association rules" 
    210     rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
    211     for r in rules: 
    212         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    213          
    214 The found rules are: :: 
    215  
    216     0.333  0.533  lenses=none -> prescription=hypermetrope 
    217     0.333  0.667  prescription=hypermetrope -> lenses=none 
    218     0.333  0.533  lenses=none -> astigmatic=yes 
    219     0.333  0.667  astigmatic=yes -> lenses=none 
    220     0.500  0.800  lenses=none -> tear_rate=reduced 
    221     0.500  1.000  tear_rate=reduced -> lenses=none 
    222      
    223 To limit the algorithm to classification rules, set classificationRules to 1: :: 
    224  
    225     print "\\nClassification rules" 
    226     rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
    227     for r in rules: 
    228         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    229  
    230 The found rules are, naturally, a subset of the above rules: :: 
    231  
    232     0.333  0.667  prescription=hypermetrope -> lenses=none 
    233     0.333  0.667  astigmatic=yes -> lenses=none 
    234     0.500  1.000  tear_rate=reduced -> lenses=none 
    235      
    236 Itemsets are induced in a similar fashion as for sparse data, except that the 
    237 first element of the tuple, the item set, is represented not by indices of 
    238 features, as before, but with tuples (feature-index, value-index): :: 
    239  
    240     inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
    241     itemsets = inducer.get_itemsets(data) 
    242     print itemsets[8] 
    243      
    244 This prints out :: 
    245  
    246     (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    247      
    248 meaning that the ninth itemset contains the second value of the third feature 
    249 (2, 1), and the first value of the fifth (4, 0). 
    250  
    251 ======================= 
    252 Representation of rules 
    253 ======================= 
    254  
    255 An :class:`AssociationRule` represents a rule. In Orange, methods for  
    256 induction of association rules return the induced rules in 
    257 :class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
    258  
    259 .. class:: AssociationRule 
    260  
    261     .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
    262      
    263         Constructs an association rule and computes all measures listed above. 
    264      
    265     .. method:: __init__(left, right, support, confidence) 
    266      
    267         Construct association rule and sets its support and confidence. If 
    268         you intend to pass on such a rule you should set other attributes 
    269         manually - AssociationRules's constructor cannot compute anything 
    270         from arguments support and confidence. 
    271      
    272     .. method:: __init__(rule) 
    273      
    274         Given an association rule as the argument, constructor copies of the 
    275         rule. 
    276   
    277     .. attribute:: left, right 
    278      
    279         The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
    280         In rules created by :class:`AssociationRulesSparseInducer` from examples that 
    281         contain all values as meta-values, left and right are examples in the 
    282         same form. Otherwise, values in left that do not appear in the rule 
    283         are "don't care", and value in right are "don't know". Both can, 
    284         however, be tested by :meth:`~Orange.data.Value.is_special`. 
    285      
    286     .. attribute:: n_left, n_right 
    287      
    288         The number of features (i.e. defined values) on the left and on the 
    289         right side of the rule. 
    290      
    291     .. attribute:: n_applies_left, n_applies_right, n_applies_both 
    292      
    293         The number of (learning) examples that conform to the left, the right 
    294         and to both sides of the rule. 
    295      
    296     .. attribute:: n_examples 
    297      
    298         The total number of learning examples. 
    299      
    300     .. attribute:: support 
    301      
    302         nAppliesBoth/nExamples. 
    303  
    304     .. attribute:: confidence 
    305      
    306         n_applies_both/n_applies_left. 
    307      
    308     .. attribute:: coverage 
    309      
    310         n_applies_left/n_examples. 
    311  
    312     .. attribute:: strength 
    313      
    314         n_applies_right/n_applies_left. 
    315      
    316     .. attribute:: lift 
    317      
    318         n_examples * n_applies_both / (n_applies_left * n_applies_right). 
    319      
    320     .. attribute:: leverage 
    321      
    322         (n_Applies_both * n_examples - n_applies_left * n_applies_right). 
    323      
    324     .. attribute:: examples, match_left, match_both 
    325      
    326         If store_examples was True during induction, examples contains a copy 
    327         of the example table used to induce the rules. Attributes match_left 
    328         and match_both are lists of integers, representing the indices of 
    329         examples which match the left-hand side of the rule and both sides, 
    330         respectively. 
    331     
    332     .. method:: applies_left(example) 
    333      
    334     .. method:: applies_right(example) 
    335      
    336     .. method:: applies_both(example) 
    337      
    338         Tells whether the example fits into the left, right or both sides of 
    339         the rule, respectively. If the rule is represented by sparse examples, 
    340         the given example must be sparse as well. 
    341      
    342 Association rule inducers do not store evidence about which example supports 
    343 which rule. Let us write a function that finds the examples that 
    344 confirm the rule (fit both sides of it) and those that contradict it (fit the 
    345 left-hand side but not the right). The example:: 
    346  
    347     import Orange 
    348  
    349     data = Orange.data.Table("lenses") 
    350  
    351     rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
    352     rule = rules[0] 
    353  
    354     print 
    355     print "Rule: ", rule 
    356     print 
    357  
    358     print "Supporting examples:" 
    359     for example in data: 
    360         if rule.appliesBoth(example): 
    361             print example 
    362     print 
    363  
    364     print "Contradicting examples:" 
    365     for example in data: 
    366         if rule.applies_left(example) and not rule.applies_right(example): 
    367             print example 
    368     print 
    369  
    370 The latter printouts get simpler and faster if we instruct the inducer to 
    371 store the examples. We can then do, for instance, this: :: 
    372  
    373     print "Match left: " 
    374     print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
    375     print "\\nMatch both: " 
    376     print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
    377  
    378 The "contradicting" examples are then those whose indices are found in 
    379 match_left but not in match_both. The memory friendlier and the faster way 
    380 to compute this is as follows: :: 
    381  
    382     >>> [x for x in rule.match_left if not x in rule.match_both] 
    383     [0, 2, 8, 10, 16, 17, 18] 
    384     >>> set(rule.match_left) - set(rule.match_both) 
    385     set([0, 2, 8, 10, 16, 17, 18]) 
    386  
    387 =============== 
    388 Utilities 
    389 =============== 
    390  
    391 .. autofunction:: print_rules 
    392  
    393 .. autofunction:: sort 
    394  
    395 """ 
    396  
    3971from orange import \ 
    3982    AssociationRule, \ 
Note: See TracChangeset for help on using the changeset viewer.