Ignore:
Timestamp:
02/07/12 22:08:39 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

Polished discretization scripts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.associate.rst

    r9372 r9988  
    33==================================== 
    44 
    5 .. automodule:: Orange.associate 
     5============================== 
     6Induction of association rules 
     7============================== 
     8 
     9Orange provides two algorithms for induction of 
     10`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
     11One is the basic Agrawal's algorithm with dynamic induction of supported 
     12itemsets and rules that is designed specifically for datasets with a 
     13large number of different items. This is, however, not really suitable 
     14for feature-based machine learning problems. 
     15We have adapted the original algorithm for efficiency 
     16with the latter type of data, and to induce the rules where, 
     17both sides don't only contain features 
     18(like "bread, butter -> jam") but also their values 
     19("bread = wheat, butter = yes -> jam = plum"). 
     20 
     21It is also possible to extract item sets instead of association rules. These 
     22are often more interesting than the rules themselves. 
     23 
     24Besides association rule inducer, Orange also provides a rather simplified 
     25method for classification by association rules. 
     26 
     27=================== 
     28Agrawal's algorithm 
     29=================== 
     30 
     31The class that induces rules by Agrawal's algorithm, accepts the data examples 
     32of two forms. The first is the standard form in which each example is 
     33described by values of a fixed list of features (defined in domain). 
     34The algorithm, however, disregards the feature values and only checks whether 
     35the value is defined or not. The rule shown above ("bread, butter -> jam") 
     36actually means that if "bread" and "butter" are defined, then "jam" is defined 
     37as well. It is expected that most of values will be undefined - if this is not 
     38so, use the :class:`~AssociationRulesInducer`. 
     39 
     40:class:`AssociationRulesSparseInducer` can also use sparse data. 
     41Sparse examples have no fixed 
     42features - the domain is empty. All values assigned to example are given as meta attributes. 
     43All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
     44The most suitable format fot this kind of data it is the basket format. 
     45 
     46The algorithm first dynamically builds all itemsets (sets of features) that have 
     47at least the prescribed support. Each of these is then used to derive rules 
     48with requested confidence. 
     49 
     50If examples were given in the sparse form, so are the left and right side 
     51of the induced rules. If examples were given in the standard form, so are 
     52the examples in association rules. 
     53 
     54.. class:: AssociationRulesSparseInducer 
     55 
     56    .. attribute:: support 
     57 
     58        Minimal support for the rule. 
     59 
     60    .. attribute:: confidence 
     61 
     62        Minimal confidence for the rule. 
     63 
     64    .. attribute:: store_examples 
     65 
     66        Store the examples covered by each rule and 
     67        those confirming it. 
     68 
     69    .. attribute:: max_item_sets 
     70 
     71        The maximal number of itemsets. The algorithm's 
     72        running time (and its memory consumption) depends on the minimal support; 
     73        the lower the requested support, the more eligible itemsets will be found. 
     74        There is no general rule for setting support - perhaps it 
     75        should be around 0.3, but this depends on the data set. 
     76        If the supoort was set too low, the algorithm could run out of memory. 
     77        Therefore, Orange limits the number of generated rules to 
     78        :obj:`max_item_sets`. If Orange reports, that the prescribed 
     79        :obj:`max_item_sets` was exceeded, increase the requered support 
     80        or alternatively, increase :obj:`max_item_sets` to as high as you computer 
     81        can handle. 
     82 
     83    .. method:: __call__(data, weight_id) 
     84 
     85        Induce rules from the data set. 
     86 
     87 
     88    .. method:: get_itemsets(data) 
     89 
     90        Returns a list of pairs. The first element of a pair is a tuple with 
     91        indices of features in the item set (negative for sparse data). 
     92        The second element is a list of indices supporting the item set, that is, 
     93        all the items in the set. If :obj:`store_examples` is False, the second 
     94        element is None. 
     95 
     96We shall test the rule inducer on a dataset consisting of a brief description 
     97of Spanish Inquisition, given by Palin et al: 
     98 
     99    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
     100 
     101    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
     102 
     103The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
     104 
     105Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
     106 
     107.. literalinclude:: code/inquisition.basket 
     108 
     109Inducing the rules is trivial (uses :download:`inquisition.basket <code/inquisition.basket>`):: 
     110 
     111    import Orange 
     112    data = Orange.data.Table("inquisition") 
     113 
     114    rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
     115 
     116    print "%5s   %5s" % ("supp", "conf") 
     117    for r in rules: 
     118        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
     119 
     120The induced rules are surprisingly fear-full: :: 
     121 
     122    0.500   1.000   fear -> surprise 
     123    0.500   1.000   surprise -> fear 
     124    0.500   1.000   fear -> surprise our 
     125    0.500   1.000   fear surprise -> our 
     126    0.500   1.000   fear our -> surprise 
     127    0.500   1.000   surprise -> fear our 
     128    0.500   1.000   surprise our -> fear 
     129    0.500   0.714   our -> fear surprise 
     130    0.500   1.000   fear -> our 
     131    0.500   0.714   our -> fear 
     132    0.500   1.000   surprise -> our 
     133    0.500   0.714   our -> surprise 
     134 
     135To get only a list of supported item sets, one should call the method 
     136get_itemsets:: 
     137 
     138    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True) 
     139    itemsets = inducer.get_itemsets(data) 
     140 
     141Now itemsets is a list of itemsets along with the examples supporting them 
     142since we set store_examples to True. :: 
     143 
     144    >>> itemsets[5] 
     145    ((-11, -7), [1, 2, 3, 6, 9]) 
     146    >>> [data.domain[i].name for i in itemsets[5][0]] 
     147    ['surprise', 'our'] 
     148 
     149The sixth itemset contains features with indices -11 and -7, that is, the 
     150words "surprise" and "our". The examples supporting it are those with 
     151indices 1,2, 3, 6 and 9. 
     152 
     153This way of representing the itemsets is memory efficient and faster than using 
     154objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`. 
     155 
     156.. _non-sparse-examples: 
     157 
     158=================== 
     159Non-sparse data 
     160=================== 
     161 
     162:class:`AssociationRulesInducer` works with non-sparse data. 
     163Unknown values are ignored, while values of features are not (as opposite to 
     164the algorithm for sparse rules). In addition, the algorithm 
     165can be directed to search only for classification rules, in which the only 
     166feature on the right-hand side is the class variable. 
     167 
     168.. class:: AssociationRulesInducer 
     169 
     170    All attributes can be set with the constructor. 
     171 
     172    .. attribute:: support 
     173 
     174       Minimal support for the rule. 
     175 
     176    .. attribute:: confidence 
     177 
     178        Minimal confidence for the rule. 
     179 
     180    .. attribute:: classification_rules 
     181 
     182        If True (default is False), the classification rules are constructed instead 
     183        of general association rules. 
     184 
     185    .. attribute:: store_examples 
     186 
     187        Store the examples covered by each rule and those 
     188        confirming it 
     189 
     190    .. attribute:: max_item_sets 
     191 
     192        The maximal number of itemsets. 
     193 
     194    .. method:: __call__(data, weight_id) 
     195 
     196        Induce rules from the data set. 
     197 
     198    .. method:: get_itemsets(data) 
     199 
     200        Returns a list of pairs. The first element of a pair is a tuple with 
     201        indices of features in the item set (negative for sparse data). 
     202        The second element is a list of indices supporting the item set, that is, 
     203        all the items in the set. If :obj:`store_examples` is False, the second 
     204        element is None. 
     205 
     206The example:: 
     207 
     208    import Orange 
     209 
     210    data = Orange.data.Table("lenses") 
     211 
     212    print "Association rules" 
     213    rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
     214    for r in rules: 
     215        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     216 
     217The found rules are: :: 
     218 
     219    0.333  0.533  lenses=none -> prescription=hypermetrope 
     220    0.333  0.667  prescription=hypermetrope -> lenses=none 
     221    0.333  0.533  lenses=none -> astigmatic=yes 
     222    0.333  0.667  astigmatic=yes -> lenses=none 
     223    0.500  0.800  lenses=none -> tear_rate=reduced 
     224    0.500  1.000  tear_rate=reduced -> lenses=none 
     225 
     226To limit the algorithm to classification rules, set classificationRules to 1: :: 
     227 
     228    print "\\nClassification rules" 
     229    rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
     230    for r in rules: 
     231        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     232 
     233The found rules are, naturally, a subset of the above rules: :: 
     234 
     235    0.333  0.667  prescription=hypermetrope -> lenses=none 
     236    0.333  0.667  astigmatic=yes -> lenses=none 
     237    0.500  1.000  tear_rate=reduced -> lenses=none 
     238 
     239Itemsets are induced in a similar fashion as for sparse data, except that the 
     240first element of the tuple, the item set, is represented not by indices of 
     241features, as before, but with tuples (feature-index, value-index): :: 
     242 
     243    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
     244    itemsets = inducer.get_itemsets(data) 
     245    print itemsets[8] 
     246 
     247This prints out :: 
     248 
     249    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
     250 
     251meaning that the ninth itemset contains the second value of the third feature 
     252(2, 1), and the first value of the fifth (4, 0). 
     253 
     254======================= 
     255Representation of rules 
     256======================= 
     257 
     258An :class:`AssociationRule` represents a rule. In Orange, methods for 
     259induction of association rules return the induced rules in 
     260:class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
     261 
     262.. class:: AssociationRule 
     263 
     264    .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
     265 
     266        Constructs an association rule and computes all measures listed above. 
     267 
     268    .. method:: __init__(left, right, support, confidence) 
     269 
     270        Construct association rule and sets its support and confidence. If 
     271        you intend to pass on such a rule you should set other attributes 
     272        manually - AssociationRules's constructor cannot compute anything 
     273        from arguments support and confidence. 
     274 
     275    .. method:: __init__(rule) 
     276 
     277        Given an association rule as the argument, constructor copies of the 
     278        rule. 
     279 
     280    .. attribute:: left, right 
     281 
     282        The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
     283        In rules created by :class:`AssociationRulesSparseInducer` from examples that 
     284        contain all values as meta-values, left and right are examples in the 
     285        same form. Otherwise, values in left that do not appear in the rule 
     286        are "don't care", and value in right are "don't know". Both can, 
     287        however, be tested by :meth:`~Orange.data.Value.is_special`. 
     288 
     289    .. attribute:: n_left, n_right 
     290 
     291        The number of features (i.e. defined values) on the left and on the 
     292        right side of the rule. 
     293 
     294    .. attribute:: n_applies_left, n_applies_right, n_applies_both 
     295 
     296        The number of (learning) examples that conform to the left, the right 
     297        and to both sides of the rule. 
     298 
     299    .. attribute:: n_examples 
     300 
     301        The total number of learning examples. 
     302 
     303    .. attribute:: support 
     304 
     305        nAppliesBoth/nExamples. 
     306 
     307    .. attribute:: confidence 
     308 
     309        n_applies_both/n_applies_left. 
     310 
     311    .. attribute:: coverage 
     312 
     313        n_applies_left/n_examples. 
     314 
     315    .. attribute:: strength 
     316 
     317        n_applies_right/n_applies_left. 
     318 
     319    .. attribute:: lift 
     320 
     321        n_examples * n_applies_both / (n_applies_left * n_applies_right). 
     322 
     323    .. attribute:: leverage 
     324 
     325        (n_Applies_both * n_examples - n_applies_left * n_applies_right). 
     326 
     327    .. attribute:: examples, match_left, match_both 
     328 
     329        If store_examples was True during induction, examples contains a copy 
     330        of the example table used to induce the rules. Attributes match_left 
     331        and match_both are lists of integers, representing the indices of 
     332        examples which match the left-hand side of the rule and both sides, 
     333        respectively. 
     334 
     335    .. method:: applies_left(example) 
     336 
     337    .. method:: applies_right(example) 
     338 
     339    .. method:: applies_both(example) 
     340 
     341        Tells whether the example fits into the left, right or both sides of 
     342        the rule, respectively. If the rule is represented by sparse examples, 
     343        the given example must be sparse as well. 
     344 
     345Association rule inducers do not store evidence about which example supports 
     346which rule. Let us write a function that finds the examples that 
     347confirm the rule (fit both sides of it) and those that contradict it (fit the 
     348left-hand side but not the right). The example:: 
     349 
     350    import Orange 
     351 
     352    data = Orange.data.Table("lenses") 
     353 
     354    rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
     355    rule = rules[0] 
     356 
     357    print 
     358    print "Rule: ", rule 
     359    print 
     360 
     361    print "Supporting examples:" 
     362    for example in data: 
     363        if rule.appliesBoth(example): 
     364            print example 
     365    print 
     366 
     367    print "Contradicting examples:" 
     368    for example in data: 
     369        if rule.applies_left(example) and not rule.applies_right(example): 
     370            print example 
     371    print 
     372 
     373The latter printouts get simpler and faster if we instruct the inducer to 
     374store the examples. We can then do, for instance, this: :: 
     375 
     376    print "Match left: " 
     377    print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
     378    print "\\nMatch both: " 
     379    print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
     380 
     381The "contradicting" examples are then those whose indices are found in 
     382match_left but not in match_both. The memory friendlier and the faster way 
     383to compute this is as follows: :: 
     384 
     385    >>> [x for x in rule.match_left if not x in rule.match_both] 
     386    [0, 2, 8, 10, 16, 17, 18] 
     387    >>> set(rule.match_left) - set(rule.match_both) 
     388    set([0, 2, 8, 10, 16, 17, 18]) 
     389 
     390=============== 
     391Utilities 
     392=============== 
     393 
     394.. autofunction:: print_rules 
     395 
     396.. autofunction:: sort 
Note: See TracChangeset for help on using the changeset viewer.