source: orange/Orange/associate/__init__.py @ 9919:8a2a770ef3af

Revision 9919:8a2a770ef3af, 15.4 KB checked in by markotoplak, 2 years ago (diff)

data.variable -> feature

Line 
1"""
2==============================
3Induction of association rules
4==============================
5
6Orange provides two algorithms for induction of
7`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_.
8One is the basic Agrawal's algorithm with dynamic induction of supported
9itemsets and rules that is designed specifically for datasets with a
10large number of different items. This is, however, not really suitable
11for feature-based machine learning problems.
12We have adapted the original algorithm for efficiency
13with the latter type of data, and to induce the rules where,
14both sides don't only contain features
15(like "bread, butter -> jam") but also their values
16("bread = wheat, butter = yes -> jam = plum").
17
18It is also possible to extract item sets instead of association rules. These
19are often more interesting than the rules themselves.
20
21Besides association rule inducer, Orange also provides a rather simplified
22method for classification by association rules.
23
24===================
25Agrawal's algorithm
26===================
27
28The class that induces rules by Agrawal's algorithm, accepts the data examples
29of two forms. The first is the standard form in which each example is
30described by values of a fixed list of features (defined in domain).
31The algorithm, however, disregards the feature values and only checks whether
32the value is defined or not. The rule shown above ("bread, butter -> jam")
33actually means that if "bread" and "butter" are defined, then "jam" is defined
34as well. It is expected that most of values will be undefined - if this is not
35so, use the :class:`~AssociationRulesInducer`.
36
37:class:`AssociationRulesSparseInducer` can also use sparse data.
38Sparse examples have no fixed
39features - the domain is empty. All values assigned to example are given as meta attributes.
40All meta attributes need to be registered with the :obj:`~Orange.data.Domain`.
41The most suitable format fot this kind of data it is the basket format.
42
43The algorithm first dynamically builds all itemsets (sets of features) that have
44at least the prescribed support. Each of these is then used to derive rules
45with requested confidence.
46
47If examples were given in the sparse form, so are the left and right side
48of the induced rules. If examples were given in the standard form, so are
49the examples in association rules.
50
51.. class:: AssociationRulesSparseInducer
52
53    .. attribute:: support
54   
55        Minimal support for the rule.
56       
57    .. attribute:: confidence
58   
59        Minimal confidence for the rule.
60       
61    .. attribute:: store_examples
62   
63        Store the examples covered by each rule and
64        those confirming it.
65       
66    .. attribute:: max_item_sets
67   
68        The maximal number of itemsets. The algorithm's
69        running time (and its memory consumption) depends on the minimal support;
70        the lower the requested support, the more eligible itemsets will be found.
71        There is no general rule for setting support - perhaps it
72        should be around 0.3, but this depends on the data set.
73        If the supoort was set too low, the algorithm could run out of memory.
74        Therefore, Orange limits the number of generated rules to
75        :obj:`max_item_sets`. If Orange reports, that the prescribed
76        :obj:`max_item_sets` was exceeded, increase the requered support
77        or alternatively, increase :obj:`max_item_sets` to as high as you computer
78        can handle.
79
80    .. method:: __call__(data, weight_id)
81
82        Induce rules from the data set.
83
84
85    .. method:: get_itemsets(data)
86
87        Returns a list of pairs. The first element of a pair is a tuple with
88        indices of features in the item set (negative for sparse data).
89        The second element is a list of indices supporting the item set, that is,
90        all the items in the set. If :obj:`store_examples` is False, the second
91        element is None.
92
93We shall test the rule inducer on a dataset consisting of a brief description
94of Spanish Inquisition, given by Palin et al:
95
96    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again.
97
98    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn!
99   
100The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words.
101
102Data example (:download:`inquisition.basket <code/inquisition.basket>`):
103
104.. literalinclude:: code/inquisition.basket
105   
106Inducing the rules is trivial (uses :download:`inquisition.basket <code/inquisition.basket>`)::
107
108    import Orange
109    data = Orange.data.Table("inquisition")
110
111    rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)
112
113    print "%5s   %5s" % ("supp", "conf")
114    for r in rules:
115        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r)
116
117The induced rules are surprisingly fear-full: ::
118
119    0.500   1.000   fear -> surprise
120    0.500   1.000   surprise -> fear
121    0.500   1.000   fear -> surprise our
122    0.500   1.000   fear surprise -> our
123    0.500   1.000   fear our -> surprise
124    0.500   1.000   surprise -> fear our
125    0.500   1.000   surprise our -> fear
126    0.500   0.714   our -> fear surprise
127    0.500   1.000   fear -> our
128    0.500   0.714   our -> fear
129    0.500   1.000   surprise -> our
130    0.500   0.714   our -> surprise
131
132To get only a list of supported item sets, one should call the method
133get_itemsets::
134
135    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True)
136    itemsets = inducer.get_itemsets(data)
137   
138Now itemsets is a list of itemsets along with the examples supporting them
139since we set store_examples to True. ::
140
141    >>> itemsets[5]
142    ((-11, -7), [1, 2, 3, 6, 9])
143    >>> [data.domain[i].name for i in itemsets[5][0]]
144    ['surprise', 'our']   
145   
146The sixth itemset contains features with indices -11 and -7, that is, the
147words "surprise" and "our". The examples supporting it are those with
148indices 1,2, 3, 6 and 9.
149
150This way of representing the itemsets is memory efficient and faster than using
151objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`.
152
153.. _non-sparse-examples:
154
155===================
156Non-sparse data
157===================
158
159:class:`AssociationRulesInducer` works with non-sparse data.
160Unknown values are ignored, while values of features are not (as opposite to
161the algorithm for sparse rules). In addition, the algorithm
162can be directed to search only for classification rules, in which the only
163feature on the right-hand side is the class variable.
164
165.. class:: AssociationRulesInducer
166
167    All attributes can be set with the constructor.
168
169    .. attribute:: support
170   
171       Minimal support for the rule.
172   
173    .. attribute:: confidence
174   
175        Minimal confidence for the rule.
176   
177    .. attribute:: classification_rules
178   
179        If True (default is False), the classification rules are constructed instead
180        of general association rules.
181
182    .. attribute:: store_examples
183   
184        Store the examples covered by each rule and those
185        confirming it
186       
187    .. attribute:: max_item_sets
188   
189        The maximal number of itemsets.
190
191    .. method:: __call__(data, weight_id)
192
193        Induce rules from the data set.
194
195    .. method:: get_itemsets(data)
196
197        Returns a list of pairs. The first element of a pair is a tuple with
198        indices of features in the item set (negative for sparse data).
199        The second element is a list of indices supporting the item set, that is,
200        all the items in the set. If :obj:`store_examples` is False, the second
201        element is None.
202
203The example::
204
205    import Orange
206
207    data = Orange.data.Table("lenses")
208
209    print "Association rules"
210    rules = Orange.associate.AssociationRulesInducer(data, support = 0.5)
211    for r in rules:
212        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r)
213       
214The found rules are: ::
215
216    0.333  0.533  lenses=none -> prescription=hypermetrope
217    0.333  0.667  prescription=hypermetrope -> lenses=none
218    0.333  0.533  lenses=none -> astigmatic=yes
219    0.333  0.667  astigmatic=yes -> lenses=none
220    0.500  0.800  lenses=none -> tear_rate=reduced
221    0.500  1.000  tear_rate=reduced -> lenses=none
222   
223To limit the algorithm to classification rules, set classificationRules to 1: ::
224
225    print "\\nClassification rules"
226    rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1)
227    for r in rules:
228        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r)
229
230The found rules are, naturally, a subset of the above rules: ::
231
232    0.333  0.667  prescription=hypermetrope -> lenses=none
233    0.333  0.667  astigmatic=yes -> lenses=none
234    0.500  1.000  tear_rate=reduced -> lenses=none
235   
236Itemsets are induced in a similar fashion as for sparse data, except that the
237first element of the tuple, the item set, is represented not by indices of
238features, as before, but with tuples (feature-index, value-index): ::
239
240    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True)
241    itemsets = inducer.get_itemsets(data)
242    print itemsets[8]
243   
244This prints out ::
245
246    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23])
247   
248meaning that the ninth itemset contains the second value of the third feature
249(2, 1), and the first value of the fifth (4, 0).
250
251=======================
252Representation of rules
253=======================
254
255An :class:`AssociationRule` represents a rule. In Orange, methods for
256induction of association rules return the induced rules in
257:class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances.
258
259.. class:: AssociationRule
260
261    .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples)
262   
263        Constructs an association rule and computes all measures listed above.
264   
265    .. method:: __init__(left, right, support, confidence)
266   
267        Construct association rule and sets its support and confidence. If
268        you intend to pass on such a rule you should set other attributes
269        manually - AssociationRules's constructor cannot compute anything
270        from arguments support and confidence.
271   
272    .. method:: __init__(rule)
273   
274        Given an association rule as the argument, constructor copies of the
275        rule.
276 
277    .. attribute:: left, right
278   
279        The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`.
280        In rules created by :class:`AssociationRulesSparseInducer` from examples that
281        contain all values as meta-values, left and right are examples in the
282        same form. Otherwise, values in left that do not appear in the rule
283        are "don't care", and value in right are "don't know". Both can,
284        however, be tested by :meth:`~Orange.data.Value.is_special`.
285   
286    .. attribute:: n_left, n_right
287   
288        The number of features (i.e. defined values) on the left and on the
289        right side of the rule.
290   
291    .. attribute:: n_applies_left, n_applies_right, n_applies_both
292   
293        The number of (learning) examples that conform to the left, the right
294        and to both sides of the rule.
295   
296    .. attribute:: n_examples
297   
298        The total number of learning examples.
299   
300    .. attribute:: support
301   
302        nAppliesBoth/nExamples.
303
304    .. attribute:: confidence
305   
306        n_applies_both/n_applies_left.
307   
308    .. attribute:: coverage
309   
310        n_applies_left/n_examples.
311
312    .. attribute:: strength
313   
314        n_applies_right/n_applies_left.
315   
316    .. attribute:: lift
317   
318        n_examples * n_applies_both / (n_applies_left * n_applies_right).
319   
320    .. attribute:: leverage
321   
322        (n_Applies_both * n_examples - n_applies_left * n_applies_right).
323   
324    .. attribute:: examples, match_left, match_both
325   
326        If store_examples was True during induction, examples contains a copy
327        of the example table used to induce the rules. Attributes match_left
328        and match_both are lists of integers, representing the indices of
329        examples which match the left-hand side of the rule and both sides,
330        respectively.
331   
332    .. method:: applies_left(example)
333   
334    .. method:: applies_right(example)
335   
336    .. method:: applies_both(example)
337   
338        Tells whether the example fits into the left, right or both sides of
339        the rule, respectively. If the rule is represented by sparse examples,
340        the given example must be sparse as well.
341   
342Association rule inducers do not store evidence about which example supports
343which rule. Let us write a function that finds the examples that
344confirm the rule (fit both sides of it) and those that contradict it (fit the
345left-hand side but not the right). The example::
346
347    import Orange
348
349    data = Orange.data.Table("lenses")
350
351    rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3)
352    rule = rules[0]
353
354    print
355    print "Rule: ", rule
356    print
357
358    print "Supporting examples:"
359    for example in data:
360        if rule.appliesBoth(example):
361            print example
362    print
363
364    print "Contradicting examples:"
365    for example in data:
366        if rule.applies_left(example) and not rule.applies_right(example):
367            print example
368    print
369
370The latter printouts get simpler and faster if we instruct the inducer to
371store the examples. We can then do, for instance, this: ::
372
373    print "Match left: "
374    print "\\n".join(str(rule.examples[i]) for i in rule.match_left)
375    print "\\nMatch both: "
376    print "\\n".join(str(rule.examples[i]) for i in rule.match_both)
377
378The "contradicting" examples are then those whose indices are found in
379match_left but not in match_both. The memory friendlier and the faster way
380to compute this is as follows: ::
381
382    >>> [x for x in rule.match_left if not x in rule.match_both]
383    [0, 2, 8, 10, 16, 17, 18]
384    >>> set(rule.match_left) - set(rule.match_both)
385    set([0, 2, 8, 10, 16, 17, 18])
386
387===============
388Utilities
389===============
390
391.. autofunction:: print_rules
392
393.. autofunction:: sort
394
395"""
396
397from orange import \
398    AssociationRule, \
399    AssociationRules, \
400    AssociationRulesInducer, \
401    AssociationRulesSparseInducer, \
402    ItemsetNodeProxy, \
403    ItemsetsSparseInducer
404
405def print_rules(rules, ms = []):
406    """
407    Print the rules. If ms is left empty, only the rules are printed. If ms
408    contains rules' attributes, e.g. ``["support", "confidence"]``, these are printed out as well.
409    """
410    if ms:
411        print "\t".join([m[:4] for m in ms]) + "\trule"
412        for rule in rules:
413            print "\t".join(["%5.3f" % getattr(rule, m) for m in ms]) + "\t" + str(rule)
414    else:
415        for rule in rules:
416            print rule
417
418class __Cmp:
419    def __init__(self, ms):
420        self.ms = ms
421
422    def __call__(self, r1, r2):       
423        for m in self.ms:
424            c = -cmp(getattr(r1, m), getattr(r2, m))
425            if c:
426                return c
427        return 0
428
429def sort(rules, ms = ["support"]):
430    """
431    Sort the rules according to the given criteria. The default key is "support"; list multiple keys in a list.
432    """
433    rules.sort(__Cmp(ms))
Note: See TracBrowser for help on using the repository browser.