source: orange/docs/reference/rst/Orange.associate.rst @ 9994:1073e0304a87

Revision 9994:1073e0304a87, 14.1 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Remove links from documentation to datasets. Remove datasets reference directory.

Line 
1====================================
2Association rules (``associate``)
3====================================
4
5==============================
6Induction of association rules
7==============================
8
9Orange provides two algorithms for induction of
10`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_.
11One is the basic Agrawal's algorithm with dynamic induction of supported
12itemsets and rules that is designed specifically for datasets with a
13large number of different items. This is, however, not really suitable
14for feature-based machine learning problems.
15We have adapted the original algorithm for efficiency
16with the latter type of data, and to induce the rules where,
17both sides don't only contain features
18(like "bread, butter -> jam") but also their values
19("bread = wheat, butter = yes -> jam = plum").
20
21It is also possible to extract item sets instead of association rules. These
22are often more interesting than the rules themselves.
23
24Besides association rule inducer, Orange also provides a rather simplified
25method for classification by association rules.
26
27===================
28Agrawal's algorithm
29===================
30
31The class that induces rules by Agrawal's algorithm, accepts the data examples
32of two forms. The first is the standard form in which each example is
33described by values of a fixed list of features (defined in domain).
34The algorithm, however, disregards the feature values and only checks whether
35the value is defined or not. The rule shown above ("bread, butter -> jam")
36actually means that if "bread" and "butter" are defined, then "jam" is defined
37as well. It is expected that most of values will be undefined - if this is not
38so, use the :class:`~AssociationRulesInducer`.
39
40:class:`AssociationRulesSparseInducer` can also use sparse data.
41Sparse examples have no fixed
42features - the domain is empty. All values assigned to example are given as meta attributes.
43All meta attributes need to be registered with the :obj:`~Orange.data.Domain`.
44The most suitable format fot this kind of data it is the basket format.
45
46The algorithm first dynamically builds all itemsets (sets of features) that have
47at least the prescribed support. Each of these is then used to derive rules
48with requested confidence.
49
50If examples were given in the sparse form, so are the left and right side
51of the induced rules. If examples were given in the standard form, so are
52the examples in association rules.
53
54.. class:: AssociationRulesSparseInducer
55
56    .. attribute:: support
57
58        Minimal support for the rule.
59
60    .. attribute:: confidence
61
62        Minimal confidence for the rule.
63
64    .. attribute:: store_examples
65
66        Store the examples covered by each rule and
67        those confirming it.
68
69    .. attribute:: max_item_sets
70
71        The maximal number of itemsets. The algorithm's
72        running time (and its memory consumption) depends on the minimal support;
73        the lower the requested support, the more eligible itemsets will be found.
74        There is no general rule for setting support - perhaps it
75        should be around 0.3, but this depends on the data set.
76        If the supoort was set too low, the algorithm could run out of memory.
77        Therefore, Orange limits the number of generated rules to
78        :obj:`max_item_sets`. If Orange reports, that the prescribed
79        :obj:`max_item_sets` was exceeded, increase the requered support
80        or alternatively, increase :obj:`max_item_sets` to as high as you computer
81        can handle.
82
83    .. method:: __call__(data, weight_id)
84
85        Induce rules from the data set.
86
87
88    .. method:: get_itemsets(data)
89
90        Returns a list of pairs. The first element of a pair is a tuple with
91        indices of features in the item set (negative for sparse data).
92        The second element is a list of indices supporting the item set, that is,
93        all the items in the set. If :obj:`store_examples` is False, the second
94        element is None.
95
96We shall test the rule inducer on a dataset consisting of a brief description
97of Spanish Inquisition, given by Palin et al:
98
99    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again.
100
101    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn!
102
103The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words.
104
105Data example (:download:`inquisition.basket <code/inquisition.basket>`):
106
107.. literalinclude:: code/inquisition.basket
108
109Inducing the rules is trivial::
110
111    import Orange
112    data = Orange.data.Table("inquisition")
113
114    rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)
115
116    print "%5s   %5s" % ("supp", "conf")
117    for r in rules:
118        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r)
119
120The induced rules are surprisingly fear-full: ::
121
122    0.500   1.000   fear -> surprise
123    0.500   1.000   surprise -> fear
124    0.500   1.000   fear -> surprise our
125    0.500   1.000   fear surprise -> our
126    0.500   1.000   fear our -> surprise
127    0.500   1.000   surprise -> fear our
128    0.500   1.000   surprise our -> fear
129    0.500   0.714   our -> fear surprise
130    0.500   1.000   fear -> our
131    0.500   0.714   our -> fear
132    0.500   1.000   surprise -> our
133    0.500   0.714   our -> surprise
134
135To get only a list of supported item sets, one should call the method
136get_itemsets::
137
138    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True)
139    itemsets = inducer.get_itemsets(data)
140
141Now itemsets is a list of itemsets along with the examples supporting them
142since we set store_examples to True. ::
143
144    >>> itemsets[5]
145    ((-11, -7), [1, 2, 3, 6, 9])
146    >>> [data.domain[i].name for i in itemsets[5][0]]
147    ['surprise', 'our']
148
149The sixth itemset contains features with indices -11 and -7, that is, the
150words "surprise" and "our". The examples supporting it are those with
151indices 1,2, 3, 6 and 9.
152
153This way of representing the itemsets is memory efficient and faster than using
154objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`.
155
156.. _non-sparse-examples:
157
158===================
159Non-sparse data
160===================
161
162:class:`AssociationRulesInducer` works with non-sparse data.
163Unknown values are ignored, while values of features are not (as opposite to
164the algorithm for sparse rules). In addition, the algorithm
165can be directed to search only for classification rules, in which the only
166feature on the right-hand side is the class variable.
167
168.. class:: AssociationRulesInducer
169
170    All attributes can be set with the constructor.
171
172    .. attribute:: support
173
174       Minimal support for the rule.
175
176    .. attribute:: confidence
177
178        Minimal confidence for the rule.
179
180    .. attribute:: classification_rules
181
182        If True (default is False), the classification rules are constructed instead
183        of general association rules.
184
185    .. attribute:: store_examples
186
187        Store the examples covered by each rule and those
188        confirming it
189
190    .. attribute:: max_item_sets
191
192        The maximal number of itemsets.
193
194    .. method:: __call__(data, weight_id)
195
196        Induce rules from the data set.
197
198    .. method:: get_itemsets(data)
199
200        Returns a list of pairs. The first element of a pair is a tuple with
201        indices of features in the item set (negative for sparse data).
202        The second element is a list of indices supporting the item set, that is,
203        all the items in the set. If :obj:`store_examples` is False, the second
204        element is None.
205
206The example::
207
208    import Orange
209
210    data = Orange.data.Table("lenses")
211
212    print "Association rules"
213    rules = Orange.associate.AssociationRulesInducer(data, support = 0.5)
214    for r in rules:
215        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r)
216
217The found rules are: ::
218
219    0.333  0.533  lenses=none -> prescription=hypermetrope
220    0.333  0.667  prescription=hypermetrope -> lenses=none
221    0.333  0.533  lenses=none -> astigmatic=yes
222    0.333  0.667  astigmatic=yes -> lenses=none
223    0.500  0.800  lenses=none -> tear_rate=reduced
224    0.500  1.000  tear_rate=reduced -> lenses=none
225
226To limit the algorithm to classification rules, set classificationRules to 1: ::
227
228    print "\\nClassification rules"
229    rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1)
230    for r in rules:
231        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r)
232
233The found rules are, naturally, a subset of the above rules: ::
234
235    0.333  0.667  prescription=hypermetrope -> lenses=none
236    0.333  0.667  astigmatic=yes -> lenses=none
237    0.500  1.000  tear_rate=reduced -> lenses=none
238
239Itemsets are induced in a similar fashion as for sparse data, except that the
240first element of the tuple, the item set, is represented not by indices of
241features, as before, but with tuples (feature-index, value-index): ::
242
243    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True)
244    itemsets = inducer.get_itemsets(data)
245    print itemsets[8]
246
247This prints out ::
248
249    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23])
250
251meaning that the ninth itemset contains the second value of the third feature
252(2, 1), and the first value of the fifth (4, 0).
253
254=======================
255Representation of rules
256=======================
257
258An :class:`AssociationRule` represents a rule. In Orange, methods for
259induction of association rules return the induced rules in
260:class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances.
261
262.. class:: AssociationRule
263
264    .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples)
265
266        Constructs an association rule and computes all measures listed above.
267
268    .. method:: __init__(left, right, support, confidence)
269
270        Construct association rule and sets its support and confidence. If
271        you intend to pass on such a rule you should set other attributes
272        manually - AssociationRules's constructor cannot compute anything
273        from arguments support and confidence.
274
275    .. method:: __init__(rule)
276
277        Given an association rule as the argument, constructor copies of the
278        rule.
279
280    .. attribute:: left, right
281
282        The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`.
283        In rules created by :class:`AssociationRulesSparseInducer` from examples that
284        contain all values as meta-values, left and right are examples in the
285        same form. Otherwise, values in left that do not appear in the rule
286        are "don't care", and value in right are "don't know". Both can,
287        however, be tested by :meth:`~Orange.data.Value.is_special`.
288
289    .. attribute:: n_left, n_right
290
291        The number of features (i.e. defined values) on the left and on the
292        right side of the rule.
293
294    .. attribute:: n_applies_left, n_applies_right, n_applies_both
295
296        The number of (learning) examples that conform to the left, the right
297        and to both sides of the rule.
298
299    .. attribute:: n_examples
300
301        The total number of learning examples.
302
303    .. attribute:: support
304
305        nAppliesBoth/nExamples.
306
307    .. attribute:: confidence
308
309        n_applies_both/n_applies_left.
310
311    .. attribute:: coverage
312
313        n_applies_left/n_examples.
314
315    .. attribute:: strength
316
317        n_applies_right/n_applies_left.
318
319    .. attribute:: lift
320
321        n_examples * n_applies_both / (n_applies_left * n_applies_right).
322
323    .. attribute:: leverage
324
325        (n_Applies_both * n_examples - n_applies_left * n_applies_right).
326
327    .. attribute:: examples, match_left, match_both
328
329        If store_examples was True during induction, examples contains a copy
330        of the example table used to induce the rules. Attributes match_left
331        and match_both are lists of integers, representing the indices of
332        examples which match the left-hand side of the rule and both sides,
333        respectively.
334
335    .. method:: applies_left(example)
336
337    .. method:: applies_right(example)
338
339    .. method:: applies_both(example)
340
341        Tells whether the example fits into the left, right or both sides of
342        the rule, respectively. If the rule is represented by sparse examples,
343        the given example must be sparse as well.
344
345Association rule inducers do not store evidence about which example supports
346which rule. Let us write a function that finds the examples that
347confirm the rule (fit both sides of it) and those that contradict it (fit the
348left-hand side but not the right). The example::
349
350    import Orange
351
352    data = Orange.data.Table("lenses")
353
354    rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3)
355    rule = rules[0]
356
357    print
358    print "Rule: ", rule
359    print
360
361    print "Supporting examples:"
362    for example in data:
363        if rule.appliesBoth(example):
364            print example
365    print
366
367    print "Contradicting examples:"
368    for example in data:
369        if rule.applies_left(example) and not rule.applies_right(example):
370            print example
371    print
372
373The latter printouts get simpler and faster if we instruct the inducer to
374store the examples. We can then do, for instance, this: ::
375
376    print "Match left: "
377    print "\\n".join(str(rule.examples[i]) for i in rule.match_left)
378    print "\\nMatch both: "
379    print "\\n".join(str(rule.examples[i]) for i in rule.match_both)
380
381The "contradicting" examples are then those whose indices are found in
382match_left but not in match_both. The memory friendlier and the faster way
383to compute this is as follows: ::
384
385    >>> [x for x in rule.match_left if not x in rule.match_both]
386    [0, 2, 8, 10, 16, 17, 18]
387    >>> set(rule.match_left) - set(rule.match_both)
388    set([0, 2, 8, 10, 16, 17, 18])
389
390===============
391Utilities
392===============
393
394.. autofunction:: print_rules
395
396.. autofunction:: sort
Note: See TracBrowser for help on using the repository browser.