source: orange/Orange/doc/reference/associationRules.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 16.5 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<HTML>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8
9<index name="Agrawal's algorithm">
10<index name="association rules"><H1>Association Rules</H1>
11
12<P>Orange provides two algorithms for induction of association
13rules. One is the basic Agrawal's algorithm with dynamic induction of
14supported itemsets and rules that is designed specifically for
15datasets with a large number of different items. This is, however, not
16really suitable for attribute-based machine learning problems, which
17are at the primary focus of Orange. We have thus adapted the original
18algorithm to be more efficient for the latter type of data, and to
19induce the rules in which, for contrast to Agrawal's rules, both sides
20don't only contain attributes (like "bread, butter -> jam") but also
21their values ("bread = wheat, butter = yes -> jam = plum"). As a
22further variation, the algorithm can be limited to search only for
23classification rules in which the sole attribute to appear on the
24right side of the rule is the class attribute.</P>
25
26<p>It is also possible to extract item sets instead of association rules. These are often more interesting than the rules themselves.</p>
27
28<P>Besides association rule inducer, Orange also provides a rather
29simplified method for classification by association rules.</P>
30
31<HR>
32
33<H2>Association Rules Inducer with Agrawal's Algorithm</H2>
34<index name="classes/AssociationRulesSparseInducer">
35
36<P>The class that induces rules by Agrawal's algorithm, <CODE></CODE> accepts the data examples of two forms. The first is the standard form in which each examples is described by values of a fixed list of attributes, defined in domain. The algorithm, however, disregards the attribute values but only checks whether the value is defined or not. The rule shown above, "bread, butter -> jam" actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the next chapter.</P>
37
38<P>Since the usual representation of examples described above is rather unsuitable for sparse examples, <CODE>AssociationRulesSparseInducer</CODE> can also use examples represented a bit differently. <EM>Sparse examples</EM> have no fixed attributes - the examples' domain is empty, there are neither ordinary nor class attributes. All values assigned to example are given as meta-attributes. All meta-attributes need, however, be <A href="Domain.htm#meta-attributes">registered with the domain descriptor</A>. If you have data of this kind, the most suitable format for it is the <A href="fileformats.htm#basket">.basket format</A>.</P>
39
40<P>In both cases, the examples are first translated into an internal <CODE>AssociationRulesSparseInducer</CODE>'s internal format for sparse datasets. The algorithm first dynamically builds all itemsets (sets of attributes) that have at least the prescribed support. Each of these is then used to derive rules with requested confidence.</P>
41
42<P>If examples were given in the sparse form, so are the left and right side of the induced rules. If examples were given in the standard form, so are the examples in association rules.</P>
43
44<P class=section>Attributes</P>
45<DL class=attributes>
46<DT>support</DT>
47<DD>Minimal support for the rule.</DD>
48
49<DT>confidence</DT>
50<DD>Minimal confidence for the rule.</DD>
51
52<DT>storeExamples</DT>
53<DD>Tells the inducer to store the examples covered by each rule and those confirming it</DD>
54
55<DT>maxItemSets</DT>
56<DD>The maximal number of itemsets.</DD>
57</DL>
58
59<P>The last attribute deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum <CODE>maxItemSets</CODE> is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the <CODE>maxItemSets</CODE> to as high as you computer is able to handle.</P>
60
61<P>We shall test the rule inducer on a dataset consisting of a brief description of Spanish Inquisition, given by Palin <EM>et al</EM>:</P>
62
63<BLOCKQUOTE>
64NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again.</BLOCKQUOTE>
65<BLOCKQUOTE>
66NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn!
67</BLOCKQUOTE>
68</P>
69
70<P>The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. The first three lines thus become:
71<p class="header"><a href="inquisition.basket">part of inquisition.basket</a></p>
72<XMP class=code>nobody, expects, the, Spanish, Inquisition
73our, chief, weapon, is, surprise, surprise, and, fear,fear, and, surprise
74our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
75</XMP>
76
77<P>Inducing the rules is trivial.</P>
78
79<p class="header"><a href="assoc-agrawal.py">assoc-agrawal.py</a>
80(uses <a href="inquisition.basket">inquisition.basket</a>)</p>
81<XMP class="code">import orange
82data = orange.ExampleTable("inquisition")
83
84rules = orange.AssociationRulesSparseInducer(data, support = 0.5)
85
86print "%5s   %5s" % ("supp", "conf")
87for r in rules:
88    print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r)
89</XMP>
90
91<P>The induced rules are surprisingly fear-full.</P>
92
93<XMP class=code>0.500   1.000   fear -> surprise
940.500   1.000   surprise -> fear
950.500   1.000   fear -> surprise our
960.500   1.000   fear surprise -> our
970.500   1.000   fear our -> surprise
980.500   1.000   surprise -> fear our
990.500   1.000   surprise our -> fear
1000.500   0.714   our -> fear surprise
1010.500   1.000   fear -> our
1020.500   0.714   our -> fear
1030.500   1.000   surprise -> our
1040.500   0.714   our -> surprise
105</XMP>
106
107<P>If examples are weighted, weight can be passed as an additional argument to call operator.</P>
108
109<p>To get only a list of supported item sets, one should call the method <code>getItemsets</code>. The result
110is a list whose elements are tuples with two elements. The first is a tuple with indices of attributes in the item set. Sparse examples are usually represented with meta attributes, so this indices will be negative. The second element is a list of indices supporting the item set, that is, containing all the items in the set. If <code>storeExamples</code> is <code>False</code>, the second element is <code>None</code>.</p>
111
112<p class="header"><a href="assoc-agrawal.py">assoc-agrawal.py</a>
113(uses <a href="inquisition.basket">inquisition.basket</a>)</p>
114<XMP class="code">inducer = orange.AssociationRulesSparseInducer(support = 0.5, storeExamples = True)
115itemsets = inducer.getItemsets(data)
116</XMP>
117
118<p>Now <code>itemsets</code> is a list of itemsets along with the examples supporting them since we set <code>storeExamples</code> to <code>True</code>.</p>
119
120<xmp class="code">>>> itemsets[5]
121((-11, -7), [1, 2, 3, 6, 9])
122>>> [data.domain[i].name for i in itemsets[5][0]]
123['surprise', 'our']
124</xmp>
125
126<p>The sixth itemset contains attributes with indices -11 and -7, that is, the words "surprise" and "our". The examples supporting it are those with indices 1,2, 3, 6 and 9.</p>
127
128<p>This way of representing the itemsets is not very programmer-friendly, but it is much more memory efficient than and faster to work with than using objects like Variable and Example.</p>
129
130
131
132<H2>Association Rules for Non-sparse Examples</H2>
133
134<P>The other algorithm for association rules provided by Orange,
135<CODE><index name="classes/AssociationRulesInducer">AssociationRulesInducer</CODE> is optimized for non-sparse
136examples in the usual Orange form. Each example is described by values
137of a fixed set of attributes. Unknown values are ignored, while values
138of attributes are not (as opposite to the above-described algorithm
139for sparse rules). In addition, the algorithm can be directed to
140search only for classification rules, in which the only attribute on
141the right-hand side is the class attribute.</P>
142
143<P class=section>Attributes</P>
144<DL class=attributes>
145<DT>support</DT>
146<DD>Minimal support for the rule.</DD>
147
148<DT>confidence</DT>
149<DD>Minimal confidence for the rule.</DD>
150
151<DT>classificationRules</DT>
152<DD>If 1 (default is 0), the algorithm constructs classification rules instead of general association rules.</DD>
153
154<DT>storeExamples</DT>
155<DD>Tells the inducer to store the examples covered by each rule and those confirming it</DD>
156
157<DT>maxItemSets</DT>
158<DD>The maximal number of itemsets.</DD>
159</DL>
160
161<P>Meaning of all attributes (except the new one,
162<CODE>classificationRules</CODE>) is the same as for
163<CODE>AssociationRulesSparseInducer</CODE>. See the description of
164<CODE>maxItemSet</CODE> there.</P>
165
166<p class="header"><a href="assoc-agrawal.py">assoc.py</a>
167(uses <a href="lenses.tab">lenses.tab</a>)</p>
168<XMP class="code">import orange
169
170data = orange.ExampleTable("lenses")
171
172print "\nAssociation rules"
173rules = orange.AssociationRulesInducer(data, supp = 0.5)
174for r in rules:
175    print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r)
176</XMP>
177
178<P>The found rules are</P>
179<XMP class=code>0.333  0.533  lenses=none -> prescription=hypermetrope
1800.333  0.667  prescription=hypermetrope -> lenses=none
1810.333  0.533  lenses=none -> astigmatic=yes
1820.333  0.667  astigmatic=yes -> lenses=none
1830.500  0.800  lenses=none -> tear_rate=reduced
1840.500  1.000  tear_rate=reduced -> lenses=none
185</XMP>
186
187<P>To limit the algorithm to classification rules, set <CODE>classificationRules</CODE> to 1.</P>
188
189<p class="header"><a href="assoc-agrawal.py">part of assoc.py</a>
190(uses <a href="lenses.tab">lenses.tab</a>)</p>
191<XMP class="code">print "\nClassification rules"
192rules = orange.AssociationRulesInducer(data, supp = 0.3, classificationRules = 1)
193for r in rules:
194    print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r)
195</XMP>
196
197<P>The found rules are, naturally, a subset of the above rules.</P>
198
199<p class="header"><a href="assoc-agrawal.py">part of assoc.py</a>
200(uses <a href="lenses.tab">lenses.tab</a>)</p>
201<XMP class="code">0.333  0.667  prescription=hypermetrope -> lenses=none
2020.333  0.667  astigmatic=yes -> lenses=none
2030.500  1.000  tear_rate=reduced -> lenses=none
204</XMP>
205
206<P><CODE>AssociationRulesInducer</CODE> can also work with weighted examples; the ID of weight attribute should be passed as an additional argument in a call.</P>
207
208<p>Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of attributes, as before, but with tuples (attribute-index, value-index).</p>
209
210<p class="header"><a href="assoc-agrawal.py">part of assoc.py</a>
211(uses <a href="lenses.tab">lenses.tab</a>)</p>
212<xmp class="code">inducer = orange.AssociationRulesInducer(support = 0.3, storeExamples = True)
213itemsets = inducer.getItemsets(data)
214print itemsets[8]</xmp>
215
216<p>This prints out
217<xmp class="code">(((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23])</xmp>
218meaning that the ninth itemset contains the second value of the third attribute, (2, 1), and the first value of the fifth, (4, 0).</p>
219
220<H2>Association Rule</H2>
221<index name="classes/AssociationRules">
222<index name="classes/AssociationRule">
223
224<P>Both classes for induction of association rules return the induced rules in <CODE><index name="classes/AssociationRules">AssociationRules</CODE> which is basically a list of instances of <CODE>AssociationRule</CODE>.</P>
225
226<P class=section>Attributes</P>
227<DL class=attributes>
228<DT>left, right</DT>
229<DD>The left and the right side of the rule. Both are given as <CODE>Example</CODE>. In rules created by <CODE>AssociationRulesSparseInducer</CODE> from examples that contain all values as meta-values, <CODE>left</CODE> and <CODE>right</CODE> are examples in the same form. Otherwise, values in <CODE>left</CODE> that do not appear in the rule are don't care, and value in <CODE>right</CODE> are don't know. Both can, however, be tested by <CODE>isSpecial</CODE> (see documentation on <A href="Value.htm">Value</A>).</DD>
230
231<DT>nLeft, nRight</DT>
232<DD>The number of attributes (ie defined values) on the left and on the right side of the rule.</DD>
233
234<DT>nAppliesLeft, nAppliesRight, nAppliesBoth</DT>
235<DD>The number of (learning) examples that conform to the left, the right and to both sides of the rule.</DD>
236
237<DT>nExamples</DT>
238<DD>The total number of learning examples.</DD>
239
240<DT>support</DT><DD><CODE>nAppliesBoth/nExamples</CODE></DD>
241<DT>confidence</DT><DD><CODE>nAppliesBoth/nAppliesLeft</CODE></DD>
242<DT>coverage</DT><DD><CODE>nAppliesLeft/nExamples</CODE></DD>
243<DT>strength</DT><DD><CODE>nAppliesRight/nAppliesLeft</CODE></DD>
244<DT>lift</DT><DD><CODE>nExamples * nAppliesBoth / (nAppliesLeft * nAppliesRight)</CODE></DD>
245<DT>leverage</DT><DD><CODE>(nAppliesBoth * nExamples - nAppliesLeft * nAppliesRight)</CODE></DD>
246
247<dt>examples, matchLeft, matchBoth</dt>
248<dd>If <code>storeExamples</code> was <code>True</code> during induction, <code>examples</code> contains a copy of the example table used to induce the rules. Attributes <code>matchLeft</code> and <code>matchBoth</code> are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively.</dd>
249</DL>
250
251<P class=section>Methods</P>
252<DL class=attributes>
253<DT>AssociationRule(left, right, nAppliesLeft, nAppliesRight, nAppliesBoth, nExamples)</DT>
254<DD>Constructs an association rule and computes all measures listed above.</DD>
255
256<DT>AssociationRule(left, right, support, confidence]])</DT>
257<DD>Construct association rule and sets its support and confidence. If you intend to pass such a rule to someone that expects more things to be set, you should set the manually - <CODE>AssociationRules</CODE>'s constructor cannot compute anything from these two arguments.
258</DD>
259
260<DT>AssociationRule(rule)</DT>
261<DD>Given an association rules as the argument, constructor copies the rule into a new rule.</DD>
262
263<DT>appliesLeft(example), appliesRight(example), appliesBoth(example)</DT>
264<DD>Tells whether the example fits into the left, right and both sides of the rule, respectively. If the rule is represented by sparse examples, the given example must be sparse as well.</DD>
265</DL>
266
267<P>Association rule inducers do not store evidence about which example supports which rule (although this is available during induction, the information is discarded afterwards). Let us write a function that find the examples that confirm the rule (ie fit both sides of it) and those that contradict it (fit the left-hand side but not the right).</P>
268
269<p class="header"><a href="assoc-rule.py">assoc-rule.py</a>
270(uses <a href="lenses.tab">lenses.tab</a>)</p>
271<XMP class=code>import orange
272
273data = orange.ExampleTable("lenses")
274
275rules = orange.AssociationRulesInducer(data, supp = 0.3)
276rule = rules[0]
277
278print
279print "Rule: ", rule
280print
281
282print "Supporting examples:"
283for example in data:
284    if rule.appliesBoth(example):
285        print example
286print
287
288print "Contradicting examples:"
289for example in data:
290    if rule.appliesLeft(example) and not rule.appliesRight(example):
291        print example
292print
293</XMP>
294
295<p>The latter printouts get simpler and (way!) faster if we instruct the inducer to store the examples. We can then do, for instance, this.</p>
296
297
298(uses <a href="lenses.tab">lenses.tab</a>)</p>
299<XMP class=code>print "Match left: "
300print "\n".join(str(rule.examples[i]) for i in rule.matchLeft)
301print "\nMatch both: "
302print "\n".join(str(rule.examples[i]) for i in rule.matchBoth)
303</XMP>
304
305<p>The "contradicting" examples are then those whose indices are find in <code>matchLeft</code> but not in <code>matchBoth</code>. The memory friendlier and the faster ways to compute this are as follows.</p>
306
307<xmp class="code">>>> [x for x in rule.matchLeft if not x in rule.matchBoth]
308[0, 2, 8, 10, 16, 17, 18]
309>>> set(rule.matchLeft) - set(rule.matchBoth)
310set([0, 2, 8, 10, 16, 17, 18])</xmp>
311
312</BODY> 
Note: See TracBrowser for help on using the repository browser.