Changeset 7235:f579c6c0fc54 in orange


Ignore:
Timestamp:
02/02/11 20:36:19 (3 years ago)
Author:
Gregor Rot <gregor.rot@…>
Branch:
default
Convert:
a45b3121fdaf68140490f75cbb0ed51ae53e9553
Message:
 
File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/associate/__init__.py

    r6848 r7235  
     1""" 
     2================= 
     3Introduction 
     4================= 
     5 
     6Orange provides two algorithms for induction of association rules. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for attribute-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain attributes (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole attribute to appear on the right side of the rule is the class attribute. 
     7 
     8It is also possible to extract item sets instead of association rules. These are often more interesting than the rules themselves. 
     9 
     10Besides association rule inducer, Orange also provides a rather simplified method for classification by association rules. 
     11 
     12================= 
     13Association rules inducer with Agrawal's algorithm 
     14================= 
     15 
     16The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each examples is described by values of a fixed list of attributes, defined in domain. The algorithm, however, disregards the attribute values but only checks whether the value is defined or not. The rule shown above, "bread, butter -> jam" actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the next chapter. 
     17 
     18Since the usual representation of examples described above is rather unsuitable for sparse examples, AssociationRulesSparseInducer can also use examples represented a bit differently. Sparse examples have no fixed attributes - the examples' domain is empty, there are neither ordinary nor class attributes. All values assigned to example are given as meta-attributes. All meta-attributes need, however, be `registered with the domain descriptor <http://orange.biolab.si/doc/reference/Domain.htm#meta-attributes>`_. If you have data of this kind, the most suitable format for it is the `basket format <http://orange.biolab.si/doc/reference/fileformats.htm#basket>`_. 
     19 
     20In both cases, the examples are first translated into an internal AssociationRulesSparseInducer's internal format for sparse datasets. The algorithm first dynamically builds all itemsets (sets of attributes) that have at least the prescribed support. Each of these is then used to derive rules with requested confidence. 
     21 
     22If examples were given in the sparse form, so are the left and right side of the induced rules. If examples were given in the standard form, so are the examples in association rules. 
     23 
     24.. class:: Orange.associate.AssociationRulesSparseInducer 
     25 
     26    .. attribute:: support 
     27    Minimal support for the rule. 
     28     
     29    .. attribute:: confidence 
     30    Minimal confidence for the rule. 
     31     
     32    .. attribute:: storeExamples 
     33    Tells the inducer to store the examples covered by each rule and those confirming it. 
     34     
     35    .. attribute:: maxItemSets 
     36    The maximal number of itemsets. 
     37         
     38The last attribute deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets to as high as you computer is able to handle. 
     39 
     40         
     41Examples for AssociationRulesSparseInducer 
     42======== 
     43 
     44We shall test the rule inducer on a dataset consisting of a brief description of Spanish Inquisition, given by Palin et al: 
     45 
     46    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
     47 
     48    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
     49     
     50The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
     51 
     52inquisition.basket :: 
     53 
     54    nobody, expects, the, Spanish, Inquisition 
     55    our, chief, weapon, is, surprise, surprise, and, fear,fear, and, surprise 
     56    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency 
     57    our, three, weapons, are, fear, surprise, and, ruthless, efficiency, and, an, almost, fanatical, devotion, to, the, Pope 
     58    our, four, no 
     59    amongst, our, weapons 
     60    amongst, our, weaponry, are, such, elements, as, fear, surprise 
     61    I'll, come, in, again 
     62    nobody, expects, the, Spanish, Inquisition 
     63    amongst, our, weaponry, are, such, diverse, elements, as, fear, surprise, ruthless, efficiency, an, almost, fanatical, devotion, to, the, Pope, and, nice, red, uniforms, oh damn 
     64     
     65Inducing the rules is trivial. 
     66 
     67assoc-agrawal.py (uses inquisition.basket) :: 
     68 
     69    import Orange 
     70    data = Orange.data.Table("inquisition") 
     71 
     72    rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
     73 
     74    print "%5s   %5s" % ("supp", "conf") 
     75    for r in rules: 
     76        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
     77 
     78The induced rules are surprisingly fear-full. :: 
     79 
     80    0.500   1.000   fear -> surprise 
     81    0.500   1.000   surprise -> fear 
     82    0.500   1.000   fear -> surprise our 
     83    0.500   1.000   fear surprise -> our 
     84    0.500   1.000   fear our -> surprise 
     85    0.500   1.000   surprise -> fear our 
     86    0.500   1.000   surprise our -> fear 
     87    0.500   0.714   our -> fear surprise 
     88    0.500   1.000   fear -> our 
     89    0.500   0.714   our -> fear 
     90    0.500   1.000   surprise -> our 
     91    0.500   0.714   our -> surprise 
     92 
     93If examples are weighted, weight can be passed as an additional argument to call operator. 
     94 
     95To get only a list of supported item sets, one should call the method getItemsets. The result is a list whose elements are tuples with two elements. The first is a tuple with indices of attributes in the item set. Sparse examples are usually represented with meta attributes, so this indices will be negative. The second element is  a list of indices supporting the item set, that is, containing all the items in the set. If storeExamples is False, the second element is None. 
     96 
     97assoc-agrawal.py (uses inquisition.basket) :: 
     98 
     99    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, storeExamples = True) 
     100    itemsets = inducer.getItemsets(data) 
     101     
     102Now itemsets is a list of itemsets along with the examples supporting them since we set storeExamples to True. :: 
     103 
     104    >>> itemsets[5] 
     105    ((-11, -7), [1, 2, 3, 6, 9]) 
     106    >>> [data.domain[i].name for i in itemsets[5][0]] 
     107    ['surprise', 'our']    
     108     
     109The sixth itemset contains attributes with indices -11 and -7, that is, the words "surprise" and "our". The examples supporting it are those with indices 1,2, 3, 6 and 9. 
     110 
     111This way of representing the itemsets is not very programmer-friendly, but it is much more memory efficient than and faster to work with than using objects like Variable and Example. 
     112 
     113================= 
     114Association rules for non-sparse examples 
     115================= 
     116 
     117The other algorithm for association rules provided by Orange, AssociationRulesInducer is optimized for non-sparse examples in the usual Orange form. Each example is described by values of a fixed set of attributes. Unknown values are ignored, while values of attributes are not (as opposite to the above-described algorithm for sparse rules). In addition, the algorithm can be directed to search only for classification rules, in which the only attribute on the right-hand side is the class attribute. 
     118 
     119.. class:: Orange.associate.AssociationRulesInducer 
     120 
     121    .. attribute:: support 
     122    Minimal support for the rule. 
     123     
     124    .. attribute:: confidence 
     125    Minimal confidence for the rule. 
     126     
     127    .. attribute:: classificationRules 
     128    If 1 (default is 0), the algorithm constructs classification rules instead of general association rules. 
     129     
     130    .. attribute:: storeExamples 
     131    Tells the inducer to store the examples covered by each rule and those confirming it 
     132     
     133    .. attribute:: maxItemSets 
     134    The maximal number of itemsets. 
     135 
     136Meaning of all attributes (except the new one, classificationRules) is the same as for AssociationRulesSparseInducer. See the description of maxItemSet there. 
     137 
     138assoc.py (uses lenses.tab) :: 
     139 
     140    import orange 
     141 
     142    data = orange.ExampleTable("lenses") 
     143 
     144    print "Association rules" 
     145    rules = orange.AssociationRulesInducer(data, supp = 0.5) 
     146    for r in rules: 
     147        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     148         
     149The found rules are :: 
     150 
     151    0.333  0.533  lenses=none -> prescription=hypermetrope 
     152    0.333  0.667  prescription=hypermetrope -> lenses=none 
     153    0.333  0.533  lenses=none -> astigmatic=yes 
     154    0.333  0.667  astigmatic=yes -> lenses=none 
     155    0.500  0.800  lenses=none -> tear_rate=reduced 
     156    0.500  1.000  tear_rate=reduced -> lenses=none 
     157     
     158To limit the algorithm to classification rules, set classificationRules to 1. :: 
     159 
     160    import orange 
     161 
     162    data = orange.ExampleTable("inquisition") 
     163    rules = orange.AssociationRulesSparseInducer(data, 
     164                support = 0.5, storeExamples = True) 
     165 
     166    print "%5s   %5s" % ("supp", "conf") 
     167    for r in rules: 
     168        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
     169 
     170The found rules are, naturally, a subset of the above rules. :: 
     171 
     172    0.333  0.667  prescription=hypermetrope -> lenses=none 
     173    0.333  0.667  astigmatic=yes -> lenses=none 
     174    0.500  1.000  tear_rate=reduced -> lenses=none 
     175     
     176AssociationRulesInducer can also work with weighted examples; the ID of weight attribute should be passed as an additional argument in a call. 
     177 
     178Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of attributes, as before, but with tuples (attribute-index, value-index). :: 
     179 
     180    inducer = orange.AssociationRulesInducer(support = 0.3, storeExamples = True) 
     181    itemsets = inducer.getItemsets(data) 
     182    print itemsets[8] 
     183     
     184This prints out :: 
     185 
     186    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
     187     
     188meaning that the ninth itemset contains the second value of the third attribute, (2, 1), and the first value of the fifth, (4, 0). 
     189 
     190================= 
     191Association rule 
     192================= 
     193 
     194Both classes for induction of association rules return the induced rules in AssociationRules which is basically a list of instances of AssociationRule. 
     195 
     196.. class:: Orange.associate.AssociationRules 
     197 
     198    .. attribute:: left, right 
     199    The left and the right side of the rule. Both are given as Example. In rules created by AssociationRulesSparseInducer from examples that contain all values as meta-values, left and right are examples in the same form. Otherwise, values in left that do not appear in the rule are don't care, and value in right are don't know. Both can, however, be tested by isSpecial (see documentation on  `Value <http://orange.biolab.si/doc/reference/Value.htm>`_). 
     200     
     201    .. attribute:: nLeft, nRight 
     202    The number of attributes (ie defined values) on the left and on the right side of the rule. 
     203     
     204    .. attribute:: nAppliesLeft, nAppliesRight, nAppliesBoth 
     205    The number of (learning) examples that conform to the left, the right and to both sides of the rule. 
     206     
     207    .. attribute:: nExamples 
     208    The total number of learning examples. 
     209     
     210    .. attribute:: support 
     211    nAppliesBoth/nExamples. 
     212 
     213    .. attribute:: confidence 
     214    nAppliesBoth/nAppliesLeft. 
     215     
     216    .. attribute:: coverage 
     217    nAppliesLeft/nExamples. 
     218 
     219    .. attribute:: strength 
     220    nAppliesRight/nAppliesLeft. 
     221     
     222    .. attribute:: lift 
     223    nExamples * nAppliesBoth / (nAppliesLeft * nAppliesRight). 
     224     
     225    .. attribute:: leverage 
     226    (nAppliesBoth * nExamples - nAppliesLeft * nAppliesRight). 
     227     
     228    .. attribute:: examples, matchLeft, matchBoth 
     229    If storeExamples was True during induction, examples contains a copy of the example table used to induce the rules. Attributes matchLeft and matchBoth are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively.     
     230 
     231    .. method:: AssociationRule(left, right, nAppliesLeft, nAppliesRight, nAppliesBoth, nExamples) 
     232    Constructs an association rule and computes all measures listed above. 
     233     
     234    .. method:: AssociationRule(left, right, support, confidence]]) 
     235    Construct association rule and sets its support and confidence. If you intend to pass such a rule to someone that expects more things to be set, you should set the manually - AssociationRules's constructor cannot compute anything from these two arguments. 
     236     
     237    .. method:: AssociationRule(rule) 
     238    Given an association rules as the argument, constructor copies the rule into a new rule. 
     239     
     240    .. method:: appliesLeft(example), appliesRight(example), appliesBoth(example) 
     241    Tells whether the example fits into the left, right and both sides of the rule, respectively. If the rule is represented by sparse examples, the given example must be sparse as well. 
     242     
     243Association rule inducers do not store evidence about which example supports which rule (although this is available during induction, the information is discarded afterwards). Let us write a function that find the examples that confirm the rule (ie fit both sides of it) and those that contradict it (fit the left-hand side but not the right). :: 
     244 
     245    import orange 
     246 
     247    data = orange.ExampleTable("lenses") 
     248 
     249    rules = orange.AssociationRulesInducer(data, supp = 0.3) 
     250    rule = rules[0] 
     251 
     252    print 
     253    print "Rule: ", rule 
     254    print 
     255 
     256    print "Supporting examples:" 
     257    for example in data: 
     258        if rule.appliesBoth(example): 
     259            print example 
     260    print 
     261 
     262    print "Contradicting examples:" 
     263    for example in data: 
     264        if rule.appliesLeft(example) and not rule.appliesRight(example): 
     265            print example 
     266    print 
     267 
     268The latter printouts get simpler and (way!) faster if we instruct the inducer to store the examples. We can then do, for instance, this. :: 
     269 
     270    print "Match left: " 
     271    print "\\n".join(str(rule.examples[i]) for i in rule.matchLeft) 
     272    print "\\nMatch both: " 
     273    print "\\n".join(str(rule.examples[i]) for i in rule.matchBoth) 
     274 
     275The "contradicting" examples are then those whose indices are find in matchLeft but not in matchBoth. The memory friendlier and the faster ways to compute this are as follows. :: 
     276 
     277    >>> [x for x in rule.matchLeft if not x in rule.matchBoth] 
     278    [0, 2, 8, 10, 16, 17, 18] 
     279    >>> set(rule.matchLeft) - set(rule.matchBoth) 
     280    set([0, 2, 8, 10, 16, 17, 18]) 
     281     
     282""" 
     283 
    1284from orange import \ 
    2      AssociationRule, \ 
     285    AssociationRule, \ 
    3286    AssociationRules, \ 
    4287    AssociationRulesInducer, \ 
Note: See TracChangeset for help on using the changeset viewer.