Changeset 7247:31dc39de56af in orange


Ignore:
Timestamp:
02/02/11 21:49:59 (3 years ago)
Author:
Gregor Rot <gregor.rot@…>
Branch:
default
Convert:
3f24869e8dcb7c6bd502dbc62634fd36c4fa3fd6
Message:
 
File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/associate/__init__.py

    r7242 r7247  
    44================= 
    55 
    6 Orange provides two algorithms for induction of association rules. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for attribute-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain attributes (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole attribute to appear on the right side of the rule is the class attribute. 
     6Orange provides two algorithms for induction of association rules. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for feature-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain features (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole feature to appear on the right side of the rule is the class Feature. 
    77 
    88It is also possible to extract item sets instead of association rules. These are often more interesting than the rules themselves. 
     
    1414================= 
    1515 
    16 The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each examples is described by values of a fixed list of attributes, defined in domain. The algorithm, however, disregards the attribute values but only checks whether the value is defined or not. The rule shown above, "bread, butter -> jam" actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the next chapter. 
    17  
    18 Since the usual representation of examples described above is rather unsuitable for sparse examples, AssociationRulesSparseInducer can also use examples represented a bit differently. Sparse examples have no fixed attributes - the examples' domain is empty, there are neither ordinary nor class attributes. All values assigned to example are given as meta-attributes. All meta-attributes need, however, be `registered with the domain descriptor <http://orange.biolab.si/doc/reference/Domain.htm#meta-attributes>`_. If you have data of this kind, the most suitable format for it is the `basket format <http://orange.biolab.si/doc/reference/fileformats.htm#basket>`_. 
    19  
    20 In both cases, the examples are first translated into an internal AssociationRulesSparseInducer's internal format for sparse datasets. The algorithm first dynamically builds all itemsets (sets of attributes) that have at least the prescribed support. Each of these is then used to derive rules with requested confidence. 
     16The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each examples is described by values of a fixed list of features, defined in domain. The algorithm, however, disregards the feature values but only checks whether the value is defined or not. The rule shown above, "bread, butter -> jam" actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the next chapter. 
     17 
     18Since the usual representation of examples described above is rather unsuitable for sparse examples, AssociationRulesSparseInducer can also use examples represented a bit differently. Sparse examples have no fixed features - the examples' domain is empty, there are neither ordinary nor class features. All values assigned to example are given as meta-features. All meta-features need, however, be `registered with the domain descriptor <http://orange.biolab.si/doc/reference/Domain.htm#meta-features>`_. If you have data of this kind, the most suitable format for it is the `basket format <http://orange.biolab.si/doc/reference/fileformats.htm#basket>`_. 
     19 
     20In both cases, the examples are first translated into an internal AssociationRulesSparseInducer's internal format for sparse datasets. The algorithm first dynamically builds all itemsets (sets of features) that have at least the prescribed support. Each of these is then used to derive rules with requested confidence. 
    2121 
    2222If examples were given in the sparse form, so are the left and right side of the induced rules. If examples were given in the standard form, so are the examples in association rules. 
     
    3636    The maximal number of itemsets. 
    3737         
    38 The last attribute deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets to as high as you computer is able to handle. 
     38The last feature deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets to as high as you computer is able to handle. 
    3939 
    4040We shall test the rule inducer on a dataset consisting of a brief description of Spanish Inquisition, given by Palin et al: 
     
    5959    amongst, our, weaponry, are, such, diverse, elements, as, fear, surprise, ruthless, efficiency, an, almost, fanatical, devotion, to, the, Pope, and, nice, red, uniforms, oh damn 
    6060     
    61 Inducing the rules is trivial. 
    62  
    63 **assoc-agrawal.py** (uses inquisition.basket) :: 
     61Inducing the rules is trivial (uses inquisition.basket): :: 
    6462 
    6563    import Orange 
     
    7270        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    7371 
    74 The induced rules are surprisingly fear-full. :: 
     72The induced rules are surprisingly fear-full: :: 
    7573 
    7674    0.500   1.000   fear -> surprise 
     
    8987If examples are weighted, weight can be passed as an additional argument to call operator. 
    9088 
    91 To get only a list of supported item sets, one should call the method getItemsets. The result is a list whose elements are tuples with two elements. The first is a tuple with indices of attributes in the item set. Sparse examples are usually represented with meta attributes, so this indices will be negative. The second element is  a list of indices supporting the item set, that is, containing all the items in the set. If storeExamples is False, the second element is None. :: 
     89To get only a list of supported item sets, one should call the method getItemsets. The result is a list whose elements are tuples with two elements. The first is a tuple with indices of features in the item set. Sparse examples are usually represented with meta features, so this indices will be negative. The second element is  a list of indices supporting the item set, that is, containing all the items in the set. If storeExamples is False, the second element is None. :: 
    9290 
    9391    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, storeExamples = True) 
     
    10199    ['surprise', 'our']    
    102100     
    103 The sixth itemset contains attributes with indices -11 and -7, that is, the words "surprise" and "our". The examples supporting it are those with indices 1,2, 3, 6 and 9. 
     101The sixth itemset contains features with indices -11 and -7, that is, the words "surprise" and "our". The examples supporting it are those with indices 1,2, 3, 6 and 9. 
    104102 
    105103This way of representing the itemsets is not very programmer-friendly, but it is much more memory efficient than and faster to work with than using objects like Variable and Example. 
     
    109107================= 
    110108 
    111 The other algorithm for association rules provided by Orange, AssociationRulesInducer is optimized for non-sparse examples in the usual Orange form. Each example is described by values of a fixed set of attributes. Unknown values are ignored, while values of attributes are not (as opposite to the above-described algorithm for sparse rules). In addition, the algorithm can be directed to search only for classification rules, in which the only attribute on the right-hand side is the class attribute. 
     109The other algorithm for association rules provided by Orange (AssociationRulesInducer) is optimized for non-sparse examples in the usual Orange form. Each example is described by values of a fixed set of features. Unknown values are ignored, while values of features are not (as opposite to the above-described algorithm for sparse rules). In addition, the algorithm can be directed to search only for classification rules, in which the only feature on the right-hand side is the class Feature. 
    112110 
    113111.. class:: Orange.associate.AssociationRulesInducer 
     
    128126    The maximal number of itemsets. 
    129127 
    130 Meaning of all attributes (except the new one, classificationRules) is the same as for AssociationRulesSparseInducer. See the description of maxItemSet there. :: 
     128Meaning of all attributes (except the new one, classificationRules) is the same as for AssociationRulesSparseInducer. See the description of maxItemSet there. The example uses `lenses data table <http://orange.biolab.si/doc/reference/lenses.tab>`_. :: 
    131129 
    132130    import Orange 
     
    139137        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    140138         
    141 The found rules are :: 
     139The found rules are: :: 
    142140 
    143141    0.333  0.533  lenses=none -> prescription=hypermetrope 
     
    150148To limit the algorithm to classification rules, set classificationRules to 1. :: 
    151149 
    152     import Orange 
    153  
    154     data = Orange.data.Table("inquisition") 
    155     rules = Orange.associate.AssociationRulesSparseInducer(data, 
    156                 support = 0.5, storeExamples = True) 
    157  
    158     print "%5s   %5s" % ("supp", "conf") 
     150    print "\\nClassification rules" 
     151    rules = orange.AssociationRulesInducer(data, supp = 0.3, classificationRules = 1) 
    159152    for r in rules: 
    160         print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    161  
    162 The found rules are, naturally, a subset of the above rules. :: 
     153        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     154 
     155The found rules are, naturally, a subset of the above rules: :: 
    163156 
    164157    0.333  0.667  prescription=hypermetrope -> lenses=none 
     
    166159    0.500  1.000  tear_rate=reduced -> lenses=none 
    167160     
    168 AssociationRulesInducer can also work with weighted examples; the ID of weight attribute should be passed as an additional argument in a call. 
    169  
    170 Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of attributes, as before, but with tuples (attribute-index, value-index). :: 
     161AssociationRulesInducer can also work with weighted examples; the ID of weight feature should be passed as an additional argument in a call. 
     162 
     163Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of features, as before, but with tuples (feature-index, value-index). :: 
    171164 
    172165    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, storeExamples = True) 
     
    178171    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    179172     
    180 meaning that the ninth itemset contains the second value of the third attribute, (2, 1), and the first value of the fifth, (4, 0). 
    181  
    182 ================= 
    183 Association rule 
     173meaning that the ninth itemset contains the second value of the third feature, (2, 1), and the first value of the fifth, (4, 0). 
     174 
     175================= 
     176AssociationRules 
    184177================= 
    185178 
     
    192185     
    193186    .. attribute:: nLeft, nRight 
    194     The number of attributes (ie defined values) on the left and on the right side of the rule. 
     187    The number of features (ie defined values) on the left and on the right side of the rule. 
    195188     
    196189    .. attribute:: nAppliesLeft, nAppliesRight, nAppliesBoth 
     
    219212     
    220213    .. attribute:: examples, matchLeft, matchBoth 
    221     If storeExamples was True during induction, examples contains a copy of the example table used to induce the rules. Attributes matchLeft and matchBoth are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively.     
     214    If storeExamples was True during induction, examples contains a copy of the example table used to induce the rules. features matchLeft and matchBoth are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively.     
    222215 
    223216    .. method:: AssociationRule(left, right, nAppliesLeft, nAppliesRight, nAppliesBoth, nExamples) 
     
    233226    Tells whether the example fits into the left, right and both sides of the rule, respectively. If the rule is represented by sparse examples, the given example must be sparse as well. 
    234227     
    235 Association rule inducers do not store evidence about which example supports which rule (although this is available during induction, the information is discarded afterwards). Let us write a function that find the examples that confirm the rule (ie fit both sides of it) and those that contradict it (fit the left-hand side but not the right). :: 
     228Association rule inducers do not store evidence about which example supports which rule (although this is available during induction, the information is discarded afterwards). Let us write a function that find the examples that confirm the rule (ie fit both sides of it) and those that contradict it (fit the left-hand side but not the right). The example uses `lenses data table <http://orange.biolab.si/doc/reference/lenses.tab>`_. :: 
    236229 
    237230    import Orange 
     
    258251    print 
    259252 
    260 The latter printouts get simpler and (way!) faster if we instruct the inducer to store the examples. We can then do, for instance, this. :: 
     253The latter printouts get simpler and (way!) faster if we instruct the inducer to store the examples. We can then do, for instance, this: :: 
    261254 
    262255    print "Match left: " 
     
    271264    >>> set(rule.matchLeft) - set(rule.matchBoth) 
    272265    set([0, 2, 8, 10, 16, 17, 18]) 
    273      
     266 
    274267""" 
    275268 
Note: See TracChangeset for help on using the changeset viewer.