Changeset 7323:923010e55996 in orange


Ignore:
Timestamp:
02/03/11 13:32:21 (3 years ago)
Author:
Gregor Rot <gregor.rot@…>
Branch:
default
Convert:
d9674c35e62febf0e030da21c4189bf447c6e039
Message:
 
File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/associate/__init__.py

    r7313 r7323  
    11""" 
    2 ================= 
     2============================== 
    33Induction of association rules 
    4 ================= 
    5  
    6 Orange provides two algorithms for induction of association rules. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for feature-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain features (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole feature to appear on the right side of the rule is the class Feature. 
     4============================== 
     5 
     6Orange provides two algorithms for induction of `association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for feature-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain features (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole feature to appear on the right side of the rule is the class feature. 
    77 
    88It is also possible to extract item sets instead of association rules. These are often more interesting than the rules themselves. 
     
    1010Besides association rule inducer, Orange also provides a rather simplified method for classification by association rules. 
    1111 
    12 ================= 
     12=================== 
    1313Agrawal's algorithm 
    14 ================= 
    15  
    16 The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each examples is described by values of a fixed list of features, defined in domain. The algorithm, however, disregards the feature values but only checks whether the value is defined or not. The rule shown above, "bread, butter -> jam" actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the next chapter. 
     14=================== 
     15 
     16The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each example is described by values of a fixed list of features (defined in domain). The algorithm, however, disregards the feature values and only checks whether the value is defined or not. The rule shown above ("bread, butter -> jam") actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the chapter :ref:`non-sparse-examples`. 
    1717 
    1818Since the usual representation of examples described above is rather unsuitable for sparse examples, AssociationRulesSparseInducer can also use examples represented a bit differently. Sparse examples have no fixed features - the examples' domain is empty, there are neither ordinary nor class features. All values assigned to example are given as meta-attributes. All meta-attributes need, however, be `registered with the domain descriptor <http://orange.biolab.si/doc/reference/Domain.htm#meta-attributes>`_. If you have data of this kind, the most suitable format for it is the `basket format <http://orange.biolab.si/doc/reference/fileformats.htm#basket>`_. 
     
    3535    .. attribute:: maxItemSets 
    3636    The maximal number of itemsets. 
    37          
    38 The last feature deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets to as high as you computer is able to handle. 
     37 
     38.. _maxItemSets: 
     39 
     40The maxItemSets attribute deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets to as high as you computer is able to handle. 
    3941 
    4042We shall test the rule inducer on a dataset consisting of a brief description of Spanish Inquisition, given by Palin et al: 
     
    103105This way of representing the itemsets is not very programmer-friendly, but it is much more memory efficient than and faster to work with than using objects like Variable and Example. 
    104106 
     107.. _non-sparse-examples: 
     108 
    105109================= 
    106110Non-sparse examples 
     
    126130    The maximal number of itemsets. 
    127131 
    128 Meaning of all attributes (except the new one, classificationRules) is the same as for AssociationRulesSparseInducer. See the description of maxItemSet there. The example uses `lenses data table <http://orange.biolab.si/doc/reference/lenses.tab>`_. :: 
     132Meaning of all attributes (except the new one, classificationRules) is the same as for AssociationRulesSparseInducer. See the description of :ref:`maxItemSets <maxItemSets>` there. The example uses `lenses data table <http://orange.biolab.si/doc/reference/lenses.tab>`_. :: 
    129133 
    130134    import Orange 
     
    146150    0.500  1.000  tear_rate=reduced -> lenses=none 
    147151     
    148 To limit the algorithm to classification rules, set classificationRules to 1. :: 
     152To limit the algorithm to classification rules, set classificationRules to 1: :: 
    149153 
    150154    print "\\nClassification rules" 
     
    161165AssociationRulesInducer can also work with weighted examples; the ID of weight feature should be passed as an additional argument in a call. 
    162166 
    163 Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of features, as before, but with tuples (feature-index, value-index). :: 
     167Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of features, as before, but with tuples (feature-index, value-index): :: 
    164168 
    165169    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, storeExamples = True) 
     
    171175    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    172176     
    173 meaning that the ninth itemset contains the second value of the third feature, (2, 1), and the first value of the fifth, (4, 0). 
     177meaning that the ninth itemset contains the second value of the third feature (2, 1), and the first value of the fifth (4, 0). 
    174178 
    175179================= 
     
    182186 
    183187    .. attribute:: left, right 
    184     The left and the right side of the rule. Both are given as Example. In rules created by AssociationRulesSparseInducer from examples that contain all values as meta-values, left and right are examples in the same form. Otherwise, values in left that do not appear in the rule are don't care, and value in right are don't know. Both can, however, be tested by isSpecial (see documentation on  `Value <http://orange.biolab.si/doc/reference/Value.htm>`_). 
     188    The left and the right side of the rule. Both are given as Example. In rules created by AssociationRulesSparseInducer from examples that contain all values as meta-values, left and right are examples in the same form. Otherwise, values in left that do not appear in the rule are "don't care", and value in right are "don't know". Both can, however, be tested by isSpecial (see documentation on  `Value <http://orange.biolab.si/doc/reference/Value.htm>`_). 
    185189     
    186190    .. attribute:: nLeft, nRight 
     
    212216     
    213217    .. attribute:: examples, matchLeft, matchBoth 
    214     If storeExamples was True during induction, examples contains a copy of the example table used to induce the rules. features matchLeft and matchBoth are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively.     
     218    If storeExamples was True during induction, examples contains a copy of the example table used to induce the rules. Attributes matchLeft and matchBoth are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively. 
    215219 
    216220    .. method:: AssociationRule(left, right, nAppliesLeft, nAppliesRight, nAppliesBoth, nExamples) 
     
    218222     
    219223    .. method:: AssociationRule(left, right, support, confidence) 
    220     Construct association rule and sets its support and confidence. If you intend to pass such a rule to someone that expects more things to be set, you should set the manually - AssociationRules's constructor cannot compute anything from these two arguments. 
     224    Construct association rule and sets its support and confidence. If you intend to pass on such a rule you should set other attributes manually - AssociationRules's constructor cannot compute anything from arguments support and confidence. 
    221225     
    222226    .. method:: AssociationRule(rule) 
    223     Given an association rules as the argument, constructor copies the rule into a new rule. 
     227    Given an association rule as the argument, constructor copies of the rule. 
    224228     
    225229    .. method:: appliesLeft(example), appliesRight(example), appliesBoth(example) 
    226     Tells whether the example fits into the left, right and both sides of the rule, respectively. If the rule is represented by sparse examples, the given example must be sparse as well. 
    227      
    228 Association rule inducers do not store evidence about which example supports which rule (although this is available during induction, the information is discarded afterwards). Let us write a function that find the examples that confirm the rule (ie fit both sides of it) and those that contradict it (fit the left-hand side but not the right). The example uses `lenses data table <http://orange.biolab.si/doc/reference/lenses.tab>`_. :: 
     230    Tells whether the example fits into the left, right or both sides of the rule, respectively. If the rule is represented by sparse examples, the given example must be sparse as well. 
     231     
     232Association rule inducers do not store evidence about which example supports which rule (although this information is available during induction its discarded afterwards). Let us write a function that finds the examples that confirm the rule (fit both sides of it) and those that contradict it (fit the left-hand side but not the right). The example uses the `lenses data table <http://orange.biolab.si/doc/reference/lenses.tab>`_. :: 
    229233 
    230234    import Orange 
     
    251255    print 
    252256 
    253 The latter printouts get simpler and (way!) faster if we instruct the inducer to store the examples. We can then do, for instance, this: :: 
     257The latter printouts get simpler and faster if we instruct the inducer to store the examples. We can then do, for instance, this: :: 
    254258 
    255259    print "Match left: " 
     
    258262    print "\\n".join(str(rule.examples[i]) for i in rule.matchBoth) 
    259263 
    260 The "contradicting" examples are then those whose indices are find in matchLeft but not in matchBoth. The memory friendlier and the faster ways to compute this are as follows. :: 
     264The "contradicting" examples are then those whose indices are found in matchLeft but not in matchBoth. The memory friendlier and the faster way to compute this is as follows: :: 
    261265 
    262266    >>> [x for x in rule.matchLeft if not x in rule.matchBoth] 
Note: See TracChangeset for help on using the changeset viewer.