Changeset 7979:4478babca5fd in orange


Ignore:
Timestamp:
06/03/11 13:47:22 (3 years ago)
Author:
ales_erjavec <ales.erjavec@…>
Branch:
default
Convert:
c523d298ad6fb26fdd3cdab2002d910ff1c34bd1
Message:

Formated module docstring (line breaks).

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/associate/__init__.py

    r7556 r7979  
    44============================== 
    55 
    6 Orange provides two algorithms for induction of `association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for feature-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain features (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole feature to appear on the right side of the rule is the class feature. 
    7  
    8 It is also possible to extract item sets instead of association rules. These are often more interesting than the rules themselves. 
    9  
    10 Besides association rule inducer, Orange also provides a rather simplified method for classification by association rules. 
     6Orange provides two algorithms for induction of 
     7`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
     8One is the basic Agrawal's algorithm with dynamic induction of supported 
     9itemsets and rules that is designed specifically for datasets with a 
     10large number of different items. This is, however, not really suitable 
     11for feature-based machine learning problems, which are at the primary focus 
     12of Orange. We have thus adapted the original algorithm to be more efficient 
     13for the latter type of data, and to induce the rules in which, for contrast 
     14to Agrawal's rules, both sides don't only contain features 
     15(like "bread, butter -> jam") but also their values 
     16("bread = wheat, butter = yes -> jam = plum"). As a further variation, the 
     17algorithm can be limited to search only for classification rules in which 
     18the sole feature to appear on the right side of the rule is the class feature. 
     19 
     20It is also possible to extract item sets instead of association rules. These 
     21are often more interesting than the rules themselves. 
     22 
     23Besides association rule inducer, Orange also provides a rather simplified 
     24method for classification by association rules. 
    1125 
    1226=================== 
     
    1428=================== 
    1529 
    16 The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each example is described by values of a fixed list of features (defined in domain). The algorithm, however, disregards the feature values and only checks whether the value is defined or not. The rule shown above ("bread, butter -> jam") actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the chapter :ref:`non-sparse-examples`. 
    17  
    18 Since the usual representation of examples described above is rather unsuitable for sparse examples, AssociationRulesSparseInducer can also use examples represented a bit differently. Sparse examples have no fixed features - the examples' domain is empty, there are neither ordinary nor class features. All values assigned to example are given as meta-attributes. All meta-attributes need, however, be `registered with the domain descriptor <http://orange.biolab.si/doc/reference/Domain.htm#meta-attributes>`_. If you have data of this kind, the most suitable format for it is the `basket format <http://orange.biolab.si/doc/reference/fileformats.htm#basket>`_. 
    19  
    20 In both cases, the examples are first translated into an internal AssociationRulesSparseInducer's internal format for sparse datasets. The algorithm first dynamically builds all itemsets (sets of features) that have at least the prescribed support. Each of these is then used to derive rules with requested confidence. 
    21  
    22 If examples were given in the sparse form, so are the left and right side of the induced rules. If examples were given in the standard form, so are the examples in association rules. 
     30The class that induces rules by Agrawal's algorithm, accepts the data examples 
     31of two forms. The first is the standard form in which each example is 
     32described by values of a fixed list of features (defined in domain). 
     33The algorithm, however, disregards the feature values and only checks whether 
     34the value is defined or not. The rule shown above ("bread, butter -> jam") 
     35actually means that if "bread" and "butter" are defined, then "jam" is defined 
     36as well. It is expected that most of values will be undefined - if this is not 
     37so, you need to use the other association rules inducer, described in the 
     38chapter :ref:`non-sparse-examples`. 
     39 
     40Since the usual representation of examples described above is rather unsuitable 
     41for sparse examples, AssociationRulesSparseInducer can also use examples 
     42represented a bit differently. Sparse examples have no fixed 
     43features - the examples' domain is empty, there are neither ordinary nor class 
     44features. All values assigned to example are given as meta-attributes. 
     45All meta-attributes need, however, be `registered with the domain descriptor 
     46<http://orange.biolab.si/doc/reference/Domain.htm#meta-attributes>`_. 
     47If you have data of this kind, the most suitable format for it is the 
     48`basket format <http://orange.biolab.si/doc/reference/fileformats.htm#basket>`_. 
     49 
     50In both cases, the examples are first translated into an internal 
     51AssociationRulesSparseInducer's internal format for sparse datasets. The 
     52algorithm first dynamically builds all itemsets (sets of features) that have 
     53at least the prescribed support. Each of these is then used to derive rules 
     54with requested confidence. 
     55 
     56If examples were given in the sparse form, so are the left and right side 
     57of the induced rules. If examples were given in the standard form, so are 
     58the examples in association rules. 
    2359 
    2460.. class:: Orange.associate.AssociationRulesSparseInducer 
     
    3470    .. attribute:: storeExamples 
    3571     
    36     Tells the inducer to store the examples covered by each rule and those confirming it. 
     72    Tells the inducer to store the examples covered by each rule and 
     73    those confirming it. 
    3774     
    3875    .. attribute:: maxItemSets 
     
    4279.. _maxItemSets: 
    4380 
    44 The maxItemSets attribute deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets to as high as you computer is able to handle. 
    45  
    46 We shall test the rule inducer on a dataset consisting of a brief description of Spanish Inquisition, given by Palin et al: 
     81The maxItemSets attribute deserves some explanation. The algorithm's 
     82running time (and its memory consumption) depends on the minimal support; 
     83the lower the requested support, the more eligible itemsets will be found. 
     84There is no general rule for knowing the itemset in advance (generally, value 
     85should be around 0.3, but this depends upon the number of different items, the 
     86diversity of examples...) so it's very easy to set the limit too low. In this 
     87case, the algorithm can induce hundreds of thousands of itemsets until it 
     88runs out of memory. To prevent this, it will stop inducing itemsets and 
     89report an error when the prescribed maximum maxItemSets is exceeded. 
     90In this case, you should increase the required support. On the other hand, 
     91you can (reasonably) increase the maxItemSets to as high as you computer is 
     92able to handle. 
     93 
     94We shall test the rule inducer on a dataset consisting of a brief description 
     95of Spanish Inquisition, given by Palin et al: 
    4796 
    4897    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
     
    85134    0.500   0.714   our -> surprise 
    86135 
    87 If examples are weighted, weight can be passed as an additional argument to call operator. 
    88  
    89 To get only a list of supported item sets, one should call the method getItemsets. The result is a list whose elements are tuples with two elements. The first is a tuple with indices of features in the item set. Sparse examples are usually represented with meta-attributes, so this indices will be negative. The second element is  a list of indices supporting the item set, that is, containing all the items in the set. If storeExamples is False, the second element is None. :: 
     136If examples are weighted, weight can be passed as an additional argument to 
     137call operator. 
     138 
     139To get only a list of supported item sets, one should call the method 
     140getItemsets. The result is a list whose elements are tuples with two elements. 
     141The first is a tuple with indices of features in the item set. Sparse examples 
     142are usually represented with meta-attributes, so this indices will be negative. 
     143The second element is  a list of indices supporting the item set, that is, 
     144containing all the items in the set. If storeExamples is False, the second 
     145element is None. :: 
    90146 
    91147    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, storeExamples = True) 
    92148    itemsets = inducer.getItemsets(data) 
    93149     
    94 Now itemsets is a list of itemsets along with the examples supporting them since we set storeExamples to True. :: 
     150Now itemsets is a list of itemsets along with the examples supporting them 
     151since we set storeExamples to True. :: 
    95152 
    96153    >>> itemsets[5] 
     
    99156    ['surprise', 'our']    
    100157     
    101 The sixth itemset contains features with indices -11 and -7, that is, the words "surprise" and "our". The examples supporting it are those with indices 1,2, 3, 6 and 9. 
    102  
    103 This way of representing the itemsets is not very programmer-friendly, but it is much more memory efficient than and faster to work with than using objects like Variable and Example. 
     158The sixth itemset contains features with indices -11 and -7, that is, the 
     159words "surprise" and "our". The examples supporting it are those with 
     160indices 1,2, 3, 6 and 9. 
     161 
     162This way of representing the itemsets is not very programmer-friendly, but 
     163it is much more memory efficient than and faster to work with than using 
     164objects like Variable and Example. 
    104165 
    105166.. _non-sparse-examples: 
     
    109170=================== 
    110171 
    111 The other algorithm for association rules provided by Orange (AssociationRulesInducer) is optimized for non-sparse examples in the usual Orange form. Each example is described by values of a fixed set of features. Unknown values are ignored, while values of features are not (as opposite to the above-described algorithm for sparse rules). In addition, the algorithm can be directed to search only for classification rules, in which the only feature on the right-hand side is the class Feature. 
     172The other algorithm for association rules provided by Orange 
     173(AssociationRulesInducer) is optimized for non-sparse examples in the usual 
     174Orange form. Each example is described by values of a fixed set of features. 
     175Unknown values are ignored, while values of features are not (as opposite to 
     176the above-described algorithm for sparse rules). In addition, the algorithm 
     177can be directed to search only for classification rules, in which the only 
     178feature on the right-hand side is the class Feature. 
    112179 
    113180.. class:: Orange.associate.AssociationRulesInducer(float asupp, float aconf) 
     
    126193    .. attribute:: classificationRules 
    127194     
    128     If 1 (default is 0), the algorithm constructs classification rules instead of general association rules. 
     195    If 1 (default is 0), the algorithm constructs classification rules instead 
     196    of general association rules. 
    129197     
    130198    .. attribute:: storeExamples 
    131199     
    132     Tells the inducer to store the examples covered by each rule and those confirming it 
     200    Tells the inducer to store the examples covered by each rule and those 
     201    confirming it 
    133202     
    134203    .. attribute:: maxItemSets 
     
    136205    The maximal number of itemsets. 
    137206 
    138 Meaning of all attributes (except the new one, classificationRules) is the same as for AssociationRulesSparseInducer. See the description of :ref:`maxItemSets <maxItemSets>` there. The example uses `lenses.tab`_: :: 
     207Meaning of all attributes (except the new one, classificationRules) is the 
     208same as for AssociationRulesSparseInducer. See the description of 
     209:ref:`maxItemSets <maxItemSets>` there. The example uses `lenses.tab`_: :: 
    139210 
    140211    import Orange 
     
    169240    0.500  1.000  tear_rate=reduced -> lenses=none 
    170241     
    171 AssociationRulesInducer can also work with weighted examples; the ID of weight feature should be passed as an additional argument in a call. 
    172  
    173 Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of features, as before, but with tuples (feature-index, value-index): :: 
     242AssociationRulesInducer can also work with weighted examples; the ID of weight 
     243feature should be passed as an additional argument in a call. 
     244 
     245Itemsets are induced in a similar fashion as for sparse data, except that the 
     246first element of the tuple, the item set, is represented not by indices of 
     247features, as before, but with tuples (feature-index, value-index): :: 
    174248 
    175249    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, storeExamples = True) 
     
    181255    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    182256     
    183 meaning that the ninth itemset contains the second value of the third feature (2, 1), and the first value of the fifth (4, 0). 
     257meaning that the ninth itemset contains the second value of the third feature 
     258(2, 1), and the first value of the fifth (4, 0). 
    184259 
    185260================= 
     
    187262================= 
    188263 
    189 Both classes for induction of association rules return the induced rules in AssociationRules which is basically a list of instances of AssociationRule. 
     264Both classes for induction of association rules return the induced rules in 
     265AssociationRules which is basically a list of instances of AssociationRule. 
    190266 
    191267.. class:: Orange.associate.AssociationRules 
     
    193269    .. attribute:: left, right 
    194270     
    195         The left and the right side of the rule. Both are given as Example. In rules created by AssociationRulesSparseInducer from examples that contain all values as meta-values, left and right are examples in the same form. Otherwise, values in left that do not appear in the rule are "don't care", and value in right are "don't know". Both can, however, be tested by isSpecial (see documentation on  `Value <http://orange.biolab.si/doc/reference/Value.htm>`_). 
     271        The left and the right side of the rule. Both are given as Example. 
     272        In rules created by AssociationRulesSparseInducer from examples that 
     273        contain all values as meta-values, left and right are examples in the 
     274        same form. Otherwise, values in left that do not appear in the rule 
     275        are "don't care", and value in right are "don't know". Both can, 
     276        however, be tested by isSpecial (see documentation on 
     277        `Value <http://orange.biolab.si/doc/reference/Value.htm>`_). 
    196278     
    197279    .. attribute:: nLeft, nRight 
    198280     
    199         The number of features (ie defined values) on the left and on the right side of the rule. 
     281        The number of features (i.e. defined values) on the left and on the 
     282        right side of the rule. 
    200283     
    201284    .. attribute:: nAppliesLeft, nAppliesRight, nAppliesBoth 
    202285     
    203         The number of (learning) examples that conform to the left, the right and to both sides of the rule. 
     286        The number of (learning) examples that conform to the left, the right 
     287        and to both sides of the rule. 
    204288     
    205289    .. attribute:: nExamples 
     
    233317    .. attribute:: examples, matchLeft, matchBoth 
    234318     
    235         If storeExamples was True during induction, examples contains a copy of the example table used to induce the rules. Attributes matchLeft and matchBoth are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively. 
     319        If storeExamples was True during induction, examples contains a copy 
     320        of the example table used to induce the rules. Attributes matchLeft 
     321        and matchBoth are lists of integers, representing the indices of 
     322        examples which match the left-hand side of the rule and both sides, 
     323        respectively. 
    236324 
    237325    .. method:: AssociationRule(left, right, nAppliesLeft, nAppliesRight, nAppliesBoth, nExamples) 
     
    241329    .. method:: AssociationRule(left, right, support, confidence) 
    242330     
    243         Construct association rule and sets its support and confidence. If you intend to pass on such a rule you should set other attributes manually - AssociationRules's constructor cannot compute anything from arguments support and confidence. 
     331        Construct association rule and sets its support and confidence. If 
     332        you intend to pass on such a rule you should set other attributes 
     333        manually - AssociationRules's constructor cannot compute anything 
     334        from arguments support and confidence. 
    244335     
    245336    .. method:: AssociationRule(rule) 
    246337     
    247         Given an association rule as the argument, constructor copies of the rule. 
     338        Given an association rule as the argument, constructor copies of the 
     339        rule. 
    248340     
    249341    .. method:: appliesLeft(example) 
     
    253345    .. method:: appliesBoth(example) 
    254346     
    255         Tells whether the example fits into the left, right or both sides of the rule, respectively. If the rule is represented by sparse examples, the given example must be sparse as well. 
    256      
    257 Association rule inducers do not store evidence about which example supports which rule (although this information is available during induction its discarded afterwards). Let us write a function that finds the examples that confirm the rule (fit both sides of it) and those that contradict it (fit the left-hand side but not the right). The example uses the `lenses.tab`_: :: 
     347        Tells whether the example fits into the left, right or both sides of 
     348        the rule, respectively. If the rule is represented by sparse examples, 
     349        the given example must be sparse as well. 
     350     
     351Association rule inducers do not store evidence about which example supports 
     352which rule (although this information is available during induction its 
     353discarded afterwards). Let us write a function that finds the examples that 
     354confirm the rule (fit both sides of it) and those that contradict it (fit the 
     355left-hand side but not the right). The example uses the `lenses.tab`_: :: 
    258356 
    259357    import Orange 
     
    280378    print 
    281379 
    282 The latter printouts get simpler and faster if we instruct the inducer to store the examples. We can then do, for instance, this: :: 
     380The latter printouts get simpler and faster if we instruct the inducer to 
     381store the examples. We can then do, for instance, this: :: 
    283382 
    284383    print "Match left: " 
     
    287386    print "\\n".join(str(rule.examples[i]) for i in rule.matchBoth) 
    288387 
    289 The "contradicting" examples are then those whose indices are found in matchLeft but not in matchBoth. The memory friendlier and the faster way to compute this is as follows: :: 
     388The "contradicting" examples are then those whose indices are found in 
     389matchLeft but not in matchBoth. The memory friendlier and the faster way 
     390to compute this is as follows: :: 
    290391 
    291392    >>> [x for x in rule.matchLeft if not x in rule.matchBoth] 
Note: See TracChangeset for help on using the changeset viewer.