Changeset 10086:6dc6d4c97504 in orange


Ignore:
Timestamp:
02/08/12 16:08:50 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
rebase_source:
6c6b5e9e897e6860992f6859f099482b293e567b
Message:

Refurbished and corrected documentation and regression tests for association rules.

Files:
10 added
4 edited

Legend:

Unmodified
Added
Removed
  • Orange/associate/__init__.py

    r9988 r10086  
    99def print_rules(rules, ms = []): 
    1010    """ 
    11     Print the rules. If ms is left empty, only the rules are printed. If ms 
    12     contains rules' attributes, e.g. ``["support", "confidence"]``, these are printed out as well. 
     11    Print the association rules. 
     12 
     13    :param rules: Association rules. 
     14    :type rules: AssociationRules 
     15 
     16    :param ms: Attributes of the rule to be printed with the rule (default: []). To report on confidence and support 
     17     use ``ms=["support", "confidence"]`` 
     18    :type ms: list 
    1319    """ 
    1420    if ms: 
     
    3339def sort(rules, ms = ["support"]): 
    3440    """ 
    35     Sort the rules according to the given criteria. The default key is "support"; list multiple keys in a list. 
     41    Sort the rules according to the given criteria. 
     42 
     43    :param rules: Association rules. 
     44    :type rules: AssociationRules 
     45 
     46    :param ms: Sort keys (list of association rules' attributes, default: `["support"]`. 
     47    :type ms: list 
    3648    """ 
    3749    rules.sort(__Cmp(ms)) 
  • docs/reference/rst/Orange.associate.rst

    r9994 r10086  
    1 ==================================== 
    2 Association rules (``associate``) 
    3 ==================================== 
    4  
    5 ============================== 
    6 Induction of association rules 
    7 ============================== 
     1.. py:currentmodule:: Orange.associate 
     2 
     3======================================================= 
     4Association rules and frequent itemsets (``associate``) 
     5======================================================= 
    86 
    97Orange provides two algorithms for induction of 
    10 `association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
    11 One is the basic Agrawal's algorithm with dynamic induction of supported 
    12 itemsets and rules that is designed specifically for datasets with a 
    13 large number of different items. This is, however, not really suitable 
    14 for feature-based machine learning problems. 
    15 We have adapted the original algorithm for efficiency 
    16 with the latter type of data, and to induce the rules where, 
    17 both sides don't only contain features 
    18 (like "bread, butter -> jam") but also their values 
    19 ("bread = wheat, butter = yes -> jam = plum"). 
    20  
    21 It is also possible to extract item sets instead of association rules. These 
    22 are often more interesting than the rules themselves. 
    23  
    24 Besides association rule inducer, Orange also provides a rather simplified 
    25 method for classification by association rules. 
    26  
    27 =================== 
    28 Agrawal's algorithm 
    29 =================== 
    30  
    31 The class that induces rules by Agrawal's algorithm, accepts the data examples 
    32 of two forms. The first is the standard form in which each example is 
    33 described by values of a fixed list of features (defined in domain). 
    34 The algorithm, however, disregards the feature values and only checks whether 
    35 the value is defined or not. The rule shown above ("bread, butter -> jam") 
    36 actually means that if "bread" and "butter" are defined, then "jam" is defined 
    37 as well. It is expected that most of values will be undefined - if this is not 
    38 so, use the :class:`~AssociationRulesInducer`. 
    39  
    40 :class:`AssociationRulesSparseInducer` can also use sparse data. 
    41 Sparse examples have no fixed 
    42 features - the domain is empty. All values assigned to example are given as meta attributes. 
    43 All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
    44 The most suitable format fot this kind of data it is the basket format. 
    45  
    46 The algorithm first dynamically builds all itemsets (sets of features) that have 
    47 at least the prescribed support. Each of these is then used to derive rules 
    48 with requested confidence. 
    49  
    50 If examples were given in the sparse form, so are the left and right side 
    51 of the induced rules. If examples were given in the standard form, so are 
    52 the examples in association rules. 
     8`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_, 
     9a standard `Apriori algorithm <http://en.wikipedia.org/wiki/Apriori_algorithm>`_ [AgrawalSrikant1994]_ for sparse (basket) data analysis 
     10and a variant of Apriori for attribute-value data sets. Both algorithms also support mining of frequent itemsets. 
     11 
     12For example, consider a simple market basket data:: 
     13 
     14    Bread, Milk 
     15    Bread, Diapers, Beer, Eggs 
     16    Milk, Diapers, Beer, Cola 
     17    Bread, Milk, Diapers, Beer 
     18    Bread, Milk, Diapers, Cola 
     19 
     20The following script induces association rules with items that appear in at least 30% of data instances 
     21(transactions): 
     22 
     23.. literalinclude:: code/associate-market.py 
     24 
     25The code reports on support and confidence first five rules found:: 
     26 
     27    Supp Conf  Rule 
     28     0.4  1.0  Cola -> Diapers 
     29     0.4  0.5  Diapers -> Cola 
     30     0.4  1.0  Cola -> Diapers Milk 
     31     0.4  1.0  Cola Diapers -> Milk 
     32     0.4  1.0  Cola Milk -> Diapers 
     33 
     34In Apriori, association rule induction is two-stage algorithm first finds itemsets that frequently appear in 
     35the data and have sufficient support, and then splits them to rules of sufficient confidence. Function `getItemsets` 
     36reports on itemsets alone and skips rule induction: 
     37 
     38.. literalinclude:: code/associate-frequent-itemsets.py 
     39 
     40The above script lists frequent itemsets and their support:: 
     41 
     42    (0.40) Cola 
     43    (0.40) Cola Diapers 
     44    (0.40) Cola Diapers Milk 
     45    (0.40) Cola Milk 
     46    (0.60) Beer 
     47 
     48====================================== 
     49Association rules induction algorithms 
     50====================================== 
     51 
     52:class:`AssociationRulesSparseInducer` induces frequent itemsets and association rules from sparse data sets. These 
     53can be either provided in the basket format (see :doc:`Orange.data.formats`) or in an attribute-value format where any 
     54entry in the data table is considered as presence of a feature in the transaction (an item), 
     55and any unknown (empty) entry signifies its absence. :class:`AssociationRulesInducer` works feature-value data, 
     56where am item is a combination of feature and its value (e.g., `astigmatic=yes`). 
     57 
     58Sparse (basket) data sets 
     59------------------------- 
    5360 
    5461.. class:: AssociationRulesSparseInducer 
     
    5663    .. attribute:: support 
    5764 
    58         Minimal support for the rule. 
     65        Minimal support for the rule. Depending on the data set it should be set to sufficiently high value 
     66        to avoid running out of working memory (default: 0.3). 
    5967 
    6068    .. attribute:: confidence 
     
    6977    .. attribute:: max_item_sets 
    7078 
    71         The maximal number of itemsets. The algorithm's 
    72         running time (and its memory consumption) depends on the minimal support; 
    73         the lower the requested support, the more eligible itemsets will be found. 
    74         There is no general rule for setting support - perhaps it 
    75         should be around 0.3, but this depends on the data set. 
    76         If the supoort was set too low, the algorithm could run out of memory. 
    77         Therefore, Orange limits the number of generated rules to 
    78         :obj:`max_item_sets`. If Orange reports, that the prescribed 
    79         :obj:`max_item_sets` was exceeded, increase the requered support 
    80         or alternatively, increase :obj:`max_item_sets` to as high as you computer 
    81         can handle. 
     79        The maximal number of itemsets induced. Orange will stop with inference of 
     80        frequent itemsets once this number of itemsets is reached. 
    8281 
    8382    .. method:: __call__(data, weight_id) 
    8483 
    85         Induce rules from the data set. 
    86  
     84        Induce rules from the provided data set. 
    8785 
    8886    .. method:: get_itemsets(data) 
    8987 
    90         Returns a list of pairs. The first element of a pair is a tuple with 
    91         indices of features in the item set (negative for sparse data). 
    92         The second element is a list of indices supporting the item set, that is, 
    93         all the items in the set. If :obj:`store_examples` is False, the second 
     88        For a given data set, return a list of frequent itemsets. List elements are pairs, 
     89        where the first element includes indices of features in the item set (negative for sparse data) and 
     90        the second element a list of indices supporting the itemset. 
     91        If :obj:`store_examples` is False, the second 
    9492        element is None. 
    9593 
    96 We shall test the rule inducer on a dataset consisting of a brief description 
    97 of Spanish Inquisition, given by Palin et al: 
     94To test this rule inducer, we will first create a sparse data sets consisting of list of words in sentences from a brief description 
     95of Spanish Inquisition, given by Palin et al.: 
    9896 
    9997    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
     
    10199    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
    102100 
    103 The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
    104  
    105 Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
     101After some cleaning (e.g., removal of stopwords and punctuation marks), 
     102our data set looks like (:download:`inquisition.basket <code/inquisition.basket>`): 
    106103 
    107104.. literalinclude:: code/inquisition.basket 
    108105 
    109 Inducing the rules is trivial:: 
    110  
    111     import Orange 
    112     data = Orange.data.Table("inquisition") 
    113  
    114     rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
    115  
    116     print "%5s   %5s" % ("supp", "conf") 
    117     for r in rules: 
    118         print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    119  
    120 The induced rules are surprisingly fear-full: :: 
     106The following script induces the association rules: 
     107 
     108.. literalinclude:: code/associate-inquistion.py 
     109 
     110The induced rules are surprisingly fear-full:: 
    121111 
    122112    0.500   1.000   fear -> surprise 
     
    156146.. _non-sparse-examples: 
    157147 
    158 =================== 
    159 Non-sparse data 
    160 =================== 
     148==================================== 
     149Feature-value (non-sparse) data sets 
     150==================================== 
    161151 
    162152:class:`AssociationRulesInducer` works with non-sparse data. 
    163 Unknown values are ignored, while values of features are not (as opposite to 
    164 the algorithm for sparse rules). In addition, the algorithm 
    165 can be directed to search only for classification rules, in which the only 
    166 feature on the right-hand side is the class variable. 
     153 
    167154 
    168155.. class:: AssociationRulesInducer 
    169156 
    170     All attributes can be set with the constructor. 
     157    Association rule induction from non-sparse data sets. An item is a feature-value combination. Unknown values in 
     158    the data table are ignored. The algorithm can also be used to search only for classification rules where the 
     159    feature on the right-hand side is the class variable. 
    171160 
    172161    .. attribute:: support 
    173162 
    174        Minimal support for the rule. 
     163       Minimal support of the induced rule (default: 0.3) 
    175164 
    176165    .. attribute:: confidence 
    177166 
    178         Minimal confidence for the rule. 
     167        Minimal confidence of the induced rule. 
    179168 
    180169    .. attribute:: classification_rules 
    181170 
    182         If True (default is False), the classification rules are constructed instead 
    183         of general association rules. 
     171        If True, the classification rules are constructed instead 
     172        of general association rules (default: False). 
    184173 
    185174    .. attribute:: store_examples 
    186175 
    187176        Store the examples covered by each rule and those 
    188         confirming it 
     177        confirming it. 
    189178 
    190179    .. attribute:: max_item_sets 
    191180 
    192         The maximal number of itemsets. 
     181        The maximal number of itemsets induced. After reaching this limit the inference algorithm will stop. 
    193182 
    194183    .. method:: __call__(data, weight_id) 
    195184 
    196         Induce rules from the data set. 
     185        Induce rules from the given data set. 
    197186 
    198187    .. method:: get_itemsets(data) 
    199188 
    200         Returns a list of pairs. The first element of a pair is a tuple with 
    201         indices of features in the item set (negative for sparse data). 
    202         The second element is a list of indices supporting the item set, that is, 
    203         all the items in the set. If :obj:`store_examples` is False, the second 
     189        For a given data set, return a list of frequent itemsets. The list consists of pairs, where 
     190        the first element includes indices of features in the item set (negative for sparse data), and 
     191        the second element a list of indices supporting the item set. If :obj:`store_examples` is False, the second 
    204192        element is None. 
    205193 
    206 The example:: 
    207  
    208     import Orange 
    209  
    210     data = Orange.data.Table("lenses") 
    211  
    212     print "Association rules" 
    213     rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
    214     for r in rules: 
    215         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    216  
    217 The found rules are: :: 
     194Following is an example script that uses :class:`AssociationRulesInducer`: 
     195 
     196.. literalinclude:: code/associate-lenses.py 
     197 
     198Script reports the following rules (first colon is support, second confidence):: 
    218199 
    219200    0.333  0.533  lenses=none -> prescription=hypermetrope 
     
    224205    0.500  1.000  tear_rate=reduced -> lenses=none 
    225206 
    226 To limit the algorithm to classification rules, set classificationRules to 1: :: 
    227  
    228     print "\\nClassification rules" 
    229     rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
    230     for r in rules: 
    231         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    232  
    233 The found rules are, naturally, a subset of the above rules: :: 
    234  
    235     0.333  0.667  prescription=hypermetrope -> lenses=none 
    236     0.333  0.667  astigmatic=yes -> lenses=none 
    237     0.500  1.000  tear_rate=reduced -> lenses=none 
    238  
    239 Itemsets are induced in a similar fashion as for sparse data, except that the 
     207To infer classification rules we can use a similar script but set `classificationRules` to 1: 
     208 
     209.. literalinclude:: code/association-lenses-classification.py 
     210    :lines: 4-5 
     211 
     212These rules are a subset of association rules that in a consequent include only a class variable:: 
     213 
     2140.333  0.667  prescription=hypermetrope -> lenses=none 
     2150.333  0.667  astigmatic=yes -> lenses=none 
     2160.500  1.000  tear_rate=reduced -> lenses=none 
     217 
     218Frequent itemsets are induced in a similar fashion as for sparse data, except that the 
    240219first element of the tuple, the item set, is represented not by indices of 
    241 features, as before, but with tuples (feature-index, value-index): :: 
     220features, as before, but with tuples (feature-index, value-index): 
     221 
     222.. literalinclude:: code/association-lenses-itemsets.py 
     223    :lines: 4-6 
    242224 
    243225    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
     
    245227    print itemsets[8] 
    246228 
    247 This prints out :: 
     229The script prints out:: 
    248230 
    249231    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    250232 
    251 meaning that the ninth itemset contains the second value of the third feature 
     233reporting that the ninth itemset contains the second value of the third feature 
    252234(2, 1), and the first value of the fifth (4, 0). 
    253235 
     
    256238======================= 
    257239 
    258 An :class:`AssociationRule` represents a rule. In Orange, methods for 
    259 induction of association rules return the induced rules in 
     240Methods for induction of association rules return the induced rules in 
    260241:class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
    261242 
     
    264245    .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
    265246 
    266         Constructs an association rule and computes all measures listed above. 
     247        Construct an association rule and compute evaluation scores (see below) based on counts given in the 
     248        arguments of the call. 
    267249 
    268250    .. method:: __init__(left, right, support, confidence) 
    269251 
    270         Construct association rule and sets its support and confidence. If 
    271         you intend to pass on such a rule you should set other attributes 
    272         manually - AssociationRules's constructor cannot compute anything 
    273         from arguments support and confidence. 
     252        Construct association rule and compute its support and confidence. For manual construction of such such a rule set other attributes 
     253        manually, as AssociationRules's constructor cannot compute anything only from support and 
     254        confidence. 
    274255 
    275256    .. method:: __init__(rule) 
    276257 
    277         Given an association rule as the argument, constructor copies of the 
    278         rule. 
     258        Given an association rule as the argument, constructor a copy of the rule. 
    279259 
    280260    .. attribute:: left, right 
    281261 
    282262        The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
    283         In rules created by :class:`AssociationRulesSparseInducer` from examples that 
    284         contain all values as meta-values, left and right are examples in the 
     263        In rules created by :class:`AssociationRulesSparseInducer` from data instances that 
     264        contain all values as meta-values, left and right are data instances in the 
    285265        same form. Otherwise, values in left that do not appear in the rule 
    286266        are "don't care", and value in right are "don't know". Both can, 
     
    289269    .. attribute:: n_left, n_right 
    290270 
    291         The number of features (i.e. defined values) on the left and on the 
     271        The number of items on the left and on the 
    292272        right side of the rule. 
    293273 
    294274    .. attribute:: n_applies_left, n_applies_right, n_applies_both 
    295275 
    296         The number of (learning) examples that conform to the left, the right 
    297         and to both sides of the rule. 
     276        The number of data instances matching the left, right and both sides of the rule, correspondingly. 
    298277 
    299278    .. attribute:: n_examples 
    300279 
    301         The total number of learning examples. 
     280        The total number of training instances. 
    302281 
    303282    .. attribute:: support 
     
    327306    .. attribute:: examples, match_left, match_both 
    328307 
    329         If store_examples was True during induction, examples contains a copy 
    330         of the example table used to induce the rules. Attributes match_left 
    331         and match_both are lists of integers, representing the indices of 
    332         examples which match the left-hand side of the rule and both sides, 
    333         respectively. 
    334  
    335     .. method:: applies_left(example) 
    336  
    337     .. method:: applies_right(example) 
    338  
    339     .. method:: applies_both(example) 
    340  
    341         Tells whether the example fits into the left, right or both sides of 
    342         the rule, respectively. If the rule is represented by sparse examples, 
    343         the given example must be sparse as well. 
    344  
    345 Association rule inducers do not store evidence about which example supports 
    346 which rule. Let us write a function that finds the examples that 
    347 confirm the rule (fit both sides of it) and those that contradict it (fit the 
    348 left-hand side but not the right). The example:: 
    349  
    350     import Orange 
    351  
    352     data = Orange.data.Table("lenses") 
    353  
    354     rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
    355     rule = rules[0] 
    356  
    357     print 
    358     print "Rule: ", rule 
    359     print 
    360  
    361     print "Supporting examples:" 
    362     for example in data: 
    363         if rule.appliesBoth(example): 
    364             print example 
    365     print 
    366  
    367     print "Contradicting examples:" 
    368     for example in data: 
    369         if rule.applies_left(example) and not rule.applies_right(example): 
    370             print example 
    371     print 
     308        If store_examples was `True` during induction, examples contain a copy 
     309        of the data table used to induce the rules. Attributes `match_left` 
     310        and `match_both` are lists of indices of data instances that match the left, 
     311        right and both sides of the rule, respectively. 
     312 
     313    .. method:: applies_left(data_instance) 
     314 
     315    .. method:: applies_right(data_instance) 
     316 
     317    .. method:: applies_both(data_instance) 
     318 
     319        Tests if data instance is matched by the left, right or both sides of 
     320        the rule, respectively. The data instance must be in the same representation as data from which the rule 
     321        was inferred. 
     322 
     323Association rule inducers do not store information on supporting data instances from training data set. 
     324Let us write a script that finds the data instances that 
     325match the rule (fit both sides of it) and those that contradict it (fit the 
     326left-hand side but not the right): 
     327 
     328.. literalinclude:: code/associate-traceback.py 
    372329 
    373330The latter printouts get simpler and faster if we instruct the inducer to 
    374 store the examples. We can then do, for instance, this: :: 
     331store the examples:: 
    375332 
    376333    print "Match left: " 
    377     print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
    378     print "\\nMatch both: " 
    379     print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
    380  
    381 The "contradicting" examples are then those whose indices are found in 
     334    print "\n".join(str(rule.examples[i]) for i in rule.match_left) 
     335    print "\nMatch both: " 
     336    print "\n".join(str(rule.examples[i]) for i in rule.match_both) 
     337 
     338The "contradicting" examples are those whose indices are found in 
    382339match_left but not in match_both. The memory friendlier and the faster way 
    383 to compute this is as follows: :: 
     340to compute this is:: 
    384341 
    385342    >>> [x for x in rule.match_left if not x in rule.match_both] 
     
    395352 
    396353.. autofunction:: sort 
     354 
     355References 
     356---------- 
     357 
     358.. [AgrawalSrikant1994] R Agrawal and R Srikant: Fast algorithms for mining association 
     359   rules in large databases. In Proc. 20th International Conference on Very Large Data Bases, pages 487-499, 
     360   Santiago, Chile, September 1994. 
     361 
     362.. [TanSteinbachKumar2005] P-N Tan, M Steinbach and V Kumar: Introduction to Data Mining, 
     363   chapter on `Association analysis: basic concepts and algorithms 
     364   <http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf>`_, Addison Wesley, 2005. 
  • docs/reference/rst/Orange.data.domain.rst

    r9958 r10086  
    263263         :param variables: List of variables (instances of :obj:`~Orange.feature.Descriptor`) 
    264264         :type variables: list 
    265          :param class_vars: A list of multiple classes; must be a keword argument 
     265         :param class_vars: A list of multiple classes; must be a keyword argument 
    266266         :type class_vars: list 
    267267 
  • source/orange/assoc.hpp

    r6963 r10086  
    171171public: 
    172172 
    173   TAssociationRulesInducer(float asupp=0.1, float aconf=0.5); 
     173  TAssociationRulesInducer(float asupp=0.3, float aconf=0.5); 
    174174  PAssociationRules operator()(PExampleGenerator, const int &weightID = 0); 
    175175 
     
    275275  bool storeExamples; //P stores examples corresponding to rules 
    276276 
    277   TAssociationRulesSparseInducer(float asupp=0.1, float aconf=0, int awei=0); 
     277  TAssociationRulesSparseInducer(float asupp=0.3, float aconf=0, int awei=0); 
    278278  TSparseItemsetTree *TAssociationRulesSparseInducer::buildTree(PExampleGenerator examples, const int &weightID, long &i, float &fullWeight); 
    279279  PAssociationRules operator()(PExampleGenerator, const int &weightID); 
     
    297297  bool storeExamples; //P stores examples corresponding to itemsets 
    298298 
    299   TItemsetsSparseInducer(float asupp=0.1, int awei=0); 
     299  TItemsetsSparseInducer(float asupp=0.3, int awei=0); 
    300300  PSparseItemsetTree operator()(PExampleGenerator, const int &weightID); 
    301301 
Note: See TracChangeset for help on using the changeset viewer.