source: orange/orange/Orange/classification/lookup.py @ 7761:8ac2e36f9363

Revision 7761:8ac2e36f9363, 22.7 KB checked in by lanz <lan.zagar@…>, 3 years ago (diff)

Documentation and code refactoring for the lookup module.

Line 
1"""
2
3.. index:: classification; lookup
4
5******************
6Lookup classifiers
7******************
8
9Lookup classifiers predict classes by looking into stored lists of
10cases. There are two kinds of such classifiers in Orange. The simpler
11and fastest :obj:`ClassifierByLookupTable` use up to three discrete
12features and have a stored mapping from values of those features to
13class value. The more complex classifiers store a
14:obj:`Orange.data.Table` and predict the class by matching the instance
15to instances in the table.
16
17.. index::
18   single: feature construction; lookup classifiers
19
20A natural habitat for these classifiers is feature construction:
21they usually reside in :obj:`getValueFrom` fields of constructed
22features to facilitate their automatic computation. For instance,
23the following script shows how to translate the `monks-1.tab`_ dataset
24features into a more useful subset that will only include the features
25a, b, e, and features that will tell whether a and b are equal and
26whether e is 1 (don't bother about the details, they follow later;
27`lookup-lookup.py`_, uses: `monks-1.tab`_):
28
29.. literalinclude:: code/lookup-lookup.py
30    :lines: 7-21
31   
32We can check the correctness of the script by printing out several
33random examples from data2.
34
35    >>> for i in range(5):
36    ...     print table2.randomexample()
37    ['3', '2', 'no', '2', 'no', '0']
38    ['2', '2', 'yes', '2', 'no', '1']
39    ['1', '2', 'no', '2', 'no', '0']
40    ['2', '3', 'no', '1', 'yes', '1']
41    ['1', '3', 'no', '1', 'yes', '1']
42
43The first :obj:`ClassifierByLookupTable` takes values of features a
44and b and computes the value of ab according to the rule given in the
45given table. The first three values correspond to a=1 and b=1, 2, 3;
46for the first combination, value of ab should be "yes", for the other
47two a and b are different. The next triplet correspond to a=2;
48here, the middle value is "yes"...
49
50The second lookup is simpler: since it involves only a single feature,
51the list is a simple one-to-one mapping from the four-valued e to the
52two-valued e1. The last value in the list is returned when e is unknown
53and tells that e1 should be unknown then as well.
54
55Note that you don't need :obj:`ClassifierByLookupTable` for this.
56The new feature e1 could be computed with a callback to Python,
57for instance::
58
59    e2.getValueFrom = lambda ex, rw: orange.Value(e2, ex["e"]=="1")
60
61
62Classifiers by lookup table
63===========================
64
65.. index::
66   single: classification; lookup table
67
68Although the above example used :obj:`ClassifierByLookupTable` as if it
69was a concrete class, ClassifierByLookupTable is actually
70abstract. Calling its constructor is a typical Orange trick: what you
71get, is never ClassifierByLookupTable, but either
72:obj:`ClassifierByLookupTable1`, :obj:`ClassifierByLookupTable2` or
73:obj:`ClassifierByLookupTable3`. As their names tell, the first
74classifies using a single feature (so that's what we had for e1),
75the second uses a pair of features (and has been constructed for ab
76above), and the third uses three features. Class predictions for each
77combination of feature values are stored in a (one dimensional) table.
78To classify an instance, the classifier computes an index of the element
79of the table that corresponds to the combination of feature values.
80
81These classifiers are built to be fast, not safe. If you, for instance,
82change the number of values for one of the features, Orange will
83most probably crash. To protect you somewhat, many of these classes'
84features are read-only and can only be set when the object is
85constructed.
86
87
88.. py:class:: ClassifierByLookupTable(class_var, variable1[, variable2[, variable3]] [, lookup_table[, distributions]])
89   
90    A general constructor that, based on the number of feature
91    descriptors, constructs one of the three classes discussed.
92    If lookup_table and distributions are omitted, constructor also
93    initializes lookup_table and distributions to two lists of the
94    right sizes, but their elements are don't knows and empty
95    distributions. If they are given, they must be of correct size.
96   
97    .. attribute:: variable1[, variable2[, variable3]](read only)
98       
99        The feature(s) that the classifier uses for classification.
100        ClassifierByLookupTable1 only has variable1,
101        ClassifierByLookupTable2 also has variable2 and
102        ClassifierByLookupTable3 has all three.
103
104    .. attribute:: variables (read only)
105       
106        The above variables, returned as a tuple.
107
108    .. attribute:: noOfValues1, noOfValues2[, noOfValues3] (read only)
109       
110        The number of values for variable1, variable2 and variable3.
111        This is stored here to make the classifier faster. Those features
112        are defined only for ClassifierByLookupTable2 (the first two) and
113        ClassifierByLookupTable3 (all three).
114
115    .. attribute:: lookup_table (read only)
116       
117        A list of values (ValueList), one for each possible combination of
118        features. For ClassifierByLookupTable1, there is an additional
119        element that is returned when the feature's value is unknown.
120        Values are ordered by values of features, with variable1 being the
121        most important. In case of two three valued features, the list
122        order is therefore 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, 3-1, 3-2, 3-3,
123        where the first digit corresponds to variable1 and the second to
124        variable2.
125       
126        The list is read-only in the sense that you cannot assign a new
127        list to this field. You can, however, change its elements. Don't
128        change its size, though.
129
130    .. attribute:: distributions (read only)
131       
132        Similar to :obj:`lookup_table`, but is of type DistributionList
133        and stores a distribution for each combination of values.
134
135    .. attribute:: dataDescription
136       
137        An object of type EFMDataDescription, defined only for
138        ClassifierByLookupTable2 and ClassifierByLookupTable3. They use
139        it to make predictions when one or more feature values are unknown.
140        ClassifierByLookupTable1 doesn't need it since this case is covered
141        by an additional element in lookup_table and distributions,
142        as told above.
143       
144    .. method:: getindex(example)
145   
146        Returns an index into lookup_table or distributions. The formula
147        depends upon the type of the classifier. If value\ *i* is
148        int(example[variable\ *i*]), then the corresponding formulae are
149
150        ClassifierByLookupTable1:
151            index = value1, or len(lookup_table)-1 if value is unknown
152        ClassifierByLookupTable2:
153            index = value1*noOfValues1 + value2, or -1 if any value is unknown
154        ClassifierByLookupTable3:
155            index = (value1*noOfValues1 + value2) * noOfValues2 + value3, or -1 if any value is unknown
156
157        Let's see some indices for randomly chosen examples from the original table.
158       
159        part of `lookup-lookup.py`_ (uses: `monks-1.tab`_):
160
161        .. literalinclude:: code/lookup-lookup.py
162            :lines: 26-29
163       
164        Output::
165       
166            ['3', '2', '1', '2', '2', '1', '0']: ab 7, e1 1
167            ['2', '2', '1', '2', '2', '1', '1']: ab 4, e1 1
168            ['1', '2', '1', '2', '2', '2', '0']: ab 1, e1 1
169            ['2', '3', '2', '3', '1', '1', '1']: ab 5, e1 0
170            ['1', '3', '2', '2', '1', '1', '1']: ab 2, e1 0
171
172
173
174.. py:class:: ClassifierByLookupTable1(class_var, variable1 [, lookup_table, distributions])
175   
176    Uses a single feature for lookup. See
177    :obj:`ClassifierByLookupTable` for more details.
178
179.. py:class:: ClassifierByLookupTable2(class_var, variable1, variable2, [, lookup_table[, distributions]])
180   
181    Uses two features for lookup. See
182    :obj:`ClassifierByLookupTable` for more details.
183       
184.. py:class:: ClassifierByLookupTable3(class_var, variable1, variable2, variable3, [, lookup_table[, distributions]])
185   
186    Uses three features for lookup. See
187    :obj:`ClassifierByLookupTable` for more details.
188
189
190Classifier by data table
191========================
192
193.. index::
194   single: classification; data table
195
196:obj:`ClassifierByDataTable` is used in similar contexts as
197:obj:`ClassifierByLookupTable`. If you write, for instance, a
198constructive induction algorithm, it is recommended that the values
199of the new feature are computed either by one of classifiers by lookup
200table or by ClassifierByDataTable, depending on the number of bound
201features.
202
203.. py:class:: ClassifierByDataTable
204
205    :obj:`ClassifierByDataTable` is the alternative to
206    :obj:`ClassifierByLookupTable`. It is to be used when the
207    classification is based on more than three features. Instead of having
208    a lookup table, it stores an :obj:`Orange.data.Table`, which is
209    optimized for a faster access.
210   
211
212    .. attribute:: sorted_examples
213       
214        A :obj:`Orange.data.Table` with sorted data instances for lookup.
215        Instances in the table can be merged; if there were multiple
216        instances with the same feature values (but possibly different
217        classes), they are merged into a single instance. Regardless of
218        merging, class values in this table are distributed: their svalue
219        contains a :obj:`Distribution`.
220
221    .. attribute:: classifierForUnknown
222       
223        This classifier is used to classify instances which were not found
224        in the table. If classifierForUnknown is not set, don't know's are
225        returned.
226
227    .. attribute:: variables (read only)
228       
229        A tuple with features in the domain. This field is here so that
230        :obj:`ClassifierByDataTable` appears more similar to
231        :obj:`ClassifierByLookupTable`. If a constructive induction
232        algorithm returns the result in one of these classifiers, and you
233        would like to check which features are used, you can use variables
234        regardless of the class you actually got.
235
236    There are no specific methods for ClassifierByDataTable.
237    Since this is a classifier, it can be called. When the instance to be
238    classified includes unknown values, :obj:`classifierForUnknown` will be
239    used if it is defined.
240
241
242
243.. py:class:: LookupLearner
244   
245    Although :obj:`ClassifierByDataTable` is not really a classifier in
246    the sense that you will use it to classify instances, but is rather a
247    function for computation of intermediate values, it has an associated
248    learner, :obj:`LookupLearner`. The learner's task is, basically, to
249    construct a Table for :obj:`sorted_examples`. It sorts them, merges them
250    and, of course, regards instance weights in the process as well.
251   
252    If data instances are provided to the constructor, the learning algorithm
253    is called and the resulting classifier is returned instead of the learner.
254
255part of `lookup-table.py`_ (uses: `monks-1.tab`_):
256
257.. literalinclude:: code/lookup-table.py
258    :lines: 7-13
259
260
261In data_s, we have prepared a table in which instances are described
262only by a, b, e and the class. Learner constructs a
263ClassifierByDataTable and stores instances from data_s into its
264sorted_examples. Instances are merged so that there are no duplicates.
265
266    >>> print len(table_s)
267    556
268    >>> print len(abe.sorted_examples)
269    36
270    >>> for i in abe.sorted_examples[:10]:
271    ...     print i
272    ['1', '1', '1', '1']
273    ['1', '1', '2', '1']
274    ['1', '1', '3', '1']
275    ['1', '1', '4', '1']
276    ['1', '2', '1', '1']
277    ['1', '2', '2', '0']
278    ['1', '2', '3', '0']
279    ['1', '2', '4', '0']
280    ['1', '3', '1', '1']
281    ['1', '3', '2', '0']
282
283Well, there's a bit more here than meets the eye: each instance's class
284value also stores the distribution of classes for all instances that
285were merged into it. In our case, the three features suffice to
286unambiguously determine the classes and, since instances covered the
287entire space, all distributions have 12 instances in one of the class
288and none in the other.
289
290    >>> for i in abe.sorted_examples[:10]:
291    ...     print i, i.getclass().svalue
292    ['1', '1', '1', '1'] <0.000, 12.000>
293    ['1', '1', '2', '1'] <0.000, 12.000>
294    ['1', '1', '3', '1'] <0.000, 12.000>
295    ['1', '1', '4', '1'] <0.000, 12.000>
296    ['1', '2', '1', '1'] <0.000, 12.000>
297    ['1', '2', '2', '0'] <12.000, 0.000>
298    ['1', '2', '3', '0'] <12.000, 0.000>
299    ['1', '2', '4', '0'] <12.000, 0.000>
300    ['1', '3', '1', '1'] <0.000, 12.000>
301    ['1', '3', '2', '0'] <12.000, 0.000>
302
303ClassifierByDataTable will usually be used by getValueFrom. So, we
304would probably continue this by constructing a new feature and put the
305classifier into its getValueFrom.
306
307    >>> y2 = Orange.data.variable.Discrete("y2", values = ["0", "1"])
308    >>> y2.getValueFrom = abe
309
310There's something disturbing here. Although abe determines the value of
311y2, abe.class_var is still y. Orange doesn't bother (the whole example
312is artificial - you will seldom pack the entire data set in an
313ClassifierByDataTable...), so shouldn't you. But still, for the sake
314of hygiene, you can conclude by
315
316    >>> abe.class_var = y2
317
318The whole story can be greatly simplified. LookupLearner can also be
319called differently than other learners. Besides instances, you can pass
320the new class variable and the features that should be used for
321classification. This saves us from constructing data_s and reassigning
322the class_var. It doesn't set the getValueFrom, though.
323
324part of `lookup-table.py`_ (uses: `monks-1.tab`_)::
325
326    import Orange
327
328    table = Orange.data.Table("monks-1")
329    a, b, e = table.domain["a"], table.domain["b"], table.domain["e"]
330
331    y2 = Orange.data.variable.Discrete("y2", values = ["0", "1"])
332    abe2 = Orange.classification.lookup.LookupLearner(y2, [a, b, e], table)
333
334Let us, for the end, show another use of LookupLearner. With the
335alternative call arguments, it offers an easy way to observe feature
336interactions. For this purpose, we shall omit e, and construct a
337ClassifierByDataTable from a and b only (part of `lookup-table.py`_; uses: `monks-1.tab`_):
338
339.. literalinclude:: code/lookup-table.py
340    :lines: 32-35
341
342The script's output show how the classes are distributed for different
343values of a and b::
344
345    ['1', '1', '1'] <0.000, 48.000>
346    ['1', '2', '0'] <36.000, 12.000>
347    ['1', '3', '0'] <36.000, 12.000>
348    ['2', '1', '0'] <36.000, 12.000>
349    ['2', '2', '1'] <0.000, 48.000>
350    ['2', '3', '0'] <36.000, 12.000>
351    ['3', '1', '0'] <36.000, 12.000>
352    ['3', '2', '0'] <36.000, 12.000>
353    ['3', '3', '1'] <0.000, 48.000>
354
355For instance, when a is '1' and b is '3', the majority class is '0',
356and the class distribution is 36:12 in favor of '0'.
357
358
359Utility functions
360=================
361
362
363There are several functions for working with classifiers that use a stored
364data table for making predictions. There are four such classifiers; the most
365general stores an :class:`Orange.data.Table` and the other three are
366specialized and optimized for cases where the domain contains only one, two or
367three features (besides the class variable).
368
369.. function:: lookup_from_bound(classVar, bound)
370
371    This function constructs an appropriate lookup classifier for one, two or
372    three features. If there are more, it returns None. The resulting
373    classifier is of type :obj:`ClassifierByLookupTable`,
374    :obj:`ClassifierByLookupTable2` or :obj:`ClassifierByLookupTable3`, with
375    classVar and bound set set as given.
376
377    If, for instance, table contains a data set Monk 1 and you would like to
378    construct a new feature from features a and b, you can call this function
379    as follows.
380   
381        >>> newvar = Orange.data.variable.Discrete()
382        >>> bound = [table.domain[name] for name in ["a", "b"]]
383        >>> lookup = lookup_from_bound(newvar, bound)
384        >>> print lookup.lookup_table
385        <?, ?, ?, ?, ?, ?, ?, ?, ?>
386
387    Function lookup_from_bound does not initialize neither newVar nor
388    the lookup table...
389
390.. function:: lookup_from_function(classVar, bound, function)
391
392    ... and that's exactly where lookup_from_function differs from
393    :obj:`lookup_from_bound`. lookup_from_function first calls
394    lookup_from_bound and then uses the function to initialize the lookup
395    table. The other difference between this and the previous function is that
396    lookup_from_function also accepts bound sets with more than three
397    features. In this case, it construct a :obj:`ClassifierByDataTable`.
398
399    The function gets the values of features as integer indices and should
400    return an integer index of the "class value". The class value must be
401    properly initialized.
402
403    For exercise, let us construct a new feature called a=b whose value will
404    be "yes" when a and b are equal and "no" when they are not. We will then
405    add the feature to the data set.
406   
407        >>> bound = [table.domain[name] for name in ["a", "b"]]
408        >>> newVar = Orange.data.variable.Discrete("a=b", values=["no", "yes"])
409        >>> lookup = lookup_from_function(newVar, bound, lambda x: x[0]==x[1])
410        >>> newVar.getValueFrom = lookup
411        >>> import orngCI
412        >>> table2 = orngCI.addAnAttribute(newVar, table)
413        >>> for i in table2[:30]:
414            ... print i
415        ['1', '1', '1', '1', '1', '1', 'yes', '1']
416        ['1', '1', '1', '1', '1', '2', 'yes', '1']
417        ['1', '1', '1', '1', '2', '1', 'yes', '1']
418        ['1', '1', '1', '1', '2', '2', 'yes', '1']
419        ...
420        ['2', '1', '2', '3', '4', '1', 'no', '0']
421        ['2', '1', '2', '3', '4', '2', 'no', '0']
422        ['2', '2', '1', '1', '1', '1', 'yes', '1']
423        ['2', '2', '1', '1', '1', '2', 'yes', '1']
424        ...
425
426    The feature was inserted with use of orngCI.addAnAttribute. By setting
427    newVar.getValueFrom to lookup we state that when converting domains
428    (either when needed by addAnAttribute or at some other place), lookup
429    should be used to compute newVar's value. (A bit off topic, but
430    important: you should never call getValueFrom directly, but always call
431    it through computeValue.)
432
433.. function:: lookup_from_data(examples [, weight])
434
435    This function takes a set of examples (e.g. :obj:`Orange.data.Table`)
436    and turns it into a classifier. If there are one, two or three features and
437    no ambiguous examples (examples are ambiguous if they have same values of
438    features but with different class values), it will construct an appropriate
439    :obj:`ClassifierByLookupTable`. Otherwise, it will return an
440    :obj:`ClassifierByDataTable`.
441   
442        >>> lookup = lookup_from_data(table)
443        >>> test_instance = Orange.data.Instance(table.domain, ['3', '2', '2', '3', '4', '1', '?'])
444        >>> lookup(test_instance)
445        <orange.Value 'y'='0'>
446   
447.. function:: dump_lookup_function(func)
448
449    dump_lookup_function returns a string with a lookup function in
450    tab-delimited format. Argument func can be any of the above-mentioned
451    classifiers or a feature whose getValueFrom points to one of such
452    classifiers.
453
454    For instance, if lookup is such as constructed in the example for
455    lookup_from_function, you can print it out by::
456   
457        >>> print dump_lookup_function(lookup)
458        a      b      a=b
459        ------ ------ ------
460        1      1      yes
461        1      2      no
462        1      3      no
463        2      1      no
464        2      2      yes
465        2      3      no
466        3      1      no
467        3      2      no
468        3      3      yes
469
470
471.. _lookup-lookup.py: code/lookup-lookup.py
472.. _lookup-table.py: code/lookup-table.py
473.. _monks-1.tab: code/monks-1.tab
474
475"""
476
477import Orange.data
478from Orange.core import \
479        LookupLearner, \
480         ClassifierByLookupTable, \
481              ClassifierByLookupTable1, \
482              ClassifierByLookupTable2, \
483              ClassifierByLookupTable3, \
484              ClassifierByExampleTable as ClassifierByDataTable
485
486
487def lookup_from_bound(attribute, bound):
488    if not len(bound):
489        raise TypeError, "no bound attributes"
490    elif len(bound) <= 3:
491        return [ClassifierByLookupTable, ClassifierByLookupTable2,
492                ClassifierByLookupTable3][len(bound) - 1](attribute, *list(bound))
493    else:
494        return None
495
496   
497def lookup_from_function(attribute, bound, function):
498    """Constructs ClassifierByDataTable or ClassifierByLookupTable
499    mirroring the given function
500   
501    """
502    lookup = lookup_from_bound(attribute, bound)
503    if lookup:
504        lookup.lookup_table = [Orange.data.Value(attribute, function(attributes))
505                              for attributes in Orange.misc.counters.LimitedCounter(
506                                  [len(attr.values) for attr in bound])]
507        return lookup
508    else:
509        examples = Orange.data.Table(Orange.data.Domain(bound, attribute))
510        for attributes in Orange.misc.counters.LimitedCounter([len(attr.values)
511                                                   for attr in dom.attributes]):
512            examples.append(Orange.data.Example(dom, attributes +
513                                                [function(attributes)]))
514        return LookupLearner(examples)
515     
516
517def lookup_from_data(examples, weight = 0, learnerForUnknown = None):
518    if len(examples.domain.attributes) <= 3:
519        lookup = lookup_from_bound(examples.domain.class_var,
520                                 examples.domain.attributes)
521        lookup_table = lookup.lookup_table
522        for example in examples:
523            ind = lookup.getindex(example)
524            if not lookup_table[ind].isSpecial() and (lookup_table[ind] <>
525                                                     example.getclass()):
526                break
527            lookup_table[ind] = example.getclass()
528        else:
529            return lookup
530
531        # there are ambiguities; a backup plan is
532        # ClassifierByDataTable, let it deal with them
533        return LookupLearner(examples, weight,
534                             learnerForUnknown=learnerForUnknown)
535
536    else:
537        return LookupLearner(examples, weight,
538                             learnerForUnknown=learnerForUnknown)
539       
540       
541def dump_lookup_function(func):
542    if isinstance(func, Orange.data.variable.Variable):
543        if not func.getValueFrom:
544            raise TypeError, "attribute '%s' does not have an associated function" % func.name
545        else:
546            func = func.getValueFrom
547
548    outp = ""
549    if isinstance(func, ClassifierByDataTable):
550    # XXX This needs some polishing :-)
551        for i in func.sorted_examples:
552            outp += "%s\n" % i
553    else:
554        boundset = func.boundset()
555        for a in boundset:
556            outp += "%s\t" % a.name
557        outp += "%s\n" % func.class_var.name
558        outp += "------\t" * (len(boundset)+1) + "\n"
559       
560        lc = 0
561        if len(boundset)==1:
562            cnt = Orange.misc.counters.LimitedCounter([len(x.values)+1 for x in boundset])
563        else:
564            cnt = Orange.misc.counters.LimitedCounter([len(x.values) for x in boundset])
565        for ex in cnt:
566            for i in range(len(ex)):
567                if ex[i]<len(boundset[i].values):
568                    outp += "%s\t" % boundset[i].values[ex[i]]
569                else:
570                    outp += "?\t",
571            outp += "%s\n" % func.class_var.values[int(func.lookup_table[lc])]
572            lc += 1
573    return outp
Note: See TracBrowser for help on using the repository browser.