source: orange/Orange/classification/lookup.py @ 10068:011be94478b7

Revision 10068:011be94478b7, 23.2 KB checked in by Lan Zagar <lan.zagar@…>, 2 years ago (diff)

Improvements to the lookup module.

Line 
1"""
2
3.. index:: classification; lookup
4
5*******************************
6Lookup classifiers (``lookup``)
7*******************************
8
9Lookup classifiers predict classes by looking into stored lists of
10cases. There are two kinds of such classifiers in Orange. The simpler
11and faster :obj:`ClassifierByLookupTable` uses up to three discrete
12features and has a stored mapping from values of those features to the
13class value. The more complex classifiers store an
14:obj:`Orange.data.Table` and predict the class by matching the instance
15to instances in the table.
16
17.. index::
18   single: feature construction; lookup classifiers
19
20A natural habitat for these classifiers is feature construction:
21they usually reside in :obj:`~Orange.feature.Descriptor.get_value_from`
22fields of constructed
23features to facilitate their automatic computation. For instance,
24the following script shows how to translate the `monks-1.tab` data set
25features into a more useful subset that will only include the features
26``a``, ``b``, ``e``, and features that will tell whether ``a`` and ``b``
27are equal and whether ``e`` is 1 (part of
28:download:`lookup-lookup.py <code/lookup-lookup.py>`):
29
30.. literalinclude:: code/lookup-lookup.py
31    :lines: 7-21
32   
33We can check the correctness of the script by printing out several
34random examples from ``data2``.
35
36    >>> for i in range(5):
37    ...     print table2.randomexample()
38    ['3', '2', 'no', '2', 'no', '0']
39    ['2', '2', 'yes', '2', 'no', '1']
40    ['1', '2', 'no', '2', 'no', '0']
41    ['2', '3', 'no', '1', 'yes', '1']
42    ['1', '3', 'no', '1', 'yes', '1']
43
44The first :obj:`ClassifierByLookupTable` takes values of features ``a``
45and ``b`` and computes the value of ``ab`` according to the rule given in the
46given table. The first three values correspond to ``a=1`` and ``b=1,2,3``;
47for the first combination, value of ``ab`` should be "yes", for the other
48two ``a`` and ``b`` are different. The next triplet corresponds to ``a=2``;
49here, the middle value is "yes"...
50
51The second lookup is simpler: since it involves only a single feature,
52the list is a simple one-to-one mapping from the four-valued ``e`` to the
53two-valued ``e1``. The last value in the list is returned when ``e`` is unknown
54and tells that ``e1`` should be unknown then as well.
55
56Note that :obj:`ClassifierByLookupTable` is not needed for this.
57The new feature ``e1`` could be computed with a callback to Python,
58for instance::
59
60    e2.get_value_from = lambda ex, rw: orange.Value(e2, ex["e"] == "1")
61
62
63Classifiers by lookup table
64===========================
65
66.. index::
67   single: classification; lookup table
68
69Although the above example used :obj:`ClassifierByLookupTable` as if it
70was a concrete class, :obj:`ClassifierByLookupTable` is actually
71abstract. Calling its constructor is a typical Orange trick: it does not
72return an instance of :obj:`ClassifierByLookupTable`, but either
73:obj:`ClassifierByLookupTable1`, :obj:`ClassifierByLookupTable2` or
74:obj:`ClassifierByLookupTable3`. As their names tell, the first
75classifies using a single feature (so that's what we had for ``e1``),
76the second uses a pair of features (and has been constructed for ``ab``
77above), and the third uses three features. Class predictions for each
78combination of feature values are stored in a (one dimensional) table.
79To classify an instance, the classifier computes an index of the element
80of the table that corresponds to the combination of feature values.
81
82These classifiers are built to be fast, not safe. For instance, if the number
83of values for one of the features is changed, Orange will most probably crash.
84To alleviate this, many of these classes' features are read-only and can only
85be set when the object is constructed.
86
87
88.. py:class:: ClassifierByLookupTable(class_var, variable1[, variable2[, variable3]] [, lookup_table[, distributions]])
89   
90    A general constructor that, based on the number of feature descriptors,
91    constructs one of the three classes discussed. If :obj:`lookup_table`
92    and :obj:`distributions` are omitted, the constructor also initializes
93    them to two lists of the right sizes, but their elements are don't knows
94    and empty distributions. If they are given, they must be of correct size.
95   
96    .. attribute:: variable1[, variable2[, variable3]](read only)
97       
98        The feature(s) that the classifier uses for classification.
99        ClassifierByLookupTable1 only has variable1,
100        ClassifierByLookupTable2 also has variable2 and
101        ClassifierByLookupTable3 has all three.
102
103    .. attribute:: variables (read only)
104       
105        The above variables, returned as a tuple.
106
107    .. attribute:: no_of_values1[, no_of_values2[, no_of_values3]] (read only)
108       
109        The number of values for variable1, variable2 and variable3.
110        This is stored here to make the classifier faster. Those features
111        are defined only for ClassifierByLookupTable2 (the first two) and
112        ClassifierByLookupTable3 (all three).
113
114    .. attribute:: lookup_table (read only)
115       
116        A list of values, one for each possible
117        combination of features. For ClassifierByLookupTable1, there is an
118        additional element that is returned when the feature's value is
119        unknown. Values are ordered by values of features, with variable1
120        being the most important. In case of two three valued features, the
121        list order is therefore 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, 3-1, 3-2, 3-3,
122        where the first digit corresponds to variable1 and the second to
123        variable2.
124       
125        The attribute is read-only - a new list cannot be assigned to it.
126        Its elements, however, can be changed. Don't change its size.
127
128    .. attribute:: distributions (read only)
129       
130        Similar to :obj:`lookup_table`, but it stores a distribution for
131        each combination of values.
132
133    .. attribute:: data_description
134       
135        An object of type :obj:`EFMDataDescription`, defined only for
136        ClassifierByLookupTable2 and ClassifierByLookupTable3. They use it
137        to make predictions when one or more feature values are unknown.
138        ClassifierByLookupTable1 doesn't need it since this case is covered by
139        an additional element in :obj:`lookup_table` and :obj:`distributions`,
140        as told above.
141       
142    .. method:: get_index(example)
143   
144        Returns an index of ``example`` in :obj:`lookup_table` and
145        :obj:`distributions`. The formula depends upon the type of
146        the classifier. If value\ *i* is int(example[variable\ *i*]),
147        then the corresponding formulae are
148
149        ClassifierByLookupTable1:
150            index = value1, or len(lookup_table) - 1 if value is unknown
151        ClassifierByLookupTable2:
152            index = value1 * no_of_values1 + value2, or -1 if any value is unknown
153        ClassifierByLookupTable3:
154            index = (value1 * no_of_values1 + value2) * no_of_values2 + value3, or -1 if any value is unknown
155
156        Let's see some indices for randomly chosen examples from the original table.
157       
158        part of :download:`lookup-lookup.py <code/lookup-lookup.py>`:
159
160        .. literalinclude:: code/lookup-lookup.py
161            :lines: 26-29
162       
163        Output::
164       
165            ['3', '2', '1', '2', '2', '1', '0']: ab 7, e1 1
166            ['2', '2', '1', '2', '2', '1', '1']: ab 4, e1 1
167            ['1', '2', '1', '2', '2', '2', '0']: ab 1, e1 1
168            ['2', '3', '2', '3', '1', '1', '1']: ab 5, e1 0
169            ['1', '3', '2', '2', '1', '1', '1']: ab 2, e1 0
170
171
172
173.. py:class:: ClassifierByLookupTable1(class_var, variable1 [, lookup_table, distributions])
174   
175    Uses a single feature for lookup. See
176    :obj:`ClassifierByLookupTable` for more details.
177
178.. py:class:: ClassifierByLookupTable2(class_var, variable1, variable2, [, lookup_table[, distributions]])
179   
180    Uses two features for lookup. See
181    :obj:`ClassifierByLookupTable` for more details.
182       
183.. py:class:: ClassifierByLookupTable3(class_var, variable1, variable2, variable3, [, lookup_table[, distributions]])
184   
185    Uses three features for lookup. See
186    :obj:`ClassifierByLookupTable` for more details.
187
188
189Classifier by data table
190========================
191
192.. index::
193   single: classification; data table
194
195:obj:`ClassifierByDataTable` is used in similar contexts as
196:obj:`ClassifierByLookupTable`. If you write, for instance, a
197constructive induction algorithm, it is recommended that the values
198of the new feature are computed either by one of classifiers by lookup
199table or by ClassifierByDataTable, depending on the number of bound
200features.
201
202.. py:class:: ClassifierByDataTable
203
204    :obj:`ClassifierByDataTable` is the alternative to
205    :obj:`ClassifierByLookupTable`. It is to be used when the
206    classification is based on more than three features. Instead of having
207    a lookup table, it stores an :obj:`Orange.data.Table`, which is
208    optimized for a faster access.
209   
210
211    .. attribute:: sorted_examples
212       
213        A :obj:`Orange.data.Table` with sorted data instances for lookup.
214        Instances in the table can be merged; if there were multiple
215        instances with the same feature values (but possibly different
216        classes), they are merged into a single instance. Regardless of
217        merging, class values in this table are distributed: their svalue
218        contains a :obj:`~Orange.statistics.distribution.Distribution`.
219
220    .. attribute:: classifier_for_unknown
221       
222        This classifier is used to classify instances which were not found
223        in the table. If classifier_for_unknown is not set, don't know's are
224        returned.
225
226    .. attribute:: variables (read only)
227       
228        A tuple with features in the domain. This field is here so that
229        :obj:`ClassifierByDataTable` appears more similar to
230        :obj:`ClassifierByLookupTable`. If a constructive induction
231        algorithm returns the result in one of these classifiers, and you
232        would like to check which features are used, you can use variables
233        regardless of the class you actually got.
234
235    There are no specific methods for ClassifierByDataTable.
236    Since this is a classifier, it can be called. When the instance to be
237    classified includes unknown values, :obj:`classifier_for_unknown` will be
238    used if it is defined.
239
240
241
242.. py:class:: LookupLearner
243   
244    Although :obj:`ClassifierByDataTable` is not really a classifier in
245    the sense that you will use it to classify instances, but is rather a
246    function for computation of intermediate values, it has an associated
247    learner, :obj:`LookupLearner`. The learner's task is, basically, to
248    construct a table for :obj:`ClassifierByDataTable.sorted_examples`.
249    It sorts them, merges them
250    and regards instance weights in the process as well.
251   
252    If data instances are provided to the constructor, the learning algorithm
253    is called and the resulting classifier is returned instead of the learner.
254
255part of :download:`lookup-table.py <code/lookup-table.py>`:
256
257.. literalinclude:: code/lookup-table.py
258    :lines: 7-13
259
260
261In `table_s`, we have prepared a table in which instances are described
262only by `a`, `b`, `e` and the class. The learner constructs a
263:obj:`ClassifierByDataTable` and stores instances from `table_s` into its
264:obj:`~ClassifierByDataTable.sorted_examples`. Instances are merged so that
265there are no duplicates.
266
267    >>> print len(table_s)
268    556
269    >>> print len(abe.sorted_examples)
270    36
271    >>> for i in abe.sorted_examples[:10]:
272    ...     print i
273    ['1', '1', '1', '1']
274    ['1', '1', '2', '1']
275    ['1', '1', '3', '1']
276    ['1', '1', '4', '1']
277    ['1', '2', '1', '1']
278    ['1', '2', '2', '0']
279    ['1', '2', '3', '0']
280    ['1', '2', '4', '0']
281    ['1', '3', '1', '1']
282    ['1', '3', '2', '0']
283
284Well, there's a bit more here than meets the eye: each instance's class
285value also stores the distribution of classes for all instances that
286were merged into it. In our case, the three features suffice to
287unambiguously determine the classes and, since instances covered the
288entire space, all distributions have 12 instances in one of the class
289and none in the other.
290
291    >>> for i in abe.sorted_examples[:10]:
292    ...     print i, i.get_class().svalue
293    ['1', '1', '1', '1'] <0.000, 12.000>
294    ['1', '1', '2', '1'] <0.000, 12.000>
295    ['1', '1', '3', '1'] <0.000, 12.000>
296    ['1', '1', '4', '1'] <0.000, 12.000>
297    ['1', '2', '1', '1'] <0.000, 12.000>
298    ['1', '2', '2', '0'] <12.000, 0.000>
299    ['1', '2', '3', '0'] <12.000, 0.000>
300    ['1', '2', '4', '0'] <12.000, 0.000>
301    ['1', '3', '1', '1'] <0.000, 12.000>
302    ['1', '3', '2', '0'] <12.000, 0.000>
303
304:obj:`ClassifierByDataTable` will usually be used by
305:obj:`~Orange.feature.Descriptor.get_value_from`. So, we
306would probably continue this by constructing a new feature and put the
307classifier into its :obj:`~Orange.feature.Descriptor.get_value_from`.
308
309    >>> y2 = Orange.feature.Discrete("y2", values = ["0", "1"])
310    >>> y2.get_value_from = abe
311
312Although `abe` determines the value of `y2`, `abe.class_var` is still `y`.
313Orange doesn't mind (the whole example is artificial - the entire data set
314will seldom be packed in an :obj:`ClassifierByDataTable`), but this can still
315be solved by
316
317    >>> abe.class_var = y2
318
319The whole story can be greatly simplified. :obj:`LookupLearner` can also be
320called differently than other learners. Besides instances, you can pass
321the new class variable and the features that should be used for
322classification. This saves us from constructing table_s and reassigning
323the :obj:`~Orange.data.Domain.class_var`. It doesn't set the
324:obj:`~Orange.feature.Descriptor.get_value_from`, though.
325
326part of :download:`lookup-table.py <code/lookup-table.py>`::
327
328    import Orange
329
330    table = Orange.data.Table("monks-1")
331    a, b, e = table.domain["a"], table.domain["b"], table.domain["e"]
332
333    y2 = Orange.feature.Discrete("y2", values = ["0", "1"])
334    abe2 = Orange.classification.lookup.LookupLearner(y2, [a, b, e], table)
335
336Let us, for the end, show another use of :obj:`LookupLearner`. With the
337alternative call arguments, it offers an easy way to observe feature
338interactions. For this purpose, we shall omit `e`, and construct a
339:obj:`ClassifierByDataTable` from `a` and `b` only (part of
340:download:`lookup-table.py <code/lookup-table.py>`):
341
342.. literalinclude:: code/lookup-table.py
343    :lines: 32-35
344
345The script's output show how the classes are distributed for different
346values of `a` and `b`::
347
348    ['1', '1', '1'] <0.000, 48.000>
349    ['1', '2', '0'] <36.000, 12.000>
350    ['1', '3', '0'] <36.000, 12.000>
351    ['2', '1', '0'] <36.000, 12.000>
352    ['2', '2', '1'] <0.000, 48.000>
353    ['2', '3', '0'] <36.000, 12.000>
354    ['3', '1', '0'] <36.000, 12.000>
355    ['3', '2', '0'] <36.000, 12.000>
356    ['3', '3', '1'] <0.000, 48.000>
357
358For instance, when `a` is '1' and `b` is '3', the majority class is '0',
359and the class distribution is 36:12 in favor of '0'.
360
361
362Utility functions
363=================
364
365
366There are several functions for working with classifiers that use a stored
367data table for making predictions. There are four such classifiers; the most
368general stores a :class:`~Orange.data.Table` and the other three are
369specialized and optimized for cases where the domain contains only one, two or
370three features (besides the class variable).
371
372.. function:: lookup_from_bound(class_var, bound)
373
374    This function constructs an appropriate lookup classifier for one, two or
375    three features. If there are more, it returns None. The resulting
376    classifier is of type :obj:`ClassifierByLookupTable`,
377    :obj:`ClassifierByLookupTable2` or :obj:`ClassifierByLookupTable3`, with
378    `class_var` and bound set set as given.
379
380    For example, using the data set `monks-1.tab`, to construct a new feature
381    from features `a` and `b`, this function can be called as follows.
382   
383        >>> new_var = Orange.feature.Discrete()
384        >>> bound = [table.domain[name] for name in ["a", "b"]]
385        >>> lookup = Orange.classification.lookup.lookup_from_bound(new_var, bound)
386        >>> print lookup.lookup_table
387        <?, ?, ?, ?, ?, ?, ?, ?, ?>
388
389    Function `lookup_from_bound` does not initialize neither `new_var` nor
390    the lookup table...
391
392.. function:: lookup_from_function(class_var, bound, function)
393
394    ... and that's exactly where `lookup_from_function` differs from
395    :obj:`lookup_from_bound`. `lookup_from_function` first calls
396    :obj:`lookup_from_bound` and then uses the function to initialize the
397    lookup table. The other difference between this and the previous function
398    is that `lookup_from_function` also accepts bound sets with more than three
399    features. In this case, it construct a :obj:`ClassifierByDataTable`.
400
401    The function gets the values of features as integer indices and should
402    return an integer index of the "class value". The class value must be
403    properly initialized.
404
405    For exercise, let us construct a new feature called `a=b` whose value will
406    be "yes" when `a` and `b` are equal and "no" when they are not. We will then
407    add the feature to the data set.
408   
409        >>> bound = [table.domain[name] for name in ["a", "b"]]
410        >>> new_var = Orange.feature.Discrete("a=b", values=["no", "yes"])
411        >>> lookup = Orange.classification.lookup.lookup_from_function(new_var, bound, lambda x: x[0] == x[1])
412        >>> new_var.get_value_from = lookup
413        >>> import orngCI
414        >>> table2 = orngCI.addAnAttribute(new_var, table)
415        >>> for i in table2[:30]:
416            ... print i
417        ['1', '1', '1', '1', '3', '1', 'yes', '1']
418        ['1', '1', '1', '1', '3', '2', 'yes', '1']
419        ['1', '1', '1', '3', '2', '1', 'yes', '1']
420        ...
421        ['1', '2', '1', '1', '1', '2', 'no', '1']
422        ['1', '2', '1', '1', '2', '1', 'no', '0']
423        ['1', '2', '1', '1', '3', '1', 'no', '0']
424        ...
425
426    The feature was inserted with use of `orngCI.addAnAttribute`. By setting
427    `new_var.get_value_from` to `lookup` we state that when converting domains
428    (either when needed by `addAnAttribute` or at some other place), `lookup`
429    should be used to compute `new_var`'s value. (A bit off topic, but
430    important: you should never call
431    :obj:`~Orange.feature.Descriptor.get_value_from` directly, but always
432    through :obj:`~Orange.feature.Descriptor.compute_value`.)
433
434.. function:: lookup_from_data(examples [, weight])
435
436    This function takes a set of data instances (e.g. :obj:`Orange.data.Table`)
437    and turns it into a classifier. If there are one, two or three features and
438    no ambiguous examples (examples are ambiguous if they have same values of
439    features but with different class values), it will construct an appropriate
440    :obj:`ClassifierByLookupTable`. Otherwise, it will return an
441    :obj:`ClassifierByDataTable`.
442   
443        >>> lookup = Orange.classification.lookup.lookup_from_data(table)
444        >>> test_instance = Orange.data.Instance(table.domain, ['3', '2', '2', '3', '4', '1', '?'])
445        >>> lookup(test_instance)
446        <orange.Value 'y'='0'>
447   
448.. function:: dump_lookup_function(func)
449
450    `dump_lookup_function` returns a string with a lookup function in
451    tab-delimited format. Argument `func` can be any of the above-mentioned
452    classifiers or a feature whose
453    :obj:`~Orange.feature.Descriptor.get_value_from` points to one of such
454    classifiers.
455
456    For instance, if `lookup` is such as constructed in the example for
457    `lookup_from_function`, it can be printed by::
458   
459        >>> print dump_lookup_function(lookup)
460        a      b      a=b
461        ------ ------ ------
462        1      1      yes
463        1      2      no
464        1      3      no
465        2      1      no
466        2      2      yes
467        2      3      no
468        3      1      no
469        3      2      no
470        3      3      yes
471
472"""
473
474from Orange.misc import deprecated_keywords
475import Orange.data
476from Orange.core import \
477        LookupLearner, \
478         ClassifierByLookupTable, \
479              ClassifierByLookupTable1, \
480              ClassifierByLookupTable2, \
481              ClassifierByLookupTable3, \
482              ClassifierByExampleTable as ClassifierByDataTable
483
484
485@deprecated_keywords({"attribute":"class_var"})
486def lookup_from_bound(class_var, bound):
487    if not len(bound):
488        raise TypeError, "no bound attributes"
489    elif len(bound) <= 3:
490        return [ClassifierByLookupTable, ClassifierByLookupTable2,
491                ClassifierByLookupTable3][len(bound) - 1](class_var, *list(bound))
492    else:
493        return None
494
495   
496@deprecated_keywords({"attribute":"class_var"})
497def lookup_from_function(class_var, bound, function):
498    """
499    Constructs ClassifierByDataTable or ClassifierByLookupTable
500    mirroring the given function.
501   
502    """
503    lookup = lookup_from_bound(class_var, bound)
504    if lookup:
505        for i, attrs in enumerate(Orange.misc.counters.LimitedCounter(
506                    [len(var.values) for var in bound])):
507            lookup.lookup_table[i] = Orange.data.Value(class_var, function(attrs))
508        return lookup
509    else:
510        dom = Orange.data.Domain(bound, class_var)
511        data = Orange.data.Table(dom)
512        for attrs in Orange.misc.counters.LimitedCounter(
513                    [len(var.values) for var in dom.features]):
514            data.append(Orange.data.Example(dom, attrs + [function(attrs)]))
515        return LookupLearner(data)
516     
517
518@deprecated_keywords({"learnerForUnknown":"learner_for_unknown"})
519def lookup_from_data(examples, weight=0, learner_for_unknown=None):
520    if len(examples.domain.features) <= 3:
521        lookup = lookup_from_bound(examples.domain.class_var,
522                                 examples.domain.features)
523        lookup_table = lookup.lookup_table
524        for example in examples:
525            ind = lookup.get_index(example)
526            if not lookup_table[ind].is_special() and (lookup_table[ind] !=
527                                                     example.get_class()):
528                break
529            lookup_table[ind] = example.get_class()
530        else:
531            return lookup
532
533        # there are ambiguities; a backup plan is
534        # ClassifierByDataTable, let it deal with them
535        return LookupLearner(examples, weight,
536                             learner_for_unknown=learner_for_unknown)
537
538    else:
539        return LookupLearner(examples, weight,
540                             learner_for_unknown=learner_for_unknown)
541       
542       
543def dump_lookup_function(func):
544    if isinstance(func, Orange.feature.Descriptor):
545        if not func.get_value_from:
546            raise TypeError, "attribute '%s' does not have an associated function" % func.name
547        else:
548            func = func.get_value_from
549
550    outp = ""
551    if isinstance(func, ClassifierByDataTable):
552    # XXX This needs some polishing :-)
553        for i in func.sorted_examples:
554            outp += "%s\n" % i
555    else:
556        boundset = func.boundset()
557        for a in boundset:
558            outp += "%s\t" % a.name
559        outp += "%s\n" % func.class_var.name
560        outp += "------\t" * (len(boundset)+1) + "\n"
561       
562        lc = 0
563        if len(boundset)==1:
564            cnt = Orange.misc.counters.LimitedCounter([len(x.values)+1 for x in boundset])
565        else:
566            cnt = Orange.misc.counters.LimitedCounter([len(x.values) for x in boundset])
567        for ex in cnt:
568            for i in range(len(ex)):
569                if ex[i] < len(boundset[i].values):
570                    outp += "%s\t" % boundset[i].values[ex[i]]
571                else:
572                    outp += "?\t",
573            outp += "%s\n" % func.class_var.values[int(func.lookup_table[lc])]
574            lc += 1
575    return outp
Note: See TracBrowser for help on using the repository browser.