source: orange/Orange/classification/lookup.py @ 9671:a7b056375472

Revision 9671:a7b056375472, 23.4 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1"""
2
3.. index:: classification; lookup
4
5*******************************
6Lookup classifiers (``lookup``)
7*******************************
8
9Lookup classifiers predict classes by looking into stored lists of
10cases. There are two kinds of such classifiers in Orange. The simpler
11and faster :obj:`ClassifierByLookupTable` uses up to three discrete
12features and has a stored mapping from values of those features to the
13class value. The more complex classifiers store an
14:obj:`Orange.data.Table` and predict the class by matching the instance
15to instances in the table.
16
17.. index::
18   single: feature construction; lookup classifiers
19
20A natural habitat for these classifiers is feature construction:
21they usually reside in :obj:`~Orange.data.variable.Variable.get_value_from` fields of constructed
22features to facilitate their automatic computation. For instance,
23the following script shows how to translate the :download:`monks-1.tab <code/monks-1.tab>` data set
24features into a more useful subset that will only include the features
25``a``, ``b``, ``e``, and features that will tell whether ``a`` and ``b`` are equal and
26whether ``e`` is 1 (don't bother about the details, they follow later;
27:download:`lookup-lookup.py <code/lookup-lookup.py>`, uses: :download:`monks-1.tab <code/monks-1.tab>`):
28
29.. literalinclude:: code/lookup-lookup.py
30    :lines: 7-21
31   
32We can check the correctness of the script by printing out several
33random examples from ``data2``.
34
35    >>> for i in range(5):
36    ...     print table2.randomexample()
37    ['3', '2', 'no', '2', 'no', '0']
38    ['2', '2', 'yes', '2', 'no', '1']
39    ['1', '2', 'no', '2', 'no', '0']
40    ['2', '3', 'no', '1', 'yes', '1']
41    ['1', '3', 'no', '1', 'yes', '1']
42
43The first :obj:`ClassifierByLookupTable` takes values of features ``a``
44and ``b`` and computes the value of ``ab`` according to the rule given in the
45given table. The first three values correspond to ``a=1`` and ``b=1,2,3``;
46for the first combination, value of ``ab`` should be "yes", for the other
47two ``a`` and ``b`` are different. The next triplet corresponds to ``a=2``;
48here, the middle value is "yes"...
49
50The second lookup is simpler: since it involves only a single feature,
51the list is a simple one-to-one mapping from the four-valued ``e`` to the
52two-valued ``e1``. The last value in the list is returned when ``e`` is unknown
53and tells that ``e1`` should be unknown then as well.
54
55Note that you don't need :obj:`ClassifierByLookupTable` for this.
56The new feature ``e1`` could be computed with a callback to Python,
57for instance::
58
59    e2.get_value_from = lambda ex, rw: orange.Value(e2, ex["e"] == "1")
60
61
62Classifiers by lookup table
63===========================
64
65.. index::
66   single: classification; lookup table
67
68Although the above example used :obj:`ClassifierByLookupTable` as if it
69was a concrete class, ClassifierByLookupTable is actually
70abstract. Calling its constructor is a typical Orange trick: what you
71get, is never ClassifierByLookupTable, but either
72:obj:`ClassifierByLookupTable1`, :obj:`ClassifierByLookupTable2` or
73:obj:`ClassifierByLookupTable3`. As their names tell, the first
74classifies using a single feature (so that's what we had for ``e1``),
75the second uses a pair of features (and has been constructed for ``ab``
76above), and the third uses three features. Class predictions for each
77combination of feature values are stored in a (one dimensional) table.
78To classify an instance, the classifier computes an index of the element
79of the table that corresponds to the combination of feature values.
80
81These classifiers are built to be fast, not safe. If you, for instance,
82change the number of values for one of the features, Orange will
83most probably crash. To protect you somewhat, many of these classes'
84features are read-only and can only be set when the object is
85constructed.
86
87
88.. py:class:: ClassifierByLookupTable(class_var, variable1[, variable2[, variable3]] [, lookup_table[, distributions]])
89   
90    A general constructor that, based on the number of feature descriptors,
91    constructs one of the three classes discussed. If :obj:`lookup_table`
92    and :obj:`distributions` are omitted, the constructor also initializes
93    them to two lists of the right sizes, but their elements are don't knows
94    and empty distributions. If they are given, they must be of correct size.
95   
96    .. attribute:: variable1[, variable2[, variable3]](read only)
97       
98        The feature(s) that the classifier uses for classification.
99        ClassifierByLookupTable1 only has variable1,
100        ClassifierByLookupTable2 also has variable2 and
101        ClassifierByLookupTable3 has all three.
102
103    .. attribute:: variables (read only)
104       
105        The above variables, returned as a tuple.
106
107    .. attribute:: no_of_values1[, no_of_values2[, no_of_values3]] (read only)
108       
109        The number of values for variable1, variable2 and variable3.
110        This is stored here to make the classifier faster. Those features
111        are defined only for ClassifierByLookupTable2 (the first two) and
112        ClassifierByLookupTable3 (all three).
113
114    .. attribute:: lookup_table (read only)
115       
116        A list of values (:obj:`Orange.core.ValueList`), one for each possible
117        combination of features. For ClassifierByLookupTable1, there is an
118        additional element that is returned when the feature's value is
119        unknown. Values are ordered by values of features, with variable1
120        being the most important. In case of two three valued features, the
121        list order is therefore 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, 3-1, 3-2, 3-3,
122        where the first digit corresponds to variable1 and the second to
123        variable2.
124       
125        The list is read-only in the sense that you cannot assign a new
126        list to this field. You can, however, change its elements. Don't
127        change its size, though.
128
129    .. attribute:: distributions (read only)
130       
131        Similar to :obj:`lookup_table`, but is of type
132        :obj:`Orange.core.DistributionList` and stores a distribution
133        for each combination of values.
134
135    .. attribute:: data_description
136       
137        An object of type :obj:`EFMDataDescription`, defined only for
138        ClassifierByLookupTable2 and ClassifierByLookupTable3. They use it
139        to make predictions when one or more feature values are unknown.
140        ClassifierByLookupTable1 doesn't need it since this case is covered by
141        an additional element in :obj:`lookup_table` and :obj:`distributions`,
142        as told above.
143       
144    .. method:: get_index(example)
145   
146        Returns an index of ``example`` in :obj:`lookup_table` and
147        :obj:`distributions`. The formula depends upon the type of
148        the classifier. If value\ *i* is int(example[variable\ *i*]),
149        then the corresponding formulae are
150
151        ClassifierByLookupTable1:
152            index = value1, or len(lookup_table) - 1 if value is unknown
153        ClassifierByLookupTable2:
154            index = value1 * no_of_values1 + value2, or -1 if any value is unknown
155        ClassifierByLookupTable3:
156            index = (value1 * no_of_values1 + value2) * no_of_values2 + value3, or -1 if any value is unknown
157
158        Let's see some indices for randomly chosen examples from the original table.
159       
160        part of :download:`lookup-lookup.py <code/lookup-lookup.py>` (uses: :download:`monks-1.tab <code/monks-1.tab>`):
161
162        .. literalinclude:: code/lookup-lookup.py
163            :lines: 26-29
164       
165        Output::
166       
167            ['3', '2', '1', '2', '2', '1', '0']: ab 7, e1 1
168            ['2', '2', '1', '2', '2', '1', '1']: ab 4, e1 1
169            ['1', '2', '1', '2', '2', '2', '0']: ab 1, e1 1
170            ['2', '3', '2', '3', '1', '1', '1']: ab 5, e1 0
171            ['1', '3', '2', '2', '1', '1', '1']: ab 2, e1 0
172
173
174
175.. py:class:: ClassifierByLookupTable1(class_var, variable1 [, lookup_table, distributions])
176   
177    Uses a single feature for lookup. See
178    :obj:`ClassifierByLookupTable` for more details.
179
180.. py:class:: ClassifierByLookupTable2(class_var, variable1, variable2, [, lookup_table[, distributions]])
181   
182    Uses two features for lookup. See
183    :obj:`ClassifierByLookupTable` for more details.
184       
185.. py:class:: ClassifierByLookupTable3(class_var, variable1, variable2, variable3, [, lookup_table[, distributions]])
186   
187    Uses three features for lookup. See
188    :obj:`ClassifierByLookupTable` for more details.
189
190
191Classifier by data table
192========================
193
194.. index::
195   single: classification; data table
196
197:obj:`ClassifierByDataTable` is used in similar contexts as
198:obj:`ClassifierByLookupTable`. If you write, for instance, a
199constructive induction algorithm, it is recommended that the values
200of the new feature are computed either by one of classifiers by lookup
201table or by ClassifierByDataTable, depending on the number of bound
202features.
203
204.. py:class:: ClassifierByDataTable
205
206    :obj:`ClassifierByDataTable` is the alternative to
207    :obj:`ClassifierByLookupTable`. It is to be used when the
208    classification is based on more than three features. Instead of having
209    a lookup table, it stores an :obj:`Orange.data.Table`, which is
210    optimized for a faster access.
211   
212
213    .. attribute:: sorted_examples
214       
215        A :obj:`Orange.data.Table` with sorted data instances for lookup.
216        Instances in the table can be merged; if there were multiple
217        instances with the same feature values (but possibly different
218        classes), they are merged into a single instance. Regardless of
219        merging, class values in this table are distributed: their svalue
220        contains a :obj:`Distribution`.
221
222    .. attribute:: classifierForUnknown
223       
224        This classifier is used to classify instances which were not found
225        in the table. If classifierForUnknown is not set, don't know's are
226        returned.
227
228    .. attribute:: variables (read only)
229       
230        A tuple with features in the domain. This field is here so that
231        :obj:`ClassifierByDataTable` appears more similar to
232        :obj:`ClassifierByLookupTable`. If a constructive induction
233        algorithm returns the result in one of these classifiers, and you
234        would like to check which features are used, you can use variables
235        regardless of the class you actually got.
236
237    There are no specific methods for ClassifierByDataTable.
238    Since this is a classifier, it can be called. When the instance to be
239    classified includes unknown values, :obj:`classifierForUnknown` will be
240    used if it is defined.
241
242
243
244.. py:class:: LookupLearner
245   
246    Although :obj:`ClassifierByDataTable` is not really a classifier in
247    the sense that you will use it to classify instances, but is rather a
248    function for computation of intermediate values, it has an associated
249    learner, :obj:`LookupLearner`. The learner's task is, basically, to
250    construct a Table for :obj:`sorted_examples`. It sorts them, merges them
251    and, of course, regards instance weights in the process as well.
252   
253    If data instances are provided to the constructor, the learning algorithm
254    is called and the resulting classifier is returned instead of the learner.
255
256part of :download:`lookup-table.py <code/lookup-table.py>` (uses: :download:`monks-1.tab <code/monks-1.tab>`):
257
258.. literalinclude:: code/lookup-table.py
259    :lines: 7-13
260
261
262In data_s, we have prepared a table in which instances are described
263only by a, b, e and the class. Learner constructs a
264ClassifierByDataTable and stores instances from data_s into its
265sorted_examples. Instances are merged so that there are no duplicates.
266
267    >>> print len(table_s)
268    556
269    >>> print len(abe.sorted_examples)
270    36
271    >>> for i in abe.sorted_examples[:10]:
272    ...     print i
273    ['1', '1', '1', '1']
274    ['1', '1', '2', '1']
275    ['1', '1', '3', '1']
276    ['1', '1', '4', '1']
277    ['1', '2', '1', '1']
278    ['1', '2', '2', '0']
279    ['1', '2', '3', '0']
280    ['1', '2', '4', '0']
281    ['1', '3', '1', '1']
282    ['1', '3', '2', '0']
283
284Well, there's a bit more here than meets the eye: each instance's class
285value also stores the distribution of classes for all instances that
286were merged into it. In our case, the three features suffice to
287unambiguously determine the classes and, since instances covered the
288entire space, all distributions have 12 instances in one of the class
289and none in the other.
290
291    >>> for i in abe.sorted_examples[:10]:
292    ...     print i, i.getclass().svalue
293    ['1', '1', '1', '1'] <0.000, 12.000>
294    ['1', '1', '2', '1'] <0.000, 12.000>
295    ['1', '1', '3', '1'] <0.000, 12.000>
296    ['1', '1', '4', '1'] <0.000, 12.000>
297    ['1', '2', '1', '1'] <0.000, 12.000>
298    ['1', '2', '2', '0'] <12.000, 0.000>
299    ['1', '2', '3', '0'] <12.000, 0.000>
300    ['1', '2', '4', '0'] <12.000, 0.000>
301    ['1', '3', '1', '1'] <0.000, 12.000>
302    ['1', '3', '2', '0'] <12.000, 0.000>
303
304ClassifierByDataTable will usually be used by :obj:`Orange.data.variable.Variable.get_value_from`. So, we
305would probably continue this by constructing a new feature and put the
306classifier into its :obj:`Orange.data.variable.Variable.get_value_from`.
307
308    >>> y2 = Orange.data.variable.Discrete("y2", values = ["0", "1"])
309    >>> y2.get_value_from = abe
310
311There's something disturbing here. Although abe determines the value of
312y2, abe.class_var is still y. Orange doesn't bother (the whole example
313is artificial - you will seldom pack the entire data set in an
314ClassifierByDataTable...), so shouldn't you. But still, for the sake
315of hygiene, you can conclude by
316
317    >>> abe.class_var = y2
318
319The whole story can be greatly simplified. LookupLearner can also be
320called differently than other learners. Besides instances, you can pass
321the new class variable and the features that should be used for
322classification. This saves us from constructing data_s and reassigning
323the class_var. It doesn't set the :obj:`Orange.data.variable.Variable.get_value_from`, though.
324
325part of :download:`lookup-table.py <code/lookup-table.py>` (uses: :download:`monks-1.tab <code/monks-1.tab>`)::
326
327    import Orange
328
329    table = Orange.data.Table("monks-1")
330    a, b, e = table.domain["a"], table.domain["b"], table.domain["e"]
331
332    y2 = Orange.data.variable.Discrete("y2", values = ["0", "1"])
333    abe2 = Orange.classification.lookup.LookupLearner(y2, [a, b, e], table)
334
335Let us, for the end, show another use of LookupLearner. With the
336alternative call arguments, it offers an easy way to observe feature
337interactions. For this purpose, we shall omit e, and construct a
338ClassifierByDataTable from a and b only (part of :download:`lookup-table.py <code/lookup-table.py>`; uses: :download:`monks-1.tab <code/monks-1.tab>`):
339
340.. literalinclude:: code/lookup-table.py
341    :lines: 32-35
342
343The script's output show how the classes are distributed for different
344values of a and b::
345
346    ['1', '1', '1'] <0.000, 48.000>
347    ['1', '2', '0'] <36.000, 12.000>
348    ['1', '3', '0'] <36.000, 12.000>
349    ['2', '1', '0'] <36.000, 12.000>
350    ['2', '2', '1'] <0.000, 48.000>
351    ['2', '3', '0'] <36.000, 12.000>
352    ['3', '1', '0'] <36.000, 12.000>
353    ['3', '2', '0'] <36.000, 12.000>
354    ['3', '3', '1'] <0.000, 48.000>
355
356For instance, when a is '1' and b is '3', the majority class is '0',
357and the class distribution is 36:12 in favor of '0'.
358
359
360Utility functions
361=================
362
363
364There are several functions for working with classifiers that use a stored
365data table for making predictions. There are four such classifiers; the most
366general stores an :class:`Orange.data.Table` and the other three are
367specialized and optimized for cases where the domain contains only one, two or
368three features (besides the class variable).
369
370.. function:: lookup_from_bound(classVar, bound)
371
372    This function constructs an appropriate lookup classifier for one, two or
373    three features. If there are more, it returns None. The resulting
374    classifier is of type :obj:`ClassifierByLookupTable`,
375    :obj:`ClassifierByLookupTable2` or :obj:`ClassifierByLookupTable3`, with
376    classVar and bound set set as given.
377
378    If, for instance, table contains a data set Monk 1 and you would like to
379    construct a new feature from features a and b, you can call this function
380    as follows.
381   
382        >>> newvar = Orange.data.variable.Discrete()
383        >>> bound = [table.domain[name] for name in ["a", "b"]]
384        >>> lookup = lookup_from_bound(newvar, bound)
385        >>> print lookup.lookup_table
386        <?, ?, ?, ?, ?, ?, ?, ?, ?>
387
388    Function lookup_from_bound does not initialize neither newVar nor
389    the lookup table...
390
391.. function:: lookup_from_function(classVar, bound, function)
392
393    ... and that's exactly where lookup_from_function differs from
394    :obj:`lookup_from_bound`. lookup_from_function first calls
395    lookup_from_bound and then uses the function to initialize the lookup
396    table. The other difference between this and the previous function is that
397    lookup_from_function also accepts bound sets with more than three
398    features. In this case, it construct a :obj:`ClassifierByDataTable`.
399
400    The function gets the values of features as integer indices and should
401    return an integer index of the "class value". The class value must be
402    properly initialized.
403
404    For exercise, let us construct a new feature called a=b whose value will
405    be "yes" when a and b are equal and "no" when they are not. We will then
406    add the feature to the data set.
407   
408        >>> bound = [table.domain[name] for name in ["a", "b"]]
409        >>> newVar = Orange.data.variable.Discrete("a=b", values=["no", "yes"])
410        >>> lookup = lookup_from_function(newVar, bound, lambda x: x[0] == x[1])
411        >>> newVar.get_value_from = lookup
412        >>> import orngCI
413        >>> table2 = orngCI.addAnAttribute(newVar, table)
414        >>> for i in table2[:30]:
415            ... print i
416        ['1', '1', '1', '1', '1', '1', 'yes', '1']
417        ['1', '1', '1', '1', '1', '2', 'yes', '1']
418        ['1', '1', '1', '1', '2', '1', 'yes', '1']
419        ['1', '1', '1', '1', '2', '2', 'yes', '1']
420        ...
421        ['2', '1', '2', '3', '4', '1', 'no', '0']
422        ['2', '1', '2', '3', '4', '2', 'no', '0']
423        ['2', '2', '1', '1', '1', '1', 'yes', '1']
424        ['2', '2', '1', '1', '1', '2', 'yes', '1']
425        ...
426
427    The feature was inserted with use of orngCI.addAnAttribute. By setting
428    newVar.get_value_from to lookup we state that when converting domains
429    (either when needed by addAnAttribute or at some other place), lookup
430    should be used to compute newVar's value. (A bit off topic, but
431    important: you should never call :obj:`Orange.data.variable.Variable.get_value_from` directly, but always call
432    it through computeValue.)
433
434.. function:: lookup_from_data(examples [, weight])
435
436    This function takes a set of examples (e.g. :obj:`Orange.data.Table`)
437    and turns it into a classifier. If there are one, two or three features and
438    no ambiguous examples (examples are ambiguous if they have same values of
439    features but with different class values), it will construct an appropriate
440    :obj:`ClassifierByLookupTable`. Otherwise, it will return an
441    :obj:`ClassifierByDataTable`.
442   
443        >>> lookup = lookup_from_data(table)
444        >>> test_instance = Orange.data.Instance(table.domain, ['3', '2', '2', '3', '4', '1', '?'])
445        >>> lookup(test_instance)
446        <orange.Value 'y'='0'>
447   
448.. function:: dump_lookup_function(func)
449
450    dump_lookup_function returns a string with a lookup function in
451    tab-delimited format. Argument func can be any of the above-mentioned
452    classifiers or a feature whose :obj:`Orange.data.variable.Variable.get_value_from` points to one of such
453    classifiers.
454
455    For instance, if lookup is such as constructed in the example for
456    lookup_from_function, you can print it out by::
457   
458        >>> print dump_lookup_function(lookup)
459        a      b      a=b
460        ------ ------ ------
461        1      1      yes
462        1      2      no
463        1      3      no
464        2      1      no
465        2      2      yes
466        2      3      no
467        3      1      no
468        3      2      no
469        3      3      yes
470
471"""
472
473import Orange.data
474from Orange.core import \
475        LookupLearner, \
476         ClassifierByLookupTable, \
477              ClassifierByLookupTable1, \
478              ClassifierByLookupTable2, \
479              ClassifierByLookupTable3, \
480              ClassifierByExampleTable as ClassifierByDataTable
481
482
483def lookup_from_bound(attribute, bound):
484    if not len(bound):
485        raise TypeError, "no bound attributes"
486    elif len(bound) <= 3:
487        return [ClassifierByLookupTable, ClassifierByLookupTable2,
488                ClassifierByLookupTable3][len(bound) - 1](attribute, *list(bound))
489    else:
490        return None
491
492   
493def lookup_from_function(attribute, bound, function):
494    """Constructs ClassifierByDataTable or ClassifierByLookupTable
495    mirroring the given function
496   
497    """
498    lookup = lookup_from_bound(attribute, bound)
499    if lookup:
500        lookup.lookup_table = [Orange.data.Value(attribute, function(attributes))
501                              for attributes in Orange.misc.counters.LimitedCounter(
502                                  [len(attr.values) for attr in bound])]
503        return lookup
504    else:
505        examples = Orange.data.Table(Orange.data.Domain(bound, attribute))
506        for attributes in Orange.misc.counters.LimitedCounter([len(attr.values)
507                                                   for attr in dom.attributes]):
508            examples.append(Orange.data.Example(dom, attributes +
509                                                [function(attributes)]))
510        return LookupLearner(examples)
511     
512
513def lookup_from_data(examples, weight=0, learnerForUnknown=None):
514    if len(examples.domain.attributes) <= 3:
515        lookup = lookup_from_bound(examples.domain.class_var,
516                                 examples.domain.attributes)
517        lookup_table = lookup.lookup_table
518        for example in examples:
519            ind = lookup.getindex(example)
520            if not lookup_table[ind].isSpecial() and (lookup_table[ind] !=
521                                                     example.getclass()):
522                break
523            lookup_table[ind] = example.getclass()
524        else:
525            return lookup
526
527        # there are ambiguities; a backup plan is
528        # ClassifierByDataTable, let it deal with them
529        return LookupLearner(examples, weight,
530                             learnerForUnknown=learnerForUnknown)
531
532    else:
533        return LookupLearner(examples, weight,
534                             learnerForUnknown=learnerForUnknown)
535       
536       
537def dump_lookup_function(func):
538    if isinstance(func, Orange.data.variable.Variable):
539        if not func.get_value_from:
540            raise TypeError, "attribute '%s' does not have an associated function" % func.name
541        else:
542            func = func.get_value_from
543
544    outp = ""
545    if isinstance(func, ClassifierByDataTable):
546    # XXX This needs some polishing :-)
547        for i in func.sorted_examples:
548            outp += "%s\n" % i
549    else:
550        boundset = func.boundset()
551        for a in boundset:
552            outp += "%s\t" % a.name
553        outp += "%s\n" % func.class_var.name
554        outp += "------\t" * (len(boundset)+1) + "\n"
555       
556        lc = 0
557        if len(boundset)==1:
558            cnt = Orange.misc.counters.LimitedCounter([len(x.values)+1 for x in boundset])
559        else:
560            cnt = Orange.misc.counters.LimitedCounter([len(x.values) for x in boundset])
561        for ex in cnt:
562            for i in range(len(ex)):
563                if ex[i] < len(boundset[i].values):
564                    outp += "%s\t" % boundset[i].values[ex[i]]
565                else:
566                    outp += "?\t",
567            outp += "%s\n" % func.class_var.values[int(func.lookup_table[lc])]
568            lc += 1
569    return outp
Note: See TracBrowser for help on using the repository browser.