source: orange/Orange/feature/discretization.py @ 9671:a7b056375472

Revision 9671:a7b056375472, 23.1 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1"""
2###################################
3Discretization (``discretization``)
4###################################
5
6.. index:: discretization
7
8.. index::
9   single: feature; discretization
10
11
12Example-based automatic discretization is in essence similar to learning:
13given a set of examples, discretization method proposes a list of suitable
14intervals to cut the attribute's values into. For this reason, Orange
15structures for discretization resemble its structures for learning. Objects
16derived from ``orange.Discretization`` play a role of "learner" that,
17upon observing the examples, construct an ``orange.Discretizer`` whose role
18is to convert continuous values into discrete according to the rule found by
19``Discretization``.
20
21Orange supports several methods of discretization; here's a
22list of methods with belonging classes.
23
24* Equi-distant discretization (:class:`EquiDistDiscretization`,
25  :class:`EquiDistDiscretizer`). The range of attribute's values is split
26  into prescribed number equal-sized intervals.
27* Quantile-based discretization (:class:`EquiNDiscretization`,
28  :class:`IntervalDiscretizer`). The range is split into intervals
29  containing equal number of examples.
30* Entropy-based discretization (:class:`EntropyDiscretization`,
31  :class:`IntervalDiscretizer`). Developed by Fayyad and Irani,
32  this method balances between entropy in intervals and MDL of discretization.
33* Bi-modal discretization (:class:`BiModalDiscretization`,
34  :class:`BiModalDiscretizer`/:class:`IntervalDiscretizer`).
35  Two cut-off points set to optimize the difference of the distribution in
36  the middle interval and the distributions outside it.
37* Fixed discretization (:class:`IntervalDiscretizer`). Discretization with
38  user-prescribed cut-off points.
39
40Instances of classes derived from :class:`Discretization`. It define a
41single method: the call operator. The object can also be called through
42constructor.
43
44.. class:: Discretization
45
46    .. method:: __call__(attribute, examples[, weightID])
47
48        Given a continuous ``attribute`, ``examples`` and, optionally id of
49        attribute with example weight, this function returns a discretized
50        attribute. Argument ``attribute`` can be a descriptor, index or
51        name of the attribute.
52
53Here's an example. Part of :download:`discretization.py <code/discretization.py>`:
54
55.. literalinclude:: code/discretization.py
56    :lines: 7-15
57
58The discretized attribute ``sep_w`` is constructed with a call to
59:class:`EntropyDiscretization` (instead of constructing it and calling
60it afterwards, we passed the arguments for calling to the constructor, as
61is often allowed in Orange). We then constructed a new
62:class:`Orange.data.Table` with attributes "sepal width" (the original
63continuous attribute), ``sep_w`` and the class attribute. Script output is::
64
65    Entropy discretization, first 10 examples
66    [3.5, '>3.30', 'Iris-setosa']
67    [3.0, '(2.90, 3.30]', 'Iris-setosa']
68    [3.2, '(2.90, 3.30]', 'Iris-setosa']
69    [3.1, '(2.90, 3.30]', 'Iris-setosa']
70    [3.6, '>3.30', 'Iris-setosa']
71    [3.9, '>3.30', 'Iris-setosa']
72    [3.4, '>3.30', 'Iris-setosa']
73    [3.4, '>3.30', 'Iris-setosa']
74    [2.9, '<=2.90', 'Iris-setosa']
75    [3.1, '(2.90, 3.30]', 'Iris-setosa']
76
77:class:`EntropyDiscretization` named the new attribute's values by the
78interval range (it also named the attribute as "D_sepal width"). The new
79attribute's values get computed automatically when they are needed.
80
81As those that have read about :class:`Orange.data.variable.Variable` know,
82the answer to
83"How this works?" is hidden in the field
84:obj:`~Orange.data.variable.Variable.get_value_from`.
85This little dialog reveals the secret.
86
87::
88
89    >>> sep_w
90    EnumVariable 'D_sepal width'
91    >>> sep_w.get_value_from
92    <ClassifierFromVar instance at 0x01BA7DC0>
93    >>> sep_w.get_value_from.whichVar
94    FloatVariable 'sepal width'
95    >>> sep_w.get_value_from.transformer
96    <IntervalDiscretizer instance at 0x01BA2100>
97    >>> sep_w.get_value_from.transformer.points
98    <2.90000009537, 3.29999995232>
99
100So, the ``select`` statement in the above example converted all examples
101from ``data`` to the new domain. Since the new domain includes the attribute
102``sep_w`` that is not present in the original, ``sep_w``'s values are
103computed on the fly. For each example in ``data``, ``sep_w.get_value_from``
104is called to compute ``sep_w``'s value (if you ever need to call
105``get_value_from``, you shouldn't call ``get_value_from`` directly but call
106``compute_value`` instead). ``sep_w.get_value_from`` looks for value of
107"sepal width" in the original example. The original, continuous sepal width
108is passed to the ``transformer`` that determines the interval by its field
109``points``. Transformer returns the discrete value which is in turn returned
110by ``get_value_from`` and stored in the new example.
111
112You don't need to understand this mechanism exactly. It's important to know
113that there are two classes of objects for discretization. Those derived from
114:obj:`Discretizer` (such as :obj:`IntervalDiscretizer` that we've seen above)
115are used as transformers that translate continuous value into discrete.
116Discretization algorithms are derived from :obj:`Discretization`. Their
117job is to construct a :obj:`Discretizer` and return a new variable
118with the discretizer stored in ``get_value_from.transformer``.
119
120Discretizers
121============
122
123Different discretizers support different methods for conversion of
124continuous values into discrete. The most general is
125:class:`IntervalDiscretizer` that is also used by most discretization
126methods. Two other discretizers, :class:`EquiDistDiscretizer` and
127:class:`ThresholdDiscretizer`> could easily be replaced by
128:class:`IntervalDiscretizer` but are used for speed and simplicity.
129The fourth discretizer, :class:`BiModalDiscretizer` is specialized
130for discretizations induced by :class:`BiModalDiscretization`.
131
132.. class:: Discretizer
133
134    All discretizers support a handy method for construction of a new
135    attribute from an existing one.
136
137    .. method:: construct_variable(attribute)
138
139        Constructs a new attribute descriptor; the new attribute is discretized
140        ``attribute``. The new attribute's name equal ``attribute.name``
141        prefixed  by "D\_", and its symbolic values are discretizer specific.
142        The above example shows what comes out form :class:`IntervalDiscretizer`.
143        Discretization algorithms actually first construct a discretizer and
144        then call its :class:`construct_variable` to construct an attribute
145        descriptor.
146
147.. class:: IntervalDiscretizer
148
149    The most common discretizer.
150
151    .. attribute:: points
152
153        Cut-off points. All values below or equal to the first point belong
154        to the first interval, those between the first and the second
155        (including those equal to the second) go to the second interval and
156        so forth to the last interval which covers all values greater than
157        the last element in ``points``. The number of intervals is thus
158        ``len(points)+1``.
159
160Let us manually construct an interval discretizer with cut-off points at 3.0
161and 5.0. We shall use the discretizer to construct a discretized sepal length
162(part of :download:`discretization.py <code/discretization.py>`):
163
164.. literalinclude:: code/discretization.py
165    :lines: 22-26
166
167That's all. First five examples of ``data2`` are now
168
169::
170
171    [5.1, '>5.00', 'Iris-setosa']
172    [4.9, '(3.00, 5.00]', 'Iris-setosa']
173    [4.7, '(3.00, 5.00]', 'Iris-setosa']
174    [4.6, '(3.00, 5.00]', 'Iris-setosa']
175    [5.0, '(3.00, 5.00]', 'Iris-setosa']
176
177Can you use the same discretizer for more than one attribute? Yes, as long
178as they have same cut-off points, of course. Simply call construct_var for each
179continuous attribute (part of :download:`discretization.py <code/discretization.py>`):
180
181.. literalinclude:: code/discretization.py
182    :lines: 30-34
183
184Each attribute now has its own (FIXME) ClassifierFromVar in its
185``get_value_from``, but all use the same :class:`IntervalDiscretizer`,
186``idisc``. Changing an element of its ``points`` affect all attributes.
187
188Do not change the length of :obj:`~IntervalDiscretizer.points` if the
189discretizer is used by any attribute. The length of
190:obj:`~IntervalDiscretizer.points` should always match the number of values
191of the attribute, which is determined by the length of the attribute's field
192``values``. Therefore, if ``attr`` is a discretized
193attribute, than ``len(attr.values)`` must equal
194``len(attr.get_value_from.transformer.points)+1``. It always
195does, unless you deliberately change it. If the sizes don't match,
196Orange will probably crash, and it will be entirely your fault.
197
198
199
200.. class:: EquiDistDiscretizer
201
202    More rigid than :obj:`IntervalDiscretizer`:
203    it uses intervals of fixed width.
204
205    .. attribute:: first_cut
206       
207        The first cut-off point.
208   
209    .. attribute:: step
210
211        Width of intervals.
212
213    .. attribute:: number_of_intervals
214       
215        Number of intervals.
216
217    .. attribute:: points (read-only)
218       
219        The cut-off points; this is not a real attribute although it behaves
220        as one. Reading it constructs a list of cut-off points and returns it,
221        but changing the list doesn't affect the discretizer - it's a separate
222        list. This attribute is here only for to give the
223        :obj:`EquiDistDiscretizer` the same interface as that of
224        :obj:`IntervalDiscretizer`.
225
226All values below :obj:`~EquiDistDiscretizer.first_cut` belong to the first
227intervala (including possible values smaller than ``firstVal``. Otherwise,
228value ``val``'s interval is ``floor((val-firstVal)/step)``. If this is turns
229out to be greater or equal to :obj:`~EquiDistDiscretizer.number_of_intervals`,
230it is decreased to ``number_of_intervals-1``.
231
232This discretizer is returned by :class:`EquiDistDiscretization`; you can
233see an example in the corresponding section. You can also construct it
234manually and call its ``construct_variable``, just as shown for the
235:obj:`IntervalDiscretizer`.
236
237
238.. class:: ThresholdDiscretizer
239
240    Threshold discretizer converts continuous values into binary by comparing
241    them with a threshold. This discretizer is actually not used by any
242    discretization method, but you can use it for manual discretization.
243    Orange needs this discretizer for binarization of continuous attributes
244    in decision trees.
245
246    .. attribute:: threshold
247
248        Threshold; values below or equal to the threshold belong to the first
249        interval and those that are greater go to the second.
250
251.. class:: BiModalDiscretizer
252
253    This discretizer is the first discretizer that couldn't be replaced by
254    :class:`IntervalDiscretizer`. It has two cut off points and values are
255    discretized according to whether they belong to the middle region
256    (which includes the lower but not the upper boundary) or not. The
257    discretizer is returned by :class:`BiModalDiscretization` if its
258    field :obj:`~BiModalDiscretization.split_in_two` is true (the default).
259
260    .. attribute:: low
261       
262        Lower boudary of the interval (included in the interval).
263
264    .. attribute:: high
265
266        Upper boundary of the interval (not included in the interval).
267
268
269Discretization Algorithms
270=========================
271
272.. class:: EquiDistDiscretization
273
274    Discretizes the attribute by cutting it into the prescribed number
275    of intervals of equal width. The examples are needed to determine the
276    span of attribute values. The interval between the smallest and the
277    largest is then cut into equal parts.
278
279    .. attribute:: number_of_intervals
280
281        Number of intervals into which the attribute is to be discretized.
282        Default value is 4.
283
284For an example, we shall discretize all attributes of Iris dataset into 6
285intervals. We shall construct an :class:`Orange.data.Table` with discretized
286attributes and print description of the attributes (part
287of :download:`discretization.py <code/discretization.py>`):
288
289.. literalinclude:: code/discretization.py
290    :lines: 38-43
291
292Script's answer is
293
294::
295
296    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30>
297    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00>
298    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90>
299    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10>
300
301Any more decent ways for a script to find the interval boundaries than
302by parsing the symbolic values? Sure, they are hidden in the discretizer,
303which is, as usual, stored in ``attr.get_value_from.transformer``.
304
305Compare the following with the values above.
306
307::
308
309    >>> for attr in newattrs:
310    ...    print "%s: first interval at %5.3f, step %5.3f" % \
311    ...    (attr.name, attr.get_value_from.transformer.first_cut, \
312    ...    attr.get_value_from.transformer.step)
313    D_sepal length: first interval at 4.900, step 0.600
314    D_sepal width: first interval at 2.400, step 0.400
315    D_petal length: first interval at 1.980, step 0.980
316    D_petal width: first interval at 0.500, step 0.400
317
318As all discretizers, :class:`EquiDistDiscretizer` also has the method
319``construct_variable`` (part of :download:`discretization.py <code/discretization.py>`):
320
321.. literalinclude:: code/discretization.py
322    :lines: 69-73
323
324
325.. class:: EquiNDiscretization
326
327    Discretization with Intervals Containing (Approximately) Equal Number
328    of Examples.
329
330    Discretizes the attribute by cutting it into the prescribed number of
331    intervals so that each of them contains equal number of examples. The
332    examples are obviously needed for this discretization, too.
333
334    .. attribute:: number_of_intervals
335
336        Number of intervals into which the attribute is to be discretized.
337        Default value is 4.
338
339The use of this discretization is the same as the use of
340:class:`EquiDistDiscretization`. The resulting discretizer is
341:class:`IntervalDiscretizer`, hence it has ``points`` instead of ``first_cut``/
342``step``/``number_of_intervals``.
343
344.. class:: EntropyDiscretization
345
346    Entropy-based Discretization (Fayyad-Irani).
347
348    Fayyad-Irani's discretization method works without a predefined number of
349    intervals. Instead, it recursively splits intervals at the cut-off point
350    that minimizes the entropy, until the entropy decrease is smaller than the
351    increase of MDL induced by the new point.
352
353    An interesting thing about this discretization technique is that an
354    attribute can be discretized into a single interval, if no suitable
355    cut-off points are found. If this is the case, the attribute is rendered
356    useless and can be removed. This discretization can therefore also serve
357    for feature subset selection.
358
359    .. attribute:: force_attribute
360
361        Forces the algorithm to induce at least one cut-off point, even when
362        its information gain is lower than MDL (default: false).
363
364Part of :download:`discretization.py <code/discretization.py>`:
365
366.. literalinclude:: code/discretization.py
367    :lines: 77-80
368
369The output shows that all attributes are discretized onto three intervals::
370
371    sepal length: <5.5, 6.09999990463>
372    sepal width: <2.90000009537, 3.29999995232>
373    petal length: <1.89999997616, 4.69999980927>
374    petal width: <0.600000023842, 1.0000004768>
375
376.. class:: BiModalDiscretization
377
378    Bi-Modal Discretization
379
380    Sets two cut-off points so that the class distribution of examples in
381    between is as different from the overall distribution as possible. The
382    difference is measure by chi-square statistics. All possible cut-off
383    points are tried, thus the discretization runs in O(n^2).
384
385    This discretization method is especially suitable for the attributes in
386    which the middle region corresponds to normal and the outer regions to
387    abnormal values of the attribute. Depending on the nature of the
388    attribute, we can treat the lower and higher values separately, thus
389    discretizing the attribute into three intervals, or together, in a
390    binary attribute whose values correspond to normal and abnormal.
391
392    .. attribute:: split_in_two
393       
394        Decides whether the resulting attribute should have three or two.
395        If true (default), we have three intervals and the discretizer is
396        of type :class:`BiModalDiscretizer`. If false the result is the
397        ordinary :class:`IntervalDiscretizer`.
398
399Iris dataset has three-valued class attribute, classes are setosa, virginica
400and versicolor. As the picture below shows, sepal lenghts of versicolors are
401between lengths of setosas and virginicas (the picture itself is drawn using
402LOESS probability estimation).
403
404.. image:: files/bayes-iris.gif
405
406If we merge classes setosa and virginica into one, we can observe whether
407the bi-modal discretization would correctly recognize the interval in
408which versicolors dominate.
409
410.. literalinclude:: code/discretization.py
411    :lines: 84-87
412
413In this script, we have constructed a new class attribute which tells whether
414an iris is versicolor or not. We have told how this attribute's value is
415computed from the original class value with a simple lambda function.
416Finally, we have constructed a new domain and converted the examples.
417Now for discretization.
418
419.. literalinclude:: code/discretization.py
420    :lines: 97-100
421
422The script prints out the middle intervals::
423
424    sepal length: (5.400, 6.200]
425    sepal width: (2.000, 2.900]
426    petal length: (1.900, 4.700]
427    petal width: (0.600, 1.600]
428
429Judging by the graph, the cut-off points for "sepal length" make sense.
430
431Additional functions
432====================
433
434Some functions and classes that can be used for
435categorization of continuous features. Besides several general classes that
436can help in this task, we also provide a function that may help in
437entropy-based discretization (Fayyad & Irani), and a wrapper around classes for
438categorization that can be used for learning.
439
440.. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper
441
442.. autoclass:: Orange.feature.discretization.EntropyDiscretization_wrapper
443
444.. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class
445
446.. rubric:: Example
447
448FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange
449for Beginners tutorial shows the use of DiscretizedLearner. Other
450discretization classes from core Orange are listed in chapter on
451`categorization <../ofb/o_categorization.htm>`_ of the same tutorial.
452
453==========
454References
455==========
456
457* UM Fayyad and KB Irani. Multi-interval discretization of continuous valued
458  attributes for classification learning. In Proceedings of the 13th
459  International Joint Conference on Artificial Intelligence, pages
460  1022--1029, Chambery, France, 1993.
461
462"""
463
464import Orange.core as orange
465
466from Orange.core import \
467    Discrete2Continuous, \
468    Discretizer, \
469        BiModalDiscretizer, \
470        EquiDistDiscretizer, \
471        IntervalDiscretizer, \
472        ThresholdDiscretizer, \
473        EntropyDiscretization, \
474        EquiDistDiscretization, \
475        EquiNDiscretization, \
476        BiModalDiscretization, \
477        Discretization
478
479######
480# from orngDics.py
481def entropyDiscretization_wrapper(table):
482    """Take the classified table set (table) and categorize all continuous
483    features using the entropy based discretization
484    :obj:`EntropyDiscretization`.
485   
486    :param table: data to discretize.
487    :type table: Orange.data.Table
488    :rtype: :obj:`Orange.data.Table` includes all categorical and discretized\
489    continuous features from the original data table.
490   
491    After categorization, features that were categorized to a single interval
492    (to a constant value) are removed from table and prints their names.
493    Returns a table that
494
495    """
496    orange.setrandseed(0)
497    tablen=orange.Preprocessor_discretize(table, method=EntropyDiscretization())
498   
499    attrlist=[]
500    nrem=0
501    for i in tablen.domain.attributes:
502        if (len(i.values)>1):
503            attrlist.append(i)
504        else:
505            nrem=nrem+1
506    attrlist.append(tablen.domain.classVar)
507    return tablen.select(attrlist)
508
509
510class EntropyDiscretization_wrapper:
511    """This is simple wrapper class around the function
512    :obj:`entropyDiscretization`.
513   
514    :param data: data to discretize.
515    :type data: Orange.data.Table
516   
517    Once invoked it would either create an object that can be passed a data
518    set for discretization, or if invoked with the data set, would return a
519    discretized data set::
520
521        discretizer = Orange.feature.dicretization.EntropyDiscretization()
522        disc_data = discretizer(table)
523        another_disc_data = Orange.feature.dicretization.EntropyDiscretization(table)
524
525    """
526    def __call__(self, data):
527        return entropyDiscretization(data)
528
529def DiscretizedLearner(baseLearner, examples=None, weight=0, **kwds):
530  learner = apply(DiscretizedLearner_Class, [baseLearner], kwds)
531  if examples: return learner(examples, weight)
532  else: return learner
533
534class DiscretizedLearner_Class:
535    """This class allows to set an learner object, such that before learning a
536    data passed to a learner is discretized. In this way we can prepare an
537    object that lears without giving it the data, and, for instance, use it in
538    some standard testing procedure that repeats learning/testing on several
539    data samples.
540
541    :param baseLearner: learner to which give discretized data
542    :type baseLearner: Orange.classification.Learner
543   
544    :param table: data whose continuous features need to be discretized
545    :type table: Orange.data.Table
546   
547    :param discretizer: a discretizer that converts continuous values into
548      discrete. Defaults to
549      :obj:`Orange.feature.discretization.EntropyDiscretization`.
550    :type discretizer: Orange.feature.discretization.Discretization
551   
552    :param name: name to assign to learner
553    :type name: string
554
555    An example on how such learner is set and used in ten-fold cross validation
556    is given below::
557
558        from Orange.feature import discretization
559        bayes = Orange.classification.bayes.NaiveBayesLearner()
560        disc = orange.Preprocessor_discretize(method=discretization.EquiNDiscretization(numberOfIntervals=10))
561        dBayes = discretization.DiscretizedLearner(bayes, name='disc bayes')
562        dbayes2 = discretization.DiscretizedLearner(bayes, name="EquiNBayes", discretizer=disc)
563        results = Orange.evaluation.testing.CrossValidation([dBayes], table)
564        classifier = discretization.DiscretizedLearner(bayes, examples=table)
565
566    """
567    def __init__(self, baseLearner, discretizer=EntropyDiscretization(), **kwds):
568        self.baseLearner = baseLearner
569        if hasattr(baseLearner, "name"):
570            self.name = baseLearner.name
571        self.discretizer = discretizer
572        self.__dict__.update(kwds)
573    def __call__(self, data, weight=None):
574        # filter the data and then learn
575        from Orange.preprocess import Preprocessor_discretize
576        ddata = Preprocessor_discretize(data, method=self.discretizer)
577        if weight<>None:
578            model = self.baseLearner(ddata, weight)
579        else:
580            model = self.baseLearner(ddata)
581        dcl = DiscretizedClassifier(classifier = model)
582        if hasattr(model, "domain"):
583            dcl.domain = model.domain
584        if hasattr(model, "name"):
585            dcl.name = model.name
586        return dcl
587
588class DiscretizedClassifier:
589  def __init__(self, **kwds):
590    self.__dict__.update(kwds)
591  def __call__(self, example, resultType = orange.GetValue):
592    return self.classifier(example, resultType)
Note: See TracBrowser for help on using the repository browser.