source: orange/orange/Orange/feature/discretization.py @ 7770:e1dafa5e0037

Revision 7770:e1dafa5e0037, 22.8 KB checked in by markotoplak, 3 years ago (diff)

Orange 2.5. Discretization: added documentation from the reference.

Line 
1"""
2
3.. index:: discretization
4
5.. index::
6   single: feature; discretization
7
8
9Example-based automatic discretization is in essence similar to learning:
10given a set of examples, discretization method proposes a list of suitable
11intervals to cut the attribute's values into. For this reason, Orange
12structures for discretization resemble its structures for learning. Objects
13derived from ``orange.Discretization`` play a role of "learner" that,
14upon observing the examples, construct an ``orange.Discretizer`` whose role
15is to convert continuous values into discrete according to the rule found by
16``Discretization``.
17
18Orange supports several methods of discretization; here's a
19list of methods with belonging classes.
20
21* Equi-distant discretization (:class:`EquiDistDiscretization`,
22  :class:`EquiDistDiscretizer`). The range of attribute's values is split
23  into prescribed number equal-sized intervals.
24* Quantile-based discretization (:class:`EquiNDiscretization`,
25  :class:`IntervalDiscretizer`). The range is split into intervals
26  containing equal number of examples.
27* Entropy-based discretization (:class:`EntropyDiscretization`,
28  :class:`IntervalDiscretizer`). Developed by Fayyad and Irani,
29  this method balances between entropy in intervals and MDL of discretization.
30* Bi-modal discretization (:class:`BiModalDiscretization`,
31  :class:`BiModalDiscretizer`/:class:`IntervalDiscretizer`).
32  Two cut-off points set to optimize the difference of the distribution in
33  the middle interval and the distributions outside it.
34* Fixed discretization (:class:`IntervalDiscretizer`). Discretization with
35  user-prescribed cut-off points.
36
37.. _discretization.py: code/discretization.py
38
39Instances of classes derived from :class:`Discretization`. It define a
40single method: the call operator. The object can also be called through
41constructor.
42
43.. class:: Discretization
44
45    .. method:: __call__(attribute, examples[, weightID])
46
47        Given a continuous ``attribute`, ``examples`` and, optionally id of
48        attribute with example weight, this function returns a discretized
49        attribute. Argument ``attribute`` can be a descriptor, index or
50        name of the attribute.
51
52Here's an example. Part of `discretization.py`_:
53
54.. literalinclude:: code/discretization.py
55    :lines: 7-15
56
57The discretized attribute ``sep_w`` is constructed with a call to
58:class:`EntropyDiscretization` (instead of constructing it and calling
59it afterwards, we passed the arguments for calling to the constructor, as
60is often allowed in Orange). We then constructed a new
61:class:`Orange.data.Table` with attributes "sepal width" (the original
62continuous attribute), ``sep_w`` and the class attribute. Script output is::
63
64    Entropy discretization, first 10 examples
65    [3.5, '>3.30', 'Iris-setosa']
66    [3.0, '(2.90, 3.30]', 'Iris-setosa']
67    [3.2, '(2.90, 3.30]', 'Iris-setosa']
68    [3.1, '(2.90, 3.30]', 'Iris-setosa']
69    [3.6, '>3.30', 'Iris-setosa']
70    [3.9, '>3.30', 'Iris-setosa']
71    [3.4, '>3.30', 'Iris-setosa']
72    [3.4, '>3.30', 'Iris-setosa']
73    [2.9, '<=2.90', 'Iris-setosa']
74    [3.1, '(2.90, 3.30]', 'Iris-setosa']
75
76:class:`EntropyDiscretization` named the new attribute's values by the
77interval range (it also named the attribute as "D_sepal width"). The new
78attribute's values get computed automatically when they are needed.
79
80As those that have read about :class:`Orange.data.variable.Variable` know,
81the answer to
82"How this works?" is hidden in the field
83:obj:`~Orange.data.variable.Variable.get_value_from`.
84This little dialog reveals the secret.
85
86::
87
88    >>> sep_w
89    EnumVariable 'D_sepal width'
90    >>> sep_w.get_value_from
91    <ClassifierFromVar instance at 0x01BA7DC0>
92    >>> sep_w.get_value_from.whichVar
93    FloatVariable 'sepal width'
94    >>> sep_w.get_value_from.transformer
95    <IntervalDiscretizer instance at 0x01BA2100>
96    >>> sep_w.get_value_from.transformer.points
97    <2.90000009537, 3.29999995232>
98
99So, the ``select`` statement in the above example converted all examples
100from ``data`` to the new domain. Since the new domain includes the attribute
101``sep_w`` that is not present in the original, ``sep_w``'s values are
102computed on the fly. For each example in ``data``, ``sep_w.get_value_from``
103is called to compute ``sep_w``'s value (if you ever need to call
104``get_value_from``, you shouldn't call ``get_value_from`` directly but call
105``compute_value`` instead). ``sep_w.get_value_from`` looks for value of
106"sepal width" in the original example. The original, continuous sepal width
107is passed to the ``transformer`` that determines the interval by its field
108``points``. Transformer returns the discrete value which is in turn returned
109by ``get_value_from`` and stored in the new example.
110
111You don't need to understand this mechanism exactly. It's important to know
112that there are two classes of objects for discretization. Those derived from
113:obj:`Discretizer` (such as :obj:`IntervalDiscretizer` that we've seen above)
114are used as transformers that translate continuous value into discrete.
115Discretization algorithms are derived from :obj:`Discretization`. Their
116job is to construct a :obj:`Discretizer` and return a new variable
117with the discretizer stored in ``get_value_from.transformer``.
118
119Discretizers
120============
121
122Different discretizers support different methods for conversion of
123continuous values into discrete. The most general is
124:class:`IntervalDiscretizer` that is also used by most discretization
125methods. Two other discretizers, :class:`EquiDistDiscretizer` and
126:class:`ThresholdDiscretizer`> could easily be replaced by
127:class:`IntervalDiscretizer` but are used for speed and simplicity.
128The fourth discretizer, :class:`BiModalDiscretizer` is specialized
129for discretizations induced by :class:`BiModalDiscretization`.
130
131.. class:: Discretizer
132
133    All discretizers support a handy method for construction of a new
134    attribute from an existing one.
135
136    .. method:: construct_variable(attribute)
137
138        Constructs a new attribute descriptor; the new attribute is discretized
139        ``attribute``. The new attribute's name equal ``attribute.name``
140        prefixed  by "D_", and its symbolic values are discretizer specific.
141        The above example shows what comes out form :class:`IntervalDiscretizer`.
142        Discretization algorithms actually first construct a discretizer and
143        then call its :class:`construct_variable` to construct an attribute
144        descriptor.
145
146.. class:: IntervalDiscretizer
147
148    The most common discretizer.
149
150    .. attribute:: points
151
152        Cut-off points. All values below or equal to the first point belong
153        to the first interval, those between the first and the second
154        (including those equal to the second) go to the second interval and
155        so forth to the last interval which covers all values greater than
156        the last element in ``points``. The number of intervals is thus
157        ``len(points)+1``.
158
159Let us manually construct an interval discretizer with cut-off points at 3.0
160and 5.0. We shall use the discretizer to construct a discretized sepal length
161(part of `discretization.py`_):
162
163.. literalinclude:: code/discretization.py
164    :lines: 22-26
165
166That's all. First five examples of ``data2`` are now
167
168::
169
170    [5.1, '>5.00', 'Iris-setosa']
171    [4.9, '(3.00, 5.00]', 'Iris-setosa']
172    [4.7, '(3.00, 5.00]', 'Iris-setosa']
173    [4.6, '(3.00, 5.00]', 'Iris-setosa']
174    [5.0, '(3.00, 5.00]', 'Iris-setosa']
175
176Can you use the same discretizer for more than one attribute? Yes, as long
177as they have same cut-off points, of course. Simply call construct_var for each
178continuous attribute (part of `discretization.py`_):
179
180.. literalinclude:: code/discretization.py
181    :lines: 30-34
182
183Each attribute now has its own (FIXME) ClassifierFromVar in its
184``get_value_from``, but all use the same :class:`IntervalDiscretizer`,
185``idisc``. Changing an element of its ``points`` affect all attributes.
186
187Do not change the length of :obj:`~IntervalDiscretizer.points` if the
188discretizer is used by any attribute. The length of
189:obj:`~IntervalDiscretizer.points` should always match the number of values
190of the attribute, which is determined by the length of the attribute's field
191``values``. Therefore, if ``attr`` is a discretized
192attribute, than ``len(attr.values)`` must equal
193``len(attr.get_value_from.transformer.points)+1``. It always
194does, unless you deliberately change it. If the sizes don't match,
195Orange will probably crash, and it will be entirely your fault.
196
197
198
199.. class:: EquiDistDiscretizer
200
201    More rigid than :obj:`IntervalDiscretizer`:
202    it uses intervals of fixed width.
203
204    .. attribute:: first_cut
205       
206        The first cut-off point.
207   
208    .. attribute:: step
209
210        Width of intervals.
211
212    .. attribute:: number_of_intervals
213       
214        Number of intervals.
215
216    .. attribute:: points (read-only)
217       
218        The cut-off points; this is not a real attribute although it behaves
219        as one. Reading it constructs a list of cut-off points and returns it,
220        but changing the list doesn't affect the discretizer - it's a separate
221        list. This attribute is here only for to give the
222        :obj:`EquiDistDiscretizer` the same interface as that of
223        :obj:`IntervalDiscretizer`.
224
225All values below :obj:`~EquiDistDiscretizer.first_cut` belong to the first
226intervala (including possible values smaller than ``firstVal``. Otherwise,
227value ``val``'s interval is ``floor((val-firstVal)/step)``. If this is turns
228out to be greater or equal to :obj:`~EquiDistDiscretizer.number_of_intervals`,
229it is decreased to ``number_of_intervals-1``.
230
231This discretizer is returned by :class:`EquiDistDiscretization`; you can
232see an example in the corresponding section. You can also construct it
233manually and call its ``construct_variable``, just as shown for the
234:obj:`IntervalDiscretizer`.
235
236
237.. class:: ThresholdDiscretizer
238
239    Threshold discretizer converts continuous values into binary by comparing
240    them with a threshold. This discretizer is actually not used by any
241    discretization method, but you can use it for manual discretization.
242    Orange needs this discretizer for binarization of continuous attributes
243    in decision trees.
244
245    .. attribute:: threshold
246
247        Threshold; values below or equal to the threshold belong to the first
248        interval and those that are greater go to the second.
249
250.. class:: BiModalDiscretizer
251
252    This discretizer is the first discretizer that couldn't be replaced by
253    :class:`IntervalDiscretizer`. It has two cut off points and values are
254    discretized according to whether they belong to the middle region
255    (which includes the lower but not the upper boundary) or not. The
256    discretizer is returned by :class:`BiModalDiscretization` if its
257    field :obj:`~BiModalDiscretization.split_in_two` is true (the default).
258
259    .. attribute:: low
260       
261        Lower boudary of the interval (included in the interval).
262
263    .. attribute:: high
264
265        Upper boundary of the interval (not included in the interval).
266
267
268Discretization Algorithms
269=========================
270
271.. class:: EquiDistDiscretization
272
273    Discretizes the attribute by cutting it into the prescribed number
274    of intervals of equal width. The examples are needed to determine the
275    span of attribute values. The interval between the smallest and the
276    largest is then cut into equal parts.
277
278    .. attribute:: number_of_intervals
279
280        Number of intervals into which the attribute is to be discretized.
281        Default value is 4.
282
283For an example, we shall discretize all attributes of Iris dataset into 6
284intervals. We shall construct an :class:`Orange.data.Table` with discretized
285attributes and print description of the attributes (part
286of `discretization.py`_):
287
288.. literalinclude:: code/discretization.py
289    :lines: 38-43
290
291Script's answer is
292
293::
294
295    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30>
296    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00>
297    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90>
298    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10>
299
300Any more decent ways for a script to find the interval boundaries than
301by parsing the symbolic values? Sure, they are hidden in the discretizer,
302which is, as usual, stored in ``attr.get_value_from.transformer``.
303
304Compare the following with the values above.
305
306::
307
308    >>> for attr in newattrs:
309    ...    print "%s: first interval at %5.3f, step %5.3f" % \
310    ...    (attr.name, attr.get_value_from.transformer.first_cut, \
311    ...    attr.get_value_from.transformer.step)
312    D_sepal length: first interval at 4.900, step 0.600
313    D_sepal width: first interval at 2.400, step 0.400
314    D_petal length: first interval at 1.980, step 0.980
315    D_petal width: first interval at 0.500, step 0.400
316
317As all discretizers, :class:`EquiDistDiscretizer` also has the method
318``construct_variable`` (part of `discretization.py`_):
319
320.. literalinclude:: code/discretization.py
321    :lines: 69-73
322
323
324.. class:: EquiNDiscretization
325
326    Discretization with Intervals Containing (Approximately) Equal Number
327    of Examples.
328
329    Discretizes the attribute by cutting it into the prescribed number of
330    intervals so that each of them contains equal number of examples. The
331    examples are obviously needed for this discretization, too.
332
333    .. attribute:: number_of_intervals
334
335        Number of intervals into which the attribute is to be discretized.
336        Default value is 4.
337
338The use of this discretization is the same as the use of
339:class:`EquiDistDiscretization`. The resulting discretizer is
340:class:`IntervalDiscretizer`, hence it has ``points`` instead of ``first_cut``/
341``step``/``number_of_intervals``.
342
343.. class:: EntropyDiscretization
344
345    Entropy-based Discretization (Fayyad-Irani).
346
347    Fayyad-Irani's discretization method works without a predefined number of
348    intervals. Instead, it recursively splits intervals at the cut-off point
349    that minimizes the entropy, until the entropy decrease is smaller than the
350    increase of MDL induced by the new point.
351
352    An interesting thing about this discretization technique is that an
353    attribute can be discretized into a single interval, if no suitable
354    cut-off points are found. If this is the case, the attribute is rendered
355    useless and can be removed. This discretization can therefore also serve
356    for feature subset selection.
357
358    .. attribute:: force_attribute
359
360        Forces the algorithm to induce at least one cut-off point, even when
361        its information gain is lower than MDL (default: false).
362
363Part of `discretization.py`_:
364
365.. literalinclude:: code/discretization.py
366    :lines: 77-80
367
368The output shows that all attributes are discretized onto three intervals::
369
370    sepal length: <5.5, 6.09999990463>
371    sepal width: <2.90000009537, 3.29999995232>
372    petal length: <1.89999997616, 4.69999980927>
373    petal width: <0.600000023842, 1.0000004768>
374
375.. class:: BiModalDiscretization
376
377    Bi-Modal Discretization
378
379    Sets two cut-off points so that the class distribution of examples in
380    between is as different from the overall distribution as possible. The
381    difference is measure by chi-square statistics. All possible cut-off
382    points are tried, thus the discretization runs in O(n^2).
383
384    This discretization method is especially suitable for the attributes in
385    which the middle region corresponds to normal and the outer regions to
386    abnormal values of the attribute. Depending on the nature of the
387    attribute, we can treat the lower and higher values separately, thus
388    discretizing the attribute into three intervals, or together, in a
389    binary attribute whose values correspond to normal and abnormal.
390
391    .. attribute:: split_in_two
392       
393        Decides whether the resulting attribute should have three or two.
394        If true (default), we have three intervals and the discretizer is
395        of type :class:`BiModalDiscretizer`. If false the result is the
396        ordinary :class:`IntervalDiscretizer`.
397
398Iris dataset has three-valued class attribute, classes are setosa, virginica
399and versicolor. As the picture below shows, sepal lenghts of versicolors are
400between lengths of setosas and virginicas (the picture itself is drawn using
401LOESS probability estimation).
402
403.. image:: files/bayes-iris.gif
404
405If we merge classes setosa and virginica into one, we can observe whether
406the bi-modal discretization would correctly recognize the interval in
407which versicolors dominate.
408
409.. literalinclude:: code/discretization.py
410    :lines: 84-87
411
412In this script, we have constructed a new class attribute which tells whether
413an iris is versicolor or not. We have told how this attribute's value is
414computed from the original class value with a simple lambda function.
415Finally, we have constructed a new domain and converted the examples.
416Now for discretization.
417
418.. literalinclude:: code/discretization.py
419    :lines: 97-100
420
421The script prints out the middle intervals::
422
423    sepal length: (5.400, 6.200]
424    sepal width: (2.000, 2.900]
425    petal length: (1.900, 4.700]
426    petal width: (0.600, 1.600]
427
428Judging by the graph, the cut-off points for "sepal length" make sense.
429
430Additional functions
431====================
432
433Some functions and classes that can be used for
434categorization of continuous features. Besides several general classes that
435can help in this task, we also provide a function that may help in
436entropy-based discretization (Fayyad & Irani), and a wrapper around classes for
437categorization that can be used for learning.
438
439.. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper
440
441.. autoclass:: Orange.feature.discretization.EntropyDiscretization_wrapper
442
443.. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class
444
445.. rubric:: Example
446
447FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange
448for Beginners tutorial shows the use of DiscretizedLearner. Other
449discretization classes from core Orange are listed in chapter on
450`categorization <../ofb/o_categorization.htm>`_ of the same tutorial.
451
452==========
453References
454==========
455
456* UM Fayyad and KB Irani. Multi-interval discretization of continuous valued
457  attributes for classification learning. In Proceedings of the 13th
458  International Joint Conference on Artificial Intelligence, pages
459  1022--1029, Chambery, France, 1993.
460
461"""
462
463import Orange.core as orange
464
465from Orange.core import \
466    Discrete2Continuous, \
467    Discretizer, \
468        BiModalDiscretizer, \
469        EquiDistDiscretizer, \
470        IntervalDiscretizer, \
471        ThresholdDiscretizer, \
472        EntropyDiscretization, \
473        EquiDistDiscretization, \
474        EquiNDiscretization, \
475        BiModalDiscretization, \
476        Discretization
477
478######
479# from orngDics.py
480def entropyDiscretization_wrapper(table):
481    """Take the classified table set (table) and categorize all continuous
482    features using the entropy based discretization
483    :obj:`EntropyDiscretization`.
484   
485    :param table: data to discretize.
486    :type table: Orange.data.Table
487    :rtype: :obj:`Orange.data.Table` includes all categorical and discretized\
488    continuous features from the original data table.
489   
490    After categorization, features that were categorized to a single interval
491    (to a constant value) are removed from table and prints their names.
492    Returns a table that
493
494    """
495    orange.setrandseed(0)
496    tablen=orange.Preprocessor_discretize(table, method=EntropyDiscretization())
497   
498    attrlist=[]
499    nrem=0
500    for i in tablen.domain.attributes:
501        if (len(i.values)>1):
502            attrlist.append(i)
503        else:
504            nrem=nrem+1
505    attrlist.append(tablen.domain.classVar)
506    return tablen.select(attrlist)
507
508
509class EntropyDiscretization_wrapper:
510    """This is simple wrapper class around the function
511    :obj:`entropyDiscretization`.
512   
513    :param data: data to discretize.
514    :type data: Orange.data.Table
515   
516    Once invoked it would either create an object that can be passed a data
517    set for discretization, or if invoked with the data set, would return a
518    discretized data set::
519
520        discretizer = Orange.feature.dicretization.EntropyDiscretization()
521        disc_data = discretizer(table)
522        another_disc_data = Orange.feature.dicretization.EntropyDiscretization(table)
523
524    """
525    def __call__(self, data):
526        return entropyDiscretization(data)
527
528def DiscretizedLearner(baseLearner, examples=None, weight=0, **kwds):
529  learner = apply(DiscretizedLearner_Class, [baseLearner], kwds)
530  if examples: return learner(examples, weight)
531  else: return learner
532
533class DiscretizedLearner_Class:
534    """This class allows to set an learner object, such that before learning a
535    data passed to a learner is discretized. In this way we can prepare an
536    object that lears without giving it the data, and, for instance, use it in
537    some standard testing procedure that repeats learning/testing on several
538    data samples.
539
540    :param baseLearner: learner to which give discretized data
541    :type baseLearner: Orange.classification.Learner
542   
543    :param table: data whose continuous features need to be discretized
544    :type table: Orange.data.Table
545   
546    :param discretizer: a discretizer that converts continuous values into
547      discrete. Defaults to
548      :obj:`Orange.feature.discretization.EntropyDiscretization`.
549    :type discretizer: Orange.feature.discretization.Discretization
550   
551    :param name: name to assign to learner
552    :type name: string
553
554    An example on how such learner is set and used in ten-fold cross validation
555    is given below::
556
557        from Orange.feature import discretization
558        bayes = Orange.classification.bayes.NaiveBayesLearner()
559        disc = orange.Preprocessor_discretize(method=discretization.EquiNDiscretization(numberOfIntervals=10))
560        dBayes = discretization.DiscretizedLearner(bayes, name='disc bayes')
561        dbayes2 = discretization.DiscretizedLearner(bayes, name="EquiNBayes", discretizer=disc)
562        results = Orange.evaluation.testing.CrossValidation([dBayes], table)
563        classifier = discretization.DiscretizedLearner(bayes, examples=table)
564
565    """
566    def __init__(self, baseLearner, discretizer=EntropyDiscretization(), **kwds):
567        self.baseLearner = baseLearner
568        if hasattr(baseLearner, "name"):
569            self.name = baseLearner.name
570        self.discretizer = discretizer
571        self.__dict__.update(kwds)
572    def __call__(self, data, weight=None):
573        # filter the data and then learn
574        from Orange.preprocess import Preprocessor_discretize
575        ddata = Preprocessor_discretize(data, method=self.discretizer)
576        if weight<>None:
577            model = self.baseLearner(ddata, weight)
578        else:
579            model = self.baseLearner(ddata)
580        dcl = DiscretizedClassifier(classifier = model)
581        if hasattr(model, "domain"):
582            dcl.domain = model.domain
583        if hasattr(model, "name"):
584            dcl.name = model.name
585        return dcl
586
587class DiscretizedClassifier:
588  def __init__(self, **kwds):
589    self.__dict__.update(kwds)
590  def __call__(self, example, resultType = orange.GetValue):
591    return self.classifier(example, resultType)
Note: See TracBrowser for help on using the repository browser.