source: orange/docs/reference/rst/Orange.feature.discretization.rst @ 9812:a810dc61169e

Revision 9812:a810dc61169e, 15.2 KB checked in by blaz <blaz.zupan@…>, 2 years ago (diff)

polished discretization

Line 
1.. py:currentmodule:: Orange.feature.discretization
2
3###################################
4Discretization (``discretization``)
5###################################
6
7.. index:: discretization
8
9.. index::
10   single: feature; discretization
11
12Continues features can be discretized either one feature at a time, or, as demonstrated in the following script,
13using a single discretization method on entire set of data features:
14
15.. literalinclude:: code/discretization-table.py
16
17Discretization introduces new categorical features and computes their values in accordance to
18selected (or default) discretization method::
19
20    Original data set:
21    [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
22    [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
23    [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
24
25    Discretized data set:
26    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
27    ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa']
28    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
29
30The following discretization methods are supported:
31
32* equal width discretization, where the domain of continuous feature is split to intervals of the same
33  width equal-sized intervals (:class:`EqualWidth`),
34* equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`),
35* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize
36  within-interval entropy of class distributions (:class:`Entropy`),
37* bi-modal, using three intervals to optimize the difference of the class distribution in
38  the middle with the distribution outside it (:class:`BiModal`),
39* fixed, with the user-defined cut-off points.
40
41The above script used the default discretization method (equal frequency with three intervals). This can be changed
42as demonstrated below:
43
44.. literalinclude:: code/discretization-table-method.py
45    :lines: 3-5
46
47With exception to fixed discretization, discretization approaches infer the cut-off points from the
48training data set and thus construct a discretizer to convert continuous values of this feature into categorical
49value according to the rule found by discretization. In this respect, the discretization behaves similar to
50:class:`Orange.classification.Learner`.
51
52Utility functions
53=================
54
55Some functions and classes that can be used for
56categorization of continuous features. Besides several general classes that
57can help in this task, we also provide a function that may help in
58entropy-based discretization (Fayyad & Irani), and a wrapper around classes for
59categorization that can be used for learning.
60
61.. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class
62
63.. autoclass:: DiscretizeTable
64
65.. rubric:: Example
66
67FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange
68for Beginners tutorial shows the use of DiscretizedLearner. Other
69discretization classes from core Orange are listed in chapter on
70`categorization <../ofb/o_categorization.htm>`_ of the same tutorial.
71
72Discretization Algorithms
73=========================
74
75Instances of discretization classes are all derived from :class:`Discretization`.
76
77.. class:: Discretization
78
79    .. method:: __call__(feature, data[, weightID])
80
81        Given a continuous ``feature``, ``data`` and, optionally id of
82        attribute with example weight, this function returns a discretized
83        feature. Argument ``feature`` can be a descriptor, index or
84        name of the attribute.
85
86
87.. class:: EqualWidth
88
89    Discretizes the feature by spliting its domain to a fixed number
90    of equal-width intervals. The span of original domain is computed
91    from the training data and is defined by the smallest and the
92    largest feature value.
93
94    .. attribute:: n
95
96        Number of discretization intervals (default: 4).
97
98The following example discretizes Iris dataset features using six
99intervals. The script constructs a :class:`Orange.data.Table` with discretized
100features and outputs their description:
101
102.. literalinclude:: code/discretization.py
103    :lines: 38-43
104
105The output of this script is::
106
107    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30>
108    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00>
109    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90>
110    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10>
111
112The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``::
113
114    >>> for attr in newattrs:
115    ...    print "%s: first interval at %5.3f, step %5.3f" % \
116    ...    (attr.name, attr.get_value_from.transformer.first_cut, \
117    ...    attr.get_value_from.transformer.step)
118    D_sepal length: first interval at 4.900, step 0.600
119    D_sepal width: first interval at 2.400, step 0.400
120    D_petal length: first interval at 1.980, step 0.980
121    D_petal width: first interval at 0.500, step 0.400
122
123All discretizers have the method
124``construct_variable``:
125
126.. literalinclude:: code/discretization.py
127    :lines: 69-73
128
129
130.. class:: EqualFreq
131
132    Infers the cut-off points so that the discretization intervals contain
133    approximately equal number of training data instances.
134
135    .. attribute:: n
136
137        Number of discretization intervals (default: 4).
138
139The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points``
140that store the inferred cut-offs.
141
142.. class:: Entropy
143
144    Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most
145    appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the
146    class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the
147    increase of minimal descripton length (MDL) induced by the new cut-off point.
148
149    Entropy-based discretization can reduce a continuous feature into
150    a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be
151    removed. This discretization can
152    therefore also serve for identification of non-informative features and thus used for feature subset selection.
153
154    .. attribute:: force_attribute
155
156        Forces the algorithm to induce at least one cut-off point, even when
157        its information gain is lower than MDL (default: ``False``).
158
159Part of :download:`discretization.py <code/discretization.py>`:
160
161.. literalinclude:: code/discretization.py
162    :lines: 77-80
163
164The output shows that all attributes are discretized onto three intervals::
165
166    sepal length: <5.5, 6.09999990463>
167    sepal width: <2.90000009537, 3.29999995232>
168    petal length: <1.89999997616, 4.69999980927>
169    petal width: <0.600000023842, 1.0000004768>
170
171.. class:: BiModal
172
173    Infers two cut-off points to optimize the difference of class distribution of data instances in the
174    middle and in the other two intervals. The
175    difference is scored by chi-square statistics. All possible cut-off
176    points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable
177    for the attributes in
178    which the middle region corresponds to normal and the outer regions to
179    abnormal values of the feature.
180
181    .. attribute:: split_in_two
182
183        Decides whether the resulting attribute should have three or two values.
184        If ``True`` (default), the feature will be discretized to three intervals and the discretizer
185         is of type :class:`BiModalDiscretizer`. If ``False`` the result is the
186        ordinary :class:`IntervalDiscretizer`.
187
188Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that
189sepal lenghts of versicolors are between lengths of setosas and virginicas.
190
191.. image:: files/bayes-iris.gif
192
193If we merge classes setosa and virginica, we can observe if
194the bi-modal discretization would correctly recognize the interval in
195which versicolors dominate. The following scripts peforms the merging and construction of new data set with class
196that reports if iris is versicolor or not.
197
198.. literalinclude:: code/discretization.py
199    :lines: 84-87
200
201The following script implements the discretization:
202
203.. literalinclude:: code/discretization.py
204    :lines: 97-100
205
206The middle intervals are printed::
207
208    sepal length: (5.400, 6.200]
209    sepal width: (2.000, 2.900]
210    petal length: (1.900, 4.700]
211    petal width: (0.600, 1.600]
212
213Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense.
214
215Discretizers
216============
217
218Discretizers construct a categorical feature from the continuous feature according to the method they implement and
219its parameters. The most general is
220:class:`IntervalDiscretizer` that is also used by most discretization
221methods. Two other discretizers, :class:`EquiDistDiscretizer` and
222:class:`ThresholdDiscretizer`> could easily be replaced by
223:class:`IntervalDiscretizer` but are used for speed and simplicity.
224The fourth discretizer, :class:`BiModalDiscretizer` is specialized
225for discretizations induced by :class:`BiModalDiscretization`.
226
227.. class:: Discretizer
228
229    A superclass implementing the construction of a new
230    attribute from an existing one.
231
232    .. method:: construct_variable(feature)
233
234        Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name``
235        prefixed by "D\_". Its symbolic values are discretizer specific.
236
237.. class:: IntervalDiscretizer
238
239    Discretizer defined with a set of cut-off points.
240
241    .. attribute:: points
242
243        The cut-off points; feature values below or equal to the first point will be mapped to the first interval,
244        those between the first and the second point
245        (including those equal to the second) are mapped to the second interval and
246        so forth to the last interval which covers all values greater than
247        the last value in ``points``. The number of intervals is thus
248        ``len(points)+1``.
249
250The script that follows is an examples of a manual construction of a discretizer with cut-off points
251at 3.0 and 5.0:
252
253.. literalinclude:: code/discretization.py
254    :lines: 22-26
255
256First five data instances of ``data2`` are::
257
258    [5.1, '>5.00', 'Iris-setosa']
259    [4.9, '(3.00, 5.00]', 'Iris-setosa']
260    [4.7, '(3.00, 5.00]', 'Iris-setosa']
261    [4.6, '(3.00, 5.00]', 'Iris-setosa']
262    [5.0, '(3.00, 5.00]', 'Iris-setosa']
263
264The same discretizer can be used on several features by calling the function construct_var:
265
266.. literalinclude:: code/discretization.py
267    :lines: 30-34
268
269Each feature has its own instance of :class:`ClassifierFromVar` stored in
270``get_value_from``, but all use the same :class:`IntervalDiscretizer`,
271``idisc``. Changing any element of its ``points`` affect all attributes.
272
273.. note::
274
275    The length of :obj:`~IntervalDiscretizer.points` should not be changed if the
276    discretizer is used by any attribute. The length of
277    :obj:`~IntervalDiscretizer.points` should always match the number of values
278    of the feature, which is determined by the length of the attribute's field
279    ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal
280    ``len(attr.get_value_from.transformer.points)+1``.
281
282
283.. class:: EqualWidthDiscretizer
284
285    Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first
286    interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the
287    last intervals.
288
289
290    .. attribute:: first_cut
291
292        The first cut-off point.
293
294    .. attribute:: step
295
296        Width of the intervals.
297
298    .. attribute:: n
299
300        Number of the intervals.
301
302    .. attribute:: points (read-only)
303
304        The cut-off points; this is not a real attribute although it behaves
305        as one. Reading it constructs a list of cut-off points and returns it,
306        but changing the list doesn't affect the discretizer. Only present to provide
307        the :obj:`EquiDistDiscretizer` the same interface as that of
308        :obj:`IntervalDiscretizer`.
309
310
311.. class:: ThresholdDiscretizer
312
313    Threshold discretizer converts continuous values into binary by comparing
314    them to a fixed threshold. Orange uses this discretizer for
315    binarization of continuous attributes in decision trees.
316
317    .. attribute:: threshold
318
319        The value threshold; values below or equal to the threshold belong to the first
320        interval and those that are greater go to the second.
321
322
323.. class:: BiModalDiscretizer
324
325    Bimodal discretizer has two cut off points and values are
326    discretized according to whether or not they belong to the region between these points
327    which includes the lower but not the upper boundary. The
328    discretizer is returned by :class:`BiModalDiscretization` if its
329    field :obj:`~BiModalDiscretization.split_in_two` is true (the default).
330
331    .. attribute:: low
332
333        Lower boundary of the interval (included in the interval).
334
335    .. attribute:: high
336
337        Upper boundary of the interval (not included in the interval).
338
339
340Implementational details
341========================
342
343Consider a following example (part of :download:`discretization.py <code/discretization.py>`):
344
345.. literalinclude:: code/discretization.py
346    :lines: 7-15
347
348The discretized attribute ``sep_w`` is constructed with a call to
349:class:`Entropy`; instead of constructing it and calling
350it afterwards, we passed the arguments for calling to the constructor. We then constructed a new
351:class:`Orange.data.Table` with attributes "sepal width" (the original
352continuous attribute), ``sep_w`` and the class attribute::
353
354    Entropy discretization, first 5 data instances
355    [3.5, '>3.30', 'Iris-setosa']
356    [3.0, '(2.90, 3.30]', 'Iris-setosa']
357    [3.2, '(2.90, 3.30]', 'Iris-setosa']
358    [3.1, '(2.90, 3.30]', 'Iris-setosa']
359    [3.6, '>3.30', 'Iris-setosa']
360
361The name of the new categorical variable derives from the name of original continuous variable by adding a prefix
362"D_". The values of the new attributes are computed automatically when they are needed using a transformation function
363:obj:`~Orange.data.variable.Variable.get_value_from` (see :class:`Orange.data.variable.Variable`) which encodes the
364discretization::
365
366    >>> sep_w
367    EnumVariable 'D_sepal width'
368    >>> sep_w.get_value_from
369    <ClassifierFromVar instance at 0x01BA7DC0>
370    >>> sep_w.get_value_from.whichVar
371    FloatVariable 'sepal width'
372    >>> sep_w.get_value_from.transformer
373    <IntervalDiscretizer instance at 0x01BA2100>
374    >>> sep_w.get_value_from.transformer.points
375    <2.90000009537, 3.29999995232>
376
377The ``select`` statement in the discretization script converted all data instances
378from ``data`` to the new domain. This includes a new feature
379``sep_w`` whose values are computed on the fly by calling ``sep_w.get_value_from`` for each data instance.
380The original, continuous sepal width
381is passed to the ``transformer`` that determines the interval by its field
382``points``. Transformer returns the discrete value which is in turn returned
383by ``get_value_from`` and stored in the new example.
384
385References
386==========
387
388.. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued
389  attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages
390  1022--1029, Chambery, France, 1993.
Note: See TracBrowser for help on using the repository browser.