source: orange/docs/reference/rst/Orange.feature.discretization.rst @ 9927:d6ca7b346864

Revision 9927:d6ca7b346864, 14.4 KB checked in by markotoplak, 2 years ago (diff)

data.variable -> feature.

Line 
1.. py:currentmodule:: Orange.feature.discretization
2
3###################################
4Discretization (``discretization``)
5###################################
6
7.. index:: discretization
8
9.. index::
10   single: feature; discretization
11
12Continues features can be discretized either one feature at a time, or, as demonstrated in the following script,
13using a single discretization method on entire set of data features:
14
15.. literalinclude:: code/discretization-table.py
16
17Discretization introduces new categorical features and computes their values in accordance to
18selected (or default) discretization method::
19
20    Original data set:
21    [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
22    [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
23    [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
24
25    Discretized data set:
26    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
27    ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa']
28    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
29
30The following discretization methods are supported:
31
32* equal width discretization, where the domain of continuous feature is split to intervals of the same
33  width equal-sized intervals (:class:`EqualWidth`),
34* equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`),
35* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize
36  within-interval entropy of class distributions (:class:`Entropy`),
37* bi-modal, using three intervals to optimize the difference of the class distribution in
38  the middle with the distribution outside it (:class:`BiModal`),
39* fixed, with the user-defined cut-off points.
40
41The above script used the default discretization method (equal frequency with three intervals). This can be changed
42as demonstrated below:
43
44.. literalinclude:: code/discretization-table-method.py
45    :lines: 3-5
46
47With exception to fixed discretization, discretization approaches infer the cut-off points from the
48training data set and thus construct a discretizer to convert continuous values of this feature into categorical
49value according to the rule found by discretization. In this respect, the discretization behaves similar to
50:class:`Orange.classification.Learner`.
51
52Discretization Algorithms
53=========================
54
55Instances of discretization classes are all derived from :class:`Discretization`.
56
57.. class:: Discretization
58
59    .. method:: __call__(feature, data[, weightID])
60
61        Given a continuous ``feature``, ``data`` and, optionally id of
62        attribute with example weight, this function returns a discretized
63        feature. Argument ``feature`` can be a descriptor, index or
64        name of the attribute.
65
66
67.. class:: EqualWidth
68
69    Discretizes the feature by spliting its domain to a fixed number
70    of equal-width intervals. The span of original domain is computed
71    from the training data and is defined by the smallest and the
72    largest feature value.
73
74    .. attribute:: n
75
76        Number of discretization intervals (default: 4).
77
78The following example discretizes Iris dataset features using six
79intervals. The script constructs a :class:`Orange.data.Table` with discretized
80features and outputs their description:
81
82.. literalinclude:: code/discretization.py
83    :lines: 38-43
84
85The output of this script is::
86
87    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30>
88    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00>
89    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90>
90    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10>
91
92The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``::
93
94    >>> for attr in newattrs:
95    ...    print "%s: first interval at %5.3f, step %5.3f" % \
96    ...    (attr.name, attr.get_value_from.transformer.first_cut, \
97    ...    attr.get_value_from.transformer.step)
98    D_sepal length: first interval at 4.900, step 0.600
99    D_sepal width: first interval at 2.400, step 0.400
100    D_petal length: first interval at 1.980, step 0.980
101    D_petal width: first interval at 0.500, step 0.400
102
103All discretizers have the method
104``construct_variable``:
105
106.. literalinclude:: code/discretization.py
107    :lines: 69-73
108
109
110.. class:: EqualFreq
111
112    Infers the cut-off points so that the discretization intervals contain
113    approximately equal number of training data instances.
114
115    .. attribute:: n
116
117        Number of discretization intervals (default: 4).
118
119The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points``
120that store the inferred cut-offs.
121
122.. class:: Entropy
123
124    Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most
125    appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the
126    class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the
127    increase of minimal descripton length (MDL) induced by the new cut-off point.
128
129    Entropy-based discretization can reduce a continuous feature into
130    a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be
131    removed. This discretization can
132    therefore also serve for identification of non-informative features and thus used for feature subset selection.
133
134    .. attribute:: force_attribute
135
136        Forces the algorithm to induce at least one cut-off point, even when
137        its information gain is lower than MDL (default: ``False``).
138
139Part of :download:`discretization.py <code/discretization.py>`:
140
141.. literalinclude:: code/discretization.py
142    :lines: 77-80
143
144The output shows that all attributes are discretized onto three intervals::
145
146    sepal length: <5.5, 6.09999990463>
147    sepal width: <2.90000009537, 3.29999995232>
148    petal length: <1.89999997616, 4.69999980927>
149    petal width: <0.600000023842, 1.0000004768>
150
151.. class:: BiModal
152
153    Infers two cut-off points to optimize the difference of class distribution of data instances in the
154    middle and in the other two intervals. The
155    difference is scored by chi-square statistics. All possible cut-off
156    points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable
157    for the attributes in
158    which the middle region corresponds to normal and the outer regions to
159    abnormal values of the feature.
160
161    .. attribute:: split_in_two
162
163        Decides whether the resulting attribute should have three or two values.
164        If ``True`` (default), the feature will be discretized to three
165        intervals and the discretizer is of type :class:`BiModalDiscretizer`.
166        If ``False`` the result is the ordinary :class:`IntervalDiscretizer`.
167
168Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that
169sepal lenghts of versicolors are between lengths of setosas and virginicas.
170
171.. image:: files/bayes-iris.gif
172
173If we merge classes setosa and virginica, we can observe if
174the bi-modal discretization would correctly recognize the interval in
175which versicolors dominate. The following scripts peforms the merging and construction of new data set with class
176that reports if iris is versicolor or not.
177
178.. literalinclude:: code/discretization.py
179    :lines: 84-87
180
181The following script implements the discretization:
182
183.. literalinclude:: code/discretization.py
184    :lines: 97-100
185
186The middle intervals are printed::
187
188    sepal length: (5.400, 6.200]
189    sepal width: (2.000, 2.900]
190    petal length: (1.900, 4.700]
191    petal width: (0.600, 1.600]
192
193Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense.
194
195Discretizers
196============
197
198Discretizers construct a categorical feature from the continuous feature according to the method they implement and
199its parameters. The most general is
200:class:`IntervalDiscretizer` that is also used by most discretization
201methods. Two other discretizers, :class:`EquiDistDiscretizer` and
202:class:`ThresholdDiscretizer`> could easily be replaced by
203:class:`IntervalDiscretizer` but are used for speed and simplicity.
204The fourth discretizer, :class:`BiModalDiscretizer` is specialized
205for discretizations induced by :class:`BiModalDiscretization`.
206
207.. class:: Discretizer
208
209    A superclass implementing the construction of a new
210    attribute from an existing one.
211
212    .. method:: construct_variable(feature)
213
214        Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name``
215        prefixed by "D\_". Its symbolic values are discretizer specific.
216
217.. class:: IntervalDiscretizer
218
219    Discretizer defined with a set of cut-off points.
220
221    .. attribute:: points
222
223        The cut-off points; feature values below or equal to the first point will be mapped to the first interval,
224        those between the first and the second point
225        (including those equal to the second) are mapped to the second interval and
226        so forth to the last interval which covers all values greater than
227        the last value in ``points``. The number of intervals is thus
228        ``len(points)+1``.
229
230The script that follows is an examples of a manual construction of a discretizer with cut-off points
231at 3.0 and 5.0:
232
233.. literalinclude:: code/discretization.py
234    :lines: 22-26
235
236First five data instances of ``data2`` are::
237
238    [5.1, '>5.00', 'Iris-setosa']
239    [4.9, '(3.00, 5.00]', 'Iris-setosa']
240    [4.7, '(3.00, 5.00]', 'Iris-setosa']
241    [4.6, '(3.00, 5.00]', 'Iris-setosa']
242    [5.0, '(3.00, 5.00]', 'Iris-setosa']
243
244The same discretizer can be used on several features by calling the function construct_var:
245
246.. literalinclude:: code/discretization.py
247    :lines: 30-34
248
249Each feature has its own instance of :class:`ClassifierFromVar` stored in
250``get_value_from``, but all use the same :class:`IntervalDiscretizer`,
251``idisc``. Changing any element of its ``points`` affect all attributes.
252
253.. note::
254
255    The length of :obj:`~IntervalDiscretizer.points` should not be changed if the
256    discretizer is used by any attribute. The length of
257    :obj:`~IntervalDiscretizer.points` should always match the number of values
258    of the feature, which is determined by the length of the attribute's field
259    ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal
260    ``len(attr.get_value_from.transformer.points)+1``.
261
262
263.. class:: EqualWidthDiscretizer
264
265    Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first
266    interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the
267    last intervals.
268
269
270    .. attribute:: first_cut
271
272        The first cut-off point.
273
274    .. attribute:: step
275
276        Width of the intervals.
277
278    .. attribute:: n
279
280        Number of the intervals.
281
282    .. attribute:: points (read-only)
283
284        The cut-off points; this is not a real attribute although it behaves
285        as one. Reading it constructs a list of cut-off points and returns it,
286        but changing the list doesn't affect the discretizer. Only present to provide
287        the :obj:`EquiDistDiscretizer` the same interface as that of
288        :obj:`IntervalDiscretizer`.
289
290
291.. class:: ThresholdDiscretizer
292
293    Threshold discretizer converts continuous values into binary by comparing
294    them to a fixed threshold. Orange uses this discretizer for
295    binarization of continuous attributes in decision trees.
296
297    .. attribute:: threshold
298
299        The value threshold; values below or equal to the threshold belong to the first
300        interval and those that are greater go to the second.
301
302
303.. class:: BiModalDiscretizer
304
305    Bimodal discretizer has two cut off points and values are
306    discretized according to whether or not they belong to the region between these points
307    which includes the lower but not the upper boundary. The
308    discretizer is returned by :class:`BiModalDiscretization` if its
309    field :obj:`~BiModalDiscretization.split_in_two` is true (the default).
310
311    .. attribute:: low
312
313        Lower boundary of the interval (included in the interval).
314
315    .. attribute:: high
316
317        Upper boundary of the interval (not included in the interval).
318
319
320Implementational details
321========================
322
323Consider a following example (part of :download:`discretization.py <code/discretization.py>`):
324
325.. literalinclude:: code/discretization.py
326    :lines: 7-15
327
328The discretized attribute ``sep_w`` is constructed with a call to
329:class:`Entropy`; instead of constructing it and calling
330it afterwards, we passed the arguments for calling to the constructor. We then constructed a new
331:class:`Orange.data.Table` with attributes "sepal width" (the original
332continuous attribute), ``sep_w`` and the class attribute::
333
334    Entropy discretization, first 5 data instances
335    [3.5, '>3.30', 'Iris-setosa']
336    [3.0, '(2.90, 3.30]', 'Iris-setosa']
337    [3.2, '(2.90, 3.30]', 'Iris-setosa']
338    [3.1, '(2.90, 3.30]', 'Iris-setosa']
339    [3.6, '>3.30', 'Iris-setosa']
340
341The name of the new categorical variable derives from the name of original
342continuous variable by adding a prefix ``D_``. The values of the new attributes
343are computed automatically when they are needed using a transformation
344function :obj:`~Orange.feature.Descriptor.get_value_from`
345(see :class:`Orange.feature.Descriptor`) which encodes the discretization::
346
347    >>> sep_w
348    EnumVariable 'D_sepal width'
349    >>> sep_w.get_value_from
350    <ClassifierFromVar instance at 0x01BA7DC0>
351    >>> sep_w.get_value_from.whichVar
352    FloatVariable 'sepal width'
353    >>> sep_w.get_value_from.transformer
354    <IntervalDiscretizer instance at 0x01BA2100>
355    >>> sep_w.get_value_from.transformer.points
356    <2.90000009537, 3.29999995232>
357
358The ``select`` statement in the discretization script converted all data instances
359from ``data`` to the new domain. This includes a new feature
360``sep_w`` whose values are computed on the fly by calling ``sep_w.get_value_from`` for each data instance.
361The original, continuous sepal width
362is passed to the ``transformer`` that determines the interval by its field
363``points``. Transformer returns the discrete value which is in turn returned
364by ``get_value_from`` and stored in the new example.
365
366References
367==========
368
369.. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued
370  attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages
371  1022--1029, Chambery, France, 1993.
Note: See TracBrowser for help on using the repository browser.