# source:orange/docs/reference/rst/Orange.data.discretization.rst@9963:7598327495d7

Revision 9963:7598327495d7, 3.6 KB checked in by blaz <blaz.zupan@…>, 2 years ago (diff)

Data discretization polished.

Line
1.. py:currentmodule:: Orange.data.discretization
2
3########################################
4Data discretization (``discretization``)
5########################################
6
7.. index:: discretization
8
9.. index::
10   single: data; discretization
11
12Continues features in the data can be discretized using a uniform discretization method. Discretization considers
13only continues features, and replaces them in the new data set with corresponding categorical features:
14
15.. literalinclude:: code/discretization-table.py
16
17Discretization introduces new categorical features with discretized values::
18
19    Original data set:
20    [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
21    [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
22    [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
23
24    Discretized data set:
25    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
26    ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa']
27    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
28
29Data discretization uses feature discretization classes from :doc:`Orange.feature
30.discretization` and applies them on entire data set. The suported discretization methods are:
31
32* equal width discretization, where the domain of continuous feature is split to intervals of the same
33  width equal-sized intervals (uses :class:`Orange.feature.discretization.EqualWidth`),
34* equal frequency discretization, where each intervals contains equal number of data instances (uses
35  :class:`Orange.feature.discretization.EqualFreq`),
36* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize
37  within-interval entropy of class distributions (uses :class:`Orange.feature.discretization.Entropy`),
38* bi-modal, using three intervals to optimize the difference of the class distribution in
39  the middle with the distribution outside it (uses :class:`Orange.feature.discretization.BiModal`),
40* fixed, with the user-defined cut-off points.
41
42.. FIXME give a corresponding class for fixed discretization
43
44Default discretization method (equal frequency with three intervals) can be replaced with other
45discretization approaches as demonstrated below:
46
47.. literalinclude:: code/discretization-table-method.py
48    :lines: 3-5
49
50Entropy-based discretization is special as it may infer new features that are constant and have only one value. Such
51features are redundant and provide no information about the class are. By default,
52:class:`DiscretizeTable` would remove them, a way performing feature subset selection. The effect of removal of
53non-informative features is also demonstrated in the following script:
54
55.. literalinclude:: code/discretization-entropy.py
56    :lines: 3-
57
58In the sampled dat set above three features were discretized to a constant and thus removed::
59
60    Redundant features (3 of 13):
61    cholesterol, rest SBP, age
62
63.. note::
64    Entropy-based and bi-modal discretization require class-labeled data sets.
65
66Data discretization classes
67===========================
68
69.. .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class
70
71.. autoclass:: DiscretizeTable
72
73.. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange
74   for Beginners tutorial shows the use of DiscretizedLearner. Other
75   discretization classes from core Orange are listed in chapter on
76   `categorization <../ofb/o_categorization.htm>`_ of the same tutorial. -> should put in classification/wrappers
77
78.. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued
79  attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages
80  1022--1029, Chambery, France, 1993.
Note: See TracBrowser for help on using the repository browser.