source: orange/docs/reference/rst/Orange.data.discretization.rst @ 9900:795a819ca3bb

Revision 9900:795a819ca3bb, 2.7 KB checked in by blaz <blaz.zupan@…>, 2 years ago (diff)

preliminary discretization structure (draft)

Line 
1.. py:currentmodule:: Orange.data
2
3###################################
4Discretization (``discretization``)
5###################################
6
7.. index:: discretization
8
9.. index::
10   single: data; discretization
11
12Continues features in the data can be discretized using a uniform discretization method. The approach will consider
13only continues features, and replace them in the data set with corresponding categorical features:
14
15.. literalinclude:: code/discretization-table.py
16
17Discretization introduces new categorical features and computes their values in accordance to
18a discretization method::
19
20    Original data set:
21    [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
22    [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
23    [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
24
25    Discretized data set:
26    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
27    ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa']
28    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
29
30The procedure uses feature discretization classes as define in XXX and applies them on entire data sets.
31The suported discretization methods are:
32
33* equal width discretization, where the domain of continuous feature is split to intervals of the same
34  width equal-sized intervals (:class:`EqualWidth`),
35* equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`),
36* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize
37  within-interval entropy of class distributions (:class:`Entropy`),
38* bi-modal, using three intervals to optimize the difference of the class distribution in
39  the middle with the distribution outside it (:class:`BiModal`),
40* fixed, with the user-defined cut-off points.
41
42The above script used the default discretization method (equal frequency with three intervals). This can be
43changed while some selected discretization approach as demonstrated below:
44
45.. literalinclude:: code/discretization-table-method.py
46    :lines: 3-5
47
48Classes
49=======
50
51Some functions and classes that can be used for
52categorization of continuous features. Besides several general classes that
53can help in this task, we also provide a function that may help in
54entropy-based discretization (Fayyad & Irani), and a wrapper around classes for
55categorization that can be used for learning.
56
57.. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class
58
59.. autoclass:: DiscretizeTable
60
61.. rubric:: Example
62
63FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange
64for Beginners tutorial shows the use of DiscretizedLearner. Other
65discretization classes from core Orange are listed in chapter on
66`categorization <../ofb/o_categorization.htm>`_ of the same tutorial.
Note: See TracBrowser for help on using the repository browser.