source: orange/docs/widgets/rst/data/discretize.rst @ 11778:ecd4beec2099

Revision 11778:ecd4beec2099, 7.9 KB checked in by Ales Erjavec <ales.erjavec@…>, 5 months ago (diff)

Use new SVG icons in the widget documentation.

Line 
1.. _Discretize:
2
3Discretize
4==========
5
6.. image:: ../../../../Orange/OrangeWidgets/Data/icons/Discretize.svg
7
8Discretizes continuous attrbutes from input data set.
9
10Signals
11-------
12
13Inputs:
14
15
16   - Examples (ExampleTable)
17      Attribute-valued data set.
18
19
20Outputs:
21
22
23   - Examples (ExampleTable)
24      Attribute-valued data set composed from instances from input data set that match user-defined condition.
25
26
27Description
28-----------
29
30Discretize widget receives data set on the input, finds
31attributes that are continuous and discretizes them using selected
32method. It then outputs the same data set with continuous attributes
33replaced by their discretized version.
34
35.. image:: images/Discretize.png
36   :alt: Discretize
37
38The basic version of the widget is rather simple. It allows choosing between three
39different discretizations. :obj:`Entropy-MDL`, invented by
40Fayyad and Irani is a top-down discretization which recursively splits the attribute
41at a cut giving the maximal information gain, until the gain is below the minimal
42description length of the cut. This discretization can result in an arbitrary number of
43intervals, including a single interval in which case the attribute is discarded as useless.
44:obj:`Equal-frequency` splits the attribute into the given
45:obj:`Number of intervals`, so that each contains approximately the same
46number of examples. :obj:`Equal-width` evenly splits the range between
47the smallest and the largest observed value.
48
49The widget can also be set to leave the attributes continuous or to remove them.
50
51:obj:`Class discretization` box defines what happens with the class
52value if it is continuous. We can use either equal frequency or equal width or set custom
53thresholds. The box also shows the current thresholds. In the case on the snapshot, we have a
54continuous class which is split into three intervals of equal widths; their boundaries are
55at 18545 and 31973.
56
57Since some discretization and other methods supported by the widget need discrete class
58values, the class is always discretized. :obj:`Output original class`
59decides whether the data produced by the widget will contain the original, continuous dependent
60variable or the discretized one.
61
62As usual, the widget can be set to immediately apply any changes or to commit them when
63the user presses :obj:`Commit`.
64
65Three discretization methods are supported. Continuous attributes
66are either discretized using a set of intervals of the same size
67(:obj:`Equal-Width Intervals`), using a set of intervals where interval
68borders are defined so that each interval covers approximately equal
69number of data instances (:obj:`Equal-Frequency Intervals`). Number of
70intervals is user-defined.
71
72A different technique is :obj:`Entropy-based
73discretization`, which works only if an input data set includes a
74discrete class and finds intervals so that these minimize the entropy
75of the class variable (e.g., intervals tend to include instances of
76some prevailing class). The algorithm used is that of Fayyad and Irani
77(1992). One possible outcome of the algorithm is that no appropriate
78cut-off points are found, hence an attribute is reduced to a constant
79and can be removed from the data set. Attributes of this kind are
80listed under :obj:`Removed Attributes`.
81
82Depending on the user's settings, the widget can display either the
83discretization intervals or the cut-off points, that is, interval borders.
84
85Up until now, the same settings were used for all attributes. To treat
86each attribute differently, click :obj:`Expxlore and set individual
87discretizations`. This opens another part of the widget.
88
89.. image:: images/Discretize-All.png
90
91:obj:`Individual attribute settings` shows the specific
92discretization of each attribute and allows for changing it. First, the top left
93list shows the cut-off points for each attribute. In the snapshot we used the
94entropy-MDL discretization which determines the optimal number of intervals
95automatically: we see it discretized the length and width into two intervals
96with cut-offs at 186.70 and 68.40, respectively, while the horsepower got split
97into four intervals with cut-offs at 120, 134 and 175. The height, for instance,
98was left with a single interval and thus removed.
99
100Left of the list, we can select a specific discretization method for each attribute.
101Attribute "Stroke" would be removed by the MDL-based discretization, so to save him,
102we select the attribute and click, for instance, :obj:`Equal-frequency
103discretization`. We did the same for "bore", while we decided to keep the "engine-size"
104continuous.
105
106Besides using the automatic discretization methods, it is possible to manually enter
107a set of cut-off points. One can specify up to three different manual discretizations
108for each attribute (:obj:`Custom 1`, :obj:`Custom 2`
109:obj:`Custom 3`), for instance to play with different settings
110and see their consequences further on in the schema.
111
112A likely scenario would be that an automatic discrezation would find boundaries which
113are (unnecessarily) not round numbers, like 97 instead of 100, or they would be close to
114some established standard thresholds, like 37.3 C instead of 37 C for body temperature.
115Clicking the pastes the current boundaries into the line, where one can edit
116them manually.
117
118The bottom part helps to manually determine a set of suitable cut-off points. The graph
119can show two curves, discretization gain and the target class probability. Both can be switched
120of, by (un)checking :obj:`Show discretization gain` and
121:obj:`Show target class probability`, respectively.
122
123Discretization gain is the quality estimate of the attribute if a new cut-off point is added at a specific
124attribute value. On the snapshot, if we split the lowest interval at just above 100, the
125new, five-interval attribute's information gain would be 0.495 higher than that of the current
126four-interval attribute. The widget supports different functions for the
127:obj:`Split gain measure`, that is, :obj:`Information
128gain`, :obj:`Gini index`, :obj:`chi-square`
129(the statistics) and :obj:`chi-square prob.` (the associated probability),
130:obj:`ReliefF` and :obj:`Relevance`.
131
132Checking :obj:`Show lookahead gain` adds another curve which shows
133what the gain curve would look like after a cut-off at a certain point is added. (To see
134what this means, try dragging an existing cut-off. You will see the gain, as it is, and another,
135thinner gain line. After releasing the threshold, the thin line becomes the gain line,
136except for the scaling.)
137
138The class probability is shown with the grey curve and corresponds to the scale on the
139right-hand side of the graph. In case of discrete classes, the target class can be any
140of the original classes, while for discretized attributes, it is one of the intervals
141(*< 18545.33* in our case). :obj:`Show rug` adds small lines at the bottom
142and the top of the graph, which represents histograms showing the number of examples in the
143target class (top) and the other classes (bottom). On the snapshot, the examples of the
144target class (*< 18545.33*) are concentrated at between 50 and 120, while the rarer examples
145of other classes are spread between 100 and 200, with an outlier at 250. Plotting the rug
146can be slow if the number of examples is huge.
147
148It is possible to add new cut-offs by clicking on the graph, remove them by right-clicking,
149and drag them around. The discretization defined in this way is stored as a custom discretization.
150Changes of the thresholds in the graph can be instantaneously copied to the custom line if
151:obj:`Apply on the fly` is checked. Otherwise, they are copied only when
152the user clicks :obj:`Apply`.
153
154Similar also happens in the other direction: when an attribute is selected in the list,
155the corresponding graph, including the thresholds, are shown in the graph, and any changes
156of cut-off points are reflected in the graph as well.
157
158Examples
159--------
160
161In the schema below we show Iris data set with continuous
162attributes (as in original data file) and with discretized attributes.
163
164.. image:: images/Discretize-Example-S.gif
165   :alt: Schema with Discretize widget
Note: See TracBrowser for help on using the repository browser.