source: orange/Orange/doc/widgets/Data/Discretize.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 9.1 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

4<link rel=stylesheet href="../../../style.css" type="text/css" media=screen>
5<link rel=stylesheet href="style-print.css" type="text/css" media=print></link>
12<img class="screenshot" src="../icons/Discretize.png">
13<p>Discretizes continuous attrbutes from input data set.</p>
19<DL class=attributes>
20<DT>Examples (ExampleTable)</DT>
21<DD>Attribute-valued data set.</DD>
26<DL class=attributes>
27<DT>Examples (ExampleTable)</DT>
28<DD>Attribute-valued data set composed from instances from input data set that match user-defined condition.</DD>
33<p>Discretize widget receives data set on the input, finds
34attributes that are continuous and discretizes them using selected
35method. It then outputs the same data set with continuous attributes
36replaced by their discretized version.</p>
38<table><tr><td valign="top"><img class="screenshot"
39src="Discretize.png" alt="Discretize" border=0></td>
41<td valign="top">
42<p>The basic version of the widget is rather simple. It allows choosing between three
43different discretizations. <span class="option">Entropy-MDL</span>, invented by
44Fayyad and Irani is a top-down discretization which recursively splits the attribute
45at a cut giving the maximal information gain, until the gain is below the minimal
46description length of the cut. This discretization can result in an arbitrary number of
47intervals, including a single interval in which case the attribute is discarded as useless.
48<span class="option">Equal-frequency</span> splits the attribute into the given
49<span class="option">Number of intervals</span>, so that each contains approximately the same
50number of examples. <span class="option">Equal-width</span> evenly splits the range between
51the smallest and the largest observed value.</p>
53<p>The widget can also be set to leave the attributes continuous or to remove them.</p>
55<p><span class="option">Class discretization</span> box defines what happens with the class
56value if it is continuous. We can use either equal frequency or equal width or set custom
57thresholds. The box also shows the current thresholds. In the case on the snapshot, we have a
58continuous class which is split into three intervals of equal widths; their boundaries are
59at 18545 and 31973.</p>
61<p>Since some discretization and other methods supported by the widget need discrete class
62values, the class is always discretized. <span class="option">Output original class</span>
63decides whether the data produced by the widget will contain the original, continuous dependent
64variable or the discretized one.</p>
66<p>As usual, the widget can be set to immediately apply any changes or to commit them when
67the user presses <span class="option">Commit</span>.</p>
73<p>Three discretization methods are supported. Continuous attributes
74are either discretized using a set of intervals of the same size
75(<span class="option">Equal-Width Intervals</span>), using a set of intervals where interval
76borders are defined so that each interval covers approximately equal
77number of data instances (<span class="option">Equal-Frequency Intervals</span>). Number of
78intervals is user-defined.</p>
80<p>A different technique is <span class="option">Entropy-based
81discretization</span>, which works only if an input data set includes a
82discrete class and finds intervals so that these minimize the entropy
83of the class variable (e.g., intervals tend to include instances of
84some prevailing class). The algorithm used is that of Fayyad and Irani
85(1992). One possible outcome of the algorithm is that no appropriate
86cut-off points are found, hence an attribute is reduced to a constant
87and can be removed from the data set. Attributes of this kind are
88listed under <span class="option">Removed Attributes</span>.</p>
90<p>Depending on the user's settings, the widget can display either the
91discretization intervals or the cut-off points, that is, interval borders.</p>
94<p>Up until now, the same settings were used for all attributes. To treat
95each attribute differently, click <span class="option">Expxlore and set individual
96discretizations</span>. This opens another part of the widget.</p>
98<img src="Discretize-All.png">
100<p><span class="option">Individual attribute settings</span> shows the specific
101discretization of each attribute and allows for changing it. First, the top left
102list shows the cut-off points for each attribute. In the snapshot we used the
103entropy-MDL discretization which determines the optimal number of intervals
104automatically: we see it discretized the length and width into two intervals
105with cut-offs at 186.70 and 68.40, respectively, while the horsepower got split
106into four intervals with cut-offs at 120, 134 and 175. The height, for instance,
107was left with a single interval and thus removed.</p>
109<p>Left of the list, we can select a specific discretization method for each attribute.
110Attribute "Stroke" would be removed by the MDL-based discretization, so to save him,
111we select the attribute and click, for instance, <span class="option">Equal-frequency
112discretization</span>. We did the same for "bore", while we decided to keep the "engine-size"
115<p>Besides using the automatic discretization methods, it is possible to manually enter
116a set of cut-off points. One can specify up to three different manual discretizations
117for each attribute (<span class="option">Custom 1</span>, <span class="option">Custom 2</span>
118<span class="option">Custom 3</span>), for instance to play with different settings
119and see their consequences further on in the schema.</p>
121<p>A likely scenario would be that an automatic discrezation would find boundaries which
122are (unnecessarily) not round numbers, like 97 instead of 100, or they would be close to
123some established standard thresholds, like 37.3 C instead of 37 C for body temperature.
124Clicking the pastes the current boundaries into the line, where one can edit
125them manually.</p>
127<p>The bottom part helps to manually determine a set of suitable cut-off points. The graph
128can show two curves, discretization gain and the target class probability. Both can be switched
129of, by (un)checking <span class="option">Show discretization gain</span> and
130<span class="option">Show target class probability</span>, respectively.</p>
132<p>Discretization gain is the quality estimate of the attribute if a new cut-off point is added at a specific
133attribute value. On the snapshot, if we split the lowest interval at just above 100, the
134new, five-interval attribute's information gain would be 0.495 higher than that of the current
135four-interval attribute. The widget supports different functions for the
136<span class="option">Split gain measure</span>, that is, <span class="option">Information
137gain</span>, <span class="option">Gini index</span>, <span class="option">chi-square</span>
138(the statistics) and <span class="option">chi-square prob.</span> (the associated probability),
139<span class="option">ReliefF</span> and <span class="option">Relevance</span>.</p>
141<p>Checking <span class="option">Show lookahead gain</span> adds another curve which shows
142what the gain curve would look like after a cut-off at a certain point is added. (To see
143what this means, try dragging an existing cut-off. You will see the gain, as it is, and another,
144thinner gain line. After releasing the threshold, the thin line becomes the gain line,
145except for the scaling.)</p>
147<p>The class probability is shown with the grey curve and corresponds to the scale on the
148right-hand side of the graph. In case of discrete classes, the target class can be any
149of the original classes, while for discretized attributes, it is one of the intervals
150(&lt;18545.33 in our case). <span class="option">Show rug</span> adds small lines at the bottom
151and the top of the graph, which represents histograms showing the number of examples in the
152target class (top) and the other classes (bottom). On the snapshot, the examples of the
153target class (&lt;18545.33) are concentrated at between 50 and 120, while the rarer examples
154of other classes are spread between 100 and 200, with an outlier at 250. Plotting the rug
155can be slow if the number of examples is huge.</p>
157<p>It is possible to add new cut-offs by clicking on the graph, remove them by right-clicking,
158and drag them around. The discretization defined in this way is stored as a custom discretization.
159Changes of the thresholds in the graph can be instantaneously copied to the custom line if
160<span class="option">Apply on the fly</span> is checked. Otherwise, they are copied only when
161the user clicks <span class="option">Apply</span>.</p>
163<p>Similar also happens in the other direction: when an attribute is selected in the list,
164the corresponding graph, including the thresholds, are shown in the graph, and any changes
165of cut-off points are reflected in the graph as well.</p>
169<p>In the schema below we show Iris data set with continuous
170attributes (as in original data file) and with discretized attributes.</p>
172<a href="Discretize-Example.gif">Click to enlarge<br/><img src="Discretize-Example-S.gif"
173alt="Schema with Discretize widget" class="screenshot" border=0></a>
Note: See TracBrowser for help on using the repository browser.