source: orange/Orange/doc/ofb/o_categorization.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 15.0 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="other.htm">Other Techniques for Orange Scripting</a>,
8Next: <a href="o_fss.htm">Feature Subset Selection</a>,
9Up: <a href="other.htm">Other Techniques for Orange Scripting</a>
10</p>
11
12<H1>Discretization</H1>
13<index name="discretization">
14<index name="class/EntropyDiscretization">
15<index name="class/EquiNDiscretization">
16
17<p>Data discretization (or as in machine learning also referred to as
18discretization) is a procedure that takes a data set and converts all
19continuous attributes to categorical. In other words, it discretizes
20the continuous attributes. Orange's core supports three discretization
21methods: first using equal-width intervals
22(<code>orange.EquiDistDiscretization</code>), second using
23equal-frequency intervals (<code>orange.EquiNDiscretization</code>)
24and class-aware discretization as introduced by Fayyad &amp; Irani
25(AAAI92) that uses MDL and entropy to find the best cut-off points
26(<code>orange.EntropyDiscretization</code>). The discretization
27methods are invoked through calling a preprocessor directive
28<code>orange.Preprocessor_discretize</code> which takes a data set and
29discretization method, and returns a data set with any continuous
30attribute being discretized.</p>
31
32<p>In machine learning and data mining discretization may be used for
33different purposes. It may be interesting to find informative cut-off
34points in the data (for instance, finding that the cut-off for blood's
35acidity is 7.3 may mean something to physicians).  In machine
36learning, discretization may enable the use of some learning
37algorithms (for instance, the naive Bayes in orange does not handle
38continuous-valued attributes).</p>
39
40<h2>Discretization of A Complete Data Set</h2>
41
42<p>Here is an orange script that should illustrate the basic use of
43Orange's discretization functions:</p>
44
45<p class="header"><a href="disc.py">disc.py</a> (uses <a href=
46"iris.tab">iris.tab</a>)</p>
47<xmp class="code">import orange
48
49def show_values(data, heading):
50    print heading
51    for a in data.domain.attributes:
52        print "%s: %s" % (a.name, \
53          reduce(lambda x,y: x+', '+y, [i for i in a.values]))
54       
55data = orange.ExampleTable("iris")
56
57data_ent = orange.Preprocessor_discretize(data,\
58  method=orange.EntropyDiscretization())
59show_values(data_ent, "Entropy based discretization")
60print
61
62data_n = orange.Preprocessor_discretize(data, \
63  method=orange.EquiNDiscretization(numberOfIntervals=3))
64show_values(data_n, "Equal-frequency intervals")
65</xmp>
66
67<p>Two types of discretization are used in the script,
68Fayyad-Irani's method and equal-frequency interval
69discretization. The output of the script is given bellow. Note that
70orange also changes the name of the original attribute being
71discretized by adding &ldquo;D_&rdquo; at its start. Further notice
72that with Fayyad-Irani's discretization, all four attributes
73were found to have at least two meaningful cut-off points.</p>
74
75<xmp class="code">Entropy based discretization
76D_sepal length: <=5.50, (5.50000, 5.70000], >5.70
77D_sepal width: <=2.90, (2.90000, 3.00000], (3.00000, 3.30000], >3.30
78D_petal length: <=1.90, (1.90000, 4.70000], >4.70
79D_petal width: <=0.60, (0.60000, 1.70000], >1.70
80
81Equal-frequency intervals
82D_sepal length: <=5.35, (5.35000, 6.25000], >6.25
83D_sepal width: <=2.85, (2.85000, 3.15000], >3.15
84D_petal length: <=1.80, (1.80000, 4.85000], >4.85
85D_petal width: <=0.55, (0.55000, 1.55000], >1.55
86</xmp>
87
88<h2>Attribute-Specific Discretization</h2>
89
90<p>In the example above, all continuous attributes were discretized
91using the same method. This may be ok [in fact, this is how most often
92machine learning people do discretization], but it may not be the
93right way to do especially if you want to tailor discretization to
94specific attributes. For this, you may want to apply different kind of
95discretizations. The idea is that you discretize each of attributes
96separately, and them use newly crafter attributes to form your new
97domain for the new data set. We have not told you anything on working
98with example domains, so if you want to learn more on this, jump to <a
99href="domain.htm">Basic Data Manipulation</a> section of this
100tutorial, and then come back. For those of you that trust us in what
101we are doing, just read on.</p>
102
103<p>In Orange, when converting examples (transforming one data set to
104another), attribute's values can be computed from values of other
105attributes, when needed. This is exactly how discretization
106works. Let's take again the iris data set. We shall replace
107<code>petal width</code> by quartile-discretized attribute called
108<code>pl</code>. For <code>sepal length</code>, we'll keep the
109original attribute, but add the attribute discretized using quartiles
110(<code>sl</code>) and using Fayyad-Irani's algorithm
111(<code>sl_ent</code>). We shall also keep the original (continuous)
112attribute <code>sepal width</code>. Here is the code:</p>
113
114<p class="header"><a href="disc2.py">disc2.py</a> (uses <a href=
115"iris.tab">iris.tab</a>)</p>
116<xmp class="code">def printexamples(data, inxs, msg="First %i examples"):
117  print msg % len(inxs)
118  for i in inxs:
119    print i, data[i]
120  print
121
122import orange
123iris = orange.ExampleTable("iris")
124
125equiN = orange.EquiNDiscretization(numberOfIntervals=4)
126entropy = orange.EntropyDiscretization()
127
128pl = equiN("petal length", iris)
129sl = equiN("sepal length", iris)
130sl_ent = entropy("sepal length", iris)
131
132inxs = [0, 15, 35, 50, 98]
133d_iris = iris.select(["sepal width", pl, "sepal length",sl, sl_ent, iris.domain.classVar])
134printexamples(iris, inxs, "%i examples before discretization")
135printexamples(d_iris, inxs, "%i examples before discretization")
136</xmp>
137
138<p>And here is the output of this script:</p>
139
140<xmp class="code">5 examples before discretization
1410 [5.100000, 3.500000, 1.400000, 0.200000, 'Iris-setosa']
14215 [5.700000, 4.400000, 1.500000, 0.400000, 'Iris-setosa']
14335 [5.000000, 3.200000, 1.200000, 0.200000, 'Iris-setosa']
14450 [7.000000, 3.200000, 4.700000, 1.400000, 'Iris-versicolor']
14598 [5.100000, 2.500000, 3.000000, 1.100000, 'Iris-versicolor']
146
1475 examples before discretization
1480 [3.500000, '<=1.55', 5.100000, '(5.05, 5.75]', '<=5.50', 'Iris-setosa']
14915 [4.400000, '<=1.55', 5.700000, '(5.05, 5.75]', '(5.50, 6.10]', 'Iris-setosa']
15035 [3.200000, '<=1.55', 5.000000, '<=5.05', '<=5.50', 'Iris-setosa']
15150 [3.200000, '(4.45, 5.25]', 7.000000, '>6.35', '>6.10', 'Iris-versicolor']
15298 [2.500000, '(1.55, 4.45]', 5.100000, '(5.05, 5.75]', '<=5.50', 'Iris-versicolor']
153</xmp>
154
155<p>Again, <code>EquiNDiscretization</code> and
156<code>EntropyDiscretization</code> are two of the classes that perform
157different kinds of discretization, the first will prepare four
158quartiles and the second does a Fayyad-Irani's discretization based on
159entropy and MDL. Both are derived from a common ancestor
160<code>Discretization</code>; another discretization we could use is
161<code>EquiDistDiscretization</code> that discretizes onto the given
162number of intervals of equal width.</p>
163
164<p>Called by an attribute (name, index or descriptor) and an example
165set, discretization prepares a descriptor of a discretized
166attribute. The constructed attribute is able to compute its value from
167value of the original continuous attribute and this is why conversion
168by select can work.</p>
169
170<p>Names of discretized attribute's values tell the boundaries of the
171interval. The output is thus informative, but not easily readable. You
172can, however, always change names of values, as long as the number of
173values remains the same. Adding the line</p>
174
175<xmp class="code">pl.values = sl.values = ["very low", "low", "high", "very high"]
176</xmp>
177
178to our code after the introduction of this two attributes (the new script is in
179<a href="disc3.py">disc3.py</a>), following is the second part of the output:</p>
180
181<xmp class="code">5 examples before discretization
1820 [3.500000, 'very low', 5.100000, 'low', '<=5.50', 'Iris-setosa']
18315 [4.400000, 'very low', 5.700000, 'low', '(5.50, 6.10]', 'Iris-setosa']
18435 [3.200000, 'very low', 5.000000, 'very low', '<=5.50', 'Iris-setosa']
18550 [3.200000, 'high', 7.000000, 'very high', '>6.10', 'Iris-versicolor']
18698 [2.500000, 'low', 5.100000, 'low', '<=5.50', 'Iris-versicolor']
187</xmp>
188
189<p>Want to know the cut-off points for the discretized attributes?
190This requires a little knowledge about the computation mechanics. How
191does a discretized attribute know from each attribute it should
192compute its values, and how? An attribute descriptor has a property
193<code>getValueFrom</code> which is a kind of classifier (it can indeed
194be a classifier!) that is given an original example and returns the
195value for the attribute. When converting examples from one domain to
196another, the <code>getValueFrom</code> is called for all attributes of
197the new domain that do not occur in the original. Get value takes the
198value of the original attribute and calls a property transformer to
199discretize it.</p>
200
201<p>Both, <code>EquiNDiscretization</code> and <code>EntropyDiscretization</code> construct transformer objects of type <code>IntervalDiscretizer</code>. It's cut-off points are stored in a list points:</p>
202
203
204<p class="header"><a href="disc4.py">disc4.py</a> (uses <a href=
205"iris.tab">iris.tab</a>)</p>
206<xmp class="code">import orange
207iris = orange.ExampleTable("iris")
208
209equiN = orange.EquiNDiscretization(numberOfIntervals=4)
210entropy = orange.EntropyDiscretization()
211
212pl = equiN("petal length", iris)
213sl = equiN("sepal length", iris)
214sl_ent = entropy("sepal length", iris)
215
216for attribute in [pl, sl, sl_ent]:
217  print "Cut-off points for", attribute.name, \
218    "are", attribute.getValueFrom.transformer.points
219</xmp>
220
221<p>Here's the output:</p>
222
223<xmp class="code">Cut-off points for D_petal length are <1.54999995232, 4.44999980927, 5.25>
224Cut-off points for D_sepal length are <5.05000019073, 5.75, 6.34999990463>
225Cut-off points for D_sepal length are <5.5, 6.09999990463>
226</xmp>
227
228<p>Sometimes, you may not like the cut-offs suggested by functions in
229Orange. In fact, we can tell that domain experts always like cut-offs
230at least rounded, if not changed to completely something else. To do
231this, simply assign new values to the cut-off points. Remember when
232the new attribute is crafter (like <code>sl</code>), this specifies
233only the domain of the attribute and how it is derived. We did not
234created a data set with this attribute yet, so before this, it is well
235time to change anything the discretization will actually do to the
236data. In the following example, we have rounded the cut-off points for
237the attribute <code>pl</code>. [A note is in place here:
238<code>pl</code> is python's variable that stores the pointer to our
239attribute. The name of this attribute is derived from the name of
240original attribute (<code>petal length </code>) by adding a prefix
241<code>D_</code>. You may not like this, and you can change the name by
242assign its name to something else, like <code>pl.name="pl"</code>]</p>
243
244<p class="header"><a href="disc5.py">disc5.py</a> (uses <a href=
245"iris.tab">iris.tab</a>)</p>
246<xmp class="code">import orange
247iris = orange.ExampleTable("iris")
248
249equiN = orange.EquiNDiscretization(numberOfIntervals=4)
250entropy = orange.EntropyDiscretization()
251
252pl = equiN("petal length", iris)
253sl = equiN("sepal length", iris)
254sl_ent = entropy("sepal length", iris)
255
256points = pl.getValueFrom.transformer.points
257points2 = map(lambda x:round(x), points)
258pl.getValueFrom.transformer.points = points2
259
260for attribute in [pl, sl, sl_ent]:
261  print "Cut-off points for", attribute.name, \
262    "are", attribute.getValueFrom.transformer.points
263</xmp>
264
265<p>Don't try this with discretization when using
266<code>EquiDistDiscretization</code>. Instead of
267<code>IntervalDiscretizer</code> this uses
268<code>EquiDistDiscretizer</code> with fields <code>firstVal</code>,
269<code>step</code> and <code>numberOfIntervals</code>.</p>
270
271<h2>Manual Discretization</h2>
272
273<p>What we have done above is something very close to manual
274discretization, except that the number of intervals used was the same
275as suggested by <code>EquiNDiscretization</code>. To do everything
276manually, we need to construct the same structures as the described
277discretization algorithms. We need to define a descriptor, among with
278the <code>name</code>, <code>type</code>, <code>values</code> and
279<code>getValueFrom</code>. The <code>getValueFrom</code> should be
280<code>IntervalDiscretizer</code> and with it we specify the cut-off
281points.</p>
282
283<p>Let's now discretize Iris' attribute pl using three intervals with cut-off points 2.0 and 4.0.</p>
284
285<p class="header"><a href="disc6.py">disc6.py</a> (uses <a href=
286"iris.tab">iris.tab</a>)</p>
287<xmp class="code">import orange
288
289def printexamples(data, inxs, msg="First %i examples"):
290  print msg % len(inxs)
291  for i in inxs:
292    print data[i]
293  print
294
295iris = orange.ExampleTable("iris")
296pl = orange.EnumVariable("pl")
297
298getValue = orange.ClassifierFromVar()
299getValue.whichVar = iris.domain["petal length"]
300getValue.classVar = pl
301getValue.transformer = orange.IntervalDiscretizer()
302getValue.transformer.points = [2.0, 4.0]
303
304pl.getValueFrom = getValue
305pl.values = ['low', 'medium', 'high']
306d_iris = iris.select(["petal length", pl, iris.domain.classVar])
307printexamples(d_iris, [0, 15, 35, 50, 98], "%i examples after discretization")
308</xmp>
309
310<p>Notice that we have also named each of the three intervals, and constructed the data set that shows both original and discretized attribute:</p>
311
312<xmp class="code">5 examples after discretization
313[1.400000, 'low', 'Iris-setosa']
314[1.500000, 'low', 'Iris-setosa']
315[1.200000, 'low', 'Iris-setosa']
316[4.700000, 'high', 'Iris-versicolor']
317[3.000000, 'medium', 'Iris-versicolor']
318</xmp>
319
320<hr><br><p class="Path">
321Prev: <a href="other.htm">Other Techniques for Orange Scripting</a>,
322Next: <a href="o_fss.htm">Feature Subset Selection</a>,
323Up: <a href="other.htm">Other Techniques for Orange Scripting</a>
324</p>
325
326<h2>Applying Discretization on the Test Set</h2>
327
328<p>In machine learning, you would often discretize the learning
329set. How does one then apply the same discretization on the test set?
330For discretized attributes Orange remembers the how they were
331converted from their original continuous versions, so you need only to
332convert the testing examples to a new (discretized) domain. Following
333code shows how:</p>
334
335<p class="header"><a href="disc7.py">disc7.py</a> (uses <a href=
336"iris.tab">iris.tab</a>)</p>
337<xmp class="code">import orange
338data = orange.ExampleTable("iris")
339
340#split the data to learn and test set
341ind = orange.MakeRandomIndices2(data, p0=6)
342learn = data.select(ind, 0)
343test = data.select(ind, 1)
344
345# discretize learning set, then use its new domain
346# to discretize the test set
347learnD = orange.Preprocessor_discretize(data, method=orange.EntropyDiscretization())
348testD = orange.ExampleTable(learnD.domain, test)
349
350print "Test set, original:"
351for i in range(3):
352    print test[i]
353
354print "Test set, discretized:"
355for i in range(3):
356    print testD[i]
357</xmp>
358
359<p>Following is the output of the above script:</p>
360
361<xmp class="code">Test set, original:
362[5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
363[4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
364[4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
365Test set, discretized:
366['<=5.50', '>3.30', '<=1.90', '<=0.60', 'Iris-setosa']
367['<=5.50', '(2.90, 3.30]', '<=1.90', '<=0.60', 'Iris-setosa']
368['<=5.50', '(2.90, 3.30]', '<=1.90', '<=0.60', 'Iris-setosa']
369</xmp>
370
371
372</body>
373</html>
374
Note: See TracBrowser for help on using the repository browser.