source: orange/docs/tutorial/rst/discretization.rst @ 9385:fd37d2ce5541

Revision 9385:fd37d2ce5541, 14.0 KB checked in by mitar, 2 years ago (diff)

Cleaned up tutorial.

Line 
1.. index:: discretization
2.. index::
3   single: discretization; entropy-based
4.. index::
5   single: discretization; frequency-based
6
7Discretization
8==============
9
10Data discretization is a procedure that takes a data set and converts
11all continuous attributes to categorical. In other words, it
12discretizes the continuous attributes. Orange's core supports three
13discretization methods: first using equal-width intervals
14(``orange.EquiDistDiscretization``), second using equal-frequency
15intervals (``orange.EquiNDiscretization``) and class-aware
16discretization as introduced by Fayyad & Irani (AAAI92) that uses MDL
17and entropy to find the best cut-off points
18(``orange.EntropyDiscretization``). The discretization methods are
19invoked through calling a preprocessor directive
20``orange.Preprocessor_discretize`` which takes a data set and
21discretization method, and returns a data set with any continuous
22attribute being discretized.
23
24In machine learning and data mining discretization may be used for
25different purposes. It may be interesting to find informative cut-off
26points in the data (for instance, finding that the cut-off for blood's
27acidity is 7.3 may mean something to physicians).  In machine
28learning, discretization may enable the use of some learning
29algorithms (for instance, the naive Bayes in orange does not handle
30continuous-valued attributes).
31
32Data set discretization
33-----------------------
34
35Here is a script which demonstraters the basics of discretization in
36Orange (:download:`disc.py <code/disc.py>`, uses :download:`iris.tab <code/iris.tab>`)::
37
38   import orange
39   
40   def show_values(data, heading):
41       print heading
42       for a in data.domain.attributes:
43           print "%s: %s" % (a.name, \
44             reduce(lambda x,y: x+', '+y, [i for i in a.values]))
45           
46   data = orange.ExampleTable("iris")
47   
48   data_ent = orange.Preprocessor_discretize(data,\
49     method=orange.EntropyDiscretization())
50   show_values(data_ent, "Entropy based discretization")
51   print
52   
53   data_n = orange.Preprocessor_discretize(data, \
54     method=orange.EquiNDiscretization(numberOfIntervals=3))
55   show_values(data_n, "Equal-frequency intervals")
56
57Two types of discretization are used in the script, Fayyad-Irani's
58method and equal-frequency interval discretization. The output of the
59script is given bellow. Note that orange also changes the name of the
60original attribute being discretized by adding *D_* at its
61start. Further notice that with Fayyad-Irani's discretization, all
62four attributes were found to have at least two meaningful cut-off
63points::
64
65   Entropy based discretization
66   D_sepal length: <=5.50, (5.50000, 5.70000], >5.70
67   D_sepal width: <=2.90, (2.90000, 3.00000], (3.00000, 3.30000], >3.30
68   D_petal length: <=1.90, (1.90000, 4.70000], >4.70
69   D_petal width: <=0.60, (0.60000, 1.70000], >1.70
70   
71   Equal-frequency intervals
72   D_sepal length: <=5.35, (5.35000, 6.25000], >6.25
73   D_sepal width: <=2.85, (2.85000, 3.15000], >3.15
74   D_petal length: <=1.80, (1.80000, 4.85000], >4.85
75   D_petal width: <=0.55, (0.55000, 1.55000], >1.55
76
77Attribute-specific discretization
78---------------------------------
79
80In the example above, all continuous attributes were discretized using
81the same method. This may be ok [in fact, this is how most often
82machine learning people do discretization], but it may not be the
83right way to do especially if you want to tailor discretization to
84specific attributes. For this, you may want to apply different kind of
85discretizations. The idea is that you discretize each of attributes
86separately, and them use newly crafter attributes to form your new
87domain for the new data set. We have not told you anything on working
88with example domains, but trust us in what we are doing, and just read on.
89
90In Orange, when converting examples (transforming one data set to
91another), attribute's values can be computed from values of other
92attributes, when needed. This is exactly how discretization
93works. Let's take again the iris data set. We shall replace ``petal
94width`` by quartile-discretized attribute called ``pl``. For ``sepal
95length``, we'll keep the original attribute, but add the attribute
96discretized using quartiles (``sl``) and using Fayyad-Irani's
97algorithm (``sl_ent``). We shall also keep the original (continuous)
98attribute ``sepal width`` (from :download:`disc2.py <code/disc2.py>`, uses :download:`iris.tab <code/iris.tab>`)::
99
100   def printexamples(data, inxs, msg="First %i examples"):
101     print msg % len(inxs)
102     for i in inxs:
103       print i, data[i]
104     print
105   
106   import orange
107   iris = orange.ExampleTable("iris")
108   
109   equiN = orange.EquiNDiscretization(numberOfIntervals=4)
110   entropy = orange.EntropyDiscretization()
111   
112   pl = equiN("petal length", iris)
113   sl = equiN("sepal length", iris)
114   sl_ent = entropy("sepal length", iris)
115   
116   inxs = [0, 15, 35, 50, 98]
117   d_iris = iris.select(["sepal width", pl, "sepal length",sl, sl_ent, iris.domain.classVar])
118   printexamples(iris, inxs, "%i examples before discretization")
119   printexamples(d_iris, inxs, "%i examples before discretization")
120
121The output of this script is::
122
123   5 examples before discretization
124   0 [5.100000, 3.500000, 1.400000, 0.200000, 'Iris-setosa']
125   15 [5.700000, 4.400000, 1.500000, 0.400000, 'Iris-setosa']
126   35 [5.000000, 3.200000, 1.200000, 0.200000, 'Iris-setosa']
127   50 [7.000000, 3.200000, 4.700000, 1.400000, 'Iris-versicolor']
128   98 [5.100000, 2.500000, 3.000000, 1.100000, 'Iris-versicolor']
129   
130   5 examples before discretization
131   0 [3.500000, '<=1.55', 5.100000, '(5.05, 5.75]', '<=5.50', 'Iris-setosa']
132   15 [4.400000, '<=1.55', 5.700000, '(5.05, 5.75]', '(5.50, 6.10]', 'Iris-setosa']
133   35 [3.200000, '<=1.55', 5.000000, '<=5.05', '<=5.50', 'Iris-setosa']
134   50 [3.200000, '(4.45, 5.25]', 7.000000, '>6.35', '>6.10', 'Iris-versicolor']
135   98 [2.500000, '(1.55, 4.45]', 5.100000, '(5.05, 5.75]', '<=5.50', 'Iris-versicolor']
136
137Again, ``EquiNDiscretization`` and ``EntropyDiscretization`` are two
138of the classes that perform different kinds of discretization, the
139first will prepare four quartiles and the second does a Fayyad-Irani's
140discretization based on entropy and MDL. Both are derived from a
141common ancestor ``Discretization``; another discretization we could
142use is ``EquiDistDiscretization`` that discretizes onto the given
143number of intervals of equal width.
144
145Called by an attribute (name, index or descriptor) and an example set,
146discretization prepares a descriptor of a discretized attribute. The
147constructed attribute is able to compute its value from value of the
148original continuous attribute and this is why conversion by select can
149work.
150
151Names of discretized attribute's values tell the boundaries of the
152interval. The output is thus informative, but not easily readable. You
153can, however, always change names of values, as long as the number of
154values remains the same. Adding the line::
155
156   pl.values = sl.values = ["very low", "low", "high", "very high"]
157
158to our code after the introduction of this two attributes (the new script is in
159:download:`disc3.py <code/disc3.py>`), following is the second part of the output::
160
161   5 examples before discretization
162   0 [3.500000, 'very low', 5.100000, 'low', '<=5.50', 'Iris-setosa']
163   15 [4.400000, 'very low', 5.700000, 'low', '(5.50, 6.10]', 'Iris-setosa']
164   35 [3.200000, 'very low', 5.000000, 'very low', '<=5.50', 'Iris-setosa']
165   50 [3.200000, 'high', 7.000000, 'very high', '>6.10', 'Iris-versicolor']
166   98 [2.500000, 'low', 5.100000, 'low', '<=5.50', 'Iris-versicolor']
167
168Want to know the cut-off points for the discretized attributes?  This
169requires a little knowledge about the computation mechanics. How does
170a discretized attribute know from each attribute it should compute its
171values, and how? An attribute descriptor has a property
172``getValueFrom`` which is a kind of classifier (it can indeed be a
173classifier!) that is given an original example and returns the value
174for the attribute. When converting examples from one domain to
175another, the ``getValueFrom`` is called for all attributes of the new
176domain that do not occur in the original. Get value takes the value of
177the original attribute and calls a property transformer to discretize
178it.
179
180Both, ``EquiNDiscretization`` and ``EntropyDiscretization`` construct
181transformer objects of type ``IntervalDiscretizer``. It's cut-off
182points are stored in a list points (:download:`disc4.py <code/disc4.py>`, uses :download:`iris.tab <code/iris.tab>`)::
183
184   import orange
185   iris = orange.ExampleTable("iris")
186   
187   equiN = orange.EquiNDiscretization(numberOfIntervals=4)
188   entropy = orange.EntropyDiscretization()
189   
190   pl = equiN("petal length", iris)
191   sl = equiN("sepal length", iris)
192   sl_ent = entropy("sepal length", iris)
193   
194   for attribute in [pl, sl, sl_ent]:
195     print "Cut-off points for", attribute.name, \
196       "are", attribute.getValueFrom.transformer.points
197   
198Here's the output::
199
200   Cut-off points for D_petal length are <1.54999995232, 4.44999980927, 5.25>
201   Cut-off points for D_sepal length are <5.05000019073, 5.75, 6.34999990463>
202   Cut-off points for D_sepal length are <5.5, 6.09999990463>
203
204Sometimes, you may not like the cut-offs suggested by functions in
205Orange. In fact, we can tell that domain experts always like cut-offs
206at least rounded, if not changed to completely something else. To do
207this, simply assign new values to the cut-off points. Remember when
208the new attribute is crafter (like ``sl``), this specifies only the
209domain of the attribute and how it is derived. We did not created a
210data set with this attribute yet, so before this, it is well time to
211change anything the discretization will actually do to the data. In
212the following example, we have rounded the cut-off points for the
213attribute ``pl`` (:download:`disc5.py <code/disc5.py>`, uses :download:`iris.tab <code/iris.tab>`)::
214
215   import orange
216   iris = orange.ExampleTable("iris")
217   
218   equiN = orange.EquiNDiscretization(numberOfIntervals=4)
219   entropy = orange.EntropyDiscretization()
220   
221   pl = equiN("petal length", iris)
222   sl = equiN("sepal length", iris)
223   sl_ent = entropy("sepal length", iris)
224   
225   points = pl.getValueFrom.transformer.points
226   points2 = map(lambda x:round(x), points)
227   pl.getValueFrom.transformer.points = points2
228   
229   for attribute in [pl, sl, sl_ent]:
230     print "Cut-off points for", attribute.name, \
231       "are", attribute.getValueFrom.transformer.points
232
233.. note::
234   ``pl`` is python's variable that stores the pointer to our
235   attribute. The name of this attribute is derived from the name of
236   original attribute (``petal length ``) by adding a prefix
237   ``D_``. You may not like this, and you can change the name by
238   assign its name to something else, like ``pl.name="pl"``.
239
240.. warning::
241   Don't try this with discretization when using
242   ``EquiDistDiscretization``. Instead of ``IntervalDiscretizer`` this
243   uses ``EquiDistDiscretizer`` with fields ``firstVal``, ``step`` and
244   ``numberOfIntervals``.
245
246Manual discretization
247---------------------
248
249What we have done above is something very close to manual
250discretization, except that the number of intervals used was the same
251as suggested by ``EquiNDiscretization``. To do everything manually, we
252need to construct the same structures as the described discretization
253algorithms. We need to define a descriptor, among with the ``name``,
254``type``, ``values`` and ``getValueFrom``. The ``getValueFrom`` should
255be ``IntervalDiscretizer`` and with it we specify the cut-off points.
256
257Let's now discretize Iris' attribute pl using three intervals with
258cut-off points 2.0 and 4.0 (:download:`disc6.py <code/disc6.py>`, uses :download:`iris.tab <code/iris.tab>`)::
259
260   import orange
261   
262   def printexamples(data, inxs, msg="First %i examples"):
263     print msg % len(inxs)
264     for i in inxs:
265       print data[i]
266     print
267   
268   iris = orange.ExampleTable("iris")
269   pl = orange.EnumVariable("pl")
270   
271   getValue = orange.ClassifierFromVar()
272   getValue.whichVar = iris.domain["petal length"]
273   getValue.classVar = pl
274   getValue.transformer = orange.IntervalDiscretizer()
275   getValue.transformer.points = [2.0, 4.0]
276   
277   pl.getValueFrom = getValue
278   pl.values = ['low', 'medium', 'high']
279   d_iris = iris.select(["petal length", pl, iris.domain.classVar])
280   printexamples(d_iris, [0, 15, 35, 50, 98], "%i examples after discretization")
281   
282Notice that we have also named each of the three intervals, and
283constructed the data set that shows both original and discretized
284attribute::
285
286   5 examples after discretization
287   [1.400000, 'low', 'Iris-setosa']
288   [1.500000, 'low', 'Iris-setosa']
289   [1.200000, 'low', 'Iris-setosa']
290   [4.700000, 'high', 'Iris-versicolor']
291   [3.000000, 'medium', 'Iris-versicolor']
292
293Applying discretization on the test set
294---------------------------------------
295
296In machine learning, you would often discretize the learning set. How
297does one then apply the same discretization on the test set?  For
298discretized attributes Orange remembers the how they were converted
299from their original continuous versions, so you need only to convert
300the testing examples to a new (discretized) domain. Following code
301shows how (:download:`disc7.py <code/disc7.py>`, uses :download:`iris.tab <code/iris.tab>`)::
302
303   import orange
304   data = orange.ExampleTable("iris")
305   
306   #split the data to learn and test set
307   ind = orange.MakeRandomIndices2(data, p0=6)
308   learn = data.select(ind, 0)
309   test = data.select(ind, 1)
310   
311   # discretize learning set, then use its new domain
312   # to discretize the test set
313   learnD = orange.Preprocessor_discretize(data, method=orange.EntropyDiscretization())
314   testD = orange.ExampleTable(learnD.domain, test)
315   
316   print "Test set, original:"
317   for i in range(3):
318       print test[i]
319   
320   print "Test set, discretized:"
321   for i in range(3):
322       print testD[i]
323
324Following is the output of the above script::
325
326   Test set, original:
327   [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
328   [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
329   [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
330   Test set, discretized:
331   ['<=5.50', '>3.30', '<=1.90', '<=0.60', 'Iris-setosa']
332   ['<=5.50', '(2.90, 3.30]', '<=1.90', '<=0.60', 'Iris-setosa']
333   ['<=5.50', '(2.90, 3.30]', '<=1.90', '<=0.60', 'Iris-setosa']
334
335
336
Note: See TracBrowser for help on using the repository browser.