source: orange/Orange/doc/ofb/domain.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 32.2 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="o_ensemble.htm">Ensemble Techniques</a>,
8Next: <a href="uncovered.htm">What we did not cover</a>,
9Up: <a href="default.htm">On Tutorial 'Orange for Beginners'</a>
10</p>
11
12<H1>Basic Data Manipulation</H1>
13
14<p>A substantial part of Orange's functionality is data manipulation:
15constructing data sets, selecting instances and attributes, different
16filtering techniques... While all operations on attributes and
17instances can be done on your data set in your favorite spreadsheet
18program, this may not be very convenient as you may not want to jump
19from one environment to another, and can even be prohibitive if data
20manipulation is a part of your learning or testing scheme. In this
21section of tutorial we therefore expose some of the very basic data
22manipulation techniques incorporate in Orange, which in turn may be
23sufficient for those that want to implement their own feature or
24instance selection techniques, or even do something like constructive
25induction.</p>
26
27<h2>A Warm Up</h2>
28
29We will use the data set about car types and characteristics called <a
30href="imports-85.tab">imports-85.tab</a>. Before we do anything, let
31us write a script that examines what are the attributes that are
32included in this data set.
33
34<p class="header">part of <a href="domain1.py">domain1.py</a>  (uses <a href=
35"imports-85.tab">imports-85.tab</a>) </p>
36<xmp class=code>import orange
37
38filename = "imports-85.tab"
39data = orange.ExampleTable(filename)
40print "%s includes %i attributes and a class variable %s" % \
41  (filename, len(data.domain.attributes), data.domain.classVar.name)
42
43print "Attribute names and indices:"
44for i in range(len(data.domain.attributes)):
45  print "(%2d) %-17s" % (i, data.domain.attributes[i].name),
46  if i % 3 == 2: print
47</xmp>
48
49<p>The script prints out a following report:</p>
50
51<xmp class="code">imports-85.tab includes 25 attributes and a class variable price
52Attribute names and indices:
53( 0) symboling         ( 1) normalized-losses ( 2) make
54( 3) fuel-type         ( 4) aspiration        ( 5) num-of-doors
55( 6) body-style        ( 7) drive-wheels      ( 8) engine-location
56( 9) wheel-base        (10) length            (11) width
57(12) height            (13) curb-weight       (14) engine-type
58(15) num-of-cylinders  (16) engine-size       (17) fuel-system
59(18) bore              (19) stroke            (20) compression-ratio
60(21) horsepower        (22) peak-rpm          (23) city-mpg
61(24) highway-mpg
62</xmp>
63
64<h2>Attribute Selection and Construction of Class-Based Domains</h2>
65<index name="feature subset selection">
66
67<p>Every example set in Orange has its domain. Say, a variable
68<code>data</code> stores our data set, then the domain of this data
69set can be accessed through <code>data.domain</code>. Inclusion and
70exclusion of attributes can be managed through domain: we can use one
71domain to construct another one, and then use Orange's
72<code>select</code> function to actually construct a data set from
73original instances given the new domain. There is also a more
74straightforward way to select attributes through directly using
75<code>orange.select</code>.</p>
76
77<p>Here is an example. We again use <code>imports-85</code> data set,
78and construct different data sets that include first five attributes
79(<code>newData1</code>), attributes stated in list and specified by
80their names (<code>newData2</code>), attributes stated in list and
81specified through Orange's object called <code>Variable</code>
82(<code>newData3</code>):
83
84<p class="header"><a href="domain2.py">domain2.py</a>  (uses <a href=
85"imports-85.tab">imports-85.tab</a>) </p>
86<xmp class=code>import orange
87
88def reportAttributes(dataset, header=None):
89  if dataset.domain.classVar:
90    print 'Class variable: %s,' % dataset.domain.classVar.name,
91  else:
92    print 'No Class,',
93  if header:
94    print '%s:' % header
95  for i in range(len(dataset.domain.attributes)):
96    print "%s" % dataset.domain.attributes[i].name,
97    if i % 6 == 5: print
98  print "\n"
99
100filename = "imports-85.tab"
101data = orange.ExampleTable(filename)
102reportAttributes(data, "Original data set")
103
104newData1 = data.select(range(5))
105reportAttributes(newData1, "First five attributes")
106
107newData2 = data.select(['engine-location', 'wheel-base', 'length'])
108reportAttributes(newData2, "Attributes selected by name")
109
110domain3 = orange.Domain([data.domain[0], data.domain['curb-weight'], data.domain[2]])
111newData3 = data.select(domain3)
112reportAttributes(newData3, "Attributes by domain")
113
114domain4 = orange.Domain([data.domain[0], data.domain['curb-weight'], data.domain[2]], 0)
115newData4 = data.select(domain4)
116reportAttributes(newData4, "Attributes by domain")
117</xmp>
118
119<p>The last two examples (construction of <code>newData3</code> and
120<code>newData4</code>) show a very important distinction in crafting
121the domains: in Orange, domains may or may not include class
122variable. For classification and regression tasks, you will obviously
123need class labels, for something like association rules, you won't. In
124the script above, this distinction was made with "0" as a last
125attribute for <code>Orange.domain</code>: this "0" says no, please do
126not construct a class variable as by default,
127<code>Orange.domain</code> would consider that the class variable is
128the last one from the list of attributes.<p>
129
130<p>The output produced by above script is therefore:</p>
131
132<xmp class="code">symboling normalized-losses make fuel-type aspiration num-of-doors
133body-style drive-wheels engine-location wheel-base length width
134height curb-weight engine-type num-of-cylinders engine-size fuel-system
135bore stroke compression-ratio horsepower peak-rpm city-mpg
136highway-mpg
137
138No Class, First five attributes:
139symboling normalized-losses make fuel-type aspiration
140
141No Class, Attributes selected by name:
142engine-location wheel-base length
143
144Class variable: make, Attributes by domain:
145symboling curb-weight
146
147No Class, Attributes by domain:
148symboling curb-weight make
149</xmp>
150
151<p><code>orange.Domain</code> is a rather powerful constructor of domains, and its complete description is beyond this tutorial. But to illustrate some more, here is another example: run it for yourself to see what happens.</p>
152
153<p class="header"><a href="domain3.py">domain3.py</a>  (uses <a href=
154"glass.tab">glass.tab</a>) </p>
155<xmp class=code>import orange
156domain = orange.ExampleTable("glass").domain
157
158tests = ( '(["Na", "Mg"], domain)',
159          '(["Na", "Mg"], 1, domain)',
160          '(["Na", "Mg"], 0, domain)',
161          '(["Na", "Mg"], domain.variables)',
162          '(["Na", "Mg"], 1, domain.variables)',
163          '(["Na", "Mg"], 0, domain.variables)',
164          '([domain["Na"], "Mg"], 0, domain.variables)',
165          '([domain["Na"], "Mg"], 0, source=domain)',
166          '([domain["Na"], "Mg"], 0, source=domain.variables)',
167          '([domain["Na"], domain["Mg"]], 0)',
168          '([domain["Na"], domain["Mg"]], 1)',
169          '([domain["Na"], domain["Mg"]], None)',
170          '([domain["Na"], domain["Mg"]], orange.EnumVariable("something completely different"))',
171          '(domain)',
172          '(domain, 0)',
173          '(domain, 1)',
174          '(domain, "Mg")',
175          '(domain, domain[0])',
176          '(domain, None)',
177          '(domain, orange.FloatVariable("nothing completely different"))')
178
179for args in tests:
180  line = "orange.Domain%s" % args
181  d = eval(line)
182  print line
183  print "  classVar: %s" % d.classVar
184  print "  attributes: %s" % d.attributes
185  print
186</xmp>
187
188<p>Remember that all this script does is domain construction. You
189would still need to use <code>orange.select</code> in order to obtain
190the data example sets.</p>
191
192<h2>Instance Selection</h2>
193<index name="sampling">
194<index name="filtering examples">
195
196<p>Instance selection may be based on values of attributes, or we can
197simply select some instances according to their index. There are also
198a number of filters that may help for instance selection, of which we
199here mention only <code>Filter_sameValues</code>.</p>
200
201<p>First, filtering by index. Again, we will use <code>select</code>
202function, this time giving it a vector of integer values based on
203which <code>select</code> will decide to include or not an
204instance. By default, <code>select</code> includes instances with a
205corresponding non-zero element in this list, but a specific value for
206which corresponding instances may be included may also be
207specified. Notice that through this mechanism you may craft your own
208selection vector in any way you want, and thus (if needed) implement
209some complex instance selection mechanism. Here is though a much
210simpler example:</p>
211
212<p class="header"><a href="domain7.py">domain7.py</a>  (uses <a href=
213"../datasets/adult_sample.tab">glass.tab</a>) </p>
214<xmp class=code>import orange
215
216def report_prob(header, data):
217  print 'Size of %s: %i instances; ' % (header, len(data)),
218  n = 0
219  for i in data:
220    if int(i.getclass())==0:
221      n = n + 1
222  if len(data):
223    print "p(%s)=%5.3f" % (data.domain.classVar.values[0], float(n)/len(data))
224  else:
225    print
226
227filename = "adult_sample.tab"
228data = orange.ExampleTable(filename)
229report_prob('data', data)
230
231selection = [1]*10 + [0]*(len(data)-10)
232data1 = data.select(selection)
233report_prob('data1, first ten instances', data1)
234
235data2 = data.select(selection, negate=1)
236report_prob('data2, other than first ten instances', data2)
237
238selection = [1]*12 + [2]*12 + [3]*12 + [0]*(len(data)-12*3)
239data3 = data.select(selection, 3)
240report_prob('data3, third dozen of instances', data3)
241</xmp>
242
243<p>And here is its output:</p>
244
245<xmp class="code">Size of data: 977 instances;  p(>50K)=0.242
246Size of data1, first ten instances: 10 instances;  p(>50K)=0.200
247Size of data2, other than first ten instances: 967 instances;  p(>50K)=0.242
248Size of data3, third dozen of instances: 12 instances;  p(>50K)=0.583
249</xmp>
250
251<p>The above should not be anything new to the careful reader of this tutorial. Namely, we have already used instance selection in the chapter on <a href="c_performance.htm">performance evaluation of classifiers</a>, where we have also learned how to use <code>MakeRandomIndices2</code> and <code>MakeRandomIndicesCV</code> to craft the selection vectors.</p>
252
253<p>Next and for something new in this tutorial, Orange's <code>select</code> allows also to select instances based on their attribute value. This may be best illustrated through some example, so here it goes:</p>
254
255<p class="header"><a href="domain4.py">domain4.py</a>  (uses <a href=
256"glass.tab">adult_sample.tab</a>) </p>
257<xmp class=code>import orange
258
259def report_prob(header, data):
260  print 'Size of %s: %i instances' % (header, len(data))
261  n = 0
262  for i in data:
263    if int(i.getclass())==0:
264      n = n + 1
265  print "p(%s)=%5.3f" % (data.domain.classVar.values[0], float(n)/len(data))
266
267filename = "adult_sample.tab"
268data = orange.ExampleTable(filename)
269report_prob('original data set', data)
270
271data1 = data.select(sex='Male')
272report_prob('data1', data1)
273
274data2 = data.select(sex='Male', education='Masters')
275report_prob('data2', data2)
276</xmp>
277
278<p>We have used instances from adult data set that for an individual
279described through a set of attributes states if her/his yearly
280earnings were above $50.000. Notice that we have selected instances
281based on their gender (<code>data1</code>) and gender and education
282(<code>data2</code>), and just to show how different are resulting
283data sets reported on number of instances and relative frequency of
284cases with higher earnings. Notice that when more than one
285attribute-value pair is given to <code>select</code>, conjunction of
286conditions are used. The output of above script is:</p>
287
288<xmp class="code">Size of original data set: 977 instances
289p(>50K)=0.242
290Size of data1: 624 instances
291p(>50K)=0.296
292Size of data2: 38 instances
293p(>50K)=0.632
294</xmp>
295
296<p>Could we request examples for which either of conditions holds? Or
297those for which neither of the does? Or... Yes, but not with
298select. For this, we need to use a more versatile filter called
299<code>Preprocessor_take</code>. Here's an example of how it's
300used.</p>
301
302<p class="header">part of <a href="domain5.py">domain5.py</a>  (uses <a href=
303"glass.tab">adult_sample.tab</a>) </p>
304<xmp class=code>filename = "adult_sample.tab"
305data = orange.ExampleTable(filename)
306report_prob('data', data)
307
308filter = orange.Preprocessor_take()
309filter.values = {data.domain["sex"]: "Male", data.domain["education"]: "Masters"}
310
311filter.conjunction = 1
312data1 = filter(data)
313report_prob('data1 (conjunction)', data1)
314
315filter.conjunction = 0
316data1 = filter(data)
317report_prob('data1 (disjunction)', data1)
318
319data2 = data.select(sex='Male', education='Masters')
320report_prob('data2 (select, conjuction)', data2)
321</xmp>
322
323<p>The results are reported as:</p>
324
325<xmp class="code">Size of data: 977 instances;  p(>50K)=0.242
326Size of data1 (conjunction): 38 instances;  p(>50K)=0.632
327Size of data1 (disjunction): 643 instances;  p(>50K)=0.302
328Size of data2 (select, conjuction): 38 instances;  p(>50K)=0.632
329</xmp>
330
331<p>Notice that with <code>conjunction=1</code> the resulting data set
332is just like the one constructed with <code>select</code>. Not just
333that - <CODE>select</CODE> and <CODE>Preprocessor_take</CODE> both
334actually work by constructing a filter of
335<code>Filter_sameValues</code> object, uses it and discards it
336afterwards. What we gained by using <CODE>Preprocessor_take</CODE>
337instead of <CODE>select</CODE> is the access to field
338<code>conjunction</code>; if set to 0, conditions are treated in
339disjunction (OR) instead of in conjunction (AND). And there's also
340<CODE>Preprocessor_take.negate</CODE> that reverses the
341selection. When constructing the filter, it's essential to set the
342<code>domain</code> before specifying the values.</p>
343
344<p>Selection methods can also dealwith with continuous attributes and
345values. Constraints with resepct to some attribute values are
346specified as intervals (lower limit, upper limit). Limits are
347inclusive: for a limit (30,40) and attribute age, an instance will be
348selected if the age is higher or equal to 30 and lower or equal to
34940. If the limits are reversed, e.g. (40,30), examples with values
350outside the interval are selected, that is, an instance is selected if
351age is lower or equal 30 or higher or equal 40. </p>
352
353<p class="header">part of <a href="domain6.py">domain6.py</a>  (uses <a href=
354"glass.tab">adult_sample.tab</a>) </p>
355<xmp class=code>filename = "adult_sample.tab"
356data = orange.ExampleTable(filename)
357report_prob('data', data)
358
359data1 = data.select(age=(30,40))
360report_prob('data1, age from 30 to 40', data1)
361
362data2 = data.select(age=(40,30))
363report_prob('data2, younger than 30 or older than 40', data2)
364</xmp>
365
366<p>Running this script shows that it pays to be in thirties (good for
367authors of this text, at the time of writing):</p>
368
369<xmp class="code">Size of data: 977 instances;  p(>50K)=0.242
370Size of data1, age from 30 to 40: 301 instances;  p(>50K)=0.312
371Size of data2, younger than 30 or older than 40: 676 instances;  p(>50K)=0.210
372</xmp>
373
374<h2>Accessing and Changing Attribute Values</h2>
375
376<p>Early in our tutorial we have learned that if <code>data</code> is
377a variable that stores our data set, instances can be accessed simply
378by indexing the data, like <code>data[5]</code> would be the sixth
379instance (indices start with 0). Attributes can be accessed through
380their index (<code>data[5][3]</code>; fourth attribute of sixth
381instance), name (<code>data[5]["age"]</code>; attribute age of sixth
382instance), or variable (<code>a=data.domain.attributes[5]; print
383data[5][a]</code>; attribute <code>a</code> of sixth instance).</p>
384
385<p>At this point it should be obvious that attribute values can be
386used in any (appropriate) Python's expression, and you may also set
387the values of the attributes, like in <code>data[5]["fuel-type"] =
388"gas"</code>. Orange will report an error if assignment is used with a
389value out of the variable's scope.</p>
390
391<h2>Adding Noise and Unknown Values</h2>
392<index name="noise">
393<index name="missing values/adding of">
394
395<p>Who needs these? Isn't real data populated with noise and missing
396values anyway? Well, in machine learning, you sometimes may want to
397add these to see how robust are your learning
398algorithms. Particularly, if you deal with artificial data sets that
399do not include noise and what to make them more "realistic". Like it
400or not, here is how this may be done.</p>
401
402<p>First, we will add class noise to the data set, and to make thinks
403interesting, use this data set with some learner and observe if and
404how the accuracy of the learner is affected with noise. To add class
405noise, we will use <code>Preprocessor_class_noise</code> with an
406attribute that tells in what percent of instances a class is set to an
407arbitrary value. Notice that probabilities of class values used by
408<code>Preprocessor_class_noise</code> are uniform.</p>
409
410<p class="header"><a href="domain8.py">domain8.py</a>  (uses <a href=
411"glass.tab">adult_sample.tab</a>) </p>
412<xmp class=code>import orange, orngTest, orngStat
413
414filename = "promoters.tab"
415data = orange.ExampleTable(filename)
416data.name = "unspoiled"
417datasets = [data]
418
419add_noise = orange.Preprocessor_addClassNoise()
420for noiselevel in (0.2, 0.4, 0.6):
421  add_noise.proportion = noiselevel
422  d = add_noise(data)
423  d.name = "class noise %4.2f" % noiselevel
424  datasets.append(d)
425
426learner = orange.BayesLearner()
427
428for d in datasets:
429  results = orngTest.crossValidation([learner], d, folds=10)
430  print "%20s   %5.3f" % (d.name, orngStat.CA(results)[0])
431</xmp>
432
433<p>Obviously, we expect that with added noise the performance of any classifier will degrade. This is indeed so for our example and naive Bayes learner:</p>
434
435<xmp class="code">           unspoiled   0.896
436    class noise 0.20   0.811
437    class noise 0.40   0.689
438    class noise 0.60   0.632
439</xmp>
440
441<p>We can also add noise to attributes. Here, we should distinguish
442between continuous and discrete attributes.
443
444<h2>Crafting New Attributes</h2>
445
446<p>In machine learning and data mining, you may often encounter situations where you wish to add an extra attribute which is constructed from some existing subset of attributes. You may do that "manually" (you know exactly from which attributes you will derive the new one, and you know the function as well), or in some automatic way through, say, constructive induction.</p>
447
448<p>To introduce this subject, we will be here very unambitious and
449just show how to deal with the first, manual, case. Here are two
450examples. In the first, we add two attributes to the well-known iris
451data set; the two may represent the approximation of petal and sepal
452area, respectively, and are derived from petal and sepal length and
453width. The attributes are declared first: in our case we use
454<code>orange.FloatVariable</code> that returns an object that stores
455our variable and its properties. One important property -
456<code>getValueFrom</code> - tells how this variable is computed. All
457that we need to do next is to construct new domain that includes new
458variables; from this time on every time the new variables are
459accessed, Orange will know how to compute them.</p>
460
461<p class="header"><a href="domain11.py">domain11.py</a>  (uses <a href=
462"glass.tab">iris.tab</a>) </p>
463<xmp class=code>import orange
464data = orange.ExampleTable('iris')
465
466sa = orange.FloatVariable("sepal area")
467sa.getValueFrom = lambda e, getWhat: e['sepal length'] * e['sepal width']
468
469pa = orange.FloatVariable("petal area")
470pa.getValueFrom = lambda e, getWhat: e['petal length'] * e['petal width']
471
472newdomain = orange.Domain(data.domain.attributes+[sa, pa, data.domain.classVar])
473newdata = data.select(newdomain)
474
475print
476for a in newdata.domain.attributes:
477  print "%13s" % a.name,
478print "%16s" % newdata.domain.classVar.name
479for i in [10,50,100,130]:
480  for a in newdata.domain.attributes:
481    print "%8s%5.2f" % (" ", newdata[i][a]),
482  print "%16s" % (newdata[i].getclass())
483</xmp>
484
485<p>As we took care that four data instances from the new data set are
486nicely printed out, here is the output of the script:</p>
487
488<xmp class="code"> sepal length   sepal width  petal length   petal width    sepal area    petal area             iris
489         5.40          3.70          1.50          0.20         19.98          0.30      Iris-setosa
490         7.00          3.20          4.70          1.40         22.40          6.58  Iris-versicolor
491         6.30          3.30          6.00          2.50         20.79         15.00   Iris-virginica
492         7.40          2.80          6.10          1.90         20.72         11.59   Iris-virginica
493</xmp>
494
495<p>The story is slightly different with nominal attributes, where
496apart from their name we need to declare its set of values as
497well. Everything else is quite similar.  Below is an example that adds
498a new attribute to car data set (see more at <a
499href="http://www.ailab.si/hint/car_dataset.htm">car data set web
500page</a>):</p>
501
502<p class="header"><a href="domain12.py">domain12.py</a>  (uses <a href=
503"glass.tab">car.tab</a>) </p>
504<xmp class=code>import orange
505data = orange.ExampleTable('car')
506
507priceTable={}
508priceTable['v-high:v-high'] = 'v-high'
509priceTable['high:v-high'] = 'v-high'
510priceTable['med:v-high'] = 'high'
511priceTable['low:v-high'] = 'high'
512priceTable['v-high:high'] = 'v-high'
513priceTable['high:high'] = 'high'
514priceTable['med:high'] = 'high'
515priceTable['low:high'] = 'med'
516priceTable['v-high:med'] = 'high'
517priceTable['high:med'] = 'high'
518priceTable['med:med'] = 'med'
519priceTable['low:med'] = 'low'
520priceTable['v-high:low'] = 'high'
521priceTable['high:low'] = 'high'
522priceTable['med:low'] = 'low'
523priceTable['low:low'] = 'low'
524
525def f(price, buying, maint):
526  return orange.Value(price, priceTable['%s:%s' % (buying, maint)])
527
528price = orange.EnumVariable("price", values=["v-high", "high", "med", "low"])
529price.getValueFrom = lambda e, getWhat: f(price, e['buying'], e['maint'])
530newdomain = orange.Domain(data.domain.attributes+[price, data.domain.classVar])
531newdata = data.select(newdomain)
532
533print
534for a in newdata.domain.attributes:
535  print "%10s" % a.name,
536print "%10s" % newdata.domain.classVar.name
537for i in [1,200,300,1200,1700]:
538  for a in newdata.domain.attributes:
539    print "%10s" % newdata[i][a],
540  print "%10s" % newdata[i].getclass()
541</xmp>
542
543<p>The output of this code (we intentionally printed out five selected data instances):</p>
544
545<xmp class="code">    buying      maint      doors    persons    lugboot     safety      price          y
546    v-high     v-high          2          2      small        med     v-high      unacc
547    v-high       high     5-more          4      small       high     v-high      unacc
548    v-high        med     5-more          2        med        low       high      unacc
549       med        low          2          4        med        low        low      unacc
550       low        low          4       more        big       high        low     v-good
551</xmp>
552
553<p>In machine learning, we usually alter the data domain to achieve
554better predictive accuracy, or to introduce attributes that are more
555understood by domain experts. We tested the first hypothesis on our
556data set, and constructed classification trees from, respectively,
557original and new data set. Results of running the following script are
558not striking (in terms of accuracy boost), but still give an example
559on how to do exactly the same cross-validation on two data set with
560the same number of instances.</p>
561
562<p class="header"><a href="domain13.py">domain13.py</a>  (uses <a href=
563"glass.tab">iris.tab</a>) </p>
564<xmp class=code>import orange, orngTest, orngStat, orngTree
565data = orange.ExampleTable('iris')
566
567sa = orange.FloatVariable("sepal area")
568sa.getValueFrom = lambda e, getWhat: e['sepal length'] * e['sepal width']
569pa = orange.FloatVariable("petal area")
570pa.getValueFrom = lambda e, getWhat: e['petal length'] * e['petal width']
571
572newdomain = orange.Domain(data.domain.attributes+[sa, pa, data.domain.classVar])
573newdata = data.select(newdomain)
574
575learners = [orngTree.TreeLearner(mForPruning=2.0)]
576
577indices = orange.MakeRandomIndicesCV(data, 10)
578res1 = orngTest.testWithIndices(learners, data, indices)
579res2 = orngTest.testWithIndices(learners, newdata, indices)
580
581print "original: %5.3f, new: %5.3f" % (orngStat.CA(res1)[0], orngStat.CA(res2)[0])
582</xmp>
583
584<h2>A Wrapper For Feature Subset Selection</h2>
585<index name="feature subset selection/wrapper">
586
587<p>Here, we construct a simple feature subset selection algorithm that
588uses a wrapper approach (see Kohavi R, John G: The Wrapper Approach,
589in Feature Extraction, Construction and Selection : A Data Mining
590Perspective, edited by Huan Liu and Hiroshi Motoda) and a
591hill-climbing strategy for selection of features. Wrapper approach
592estimates the quality of given feature set by running a selected
593learning algorithm. We start with empty feature set, and incrementally
594add features from original data set. We do this only if the
595classification accuracy increases, hence we stop where adding any
596single feature does not result in gain of performance. For evaluation,
597we use cross-validation. [What Kohavi and John describe in their
598wrapper approach is a little more complex, uses best-first search and
599does some smarter evaluation. From the script presented here to their
600algorithm is not far, and you are encouraged to build such wrapper if
601you need one or for an exercise.]</p>
602
603<p class="header"><a href="domain9.py">domain9.py</a>  (uses <a href=
604"voting.tab">voting.tab</a>) </p>
605<xmp class=code>import orange, orngTest, orngStat, orngTree
606
607def WrapperFSS(data, learner, verbose=0, folds=10):
608  classVar = data.domain.classVar
609  currentAtt = []
610  freeAttributes = list(data.domain.attributes)
611
612  newDomain = orange.Domain(currentAtt + [classVar])
613  d = data.select(newDomain)
614  results = orngTest.crossValidation([learner], d, folds=folds)
615  maxStat = orngStat.CA(results)[0]
616  if verbose>=2:
617    print "start (%5.3f)" % maxStat
618
619  while 1:
620    stat = []
621    for a in freeAttributes:
622      newDomain = orange.Domain([a] + currentAtt + [classVar])
623      d = data.select(newDomain)
624      results = orngTest.crossValidation([learner], d, folds=folds)
625      stat.append(orngStat.CA(results)[0])
626      if verbose>=2:
627        print "  %s gained %5.3f" % (a.name, orngStat.CA(results)[0])
628
629    if (max(stat) > maxStat):
630      oldMaxStat = maxStat
631      maxStat = max(stat)
632      bestVarIndx = stat.index(max(stat))
633      if verbose:
634        print "gain: %5.3f, attribute: %s" % (maxStat-oldMaxStat, freeAttributes[bestVarIndx].name)
635      currentAtt = currentAtt + [freeAttributes[bestVarIndx]]
636      del freeAttributes[bestVarIndx]
637    else:
638      if verbose:
639        print "stopped (%5.3f)" % (max(stat) - maxStat)
640      return orange.Domain(currentAtt + [classVar])
641      break
642
643data = orange.ExampleTable("voting")
644learner = orngTree.TreeLearner(mForPruning=0.5)
645
646bestDomain = WrapperFSS(data, learner, verbose=1)
647print bestDomain
648</xmp>
649
650<p>For a wrapper feature subset selection we have defined a function
651<code>WrapperFSS</code>, which takes the data, the learner, and can be
652optionally requested to report on the progress of search
653(<code>verbose=1</code>). Cross-validation is by default using ten
654folds, but you may change this through a parameter of
655<code>WrapperFSS</code>. Here is a result of a single run of the
656script, where we used classification tree as a learner:</p>
657
658<xmp class="code">gain: 0.343, attribute: physician-fee-freeze
659stopped (0.000)
660[physician-fee-freeze, party]
661</xmp>
662
663<p>Notice that only a single attribute was selected
664(<code>party</code> is a class). You may explore the behavior of the
665algorithm in some more detail to see why this happens by calling the
666feature subset selection with <code>verbose=2</code>. You may also
667replace tree learner with some other algorithm. We did this and used
668naive Bayes (<code>learner = orange.BayesLearner()</code>), and got
669the following:</p>
670
671<xmp class="code">gain: 0.343, attribute: physician-fee-freeze
672gain: 0.002, attribute: synfuels-corporation-cutback
673gain: 0.005, attribute: adoption-of-the-budget-resolution
674stopped (0.000)
675[physician-fee-freeze, synfuels-corporation-cutback,
676adoption-of-the-budget-resolution, party]
677</xmp>
678
679<p>The selected set of features includes <code>physician-fee-freeze</code>, but in addition also two other attributes. </p>
680
681<p>[One think with naive Bayes is that it will report a bunch of warnings of the type</p>
682
683<xmp class="code">  classifiers[i] = learners[i](learnset, weight)
684C:\PYTHON22\lib\orngTest.py:256: KernelWarning: 'BayesLearner':
685invalid conditional probability or no attributes
686(the classifier will use apriori probabilities)
687</xmp>
688
689<p>This is because at the start of the feature subset selection, a set
690with no attributes other than class was give to the learner. This
691warnings are ok and can come in handy elsewhere, if you really do not
692like them here, add the following to the code:<p>
693
694<xmp class="code">import warnings
695warnings.filterwarnings("ignore", ".*'BayesLearner': .*",
696orange.KernelWarning)
697</xmp>
698
699<index name="feature subset selection/wrapped learner">
700<p>An issue of course is does this feature subset selection by
701wrapping help us in building a better classifier. To test this, we
702will construct a <code>WrappedFSSLearner</code>, that will take some
703learning algorithm and a data set, do feature subset selection to
704determine the appropriate set of features, and craft classifier from
705data that will include this set of features. Like we did before in our
706Tutorial, we will construct <code>WrappedFSSLearner</code> such that
707it could be used by other Orange modules.</p>
708
709<p class="header">part of <a href="domain10.py">domain10.py</a>  (uses <a href=
710"voting.tab">voting.tab</a>) </p>
711<xmp class=code>def WrappedFSSLearner(learner, examples=None, verbose=0, folds=10, **kwds):
712  kwds['verbose'] = verbose
713  kwds['folds'] = folds
714  learner = apply(WrappedFSSLearner_Class, (learner,), kwds)
715  if examples:
716    return learner(examples)
717  else:
718    return learner
719
720class WrappedFSSLearner_Class:
721  def __init__(self, learner, verbose=0, folds=10, name='learner w wrapper fss'):
722    self.name = name
723    self.learner = learner
724    self.verbose = verbose
725    self.folds = folds
726
727  def __call__(self, data, weight=None):
728    domain = WrapperFSS(data, self.learner, self.verbose, self.folds)
729    selectedData = data.select(domain)
730    if self.verbose:
731      print 'features:', selectedData.domain
732    model = self.learner(selectedData, weight)
733    return Classifier(classifier = model)
734
735class Classifier:
736  def __init__(self, **kwds):
737    self.__dict__.update(kwds)
738
739  def __call__(self, example, resultType = orange.GetValue):
740    return self.classifier(example, resultType)
741
742
743#base = orngTree.TreeLearner(mForPruning=0.5)
744#base.name = 'tree'
745base = orange.BayesLearner()
746base.name = 'bayes'
747import warnings
748warnings.filterwarnings("ignore", ".*'BayesLearner': .*", orange.KernelWarning)
749
750fssed = WrappedFSSLearner(base, verbose=1, folds=5)
751fssed.name = 'w fss'
752
753# evaluation
754
755learners = [base, fssed]
756data = orange.ExampleTable("voting")
757results = orngTest.crossValidation(learners, data, folds=10)
758
759print "Learner      CA     IS     Brier    AUC"
760for i in range(len(learners)):
761  print "%-12s %5.3f  %5.3f  %5.3f  %5.3f" % (learners[i].name, \
762    orngStat.CA(results)[i], orngStat.IS(results)[i],
763    orngStat.BrierScore(results)[i], orngStat.AUC(results)[i])
764</xmp>
765
766<p>The wrapped learner uses <code>WrapperFSS</code>, which is exactly
767the same function that we have developed in out previous script. The
768objects we have introduced in the script above mainly take care of the
769attributes, the code that really does something is actually in the
770<code>__call__</code> function of
771<code>WrappedFSSLearner_Class</code>. Running this script for
772classification tree, we get the same single-attribute set with
773<code>physician-fee-freeze</code> for all of the ten folds, and a
774minimal gain in accuracy. Something similar happens for naive Bayes,
775except that attributes included in the data set are
776<code>[physician-fee-freeze, synfuels-corporation-cutback,
777adoption-of-the-budget-resolution]</code>, and the statistics reported
778are quite higher than for the naive Bayes without feature subset
779selection:</p>
780
781<xmp class="code">Learner      CA     IS     Brier    AUC
782bayes        0.901  0.758  0.177  0.976
783w fss        0.961  0.848  0.064  0.991
784</xmp>
785
786<p>This concludes the lesson on basic data set manipulation, which
787started with description of some really elemental operations and
788finished with no-so-very-basic algorithm. Still, if you are inspired
789for feature subset selection, you may use the code for our wrapper
790approach we have demonstrated at the end of this lesson to extend it
791in any way you like: what you may find out is that writing Orange
792scripts like this is easy and can be quite a joy.  </p>
793
794<hr><br><p class="Path">
795Prev: <a href="o_ensemble.htm">Ensemble Techniques</a>,
796Next: <a href="uncovered.htm">What we did not cover</a>,
797Up: <a href="default.htm">On Tutorial 'Orange for Beginners'</a>
798</p>
799
800</body></html>
801
Note: See TracBrowser for help on using the repository browser.