source: orange/Orange/doc/ofb/o_fss.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 6.1 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="o_categorization.htm">Categorization</a>,
8Next: <a href="o_ensemble.htm">Ensemble Techniques</a>,
9Up: <a href="other.htm">Other Techniques for Orange Scripting</a>
10</p>
11
12<H1>Feature Subset Selection</H1>
13<index name="feature subset selection/on complete data set">
14
15<p> While the core Orange provides mechanisms to estimate relevance of
16attributes that describe classified instances, a module called <a
17href="../modules/orngFSS.htm">orngFSS</a> provides functions and
18wrappers that simplify feature subset selection. For instance, the
19following code loads the data, sets-up a filter that will use Relief
20measure to estimate the relevance of attributes and remove attribute
21with relevance lower than 0.01, and in this way construct a new data
22set.  </p>
23
24<p class="header"><a href="fss6.py">fss6.py</a> (uses <a href=
25"../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
26<xmp class="code">import orange, orngFSS
27data = orange.ExampleTable("adult_sample")
28
29def report_relevance(data):
30  m = orngFSS.attMeasure(data)
31  for i in m:
32    print "%5.3f %s" % (i[1], i[0])
33
34print "Before feature subset selection (%d attributes):" % \
35  len(data.domain.attributes)
36report_relevance(data)
37data = orange.ExampleTable("adult_sample")
38
39marg = 0.01
40filter = orngFSS.FilterRelief(margin=marg)
41ndata = filter(data)
42print "\nAfter feature subset selection with margin %5.3f (%d attributes):" % \
43  (marg, len(ndata.domain.attributes))
44
45report_relevance(ndata)
46</xmp>
47
48<p>Notice that we have also defined a function
49<code>report_relevance</code> that takes the data, computes the
50relevance of attributes (by calling <code>orngFSS.attMeasure</code>)
51and then reports the computed relevance. Notice that (by chance!) both
52<code>orngFSS.attMeasure</code> and <code>orngFSS.FilterRelief</code>
53use the same measure to estimate attributes, so this code would
54actually get better if one would first set up an object that would
55measure the attributes, and give it to both
56<code>orngFSS.FilterRelief</code> and <code>report_relevance</code>
57(we leave this for you as an exercise). The output of the above script
58is:</p>
59
60<xmp class="code">Before feature subset selection (14 attributes):
610.183 relationship
620.154 marital-status
630.063 occupation
640.031 education
650.026 workclass
660.020 age
670.017 sex
680.012 hours-per-week
690.010 capital-loss
700.009 education-num
710.007 capital-gain
720.006 race
73-0.002 fnlwgt
74-0.008 native-country
75
76After feature subset selection with margin 0.010 (9 attributes):
770.108 marital-status
780.101 relationship
790.066 education
800.021 education-num
810.020 sex
820.017 workclass
830.017 occupation
840.015 age
850.010 hours-per-week
86</xmp>
87
88<p>Out of 14 attributes, 5 were considered to be most relevant. We can
89not check if this would help some classifier to achieve a better
90performance. We will use 10-fold cross validation for comparison. To
91do thinks rightfully, we need to do feature subset selection every
92time we see new learning data, so we need to construct a learner that
93has feature subset selection up-front, i.e., before it actually
94learns. For a learner, we will use Naive Bayes with categorization (a
95particular wrapper from orngDisc). The code is quite short since we
96will also use a wrapper called FilteredLearner from orngFSS module:</p>
97<index name="feature subset selection/wrapper">
98
99
100<p class="header">an excerpt from <a href="fss7.py">fss7.py</a> (uses <a href=
101"../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
102<xmp class="code">import orange, orngDisc, orngTest, orngStat, orngFSS
103
104data = orange.ExampleTable("crx")
105
106bayes = orange.BayesLearner()
107dBayes = orngDisc.DiscretizedLearner(bayes, name='disc bayes')
108fss = orngFSS.FilterAttsAboveThresh(threshold=0.05)
109fBayes = orngFSS.FilteredLearner(dBayes, filter=fss, name='bayes & fss')
110
111learners = [dBayes, fBayes]
112results = orngTest.crossValidation(learners, data, folds=10, storeClassifiers=1)
113</xmp>
114
115<p>Below is the result. In terms of classification accuracy, feature
116subset selection did not help. But, the rightmost column shows the
117number of features used in each classifier (results are averaged
118across ten trials of cross validation), and it is quite surprising
119that on average only the use of about two features was sufficient.</p>
120
121
122<xmp class="code">Learner         Accuracy  #Atts
123disc bayes      0.857     14.00
124bayes & fss     0.846      2.60
125</xmp>
126
127The code that computes this statistics, as well as determines which
128are those features that were used, is shown below.
129
130<p class="header">another excerpt from <a href="fss7.py">fss7.py</a> (uses <a href=
131"../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
132<xmp class="code"># how many attributes did each classifier use?
133
134natt = [0.] * len(learners)
135for fold in range(results.numberOfIterations):
136  for lrn in range(len(learners)):
137    natt[lrn] += len(results.classifiers[fold][lrn].domain.attributes)
138for lrn in range(len(learners)):
139  natt[lrn] = natt[lrn]/10.
140
141print "\nLearner         Accuracy  #Atts"
142for i in range(len(learners)):
143  print "%-15s %5.3f     %5.2f" % (learners[i].name, orngEval.CA(results)[i], natt[i])
144
145# which attributes were used in filtered case?
146
147print '\nAttribute usage (in how many folds attribute was used?):'
148used = {}
149for fold in range(results.numberOfIterations):
150  for att in results.classifiers[fold][1].domain.attributes:
151    a = att.name
152    if a in used.keys(): used[a] += 1
153    else: used[a] = 1
154for a in used.keys():
155  print '%2d x %s' % (used[a], a)
156</xmp>
157
158<p>Following is the part of the output that shows the attribute
159usage. Quite interesting, four attributes were used in constructed
160classifiers, but only one (A9) in all ten classifiers constructed by
161cross validation.</p>
162
163<xmp class="code">Attribute usage (in how many folds attribute was used?):
16410 x A9
165 2 x A10
166 3 x A7
167 6 x A6
168</xmp>
169
170<p>There are more examples on feature subset selection in the
171documentation of <a href="../modules/orngFSS.htm">orngFSS</a>
172module.</p>
173
174<hr><br><p class="Path">
175Prev: <a href="o_categorization.htm">Categorization</a>,
176Next: <a href="o_ensemble.htm">Ensemble Techniques</a>,
177Up: <a href="other.htm">Other Techniques for Orange Scripting</a>
178</p>
179
180</body></html>
181
Note: See TracBrowser for help on using the repository browser.