source: orange/docs/tutorial/rst/feature-subset-selection.rst @ 9386:b95da3693f19

Revision 9386:b95da3693f19, 5.3 KB checked in by mitar, 2 years ago (diff)

Renamed tutorial files.

Line 
1.. index::
2   single: feature subset selection
3
4Feature Subset Selection
5========================
6
7While the core Orange provides mechanisms to estimate relevance of
8attributes that describe classified instances, a module called
9:py:mod:`Orange.feature.selection` provides functions and wrappers that simplify feature
10subset selection. For instance, the following code loads the data,
11sets-up a filter that will use Relief measure to estimate the
12relevance of attributes and remove attribute with relevance lower than
130.01, and in this way construct a new data set (:download:`fss6.py <code/fss6.py>`, uses
14:download:`adult_sample.tab <code/adult_sample.tab>`)::
15
16   import orange, orngFSS
17   data = orange.ExampleTable("adult_sample")
18   
19   def report_relevance(data):
20     m = orngFSS.attMeasure(data)
21     for i in m:
22       print "%5.3f %s" % (i[1], i[0])
23   
24   print "Before feature subset selection (%d attributes):" % \
25     len(data.domain.attributes)
26   report_relevance(data)
27   data = orange.ExampleTable("adult_sample")
28   
29   marg = 0.01
30   filter = orngFSS.FilterRelief(margin=marg)
31   ndata = filter(data)
32   print "\nAfter feature subset selection with margin %5.3f (%d attributes):" % \
33     (marg, len(ndata.domain.attributes))
34   
35   report_relevance(ndata)
36
37Notice that we have also defined a function ``report_relevance`` that
38takes the data, computes the relevance of attributes (by calling
39``orngFSS.attMeasure``) and then reports the computed
40relevance. Notice that (by chance!) both ``orngFSS.attMeasure`` and
41``orngFSS.FilterRelief`` use the same measure to estimate attributes,
42so this code would actually get better if one would first set up an
43object that would measure the attributes, and give it to both
44``orngFSS.FilterRelief`` and ``report_relevance`` (we leave this for
45you as an exercise). The output of the above script is::
46
47   Before feature subset selection (14 attributes):
48   0.183 relationship
49   0.154 marital-status
50   0.063 occupation
51   0.031 education
52   0.026 workclass
53   0.020 age
54   0.017 sex
55   0.012 hours-per-week
56   0.010 capital-loss
57   0.009 education-num
58   0.007 capital-gain
59   0.006 race
60   -0.002 fnlwgt
61   -0.008 native-country
62   
63After feature subset selection with margin 0.010 (9 attributes)::
64
65   0.108 marital-status
66   0.101 relationship
67   0.066 education
68   0.021 education-num
69   0.020 sex
70   0.017 workclass
71   0.017 occupation
72   0.015 age
73   0.010 hours-per-week
74
75.. index::
76   single: feature subset selection; wrapper
77
78Out of 14 attributes, 5 were considered to be most relevant. We can
79not check if this would help some classifier to achieve a better
80performance. We will use 10-fold cross validation for comparison. To
81do thinks rightfully, we need to do feature subset selection every
82time we see new learning data, so we need to construct a learner that
83has feature subset selection up-front, i.e., before it actually
84learns. For a learner, we will use Naive Bayes with categorization (a
85particular wrapper from orngDisc). The code is quite short since we
86will also use a wrapper called FilteredLearner from orngFSS module
87(part of :download:`fss7.py <code/fss7.py>`, uses :download:`adult_sample.tab <code/adult_sample.tab>`)::
88
89   import orange, orngDisc, orngTest, orngStat, orngFSS
90   
91   data = orange.ExampleTable("crx")
92   
93   bayes = orange.BayesLearner()
94   dBayes = orngDisc.DiscretizedLearner(bayes, name='disc bayes')
95   fss = orngFSS.FilterAttsAboveThresh(threshold=0.05)
96   fBayes = orngFSS.FilteredLearner(dBayes, filter=fss, name='bayes & fss')
97   
98   learners = [dBayes, fBayes]
99   results = orngTest.crossValidation(learners, data, folds=10, storeClassifiers=1)
100
101Below is the result. In terms of classification accuracy, feature
102subset selection did not help. But, the rightmost column shows the
103number of features used in each classifier (results are averaged
104across ten trials of cross validation), and it is quite surprising
105that on average only the use of about two features was sufficient::
106
107   Learner         Accuracy  #Atts
108   disc bayes      0.857     14.00
109   bayes & fss     0.846      2.60
110
111The code that computes this statistics, as well as determines which
112are those features that were used, is shown below (from :download:`fss7.py <code/fss7.py>`)::
113
114   # how many attributes did each classifier use?
115   natt = [0.] * len(learners)
116   for fold in range(results.numberOfIterations):
117     for lrn in range(len(learners)):
118       natt[lrn] += len(results.classifiers[fold][lrn].domain.attributes)
119   for lrn in range(len(learners)):
120     natt[lrn] = natt[lrn]/10.
121   
122   print "\nLearner         Accuracy  #Atts"
123   for i in range(len(learners)):
124     print "%-15s %5.3f     %5.2f" % (learners[i].name, orngEval.CA(results)[i], natt[i])
125   
126   # which attributes were used in filtered case?
127   
128   print '\nAttribute usage (in how many folds attribute was used?):'
129   used = {}
130   for fold in range(results.numberOfIterations):
131     for att in results.classifiers[fold][1].domain.attributes:
132       a = att.name
133       if a in used.keys(): used[a] += 1
134       else: used[a] = 1
135   for a in used.keys():
136     print '%2d x %s' % (used[a], a)
137
138Following is the part of the output that shows the attribute
139usage. Quite interesting, four attributes were used in constructed
140classifiers, but only one (A9) in all ten classifiers constructed by
141cross validation::
142
143   Attribute usage (in how many folds attribute was used?):
144   10 x A9
145    2 x A10
146    3 x A7
147    6 x A6
148
Note: See TracBrowser for help on using the repository browser.