source: orange/Orange/doc/modules/orngFSS.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 13.2 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5<h1>orngFSS: Orange Feature Subset Selection Module</h1>
6
7<index name="modules+feature subset selection">
8
9<p>Module orngFSS implements several functions that support or may
10help design feature subset selection for classification problems. The
11guiding idea is that some machine learning methods may perform better
12if they learn only from a selected subset of "best" features. orngFSS
13mostly implements filter approaches, i.e., approaches were attributes
14scores are estimated prior to the modelling, that is, without
15knowing of which machine learning method will be used to construct a
16predictive model.</p>
17
18<h2>Functions</h2>
19
20<dl>
21<dt><b>attMeasure</b>(<i>data</i>[<i>, measure</i>])</dt>
22<dd class="ddfun">Assesses the quality (score) of attributes using the
23given scoring function (measure) on a data set <i>data</i> which
24should contain a discrete class. Returns a sorted list of tuples
25(attribute name, score). <i>measure</i> is an attribute quality
26measure, which should be derived from
27<code>orange.MeasureAttribute</code> and defaults to
28<code>orange.MeasureAttribute_relief(k=20, m=50)</code>.</dd>
29
30<dt><b>bestNAtts</b>(<i>scores</i>, <i>N</i>)</dt>
31<dd class="ddfun">Returns the list of names of the <i>N</i>
32highest ranked attributes from the <i>scores</i> list. List of
33attribute measures (<i>scores</i>) is of the type as returned by
34the function <code>attMeasure</code>.</dd>
35
36<dt><b>attsAboveThreshold</b>(<i>scores</i>[<i>, threshold</i>])</dt>
37<dd class="ddfun">Returns the list of names of attributes that are
38listed in the list <i>scores</i> and have their score above
39<i>threshold</i>. The default value for <i>threshold</i> is 0.0.</dd>
40
41<dt><b>selectBestNAtts</b>(<i>data</i>, <i>scores</i>, <i>N</i>)</dt>
42<dd class="ddfun">Constructs and returns a new data set that includes
43a class and only <i>N</i> best attributes from a list
44<i>scores</i>. <i>data</i> is used to pass an original data set.</dd>
45
46<dt><b>selectAttsAboveThresh</b>(<i>data</i>, <i>scores</i>[<i>, threshold</i>])</dt>
47<dd class="ddfun">Constructs and returns a new data set that
48includes a class and attributes from the list returned by function
49<code>attMeasure</code> that have the score above or equal to a
50specified <i>threshold</i>. <i>data</i> is used to pass an original data
51set. Parameter <i>threshold</i> is optional and defaults to 0.0.</dd>
52
53
54<dt><b>filterRelieff</b>(<i>data</i>[<i>, measure</i>[<em>, margin</em>]])</dt>
55
56<dd class="ddfun">Takes the data set <i>data</i> and a measure for
57score of attributes <i>measure</i>. Repeats the process of estimating
58attributes and removing the worst attribute if its measure is lower
59than <i>margin</i>. Stops when no attribute score is below this
60margin. The default for <i>measure</i> is
61<code>range.MeasureAttribute_relief(k=20, m=50)</code>, and
62<i>margin</i> defaults to 0.0 Notice that this filter procedure was
63originally designed for measures such as Relief, which are context
64dependent, i.e. removal of attributes may change the scores of other
65remaining attributes. Hence the need to re-estimate score every time an
66attribute is removed. </dd> </dl>
67
68
69
70<h2>Classes</h2>
71
72
73<dl>
74<dt><b><INDEX name="classes/FilterAttsAboveThresh (in orngFSS)">FilterAttsAboveThresh</b>([<em>measure</em>[<em>, threshold</em>]])</dt>
75<dd class="ddfun">This is simply a wrapper around the function
76<code>selectAttsAboveThresh</code>. It allows to create an object
77which stores filter's parameters and can be later called with the data
78to return the data set that includes only the selected
79attributes. <em>measure</em> is a function that returns a list of
80couples (attribute name, score), and it defaults to
81<code>orange.MeasureAttribute_relief(k=20, m=50)</code>. The default
82threshold is 0.0. Some examples of how to use this class are:
83
84<xmp class=code>filter = orngFSS.FilterAttsAboveThresh(threshold=.15)
85new_data = filter(data)
86new_data = orngFSS.FilterAttsAboveThresh(data)
87new_data = orngFSS.FilterAttsAboveThresh(data, threshold=.1)
88new_data = orngFSS.FilterAttsAboveThresh(data, threshold=.1,
89             measure=orange.MeasureAttribute_gini())
90</xmp>
91</dd>
92
93<dt><b><INDEX name="classes/FilterBestNAtts (in orngFSS)">FilterBestNAtts</b>([<em>measure</em>[<em>, n</em>]])</dt>
94<dd class="ddfun">Similarly to <code>FilterAttsAboveThresh</code>,
95this is a wrapper around the function
96<code>selectBestNAtts</code>. Measure and the number of attributes to
97retain are optional (the latter defaults to 5).</dd>
98
99<dt><b><INDEX name="classes/FilterRelieff (in orngFSS)"><index name="ReliefF">FilterRelieff</b>([<em>measure</em>[<em>, margin</em>]])</dt>
100<dd class="ddfun">Similarly to <code>FilterBestNAtts</code>, this is a
101wrapper around the function
102<code>filterRelieff</code>. <em>measure</em> and <em>margin</em> are
103optional attributes, where <em>measure</em> defaults to
104<code>orange.MeasureAttribute_relief(k=20, m=50)</code> and
105<em>margin</em> to 0.0.</dd>
106
107<dt><b><INDEX name="classes/FilteredLearner (in orngFSS)"><index name="classifiers/with attribute selection">FilteredLearner</b>([<em>baseLearner</em>[<em>,
108examples</em>[<em>, filter</em>[<em>, name</em>]]]])</dt> <dd>Wraps a
109<em>baseLearner</em> using a data <em>filter</em>, and returns the
110corresponding learner. When such learner is presented a data set, data
111is first filtered and then passed to
112<em>baseLearner</em>. <em>FilteredLearner</em> comes handy when one
113wants to test the schema of feature-subset-selection-and-learning by
114some repetitive evaluation method, e.g., cross validation. Filter
115defaults to orngFSS.FilterAttsAboveThresh with default
116attributes. Here is an example of how to set such learner (build a
117wrapper around naive Bayesian learner) and use it on a data set:</p>
118
119<xmp class=code>nb = orange.BayesLearner()
120learner = orngFSS.FilteredLearner(nb, filter=orngFSS.FilterBestNAtts(n=5), name='filtered')
121classifier = learner(data)
122</xmp>
123</dd>
124
125</dl>
126
127
128<h2>Examples</h2>
129
130<h3>Score Estimation</h3>
131
132<p>Let us start with a simple script that reads the data, uses
133orngFSS.attMeasure to derive attribute scores and prints out these for
134the first three best scored attributes. Same scoring function is then
135used to report (only) on three best score attributes.</p>
136
137<p class="header"><a href="fss1.py">fss1.py</a> (uses <a href=
138"voting.tab">voting.tab</a>)</p>
139<xmp class=code>import orange, import orngFSS
140data = orange.ExampleTable("voting")
141
142print 'Score estimate for first three attributes:'
143ma = orngFSS.attMeasure(data)
144for m in ma[:3]:
145  print "%5.3f %s" % (m[1], m[0])
146
147n = 3
148best = orngFSS.bestNAtts(ma, n)
149print '\nBest %d attributes:' % n
150for s in best:
151  print s
152</xmp>
153
154<p>The script should output something like:</p>
155
156<xmp class=printout>Attribute scores for best three attributes:
157Attribute scores for best three attributes:
1580.728 physician-fee-freeze
1590.329 adoption-of-the-budget-resolution
1600.321 synfuels-corporation-cutback
161
162Best 3 attributes:
163physician-fee-freeze
164adoption-of-the-budget-resolution
165synfuels-corporation-cutback</xmp>
166
167<h3>Different Score Measures</h3>
168
169<p>The following script reports on gain ratio and relief attribute
170scores. Notice that for our data set the ranks of the attributes
171rather match well!</p>
172
173<p class="header"><a href="fss2.py">fss2.py</a> (uses <a href=
174"voting.tab">voting.tab</a>)</p>
175<xmp class=code>import orange, orngFSS
176data = orange.ExampleTable("voting")
177
178print 'Relief GainRt Attribute'
179ma_def = orngFSS.attMeasure(data)
180gainRatio = orange.MeasureAttribute_gainRatio()
181ma_gr  = orngFSS.attMeasure(data, gainRatio)
182for i in range(5):
183  print "%5.3f  %5.3f  %s" % (ma_def[i][1], ma_gr[i][1], ma_def[i][0])
184</xmp>
185
186<h3>Filter Approach for Machine Learning</h3>
187
188<p>Attribute scoring has at least two potential uses. One is
189informative (or descriptive): the data analyst can use attribute
190scoring to find "good" attributes and those that are irrelevant for
191given classification task. The other use is in improving the
192performance of machine learning by learning only from the data set
193that includes the most informative features. This so-called filter
194approach can boost the performance of learner both in terms of
195predictive accuracy, speed-up of induction, and simplicity of
196resulting models.</p>
197
198<p>Following is a script that defines a new classifier that is based
199on naive Bayes and prior to learning selects five best attributes from
200the data set. The new classifier is wrapped-up in a special class (see
201<a href="../ofb/c_pythonlearner.htm">Building your own learner</a>
202lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The
203script compares this filtered learner naive Bayes that uses a complete
204set of attributes.</p>
205
206<p class="header"><a href="fss3.py">fss3.py</a> (uses <a href=
207"voting.tab">voting.tab</a>)</p>
208
209<xmp class=code>import orange, orngFSS
210
211class BayesFSS(object):
212  def __new__(cls, examples=None, **kwds):
213    learner = object.__new__(cls, **kwds)
214    if examples:
215      return learner(examples)
216    else:
217      return learner
218
219  def __init__(self, name='Naive Bayes with FSS', N=5):
220    self.name = name
221    self.N = 5
222
223  def __call__(self, data, weight=None):
224    ma = orngFSS.attMeasure(data)
225    filtered = orngFSS.selectBestNAtts(data, ma, self.N)
226    model = orange.BayesLearner(filtered)
227    return BayesFSS_Classifier(classifier=model, N=self.N, name=self.name)
228
229class BayesFSS_Classifier:
230  def __init__(self, **kwds):
231    self.__dict__.update(kwds)
232
233  def __call__(self, example, resultType = orange.GetValue):
234    return self.classifier(example, resultType)
235
236# test above wraper on a data set
237import orngStat, orngTest
238data = orange.ExampleTable("voting")
239learners = (orange.BayesLearner(name='Naive Bayes'), BayesFSS(name="with FSS"))
240results = orngTest.crossValidation(learners, data)
241
242# output the results
243print "Learner      CA"
244for i in range(len(learners)):
245  print "%-12s %5.3f" % (learners[i].name, orngStat.CA(results)[i])
246</xmp>
247
248<p>Interestingly, and somehow expected, feature subset selection
249helps. This is the output that we get:</p>
250
251<xmp class=printout>Learner      CA
252Naive Bayes  0.903
253with FSS     0.940
254</xmp>
255
256<h3>... And a Much Simpler One</h3>
257
258<p>Although perhaps educational, we can do all of the above by
259wrapping the learner using <code>FilteredLearner</code>, thus creating
260an object that is assembled from data filter and a base learner. When
261given the data, this learner uses attribute filter to construct a new
262data set and base learner to construct a corresponding
263classifier. Attribute filters should be of the type like
264<code>orngFSS.FilterAttsAboveThresh</code> or
265<code>orngFSS.FilterBestNAtts</code> that can be initialized with the
266arguments and later presented with a data, returning new reduced data
267set.</p>
268
269<p>The following code fragment essentially replaces the bulk of code
270from previous example, and compares naive Bayesian classifier to the
271same classifier when only a single most important attribute is
272used:</p>
273
274<p class="header">from <a href="fss4.py">fss4.py</a> (uses <a href=
275"voting.tab">voting.tab</a>)</p>
276
277<xmp class=code>nb = orange.BayesLearner()
278learners = (orange.BayesLearner(name='bayes'),
279            FilteredLearner(nb, filter=FilterBestNAtts(n=1), name='filtered'))
280results = orngEval.CrossValidation(learners, data)
281</xmp>
282
283<p>Now, let's decide to retain three attributes (change the code in <a
284href="fss4.py">fss4.py</a> accordingly!), but observe how many times
285an attribute was used. Remember, 10-fold cross validation constructs
286ten instances for each classifier, and each time we run
287FilteredLearner a different set of attributes may be
288selected. <code>orngEval.CrossValidation</code> stores classifiers in
289<code>results</code> variable, and <code>FilteredLearner</code>
290returns a classifier that can tell which attributes it used (how
291convenient!), so the code to do all this is quite short:</p>
292
293<p class="header">from <a href="fss4.py">fss4.py</a> (uses <a href=
294"voting.tab">voting.tab</a>)</p>
295
296<xmp class=code>print "\nNumber of times attributes were used in cross-validation:\n"
297attsUsed = {}
298for i in range(10):
299  for a in results.classifiers[i][1].atts():
300    if a.name in attsUsed.keys(): attsUsed[a.name] += 1
301    else: attsUsed[a.name] = 1
302for k in attsUsed.keys():
303  print "%2d x %s" % (attsUsed[k], k)
304</xmp>
305
306<p>Running <a href="fss4.py">fss4.py</a> with three attributes
307selected each time a learner is run gives the following result:</p>
308
309<xmp class=printout>Learner      CA
310bayes        0.903
311filtered     0.956
312
313Number of times attributes were used in cross-validation:
314 3 x el-salvador-aid
315 6 x synfuels-corporation-cutback
316 7 x adoption-of-the-budget-resolution
31710 x physician-fee-freeze
318 4 x crime
319</xmp>
320
321<p>Experiment yourself to see, if only one attribute is retained for
322classifier, which attribute was the one most frequently selected over
323all the ten cross-validation tests!</p>
324
325<hr>
326
327<h2>References</h2>
328
329<p>K. Kira and L. Rendell. A practical approach to feature
330selection. In D. Sleeman and P. Edwards, editors, <em>Proc. 9th Int'l
331Conf. on Machine Learning</em>, pages 249{256, Aberdeen, 1992. Morgan
332Kaufmann Publishers.</p>
333
334<p>I. Kononenko. Estimating attributes: Analysis and extensions of
335RELIEF. In F. Bergadano and L. De Raedt, editors, <em>Proc. European
336Conf. on Machine Learning (ECML-94)</em>, pages
337171{182. Springer-Verlag, 1994.</p>
338
339<p>R. Kohavi, G. John: Wrappers for Feature Subset Selection,
340<em>Artificial Intelligence</em>, 97 (1-2), pages 273-324, 1997</p>
341
342</body>
343</html>
Note: See TracBrowser for help on using the repository browser.