source: orange/orange/Orange/feature/selection.py @ 9349:fa13a2c52fcd

Revision 9349:fa13a2c52fcd, 14.5 KB checked in by mitar, 2 years ago (diff)

Changed way of linking to code in documentation.

Line 
1"""
2#########################
3Selection (``selection``)
4#########################
5
6.. index:: feature selection
7
8.. index::
9   single: feature; feature selection
10
11Some machine learning methods may perform better if they learn only from a
12selected subset of "best" features.
13
14The performance of some machine learning method can be improved by learning
15only from a selected subset of data, which includes the most informative or
16"best" features. This so-called filter approaches can boost the performance
17of learner both in terms of predictive accuracy, speed-up induction, and
18simplicity of resulting models. Feature scores are estimated prior to the
19modelling, that is, without knowing of which machine learning method will be
20used to construct a predictive model.
21
22:download:`selection-best3.py <code/selection-best3.py>` (uses :download:`voting.tab <code/voting.tab>`):
23
24.. literalinclude:: code/selection-best3.py
25    :lines: 7-
26
27The script should output this::
28
29    Best 3 features:
30    physician-fee-freeze
31    el-salvador-aid
32    synfuels-corporation-cutback
33
34.. automethod:: Orange.feature.selection.FilterAttsAboveThresh
35
36.. autoclass:: Orange.feature.selection.FilterAttsAboveThresh_Class
37   :members:
38
39.. automethod:: Orange.feature.selection.FilterBestNAtts
40
41.. autoclass:: Orange.feature.selection.FilterBestNAtts_Class
42   :members:
43
44.. automethod:: Orange.feature.selection.FilterRelief
45
46.. autoclass:: Orange.feature.selection.FilterRelief_Class
47   :members:
48
49.. automethod:: Orange.feature.selection.FilteredLearner
50
51.. autoclass:: Orange.feature.selection.FilteredLearner_Class
52   :members:
53
54.. autoclass:: Orange.feature.selection.FilteredClassifier
55   :members:
56
57These functions support in the design of feature subset selection for
58classification problems.
59
60.. automethod:: Orange.feature.selection.bestNAtts
61
62.. automethod:: Orange.feature.selection.attsAboveThreshold
63
64.. automethod:: Orange.feature.selection.selectBestNAtts
65
66.. automethod:: Orange.feature.selection.selectAttsAboveThresh
67
68.. automethod:: Orange.feature.selection.filterRelieff
69
70.. rubric:: Examples
71
72Following is a script that defines a new classifier that is based
73on naive Bayes and prior to learning selects five best features from
74the data set. The new classifier is wrapped-up in a special class (see
75<a href="../ofb/c_pythonlearner.htm">Building your own learner</a>
76lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The
77script compares this filtered learner naive Bayes that uses a complete
78set of features.
79
80:download:`selection-bayes.py <code/selection-bayes.py>` (uses :download:`voting.tab <code/voting.tab>`):
81
82.. literalinclude:: code/selection-bayes.py
83    :lines: 7-
84
85Interestingly, and somehow expected, feature subset selection
86helps. This is the output that we get::
87
88    Learner      CA
89    Naive Bayes  0.903
90    with FSS     0.940
91
92Now, a much simpler example. Although perhaps educational, we can do all of
93the above by wrapping the learner using <code>FilteredLearner</code>, thus
94creating an object that is assembled from data filter and a base learner. When
95given the data, this learner uses attribute filter to construct a new
96data set and base learner to construct a corresponding
97classifier. Attribute filters should be of the type like
98<code>orngFSS.FilterAttsAboveThresh</code> or
99<code>orngFSS.FilterBestNAtts</code> that can be initialized with the
100arguments and later presented with a data, returning new reduced data
101set.
102
103The following code fragment essentially replaces the bulk of code
104from previous example, and compares naive Bayesian classifier to the
105same classifier when only a single most important attribute is
106used.
107
108:download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` (uses :download:`voting.tab <code/voting.tab>`):
109
110.. literalinclude:: code/selection-filtered-learner.py
111    :lines: 13-16
112
113Now, let's decide to retain three features (change the code in <a
114href="fss4.py">fss4.py</a> accordingly!), but observe how many times
115an attribute was used. Remember, 10-fold cross validation constructs
116ten instances for each classifier, and each time we run
117FilteredLearner a different set of features may be
118selected. <code>orngEval.CrossValidation</code> stores classifiers in
119<code>results</code> variable, and <code>FilteredLearner</code>
120returns a classifier that can tell which features it used (how
121convenient!), so the code to do all this is quite short.
122
123.. literalinclude:: code/selection-filtered-learner.py
124    :lines: 25-
125
126Running :download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` with three features selected each
127time a learner is run gives the following result::
128
129    Learner      CA
130    bayes        0.903
131    filtered     0.956
132
133    Number of times features were used in cross-validation:
134     3 x el-salvador-aid
135     6 x synfuels-corporation-cutback
136     7 x adoption-of-the-budget-resolution
137    10 x physician-fee-freeze
138     4 x crime
139
140Experiment yourself to see, if only one attribute is retained for
141classifier, which attribute was the one most frequently selected over
142all the ten cross-validation tests!
143
144==========
145References
146==========
147
148* K. Kira and L. Rendell. A practical approach to feature selection. In
149  D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine
150  Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
151
152* I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF.
153  In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine
154  Learning (ECML-94), pages  171-182. Springer-Verlag, 1994.
155
156* R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial
157  Intelligence, 97 (1-2), pages 273-324, 1997
158
159"""
160
161__docformat__ = 'restructuredtext'
162
163import Orange.core as orange
164
165from Orange.feature.scoring import score_all
166
167# from orngFSS
168def bestNAtts(scores, N):
169    """Return the best N features (without scores) from the list returned
170    by :obj:`Orange.feature.scoring.score_all`.
171   
172    :param scores: a list such as returned by
173      :obj:`Orange.feature.scoring.score_all`
174    :type scores: list
175    :param N: number of best features to select.
176    :type N: int
177    :rtype: :obj:`list`
178
179    """
180    return map(lambda x:x[0], scores[:N])
181
182def attsAboveThreshold(scores, threshold=0.0):
183    """Return features (without scores) from the list returned by
184    :obj:`Orange.feature.scoring.score_all` with score above or
185    equal to a specified threshold.
186   
187    :param scores: a list such as one returned by
188      :obj:`Orange.feature.scoring.score_all`
189    :type scores: list
190    :param threshold: score threshold for attribute selection. Defaults to 0.
191    :type threshold: float
192    :rtype: :obj:`list`
193
194    """
195    pairs = filter(lambda x, t=threshold: x[1] > t, scores)
196    return map(lambda x:x[0], pairs)
197
198def selectBestNAtts(data, scores, N):
199    """Construct and return a new set of examples that includes a
200    class and only N best features from a list scores.
201   
202    :param data: an example table
203    :type data: Orange.data.table
204    :param scores: a list such as one returned by
205      :obj:`Orange.feature.scoring.score_all`
206    :type scores: list
207    :param N: number of features to select
208    :type N: int
209    :rtype: :class:`Orange.data.table` holding N best features
210
211    """
212    return data.select(bestNAtts(scores, N)+[data.domain.classVar.name])
213
214
215def selectAttsAboveThresh(data, scores, threshold=0.0):
216    """Construct and return a new set of examples that includes a class and
217    features from the list returned by
218    :obj:`Orange.feature.scoring.score_all` that have the score above or
219    equal to a specified threshold.
220   
221    :param data: an example table
222    :type data: Orange.data.table
223    :param scores: a list such as one returned by
224      :obj:`Orange.feature.scoring.score_all`   
225    :type scores: list
226    :param threshold: score threshold for attribute selection. Defaults to 0.
227    :type threshold: float
228    :rtype: :obj:`list` first N features (without measures)
229 
230    """
231    return data.select(attsAboveThreshold(scores, threshold)+[data.domain.classVar.name])
232
233def filterRelieff(data, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0):
234    """Take the data set and use an attribute measure to remove the worst
235    scored attribute (those below the margin). Repeats, until no attribute has
236    negative or zero score.
237   
238    .. note:: Notice that this filter procedure was originally designed for \
239    measures such as Relief, which are context dependent, i.e., removal of \
240    features may change the scores of other remaining features. Hence the \
241    need to re-estimate score every time an attribute is removed.
242
243    :param data: an data table
244    :type data: Orange.data.table
245    :param measure: an attribute measure (derived from
246      :obj:`Orange.feature.scoring.Measure`). Defaults to
247      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
248    :param margin: if score is higher than margin, attribute is not removed.
249      Defaults to 0.
250    :type margin: float
251   
252    """
253    measl = score_all(data, measure)
254    while len(data.domain.attributes)>0 and measl[-1][1]<margin:
255        data = selectBestNAtts(data, measl, len(data.domain.attributes)-1)
256#        print 'remaining ', len(data.domain.attributes)
257        measl = score_all(data, measure)
258    return data
259
260##############################################################################
261# wrappers
262
263def FilterAttsAboveThresh(data=None, **kwds):
264    filter = apply(FilterAttsAboveThresh_Class, (), kwds)
265    if data:
266        return filter(data)
267    else:
268        return filter
269 
270class FilterAttsAboveThresh_Class:
271    """Stores filter's parameters and can be later called with the data to
272    return the data table with only selected features.
273   
274    This class is used in the function :obj:`selectAttsAboveThresh`.
275   
276    :param measure: an attribute measure (derived from
277      :obj:`Orange.feature.scoring.Measure`). Defaults to
278      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50. 
279    :param threshold: score threshold for attribute selection. Defaults to 0.
280    :type threshold: float
281     
282    Some examples of how to use this class are::
283
284        filter = Orange.feature.selection.FilterAttsAboveThresh(threshold=.15)
285        new_data = filter(data)
286        new_data = Orange.feature.selection.FilterAttsAboveThresh(data)
287        new_data = Orange.feature.selection.FilterAttsAboveThresh(data, threshold=.1)
288        new_data = Orange.feature.selection.FilterAttsAboveThresh(data, threshold=.1,
289                   measure=Orange.feature.scoring.Gini())
290
291    """
292    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), 
293               threshold=0.0):
294        self.measure = measure
295        self.threshold = threshold
296
297    def __call__(self, data):
298        """Take data and return features with scores above given threshold.
299       
300        :param data: an data table
301        :type data: Orange.data.table
302
303        """
304        ma = score_all(data, self.measure)
305        return selectAttsAboveThresh(data, ma, self.threshold)
306
307def FilterBestNAtts(data=None, **kwds):
308    """Similarly to :obj:`FilterAttsAboveThresh`, wrap around class
309    :obj:`FilterBestNAtts_Class`.
310   
311    :param measure: an attribute measure (derived from
312      :obj:`Orange.feature.scoring.Measure`). Defaults to
313      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50. 
314    :param n: number of best features to return. Defaults to 5.
315    :type n: int
316
317    """
318    filter = apply(FilterBestNAtts_Class, (), kwds)
319    if data: return filter(data)
320    else: return filter
321 
322class FilterBestNAtts_Class:
323    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), n=5):
324        self.measure = measure
325        self.n = n
326    def __call__(self, data):
327        ma = score_all(data, self.measure)
328        self.n = min(self.n, len(data.domain.attributes))
329        return selectBestNAtts(data, ma, self.n)
330
331def FilterRelief(data=None, **kwds):
332    """Similarly to :obj:`FilterBestNAtts`, wrap around class
333    :obj:`FilterRelief_Class`.
334   
335    :param measure: an attribute measure (derived from
336      :obj:`Orange.feature.scoring.Measure`). Defaults to
337      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50. 
338    :param margin: margin for Relief scoring. Defaults to 0.
339    :type margin: float
340
341    """   
342    filter = apply(FilterRelief_Class, (), kwds)
343    if data:
344        return filter(data)
345    else:
346        return filter
347 
348class FilterRelief_Class:
349    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0):
350        self.measure = measure
351        self.margin = margin
352    def __call__(self, data):
353        return filterRelieff(data, self.measure, self.margin)
354
355##############################################################################
356# wrapped learner
357
358def FilteredLearner(baseLearner, examples = None, weight = None, **kwds):
359    """Return the corresponding learner that wraps
360    :obj:`Orange.classification.baseLearner` and a data selection method.
361   
362    When such learner is presented a data table, data is first filtered and
363    then passed to :obj:`Orange.classification.baseLearner`. This comes handy
364    when one wants to test the schema of feature-subset-selection-and-learning
365    by some repetitive evaluation method, e.g., cross validation.
366   
367    :param filter: defatuls to
368      :obj:`Orange.feature.selection.FilterAttsAboveThresh`
369    :type filter: Orange.feature.selection.FilterAttsAboveThresh
370
371    Here is an example of how to build a wrapper around naive Bayesian learner
372    and use it on a data set::
373
374        nb = Orange.classification.bayes.NaiveBayesLearner()
375        learner = Orange.feature.selection.FilteredLearner(nb,
376                  filter=Orange.feature.selection.FilterBestNAtts(n=5), name='filtered')
377        classifier = learner(data)
378
379    """
380    learner = apply(FilteredLearner_Class, [baseLearner], kwds)
381    if examples:
382        return learner(examples, weight)
383    else:
384        return learner
385
386class FilteredLearner_Class:
387    def __init__(self, baseLearner, filter=FilterAttsAboveThresh(), name='filtered'):
388        self.baseLearner = baseLearner
389        self.filter = filter
390        self.name = name
391    def __call__(self, data, weight=0):
392        # filter the data and then learn
393        fdata = self.filter(data)
394        model = self.baseLearner(fdata, weight)
395        return FilteredClassifier(classifier = model, domain = model.domain)
396
397class FilteredClassifier:
398    def __init__(self, **kwds):
399        self.__dict__.update(kwds)
400    def __call__(self, example, resultType = orange.GetValue):
401        return self.classifier(example, resultType)
402    def atts(self):
403        return self.domain.attributes 
Note: See TracBrowser for help on using the repository browser.