source: orange/Orange/feature/selection.py @ 9676:895804462430

Revision 9676:895804462430, 14.7 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 2 years ago (diff)

Merged.

Line 
1"""
2#########################
3Selection (``selection``)
4#########################
5
6.. index:: feature selection
7
8.. index::
9   single: feature; feature selection
10
11Some machine learning methods perform better if they learn only from a
12selected subset of the most informative or "best" features.
13
14This so-called filter approach can boost the performance
15of learner in terms of predictive accuracy, speed-up induction and
16simplicity of resulting models. Feature scores are estimated before
17modeling, without knowing  which machine learning method will be
18used to construct a predictive model.
19
20:download:`Example script:<code/selection-best3.py>`
21
22.. literalinclude:: code/selection-best3.py
23    :lines: 7-
24
25The script should output this::
26
27    Best 3 features:
28    physician-fee-freeze
29    el-salvador-aid
30    synfuels-corporation-cutback
31
32.. autoclass:: Orange.feature.selection.FilterAboveThreshold
33   :members:
34
35.. autoclass:: Orange.feature.selection.FilterBestN
36   :members:
37
38.. autoclass:: Orange.feature.selection.FilterRelief
39   :members:
40
41.. autoclass:: Orange.feature.selection.FilteredLearner
42   :members:
43
44.. autoclass:: Orange.feature.selection.FilteredClassifier
45   :members:
46
47These functions support the design of feature subset selection for
48classification problems.
49
50.. automethod:: Orange.feature.selection.best_n
51
52.. automethod:: Orange.feature.selection.above_threshold
53
54.. automethod:: Orange.feature.selection.select_best_n
55
56.. automethod:: Orange.feature.selection.select_above_threshold
57
58.. automethod:: Orange.feature.selection.select_relief
59
60.. rubric:: Examples
61
62The following script defines a new Naive Bayes classifier, that
63selects five best features from the data set before learning.
64The new classifier is wrapped-up in a special class (see
65<a href="../ofb/c_pythonlearner.htm">Building your own learner</a>
66lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The
67script compares this filtered learner with one that uses a complete
68set of features.
69
70:download:`selection-bayes.py<code/selection-bayes.py>`
71
72.. literalinclude:: code/selection-bayes.py
73    :lines: 7-
74
75Interestingly, and somehow expected, feature subset selection
76helps. This is the output that we get::
77
78    Learner      CA
79    Naive Bayes  0.903
80    with FSS     0.940
81
82We can do all of  he above by wrapping the learner using
83<code>FilteredLearner</code>, thus
84creating an object that is assembled from data filter and a base learner. When
85given a data table, this learner uses attribute filter to construct a new
86data set and base learner to construct a corresponding
87classifier. Attribute filters should be of the type like
88<code>orngFSS.FilterAboveThresh</code> or
89<code>orngFSS.FilterBestN</code> that can be initialized with the
90arguments and later presented with a data, returning new reduced data
91set.
92
93The following code fragment replaces the bulk of code
94from previous example, and compares naive Bayesian classifier to the
95same classifier when only a single most important attribute is
96used.
97
98:download:`selection-filtered-learner.py<code/selection-filtered-learner.py>`
99
100.. literalinclude:: code/selection-filtered-learner.py
101    :lines: 13-16
102
103Now, let's decide to retain three features (change the code in <a
104href="fss4.py">fss4.py</a> accordingly!), but observe how many times
105an attribute was used. Remember, 10-fold cross validation constructs
106ten instances for each classifier, and each time we run
107FilteredLearner a different set of features may be
108selected. <code>orngEval.CrossValidation</code> stores classifiers in
109<code>results</code> variable, and <code>FilteredLearner</code>
110returns a classifier that can tell which features it used (how
111convenient!), so the code to do all this is quite short.
112
113.. literalinclude:: code/selection-filtered-learner.py
114    :lines: 25-
115
116Running :download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` with three features selected each
117time a learner is run gives the following result::
118
119    Learner      CA
120    bayes        0.903
121    filtered     0.956
122
123    Number of times features were used in cross-validation:
124     3 x el-salvador-aid
125     6 x synfuels-corporation-cutback
126     7 x adoption-of-the-budget-resolution
127    10 x physician-fee-freeze
128     4 x crime
129
130Experiment yourself to see, if only one attribute is retained for
131classifier, which attribute was the one most frequently selected over
132all the ten cross-validation tests!
133
134==========
135References
136==========
137
138* K. Kira and L. Rendell. A practical approach to feature selection. In
139  D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine
140  Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
141
142* I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF.
143  In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine
144  Learning (ECML-94), pages  171-182. Springer-Verlag, 1994.
145
146* R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial
147  Intelligence, 97 (1-2), pages 273-324, 1997
148
149"""
150
151__docformat__ = 'restructuredtext'
152
153import Orange.core as orange
154
155from Orange.feature.scoring import score_all
156
157
158def best_n(scores, N):
159    """Return the best N features (without scores) from the list returned
160    by :obj:`Orange.feature.scoring.score_all`.
161
162    :param scores: a list such as returned by
163      :obj:`Orange.feature.scoring.score_all`
164    :type scores: list
165    :param N: number of best features to select.
166    :type N: int
167    :rtype: :obj:`list`
168
169    """
170    return map(lambda x:x[0], scores[:N])
171
172bestNAtts = best_n
173
174
175def above_threshold(scores, threshold=0.0):
176    """Return features (without scores) from the list returned by
177    :obj:`Orange.feature.scoring.score_all` with score above or
178    equal to a specified threshold.
179
180    :param scores: a list such as one returned by
181      :obj:`Orange.feature.scoring.score_all`
182    :type scores: list
183    :param threshold: score threshold for attribute selection. Defaults to 0.
184    :type threshold: float
185    :rtype: :obj:`list`
186
187    """
188    pairs = filter(lambda x, t=threshold: x[1] > t, scores)
189    return map(lambda x: x[0], pairs)
190
191attsAboveThreshold = above_threshold
192
193
194def select_best_n(data, scores, N):
195    """Construct and return a new set of examples that includes a
196    class and only N best features from a list scores.
197
198    :param data: an example table
199    :type data: Orange.data.table
200    :param scores: a list such as one returned by
201      :obj:`Orange.feature.scoring.score_all`
202    :type scores: list
203    :param N: number of features to select
204    :type N: int
205    :rtype: :class:`Orange.data.table` holding N best features
206
207    """
208    return data.select(best_n(scores, N) + [data.domain.classVar.name])
209
210selectBestNAtts = select_best_n
211
212
213def select_above_threshold(data, scores, threshold=0.0):
214    """Construct and return a new set of examples that includes a class and
215    features from the list returned by
216    :obj:`Orange.feature.scoring.score_all` that have the score above or
217    equal to a specified threshold.
218
219    :param data: an example table
220    :type data: Orange.data.table
221    :param scores: a list such as one returned by
222      :obj:`Orange.feature.scoring.score_all`
223    :type scores: list
224    :param threshold: score threshold for attribute selection. Defaults to 0.
225    :type threshold: float
226    :rtype: :obj:`list` first N features (without measures)
227
228    """
229    return data.select(above_threshold(scores, threshold) + \
230                       [data.domain.classVar.name])
231
232selectAttsAboveThresh = select_above_threshold
233
234
235def select_relief(data, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0):
236    """Take the data set and use an attribute measure to remove the worst
237    scored attribute (those below the margin). Repeats, until no attribute has
238    negative or zero score.
239
240    .. note:: Notice that this filter procedure was originally designed for \
241    measures such as Relief, which are context dependent, i.e., removal of \
242    features may change the scores of other remaining features. Hence the \
243    need to re-estimate score every time an attribute is removed.
244
245    :param data: an data table
246    :type data: Orange.data.table
247    :param measure: an attribute measure (derived from
248      :obj:`Orange.feature.scoring.Measure`). Defaults to
249      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
250    :param margin: if score is higher than margin, attribute is not removed.
251      Defaults to 0.
252    :type margin: float
253
254    """
255    measl = score_all(data, measure)
256    while len(data.domain.attributes) > 0 and measl[-1][1] < margin:
257        data = select_best_n(data, measl, len(data.domain.attributes) - 1)
258#        print 'remaining ', len(data.domain.attributes)
259        measl = score_all(data, measure)
260    return data
261
262filterRelieff = select_relief
263
264
265class FilterAboveThreshold(object):
266    """Store filter parameters that are later called with the data to
267    return the data table with only selected features.
268
269    This class uses :obj:`select_above_threshold`.
270
271    :param measure: an attribute measure (derived from
272      :obj:`Orange.feature.scoring.Measure`). Defaults to
273      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
274    :param threshold: score threshold for attribute selection. Defaults to 0.
275    :type threshold: float
276
277    Some examples of how to use this class are::
278
279        filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
280        new_data = filter(data)
281        new_data = Orange.feature.selection.FilterAboveThreshold(data)
282        new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
283        new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1,
284                   measure=Orange.feature.scoring.Gini())
285
286    """
287    def __new__(cls, data=None,
288                measure=orange.MeasureAttribute_relief(k=20, m=50),
289                threshold=0.0):
290
291        if data is None:
292            self = object.__new__(cls, measure=measure, threshold=threshold)
293            return self
294        else:
295            self = cls(measure=measure, threshold=threshold)
296            return self(data)
297
298    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), \
299                 threshold=0.0):
300
301        self.measure = measure
302        self.threshold = threshold
303
304    def __call__(self, data):
305        """Take data and return features with scores above given threshold.
306
307        :param data: data table
308        :type data: Orange.data.table
309
310        """
311        ma = score_all(data, self.measure)
312        return select_above_threshold(data, ma, self.threshold)
313
314FilterAttsAboveThresh = FilterAboveThreshold
315FilterAttsAboveThresh_Class = FilterAboveThreshold
316
317
318class FilterBestN(object):
319    """Store filter parameters that are later called with the data to
320    return the data table with only selected features.
321
322    :param measure: an attribute measure (derived from
323      :obj:`Orange.feature.scoring.Measure`). Defaults to
324      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
325    :param n: number of best features to return. Defaults to 5.
326    :type n: int
327
328    """
329    def __new__(cls, data=None,
330                measure=orange.MeasureAttribute_relief(k=20, m=50),
331                n=5):
332
333        if data is None:
334            self = object.__new__(cls, measure=measure, n=n)
335            return self
336        else:
337            self = cls(measure=measure, n=n)
338            return self(data)
339
340    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50),
341                 n=5):
342        self.measure = measure
343        self.n = n
344
345    def __call__(self, data):
346        ma = score_all(data, self.measure)
347        self.n = min(self.n, len(data.domain.attributes))
348        return select_best_n(data, ma, self.n)
349
350FilterBestNAtts = FilterBestN
351FilterBestNAtts_Class = FilterBestN
352
353
354class FilterRelief(object):
355    """Store filter parameters that are later called with the data to
356    return the data table with only selected features.
357
358    :param measure: an attribute measure (derived from
359      :obj:`Orange.feature.scoring.Measure`). Defaults to
360      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
361    :param margin: margin for Relief scoring. Defaults to 0.
362    :type margin: float
363
364    """
365    def __new__(cls, data=None,
366                measure=orange.MeasureAttribute_relief(k=20, m=50),
367                margin=0):
368
369        if data is None:
370            self = object.__new__(cls, measure=measure, margin=margin)
371            return self
372        else:
373            self = cls(measure=measure, margin=margin)
374            return self(data)
375
376    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50),
377                 margin=0):
378        self.measure = measure
379        self.margin = margin
380
381    def __call__(self, data):
382        return select_relief(data, self.measure, self.margin)
383
384FilterRelief_Class = FilterRelief
385
386##############################################################################
387# wrapped learner
388
389
390class FilteredLearner(object):
391    """Return the learner that wraps :obj:`Orange.classification.baseLearner`
392    and a data selection method.
393
394    When calling the learner with a data table, data is first filtered and
395    then passed to :obj:`Orange.classification.baseLearner`. This comes handy
396    when one wants to test the schema of feature-subset-selection-and-learning
397    by some repetitive evaluation method, e.g., cross validation.
398
399    :param filter: defatuls to
400      :obj:`Orange.feature.selection.FilterAboveThreshold`
401    :type filter: Orange.feature.selection.FilterAboveThreshold
402
403    Here is an example of how to build a wrapper around naive Bayesian learner
404    and use it on a data set::
405
406        nb = Orange.classification.bayes.NaiveBayesLearner()
407        learner = Orange.feature.selection.FilteredLearner(nb,
408            filter=Orange.feature.selection.FilterBestN(n=5), name='filtered')
409        classifier = learner(data)
410
411    """
412    def __new__(cls, baseLearner, data=None, weight=0,
413                filter=FilterAboveThreshold(), name='filtered'):
414
415        if data is None:
416            self = object.__new__(cls, baseLearner, filter=filter, name=name)
417            return self
418        else:
419            self = cls(baseLearner, filter=filter, name=name)
420            return self(data, weight)
421
422    def __init__(self, baseLearner, filter=FilterAboveThreshold(),
423                 name='filtered'):
424        self.baseLearner = baseLearner
425        self.filter = filter
426        self.name = name
427
428    def __call__(self, data, weight=0):
429        # filter the data and then learn
430        fdata = self.filter(data)
431        model = self.baseLearner(fdata, weight)
432        return FilteredClassifier(classifier=model, domain=model.domain)
433
434FilteredLearner_Class = FilteredLearner
435
436
437class FilteredClassifier:
438    def __init__(self, **kwds):
439        self.__dict__.update(kwds)
440
441    def __call__(self, example, resultType=orange.GetValue):
442        return self.classifier(example, resultType)
443
444    def atts(self):
445        return self.domain.attributes
Note: See TracBrowser for help on using the repository browser.