source: orange/orange/Orange/feature/selection.py @ 9645:dc21346e00ee

Revision 9645:dc21346e00ee, 14.8 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 2 years ago (diff)

To Orange25.

Line 
1"""
2#########################
3Selection (``selection``)
4#########################
5
6.. index:: feature selection
7
8.. index::
9   single: feature; feature selection
10
11Some machine learning methods perform better if they learn only from a
12selected subset of the most informative or "best" features.
13
14This so-called filter approach can boost the performance
15of learner in terms of predictive accuracy, speed-up induction and
16simplicity of resulting models. Feature scores are estimated before
17modeling, without knowing  which machine learning method will be
18used to construct a predictive model.
19
20:download:`Example script:<code/selection-best3.py>`
21
22.. literalinclude:: code/selection-best3.py
23    :lines: 7-
24
25The script should output this::
26
27    Best 3 features:
28    physician-fee-freeze
29    el-salvador-aid
30    synfuels-corporation-cutback
31
32.. autoclass:: Orange.feature.selection.FilterAboveThreshold
33   :members:
34
35.. autoclass:: Orange.feature.selection.FilterBestN
36   :members:
37
38.. autoclass:: Orange.feature.selection.FilterRelief
39   :members:
40
41.. automethod:: Orange.feature.selection.FilteredLearner
42
43.. autoclass:: Orange.feature.selection.FilteredLearner_Class
44   :members:
45
46.. autoclass:: Orange.feature.selection.FilteredClassifier
47   :members:
48
49These functions support the design of feature subset selection for
50classification problems.
51
52.. automethod:: Orange.feature.selection.best_n
53
54.. automethod:: Orange.feature.selection.above_threshold
55
56.. automethod:: Orange.feature.selection.select_best_n
57
58.. automethod:: Orange.feature.selection.select_above_threshold
59
60.. automethod:: Orange.feature.selection.select_relief
61
62.. rubric:: Examples
63
64The following script defines a new Naive Bayes classifier, that
65selects five best features from the data set before learning.
66The new classifier is wrapped-up in a special class (see
67<a href="../ofb/c_pythonlearner.htm">Building your own learner</a>
68lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The
69script compares this filtered learner with one that uses a complete
70set of features.
71
72:download:`selection-bayes.py <code/selection-bayes.py>` (uses :download:`voting.tab <code/voting.tab>`):
73
74.. literalinclude:: code/selection-bayes.py
75    :lines: 7-
76
77Interestingly, and somehow expected, feature subset selection
78helps. This is the output that we get::
79
80    Learner      CA
81    Naive Bayes  0.903
82    with FSS     0.940
83
84Now, a much simpler example. Although perhaps educational, we can do all of
85the above by wrapping the learner using <code>FilteredLearner</code>, thus
86creating an object that is assembled from data filter and a base learner. When
87given the data, this learner uses attribute filter to construct a new
88data set and base learner to construct a corresponding
89classifier. Attribute filters should be of the type like
90<code>orngFSS.FilterAttsAboveThresh</code> or
91<code>orngFSS.FilterBestNAtts</code> that can be initialized with the
92arguments and later presented with a data, returning new reduced data
93set.
94
95The following code fragment essentially replaces the bulk of code
96from previous example, and compares naive Bayesian classifier to the
97same classifier when only a single most important attribute is
98used.
99
100:download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` (uses :download:`voting.tab <code/voting.tab>`):
101
102.. literalinclude:: code/selection-filtered-learner.py
103    :lines: 13-16
104
105Now, let's decide to retain three features (change the code in <a
106href="fss4.py">fss4.py</a> accordingly!), but observe how many times
107an attribute was used. Remember, 10-fold cross validation constructs
108ten instances for each classifier, and each time we run
109FilteredLearner a different set of features may be
110selected. <code>orngEval.CrossValidation</code> stores classifiers in
111<code>results</code> variable, and <code>FilteredLearner</code>
112returns a classifier that can tell which features it used (how
113convenient!), so the code to do all this is quite short.
114
115.. literalinclude:: code/selection-filtered-learner.py
116    :lines: 25-
117
118Running :download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` with three features selected each
119time a learner is run gives the following result::
120
121    Learner      CA
122    bayes        0.903
123    filtered     0.956
124
125    Number of times features were used in cross-validation:
126     3 x el-salvador-aid
127     6 x synfuels-corporation-cutback
128     7 x adoption-of-the-budget-resolution
129    10 x physician-fee-freeze
130     4 x crime
131
132Experiment yourself to see, if only one attribute is retained for
133classifier, which attribute was the one most frequently selected over
134all the ten cross-validation tests!
135
136==========
137References
138==========
139
140* K. Kira and L. Rendell. A practical approach to feature selection. In
141  D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine
142  Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
143
144* I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF.
145  In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine
146  Learning (ECML-94), pages  171-182. Springer-Verlag, 1994.
147
148* R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial
149  Intelligence, 97 (1-2), pages 273-324, 1997
150
151"""
152
153__docformat__ = 'restructuredtext'
154
155import Orange.core as orange
156
157from Orange.feature.scoring import score_all
158
159def best_n(scores, N):
160    """Return the best N features (without scores) from the list returned
161    by :obj:`Orange.feature.scoring.score_all`.
162   
163    :param scores: a list such as returned by
164      :obj:`Orange.feature.scoring.score_all`
165    :type scores: list
166    :param N: number of best features to select.
167    :type N: int
168    :rtype: :obj:`list`
169
170    """
171    return map(lambda x:x[0], scores[:N])
172
173bestNAtts = best_n
174
175def above_threshold(scores, threshold=0.0):
176    """Return features (without scores) from the list returned by
177    :obj:`Orange.feature.scoring.score_all` with score above or
178    equal to a specified threshold.
179   
180    :param scores: a list such as one returned by
181      :obj:`Orange.feature.scoring.score_all`
182    :type scores: list
183    :param threshold: score threshold for attribute selection. Defaults to 0.
184    :type threshold: float
185    :rtype: :obj:`list`
186
187    """
188    pairs = filter(lambda x, t=threshold: x[1] > t, scores)
189    return map(lambda x:x[0], pairs)
190
191attsAboveThreshold = above_threshold
192
193
194def select_best_n(data, scores, N):
195    """Construct and return a new set of examples that includes a
196    class and only N best features from a list scores.
197   
198    :param data: an example table
199    :type data: Orange.data.table
200    :param scores: a list such as one returned by
201      :obj:`Orange.feature.scoring.score_all`
202    :type scores: list
203    :param N: number of features to select
204    :type N: int
205    :rtype: :class:`Orange.data.table` holding N best features
206
207    """
208    return data.select(best_n(scores, N) + [data.domain.classVar.name])
209
210selectBestNAtts = select_best_n
211
212
213def select_above_threshold(data, scores, threshold=0.0):
214    """Construct and return a new set of examples that includes a class and
215    features from the list returned by
216    :obj:`Orange.feature.scoring.score_all` that have the score above or
217    equal to a specified threshold.
218   
219    :param data: an example table
220    :type data: Orange.data.table
221    :param scores: a list such as one returned by
222      :obj:`Orange.feature.scoring.score_all`   
223    :type scores: list
224    :param threshold: score threshold for attribute selection. Defaults to 0.
225    :type threshold: float
226    :rtype: :obj:`list` first N features (without measures)
227 
228    """
229    return data.select(above_threshold(scores, threshold) + [data.domain.classVar.name])
230
231selectAttsAboveThresh = select_above_threshold
232
233
234def select_relief(data, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0):
235    """Take the data set and use an attribute measure to remove the worst
236    scored attribute (those below the margin). Repeats, until no attribute has
237    negative or zero score.
238   
239    .. note:: Notice that this filter procedure was originally designed for \
240    measures such as Relief, which are context dependent, i.e., removal of \
241    features may change the scores of other remaining features. Hence the \
242    need to re-estimate score every time an attribute is removed.
243
244    :param data: an data table
245    :type data: Orange.data.table
246    :param measure: an attribute measure (derived from
247      :obj:`Orange.feature.scoring.Measure`). Defaults to
248      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
249    :param margin: if score is higher than margin, attribute is not removed.
250      Defaults to 0.
251    :type margin: float
252   
253    """
254    measl = score_all(data, measure)
255    while len(data.domain.attributes) > 0 and measl[-1][1] < margin:
256        data = (data, measl, len(data.domain.attributes) - 1)
257#        print 'remaining ', len(data.domain.attributes)
258        measl = score_all(data, measure)
259    return data
260
261select_relief = filterRelieff
262
263
264class FilterAboveThreshold(object):
265    """Store filter parameters and can be later called with the data to
266    return the data table with only selected features.
267
268    This class uses the function :obj:`select_above_threshold`.
269
270    :param measure: an attribute measure (derived from
271      :obj:`Orange.feature.scoring.Measure`). Defaults to
272      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
273    :param threshold: score threshold for attribute selection. Defaults to 0.
274    :type threshold: float
275
276    Some examples of how to use this class are::
277
278        filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
279        new_data = filter(data)
280        new_data = Orange.feature.selection.FilterAboveThreshold(data)
281        new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
282        new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1,
283                   measure=Orange.feature.scoring.Gini())
284
285    """
286    def __new__(cls, data=None,
287                measure=orange.MeasureAttribute_relief(k=20, m=50),
288                threshold=0.0):
289
290        if data is None:
291            self = object.__new__(cls, measure=measure, threshold=threshold)
292            return self
293        else:
294            self = cls(measure=measure, threshold=threshold)
295            return self(data)
296
297    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), \
298                 threshold=0.0):
299
300        self.measure = measure
301        self.threshold = threshold
302
303    def __call__(self, data):
304        """Take data and return features with scores above given threshold.
305
306        :param data: an data table
307        :type data: Orange.data.table
308
309        """
310        ma = score_all(data, self.measure)
311        return select_above_threshold(data, ma, self.threshold)
312
313FilterAttsAboveThresh = FilterAboveThreshold
314FilterAttsAboveThresh_Class = FilterAboveThreshold
315
316
317class FilterBestN(object):
318    """Store filter parameters and can be later called with the data to
319    return the data table with only selected features.
320
321    :param measure: an attribute measure (derived from
322      :obj:`Orange.feature.scoring.Measure`). Defaults to
323      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
324    :param n: number of best features to return. Defaults to 5.
325    :type n: int
326
327    """
328    def __new__(cls, data=None,
329                measure=orange.MeasureAttribute_relief(k=20, m=50),
330                n=5):
331
332        if data is None:
333            self = object.__new__(cls, measure=measure, n=n)
334            return self
335        else:
336            self = cls(measure=measure, n=n)
337            return self(data)
338
339    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), n=5):
340        self.measure = measure
341        self.n = n
342
343    def __call__(self, data):
344        ma = score_all(data, self.measure)
345        self.n = min(self.n, len(data.domain.attributes))
346        return (data, ma, self.n)
347
348FilterBestNAtts = FilterBestN
349FilterBestNAtts_Class = FilterBestN
350
351class FilterRelief(object):
352    """Similarly to :obj:`FilterBestNAtts`, wrap around class
353    :obj:`FilterRelief_Class`.
354   
355    :param measure: an attribute measure (derived from
356      :obj:`Orange.feature.scoring.Measure`). Defaults to
357      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50. 
358    :param margin: margin for Relief scoring. Defaults to 0.
359    :type margin: float
360
361    """
362    def __new__(cls, data=None,
363                measure=orange.MeasureAttribute_relief(k=20, m=50),
364                margin=0):
365
366        if data is None:
367            self = object.__new__(cls, measure=measure, margin=margin)
368            return self
369        else:
370            self = cls(measure=measure, margin=margin)
371            return self(data)
372
373    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0):
374        self.measure = measure
375        self.margin = margin
376
377    def __call__(self, data):
378        return select_relief(data, self.measure, self.margin)
379
380FilterRelief_Class = FilterRelief
381
382##############################################################################
383# wrapped learner
384
385
386def FilteredLearner(baseLearner, examples=None, weight=None, **kwds):
387    """Return the corresponding learner that wraps
388    :obj:`Orange.classification.baseLearner` and a data selection method.
389   
390    When such learner is presented a data table, data is first filtered and
391    then passed to :obj:`Orange.classification.baseLearner`. This comes handy
392    when one wants to test the schema of feature-subset-selection-and-learning
393    by some repetitive evaluation method, e.g., cross validation.
394   
395    :param filter: defatuls to
396      :obj:`Orange.feature.selection.FilterAttsAboveThresh`
397    :type filter: Orange.feature.selection.FilterAttsAboveThresh
398
399    Here is an example of how to build a wrapper around naive Bayesian learner
400    and use it on a data set::
401
402        nb = Orange.classification.bayes.NaiveBayesLearner()
403        learner = Orange.feature.selection.FilteredLearner(nb,
404                  filter=Orange.feature.selection.FilterBestNAtts(n=5), name='filtered')
405        classifier = learner(data)
406
407    """
408    learner = apply(FilteredLearner_Class, [baseLearner], kwds)
409    if examples:
410        return learner(examples, weight)
411    else:
412        return learner
413
414class FilteredLearner_Class:
415    def __init__(self, baseLearner, filter=FilterAttsAboveThresh(), name='filtered'):
416        self.baseLearner = baseLearner
417        self.filter = filter
418        self.name = name
419    def __call__(self, data, weight=0):
420        # filter the data and then learn
421        fdata = self.filter(data)
422        model = self.baseLearner(fdata, weight)
423        return FilteredClassifier(classifier=model, domain=model.domain)
424
425class FilteredClassifier:
426    def __init__(self, **kwds):
427        self.__dict__.update(kwds)
428    def __call__(self, example, resultType=orange.GetValue):
429        return self.classifier(example, resultType)
430    def atts(self):
431        return self.domain.attributes
Note: See TracBrowser for help on using the repository browser.