source: orange/orange/Orange/feature/selection.py @ 9653:652ca8b091ed

Revision 9653:652ca8b091ed, 15.0 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 2 years ago (diff)

Fixes some bugs.

Line 
1"""
2#########################
3Selection (``selection``)
4#########################
5
6.. index:: feature selection
7
8.. index::
9   single: feature; feature selection
10
11Some machine learning methods perform better if they learn only from a
12selected subset of the most informative or "best" features.
13
14This so-called filter approach can boost the performance
15of learner in terms of predictive accuracy, speed-up induction and
16simplicity of resulting models. Feature scores are estimated before
17modeling, without knowing  which machine learning method will be
18used to construct a predictive model.
19
20:download:`Example script:<code/selection-best3.py>`
21
22.. literalinclude:: code/selection-best3.py
23    :lines: 7-
24
25The script should output this::
26
27    Best 3 features:
28    physician-fee-freeze
29    el-salvador-aid
30    synfuels-corporation-cutback
31
32.. autoclass:: Orange.feature.selection.FilterAboveThreshold
33   :members:
34
35.. autoclass:: Orange.feature.selection.FilterBestN
36   :members:
37
38.. autoclass:: Orange.feature.selection.FilterRelief
39   :members:
40
41.. automethod:: Orange.feature.selection.FilteredLearner
42
43.. autoclass:: Orange.feature.selection.FilteredLearner_Class
44   :members:
45
46.. autoclass:: Orange.feature.selection.FilteredClassifier
47   :members:
48
49These functions support the design of feature subset selection for
50classification problems.
51
52.. automethod:: Orange.feature.selection.best_n
53
54.. automethod:: Orange.feature.selection.above_threshold
55
56.. automethod:: Orange.feature.selection.select_best_n
57
58.. automethod:: Orange.feature.selection.select_above_threshold
59
60.. automethod:: Orange.feature.selection.select_relief
61
62.. rubric:: Examples
63
64The following script defines a new Naive Bayes classifier, that
65selects five best features from the data set before learning.
66The new classifier is wrapped-up in a special class (see
67<a href="../ofb/c_pythonlearner.htm">Building your own learner</a>
68lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The
69script compares this filtered learner with one that uses a complete
70set of features.
71
72:download:`selection-bayes.py <code/selection-bayes.py>` (uses :download:`voting.tab <code/voting.tab>`):
73
74.. literalinclude:: code/selection-bayes.py
75    :lines: 7-
76
77Interestingly, and somehow expected, feature subset selection
78helps. This is the output that we get::
79
80    Learner      CA
81    Naive Bayes  0.903
82    with FSS     0.940
83
84Now, a much simpler example. Although perhaps educational, we can do all of
85the above by wrapping the learner using <code>FilteredLearner</code>, thus
86creating an object that is assembled from data filter and a base learner. When
87given the data, this learner uses attribute filter to construct a new
88data set and base learner to construct a corresponding
89classifier. Attribute filters should be of the type like
90<code>orngFSS.FilterAttsAboveThresh</code> or
91<code>orngFSS.FilterBestNAtts</code> that can be initialized with the
92arguments and later presented with a data, returning new reduced data
93set.
94
95The following code fragment essentially replaces the bulk of code
96from previous example, and compares naive Bayesian classifier to the
97same classifier when only a single most important attribute is
98used.
99
100:download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` (uses :download:`voting.tab <code/voting.tab>`):
101
102.. literalinclude:: code/selection-filtered-learner.py
103    :lines: 13-16
104
105Now, let's decide to retain three features (change the code in <a
106href="fss4.py">fss4.py</a> accordingly!), but observe how many times
107an attribute was used. Remember, 10-fold cross validation constructs
108ten instances for each classifier, and each time we run
109FilteredLearner a different set of features may be
110selected. <code>orngEval.CrossValidation</code> stores classifiers in
111<code>results</code> variable, and <code>FilteredLearner</code>
112returns a classifier that can tell which features it used (how
113convenient!), so the code to do all this is quite short.
114
115.. literalinclude:: code/selection-filtered-learner.py
116    :lines: 25-
117
118Running :download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` with three features selected each
119time a learner is run gives the following result::
120
121    Learner      CA
122    bayes        0.903
123    filtered     0.956
124
125    Number of times features were used in cross-validation:
126     3 x el-salvador-aid
127     6 x synfuels-corporation-cutback
128     7 x adoption-of-the-budget-resolution
129    10 x physician-fee-freeze
130     4 x crime
131
132Experiment yourself to see, if only one attribute is retained for
133classifier, which attribute was the one most frequently selected over
134all the ten cross-validation tests!
135
136==========
137References
138==========
139
140* K. Kira and L. Rendell. A practical approach to feature selection. In
141  D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine
142  Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
143
144* I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF.
145  In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine
146  Learning (ECML-94), pages  171-182. Springer-Verlag, 1994.
147
148* R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial
149  Intelligence, 97 (1-2), pages 273-324, 1997
150
151"""
152
153__docformat__ = 'restructuredtext'
154
155import Orange.core as orange
156
157from Orange.feature.scoring import score_all
158
159
160def best_n(scores, N):
161    """Return the best N features (without scores) from the list returned
162    by :obj:`Orange.feature.scoring.score_all`.
163
164    :param scores: a list such as returned by
165      :obj:`Orange.feature.scoring.score_all`
166    :type scores: list
167    :param N: number of best features to select.
168    :type N: int
169    :rtype: :obj:`list`
170
171    """
172    return map(lambda x:x[0], scores[:N])
173
174bestNAtts = best_n
175
176
177def above_threshold(scores, threshold=0.0):
178    """Return features (without scores) from the list returned by
179    :obj:`Orange.feature.scoring.score_all` with score above or
180    equal to a specified threshold.
181
182    :param scores: a list such as one returned by
183      :obj:`Orange.feature.scoring.score_all`
184    :type scores: list
185    :param threshold: score threshold for attribute selection. Defaults to 0.
186    :type threshold: float
187    :rtype: :obj:`list`
188
189    """
190    pairs = filter(lambda x, t=threshold: x[1] > t, scores)
191    return map(lambda x: x[0], pairs)
192
193attsAboveThreshold = above_threshold
194
195
196def select_best_n(data, scores, N):
197    """Construct and return a new set of examples that includes a
198    class and only N best features from a list scores.
199
200    :param data: an example table
201    :type data: Orange.data.table
202    :param scores: a list such as one returned by
203      :obj:`Orange.feature.scoring.score_all`
204    :type scores: list
205    :param N: number of features to select
206    :type N: int
207    :rtype: :class:`Orange.data.table` holding N best features
208
209    """
210    return data.select(best_n(scores, N) + [data.domain.classVar.name])
211
212selectBestNAtts = select_best_n
213
214
215def select_above_threshold(data, scores, threshold=0.0):
216    """Construct and return a new set of examples that includes a class and
217    features from the list returned by
218    :obj:`Orange.feature.scoring.score_all` that have the score above or
219    equal to a specified threshold.
220
221    :param data: an example table
222    :type data: Orange.data.table
223    :param scores: a list such as one returned by
224      :obj:`Orange.feature.scoring.score_all`
225    :type scores: list
226    :param threshold: score threshold for attribute selection. Defaults to 0.
227    :type threshold: float
228    :rtype: :obj:`list` first N features (without measures)
229
230    """
231    return data.select(above_threshold(scores, threshold) + \
232                       [data.domain.classVar.name])
233
234selectAttsAboveThresh = select_above_threshold
235
236
237def select_relief(data, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0):
238    """Take the data set and use an attribute measure to remove the worst
239    scored attribute (those below the margin). Repeats, until no attribute has
240    negative or zero score.
241
242    .. note:: Notice that this filter procedure was originally designed for \
243    measures such as Relief, which are context dependent, i.e., removal of \
244    features may change the scores of other remaining features. Hence the \
245    need to re-estimate score every time an attribute is removed.
246
247    :param data: an data table
248    :type data: Orange.data.table
249    :param measure: an attribute measure (derived from
250      :obj:`Orange.feature.scoring.Measure`). Defaults to
251      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
252    :param margin: if score is higher than margin, attribute is not removed.
253      Defaults to 0.
254    :type margin: float
255
256    """
257    measl = score_all(data, measure)
258    while len(data.domain.attributes) > 0 and measl[-1][1] < margin:
259        data = select_best_n(data, measl, len(data.domain.attributes) - 1)
260#        print 'remaining ', len(data.domain.attributes)
261        measl = score_all(data, measure)
262    return data
263
264filterRelieff = select_relief
265
266
267class FilterAboveThreshold(object):
268    """Store filter parameters and can be later called with the data to
269    return the data table with only selected features.
270
271    This class uses the function :obj:`select_above_threshold`.
272
273    :param measure: an attribute measure (derived from
274      :obj:`Orange.feature.scoring.Measure`). Defaults to
275      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
276    :param threshold: score threshold for attribute selection. Defaults to 0.
277    :type threshold: float
278
279    Some examples of how to use this class are::
280
281        filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
282        new_data = filter(data)
283        new_data = Orange.feature.selection.FilterAboveThreshold(data)
284        new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
285        new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1,
286                   measure=Orange.feature.scoring.Gini())
287
288    """
289    def __new__(cls, data=None,
290                measure=orange.MeasureAttribute_relief(k=20, m=50),
291                threshold=0.0):
292
293        if data is None:
294            self = object.__new__(cls, measure=measure, threshold=threshold)
295            return self
296        else:
297            self = cls(measure=measure, threshold=threshold)
298            return self(data)
299
300    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50), \
301                 threshold=0.0):
302
303        self.measure = measure
304        self.threshold = threshold
305
306    def __call__(self, data):
307        """Take data and return features with scores above given threshold.
308
309        :param data: an data table
310        :type data: Orange.data.table
311
312        """
313        ma = score_all(data, self.measure)
314        return select_above_threshold(data, ma, self.threshold)
315
316FilterAttsAboveThresh = FilterAboveThreshold
317FilterAttsAboveThresh_Class = FilterAboveThreshold
318
319
320class FilterBestN(object):
321    """Store filter parameters and can be later called with the data to
322    return the data table with only selected features.
323
324    :param measure: an attribute measure (derived from
325      :obj:`Orange.feature.scoring.Measure`). Defaults to
326      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
327    :param n: number of best features to return. Defaults to 5.
328    :type n: int
329
330    """
331    def __new__(cls, data=None,
332                measure=orange.MeasureAttribute_relief(k=20, m=50),
333                n=5):
334
335        if data is None:
336            self = object.__new__(cls, measure=measure, n=n)
337            return self
338        else:
339            self = cls(measure=measure, n=n)
340            return self(data)
341
342    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50),
343                 n=5):
344        self.measure = measure
345        self.n = n
346
347    def __call__(self, data):
348        ma = score_all(data, self.measure)
349        self.n = min(self.n, len(data.domain.attributes))
350        return select_best_n(data, ma, self.n)
351
352FilterBestNAtts = FilterBestN
353FilterBestNAtts_Class = FilterBestN
354
355
356class FilterRelief(object):
357    """Similarly to :obj:`FilterBestNAtts`, wrap around class
358    :obj:`FilterRelief_Class`.
359
360    :param measure: an attribute measure (derived from
361      :obj:`Orange.feature.scoring.Measure`). Defaults to
362      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.
363    :param margin: margin for Relief scoring. Defaults to 0.
364    :type margin: float
365
366    """
367    def __new__(cls, data=None,
368                measure=orange.MeasureAttribute_relief(k=20, m=50),
369                margin=0):
370
371        if data is None:
372            self = object.__new__(cls, measure=measure, margin=margin)
373            return self
374        else:
375            self = cls(measure=measure, margin=margin)
376            return self(data)
377
378    def __init__(self, measure=orange.MeasureAttribute_relief(k=20, m=50),
379                 margin=0):
380        self.measure = measure
381        self.margin = margin
382
383    def __call__(self, data):
384        return select_relief(data, self.measure, self.margin)
385
386FilterRelief_Class = FilterRelief
387
388##############################################################################
389# wrapped learner
390
391
392class FilteredLearner(object):
393    """Return the corresponding learner that wraps
394    :obj:`Orange.classification.baseLearner` and a data selection method.
395
396    When such learner is presented a data table, data is first filtered and
397    then passed to :obj:`Orange.classification.baseLearner`. This comes handy
398    when one wants to test the schema of feature-subset-selection-and-learning
399    by some repetitive evaluation method, e.g., cross validation.
400
401    :param filter: defatuls to
402      :obj:`Orange.feature.selection.FilterAboveThreshold`
403    :type filter: Orange.feature.selection.FilterAboveThreshold
404
405    Here is an example of how to build a wrapper around naive Bayesian learner
406    and use it on a data set::
407
408        nb = Orange.classification.bayes.NaiveBayesLearner()
409        learner = Orange.feature.selection.FilteredLearner(nb,
410                  filter=Orange.feature.selection.FilterBestNAtts(n=5), name='filtered')
411        classifier = learner(data)
412
413    """
414    def __new__(cls, baseLearner, data=None, weight=0,
415                filter=FilterAboveThreshold(), name='filtered'):
416
417        if data is None:
418            self = object.__new__(cls, baseLearner, filter=filter, name=name)
419            return self
420        else:
421            self = cls(baseLearner, filter=filter, name=name)
422            return self(data, weight)
423
424    def __init__(self, baseLearner, filter=FilterAboveThreshold(),
425                 name='filtered'):
426        self.baseLearner = baseLearner
427        self.filter = filter
428        self.name = name
429
430    def __call__(self, data, weight=0):
431        # filter the data and then learn
432        fdata = self.filter(data)
433        model = self.baseLearner(fdata, weight)
434        return FilteredClassifier(classifier=model, domain=model.domain)
435
436FilteredLearner_Class = FilteredLearner
437
438
439class FilteredClassifier:
440    def __init__(self, **kwds):
441        self.__dict__.update(kwds)
442
443    def __call__(self, example, resultType=orange.GetValue):
444        return self.classifier(example, resultType)
445
446    def atts(self):
447        return self.domain.attributes
Note: See TracBrowser for help on using the repository browser.