Selection (selection)¶
Feature selection module contains several utility functions for selecting features based on they scores normally obtained in classification or regression problems. A typical example is the function select that returns a subsets of highest-scored features features:
import Orange
voting = Orange.data.Table("voting")
n = 3
ma = Orange.feature.scoring.score_all(voting)
best = Orange.feature.selection.top_rated(ma, n)
print 'Best %d features:' % n
for s in best:
print s
The script outputs:
Best 3 features:
physician-fee-freeze
el-salvador-aid
synfuels-corporation-cutback
The module also includes a learner that incorporates feature subset selection.
Functions for feature subset selection¶
- static selection.top_rated(scores, n, highest_best=True)¶
Return n top-rated features from the list of scores.
Parameters: Return type:
- static selection.above_threshold(scores, threshold=0.0)¶
Return features (without scores) with scores above or equal to a specified threshold.
Parameters: Return type:
- static selection.select(data, scores, n)¶
Construct and return a new data table that includes a class and only the best features from a list scores.
Parameters: - data (Orange.data.Table) – a data table
- scores (list) – a list such as the one returned by score_all
- n (int) – number of features to select
Return type:
- static selection.select_above_threshold(data, scores, threshold=0.0)¶
Construct and return a new data table that includes a class and features from the list returned by score_all with higher or equal score to a given threshold.
Parameters: - data (Orange.data.Table) – a data table
- scores (list) – a list such as the one returned by score_all
- threshold (float) – threshold for selection
Return type:
- static selection.select_relief(data, measure=Orange.feature.scoring.Relief(k=20, m=10), margin=0)¶
Iteratively remove the worst scored feature until no feature has a score below the margin. The filter procedure was originally designed for measures such as Relief, which are context dependent, i.e., removal of features may change the scores of other remaining features. The score is thus recomputed in each iteration.
Parameters: - data (Orange.data.Table) – a data table
- measure (Orange.feature.scoring.Score) – a feature scorer
- margin (float) – margin for removal
Learning with feature subset selection¶
- class Orange.feature.selection.FilteredLearner(base_learner, filter=FilterAboveThreshold(), name=filtered)¶
- A feature selection wrapper around base learner. When provided data,
- this learner applies a given feature selection method and then calls the base learner.
Here is an example of how to build a wrapper around naive Bayesian learner and use it on a data set:
nb = Orange.classification.bayes.NaiveBayesLearner() learner = Orange.feature.selection.FilteredLearner(nb, filter=Orange.feature.selection.FilterBestN(n=5), name='filtered') classifier = learner(data)
- class Orange.feature.selection.FilteredClassifier(**kwds)¶
A classifier returned by FilteredLearner.
Class wrappers for selection functions¶
- class Orange.feature.selection.FilterAboveThreshold(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), threshold=0.0)¶
A wrapper around select_above_threshold; the constructor stores the parameters of the feature selection procedure that are then applied when the the selection is called with the actual data.
Parameters: - measure (Orange.feature.scoring.Score) – a feature scorer
- threshold (float) – threshold for selection. Defaults to 0.
- __call__(data)¶
Return data table features that have scores above given threshold.
Parameters: data (Orange.data.Table) – data table
Below are few examples of utility of this class:
>>> filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
>>> new_data = filter(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1, \
measure=Orange.feature.scoring.Gini())
- class Orange.feature.selection.FilterBestN(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), n=5)¶
A wrapper around select; the constructor stores the filter parameters that are applied when the function is called.
Parameters: - measure (Orange.feature.scoring.Score) – a feature scorer
- n (int) – number of features to select
- class Orange.feature.selection.FilterRelief(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), margin=0)¶
A class wrapper around select_best_n; the constructor stores the filter parameters that are applied when the function is called.
Parameters: - measure (Orange.feature.scoring.Score) – a feature scorer
- margin (float) – margin for Relief scoring
Examples
The following script defines a new Naive Bayes classifier, that selects five best features from the data set before learning. The new classifier is wrapped-up in a special class (see <a href=”../ofb/c_pythonlearner.htm”>Building your own learner</a> lesson in <a href=”../ofb/default.htm”>Orange for Beginners</a>). The script compares this filtered learner with one that uses a complete set of features.
import Orange
class BayesFSS(object):
def __new__(cls, examples=None, **kwds):
learner = object.__new__(cls)
if examples:
return learner(examples)
else:
return learner
def __init__(self, name='Naive Bayes with FSS', N=5):
self.name = name
self.N = 5
def __call__(self, table, weight=None):
ma = Orange.feature.scoring.score_all(table)
filtered = Orange.feature.selection.selectBestNAtts(table, ma, self.N)
model = Orange.classification.bayes.NaiveLearner(filtered)
return BayesFSS_Classifier(classifier=model, N=self.N, name=self.name)
class BayesFSS_Classifier:
def __init__(self, **kwds):
self.__dict__.update(kwds)
def __call__(self, example, resultType = Orange.classification.Classifier.GetValue):
return self.classifier(example, resultType)
# test above wraper on a data set
voting = Orange.data.Table("voting")
learners = (Orange.classification.bayes.NaiveLearner(name='Naive Bayes'),
BayesFSS(name="with FSS"))
results = Orange.evaluation.testing.cross_validation(learners, voting)
# output the results
print "Learner CA"
for i in range(len(learners)):
print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])
Interestingly, and somehow expected, feature subset selection helps. This is the output that we get:
Learner CA
Naive Bayes 0.903
with FSS 0.940
We can do all of he above by wrapping the learner using <code>FilteredLearner</code>, thus creating an object that is assembled from data filter and a base learner. When given a data table, this learner uses attribute filter to construct a new data set and base learner to construct a corresponding classifier. Attribute filters should be of the type like <code>orngFSS.FilterAboveThresh</code> or <code>orngFSS.FilterBestN</code> that can be initialized with the arguments and later presented with a data, returning new reduced data set.
The following code fragment replaces the bulk of code from previous example, and compares naive Bayesian classifier to the same classifier when only a single most important attribute is used.
nb = Orange.classification.bayes.NaiveLearner()
fl = Orange.feature.selection.FilteredLearner(nb,
filter=Orange.feature.selection.FilterBestNAtts(n=1), name='filtered')
learners = (Orange.classification.bayes.NaiveLearner(name='bayes'), fl)
Now, let’s decide to retain three features (change the code in <a href=”fss4.py”>fss4.py</a> accordingly!), but observe how many times an attribute was used. Remember, 10-fold cross validation constructs ten instances for each classifier, and each time we run FilteredLearner a different set of features may be selected. <code>orngEval.CrossValidation</code> stores classifiers in <code>results</code> variable, and <code>FilteredLearner</code> returns a classifier that can tell which features it used (how convenient!), so the code to do all this is quite short.
print "\nNumber of times attributes were used in cross-validation:"
attsUsed = {}
for i in range(10):
for a in results.classifiers[i][1].atts():
if a.name in attsUsed.keys():
attsUsed[a.name] += 1
else:
attsUsed[a.name] = 1
for k in attsUsed.keys():
print "%2d x %s" % (attsUsed[k], k)
Running selection-filtered-learner.py with three features selected each time a learner is run gives the following result:
Learner CA
bayes 0.903
filtered 0.956
Number of times features were used in cross-validation:
3 x el-salvador-aid
6 x synfuels-corporation-cutback
7 x adoption-of-the-budget-resolution
10 x physician-fee-freeze
4 x crime
Experiment yourself to see, if only one attribute is retained for classifier, which attribute was the one most frequently selected over all the ten cross-validation tests!
References¶
- K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Proc. 9th Int’l Conf. on Machine Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
- I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine Learning (ECML-94), pages 171-182. Springer-Verlag, 1994.
- R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial Intelligence, 97 (1-2), pages 273-324, 1997
