Feature score is an assessment of the usefulness of the feature for prediction of the dependant (class) variable. Orange provides classes that compute the common feature scores for classification and regression regression.
The script below computes the information gain of feature “tear_rate” in the Lenses data set (loaded into data):
>>> print Orange.feature.scoring.InfoGain("tear_rate", data) 0.548795044422
Calling the scorer by passing the variable and the data to the constructor, like above is convenient. However, when scoring multiple variables, some methods run much faster if the scorer is constructed, stored and called for each variable.
>>> gain = Orange.feature.scoring.InfoGain() >>> for feature in data.domain.features: ... print feature.name, gain(feature, data) age 0.0393966436386 prescription 0.0395109653473 astigmatic 0.377005338669 tear_rate 0.548795044422
The speed gain is most noticable in Relief, which computes the scores of all features in parallel.
The module also provides a convenience function score_all that computes the scores for all attributes. The following example computes feature scores, both with score_all and by scoring each feature individually, and prints out the best three features.
import Orange voting = Orange.data.Table("voting") def print_best_3(ma): for m in ma[:3]: print "%5.3f %s" % (m, m) print 'Feature scores for best three features (with score_all):' ma = Orange.feature.scoring.score_all(voting) print_best_3(ma) print print 'Feature scores for best three features (scored individually):' meas = Orange.feature.scoring.Relief(k=20, m=50) mr = [ (a.name, meas(a, voting)) for a in voting.domain.attributes] mr.sort(key=lambda x: -x) #sort decreasingly by the score print_best_3(mr)
Feature scores for best three features (with score_all): 0.613 physician-fee-freeze 0.255 el-salvador-aid 0.228 synfuels-corporation-cutback Feature scores for best three features (scored individually): 0.613 physician-fee-freeze 0.255 el-salvador-aid 0.228 synfuels-corporation-cutback
It is also possible to score features that do not appear in the data but can be computed from it. A typical case are discretized features:
import Orange iris = Orange.data.Table("iris") d1 = Orange.feature.discretization.Entropy("petal length", iris) print Orange.feature.scoring.InfoGain(d1, iris)
Calling scoring methods¶
Scorers can be called with different type of arguments. For instance, when given the data, most scoring methods first compute the corresponding contingency tables. If these are already known, they can be given to the scorer instead of the data to save some time.
Not all classes accept all kinds of arguments. Relief, for instance, only supports the form with instances on the input.
- Score.__call__(attribute, data[, apriori_class_distribution][, weightID])¶
- attribute (Orange.feature.Descriptor or int or string) – the chosen feature, either as a descriptor, index, or a name.
- data (Orange.data.Table) – data.
- weightID – id for meta-feature with weight.
All scoring methods support this form.
- Score.__call__(attribute, domain_contingency[, apriori_class_distribution])
- Score.__call__(contingency, class_distribution[, apriori_class_distribution])
- contingency (Orange.statistics.contingency.VarClass) –
- class_distribution (Orange.statistics.distribution.Distribution) – distribution of the class variable. If unknowns_treatment is IgnoreUnknowns, it should be computed on instances where feature value is defined. Otherwise, class distribution should be the overall class distribution.
- apriori_class_distribution – Optional and most often ignored. Useful if the scoring method makes any probability estimates based on apriori class probabilities (such as the m-estimate).
Feature score - the higher the value, the better the feature. If the quality cannot be scored, return Score.Rejected.
float or Score.Rejected.
The code demonstrates using the different call signatures by computing the score of the same feature with GainRatio.
import Orange titanic = Orange.data.Table("titanic") meas = Orange.feature.scoring.GainRatio() print "Call with variable and data table" print meas(0, titanic) print "Call with variable and domain contingency" domain_cont = Orange.statistics.contingency.Domain(titanic) print meas(0, domain_cont) print "Call with contingency and class distribution" cont = Orange.statistics.contingency.VarClass(0, titanic) class_dist = Orange.statistics.distribution.Distribution( \ titanic.domain.class_var, titanic) print meas(cont, class_dist)
Feature scoring in classification problems¶
- class Orange.feature.scoring.InfoGain¶
Information gain; the expected decrease of entropy. See page on wikipedia.
- class Orange.feature.scoring.GainRatio¶
Information gain ratio; information gain divided by the entropy of the feature’s value. Introduced in [Quinlan1986] in order to avoid overestimation of multi-valued features. It has been shown, however, that it still overestimates features with multiple values. See Wikipedia.
- class Orange.feature.scoring.Gini¶
Gini index is the probability that two randomly chosen instances will have different classes. See Gini coefficient on Wikipedia.
- class Orange.feature.scoring.Relevance¶
The potential value for decision rules.
- class Orange.feature.scoring.Cost¶
Evaluates features based on the cost decrease achieved by knowing the value of feature, according to the specified cost matrix.
If the cost of predicting the first class of an instance that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate score can be constructed as follows:
>>> meas = Orange.feature.scoring.Cost() >>> meas.cost = ((0, 5), (1, 0)) >>> meas(3, data) 0.083333350718021393
Knowing the value of feature 3 would decrease the classification cost for approximately 0.083 per instance.
- class Orange.feature.scoring.Relief¶
Assesses features’ ability to distinguish between very similar instances from different classes. This scoring method was first developed by Kira and Rendell and then improved by Kononenko. The class Relief works on discrete and continuous classes and thus implements ReliefF and RReliefF.
ReliefF is slow since it needs to find k nearest neighbours for each of m reference instances. As we normally compute ReliefF for all features in the dataset, Relief caches the results for all features, when called to score a certain feature. When called again, it uses the stored results if the domain and the data table have not changed (data table version and the data checksum are compared). Caching will only work if you use the same object. Constructing new instances of Relief for each feature, like this:
for attr in data.domain.attributes: print Orange.feature.scoring.Relief(attr, data)
runs much slower than reusing the same instance:
meas = Orange.feature.scoring.Relief() for attr in table.domain.attributes: print meas(attr, data)
Number of neighbours for each instance. Default is 5.
Number of reference instances. Default is 100. When -1, all instances are used as reference.
- class Orange.feature.scoring.Distance¶
The distance is defined as information gain divided by joint entropy ( is the class variable and the feature):
Feature scoring in regression problems¶
- class Orange.feature.scoring.Relief
Relief is used for regression in the same way as for classification (see Relief in classification problems).
Implemented methods for scoring relevances of features are subclasses of Score. Those that compute statistics on conditional distributions of class values given the feature values are derived from ScoreFromProbabilities.
- class Orange.feature.scoring.Score¶
Abstract base class for feature scoring. Its attributes describe which types of features it can handle which kind of data it requires.
Indicates whether the scoring method can handle discrete features.
Indicates whether the scoring method can handle continuous features.
The type of data needed indicated by one the constants below. Classes with use DomainContingency will also handle generators. Those based on Contingency_Class will be able to take generators and domain contingencies.
Constant. Indicates that the scoring method needs an instance generator on the input as, for example, Relief.
Constant. Indicates that the scoring method needs Orange.statistics.contingency.Domain.
Treatment of unknown values
Defined in classes that are able to treat unknown values. It should be set to one of the values below.
Constant. Instances for which the feature value is unknown are removed.
Constant. Features with unknown values are punished. The feature quality is reduced by the proportion of unknown values. For impurity scores the impurity decreases only where the value is defined and stays the same otherwise.
Constant. Undefined values are replaced by the most common value.
Constant. Unknown values are treated as a separate value.
Abstract. See Calling scoring methods.
- threshold_function(attribute, instances[, weightID])¶
Assess different binarizations of the continuous feature attribute. Return a list of tuples. The first element is a threshold (between two existing values), the second is the quality of the corresponding binary feature, and the third the distribution of instances below and above the threshold. Not all scorers return the third element.
To show the computation of thresholds, we shall use the Iris data set:
iris = Orange.data.Table("iris") meas = Orange.feature.scoring.Relief() for t in meas.threshold_function("petal length", iris): print "%5.3f: %5.3f" % t
- best_threshold(attribute, instances)¶
Return the best threshold for binarization, that is, the threshold with which the resulting binary feature will have the optimal score.
The script below prints out the best threshold for binarization of an feature. ReliefF is used scoring:
thresh, score, distr = meas.best_threshold("petal length", iris) print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)
- class Orange.feature.scoring.ScoreFromProbabilities¶
Abstract base class for feature scoring method that can be computed from contingency matrices.
The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. Defaults use relative frequencies; possible alternatives are, for instance, ProbabilityEstimatorConstructor_m and ConditionalProbabilityEstimatorConstructor_ByRows (with estimator constructor again set to ProbabilityEstimatorConstructor_m), respectively.
- class Orange.feature.scoring.OrderAttributes(score=None)¶
Orders features by their scores.
- Orange.feature.scoring.score_all(data, score=Relief(k=20, m=50))¶
Assess the quality of features using the given measure and return a sorted list of tuples (feature name, measure).
Parameters: Return type:
list; a sorted list of tuples (feature name, score)
|[Kononenko2007]||Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining, Woodhead Publishing, 2007.|
|[Quinlan1986]||J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.|
|[Breiman1984]||L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984.|
|[Kononenko1995]||I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995.|