Ensemble algorithms (ensemble)

Ensembles use multiple models to improve prediction performance. The module implements a number of popular approaches, including bagging, boosting, stacking and forest trees. Most of these are available both for classification and regression with exception of stacking, which with present implementation supports classification only.

Bagging

class Orange.ensemble.bagging.BaggedLearner(learner, t=10, name=Bagging)

Bases: Orange.classification.Learner

BaggedLearner takes a learner and returns a bagged learner, which is essentially a wrapper around the learner passed as an argument. If instances are passed in arguments, BaggedLearner returns a bagged classifier. Both learner and classifier then behave just like any other learner and classifier in Orange.

Bagging, in essence, takes training data and a learner, and builds t classifiers, each time presenting a learner a bootstrap sample from the training data. When given a test instance, classifiers vote on class, and a bagged classifier returns a class with the highest number of votes. As implemented in Orange, when class probabilities are requested, these are proportional to the number of votes for a particular class.

Parameters:
  • learner (Orange.core.Learner) – learner to be bagged.
  • t (int) – number of bagged classifiers, that is, classifiers created when instances are passed to bagged learner.
  • name (str) – name of the resulting learner.
Return type:

Orange.ensemble.bagging.BaggedClassifier or Orange.ensemble.bagging.BaggedLearner

__call__(instances, weight=0)

Learn from the given table of data instances.

Parameters:
  • instances (Orange.data.Table) – data instances to learn from.
  • weight (int) – ID of meta feature with weights of instances
Return type:

Orange.ensemble.bagging.BaggedClassifier

class Orange.ensemble.bagging.BaggedClassifier(classifiers, name, class_var, **kwds)

Bases: Orange.classification.Classifier

A classifier that uses a bagging technique. Usually the learner (Orange.ensemble.bagging.BaggedLearner) is used to construct the classifier.

When constructing the classifier manually, the following parameters can be passed:

Parameters:
  • classifiers (list) – a list of boosted classifiers.
  • name (str) – name of the resulting classifier.
  • class_var (Orange.feature.Descriptor) – the class feature.
__call__(instance, result_type=0)
Parameters:
  • instance (Orange.data.Instance) – instance to be classified.
  • result_typeOrange.classification.Classifier.GetValue or Orange.classification.Classifier.GetProbabilities or Orange.classification.Classifier.GetBoth
Return type:

Orange.data.Value, Orange.statistics.Distribution or a tuple with both

Boosting

class Orange.ensemble.boosting.BoostedLearner(learner, t=10, name=AdaBoost.M1)

Bases: Orange.classification.Learner

Instead of drawing a series of bootstrap samples from the training set, bootstrap maintains a weight for each instance. When a classifier is trained from the training set, the weights for misclassified instances are increased. Just like in a bagged learner, the class is decided based on voting of classifiers, but in boosting votes are weighted by accuracy obtained on training set.

BoostedLearner is an implementation of AdaBoost.M1 (Freund and Shapire, 1996). From user’s viewpoint, the use of the BoostedLearner is similar to that of BaggedLearner. The learner passed as an argument needs to deal with instance weights.

Parameters:
  • learner (Orange.core.Learner) – learner to be boosted.
  • t (int) – number of boosted classifiers created from the instance set.
  • name (str) – name of the resulting learner.
Return type:

Orange.ensemble.boosting.BoostedClassifier or Orange.ensemble.boosting.BoostedLearner

__call__(instances, orig_weight=0)

Learn from the given table of data instances.

Parameters:
Return type:

Orange.ensemble.boosting.BoostedClassifier

class Orange.ensemble.boosting.BoostedClassifier(classifiers, name, class_var, **kwds)

Bases: Orange.classification.Classifier

A classifier that uses a boosting technique. Usually the learner (Orange.ensemble.boosting.BoostedLearner) is used to construct the classifier.

When constructing the classifier manually, the following parameters can be passed:

Parameters:
  • classifiers (list) – a list of boosted classifiers.
  • name (str) – name of the resulting classifier.
  • class_var (Orange.feature.Descriptor) – the class feature.
__call__(instance, result_type=0)
Parameters:
  • instance (Orange.data.Instance) – instance to be classified.
  • result_typeOrange.classification.Classifier.GetValue or Orange.classification.Classifier.GetProbabilities or Orange.classification.Classifier.GetBoth
Return type:

Orange.data.Value, Orange.statistics.Distribution or a tuple with both

Example

The following script fits classification models by boosting and bagging on Lymphography data set with TreeLearner and post-pruning as a base learner. Classification accuracy of the methods is estimated by 10-fold cross validation (ensemble.py):

import Orange

tree = Orange.classification.tree.TreeLearner(m_pruning=2, name="tree")
bs = Orange.ensemble.boosting.BoostedLearner(tree, name="boosted tree")
bg = Orange.ensemble.bagging.BaggedLearner(tree, name="bagged tree")

lymphography = Orange.data.Table("lymphography.tab")

learners = [tree, bs, bg]
results = Orange.evaluation.testing.cross_validation(learners, lymphography, folds=3)
print "Classification Accuracy:"
for i in range(len(learners)):
    print ("%15s: %5.3f") % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])

Running this script demonstrates some benefit of boosting and bagging over the baseline learner:

Classification Accuracy:
           tree: 0.764
   boosted tree: 0.770
    bagged tree: 0.790

Stacking

class Orange.ensemble.stacking.StackedClassificationLearner(learners, meta_learner=NaiveLearner 'naive', folds=10, name=stacking)

Bases: Orange.classification.Learner

Stacking by inference of meta classifier from class probability estimates on cross-validation held-out data for level-0 classifiers developed on held-in data sets.

Parameters:
  • learners (list) – level-0 learners.
  • meta_learner (Learner) – meta learner (default: NaiveLearner).
  • folds – number of iterations (folds) of cross-validation to assemble class probability data for meta learner.
  • name (string) – learner name (default: stacking).
Return type:

StackedClassificationLearner or StackedClassifier

class Orange.ensemble.stacking.StackedClassifier(classifiers, meta_classifier, **kwds)

A classifier for stacking. Uses a set of level-0 classifiers to induce class probabilities, which are an input to a meta-classifier to predict class probability for a given data instance.

Parameters:
  • classifiers (list) – a list of level-0 classifiers.
  • meta_classifier (Classifier) – meta-classifier.

Example

Stacking often produces classifiers that are more predictive than individual classifiers in the ensemble. This effect is illustrated by a script that combines four different classification algorithms (ensemble-stacking.py):

data = Orange.data.Table("promoters")

bayes = Orange.classification.bayes.NaiveLearner(name="bayes")
tree = Orange.classification.tree.SimpleTreeLearner(name="tree")
lin = Orange.classification.svm.LinearLearner(name="lr")
knn = Orange.classification.knn.kNNLearner(name="knn")

base_learners = [bayes, tree, lin, knn]
stack = Orange.ensemble.stacking.StackedClassificationLearner(base_learners)

learners = [stack, bayes, tree, lin, knn]
res = Orange.evaluation.testing.cross_validation(learners, data, 3)
print "\n".join(["%8s: %5.3f" % (l.name, r) for r, l in zip(Orange.evaluation.scoring.CA(res), learners)])

The benefits of stacking on this particular data set are substantial (numbers show classification accuracy):

stacking: 0.915
   bayes: 0.858
    tree: 0.688
      lr: 0.868
     knn: 0.839

Random Forest

class Orange.ensemble.forest.RandomForestLearner(trees=100, attributes=None, name=Random Forest, rand=None, callback=None, base_learner=None, learner=None)

Bases: Orange.classification.Learner

Trains an ensemble predictor consisting of trees trained on bootstrap samples of training data. To increase randomness, the tree learner considers only a subset of candidate features at each node. The algorithm closely follows the original procedure (Brieman, 2001) both in implementation and parameter defaults.

Parameters:
  • trees (int) – number of trees in the forest.
  • attributes (int) – number of randomly drawn features among which to select the best one to split the data sets in tree nodes. The default, None, means the square root of the number of features in the training data. Ignored if learner is specified.
  • base_learner (None or Orange.classification.tree.TreeLearner or Orange.classification.tree.SimpleTreeLearner) – A base tree learner. The base learner will be randomized with Random Forest’s random feature subset selection. If None (default), SimpleTreeLearner and it will not split nodes with less than 5 data instances.
  • rand – random generator used in bootstrap sampling. If None (default), then random.Random(0) is used.
  • learner (None or Orange.core.Learner) – Tree induction learner. If None (default), the base_learner will be used (and randomized). If learner is specified, it will be used as such with no additional transformations.
  • callback – a function to be called after every iteration of induction of classifier. The call includes a parameter (from 0.0 to 1.0) that provides an estimate of completion of the learning progress.
  • name (string) – learner name.
Return type:

RandomForestClassifier or RandomForestLearner

__call__(instances, weight=0)

Learn from the given table of data instances.

Parameters:
  • instances (class:Orange.data.Table) – learning data.
  • weight (int) – weight.
Return type:

Orange.ensemble.forest.RandomForestClassifier

class Orange.ensemble.forest.RandomForestClassifier(classifiers, name, domain, class_var, class_vars, **kwds)

Bases: Orange.classification.Classifier

Uses the trees induced by the RandomForestLearner. An input instance is classified into the class with the most frequent vote. However, this implementation returns the averaged probabilities from each of the trees if class probability is requested.

When constructed manually, the following parameters have to be passed:

Parameters:
__call__(instance, result_type=0)
Parameters:
  • instance (Orange.data.Instance) – instance to be classified.
  • result_typeOrange.classification.Classifier.GetValue or Orange.classification.Classifier.GetProbabilities or Orange.classification.Classifier.GetBoth
Return type:

Orange.data.Value, Orange.statistics.Distribution or a tuple with both

Example

The following script assembles a random forest learner and compares it to a tree learner on a liver disorder (bupa) and housing data sets.

ensemble-forest.py

import Orange

forest = Orange.ensemble.forest.RandomForestLearner(trees=50, name="forest")
tree = Orange.classification.tree.TreeLearner(min_instances=2, m_pruning=2, \
                            same_majority_pruning=True, name='tree')
learners = [tree, forest]

print "Classification: bupa.tab"
bupa = Orange.data.Table("bupa.tab")
results = Orange.evaluation.testing.cross_validation(learners, bupa, folds=3)
print "Learner  CA     Brier  AUC"
for i in range(len(learners)):
    print "%-8s %5.3f  %5.3f  %5.3f" % (learners[i].name, \
        Orange.evaluation.scoring.CA(results)[i],
        Orange.evaluation.scoring.Brier_score(results)[i],
        Orange.evaluation.scoring.AUC(results)[i])

print "Regression: housing.tab"
bupa = Orange.data.Table("housing.tab")
results = Orange.evaluation.testing.cross_validation(learners, bupa, folds=3)
print "Learner  MSE    RSE    R2"
for i in range(len(learners)):
    print "%-8s %5.3f  %5.3f  %5.3f" % (learners[i].name, \
        Orange.evaluation.scoring.MSE(results)[i],
        Orange.evaluation.scoring.RSE(results)[i],
        Orange.evaluation.scoring.R2(results)[i],)

Notice that our forest contains 50 trees. Learners are compared through 3-fold cross validation:

Classification: bupa.tab
Learner  CA     Brier  AUC
tree     0.586  0.829  0.575
forest   0.710  0.392  0.752
Regression: housing.tab
Learner  MSE    RSE    R2
tree     23.708  0.281  0.719
forest   11.988  0.142  0.858

Perhaps the sole purpose of the following example is to show how to access the individual classifiers once they are assembled into the forest, and to show how we can assemble a tree learner to be used in random forests. In the following example the best feature for decision nodes is selected among three randomly chosen features, and maxDepth and minExamples are both set to 5.

ensemble-forest2.py

import Orange

bupa = Orange.data.Table('bupa.tab')

tree = Orange.classification.tree.TreeLearner()
tree.min_instances = 5
tree.max_depth = 5

forest_learner = Orange.ensemble.forest.RandomForestLearner(base_learner=tree, trees=50, attributes=3)
forest = forest_learner(bupa)

for c in forest.classifiers:
    print c.countNodes(),
print

Running the above code would report on sizes (number of nodes) of the tree in a constructed random forest.

Feature scoring

L. Breiman (2001) suggested the possibility of using random forests as a non-myopic measure of feature importance.

The assessment of feature relevance with random forests is based on the idea that randomly changing the value of an important feature greatly affects instance’s classification, while changing the value of an unimportant feature does not affect it much. Implemented algorithm accumulates feature scores over given number of trees. Importance of all features for a single tree are computed as: correctly classified OOB instances minus correctly classified OOB instances when the feature is randomly shuffled. The accumulated feature scores are divided by the number of used trees and multiplied by 100 before they are returned.

class Orange.ensemble.forest.ScoreFeature(trees=100, attributes=None, rand=None, base_learner=None, learner=None)
Parameters:
  • trees (int) – number of trees in the forest.
  • attributes (int) – number of randomly drawn features among which to select the best to split the nodes in tree induction. The default, None, means the square root of the number of features in the training data. Ignored if learner is specified.
  • base_learner (None or Orange.classification.tree.TreeLearner or Orange.classification.tree.SimpleTreeLearner) – A base tree learner. The base learner will be randomized with Random Forest’s random feature subset selection. If None (default), SimpleTreeLearner and it will not split nodes with less than 5 data instances.
  • rand – random generator used in bootstrap sampling. If None (default), then random.Random(0) is used.
  • learner (None or Orange.core.Learner) – Tree induction learner. If None (default), the base_learner will be used (and randomized). If learner is specified, it will be used as such with no additional transformations.
__call__(feature, instances, aprior_class=None)

Return importance of a given feature. Only the first call on a given data set is computationally expensive.

Parameters:
importances(table)

DEPRECATED. Return importance of all features in the dataset as a list.

Parameters:table (Orange.data.Table) – dataset of which the features’ importance needs to be measured.

Computation of feature importance with random forests is rather slow and importances for all features need to be computes simultaneously. When it is called to compute a quality of certain feature, it computes qualities for all features in the dataset. When called again, it uses the stored results if the domain is still the same and the data table has not changed (this is done by checking the data table’s version and is not foolproof; it will not detect if you change values of existing instances, but will notice adding and removing instances; see the page on Orange.data.Table for details).

ensemble-forest-measure.py

import Orange
import random

files = [ "iris.tab" ]

for fn in files:
    print "\nDATA:" + fn + "\n"
    iris = Orange.data.Table(fn)

    measure = Orange.ensemble.forest.ScoreFeature(trees=100)

    #call by attribute index
    imp0 = measure(0, iris) 
    #call with a Descriptor
    imp1 = measure(iris.domain.attributes[1], iris)
    print "first: %0.2f, second: %0.2f\n" % (imp0, imp1)

    print "different random seed"
    measure = Orange.ensemble.forest.ScoreFeature(trees=100, 
            rand=random.Random(10))

    imp0 = measure(0, iris)
    imp1 = measure(iris.domain.attributes[1], iris)
    print "first: %0.2f, second: %0.2f\n" % (imp0, imp1)

    print "All importances:"
    for at in iris.domain.attributes:
        print "%15s: %6.2f" % (at.name, measure(at, iris))

The output of the above script is:

DATA:iris.tab

first: 3.91, second: 0.38

different random seed
first: 3.39, second: 0.46

All importances:
   sepal length:   3.39
    sepal width:   0.46
   petal length:  30.15
    petal width:  31.98
References