source: orange/docs/tutorial/rst/ensembles.rst @ 11051:04009d17e84e

Revision 11051:04009d17e84e, 4.7 KB checked in by blaz <blaz.zupan@…>, 16 months ago (diff)

new tutorial

Line 
1.. index:: ensembles
2
3Ensembles
4=========
5
6`Learning of ensembles <http://en.wikipedia.org/wiki/Ensemble_learning>`_ combines the predictions of separate models to gain in accuracy. The models may come from different training data samples, or may use different learners on the same data sets. Learners may also be diversified by changing their parameter sets.
7
8In Orange, ensembles are simply wrappers around learners. They behave just like any other learner. Given the data, they return models that can predict the outcome for any data instance::
9
10   >>> import Orange
11   >>> data = Orange.data.Table("housing")
12   >>> tree = Orange.classification.tree.TreeLearner()
13   >>> btree = Orange.ensemble.bagging.BaggedLearner(tree)
14   >>> btree
15   BaggedLearner 'Bagging'
16   >>> btree(data)
17   BaggedClassifier 'Bagging'
18   >>> btree(data)(data[0])
19   <orange.Value 'MEDV'='24.6'>
20
21The last line builds a predictor (``btree(data)``) and then uses it on a first data instance.
22
23Most ensemble methods can wrap either classification or regression learners. Exceptions are task-specialized techniques such as boosting.
24
25Bagging and Boosting
26--------------------
27
28.. index:: 
29   single: ensembles; bagging
30
31`Bootstrap aggregating <http://en.wikipedia.org/wiki/Bootstrap_aggregating>`_, or bagging, samples the training data uniformly and with replacement to train different predictors. Majority vote (classification) or mean (regression) across predictions then combines independent predictions into a single prediction.
32
33.. index:: 
34   single: ensembles; boosting
35
36In general, boosting is a technique that combines weak learners into a single strong learner. Orange implements `AdaBoost <http://en.wikipedia.org/wiki/AdaBoost>`_, which assigns weights to data instances according to performance of the learner. AdaBoost uses these weights to iteratively sample the instances to focus on those that are harder to classify. In the aggregation AdaBoost emphases individual classifiers with better performance on their training sets.
37
38The following script wraps a classification tree in boosted and bagged learner, and tests the three learner through cross-validation:
39
40.. literalinclude:: code/ensemble-bagging.py
41
42The benefit of the two ensembling techniques, assessed in terms of area under ROC curve, is obvious::
43
44    tree: 0.83
45   boost: 0.90
46    bagg: 0.91
47
48Stacking
49--------
50
51.. index:: 
52   single: ensembles; stacking
53
54Consider we partition a training set into held-in and held-out set. Assume that our taks is prediction of y, either probability of the target class in classification or a real value in regression. We are given a set of learners. We train them on held-in set, and obtain a vector of prediction on held-out set. Each element of the vector corresponds to prediction of individual predictor. We can now learn how to combine these predictions to form a target prediction, by training a new predictor on a data set of predictions and true value of y in held-out set. The technique is called `stacked generalization <http://en.wikipedia.org/wiki/Ensemble_learning#Stacking>`_, or in short stacking. Instead of a single split to held-in and held-out data set, the vectors of predictions are obtained through cross-validation.
55
56Orange provides a wrapper for stacking that is given a set of base learners and a meta learner:
57
58.. literalinclude:: code/ensemble-stacking.py
59   :lines: 3-
60
61By default, the meta classifier is naive Bayesian classifier. Changing this to logistic regression may be a good idea as well::
62
63    stack = Orange.ensemble.stacking.StackedClassificationLearner(base_learners, \
64               meta_learner=Orange.classification.logreg.LogRegLearner)
65
66Stacking is often better than each of the base learners alone, as also demonstrated by running our script::
67
68   stacking: 0.967
69      bayes: 0.933
70       tree: 0.836
71        knn: 0.947
72
73Random Forests
74--------------
75
76.. index:: 
77   single: ensembles; random forests
78
79`Random forest <http://en.wikipedia.org/wiki/Random_forest>`_ ensembles tree predictors. The diversity of trees is achieved in randomization of feature selection for node split criteria, where instead of the best feature one is picked arbitrary from a set of best features. Another source of randomization is a bootstrap sample of data from which the threes are developed. Predictions from usually several hundred trees are aggregated by voting. Constructing so many trees may be computationally demanding. Orange uses a special tree inducer (Orange.classification.tree.SimpleTreeLearner, considered by default) optimized for speed in random forest construction:
80
81.. literalinclude:: code/ensemble-forest.py
82   :lines: 3-
83
84Random forests are often superior when compared to other base classification or regression learners::
85
86   forest: 0.976
87    bayes: 0.935
88      knn: 0.952
Note: See TracBrowser for help on using the repository browser.