source: orange/docs/reference/rst/Orange.ensemble.rst @ 10540:788913064426

Revision 10540:788913064426, 6.7 KB checked in by blaz <blaz.zupan@…>, 2 years ago (diff)

Added stacking (ensemble method).

Line 
1##################################
2Ensemble algorithms (``ensemble``)
3##################################
4
5.. index:: ensemble
6
7`Ensembles <http://en.wikipedia.org/wiki/Ensemble_learning>`_ use
8multiple models to improve prediction performance. The module
9implements a number of popular approaches, including bagging,
10boosting, stacking and forest trees. Most of these are available both
11for classification and regression with exception of stacking, which
12with present implementation supports classification only.
13
14*******
15Bagging
16*******
17
18.. index:: bagging
19.. index::
20   single: ensemble; ensemble
21
22.. autoclass:: Orange.ensemble.bagging.BaggedLearner
23   :members:
24   :show-inheritance:
25
26.. autoclass:: Orange.ensemble.bagging.BaggedClassifier
27   :members:
28   :show-inheritance:
29
30********
31Boosting
32********
33
34.. index:: boosting
35.. index::
36   single: ensemble; boosting
37
38
39.. autoclass:: Orange.ensemble.boosting.BoostedLearner
40  :members:
41  :show-inheritance:
42
43.. autoclass:: Orange.ensemble.boosting.BoostedClassifier
44   :members:
45   :show-inheritance:
46
47Example
48=======
49
50The following script fits classification models by boosting and
51bagging on Lymphography data set with TreeLearner and post-pruning as
52a base learner. Classification accuracy of the methods is estimated by
5310-fold cross validation (:download:`ensemble.py <code/ensemble.py>`):
54
55.. literalinclude:: code/ensemble.py
56  :lines: 7-
57
58Running this script demonstrates some benefit of boosting and bagging
59over the baseline learner::
60
61    Classification Accuracy:
62               tree: 0.764
63       boosted tree: 0.770
64        bagged tree: 0.790
65
66********
67Stacking
68********
69
70.. index:: stacking
71.. index::
72   single: ensemble; stacking
73
74
75.. autoclass:: Orange.ensemble.stacking.StackedClassificationLearner
76  :members:
77  :show-inheritance:
78
79.. autoclass:: Orange.ensemble.stacking.StackedClassifier
80   :members:
81   :show-inheritance:
82
83Example
84=======
85
86Stacking often produces classifiers that are more predictive than
87individual classifiers in the ensemble. This effect is illustrated by
88a script that combines four different classification
89algorithms (:download:`ensemble-stacking.py <code/ensemble-stacking.py>`):
90
91.. literalinclude:: code/ensemble-stacking.py
92  :lines: 3-
93
94The benefits of stacking on this particular data set are
95substantial (numbers show classification accuracy)::
96
97   stacking: 0.934
98      bayes: 0.858
99       tree: 0.688
100         lr: 0.764
101        knn: 0.830
102
103*************
104Random Forest
105*************
106
107.. index:: random forest
108.. index::
109   single: ensemble; random forest
110   
111.. autoclass:: Orange.ensemble.forest.RandomForestLearner
112  :members:
113  :show-inheritance:
114
115.. autoclass:: Orange.ensemble.forest.RandomForestClassifier
116  :members:
117  :show-inheritance:
118
119
120Example
121========
122
123The following script assembles a random forest learner and compares it
124to a tree learner on a liver disorder (bupa) and housing data sets.
125
126:download:`ensemble-forest.py <code/ensemble-forest.py>`
127
128.. literalinclude:: code/ensemble-forest.py
129  :lines: 7-
130
131Notice that our forest contains 50 trees. Learners are compared through
1323-fold cross validation::
133
134    Classification: bupa.tab
135    Learner  CA     Brier  AUC
136    tree     0.586  0.829  0.575
137    forest   0.710  0.392  0.752
138    Regression: housing.tab
139    Learner  MSE    RSE    R2
140    tree     23.708  0.281  0.719
141    forest   11.988  0.142  0.858
142
143Perhaps the sole purpose of the following example is to show how to
144access the individual classifiers once they are assembled into the
145forest, and to show how we can assemble a tree learner to be used in
146random forests. In the following example the best feature for decision
147nodes is selected among three randomly chosen features, and maxDepth
148and minExamples are both set to 5.
149
150:download:`ensemble-forest2.py <code/ensemble-forest2.py>`
151
152.. literalinclude:: code/ensemble-forest2.py
153  :lines: 7-
154
155Running the above code would report on sizes (number of nodes) of the tree
156in a constructed random forest.
157
158   
159Feature scoring
160===============
161
162L. Breiman (2001) suggested the possibility of using random forests as a
163non-myopic measure of feature importance.
164
165The assessment of feature relevance with random forests is based on the
166idea that randomly changing the value of an important feature greatly
167affects instance's classification, while changing the value of an
168unimportant feature does not affect it much. Implemented algorithm
169accumulates feature scores over given number of trees. Importance of
170all features for a single tree are computed as: correctly classified
171OOB instances minus correctly classified OOB instances when the feature is
172randomly shuffled. The accumulated feature scores are divided by the
173number of used trees and multiplied by 100 before they are returned.
174
175.. autoclass:: Orange.ensemble.forest.ScoreFeature
176  :members:
177
178Computation of feature importance with random forests is rather slow
179and importances for all features need to be computes
180simultaneously. When it is called to compute a quality of certain
181feature, it computes qualities for all features in the dataset. When
182called again, it uses the stored results if the domain is still the
183same and the data table has not changed (this is done by checking the
184data table's version and is not foolproof; it will not detect if you
185change values of existing instances, but will notice adding and
186removing instances; see the page on :class:`Orange.data.Table` for
187details).
188
189:download:`ensemble-forest-measure.py <code/ensemble-forest-measure.py>`
190
191.. literalinclude:: code/ensemble-forest-measure.py
192  :lines: 7-
193
194The output of the above script is::
195
196    DATA:iris.tab
197
198    first: 3.91, second: 0.38
199
200    different random seed
201    first: 3.39, second: 0.46
202
203    All importances:
204       sepal length:   3.39
205        sepal width:   0.46
206       petal length:  30.15
207        petal width:  31.98
208
209References
210----------
211
212* L Breiman. Bagging Predictors. `Technical report No. 421
213  <http://www.stat.berkeley.edu/tech-reports/421.ps.Z>`_. University
214  of California, Berkeley, 1994.
215* Y Freund, RE Schapire. `Experiments with a New Boosting Algorithm
216  <http://citeseer.ist.psu.edu/freund96experiments.html>`_. Machine
217  Learning: Proceedings of the Thirteenth International Conference
218  (ICML'96), 1996.
219* JR Quinlan. `Boosting, bagging, and C4.5
220  <http://www.rulequest.com/Personal/q.aaai96.ps>`_ . In Proc. of 13th
221  National Conference on Artificial Intelligence
222  (AAAI'96). pp. 725-730, 1996.
223* L Brieman. `Random Forests
224  <http://www.springerlink.com/content/u0p06167n6173512/>`_. Machine
225  Learning, 45, 5-32, 2001.
226* M Robnik-Sikonja. `Improving Random Forests
227  <http://lkm.fri.uni-lj.si/rmarko/papers/robnik04-ecml.pdf>`_. In
228  Proc. of European Conference on Machine Learning (ECML 2004),
229  pp. 359-370, 2004.
230
231.. automodule:: Orange.ensemble
232
Note: See TracBrowser for help on using the repository browser.