source: orange/orange/doc/modules/orngEnsemble.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 14.2 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line 
1<html>
2<head>
3<title>orngEnsemble: Orange Bagging and Boosting Module</title>
4<link rel=stylesheet href="../style.css" type="text/css">
5<link rel=stylesheet href="style-print.css" type="text/css" media=print>
6</head>
7
8<body>
9<h1>orngEnsemble: Orange Bagging and Boosting Module</h1>
10<index name="modules/ensemble methods">
11
12<p>Module orngEnsemble implements Breiman's bagging and Random Forest,
13and Freund and Schapire's boosting algorithms.</p>
14
15<h2>BaggedLearner</h2>
16<index name="ensamble learning">
17<index name="modules+bagging">
18<index name="classifiers/bagging">
19<index name="modules+bagging">
20
21<p><code><index name="classes/BaggedLearner (in orngEnsemble)">BaggedLearner</code> takes a learner and returns a bagged
22learner, which is essentially a wrapper around the learner passed as
23an argument. If examples are passed in arguments,
24<code>BaggedLearner</code> returns a bagged classifiers. Both learner
25and classifier then behave just like any other learner and classifier
26in Orange.</p>
27
28<p class=section>Attributes</p>
29<dl class=attributes>
30  <dt>learner</dt>
31  <dd>A learner to be bagged.</dd>
32
33  <dt>examples</dt>
34  <dd>If examples are passed to <code>BaggedLearner</code>, this
35  returns a <code><index name="classes/BaggedClassifier (in orngEnsemble)">BaggedClassifier</code>, that is, creates
36  <code>t</code> classifiers using learner and a subset of examples,
37  as appropriate for bagging (default: None).</dd>
38
39  <dt>t</dt>
40  <dd>Number of bagged classifiers, that is, classifiers created when
41  examples are passed to bagged learner (default: 10).</dd>
42
43  <dt>name</dt>
44  <dd>The name of the learner (default: Bagging).</dd>
45</dl>
46
47<p>Bagging, in essence, takes a training data and a learner, and
48builds <code>t</code> classifiers each time presenting a learner a
49bootstrap sample from the training data. When given a test example,
50classifiers vote on class, and a bagged classifier returns a class
51with a highest number of votes. As implemented in Orange, when class
52probabilities are requested, these are proportional to the number of
53votes for a particular class.<p>
54
55<H3>Example</H3>
56<p>See <a href="#ble">BoostedLearner example</a>.</p>
57
58
59<h2>BoostedLearner</h2>
60<index name="modules+boosting">
61<index name="classifiers/boosting">
62
63<p>Instead of drawing a series of bootstrap samples from the training
64set, bootstrap maintains a weight for each instance. When classifier
65is trained from the training set, the weights for misclassified
66instances are increased. Just like in bagged learner, the class is
67decided based on voting of classifiers, but in boosting votes are
68weighted by accuracy obtained on training set.</p>
69
70<p><index name="classes/BoostedLearner (in orngEnsemble)">BoostedLearner is an implementation of AdaBoost.M1 (Freund and
71Shapire, 1996). From user's viewpoint, the use of the
72<code>BoostedLearner</code> is similar to that of
73<code>BaggedLearner</code>. The learner passed as an argument needs to
74deal with example weights.</p>
75
76<p class=section>Attributes</p>
77<dl class=attributes>
78  <dt>learner</dt>
79  <dd>A learner to be boosted.</dd>
80
81  <dt>examples</dt>
82  <dd>If examples are passed to <code>BoostedLearner</code>, this
83  returns a <code><index name="classes/BoostedClassifier (in orngEnsemble)">BoostedClassifier</code>, that is, creates
84  <code>t</code> classifiers using learner and a subset of examples,
85  as appropriate for AdaBoost.M1 (default: None).</dd>
86
87  <dt>t</dt> <dd>Number of boosted classifiers created from the
88  example set (default: 10).</dd>
89
90  <dt>name</dt>
91  <dd>The name of the learner (default: AdaBoost.M1).</dd>
92</dl>
93
94
95<a name="ble"><H3>Example</H3>
96
97<P>Let us try boosting and bagging on Iris data set and use
98<code>TreeLearner</code> with post-pruning as a base learner. For
99testing, we use 10-fold cross validation and observe classification
100accuracy.</p>
101
102<p class="header"><a href="ensemble.py">ensemble.py</a> (uses <a href=
103"iris.tab">iris.tab</a>)</p>
104<XMP class=code>import orange, orngEnsemble, orngTree
105import orngTest, orngStat
106
107tree = orngTree.TreeLearner(mForPruning=2, name="tree")
108bs = orngEnsemble.BoostedLearner(tree, name="boosted tree")
109bg = orngEnsemble.BaggedLearner(tree, name="bagged tree")
110
111data = orange.ExampleTable("lymphography.tab")
112
113learners = [tree, bs, bg]
114results = orngTest.crossValidation(learners, data)
115print "Classification Accuracy:"
116for i in range(len(learners)):
117    print ("%15s: %5.3f") % (learners[i].name, orngStat.CA(results)[i])
118</XMP>
119
120<p>Running this script, we may get something like:</p>
121<XMP class=code>Classification Accuracy:
122Classification Accuracy:
123           tree: 0.769
124   boosted tree: 0.782
125    bagged tree: 0.783
126</XMP>
127
128
129
130<h2>RandomForestLearner</h2>
131<index name="modules+random forest">
132<index name="classifiers/random forest">
133
134<p>Just like bagging, classifiers in random forests are trained from
135bootstrap samples of training data. Here, classifiers are trees, but to
136increase randomness build in the way that at each node the best
137attribute is chosen from a subset of attributes in the training
138set. We closely follows the original algorithm (Brieman, 2001) both in
139implementation and parameter defaults.</p>
140
141<P>Learner is encapsulated in class <index name="classes/RandomForestLearner (in orngEnsemble)"><CODE>RandomForestLearner</CODE>.</P>
142
143<p class=section>Attributes</p>
144<dl class=attributes>
145  <dt>examples</dt>
146  <dd>If these are passed, the call returns
147  <code><index name="classes/RandomForestClassifier (in orngEnsemble)">RandomForestClassifier</code>, that is, creates the required
148  set of decision trees, which, when presented with an examples, vote
149  for the predicted class.</dd>
150
151  <dt>trees</dt>
152  <dd>Number of trees in the forest (default: 100).</dd>
153
154  <dt>learner</dt>
155  <dd>Although not required, one can use this argument to pass one's
156  own tree induction algorithm. If none is passed,
157  <code>RandomForestLearner</code> will use Orange's tree induction
158  algorithm such that in induction nodes with less then 5 examples
159  will not be considered for (further) splitting. (default: None)</dd>
160
161  <dt>attributes</dt>
162  <dd>Number of attributes used in a randomly drawn subset when
163  searching for best attribute to split the node in tree growing
164  (default: None, and if kept this way, this is turned into square
165  root of attributes in the training set, when this is presented to
166  learner).</dd>
167
168  <dt>rand</dt>
169  <dd>Random generator used in bootstrap sampling. If none is passed,
170  then Python's Random from random library is used, with seed
171  initialized to 0..</dd>
172
173  <dt>callback</dt>
174  <dd>A function to be called after every iteration of induction of
175  classifier. This is called with parameter (from 0.0 to 1.0) that
176  gives estimates on learning progress.</dd>
177
178  <dt>name</dt>
179  <dd>The name of the learner (default: Random Forest).</dd>
180</dl>
181
182<p>A note on voting. Random forest classifier uses decision trees
183induced from bootstrapped training set to vote on class of presented
184example. Most frequent vote is returned. However, in our
185implementation, if class probability is requested from a classifier,
186this will return the averaged probabilities from each of the trees.</p>
187
188<h3>Examples</h3>
189
190<p>The following script assembles a random forest learner and compares
191it to a tree learner on a liver disorder (bupa) data set.</p>
192
193<p class="header"><a href="ensemble2.py">ensemble2.py</a> (uses <a href=
194"bupa.tab">bupa.tab</a>)</p>
195<xmp class=code>import orange, orngTree, orngEnsemble
196
197data = orange.ExampleTable('bupa.tab')
198forest = orngEnsemble.RandomForestLearner(trees=50, name="forest")
199tree = orngTree.TreeLearner(minExamples=2, mForPrunning=2, \
200                            sameMajorityPruning=True, name='tree')
201learners = [tree, forest]
202
203import orngTest, orngStat
204results = orngTest.crossValidation(learners, data, folds=10)
205print "Learner  CA     Brier  AUC"
206for i in range(len(learners)):
207    print "%-8s %5.3f  %5.3f  %5.3f" % (learners[i].name, \
208        orngStat.CA(results)[i],
209        orngStat.BrierScore(results)[i],
210        orngStat.AUC(results)[i])
211</xmp>
212
213<p>Notice that our forest contains 50 trees. Learners are compared
214through 10-fold cross validation, and results reported on
215classification accuracy, brier score and area under ROC curve:</p>
216
217<xmp class=code>Learner  CA     Brier  AUC
218tree     0.664  0.673  0.653
219forest   0.710  0.373  0.777
220</xmp>
221
222<p>Perhaps the sole purpose of the following example is to show how to
223access the individual classifiers once they are assembled into the
224forest, and to show how we can assemble a tree learner to be used in
225random forests. The tree induction uses an attribute subset split
226constructor, which we have borrowed from orngEnsamble and from which
227we have requested the best attribute for decision nodes to be selected
228from three randomly chosen attributes.</p>
229
230<p class="header"><a href="ensemble3.py">ensemble3.py</a> (uses <a href=
231"bupa.tab">bupa.tab</a>)</p>
232<xmp class=code>import orange, orngTree, orngEnsemble
233
234data = orange.ExampleTable('bupa.tab')
235
236tree = orngTree.TreeLearner(storeNodeClassifier = 0, storeContingencies=0, \
237  storeDistributions=1, minExamples=5, ).instance()
238gini = orange.MeasureAttribute_gini()
239tree.split.discreteSplitConstructor.measure = \
240  tree.split.continuousSplitConstructor.measure = gini
241tree.maxDepth = 5
242tree.split = orngEnsemble.SplitConstructor_AttributeSubset(tree.split, 3)
243
244forestLearner = orngEnsemble.RandomForestLearner(learner=tree, trees=50)
245forest = forestLearner(data)
246
247for c in forest.classifiers:
248    print orngTree.countNodes(c),
249print
250</xmp>
251
252<p>Running the above code would report on sizes (number of nodes) of
253the tree in a constructed random forest.</p>
254
255
256
257<h2>MeasureAttribute_randomForests</h2>
258
259<p>L. Breiman (2001) suggested the possibility of using random forests
260as a non-myopic measure of attribute importance. </p>
261
262<p>Assessing relevance of attributes with random forests is based
263on the idea that randomly changing the value of an important attribute
264greatly affects example's classification while changing the value
265of an unimportant attribute doen't affect it much. Implemented algorithm
266accumulates attribute scores over given number of trees.
267Importances of all atributes for a single tree are computed as:
268correctly classified OOB examples
269minus correctly classified OOB examples when an attribute is randomly
270shuffled. The accumulated attribute scores are divided by the number
271of used trees and multiplied by 100 before they are returned.</p>
272
273<p class=section>Attributes</p>
274<dl class=attributes>
275
276  <dt>trees</dt>
277  <dd>Number of trees in the forest (default: 100).</dd>
278
279  <dt>learner</dt>
280  <dd>Although not required, one can use this argument to pass one's
281  own tree induction algorithm. If none is passed,
282  <code>MeasureAttribute_randomForests</code> will use Orange's tree induction
283  algorithm such that in induction nodes with less then 5 examples
284  will not be considered for (further) splitting. (default: None)</dd>
285
286  <dt>attributes</dt>
287  <dd>Number of attributes used in a randomly drawn subset when
288  searching for best attribute to split the node in tree growing
289  (default: None, and if kept this way, this is turned into square
290  root of attributes in example set).</dd>
291
292  <dt>rand</dt>
293  <dd>Random generator used in bootstrap sampling. If none is passed,
294  then Python's Random from random library is used, with seed
295  initialized to 0.</dd>
296</dl>
297
298<p>Computation of attribute importance with random forests is rather slow.
299Also, importances for all attributes need to be considered simultaneous.
300Since we normally compute attribute importance with random forests
301for all attributes in the dataset, <CODE>MeasureAttribute_randomForests</CODE>
302caches the results. When it is called to compute a quality of certain attribute,
303it computes qualities for all attributes in the dataset.
304When called again, it uses the stored results if the domain is still
305the same and the example table has not changed (this is done by
306checking the example tables version and is not foolproof;
307it won't detect if you change values of existing examples,
308but will notice adding and removing examples; see the page on
309<A href="../reference/ExampleTable.htm"><CODE>ExampleTable</CODE></A> for details).</P>
310
311<p>Caching will only have an effect if you use the same instance for all attributes in the domain.</p>
312
313<h3>Example</h3>
314
315<p>The following script demonstrates measuring attribute importance with random forests.</p>
316
317<p class="header"><a href="ensemble4.py">ensemble4.py</a> (uses <a href=
318"iris.tab">iris.tab</a>)</p>
319<xmp class=code>import orange, orngEnsemble, random
320
321data = orange.ExampleTable("iris.tab")
322
323measure = orngEnsemble.MeasureAttribute_randomForests(trees=100)
324
325#call by attribute index
326imp0 = measure(0, data)
327#call by orange.Variable
328imp1 = measure(data.domain.attributes[1], data)
329print "first: %0.2f, second: %0.2f\n" % (imp0, imp1)
330
331print "different random seed"
332measure = orngEnsemble.MeasureAttribute_randomForests(trees=100, rand=random.Random(10))
333
334imp0 = measure(0, data)
335imp1 = measure(data.domain.attributes[1], data)
336print "first: %0.2f, second: %0.2f\n" % (imp0, imp1)
337
338print "All importances:"
339imps = measure.importances(data)
340for i,imp in enumerate(imps):
341    print "%15s: %6.2f" % (data.domain.attributes[i].name, imp)
342</xmp>
343
344<p>Corresponding output:</p>
345
346<xmp class=code>first: 0.32, second: 0.04
347
348different random seed
349first: 0.33, second: 0.14
350
351All importances:
352   sepal length:   0.33
353    sepal width:   0.14
354   petal length:  15.16
355    petal width:  48.59
356</xmp>
357
358<HR>
359<H2>References</H2>
360
361<P>L Breiman. Bagging Predictors. Technical report No. 421. University of
362California, Berkeley, 1994. [<a href="http://www.stat.berkeley.edu/tech-reports/421.ps.Z">PS</a>]</P>
363
364<P>Y Freund, RE Schapire. Experiments with a New Boosting
365Algorithm. Machine Learning: Proceedings of the Thirteenth
366International Conference (ICML'96), 1996. [<a href="http://citeseer.ist.psu.edu/freund96experiments.html">Citeseer</a>]</P>
367
368<P>JR Quinlan. Boosting, bagging, and C4.5. In Proc. of 13th
369National Conference on Artificial Intelligence (AAAI'96). pp. 725-730,
3701996. [<a href="http://www.rulequest.com/Personal/q.aaai96.ps">PS</a>]</P>
371
372<p>L Brieman. Random Forests. Machine Learning, 45, 5-32, 2001. [<a href="http://www.springerlink.com/content/u0p06167n6173512/">SpringerLink</a>]</p>
373
374<p> M Robnik-Sikonja. Improving Random Forests. In Proc. of European Conference on Machine Learning (ECML 2004), pp. 359-370, 2004. [<a href="http://lkm.fri.uni-lj.si/rmarko/papers/robnik04-ecml.pdf">PDF</a>]</p>
375
376</body>
377</html>
Note: See TracBrowser for help on using the repository browser.