source: orange/orange/doc/ofb/c_bagging.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 5.3 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="c_nb.htm">Naive Bayes in Python</a>,
8Next: <a href="regression.htm">Regression</a>,
9Up: <a href="c_pythonlearner.htm">Build Your Own Learner</a>,
10</p>
11
12<H1>Build Your Own Bagger</H1>
13
14<p>Here we show how to use the schema that allows us to build our own
15learners/classifiers for bagging. While you can find bagging,
16boosting, and other ensemble-related stuff in <a
17href="../modules/orngEnsemble.htm">orngEnsemble</a> module, we thought
18explaining how to code bagging in Python may provide for a nice
19example. The following pseudo-code (from
20Whitten &amp; Frank: Data Mining) illustrates the main idea of bagging:</p>
21
22<xmp class="code">MODEL GENERATION
23Let n be the number of instances in the training data.
24For each of t iterations:
25   Sample n instances with replacement from training data.
26   Apply the learning algorithm to the sample.
27   Store the resulting model.
28
29CLASSIFICATION
30For each of the t models:
31   Predict class of instance using model.
32Return class that has been predicted most often.
33</xmp>
34
35
36<p>Using the above idea, this means that our <code>Learner_Class</code> will
37need to develop t classifiers and will have to pass them to
38<code>Classifier</code>, which, once seeing a data instance, will use them for
39classification. We will allow parameter t to be specified by the
40user, 10 being the default.</p>
41
42<p>The code for the <code>Learner_Class</code> is therefore:</p>
43
44<p class="header">class <code>Learner_Class</code> from <a href=
45"bagging.py">bagging.py</a></p>
46<xmp class="code">class Learner_Class:
47    def __init__(self, learner, t=10, name='bagged classifier'):
48        self.t = t
49        self.name = name
50        self.learner = learner
51
52    def __call__(self, examples, weight=None):
53        n = len(examples)
54        classifiers = []
55        for i in range(self.t):
56            selection = []
57            for i in range(n):
58                selection.append(random.randrange(n))
59            data = examples.getitems(selection)
60            classifiers.append(self.learner(data))
61           
62        return Classifier(classifiers = classifiers, \
63            name=self.name, domain=examples.domain)
64</xmp>
65
66<p>Upon invocation, <code>__init__</code> stores the base learning (the one that
67will be bagged), the value of the parameter t, and the name of the
68classifier. Note that while the learner requires the base learner
69to be specified, parameters t and name are optional.</p>
70
71<p>When the learner is called with examples, a list of t
72classifiers is build and stored in variable <code>classifier</code>. Notice that
73for data sampling with replacement, a list of data instance indices
74is build (<code>selection</code>) and then used to sample the data from training
75examples (<code>example.getitems</code>). Finally, a <code>Classifier</code> is called
76with a list of classifiers, name and domain information.</p>
77
78<p class="header">class <code>Classifier</code> from <a href=
79"bagging.py">bagging.py</a></p>
80<xmp class="code">class Classifier:
81    def __init__(self, **kwds):
82        self.__dict__.update(kwds)
83
84    def __call__(self, example, resultType = orange.GetValue):
85        freq = [0.] * len(self.domain.classVar.values)
86        for c in self.classifiers:
87            freq[int(c(example))] += 1
88        index = freq.index(max(freq))
89        value = orange.Value(self.domain.classVar, index)
90        for i in range(len(freq)):
91            freq[i] = freq[i]/len(self.classifiers)
92        if resultType == orange.GetValue: return value
93        elif resultType == orange.GetProbabilities: return freq
94        else: return (value, freq)
95</xmp>
96
97
98<p>For initialization, <code>Classifier</code> stores all parameters it was
99invoked with. When called with a data instance, a list freq is
100initialized which is of length equal to the number of classes and
101records the number of models that classify an instance to a
102specific class. The class that majority of models voted for is
103returned. While it may be possible to return classes index, or even
104a name, by convention classifiers in Orange return an object <code>Value</code>
105instead.</p>
106
107<p>Notice that while, originally, bagging was not intended to
108compute probabilities of classes, we compute these as the
109proportion of models that voted for a certain class (this is
110probably incorrect, but suffice for our example, and does not hurt
111if only classes values and not probabilities are used).</p>
112
113<p>Here is the code that tests our bagging we have just
114implemented. It compares a decision tree and its bagged variant.
115Run it yourself to see which one is better!</p>
116
117<p class="header"><a href="bagging_test.py">bagging_test.py</a> (uses <a
118href="bagging.py">bagging.py</a> and <a href=
119"../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
120<xmp class="code">import orange, orngTree, orngEval, bagging
121data = orange.ExampleTable("adult_sample")
122
123tree = orngTree.TreeLearner(mForPrunning=10, minExamples=30)
124tree.name = "tree"
125baggedTree = bagging.Learner(learner=tree, t=5)
126
127learners = [tree, baggedTree]
128
129results = orngEval.crossValidation(learners, data, folds=5)
130for i in range(len(learners)):
131    print learners[i].name, orngEval.CA(results)[i]
132</xmp>
133
134<hr><br><p class="Path">
135Prev: <a href="c_nb.htm">Naive Bayes in Python</a>,
136Next: <a href="regression.htm">Regression</a>,
137Up: <a href="c_pythonlearner.htm">Build Your Own Learner</a>
138</p>
139
140</body>
141</html>
Note: See TracBrowser for help on using the repository browser.