source: orange/orange/doc/ofb/c_pythonlearner.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 20.4 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="c_performance.htm">Testing and Evaluating</a>,
8Next: <a href="c_nb_disc.htm">Naive Bayes with Discretization</a>,
9Up: <a href="classification.htm">Classification</a>
10</p>
11
12<H1>Build Your Own Learner</H1>
13<index name="classifiers/in Python">
14
15<p>This part of tutorial will show how to build learners and
16classifiers in Python, that is, how to build your own learners and
17classifiers. Especially for those of you that want to test some of
18your methods or want to combine existing techniques in Orange, this is
19a very important topic. Developing your own learners in Python makes
20prototyping of new methods fast and enjoyable.</p>
21
22<p>There are different ways to build learners/classifiers in
23Python. We will take the route that shows how to do this correctly, in
24a sense that you will be able to use your learner as it would be any
25learner that Orange originally provides. Distinct to Orange learners
26is the way how they are invoked and what the return. Let us start with
27an example. Say that we have a Learner(), which is some learner in
28Orange. The learner can be called in two different ways:</p>
29
30<xmp class="code">learner = Learner()
31classifier = Learner(data)
32</xmp>
33
34<p>In the first line, the learner is invoked without the data set and
35in that case it should return an instance of learner, such that later
36you may say <code>classifier = learner(data)</code> or you may call
37some validation procedure with a <code>learner</code> itself (say
38<code>orngEval.CrossValidation([learner], data)</code>). In the second
39line, learner is called with the data and returns a classifier.</p>
40
41<p>Classifiers should be called with a data instance to classify,
42and should return either a class value (by default), probability of
43classes or both:</p>
44
45<xmp class="code">value = classifier(instance)
46value = classifier(instance, orange.GetValue)
47probabilities = classifier(instance, orange.GetProbabilities)
48value, probabilities = classifier(instance, orange.GetBoth)
49</xmp>
50
51<p>Here is a short example:</p>
52
53<pre class="code">
54> <strong>python</strong>
55>>> <strong>import orange</strong>
56>>> <strong>data = orange.ExampleTable("voting")</strong>
57>>> <strong>learner = orange.BayesLearner()</strong>
58>>> <strong>classifier = learner(data)</strong>
59>>> <strong>classifier(data[0])</strong>
60republican
61>>> <strong>classifier(data[0], orange.GetBoth)</strong>
62(republican, [0.99999994039535522, 7.9730767765795463e-008])
63>>> <strong>classifier(data[0], orange.GetProbabilities)</strong>
64[0.99999994039535522, 7.9730767765795463e-008]
65>>>
66>>> <strong>c = orange.BayesLearner(data)</strong>
67>>> <strong>c(data[12])</strong>
68democrat
69>>>
70</pre>
71
72<p>Throughout our examples, we will assume that our learner and the
73corresponding classifier will be defined in a single file (module)
74that will not contain any other code. This helps for code reuse, so
75that if you want to use your new method anywhere else, you just import
76it from that file. Each such module will contain a class
77<code>Learner_Class</code> and a class <code>Classifier</code>.</p> We
78will use this schema to define a learner that will use naive Bayesian classifier with embeded categorization of training data. Then we will show how to write naive Bayesian classifier in Python (that is, how to do this from
79scratch). We conclude with Python implementation of bagging.</p>
80
81
82
83<h2>Naive Bayes with Discretization</H2>
84
85<p>Let us build a learner/classifier that is an extension of build-in
86naive Bayes and which before learning categorizes the data (see also
87the lesson on <a href= "o_categorization.htm">Categorization</a>). We
88will define a module <a href="nbdisc.py">nbdisc.py</a> that will
89implement two classes, Learner and Classifier. Following is a Python
90code for a Learner class:</p>
91
92<p class="header">function <code>Learner</code> from <a href=
93"nbdisc.py">nbdisc.py</a></p>
94<xmp class="code">class Learner(object):
95    def __new__(cls, examples=None, name='discretized bayes', **kwds):
96        learner = object.__new__(cls, **kwds)
97        if examples:
98            learner.__init__(name) # force init
99            return learner(examples)
100        else:
101            return learner  # invokes the __init__
102
103    def __init__(self, name='discretized bayes'):
104        self.name = name
105
106    def __call__(self, data, weight=None):
107        disc = orange.Preprocessor_discretize( \
108            data, method=orange.EntropyDiscretization())
109        model = orange.BayesLearner(disc, weight)
110        return Classifier(classifier = model)
111</xmp>
112
113<p>Learner_Class has three methods. Method <code>__new__</code>
114creates the object and returns a learner or classifier, depending if
115examples where passed to the call. If the examples were passed as an
116argument than the method called the learner (invoking
117<code>__call__</code> method). Method <code>__init__</code> is invoked
118every time the class is called for the first time. Notice that all it
119does is remembers the only argument that this class can be called
120with, i.e. the argument <code>name</code> which defaults to
121&lsquo;discretized bayes&rsquo;. If you would expect any other
122arguments for your learners, you should handle them here (store them
123as class&rsquo; attributes using the keyword <code>self</code>).</p>
124
125<p>If we have created an instance of the learner (and did not pass the
126examples as attributes), the next call of this learner will invoke a
127method <code>__call__</code>, where the essence of our learner is
128implemented. Notice also that we have included an attribute for vector
129of instance weights, which is passed to naive Bayesian learner. In our
130learner, we first discretize the data using Fayyad &amp; Irani&rsquo;s
131entropy-based discretization, then build a naive Bayesian model and
132finally pass it to a class <code>Classifier</code>. You may expect
133that at its first invocation the <code>Classifier</code> will just
134remember the model we have called it with:</p>
135
136<p class="header">class Classifier from <a href=
137"nbdisc.py">nbdisc.py</a></p>
138<xmp class="code">class Classifier:
139    def __init__(self, **kwds):
140        self.__dict__.update(kwds)
141
142    def __call__(self, example, resultType = orange.GetValue):
143        return self.classifier(example, resultType)
144</xmp>
145
146
147<p>The method <code>__init__</code> in <code>Classifier</code> is
148rather general: it makes <code>Classifier</code> remember all
149arguments it was called with. They are then accessed through
150<code>Classifiers</code>&rsquo; arguments
151(<code>self.argument_name</code>). When Classifier is called, it
152expects an example and an optional argument that specifies the type of
153result to be returned.</p>
154
155<p>This completes our code for naive Bayesian classifier with
156discretization. You can see that the code is fairly short (fewer than
15720 lines), and it can be easily extended or changed if we want to do
158something else as well (like feature subset selection, ...).</p>
159
160<p>Here are now a few lines to test our code:</p>
161
162<p class="header">uses <a href="iris.tab">iris.tab</a> and <a href=
163"nbdisc.py">nbdisc.py</a></p>
164<pre class="code">
165> <code>python</code>
166>>> <code>import orange, nbdisc</code>
167>>> <code>data = orange.ExampleTable("iris")</code>
168>>> <code>classifier = nbdisc.Learner(data)</code>
169>>> <code>print classifier(data[100])</code>
170Iris-virginica
171>>> <code>classifier(data[100], orange.GetBoth)</code>
172(<orange.Value 'iris'='Iris-virginica'>, <0.000, 0.001, 0.999>)
173>>>
174</pre>
175
176<p>For a more elaborate test that also shows the use of a learner
177(that is not given the data at its initialization), here is a
178script that does 10-fold cross validation:</p>
179
180<p class="header">
181<a href=
182"nbdisc_test.py">nbdisc_test.py</a>
183(uses <a href="iris.tab">iris.tab</a> and
184<a href="nbdisc.py">nbdisc.py</a>)</p>
185<xmp class="code">import orange, orngEval, nbdisc
186data = orange.ExampleTable("iris")
187results = orngEval.CrossValidation([nbdisc.Learner()], data)
188print "Accuracy = %5.3f" % orngEval.CA(results)[0]
189</xmp>
190
191<p>The accuracy on this data set is about 92%. You may try to obtain a
192better accuracy by using some other type of discretization, or try
193some other learner on this data (hint: k-NN should perform
194better).</p>
195
196
197
198<h2>Python Implementation of Naive Bayesian Classifier</h2>
199<index name="naive Bayesian classifier (in Python)">
200
201<p>The naive Bayesian classifier we will implement in
202this lesson uses standard naive Bayesian algorithm also described
203in Michell: Machine Learning, 1997 (pages 177-180). Essentially, if
204an instance is described with n attributes a<sub>i</sub> (i from 1
205to n), then the class that instance is classified to a class v from
206set of possible classes V according to naive Bayes classifier
207is:</p>
208
209<center><img src="f1.gif" alt="formula for v"></center>
210
211<p>We will also compute a vector of elements</p>
212
213<center><img src="f2.gif" alt="formula for pj"></center>
214
215<p>which, after normalization so that the sum of p<sub>j</sub> is
216equal to 1, represent class probabilities. The class probabilities and
217conditional probabilities (priors) in above formulas are estimated
218from training data: class probability is equal to the relative class
219frequency, while the conditional probability of attribute value given
220class is computed by figuring out the proportion of instances with a
221value of i-th attribute equal to a<sub>i</sub> among instances that
222from class v<sub>j</sub>.</p>
223
224<p>To complicate things just a little bit, m-estimate (see
225Mitchell, and Cestnik IJCAI-1990) will be used instead of relative
226frequency when computing prior conditional probabilities. So
227(following the example in Mitchell), when assessing
228P=P(Wind=strong|PlayTennis=no) we find that the total number of
229training examples with PlayTennis=no is n=5, and of these there are
230nc=3 for which Wind=strong, than using relative frequency the
231corresponding probability would be</p>
232
233<center><img src="f3.gif" alt="formula for P"></center>
234
235<p>Relative frequency has a problem when number of instance is
236small, and to alleviate that m-estimate assumes that there are m
237imaginary cases (m is also referred to as equivalent sample size)
238with equal probability of class values p. Our conditional
239probability using m-estimate is then computed as</p>
240
241<center><img src="f4.gif" alt="formula for Pm"></center>
242
243<p>Often, instead of uniform class probability p, a relative class
244frequency as estimated from training data is taken.</p>
245
246<p></p>
247
248<p>We will develop a module called bayes.py that will implement our
249naive Bayes learner and classifier. The structure of the module
250will be as with <a href="c_nb_disc.htm">previous example</a>.
251Again, we will implement two classes, one for learning and the other
252on for classification. Here is a <code>Learner</code>: class</p>
253
254<p class="header">class Learner_Class from <a href=
255"bayes.py">bayes.py</a></p>
256<xmp class="code">class Learner_Class:
257  def __init__(self, m=0.0, name='std naive bayes', **kwds):
258    self.__dict__.update(kwds)
259    self.m = m
260    self.name = name
261
262  def __call__(self, examples, weight=None, **kwds):
263    for k in kwds.keys():
264      self.__dict__[k] = kwds[k]
265    domain = examples.domain
266
267    # first, compute class probabilities
268    n_class = [0.] * len(domain.classVar.values)
269    for e in examples:
270      n_class[int(e.getclass())] += 1
271
272    p_class = [0.] * len(domain.classVar.values)
273    for i in range(len(domain.classVar.values)):
274      p_class[i] = n_class[i] / len(examples)
275
276    # count examples with specific attribute and
277    # class value, pc[attribute][value][class]
278
279    # initialization of pc
280    pc = []
281    for i in domain.attributes:
282      p = [[0.]*len(domain.classVar.values) for i in range(len(i.values))]
283      pc.append(p)
284
285    # count instances, store them in pc
286    for e in examples:
287      c = int(e.getclass())
288      for i in range(len(domain.attributes)):
289      if not e[i].isSpecial():
290        pc[i][int(e[i])][c] += 1.0
291
292    # compute conditional probabilities
293    for i in range(len(domain.attributes)):
294      for j in range(len(domain.attributes[i].values)):
295        for k in range(len(domain.classVar.values)):
296          pc[i][j][k] = (pc[i][j][k] + self.m * p_class[k])/ \
297            (n_class[k] + self.m)
298
299    return Classifier(m = self.m, domain=domain, p_class=p_class, \
300             p_cond=pc, name=self.name)
301</xmp>
302
303<p>Initialization of Learner_Class saves the two attributes, m and
304name of the classifier. Notice that both parameters are optional,
305and the default value for m is 0, making naive Bayes m-estimate
306equal to relative frequency unless the user specifies some other
307value for m. Function <code>__call__</code> is called with the training data
308set, computes class and conditional probabilities and calls
309classifiers, passing the probabilities along with some other
310variables required for classification.</p>
311
312<p class="header">class Classifier from <a href="bayes.py">bayes.py</a></p>
313<xmp class="code">class Classifier:
314  def __init__(self, **kwds):
315    self.__dict__.update(kwds)
316
317  def __call__(self, example, result_type=orange.GetValue):
318    # compute the class probabilities
319    p = map(None, self.p_class)
320    for c in range(len(self.domain.classVar.values)):
321      for a in range(len(self.domain.attributes)):
322        if not example[a].isSpecial():
323          p[c] *= self.p_cond[a][int(example[a])][c]
324
325    # normalize probabilities to sum to 1
326    sum =0.
327    for pp in p: sum += pp
328    if sum>0:
329      for i in range(len(p)): p[i] = p[i]/sum
330
331    # find the class with highest probability
332    v_index = p.index(max(p))
333    v = orange.Value(self.domain.classVar, v_index)
334
335    # return the value based on requested return type
336    if result_type == orange.GetValue:
337      return v
338    if result_type == orange.GetProbabilities:
339      return p
340    return (v,p)
341
342  def show(self):
343    print 'm=', self.m
344    print 'class prob=', self.p_class
345    print 'cond prob=', self.p_cond
346</xmp>
347
348
349<p>Upon first invocation, the classifier will store the values of
350the parameters it was called with (<code>__init__</code>). When called with a
351data instance, it will first compute the class probabilities using
352the prior probabilities sent by the learner. The probabilities will
353be normalized to sum to 1. The class will then be found that has
354the highest probability, and the classifier will accordingly
355predict to this class. Notice that we have also added a method
356called show, which reports on m, class probabilities and
357conditional probabilities:</p>
358
359<p class="header">uses <a href="voting.tab">voting.tab</a></p>
360<pre class="code">
361> <strong>python</strong>
362>>> <strong>import orange, bayes</strong>
363>>> <strong>data = orange.ExampleTable("voting")</strong>
364>>> <strong>classifier = bayes.Learner(data)</strong>
365>>> <strong>classifier.show()</strong>
366m= 0.0
367class prob= [0.38620689655172413, 0.61379310344827587]
368cond prob= [[[0.79761904761904767, 0.38202247191011235], ...]]
369>>>
370</pre>
371
372
373<p>The following script tests our naive Bayes, and compares it to
37410-nearest neighbors. Running the script (do you it yourself)
375reports classification accuracies just about 90% (somehow, on this
376data set, kNN does better; smrc&hellip;).</p>
377
378<p class="header"><a href="bayes_test.py">bayes_test.py</a> (uses <a href="bayes.py">bayes.py</a> and <a href="voting.tab">voting.tab</a>)</p>
379<xmp class="code">import orange, orngEval, bayes
380data = orange.ExampleTable("voting")
381
382bayes = bayes.Learner(m=2, name='my bayes')
383knn = orange.kNNLearner(k=10)
384knn.name = "knn"
385
386learners = [knn,bayes]
387results = orngEval.CrossValidation(learners, data)
388for i in range(len(learners)):
389    print learners[i].name, orngEval.CA(results)[i]
390</xmp>
391
392
393
394<h2>Bagging</h2>
395
396<p>Here we show how to use the schema that allows us to build our own
397learners/classifiers for bagging. While you can find bagging,
398boosting, and other ensemble-related stuff in <a
399href="../modules/orngEnsemble.htm">orngEnsemble</a> module, we thought
400explaining how to code bagging in Python may provide for a nice
401example. The following pseudo-code (from
402Whitten &amp; Frank: Data Mining) illustrates the main idea of bagging:</p>
403
404<xmp class="code">MODEL GENERATION
405Let n be the number of instances in the training data.
406For each of t iterations:
407   Sample n instances with replacement from training data.
408   Apply the learning algorithm to the sample.
409   Store the resulting model.
410
411CLASSIFICATION
412For each of the t models:
413   Predict class of instance using model.
414Return class that has been predicted most often.
415</xmp>
416
417
418<p>Using the above idea, this means that our <code>Learner_Class</code> will
419need to develop t classifiers and will have to pass them to
420<code>Classifier</code>, which, once seeing a data instance, will use them for
421classification. We will allow parameter t to be specified by the
422user, 10 being the default.</p>
423
424<p>The code for the <code>Learner_Class</code> is therefore:</p>
425
426<p class="header">class <code>Learner_Class</code> from <a href=
427"bagging.py">bagging.py</a></p>
428<xmp class="code">class Learner_Class:
429    def __init__(self, learner, t=10, name='bagged classifier'):
430        self.t = t
431        self.name = name
432        self.learner = learner
433
434    def __call__(self, examples, weight=None):
435        n = len(examples)
436        classifiers = []
437        for i in range(self.t):
438            selection = []
439            for i in range(n):
440                selection.append(random.randrange(n))
441            data = examples.getitems(selection)
442            classifiers.append(self.learner(data))
443           
444        return Classifier(classifiers = classifiers, \
445            name=self.name, domain=examples.domain)
446</xmp>
447
448<p>Upon invocation, <code>__init__</code> stores the base learning
449(the one that will be bagged), the value of the parameter t, and the
450name of the classifier. Note that while the learner requires the base
451learner to be specified, parameters t and name are optional.</p>
452
453<p>When the learner is called with examples, a list of t classifiers
454is build and stored in variable <code>classifier</code>. Notice that
455for data sampling with replacement, a list of data instance indices is
456build (<code>selection</code>) and then used to sample the data from
457training examples (<code>example.getitems</code>). Finally, a
458<code>Classifier</code> is called with a list of classifiers, name and
459domain information.</p>
460
461<p class="header">class <code>Classifier</code> from <a href=
462"bagging.py">bagging.py</a></p>
463<xmp class="code">class Classifier:
464    def __init__(self, **kwds):
465        self.__dict__.update(kwds)
466
467    def __call__(self, example, resultType = orange.GetValue):
468        freq = [0.] * len(self.domain.classVar.values)
469        for c in self.classifiers:
470            freq[int(c(example))] += 1
471        index = freq.index(max(freq))
472        value = orange.Value(self.domain.classVar, index)
473        for i in range(len(freq)):
474            freq[i] = freq[i]/len(self.classifiers)
475        if resultType == orange.GetValue: return value
476        elif resultType == orange.GetProbabilities: return freq
477        else: return (value, freq)
478</xmp>
479
480
481<p>For initialization, <code>Classifier</code> stores all parameters
482it was invoked with. When called with a data instance, a list freq is
483initialized which is of length equal to the number of classes and
484records the number of models that classify an instance to a specific
485class. The class that majority of models voted for is returned. While
486it may be possible to return classes index, or even a name, by
487convention classifiers in Orange return an object <code>Value</code>
488instead.</p>
489
490<p>Notice that while, originally, bagging was not intended to compute
491probabilities of classes, we compute these as the proportion of models
492that voted for a certain class (this is probably incorrect, but
493suffice for our example, and does not hurt if only classes values and
494not probabilities are used).</p>
495
496<p>Here is the code that tests our bagging we have just
497implemented. It compares a decision tree and its bagged variant.  Run
498it yourself to see which one is better!</p>
499
500<p class="header"><a href="bagging_test.py">bagging_test.py</a> (uses <a
501href="bagging.py">bagging.py</a> and <a href=
502"../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
503<xmp class="code">import orange, orngTree, orngEval, bagging
504data = orange.ExampleTable("adult_sample")
505
506tree = orngTree.TreeLearner(mForPrunning=10, minExamples=30)
507tree.name = "tree"
508baggedTree = bagging.Learner(learner=tree, t=5)
509
510learners = [tree, baggedTree]
511
512results = orngEval.crossValidation(learners, data, folds=5)
513for i in range(len(learners)):
514    print learners[i].name, orngEval.CA(results)[i]
515</xmp>
516
517
518
519
520<hr><br><p class="Path">
521Prev: <a href="c_performance.htm">Testing and Evaluating</a>,
522Next: <a href="regression.htm">Regression</a>,
523Up: <a href="classification.htm">Classification</a>
524</p>
525
526</body>
527</html>
528
Note: See TracBrowser for help on using the repository browser.