source: orange/docs/tutorial/rst/learners-in-python.rst @ 9994:1073e0304a87

Revision 9994:1073e0304a87, 18.7 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Remove links from documentation to datasets. Remove datasets reference directory.

Line 
1Build your own learner
2======================
3
4.. index::
5   single: classifiers; in Python
6
7This part of tutorial will show how to build learners and classifiers
8in Python, that is, how to build your own learners and
9classifiers. Especially for those of you that want to test some of
10your methods or want to combine existing techniques in Orange, this is
11a very important topic. Developing your own learners in Python makes
12prototyping of new methods fast and enjoyable.
13
14There are different ways to build learners/classifiers in Python. We
15will take the route that shows how to do this correctly, in a sense
16that you will be able to use your learner as it would be any learner
17that Orange originally provides. Distinct to Orange learners is the
18way how they are invoked and what the return. Let us start with an
19example. Say that we have a Learner(), which is some learner in
20Orange. The learner can be called in two different ways::
21
22   learner = Learner()
23   classifier = Learner(data)
24
25In the first line, the learner is invoked without the data set and
26in that case it should return an instance of learner, such that later
27you may say ``classifier = learner(data)`` or you may call
28some validation procedure with a ``learner`` itself (say
29``orngEval.CrossValidation([learner], data)``). In the second
30line, learner is called with the data and returns a classifier.
31
32Classifiers should be called with a data instance to classify,
33and should return either a class value (by default), probability of
34classes or both::
35
36   value = classifier(instance)
37   value = classifier(instance, orange.GetValue)
38   probabilities = classifier(instance, orange.GetProbabilities)
39   value, probabilities = classifier(instance, orange.GetBoth)
40
41Here is a short example::
42
43   > python
44   >>> import orange
45   >>> data = orange.ExampleTable("voting")
46   >>> learner = orange.BayesLearner()
47   >>> classifier = learner(data)
48   >>> classifier(data[0])
49   republican
50   >>> classifier(data[0], orange.GetBoth)
51   (republican, [0.99999994039535522, 7.9730767765795463e-008])
52   >>> classifier(data[0], orange.GetProbabilities)
53   [0.99999994039535522, 7.9730767765795463e-008]
54   >>>
55   >>> c = orange.BayesLearner(data)
56   >>> c(data[12])
57   democrat
58   >>>
59
60We will here assume that our learner and the corresponding classifier
61will be defined in a single file (module) that will not contain any
62other code. This helps for code reuse, so that if you want to use your
63new method anywhere else, you just import it from that file. Each such
64module will contain a class ``Learner_Class`` and a class
65``Classifier``. We will use this schema to define a learner that will
66use naive Bayesian classifier with embeded categorization of training
67data. Then we will show how to write naive Bayesian classifier in
68Python (that is, how to do this from scratch). We conclude with Python
69implementation of bagging.
70
71.. _naive bayes with discretization:
72
73Naive Bayes with discretization
74-------------------------------
75
76Let us build a learner/classifier that is an extension of build-in
77naive Bayes and which before learning categorizes the data. We will
78define a module :download:`nbdisc.py <code/nbdisc.py>` that will implement two classes, Learner
79and Classifier. Following is a Python code for a Learner class (part
80of :download:`nbdisc.py <code/nbdisc.py>`)::
81
82   class Learner(object):
83       def __new__(cls, examples=None, name='discretized bayes', **kwds):
84           learner = object.__new__(cls)
85           if examples:
86               learner.__init__(name) # force init
87               return learner(examples)
88           else:
89               return learner  # invokes the __init__
90   
91       def __init__(self, name='discretized bayes'):
92           self.name = name
93   
94       def __call__(self, data, weight=None):
95           disc = orange.Preprocessor_discretize( \
96               data, method=orange.EntropyDiscretization())
97           model = orange.BayesLearner(disc, weight)
98           return Classifier(classifier = model)
99
100``Learner_Class`` has three methods. Method ``__new__`` creates the
101object and returns a learner or classifier, depending if examples
102where passed to the call. If the examples were passed as an argument
103than the method called the learner (invoking ``__call__``
104method). Method ``__init__`` is invoked every time the class is called
105for the first time. Notice that all it does is remembers the only
106argument that this class can be called with, i.e. the argument
107``name`` which defaults to discretized bayes. If you would expect any
108other arguments for your learners, you should handle them here (store
109them as class' attributes using the keyword ``self``).
110
111If we have created an instance of the learner (and did not pass the
112examples as attributes), the next call of this learner will invoke a
113method ``__call__``, where the essence of our learner is
114implemented. Notice also that we have included an attribute for vector
115of instance weights, which is passed to naive Bayesian learner. In our
116learner, we first discretize the data using Fayyad &amp; Irani's
117entropy-based discretization, then build a naive Bayesian model and
118finally pass it to a class ``Classifier``. You may expect that at its
119first invocation the ``Classifier`` will just remember the model we
120have called it with (part of :download:`nbdisc.py <code/nbdisc.py>`)::
121
122   class Classifier:
123       def __init__(self, **kwds):
124           self.__dict__.update(kwds)
125   
126       def __call__(self, example, resultType = orange.GetValue):
127           return self.classifier(example, resultType)
128   
129The method ``__init__`` in ``Classifier`` is rather general: it makes
130``Classifier`` remember all arguments it was called with. They are
131then accessed through ``Classifiers``' arguments
132(``self.argument_name``). When Classifier is called, it expects an
133example and an optional argument that specifies the type of result to
134be returned.
135
136This completes our code for naive Bayesian classifier with
137discretization. You can see that the code is fairly short (fewer than
13820 lines), and it can be easily extended or changed if we want to do
139something else as well (like feature subset selection, ...).
140
141Here are now a few lines to test our code::
142
143   >>> import orange, nbdisc
144   >>> data = orange.ExampleTable("iris")
145   >>> classifier = nbdisc.Learner(data)
146   >>> print classifier(data[100])
147   Iris-virginica
148   >>> classifier(data[100], orange.GetBoth)
149   (<orange.Value 'iris'='Iris-virginica'>, <0.000, 0.001, 0.999>)
150   >>>
151
152For a more elaborate test that also shows the use of a learner (that
153is not given the data at its initialization), here is a script that
154does 10-fold cross validation (:download:`nbdisc_test.py <code/nbdisc_test.py>`,
155uses :download:`nbdisc.py <code/nbdisc.py>`)::
156
157   import orange, orngEval, nbdisc
158   data = orange.ExampleTable("iris")
159   results = orngEval.CrossValidation([nbdisc.Learner()], data)
160   print "Accuracy = %5.3f" % orngEval.CA(results)[0]
161
162The accuracy on this data set is about 92%. You may try to obtain a
163better accuracy by using some other type of discretization, or try
164some other learner on this data (hint: k-NN should perform better).
165
166Python implementation of naive Bayesian classifier
167--------------------------------------------------
168
169.. index::
170   single: naive Bayesian classifier; in Python
171
172The naive Bayesian classifier we will implement in this lesson uses
173standard naive Bayesian algorithm also described in Michell: Machine
174Learning, 1997 (pages 177-180). Essentially, if a data instance is
175described with :math:`n` features :math:`a_i`, then the
176class that instance is classified to a class :math:`c` from set of possible
177classes :math:`V`. According to naive Bayes classifier:
178
179.. math::
180   c=\arg\max_{c_i\in V} P(v_j)\prod_{i=1}^n P(a_i|v_j)
181
182We will also compute a vector of elements:
183
184.. math::
185   p_j = P(v_j)\prod_{i=1}^n P(a_i, v_j)
186
187which, after normalization such that :math:`\sum_j p_j` is
188equal to 1, represent class probabilities. The class probabilities and
189conditional probabilities (priors) in above formulas are estimated
190from training data: class probability is equal to the relative class
191frequency, while the conditional probability of attribute value given
192class is computed by figuring out the proportion of instances with a
193value of :math:`i`-th attribute equal to :math:`a_i` among instances that
194from class :math:`v_j`.
195
196To complicate things just a little bit, :math:`m`-estimate (see
197Mitchell, and Cestnik IJCAI-1990) will be used instead of relative
198frequency when computing prior conditional probabilities. So
199(following the example in Mitchell), when assessing :math:`P=P({\rm
200Wind}={\rm strong}|{\rm PlayTennis}={\rm no})` we find that the total
201number of training examples with PlayTennis=no is :math:`n=5`, and of
202these there are :math:`n_c=3` for which Wind=strong, than using
203relative frequency the corresponding probability would be:
204
205.. math::
206   P={n_c\over n}
207
208Relative frequency has a problem when number of instance is
209small, and to alleviate that m-estimate assumes that there are m
210imaginary cases (m is also referred to as equivalent sample size)
211with equal probability of class values p. Our conditional
212probability using m-estimate is then computed as:
213
214.. math::
215   P={n_c+m p\over n+m}
216
217Often, instead of uniform class probability :math:`p`, a relative class
218frequency as estimated from training data is taken.
219
220We will develop a module called bayes.py that will implement our naive
221Bayes learner and classifier. The structure of the module will be as
222with `naive bayes with discretization`_.  Again, we will implement two classes, one for
223learning and the other on for classification. Here is a ``Learner``:
224class (part of :download:`bayes.py <code/bayes.py>`)::
225
226   class Learner_Class:
227     def __init__(self, m=0.0, name='std naive bayes', **kwds):
228       self.__dict__.update(kwds)
229       self.m = m
230       self.name = name
231   
232     def __call__(self, examples, weight=None, **kwds):
233       for k in kwds.keys():
234         self.__dict__[k] = kwds[k]
235       domain = examples.domain
236   
237       # first, compute class probabilities
238       n_class = [0.] * len(domain.classVar.values)
239       for e in examples:
240         n_class[int(e.getclass())] += 1
241   
242       p_class = [0.] * len(domain.classVar.values)
243       for i in range(len(domain.classVar.values)):
244         p_class[i] = n_class[i] / len(examples)
245   
246       # count examples with specific attribute and
247       # class value, pc[attribute][value][class]
248   
249       # initialization of pc
250       pc = []
251       for i in domain.attributes:
252         p = [[0.]*len(domain.classVar.values) for i in range(len(i.values))]
253         pc.append(p)
254   
255       # count instances, store them in pc
256       for e in examples:
257         c = int(e.getclass())
258         for i in range(len(domain.attributes)):
259         if not e[i].isSpecial():
260           pc[i][int(e[i])][c] += 1.0
261   
262       # compute conditional probabilities
263       for i in range(len(domain.attributes)):
264         for j in range(len(domain.attributes[i].values)):
265           for k in range(len(domain.classVar.values)):
266             pc[i][j][k] = (pc[i][j][k] + self.m * p_class[k])/ \
267               (n_class[k] + self.m)
268   
269       return Classifier(m = self.m, domain=domain, p_class=p_class, \
270                p_cond=pc, name=self.name)
271
272Initialization of ``Learner_Class`` saves the two attributes, ``m``
273and ``name`` of the classifier. Notice that both parameters are
274optional, and the default value for ``m`` is 0, making naive Bayes
275m-estimate equal to relative frequency unless the user specifies some
276other value for m. Function ``__call__`` is called with the training
277data set, computes class and conditional probabilities and calls
278classifiers, passing the probabilities along with some other variables
279required for classification (part of :download:`bayes.py <code/bayes.py>`)::
280
281   class Classifier:
282     def __init__(self, **kwds):
283       self.__dict__.update(kwds)
284   
285     def __call__(self, example, result_type=orange.GetValue):
286       # compute the class probabilities
287       p = map(None, self.p_class)
288       for c in range(len(self.domain.classVar.values)):
289         for a in range(len(self.domain.attributes)):
290           if not example[a].isSpecial():
291             p[c] *= self.p_cond[a][int(example[a])][c]
292   
293       # normalize probabilities to sum to 1
294       sum =0.
295       for pp in p: sum += pp
296       if sum>0:
297         for i in range(len(p)): p[i] = p[i]/sum
298   
299       # find the class with highest probability
300       v_index = p.index(max(p))
301       v = orange.Value(self.domain.classVar, v_index)
302   
303       # return the value based on requested return type
304       if result_type == orange.GetValue:
305         return v
306       if result_type == orange.GetProbabilities:
307         return p
308       return (v,p)
309   
310     def show(self):
311       print 'm=', self.m
312       print 'class prob=', self.p_class
313       print 'cond prob=', self.p_cond
314   
315Upon first invocation, the classifier will store the values of the
316parameters it was called with (``__init__``). When called with a data
317instance, it will first compute the class probabilities using the
318prior probabilities sent by the learner. The probabilities will be
319normalized to sum to 1. The class will then be found that has the
320highest probability, and the classifier will accordingly predict to
321this class. Notice that we have also added a method called show, which
322reports on m, class probabilities and conditional probabilities::
323
324   >>> import orange, bayes
325   >>> data = orange.ExampleTable("voting")
326   >>> classifier = bayes.Learner(data)
327   >>> classifier.show()
328   m= 0.0
329   class prob= [0.38620689655172413, 0.61379310344827587]
330   cond prob= [[[0.79761904761904767, 0.38202247191011235], ...]]
331   >>>
332
333The following script tests our naive Bayes, and compares it to
33410-nearest neighbors. Running the script (do you it yourself) reports
335classification accuracies just about 90% (:download:`bayes_test.py <code/bayes_test.py>`, uses
336:download:`bayes.py <code/bayes.py>` and :download:`voting.tab <code/voting.tab>`)::
337
338   import orange, orngEval, bayes
339   data = orange.ExampleTable("voting")
340   
341   bayes = bayes.Learner(m=2, name='my bayes')
342   knn = orange.kNNLearner(k=10)
343   knn.name = "knn"
344   
345   learners = [knn,bayes]
346   results = orngEval.CrossValidation(learners, data)
347   for i in range(len(learners)):
348       print learners[i].name, orngEval.CA(results)[i]
349
350Bagging
351-------
352
353Here we show how to use the schema that allows us to build our own
354learners/classifiers for bagging. While you can find bagging,
355boosting, and other ensemble-related stuff in :py:mod:`Orange.ensemble` module, we thought
356explaining how to code bagging in Python may provide for a nice
357example. The following pseudo-code (from
358Whitten &amp; Frank: Data Mining) illustrates the main idea of bagging::
359
360   MODEL GENERATION
361   Let n be the number of instances in the training data.
362   For each of t iterations:
363      Sample n instances with replacement from training data.
364      Apply the learning algorithm to the sample.
365      Store the resulting model.
366   
367   CLASSIFICATION
368   For each of the t models:
369      Predict class of instance using model.
370   Return class that has been predicted most often.
371
372Using the above idea, this means that our ``Learner_Class`` will need
373to develop t classifiers and will have to pass them to ``Classifier``,
374which, once seeing a data instance, will use them for
375classification. We will allow parameter t to be specified by the user,
37610 being the default.
377
378The code for the ``Learner_Class`` is therefore (part of
379:download:`bagging.py <code/bagging.py>`)::
380
381   class Learner_Class:
382       def __init__(self, learner, t=10, name='bagged classifier'):
383           self.t = t
384           self.name = name
385           self.learner = learner
386   
387       def __call__(self, examples, weight=None):
388           n = len(examples)
389           classifiers = []
390           for i in range(self.t):
391               selection = []
392               for i in range(n):
393                   selection.append(random.randrange(n))
394               data = examples.getitems(selection)
395               classifiers.append(self.learner(data))
396               
397           return Classifier(classifiers = classifiers, \
398               name=self.name, domain=examples.domain)
399
400Upon invocation, ``__init__`` stores the base learning (the one that
401will be bagged), the value of the parameter t, and the name of the
402classifier. Note that while the learner requires the base learner to
403be specified, parameters t and name are optional.
404
405When the learner is called with examples, a list of t classifiers is
406build and stored in variable ``classifier``. Notice that for data
407sampling with replacement, a list of data instance indices is build
408(``selection``) and then used to sample the data from training
409examples (``example.getitems``). Finally, a ``Classifier`` is called
410with a list of classifiers, name and domain information (part of
411:download:`bagging.py <code/bagging.py>`)::
412
413   class Classifier:
414       def __init__(self, **kwds):
415           self.__dict__.update(kwds)
416   
417       def __call__(self, example, resultType = orange.GetValue):
418           freq = [0.] * len(self.domain.classVar.values)
419           for c in self.classifiers:
420               freq[int(c(example))] += 1
421           index = freq.index(max(freq))
422           value = orange.Value(self.domain.classVar, index)
423           for i in range(len(freq)):
424               freq[i] = freq[i]/len(self.classifiers)
425           if resultType == orange.GetValue: return value
426           elif resultType == orange.GetProbabilities: return freq
427           else: return (value, freq)
428   
429For initialization, ``Classifier`` stores all parameters it was
430invoked with. When called with a data instance, a list freq is
431initialized which is of length equal to the number of classes and
432records the number of models that classify an instance to a specific
433class. The class that majority of models voted for is returned. While
434it may be possible to return classes index, or even a name, by
435convention classifiers in Orange return an object ``Value`` instead.
436
437Notice that while, originally, bagging was not intended to compute
438probabilities of classes, we compute these as the proportion of models
439that voted for a certain class (this is probably incorrect, but
440suffice for our example, and does not hurt if only classes values and
441not probabilities are used).
442
443Here is the code that tests our bagging we have just implemented. It
444compares a decision tree and its bagged variant.  Run it yourself to
445see which one is better (:download:`bagging_test.py <code/bagging_test.py>`)::
446
447   import orange, orngTree, orngEval, bagging
448   data = orange.ExampleTable("adult_sample")
449   
450   tree = orngTree.TreeLearner(mForPrunning=10, minExamples=30)
451   tree.name = "tree"
452   baggedTree = bagging.Learner(learner=tree, t=5)
453   
454   learners = [tree, baggedTree]
455   
456   results = orngEval.crossValidation(learners, data, folds=5)
457   for i in range(len(learners)):
458       print learners[i].name, orngEval.CA(results)[i]
459
460
461
Note: See TracBrowser for help on using the repository browser.