source: orange/docs/tutorial/rst/evaluation.rst @ 9994:1073e0304a87

Revision 9994:1073e0304a87, 21.9 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Remove links from documentation to datasets. Remove datasets reference directory.

Line 
1Testing and evaluating your classifiers
2=======================================
3
4.. index::
5   single: classifiers; accuracy of
6
7In this lesson you will learn how to estimate the accuracy of
8classifiers. The simplest way to do this is to use Orange's
9:py:mod:`Orange.evaluation.testing` and :py:mod:`Orange.statistics` modules. This is probably how you
10will perform evaluation in your scripts, and thus we start with
11examples that uses these two modules. You may as well perform testing
12and scoring on your own, so we further provide several example scripts
13to compute classification accuracy, measure it on a list of
14classifiers, do cross-validation, leave-one-out and random
15sampling. While all of this functionality is available in
16:py:mod:`Orange.evaluation.testing` and :py:mod:`Orange.statistics` modules, these example scripts may
17still be useful for those that want to learn more about Orange's
18learner/classifier objects and the way to use them in combination with
19data sampling.
20
21.. index:: cross validation
22
23Orange's classes for performance evaluation
24-------------------------------------------
25
26Below is a script that takes a list of learners (naive Bayesian
27classifer and classification tree) and scores their predictive
28performance on a single data set using ten-fold cross validation. The
29script reports on four different scores: classification accuracy,
30information score, Brier score and area under ROC curve
31(:download:`accuracy7.py <code/accuracy7.py>`)::
32
33   import orange, orngTest, orngStat, orngTree
34   
35   # set up the learners
36   bayes = orange.BayesLearner()
37   tree = orngTree.TreeLearner(mForPruning=2)
38   bayes.name = "bayes"
39   tree.name = "tree"
40   learners = [bayes, tree]
41   
42   # compute accuracies on data
43   data = orange.ExampleTable("voting")
44   results = orngTest.crossValidation(learners, data, folds=10)
45   
46   # output the results
47   print "Learner  CA     IS     Brier    AUC"
48   for i in range(len(learners)):
49       print "%-8s %5.3f  %5.3f  %5.3f  %5.3f" % (learners[i].name, \
50           orngStat.CA(results)[i], orngStat.IS(results)[i],
51           orngStat.BrierScore(results)[i], orngStat.AUC(results)[i])
52   
53The output of this script is::
54
55   Learner  CA     IS     Brier    AUC
56   bayes    0.901  0.758  0.176  0.976
57   tree     0.961  0.845  0.075  0.956
58
59The call to ``orngTest.CrossValidation`` does the hard work.  Function
60``crossValidation`` returns the object stored in ``results``, which
61essentially stores the probabilities and class values of the instances
62that were used as test cases. Based on ``results``, the classification
63accuracy, information score, Brier score and area under ROC curve
64(AUC) for each of the learners are computed (function ``CA``, ``IS``
65and ``AUC``).
66
67Apart from statistics that we have mentioned above, :py:mod:`Orange.statistics`,
68has build-in functions that can compute other performance metrics, and
69:py:mod:`Orange.evaluation.testing` includes other testing schemas. If you need to test
70your learners with standard statistics, these are probably all you
71need. Compared to the script above, we below show the use of some
72other statistics, with perhaps more modular code as above (part of
73:download:`accuracy8.py <code/accuracy8.py>`)::
74
75   data = orange.ExampleTable("voting")
76   res = orngTest.crossValidation(learners, data, folds=10)
77   cm = orngStat.computeConfusionMatrices(res,
78           classIndex=data.domain.classVar.values.index('democrat'))
79   
80   stat = (('CA', 'CA(res)'),
81           ('Sens', 'sens(cm)'),
82           ('Spec', 'spec(cm)'),
83           ('AUC', 'AUC(res)'),
84           ('IS', 'IS(res)'),
85           ('Brier', 'BrierScore(res)'),
86           ('F1', 'F1(cm)'),
87           ('F2', 'Falpha(cm, alpha=2.0)'))
88   
89   scores = [eval("orngStat."+s[1]) for s in stat]
90   print "Learner  " + "".join(["%-7s" % s[0] for s in stat])
91   for (i, l) in enumerate(learners):
92       print "%-8s " % l.name + "".join(["%5.3f  " % s[i] for s in scores])
93   
94For a number of scoring measures we needed to compute the confusion
95matrix, for which we also needed to specify the target class
96(democrats, in our case). This script has a similar output to the
97previous one::
98
99   Learner  CA     Sens   Spec   AUC    IS     Brier  F1     F2
100   bayes    0.901  0.891  0.917  0.976  0.758  0.176  0.917  0.908
101   tree     0.961  0.974  0.940  0.956  0.845  0.075  0.968  0.970
102
103Do it on your own: a warm-up
104----------------------------
105
106Let us continue with a line of exploration of voting data set, and
107build a naive Bayesian classifier from it, and compute the
108classification accuracy on the same data set (:download:`accuracy.py <code/accuracy.py>`, uses
109:download:`voting.tab <code/voting.tab>`)::
110
111   import orange
112   data = orange.ExampleTable("voting")
113   classifier = orange.BayesLearner(data)
114   
115   # compute classification accuracy
116   correct = 0.0
117   for ex in data:
118       if classifier(ex) == ex.getclass():
119           correct += 1
120   print "Classification accuracy:", correct/len(data)
121
122To compute classification accuracy, the script examines every
123data item and checks how many times this has been classified
124correctly. Running this script on shows that this is just above
12590%.
126
127.. warning::
128   Training and testing on the same data set is not something we
129   should do, as good performance scores may be simply due to
130   overfitting. We use this type of testing here for code
131   demonstration purposes only.
132
133Let us extend the code with a function that is given a data set and a
134set of classifiers (e.g., ``accuracy(test_data, classifiers)``) and
135computes the classification accuracies for each of the classifier. By
136this means, let us compare naive Bayes and classification trees
137(:download:`accuracy2.py <code/accuracy2.py>`)::
138
139   import orange, orngTree
140   
141   def accuracy(test_data, classifiers):
142       correct = [0.0]*len(classifiers)
143       for ex in test_data:
144           for i in range(len(classifiers)):
145               if classifiers[i](ex) == ex.getclass():
146                   correct[i] += 1
147       for i in range(len(correct)):
148           correct[i] = correct[i] / len(test_data)
149       return correct
150   
151   # set up the classifiers
152   data = orange.ExampleTable("voting")
153   bayes = orange.BayesLearner(data)
154   bayes.name = "bayes"
155   tree = orngTree.TreeLearner(data);
156   tree.name = "tree"
157   classifiers = [bayes, tree]
158   
159   # compute accuracies
160   acc = accuracy(data, classifiers)
161   print "Classification accuracies:"
162   for i in range(len(classifiers)):
163       print classifiers[i].name, acc[i]
164
165This is the first time in out tutorial that we define a function.  You
166may see that this is quite simple in Python; functions are introduced
167with a keyword ``def``, followed by function's name and list of
168arguments. Do not forget semicolon at the end of the definition
169string. Other than that, there is nothing new in this code. A mild
170exception to that is an expression ``classifiers[i](ex)``, but
171intuition tells us that here the i-th classifier is called with a
172function with example to classify as an argument. So, finally, which
173method does better? Here is the output::
174
175   Classification accuracies:
176   bayes 0.903448275862
177   tree 0.997701149425
178
179It looks like a classification tree are much more accurate here.
180But beware the overfitting (especially unpruned classification
181trees are prone to that) and read on!
182
183Training and test set
184---------------------
185
186In machine learning, one should not learn and test classifiers on the
187same data set. For this reason, let us split our data in half, and use
188first half of the data for training and the rest for testing. The
189script is similar to the one above, with a part which is different
190shown below (part of :download:`accuracy3.py <code/accuracy3.py>`)::
191
192   # set up the classifiers
193   data = orange.ExampleTable("voting")
194   selection = orange.MakeRandomIndices2(data, 0.5)
195   train_data = data.select(selection, 0)
196   test_data = data.select(selection, 1)
197   
198   bayes = orange.BayesLearner(train_data)
199   tree = orngTree.TreeLearner(train_data)
200
201Orange's function ``RandomIndicesS2Gen`` takes the data and generates
202a vector of length equal to the number of the data instances. Elements
203of vectors are either 0 or 1, and the probability of the element being
2040 is 0.5 (are whatever we specify in the argument of the
205function). Then, for i-th instance of data, this may go either to the
206training set (if selection[i]==0) or to test set (if
207selection[i]==1). Notice that ``MakeRandomIndices2`` makes sure that
208this split is stratified, e.g., the class distribution in training and
209test set is approximately equal (you may use the attribute
210``stratified=0`` if you do not like stratification).
211
212The output of this testing is::
213
214   Classification accuracies:
215   bayes 0.93119266055
216   tree 0.802752293578
217
218Here, the accuracy naive Bayes is much higher. But warning: the result
219is inconclusive, since it depends on only one random split of the
220data.
221
22270-30 random sampling
223---------------------
224
225Above, we have used the function ``accuracy(data, classifiers)`` that
226took a data set and a set of classifiers and measured the
227classification accuracy of classifiers on the data. Remember,
228classifiers were models that have been already constructed (they have
229*seen* the learning data already), so in fact the data in accuracy
230served as a test data set. Now, let us write another function, that
231will be given a set of learners and a data set, will repeatedly split
232the data set to, say 70% and 30%, use the first part of the data (70%)
233to learn the model and obtain a classifier, which, using accuracy
234function developed above, will be tested on the remaining data (30%).
235
236A learner in Orange is an object that encodes a specific machine
237learning algorithm, and is ready to accept the data to construct and
238return the predictive model. We have met quite a number of learners so
239far (but we did not call them this way): ``orange.BayesLearner()``,
240``orange.knnLearner()``, and others. If we use python to simply call a
241learner, say with::
242
243   ``learner = orange.BayesLearner()``
244
245then ``learner`` becomes an instance of ``orange.BayesLearner`` and
246is ready to get some data to return a classifier. For instance, in our
247lessons so far we have used::
248
249   ``classifier = orange.BayesLearner(data)``
250
251and we could equally use::
252
253   ``learner = orange.BayesLearner()``
254   ``classifier = learner(data)``
255   
256So why complicating with learners? Well, in the task we are just
257foreseeing, we will repeatedly do learning and testing. If we want to
258build a reusable function that has in the input a set of machine
259learning algorithm and on the output reports on their performance, we
260can do this only through the use of learners (remember, classifiers
261have already seen the data and cannot be re-learned).
262
263Our script, without accuracy function, which is exactly like the
264one we have defined in :download:`accuracy2.py <code/accuracy2.py>`, is (part of :download:`accuracy4.py <code/accuracy4.py>`)::
265
266   def test_rnd_sampling(data, learners, p=0.7, n=10):
267       acc = [0.0]*len(learners)
268       for i in range(n):
269           selection = orange.MakeRandomIndices2(data, p)
270           train_data = data.select(selection, 0)
271           test_data = data.select(selection, 1)
272           classifiers = []
273           for l in learners:
274               classifiers.append(l(train_data))
275           acc1 = accuracy(test_data, classifiers)
276           print "%d: %s" % (i+1, acc1)
277           for j in range(len(learners)):
278               acc[j] += acc1[j]
279       for j in range(len(learners)):
280           acc[j] = acc[j]/n
281       return acc
282       
283   # set up the learners
284   bayes = orange.BayesLearner()
285   tree = orngTree.TreeLearner()
286   bayes.name = "bayes"
287   tree.name = "tree"
288   learners = [bayes, tree]
289   
290   # compute accuracies on data
291   data = orange.ExampleTable("voting")
292   acc = test_rnd_sampling(data, learners)
293   print "Classification accuracies:"
294   for i in range(len(learners)):
295       print learners[i].name, acc[i]
296
297Essential to the above script is a function test_rnd_sampling, which
298takes the data and list of classifiers, and returns their accuracy
299estimated through repetitive sampling. Additional (and optional)
300parameter p tells what percentage of the data is used for
301learning. There is another parameter n that specifies how many times
302to repeat the learn-and-test procedure. Note that in the code, when
303test_rnd_sampling was called, these two parameters were not specified
304so that their default values were used (70% and 10, respectively). You
305may try to change the code, and instead use test_rnd_sampling(data,
306learners, n=100, p=0.5), or experiment in other ways. There is also a
307print statement in test_rnd_sampling&nbsp; that reports on the
308accuracies of the individual runs (just to see that the code really
309works), which should probably be removed if you would not like to have
310a long printout when testing with large n. Depending on the random
311seed setup on your machine, the output of this script should be
312something like::
313
314   1: [0.9007633587786259, 0.79389312977099236]
315   2: [0.9007633587786259, 0.79389312977099236]
316   3: [0.95419847328244278, 0.92366412213740456]
317   4: [0.87786259541984735, 0.86259541984732824]
318   5: [0.86259541984732824, 0.80152671755725191]
319   6: [0.87022900763358779, 0.80916030534351147]
320   7: [0.87786259541984735, 0.82442748091603058]
321   8: [0.92366412213740456, 0.93893129770992367]
322   9: [0.89312977099236646, 0.82442748091603058]
323   10: [0.92366412213740456, 0.86259541984732824]
324   Classification accuracies:
325   bayes 0.898473282443
326   tree 0.843511450382
327
328Ok, so we were rather lucky before with the tree results, and it looks
329like naive Bayes does not do bad at all in comparison. But a warning
330is in order: these are with trees with no punning. Try to use
331something like ``tree = orngTree.TreeLearner(train_data,
332mForPruning=2)`` in your script instead, and see if the result gets
333any different (when we have tryed this, we get some improvement with
334pruning)!
335
33610-fold cross-validation
337------------------------
338
339The evaluation through k-fold cross validation method is probably the
340most common in machine learning community. The data set is here split
341into k equally sized subsets, and then in i-th iteration (i=1..k) i-th
342subset is used for testing the classifier that has been build on all
343other remaining subsets. Notice that in this method each instance has
344been classified (for testing) exactly once. The number of subsets k is
345usually set to 10. Orange has build-in procedure that splits develops
346an array of length equal to the number of data instances, with each
347element of the array being a number from 0 to k-1. This numbers are
348assigned such that each resulting data subset has class distribution
349that is similar to original subset (stratified k-fold
350cross-validation).
351
352The script for k-fold cross-validation is similar to the script for
353repetitive random sampling above. We define a function called
354``cross_validation`` and use it to compute the accuracies (part of
355:download:`accuracy5.py <code/accuracy5.py>`)::
356
357   def cross_validation(data, learners, k=10):
358       acc = [0.0]*len(learners)
359       selection = orange.MakeRandomIndicesCV(data, folds=k)
360       for test_fold in range(k):
361           train_data = data.select(selection, test_fold, negate=1)
362           test_data = data.select(selection, test_fold)
363           classifiers = []
364           for l in learners:
365               classifiers.append(l(train_data))
366           acc1 = accuracy(test_data, classifiers)
367           print "%d: %s" % (test_fold+1, acc1)
368           for j in range(len(learners)):
369               acc[j] += acc1[j]
370       for j in range(len(learners)):
371           acc[j] = acc[j]/k
372       return acc
373   
374   # ... some code skipped ...
375   
376   bayes = orange.BayesLearner()
377   tree = orngTree.TreeLearner(mForPruning=2)
378   
379   # ... some code skipped ...
380   
381   # compute accuracies on data
382   data = orange.ExampleTable("voting")
383   acc = cross_validation(data, learners, k=10)
384   print "Classification accuracies:"
385   for i in range(len(learners)):
386       print learners[i].name, acc[i]
387
388Notice that to select the instances, we have again used
389``data.select``. To obtain train data, we have instructed Orange to
390use all instances that have a value different from ``test_fold``, an
391integer that stores the current index of the fold to be used for
392testing. Also notice that this time we have included pruning for
393trees.
394
395Running the 10-fold cross validation on our data set results in
396similar numbers as produced by random sampling (when pruning was
397used). For those of you curious if this is really so, run the script
398yourself.
399
400Leave-one-out
401-------------
402
403This evaluation procedure is often performed when data sets are small
404(no really the case for the data we are using in our example). If each
405cycle, a single instance is used for testing, while the classifier is
406build on all other instances. One can define leave-one-out test
407through a single Python function (part of :download:`accuracy6.py <code/accuracy6.py>`)::
408
409   def leave_one_out(data, learners):
410       print 'leave-one-out: %d of %d' % (i, len(data))
411       acc = [0.0]*len(learners)
412       selection = [1] * len(data)
413       last = 0
414       for i in range(len(data)):
415           selection[last] = 1
416           selection[i] = 0
417           train_data = data.select(selection, 1)
418           for j in range(len(learners)):
419               classifier = learners[j](train_data)
420               if classifier(data[i]) == data[i].getclass():
421                   acc[j] += 1
422           last = i
423   
424       for j in range(len(learners)):
425           acc[j] = acc[j]/len(data)
426       return acc
427
428What is not shown in the code above but contained in the script, is
429that we have introduced some pre-pruning with trees and used ``tree =
430orngTree.TreeLearner(minExamples=10, mForPruning=2)``. This was just
431to decrease the time one needs to wait for results of the testing (on
432our moderately fast machines, it takes about half-second for each
433iteration).
434
435Again, Python's list variable selection is used to filter out the data
436for learning: this time all its elements but i-th are equal
437to 1. There is no need to separately create test set, since it
438contains only one (i-th) item, which is referred to directly as
439``data[i]``. Everything else (except for the call to leave_one_out, which
440this time requires no extra parameters) is the same as in the scripts
441defined for random sampling and cross-validation.  Interestingly, the
442accuracies obtained on voting data set are similar as well::
443
444   Classification accuracies:
445   bayes 0.901149425287
446   tree 0.96091954023
447
448Area under roc
449--------------
450
451Going back to the data set we use in this lesson (:download:`voting.tab <code/voting.tab>`), let
452us say that at the end of 1984 we met on a corridor two members of
453congress. Somebody tells us that they are for a different party. We
454now use the classifier we have just developed on our data to compute
455the probability that each of them is republican. What is the chance
456that the one we have assigned a higher probability is the one that is
457republican indeed?
458
459This type of statistics is much used in medicine and is called area
460under ROC curve (see, for instance, JR Beck &amp; EK Schultz: The use
461of ROC curves in test performance evaluation. Archives of Pathology
462and Laboratory Medicine 110:13-20, 1986 and Hanley &amp; McNeil: The
463meaning and use of the area under receiver operating characteristic
464curve. Radiology, 143:29--36, 1982). It is a discrimination measure
465that ranges from 0.5 (random guessing) to 1.0 (a clear margin exists
466in probability that divides the two classes). Just to give another
467example for yet another statistics that can be assessed in Orange, we
468here present a simple (but not optimized and rather inefficient)
469implementation of this measure.
470
471We will use a script similar to :download:`accuracy5.py <code/accuracy5.py>` (k-fold cross
472validation) and will replace the accuracy() function with a function
473that computes area under ROC for a given data set and set of
474classifiers. The algorithm will investigate all pairs of data
475items. Those pairs where the outcome was originally different (e.g.,
476one item represented a republican, the other one democrat) will be
477termed valid pairs and will be checked. Given a valid pair, if the
478higher probability for republican was indeed assigned to the item that
479was republican also originally, this pair will be termed a correct
480pair. Area under ROC is then the proportion of correct pairs in the
481set of valid pairs of instances. In case of ties (both instances were
482assigned the same probability of representing a republican), this
483would be counted as 0.5 instead of 1. The code for function that
484computes the area under ROC using this method is coded in Python as
485(part of :download:`roc.py <code/roc.py>`)::
486
487   def aroc(data, classifiers):
488       ar = []
489       for c in classifiers:
490           p = []
491           for d in data:
492               p.append(c(d, orange.GetProbabilities)[0])
493           correct = 0.0; valid = 0.0
494           for i in range(len(data)-1):
495               for j in range(i+1,len(data)):
496                   if data[i].getclass() <> data[j].getclass():
497                       valid += 1
498                       if p[i] == p[j]:
499                           correct += 0.5
500                       elif data[i].getclass() == 0:
501                           if p[i] > p[j]:
502                               correct += 1.0
503                       else:
504                           if p[j] > p[i]:
505                               correct += 1.0
506           ar.append(correct / valid)
507       return ar
508   
509Notice that the array p of length equal to the data set contains the
510probabilities of the item being classified as republican. We have to
511admit that although on the voting data set and under 10-fold
512cross-validation computing area under ROC is rather fast (below 3s),
513there exist a better algorithm with complexity O(n log n) instead of
514O(n^2). Anyway, running :download:`roc.py <code/roc.py>` shows that naive Bayes is better in
515terms of discrimination using area under ROC::
516
517   Area under ROC:
518   bayes 0.970308048433
519   tree 0.954274027987
520   majority 0.5
521
522.. note::
523   Just for a check a majority classifier was also included in the
524   test case this time. As expected, its area under ROC is minimal and
525   equal to 0.5.
Note: See TracBrowser for help on using the repository browser.