source: orange/orange/doc/ofb/c_performance.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 23.1 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
6<p class="Path">
7Prev: <a href="c_otherclass.htm">Selected Classification Methods</a>,
8Next: <a href="c_pythonlearner.htm">Build your own learner</a>,
9Up: <a href="classification.htm">Classification</a>
12<H1>Testing and Evaluating Your Classifiers</H1>
13<index name="classifiers/accuracy of">
15<p>In this lesson you will learn how to evaluate the classification
16methods in terms of their power to accuratly predict the class of the
17testing examples.
19<p>The simplest way to estimate the accuracy and report on score
20metrics is to use Orange's <a
21href="../modules/orngTest.htm">orngTest</a> and <a
22href="../modules/orngStat.htm">orngStat</a> modules. This is probably
23how you will perform evaluation in your scripts, and thus we start
24with examples that uses these two modules.</p>
26<p>You may as well perform testing and scoring on your own, so we
27further provide several example scripts to compute classification
28accuracy, measure it on a list of classifiers, do cross-validation,
29leave-one-out and random sampling. While all of this functionality is
30available in <a href="../modules/orngTest.htm">orngTest</a> and <a
31href="../modules/orngStat.htm">orngStat</a> modules, these example
32scripts may still be useful for those that want to learn more about
33Orange's learner/classifier objects and the way to use them in
34combination with data sampling.</p>
37<h2>Testing the Easy Way: orngTest and orngStat Modules</h2>
38<index name="crossValidation">
40<p>Below is a script that takes a list of learners (naive Bayesian
41classifer and classification tree) and scores their predictive
42performance on a single data set using ten-fold cross validation. The
43script reports on four different scores: classification accuracy,
44information score, Brier score and area under ROC curve.</p>
46<p class="header"><a href=""></a> (uses <a
48<xmp class="code">import orange, orngTest, orngStat, orngTree
50# set up the learners
51bayes = orange.BayesLearner()
52tree = orngTree.TreeLearner(mForPruning=2) = "bayes" = "tree"
55learners = [bayes, tree]
57# compute accuracies on data
58data = orange.ExampleTable("voting")
59results = orngTest.crossValidation(learners, data, folds=10)
61# output the results
62print "Learner  CA     IS     Brier    AUC"
63for i in range(len(learners)):
64    print "%-8s %5.3f  %5.3f  %5.3f  %5.3f" % (learners[i].name, \
65        orngStat.CA(results)[i], orngStat.IS(results)[i],
66        orngStat.BrierScore(results)[i], orngStat.AUC(results)[i])
70<p>The output of this script is:</p>
71<xmp class="code">Learner  CA     IS     Brier    AUC
72bayes    0.901  0.758  0.176  0.976
73tree     0.961  0.845  0.075  0.956
77<p>The call to <code>orngTest.CrossValidation</code> does the hard
78work.  Function <code>crossValidation</code> returns the object stored
79in <code>results</code>, which essentially stores the probabilities
80and class values of the instances that were used as test cases. Based
81on <code>results</code>, the classification accuracy, information
82score, Brier score and area under ROC curve (AUC) for each of the
83learners are computed (function <code>CA</code>,
84<code>IS</code> and <code>AUC</code>).</p>
86<p>Apart from statistics that we have mentioned above, <a
87href="../modules/orngStat.htm">orngStat</a> has build-in functions
88that can compute other performance metrics, and <a
89href="../modules/orngTest.htm">orngTest</a> includes other testing
90schemas. If you need to test your learners with standard statistics,
91these are probably all you need. Compared to the script above, we
92below show the use of some other statistics, with perhaps more modular
93code as above.</p>
95<p class="header">part of <a href=""></a> (uses <a
97<xmp class="code">data = orange.ExampleTable("voting")
98res = orngTest.crossValidation(learners, data, folds=10)
99cm = orngStat.computeConfusionMatrices(res,
100        classIndex=data.domain.classVar.values.index('democrat'))
102stat = (('CA', 'CA(res)'),
103        ('Sens', 'sens(cm)'),
104        ('Spec', 'spec(cm)'),
105        ('AUC', 'AUC(res)'),
106        ('IS', 'IS(res)'),
107        ('Brier', 'BrierScore(res)'),
108        ('F1', 'F1(cm)'),
109        ('F2', 'Falpha(cm, alpha=2.0)'))
111scores = [eval("orngStat."+s[1]) for s in stat]
112print "Learner  " + "".join(["%-7s" % s[0] for s in stat])
113for (i, l) in enumerate(learners):
114    print "%-8s " % + "".join(["%5.3f  " % s[i] for s in scores])
117<p>Notice that for a number of scoring measures we needed to compute
118the confusion matrix, for which we also needed to specify the target
119class (democrats, in our case). This script has a similar output to
120the previous one:</p>
122<xmp class="code">Learner  CA     Sens   Spec   AUC    IS     Brier  F1     F2
123bayes    0.901  0.891  0.917  0.976  0.758  0.176  0.917  0.908
124tree     0.961  0.974  0.940  0.956  0.845  0.075  0.968  0.970
128<h2>Do It On Your Own: A Warm-Up</h2>
130<p>Let us continue with a line of exploration of voting data set,
131and build a naive Bayesian classifier from it, and compute the
132classification accuracy on the same data set (not something we
133should do to avoid overfitting, but may serve our demonstration
136<p class="header"><a href=""></a>
137(uses <a href=""></a>)</p>
138<xmp class="code">import orange
139data = orange.ExampleTable("voting")
140classifier = orange.BayesLearner(data)
142# compute classification accuracy
143correct = 0.0
144for ex in data:
145    if classifier(ex) == ex.getclass():
146        correct += 1
147print "Classification accuracy:", correct/len(data)
150<p>To compute classification accuracy, the script examines every
151data item and checks how many times this has been classified
152correctly. Running this script on shows that this is just above
155<p>Now, let us extend the code with a function that is given a data
156set and a set of classifiers (e.g., <code>accuracy(test_data,
157classifiers)</code>) and computes the classification accuracies for each
158of the classifier. By this means, let us compare naive Bayes and
159classification trees.</p>
161<p class="header"><a href=""></a> (uses <a href=
163<xmp class="code">import orange, orngTree
165def accuracy(test_data, classifiers):
166    correct = [0.0]*len(classifiers)
167    for ex in test_data:
168        for i in range(len(classifiers)):
169            if classifiers[i](ex) == ex.getclass():
170                correct[i] += 1
171    for i in range(len(correct)):
172        correct[i] = correct[i] / len(test_data)
173    return correct
175# set up the classifiers
176data = orange.ExampleTable("voting")
177bayes = orange.BayesLearner(data) = "bayes"
179tree = orngTree.TreeLearner(data); = "tree"
181classifiers = [bayes, tree]
183# compute accuracies
184acc = accuracy(data, classifiers)
185print "Classification accuracies:"
186for i in range(len(classifiers)):
187    print classifiers[i].name, acc[i]
190<p>This is the first time in out tutorial that we define a function.
191You may see that this is quite simple in Python; functions are
192introduced with a keyword <code>def</code>, followed by function&rsquo;s name
193and list of arguments. Do not forget semicolon at the end of the
194definition string. Other than that, there is nothing new in this
195code. A mild exception to that is an expression <code>classifiers[i](ex)</code>,
196but intuition tells us that here the i-th classifier is called with
197a function with example to classify as an argument. So, finally,
198which method does better? Here is the output:</p>
200<xmp class="code">Classification accuracies:
201bayes 0.903448275862
202tree 0.997701149425
205<p>It looks like a classification tree are much more accurate here.
206But beware the overfitting (especially unpruned classification
207trees are prone to that) and read on!</p>
209<h2>Training and Test Set</h2>
211<p>In machine learning, one should not learn and test classifiers
212on the same data set. For this reason, let us split our data in
213half, and use first half of the data for training and the rest for
214testing. The script is similar to the one above, with a part which
215is different shown below:</p>
217<p class="header">part of <a href=""></a> (uses <a
219<xmp class="code"># set up the classifiers
220data = orange.ExampleTable("voting")
221selection = orange.MakeRandomIndices2(data, 0.5)
222train_data =, 0)
223test_data =, 1)
225bayes = orange.BayesLearner(train_data)
226tree = orngTree.TreeLearner(train_data)
229<p>Orange's function <code>RandomIndicesS2Gen</code> takes the data
230and generates a vector of length equal to the number of the data
231instances. Elements of vectors are either 0 or 1, and the probability
232of the element being 0 is 0.5 (are whatever we specify in the argument
233of the function). Then, for i-th instance of data, this may go either
234to the training set (if selection[i]==0) or to test set (if
235selection[i]==1). Notice that <code>MakeRandomIndices2</code> makes
236sure that this split is stratified, e.g., the class distribution in
237training and test set is approximately equal (you may use the
238attribute <code>stratified=0</code> if you do not like
241<p>The output of this testing is:</p>
243<xmp class="code">Classification accuracies:
244bayes 0.93119266055
245tree 0.802752293578
248<p>Here, the accuracy naive Bayes is much higher. But warning: the
249result is inconclusive, since it depends on only one random split of
250the data.</p>
252<h2>70-30 Random Sampling</h2>
254<p>Above, we have used the function <code>accuracy(data, classifiers)</code>
255that took a data set and a set of classifiers and measured the
256classification accuracy of classifiers on the data. Remember,
257classifiers were models that have been already constructed (they
258have &ldquo;seen&rdquo; the learning data already), so in fact the
259data in accuracy served as a test data set. Now, let us write
260another function, that will be given a set of learners and a data
261set, will repeatedly split the data set to, say 70% and 30%, use
262the first part of the data (70%) to learn the model and obtain a
263classifier, which, using accuracy function developed above, will be
264tested on the remaining data (30%).</p>
266<p>A learner in Orange is an object that encodes a specific machine
267learning algorithm, and is ready to accept the data to construct
268and return the predictive model. We have met quite a number of
269learners so far (but we did not call them this way):
270<code>orange.BayesLearner()</code>, <code>orange.knnLearner()</code>, and others. If we use
271python to simply call a learner, say with</p>
273<p><code>learner = orange.BayesLearner()</code></p>
277<p>then <code>learner</code> becomes an instance of <code>orange.BayesLearner</code> and is
278ready to get some data to return a classifier. For instance, in our
279lessons so far we have used</p>
281<p><code>classifier = orange.BayesLearner(data)</code></p>
283<p>and we could equally use</p>
285<p><code>learner = orange.BayesLearner()</code><br>
286<code>classifier = learner(data)</code></p>
288<p>So why complicating with learners? Well, in the task we are just
289foreseeing, we will repeatedly do learning and testing. If we want
290to build a reusable function that has in the input a set of machine
291learning algorithm and on the output reports on their performance,
292we can do this only through the use of learners (remember,
293classifiers have already seen the data and cannot be
296<p>Our script (without accuracy function, which is exactly like the
297one we have defined in <a href=
298""></a>) is</p>
300<p class="header">part of <a href=""></a>  (uses <a href=
303<xmp class="code">def test_rnd_sampling(data, learners, p=0.7, n=10):
304    acc = [0.0]*len(learners)
305    for i in range(n):
306        selection = orange.MakeRandomIndices2(data, p)
307        train_data =, 0)
308        test_data =, 1)
309        classifiers = []
310        for l in learners:
311            classifiers.append(l(train_data))
312        acc1 = accuracy(test_data, classifiers)
313        print "%d: %s" % (i+1, acc1)
314        for j in range(len(learners)):
315            acc[j] += acc1[j]
316    for j in range(len(learners)):
317        acc[j] = acc[j]/n
318    return acc
320# set up the learners
321bayes = orange.BayesLearner()
322tree = orngTree.TreeLearner() = "bayes" = "tree"
325learners = [bayes, tree]
327# compute accuracies on data
328data = orange.ExampleTable("voting")
329acc = test_rnd_sampling(data, learners)
330print "Classification accuracies:"
331for i in range(len(learners)):
332    print learners[i].name, acc[i]
335<p>Essential to the above script is a function test_rnd_sampling,
336which takes the data and list of classifiers, and returns their
337accuracy estimated through repetitive sampling. Additional (and
338optional) parameter p tells what percentage of the data is used for
339learning. There is another parameter n that specifies how many
340times to repeat the learn-and-test procedure. Note that in the
341code, when test_rnd_sampling was called, these two parameters were
342not specified so that their default values were used (70% and 10,
343respectively). You may try to change the code, and instead use
344test_rnd_sampling(data, learners, n=100, p=0.5), or experiment in
345other ways. There is also a print statement in
346test_rnd_sampling&nbsp; that reports on the accuracies of the
347individual runs (just to see that the code really works), which
348should probably be removed if you would not like to have a long
349printout when testing with large n. Depending on the random seed
350setup on your machine, the output of this script should be
351something like:</p>
353<xmp class="code">1: [0.9007633587786259, 0.79389312977099236]
3542: [0.9007633587786259, 0.79389312977099236]
3553: [0.95419847328244278, 0.92366412213740456]
3564: [0.87786259541984735, 0.86259541984732824]
3575: [0.86259541984732824, 0.80152671755725191]
3586: [0.87022900763358779, 0.80916030534351147]
3597: [0.87786259541984735, 0.82442748091603058]
3608: [0.92366412213740456, 0.93893129770992367]
3619: [0.89312977099236646, 0.82442748091603058]
36210: [0.92366412213740456, 0.86259541984732824]
363Classification accuracies:
364bayes 0.898473282443
365tree 0.843511450382
369<p>Ok, so we were rather lucky before with the tree results, and it looks like naive Bayes does not do bad at all in comparison. But a warning is in order: these are with trees with no punning. Try to use  something like <code>tree = orngTree.TreeLearner(train_data, mForPruning=2)</code> in your script instead, and see if the result gets any different (when we have tryed this, we get some improvement with pruning)!</p>
371<h2>10-Fold Cross-Validation</h2>
373<p>The evaluation through k-fold cross validation method is
374probably the most common in machine learning community. The data
375set is here split into k equally sized subsets, and then in i-th
376iteration (i=1..k) i-th subset is used for testing the classifier
377that has been build on all other remaining subsets. Notice that in
378this method each instance has been classified (for testing) exactly
379once. The number of subsets k is usually set to 10. Orange has
380build-in procedure that splits develops an array of length equal to
381the number of data instances, with each element of the array being
382a number from 0 to k-1. This numbers are assigned such that each
383resulting data subset has class distribution that is similar to
384original subset (stratified k-fold cross-validation).</p>
386<p>The script for k-fold cross-validation is similar to the script
387for repetitive random sampling above. We define a function called
388<code>cross_validation</code> and use it to compute the accuracies:</p>
390<p class="header">part of <a href=""></a> (uses <a
392<xmp class="code">def cross_validation(data, learners, k=10):
393    acc = [0.0]*len(learners)
394    selection = orange.MakeRandomIndicesCV(data, folds=k)
395    for test_fold in range(k):
396        train_data =, test_fold, negate=1)
397        test_data =, test_fold)
398        classifiers = []
399        for l in learners:
400            classifiers.append(l(train_data))
401        acc1 = accuracy(test_data, classifiers)
402        print "%d: %s" % (test_fold+1, acc1)
403        for j in range(len(learners)):
404            acc[j] += acc1[j]
405    for j in range(len(learners)):
406        acc[j] = acc[j]/k
407    return acc
409# ... some code skipped ...
411bayes = orange.BayesLearner()
412tree = orngTree.TreeLearner(mForPruning=2)
414# ... some code skipped ...
416# compute accuracies on data
417data = orange.ExampleTable("voting")
418acc = cross_validation(data, learners, k=10)
419print "Classification accuracies:"
420for i in range(len(learners)):
421    print learners[i].name, acc[i]
425<p>Notice that to select the instances, we have again used
426<code></code>. To obtain train data, we have instructed Orange to use all instances that have a value different from <code>test_fold</code>, an integer that stores the current index of the fold to be used for testing. Also notice that this time we have included pruning for trees.</p>
428<p>Running the 10-fold cross validation on our data set results in
429similar numbers as produced by random sampling (when pruning was used). For those of you curious if this is really so, run the script yourself.</p>
431<h2>Leave-One-Out (Jack Knife)</h2>
433<p>This evaluation procedure is often performed when data sets are
434small (no really the case for the data we are using in our
435example). If each cycle, a single instance is used for testing,
436while the classifier is build on all other instances. One can
437define leave-one-out test through a single Python function:</p>
439<p class="header">part of <a href=""></a> (uses <a
441<xmp class="code">def leave_one_out(data, learners):
442    print 'leave-one-out: %d of %d' % (i, len(data))
443    acc = [0.0]*len(learners)
444    selection = [1] * len(data)
445    last = 0
446    for i in range(len(data)):
447        selection[last] = 1
448        selection[i] = 0
449        train_data =, 1)
450        for j in range(len(learners)):
451            classifier = learners[j](train_data)
452            if classifier(data[i]) == data[i].getclass():
453                acc[j] += 1
454        last = i
456    for j in range(len(learners)):
457        acc[j] = acc[j]/len(data)
458    return acc
461<p>What is not shown in the code above but contained in the script, is that we have introduced some pre-pruning with trees and used
462<code>tree = orngTree.TreeLearner(minExamples=10, mForPruning=2)</code>. This was just to decrease the time one needs to wait for results of the testing (on our moderately fast machines, it takes about half-second for each iteration).
464Again, Python's list variable selection is used to filter
465out the data for learning: this time all its elements but i-th are
466equal to 1. There is no need to separately create test set, since
467it contains only one (i-th) item, which is referred to directly as
468data[i]. Everything else (except for the call to leave_one_out,
469which this time requires no extra parameters) is the same as in the
470scripts defined for random sampling and cross-validation.
471Interestingly, the accuracies obtained on voting data set are
472similar as well:</p>
474<xmp class="code">Classification accuracies:
475bayes 0.901149425287
476tree 0.96091954023
479<h2>Area Under ROC</h2>
481<p>Going back to the data set we use in this lesson (<a href=
482""></a>), let us say that at the end of
4831984 we met on a corridor two members of congress. Somebody tells
484us that they are for a different party. We now use the classifier
485we have just developed on our data to compute the probability that
486each of them is republican. What is the chance that the one we have
487assigned a higher probability is the one that is republican
490<p>This type of statistics is much used in medicine and is called
491area under ROC curve (see, for instance, JR Beck &amp; EK Schultz:
492The use of ROC curves in test performance evaluation. Archives of
493Pathology and Laboratory Medicine 110:13-20, 1986 and Hanley &amp;
494McNeil: The meaning and use of the area under receiver operating
495characteristic curve. Radiology, 143:29--36, 1982). It is a
496discrimination measure that ranges from 0.5 (random guessing) to
4971.0 (a clear margin exists in probability that divides the two
498classes). Just to give another example for yet another statistics
499that can be assessed in Orange, we here present a simple (but not
500optimized and rather inefficient) implementation of this
503<p>We will use a script similar to <a href=
504""></a> (k-fold cross validation) and
505will replace the accuracy() function with a function that computes
506area under ROC for a given data set and set of classifiers. The
507algorithm will investigate all pairs of data items. Those pairs
508where the outcome was originally different (e.g., one item
509represented a republican, the other one democrat) will be termed
510valid pairs and will be checked. Given a valid pair, if the higher
511probability for republican was indeed assigned to the item that was
512republican also originally, this pair will be termed a correct
513pair. Area under ROC is then the proportion of correct pairs in the
514set of valid pairs of instances. In case of ties (both instances
515were assigned the same probability of representing a republican),
516this would be counted as 0.5 instead of 1. The code for function
517that computes the area under ROC using this method is coded in
518Python as:</p>
520<p class="header">part of <a href=""></a> (uses <a href=
522<xmp class="code">def aroc(data, classifiers):
523    ar = []
524    for c in classifiers:
525        p = []
526        for d in data:
527            p.append(c(d, orange.GetProbabilities)[0])
528        correct = 0.0; valid = 0.0
529        for i in range(len(data)-1):
530            for j in range(i+1,len(data)):
531                if data[i].getclass() <> data[j].getclass():
532                    valid += 1
533                    if p[i] == p[j]:
534                        correct += 0.5
535                    elif data[i].getclass() == 0:
536                        if p[i] > p[j]:
537                            correct += 1.0
538                    else:
539                        if p[j] > p[i]:
540                            correct += 1.0
541        ar.append(correct / valid)
542    return ar
546<p>Notice that the array p of length equal to the data set contains
547the probabilities of the item being classified as republican. We
548have to admit that although on the voting data set and under
54910-fold cross-validation computing area under ROC is rather fast
550(below 3s), there exist a better algorithm with complexity O(n log
551n) instead of O(n^2). Anyway, running <a href=
552""></a> shows that naive Bayes is better in terms
553of discrimination using area under ROC:</p>
555<xmp class="code">Area under ROC:
556bayes 0.970308048433
557tree 0.954274027987
558majority 0.5
562<p>Notice that just for a check a majority classifier was also
563included in the test case this time. As expected, its area under
564ROC is minimal and equal to 0.5.</p>
567<hr><br><p class="Path">
568Prev: <a href="c_otherclass.htm">Selected Classification Methods</a>,
569Next: <a href="c_pythonlearner.htm">Build your own learner</a>,
570Up: <a href="classification.htm">Classification</a>
Note: See TracBrowser for help on using the repository browser.