#
source:
orange/docs/tutorial/rst/evaluation.rst
@
9385:fd37d2ce5541

Revision 9385:fd37d2ce5541, 22.0 KB checked in by mitar, 2 years ago (diff) |
---|

Line | |
---|---|

1 | Testing and evaluating your classifiers |

2 | ======================================= |

3 | |

4 | .. index:: |

5 | single: classifiers; accuracy of |

6 | |

7 | In this lesson you will learn how to estimate the accuracy of |

8 | classifiers. The simplest way to do this is to use Orange's |

9 | :py:mod:`Orange.evaluation.testing` and :py:mod:`Orange.statistics` modules. This is probably how you |

10 | will perform evaluation in your scripts, and thus we start with |

11 | examples that uses these two modules. You may as well perform testing |

12 | and scoring on your own, so we further provide several example scripts |

13 | to compute classification accuracy, measure it on a list of |

14 | classifiers, do cross-validation, leave-one-out and random |

15 | sampling. While all of this functionality is available in |

16 | :py:mod:`Orange.evaluation.testing` and :py:mod:`Orange.statistics` modules, these example scripts may |

17 | still be useful for those that want to learn more about Orange's |

18 | learner/classifier objects and the way to use them in combination with |

19 | data sampling. |

20 | |

21 | .. index:: cross validation |

22 | |

23 | Orange's classes for performance evaluation |

24 | ------------------------------------------- |

25 | |

26 | Below is a script that takes a list of learners (naive Bayesian |

27 | classifer and classification tree) and scores their predictive |

28 | performance on a single data set using ten-fold cross validation. The |

29 | script reports on four different scores: classification accuracy, |

30 | information score, Brier score and area under ROC curve |

31 | (:download:`accuracy7.py <code/accuracy7.py>`, uses :download:`voting.tab <code/voting.tab>`):: |

32 | |

33 | import orange, orngTest, orngStat, orngTree |

34 | |

35 | # set up the learners |

36 | bayes = orange.BayesLearner() |

37 | tree = orngTree.TreeLearner(mForPruning=2) |

38 | bayes.name = "bayes" |

39 | tree.name = "tree" |

40 | learners = [bayes, tree] |

41 | |

42 | # compute accuracies on data |

43 | data = orange.ExampleTable("voting") |

44 | results = orngTest.crossValidation(learners, data, folds=10) |

45 | |

46 | # output the results |

47 | print "Learner CA IS Brier AUC" |

48 | for i in range(len(learners)): |

49 | print "%-8s %5.3f %5.3f %5.3f %5.3f" % (learners[i].name, \ |

50 | orngStat.CA(results)[i], orngStat.IS(results)[i], |

51 | orngStat.BrierScore(results)[i], orngStat.AUC(results)[i]) |

52 | |

53 | The output of this script is:: |

54 | |

55 | Learner CA IS Brier AUC |

56 | bayes 0.901 0.758 0.176 0.976 |

57 | tree 0.961 0.845 0.075 0.956 |

58 | |

59 | The call to ``orngTest.CrossValidation`` does the hard work. Function |

60 | ``crossValidation`` returns the object stored in ``results``, which |

61 | essentially stores the probabilities and class values of the instances |

62 | that were used as test cases. Based on ``results``, the classification |

63 | accuracy, information score, Brier score and area under ROC curve |

64 | (AUC) for each of the learners are computed (function ``CA``, ``IS`` |

65 | and ``AUC``). |

66 | |

67 | Apart from statistics that we have mentioned above, :py:mod:`Orange.statistics`, |

68 | has build-in functions that can compute other performance metrics, and |

69 | :py:mod:`Orange.evaluation.testing` includes other testing schemas. If you need to test |

70 | your learners with standard statistics, these are probably all you |

71 | need. Compared to the script above, we below show the use of some |

72 | other statistics, with perhaps more modular code as above (part of |

73 | :download:`accuracy8.py <code/accuracy8.py>`):: |

74 | |

75 | data = orange.ExampleTable("voting") |

76 | res = orngTest.crossValidation(learners, data, folds=10) |

77 | cm = orngStat.computeConfusionMatrices(res, |

78 | classIndex=data.domain.classVar.values.index('democrat')) |

79 | |

80 | stat = (('CA', 'CA(res)'), |

81 | ('Sens', 'sens(cm)'), |

82 | ('Spec', 'spec(cm)'), |

83 | ('AUC', 'AUC(res)'), |

84 | ('IS', 'IS(res)'), |

85 | ('Brier', 'BrierScore(res)'), |

86 | ('F1', 'F1(cm)'), |

87 | ('F2', 'Falpha(cm, alpha=2.0)')) |

88 | |

89 | scores = [eval("orngStat."+s[1]) for s in stat] |

90 | print "Learner " + "".join(["%-7s" % s[0] for s in stat]) |

91 | for (i, l) in enumerate(learners): |

92 | print "%-8s " % l.name + "".join(["%5.3f " % s[i] for s in scores]) |

93 | |

94 | For a number of scoring measures we needed to compute the confusion |

95 | matrix, for which we also needed to specify the target class |

96 | (democrats, in our case). This script has a similar output to the |

97 | previous one:: |

98 | |

99 | Learner CA Sens Spec AUC IS Brier F1 F2 |

100 | bayes 0.901 0.891 0.917 0.976 0.758 0.176 0.917 0.908 |

101 | tree 0.961 0.974 0.940 0.956 0.845 0.075 0.968 0.970 |

102 | |

103 | Do it on your own: a warm-up |

104 | ---------------------------- |

105 | |

106 | Let us continue with a line of exploration of voting data set, and |

107 | build a naive Bayesian classifier from it, and compute the |

108 | classification accuracy on the same data set (:download:`accuracy.py <code/accuracy.py>`, uses |

109 | :download:`voting.tab <code/voting.tab>`):: |

110 | |

111 | import orange |

112 | data = orange.ExampleTable("voting") |

113 | classifier = orange.BayesLearner(data) |

114 | |

115 | # compute classification accuracy |

116 | correct = 0.0 |

117 | for ex in data: |

118 | if classifier(ex) == ex.getclass(): |

119 | correct += 1 |

120 | print "Classification accuracy:", correct/len(data) |

121 | |

122 | To compute classification accuracy, the script examines every |

123 | data item and checks how many times this has been classified |

124 | correctly. Running this script on shows that this is just above |

125 | 90%. |

126 | |

127 | .. warning:: |

128 | Training and testing on the same data set is not something we |

129 | should do, as good performance scores may be simply due to |

130 | overfitting. We use this type of testing here for code |

131 | demonstration purposes only. |

132 | |

133 | Let us extend the code with a function that is given a data set and a |

134 | set of classifiers (e.g., ``accuracy(test_data, classifiers)``) and |

135 | computes the classification accuracies for each of the classifier. By |

136 | this means, let us compare naive Bayes and classification trees |

137 | (:download:`accuracy2.py <code/accuracy2.py>`, uses :download:`voting.tab <code/voting.tab>`):: |

138 | |

139 | import orange, orngTree |

140 | |

141 | def accuracy(test_data, classifiers): |

142 | correct = [0.0]*len(classifiers) |

143 | for ex in test_data: |

144 | for i in range(len(classifiers)): |

145 | if classifiers[i](ex) == ex.getclass(): |

146 | correct[i] += 1 |

147 | for i in range(len(correct)): |

148 | correct[i] = correct[i] / len(test_data) |

149 | return correct |

150 | |

151 | # set up the classifiers |

152 | data = orange.ExampleTable("voting") |

153 | bayes = orange.BayesLearner(data) |

154 | bayes.name = "bayes" |

155 | tree = orngTree.TreeLearner(data); |

156 | tree.name = "tree" |

157 | classifiers = [bayes, tree] |

158 | |

159 | # compute accuracies |

160 | acc = accuracy(data, classifiers) |

161 | print "Classification accuracies:" |

162 | for i in range(len(classifiers)): |

163 | print classifiers[i].name, acc[i] |

164 | |

165 | This is the first time in out tutorial that we define a function. You |

166 | may see that this is quite simple in Python; functions are introduced |

167 | with a keyword ``def``, followed by function's name and list of |

168 | arguments. Do not forget semicolon at the end of the definition |

169 | string. Other than that, there is nothing new in this code. A mild |

170 | exception to that is an expression ``classifiers[i](ex)``, but |

171 | intuition tells us that here the i-th classifier is called with a |

172 | function with example to classify as an argument. So, finally, which |

173 | method does better? Here is the output:: |

174 | |

175 | Classification accuracies: |

176 | bayes 0.903448275862 |

177 | tree 0.997701149425 |

178 | |

179 | It looks like a classification tree are much more accurate here. |

180 | But beware the overfitting (especially unpruned classification |

181 | trees are prone to that) and read on! |

182 | |

183 | Training and test set |

184 | --------------------- |

185 | |

186 | In machine learning, one should not learn and test classifiers on the |

187 | same data set. For this reason, let us split our data in half, and use |

188 | first half of the data for training and the rest for testing. The |

189 | script is similar to the one above, with a part which is different |

190 | shown below (part of :download:`accuracy3.py <code/accuracy3.py>`, uses :download:`voting.tab <code/voting.tab>`):: |

191 | |

192 | # set up the classifiers |

193 | data = orange.ExampleTable("voting") |

194 | selection = orange.MakeRandomIndices2(data, 0.5) |

195 | train_data = data.select(selection, 0) |

196 | test_data = data.select(selection, 1) |

197 | |

198 | bayes = orange.BayesLearner(train_data) |

199 | tree = orngTree.TreeLearner(train_data) |

200 | |

201 | Orange's function ``RandomIndicesS2Gen`` takes the data and generates |

202 | a vector of length equal to the number of the data instances. Elements |

203 | of vectors are either 0 or 1, and the probability of the element being |

204 | 0 is 0.5 (are whatever we specify in the argument of the |

205 | function). Then, for i-th instance of data, this may go either to the |

206 | training set (if selection[i]==0) or to test set (if |

207 | selection[i]==1). Notice that ``MakeRandomIndices2`` makes sure that |

208 | this split is stratified, e.g., the class distribution in training and |

209 | test set is approximately equal (you may use the attribute |

210 | ``stratified=0`` if you do not like stratification). |

211 | |

212 | The output of this testing is:: |

213 | |

214 | Classification accuracies: |

215 | bayes 0.93119266055 |

216 | tree 0.802752293578 |

217 | |

218 | Here, the accuracy naive Bayes is much higher. But warning: the result |

219 | is inconclusive, since it depends on only one random split of the |

220 | data. |

221 | |

222 | 70-30 random sampling |

223 | --------------------- |

224 | |

225 | Above, we have used the function ``accuracy(data, classifiers)`` that |

226 | took a data set and a set of classifiers and measured the |

227 | classification accuracy of classifiers on the data. Remember, |

228 | classifiers were models that have been already constructed (they have |

229 | *seen* the learning data already), so in fact the data in accuracy |

230 | served as a test data set. Now, let us write another function, that |

231 | will be given a set of learners and a data set, will repeatedly split |

232 | the data set to, say 70% and 30%, use the first part of the data (70%) |

233 | to learn the model and obtain a classifier, which, using accuracy |

234 | function developed above, will be tested on the remaining data (30%). |

235 | |

236 | A learner in Orange is an object that encodes a specific machine |

237 | learning algorithm, and is ready to accept the data to construct and |

238 | return the predictive model. We have met quite a number of learners so |

239 | far (but we did not call them this way): ``orange.BayesLearner()``, |

240 | ``orange.knnLearner()``, and others. If we use python to simply call a |

241 | learner, say with:: |

242 | |

243 | ``learner = orange.BayesLearner()`` |

244 | |

245 | then ``learner`` becomes an instance of ``orange.BayesLearner`` and |

246 | is ready to get some data to return a classifier. For instance, in our |

247 | lessons so far we have used:: |

248 | |

249 | ``classifier = orange.BayesLearner(data)`` |

250 | |

251 | and we could equally use:: |

252 | |

253 | ``learner = orange.BayesLearner()`` |

254 | ``classifier = learner(data)`` |

255 | |

256 | So why complicating with learners? Well, in the task we are just |

257 | foreseeing, we will repeatedly do learning and testing. If we want to |

258 | build a reusable function that has in the input a set of machine |

259 | learning algorithm and on the output reports on their performance, we |

260 | can do this only through the use of learners (remember, classifiers |

261 | have already seen the data and cannot be re-learned). |

262 | |

263 | Our script, without accuracy function, which is exactly like the |

264 | one we have defined in :download:`accuracy2.py <code/accuracy2.py>`, is (part of :download:`accuracy4.py <code/accuracy4.py>`):: |

265 | |

266 | def test_rnd_sampling(data, learners, p=0.7, n=10): |

267 | acc = [0.0]*len(learners) |

268 | for i in range(n): |

269 | selection = orange.MakeRandomIndices2(data, p) |

270 | train_data = data.select(selection, 0) |

271 | test_data = data.select(selection, 1) |

272 | classifiers = [] |

273 | for l in learners: |

274 | classifiers.append(l(train_data)) |

275 | acc1 = accuracy(test_data, classifiers) |

276 | print "%d: %s" % (i+1, acc1) |

277 | for j in range(len(learners)): |

278 | acc[j] += acc1[j] |

279 | for j in range(len(learners)): |

280 | acc[j] = acc[j]/n |

281 | return acc |

282 | |

283 | # set up the learners |

284 | bayes = orange.BayesLearner() |

285 | tree = orngTree.TreeLearner() |

286 | bayes.name = "bayes" |

287 | tree.name = "tree" |

288 | learners = [bayes, tree] |

289 | |

290 | # compute accuracies on data |

291 | data = orange.ExampleTable("voting") |

292 | acc = test_rnd_sampling(data, learners) |

293 | print "Classification accuracies:" |

294 | for i in range(len(learners)): |

295 | print learners[i].name, acc[i] |

296 | |

297 | Essential to the above script is a function test_rnd_sampling, which |

298 | takes the data and list of classifiers, and returns their accuracy |

299 | estimated through repetitive sampling. Additional (and optional) |

300 | parameter p tells what percentage of the data is used for |

301 | learning. There is another parameter n that specifies how many times |

302 | to repeat the learn-and-test procedure. Note that in the code, when |

303 | test_rnd_sampling was called, these two parameters were not specified |

304 | so that their default values were used (70% and 10, respectively). You |

305 | may try to change the code, and instead use test_rnd_sampling(data, |

306 | learners, n=100, p=0.5), or experiment in other ways. There is also a |

307 | print statement in test_rnd_sampling that reports on the |

308 | accuracies of the individual runs (just to see that the code really |

309 | works), which should probably be removed if you would not like to have |

310 | a long printout when testing with large n. Depending on the random |

311 | seed setup on your machine, the output of this script should be |

312 | something like:: |

313 | |

314 | 1: [0.9007633587786259, 0.79389312977099236] |

315 | 2: [0.9007633587786259, 0.79389312977099236] |

316 | 3: [0.95419847328244278, 0.92366412213740456] |

317 | 4: [0.87786259541984735, 0.86259541984732824] |

318 | 5: [0.86259541984732824, 0.80152671755725191] |

319 | 6: [0.87022900763358779, 0.80916030534351147] |

320 | 7: [0.87786259541984735, 0.82442748091603058] |

321 | 8: [0.92366412213740456, 0.93893129770992367] |

322 | 9: [0.89312977099236646, 0.82442748091603058] |

323 | 10: [0.92366412213740456, 0.86259541984732824] |

324 | Classification accuracies: |

325 | bayes 0.898473282443 |

326 | tree 0.843511450382 |

327 | |

328 | Ok, so we were rather lucky before with the tree results, and it looks |

329 | like naive Bayes does not do bad at all in comparison. But a warning |

330 | is in order: these are with trees with no punning. Try to use |

331 | something like ``tree = orngTree.TreeLearner(train_data, |

332 | mForPruning=2)`` in your script instead, and see if the result gets |

333 | any different (when we have tryed this, we get some improvement with |

334 | pruning)! |

335 | |

336 | 10-fold cross-validation |

337 | ------------------------ |

338 | |

339 | The evaluation through k-fold cross validation method is probably the |

340 | most common in machine learning community. The data set is here split |

341 | into k equally sized subsets, and then in i-th iteration (i=1..k) i-th |

342 | subset is used for testing the classifier that has been build on all |

343 | other remaining subsets. Notice that in this method each instance has |

344 | been classified (for testing) exactly once. The number of subsets k is |

345 | usually set to 10. Orange has build-in procedure that splits develops |

346 | an array of length equal to the number of data instances, with each |

347 | element of the array being a number from 0 to k-1. This numbers are |

348 | assigned such that each resulting data subset has class distribution |

349 | that is similar to original subset (stratified k-fold |

350 | cross-validation). |

351 | |

352 | The script for k-fold cross-validation is similar to the script for |

353 | repetitive random sampling above. We define a function called |

354 | ``cross_validation`` and use it to compute the accuracies (part of |

355 | :download:`accuracy5.py <code/accuracy5.py>`):: |

356 | |

357 | def cross_validation(data, learners, k=10): |

358 | acc = [0.0]*len(learners) |

359 | selection = orange.MakeRandomIndicesCV(data, folds=k) |

360 | for test_fold in range(k): |

361 | train_data = data.select(selection, test_fold, negate=1) |

362 | test_data = data.select(selection, test_fold) |

363 | classifiers = [] |

364 | for l in learners: |

365 | classifiers.append(l(train_data)) |

366 | acc1 = accuracy(test_data, classifiers) |

367 | print "%d: %s" % (test_fold+1, acc1) |

368 | for j in range(len(learners)): |

369 | acc[j] += acc1[j] |

370 | for j in range(len(learners)): |

371 | acc[j] = acc[j]/k |

372 | return acc |

373 | |

374 | # ... some code skipped ... |

375 | |

376 | bayes = orange.BayesLearner() |

377 | tree = orngTree.TreeLearner(mForPruning=2) |

378 | |

379 | # ... some code skipped ... |

380 | |

381 | # compute accuracies on data |

382 | data = orange.ExampleTable("voting") |

383 | acc = cross_validation(data, learners, k=10) |

384 | print "Classification accuracies:" |

385 | for i in range(len(learners)): |

386 | print learners[i].name, acc[i] |

387 | |

388 | Notice that to select the instances, we have again used |

389 | ``data.select``. To obtain train data, we have instructed Orange to |

390 | use all instances that have a value different from ``test_fold``, an |

391 | integer that stores the current index of the fold to be used for |

392 | testing. Also notice that this time we have included pruning for |

393 | trees. |

394 | |

395 | Running the 10-fold cross validation on our data set results in |

396 | similar numbers as produced by random sampling (when pruning was |

397 | used). For those of you curious if this is really so, run the script |

398 | yourself. |

399 | |

400 | Leave-one-out |

401 | ------------- |

402 | |

403 | This evaluation procedure is often performed when data sets are small |

404 | (no really the case for the data we are using in our example). If each |

405 | cycle, a single instance is used for testing, while the classifier is |

406 | build on all other instances. One can define leave-one-out test |

407 | through a single Python function (part of :download:`accuracy6.py <code/accuracy6.py>`):: |

408 | |

409 | def leave_one_out(data, learners): |

410 | print 'leave-one-out: %d of %d' % (i, len(data)) |

411 | acc = [0.0]*len(learners) |

412 | selection = [1] * len(data) |

413 | last = 0 |

414 | for i in range(len(data)): |

415 | selection[last] = 1 |

416 | selection[i] = 0 |

417 | train_data = data.select(selection, 1) |

418 | for j in range(len(learners)): |

419 | classifier = learners[j](train_data) |

420 | if classifier(data[i]) == data[i].getclass(): |

421 | acc[j] += 1 |

422 | last = i |

423 | |

424 | for j in range(len(learners)): |

425 | acc[j] = acc[j]/len(data) |

426 | return acc |

427 | |

428 | What is not shown in the code above but contained in the script, is |

429 | that we have introduced some pre-pruning with trees and used ``tree = |

430 | orngTree.TreeLearner(minExamples=10, mForPruning=2)``. This was just |

431 | to decrease the time one needs to wait for results of the testing (on |

432 | our moderately fast machines, it takes about half-second for each |

433 | iteration). |

434 | |

435 | Again, Python's list variable selection is used to filter out the data |

436 | for learning: this time all its elements but i-th are equal |

437 | to 1. There is no need to separately create test set, since it |

438 | contains only one (i-th) item, which is referred to directly as |

439 | ``data[i]``. Everything else (except for the call to leave_one_out, which |

440 | this time requires no extra parameters) is the same as in the scripts |

441 | defined for random sampling and cross-validation. Interestingly, the |

442 | accuracies obtained on voting data set are similar as well:: |

443 | |

444 | Classification accuracies: |

445 | bayes 0.901149425287 |

446 | tree 0.96091954023 |

447 | |

448 | Area under roc |

449 | -------------- |

450 | |

451 | Going back to the data set we use in this lesson (:download:`voting.tab <code/voting.tab>`), let |

452 | us say that at the end of 1984 we met on a corridor two members of |

453 | congress. Somebody tells us that they are for a different party. We |

454 | now use the classifier we have just developed on our data to compute |

455 | the probability that each of them is republican. What is the chance |

456 | that the one we have assigned a higher probability is the one that is |

457 | republican indeed? |

458 | |

459 | This type of statistics is much used in medicine and is called area |

460 | under ROC curve (see, for instance, JR Beck & EK Schultz: The use |

461 | of ROC curves in test performance evaluation. Archives of Pathology |

462 | and Laboratory Medicine 110:13-20, 1986 and Hanley & McNeil: The |

463 | meaning and use of the area under receiver operating characteristic |

464 | curve. Radiology, 143:29--36, 1982). It is a discrimination measure |

465 | that ranges from 0.5 (random guessing) to 1.0 (a clear margin exists |

466 | in probability that divides the two classes). Just to give another |

467 | example for yet another statistics that can be assessed in Orange, we |

468 | here present a simple (but not optimized and rather inefficient) |

469 | implementation of this measure. |

470 | |

471 | We will use a script similar to :download:`accuracy5.py <code/accuracy5.py>` (k-fold cross |

472 | validation) and will replace the accuracy() function with a function |

473 | that computes area under ROC for a given data set and set of |

474 | classifiers. The algorithm will investigate all pairs of data |

475 | items. Those pairs where the outcome was originally different (e.g., |

476 | one item represented a republican, the other one democrat) will be |

477 | termed valid pairs and will be checked. Given a valid pair, if the |

478 | higher probability for republican was indeed assigned to the item that |

479 | was republican also originally, this pair will be termed a correct |

480 | pair. Area under ROC is then the proportion of correct pairs in the |

481 | set of valid pairs of instances. In case of ties (both instances were |

482 | assigned the same probability of representing a republican), this |

483 | would be counted as 0.5 instead of 1. The code for function that |

484 | computes the area under ROC using this method is coded in Python as |

485 | (part of :download:`roc.py <code/roc.py>`):: |

486 | |

487 | def aroc(data, classifiers): |

488 | ar = [] |

489 | for c in classifiers: |

490 | p = [] |

491 | for d in data: |

492 | p.append(c(d, orange.GetProbabilities)[0]) |

493 | correct = 0.0; valid = 0.0 |

494 | for i in range(len(data)-1): |

495 | for j in range(i+1,len(data)): |

496 | if data[i].getclass() <> data[j].getclass(): |

497 | valid += 1 |

498 | if p[i] == p[j]: |

499 | correct += 0.5 |

500 | elif data[i].getclass() == 0: |

501 | if p[i] > p[j]: |

502 | correct += 1.0 |

503 | else: |

504 | if p[j] > p[i]: |

505 | correct += 1.0 |

506 | ar.append(correct / valid) |

507 | return ar |

508 | |

509 | Notice that the array p of length equal to the data set contains the |

510 | probabilities of the item being classified as republican. We have to |

511 | admit that although on the voting data set and under 10-fold |

512 | cross-validation computing area under ROC is rather fast (below 3s), |

513 | there exist a better algorithm with complexity O(n log n) instead of |

514 | O(n^2). Anyway, running :download:`roc.py <code/roc.py>` shows that naive Bayes is better in |

515 | terms of discrimination using area under ROC:: |

516 | |

517 | Area under ROC: |

518 | bayes 0.970308048433 |

519 | tree 0.954274027987 |

520 | majority 0.5 |

521 | |

522 | .. note:: |

523 | Just for a check a majority classifier was also included in the |

524 | test case this time. As expected, its area under ROC is minimal and |

525 | equal to 0.5. |

**Note:**See TracBrowser for help on using the repository browser.