#
source:
orange/docs/reference/rst/Orange.evaluation.scoring.rst
@
9904:a4f7bd7922d8

Revision 9904:a4f7bd7922d8, 18.0 KB checked in by anze <anze.staric@…>, 2 years ago (diff) |
---|

Line | |
---|---|

1 | .. automodule:: Orange.evaluation.scoring |

2 | |

3 | ############################ |

4 | Method scoring (``scoring``) |

5 | ############################ |

6 | |

7 | .. index: scoring |

8 | |

9 | This module contains various measures of quality for classification and |

10 | regression. Most functions require an argument named :obj:`res`, an instance of |

11 | :class:`Orange.evaluation.testing.ExperimentResults` as computed by |

12 | functions from :mod:`Orange.evaluation.testing` and which contains |

13 | predictions obtained through cross-validation, |

14 | leave one-out, testing on training data or test set instances. |

15 | |

16 | ============== |

17 | Classification |

18 | ============== |

19 | |

20 | To prepare some data for examples on this page, we shall load the voting data |

21 | set (problem of predicting the congressman's party (republican, democrat) |

22 | based on a selection of votes) and evaluate naive Bayesian learner, |

23 | classification trees and majority classifier using cross-validation. |

24 | For examples requiring a multivalued class problem, we shall do the same |

25 | with the vehicle data set (telling whether a vehicle described by the features |

26 | extracted from a picture is a van, bus, or Opel or Saab car). |

27 | |

28 | Basic cross validation example is shown in the following part of |

29 | (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`): |

30 | |

31 | .. literalinclude:: code/statExample0.py |

32 | |

33 | If instances are weighted, weights are taken into account. This can be |

34 | disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of |

35 | disabling weights is to clear the |

36 | :class:`Orange.evaluation.testing.ExperimentResults`' flag weights. |

37 | |

38 | General Measures of Quality |

39 | =========================== |

40 | |

41 | .. autofunction:: CA |

42 | |

43 | .. autofunction:: AP |

44 | |

45 | .. autofunction:: Brier_score |

46 | |

47 | .. autofunction:: IS |

48 | |

49 | So, let's compute all this in part of |

50 | (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out: |

51 | |

52 | .. literalinclude:: code/statExample1.py |

53 | :lines: 13- |

54 | |

55 | The output should look like this:: |

56 | |

57 | method CA AP Brier IS |

58 | bayes 0.903 0.902 0.175 0.759 |

59 | tree 0.846 0.845 0.286 0.641 |

60 | majorty 0.614 0.526 0.474 -0.000 |

61 | |

62 | Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out |

63 | the standard errors. |

64 | |

65 | Confusion Matrix |

66 | ================ |

67 | |

68 | .. autofunction:: confusion_matrices |

69 | |

70 | **A positive-negative confusion matrix** is computed (a) if the class is |

71 | binary unless :obj:`classIndex` argument is -2, (b) if the class is |

72 | multivalued and the :obj:`classIndex` is non-negative. Argument |

73 | :obj:`classIndex` then tells which class is positive. In case (a), |

74 | :obj:`classIndex` may be omitted; the first class |

75 | is then negative and the second is positive, unless the :obj:`baseClass` |

76 | attribute in the object with results has non-negative value. In that case, |

77 | :obj:`baseClass` is an index of the target class. :obj:`baseClass` |

78 | attribute of results object should be set manually. The result of a |

79 | function is a list of instances of class :class:`ConfusionMatrix`, |

80 | containing the (weighted) number of true positives (TP), false |

81 | negatives (FN), false positives (FP) and true negatives (TN). |

82 | |

83 | We can also add the keyword argument :obj:`cutoff` |

84 | (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices` |

85 | will disregard the classifiers' class predictions and observe the predicted |

86 | probabilities, and consider the prediction "positive" if the predicted |

87 | probability of the positive class is higher than the :obj:`cutoff`. |

88 | |

89 | The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the |

90 | cut off threshold from the default 0.5 to 0.2 affects the confusion matrics |

91 | for naive Bayesian classifier:: |

92 | |

93 | cm = Orange.evaluation.scoring.confusion_matrices(res)[0] |

94 | print "Confusion matrix for naive Bayes:" |

95 | print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) |

96 | |

97 | cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0] |

98 | print "Confusion matrix for naive Bayes:" |

99 | print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) |

100 | |

101 | The output:: |

102 | |

103 | Confusion matrix for naive Bayes: |

104 | TP: 238, FP: 13, FN: 29.0, TN: 155 |

105 | Confusion matrix for naive Bayes: |

106 | TP: 239, FP: 18, FN: 28.0, TN: 150 |

107 | |

108 | shows that the number of true positives increases (and hence the number of |

109 | false negatives decreases) by only a single instance, while five instances |

110 | that were originally true negatives become false positives due to the |

111 | lower threshold. |

112 | |

113 | To observe how good are the classifiers in detecting vans in the vehicle |

114 | data set, we would compute the matrix like this:: |

115 | |

116 | cm = Orange.evaluation.scoring.confusion_matrices(resVeh, vehicle.domain.classVar.values.index("van")) |

117 | |

118 | and get the results like these:: |

119 | |

120 | TP: 189, FP: 241, FN: 10.0, TN: 406 |

121 | |

122 | while the same for class "opel" would give:: |

123 | |

124 | TP: 86, FP: 112, FN: 126.0, TN: 522 |

125 | |

126 | The main difference is that there are only a few false negatives for the |

127 | van, meaning that the classifier seldom misses it (if it says it's not a |

128 | van, it's almost certainly not a van). Not so for the Opel car, where the |

129 | classifier missed 126 of them and correctly detected only 86. |

130 | |

131 | **General confusion matrix** is computed (a) in case of a binary class, |

132 | when :obj:`classIndex` is set to -2, (b) when we have multivalued class and |

133 | the caller doesn't specify the :obj:`classIndex` of the positive class. |

134 | When called in this manner, the function cannot use the argument |

135 | :obj:`cutoff`. |

136 | |

137 | The function then returns a three-dimensional matrix, where the element |

138 | A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`] |

139 | gives the number of instances belonging to 'actual_class' for which the |

140 | 'learner' predicted 'predictedClass'. We shall compute and print out |

141 | the matrix for naive Bayesian classifier. |

142 | |

143 | Here we see another example from :download:`statExamples.py <code/statExamples.py>`:: |

144 | |

145 | cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0] |

146 | classes = vehicle.domain.classVar.values |

147 | print "\t"+"\t".join(classes) |

148 | for className, classConfusions in zip(classes, cm): |

149 | print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) |

150 | |

151 | So, here's what this nice piece of code gives:: |

152 | |

153 | bus van saab opel |

154 | bus 56 95 21 46 |

155 | van 6 189 4 0 |

156 | saab 3 75 73 66 |

157 | opel 4 71 51 86 |

158 | |

159 | Van's are clearly simple: 189 vans were classified as vans (we know this |

160 | already, we've printed it out above), and the 10 misclassified pictures |

161 | were classified as buses (6) and Saab cars (4). In all other classes, |

162 | there were more instances misclassified as vans than correctly classified |

163 | instances. The classifier is obviously quite biased to vans. |

164 | |

165 | .. method:: sens(confm) |

166 | .. method:: spec(confm) |

167 | .. method:: PPV(confm) |

168 | .. method:: NPV(confm) |

169 | .. method:: precision(confm) |

170 | .. method:: recall(confm) |

171 | .. method:: F2(confm) |

172 | .. method:: Falpha(confm, alpha=2.0) |

173 | .. method:: MCC(conf) |

174 | |

175 | With the confusion matrix defined in terms of positive and negative |

176 | classes, you can also compute the |

177 | `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_ |

178 | [TP/(TP+FN)], `specificity <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_ |

179 | [TN/(TN+FP)], `positive predictive value <http://en.wikipedia.org/wiki/Positive_predictive_value>`_ |

180 | [TP/(TP+FP)] and `negative predictive value <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)]. |

181 | In information retrieval, positive predictive value is called precision |

182 | (the ratio of the number of relevant records retrieved to the total number |

183 | of irrelevant and relevant records retrieved), and sensitivity is called |

184 | `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_ |

185 | (the ratio of the number of relevant records retrieved to the total number |

186 | of relevant records in the database). The |

187 | `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision |

188 | and recall is called an |

189 | `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending |

190 | on the ratio of the weight between precision and recall is implemented |

191 | as F1 [2*precision*recall/(precision+recall)] or, for a general case, |

192 | Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. |

193 | The `Matthews correlation coefficient <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ |

194 | in essence a correlation coefficient between |

195 | the observed and predicted binary classifications; it returns a value |

196 | between -1 and +1. A coefficient of +1 represents a perfect prediction, |

197 | 0 an average random prediction and -1 an inverse prediction. |

198 | |

199 | If the argument :obj:`confm` is a single confusion matrix, a single |

200 | result (a number) is returned. If confm is a list of confusion matrices, |

201 | a list of scores is returned, one for each confusion matrix. |

202 | |

203 | Note that weights are taken into account when computing the matrix, so |

204 | these functions don't check the 'weighted' keyword argument. |

205 | |

206 | Let us print out sensitivities and specificities of our classifiers in |

207 | part of :download:`statExamples.py <code/statExamples.py>`:: |

208 | |

209 | cm = Orange.evaluation.scoring.confusion_matrices(res) |

210 | |

211 | print "method\tsens\tspec" |

212 | for l in range(len(learners)): |

213 | print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l])) |

214 | |

215 | ROC Analysis |

216 | ============ |

217 | |

218 | `Receiver Operating Characteristic \ |

219 | <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_ |

220 | (ROC) analysis was initially developed for |

221 | a binary-like problems and there is no consensus on how to apply it in |

222 | multi-class problems, nor do we know for sure how to do ROC analysis after |

223 | cross validation and similar multiple sampling techniques. If you are |

224 | interested in the area under the curve, function AUC will deal with those |

225 | problems as specifically described below. |

226 | |

227 | .. autofunction:: AUC |

228 | |

229 | .. attribute:: AUC.ByWeightedPairs (or 0) |

230 | |

231 | Computes AUC for each pair of classes (ignoring instances of all other |

232 | classes) and averages the results, weighting them by the number of |

233 | pairs of instances from these two classes (e.g. by the product of |

234 | probabilities of the two classes). AUC computed in this way still |

235 | behaves as concordance index, e.g., gives the probability that two |

236 | randomly chosen instances from different classes will be correctly |

237 | recognized (this is of course true only if the classifier knows |

238 | from which two classes the instances came). |

239 | |

240 | .. attribute:: AUC.ByPairs (or 1) |

241 | |

242 | Similar as above, except that the average over class pairs is not |

243 | weighted. This AUC is, like the binary, independent of class |

244 | distributions, but it is not related to concordance index any more. |

245 | |

246 | .. attribute:: AUC.WeightedOneAgainstAll (or 2) |

247 | |

248 | For each class, it computes AUC for this class against all others (that |

249 | is, treating other classes as one class). The AUCs are then averaged by |

250 | the class probabilities. This is related to concordance index in which |

251 | we test the classifier's (average) capability for distinguishing the |

252 | instances from a specified class from those that come from other classes. |

253 | Unlike the binary AUC, the measure is not independent of class |

254 | distributions. |

255 | |

256 | .. attribute:: AUC.OneAgainstAll (or 3) |

257 | |

258 | As above, except that the average is not weighted. |

259 | |

260 | In case of multiple folds (for instance if the data comes from cross |

261 | validation), the computation goes like this. When computing the partial |

262 | AUCs for individual pairs of classes or singled-out classes, AUC is |

263 | computed for each fold separately and then averaged (ignoring the number |

264 | of instances in each fold, it's just a simple average). However, if a |

265 | certain fold doesn't contain any instances of a certain class (from the |

266 | pair), the partial AUC is computed treating the results as if they came |

267 | from a single-fold. This is not really correct since the class |

268 | probabilities from different folds are not necessarily comparable, |

269 | yet this will most often occur in a leave-one-out experiments, |

270 | comparability shouldn't be a problem. |

271 | |

272 | Computing and printing out the AUC's looks just like printing out |

273 | classification accuracies (except that we call AUC instead of |

274 | CA, of course):: |

275 | |

276 | AUCs = Orange.evaluation.scoring.AUC(res) |

277 | for l in range(len(learners)): |

278 | print "%10s: %5.3f" % (learners[l].name, AUCs[l]) |

279 | |

280 | For vehicle, you can run exactly this same code; it will compute AUCs |

281 | for all pairs of classes and return the average weighted by probabilities |

282 | of pairs. Or, you can specify the averaging method yourself, like this:: |

283 | |

284 | AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll) |

285 | |

286 | The following snippet tries out all four. (We don't claim that this is |

287 | how the function needs to be used; it's better to stay with the default.):: |

288 | |

289 | methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] |

290 | print " " *25 + " \tbayes\ttree\tmajority" |

291 | for i in range(4): |

292 | AUCs = Orange.evaluation.scoring.AUC(resVeh, i) |

293 | print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) |

294 | |

295 | As you can see from the output:: |

296 | |

297 | bayes tree majority |

298 | by pairs, weighted: 0.789 0.871 0.500 |

299 | by pairs: 0.791 0.872 0.500 |

300 | one vs. all, weighted: 0.783 0.800 0.500 |

301 | one vs. all: 0.783 0.800 0.500 |

302 | |

303 | .. autofunction:: AUC_single |

304 | |

305 | .. autofunction:: AUC_pair |

306 | |

307 | .. autofunction:: AUC_matrix |

308 | |

309 | The remaining functions, which plot the curves and statistically compare |

310 | them, require that the results come from a test with a single iteration, |

311 | and they always compare one chosen class against all others. If you have |

312 | cross validation results, you can either use split_by_iterations to split the |

313 | results by folds, call the function for each fold separately and then sum |

314 | the results up however you see fit, or you can set the ExperimentResults' |

315 | attribute number_of_iterations to 1, to cheat the function - at your own |

316 | responsibility for the statistical correctness. Regarding the multi-class |

317 | problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class |

318 | attribute's baseValue at the time when results were computed. If baseValue |

319 | was not given at that time, 1 (that is, the second class) is used as default. |

320 | |

321 | We shall use the following code to prepare suitable experimental results:: |

322 | |

323 | ri2 = Orange.core.MakeRandomIndices2(voting, 0.6) |

324 | train = voting.selectref(ri2, 0) |

325 | test = voting.selectref(ri2, 1) |

326 | res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test) |

327 | |

328 | |

329 | .. autofunction:: AUCWilcoxon |

330 | |

331 | .. autofunction:: compute_ROC |

332 | |

333 | Comparison of Algorithms |

334 | ------------------------ |

335 | |

336 | .. autofunction:: McNemar |

337 | |

338 | .. autofunction:: McNemar_of_two |

339 | |

340 | ========== |

341 | Regression |

342 | ========== |

343 | |

344 | General Measure of Quality |

345 | ========================== |

346 | |

347 | Several alternative measures, as given below, can be used to evaluate |

348 | the sucess of numeric prediction: |

349 | |

350 | .. image:: files/statRegression.png |

351 | |

352 | .. autofunction:: MSE |

353 | |

354 | .. autofunction:: RMSE |

355 | |

356 | .. autofunction:: MAE |

357 | |

358 | .. autofunction:: RSE |

359 | |

360 | .. autofunction:: RRSE |

361 | |

362 | .. autofunction:: RAE |

363 | |

364 | .. autofunction:: R2 |

365 | |

366 | The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to |

367 | score several regression methods. |

368 | |

369 | .. literalinclude:: code/statExamplesRegression.py |

370 | |

371 | The code above produces the following output:: |

372 | |

373 | Learner MSE RMSE MAE RSE RRSE RAE R2 |

374 | maj 84.585 9.197 6.653 1.002 1.001 1.001 -0.002 |

375 | rt 40.015 6.326 4.592 0.474 0.688 0.691 0.526 |

376 | knn 21.248 4.610 2.870 0.252 0.502 0.432 0.748 |

377 | lr 24.092 4.908 3.425 0.285 0.534 0.515 0.715 |

378 | |

379 | ================= |

380 | Ploting functions |

381 | ================= |

382 | |

383 | .. autofunction:: graph_ranks |

384 | |

385 | The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph: |

386 | |

387 | .. literalinclude:: code/statExamplesGraphRanks.py |

388 | |

389 | Code produces the following graph: |

390 | |

391 | .. image:: files/statExamplesGraphRanks1.png |

392 | |

393 | .. autofunction:: compute_CD |

394 | |

395 | .. autofunction:: compute_friedman |

396 | |

397 | ================= |

398 | Utility Functions |

399 | ================= |

400 | |

401 | .. autofunction:: split_by_iterations |

402 | |

403 | ===================================== |

404 | Scoring for multilabel classification |

405 | ===================================== |

406 | |

407 | Multi-label classification requries different metrics than those used in traditional single-label |

408 | classification. This module presents the various methrics that have been proposed in the literature. |

409 | Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples |

410 | :math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier |

411 | and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`. |

412 | |

413 | .. autofunction:: mlc_hamming_loss |

414 | .. autofunction:: mlc_accuracy |

415 | .. autofunction:: mlc_precision |

416 | .. autofunction:: mlc_recall |

417 | |

418 | So, let's compute all this and print it out (part of |

419 | :download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses |

420 | :download:`emotions.tab <code/emotions.tab>`): |

421 | |

422 | .. literalinclude:: code/mlc-evaluate.py |

423 | :lines: 1-15 |

424 | |

425 | The output should look like this:: |

426 | |

427 | loss= [0.9375] |

428 | accuracy= [0.875] |

429 | precision= [1.0] |

430 | recall= [0.875] |

431 | |

432 | References |

433 | ========== |

434 | |

435 | Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', |

436 | Pattern Recogintion, vol.37, no.9, pp:1757-71 |

437 | |

438 | Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper |

439 | presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining |

440 | (PAKDD 2004) |

441 | |

442 | Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization', |

443 | Machine Learning, vol.39, no.2/3, pp:135-68. |

**Note:**See TracBrowser for help on using the repository browser.