source: orange/docs/reference/rst/Orange.evaluation.scoring.rst @ 10003:1631c2f30f11

Revision 10003:1631c2f30f11, 22.9 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Merge

Line 
1.. automodule:: Orange.evaluation.scoring
2
3############################
4Method scoring (``scoring``)
5############################
6
7.. index: scoring
8
9Scoring plays and integral role in evaluation of any prediction model. Orange
10implements various scores for evaluation of classification,
11regression and multi-label models. Most of the methods needs to be called
12with an instance of :obj:`ExperimentResults`.
13
14.. literalinclude:: code/statExample0.py
15
16==============
17Classification
18==============
19
20Many scores for evaluation of classification models can be computed solely
21from the confusion matrix constructed manually with the
22:obj:`confusion_matrices` function. If class variable has more than two
23values, the index of the value to calculate the confusion matrix for should
24be passed as well.
25
26Calibration scores
27==================
28
29.. autofunction:: CA
30.. autofunction:: sens
31.. autofunction:: spec
32.. autofunction:: PPV
33.. autofunction:: NPV
34.. autofunction:: precision
35.. autofunction:: recall
36.. autofunction:: F1
37.. autofunction:: Falpha
38.. autofunction:: MCC
39.. autofunction:: AP
40.. autofunction:: IS
41.. autofunction::
42
43Discriminatory scores
44=====================
45
46.. autofunction:: Brier_score
47
48.. autofunction:: AUC
49
50    .. attribute:: AUC.ByWeightedPairs (or 0)
51
52      Computes AUC for each pair of classes (ignoring instances of all other
53      classes) and averages the results, weighting them by the number of
54      pairs of instances from these two classes (e.g. by the product of
55      probabilities of the two classes). AUC computed in this way still
56      behaves as concordance index, e.g., gives the probability that two
57      randomly chosen instances from different classes will be correctly
58      recognized (this is of course true only if the classifier knows
59      from which two classes the instances came).
60
61   .. attribute:: AUC.ByPairs (or 1)
62
63      Similar as above, except that the average over class pairs is not
64      weighted. This AUC is, like the binary, independent of class
65      distributions, but it is not related to concordance index any more.
66
67   .. attribute:: AUC.WeightedOneAgainstAll (or 2)
68
69      For each class, it computes AUC for this class against all others (that
70      is, treating other classes as one class). The AUCs are then averaged by
71      the class probabilities. This is related to concordance index in which
72      we test the classifier's (average) capability for distinguishing the
73      instances from a specified class from those that come from other classes.
74      Unlike the binary AUC, the measure is not independent of class
75      distributions.
76
77   .. attribute:: AUC.OneAgainstAll (or 3)
78
79      As above, except that the average is not weighted.
80
81   In case of multiple folds (for instance if the data comes from cross
82   validation), the computation goes like this. When computing the partial
83   AUCs for individual pairs of classes or singled-out classes, AUC is
84   computed for each fold separately and then averaged (ignoring the number
85   of instances in each fold, it's just a simple average). However, if a
86   certain fold doesn't contain any instances of a certain class (from the
87   pair), the partial AUC is computed treating the results as if they came
88   from a single-fold. This is not really correct since the class
89   probabilities from different folds are not necessarily comparable,
90   yet this will most often occur in a leave-one-out experiments,
91   comparability shouldn't be a problem.
92
93   Computing and printing out the AUC's looks just like printing out
94   classification accuracies (except that we call AUC instead of
95   CA, of course)::
96
97       AUCs = Orange.evaluation.scoring.AUC(res)
98       for l in range(len(learners)):
99           print "%10s: %5.3f" % (learners[l].name, AUCs[l])
100
101   For vehicle, you can run exactly this same code; it will compute AUCs
102   for all pairs of classes and return the average weighted by probabilities
103   of pairs. Or, you can specify the averaging method yourself, like this::
104
105       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll)
106
107   The following snippet tries out all four. (We don't claim that this is
108   how the function needs to be used; it's better to stay with the default.)::
109
110       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"]
111       print " " *25 + "  \tbayes\ttree\tmajority"
112       for i in range(4):
113           AUCs = Orange.evaluation.scoring.AUC(resVeh, i)
114           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs))
115
116   As you can see from the output::
117
118                                   bayes   tree    majority
119              by pairs, weighted:  0.789   0.871   0.500
120                        by pairs:  0.791   0.872   0.500
121           one vs. all, weighted:  0.783   0.800   0.500
122                     one vs. all:  0.783   0.800   0.500
123
124.. autofunction:: AUC_single
125
126.. autofunction:: AUC_pair
127
128.. autofunction:: AUC_matrix
129
130The remaining functions, which plot the curves and statistically compare
131them, require that the results come from a test with a single iteration,
132and they always compare one chosen class against all others. If you have
133cross validation results, you can either use split_by_iterations to split the
134results by folds, call the function for each fold separately and then sum
135the results up however you see fit, or you can set the ExperimentResults'
136attribute number_of_iterations to 1, to cheat the function - at your own
137responsibility for the statistical correctness. Regarding the multi-class
138problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class
139attribute's baseValue at the time when results were computed. If baseValue
140was not given at that time, 1 (that is, the second class) is used as default.
141
142We shall use the following code to prepare suitable experimental results::
143
144    ri2 = Orange.core.MakeRandomIndices2(voting, 0.6)
145    train = voting.selectref(ri2, 0)
146    test = voting.selectref(ri2, 1)
147    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test)
148
149
150.. autofunction:: AUCWilcoxon
151
152.. autofunction:: compute_ROC
153
154
155.. autofunction:: confusion_matrices
156
157.. autoclass:: ConfusionMatrix
158
159
160To prepare some data for examples on this page, we shall load the voting data
161set (problem of predicting the congressman's party (republican, democrat)
162based on a selection of votes) and evaluate naive Bayesian learner,
163classification trees and majority classifier using cross-validation.
164For examples requiring a multivalued class problem, we shall do the same
165with the vehicle data set (telling whether a vehicle described by the features
166extracted from a picture is a van, bus, or Opel or Saab car).
167
168Basic cross validation example is shown in the following part of
169(:download:`statExamples.py <code/statExamples.py>`):
170
171If instances are weighted, weights are taken into account. This can be
172disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of
173disabling weights is to clear the
174:class:`Orange.evaluation.testing.ExperimentResults`' flag weights.
175
176General Measures of Quality
177===========================
178
179
180
181
182
183So, let's compute all this in part of
184(:download:`statExamples.py <code/statExamples.py>`) and print it out:
185
186.. literalinclude:: code/statExample1.py
187   :lines: 13-
188
189The output should look like this::
190
191    method  CA      AP      Brier    IS
192    bayes   0.903   0.902   0.175    0.759
193    tree    0.846   0.845   0.286    0.641
194    majority  0.614   0.526   0.474   -0.000
195
196Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out
197the standard errors.
198
199Confusion Matrix
200================
201
202.. autofunction:: confusion_matrices
203
204   **A positive-negative confusion matrix** is computed (a) if the class is
205   binary unless :obj:`classIndex` argument is -2, (b) if the class is
206   multivalued and the :obj:`classIndex` is non-negative. Argument
207   :obj:`classIndex` then tells which class is positive. In case (a),
208   :obj:`classIndex` may be omitted; the first class
209   is then negative and the second is positive, unless the :obj:`baseClass`
210   attribute in the object with results has non-negative value. In that case,
211   :obj:`baseClass` is an index of the target class. :obj:`baseClass`
212   attribute of results object should be set manually. The result of a
213   function is a list of instances of class :class:`ConfusionMatrix`,
214   containing the (weighted) number of true positives (TP), false
215   negatives (FN), false positives (FP) and true negatives (TN).
216
217   We can also add the keyword argument :obj:`cutoff`
218   (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices`
219   will disregard the classifiers' class predictions and observe the predicted
220   probabilities, and consider the prediction "positive" if the predicted
221   probability of the positive class is higher than the :obj:`cutoff`.
222
223   The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the
224   cut off threshold from the default 0.5 to 0.2 affects the confusion matrics
225   for naive Bayesian classifier::
226
227       cm = Orange.evaluation.scoring.confusion_matrices(res)[0]
228       print "Confusion matrix for naive Bayes:"
229       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
230
231       cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0]
232       print "Confusion matrix for naive Bayes:"
233       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
234
235   The output::
236
237       Confusion matrix for naive Bayes:
238       TP: 238, FP: 13, FN: 29.0, TN: 155
239       Confusion matrix for naive Bayes:
240       TP: 239, FP: 18, FN: 28.0, TN: 150
241
242   shows that the number of true positives increases (and hence the number of
243   false negatives decreases) by only a single instance, while five instances
244   that were originally true negatives become false positives due to the
245   lower threshold.
246
247   To observe how good are the classifiers in detecting vans in the vehicle
248   data set, we would compute the matrix like this::
249
250      cm = Orange.evaluation.scoring.confusion_matrices(resVeh, vehicle.domain.classVar.values.index("van"))
251
252   and get the results like these::
253
254       TP: 189, FP: 241, FN: 10.0, TN: 406
255
256   while the same for class "opel" would give::
257
258       TP: 86, FP: 112, FN: 126.0, TN: 522
259
260   The main difference is that there are only a few false negatives for the
261   van, meaning that the classifier seldom misses it (if it says it's not a
262   van, it's almost certainly not a van). Not so for the Opel car, where the
263   classifier missed 126 of them and correctly detected only 86.
264
265   **General confusion matrix** is computed (a) in case of a binary class,
266   when :obj:`classIndex` is set to -2, (b) when we have multivalued class and
267   the caller doesn't specify the :obj:`classIndex` of the positive class.
268   When called in this manner, the function cannot use the argument
269   :obj:`cutoff`.
270
271   The function then returns a three-dimensional matrix, where the element
272   A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`]
273   gives the number of instances belonging to 'actual_class' for which the
274   'learner' predicted 'predictedClass'. We shall compute and print out
275   the matrix for naive Bayesian classifier.
276
277   Here we see another example from :download:`statExamples.py <code/statExamples.py>`::
278
279       cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0]
280       classes = vehicle.domain.classVar.values
281       print "\t"+"\t".join(classes)
282       for className, classConfusions in zip(classes, cm):
283           print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions))
284
285   So, here's what this nice piece of code gives::
286
287              bus   van  saab opel
288       bus     56   95   21   46
289       van     6    189  4    0
290       saab    3    75   73   66
291       opel    4    71   51   86
292
293   Van's are clearly simple: 189 vans were classified as vans (we know this
294   already, we've printed it out above), and the 10 misclassified pictures
295   were classified as buses (6) and Saab cars (4). In all other classes,
296   there were more instances misclassified as vans than correctly classified
297   instances. The classifier is obviously quite biased to vans.
298
299
300
301   With the confusion matrix defined in terms of positive and negative
302   classes, you can also compute the
303   `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_
304   [TP/(TP+FN)], `specificity <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_
305   [TN/(TN+FP)], `positive predictive value <http://en.wikipedia.org/wiki/Positive_predictive_value>`_
306   [TP/(TP+FP)] and `negative predictive value <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)].
307   In information retrieval, positive predictive value is called precision
308   (the ratio of the number of relevant records retrieved to the total number
309   of irrelevant and relevant records retrieved), and sensitivity is called
310   `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_
311   (the ratio of the number of relevant records retrieved to the total number
312   of relevant records in the database). The
313   `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision
314   and recall is called an
315   `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending
316   on the ratio of the weight between precision and recall is implemented
317   as F1 [2*precision*recall/(precision+recall)] or, for a general case,
318   Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)].
319   The `Matthews correlation coefficient <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
320   in essence a correlation coefficient between
321   the observed and predicted binary classifications; it returns a value
322   between -1 and +1. A coefficient of +1 represents a perfect prediction,
323   0 an average random prediction and -1 an inverse prediction.
324
325   If the argument :obj:`confm` is a single confusion matrix, a single
326   result (a number) is returned. If confm is a list of confusion matrices,
327   a list of scores is returned, one for each confusion matrix.
328
329   Note that weights are taken into account when computing the matrix, so
330   these functions don't check the 'weighted' keyword argument.
331
332   Let us print out sensitivities and specificities of our classifiers in
333   part of :download:`statExamples.py <code/statExamples.py>`::
334
335       cm = Orange.evaluation.scoring.confusion_matrices(res)
336       print
337       print "method\tsens\tspec"
338       for l in range(len(learners)):
339           print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l]))
340
341ROC Analysis
342============
343
344`Receiver Operating Characteristic \
345<http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
346(ROC) analysis was initially developed for
347a binary-like problems and there is no consensus on how to apply it in
348multi-class problems, nor do we know for sure how to do ROC analysis after
349cross validation and similar multiple sampling techniques. If you are
350interested in the area under the curve, function AUC will deal with those
351problems as specifically described below.
352
353.. autofunction:: AUC
354
355   .. attribute:: AUC.ByWeightedPairs (or 0)
356
357      Computes AUC for each pair of classes (ignoring instances of all other
358      classes) and averages the results, weighting them by the number of
359      pairs of instances from these two classes (e.g. by the product of
360      probabilities of the two classes). AUC computed in this way still
361      behaves as concordance index, e.g., gives the probability that two
362      randomly chosen instances from different classes will be correctly
363      recognized (this is of course true only if the classifier knows
364      from which two classes the instances came).
365
366   .. attribute:: AUC.ByPairs (or 1)
367
368      Similar as above, except that the average over class pairs is not
369      weighted. This AUC is, like the binary, independent of class
370      distributions, but it is not related to concordance index any more.
371
372   .. attribute:: AUC.WeightedOneAgainstAll (or 2)
373
374      For each class, it computes AUC for this class against all others (that
375      is, treating other classes as one class). The AUCs are then averaged by
376      the class probabilities. This is related to concordance index in which
377      we test the classifier's (average) capability for distinguishing the
378      instances from a specified class from those that come from other classes.
379      Unlike the binary AUC, the measure is not independent of class
380      distributions.
381
382   .. attribute:: AUC.OneAgainstAll (or 3)
383
384      As above, except that the average is not weighted.
385
386   In case of multiple folds (for instance if the data comes from cross
387   validation), the computation goes like this. When computing the partial
388   AUCs for individual pairs of classes or singled-out classes, AUC is
389   computed for each fold separately and then averaged (ignoring the number
390   of instances in each fold, it's just a simple average). However, if a
391   certain fold doesn't contain any instances of a certain class (from the
392   pair), the partial AUC is computed treating the results as if they came
393   from a single-fold. This is not really correct since the class
394   probabilities from different folds are not necessarily comparable,
395   yet this will most often occur in a leave-one-out experiments,
396   comparability shouldn't be a problem.
397
398   Computing and printing out the AUC's looks just like printing out
399   classification accuracies (except that we call AUC instead of
400   CA, of course)::
401
402       AUCs = Orange.evaluation.scoring.AUC(res)
403       for l in range(len(learners)):
404           print "%10s: %5.3f" % (learners[l].name, AUCs[l])
405
406   For vehicle, you can run exactly this same code; it will compute AUCs
407   for all pairs of classes and return the average weighted by probabilities
408   of pairs. Or, you can specify the averaging method yourself, like this::
409
410       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll)
411
412   The following snippet tries out all four. (We don't claim that this is
413   how the function needs to be used; it's better to stay with the default.)::
414
415       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"]
416       print " " *25 + "  \tbayes\ttree\tmajority"
417       for i in range(4):
418           AUCs = Orange.evaluation.scoring.AUC(resVeh, i)
419           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs))
420
421   As you can see from the output::
422
423                                   bayes   tree    majority
424              by pairs, weighted:  0.789   0.871   0.500
425                        by pairs:  0.791   0.872   0.500
426           one vs. all, weighted:  0.783   0.800   0.500
427                     one vs. all:  0.783   0.800   0.500
428
429.. autofunction:: AUC_single
430
431.. autofunction:: AUC_pair
432
433.. autofunction:: AUC_matrix
434
435The remaining functions, which plot the curves and statistically compare
436them, require that the results come from a test with a single iteration,
437and they always compare one chosen class against all others. If you have
438cross validation results, you can either use split_by_iterations to split the
439results by folds, call the function for each fold separately and then sum
440the results up however you see fit, or you can set the ExperimentResults'
441attribute number_of_iterations to 1, to cheat the function - at your own
442responsibility for the statistical correctness. Regarding the multi-class
443problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class
444attribute's baseValue at the time when results were computed. If baseValue
445was not given at that time, 1 (that is, the second class) is used as default.
446
447We shall use the following code to prepare suitable experimental results::
448
449    ri2 = Orange.data.sample.SubsetIndices2(voting, 0.6)
450    train = voting.selectref(ri2, 0)
451    test = voting.selectref(ri2, 1)
452    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test)
453
454
455.. autofunction:: AUCWilcoxon
456
457.. autofunction:: compute_ROC
458
459Comparison of Algorithms
460------------------------
461
462.. autofunction:: McNemar
463
464.. autofunction:: McNemar_of_two
465
466==========
467Regression
468==========
469
470General Measure of Quality
471==========================
472
473Several alternative measures, as given below, can be used to evaluate
474the sucess of numeric prediction:
475
476.. image:: files/statRegression.png
477
478.. autofunction:: MSE
479
480.. autofunction:: RMSE
481
482.. autofunction:: MAE
483
484.. autofunction:: RSE
485
486.. autofunction:: RRSE
487
488.. autofunction:: RAE
489
490.. autofunction:: R2
491
492The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to
493score several regression methods.
494
495.. literalinclude:: code/statExamplesRegression.py
496
497The code above produces the following output::
498
499    Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2
500    maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002
501    rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526
502    knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748
503    lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715
504
505=================
506Ploting functions
507=================
508
509.. autofunction:: graph_ranks
510
511The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph:
512
513.. literalinclude:: code/statExamplesGraphRanks.py
514
515Code produces the following graph:
516
517.. image:: files/statExamplesGraphRanks1.png
518
519.. autofunction:: compute_CD
520
521.. autofunction:: compute_friedman
522
523=================
524Utility Functions
525=================
526
527.. autofunction:: split_by_iterations
528
529=====================================
530Scoring for multilabel classification
531=====================================
532
533Multi-label classification requries different metrics than those used in traditional single-label
534classification. This module presents the various methrics that have been proposed in the literature.
535Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples
536:math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier
537and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`.
538
539.. autofunction:: mlc_hamming_loss
540.. autofunction:: mlc_accuracy
541.. autofunction:: mlc_precision
542.. autofunction:: mlc_recall
543
544So, let's compute all this and print it out (part of
545:download:`mlc-evaluate.py <code/mlc-evaluate.py>`):
546
547.. literalinclude:: code/mlc-evaluate.py
548   :lines: 1-15
549
550The output should look like this::
551
552    loss= [0.9375]
553    accuracy= [0.875]
554    precision= [1.0]
555    recall= [0.875]
556
557References
558==========
559
560Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification',
561Pattern Recogintion, vol.37, no.9, pp:1757-71
562
563Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper
564presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining
565(PAKDD 2004)
566
567Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization',
568Machine Learning, vol.39, no.2/3, pp:135-68.
Note: See TracBrowser for help on using the repository browser.