source: orange/docs/reference/rst/Orange.evaluation.scoring.rst @ 10004:61f07cf06cd4

Revision 10004:61f07cf06cd4, 22.9 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Merge

Line 
1.. automodule:: Orange.evaluation.scoring
2
3############################
4Method scoring (``scoring``)
5############################
6
7.. index: scoring
8
9Scoring plays and integral role in evaluation of any prediction model. Orange
10implements various scores for evaluation of classification,
11regression and multi-label models. Most of the methods needs to be called
12with an instance of :obj:`ExperimentResults`.
13
14.. literalinclude:: code/statExample0.py
15
16==============
17Classification
18==============
19
20Many scores for evaluation of classification models can be computed solely
21from the confusion matrix constructed manually with the
22:obj:`confusion_matrices` function. If class variable has more than two
23values, the index of the value to calculate the confusion matrix for should
24be passed as well.
25
26Calibration scores
27==================
28
29.. autoclass:: CAClass
30.. autofunction:: sens
31.. autofunction:: spec
32.. autofunction:: PPV
33.. autofunction:: NPV
34.. autofunction:: precision
35.. autofunction:: recall
36.. autofunction:: F1
37.. autofunction:: Falpha
38.. autofunction:: MCC
39.. autofunction:: AP
40.. autofunction:: IS
41
42Discriminatory scores
43=====================
44
45.. autofunction:: Brier_score
46
47.. autofunction:: AUC
48
49    .. attribute:: AUC.ByWeightedPairs (or 0)
50
51      Computes AUC for each pair of classes (ignoring instances of all other
52      classes) and averages the results, weighting them by the number of
53      pairs of instances from these two classes (e.g. by the product of
54      probabilities of the two classes). AUC computed in this way still
55      behaves as concordance index, e.g., gives the probability that two
56      randomly chosen instances from different classes will be correctly
57      recognized (this is of course true only if the classifier knows
58      from which two classes the instances came).
59
60   .. attribute:: AUC.ByPairs (or 1)
61
62      Similar as above, except that the average over class pairs is not
63      weighted. This AUC is, like the binary, independent of class
64      distributions, but it is not related to concordance index any more.
65
66   .. attribute:: AUC.WeightedOneAgainstAll (or 2)
67
68      For each class, it computes AUC for this class against all others (that
69      is, treating other classes as one class). The AUCs are then averaged by
70      the class probabilities. This is related to concordance index in which
71      we test the classifier's (average) capability for distinguishing the
72      instances from a specified class from those that come from other classes.
73      Unlike the binary AUC, the measure is not independent of class
74      distributions.
75
76   .. attribute:: AUC.OneAgainstAll (or 3)
77
78      As above, except that the average is not weighted.
79
80   In case of multiple folds (for instance if the data comes from cross
81   validation), the computation goes like this. When computing the partial
82   AUCs for individual pairs of classes or singled-out classes, AUC is
83   computed for each fold separately and then averaged (ignoring the number
84   of instances in each fold, it's just a simple average). However, if a
85   certain fold doesn't contain any instances of a certain class (from the
86   pair), the partial AUC is computed treating the results as if they came
87   from a single-fold. This is not really correct since the class
88   probabilities from different folds are not necessarily comparable,
89   yet this will most often occur in a leave-one-out experiments,
90   comparability shouldn't be a problem.
91
92   Computing and printing out the AUC's looks just like printing out
93   classification accuracies (except that we call AUC instead of
94   CA, of course)::
95
96       AUCs = Orange.evaluation.scoring.AUC(res)
97       for l in range(len(learners)):
98           print "%10s: %5.3f" % (learners[l].name, AUCs[l])
99
100   For vehicle, you can run exactly this same code; it will compute AUCs
101   for all pairs of classes and return the average weighted by probabilities
102   of pairs. Or, you can specify the averaging method yourself, like this::
103
104       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll)
105
106   The following snippet tries out all four. (We don't claim that this is
107   how the function needs to be used; it's better to stay with the default.)::
108
109       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"]
110       print " " *25 + "  \tbayes\ttree\tmajority"
111       for i in range(4):
112           AUCs = Orange.evaluation.scoring.AUC(resVeh, i)
113           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs))
114
115   As you can see from the output::
116
117                                   bayes   tree    majority
118              by pairs, weighted:  0.789   0.871   0.500
119                        by pairs:  0.791   0.872   0.500
120           one vs. all, weighted:  0.783   0.800   0.500
121                     one vs. all:  0.783   0.800   0.500
122
123.. autofunction:: AUC_single
124
125.. autofunction:: AUC_pair
126
127.. autofunction:: AUC_matrix
128
129The remaining functions, which plot the curves and statistically compare
130them, require that the results come from a test with a single iteration,
131and they always compare one chosen class against all others. If you have
132cross validation results, you can either use split_by_iterations to split the
133results by folds, call the function for each fold separately and then sum
134the results up however you see fit, or you can set the ExperimentResults'
135attribute number_of_iterations to 1, to cheat the function - at your own
136responsibility for the statistical correctness. Regarding the multi-class
137problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class
138attribute's baseValue at the time when results were computed. If baseValue
139was not given at that time, 1 (that is, the second class) is used as default.
140
141We shall use the following code to prepare suitable experimental results::
142
143    ri2 = Orange.core.MakeRandomIndices2(voting, 0.6)
144    train = voting.selectref(ri2, 0)
145    test = voting.selectref(ri2, 1)
146    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test)
147
148
149.. autofunction:: AUCWilcoxon
150
151.. autofunction:: compute_ROC
152
153
154.. autofunction:: confusion_matrices
155
156.. autoclass:: ConfusionMatrix
157
158
159To prepare some data for examples on this page, we shall load the voting data
160set (problem of predicting the congressman's party (republican, democrat)
161based on a selection of votes) and evaluate naive Bayesian learner,
162classification trees and majority classifier using cross-validation.
163For examples requiring a multivalued class problem, we shall do the same
164with the vehicle data set (telling whether a vehicle described by the features
165extracted from a picture is a van, bus, or Opel or Saab car).
166
167Basic cross validation example is shown in the following part of
168(:download:`statExamples.py <code/statExamples.py>`):
169
170If instances are weighted, weights are taken into account. This can be
171disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of
172disabling weights is to clear the
173:class:`Orange.evaluation.testing.ExperimentResults`' flag weights.
174
175General Measures of Quality
176===========================
177
178
179
180
181
182So, let's compute all this in part of
183(:download:`statExamples.py <code/statExamples.py>`) and print it out:
184
185.. literalinclude:: code/statExample1.py
186   :lines: 13-
187
188The output should look like this::
189
190    method  CA      AP      Brier    IS
191    bayes   0.903   0.902   0.175    0.759
192    tree    0.846   0.845   0.286    0.641
193    majority  0.614   0.526   0.474   -0.000
194
195Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out
196the standard errors.
197
198Confusion Matrix
199================
200
201.. autofunction:: confusion_matrices
202
203   **A positive-negative confusion matrix** is computed (a) if the class is
204   binary unless :obj:`classIndex` argument is -2, (b) if the class is
205   multivalued and the :obj:`classIndex` is non-negative. Argument
206   :obj:`classIndex` then tells which class is positive. In case (a),
207   :obj:`classIndex` may be omitted; the first class
208   is then negative and the second is positive, unless the :obj:`baseClass`
209   attribute in the object with results has non-negative value. In that case,
210   :obj:`baseClass` is an index of the target class. :obj:`baseClass`
211   attribute of results object should be set manually. The result of a
212   function is a list of instances of class :class:`ConfusionMatrix`,
213   containing the (weighted) number of true positives (TP), false
214   negatives (FN), false positives (FP) and true negatives (TN).
215
216   We can also add the keyword argument :obj:`cutoff`
217   (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices`
218   will disregard the classifiers' class predictions and observe the predicted
219   probabilities, and consider the prediction "positive" if the predicted
220   probability of the positive class is higher than the :obj:`cutoff`.
221
222   The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the
223   cut off threshold from the default 0.5 to 0.2 affects the confusion matrics
224   for naive Bayesian classifier::
225
226       cm = Orange.evaluation.scoring.confusion_matrices(res)[0]
227       print "Confusion matrix for naive Bayes:"
228       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
229
230       cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0]
231       print "Confusion matrix for naive Bayes:"
232       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
233
234   The output::
235
236       Confusion matrix for naive Bayes:
237       TP: 238, FP: 13, FN: 29.0, TN: 155
238       Confusion matrix for naive Bayes:
239       TP: 239, FP: 18, FN: 28.0, TN: 150
240
241   shows that the number of true positives increases (and hence the number of
242   false negatives decreases) by only a single instance, while five instances
243   that were originally true negatives become false positives due to the
244   lower threshold.
245
246   To observe how good are the classifiers in detecting vans in the vehicle
247   data set, we would compute the matrix like this::
248
249      cm = Orange.evaluation.scoring.confusion_matrices(resVeh, vehicle.domain.classVar.values.index("van"))
250
251   and get the results like these::
252
253       TP: 189, FP: 241, FN: 10.0, TN: 406
254
255   while the same for class "opel" would give::
256
257       TP: 86, FP: 112, FN: 126.0, TN: 522
258
259   The main difference is that there are only a few false negatives for the
260   van, meaning that the classifier seldom misses it (if it says it's not a
261   van, it's almost certainly not a van). Not so for the Opel car, where the
262   classifier missed 126 of them and correctly detected only 86.
263
264   **General confusion matrix** is computed (a) in case of a binary class,
265   when :obj:`classIndex` is set to -2, (b) when we have multivalued class and
266   the caller doesn't specify the :obj:`classIndex` of the positive class.
267   When called in this manner, the function cannot use the argument
268   :obj:`cutoff`.
269
270   The function then returns a three-dimensional matrix, where the element
271   A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`]
272   gives the number of instances belonging to 'actual_class' for which the
273   'learner' predicted 'predictedClass'. We shall compute and print out
274   the matrix for naive Bayesian classifier.
275
276   Here we see another example from :download:`statExamples.py <code/statExamples.py>`::
277
278       cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0]
279       classes = vehicle.domain.classVar.values
280       print "\t"+"\t".join(classes)
281       for className, classConfusions in zip(classes, cm):
282           print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions))
283
284   So, here's what this nice piece of code gives::
285
286              bus   van  saab opel
287       bus     56   95   21   46
288       van     6    189  4    0
289       saab    3    75   73   66
290       opel    4    71   51   86
291
292   Van's are clearly simple: 189 vans were classified as vans (we know this
293   already, we've printed it out above), and the 10 misclassified pictures
294   were classified as buses (6) and Saab cars (4). In all other classes,
295   there were more instances misclassified as vans than correctly classified
296   instances. The classifier is obviously quite biased to vans.
297
298
299
300   With the confusion matrix defined in terms of positive and negative
301   classes, you can also compute the
302   `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_
303   [TP/(TP+FN)], `specificity <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_
304   [TN/(TN+FP)], `positive predictive value <http://en.wikipedia.org/wiki/Positive_predictive_value>`_
305   [TP/(TP+FP)] and `negative predictive value <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)].
306   In information retrieval, positive predictive value is called precision
307   (the ratio of the number of relevant records retrieved to the total number
308   of irrelevant and relevant records retrieved), and sensitivity is called
309   `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_
310   (the ratio of the number of relevant records retrieved to the total number
311   of relevant records in the database). The
312   `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision
313   and recall is called an
314   `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending
315   on the ratio of the weight between precision and recall is implemented
316   as F1 [2*precision*recall/(precision+recall)] or, for a general case,
317   Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)].
318   The `Matthews correlation coefficient <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
319   in essence a correlation coefficient between
320   the observed and predicted binary classifications; it returns a value
321   between -1 and +1. A coefficient of +1 represents a perfect prediction,
322   0 an average random prediction and -1 an inverse prediction.
323
324   If the argument :obj:`confm` is a single confusion matrix, a single
325   result (a number) is returned. If confm is a list of confusion matrices,
326   a list of scores is returned, one for each confusion matrix.
327
328   Note that weights are taken into account when computing the matrix, so
329   these functions don't check the 'weighted' keyword argument.
330
331   Let us print out sensitivities and specificities of our classifiers in
332   part of :download:`statExamples.py <code/statExamples.py>`::
333
334       cm = Orange.evaluation.scoring.confusion_matrices(res)
335       print
336       print "method\tsens\tspec"
337       for l in range(len(learners)):
338           print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l]))
339
340ROC Analysis
341============
342
343`Receiver Operating Characteristic \
344<http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
345(ROC) analysis was initially developed for
346a binary-like problems and there is no consensus on how to apply it in
347multi-class problems, nor do we know for sure how to do ROC analysis after
348cross validation and similar multiple sampling techniques. If you are
349interested in the area under the curve, function AUC will deal with those
350problems as specifically described below.
351
352.. autofunction:: AUC
353
354   .. attribute:: AUC.ByWeightedPairs (or 0)
355
356      Computes AUC for each pair of classes (ignoring instances of all other
357      classes) and averages the results, weighting them by the number of
358      pairs of instances from these two classes (e.g. by the product of
359      probabilities of the two classes). AUC computed in this way still
360      behaves as concordance index, e.g., gives the probability that two
361      randomly chosen instances from different classes will be correctly
362      recognized (this is of course true only if the classifier knows
363      from which two classes the instances came).
364
365   .. attribute:: AUC.ByPairs (or 1)
366
367      Similar as above, except that the average over class pairs is not
368      weighted. This AUC is, like the binary, independent of class
369      distributions, but it is not related to concordance index any more.
370
371   .. attribute:: AUC.WeightedOneAgainstAll (or 2)
372
373      For each class, it computes AUC for this class against all others (that
374      is, treating other classes as one class). The AUCs are then averaged by
375      the class probabilities. This is related to concordance index in which
376      we test the classifier's (average) capability for distinguishing the
377      instances from a specified class from those that come from other classes.
378      Unlike the binary AUC, the measure is not independent of class
379      distributions.
380
381   .. attribute:: AUC.OneAgainstAll (or 3)
382
383      As above, except that the average is not weighted.
384
385   In case of multiple folds (for instance if the data comes from cross
386   validation), the computation goes like this. When computing the partial
387   AUCs for individual pairs of classes or singled-out classes, AUC is
388   computed for each fold separately and then averaged (ignoring the number
389   of instances in each fold, it's just a simple average). However, if a
390   certain fold doesn't contain any instances of a certain class (from the
391   pair), the partial AUC is computed treating the results as if they came
392   from a single-fold. This is not really correct since the class
393   probabilities from different folds are not necessarily comparable,
394   yet this will most often occur in a leave-one-out experiments,
395   comparability shouldn't be a problem.
396
397   Computing and printing out the AUC's looks just like printing out
398   classification accuracies (except that we call AUC instead of
399   CA, of course)::
400
401       AUCs = Orange.evaluation.scoring.AUC(res)
402       for l in range(len(learners)):
403           print "%10s: %5.3f" % (learners[l].name, AUCs[l])
404
405   For vehicle, you can run exactly this same code; it will compute AUCs
406   for all pairs of classes and return the average weighted by probabilities
407   of pairs. Or, you can specify the averaging method yourself, like this::
408
409       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll)
410
411   The following snippet tries out all four. (We don't claim that this is
412   how the function needs to be used; it's better to stay with the default.)::
413
414       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"]
415       print " " *25 + "  \tbayes\ttree\tmajority"
416       for i in range(4):
417           AUCs = Orange.evaluation.scoring.AUC(resVeh, i)
418           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs))
419
420   As you can see from the output::
421
422                                   bayes   tree    majority
423              by pairs, weighted:  0.789   0.871   0.500
424                        by pairs:  0.791   0.872   0.500
425           one vs. all, weighted:  0.783   0.800   0.500
426                     one vs. all:  0.783   0.800   0.500
427
428.. autofunction:: AUC_single
429
430.. autofunction:: AUC_pair
431
432.. autofunction:: AUC_matrix
433
434The remaining functions, which plot the curves and statistically compare
435them, require that the results come from a test with a single iteration,
436and they always compare one chosen class against all others. If you have
437cross validation results, you can either use split_by_iterations to split the
438results by folds, call the function for each fold separately and then sum
439the results up however you see fit, or you can set the ExperimentResults'
440attribute number_of_iterations to 1, to cheat the function - at your own
441responsibility for the statistical correctness. Regarding the multi-class
442problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class
443attribute's baseValue at the time when results were computed. If baseValue
444was not given at that time, 1 (that is, the second class) is used as default.
445
446We shall use the following code to prepare suitable experimental results::
447
448    ri2 = Orange.data.sample.SubsetIndices2(voting, 0.6)
449    train = voting.selectref(ri2, 0)
450    test = voting.selectref(ri2, 1)
451    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test)
452
453
454.. autofunction:: AUCWilcoxon
455
456.. autofunction:: compute_ROC
457
458Comparison of Algorithms
459------------------------
460
461.. autofunction:: McNemar
462
463.. autofunction:: McNemar_of_two
464
465==========
466Regression
467==========
468
469General Measure of Quality
470==========================
471
472Several alternative measures, as given below, can be used to evaluate
473the sucess of numeric prediction:
474
475.. image:: files/statRegression.png
476
477.. autofunction:: MSE
478
479.. autofunction:: RMSE
480
481.. autofunction:: MAE
482
483.. autofunction:: RSE
484
485.. autofunction:: RRSE
486
487.. autofunction:: RAE
488
489.. autofunction:: R2
490
491The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to
492score several regression methods.
493
494.. literalinclude:: code/statExamplesRegression.py
495
496The code above produces the following output::
497
498    Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2
499    maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002
500    rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526
501    knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748
502    lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715
503
504=================
505Ploting functions
506=================
507
508.. autofunction:: graph_ranks
509
510The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph:
511
512.. literalinclude:: code/statExamplesGraphRanks.py
513
514Code produces the following graph:
515
516.. image:: files/statExamplesGraphRanks1.png
517
518.. autofunction:: compute_CD
519
520.. autofunction:: compute_friedman
521
522=================
523Utility Functions
524=================
525
526.. autofunction:: split_by_iterations
527
528=====================================
529Scoring for multilabel classification
530=====================================
531
532Multi-label classification requries different metrics than those used in traditional single-label
533classification. This module presents the various methrics that have been proposed in the literature.
534Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples
535:math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier
536and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`.
537
538.. autofunction:: mlc_hamming_loss
539.. autofunction:: mlc_accuracy
540.. autofunction:: mlc_precision
541.. autofunction:: mlc_recall
542
543So, let's compute all this and print it out (part of
544:download:`mlc-evaluate.py <code/mlc-evaluate.py>`):
545
546.. literalinclude:: code/mlc-evaluate.py
547   :lines: 1-15
548
549The output should look like this::
550
551    loss= [0.9375]
552    accuracy= [0.875]
553    precision= [1.0]
554    recall= [0.875]
555
556References
557==========
558
559Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification',
560Pattern Recogintion, vol.37, no.9, pp:1757-71
561
562Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper
563presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining
564(PAKDD 2004)
565
566Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization',
567Machine Learning, vol.39, no.2/3, pp:135-68.
Note: See TracBrowser for help on using the repository browser.