source: orange/docs/reference/rst/Orange.evaluation.scoring.rst @ 9994:1073e0304a87

Revision 9994:1073e0304a87, 17.7 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Remove links from documentation to datasets. Remove datasets reference directory.

Line 
1.. automodule:: Orange.evaluation.scoring
2
3############################
4Method scoring (``scoring``)
5############################
6
7.. index: scoring
8
9This module contains various measures of quality for classification and
10regression. Most functions require an argument named :obj:`res`, an instance of
11:class:`Orange.evaluation.testing.ExperimentResults` as computed by
12functions from :mod:`Orange.evaluation.testing` and which contains
13predictions obtained through cross-validation,
14leave one-out, testing on training data or test set instances.
15
16==============
17Classification
18==============
19
20To prepare some data for examples on this page, we shall load the voting data
21set (problem of predicting the congressman's party (republican, democrat)
22based on a selection of votes) and evaluate naive Bayesian learner,
23classification trees and majority classifier using cross-validation.
24For examples requiring a multivalued class problem, we shall do the same
25with the vehicle data set (telling whether a vehicle described by the features
26extracted from a picture is a van, bus, or Opel or Saab car).
27
28Basic cross validation example is shown in the following part of
29(:download:`statExamples.py <code/statExamples.py>`):
30
31.. literalinclude:: code/statExample0.py
32
33If instances are weighted, weights are taken into account. This can be
34disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of
35disabling weights is to clear the
36:class:`Orange.evaluation.testing.ExperimentResults`' flag weights.
37
38General Measures of Quality
39===========================
40
41.. autofunction:: CA
42
43.. autofunction:: AP
44
45.. autofunction:: Brier_score
46
47.. autofunction:: IS
48
49So, let's compute all this in part of
50(:download:`statExamples.py <code/statExamples.py>`) and print it out:
51
52.. literalinclude:: code/statExample1.py
53   :lines: 13-
54
55The output should look like this::
56
57    method  CA      AP      Brier    IS
58    bayes   0.903   0.902   0.175    0.759
59    tree    0.846   0.845   0.286    0.641
60    majorty  0.614   0.526   0.474   -0.000
61
62Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out
63the standard errors.
64
65Confusion Matrix
66================
67
68.. autofunction:: confusion_matrices
69
70   **A positive-negative confusion matrix** is computed (a) if the class is
71   binary unless :obj:`classIndex` argument is -2, (b) if the class is
72   multivalued and the :obj:`classIndex` is non-negative. Argument
73   :obj:`classIndex` then tells which class is positive. In case (a),
74   :obj:`classIndex` may be omitted; the first class
75   is then negative and the second is positive, unless the :obj:`baseClass`
76   attribute in the object with results has non-negative value. In that case,
77   :obj:`baseClass` is an index of the target class. :obj:`baseClass`
78   attribute of results object should be set manually. The result of a
79   function is a list of instances of class :class:`ConfusionMatrix`,
80   containing the (weighted) number of true positives (TP), false
81   negatives (FN), false positives (FP) and true negatives (TN).
82
83   We can also add the keyword argument :obj:`cutoff`
84   (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices`
85   will disregard the classifiers' class predictions and observe the predicted
86   probabilities, and consider the prediction "positive" if the predicted
87   probability of the positive class is higher than the :obj:`cutoff`.
88
89   The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the
90   cut off threshold from the default 0.5 to 0.2 affects the confusion matrics
91   for naive Bayesian classifier::
92
93       cm = Orange.evaluation.scoring.confusion_matrices(res)[0]
94       print "Confusion matrix for naive Bayes:"
95       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
96
97       cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0]
98       print "Confusion matrix for naive Bayes:"
99       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
100
101   The output::
102
103       Confusion matrix for naive Bayes:
104       TP: 238, FP: 13, FN: 29.0, TN: 155
105       Confusion matrix for naive Bayes:
106       TP: 239, FP: 18, FN: 28.0, TN: 150
107
108   shows that the number of true positives increases (and hence the number of
109   false negatives decreases) by only a single instance, while five instances
110   that were originally true negatives become false positives due to the
111   lower threshold.
112
113   To observe how good are the classifiers in detecting vans in the vehicle
114   data set, we would compute the matrix like this::
115
116      cm = Orange.evaluation.scoring.confusion_matrices(resVeh, vehicle.domain.classVar.values.index("van"))
117
118   and get the results like these::
119
120       TP: 189, FP: 241, FN: 10.0, TN: 406
121
122   while the same for class "opel" would give::
123
124       TP: 86, FP: 112, FN: 126.0, TN: 522
125
126   The main difference is that there are only a few false negatives for the
127   van, meaning that the classifier seldom misses it (if it says it's not a
128   van, it's almost certainly not a van). Not so for the Opel car, where the
129   classifier missed 126 of them and correctly detected only 86.
130
131   **General confusion matrix** is computed (a) in case of a binary class,
132   when :obj:`classIndex` is set to -2, (b) when we have multivalued class and
133   the caller doesn't specify the :obj:`classIndex` of the positive class.
134   When called in this manner, the function cannot use the argument
135   :obj:`cutoff`.
136
137   The function then returns a three-dimensional matrix, where the element
138   A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`]
139   gives the number of instances belonging to 'actual_class' for which the
140   'learner' predicted 'predictedClass'. We shall compute and print out
141   the matrix for naive Bayesian classifier.
142
143   Here we see another example from :download:`statExamples.py <code/statExamples.py>`::
144
145       cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0]
146       classes = vehicle.domain.classVar.values
147       print "\t"+"\t".join(classes)
148       for className, classConfusions in zip(classes, cm):
149           print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions))
150
151   So, here's what this nice piece of code gives::
152
153              bus   van  saab opel
154       bus     56   95   21   46
155       van     6    189  4    0
156       saab    3    75   73   66
157       opel    4    71   51   86
158
159   Van's are clearly simple: 189 vans were classified as vans (we know this
160   already, we've printed it out above), and the 10 misclassified pictures
161   were classified as buses (6) and Saab cars (4). In all other classes,
162   there were more instances misclassified as vans than correctly classified
163   instances. The classifier is obviously quite biased to vans.
164
165   .. method:: sens(confm)
166   .. method:: spec(confm)
167   .. method:: PPV(confm)
168   .. method:: NPV(confm)
169   .. method:: precision(confm)
170   .. method:: recall(confm)
171   .. method:: F2(confm)
172   .. method:: Falpha(confm, alpha=2.0)
173   .. method:: MCC(conf)
174
175   With the confusion matrix defined in terms of positive and negative
176   classes, you can also compute the
177   `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_
178   [TP/(TP+FN)], `specificity <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_
179   [TN/(TN+FP)], `positive predictive value <http://en.wikipedia.org/wiki/Positive_predictive_value>`_
180   [TP/(TP+FP)] and `negative predictive value <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)].
181   In information retrieval, positive predictive value is called precision
182   (the ratio of the number of relevant records retrieved to the total number
183   of irrelevant and relevant records retrieved), and sensitivity is called
184   `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_
185   (the ratio of the number of relevant records retrieved to the total number
186   of relevant records in the database). The
187   `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision
188   and recall is called an
189   `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending
190   on the ratio of the weight between precision and recall is implemented
191   as F1 [2*precision*recall/(precision+recall)] or, for a general case,
192   Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)].
193   The `Matthews correlation coefficient <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
194   in essence a correlation coefficient between
195   the observed and predicted binary classifications; it returns a value
196   between -1 and +1. A coefficient of +1 represents a perfect prediction,
197   0 an average random prediction and -1 an inverse prediction.
198
199   If the argument :obj:`confm` is a single confusion matrix, a single
200   result (a number) is returned. If confm is a list of confusion matrices,
201   a list of scores is returned, one for each confusion matrix.
202
203   Note that weights are taken into account when computing the matrix, so
204   these functions don't check the 'weighted' keyword argument.
205
206   Let us print out sensitivities and specificities of our classifiers in
207   part of :download:`statExamples.py <code/statExamples.py>`::
208
209       cm = Orange.evaluation.scoring.confusion_matrices(res)
210       print
211       print "method\tsens\tspec"
212       for l in range(len(learners)):
213           print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l]))
214
215ROC Analysis
216============
217
218`Receiver Operating Characteristic \
219<http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
220(ROC) analysis was initially developed for
221a binary-like problems and there is no consensus on how to apply it in
222multi-class problems, nor do we know for sure how to do ROC analysis after
223cross validation and similar multiple sampling techniques. If you are
224interested in the area under the curve, function AUC will deal with those
225problems as specifically described below.
226
227.. autofunction:: AUC
228
229   .. attribute:: AUC.ByWeightedPairs (or 0)
230
231      Computes AUC for each pair of classes (ignoring instances of all other
232      classes) and averages the results, weighting them by the number of
233      pairs of instances from these two classes (e.g. by the product of
234      probabilities of the two classes). AUC computed in this way still
235      behaves as concordance index, e.g., gives the probability that two
236      randomly chosen instances from different classes will be correctly
237      recognized (this is of course true only if the classifier knows
238      from which two classes the instances came).
239
240   .. attribute:: AUC.ByPairs (or 1)
241
242      Similar as above, except that the average over class pairs is not
243      weighted. This AUC is, like the binary, independent of class
244      distributions, but it is not related to concordance index any more.
245
246   .. attribute:: AUC.WeightedOneAgainstAll (or 2)
247
248      For each class, it computes AUC for this class against all others (that
249      is, treating other classes as one class). The AUCs are then averaged by
250      the class probabilities. This is related to concordance index in which
251      we test the classifier's (average) capability for distinguishing the
252      instances from a specified class from those that come from other classes.
253      Unlike the binary AUC, the measure is not independent of class
254      distributions.
255
256   .. attribute:: AUC.OneAgainstAll (or 3)
257
258      As above, except that the average is not weighted.
259
260   In case of multiple folds (for instance if the data comes from cross
261   validation), the computation goes like this. When computing the partial
262   AUCs for individual pairs of classes or singled-out classes, AUC is
263   computed for each fold separately and then averaged (ignoring the number
264   of instances in each fold, it's just a simple average). However, if a
265   certain fold doesn't contain any instances of a certain class (from the
266   pair), the partial AUC is computed treating the results as if they came
267   from a single-fold. This is not really correct since the class
268   probabilities from different folds are not necessarily comparable,
269   yet this will most often occur in a leave-one-out experiments,
270   comparability shouldn't be a problem.
271
272   Computing and printing out the AUC's looks just like printing out
273   classification accuracies (except that we call AUC instead of
274   CA, of course)::
275
276       AUCs = Orange.evaluation.scoring.AUC(res)
277       for l in range(len(learners)):
278           print "%10s: %5.3f" % (learners[l].name, AUCs[l])
279
280   For vehicle, you can run exactly this same code; it will compute AUCs
281   for all pairs of classes and return the average weighted by probabilities
282   of pairs. Or, you can specify the averaging method yourself, like this::
283
284       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll)
285
286   The following snippet tries out all four. (We don't claim that this is
287   how the function needs to be used; it's better to stay with the default.)::
288
289       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"]
290       print " " *25 + "  \tbayes\ttree\tmajority"
291       for i in range(4):
292           AUCs = Orange.evaluation.scoring.AUC(resVeh, i)
293           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs))
294
295   As you can see from the output::
296
297                                   bayes   tree    majority
298              by pairs, weighted:  0.789   0.871   0.500
299                        by pairs:  0.791   0.872   0.500
300           one vs. all, weighted:  0.783   0.800   0.500
301                     one vs. all:  0.783   0.800   0.500
302
303.. autofunction:: AUC_single
304
305.. autofunction:: AUC_pair
306
307.. autofunction:: AUC_matrix
308
309The remaining functions, which plot the curves and statistically compare
310them, require that the results come from a test with a single iteration,
311and they always compare one chosen class against all others. If you have
312cross validation results, you can either use split_by_iterations to split the
313results by folds, call the function for each fold separately and then sum
314the results up however you see fit, or you can set the ExperimentResults'
315attribute number_of_iterations to 1, to cheat the function - at your own
316responsibility for the statistical correctness. Regarding the multi-class
317problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class
318attribute's baseValue at the time when results were computed. If baseValue
319was not given at that time, 1 (that is, the second class) is used as default.
320
321We shall use the following code to prepare suitable experimental results::
322
323    ri2 = Orange.data.sample.SubsetIndices2(voting, 0.6)
324    train = voting.selectref(ri2, 0)
325    test = voting.selectref(ri2, 1)
326    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test)
327
328
329.. autofunction:: AUCWilcoxon
330
331.. autofunction:: compute_ROC
332
333Comparison of Algorithms
334------------------------
335
336.. autofunction:: McNemar
337
338.. autofunction:: McNemar_of_two
339
340==========
341Regression
342==========
343
344General Measure of Quality
345==========================
346
347Several alternative measures, as given below, can be used to evaluate
348the sucess of numeric prediction:
349
350.. image:: files/statRegression.png
351
352.. autofunction:: MSE
353
354.. autofunction:: RMSE
355
356.. autofunction:: MAE
357
358.. autofunction:: RSE
359
360.. autofunction:: RRSE
361
362.. autofunction:: RAE
363
364.. autofunction:: R2
365
366The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to
367score several regression methods.
368
369.. literalinclude:: code/statExamplesRegression.py
370
371The code above produces the following output::
372
373    Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2
374    maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002
375    rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526
376    knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748
377    lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715
378
379=================
380Ploting functions
381=================
382
383.. autofunction:: graph_ranks
384
385The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph:
386
387.. literalinclude:: code/statExamplesGraphRanks.py
388
389Code produces the following graph:
390
391.. image:: files/statExamplesGraphRanks1.png
392
393.. autofunction:: compute_CD
394
395.. autofunction:: compute_friedman
396
397=================
398Utility Functions
399=================
400
401.. autofunction:: split_by_iterations
402
403=====================================
404Scoring for multilabel classification
405=====================================
406
407Multi-label classification requries different metrics than those used in traditional single-label
408classification. This module presents the various methrics that have been proposed in the literature.
409Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples
410:math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier
411and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`.
412
413.. autofunction:: mlc_hamming_loss
414.. autofunction:: mlc_accuracy
415.. autofunction:: mlc_precision
416.. autofunction:: mlc_recall
417
418So, let's compute all this and print it out (part of
419:download:`mlc-evaluate.py <code/mlc-evaluate.py>`):
420
421.. literalinclude:: code/mlc-evaluate.py
422   :lines: 1-15
423
424The output should look like this::
425
426    loss= [0.9375]
427    accuracy= [0.875]
428    precision= [1.0]
429    recall= [0.875]
430
431References
432==========
433
434Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification',
435Pattern Recogintion, vol.37, no.9, pp:1757-71
436
437Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper
438presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining
439(PAKDD 2004)
440
441Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization',
442Machine Learning, vol.39, no.2/3, pp:135-68.
Note: See TracBrowser for help on using the repository browser.