source: orange/docs/reference/rst/Orange.evaluation.scoring.rst @ 9892:3b220d15fb39

Revision 9892:3b220d15fb39, 18.0 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Underscored function parameters and moved documentation to rst dir.

Line 
1.. automodule:: Orange.evaluation.scoring
2
3############################
4Method scoring (``scoring``)
5############################
6
7.. index: scoring
8
9This module contains various measures of quality for classification and
10regression. Most functions require an argument named :obj:`res`, an instance of
11:class:`Orange.evaluation.testing.ExperimentResults` as computed by
12functions from :mod:`Orange.evaluation.testing` and which contains
13predictions obtained through cross-validation,
14leave one-out, testing on training data or test set instances.
15
16==============
17Classification
18==============
19
20To prepare some data for examples on this page, we shall load the voting data
21set (problem of predicting the congressman's party (republican, democrat)
22based on a selection of votes) and evaluate naive Bayesian learner,
23classification trees and majority classifier using cross-validation.
24For examples requiring a multivalued class problem, we shall do the same
25with the vehicle data set (telling whether a vehicle described by the features
26extracted from a picture is a van, bus, or Opel or Saab car).
27
28Basic cross validation example is shown in the following part of
29(:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`):
30
31.. literalinclude:: code/statExample0.py
32
33If instances are weighted, weights are taken into account. This can be
34disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of
35disabling weights is to clear the
36:class:`Orange.evaluation.testing.ExperimentResults`' flag weights.
37
38General Measures of Quality
39===========================
40
41.. autofunction:: CA
42
43.. autofunction:: AP
44
45.. autofunction:: Brier_score
46
47.. autofunction:: IS
48
49So, let's compute all this in part of
50(:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out:
51
52.. literalinclude:: code/statExample1.py
53   :lines: 13-
54
55The output should look like this::
56
57    method  CA      AP      Brier    IS
58    bayes   0.903   0.902   0.175    0.759
59    tree    0.846   0.845   0.286    0.641
60    majorty  0.614   0.526   0.474   -0.000
61
62Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out
63the standard errors.
64
65Confusion Matrix
66================
67
68.. autofunction:: confusion_matrices
69
70   **A positive-negative confusion matrix** is computed (a) if the class is
71   binary unless :obj:`classIndex` argument is -2, (b) if the class is
72   multivalued and the :obj:`classIndex` is non-negative. Argument
73   :obj:`classIndex` then tells which class is positive. In case (a),
74   :obj:`classIndex` may be omitted; the first class
75   is then negative and the second is positive, unless the :obj:`baseClass`
76   attribute in the object with results has non-negative value. In that case,
77   :obj:`baseClass` is an index of the target class. :obj:`baseClass`
78   attribute of results object should be set manually. The result of a
79   function is a list of instances of class :class:`ConfusionMatrix`,
80   containing the (weighted) number of true positives (TP), false
81   negatives (FN), false positives (FP) and true negatives (TN).
82
83   We can also add the keyword argument :obj:`cutoff`
84   (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices`
85   will disregard the classifiers' class predictions and observe the predicted
86   probabilities, and consider the prediction "positive" if the predicted
87   probability of the positive class is higher than the :obj:`cutoff`.
88
89   The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the
90   cut off threshold from the default 0.5 to 0.2 affects the confusion matrics
91   for naive Bayesian classifier::
92
93       cm = Orange.evaluation.scoring.confusion_matrices(res)[0]
94       print "Confusion matrix for naive Bayes:"
95       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
96
97       cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0]
98       print "Confusion matrix for naive Bayes:"
99       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)
100
101   The output::
102
103       Confusion matrix for naive Bayes:
104       TP: 238, FP: 13, FN: 29.0, TN: 155
105       Confusion matrix for naive Bayes:
106       TP: 239, FP: 18, FN: 28.0, TN: 150
107
108   shows that the number of true positives increases (and hence the number of
109   false negatives decreases) by only a single instance, while five instances
110   that were originally true negatives become false positives due to the
111   lower threshold.
112
113   To observe how good are the classifiers in detecting vans in the vehicle
114   data set, we would compute the matrix like this::
115
116      cm = Orange.evaluation.scoring.confusion_matrices(resVeh, \
117vehicle.domain.classVar.values.index("van"))
118
119   and get the results like these::
120
121       TP: 189, FP: 241, FN: 10.0, TN: 406
122
123   while the same for class "opel" would give::
124
125       TP: 86, FP: 112, FN: 126.0, TN: 522
126
127   The main difference is that there are only a few false negatives for the
128   van, meaning that the classifier seldom misses it (if it says it's not a
129   van, it's almost certainly not a van). Not so for the Opel car, where the
130   classifier missed 126 of them and correctly detected only 86.
131
132   **General confusion matrix** is computed (a) in case of a binary class,
133   when :obj:`classIndex` is set to -2, (b) when we have multivalued class and
134   the caller doesn't specify the :obj:`classIndex` of the positive class.
135   When called in this manner, the function cannot use the argument
136   :obj:`cutoff`.
137
138   The function then returns a three-dimensional matrix, where the element
139   A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`]
140   gives the number of instances belonging to 'actual_class' for which the
141   'learner' predicted 'predictedClass'. We shall compute and print out
142   the matrix for naive Bayesian classifier.
143
144   Here we see another example from :download:`statExamples.py <code/statExamples.py>`::
145
146       cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0]
147       classes = vehicle.domain.classVar.values
148       print "\t"+"\t".join(classes)
149       for className, classConfusions in zip(classes, cm):
150           print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions))
151
152   So, here's what this nice piece of code gives::
153
154              bus   van  saab opel
155       bus     56   95   21   46
156       van     6    189  4    0
157       saab    3    75   73   66
158       opel    4    71   51   86
159
160   Van's are clearly simple: 189 vans were classified as vans (we know this
161   already, we've printed it out above), and the 10 misclassified pictures
162   were classified as buses (6) and Saab cars (4). In all other classes,
163   there were more instances misclassified as vans than correctly classified
164   instances. The classifier is obviously quite biased to vans.
165
166   .. method:: sens(confm)
167   .. method:: spec(confm)
168   .. method:: PPV(confm)
169   .. method:: NPV(confm)
170   .. method:: precision(confm)
171   .. method:: recall(confm)
172   .. method:: F2(confm)
173   .. method:: Falpha(confm, alpha=2.0)
174   .. method:: MCC(conf)
175
176   With the confusion matrix defined in terms of positive and negative
177   classes, you can also compute the
178   `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_
179   [TP/(TP+FN)], `specificity \
180<http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_
181   [TN/(TN+FP)], `positive predictive value \
182<http://en.wikipedia.org/wiki/Positive_predictive_value>`_
183   [TP/(TP+FP)] and `negative predictive value \
184<http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)].
185   In information retrieval, positive predictive value is called precision
186   (the ratio of the number of relevant records retrieved to the total number
187   of irrelevant and relevant records retrieved), and sensitivity is called
188   `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_
189   (the ratio of the number of relevant records retrieved to the total number
190   of relevant records in the database). The
191   `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision
192   and recall is called an
193   `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending
194   on the ratio of the weight between precision and recall is implemented
195   as F1 [2*precision*recall/(precision+recall)] or, for a general case,
196   Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)].
197   The `Matthews correlation coefficient \
198<http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
199   in essence a correlation coefficient between
200   the observed and predicted binary classifications; it returns a value
201   between -1 and +1. A coefficient of +1 represents a perfect prediction,
202   0 an average random prediction and -1 an inverse prediction.
203
204   If the argument :obj:`confm` is a single confusion matrix, a single
205   result (a number) is returned. If confm is a list of confusion matrices,
206   a list of scores is returned, one for each confusion matrix.
207
208   Note that weights are taken into account when computing the matrix, so
209   these functions don't check the 'weighted' keyword argument.
210
211   Let us print out sensitivities and specificities of our classifiers in
212   part of :download:`statExamples.py <code/statExamples.py>`::
213
214       cm = Orange.evaluation.scoring.confusion_matrices(res)
215       print
216       print "method\tsens\tspec"
217       for l in range(len(learners)):
218           print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l]))
219
220ROC Analysis
221============
222
223`Receiver Operating Characteristic \
224<http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
225(ROC) analysis was initially developed for
226a binary-like problems and there is no consensus on how to apply it in
227multi-class problems, nor do we know for sure how to do ROC analysis after
228cross validation and similar multiple sampling techniques. If you are
229interested in the area under the curve, function AUC will deal with those
230problems as specifically described below.
231
232.. autofunction:: AUC
233
234   .. attribute:: AUC.ByWeightedPairs (or 0)
235
236      Computes AUC for each pair of classes (ignoring instances of all other
237      classes) and averages the results, weighting them by the number of
238      pairs of instances from these two classes (e.g. by the product of
239      probabilities of the two classes). AUC computed in this way still
240      behaves as concordance index, e.g., gives the probability that two
241      randomly chosen instances from different classes will be correctly
242      recognized (this is of course true only if the classifier knows
243      from which two classes the instances came).
244
245   .. attribute:: AUC.ByPairs (or 1)
246
247      Similar as above, except that the average over class pairs is not
248      weighted. This AUC is, like the binary, independent of class
249      distributions, but it is not related to concordance index any more.
250
251   .. attribute:: AUC.WeightedOneAgainstAll (or 2)
252
253      For each class, it computes AUC for this class against all others (that
254      is, treating other classes as one class). The AUCs are then averaged by
255      the class probabilities. This is related to concordance index in which
256      we test the classifier's (average) capability for distinguishing the
257      instances from a specified class from those that come from other classes.
258      Unlike the binary AUC, the measure is not independent of class
259      distributions.
260
261   .. attribute:: AUC.OneAgainstAll (or 3)
262
263      As above, except that the average is not weighted.
264
265   In case of multiple folds (for instance if the data comes from cross
266   validation), the computation goes like this. When computing the partial
267   AUCs for individual pairs of classes or singled-out classes, AUC is
268   computed for each fold separately and then averaged (ignoring the number
269   of instances in each fold, it's just a simple average). However, if a
270   certain fold doesn't contain any instances of a certain class (from the
271   pair), the partial AUC is computed treating the results as if they came
272   from a single-fold. This is not really correct since the class
273   probabilities from different folds are not necessarily comparable,
274   yet this will most often occur in a leave-one-out experiments,
275   comparability shouldn't be a problem.
276
277   Computing and printing out the AUC's looks just like printing out
278   classification accuracies (except that we call AUC instead of
279   CA, of course)::
280
281       AUCs = Orange.evaluation.scoring.AUC(res)
282       for l in range(len(learners)):
283           print "%10s: %5.3f" % (learners[l].name, AUCs[l])
284
285   For vehicle, you can run exactly this same code; it will compute AUCs
286   for all pairs of classes and return the average weighted by probabilities
287   of pairs. Or, you can specify the averaging method yourself, like this::
288
289       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll)
290
291   The following snippet tries out all four. (We don't claim that this is
292   how the function needs to be used; it's better to stay with the default.)::
293
294       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"]
295       print " " *25 + "  \tbayes\ttree\tmajority"
296       for i in range(4):
297           AUCs = Orange.evaluation.scoring.AUC(resVeh, i)
298           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs))
299
300   As you can see from the output::
301
302                                   bayes   tree    majority
303              by pairs, weighted:  0.789   0.871   0.500
304                        by pairs:  0.791   0.872   0.500
305           one vs. all, weighted:  0.783   0.800   0.500
306                     one vs. all:  0.783   0.800   0.500
307
308.. autofunction:: AUC_single
309
310.. autofunction:: AUC_pair
311
312.. autofunction:: AUC_matrix
313
314The remaining functions, which plot the curves and statistically compare
315them, require that the results come from a test with a single iteration,
316and they always compare one chosen class against all others. If you have
317cross validation results, you can either use split_by_iterations to split the
318results by folds, call the function for each fold separately and then sum
319the results up however you see fit, or you can set the ExperimentResults'
320attribute number_of_iterations to 1, to cheat the function - at your own
321responsibility for the statistical correctness. Regarding the multi-class
322problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class
323attribute's baseValue at the time when results were computed. If baseValue
324was not given at that time, 1 (that is, the second class) is used as default.
325
326We shall use the following code to prepare suitable experimental results::
327
328    ri2 = Orange.core.MakeRandomIndices2(voting, 0.6)
329    train = voting.selectref(ri2, 0)
330    test = voting.selectref(ri2, 1)
331    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test)
332
333
334.. autofunction:: AUCWilcoxon
335
336.. autofunction:: compute_ROC
337
338Comparison of Algorithms
339------------------------
340
341.. autofunction:: McNemar
342
343.. autofunction:: McNemar_of_two
344
345==========
346Regression
347==========
348
349General Measure of Quality
350==========================
351
352Several alternative measures, as given below, can be used to evaluate
353the sucess of numeric prediction:
354
355.. image:: files/statRegression.png
356
357.. autofunction:: MSE
358
359.. autofunction:: RMSE
360
361.. autofunction:: MAE
362
363.. autofunction:: RSE
364
365.. autofunction:: RRSE
366
367.. autofunction:: RAE
368
369.. autofunction:: R2
370
371The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to
372score several regression methods.
373
374.. literalinclude:: code/statExamplesRegression.py
375
376The code above produces the following output::
377
378    Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2
379    maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002
380    rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526
381    knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748
382    lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715
383
384=================
385Ploting functions
386=================
387
388.. autofunction:: graph_ranks
389
390The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph:
391
392.. literalinclude:: code/statExamplesGraphRanks.py
393
394Code produces the following graph:
395
396.. image:: files/statExamplesGraphRanks1.png
397
398.. autofunction:: compute_CD
399
400.. autofunction:: compute_friedman
401
402=================
403Utility Functions
404=================
405
406.. autofunction:: split_by_iterations
407
408=====================================
409Scoring for multilabel classification
410=====================================
411
412Multi-label classification requries different metrics than those used in traditional single-label
413classification. This module presents the various methrics that have been proposed in the literature.
414Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples
415:math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier
416and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`.
417
418.. autofunction:: mlc_hamming_loss
419.. autofunction:: mlc_accuracy
420.. autofunction:: mlc_precision
421.. autofunction:: mlc_recall
422
423So, let's compute all this and print it out (part of
424:download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses
425:download:`emotions.tab <code/emotions.tab>`):
426
427.. literalinclude:: code/mlc-evaluate.py
428   :lines: 1-15
429
430The output should look like this::
431
432    loss= [0.9375]
433    accuracy= [0.875]
434    precision= [1.0]
435    recall= [0.875]
436
437References
438==========
439
440Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification',
441Pattern Recogintion, vol.37, no.9, pp:1757-71
442
443Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper
444presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining
445(PAKDD 2004)
446
447Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization',
448Machine Learning, vol.39, no.2/3, pp:135-68.
Note: See TracBrowser for help on using the repository browser.