Ignore:
Timestamp:
02/07/12 11:04:25 (2 years ago)
Author:
anze <anze.staric@…>
Branch:
default
rebase_source:
fd358bf360bc24d5c7ae3104b540b946f9cf6f41
Message:

Underscored function parameters and moved documentation to rst dir.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.evaluation.scoring.rst

    r9372 r9892  
    11.. automodule:: Orange.evaluation.scoring 
     2 
     3############################ 
     4Method scoring (``scoring``) 
     5############################ 
     6 
     7.. index: scoring 
     8 
     9This module contains various measures of quality for classification and 
     10regression. Most functions require an argument named :obj:`res`, an instance of 
     11:class:`Orange.evaluation.testing.ExperimentResults` as computed by 
     12functions from :mod:`Orange.evaluation.testing` and which contains 
     13predictions obtained through cross-validation, 
     14leave one-out, testing on training data or test set instances. 
     15 
     16============== 
     17Classification 
     18============== 
     19 
     20To prepare some data for examples on this page, we shall load the voting data 
     21set (problem of predicting the congressman's party (republican, democrat) 
     22based on a selection of votes) and evaluate naive Bayesian learner, 
     23classification trees and majority classifier using cross-validation. 
     24For examples requiring a multivalued class problem, we shall do the same 
     25with the vehicle data set (telling whether a vehicle described by the features 
     26extracted from a picture is a van, bus, or Opel or Saab car). 
     27 
     28Basic cross validation example is shown in the following part of 
     29(:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`): 
     30 
     31.. literalinclude:: code/statExample0.py 
     32 
     33If instances are weighted, weights are taken into account. This can be 
     34disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of 
     35disabling weights is to clear the 
     36:class:`Orange.evaluation.testing.ExperimentResults`' flag weights. 
     37 
     38General Measures of Quality 
     39=========================== 
     40 
     41.. autofunction:: CA 
     42 
     43.. autofunction:: AP 
     44 
     45.. autofunction:: Brier_score 
     46 
     47.. autofunction:: IS 
     48 
     49So, let's compute all this in part of 
     50(:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out: 
     51 
     52.. literalinclude:: code/statExample1.py 
     53   :lines: 13- 
     54 
     55The output should look like this:: 
     56 
     57    method  CA      AP      Brier    IS 
     58    bayes   0.903   0.902   0.175    0.759 
     59    tree    0.846   0.845   0.286    0.641 
     60    majorty  0.614   0.526   0.474   -0.000 
     61 
     62Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out 
     63the standard errors. 
     64 
     65Confusion Matrix 
     66================ 
     67 
     68.. autofunction:: confusion_matrices 
     69 
     70   **A positive-negative confusion matrix** is computed (a) if the class is 
     71   binary unless :obj:`classIndex` argument is -2, (b) if the class is 
     72   multivalued and the :obj:`classIndex` is non-negative. Argument 
     73   :obj:`classIndex` then tells which class is positive. In case (a), 
     74   :obj:`classIndex` may be omitted; the first class 
     75   is then negative and the second is positive, unless the :obj:`baseClass` 
     76   attribute in the object with results has non-negative value. In that case, 
     77   :obj:`baseClass` is an index of the target class. :obj:`baseClass` 
     78   attribute of results object should be set manually. The result of a 
     79   function is a list of instances of class :class:`ConfusionMatrix`, 
     80   containing the (weighted) number of true positives (TP), false 
     81   negatives (FN), false positives (FP) and true negatives (TN). 
     82 
     83   We can also add the keyword argument :obj:`cutoff` 
     84   (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices` 
     85   will disregard the classifiers' class predictions and observe the predicted 
     86   probabilities, and consider the prediction "positive" if the predicted 
     87   probability of the positive class is higher than the :obj:`cutoff`. 
     88 
     89   The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the 
     90   cut off threshold from the default 0.5 to 0.2 affects the confusion matrics 
     91   for naive Bayesian classifier:: 
     92 
     93       cm = Orange.evaluation.scoring.confusion_matrices(res)[0] 
     94       print "Confusion matrix for naive Bayes:" 
     95       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
     96 
     97       cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0] 
     98       print "Confusion matrix for naive Bayes:" 
     99       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
     100 
     101   The output:: 
     102 
     103       Confusion matrix for naive Bayes: 
     104       TP: 238, FP: 13, FN: 29.0, TN: 155 
     105       Confusion matrix for naive Bayes: 
     106       TP: 239, FP: 18, FN: 28.0, TN: 150 
     107 
     108   shows that the number of true positives increases (and hence the number of 
     109   false negatives decreases) by only a single instance, while five instances 
     110   that were originally true negatives become false positives due to the 
     111   lower threshold. 
     112 
     113   To observe how good are the classifiers in detecting vans in the vehicle 
     114   data set, we would compute the matrix like this:: 
     115 
     116      cm = Orange.evaluation.scoring.confusion_matrices(resVeh, \ 
     117vehicle.domain.classVar.values.index("van")) 
     118 
     119   and get the results like these:: 
     120 
     121       TP: 189, FP: 241, FN: 10.0, TN: 406 
     122 
     123   while the same for class "opel" would give:: 
     124 
     125       TP: 86, FP: 112, FN: 126.0, TN: 522 
     126 
     127   The main difference is that there are only a few false negatives for the 
     128   van, meaning that the classifier seldom misses it (if it says it's not a 
     129   van, it's almost certainly not a van). Not so for the Opel car, where the 
     130   classifier missed 126 of them and correctly detected only 86. 
     131 
     132   **General confusion matrix** is computed (a) in case of a binary class, 
     133   when :obj:`classIndex` is set to -2, (b) when we have multivalued class and 
     134   the caller doesn't specify the :obj:`classIndex` of the positive class. 
     135   When called in this manner, the function cannot use the argument 
     136   :obj:`cutoff`. 
     137 
     138   The function then returns a three-dimensional matrix, where the element 
     139   A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`] 
     140   gives the number of instances belonging to 'actual_class' for which the 
     141   'learner' predicted 'predictedClass'. We shall compute and print out 
     142   the matrix for naive Bayesian classifier. 
     143 
     144   Here we see another example from :download:`statExamples.py <code/statExamples.py>`:: 
     145 
     146       cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0] 
     147       classes = vehicle.domain.classVar.values 
     148       print "\t"+"\t".join(classes) 
     149       for className, classConfusions in zip(classes, cm): 
     150           print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) 
     151 
     152   So, here's what this nice piece of code gives:: 
     153 
     154              bus   van  saab opel 
     155       bus     56   95   21   46 
     156       van     6    189  4    0 
     157       saab    3    75   73   66 
     158       opel    4    71   51   86 
     159 
     160   Van's are clearly simple: 189 vans were classified as vans (we know this 
     161   already, we've printed it out above), and the 10 misclassified pictures 
     162   were classified as buses (6) and Saab cars (4). In all other classes, 
     163   there were more instances misclassified as vans than correctly classified 
     164   instances. The classifier is obviously quite biased to vans. 
     165 
     166   .. method:: sens(confm) 
     167   .. method:: spec(confm) 
     168   .. method:: PPV(confm) 
     169   .. method:: NPV(confm) 
     170   .. method:: precision(confm) 
     171   .. method:: recall(confm) 
     172   .. method:: F2(confm) 
     173   .. method:: Falpha(confm, alpha=2.0) 
     174   .. method:: MCC(conf) 
     175 
     176   With the confusion matrix defined in terms of positive and negative 
     177   classes, you can also compute the 
     178   `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_ 
     179   [TP/(TP+FN)], `specificity \ 
     180<http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_ 
     181   [TN/(TN+FP)], `positive predictive value \ 
     182<http://en.wikipedia.org/wiki/Positive_predictive_value>`_ 
     183   [TP/(TP+FP)] and `negative predictive value \ 
     184<http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)]. 
     185   In information retrieval, positive predictive value is called precision 
     186   (the ratio of the number of relevant records retrieved to the total number 
     187   of irrelevant and relevant records retrieved), and sensitivity is called 
     188   `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_ 
     189   (the ratio of the number of relevant records retrieved to the total number 
     190   of relevant records in the database). The 
     191   `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision 
     192   and recall is called an 
     193   `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending 
     194   on the ratio of the weight between precision and recall is implemented 
     195   as F1 [2*precision*recall/(precision+recall)] or, for a general case, 
     196   Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. 
     197   The `Matthews correlation coefficient \ 
     198<http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ 
     199   in essence a correlation coefficient between 
     200   the observed and predicted binary classifications; it returns a value 
     201   between -1 and +1. A coefficient of +1 represents a perfect prediction, 
     202   0 an average random prediction and -1 an inverse prediction. 
     203 
     204   If the argument :obj:`confm` is a single confusion matrix, a single 
     205   result (a number) is returned. If confm is a list of confusion matrices, 
     206   a list of scores is returned, one for each confusion matrix. 
     207 
     208   Note that weights are taken into account when computing the matrix, so 
     209   these functions don't check the 'weighted' keyword argument. 
     210 
     211   Let us print out sensitivities and specificities of our classifiers in 
     212   part of :download:`statExamples.py <code/statExamples.py>`:: 
     213 
     214       cm = Orange.evaluation.scoring.confusion_matrices(res) 
     215       print 
     216       print "method\tsens\tspec" 
     217       for l in range(len(learners)): 
     218           print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l])) 
     219 
     220ROC Analysis 
     221============ 
     222 
     223`Receiver Operating Characteristic \ 
     224<http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_ 
     225(ROC) analysis was initially developed for 
     226a binary-like problems and there is no consensus on how to apply it in 
     227multi-class problems, nor do we know for sure how to do ROC analysis after 
     228cross validation and similar multiple sampling techniques. If you are 
     229interested in the area under the curve, function AUC will deal with those 
     230problems as specifically described below. 
     231 
     232.. autofunction:: AUC 
     233 
     234   .. attribute:: AUC.ByWeightedPairs (or 0) 
     235 
     236      Computes AUC for each pair of classes (ignoring instances of all other 
     237      classes) and averages the results, weighting them by the number of 
     238      pairs of instances from these two classes (e.g. by the product of 
     239      probabilities of the two classes). AUC computed in this way still 
     240      behaves as concordance index, e.g., gives the probability that two 
     241      randomly chosen instances from different classes will be correctly 
     242      recognized (this is of course true only if the classifier knows 
     243      from which two classes the instances came). 
     244 
     245   .. attribute:: AUC.ByPairs (or 1) 
     246 
     247      Similar as above, except that the average over class pairs is not 
     248      weighted. This AUC is, like the binary, independent of class 
     249      distributions, but it is not related to concordance index any more. 
     250 
     251   .. attribute:: AUC.WeightedOneAgainstAll (or 2) 
     252 
     253      For each class, it computes AUC for this class against all others (that 
     254      is, treating other classes as one class). The AUCs are then averaged by 
     255      the class probabilities. This is related to concordance index in which 
     256      we test the classifier's (average) capability for distinguishing the 
     257      instances from a specified class from those that come from other classes. 
     258      Unlike the binary AUC, the measure is not independent of class 
     259      distributions. 
     260 
     261   .. attribute:: AUC.OneAgainstAll (or 3) 
     262 
     263      As above, except that the average is not weighted. 
     264 
     265   In case of multiple folds (for instance if the data comes from cross 
     266   validation), the computation goes like this. When computing the partial 
     267   AUCs for individual pairs of classes or singled-out classes, AUC is 
     268   computed for each fold separately and then averaged (ignoring the number 
     269   of instances in each fold, it's just a simple average). However, if a 
     270   certain fold doesn't contain any instances of a certain class (from the 
     271   pair), the partial AUC is computed treating the results as if they came 
     272   from a single-fold. This is not really correct since the class 
     273   probabilities from different folds are not necessarily comparable, 
     274   yet this will most often occur in a leave-one-out experiments, 
     275   comparability shouldn't be a problem. 
     276 
     277   Computing and printing out the AUC's looks just like printing out 
     278   classification accuracies (except that we call AUC instead of 
     279   CA, of course):: 
     280 
     281       AUCs = Orange.evaluation.scoring.AUC(res) 
     282       for l in range(len(learners)): 
     283           print "%10s: %5.3f" % (learners[l].name, AUCs[l]) 
     284 
     285   For vehicle, you can run exactly this same code; it will compute AUCs 
     286   for all pairs of classes and return the average weighted by probabilities 
     287   of pairs. Or, you can specify the averaging method yourself, like this:: 
     288 
     289       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll) 
     290 
     291   The following snippet tries out all four. (We don't claim that this is 
     292   how the function needs to be used; it's better to stay with the default.):: 
     293 
     294       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] 
     295       print " " *25 + "  \tbayes\ttree\tmajority" 
     296       for i in range(4): 
     297           AUCs = Orange.evaluation.scoring.AUC(resVeh, i) 
     298           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) 
     299 
     300   As you can see from the output:: 
     301 
     302                                   bayes   tree    majority 
     303              by pairs, weighted:  0.789   0.871   0.500 
     304                        by pairs:  0.791   0.872   0.500 
     305           one vs. all, weighted:  0.783   0.800   0.500 
     306                     one vs. all:  0.783   0.800   0.500 
     307 
     308.. autofunction:: AUC_single 
     309 
     310.. autofunction:: AUC_pair 
     311 
     312.. autofunction:: AUC_matrix 
     313 
     314The remaining functions, which plot the curves and statistically compare 
     315them, require that the results come from a test with a single iteration, 
     316and they always compare one chosen class against all others. If you have 
     317cross validation results, you can either use split_by_iterations to split the 
     318results by folds, call the function for each fold separately and then sum 
     319the results up however you see fit, or you can set the ExperimentResults' 
     320attribute number_of_iterations to 1, to cheat the function - at your own 
     321responsibility for the statistical correctness. Regarding the multi-class 
     322problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class 
     323attribute's baseValue at the time when results were computed. If baseValue 
     324was not given at that time, 1 (that is, the second class) is used as default. 
     325 
     326We shall use the following code to prepare suitable experimental results:: 
     327 
     328    ri2 = Orange.core.MakeRandomIndices2(voting, 0.6) 
     329    train = voting.selectref(ri2, 0) 
     330    test = voting.selectref(ri2, 1) 
     331    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test) 
     332 
     333 
     334.. autofunction:: AUCWilcoxon 
     335 
     336.. autofunction:: compute_ROC 
     337 
     338Comparison of Algorithms 
     339------------------------ 
     340 
     341.. autofunction:: McNemar 
     342 
     343.. autofunction:: McNemar_of_two 
     344 
     345========== 
     346Regression 
     347========== 
     348 
     349General Measure of Quality 
     350========================== 
     351 
     352Several alternative measures, as given below, can be used to evaluate 
     353the sucess of numeric prediction: 
     354 
     355.. image:: files/statRegression.png 
     356 
     357.. autofunction:: MSE 
     358 
     359.. autofunction:: RMSE 
     360 
     361.. autofunction:: MAE 
     362 
     363.. autofunction:: RSE 
     364 
     365.. autofunction:: RRSE 
     366 
     367.. autofunction:: RAE 
     368 
     369.. autofunction:: R2 
     370 
     371The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to 
     372score several regression methods. 
     373 
     374.. literalinclude:: code/statExamplesRegression.py 
     375 
     376The code above produces the following output:: 
     377 
     378    Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2 
     379    maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002 
     380    rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526 
     381    knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748 
     382    lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715 
     383 
     384================= 
     385Ploting functions 
     386================= 
     387 
     388.. autofunction:: graph_ranks 
     389 
     390The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph: 
     391 
     392.. literalinclude:: code/statExamplesGraphRanks.py 
     393 
     394Code produces the following graph: 
     395 
     396.. image:: files/statExamplesGraphRanks1.png 
     397 
     398.. autofunction:: compute_CD 
     399 
     400.. autofunction:: compute_friedman 
     401 
     402================= 
     403Utility Functions 
     404================= 
     405 
     406.. autofunction:: split_by_iterations 
     407 
     408===================================== 
     409Scoring for multilabel classification 
     410===================================== 
     411 
     412Multi-label classification requries different metrics than those used in traditional single-label 
     413classification. This module presents the various methrics that have been proposed in the literature. 
     414Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples 
     415:math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier 
     416and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`. 
     417 
     418.. autofunction:: mlc_hamming_loss 
     419.. autofunction:: mlc_accuracy 
     420.. autofunction:: mlc_precision 
     421.. autofunction:: mlc_recall 
     422 
     423So, let's compute all this and print it out (part of 
     424:download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses 
     425:download:`emotions.tab <code/emotions.tab>`): 
     426 
     427.. literalinclude:: code/mlc-evaluate.py 
     428   :lines: 1-15 
     429 
     430The output should look like this:: 
     431 
     432    loss= [0.9375] 
     433    accuracy= [0.875] 
     434    precision= [1.0] 
     435    recall= [0.875] 
     436 
     437References 
     438========== 
     439 
     440Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', 
     441Pattern Recogintion, vol.37, no.9, pp:1757-71 
     442 
     443Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper 
     444presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining 
     445(PAKDD 2004) 
     446 
     447Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization', 
     448Machine Learning, vol.39, no.2/3, pp:135-68. 
Note: See TracChangeset for help on using the changeset viewer.