Changeset 10062:f05f42dace4f in orange


Ignore:
Timestamp:
02/08/12 09:40:27 (2 years ago)
Author:
anze <anze.staric@…>
Branch:
default
rebase_source:
3bdc54e2f76329e4cb5942d5d2abe4685d57193a
Message:

Removed duplicated part of documentation.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.evaluation.scoring.rst

    r10037 r10062  
    158158 
    159159 
    160 To prepare some data for examples on this page, we shall load the voting data 
    161 set (problem of predicting the congressman's party (republican, democrat) 
    162 based on a selection of votes) and evaluate naive Bayesian learner, 
    163 classification trees and majority classifier using cross-validation. 
    164 For examples requiring a multivalued class problem, we shall do the same 
    165 with the vehicle data set (telling whether a vehicle described by the features 
    166 extracted from a picture is a van, bus, or Opel or Saab car). 
    167  
    168 Basic cross validation example is shown in the following part of 
    169 (:download:`statExamples.py <code/statExamples.py>`): 
    170  
    171 If instances are weighted, weights are taken into account. This can be 
    172 disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of 
    173 disabling weights is to clear the 
    174 :class:`Orange.evaluation.testing.ExperimentResults`' flag weights. 
    175  
    176 General Measures of Quality 
    177 =========================== 
    178  
    179  
    180  
    181  
    182  
    183 So, let's compute all this in part of 
    184 (:download:`statExamples.py <code/statExamples.py>`) and print it out: 
    185  
    186 .. literalinclude:: code/statExample1.py 
    187    :lines: 13- 
    188  
    189 The output should look like this:: 
    190  
    191     method  CA      AP      Brier    IS 
    192     bayes   0.903   0.902   0.175    0.759 
    193     tree    0.846   0.845   0.286    0.641 
    194     majority  0.614   0.526   0.474   -0.000 
    195  
    196 Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out 
    197 the standard errors. 
    198  
    199 Confusion Matrix 
    200 ================ 
    201  
    202 .. autofunction:: confusion_matrices 
    203  
    204    **A positive-negative confusion matrix** is computed (a) if the class is 
    205    binary unless :obj:`classIndex` argument is -2, (b) if the class is 
    206    multivalued and the :obj:`classIndex` is non-negative. Argument 
    207    :obj:`classIndex` then tells which class is positive. In case (a), 
    208    :obj:`classIndex` may be omitted; the first class 
    209    is then negative and the second is positive, unless the :obj:`baseClass` 
    210    attribute in the object with results has non-negative value. In that case, 
    211    :obj:`baseClass` is an index of the target class. :obj:`baseClass` 
    212    attribute of results object should be set manually. The result of a 
    213    function is a list of instances of class :class:`ConfusionMatrix`, 
    214    containing the (weighted) number of true positives (TP), false 
    215    negatives (FN), false positives (FP) and true negatives (TN). 
    216  
    217    We can also add the keyword argument :obj:`cutoff` 
    218    (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices` 
    219    will disregard the classifiers' class predictions and observe the predicted 
    220    probabilities, and consider the prediction "positive" if the predicted 
    221    probability of the positive class is higher than the :obj:`cutoff`. 
    222  
    223    The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the 
    224    cut off threshold from the default 0.5 to 0.2 affects the confusion matrics 
    225    for naive Bayesian classifier:: 
    226  
    227        cm = Orange.evaluation.scoring.confusion_matrices(res)[0] 
    228        print "Confusion matrix for naive Bayes:" 
    229        print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
    230  
    231        cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0] 
    232        print "Confusion matrix for naive Bayes:" 
    233        print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
    234  
    235    The output:: 
    236  
    237        Confusion matrix for naive Bayes: 
    238        TP: 238, FP: 13, FN: 29.0, TN: 155 
    239        Confusion matrix for naive Bayes: 
    240        TP: 239, FP: 18, FN: 28.0, TN: 150 
    241  
    242    shows that the number of true positives increases (and hence the number of 
    243    false negatives decreases) by only a single instance, while five instances 
    244    that were originally true negatives become false positives due to the 
    245    lower threshold. 
    246  
    247    To observe how good are the classifiers in detecting vans in the vehicle 
    248    data set, we would compute the matrix like this:: 
    249  
    250       cm = Orange.evaluation.scoring.confusion_matrices(resVeh, vehicle.domain.classVar.values.index("van")) 
    251  
    252    and get the results like these:: 
    253  
    254        TP: 189, FP: 241, FN: 10.0, TN: 406 
    255  
    256    while the same for class "opel" would give:: 
    257  
    258        TP: 86, FP: 112, FN: 126.0, TN: 522 
    259  
    260    The main difference is that there are only a few false negatives for the 
    261    van, meaning that the classifier seldom misses it (if it says it's not a 
    262    van, it's almost certainly not a van). Not so for the Opel car, where the 
    263    classifier missed 126 of them and correctly detected only 86. 
    264  
    265    **General confusion matrix** is computed (a) in case of a binary class, 
    266    when :obj:`classIndex` is set to -2, (b) when we have multivalued class and 
    267    the caller doesn't specify the :obj:`classIndex` of the positive class. 
    268    When called in this manner, the function cannot use the argument 
    269    :obj:`cutoff`. 
    270  
    271    The function then returns a three-dimensional matrix, where the element 
    272    A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`] 
    273    gives the number of instances belonging to 'actual_class' for which the 
    274    'learner' predicted 'predictedClass'. We shall compute and print out 
    275    the matrix for naive Bayesian classifier. 
    276  
    277    Here we see another example from :download:`statExamples.py <code/statExamples.py>`:: 
    278  
    279        cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0] 
    280        classes = vehicle.domain.classVar.values 
    281        print "\t"+"\t".join(classes) 
    282        for className, classConfusions in zip(classes, cm): 
    283            print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) 
    284  
    285    So, here's what this nice piece of code gives:: 
    286  
    287               bus   van  saab opel 
    288        bus     56   95   21   46 
    289        van     6    189  4    0 
    290        saab    3    75   73   66 
    291        opel    4    71   51   86 
    292  
    293    Van's are clearly simple: 189 vans were classified as vans (we know this 
    294    already, we've printed it out above), and the 10 misclassified pictures 
    295    were classified as buses (6) and Saab cars (4). In all other classes, 
    296    there were more instances misclassified as vans than correctly classified 
    297    instances. The classifier is obviously quite biased to vans. 
    298  
    299  
    300  
    301    With the confusion matrix defined in terms of positive and negative 
    302    classes, you can also compute the 
    303    `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_ 
    304    [TP/(TP+FN)], `specificity <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_ 
    305    [TN/(TN+FP)], `positive predictive value <http://en.wikipedia.org/wiki/Positive_predictive_value>`_ 
    306    [TP/(TP+FP)] and `negative predictive value <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)]. 
    307    In information retrieval, positive predictive value is called precision 
    308    (the ratio of the number of relevant records retrieved to the total number 
    309    of irrelevant and relevant records retrieved), and sensitivity is called 
    310    `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_ 
    311    (the ratio of the number of relevant records retrieved to the total number 
    312    of relevant records in the database). The 
    313    `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision 
    314    and recall is called an 
    315    `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending 
    316    on the ratio of the weight between precision and recall is implemented 
    317    as F1 [2*precision*recall/(precision+recall)] or, for a general case, 
    318    Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. 
    319    The `Matthews correlation coefficient <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ 
    320    in essence a correlation coefficient between 
    321    the observed and predicted binary classifications; it returns a value 
    322    between -1 and +1. A coefficient of +1 represents a perfect prediction, 
    323    0 an average random prediction and -1 an inverse prediction. 
    324  
    325    If the argument :obj:`confm` is a single confusion matrix, a single 
    326    result (a number) is returned. If confm is a list of confusion matrices, 
    327    a list of scores is returned, one for each confusion matrix. 
    328  
    329    Note that weights are taken into account when computing the matrix, so 
    330    these functions don't check the 'weighted' keyword argument. 
    331  
    332    Let us print out sensitivities and specificities of our classifiers in 
    333    part of :download:`statExamples.py <code/statExamples.py>`:: 
    334  
    335        cm = Orange.evaluation.scoring.confusion_matrices(res) 
    336        print 
    337        print "method\tsens\tspec" 
    338        for l in range(len(learners)): 
    339            print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l])) 
    340  
    341 ROC Analysis 
    342 ============ 
    343  
    344 `Receiver Operating Characteristic \ 
    345 <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_ 
    346 (ROC) analysis was initially developed for 
    347 a binary-like problems and there is no consensus on how to apply it in 
    348 multi-class problems, nor do we know for sure how to do ROC analysis after 
    349 cross validation and similar multiple sampling techniques. If you are 
    350 interested in the area under the curve, function AUC will deal with those 
    351 problems as specifically described below. 
    352  
    353 .. autofunction:: AUC 
    354  
    355    .. attribute:: AUC.ByWeightedPairs (or 0) 
    356  
    357       Computes AUC for each pair of classes (ignoring instances of all other 
    358       classes) and averages the results, weighting them by the number of 
    359       pairs of instances from these two classes (e.g. by the product of 
    360       probabilities of the two classes). AUC computed in this way still 
    361       behaves as concordance index, e.g., gives the probability that two 
    362       randomly chosen instances from different classes will be correctly 
    363       recognized (this is of course true only if the classifier knows 
    364       from which two classes the instances came). 
    365  
    366    .. attribute:: AUC.ByPairs (or 1) 
    367  
    368       Similar as above, except that the average over class pairs is not 
    369       weighted. This AUC is, like the binary, independent of class 
    370       distributions, but it is not related to concordance index any more. 
    371  
    372    .. attribute:: AUC.WeightedOneAgainstAll (or 2) 
    373  
    374       For each class, it computes AUC for this class against all others (that 
    375       is, treating other classes as one class). The AUCs are then averaged by 
    376       the class probabilities. This is related to concordance index in which 
    377       we test the classifier's (average) capability for distinguishing the 
    378       instances from a specified class from those that come from other classes. 
    379       Unlike the binary AUC, the measure is not independent of class 
    380       distributions. 
    381  
    382    .. attribute:: AUC.OneAgainstAll (or 3) 
    383  
    384       As above, except that the average is not weighted. 
    385  
    386    In case of multiple folds (for instance if the data comes from cross 
    387    validation), the computation goes like this. When computing the partial 
    388    AUCs for individual pairs of classes or singled-out classes, AUC is 
    389    computed for each fold separately and then averaged (ignoring the number 
    390    of instances in each fold, it's just a simple average). However, if a 
    391    certain fold doesn't contain any instances of a certain class (from the 
    392    pair), the partial AUC is computed treating the results as if they came 
    393    from a single-fold. This is not really correct since the class 
    394    probabilities from different folds are not necessarily comparable, 
    395    yet this will most often occur in a leave-one-out experiments, 
    396    comparability shouldn't be a problem. 
    397  
    398    Computing and printing out the AUC's looks just like printing out 
    399    classification accuracies (except that we call AUC instead of 
    400    CA, of course):: 
    401  
    402        AUCs = Orange.evaluation.scoring.AUC(res) 
    403        for l in range(len(learners)): 
    404            print "%10s: %5.3f" % (learners[l].name, AUCs[l]) 
    405  
    406    For vehicle, you can run exactly this same code; it will compute AUCs 
    407    for all pairs of classes and return the average weighted by probabilities 
    408    of pairs. Or, you can specify the averaging method yourself, like this:: 
    409  
    410        AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll) 
    411  
    412    The following snippet tries out all four. (We don't claim that this is 
    413    how the function needs to be used; it's better to stay with the default.):: 
    414  
    415        methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] 
    416        print " " *25 + "  \tbayes\ttree\tmajority" 
    417        for i in range(4): 
    418            AUCs = Orange.evaluation.scoring.AUC(resVeh, i) 
    419            print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) 
    420  
    421    As you can see from the output:: 
    422  
    423                                    bayes   tree    majority 
    424               by pairs, weighted:  0.789   0.871   0.500 
    425                         by pairs:  0.791   0.872   0.500 
    426            one vs. all, weighted:  0.783   0.800   0.500 
    427                      one vs. all:  0.783   0.800   0.500 
    428  
    429 .. autofunction:: AUC_single 
    430  
    431 .. autofunction:: AUC_pair 
    432  
    433 .. autofunction:: AUC_matrix 
    434  
    435 The remaining functions, which plot the curves and statistically compare 
    436 them, require that the results come from a test with a single iteration, 
    437 and they always compare one chosen class against all others. If you have 
    438 cross validation results, you can either use split_by_iterations to split the 
    439 results by folds, call the function for each fold separately and then sum 
    440 the results up however you see fit, or you can set the ExperimentResults' 
    441 attribute number_of_iterations to 1, to cheat the function - at your own 
    442 responsibility for the statistical correctness. Regarding the multi-class 
    443 problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class 
    444 attribute's baseValue at the time when results were computed. If baseValue 
    445 was not given at that time, 1 (that is, the second class) is used as default. 
    446  
    447 We shall use the following code to prepare suitable experimental results:: 
    448  
    449     ri2 = Orange.data.sample.SubsetIndices2(voting, 0.6) 
    450     train = voting.selectref(ri2, 0) 
    451     test = voting.selectref(ri2, 1) 
    452     res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test) 
    453  
    454  
    455 .. autofunction:: AUCWilcoxon 
    456  
    457 .. autofunction:: compute_ROC 
    458  
    459160Comparison of Algorithms 
    460161------------------------ 
Note: See TracChangeset for help on using the changeset viewer.