| 1 | .. automodule:: Orange.evaluation.scoring |
|---|
| 2 | |
|---|
| 3 | ############################ |
|---|
| 4 | Method scoring (``scoring``) |
|---|
| 5 | ############################ |
|---|
| 6 | |
|---|
| 7 | .. index: scoring |
|---|
| 8 | |
|---|
| 9 | This module contains various measures of quality for classification and |
|---|
| 10 | regression. Most functions require an argument named :obj:`res`, an instance of |
|---|
| 11 | :class:`Orange.evaluation.testing.ExperimentResults` as computed by |
|---|
| 12 | functions from :mod:`Orange.evaluation.testing` and which contains |
|---|
| 13 | predictions obtained through cross-validation, |
|---|
| 14 | leave one-out, testing on training data or test set instances. |
|---|
| 15 | |
|---|
| 16 | ============== |
|---|
| 17 | Classification |
|---|
| 18 | ============== |
|---|
| 19 | |
|---|
| 20 | To prepare some data for examples on this page, we shall load the voting data |
|---|
| 21 | set (problem of predicting the congressman's party (republican, democrat) |
|---|
| 22 | based on a selection of votes) and evaluate naive Bayesian learner, |
|---|
| 23 | classification trees and majority classifier using cross-validation. |
|---|
| 24 | For examples requiring a multivalued class problem, we shall do the same |
|---|
| 25 | with the vehicle data set (telling whether a vehicle described by the features |
|---|
| 26 | extracted from a picture is a van, bus, or Opel or Saab car). |
|---|
| 27 | |
|---|
| 28 | Basic cross validation example is shown in the following part of |
|---|
| 29 | (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`): |
|---|
| 30 | |
|---|
| 31 | .. literalinclude:: code/statExample0.py |
|---|
| 32 | |
|---|
| 33 | If instances are weighted, weights are taken into account. This can be |
|---|
| 34 | disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of |
|---|
| 35 | disabling weights is to clear the |
|---|
| 36 | :class:`Orange.evaluation.testing.ExperimentResults`' flag weights. |
|---|
| 37 | |
|---|
| 38 | General Measures of Quality |
|---|
| 39 | =========================== |
|---|
| 40 | |
|---|
| 41 | .. autofunction:: CA |
|---|
| 42 | |
|---|
| 43 | .. autofunction:: AP |
|---|
| 44 | |
|---|
| 45 | .. autofunction:: Brier_score |
|---|
| 46 | |
|---|
| 47 | .. autofunction:: IS |
|---|
| 48 | |
|---|
| 49 | So, let's compute all this in part of |
|---|
| 50 | (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out: |
|---|
| 51 | |
|---|
| 52 | .. literalinclude:: code/statExample1.py |
|---|
| 53 | :lines: 13- |
|---|
| 54 | |
|---|
| 55 | The output should look like this:: |
|---|
| 56 | |
|---|
| 57 | method CA AP Brier IS |
|---|
| 58 | bayes 0.903 0.902 0.175 0.759 |
|---|
| 59 | tree 0.846 0.845 0.286 0.641 |
|---|
| 60 | majorty 0.614 0.526 0.474 -0.000 |
|---|
| 61 | |
|---|
| 62 | Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out |
|---|
| 63 | the standard errors. |
|---|
| 64 | |
|---|
| 65 | Confusion Matrix |
|---|
| 66 | ================ |
|---|
| 67 | |
|---|
| 68 | .. autofunction:: confusion_matrices |
|---|
| 69 | |
|---|
| 70 | **A positive-negative confusion matrix** is computed (a) if the class is |
|---|
| 71 | binary unless :obj:`classIndex` argument is -2, (b) if the class is |
|---|
| 72 | multivalued and the :obj:`classIndex` is non-negative. Argument |
|---|
| 73 | :obj:`classIndex` then tells which class is positive. In case (a), |
|---|
| 74 | :obj:`classIndex` may be omitted; the first class |
|---|
| 75 | is then negative and the second is positive, unless the :obj:`baseClass` |
|---|
| 76 | attribute in the object with results has non-negative value. In that case, |
|---|
| 77 | :obj:`baseClass` is an index of the target class. :obj:`baseClass` |
|---|
| 78 | attribute of results object should be set manually. The result of a |
|---|
| 79 | function is a list of instances of class :class:`ConfusionMatrix`, |
|---|
| 80 | containing the (weighted) number of true positives (TP), false |
|---|
| 81 | negatives (FN), false positives (FP) and true negatives (TN). |
|---|
| 82 | |
|---|
| 83 | We can also add the keyword argument :obj:`cutoff` |
|---|
| 84 | (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices` |
|---|
| 85 | will disregard the classifiers' class predictions and observe the predicted |
|---|
| 86 | probabilities, and consider the prediction "positive" if the predicted |
|---|
| 87 | probability of the positive class is higher than the :obj:`cutoff`. |
|---|
| 88 | |
|---|
| 89 | The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the |
|---|
| 90 | cut off threshold from the default 0.5 to 0.2 affects the confusion matrics |
|---|
| 91 | for naive Bayesian classifier:: |
|---|
| 92 | |
|---|
| 93 | cm = Orange.evaluation.scoring.confusion_matrices(res)[0] |
|---|
| 94 | print "Confusion matrix for naive Bayes:" |
|---|
| 95 | print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) |
|---|
| 96 | |
|---|
| 97 | cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0] |
|---|
| 98 | print "Confusion matrix for naive Bayes:" |
|---|
| 99 | print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) |
|---|
| 100 | |
|---|
| 101 | The output:: |
|---|
| 102 | |
|---|
| 103 | Confusion matrix for naive Bayes: |
|---|
| 104 | TP: 238, FP: 13, FN: 29.0, TN: 155 |
|---|
| 105 | Confusion matrix for naive Bayes: |
|---|
| 106 | TP: 239, FP: 18, FN: 28.0, TN: 150 |
|---|
| 107 | |
|---|
| 108 | shows that the number of true positives increases (and hence the number of |
|---|
| 109 | false negatives decreases) by only a single instance, while five instances |
|---|
| 110 | that were originally true negatives become false positives due to the |
|---|
| 111 | lower threshold. |
|---|
| 112 | |
|---|
| 113 | To observe how good are the classifiers in detecting vans in the vehicle |
|---|
| 114 | data set, we would compute the matrix like this:: |
|---|
| 115 | |
|---|
| 116 | cm = Orange.evaluation.scoring.confusion_matrices(resVeh, \ |
|---|
| 117 | vehicle.domain.classVar.values.index("van")) |
|---|
| 118 | |
|---|
| 119 | and get the results like these:: |
|---|
| 120 | |
|---|
| 121 | TP: 189, FP: 241, FN: 10.0, TN: 406 |
|---|
| 122 | |
|---|
| 123 | while the same for class "opel" would give:: |
|---|
| 124 | |
|---|
| 125 | TP: 86, FP: 112, FN: 126.0, TN: 522 |
|---|
| 126 | |
|---|
| 127 | The main difference is that there are only a few false negatives for the |
|---|
| 128 | van, meaning that the classifier seldom misses it (if it says it's not a |
|---|
| 129 | van, it's almost certainly not a van). Not so for the Opel car, where the |
|---|
| 130 | classifier missed 126 of them and correctly detected only 86. |
|---|
| 131 | |
|---|
| 132 | **General confusion matrix** is computed (a) in case of a binary class, |
|---|
| 133 | when :obj:`classIndex` is set to -2, (b) when we have multivalued class and |
|---|
| 134 | the caller doesn't specify the :obj:`classIndex` of the positive class. |
|---|
| 135 | When called in this manner, the function cannot use the argument |
|---|
| 136 | :obj:`cutoff`. |
|---|
| 137 | |
|---|
| 138 | The function then returns a three-dimensional matrix, where the element |
|---|
| 139 | A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`] |
|---|
| 140 | gives the number of instances belonging to 'actual_class' for which the |
|---|
| 141 | 'learner' predicted 'predictedClass'. We shall compute and print out |
|---|
| 142 | the matrix for naive Bayesian classifier. |
|---|
| 143 | |
|---|
| 144 | Here we see another example from :download:`statExamples.py <code/statExamples.py>`:: |
|---|
| 145 | |
|---|
| 146 | cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0] |
|---|
| 147 | classes = vehicle.domain.classVar.values |
|---|
| 148 | print "\t"+"\t".join(classes) |
|---|
| 149 | for className, classConfusions in zip(classes, cm): |
|---|
| 150 | print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) |
|---|
| 151 | |
|---|
| 152 | So, here's what this nice piece of code gives:: |
|---|
| 153 | |
|---|
| 154 | bus van saab opel |
|---|
| 155 | bus 56 95 21 46 |
|---|
| 156 | van 6 189 4 0 |
|---|
| 157 | saab 3 75 73 66 |
|---|
| 158 | opel 4 71 51 86 |
|---|
| 159 | |
|---|
| 160 | Van's are clearly simple: 189 vans were classified as vans (we know this |
|---|
| 161 | already, we've printed it out above), and the 10 misclassified pictures |
|---|
| 162 | were classified as buses (6) and Saab cars (4). In all other classes, |
|---|
| 163 | there were more instances misclassified as vans than correctly classified |
|---|
| 164 | instances. The classifier is obviously quite biased to vans. |
|---|
| 165 | |
|---|
| 166 | .. method:: sens(confm) |
|---|
| 167 | .. method:: spec(confm) |
|---|
| 168 | .. method:: PPV(confm) |
|---|
| 169 | .. method:: NPV(confm) |
|---|
| 170 | .. method:: precision(confm) |
|---|
| 171 | .. method:: recall(confm) |
|---|
| 172 | .. method:: F2(confm) |
|---|
| 173 | .. method:: Falpha(confm, alpha=2.0) |
|---|
| 174 | .. method:: MCC(conf) |
|---|
| 175 | |
|---|
| 176 | With the confusion matrix defined in terms of positive and negative |
|---|
| 177 | classes, you can also compute the |
|---|
| 178 | `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_ |
|---|
| 179 | [TP/(TP+FN)], `specificity \ |
|---|
| 180 | <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_ |
|---|
| 181 | [TN/(TN+FP)], `positive predictive value \ |
|---|
| 182 | <http://en.wikipedia.org/wiki/Positive_predictive_value>`_ |
|---|
| 183 | [TP/(TP+FP)] and `negative predictive value \ |
|---|
| 184 | <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)]. |
|---|
| 185 | In information retrieval, positive predictive value is called precision |
|---|
| 186 | (the ratio of the number of relevant records retrieved to the total number |
|---|
| 187 | of irrelevant and relevant records retrieved), and sensitivity is called |
|---|
| 188 | `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_ |
|---|
| 189 | (the ratio of the number of relevant records retrieved to the total number |
|---|
| 190 | of relevant records in the database). The |
|---|
| 191 | `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision |
|---|
| 192 | and recall is called an |
|---|
| 193 | `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending |
|---|
| 194 | on the ratio of the weight between precision and recall is implemented |
|---|
| 195 | as F1 [2*precision*recall/(precision+recall)] or, for a general case, |
|---|
| 196 | Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. |
|---|
| 197 | The `Matthews correlation coefficient \ |
|---|
| 198 | <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ |
|---|
| 199 | in essence a correlation coefficient between |
|---|
| 200 | the observed and predicted binary classifications; it returns a value |
|---|
| 201 | between -1 and +1. A coefficient of +1 represents a perfect prediction, |
|---|
| 202 | 0 an average random prediction and -1 an inverse prediction. |
|---|
| 203 | |
|---|
| 204 | If the argument :obj:`confm` is a single confusion matrix, a single |
|---|
| 205 | result (a number) is returned. If confm is a list of confusion matrices, |
|---|
| 206 | a list of scores is returned, one for each confusion matrix. |
|---|
| 207 | |
|---|
| 208 | Note that weights are taken into account when computing the matrix, so |
|---|
| 209 | these functions don't check the 'weighted' keyword argument. |
|---|
| 210 | |
|---|
| 211 | Let us print out sensitivities and specificities of our classifiers in |
|---|
| 212 | part of :download:`statExamples.py <code/statExamples.py>`:: |
|---|
| 213 | |
|---|
| 214 | cm = Orange.evaluation.scoring.confusion_matrices(res) |
|---|
| 215 | print |
|---|
| 216 | print "method\tsens\tspec" |
|---|
| 217 | for l in range(len(learners)): |
|---|
| 218 | print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l])) |
|---|
| 219 | |
|---|
| 220 | ROC Analysis |
|---|
| 221 | ============ |
|---|
| 222 | |
|---|
| 223 | `Receiver Operating Characteristic \ |
|---|
| 224 | <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_ |
|---|
| 225 | (ROC) analysis was initially developed for |
|---|
| 226 | a binary-like problems and there is no consensus on how to apply it in |
|---|
| 227 | multi-class problems, nor do we know for sure how to do ROC analysis after |
|---|
| 228 | cross validation and similar multiple sampling techniques. If you are |
|---|
| 229 | interested in the area under the curve, function AUC will deal with those |
|---|
| 230 | problems as specifically described below. |
|---|
| 231 | |
|---|
| 232 | .. autofunction:: AUC |
|---|
| 233 | |
|---|
| 234 | .. attribute:: AUC.ByWeightedPairs (or 0) |
|---|
| 235 | |
|---|
| 236 | Computes AUC for each pair of classes (ignoring instances of all other |
|---|
| 237 | classes) and averages the results, weighting them by the number of |
|---|
| 238 | pairs of instances from these two classes (e.g. by the product of |
|---|
| 239 | probabilities of the two classes). AUC computed in this way still |
|---|
| 240 | behaves as concordance index, e.g., gives the probability that two |
|---|
| 241 | randomly chosen instances from different classes will be correctly |
|---|
| 242 | recognized (this is of course true only if the classifier knows |
|---|
| 243 | from which two classes the instances came). |
|---|
| 244 | |
|---|
| 245 | .. attribute:: AUC.ByPairs (or 1) |
|---|
| 246 | |
|---|
| 247 | Similar as above, except that the average over class pairs is not |
|---|
| 248 | weighted. This AUC is, like the binary, independent of class |
|---|
| 249 | distributions, but it is not related to concordance index any more. |
|---|
| 250 | |
|---|
| 251 | .. attribute:: AUC.WeightedOneAgainstAll (or 2) |
|---|
| 252 | |
|---|
| 253 | For each class, it computes AUC for this class against all others (that |
|---|
| 254 | is, treating other classes as one class). The AUCs are then averaged by |
|---|
| 255 | the class probabilities. This is related to concordance index in which |
|---|
| 256 | we test the classifier's (average) capability for distinguishing the |
|---|
| 257 | instances from a specified class from those that come from other classes. |
|---|
| 258 | Unlike the binary AUC, the measure is not independent of class |
|---|
| 259 | distributions. |
|---|
| 260 | |
|---|
| 261 | .. attribute:: AUC.OneAgainstAll (or 3) |
|---|
| 262 | |
|---|
| 263 | As above, except that the average is not weighted. |
|---|
| 264 | |
|---|
| 265 | In case of multiple folds (for instance if the data comes from cross |
|---|
| 266 | validation), the computation goes like this. When computing the partial |
|---|
| 267 | AUCs for individual pairs of classes or singled-out classes, AUC is |
|---|
| 268 | computed for each fold separately and then averaged (ignoring the number |
|---|
| 269 | of instances in each fold, it's just a simple average). However, if a |
|---|
| 270 | certain fold doesn't contain any instances of a certain class (from the |
|---|
| 271 | pair), the partial AUC is computed treating the results as if they came |
|---|
| 272 | from a single-fold. This is not really correct since the class |
|---|
| 273 | probabilities from different folds are not necessarily comparable, |
|---|
| 274 | yet this will most often occur in a leave-one-out experiments, |
|---|
| 275 | comparability shouldn't be a problem. |
|---|
| 276 | |
|---|
| 277 | Computing and printing out the AUC's looks just like printing out |
|---|
| 278 | classification accuracies (except that we call AUC instead of |
|---|
| 279 | CA, of course):: |
|---|
| 280 | |
|---|
| 281 | AUCs = Orange.evaluation.scoring.AUC(res) |
|---|
| 282 | for l in range(len(learners)): |
|---|
| 283 | print "%10s: %5.3f" % (learners[l].name, AUCs[l]) |
|---|
| 284 | |
|---|
| 285 | For vehicle, you can run exactly this same code; it will compute AUCs |
|---|
| 286 | for all pairs of classes and return the average weighted by probabilities |
|---|
| 287 | of pairs. Or, you can specify the averaging method yourself, like this:: |
|---|
| 288 | |
|---|
| 289 | AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll) |
|---|
| 290 | |
|---|
| 291 | The following snippet tries out all four. (We don't claim that this is |
|---|
| 292 | how the function needs to be used; it's better to stay with the default.):: |
|---|
| 293 | |
|---|
| 294 | methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] |
|---|
| 295 | print " " *25 + " \tbayes\ttree\tmajority" |
|---|
| 296 | for i in range(4): |
|---|
| 297 | AUCs = Orange.evaluation.scoring.AUC(resVeh, i) |
|---|
| 298 | print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) |
|---|
| 299 | |
|---|
| 300 | As you can see from the output:: |
|---|
| 301 | |
|---|
| 302 | bayes tree majority |
|---|
| 303 | by pairs, weighted: 0.789 0.871 0.500 |
|---|
| 304 | by pairs: 0.791 0.872 0.500 |
|---|
| 305 | one vs. all, weighted: 0.783 0.800 0.500 |
|---|
| 306 | one vs. all: 0.783 0.800 0.500 |
|---|
| 307 | |
|---|
| 308 | .. autofunction:: AUC_single |
|---|
| 309 | |
|---|
| 310 | .. autofunction:: AUC_pair |
|---|
| 311 | |
|---|
| 312 | .. autofunction:: AUC_matrix |
|---|
| 313 | |
|---|
| 314 | The remaining functions, which plot the curves and statistically compare |
|---|
| 315 | them, require that the results come from a test with a single iteration, |
|---|
| 316 | and they always compare one chosen class against all others. If you have |
|---|
| 317 | cross validation results, you can either use split_by_iterations to split the |
|---|
| 318 | results by folds, call the function for each fold separately and then sum |
|---|
| 319 | the results up however you see fit, or you can set the ExperimentResults' |
|---|
| 320 | attribute number_of_iterations to 1, to cheat the function - at your own |
|---|
| 321 | responsibility for the statistical correctness. Regarding the multi-class |
|---|
| 322 | problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class |
|---|
| 323 | attribute's baseValue at the time when results were computed. If baseValue |
|---|
| 324 | was not given at that time, 1 (that is, the second class) is used as default. |
|---|
| 325 | |
|---|
| 326 | We shall use the following code to prepare suitable experimental results:: |
|---|
| 327 | |
|---|
| 328 | ri2 = Orange.core.MakeRandomIndices2(voting, 0.6) |
|---|
| 329 | train = voting.selectref(ri2, 0) |
|---|
| 330 | test = voting.selectref(ri2, 1) |
|---|
| 331 | res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test) |
|---|
| 332 | |
|---|
| 333 | |
|---|
| 334 | .. autofunction:: AUCWilcoxon |
|---|
| 335 | |
|---|
| 336 | .. autofunction:: compute_ROC |
|---|
| 337 | |
|---|
| 338 | Comparison of Algorithms |
|---|
| 339 | ------------------------ |
|---|
| 340 | |
|---|
| 341 | .. autofunction:: McNemar |
|---|
| 342 | |
|---|
| 343 | .. autofunction:: McNemar_of_two |
|---|
| 344 | |
|---|
| 345 | ========== |
|---|
| 346 | Regression |
|---|
| 347 | ========== |
|---|
| 348 | |
|---|
| 349 | General Measure of Quality |
|---|
| 350 | ========================== |
|---|
| 351 | |
|---|
| 352 | Several alternative measures, as given below, can be used to evaluate |
|---|
| 353 | the sucess of numeric prediction: |
|---|
| 354 | |
|---|
| 355 | .. image:: files/statRegression.png |
|---|
| 356 | |
|---|
| 357 | .. autofunction:: MSE |
|---|
| 358 | |
|---|
| 359 | .. autofunction:: RMSE |
|---|
| 360 | |
|---|
| 361 | .. autofunction:: MAE |
|---|
| 362 | |
|---|
| 363 | .. autofunction:: RSE |
|---|
| 364 | |
|---|
| 365 | .. autofunction:: RRSE |
|---|
| 366 | |
|---|
| 367 | .. autofunction:: RAE |
|---|
| 368 | |
|---|
| 369 | .. autofunction:: R2 |
|---|
| 370 | |
|---|
| 371 | The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to |
|---|
| 372 | score several regression methods. |
|---|
| 373 | |
|---|
| 374 | .. literalinclude:: code/statExamplesRegression.py |
|---|
| 375 | |
|---|
| 376 | The code above produces the following output:: |
|---|
| 377 | |
|---|
| 378 | Learner MSE RMSE MAE RSE RRSE RAE R2 |
|---|
| 379 | maj 84.585 9.197 6.653 1.002 1.001 1.001 -0.002 |
|---|
| 380 | rt 40.015 6.326 4.592 0.474 0.688 0.691 0.526 |
|---|
| 381 | knn 21.248 4.610 2.870 0.252 0.502 0.432 0.748 |
|---|
| 382 | lr 24.092 4.908 3.425 0.285 0.534 0.515 0.715 |
|---|
| 383 | |
|---|
| 384 | ================= |
|---|
| 385 | Ploting functions |
|---|
| 386 | ================= |
|---|
| 387 | |
|---|
| 388 | .. autofunction:: graph_ranks |
|---|
| 389 | |
|---|
| 390 | The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph: |
|---|
| 391 | |
|---|
| 392 | .. literalinclude:: code/statExamplesGraphRanks.py |
|---|
| 393 | |
|---|
| 394 | Code produces the following graph: |
|---|
| 395 | |
|---|
| 396 | .. image:: files/statExamplesGraphRanks1.png |
|---|
| 397 | |
|---|
| 398 | .. autofunction:: compute_CD |
|---|
| 399 | |
|---|
| 400 | .. autofunction:: compute_friedman |
|---|
| 401 | |
|---|
| 402 | ================= |
|---|
| 403 | Utility Functions |
|---|
| 404 | ================= |
|---|
| 405 | |
|---|
| 406 | .. autofunction:: split_by_iterations |
|---|
| 407 | |
|---|
| 408 | ===================================== |
|---|
| 409 | Scoring for multilabel classification |
|---|
| 410 | ===================================== |
|---|
| 411 | |
|---|
| 412 | Multi-label classification requries different metrics than those used in traditional single-label |
|---|
| 413 | classification. This module presents the various methrics that have been proposed in the literature. |
|---|
| 414 | Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples |
|---|
| 415 | :math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier |
|---|
| 416 | and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`. |
|---|
| 417 | |
|---|
| 418 | .. autofunction:: mlc_hamming_loss |
|---|
| 419 | .. autofunction:: mlc_accuracy |
|---|
| 420 | .. autofunction:: mlc_precision |
|---|
| 421 | .. autofunction:: mlc_recall |
|---|
| 422 | |
|---|
| 423 | So, let's compute all this and print it out (part of |
|---|
| 424 | :download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses |
|---|
| 425 | :download:`emotions.tab <code/emotions.tab>`): |
|---|
| 426 | |
|---|
| 427 | .. literalinclude:: code/mlc-evaluate.py |
|---|
| 428 | :lines: 1-15 |
|---|
| 429 | |
|---|
| 430 | The output should look like this:: |
|---|
| 431 | |
|---|
| 432 | loss= [0.9375] |
|---|
| 433 | accuracy= [0.875] |
|---|
| 434 | precision= [1.0] |
|---|
| 435 | recall= [0.875] |
|---|
| 436 | |
|---|
| 437 | References |
|---|
| 438 | ========== |
|---|
| 439 | |
|---|
| 440 | Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', |
|---|
| 441 | Pattern Recogintion, vol.37, no.9, pp:1757-71 |
|---|
| 442 | |
|---|
| 443 | Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper |
|---|
| 444 | presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining |
|---|
| 445 | (PAKDD 2004) |
|---|
| 446 | |
|---|
| 447 | Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization', |
|---|
| 448 | Machine Learning, vol.39, no.2/3, pp:135-68. |
|---|