orngStat: Orange Statistics for Predictors

This module contains various measures of quality for classification and regression. Most functions require an argument named res, an instance of ExperimentResults as computed by functions from orngTest and which contains predictions obtained through cross-validation, leave one-out, testing on training data or test set examples.


To prepare some data for examples on this page, we shall load the voting data set (problem of predicting the congressman's party (republican, democrat) based on a selection of votes) and evaluate naive bayesian learner, classification trees and majority classifier using cross-validation. For examples requiring a multivalued class problem, we shall do the same with the vehicle data set (telling whether a vehicle described by the features extracted from a picture is a van, bus, or Opel or Saab car).

part of

import orange, orngTest, orngTree learners = [orange.BayesLearner(name = "bayes"), orngTree.TreeLearner(name="tree"), orange.MajorityLearner(name="majrty")] voting = orange.ExampleTable("voting") res = orngTest.crossValidation(learners, voting) vehicle = orange.ExampleTable("vehicle") resVeh = orngTest.crossValidation(learners, vehicle)

If examples are weighted, weights are taken into account. This can be disabled by giving unweighted=1 as a keyword argument. Another way of disabling weights is to clear the ExperimentResults' flag weights.

General Measures of Quality

CA(res, reportSE=False)

Computes classification accuracy, i.e. percentage of matches between predicted and actual class. The function returns a list of classification accuracies of all classifiers tested. If reportSE is set to true, the list will contain tuples with accuracies and standard errors.

If results are from multiple repetitions of experiments (like those returned by orngTest.crossValidation or orngTest.proportionTest) the standard error (SE) is estimated from deviation of classification accuracy accross folds (SD), as SE = SD/sqrt(N), where N is number of repetitions (e.g. number of folds).

If results are from a single repetition, we assume independency of examples and treat the classification accuracy as distributed according to binomial distribution. This can be approximated by normal distribution, so we report the SE of sqrt(CA*(1-CA)/N), where CA is classification accuracy and N is number of test examples.

Instead of ExperimentResults, this function can be given a list of confusion matrices (see below). Standard errors are in this case estimated using the latter method.

AP(res, reportSE=False)
Computes the average probability assigned to the correct class.

BrierScore(res, reportSE=False)
Computes the Brier's score, defined as the average (over test examples) of sumx(t(x)-p(x))2, where x is a class, t(x) is 1 for the correct class and 0 for the others, and p(x) is the probability that the classifier assigned to the class x.

IS(res, apriori=None, reportSE=False)
Computes the information score as defined by Kononenko and Bratko (1991). Argument 'apriori' gives the apriori class distribution; if it is omitted, the class distribution is computed from the actual classes of examples in res.

So, let's compute all this and print it out.

part of

import orngStat CAs = orngStat.CA(res) APs = orngStat.AP(res) Briers = orngStat.BrierScore(res) ISs = orngStat.IS(res) print print "method\tCA\tAP\tBrier\tIS" for l in range(len(learners)): print "%s\t%5.3f\t%5.3f\t%5.3f\t%6.3f" % (learners[l].name, CAs[l], APs[l], Briers[l], ISs[l])

The output should look like this.

method CA AP Brier IS bayes 0.903 0.902 0.175 0.759 tree 0.846 0.845 0.286 0.641 majrty 0.614 0.526 0.474 -0.000

Script contains another example that also prints out the standard errors.

Confusion Matrix

confusionMatrices(res, classIndex=-1, {cutoff})

This function can compute two different forms of confusion matrix: one in which a certain class is marked as positive and the other(s) negative, and another in which no class is singled out. The way to specify what we want is somewhat confusing due to backward compatibility issues.

A positive-negative confusion matrix is computed (a) if the class is binary unless classIndex argument is -2, (b) if the class is multivalued and the classIndex is non-negative. Argument classIndex then tells which class is positive. In case (a), classIndex may be omited; the first class is then negative and the second is positive, unless the baseClass attribute in the object with results has non-negative value. In that case, baseClass is an index of the traget class. baseClass attribute of results object should be set manually. The result of a function is a list of instances of class ConfusionMatrix, containing the (weighted) number of true positives (TP), false negatives (FN), false positives (FP) and true negatives (TN).

We can also add the keyword argument cutoff (e.g. confusionMatrices(results, cutoff=0.3); if we do, confusionMatrices will disregard the classifiers' class predictions and observe the predicted probabilities, and consider the prediction "positive" if the predicted probability of the positive class is higher than the cutoff.

The example below shows how setting the cut off threshold from the default 0.5 to 0.2 affects the confusion matrics for naive Bayesian classifier.

part of

cm = orngStat.confusionMatrices(res)[0] print "Confusion matrix for naive Bayes:" print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) cm = orngStat.confusionMatrices(res, cutoff=0.2)[0] print "Confusion matrix for naive Bayes:" print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN)

The output,

Confusion matrix for naive Bayes: TP: 238, FP: 13, FN: 29.0, TN: 155 Confusion matrix for naive Bayes: TP: 239, FP: 18, FN: 28.0, TN: 150 shows that the number of true positives increases (and hence the number of false negatives decreases) by only a single example, while five examples that were originally true negatives become false positives due to the lower threshold.

To observe how good are the classifiers in detecting vans in the vehicle data set, we would compute the matrix like this:

cm = orngStat.confusionMatrices(resVeh, vehicle.domain.classVar.values.index("van")) and get the results like these TP: 189, FP: 241, FN: 10.0, TN: 406 while the same for class "opel" would give TP: 86, FP: 112, FN: 126.0, TN: 522 The main difference is that there are only a few false negatives for the van, meaning that the classifier seldom misses it (if it says it's not a van, it's almost certainly not a van). Not so for the Opel car, where the classifier missed 126 of them and correctly detected only 86.

General confusion matrix is computed (a) in case of a binary class, when classIndex is set to -2, (b) when we have multivalued class and the caller doesn't specify the classIndex of the positive class. When called in this manner, the function cannot use the argument cutoff.

The function then returns a three-dimensional matrix, where the element A[learner][actualClass][predictedClass] gives the number of examples belonging to 'actualClass' for which the 'learner' predicted 'predictedClass'. We shall compute and print out the matrix for naive Bayesian classifier.

part of

cm = orngStat.confusionMatrices(resVeh)[0] classes = vehicle.domain.classVar.values print "\t"+"\t".join(classes) for className, classConfusions in zip(classes, cm): print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions))

Sorry for the language, but it's time you learn to talk dirty in Python, too. "\t".join(classes) will join the strings from list classes by putting tabulators between them. zip merges to lists, element by element, hence it will create a list of tuples containing a class name from classes and a list telling how many examples from this class were classified into each possible class. Finally, the format string consists of a %s for the class name and one tabulator and %i for each class. The data we provide for this format string is (className, ) (a tuple containing the class name), plus the misclassification list converted to a tuple.

So, here's what this nice piece of code gives:

bus van saab opel bus 56 95 21 46 van 6 189 4 0 saab 3 75 73 66 opel 4 71 51 86

Van's are clearly simple: 189 vans were classified as vans (we know this already, we've printed it out above), and the 10 misclassified pictures were classified as buses (6) and Saab cars (4). In all other classes, there were more examples misclassified as vans than correctly classified examples. The classifier is obviously quite biased to vans.

sens(confm), spec(confm), PPV(confm), NPV(confm), precision(confm), recall(confm), F2(confm), Falpha(confm, alpha=2.0), MCC(conf)

With the confusion matrix defined in terms of positive and negative classes, you can also compute the sensitivity [TP/(TP+FN)], specificity [TN/(TN+FP)], positive predictive value [TP/(TP+FP)] and negative predictive value [TN/(TN+FN)]. In information retrieval, positive predictive value is called precision (the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved), and sensitivity is called recall (the ratio of the number of relevant records retrieved to the total number of relevant records in the database). The harmonic mean of precision and recall is called an F-measure, where, depending on the ratio of the weight between precision and recall is implemented as F1 [2*precision*recall/(precision+recall)] or, for a general case, Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. The [ Matthews correlation coefficient] in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.

If the argument confm is a single confusion matrix, a single result (a number) is returned. If confm is a list of confusion matrices, a list of scores is returned, one for each confusion matrix.

Note that weights are taken into account when computing the matrix, so these functions don't check the 'weighted' keyword argument.

Let us print out sensitivities and specificities of our classifiers.

part of

cm = orngStat.confusionMatrices(res) print print "method\tsens\tspec" for l in range(len(learners)): print "%s\t%5.3f\t%5.3f" % (learners[l].name, orngStat.sens(cm[l]), orngStat.spec(cm[l]))

ROC Analysis

Receiver Operating Characteristic (ROC) analysis was initially developed for a binary-like problems and there is no consensus on how to apply it in multi-class problems, nor do we know for sure how to do ROC analysis after cross validation and similar multiple sampling techniques. If you are interested in the area under the curve, function AUC will deal with those problems as specifically described below.

AUC(res, method = AUC.ByWeightedPairs)
Returns the area under ROC curve (AUC) given a set of experimental results. For multivalued class problems, it will compute some sort of average, as specified by the argument method:
AUC.ByWeightedPairs (or 0)
Computes AUC for each pair of classes (ignoring examples of all other classes) and averages the results, weighting them by the number of pairs of examples from these two classes (e.g. by the product of probabilities of the two classes). AUC computed in this way still behaves as concordance index, e.g., gives the probability that two randomly chosen examples from different classes will be correctly recognized (this is of course true only if the classifier knows from which two classes the examples came).
AUC.ByPairs (or 1)
Similar as above, except that the average over class pairs is not weighted. This AUC is, like the binary, independent of class distributions, but it is not related to concordance index any more.
AUC.WeightedOneAgainstAll (or 2)
For each class, it computes AUC for this class against all others (that is, treating other classes as one class). The AUCs are then averaged by the class probabilities. This is related to concordance index in which we test the classifier's (average) capability for distinguishing the examples from a specified class from those that come from other classes. Unlike the binary AUC, the measure is not independent of class distributions.
AUC.OneAgainstAll (or 3)
As above, except that the average is not weighted.

In case of multiple folds (for instance if the data comes from cross validation), the computation goes like this. When computing the partial AUCs for individual pairs of classes or singled-out classes, AUC is computed for each fold separately and then averaged (ignoring the number of examples in each fold, it's just a simple average). However, if a certain fold doesn't contain any examples of a certain class (from the pair), the partial AUC is computed treating the results as if they came from a single-fold. This is not really correct since the class probabilities from different folds are not necessarily comparable, yet this will most often occur in a leave-one-out experiments, comparability shouldn't be a problem.

Computing and printing out the AUC's looks just like printing out classification accuracies (except that we call AUC instead of CA, of course):

part of

AUCs = orngStat.AUC(res) for l in range(len(learners)): print "%10s: %5.3f" % (learners[l].name, AUCs[l])

For vehicle, you can run exactly this same code; it will compute AUCs for all pairs of classes and return the average weighted by probabilities of pairs. Or, you can specify the averaging method yourself, like this

AUCs = orngStat.AUC(resVeh, orngStat.AUC.WeightedOneAgainstAll) The following snippet tries out all four. (We don't claim that this is how the function needs to be used; it's better to stay with the default.)

part of

methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] print " " *25 + " \tbayes\ttree\tmajority" for i in range(4): AUCs = orngStat.AUC(resVeh, i) print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) As you can see from the output, bayes tree majority by pairs, weighted: 0.789 0.871 0.500 by pairs: 0.791 0.872 0.500 one vs. all, weighted: 0.783 0.800 0.500 one vs. all: 0.783 0.800 0.500
AUC_single(res, classIndex)
Computes AUC where the class given classIndex is singled out, and all other classes are treated as a single class. To find how good our classifiers are in distinguishing between vans and other vehicle, call the function like this orngStat.AUC_single(resVeh, classIndex = vehicle.domain.classVar.values.index("van"))
AUC_pair(res, classIndex1, classIndex2)
Computes AUC between a pair of examples, ignoring examples from all other classes.
Computes a (lower diagonal) matrix with AUCs for all pairs of classes. If there are empty classes, the corresponding elements in the matrix are -1. Remember the beautiful(?) code for printing out the confusion matrix? Here it strikes again:

part of

classes = vehicle.domain.classVar.values AUCmatrix = orngStat.AUC_matrix(resVeh)[0] print "\t"+"\t".join(classes[:-1]) for className, AUCrow in zip(classes[1:], AUCmatrix[1:]): print ("%s" + ("\t%5.3f" * len(AUCrow))) % ((className, ) + tuple(AUCrow))

The remaining functions, which plot the curves and statistically compare them, require that the results come from a test with a single iteration, and they always compare one chosen class against all others. If you have cross validation results, you can either use splitByIterations to split the results by folds, call the function for each fold separately and then sum the results up however you see fit, or you can set the ExperimentResults' attribute numberOfIterations to 1, to cheat the function - at your own responsibility for the statistical correctness. Regarding the multi-class problems, if you don't chose a specific class, orngStat will use the class attribute's baseValue at the time when results were computed. If baseValue was not given at that time, 1 (that is, the second class) is used as default.

We shall use the following code to prepare suitable experimental results

ri2 = orange.MakeRandomIndices2(voting, 0.6) train = voting.selectref(ri2, 0) test = voting.selectref(ri2, 1) res1 = orngTest.learnAndTestOnTestData(learners, train, test)
AUCWilcoxon(res, classIndex=1)

Computes the area under ROC (AUC) and its standard error using Wilcoxon's approach proposed by Hanley and McNeal (1982). If classIndex is not specified, the first class is used as "the positive" and others are negative. The result is a list of tuples (aROC, standard error).

To compute the AUCs with the corresponding confidence intervals for our experimental results, simply call


compare2AUCs(res, learner1, learner2, classIndex=1)

Compares ROC curves of learning algorithms with indices learner1 and learner2. The function returns three tuples, the first two have areas under ROCs and standard errors for both learner, and the third is the difference of the areas and its standard error: ((AUC1, SE1), (AUC2, SE2), (AUC1-AUC2, SE(AUC1)+SE(AUC2)-2*COVAR)).


This function is broken at the moment: it returns some numbers, but they're wrong.

computeROC(res, classIndex=1)

Computes a ROC curve as a list of (x, y) tuples, where x is 1-specificity and y is sensitivity.

computeCDT(res, classIndex=1), ROCsFromCDT(cdt, {print})

These two functions are obsolete and shouldn't be called. Use AUC instead.

AROC(res, classIndex=1), AROCFromCDT(res, {print}), compare2AROCs(res, learner1, learner2, classIndex=1)
These are all deprecated, too. Instead, use AUCWilcoxon (for AROC), AUC (for AROCFromCDT), and compare2AUCs (for compare2AROCs).

Comparison of Algorithms

Computes a triangular matrix with McNemar statistics for each pair of classifiers. The statistics is distributed by chi-square distribution with one degree of freedom; critical value for 5% significance is around 3.84.
McNemarOfTwo(res, learner1, learner2)
McNemarOfTwo computes a McNemar statistics for a pair of classifier, specified by indices learner1 and learner2.


Several alternative measures, as given below, can be used to evaluate the sucess of numeric prediction:

Computes mean-squared error.
Computes root mean-squared error.
Computes mean absolute error.
Computes relative squared error.
Computes root relative squared error.
Computes relative absolute error.
Computes the coefficient of determination, R-squared.

The following code uses most of the above measures to score several regression methods.

import orange import orngRegression as r import orngTree import orngStat, orngTest data = orange.ExampleTable("housing") # definition of regressors lr = r.LinearRegressionLearner(name="lr") rt = orngTree.TreeLearner(measure="retis", mForPruning=2, minExamples=20, name="rt") maj = orange.MajorityLearner(name="maj") knn = orange.kNNLearner(k=10, name="knn") learners = [maj, rt, knn, lr] # cross validation, selection of scores, report of results results = orngTest.crossValidation(learners, data, folds=3) scores = [("MSE", orngStat.MSE), ("RMSE", orngStat.RMSE), ("MAE", orngStat.MAE), ("RSE", orngStat.RSE), ("RRSE", orngStat.RRSE), ("RAE", orngStat.RAE), ("R2", orngStat.R2)] print "Learner " + "".join(["%-8s" % s[0] for s in scores]) for i in range(len(learners)): print "%-8s " % learners[i].name + \ "".join(["%7.3f " % s[1](results)[i] for s in scores])

The code above produces the following output:

Learner MSE RMSE MAE RSE RRSE RAE R2 maj 84.585 9.197 6.653 1.002 1.001 1.001 -0.002 rt 40.015 6.326 4.592 0.474 0.688 0.691 0.526 knn 21.248 4.610 2.870 0.252 0.502 0.432 0.748 lr 24.092 4.908 3.425 0.285 0.534 0.515 0.715

Plotting Functions

graph_ranks(filename, avranks, names, cd=None, lowv=None, highv=None, width=6, textspace=1, reverse=False, cdmethod=None)
Draws a CD graph, which is used to display the differences in methods' performance. See Janez Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets, 7(Jan):1--30, 2006.

Needs matplotlib to work.

Output file name (with extension). Formats supported by matplotlib can be used.
List of average methods' ranks.
List of methods' names.
Critical difference. Used for marking methods that whose difference is not statistically significant.
The lowest shown rank, if None, use 1.
he highest shown rank, if None, use len(avranks).
Width of the drawn figure in inches, default 6 inches.
Space on figure sides left for the description of methods, default 1 inch.
If True, the lowest rank is on the right. Default: False.
None by default. It can be an index of element in avranks or or names which specifies the method which should be marked with an interval. If specified, the interval is marked only around that method. This option is ment to be used with Bonferonni-Dunn test.

import orange, orngStat names = ["first", "third", "second", "fourth" ] avranks = [1.9, 3.2, 2.8, 3.3 ] cd = orngStat.compute_CD(avranks, 30) #tested on 30 datasets orngStat.graph_ranks("statExamples-graph_ranks1.png", avranks, names, \ cd=cd, width=6, textspace=1.5)

The code above produces the following graph:

compute_CD(avranks, N, alpha="0.05", type="nemenyi")
Returns critical difference for Nemenyi or Bonferroni-Dunn test according to given alpha (either alpha="0.05" or alpha="0.1") for average ranks and number of tested data sets N. Type can be either "nemenyi" for for Nemenyi two tailed test or "bonferroni-dunn" for Bonferroni-Dunn test.
compute_friedman(avranks, N)
Returns a tuple composed of (friedman statistic, degrees of freedom) and (Iman statistic - F-distribution, degrees of freedoma) given average ranks and a number of tested data sets N.

Utility Functions

Splits ExperimentResults of multiple iteratation test into a list of ExperimentResults, one for each iteration.