orngStat: Orange Statistics for Predictors
This module contains various measures of quality for classification
and regression. Most functions require an argument named
res
, an instance of
as computed by functions from orngTest and which contains predictions
obtained through crossvalidation, leave oneout, testing on training
data or test set examples.
Classification
To prepare some data for examples on this page, we shall load the voting data set (problem of predicting the congressman's party (republican, democrat) based on a selection of votes) and evaluate naive bayesian learner, classification trees and majority classifier using crossvalidation. For examples requiring a multivalued class problem, we shall do the same with the vehicle data set (telling whether a vehicle described by the features extracted from a picture is a van, bus, or Opel or Saab car).
part of statExamples.py
If examples are weighted, weights are taken into account. This can
be disabled by giving unweighted=1
as a keyword
argument. Another way of disabling weights is to clear the
ExperimentResults
' flag weights
.
General Measures of Quality
 CA(res, reportSE=False)

Computes classification accuracy, i.e. percentage of matches between predicted and actual class. The function returns a list of classification accuracies of all classifiers tested. If
reportSE
is set to true, the list will contain tuples with accuracies and standard errors.If results are from multiple repetitions of experiments (like those returned by orngTest.crossValidation or orngTest.proportionTest) the standard error (SE) is estimated from deviation of classification accuracy accross folds (SD), as SE = SD/sqrt(N), where N is number of repetitions (e.g. number of folds).
If results are from a single repetition, we assume independency of examples and treat the classification accuracy as distributed according to binomial distribution. This can be approximated by normal distribution, so we report the SE of sqrt(CA*(1CA)/N), where CA is classification accuracy and N is number of test examples.
Instead of
ExperimentResults
, this function can be given a list of confusion matrices (see below). Standard errors are in this case estimated using the latter method.  AP(res, reportSE=False)
 Computes the average probability assigned to the correct class.
 BrierScore(res, reportSE=False)
 Computes the Brier's score, defined as the average (over test examples) of sum_{x}(t(x)p(x))^{2}, where x is a class, t(x) is 1 for the correct class and 0 for the others, and p(x) is the probability that the classifier assigned to the class x.
 IS(res, apriori=None, reportSE=False)
 Computes the information score as defined by Kononenko and Bratko (1991). Argument 'apriori' gives the apriori class distribution; if it is omitted, the class distribution is computed from the actual classes of examples in res.
So, let's compute all this and print it out.
part of statExamples.py
The output should look like this.
Script statExamples.py contains another example that also prints out the standard errors.
Confusion Matrix
 confusionMatrices(res, classIndex=1, {cutoff})
This function can compute two different forms of confusion matrix: one in which a certain class is marked as positive and the other(s) negative, and another in which no class is singled out. The way to specify what we want is somewhat confusing due to backward compatibility issues.
A positivenegative confusion matrix is computed (a) if the class is binary unless
classIndex
argument is 2, (b) if the class is multivalued and theclassIndex
is nonnegative. ArgumentclassIndex
then tells which class is positive. In case (a),classIndex
may be omited; the first class is then negative and the second is positive, unless thebaseClass
attribute in the object with results has nonnegative value. In that case,baseClass
is an index of the traget class.baseClass
attribute of results object should be set manually. The result of a function is a list of instances of classConfusionMatrix
, containing the (weighted) number of true positives (TP
), false negatives (FN
), false positives (FP
) and true negatives (TN
).
We can also add the keyword argument
cutoff
(e.g.confusionMatrices(results, cutoff=0.3)
; if we do,confusionMatrices
will disregard the classifiers' class predictions and observe the predicted probabilities, and consider the prediction "positive" if the predicted probability of the positive class is higher than the cutoff.
The example below shows how setting the cut off threshold from the default 0.5 to 0.2 affects the confusion matrics for naive Bayesian classifier.
part of statExamples.py
cm = orngStat.confusionMatrices(res)[0] print "Confusion matrix for naive Bayes:" print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) cm = orngStat.confusionMatrices(res, cutoff=0.2)[0] print "Confusion matrix for naive Bayes:" print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) The output,
Confusion matrix for naive Bayes: TP: 238, FP: 13, FN: 29.0, TN: 155 Confusion matrix for naive Bayes: TP: 239, FP: 18, FN: 28.0, TN: 150 shows that the number of true positives increases (and hence the number of false negatives decreases) by only a single example, while five examples that were originally true negatives become false positives due to the lower threshold.
To observe how good are the classifiers in detecting vans in the vehicle data set, we would compute the matrix like this:
cm = orngStat.confusionMatrices(resVeh, vehicle.domain.classVar.values.index("van")) and get the results like theseTP: 189, FP: 241, FN: 10.0, TN: 406 while the same for class "opel" would giveTP: 86, FP: 112, FN: 126.0, TN: 522 The main difference is that there are only a few false negatives for the van, meaning that the classifier seldom misses it (if it says it's not a van, it's almost certainly not a van). Not so for the Opel car, where the classifier missed 126 of them and correctly detected only 86.
General confusion matrix is computed (a) in case of a binary class, when
classIndex
is set to 2, (b) when we have multivalued class and the caller doesn't specify theclassIndex
of the positive class. When called in this manner, the function cannot use the argumentcutoff
.The function then returns a threedimensional matrix, where the element A[learner][actualClass][predictedClass] gives the number of examples belonging to 'actualClass' for which the 'learner' predicted 'predictedClass'. We shall compute and print out the matrix for naive Bayesian classifier.
part of statExamples.py
cm = orngStat.confusionMatrices(resVeh)[0] classes = vehicle.domain.classVar.values print "\t"+"\t".join(classes) for className, classConfusions in zip(classes, cm): print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) Sorry for the language, but it's time you learn to talk dirty in Python, too.
"\t".join(classes)
will join the strings from listclasses
by putting tabulators between them.zip
merges to lists, element by element, hence it will create a list of tuples containing a class name fromclasses
and a list telling how many examples from this class were classified into each possible class. Finally, the format string consists of a%s
for the class name and one tabulator and%i
for each class. The data we provide for this format string is(className, )
(a tuple containing the class name), plus the misclassification list converted to a tuple.
So, here's what this nice piece of code gives:
bus van saab opel bus 56 95 21 46 van 6 189 4 0 saab 3 75 73 66 opel 4 71 51 86 Van's are clearly simple: 189 vans were classified as vans (we know this already, we've printed it out above), and the 10 misclassified pictures were classified as buses (6) and Saab cars (4). In all other classes, there were more examples misclassified as vans than correctly classified examples. The classifier is obviously quite biased to vans.
 sens(confm), spec(confm), PPV(confm), NPV(confm), precision(confm), recall(confm), F2(confm), Falpha(confm, alpha=2.0), MCC(conf)
With the confusion matrix defined in terms of positive and negative classes, you can also compute the sensitivity [TP/(TP+FN)], specificity [TN/(TN+FP)], positive predictive value [TP/(TP+FP)] and negative predictive value [TN/(TN+FN)]. In information retrieval, positive predictive value is called precision (the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved), and sensitivity is called recall (the ratio of the number of relevant records retrieved to the total number of relevant records in the database). The harmonic mean of precision and recall is called an Fmeasure, where, depending on the ratio of the weight between precision and recall is implemented as
F1
[2*precision*recall/(precision+recall)] or, for a general case,Falpha
[(1+alpha)*precision*recall / (alpha*precision + recall)]. The [http://en.wikipedia.org/wiki/Matthews_correlation_coefficient Matthews correlation coefficient] in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between 1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and 1 an inverse prediction.
If the argument
confm
is a single confusion matrix, a single result (a number) is returned. Ifconfm
is a list of confusion matrices, a list of scores is returned, one for each confusion matrix.
Note that weights are taken into account when computing the matrix, so these functions don't check the 'weighted' keyword argument.
Let us print out sensitivities and specificities of our classifiers.
part of statExamples.py
cm = orngStat.confusionMatrices(res) print print "method\tsens\tspec" for l in range(len(learners)): print "%s\t%5.3f\t%5.3f" % (learners[l].name, orngStat.sens(cm[l]), orngStat.spec(cm[l]))
ROC Analysis
Receiver
Operating Characteristic (ROC) analysis was initially developed
for a binarylike problems and there is no consensus on how to apply
it in multiclass problems, nor do we know for sure how to do ROC
analysis after cross validation and similar multiple sampling
techniques. If you are interested in the area under the curve,
function AUC
will deal with those problems as
specifically described below.
 AUC(res, method = AUC.ByWeightedPairs)
 Returns the area under ROC curve (AUC) given a set of experimental results. For multivalued class problems, it will compute some sort of average, as specified by the argument
method
:AUC.ByWeightedPairs
(or0
) Computes AUC for each pair of classes (ignoring examples of all other classes) and averages the results, weighting them by the number of pairs of examples from these two classes (e.g. by the product of probabilities of the two classes). AUC computed in this way still behaves as concordance index, e.g., gives the probability that two randomly chosen examples from different classes will be correctly recognized (this is of course true only if the classifier knows from which two classes the examples came).
AUC.ByPairs
(or1
) Similar as above, except that the average over class pairs is not weighted. This AUC is, like the binary, independent of class distributions, but it is not related to concordance index any more.
AUC.WeightedOneAgainstAll
(or2
) For each class, it computes AUC for this class against all others (that is, treating other classes as one class). The AUCs are then averaged by the class probabilities. This is related to concordance index in which we test the classifier's (average) capability for distinguishing the examples from a specified class from those that come from other classes. Unlike the binary AUC, the measure is not independent of class distributions.
AUC.OneAgainstAll
(or3
) As above, except that the average is not weighted.
In case of multiple folds (for instance if the data comes from cross validation), the computation goes like this. When computing the partial AUCs for individual pairs of classes or singledout classes, AUC is computed for each fold separately and then averaged (ignoring the number of examples in each fold, it's just a simple average). However, if a certain fold doesn't contain any examples of a certain class (from the pair), the partial AUC is computed treating the results as if they came from a singlefold. This is not really correct since the class probabilities from different folds are not necessarily comparable, yet this will most often occur in a leaveoneout experiments, comparability shouldn't be a problem.
Computing and printing out the AUC's looks just like printing out classification accuracies (except that we call AUC instead of CA, of course):
part of statExamples.py
AUCs = orngStat.AUC(res) for l in range(len(learners)): print "%10s: %5.3f" % (learners[l].name, AUCs[l]) For vehicle, you can run exactly this same code; it will compute AUCs for all pairs of classes and return the average weighted by probabilities of pairs. Or, you can specify the averaging method yourself, like this
AUCs = orngStat.AUC(resVeh, orngStat.AUC.WeightedOneAgainstAll) The following snippet tries out all four. (We don't claim that this is how the function needs to be used; it's better to stay with the default.)part of statExamples.py
methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] print " " *25 + " \tbayes\ttree\tmajority" for i in range(4): AUCs = orngStat.AUC(resVeh, i) print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) As you can see from the output,bayes tree majority by pairs, weighted: 0.789 0.871 0.500 by pairs: 0.791 0.872 0.500 one vs. all, weighted: 0.783 0.800 0.500 one vs. all: 0.783 0.800 0.500  AUC_single(res, classIndex)
 Computes AUC where the class given
classIndex
is singled out, and all other classes are treated as a single class. To find how good our classifiers are in distinguishing between vans and other vehicle, call the function like thisorngStat.AUC_single(resVeh, classIndex = vehicle.domain.classVar.values.index("van"))  AUC_pair(res, classIndex1, classIndex2)
 Computes AUC between a pair of examples, ignoring examples from all other classes.
 AUC_matrix(res)
 Computes a (lower diagonal) matrix with AUCs for all pairs of classes. If there are empty classes, the corresponding elements in the matrix are 1. Remember the beautiful(?) code for printing out the confusion matrix? Here it strikes again:
part of statExamples.py
classes = vehicle.domain.classVar.values AUCmatrix = orngStat.AUC_matrix(resVeh)[0] print "\t"+"\t".join(classes[:1]) for className, AUCrow in zip(classes[1:], AUCmatrix[1:]): print ("%s" + ("\t%5.3f" * len(AUCrow))) % ((className, ) + tuple(AUCrow))
The remaining functions, which plot the curves and statistically
compare them, require that the results come from a test with a single
iteration, and they always compare one chosen class against all
others. If you have cross validation results, you can either use splitByIterations
to split
the results by folds, call the function for each fold separately and
then sum the results up however you see fit, or you can set the
ExperimentResults
' attribute
numberOfIterations
to 1, to cheat the function  at your
own responsibility for the statistical correctness. Regarding the
multiclass problems, if you don't chose a specific class,
orngStat
will use the class attribute's
baseValue
at the time when results were computed. If
baseValue
was not given at that time, 1 (that is, the
second class) is used as default.
We shall use the following code to prepare suitable experimental results
 AUCWilcoxon(res, classIndex=1)
Computes the area under ROC (AUC) and its standard error using Wilcoxon's approach proposed by Hanley and McNeal (1982). If
classIndex
is not specified, the first class is used as "the positive" and others are negative. The result is a list of tuples (aROC, standard error).To compute the AUCs with the corresponding confidence intervals for our experimental results, simply call
orngStat.AUCWilcoxon(res1)  compare2AUCs(res, learner1, learner2, classIndex=1)
Compares ROC curves of learning algorithms with indices
<learner1
andlearner2
. The function returns three tuples, the first two have areas under ROCs and standard errors for both learner, and the third is the difference of the areas and its standard error: ((AUC1, SE1), (AUC2, SE2), (AUC1AUC2, SE(AUC1)+SE(AUC2)2*COVAR)).This function is broken at the moment: it returns some numbers, but they're wrong.
 computeROC(res, classIndex=1)
Computes a ROC curve as a list of (x, y) tuples, where x is 1specificity and y is sensitivity.
 computeCDT(res, classIndex=1), ROCsFromCDT(cdt, {print})
These two functions are obsolete and shouldn't be called. Use
AUC
instead. AROC(res, classIndex=1), AROCFromCDT(res, {print}), compare2AROCs(res, learner1, learner2, classIndex=1)
 These are all deprecated, too. Instead, use AUCWilcoxon (for AROC), AUC (for AROCFromCDT), and compare2AUCs (for compare2AROCs).
Comparison of Algorithms
 McNemar(res)
 Computes a triangular matrix with McNemar statistics for each pair of classifiers. The statistics is distributed by chisquare distribution with one degree of freedom; critical value for 5% significance is around 3.84.
 McNemarOfTwo(res, learner1, learner2)
 McNemarOfTwo computes a McNemar statistics for a pair of classifier, specified by indices learner1 and learner2.
Regression
Several alternative measures, as given below, can be used to evaluate the sucess of numeric prediction:
 MSE(res)
 Computes meansquared error.
 RMSE(res)
 Computes root meansquared error.
 MAE(res)
 Computes mean absolute error.
 RSE(res)
 Computes relative squared error.
 RRSE(res)
 Computes root relative squared error.
 RAE(res)
 Computes relative absolute error.
 R2(res)
 Computes the coefficient of determination, Rsquared.
The following code uses most of the above measures to score several regression methods.
The code above produces the following output:
Plotting Functions
 graph_ranks(filename, avranks, names, cd=None, lowv=None, highv=None, width=6, textspace=1, reverse=False, cdmethod=None)

Draws a CD graph, which is used to display the differences in methods'
performance. See Janez Demsar, Statistical Comparisons of Classifiers over
Multiple Data Sets, 7(Jan):130, 2006.
Needs matplotlib to work.
 filename
 Output file name (with extension). Formats supported by matplotlib can be used.
 avranks
 List of average methods' ranks.
 names
 List of methods' names.
 cd
 Critical difference. Used for marking methods that whose difference is not statistically significant.
 lowv
 The lowest shown rank, if None, use 1.
 highv
 he highest shown rank, if None, use len(avranks).
 width
 Width of the drawn figure in inches, default 6 inches.
 textspace
 Space on figure sides left for the description of methods, default 1 inch.
 reverse
 If True, the lowest rank is on the right. Default: False.
 cdmethod
 None by default. It can be an index of element in avranks or or names which specifies the method which should be marked with an interval. If specified, the interval is marked only around that method. This option is ment to be used with BonferonniDunn test.
import orange, orngStat names = ["first", "third", "second", "fourth" ] avranks = [1.9, 3.2, 2.8, 3.3 ] cd = orngStat.compute_CD(avranks, 30) #tested on 30 datasets orngStat.graph_ranks("statExamplesgraph_ranks1.png", avranks, names, \ cd=cd, width=6, textspace=1.5) The code above produces the following graph:
 compute_CD(avranks, N, alpha="0.05", type="nemenyi")
 Returns critical difference for Nemenyi or BonferroniDunn test according to given alpha (either alpha="0.05" or alpha="0.1") for average ranks and number of tested data sets N. Type can be either "nemenyi" for for Nemenyi two tailed test or "bonferronidunn" for BonferroniDunn test.
 compute_friedman(avranks, N)
 Returns a tuple composed of (friedman statistic, degrees of freedom) and (Iman statistic  Fdistribution, degrees of freedoma) given average ranks and a number of tested data sets N.
Utility Functions
 splitByIterations(res)
 Splits ExperimentResults of multiple iteratation test into a list of ExperimentResults, one for each iteration.