Testing and Evaluating Your Classifiers
In this lesson you will learn how to evaluate the classification methods in terms of their power to accuratly predict the class of the testing examples.
The simplest way to estimate the accuracy and report on score metrics is to use Orange's orngTest and orngStat modules. This is probably how you will perform evaluation in your scripts, and thus we start with examples that uses these two modules.
You may as well perform testing and scoring on your own, so we further provide several example scripts to compute classification accuracy, measure it on a list of classifiers, do cross-validation, leave-one-out and random sampling. While all of this functionality is available in orngTest and orngStat modules, these example scripts may still be useful for those that want to learn more about Orange's learner/classifier objects and the way to use them in combination with data sampling.
Testing the Easy Way: orngTest and orngStat Modules
Below is a script that takes a list of learners (naive Bayesian classifer and classification tree) and scores their predictive performance on a single data set using ten-fold cross validation. The script reports on four different scores: classification accuracy, information score, Brier score and area under ROC curve.
The output of this script is:
The call to
orngTest.CrossValidation does the hard
crossValidation returns the object stored
results, which essentially stores the probabilities
and class values of the instances that were used as test cases. Based
results, the classification accuracy, information
score, Brier score and area under ROC curve (AUC) for each of the
learners are computed (function
Apart from statistics that we have mentioned above, orngStat has build-in functions that can compute other performance metrics, and orngTest includes other testing schemas. If you need to test your learners with standard statistics, these are probably all you need. Compared to the script above, we below show the use of some other statistics, with perhaps more modular code as above.
Notice that for a number of scoring measures we needed to compute the confusion matrix, for which we also needed to specify the target class (democrats, in our case). This script has a similar output to the previous one:
Do It On Your Own: A Warm-Up
Let us continue with a line of exploration of voting data set, and build a naive Bayesian classifier from it, and compute the classification accuracy on the same data set (not something we should do to avoid overfitting, but may serve our demonstration purpose).
To compute classification accuracy, the script examines every data item and checks how many times this has been classified correctly. Running this script on shows that this is just above 90%.
Now, let us extend the code with a function that is given a data
set and a set of classifiers (e.g.,
classifiers)) and computes the classification accuracies for each
of the classifier. By this means, let us compare naive Bayes and
This is the first time in out tutorial that we define a function.
You may see that this is quite simple in Python; functions are
introduced with a keyword
def, followed by function’s name
and list of arguments. Do not forget semicolon at the end of the
definition string. Other than that, there is nothing new in this
code. A mild exception to that is an expression
but intuition tells us that here the i-th classifier is called with
a function with example to classify as an argument. So, finally,
which method does better? Here is the output:
It looks like a classification tree are much more accurate here. But beware the overfitting (especially unpruned classification trees are prone to that) and read on!
Training and Test Set
In machine learning, one should not learn and test classifiers on the same data set. For this reason, let us split our data in half, and use first half of the data for training and the rest for testing. The script is similar to the one above, with a part which is different shown below:
RandomIndicesS2Gen takes the data
and generates a vector of length equal to the number of the data
instances. Elements of vectors are either 0 or 1, and the probability
of the element being 0 is 0.5 (are whatever we specify in the argument
of the function). Then, for i-th instance of data, this may go either
to the training set (if selection[i]==0) or to test set (if
selection[i]==1). Notice that
sure that this split is stratified, e.g., the class distribution in
training and test set is approximately equal (you may use the
stratified=0 if you do not like
The output of this testing is:
Here, the accuracy naive Bayes is much higher. But warning: the result is inconclusive, since it depends on only one random split of the data.
70-30 Random Sampling
Above, we have used the function
that took a data set and a set of classifiers and measured the
classification accuracy of classifiers on the data. Remember,
classifiers were models that have been already constructed (they
have “seen” the learning data already), so in fact the
data in accuracy served as a test data set. Now, let us write
another function, that will be given a set of learners and a data
set, will repeatedly split the data set to, say 70% and 30%, use
the first part of the data (70%) to learn the model and obtain a
classifier, which, using accuracy function developed above, will be
tested on the remaining data (30%).
A learner in Orange is an object that encodes a specific machine
learning algorithm, and is ready to accept the data to construct
and return the predictive model. We have met quite a number of
learners so far (but we did not call them this way):
orange.knnLearner(), and others. If we use
python to simply call a learner, say with
learner = orange.BayesLearner()
learner becomes an instance of
orange.BayesLearner and is
ready to get some data to return a classifier. For instance, in our
lessons so far we have used
classifier = orange.BayesLearner(data)
and we could equally use
learner = orange.BayesLearner()
classifier = learner(data)
So why complicating with learners? Well, in the task we are just foreseeing, we will repeatedly do learning and testing. If we want to build a reusable function that has in the input a set of machine learning algorithm and on the output reports on their performance, we can do this only through the use of learners (remember, classifiers have already seen the data and cannot be re-learned).
Our script (without accuracy function, which is exactly like the one we have defined in accuracy2.py) is