orngTest: Orange Module for Sampling and Testing

orngTest is Orange module for testing learning algorithms. It includes functions for data sampling and splitting, and for testing learners. It implements cross-validation, leave-one out, random sampling, learning curves. All functions return result in the same for - an instance of ExperimentResults, described at the end of the page, or, in case of learning curves, a list of ExperimentResults. This object(s) can be passed to statistical function for model evaluation (classification accuracy, Brier score, ROC analysis...) available in module orngStat.

Your scripts will thus basically conduct experiments using functions in orngTest, covered on this page and then evaluate the results by functions in orngStat. For those interested in writing their own statistical measures of the quality of models, description of TestedExample and ExperimentResults are available at the end of this page.

An important change over previous versions of Orange: Orange has been "de-randomized". Running the same script twice will generally give the same results, unless special care is taken to randomize it. This is opposed to the previous versions where special care needed to be taken to make experiments repeatable. See arguments randseed and randomGenerator for the explanation.

Example scripts in this section suppose that the data is loaded and a list of learning algorithms is prepared.

part of (uses

import orange, orngTest, orngStat data = orange.ExampleTable("voting") bayes = orange.BayesLearner(name = "bayes") tree = orange.TreeLearner(name = "tree") majority = orange.MajorityLearner(name = "default") learners = [bayes, tree, majority] names = [ for x in learners]

After testing is done, classification accuracies can be computed and printed by the following function (function uses list names constructed above).

def printResults(res): CAs = orngStat.CA(res, reportSE=1) for i in range(len(names)): print "%s: %5.3f+-%4.3f" % (names[i], CAs[i][0], 1.96*CAs[i][1]), print

Common Arguments

Many function in this module use a set of common arguments, which we define here.

A list of learning algorithms. These can be either pure Orange objects (such as orange.BayesLearner) or Python classes or functions written in pure Python (anything that can be called with the same arguments and results as Orange's classifiers and performs similar function).
examples, learnset, testset
Examples, given as an ExampleTable (some functions need an undivided set of examples while others need examples that are already split into two sets). If examples are weighted, pass them as a tuple (examples, weightID). Weights are respected by learning and testing, but not by sampling. When selecting 10% of examples, this means 10% by number, not by weights. There is also no guarantee that sums of example weights will be (at least roughly) equal for folds in cross validation.
Tells whether to stratify the random selections. Its default value is orange.StratifiedIfPossible which stratifies selections if the class attribute is discrete and has no unknown values.
randseed (obsolete: indicesrandseed), randomGenerator
Random seed (randseed) or random generator (randomGenerator) for random selection of examples. If omitted, random seed of 0 is used and the same test will always select the same examples from the example set. There are various slightly different ways to randomize it.

  • Set randomGenerator to orange.globalRandom. The function's selection will depend upon Orange's global random generator that is reset (with random seed 0) when Orange is imported. Script's output will therefore depend upon what you did after Orange was first imported in the current Python session. res = orngTest.proportionTest(learners, data, 0.7, randomGenerator = orange.globalRandom)
  • Construct a new orange.RandomGenerator and use in various places and times. The code below, for instance, will produce different results in each iteration, but overall the same results each time it's run. for i in range(3): res = orngTest.proportionTest(learners, data, 0.7, randomGenerator = myRandom) printResults(res)
  • Set the random seed (argument randseed) to a random number provided by Python. Python has a global random generator that is reset when Python is loaded, using current system time for a seed. With this, results will be (in general) different each time the script is run. import random for i in range(3): res = orngTest.proportionTest(learners, data, 0.7, randomGenerator = random.randint(0, 100)) printResults(res) The same module also provides random generators as object, so that you can have independent local random generators in case you need them.
A list of preprocessors. It consists of tuples (c, preprocessor), where c determines whether the preprocessor will be applied to learning set ("L"), test set ("T") or to both ("B"). The latter is applied first, when the example set is still undivided. The "L" and "T" preprocessors are applied on the separated subsets. Preprocessing testing examples is allowed only on experimental procedures that do not report the TestedExample's in the same order as examples in the original set. The second item in the tuple, preprocessor can be either pure Orange or pure Python preprocessor, that is, any function or callable class that accept a table of examples and weight, and returns a preprocessed table and weight.

This example will demonstrate the devastating effect of 100% class noise on learning.

classnoise = orange.Preprocessor_addClassNoise(proportion=1.0) res = orngTest.proportionTest(learners, data, 0.7, 100, pps = [("L", classnoise)])
Gives the proportions of learning examples at which the tests are to be made, where applicable. The default is [0.1, 0.2, ..., 1.0].
storeClassifiers (keyword argument)
If this flag is set, the testing procedure will store the constructed classifiers. For each iteration of the test (eg for each fold in cross validation, for each left out example in leave-one-out...), the list of classifiers is appended to the ExperimentResults' field classifiers.

The script below makes 100 repetitions of 70:30 test and store the classifiers it induces.

res = orngTest.proportionTest(learners, data, 0.7, 100, storeClassifier = 1)

After this, res.classifiers is a list of 100 items and each item will be a list with three classifiers.

verbose (keyword argument)
Several functions can report their progress if you add a keyword argument verbose=1

Sampling and Testing Functions

proportionTest(learners, data, learnProp, times = 10, strat = ..., pps = [])
Splits the data with learnProp of examples in the learning and the rest in the testing set. The test is repeated for a given number of times (default 10). Division is stratified by default. Function also accepts keyword arguments for randomization and storing classifiers.

100 repetitions of the so-called 70:30 test in which 70% of examples are used for training and 30% for testing is done by

res = orngTest.proportionTest(learners, data, 0.7, 100)

Note that Python allows naming the arguments; instead of "100" you can use "times = 100" to increase the clarity (not so with keyword arguments, such as storeClassifiers, randseed or verbose that must always be given with a name, as shown in examples above).

leaveOneOut(learners, examples, pps = [])
Performs a leave-one-out experiment with the given list of learners and examples. This is equivalent to performing len(examples)-fold cross validation. Function accepts additional keyword arguments for preprocessing, storing classifiers and verbose output.
crossValidation(learners, examples, folds = 10, strat = ..., pps = [])
Performs a cross validation with the given number of folds.
testWithIndices(learners, examples, weight, indices, indicesrandseed="*", pps=None)
Performs a cross-validation-like test. The difference is that the caller provides indices (each index gives a fold of an example) which do not necessarily divide the examples into folds of (approximately) same sizes. In fact, function crossValidation is actually written as a single call to testWithIndices.

testWithIndices takes care the TestedExamples are in the same order as the corresponding examples in the original set. Preprocessing of testing examples is thus not allowed. The computed results can be saved in files or loaded therefrom if you add a keyword argument cache = 1. In this case, you also have to specify the rand seed which was used to compute the indices (argument indicesrandseed; if you don't there will be no caching.

You can request progress reports with a keyword argument verbose = 1.

learningCurveN(learners, examples, folds = 10, strat = ..., proportions = ..., pps=[])
A simpler interface for function learningCurve (see below). Instead of methods for preparing indices, it simply takes number of folds and a flag telling whether we want a stratified cross-validation or not. This function does not return a single ExperimentResults but a list of them, one for each proportion.

prop = [0.2, 0.4, 0.6, 0.8, 1.0] res = orngTest.learningCurveN(learners, data, folds = 5, proportions = prop) for i, p in enumerate(prop): print "%5.3f:" % p, printResults(res[i])

Function basically prepares a random generator and example selectors (cv and pick, see below) and calls the learningCurve.

learningCurve(learners, examples, cv = None, pick = None, proportions = ..., pps=[])
Computes learning curves using a procedure recommended by Salzberg (1997). It first prepares data subsets (folds). For each proportion, it performs the cross-validation, but taking only a proportion of examples for learning.

Arguments cv and pick give the methods for preparing indices for cross-validation and random selection of learning examples. If they are not given, orange.MakeRandomIndicesCV and orange.MakeRandomIndices2 are used, both will be stratified and the cross-validation will be 10-fold. Proportions is a list of proportions of learning examples.

The function can save time by loading experimental existing data for any test that were already conducted and saved. Also, the computed results are stored for later use. You can enable this by adding a keyword argument cache=1. Another keyword deals with progress report. If you add verbose=1, the function will print the proportion and the fold number.

learningCurveWithTestData(learners, learnset, testset, times = 10, proportions = ..., strat = ..., pps=[])
This function is suitable for computing a learning curve on datasets, where learning and testing examples are split in advance. For each proportion of learning examples, it randomly select the requested number of learning examples, builds the models and tests them on the entire testset. The whole test is repeated for the given number of times for each proportion. The result is a list of ExperimentResults, one for each proportion.

In the following scripts, examples are pre-divided onto training and testing set. Learning curves are computed in which 20, 40, 60, 80 and 100 percents of the examples in the former set are used for learning and the latter set is used for testing. Random selection of the given proportion of learning set is repeated for five times.

indices = orange.MakeRandomIndices2(data, p0 = 0.7) train =, 0) test =, 1) res = orngTest.learningCurveWithTestData( learners, train, test, times = 5, proportions = prop) for i, p in enumerate(prop): print "%5.3f:" % p, printResults(res[i])

learnAndTestOnTestData(learners, learnset, testset, testResults=None, iterationNumber=0, pps=[])
This function performs no sampling on its own: two separate datasets need to be passed, one for training and the other for testing. The function preprocesses the data, induces the model and tests it. The order of filters is peculiar, but it makes sense when compared to other methods that support preprocessing of testing examples. The function first applies preprocessors marked "B" (both sets), and only then the preprocessors that need to processor only one of the sets.

You can pass an already initialized ExperimentResults (argument results) and an iteration number (iterationNumber). Results of the test will be appended with the given iteration number. This is because learnAndTestWithTestData gets called by other functions, like proportionTest and learningCurveWithTestData. If you omit the parameters, a new ExperimentResults will be created.

learnAndTestOnLearnData(learners, learnset, testResults=None, iterationNumber=0, pps=[])
This function is similar to the above, except that it learns and tests on the same data. If first preprocesses the data with "B" preprocessors on the whole data, and afterwards any "L" or "T" preprocessors on separate datasets. Then it induces the model from the learning set and tests it on the testing set.

As with learnAndTestOnTestData, you can pass an already initialized ExperimentResults (argument results) and an iteration number to the function. In this case, results of the test will be appended with the given iteration number.

testOnData(classifiers, testset, testResults=None, iterationNumber=0)
This function gets a list of classifiers, not learners like the other functions in this module. It classifies each testing example with each classifier. You can pass an existing ExperimentResults and iteration number, like in learnAndTestWithTestData (which actually calls testWithTestData). If you don't, a new ExperimentResults will be created.


Knowing classes TestedExample that stores results of testing for a single test example and ExperimentResults that stores a list of TestedExamples along with some other data on experimental procedures and classifiers used, is important if you would like to write your own measures of quality of models, compatible the sampling infrastructure provided by Orange. If not, you can skip the remainder of this page.


TestedExample stores predictions of different classifiers for a single testing example.


A list of predictions of type Value, one for each classifier.
A list of probabilities of classes, one for each classifier.
Iteration number (e.g. fold) in which the TestedExample was created/tested.
The correct class of the example
Example's weight. Even if the example set was not weighted, this attribute is present and equals 1.0.


__init__(iterationNumber = None, actualClass = None, n = 0)
Constructs and initializes a new TestExample.
addResult(aclass, aprob)
Appends a new result (class and probability prediction by a single classifier) to the classes and probabilities field.
setResult(i, aclass, aprob)
Sets the result of the i-th classifier to the given values.


ExperimentResults stores results of one or more repetitions of some test (cross validation, repeated sampling...) under the same circumstances.


A list of instances of TestedExample, one for each example in the dataset.
A list of classifiers, one element for each repetition (eg fold). Each element is a list of classifiers, one for each learner. This field is used only if storing is enabled by storeClassifiers = 1.
Number of iterations. This can be number of folds (in cross validation) or number of repetitions of some test. TestedExample's attribute iterationNumber should be in range [0, numberOfIterations-1].
Number of learners. Lengths of lists classes and probabilities in each TestedExample should equal numberOfLearners.
If the experimental method supports caching and there are no obstacles for caching (such as unknown random seeds), this is a list of boolean values. Each element corresponds to a classifier and tells whether the experimental results for that classifier were computed or loaded from the cache.
A flag telling whether the results are weighted. If false, weights are still present in TestedExamples, but they are all 1.0. Clear this flag, if your experimental procedure ran on weighted testing examples but you would like to ignore the weights in statistics.


__init__(iterations, learners, weights)
Initializes the object and sets the number of iterations, learners and the flag telling whether TestedExamples will be weighted.
saveToFiles(lrn, filename), loadFromFiles(lrn, filename)
Saves and load testing results. lrn is a list of learners and filename is a template for the filename. The attribute loaded is initialized so that it contains 1's for the learners whose data was loaded and 0's for learners which need to be tested. The function returns 1 if all the files were found and loaded, and 0 otherwise.

The data is saved in a separate file for each classifier. The file is a binary pickle file containing a list of tuples
((x.actualClass, x.iterationNumber), (x.classes[i], x.probabilities[i]))
where x is a TestedExample and i is the index of a learner.

The file resides in the directory ./cache. Its name consists of a template, given by a caller. The filename should contain a %s which is replaced by name, shortDescription, description, func_doc or func_name (in that order) attribute of the learner (this gets extracted by orngMisc.getobjectname). If a learner has none of these attributes, its class name is used.

Filename should include enough data to make sure that it indeed contains the right experimental results. The function learningCurve, for example, forms the name of the file from a string "{learningCurve}", the proportion of learning examples, random seeds for cross-validation and learning set selection, a list of preprocessors' names and a checksum for examples. Of course you can outsmart this, but it should suffice in most cases.

Removes the results for the i-th learner.
add(results, index, replace=-1)
Appends the results of the index-th learner, or uses it to replace the results of the learner with the index replace if replace is a valid index. Assumes that results came from evaluation on the same data set using the same testing technique (same number of iterations).


Salzberg, S. L. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1, pages 317-328.