Random sampling data (sample)¶
Random sampling is done by constructing a vector of subset indices (e.g. a table of 0’s and 1’s), one corresponding to each instance, and then passing the vector to the table’s Orange.data.Table.select method.
Orange provides several methods for construction of such indices: SubsetIndices2 for splitting into two sets (or extracting a random subset), SubsetIndicesN for splitting into multiple sets and SubsetIndicesCV for cross validation. All classes are derived from the abstract class SubsetIndices.
The typical usage pattern is as follows.
lenses = Orange.data.Table("lenses")
indices2 = Orange.data.sample.SubsetIndices2(p0=0.25)
ind = indices2(lenses)
lenses0 = lenses.select(ind, 0)
lenses1 = lenses.select(ind, 1)
Subset indices are deterministic in the sense that unless the caller explicitly modifies random seeds, the same setup will always return the same indices. Details are shown in the section about SubsetIndices2.
- class Orange.data.sample.SubsetIndices¶
- stratified¶
Defines whether the samples should be stratified, that is, whether all subset should have approximatelly equal class distributions. Possible values are
- Stratified¶
Division is stratified; exceptions is raised if this is not possible, for instance if the data is numeric.
- NotStratified¶
Division is not stratified.
- StratifiedIfPossible¶
Division is stratified if possible and unstratified otherwise (default).
- randseed¶
- random_generator¶
If random_generator (of type Orange.misc.Random) is set, it is used for generation of random numbers. In this case, SubsetIndices will return a different set of indices each time it is called.
The same generator can be shared between different objects; this can be useful when constructing an experiment that depends on a single random seed.
If random_generator is not given, but randseed is set (that is, positive), the value is used to initiate a new, temporary local random generator. This way, the indices generator will always give same indices for the same data.
If none of the two is defined, a new random generator is constructed each time the object is called and initialized with a seed of 0. Note that this is different from some other classes, such as Descriptor, Distribution and Table, that store such generators for future use: the generator constructed by SubsetIndices is disposed after use) and initialized with random seed 0.
Examples are shown in documentation for SubsetIndices2.
- __call__(data)¶
Return a list of indices for the given data table. If data has a discrete class, sampling can be stratified.
- __call__(n)
Return a list of n indices. Sampling cannot be stratified.
- class Orange.data.sample.SubsetIndices2¶
Prepares a list of 0’s and 1’s in the given proportions.
- p0¶
The proportion or a number of 0’s. If p0 is less than 1, the number gives a proportion; for instance, if p0 is 0.2, 20% of indices will be 0’s and 80% will be 1’s. If p0 is 1 or more, it gives the number of 0’s; with p0=10, the list will have 10 0’s and the rest of the list will be 1’s.
The following examples splits the data on lenses to two datasets, the first containing only 6 data instances and the other containing the rest (from randomindices2.py):
indices2 = Orange.data.sample.SubsetIndices2(p0=6) ind = indices2(lenses) print ind lenses0 = lenses.select(ind, 0) lenses1 = lenses.select(ind, 1) print len(lenses0), len(lenses1)
Output:
<0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> 6 18
Repeating this gives the same set of indices.
print "\nIndices without playing with random generator" for i in range(5): print indices2(lenses)
Output:
Indices without playing with random generator <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1>
With a random generator, it gives different indices every time.
print "\nIndices with random generator" indices2.random_generator = Orange.misc.Random(42) for i in range(5): print indices2(lenses)
Output:
Indices with random generator <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> <1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1> <1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1> <1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0> <1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1>
Running this same script again however gives the same indices since the same random generator is constructed and used.
The next example sets the random seed and removes the random generator (otherwise the seed would have no effect as the generator has the priority). At each call, it constructs a private random generator and initializes it with the given seed, and therefore always returns the same indices.
print "\nIndices with randseed" indices2.random_generator = None indices2.randseed = 42 for i in range(5): print indices2(lenses)
Output:
Indices with randseed <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
There are 24 instances in the dataset. Setting SubsetIndices2.p0 to 0.25 instead of 6 gives the same result.
print "\nIndices with p0 set as probability (not 'a number of')" indices2.p0 = 0.25 print indices2(lenses)
Output:
Indices with p0 set as probability (not 'a number of') <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
The class can also be called with a number of data instances instead of the data. In this case, stratification is not possible.
print "\n... stratified 'if possible'" indices2.stratified = indices2.StratifiedIfPossible print indices2(lenses)
Output:
... stratified 'if possible' <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
- class Orange.data.sample.SubsetIndicesN¶
A generalization of RandomIndices2 to multiple subsets.
- p¶
A list of proportions of data that go to each fold. If p has a length of 3, the returned list will have four different indices, the first three will have probabilities as defined in p while the last will have a probability of (1 - sum of elements of p).
SubsetIndicesN does not support stratification; setting stratified to Stratified will yield an error.
The following constructs a division in which one half of data is in the first set and one quarter in the second and in the third randomindicesn.py).
lenses = Orange.data.Table("lenses") indicesn = Orange.data.sample.SubsetIndicesN(p=[0.5, 0.25]) ind = indicesn(lenses) print ind
Output:
<1, 0, 0, 2, 0, 1, 1, 0, 2, 0, 2, 2, 1, 0, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0>
- class Orange.data.sample.SubsetIndicesCV¶
Computes indices for cross-validation by constructing a list of indices between 0 and folds-1 (inclusive), with an equal number of each (if the number of instances is not divisible by folds, the last folds will have one element less).
- folds¶
Number of folds. Default is 10.
This prepares indices for ten-fold cross validation and indices for 10 data instances for 5-fold cross validation without giving the actual data in the latter case (randomindicescv.py).
import Orange lenses = Orange.data.Table("lenses") print "Indices for ordinary 10-fold CV" print Orange.data.sample.SubsetIndicesCV(lenses) print "Indices for 5 folds on 10 instances" print Orange.data.sample.SubsetIndicesCV(10, folds=5)
- Output::
- Indices for ordinary 10-fold CV <1, 1, 3, 8, 8, 3, 2, 7, 5, 0, 1, 5, 2, 9, 4, 7, 4, 9, 3, 6, 0, 2, 0, 6> Indices for 5 folds on 10 instances <3, 0, 1, 0, 3, 2, 4, 4, 1, 2>
Since instances do not divide evenly into ten folds, the first four folds have one element more - there are three 0’s, 1’s, 2’s and 3’s, but only two 4’s, 5’s..