source: orange/Orange/data/sample.py @ 9697:b203c6d09c9b

Revision 9697:b203c6d09c9b, 12.4 KB checked in by Jure Zbontar <jure.zbontar@…>, 2 years ago (diff)

Trivial merge

Line 
1"""
2=================================
3Sampling of examples (``sample``)
4=================================
5
6Example sampling is one of the basic procedures in machine learning. If
7for nothing else, everybody needs to split dataset into training and
8testing examples.
9 
10It is easy to select a subset of examples in Orange. The key idea is the
11use of indices: first construct a list of indices, one corresponding
12to each example. Then you can select examples by indices, say take
13all examples with index 3. Or with index other than 3. It is obvious
14that this is useful for many typical setups, such as 70-30 splits or
15cross-validation.
16 
17Orange provides methods for making such selections, such as
18:obj:`Orange.data.Table.select`.  And, of course, it provides methods
19for constructing indices for different kinds of splits. For instance,
20for the most common used sampling method, cross-validation, the Orange's
21class :obj:`SubsetIndicesCV` prepares a list of indices that assign a
22fold to each example.
23
24Classes that construct such indices are derived from a basic
25abstract :obj:`SubsetIndices`. There are three different classes
26provided. :obj:`SubsetIndices2` constructs a list of 0's and 1's in
27prescribed proportion; it can be used for, for instance, 70-30 divisions
28on training and testing examples. A more general :obj:`SubsetIndicesN`
29construct a list of indices from 0 to N-1 in given proportions. Finally,
30the most often used :obj:`SubsetIndicesCV` prepares indices for
31cross-validation.
32
33Subset indices are more deterministic than in versions of Orange prior to
34September 2003. See examples in the section about :obj:`SubsetIndices2`
35for details.
36 
37.. class:: SubsetIndices
38
39    .. data:: Stratified
40
41    .. data:: NotStratified
42
43    .. data:: StratifiedIfPossible
44       
45        Constants for setting :obj:`stratified`. If
46        :obj:`StratifiedIfPossible`, Orange will try to construct
47        stratified indices, but fall back to non-stratified if anything
48        goes wrong. For stratified indices, it needs to see the example
49        table (see the calling operator below), and the class should be
50        discrete and have no unknown values.
51
52
53    .. attribute:: stratified
54
55        Defines whether the division should be stratified, that is,
56        whether all subset should have approximatelly equal class
57        distributions. Possible values are :obj:`Stratified`,
58        :obj:`NotStratified` and :obj:`StratifiedIfPossible` (default).
59
60    .. attribute:: randseed
61   
62    .. attribute:: random_generator
63
64        These two fields deal with the way :obj:`SubsetIndices` generates
65        random numbers.
66
67        If :obj:`random_generator` (of type :obj:`Orange.misc.Random`)
68        is set, it is used. The same random generator can be shared
69        between different objects; this can be useful when constructing an
70        experiment that depends on a single random seed. If you use this,
71        :obj:`SubsetIndices` will return a different set of indices each
72        time it's called, even if with the same arguments.
73
74        If :obj:`random_generator` is not given, but :attr:`randseed` is
75        (positive values denote a defined :obj:`randseed`), the value is
76        used to initiate a new, temporary local random generator. This
77        way, the indices generator will always give same indices for
78        the same data.
79
80        If none of the two is defined, a new random generator
81        is constructed each time the object is called (note that
82        this is unlike some other classes, such as :obj:`Variable`,
83        :obj:`Distribution` and :obj:`Orange.data.Table`, that store
84        such generators for future use; the generator constructed by
85        :obj:`SubsetIndices` is disposed after use) and initialized
86        with random seed 0. This thus has the same effect as setting
87        :obj:`randseed` to 0.
88
89        The example for :obj:`SubsetIndices2` shows the difference
90        between those options.
91
92    .. method:: __call__(examples)
93
94        :obj:`SubsetIndices` can be called to return a list of
95        indices. The argument can be either the desired length of the list
96        (presumably corresponding to a length of some list of examples)
97        or a set of examples, given as :obj:`Orange.data.Table` or plain
98        Python list. It is obvious that in the former case, indices
99        cannot correspond to a stratified division; if :obj:`stratified`
100        is set to :obj:`Stratified`, an exception is raised.
101
102.. class:: SubsetIndices2
103
104    This object prepares a list of 0's and 1's.
105 
106    .. attribute:: p0
107
108        The proportion or a number of 0's. If :obj:`p0` is less than
109        1, it's a proportion. For instance, if :obj:`p0` is 0.2, 20%
110        of indices will be 0's and 80% will be 1's. If :obj:`p0`
111        is 1 or more, it gives the exact number of 0's. For instance,
112        with :obj:`p0` of 10, you will get a list with 10 0's and
113        the rest of the list will be 1's.
114 
115Say that you have loaded the lenses domain into ``data``. We'll split
116it into two datasets, the first containing only 6 examples and the other
117containing the rest (from :download:`randomindices2.py <code/randomindices2.py>`):
118 
119.. literalinclude:: code/randomindices2.py
120    :lines: 11-17
121
122Output::
123
124    <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
125    6 18
126 
127No surprises here. Let's now see what's with those random seeds and generators. First, we shall simply construct and print five lists of random indices.
128 
129.. literalinclude:: code/randomindices2.py
130    :lines: 19-21
131
132Output::
133
134    Indices without playing with random generator
135    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1>
136    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1>
137    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1>
138    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1>
139    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1>
140
141
142We ran it for five times and got the same result each time.
143
144.. literalinclude:: code/randomindices2.py
145    :lines: 23-26
146
147Output::
148
149    Indices with random generator
150    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
151    <1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1>
152    <1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1>
153    <1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0>
154    <1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1>
155
156We have constructed a private random generator for random indices. And
157got five different lists but if you run the whole script again, you'll
158get the same five sets, since the generator will be constructed again
159and start generating number from the beginning. Again, you should have
160got this same indices on any operating system.
161
162.. literalinclude:: code/randomindices2.py
163    :lines: 28-32
164
165Output::
166
167    Indices with randseed
168    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
169    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
170    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
171    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
172    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
173
174
175Here we have set the random seed and removed the random generator
176(otherwise the seed would have no effect as the generator has the
177priority). Each time we run the indices generator, it constructs a
178private random generator and initializes it with the given seed, and
179consequentially always returns the same indices.
180
181Let's play with :obj:`SubsetIndices2.p0`. There are 24 examples in the
182dataset. Setting :obj:`SubsetIndices2.p0` to 0.25 instead of 6 shouldn't
183alter the indices. Let's check it.
184
185.. literalinclude:: code/randomindices2.py
186    :lines: 35-37
187
188Output::
189
190    Indices with p0 set as probability (not 'a number of')
191    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
192
193Finally, let's observe the effects of :obj:`~SubsetIndices.stratified`. By
194default, indices are stratified if it's possible and, in our case,
195it is and they are.
196
197.. literalinclude:: code/randomindices2.py
198    :lines: 39-49
199
200Output::
201
202    ... with stratification
203    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
204    <0.625, 0.167, 0.208>
205    <0.611, 0.167, 0.222>
206
207We explicitly requested stratication and got the same indices as
208before. That's OK. We also printed out the distribution for the whole
209dataset and for the selected dataset (as we gave no second parameter,
210the examples with no-null indices got selected). They are not same, but
211they are pretty close. :obj:`SubsetIndices2` did what it could. Now let's
212try without stratification. The script is pretty same except for changing
213:obj:`~SubsetIndices.stratified` to :obj:`~SubsetIndices.NotStratified`.
214
215.. literalinclude:: code/randomindices2.py
216    :lines: 51-62
217
218Output::
219   
220    ... and without stratification
221    <0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1>
222    <0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1>
223    <0.625, 0.167, 0.208>
224    <0.611, 0.167, 0.222>
225
226
227Different indices and ... just look at the distribution. Could be worse
228but, well, :obj:`~SubsetIndices.NotStratified` doesn't mean that Orange
229will make an effort to get uneven distributions. It just won't mind
230about them.
231
232For a final test, you can set the class of one of the examples to unknown
233and rerun the last script with setting :obj:`~SubsetIndices.stratified`
234once to :obj:`~SubsetIndices.Stratified` and once to
235:obj:`~SubsetIndices.StratifiedIfPossible`. In the first case you'll
236get an error and in the second you'll have a non-stratified indices.
237
238.. literalinclude:: code/randomindices2.py
239    :lines: 64-70
240
241Output::
242
243    ... stratified 'if possible'
244    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1>
245
246    ... stratified 'if possible', after removing the first example's class
247    <0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1>
248 
249.. class:: SubsetIndicesN
250
251    A straight generalization of :obj:`RandomIndices2`, so there's not
252    much to be told about it.
253
254    .. attribute:: p
255
256        A list of proportions of examples that go to each fold. If
257        :obj:`p` has a length of 3, the returned list will have four
258        different indices, the first three will have probabilities as
259        defined in :obj:`p` while the last will have a probability of
260        (1 - sum of elements of :obj:`p`).
261
262:obj:`SubsetIndicesN` does not support stratification; setting
263:obj:`stratified` to :obj:`Stratified` will yield an error.
264
265Let us construct a list of indices that would assign half of examples
266to the first set and a quarter to the second and third (part of
267:download:`randomindicesn.py <code/randomindicesn.py>`, uses :download:`lenses.tab <code/lenses.tab>`):
268
269.. literalinclude:: code/randomindicesn.py
270    :lines: 9-14
271
272Output:
273
274    <1, 0, 0, 2, 0, 1, 1, 0, 2, 0, 2, 2, 1, 0, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0>
275
276Count them and you'll see there are 12 zero's and 6 one's and two's out of 24.
277 
278.. class:: SubsetIndicesCV
279 
280    :obj:`SubsetIndicesCV` computes indices for cross-validation.
281
282    It constructs a list of indices between 0 and :obj:`folds` -1
283    (inclusive), with an equal number of each (if the number of examples
284    is not divisible by :obj:`folds`, the last folds will have one
285    example less).
286
287    .. attribute:: folds
288
289        Number of folds. Default is 10.
290 
291We shall prepare indices for an ordinary ten-fold cross validation and
292indices for 10 examples for 5-fold cross validation. For the latter,
293we shall only pass the number of examples, which, of course, prevents
294the stratification. Part of :download:`randomindicescv.py <code/randomindicescv.py>`, uses :download:`lenses.tab <code/lenses.tab>`):
295
296.. literalinclude:: code/randomindicescv.py
297    :lines: 7-12
298
299Output::
300
301    Indices for ordinary 10-fold CV
302    <1, 1, 3, 8, 8, 3, 2, 7, 5, 0, 1, 5, 2, 9, 4, 7, 4, 9, 3, 6, 0, 2, 0, 6>
303    Indices for 5 folds on 10 examples
304    <3, 0, 1, 0, 3, 2, 4, 4, 1, 2>
305
306
307Since examples don't divide evenly into ten folds, the first four folds
308have one example more - there are three 0's, 1's, 2's and 3's, but only
309two 4's, 5's..
310
311"""
312
313pass
314
315from orange import \
316     MakeRandomIndices as SubsetIndices, \
317     MakeRandomIndicesN as SubsetIndicesN, \
318     MakeRandomIndicesCV as SubsetIndicesCV, \
319     MakeRandomIndicesMultiple as SubsetIndicesMultiple, \
320     MakeRandomIndices2 as SubsetIndices2
Note: See TracBrowser for help on using the repository browser.