source: orange/Orange/doc/reference/RandomIndices.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 14.1 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8<h1>Classes for Random Sampling</h1>
9<index name="preprocessing+sampling">
10<index name="splitting data sets">
11<index name="cross validation">
12
13<P>Example sampling is one of the basic procedures in machine learning. If for nothing else, everybody needs to split dataset into training and testing examples.</p>
14
15<p>It is easy to select a subset of examples in Orange. The key idea is the use of indices: first construct a list of indices, one corresponding to each example. Then you can select examples by indices, say take all examples with index 3. Or with index other than 3. It is obvious that this is useful for many typical setups, such as 70-30 splits or cross-validation.</p>
16
17<p>Orange provides methods for making such selections, such as <a href="ExampleTable.htm#select"><code>ExampleTable</code>'s <code>select</code></A> method. And, of course, it provides methods for constructing indices for different kinds of splits. For instance, for the most common used sampling method, cross-validation, the Orange's class <code>MakeRandomIndicesCV</code> prepares a list of indices that assign a fold to each example.</p>
18
19<p>Classes that construct such indices are derived from a basic abstract <code>MakeRandomIndices</code>. There are three different classes provided. <code>MakeRandomIndices2</code> constructs a list of 0's and 1's in prescribed proportion; it can be used for, for instance, 70-30 divisions on training and testing examples. A more general <code>MakeRandomIndicesN</code> construct a list of indices from 0 to N-1 in given proportions. Finally, the most often used <code>MakeRandomIndicesCV</code> prepares indices for cross-validation.</p>
20
21<P><B>Important change:</B> random indices are more deterministic than in versions of Orange prior to September 2003. See examples in the section about <CODE>MakeRandomIndices2</CODE> for details.</P>
22
23<hr>
24
25<h2>MakeRandomIndices</h2>
26<index name="classes/MakeRandomIndices">
27
28<p class=section>Attributes</p>
29<DL class=attributes>
30<dt>stratified</dt>
31<dd>Defines whether the division should be stratified, that is, whether all subset should have approximatelly equal class distributions. Possible values are <code>MakeRandomIndices.Stratified</code>, <code>MakeRandomIndices.NotStratified</code> and <code>MakeRandomIndices.StratifiedIfPossible</code>. In the latter case, which is also the default, Orange will try to construct stratified indices, but fall back to non-stratified if anything goes wrong. For stratified indices, it needs to see the example table (see the calling operator below), and the class should be discrete and have no unknown values.</dd>
32
33<dt>randseed, randomGenerator</dt>
34<dd>These two fields deal with the way <code>MakeRandomIndices</code> generates random numbers.
35<ul>
36<li>If <code>randomGenerator</code> (of type <code>orange.RandomGenerator</code>) is set, it is used. The same random generator can be shared between different objects; this can be useful when constructing an experiment that depends on a single random seed. If you use this, <CODE>MakeRandomIndices</CODE> will return a different set of indices each time it's called, even if with the same arguments.</li>
37
38<li>If <code>randomGenerator</code> is not given, but <code>randseed</code> is (positive values denote a defined <code>randseed</code>), the value is used to initiate a new, temporary local random generator. This way, the indices generator will always give same indices for the same data.</li>
39
40<li>If none of the two is defined, a new random generator is constructed each time the object is called (note that this is unlike some other classes, such as <a href="Variable.htm"><CODE>Variable</CODE></A>, <A href="distributions.htm"><CODE>Distribution</CODE></A> and <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A>, that store such generators for future use; the generator constructed by <CODE>MakeRandomIndices</CODE> is disposed after use) and initialized with random seed 0. This thus has the same effect as setting <CODE>randseed</CODE> to 0.</li>
41</ul>
42The example for <code>MakeRandomIndices2</code> shows the difference between those options.
43</dd>
44</dl>
45
46<p><code>MakeRandomIndices</code> can be called to return a list of indices. The argument can be either the desired length of the list (presumably corresponding to a length of some list of examples) or a set of examples, given as <CODE>ExampleTable</CODE> or plain Python list. It is obvious that in the former case, indices cannot correspond to a stratified division; if <code>stratified</code> is set to <code>MakeRandomIndices.Stratified</code>, an exception is raised.</p>
47
48<h2>MakeRandomIndices2</h2>
49<index name="classes/MakeRandomIndices2">
50
51<p>This object prepares a list of 0's and 1's.</p>
52
53<P class=section>Attributes</P>
54<DL class=attributes>
55<DT>p0</DT>
56<DD>The proportion or a number of 0's. If <code>p0</code> is less than 1, it's a proportion. For instance, if <code>p0</code> is 0.2, 20% of indices will be 0's and 80% will be 1's. If <code>p0</code> is 1 or more, it gives the exact number of 0's. For instance, with <code>p0</code> of 10, you will get a list with 10 0's and the rest of the list will be 1's.</DD>
57</DL>
58
59<p>Say that you have loaded the lenses domain into <code>data</code>. We'll split it into two datasets, the first containing only 6 examples and the other containing the rest.</p>
60
61<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
62(uses <a href="lenses.tab">lenses.tab</a>)</p>
63<xmp class="code">>>> indices2 = orange.MakeRandomIndices2(p0=6)
64>>> ind = indices2(data)
65>>> print ind
66<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
67>>> data0 = data.select(ind, 0)
68>>> data1 = data.select(ind, 1)
69>>> print len(data0), len(data1)
706 18
71</xmp>
72
73<p>No surprises here. Let's now see what's with those random seeds and generators. First, we shall simply construct and print five lists of random indices.</p>
74
75<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
76(uses <a href="lenses.tab">lenses.tab</a>)</p>
77<xmp class="code">>>> indices2 = orange.MakeRandomIndices2(p0=6)
78>>> for i in range(5):
79>>>    print indices2(data)
80<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
81<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
82<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
83<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
84<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>
85</xmp>
86
87<p>We ran it for five times and got the same result each time (this would not be so in older versions of Orange!). Unless there's something wrong with your port of Orange, you've got the same indices as above.</p>
88
89<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
90(uses <a href="lenses.tab">lenses.tab</a>)</p>
91<xmp class="code">>>> indices2.randomGenerator = orange.RandomGenerator(42)
92>>> for i in range(5):
93>>>    print indices2(data)
94<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
95<0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0>
96<0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1>
97<1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1>
98<1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0>
99</xmp>
100
101<p>Now we constructed a private random generator for random indices. And got five different lists but if you run the whole script again, you'll get the same five sets, since the generator will be constructed again and start generating number from the beginning. Again, you should have got this same indices on any operating system.</p>
102
103<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
104(uses <a href="lenses.tab">lenses.tab</a>)</p>
105<xmp class="code">>>> indices2.randseed = 42
106>>> indices2.randomGenerator = None
107>>> for i in range(5):
108>>>    print indices2(data)
109<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
110<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
111<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
112<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
113<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
114</xmp>
115
116<p>Here we set the random seed and removed the random generator (otherwise the seed would have no effect as the generator has the priority). Each time we run the indices generator, it constructs a private random generator and initializes it with the given seed, and consequentially always returns the same indices. As you have probably noticed, this indices are the same as those generated one example earlier, due to the same random seed.</p>
117
118<p>Let's play with <code>p0</code>. There are 24 examples in the dataset. Setting <code>p0</code> to 0.25 instead of 6 shouldn't alter the indices. Let's check it.</p>
119
120<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
121(uses <a href="lenses.tab">lenses.tab</a>)</p>
122<xmp class="code">>>> indices2.p0 = 0.25
123>>> print indices2(data)
124<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
125</xmp>
126
127<p>Finally, let's observe the effects of <code>stratified</code>. By default, indices are stratified if it's possible and, in our case, it is and they are.</p>
128
129<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
130(uses <a href="lenses.tab">lenses.tab</a>)</p>
131<xmp class="code">>>> indices2.stratified = indices2.Stratified
132>>> ind = indices2(data)
133>>> print ind
134<0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>
135>>> data2 = data.select(ind)
136>>> od = orange.getClassDistribution(data)
137>>> sd = orange.getClassDistribution(data2)
138>>> od.normalize()
139>>> sd.normalize()
140>>> print od
141<0.625, 0.208, 0.167>
142>>> print sd
143<0.611, 0.222, 0.167>
144</xmp>
145
146<p>We explicitly requested stratication and got the same indices as before. That's OK. We also printed out the distribution for the whole dataset and for the selected dataset (as we gave no second parameter, the examples with no-null indices got selected). They are not same, but they are pretty close. <code>MakeRandomIndices2</code> did what it could. Now let's try without stratification. The script is pretty same except for changing <code>Stratified</code> to <code>NotStratified</code>:</p>
147
148<p class="header">part of <a href="randomindices2.py">randomindices2.py</a>
149(uses <a href="lenses.tab">lenses.tab</a>)</p>
150<xmp class="code">>>> indices2.stratified = indices2.NotStratified
151>>> ind = indices2(data)
152>>> print ind
153<0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1>
154>>> data2 = data.select(ind)
155>>> sd = orange.getClassDistribution(data2)
156>>> sd.normalize()
157>>> print od
158<0.625, 0.208, 0.167>
159>>> print sd
160<0.556, 0.278, 0.167>
161</xmp>
162
163<p>Different indices and ... just look at the distribution. Could be worse but, well, <code>NotStratified</code> doesn't mean that Orange will make an effort to get uneven distributions. It just won't mind about them.</p>
164
165<p>For a final test, you can set the class of one of the examples to unknown and rerun the last script with setting <code>stratified</code> once to <code>Stratified</code> and once to <code>StratifiedIfPossible</code>. In the first case you'll get an error and in the second you'll have a non-stratified indices.</p>
166
167<h2>MakeRandomIndicesN</h2>
168<index name="classes/MakeRandomIndicesN">
169
170<p><code>MakeRandomIndicesN</code> is a straight generalization of <code>RandomIndices2</code>, so there's not much to be told about it.</P>
171
172<P class=section>Attributes</P>
173<DL class=attributes>
174<DT>p</DT>
175<DD>A list of proportions of examples that go to each fold. If <code>p</code> has a length of 3, the returned list will have four different indices, the first three will have probabilities as defined in <code>p</code> while the last will have a probability of (1 - sum of elements of <code>p</code>).</DD>
176</DL>
177
178<p><code>MakeRandomIndicesN</code> does not support stratification (yet); setting <code>stratified</code> to <code>Stratified</code> will yield an error.</p>
179
180<p>Let us construct a list of indices that would assign half of examples to the first set and a quarter to the second and third.</p>
181<p class="header">part of <a href="randomindicesn.py">randomindicesn.py</a>
182(uses <a href="lenses.tab">lenses.tab</a>)</p>
183<xmp class="code">>>> indicesn = orange.MakeRandomIndicesN(p=[0.5, 0.25])
184>>> ind = indicesn(data)
185>>> print ind
186<1, 2, 1, 0, 2, 2, 1, 0, 2, 0, 2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0>
187</xmp>
188
189<p>Count them and you'll see there are 12 zero's and 6 one's and two's out of 24.</p>
190
191<h2>MakeRandomIndicesCV</h2>
192<index name="classes/MakeRandomIndicesCV">
193
194<p><code>MakeRandomIndicesCV</code> computes indices for cross-validation.</p>
195
196<P class=section>Attributes</P>
197<DL class=attributes>
198<DT>folds</DT>
199<DD>Number of folds. Default is 10.</DD>
200</DL>
201
202<P>The object constructs a list of indices between 0 and <CODE>folds-1</CODE> (inclusive), with an equal number of each (if the number of examples is not divisible by <CODE>folds</CODE>, the last folds will have one example less).</p>
203
204<p>For an exercise, we shall first prepare indices for an ordinary ten-fold cross validation.</p>
205<p class="header">part of <a href="randomindicescv.py">randomindicescv.py</a>
206(uses <a href="lenses.tab">lenses.tab</a>)</p>
207<xmp class="code">>>> print orange.MakeRandomIndicesCV(data)
208<0, 8, 1, 3, 6, 9, 4, 2, 4, 6, 7, 1, 9, 7, 2, 3, 0, 5, 8, 0, 1, 5, 3, 2>
209</xmp>
210
211<p>Since examples don't divide evenly into ten folds, the first four folds have one example more - there are three 0's, 1's, 2's and 3's, but only two 4's, 5's...</p>
212
213<p>For a more even division, Orange will prepare indices for 10 examples for 5-fold cross validation. Instead of giving the examples, as usual, we shall only pass the number of them. This, of course, prevents the stratification.</p>
214
215<p class="header">part of <a href="randomindicescv.py">randomindicescv.py</a>
216(uses <a href="lenses.tab">lenses.tab</a>)</p>
217<xmp class="code">>>> print orange.MakeRandomIndicesCV(10, folds=5)
218<2, 1, 3, 3, 0, 2, 0, 4, 1, 4>
219</xmp>
Note: See TracBrowser for help on using the repository browser.