Changeset 7895:8f4078eca8d0 in orange


Ignore:
Timestamp:
05/06/11 15:46:32 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
4094340a833d3e2a98bd753fd5ae75bf02e1fbce
Message:

Orange 2.5: reference documentation - Sampling. Fixes #756.

Location:
orange
Files:
4 added
3 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/data/sample.py

    r7816 r7895  
     1""" 
     2Example sampling is one of the basic procedures in machine learning. If 
     3for nothing else, everybody needs to split dataset into training and 
     4testing examples.  
     5  
     6It is easy to select a subset of examples in Orange. The key idea is the 
     7use of indices: first construct a list of indices, one corresponding 
     8to each example. Then you can select examples by indices, say take 
     9all examples with index 3. Or with index other than 3. It is obvious 
     10that this is useful for many typical setups, such as 70-30 splits or 
     11cross-validation.  
     12  
     13Orange provides methods for making such selections, such as 
     14:obj:`Orange.data.Table.select`.  And, of course, it provides methods 
     15for constructing indices for different kinds of splits. For instance, 
     16for the most common used sampling method, cross-validation, the Orange's 
     17class :obj:`SubsetIndicesCV` prepares a list of indices that assign a 
     18fold to each example. 
     19 
     20Classes that construct such indices are derived from a basic 
     21abstract :obj:`SubsetIndices`. There are three different classes 
     22provided. :obj:`SubsetIndices2` constructs a list of 0's and 1's in 
     23prescribed proportion; it can be used for, for instance, 70-30 divisions 
     24on training and testing examples. A more general :obj:`SubsetIndicesN` 
     25construct a list of indices from 0 to N-1 in given proportions. Finally, 
     26the most often used :obj:`SubsetIndicesCV` prepares indices for 
     27cross-validation. 
     28 
     29Subset indices are more deterministic than in versions of Orange prior to 
     30September 2003. See examples in the section about :obj:`SubsetIndices2` 
     31for details. 
     32  
     33.. class:: SubsetIndices 
     34 
     35    .. data:: Stratified 
     36 
     37    .. data:: NotStratified 
     38 
     39    .. data:: StratifiedIfPossible 
     40         
     41        Constants for setting :obj:`stratified`. If 
     42        :obj:`StratifiedIfPossible`, Orange will try to construct 
     43        stratified indices, but fall back to non-stratified if anything 
     44        goes wrong. For stratified indices, it needs to see the example 
     45        table (see the calling operator below), and the class should be 
     46        discrete and have no unknown values. 
     47 
     48 
     49    .. attribute:: stratified 
     50 
     51        Defines whether the division should be stratified, that is, 
     52        whether all subset should have approximatelly equal class 
     53        distributions. Possible values are :obj:`Stratified`, 
     54        :obj:`NotStratified` and :obj:`StratifiedIfPossible` (default). 
     55 
     56    .. attribute:: randseed 
     57     
     58    .. attribute:: random_generator 
     59 
     60        These two fields deal with the way :obj:`SubsetIndices` generates 
     61        random numbers. 
     62 
     63        If :obj:`random_generator` (of type :obj:`orange.RandomGenerator`) 
     64        is set, it is used. The same random generator can be shared 
     65        between different objects; this can be useful when constructing an 
     66        experiment that depends on a single random seed. If you use this, 
     67        :obj:`SubsetIndices` will return a different set of indices each 
     68        time it's called, even if with the same arguments. 
     69 
     70        If :obj:`random_generator` is not given, but :attr:`randseed` is 
     71        (positive values denote a defined :obj:`randseed`), the value is 
     72        used to initiate a new, temporary local random generator. This 
     73        way, the indices generator will always give same indices for 
     74        the same data. 
     75 
     76        If none of the two is defined, a new random generator 
     77        is constructed each time the object is called (note that 
     78        this is unlike some other classes, such as :obj:`Variable`, 
     79        :obj:`Distribution` and :obj:`Orange.data.Table`, that store 
     80        such generators for future use; the generator constructed by 
     81        :obj:`SubsetIndices` is disposed after use) and initialized 
     82        with random seed 0. This thus has the same effect as setting 
     83        :obj:`randseed` to 0. 
     84 
     85        The example for :obj:`SubsetIndices2` shows the difference 
     86        between those options. 
     87 
     88    .. method:: __call__(examples) 
     89 
     90        :obj:`SubsetIndices` can be called to return a list of 
     91        indices. The argument can be either the desired length of the list 
     92        (presumably corresponding to a length of some list of examples) 
     93        or a set of examples, given as :obj:`Orange.data.Table` or plain 
     94        Python list. It is obvious that in the former case, indices 
     95        cannot correspond to a stratified division; if :obj:`stratified` 
     96        is set to :obj:`Stratified`, an exception is raised. 
     97 
     98.. class:: SubsetIndices2 
     99 
     100    This object prepares a list of 0's and 1's. 
     101  
     102    .. attribute:: p0 
     103 
     104        The proportion or a number of 0's. If :obj:`p0` is less than 
     105        1, it's a proportion. For instance, if :obj:`p0` is 0.2, 20% 
     106        of indices will be 0's and 80% will be 1's. If :obj:`p0` 
     107        is 1 or more, it gives the exact number of 0's. For instance, 
     108        with :obj:`p0` of 10, you will get a list with 10 0's and 
     109        the rest of the list will be 1's. 
     110  
     111Say that you have loaded the lenses domain into ``data``. We'll split 
     112it into two datasets, the first containing only 6 examples and the other 
     113containing the rest (from `randomindices2.py`_): 
     114  
     115.. _randomindices2.py: code/randomindices2.py 
     116.. _lenses.tab: code/lenses.tab 
     117 
     118.. literalinclude:: code/randomindices2.py 
     119    :lines: 11-17 
     120 
     121Output:: 
     122 
     123    <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1> 
     124    6 18 
     125  
     126No surprises here. Let's now see what's with those random seeds and generators. First, we shall simply construct and print five lists of random indices.  
     127  
     128.. literalinclude:: code/randomindices2.py 
     129    :lines: 19-21 
     130 
     131Output:: 
     132 
     133    Indices without playing with random generator 
     134    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> 
     135    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> 
     136    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> 
     137    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> 
     138    <0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1> 
     139 
     140 
     141We ran it for five times and got the same result each time. 
     142 
     143.. literalinclude:: code/randomindices2.py 
     144    :lines: 23-26 
     145 
     146Output:: 
     147 
     148    Indices with random generator 
     149    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     150    <1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1> 
     151    <1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1> 
     152    <1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0> 
     153    <1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1> 
     154 
     155We have constructed a private random generator for random indices. And 
     156got five different lists but if you run the whole script again, you'll 
     157get the same five sets, since the generator will be constructed again 
     158and start generating number from the beginning. Again, you should have 
     159got this same indices on any operating system. 
     160 
     161.. literalinclude:: code/randomindices2.py 
     162    :lines: 28-32 
     163 
     164Output:: 
     165 
     166    Indices with randseed 
     167    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     168    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     169    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     170    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     171    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     172 
     173 
     174Here we have set the random seed and removed the random generator 
     175(otherwise the seed would have no effect as the generator has the 
     176priority). Each time we run the indices generator, it constructs a 
     177private random generator and initializes it with the given seed, and 
     178consequentially always returns the same indices. 
     179 
     180Let's play with :obj:`SubsetIndices2.p0`. There are 24 examples in the 
     181dataset. Setting :obj:`SubsetIndices2.p0` to 0.25 instead of 6 shouldn't 
     182alter the indices. Let's check it. 
     183 
     184.. literalinclude:: code/randomindices2.py 
     185    :lines: 35-37 
     186 
     187Output:: 
     188 
     189    Indices with p0 set as probability (not 'a number of') 
     190    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     191 
     192Finally, let's observe the effects of :obj:`~SubsetIndices.stratified`. By 
     193default, indices are stratified if it's possible and, in our case, 
     194it is and they are. 
     195 
     196.. literalinclude:: code/randomindices2.py 
     197    :lines: 39-49 
     198 
     199Output:: 
     200 
     201    ... with stratification 
     202    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     203    <0.625, 0.167, 0.208> 
     204    <0.611, 0.167, 0.222> 
     205 
     206We explicitly requested stratication and got the same indices as 
     207before. That's OK. We also printed out the distribution for the whole 
     208dataset and for the selected dataset (as we gave no second parameter, 
     209the examples with no-null indices got selected). They are not same, but 
     210they are pretty close. :obj:`SubsetIndices2` did what it could. Now let's 
     211try without stratification. The script is pretty same except for changing 
     212:obj:`~SubsetIndices.stratified` to :obj:`~SubsetIndices.NotStratified`. 
     213 
     214.. literalinclude:: code/randomindices2.py 
     215    :lines: 51-62 
     216 
     217Output:: 
     218     
     219    ... and without stratification 
     220    <0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1> 
     221    <0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1> 
     222    <0.625, 0.167, 0.208> 
     223    <0.611, 0.167, 0.222> 
     224 
     225 
     226Different indices and ... just look at the distribution. Could be worse 
     227but, well, :obj:`~SubsetIndices.NotStratified` doesn't mean that Orange 
     228will make an effort to get uneven distributions. It just won't mind 
     229about them. 
     230 
     231For a final test, you can set the class of one of the examples to unknown 
     232and rerun the last script with setting :obj:`~SubsetIndices.stratified` 
     233once to :obj:`~SubsetIndices.Stratified` and once to 
     234:obj:`~SubsetIndices.StratifiedIfPossible`. In the first case you'll 
     235get an error and in the second you'll have a non-stratified indices. 
     236 
     237.. literalinclude:: code/randomindices2.py 
     238    :lines: 64-70 
     239 
     240Output:: 
     241 
     242    ... stratified 'if possible' 
     243    <1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1> 
     244 
     245    ... stratified 'if possible', after removing the first example's class 
     246    <0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1> 
     247  
     248.. class:: SubsetIndicesN 
     249 
     250    A straight generalization of :obj:`RandomIndices2`, so there's not 
     251    much to be told about it. 
     252 
     253    .. attribute:: p 
     254 
     255        A list of proportions of examples that go to each fold. If 
     256        :obj:`p` has a length of 3, the returned list will have four 
     257        different indices, the first three will have probabilities as 
     258        defined in :obj:`p` while the last will have a probability of 
     259        (1 - sum of elements of :obj:`p`). 
     260 
     261:obj:`SubsetIndicesN` does not support stratification; setting 
     262:obj:`stratified` to :obj:`Stratified` will yield an error. 
     263 
     264.. _randomindicesn.py: code/randomindicesn.py 
     265 
     266Let us construct a list of indices that would assign half of examples 
     267to the first set and a quarter to the second and third (part of 
     268`randomindicesn.py`_, uses `lenses.tab`_): 
     269 
     270.. literalinclude:: code/randomindicesn.py 
     271    :lines: 9-14 
     272 
     273Output: 
     274 
     275    <1, 0, 0, 2, 0, 1, 1, 0, 2, 0, 2, 2, 1, 0, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0> 
     276 
     277Count them and you'll see there are 12 zero's and 6 one's and two's out of 24. 
     278  
     279.. class:: SubsetIndicesCV 
     280  
     281    :obj:`SubsetIndicesCV` computes indices for cross-validation. 
     282 
     283    It constructs a list of indices between 0 and :obj:`folds` -1 
     284    (inclusive), with an equal number of each (if the number of examples 
     285    is not divisible by :obj:`folds`, the last folds will have one 
     286    example less). 
     287 
     288    .. attribute:: folds 
     289 
     290        Number of folds. Default is 10. 
     291  
     292.. _randomindicescv.py: code/randomindicescv.py 
     293  
     294We shall prepare indices for an ordinary ten-fold cross validation and 
     295indices for 10 examples for 5-fold cross validation. For the latter, 
     296we shall only pass the number of examples, which, of course, prevents 
     297the stratification. Part of `randomindicescv.py`_, uses `lenses.tab`_): 
     298 
     299.. literalinclude:: code/randomindicescv.py 
     300    :lines: 7-12 
     301 
     302Output:: 
     303 
     304    Indices for ordinary 10-fold CV 
     305    <1, 1, 3, 8, 8, 3, 2, 7, 5, 0, 1, 5, 2, 9, 4, 7, 4, 9, 3, 6, 0, 2, 0, 6> 
     306    Indices for 5 folds on 10 examples 
     307    <3, 0, 1, 0, 3, 2, 4, 4, 1, 2> 
     308 
     309 
     310Since examples don't divide evenly into ten folds, the first four folds 
     311have one example more - there are three 0's, 1's, 2's and 3's, but only 
     312two 4's, 5's.. 
     313 
     314""" 
     315 
     316pass 
     317 
    1318from orange import \ 
    2319     MakeRandomIndices as SubsetIndices, \ 
  • orange/doc/Orange/rst/Orange.data.rst

    r7801 r7895  
    99    Orange.data.value 
    1010    Orange.data.instance 
    11     Orange.data.table.rst 
     11    Orange.data.table 
     12    Orange.data.sample 
  • orange/fixes/fix_changed_names.py

    r7842 r7895  
    451451           "orange.DomainContinuizer": "Orange.feature.continuization.DomainContinuizer", 
    452452            
    453            "orange.MakeRandomIndices": "Orange.data.sample.MakeRandomIndices", 
    454            "orange.MakeRandomIndicesN": "Orange.data.sample.MakeRandomIndicesN", 
    455            "orange.MakeRandomIndicesCV": "Orange.data.sample.MakeRandomIndicesCV", 
    456            "orange.MakeRandomIndicesMultiple": "Orange.data.sample.MakeRandomIndicesMultiple", 
    457            "orange.MakeRandomIndices2": "Orange.data.sample.MakeRandomIndices2", 
     453           "orange.MakeRandomIndices": "Orange.data.sample.SubsetIndices", 
     454           "orange.MakeRandomIndicesN": "Orange.data.sample.SubsetIndicesN", 
     455           "orange.MakeRandomIndicesCV": "Orange.data.sample.SubsetIndicesCV", 
     456           "orange.MakeRandomIndicesMultiple": "Orange.data.sample.SubsetIndicesMultiple", 
     457           "orange.MakeRandomIndices2": "Orange.data.sample.SubsetIndices2", 
    458458 
    459459           } 
Note: See TracChangeset for help on using the changeset viewer.