Changeset 9727:b43c4d11d333 in orange for Orange/data/variable.py


Ignore:
Timestamp:
02/06/12 15:18:09 (2 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Message:

Polished documentation about Orange.data.variable

File:
1 edited

Legend:

Unmodified
Added
Removed
  • Orange/data/variable.py

    r9671 r9727  
    1 """ 
    2 ======================== 
    3 Variables (``variable``) 
    4 ======================== 
    5  
    6 Data instances in Orange can contain several types of variables: 
    7 :ref:`discrete <discrete>`, :ref:`continuous <continuous>`, 
    8 :ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it. 
    9 The latter represent arbitrary Python objects. 
    10 The names, types, values (where applicable), functions for computing the 
    11 variable value from values of other variables, and other properties of the 
    12 variables are stored in descriptor classes defined in this module. 
    13  
    14 Variable descriptors 
    15 -------------------- 
    16  
    17 Variable descriptors can be constructed either directly, using  
    18 constructors and passing attributes as parameters, or by a  
    19 factory function :func:`Orange.data.variable.make`, which either  
    20 retrieves an existing descriptor or constructs a new one. 
    21  
    22 .. class:: Variable 
    23  
    24     An abstract base class for variable descriptors. 
    25  
    26     .. attribute:: name 
    27  
    28         The name of the variable. Variable names do not need to be unique since two 
    29         variables are considered the same only if they have the same descriptor 
    30         (e.g. even multiple variables in the same table can have the same name). 
    31         This should, however, be avoided since it may result in unpredictable 
    32         behavior. 
    33      
    34     .. attribute:: var_type 
    35         
    36         Variable type; it can be Orange.data.Type.Discrete, 
    37         Orange.data.Type.Continuous, Orange.data.Type.String or 
    38         Orange.data.Type.Other.   
    39  
    40     .. attribute:: get_value_from 
    41  
    42         A function (an instance of :obj:`Orange.classification.Classifier`) which computes 
    43         a value of the variable from values of one or more other variables. This 
    44         is used, for instance, in discretization where the variables describing 
    45         the discretized variable are computed from the original variable.  
    46  
    47     .. attribute:: ordered 
    48      
    49         A flag telling whether the values of a discrete variable are ordered. At 
    50         the moment, no built-in method treats ordinal variables differently than 
    51         nominal ones. 
    52      
    53     .. attribute:: distributed 
    54      
    55         A flag telling whether the values of the variables are distributions. 
    56         As for the flag ordered, no methods treat such variables in any special 
    57         manner. 
    58      
    59     .. attribute:: random_generator 
    60      
    61         A local random number generator used by method 
    62         :obj:`Variable.random_value`. 
    63      
    64     .. attribute:: default_meta_id 
    65      
    66         A proposed (but not guaranteed) meta id to be used for that variable. 
    67         This is used, for instance, by the data loader for tab-delimited file 
    68         format instead of assigning an arbitrary new value, or by 
    69         :obj:`Orange.data.new_meta_id` if the variable is passed as an argument.  
    70          
    71     .. attribute:: attributes 
    72          
    73         A dictionary which allows the user to store additional information 
    74         about the variable. All values should be strings. See the section  
    75         about :ref:`storing additional information <attributes>`. 
    76  
    77     .. method:: __call__(obj) 
    78      
    79            Convert a string, number, or other suitable object into a variable 
    80            value. 
    81             
    82            :param obj: An object to be converted into a variable value 
    83            :type o: any suitable 
    84            :rtype: :class:`Orange.data.Value` 
    85         
    86     .. method:: randomvalue() 
    87  
    88            Return a random value for the variable. 
    89         
    90            :rtype: :class:`Orange.data.Value` 
    91         
    92     .. method:: compute_value(inst) 
    93  
    94            Compute the value of the variable given the instance by calling 
    95            obj:`~Variable.get_value_from` through a mechanism that prevents deadlocks by 
    96            circular calls. 
    97  
    98            :rtype: :class:`Orange.data.Value` 
    99  
    100 .. _discrete: 
    101 .. class:: Discrete 
    102  
    103     Bases: :class:`Variable` 
    104     
    105     Descriptor for discrete variables. 
    106      
    107     .. attribute:: values 
    108      
    109         A list with symbolic names for variables' values. Values are stored as 
    110         indices referring to this list. Therefore, modifying this list  
    111         instantly changes the (symbolic) names of values as they are printed out or 
    112         referred to by user. 
    113      
    114         .. note:: 
    115          
    116             The size of the list is also used to indicate the number of 
    117             possible values for this variable. Changing the size - especially 
    118             shrinking the list - can have disastrous effects and is therefore not 
    119             really recommended. Also, do not add values to the list by 
    120             calling its append or extend method: call the :obj:`add_value` 
    121             method instead. 
    122  
    123             It is also assumed that this attribute is always defined (but can 
    124             be empty), so never set it to None. 
    125      
    126     .. attribute:: base_value 
    127  
    128             Stores the base value for the variable as an index in `values`. 
    129             This can be, for instance, a "normal" value, such as "no 
    130             complications" as opposed to abnormal "low blood pressure". The 
    131             base value is used by certain statistics, continuization etc. 
    132             potentially, learning algorithms. The default is -1 which means that 
    133             there is no base value. 
    134      
    135     .. method:: add_value 
    136      
    137             Add a value to values. Always call this function instead of 
    138             appending to values. 
    139  
    140 .. _continuous: 
    141 .. class:: Continuous 
    142  
    143     Bases: :class:`Variable` 
    144  
    145     Descriptor for continuous variables. 
    146      
    147     .. attribute:: number_of_decimals 
    148      
    149         The number of decimals used when the value is printed out, converted to 
    150         a string or saved to a file. 
    151      
    152     .. attribute:: scientific_format 
    153      
    154         If ``True``, the value is printed in scientific format whenever it 
    155         would have more than 5 digits. In this case, :obj:`number_of_decimals` is 
    156         ignored. 
    157  
    158     .. attribute:: adjust_decimals 
    159      
    160         Tells Orange to monitor the number of decimals when the value is 
    161         converted from a string (when the values are read from a file or 
    162         converted by, e.g. ``inst[0]="3.14"``):  
    163         0: the number of decimals is not adjusted automatically; 
    164         1: the number of decimals is (and has already) been adjusted; 
    165         2: automatic adjustment is enabled, but no values have been converted yet. 
    166  
    167         By default, adjustment of the number of decimals goes as follows: 
    168      
    169         If the variable was constructed when data was read from a file, it will  
    170         be printed with the same number of decimals as the largest number of  
    171         decimals encountered in the file. If scientific notation occurs in the  
    172         file, :obj:`scientific_format` will be set to ``True`` and scientific format  
    173         will be used for values too large or too small.  
    174      
    175         If the variable is created in a script, it will have, by default, three 
    176         decimal places. This can be changed either by setting the value 
    177         from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by 
    178         manually setting the :obj:`number_of_decimals`. 
    179  
    180     .. attribute:: start_value, end_value, step_value 
    181      
    182         The range used for :obj:`randomvalue`. 
    183  
    184 .. _String: 
    185 .. class:: String 
    186  
    187     Bases: :class:`Variable` 
    188  
    189     Descriptor for variables that contain strings. No method can use them for  
    190     learning; some will complain and others will silently ignore them when they  
    191     encounter them. They can be, however, useful for meta-attributes; if  
    192     instances in a dataset have unique IDs, the most efficient way to store them  
    193     is to read them as meta-attributes. In general, never use discrete  
    194     attributes with many (say, more than 50) values. Such attributes are  
    195     probably not of any use for learning and should be stored as string 
    196     attributes. 
    197  
    198     When converting strings into values and back, empty strings are treated  
    199     differently than usual. For other types, an empty string can be used to 
    200     denote undefined values, while :obj:`String` will take empty strings 
    201     as empty strings -- except when loading or saving into file. 
    202     Empty strings in files are interpreted as undefined; to specify an empty 
    203     string, enclose the string in double quotes; these are removed when the 
    204     string is loaded. 
    205  
    206 .. _Python: 
    207 .. class:: Python 
    208  
    209     Bases: :class:`Variable` 
    210  
    211     Base class for descriptors defined in Python. It is fully functional 
    212     and can be used as a descriptor for attributes that contain arbitrary Python 
    213     values. Since this is an advanced topic, PythonVariables are described on a  
    214     separate page. !!TODO!! 
    215      
    216      
    217 Variables computed from other variables 
    218 --------------------------------------- 
    219  
    220 Values of variables are often computed from other variables, such as in 
    221 discretization. The mechanism described below usually functions behind the scenes, 
    222 so understanding it is required only for implementing specific transformations. 
    223  
    224 Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``. 
    225 It can help the learning algorithm if the four-valued attribute ``e`` is 
    226 replaced with a binary attribute having values `"1"` and `"not 1"`. The 
    227 new variable will be computed from the old one on the fly.  
    228  
    229 .. literalinclude:: code/variable-get_value_from.py 
    230     :lines: 7-17 
    231      
    232 The new variable is named ``e2``; we define it with a descriptor of type  
    233 :obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we  
    234 chose this order so that the ``not 1``'s index is ``0``, which can be, if  
    235 needed, interpreted as ``False``). Finally, we tell e2 to use  
    236 ``checkE`` to compute its value when needed, by assigning ``checkE`` to  
    237 ``e2.get_value_from``.  
    238  
    239 ``checkE`` is a function that is passed an instance and another argument we  
    240 do not care about here. If the instance's ``e`` equals ``1``, the function  
    241 returns value ``1``, otherwise it returns ``not 1``. Both are returned as  
    242 values, not plain strings. 
    243  
    244 In most circumstances the value of ``e2`` can be computed on the fly - we can  
    245 pretend that the variable exists in the data, although it does not (but  
    246 can be computed from it). For instance, we can compute the information gain of 
    247 variable ``e2`` or its distribution without actually constructing data containing 
    248 the new variable. 
    249  
    250 .. literalinclude:: code/variable-get_value_from.py 
    251     :lines: 19-22 
    252  
    253 There are methods which cannot compute values on the fly because it would be 
    254 too complex or time consuming. In such cases, the data need to be converted 
    255 to a new :obj:`Orange.data.Table`:: 
    256  
    257     new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var]) 
    258     new_data = Orange.data.Table(new_domain, data)  
    259  
    260 Automatic computation is useful when the data is split into training and  
    261 testing examples. Training instances can be modified by adding, removing  
    262 and transforming variables (in a typical setup, continuous variables  
    263 are discretized prior to learning, therefore the original variables are  
    264 replaced by new ones). Test instances, on the other hand, are left as they  
    265 are. When they are classified, the classifier automatically converts the  
    266 testing instances into the new domain, which includes recomputation of  
    267 transformed variables.  
    268  
    269 .. literalinclude:: code/variable-get_value_from.py 
    270     :lines: 24- 
    271  
    272 .. _attributes: 
    273  
    274 Storing additional variables 
    275 ----------------------------- 
    276  
    277 All variables have a field :obj:`~Variable.attributes`, a dictionary 
    278 which can contain strings. Although the current implementation allows all 
    279 types of value we strongly advise to use only strings. An example: 
    280  
    281 .. literalinclude:: code/attributes.py 
    282  
    283 These attributes can only be saved to a .tab file. They are listed in the 
    284 third line in <name>=<value> format, after other attribute specifications 
    285 (such as "meta" or "class"), and are separated by spaces.  
    286  
    287 .. _variable_descriptor_reuse: 
    288  
    289 Reuse of descriptors 
    290 -------------------- 
    291  
    292 There are situations when variable descriptors need to be reused. Typically, the  
    293 user loads some training examples, trains a classifier, and then loads a separate 
    294 test set. For the classifier to recognize the variables in the second data set, 
    295 the descriptors, not just the names, need to be the same.  
    296  
    297 When constructing new descriptors for data read from a file or during unpickling, 
    298 Orange checks whether an appropriate descriptor (with the same name and, in case 
    299 of discrete variables, also values) already exists and reuses it. When new 
    300 descriptors are constructed by explicitly calling the above constructors, this 
    301 always creates new descriptors and thus new variables, although a variable with 
    302 the same name may already exist. 
    303  
    304 The search for an existing variable is based on four attributes: the variable's name, 
    305 type, ordered values, and unordered values. As for the latter two, the values can  
    306 be explicitly ordered by the user, e.g. in the second line of the tab-delimited  
    307 file. For instance, sizes can be ordered as small, medium, or big. 
    308  
    309 The search for existing variables can end with one of the following statuses. 
    310  
    311 .. data:: Orange.data.variable.MakeStatus.NotFound (4) 
    312  
    313     The variable with that name and type does not exist.  
    314  
    315 .. data:: Orange.data.variable.MakeStatus.Incompatible (3) 
    316  
    317     There are variables with matching name and type, but their 
    318     values are incompatible with the prescribed ordered values. For example, 
    319     if the existing variable already has values ["a", "b"] and the new one 
    320     wants ["b", "a"], the old variable cannot be reused. The existing list can, 
    321     however be appended with the new values, so searching for ["a", "b", "c"] would 
    322     succeed. Likewise a search for ["a"] would be successful, since the extra existing value 
    323     does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``. 
    324  
    325 .. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2) 
    326  
    327     There is a matching variable, yet it has none of the values that the new 
    328     variable will have (this is obviously possible only if the new variable has 
    329     no prescribed ordered values). For instance, we search for a variable 
    330     "sex" with values "male" and "female", while there is a variable of the same  
    331     name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this  
    332     variable is possible, though this should probably be a new variable since it  
    333     obviously comes from a different data set. If we do decide to reuse the variable, the  
    334     old variable will get some unneeded new values and the new one will inherit  
    335     some from the old. 
    336  
    337 .. data:: Orange.data.variable.MakeStatus.MissingValues (1) 
    338  
    339     There is a matching variable with some of the values that the new one  
    340     requires, but some values are missing. This situation is neither uncommon  
    341     nor suspicious: in case of separate training and testing data sets there may 
    342     be values which occur in one set but not in the other. 
    343  
    344 .. data:: Orange.data.variable.MakeStatus.OK (0) 
    345  
    346     There is a perfect match which contains all the prescribed values in the 
    347     correct order. The existing variable may have some extra values, though. 
    348  
    349 Continuous variables can obviously have only two statuses,  
    350 :obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`. 
    351  
    352 When loading the data using :obj:`Orange.data.Table`, Orange takes the safest  
    353 approach and, by default, reuses everything that is compatible up to  
    354 and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the 
    355 variable having too many values, which the user can notice and fix. More on that  
    356 in the page on `loading data`. !!TODO!! 
    357  
    358 There are two functions for reusing the variables instead of creating new ones. 
    359  
    360 .. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on]) 
    361  
    362     Find and return an existing variable or create a new one if none of the existing 
    363     variables matches the given name, type and values. 
    364      
    365     The optional `create_new_on` specifies the status at which a new variable is 
    366     created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or 
    367     non-existing) variables cannot be reused. If it is set lower, for instance  
    368     to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists 
    369     a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function 
    370     always creates a new variable. 
    371      
    372     The function returns a tuple containing a variable descriptor and the 
    373     status of the best matching variable. So, if ``create_new_on`` is set to 
    374     :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say, 
    375     :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second  
    376     element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other 
    377     hand, there exists a variable which is perfectly OK, its descriptor is  
    378     returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no  
    379     indicator whether the returned variable is reused or not. This can be, 
    380     however, read from the status code: if it is smaller than the specified 
    381     ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed. 
    382  
    383     The exception to the rule is when ``create_new_on`` is OK. In this case, the  
    384     function does not search through the existing variables and cannot know the  
    385     status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`. 
    386  
    387     :param name: Variable name 
    388     :param type: Variable type 
    389     :type type: Orange.data.variable.Type 
    390     :param ordered_values: a list of ordered values 
    391     :param unordered_values: a list of values, for which the order does not 
    392         matter 
    393     :param create_new_on: gives the condition for constructing a new variable instead 
    394         of using the new one 
    395      
    396     :return_type: a tuple (:class:`Orange.data.variable.Variable`, int) 
    397      
    398 .. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on]) 
    399  
    400     Find and return an existing variable, or :obj:`None` if no match is found. 
    401      
    402     :param name: variable name. 
    403     :param type: variable type. 
    404     :type type: Orange.data.variable.Type 
    405     :param ordered_values: a list of ordered values 
    406     :param unordered_values: a list of values, for which the order does not 
    407         matter 
    408     :param create_new_on: gives the condition for constructing a new variable instead 
    409         of using the new one 
    410  
    411     :return_type: :class:`Orange.data.variable.Variable` 
    412      
    413 These following examples (from :download:`variable-reuse.py <code/variable-reuse.py>`) give the shown results if 
    414 executed only once (in a Python session) and in this order. 
    415  
    416 :func:`Orange.data.variable.make` can be used for the construction of new variables. :: 
    417      
    418     >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"]) 
    419     >>> print s, v1.values 
    420     4 <a, b> 
    421  
    422 No surprises here: a new variable is created and the status is :obj:`~Orange.data.variable.MakeStatus.NotFound`. :: 
    423  
    424     >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"]) 
    425     >>> print s, v2 is v1, v1.values 
    426     1 True <a, b, c> 
    427  
    428 The status is 1 (:obj:`~Orange.data.variable.MakeStatus.MissingValues`), yet the variable is reused (``v2 is v1``). 
    429 ``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does 
    430 not matter that the new variable does not need the value ``b``. :: 
    431  
    432     >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"]) 
    433     >>> print s, v3 is v1, v1.values 
    434     1 True <a, b, c, d> 
    435  
    436 This is like before, except that the new value, ``d`` is not among the 
    437 ordered values. :: 
    438  
    439     >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"]) 
    440     >>> print s, v4 is v1, v1.values, v4.values 
    441     3, False, <b>, <a, b, c, d> 
    442  
    443 The new variable needs to have ``b`` as the first value, so it is incompatible  
    444 with the existing variables. The status is thus 3 (:obj:`~Orange.data.variable.MakeStatus.Incompatible`), the two  
    445 variables are not equal and have different lists of values. :: 
    446  
    447     >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"]) 
    448     >>> print s, v5 is v1, v1.values, v5.values 
    449     0 True <a, b, c, d> <a, b, c, d> 
    450  
    451 The new variable has values ``c`` and ``a``, but the order is not important,  
    452 so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. :: 
    453  
    454     >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"]) 
    455     >>> print s, v6 is v1, v1.values, v6.values 
    456     2 True <a, b, c, d, e> <a, b, c, d, e> 
    457  
    458 The new variable has different values than the existing variable (status is 2, 
    459 :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`), but the existing one is nonetheless reused. Note that we 
    460 gave ``e`` in the list of unordered values. If it was among the ordered, the 
    461 reuse would fail. :: 
    462  
    463     >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, 
    464             ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues))) 
    465     >>> print s, v7 is v1, v1.values, v7.values 
    466     2 False <a, b, c, d, e> <f> 
    467  
    468 This is the same as before, except that we prohibited reuse when there are no 
    469 recognized values. Hence a new variable is created, though the returned status is  
    470 the same as before:: 
    471  
    472     >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, 
    473             ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK) 
    474     >>> print s, v8 is v1, v1.values, v8.values 
    475     0 False <a, b, c, d, e> <a, b, c, d, e> 
    476  
    477 Finally, this is a perfect match, but any reuse is prohibited, so a new  
    478 variable is created. 
    479  
    480 """ 
    4811from orange import Variable 
    4822from orange import EnumVariable as Discrete 
Note: See TracChangeset for help on using the changeset viewer.