Changeset 9727:b43c4d11d333 in orange


Ignore:
Timestamp:
02/06/12 15:18:09 (2 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Message:

Polished documentation about Orange.data.variable

Files:
2 edited

Legend:

Unmodified
Added
Removed
  • Orange/data/variable.py

    r9671 r9727  
    1 """ 
    2 ======================== 
    3 Variables (``variable``) 
    4 ======================== 
    5  
    6 Data instances in Orange can contain several types of variables: 
    7 :ref:`discrete <discrete>`, :ref:`continuous <continuous>`, 
    8 :ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it. 
    9 The latter represent arbitrary Python objects. 
    10 The names, types, values (where applicable), functions for computing the 
    11 variable value from values of other variables, and other properties of the 
    12 variables are stored in descriptor classes defined in this module. 
    13  
    14 Variable descriptors 
    15 -------------------- 
    16  
    17 Variable descriptors can be constructed either directly, using  
    18 constructors and passing attributes as parameters, or by a  
    19 factory function :func:`Orange.data.variable.make`, which either  
    20 retrieves an existing descriptor or constructs a new one. 
    21  
    22 .. class:: Variable 
    23  
    24     An abstract base class for variable descriptors. 
    25  
    26     .. attribute:: name 
    27  
    28         The name of the variable. Variable names do not need to be unique since two 
    29         variables are considered the same only if they have the same descriptor 
    30         (e.g. even multiple variables in the same table can have the same name). 
    31         This should, however, be avoided since it may result in unpredictable 
    32         behavior. 
    33      
    34     .. attribute:: var_type 
    35         
    36         Variable type; it can be Orange.data.Type.Discrete, 
    37         Orange.data.Type.Continuous, Orange.data.Type.String or 
    38         Orange.data.Type.Other.   
    39  
    40     .. attribute:: get_value_from 
    41  
    42         A function (an instance of :obj:`Orange.classification.Classifier`) which computes 
    43         a value of the variable from values of one or more other variables. This 
    44         is used, for instance, in discretization where the variables describing 
    45         the discretized variable are computed from the original variable.  
    46  
    47     .. attribute:: ordered 
    48      
    49         A flag telling whether the values of a discrete variable are ordered. At 
    50         the moment, no built-in method treats ordinal variables differently than 
    51         nominal ones. 
    52      
    53     .. attribute:: distributed 
    54      
    55         A flag telling whether the values of the variables are distributions. 
    56         As for the flag ordered, no methods treat such variables in any special 
    57         manner. 
    58      
    59     .. attribute:: random_generator 
    60      
    61         A local random number generator used by method 
    62         :obj:`Variable.random_value`. 
    63      
    64     .. attribute:: default_meta_id 
    65      
    66         A proposed (but not guaranteed) meta id to be used for that variable. 
    67         This is used, for instance, by the data loader for tab-delimited file 
    68         format instead of assigning an arbitrary new value, or by 
    69         :obj:`Orange.data.new_meta_id` if the variable is passed as an argument.  
    70          
    71     .. attribute:: attributes 
    72          
    73         A dictionary which allows the user to store additional information 
    74         about the variable. All values should be strings. See the section  
    75         about :ref:`storing additional information <attributes>`. 
    76  
    77     .. method:: __call__(obj) 
    78      
    79            Convert a string, number, or other suitable object into a variable 
    80            value. 
    81             
    82            :param obj: An object to be converted into a variable value 
    83            :type o: any suitable 
    84            :rtype: :class:`Orange.data.Value` 
    85         
    86     .. method:: randomvalue() 
    87  
    88            Return a random value for the variable. 
    89         
    90            :rtype: :class:`Orange.data.Value` 
    91         
    92     .. method:: compute_value(inst) 
    93  
    94            Compute the value of the variable given the instance by calling 
    95            obj:`~Variable.get_value_from` through a mechanism that prevents deadlocks by 
    96            circular calls. 
    97  
    98            :rtype: :class:`Orange.data.Value` 
    99  
    100 .. _discrete: 
    101 .. class:: Discrete 
    102  
    103     Bases: :class:`Variable` 
    104     
    105     Descriptor for discrete variables. 
    106      
    107     .. attribute:: values 
    108      
    109         A list with symbolic names for variables' values. Values are stored as 
    110         indices referring to this list. Therefore, modifying this list  
    111         instantly changes the (symbolic) names of values as they are printed out or 
    112         referred to by user. 
    113      
    114         .. note:: 
    115          
    116             The size of the list is also used to indicate the number of 
    117             possible values for this variable. Changing the size - especially 
    118             shrinking the list - can have disastrous effects and is therefore not 
    119             really recommended. Also, do not add values to the list by 
    120             calling its append or extend method: call the :obj:`add_value` 
    121             method instead. 
    122  
    123             It is also assumed that this attribute is always defined (but can 
    124             be empty), so never set it to None. 
    125      
    126     .. attribute:: base_value 
    127  
    128             Stores the base value for the variable as an index in `values`. 
    129             This can be, for instance, a "normal" value, such as "no 
    130             complications" as opposed to abnormal "low blood pressure". The 
    131             base value is used by certain statistics, continuization etc. 
    132             potentially, learning algorithms. The default is -1 which means that 
    133             there is no base value. 
    134      
    135     .. method:: add_value 
    136      
    137             Add a value to values. Always call this function instead of 
    138             appending to values. 
    139  
    140 .. _continuous: 
    141 .. class:: Continuous 
    142  
    143     Bases: :class:`Variable` 
    144  
    145     Descriptor for continuous variables. 
    146      
    147     .. attribute:: number_of_decimals 
    148      
    149         The number of decimals used when the value is printed out, converted to 
    150         a string or saved to a file. 
    151      
    152     .. attribute:: scientific_format 
    153      
    154         If ``True``, the value is printed in scientific format whenever it 
    155         would have more than 5 digits. In this case, :obj:`number_of_decimals` is 
    156         ignored. 
    157  
    158     .. attribute:: adjust_decimals 
    159      
    160         Tells Orange to monitor the number of decimals when the value is 
    161         converted from a string (when the values are read from a file or 
    162         converted by, e.g. ``inst[0]="3.14"``):  
    163         0: the number of decimals is not adjusted automatically; 
    164         1: the number of decimals is (and has already) been adjusted; 
    165         2: automatic adjustment is enabled, but no values have been converted yet. 
    166  
    167         By default, adjustment of the number of decimals goes as follows: 
    168      
    169         If the variable was constructed when data was read from a file, it will  
    170         be printed with the same number of decimals as the largest number of  
    171         decimals encountered in the file. If scientific notation occurs in the  
    172         file, :obj:`scientific_format` will be set to ``True`` and scientific format  
    173         will be used for values too large or too small.  
    174      
    175         If the variable is created in a script, it will have, by default, three 
    176         decimal places. This can be changed either by setting the value 
    177         from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by 
    178         manually setting the :obj:`number_of_decimals`. 
    179  
    180     .. attribute:: start_value, end_value, step_value 
    181      
    182         The range used for :obj:`randomvalue`. 
    183  
    184 .. _String: 
    185 .. class:: String 
    186  
    187     Bases: :class:`Variable` 
    188  
    189     Descriptor for variables that contain strings. No method can use them for  
    190     learning; some will complain and others will silently ignore them when they  
    191     encounter them. They can be, however, useful for meta-attributes; if  
    192     instances in a dataset have unique IDs, the most efficient way to store them  
    193     is to read them as meta-attributes. In general, never use discrete  
    194     attributes with many (say, more than 50) values. Such attributes are  
    195     probably not of any use for learning and should be stored as string 
    196     attributes. 
    197  
    198     When converting strings into values and back, empty strings are treated  
    199     differently than usual. For other types, an empty string can be used to 
    200     denote undefined values, while :obj:`String` will take empty strings 
    201     as empty strings -- except when loading or saving into file. 
    202     Empty strings in files are interpreted as undefined; to specify an empty 
    203     string, enclose the string in double quotes; these are removed when the 
    204     string is loaded. 
    205  
    206 .. _Python: 
    207 .. class:: Python 
    208  
    209     Bases: :class:`Variable` 
    210  
    211     Base class for descriptors defined in Python. It is fully functional 
    212     and can be used as a descriptor for attributes that contain arbitrary Python 
    213     values. Since this is an advanced topic, PythonVariables are described on a  
    214     separate page. !!TODO!! 
    215      
    216      
    217 Variables computed from other variables 
    218 --------------------------------------- 
    219  
    220 Values of variables are often computed from other variables, such as in 
    221 discretization. The mechanism described below usually functions behind the scenes, 
    222 so understanding it is required only for implementing specific transformations. 
    223  
    224 Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``. 
    225 It can help the learning algorithm if the four-valued attribute ``e`` is 
    226 replaced with a binary attribute having values `"1"` and `"not 1"`. The 
    227 new variable will be computed from the old one on the fly.  
    228  
    229 .. literalinclude:: code/variable-get_value_from.py 
    230     :lines: 7-17 
    231      
    232 The new variable is named ``e2``; we define it with a descriptor of type  
    233 :obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we  
    234 chose this order so that the ``not 1``'s index is ``0``, which can be, if  
    235 needed, interpreted as ``False``). Finally, we tell e2 to use  
    236 ``checkE`` to compute its value when needed, by assigning ``checkE`` to  
    237 ``e2.get_value_from``.  
    238  
    239 ``checkE`` is a function that is passed an instance and another argument we  
    240 do not care about here. If the instance's ``e`` equals ``1``, the function  
    241 returns value ``1``, otherwise it returns ``not 1``. Both are returned as  
    242 values, not plain strings. 
    243  
    244 In most circumstances the value of ``e2`` can be computed on the fly - we can  
    245 pretend that the variable exists in the data, although it does not (but  
    246 can be computed from it). For instance, we can compute the information gain of 
    247 variable ``e2`` or its distribution without actually constructing data containing 
    248 the new variable. 
    249  
    250 .. literalinclude:: code/variable-get_value_from.py 
    251     :lines: 19-22 
    252  
    253 There are methods which cannot compute values on the fly because it would be 
    254 too complex or time consuming. In such cases, the data need to be converted 
    255 to a new :obj:`Orange.data.Table`:: 
    256  
    257     new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var]) 
    258     new_data = Orange.data.Table(new_domain, data)  
    259  
    260 Automatic computation is useful when the data is split into training and  
    261 testing examples. Training instances can be modified by adding, removing  
    262 and transforming variables (in a typical setup, continuous variables  
    263 are discretized prior to learning, therefore the original variables are  
    264 replaced by new ones). Test instances, on the other hand, are left as they  
    265 are. When they are classified, the classifier automatically converts the  
    266 testing instances into the new domain, which includes recomputation of  
    267 transformed variables.  
    268  
    269 .. literalinclude:: code/variable-get_value_from.py 
    270     :lines: 24- 
    271  
    272 .. _attributes: 
    273  
    274 Storing additional variables 
    275 ----------------------------- 
    276  
    277 All variables have a field :obj:`~Variable.attributes`, a dictionary 
    278 which can contain strings. Although the current implementation allows all 
    279 types of value we strongly advise to use only strings. An example: 
    280  
    281 .. literalinclude:: code/attributes.py 
    282  
    283 These attributes can only be saved to a .tab file. They are listed in the 
    284 third line in <name>=<value> format, after other attribute specifications 
    285 (such as "meta" or "class"), and are separated by spaces.  
    286  
    287 .. _variable_descriptor_reuse: 
    288  
    289 Reuse of descriptors 
    290 -------------------- 
    291  
    292 There are situations when variable descriptors need to be reused. Typically, the  
    293 user loads some training examples, trains a classifier, and then loads a separate 
    294 test set. For the classifier to recognize the variables in the second data set, 
    295 the descriptors, not just the names, need to be the same.  
    296  
    297 When constructing new descriptors for data read from a file or during unpickling, 
    298 Orange checks whether an appropriate descriptor (with the same name and, in case 
    299 of discrete variables, also values) already exists and reuses it. When new 
    300 descriptors are constructed by explicitly calling the above constructors, this 
    301 always creates new descriptors and thus new variables, although a variable with 
    302 the same name may already exist. 
    303  
    304 The search for an existing variable is based on four attributes: the variable's name, 
    305 type, ordered values, and unordered values. As for the latter two, the values can  
    306 be explicitly ordered by the user, e.g. in the second line of the tab-delimited  
    307 file. For instance, sizes can be ordered as small, medium, or big. 
    308  
    309 The search for existing variables can end with one of the following statuses. 
    310  
    311 .. data:: Orange.data.variable.MakeStatus.NotFound (4) 
    312  
    313     The variable with that name and type does not exist.  
    314  
    315 .. data:: Orange.data.variable.MakeStatus.Incompatible (3) 
    316  
    317     There are variables with matching name and type, but their 
    318     values are incompatible with the prescribed ordered values. For example, 
    319     if the existing variable already has values ["a", "b"] and the new one 
    320     wants ["b", "a"], the old variable cannot be reused. The existing list can, 
    321     however be appended with the new values, so searching for ["a", "b", "c"] would 
    322     succeed. Likewise a search for ["a"] would be successful, since the extra existing value 
    323     does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``. 
    324  
    325 .. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2) 
    326  
    327     There is a matching variable, yet it has none of the values that the new 
    328     variable will have (this is obviously possible only if the new variable has 
    329     no prescribed ordered values). For instance, we search for a variable 
    330     "sex" with values "male" and "female", while there is a variable of the same  
    331     name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this  
    332     variable is possible, though this should probably be a new variable since it  
    333     obviously comes from a different data set. If we do decide to reuse the variable, the  
    334     old variable will get some unneeded new values and the new one will inherit  
    335     some from the old. 
    336  
    337 .. data:: Orange.data.variable.MakeStatus.MissingValues (1) 
    338  
    339     There is a matching variable with some of the values that the new one  
    340     requires, but some values are missing. This situation is neither uncommon  
    341     nor suspicious: in case of separate training and testing data sets there may 
    342     be values which occur in one set but not in the other. 
    343  
    344 .. data:: Orange.data.variable.MakeStatus.OK (0) 
    345  
    346     There is a perfect match which contains all the prescribed values in the 
    347     correct order. The existing variable may have some extra values, though. 
    348  
    349 Continuous variables can obviously have only two statuses,  
    350 :obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`. 
    351  
    352 When loading the data using :obj:`Orange.data.Table`, Orange takes the safest  
    353 approach and, by default, reuses everything that is compatible up to  
    354 and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the 
    355 variable having too many values, which the user can notice and fix. More on that  
    356 in the page on `loading data`. !!TODO!! 
    357  
    358 There are two functions for reusing the variables instead of creating new ones. 
    359  
    360 .. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on]) 
    361  
    362     Find and return an existing variable or create a new one if none of the existing 
    363     variables matches the given name, type and values. 
    364      
    365     The optional `create_new_on` specifies the status at which a new variable is 
    366     created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or 
    367     non-existing) variables cannot be reused. If it is set lower, for instance  
    368     to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists 
    369     a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function 
    370     always creates a new variable. 
    371      
    372     The function returns a tuple containing a variable descriptor and the 
    373     status of the best matching variable. So, if ``create_new_on`` is set to 
    374     :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say, 
    375     :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second  
    376     element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other 
    377     hand, there exists a variable which is perfectly OK, its descriptor is  
    378     returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no  
    379     indicator whether the returned variable is reused or not. This can be, 
    380     however, read from the status code: if it is smaller than the specified 
    381     ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed. 
    382  
    383     The exception to the rule is when ``create_new_on`` is OK. In this case, the  
    384     function does not search through the existing variables and cannot know the  
    385     status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`. 
    386  
    387     :param name: Variable name 
    388     :param type: Variable type 
    389     :type type: Orange.data.variable.Type 
    390     :param ordered_values: a list of ordered values 
    391     :param unordered_values: a list of values, for which the order does not 
    392         matter 
    393     :param create_new_on: gives the condition for constructing a new variable instead 
    394         of using the new one 
    395      
    396     :return_type: a tuple (:class:`Orange.data.variable.Variable`, int) 
    397      
    398 .. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on]) 
    399  
    400     Find and return an existing variable, or :obj:`None` if no match is found. 
    401      
    402     :param name: variable name. 
    403     :param type: variable type. 
    404     :type type: Orange.data.variable.Type 
    405     :param ordered_values: a list of ordered values 
    406     :param unordered_values: a list of values, for which the order does not 
    407         matter 
    408     :param create_new_on: gives the condition for constructing a new variable instead 
    409         of using the new one 
    410  
    411     :return_type: :class:`Orange.data.variable.Variable` 
    412      
    413 These following examples (from :download:`variable-reuse.py <code/variable-reuse.py>`) give the shown results if 
    414 executed only once (in a Python session) and in this order. 
    415  
    416 :func:`Orange.data.variable.make` can be used for the construction of new variables. :: 
    417      
    418     >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"]) 
    419     >>> print s, v1.values 
    420     4 <a, b> 
    421  
    422 No surprises here: a new variable is created and the status is :obj:`~Orange.data.variable.MakeStatus.NotFound`. :: 
    423  
    424     >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"]) 
    425     >>> print s, v2 is v1, v1.values 
    426     1 True <a, b, c> 
    427  
    428 The status is 1 (:obj:`~Orange.data.variable.MakeStatus.MissingValues`), yet the variable is reused (``v2 is v1``). 
    429 ``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does 
    430 not matter that the new variable does not need the value ``b``. :: 
    431  
    432     >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"]) 
    433     >>> print s, v3 is v1, v1.values 
    434     1 True <a, b, c, d> 
    435  
    436 This is like before, except that the new value, ``d`` is not among the 
    437 ordered values. :: 
    438  
    439     >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"]) 
    440     >>> print s, v4 is v1, v1.values, v4.values 
    441     3, False, <b>, <a, b, c, d> 
    442  
    443 The new variable needs to have ``b`` as the first value, so it is incompatible  
    444 with the existing variables. The status is thus 3 (:obj:`~Orange.data.variable.MakeStatus.Incompatible`), the two  
    445 variables are not equal and have different lists of values. :: 
    446  
    447     >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"]) 
    448     >>> print s, v5 is v1, v1.values, v5.values 
    449     0 True <a, b, c, d> <a, b, c, d> 
    450  
    451 The new variable has values ``c`` and ``a``, but the order is not important,  
    452 so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. :: 
    453  
    454     >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"]) 
    455     >>> print s, v6 is v1, v1.values, v6.values 
    456     2 True <a, b, c, d, e> <a, b, c, d, e> 
    457  
    458 The new variable has different values than the existing variable (status is 2, 
    459 :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`), but the existing one is nonetheless reused. Note that we 
    460 gave ``e`` in the list of unordered values. If it was among the ordered, the 
    461 reuse would fail. :: 
    462  
    463     >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, 
    464             ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues))) 
    465     >>> print s, v7 is v1, v1.values, v7.values 
    466     2 False <a, b, c, d, e> <f> 
    467  
    468 This is the same as before, except that we prohibited reuse when there are no 
    469 recognized values. Hence a new variable is created, though the returned status is  
    470 the same as before:: 
    471  
    472     >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, 
    473             ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK) 
    474     >>> print s, v8 is v1, v1.values, v8.values 
    475     0 False <a, b, c, d, e> <a, b, c, d, e> 
    476  
    477 Finally, this is a perfect match, but any reuse is prohibited, so a new  
    478 variable is created. 
    479  
    480 """ 
    4811from orange import Variable 
    4822from orange import EnumVariable as Discrete 
  • docs/reference/rst/Orange.data.variable.rst

    r9372 r9727  
    11.. automodule:: Orange.data.variable 
     2 
     3======================== 
     4Variables (``variable``) 
     5======================== 
     6 
     7Data instances in Orange can contain several types of variables: 
     8:ref:`discrete <discrete>`, :ref:`continuous <continuous>`, 
     9:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it. 
     10The latter represent arbitrary Python objects. 
     11The names, types, values (where applicable), functions for computing the 
     12variable value from values of other variables, and other properties of the 
     13variables are stored in descriptor classes derived from :obj:`Orange.data 
     14.variable.Variable`. 
     15 
     16Orange considers two variables (e.g. in two different data tables) the 
     17same if they have the same descriptor. It is allowed - but not 
     18recommended - to have different variables with the same name. 
     19 
     20Variable descriptors 
     21-------------------- 
     22 
     23Variable descriptors can be constructed either by calling the 
     24corresponding constructors or by a factory function :func:`Orange.data 
     25.variable.make`, which either retrieves an existing descriptor or 
     26constructs a new one. 
     27 
     28.. class:: Variable 
     29 
     30    An abstract base class for variable descriptors. 
     31 
     32    .. attribute:: name 
     33 
     34        The name of the variable. 
     35 
     36    .. attribute:: var_type 
     37 
     38        Variable type; it can be :obj:`~Orange.data.Type.Discrete`, 
     39        :obj:`~Orange.data.Type.Continuous`, 
     40        :obj:`~Orange.data.Type.String` or :obj:`~Orange.data.Type.Other`. 
     41 
     42    .. attribute:: get_value_from 
     43 
     44        A function (an instance of :obj:`~Orange.classification.Classifier`) 
     45        that computes a value of the variable from values of one or more 
     46        other variables. This is used, for instance, in discretization, 
     47        which computes the value of a discretized variable from the 
     48        original continuous variable. 
     49 
     50    .. attribute:: ordered 
     51 
     52        A flag telling whether the values of a discrete variable are ordered. At 
     53        the moment, no built-in method treats ordinal variables differently than 
     54        nominal ones. 
     55 
     56    .. attribute:: random_generator 
     57 
     58        A local random number generator used by method 
     59        :obj:`~Variable.randomvalue()`. 
     60 
     61    .. attribute:: default_meta_id 
     62 
     63        A proposed (but not guaranteed) meta id to be used for that variable. 
     64        For instance, when a tab-delimited contains meta attributes and 
     65        the existing variables are reused, they will have this id 
     66        (instead of a new one assigned by :obj:`Orange.data.new_meta_id()`). 
     67 
     68    .. attribute:: attributes 
     69 
     70        A dictionary which allows the user to store additional information 
     71        about the variable. All values should be strings. See the section 
     72        about :ref:`storing additional information <attributes>`. 
     73 
     74    .. method:: __call__(obj) 
     75 
     76           Convert a string, number, or other suitable object into a variable 
     77           value. 
     78 
     79           :param obj: An object to be converted into a variable value 
     80           :type o: any suitable 
     81           :rtype: :class:`Orange.data.Value` 
     82 
     83    .. method:: randomvalue() 
     84 
     85           Return a random value for the variable. 
     86 
     87           :rtype: :class:`Orange.data.Value` 
     88 
     89    .. method:: compute_value(inst) 
     90 
     91           Compute the value of the variable given the instance by calling 
     92           obj:`~Variable.get_value_from` through a mechanism that 
     93           prevents infinite recursive calls. 
     94 
     95           :rtype: :class:`Orange.data.Value` 
     96 
     97.. _discrete: 
     98.. class:: Discrete 
     99 
     100    Bases: :class:`Variable` 
     101 
     102    Descriptor for discrete variables. 
     103 
     104    .. attribute:: values 
     105 
     106        A list with symbolic names for variables' values. Values are stored as 
     107        indices referring to this list and modifying it instantly 
     108        changes the (symbolic) names of values as they are printed out or 
     109        referred to by user. 
     110 
     111        .. note:: 
     112 
     113            The size of the list is also used to indicate the number of 
     114            possible values for this variable. Changing the size - especially 
     115            shrinking the list - can crash Python. Also, do not add values 
     116            to the list by calling its append or extend method: use 
     117             :obj:`add_value` method instead. 
     118 
     119            It is also assumed that this attribute is always defined (but can 
     120            be empty), so never set it to ``None``. 
     121 
     122    .. attribute:: base_value 
     123 
     124            Stores the base value for the variable as an index in `values`. 
     125            This can be, for instance, a "normal" value, such as "no 
     126            complications" as opposed to abnormal "low blood pressure". The 
     127            base value is used by certain statistics, continuization etc. 
     128            potentially, learning algorithms. The default is -1 which means that 
     129            there is no base value. 
     130 
     131    .. method:: add_value(s) 
     132 
     133            Add a value with symbolic name ``s`` to values. Always call 
     134            this function instead of appending to ``values``. 
     135 
     136.. _continuous: 
     137.. class:: Continuous 
     138 
     139    Bases: :class:`Variable` 
     140 
     141    Descriptor for continuous variables. 
     142 
     143    .. attribute:: number_of_decimals 
     144 
     145        The number of decimals used when the value is printed out, converted to 
     146        a string or saved to a file. 
     147 
     148    .. attribute:: scientific_format 
     149 
     150        If ``True``, the value is printed in scientific format whenever it 
     151        would have more than 5 digits. In this case, :obj:`number_of_decimals` is 
     152        ignored. 
     153 
     154    .. attribute:: adjust_decimals 
     155 
     156        Tells Orange to monitor the number of decimals when the value is 
     157        converted from a string (when the values are read from a file or 
     158        converted by, e.g. ``inst[0]="3.14"``): 
     159 
     160        * 0: the number of decimals is not adjusted automatically; 
     161        * 1: the number of decimals is (and has already) been adjusted; 
     162        * 2: automatic adjustment is enabled, but no values have been 
     163          converted yet. 
     164 
     165        By default, adjustment of the number of decimals goes as follows: 
     166 
     167        * If the variable was constructed when data was read from a file, 
     168          it will be printed with the same number of decimals as the 
     169          largest number of decimals encountered in the file. If 
     170          scientific notation occurs in the file, 
     171          :obj:`scientific_format` will be set to ``True`` and scientific 
     172          format will be used for values too large or too small. 
     173 
     174        * If the variable is created in a script, it will have, 
     175          by default, three decimal places. This can be changed either by 
     176          setting the value from a string (e.g. ``inst[0]="3.14"``, 
     177          but not ``inst[0]=3.14``) or by manually setting the 
     178          :obj:`number_of_decimals`. 
     179 
     180    .. attribute:: start_value, end_value, step_value 
     181 
     182        The range used for :obj:`randomvalue`. 
     183 
     184.. _String: 
     185.. class:: String 
     186 
     187    Bases: :class:`Variable` 
     188 
     189    Descriptor for variables that contain strings. No method can use them for 
     190    learning; some will raise error or warnings, and others will 
     191    silently ignore them. They can be, however, used as meta-attributes; if 
     192    instances in a dataset have unique IDs, the most efficient way to store them 
     193    is to read them as meta-attributes. In general, never use discrete 
     194    attributes with many (say, more than 50) values. Such attributes are 
     195    probably not of any use for learning and should be stored as string 
     196    attributes. 
     197 
     198    When converting strings into values and back, empty strings are treated 
     199    differently than usual. For other types, an empty string denotes 
     200    undefined values, while :obj:`String` will take empty strings 
     201    as empty strings -- except when loading or saving into file. 
     202    Empty strings in files are interpreted as undefined; to specify an empty 
     203    string, enclose the string in double quotes; these are removed when the 
     204    string is loaded. 
     205 
     206.. _Python: 
     207.. class:: Python 
     208 
     209    Bases: :class:`Variable` 
     210 
     211    Base class for descriptors defined in Python. It is fully functional 
     212    and can be used as a descriptor for attributes that contain arbitrary Python 
     213    values. Since this is an advanced topic, PythonVariables are described on a 
     214    separate page. !!TODO!! 
     215 
     216 
     217.. _attributes: 
     218 
     219Storing additional attributes 
     220----------------------------- 
     221 
     222All variables have a field :obj:`~Variable.attributes`, a dictionary 
     223that can store additional string data. 
     224 
     225.. literalinclude:: code/attributes.py 
     226 
     227These attributes can only be saved to a .tab file. They are listed in the 
     228third line in <name>=<value> format, after other attribute specifications 
     229(such as "meta" or "class"), and are separated by spaces. 
     230 
     231.. _variable_descriptor_reuse: 
     232 
     233Reuse of descriptors 
     234-------------------- 
     235 
     236There are situations when variable descriptors need to be reused. Typically, the 
     237user loads some training examples, trains a classifier, and then loads a separate 
     238test set. For the classifier to recognize the variables in the second data set, 
     239the descriptors, not just the names, need to be the same. 
     240 
     241When constructing new descriptors for data read from a file or during unpickling, 
     242Orange checks whether an appropriate descriptor (with the same name and, in case 
     243of discrete variables, also values) already exists and reuses it. When new 
     244descriptors are constructed by explicitly calling the above constructors, this 
     245always creates new descriptors and thus new variables, although a variable with 
     246the same name may already exist. 
     247 
     248The search for an existing variable is based on four attributes: the variable's name, 
     249type, ordered values, and unordered values. As for the latter two, the values can 
     250be explicitly ordered by the user, e.g. in the second line of the tab-delimited 
     251file. For instance, sizes can be ordered as small, medium, or big. 
     252 
     253The search for existing variables can end with one of the following statuses. 
     254 
     255.. data:: Orange.data.variable.MakeStatus.NotFound (4) 
     256 
     257    The variable with that name and type does not exist. 
     258 
     259.. data:: Orange.data.variable.MakeStatus.Incompatible (3) 
     260 
     261    There are variables with matching name and type, but their 
     262    values are incompatible with the prescribed ordered values. For example, 
     263    if the existing variable already has values ["a", "b"] and the new one 
     264    wants ["b", "a"], the old variable cannot be reused. The existing list can, 
     265    however be appended with the new values, so searching for ["a", "b", "c"] would 
     266    succeed. Likewise a search for ["a"] would be successful, since the extra existing value 
     267    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``. 
     268 
     269.. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2) 
     270 
     271    There is a matching variable, yet it has none of the values that the new 
     272    variable will have (this is obviously possible only if the new variable has 
     273    no prescribed ordered values). For instance, we search for a variable 
     274    "sex" with values "male" and "female", while there is a variable of the same 
     275    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this 
     276    variable is possible, though this should probably be a new variable since it 
     277    obviously comes from a different data set. If we do decide to reuse the variable, the 
     278    old variable will get some unneeded new values and the new one will inherit 
     279    some from the old. 
     280 
     281.. data:: Orange.data.variable.MakeStatus.MissingValues (1) 
     282 
     283    There is a matching variable with some of the values that the new one 
     284    requires, but some values are missing. This situation is neither uncommon 
     285    nor suspicious: in case of separate training and testing data sets there may 
     286    be values which occur in one set but not in the other. 
     287 
     288.. data:: Orange.data.variable.MakeStatus.OK (0) 
     289 
     290    There is a perfect match which contains all the prescribed values in the 
     291    correct order. The existing variable may have some extra values, though. 
     292 
     293Continuous variables can obviously have only two statuses, 
     294:obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`. 
     295 
     296When loading the data using :obj:`Orange.data.Table`, Orange takes the safest 
     297approach and, by default, reuses everything that is compatible up to 
     298and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the 
     299variable having too many values, which the user can notice and fix. More on that 
     300in the page on :doc:`Orange.data.formats`. 
     301 
     302There are two functions for reusing the variables instead of creating new ones. 
     303 
     304.. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on]) 
     305 
     306    Find and return an existing variable or create a new one if none of the existing 
     307    variables matches the given name, type and values. 
     308 
     309    The optional `create_new_on` specifies the status at which a new variable is 
     310    created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or 
     311    non-existing) variables cannot be reused. If it is set lower, for instance 
     312    to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists 
     313    a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function 
     314    always creates a new variable. 
     315 
     316    The function returns a tuple containing a variable descriptor and the 
     317    status of the best matching variable. So, if ``create_new_on`` is set to 
     318    :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say, 
     319    :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second 
     320    element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other 
     321    hand, there exists a variable which is perfectly OK, its descriptor is 
     322    returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no 
     323    indicator whether the returned variable is reused or not. This can be, 
     324    however, read from the status code: if it is smaller than the specified 
     325    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed. 
     326 
     327    The exception to the rule is when ``create_new_on`` is OK. In this case, the 
     328    function does not search through the existing variables and cannot know the 
     329    status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`. 
     330 
     331    :param name: Variable name 
     332    :param type: Variable type 
     333    :type type: Orange.data.variable.Type 
     334    :param ordered_values: a list of ordered values 
     335    :param unordered_values: a list of values, for which the order does not 
     336        matter 
     337    :param create_new_on: gives the condition for constructing a new variable instead 
     338        of using the new one 
     339 
     340    :return_type: a tuple (:class:`~Orange.data.variable.Variable`, int) 
     341 
     342.. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on]) 
     343 
     344    Find and return an existing variable, or :obj:`None` if no match is found. 
     345 
     346    :param name: variable name. 
     347    :param type: variable type. 
     348    :type type: Orange.data.variable.Type 
     349    :param ordered_values: a list of ordered values 
     350    :param unordered_values: a list of values, for which the order does not 
     351        matter 
     352    :param create_new_on: gives the condition for constructing a new variable instead 
     353        of using the new one 
     354 
     355    :return_type: :class:`~Orange.data.variable.Variable` 
     356 
     357The following examples give the shown results if 
     358executed only once (in a Python session) and in this order. 
     359 
     360:func:`Orange.data.variable.make` can be used for the construction of new variables. :: 
     361 
     362    >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"]) 
     363    >>> print s, v1.values 
     364    NotFound <a, b> 
     365 
     366A new variable was created and the status is :obj:`~Orange.data.variable 
     367.MakeStatus.NotFound`. :: 
     368 
     369    >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"]) 
     370    >>> print s, v2 is v1, v1.values 
     371    MissingValues True <a, b, c> 
     372 
     373The status is :obj:`~Orange.data.variable.MakeStatus.MissingValues`, 
     374yet the variable is reused (``v2 is v1``). ``v1`` gets a new value, 
     375``"c"``, which was given as an unordered value. It does 
     376not matter that the new variable does not need the value ``b``. :: 
     377 
     378    >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"]) 
     379    >>> print s, v3 is v1, v1.values 
     380    MissingValues True <a, b, c, d> 
     381 
     382This is like before, except that the new value, ``d`` is not among the 
     383ordered values. :: 
     384 
     385    >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"]) 
     386    >>> print s, v4 is v1, v1.values, v4.values 
     387    Incompatible, False, <b>, <a, b, c, d> 
     388 
     389The new variable needs to have ``b`` as the first value, so it is incompatible 
     390with the existing variables. The status is 
     391:obj:`~Orange.data.variable.MakeStatus.Incompatible` and 
     392a new variable is created; the two variables are not equal and have 
     393different lists of values. :: 
     394 
     395    >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"]) 
     396    >>> print s, v5 is v1, v1.values, v5.values 
     397    OK True <a, b, c, d> <a, b, c, d> 
     398 
     399The new variable has values ``c`` and ``a``, but the order is not important, 
     400so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. :: 
     401 
     402    >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"]) 
     403    >>> print s, v6 is v1, v1.values, v6.values 
     404    NoRecognizedValues True <a, b, c, d, e> <a, b, c, d, e> 
     405 
     406The new variable has different values than the existing variable (status 
     407is :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`), 
     408but the existing one is nonetheless reused. Note that we 
     409gave ``e`` in the list of unordered values. If it was among the ordered, the 
     410reuse would fail. :: 
     411 
     412    >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, 
     413            ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues))) 
     414    >>> print s, v7 is v1, v1.values, v7.values 
     415    Incompatible False <a, b, c, d, e> <f> 
     416 
     417This is the same as before, except that we prohibited reuse when there are no 
     418recognized values. Hence a new variable is created, though the returned status is 
     419the same as before:: 
     420 
     421    >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, 
     422            ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK) 
     423    >>> print s, v8 is v1, v1.values, v8.values 
     424    OK False <a, b, c, d, e> <a, b, c, d, e> 
     425 
     426Finally, this is a perfect match, but any reuse is prohibited, so a new 
     427variable is created. 
     428 
     429 
     430 
     431Variables computed from other variables 
     432--------------------------------------- 
     433 
     434Values of variables are often computed from other variables, such as in 
     435discretization. The mechanism described below usually functions behind the scenes, 
     436so understanding it is required only for implementing specific transformations. 
     437 
     438Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``. 
     439It can help the learning algorithm if the four-valued attribute ``e`` is 
     440replaced with a binary attribute having values `"1"` and `"not 1"`. The 
     441new variable will be computed from the old one on the fly. 
     442 
     443.. literalinclude:: code/variable-get_value_from.py 
     444    :lines: 7-17 
     445 
     446The new variable is named ``e2``; we define it with a descriptor of type 
     447:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we 
     448chose this order so that the ``not 1``'s index is ``0``, which can be, if 
     449needed, interpreted as ``False``). Finally, we tell e2 to use 
     450``checkE`` to compute its value when needed, by assigning ``checkE`` to 
     451``e2.get_value_from``. 
     452 
     453``checkE`` is a function that is passed an instance and another argument we 
     454do not care about here. If the instance's ``e`` equals ``1``, the function 
     455returns value ``1``, otherwise it returns ``not 1``. Both are returned as 
     456values, not plain strings. 
     457 
     458In most circumstances the value of ``e2`` can be computed on the fly - we can 
     459pretend that the variable exists in the data, although it does not (but 
     460can be computed from it). For instance, we can compute the information gain of 
     461variable ``e2`` or its distribution without actually constructing data containing 
     462the new variable. 
     463 
     464.. literalinclude:: code/variable-get_value_from.py 
     465    :lines: 19-22 
     466 
     467There are methods which cannot compute values on the fly because it would be 
     468too complex or time consuming. In such cases, the data need to be converted 
     469to a new :obj:`Orange.data.Table`:: 
     470 
     471    new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var]) 
     472    new_data = Orange.data.Table(new_domain, data) 
     473 
     474Automatic computation is useful when the data is split into training and 
     475testing examples. Training instances can be modified by adding, removing 
     476and transforming variables (in a typical setup, continuous variables 
     477are discretized prior to learning, therefore the original variables are 
     478replaced by new ones). Test instances, on the other hand, are left as they 
     479are. When they are classified, the classifier automatically converts the 
     480testing instances into the new domain, which includes recomputation of 
     481transformed variables. 
     482 
     483.. literalinclude:: code/variable-get_value_from.py 
     484    :lines: 24- 
Note: See TracChangeset for help on using the changeset viewer.