02/06/12 15:18:09 (2 years ago)
janezd <janez.demsar@…>

Polished documentation about Orange.data.variable

1 edited


  • docs/reference/rst/Orange.data.variable.rst

    r9372 r9727  
    11.. automodule:: Orange.data.variable 
     4Variables (``variable``) 
     7Data instances in Orange can contain several types of variables: 
     8:ref:`discrete <discrete>`, :ref:`continuous <continuous>`, 
     9:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it. 
     10The latter represent arbitrary Python objects. 
     11The names, types, values (where applicable), functions for computing the 
     12variable value from values of other variables, and other properties of the 
     13variables are stored in descriptor classes derived from :obj:`Orange.data 
     16Orange considers two variables (e.g. in two different data tables) the 
     17same if they have the same descriptor. It is allowed - but not 
     18recommended - to have different variables with the same name. 
     20Variable descriptors 
     23Variable descriptors can be constructed either by calling the 
     24corresponding constructors or by a factory function :func:`Orange.data 
     25.variable.make`, which either retrieves an existing descriptor or 
     26constructs a new one. 
     28.. class:: Variable 
     30    An abstract base class for variable descriptors. 
     32    .. attribute:: name 
     34        The name of the variable. 
     36    .. attribute:: var_type 
     38        Variable type; it can be :obj:`~Orange.data.Type.Discrete`, 
     39        :obj:`~Orange.data.Type.Continuous`, 
     40        :obj:`~Orange.data.Type.String` or :obj:`~Orange.data.Type.Other`. 
     42    .. attribute:: get_value_from 
     44        A function (an instance of :obj:`~Orange.classification.Classifier`) 
     45        that computes a value of the variable from values of one or more 
     46        other variables. This is used, for instance, in discretization, 
     47        which computes the value of a discretized variable from the 
     48        original continuous variable. 
     50    .. attribute:: ordered 
     52        A flag telling whether the values of a discrete variable are ordered. At 
     53        the moment, no built-in method treats ordinal variables differently than 
     54        nominal ones. 
     56    .. attribute:: random_generator 
     58        A local random number generator used by method 
     59        :obj:`~Variable.randomvalue()`. 
     61    .. attribute:: default_meta_id 
     63        A proposed (but not guaranteed) meta id to be used for that variable. 
     64        For instance, when a tab-delimited contains meta attributes and 
     65        the existing variables are reused, they will have this id 
     66        (instead of a new one assigned by :obj:`Orange.data.new_meta_id()`). 
     68    .. attribute:: attributes 
     70        A dictionary which allows the user to store additional information 
     71        about the variable. All values should be strings. See the section 
     72        about :ref:`storing additional information <attributes>`. 
     74    .. method:: __call__(obj) 
     76           Convert a string, number, or other suitable object into a variable 
     77           value. 
     79           :param obj: An object to be converted into a variable value 
     80           :type o: any suitable 
     81           :rtype: :class:`Orange.data.Value` 
     83    .. method:: randomvalue() 
     85           Return a random value for the variable. 
     87           :rtype: :class:`Orange.data.Value` 
     89    .. method:: compute_value(inst) 
     91           Compute the value of the variable given the instance by calling 
     92           obj:`~Variable.get_value_from` through a mechanism that 
     93           prevents infinite recursive calls. 
     95           :rtype: :class:`Orange.data.Value` 
     97.. _discrete: 
     98.. class:: Discrete 
     100    Bases: :class:`Variable` 
     102    Descriptor for discrete variables. 
     104    .. attribute:: values 
     106        A list with symbolic names for variables' values. Values are stored as 
     107        indices referring to this list and modifying it instantly 
     108        changes the (symbolic) names of values as they are printed out or 
     109        referred to by user. 
     111        .. note:: 
     113            The size of the list is also used to indicate the number of 
     114            possible values for this variable. Changing the size - especially 
     115            shrinking the list - can crash Python. Also, do not add values 
     116            to the list by calling its append or extend method: use 
     117             :obj:`add_value` method instead. 
     119            It is also assumed that this attribute is always defined (but can 
     120            be empty), so never set it to ``None``. 
     122    .. attribute:: base_value 
     124            Stores the base value for the variable as an index in `values`. 
     125            This can be, for instance, a "normal" value, such as "no 
     126            complications" as opposed to abnormal "low blood pressure". The 
     127            base value is used by certain statistics, continuization etc. 
     128            potentially, learning algorithms. The default is -1 which means that 
     129            there is no base value. 
     131    .. method:: add_value(s) 
     133            Add a value with symbolic name ``s`` to values. Always call 
     134            this function instead of appending to ``values``. 
     136.. _continuous: 
     137.. class:: Continuous 
     139    Bases: :class:`Variable` 
     141    Descriptor for continuous variables. 
     143    .. attribute:: number_of_decimals 
     145        The number of decimals used when the value is printed out, converted to 
     146        a string or saved to a file. 
     148    .. attribute:: scientific_format 
     150        If ``True``, the value is printed in scientific format whenever it 
     151        would have more than 5 digits. In this case, :obj:`number_of_decimals` is 
     152        ignored. 
     154    .. attribute:: adjust_decimals 
     156        Tells Orange to monitor the number of decimals when the value is 
     157        converted from a string (when the values are read from a file or 
     158        converted by, e.g. ``inst[0]="3.14"``): 
     160        * 0: the number of decimals is not adjusted automatically; 
     161        * 1: the number of decimals is (and has already) been adjusted; 
     162        * 2: automatic adjustment is enabled, but no values have been 
     163          converted yet. 
     165        By default, adjustment of the number of decimals goes as follows: 
     167        * If the variable was constructed when data was read from a file, 
     168          it will be printed with the same number of decimals as the 
     169          largest number of decimals encountered in the file. If 
     170          scientific notation occurs in the file, 
     171          :obj:`scientific_format` will be set to ``True`` and scientific 
     172          format will be used for values too large or too small. 
     174        * If the variable is created in a script, it will have, 
     175          by default, three decimal places. This can be changed either by 
     176          setting the value from a string (e.g. ``inst[0]="3.14"``, 
     177          but not ``inst[0]=3.14``) or by manually setting the 
     178          :obj:`number_of_decimals`. 
     180    .. attribute:: start_value, end_value, step_value 
     182        The range used for :obj:`randomvalue`. 
     184.. _String: 
     185.. class:: String 
     187    Bases: :class:`Variable` 
     189    Descriptor for variables that contain strings. No method can use them for 
     190    learning; some will raise error or warnings, and others will 
     191    silently ignore them. They can be, however, used as meta-attributes; if 
     192    instances in a dataset have unique IDs, the most efficient way to store them 
     193    is to read them as meta-attributes. In general, never use discrete 
     194    attributes with many (say, more than 50) values. Such attributes are 
     195    probably not of any use for learning and should be stored as string 
     196    attributes. 
     198    When converting strings into values and back, empty strings are treated 
     199    differently than usual. For other types, an empty string denotes 
     200    undefined values, while :obj:`String` will take empty strings 
     201    as empty strings -- except when loading or saving into file. 
     202    Empty strings in files are interpreted as undefined; to specify an empty 
     203    string, enclose the string in double quotes; these are removed when the 
     204    string is loaded. 
     206.. _Python: 
     207.. class:: Python 
     209    Bases: :class:`Variable` 
     211    Base class for descriptors defined in Python. It is fully functional 
     212    and can be used as a descriptor for attributes that contain arbitrary Python 
     213    values. Since this is an advanced topic, PythonVariables are described on a 
     214    separate page. !!TODO!! 
     217.. _attributes: 
     219Storing additional attributes 
     222All variables have a field :obj:`~Variable.attributes`, a dictionary 
     223that can store additional string data. 
     225.. literalinclude:: code/attributes.py 
     227These attributes can only be saved to a .tab file. They are listed in the 
     228third line in <name>=<value> format, after other attribute specifications 
     229(such as "meta" or "class"), and are separated by spaces. 
     231.. _variable_descriptor_reuse: 
     233Reuse of descriptors 
     236There are situations when variable descriptors need to be reused. Typically, the 
     237user loads some training examples, trains a classifier, and then loads a separate 
     238test set. For the classifier to recognize the variables in the second data set, 
     239the descriptors, not just the names, need to be the same. 
     241When constructing new descriptors for data read from a file or during unpickling, 
     242Orange checks whether an appropriate descriptor (with the same name and, in case 
     243of discrete variables, also values) already exists and reuses it. When new 
     244descriptors are constructed by explicitly calling the above constructors, this 
     245always creates new descriptors and thus new variables, although a variable with 
     246the same name may already exist. 
     248The search for an existing variable is based on four attributes: the variable's name, 
     249type, ordered values, and unordered values. As for the latter two, the values can 
     250be explicitly ordered by the user, e.g. in the second line of the tab-delimited 
     251file. For instance, sizes can be ordered as small, medium, or big. 
     253The search for existing variables can end with one of the following statuses. 
     255.. data:: Orange.data.variable.MakeStatus.NotFound (4) 
     257    The variable with that name and type does not exist. 
     259.. data:: Orange.data.variable.MakeStatus.Incompatible (3) 
     261    There are variables with matching name and type, but their 
     262    values are incompatible with the prescribed ordered values. For example, 
     263    if the existing variable already has values ["a", "b"] and the new one 
     264    wants ["b", "a"], the old variable cannot be reused. The existing list can, 
     265    however be appended with the new values, so searching for ["a", "b", "c"] would 
     266    succeed. Likewise a search for ["a"] would be successful, since the extra existing value 
     267    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``. 
     269.. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2) 
     271    There is a matching variable, yet it has none of the values that the new 
     272    variable will have (this is obviously possible only if the new variable has 
     273    no prescribed ordered values). For instance, we search for a variable 
     274    "sex" with values "male" and "female", while there is a variable of the same 
     275    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this 
     276    variable is possible, though this should probably be a new variable since it 
     277    obviously comes from a different data set. If we do decide to reuse the variable, the 
     278    old variable will get some unneeded new values and the new one will inherit 
     279    some from the old. 
     281.. data:: Orange.data.variable.MakeStatus.MissingValues (1) 
     283    There is a matching variable with some of the values that the new one 
     284    requires, but some values are missing. This situation is neither uncommon 
     285    nor suspicious: in case of separate training and testing data sets there may 
     286    be values which occur in one set but not in the other. 
     288.. data:: Orange.data.variable.MakeStatus.OK (0) 
     290    There is a perfect match which contains all the prescribed values in the 
     291    correct order. The existing variable may have some extra values, though. 
     293Continuous variables can obviously have only two statuses, 
     294:obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`. 
     296When loading the data using :obj:`Orange.data.Table`, Orange takes the safest 
     297approach and, by default, reuses everything that is compatible up to 
     298and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the 
     299variable having too many values, which the user can notice and fix. More on that 
     300in the page on :doc:`Orange.data.formats`. 
     302There are two functions for reusing the variables instead of creating new ones. 
     304.. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on]) 
     306    Find and return an existing variable or create a new one if none of the existing 
     307    variables matches the given name, type and values. 
     309    The optional `create_new_on` specifies the status at which a new variable is 
     310    created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or 
     311    non-existing) variables cannot be reused. If it is set lower, for instance 
     312    to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists 
     313    a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function 
     314    always creates a new variable. 
     316    The function returns a tuple containing a variable descriptor and the 
     317    status of the best matching variable. So, if ``create_new_on`` is set to 
     318    :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say, 
     319    :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second 
     320    element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other 
     321    hand, there exists a variable which is perfectly OK, its descriptor is 
     322    returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no 
     323    indicator whether the returned variable is reused or not. This can be, 
     324    however, read from the status code: if it is smaller than the specified 
     325    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed. 
     327    The exception to the rule is when ``create_new_on`` is OK. In this case, the 
     328    function does not search through the existing variables and cannot know the 
     329    status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`. 
     331    :param name: Variable name 
     332    :param type: Variable type 
     333    :type type: Orange.data.variable.Type 
     334    :param ordered_values: a list of ordered values 
     335    :param unordered_values: a list of values, for which the order does not 
     336        matter 
     337    :param create_new_on: gives the condition for constructing a new variable instead 
     338        of using the new one 
     340    :return_type: a tuple (:class:`~Orange.data.variable.Variable`, int) 
     342.. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on]) 
     344    Find and return an existing variable, or :obj:`None` if no match is found. 
     346    :param name: variable name. 
     347    :param type: variable type. 
     348    :type type: Orange.data.variable.Type 
     349    :param ordered_values: a list of ordered values 
     350    :param unordered_values: a list of values, for which the order does not 
     351        matter 
     352    :param create_new_on: gives the condition for constructing a new variable instead 
     353        of using the new one 
     355    :return_type: :class:`~Orange.data.variable.Variable` 
     357The following examples give the shown results if 
     358executed only once (in a Python session) and in this order. 
     360:func:`Orange.data.variable.make` can be used for the construction of new variables. :: 
     362    >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"]) 
     363    >>> print s, v1.values 
     364    NotFound <a, b> 
     366A new variable was created and the status is :obj:`~Orange.data.variable 
     367.MakeStatus.NotFound`. :: 
     369    >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"]) 
     370    >>> print s, v2 is v1, v1.values 
     371    MissingValues True <a, b, c> 
     373The status is :obj:`~Orange.data.variable.MakeStatus.MissingValues`, 
     374yet the variable is reused (``v2 is v1``). ``v1`` gets a new value, 
     375``"c"``, which was given as an unordered value. It does 
     376not matter that the new variable does not need the value ``b``. :: 
     378    >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"]) 
     379    >>> print s, v3 is v1, v1.values 
     380    MissingValues True <a, b, c, d> 
     382This is like before, except that the new value, ``d`` is not among the 
     383ordered values. :: 
     385    >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"]) 
     386    >>> print s, v4 is v1, v1.values, v4.values 
     387    Incompatible, False, <b>, <a, b, c, d> 
     389The new variable needs to have ``b`` as the first value, so it is incompatible 
     390with the existing variables. The status is 
     391:obj:`~Orange.data.variable.MakeStatus.Incompatible` and 
     392a new variable is created; the two variables are not equal and have 
     393different lists of values. :: 
     395    >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"]) 
     396    >>> print s, v5 is v1, v1.values, v5.values 
     397    OK True <a, b, c, d> <a, b, c, d> 
     399The new variable has values ``c`` and ``a``, but the order is not important, 
     400so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. :: 
     402    >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"]) 
     403    >>> print s, v6 is v1, v1.values, v6.values 
     404    NoRecognizedValues True <a, b, c, d, e> <a, b, c, d, e> 
     406The new variable has different values than the existing variable (status 
     407is :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`), 
     408but the existing one is nonetheless reused. Note that we 
     409gave ``e`` in the list of unordered values. If it was among the ordered, the 
     410reuse would fail. :: 
     412    >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, 
     413            ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues))) 
     414    >>> print s, v7 is v1, v1.values, v7.values 
     415    Incompatible False <a, b, c, d, e> <f> 
     417This is the same as before, except that we prohibited reuse when there are no 
     418recognized values. Hence a new variable is created, though the returned status is 
     419the same as before:: 
     421    >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, 
     422            ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK) 
     423    >>> print s, v8 is v1, v1.values, v8.values 
     424    OK False <a, b, c, d, e> <a, b, c, d, e> 
     426Finally, this is a perfect match, but any reuse is prohibited, so a new 
     427variable is created. 
     431Variables computed from other variables 
     434Values of variables are often computed from other variables, such as in 
     435discretization. The mechanism described below usually functions behind the scenes, 
     436so understanding it is required only for implementing specific transformations. 
     438Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``. 
     439It can help the learning algorithm if the four-valued attribute ``e`` is 
     440replaced with a binary attribute having values `"1"` and `"not 1"`. The 
     441new variable will be computed from the old one on the fly. 
     443.. literalinclude:: code/variable-get_value_from.py 
     444    :lines: 7-17 
     446The new variable is named ``e2``; we define it with a descriptor of type 
     447:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we 
     448chose this order so that the ``not 1``'s index is ``0``, which can be, if 
     449needed, interpreted as ``False``). Finally, we tell e2 to use 
     450``checkE`` to compute its value when needed, by assigning ``checkE`` to 
     453``checkE`` is a function that is passed an instance and another argument we 
     454do not care about here. If the instance's ``e`` equals ``1``, the function 
     455returns value ``1``, otherwise it returns ``not 1``. Both are returned as 
     456values, not plain strings. 
     458In most circumstances the value of ``e2`` can be computed on the fly - we can 
     459pretend that the variable exists in the data, although it does not (but 
     460can be computed from it). For instance, we can compute the information gain of 
     461variable ``e2`` or its distribution without actually constructing data containing 
     462the new variable. 
     464.. literalinclude:: code/variable-get_value_from.py 
     465    :lines: 19-22 
     467There are methods which cannot compute values on the fly because it would be 
     468too complex or time consuming. In such cases, the data need to be converted 
     469to a new :obj:`Orange.data.Table`:: 
     471    new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var]) 
     472    new_data = Orange.data.Table(new_domain, data) 
     474Automatic computation is useful when the data is split into training and 
     475testing examples. Training instances can be modified by adding, removing 
     476and transforming variables (in a typical setup, continuous variables 
     477are discretized prior to learning, therefore the original variables are 
     478replaced by new ones). Test instances, on the other hand, are left as they 
     479are. When they are classified, the classifier automatically converts the 
     480testing instances into the new domain, which includes recomputation of 
     481transformed variables. 
     483.. literalinclude:: code/variable-get_value_from.py 
     484    :lines: 24- 
Note: See TracChangeset for help on using the changeset viewer.