source: orange/orange/Orange/data/variable.py @ 9349:fa13a2c52fcd

Revision 9349:fa13a2c52fcd, 21.8 KB checked in by mitar, 2 years ago (diff)

Changed way of linking to code in documentation.

Line 
1"""
2========================
3Variables (``variable``)
4========================
5
6Data instances in Orange can contain several types of variables:
7:ref:`discrete <discrete>`, :ref:`continuous <continuous>`,
8:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it.
9The latter represent arbitrary Python objects.
10The names, types, values (where applicable), functions for computing the
11variable value from values of other variables, and other properties of the
12variables are stored in descriptor classes defined in this module.
13
14Variable descriptors
15--------------------
16
17Variable descriptors can be constructed either directly, using
18constructors and passing attributes as parameters, or by a
19factory function :func:`Orange.data.variable.make`, which either
20retrieves an existing descriptor or constructs a new one.
21
22.. class:: Variable
23
24    An abstract base class for variable descriptors.
25
26    .. attribute:: name
27
28        The name of the variable. Variable names do not need to be unique since two
29        variables are considered the same only if they have the same descriptor
30        (e.g. even multiple variables in the same table can have the same name).
31        This should, however, be avoided since it may result in unpredictable
32        behavior.
33   
34    .. attribute:: var_type
35       
36        Variable type; it can be Orange.data.Type.Discrete,
37        Orange.data.Type.Continuous, Orange.data.Type.String or
38        Orange.data.Type.Other. 
39
40    .. attribute:: get_value_from
41
42        A function (an instance of :obj:`Orange.classification.Classifier`) which computes
43        a value of the variable from values of one or more other variables. This
44        is used, for instance, in discretization where the variables describing
45        the discretized variable are computed from the original variable.
46
47    .. attribute:: ordered
48   
49        A flag telling whether the values of a discrete variable are ordered. At
50        the moment, no built-in method treats ordinal variables differently than
51        nominal ones.
52   
53    .. attribute:: distributed
54   
55        A flag telling whether the values of the variables are distributions.
56        As for the flag ordered, no methods treat such variables in any special
57        manner.
58   
59    .. attribute:: random_generator
60   
61        A local random number generator used by method
62        :obj:`Variable.random_value`.
63   
64    .. attribute:: default_meta_id
65   
66        A proposed (but not guaranteed) meta id to be used for that variable.
67        This is used, for instance, by the data loader for tab-delimited file
68        format instead of assigning an arbitrary new value, or by
69        :obj:`Orange.data.new_meta_id` if the variable is passed as an argument.
70       
71    .. attribute:: attributes
72       
73        A dictionary which allows the user to store additional information
74        about the variable. All values should be strings. See the section
75        about :ref:`storing additional information <attributes>`.
76
77    .. method:: __call__(obj)
78   
79           Convert a string, number, or other suitable object into a variable
80           value.
81           
82           :param obj: An object to be converted into a variable value
83           :type o: any suitable
84           :rtype: :class:`Orange.data.Value`
85       
86    .. method:: randomvalue()
87
88           Return a random value for the variable.
89       
90           :rtype: :class:`Orange.data.Value`
91       
92    .. method:: compute_value(inst)
93
94           Compute the value of the variable given the instance by calling
95           obj:`~Variable.get_value_from` through a mechanism that prevents deadlocks by
96           circular calls.
97
98           :rtype: :class:`Orange.data.Value`
99
100.. _discrete:
101.. class:: Discrete
102
103    Bases: :class:`Variable`
104   
105    Descriptor for discrete variables.
106   
107    .. attribute:: values
108   
109        A list with symbolic names for variables' values. Values are stored as
110        indices referring to this list. Therefore, modifying this list
111        instantly changes the (symbolic) names of values as they are printed out or
112        referred to by user.
113   
114        .. note::
115       
116            The size of the list is also used to indicate the number of
117            possible values for this variable. Changing the size - especially
118            shrinking the list - can have disastrous effects and is therefore not
119            really recommended. Also, do not add values to the list by
120            calling its append or extend method: call the :obj:`add_value`
121            method instead.
122
123            It is also assumed that this attribute is always defined (but can
124            be empty), so never set it to None.
125   
126    .. attribute:: base_value
127
128            Stores the base value for the variable as an index in `values`.
129            This can be, for instance, a "normal" value, such as "no
130            complications" as opposed to abnormal "low blood pressure". The
131            base value is used by certain statistics, continuization etc.
132            potentially, learning algorithms. The default is -1 which means that
133            there is no base value.
134   
135    .. method:: add_value
136   
137            Add a value to values. Always call this function instead of
138            appending to values.
139
140.. _continuous:
141.. class:: Continuous
142
143    Bases: :class:`Variable`
144
145    Descriptor for continuous variables.
146   
147    .. attribute:: number_of_decimals
148   
149        The number of decimals used when the value is printed out, converted to
150        a string or saved to a file.
151   
152    .. attribute:: scientific_format
153   
154        If ``True``, the value is printed in scientific format whenever it
155        would have more than 5 digits. In this case, :obj:`number_of_decimals` is
156        ignored.
157
158    .. attribute:: adjust_decimals
159   
160        Tells Orange to monitor the number of decimals when the value is
161        converted from a string (when the values are read from a file or
162        converted by, e.g. ``inst[0]="3.14"``):
163        0: the number of decimals is not adjusted automatically;
164        1: the number of decimals is (and has already) been adjusted;
165        2: automatic adjustment is enabled, but no values have been converted yet.
166
167        By default, adjustment of the number of decimals goes as follows:
168   
169        If the variable was constructed when data was read from a file, it will
170        be printed with the same number of decimals as the largest number of
171        decimals encountered in the file. If scientific notation occurs in the
172        file, :obj:`scientific_format` will be set to ``True`` and scientific format
173        will be used for values too large or too small.
174   
175        If the variable is created in a script, it will have, by default, three
176        decimal places. This can be changed either by setting the value
177        from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by
178        manually setting the :obj:`number_of_decimals`.
179
180    .. attribute:: start_value, end_value, step_value
181   
182        The range used for :obj:`randomvalue`.
183
184.. _String:
185.. class:: String
186
187    Bases: :class:`Variable`
188
189    Descriptor for variables that contain strings. No method can use them for
190    learning; some will complain and others will silently ignore them when they
191    encounter them. They can be, however, useful for meta-attributes; if
192    instances in a dataset have unique IDs, the most efficient way to store them
193    is to read them as meta-attributes. In general, never use discrete
194    attributes with many (say, more than 50) values. Such attributes are
195    probably not of any use for learning and should be stored as string
196    attributes.
197
198    When converting strings into values and back, empty strings are treated
199    differently than usual. For other types, an empty string can be used to
200    denote undefined values, while :obj:`String` will take empty strings
201    as empty strings -- except when loading or saving into file.
202    Empty strings in files are interpreted as undefined; to specify an empty
203    string, enclose the string in double quotes; these are removed when the
204    string is loaded.
205
206.. _Python:
207.. class:: Python
208
209    Bases: :class:`Variable`
210
211    Base class for descriptors defined in Python. It is fully functional
212    and can be used as a descriptor for attributes that contain arbitrary Python
213    values. Since this is an advanced topic, PythonVariables are described on a
214    separate page. !!TODO!!
215   
216   
217Variables computed from other variables
218---------------------------------------
219
220Values of variables are often computed from other variables, such as in
221discretization. The mechanism described below usually functions behind the scenes,
222so understanding it is required only for implementing specific transformations.
223
224Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``.
225It can help the learning algorithm if the four-valued attribute ``e`` is
226replaced with a binary attribute having values `"1"` and `"not 1"`. The
227new variable will be computed from the old one on the fly.
228
229.. literalinclude:: code/variable-get_value_from.py
230    :lines: 7-17
231   
232The new variable is named ``e2``; we define it with a descriptor of type
233:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we
234chose this order so that the ``not 1``'s index is ``0``, which can be, if
235needed, interpreted as ``False``). Finally, we tell e2 to use
236``checkE`` to compute its value when needed, by assigning ``checkE`` to
237``e2.get_value_from``.
238
239``checkE`` is a function that is passed an instance and another argument we
240do not care about here. If the instance's ``e`` equals ``1``, the function
241returns value ``1``, otherwise it returns ``not 1``. Both are returned as
242values, not plain strings.
243
244In most circumstances the value of ``e2`` can be computed on the fly - we can
245pretend that the variable exists in the data, although it does not (but
246can be computed from it). For instance, we can compute the information gain of
247variable ``e2`` or its distribution without actually constructing data containing
248the new variable.
249
250.. literalinclude:: code/variable-get_value_from.py
251    :lines: 19-22
252
253There are methods which cannot compute values on the fly because it would be
254too complex or time consuming. In such cases, the data need to be converted
255to a new :obj:`Orange.data.Table`::
256
257    new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var])
258    new_data = Orange.data.Table(new_domain, data)
259
260Automatic computation is useful when the data is split into training and
261testing examples. Training instances can be modified by adding, removing
262and transforming variables (in a typical setup, continuous variables
263are discretized prior to learning, therefore the original variables are
264replaced by new ones). Test instances, on the other hand, are left as they
265are. When they are classified, the classifier automatically converts the
266testing instances into the new domain, which includes recomputation of
267transformed variables.
268
269.. literalinclude:: code/variable-get_value_from.py
270    :lines: 24-
271
272.. _attributes:
273
274Storing additional variables
275-----------------------------
276
277All variables have a field :obj:`~Variable.attributes`, a dictionary
278which can contain strings. Although the current implementation allows all
279types of value we strongly advise to use only strings. An example:
280
281.. literalinclude:: code/attributes.py
282
283These attributes can only be saved to a .tab file. They are listed in the
284third line in <name>=<value> format, after other attribute specifications
285(such as "meta" or "class"), and are separated by spaces.
286
287Reuse of descriptors
288--------------------
289
290There are situations when variable descriptors need to be reused. Typically, the
291user loads some training examples, trains a classifier, and then loads a separate
292test set. For the classifier to recognize the variables in the second data set,
293the descriptors, not just the names, need to be the same.
294
295When constructing new descriptors for data read from a file or during unpickling,
296Orange checks whether an appropriate descriptor (with the same name and, in case
297of discrete variables, also values) already exists and reuses it. When new
298descriptors are constructed by explicitly calling the above constructors, this
299always creates new descriptors and thus new variables, although a variable with
300the same name may already exist.
301
302The search for an existing variable is based on four attributes: the variable's name,
303type, ordered values, and unordered values. As for the latter two, the values can
304be explicitly ordered by the user, e.g. in the second line of the tab-delimited
305file. For instance, sizes can be ordered as small, medium, or big.
306
307The search for existing variables can end with one of the following statuses.
308
309.. data:: Orange.data.variable.MakeStatus.NotFound (4)
310
311    The variable with that name and type does not exist.
312
313.. data:: Orange.data.variable.MakeStatus.Incompatible (3)
314
315    There are variables with matching name and type, but their
316    values are incompatible with the prescribed ordered values. For example,
317    if the existing variable already has values ["a", "b"] and the new one
318    wants ["b", "a"], the old variable cannot be reused. The existing list can,
319    however be appended with the new values, so searching for ["a", "b", "c"] would
320    succeed. Likewise a search for ["a"] would be successful, since the extra existing value
321    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``.
322
323.. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2)
324
325    There is a matching variable, yet it has none of the values that the new
326    variable will have (this is obviously possible only if the new variable has
327    no prescribed ordered values). For instance, we search for a variable
328    "sex" with values "male" and "female", while there is a variable of the same
329    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this
330    variable is possible, though this should probably be a new variable since it
331    obviously comes from a different data set. If we do decide to reuse the variable, the
332    old variable will get some unneeded new values and the new one will inherit
333    some from the old.
334
335.. data:: Orange.data.variable.MakeStatus.MissingValues (1)
336
337    There is a matching variable with some of the values that the new one
338    requires, but some values are missing. This situation is neither uncommon
339    nor suspicious: in case of separate training and testing data sets there may
340    be values which occur in one set but not in the other.
341
342.. data:: Orange.data.variable.MakeStatus.OK (0)
343
344    There is a perfect match which contains all the prescribed values in the
345    correct order. The existing variable may have some extra values, though.
346
347Continuous variables can obviously have only two statuses,
348:obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`.
349
350When loading the data using :obj:`Orange.data.Table`, Orange takes the safest
351approach and, by default, reuses everything that is compatible up to
352and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the
353variable having too many values, which the user can notice and fix. More on that
354in the page on `loading data`. !!TODO!!
355
356There are two functions for reusing the variables instead of creating new ones.
357
358.. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on])
359
360    Find and return an existing variable or create a new one if none of the existing
361    variables matches the given name, type and values.
362   
363    The optional `create_new_on` specifies the status at which a new variable is
364    created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or
365    non-existing) variables cannot be reused. If it is set lower, for instance
366    to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists
367    a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function
368    always creates a new variable.
369   
370    The function returns a tuple containing a variable descriptor and the
371    status of the best matching variable. So, if ``create_new_on`` is set to
372    :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say,
373    :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second
374    element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other
375    hand, there exists a variable which is perfectly OK, its descriptor is
376    returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no
377    indicator whether the returned variable is reused or not. This can be,
378    however, read from the status code: if it is smaller than the specified
379    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed.
380
381    The exception to the rule is when ``create_new_on`` is OK. In this case, the
382    function does not search through the existing variables and cannot know the
383    status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`.
384
385    :param name: Variable name
386    :param type: Variable type
387    :type type: Orange.data.variable.Type
388    :param ordered_values: a list of ordered values
389    :param unordered_values: a list of values, for which the order does not
390        matter
391    :param create_new_on: gives the condition for constructing a new variable instead
392        of using the new one
393   
394    :return_type: a tuple (:class:`Orange.data.variable.Variable`, int)
395   
396.. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on])
397
398    Find and return an existing variable, or :obj:`None` if no match is found.
399   
400    :param name: variable name.
401    :param type: variable type.
402    :type type: Orange.data.variable.Type
403    :param ordered_values: a list of ordered values
404    :param unordered_values: a list of values, for which the order does not
405        matter
406    :param create_new_on: gives the condition for constructing a new variable instead
407        of using the new one
408
409    :return_type: :class:`Orange.data.variable.Variable`
410   
411These following examples (from :download:`variable-reuse.py <code/variable-reuse.py>`) give the shown results if
412executed only once (in a Python session) and in this order.
413
414:func:`Orange.data.variable.make` can be used for the construction of new variables. ::
415   
416    >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"])
417    >>> print s, v1.values
418    4 <a, b>
419
420No surprises here: a new variable is created and the status is :obj:`~Orange.data.variable.MakeStatus.NotFound`. ::
421
422    >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"])
423    >>> print s, v2 is v1, v1.values
424    1 True <a, b, c>
425
426The status is 1 (:obj:`~Orange.data.variable.MakeStatus.MissingValues`), yet the variable is reused (``v2 is v1``).
427``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does
428not matter that the new variable does not need the value ``b``. ::
429
430    >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"])
431    >>> print s, v3 is v1, v1.values
432    1 True <a, b, c, d>
433
434This is like before, except that the new value, ``d`` is not among the
435ordered values. ::
436
437    >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"])
438    >>> print s, v4 is v1, v1.values, v4.values
439    3, False, <b>, <a, b, c, d>
440
441The new variable needs to have ``b`` as the first value, so it is incompatible
442with the existing variables. The status is thus 3 (:obj:`~Orange.data.variable.MakeStatus.Incompatible`), the two
443variables are not equal and have different lists of values. ::
444
445    >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"])
446    >>> print s, v5 is v1, v1.values, v5.values
447    0 True <a, b, c, d> <a, b, c, d>
448
449The new variable has values ``c`` and ``a``, but the order is not important,
450so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. ::
451
452    >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"])
453    >>> print s, v6 is v1, v1.values, v6.values
454    2 True <a, b, c, d, e> <a, b, c, d, e>
455
456The new variable has different values than the existing variable (status is 2,
457:obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`), but the existing one is nonetheless reused. Note that we
458gave ``e`` in the list of unordered values. If it was among the ordered, the
459reuse would fail. ::
460
461    >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None,
462            ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues)))
463    >>> print s, v7 is v1, v1.values, v7.values
464    2 False <a, b, c, d, e> <f>
465
466This is the same as before, except that we prohibited reuse when there are no
467recognized values. Hence a new variable is created, though the returned status is
468the same as before::
469
470    >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete,
471            ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK)
472    >>> print s, v8 is v1, v1.values, v8.values
473    0 False <a, b, c, d, e> <a, b, c, d, e>
474
475Finally, this is a perfect match, but any reuse is prohibited, so a new
476variable is created.
477
478"""
479from orange import Variable
480from orange import EnumVariable as Discrete
481from orange import FloatVariable as Continuous
482from orange import PythonVariable as Python
483from orange import StringVariable as String
484
485from orange import VarList as Variables
486
487import orange
488new_meta_id = orange.newmetaid
489make = orange.Variable.make
490retrieve = orange.Variable.get_existing
491MakeStatus = orange.Variable.MakeStatus
492del orange
Note: See TracBrowser for help on using the repository browser.