source: orange/orange/Orange/data/variable.py @ 9636:e1a9df059e31

Revision 9636:e1a9df059e31, 21.8 KB checked in by markotoplak, 2 years ago (diff)

Orange.data.table documentation fixes.

Line 
1"""
2========================
3Variables (``variable``)
4========================
5
6Data instances in Orange can contain several types of variables:
7:ref:`discrete <discrete>`, :ref:`continuous <continuous>`,
8:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it.
9The latter represent arbitrary Python objects.
10The names, types, values (where applicable), functions for computing the
11variable value from values of other variables, and other properties of the
12variables are stored in descriptor classes defined in this module.
13
14Variable descriptors
15--------------------
16
17Variable descriptors can be constructed either directly, using
18constructors and passing attributes as parameters, or by a
19factory function :func:`Orange.data.variable.make`, which either
20retrieves an existing descriptor or constructs a new one.
21
22.. class:: Variable
23
24    An abstract base class for variable descriptors.
25
26    .. attribute:: name
27
28        The name of the variable. Variable names do not need to be unique since two
29        variables are considered the same only if they have the same descriptor
30        (e.g. even multiple variables in the same table can have the same name).
31        This should, however, be avoided since it may result in unpredictable
32        behavior.
33   
34    .. attribute:: var_type
35       
36        Variable type; it can be Orange.data.Type.Discrete,
37        Orange.data.Type.Continuous, Orange.data.Type.String or
38        Orange.data.Type.Other. 
39
40    .. attribute:: get_value_from
41
42        A function (an instance of :obj:`Orange.classification.Classifier`) which computes
43        a value of the variable from values of one or more other variables. This
44        is used, for instance, in discretization where the variables describing
45        the discretized variable are computed from the original variable.
46
47    .. attribute:: ordered
48   
49        A flag telling whether the values of a discrete variable are ordered. At
50        the moment, no built-in method treats ordinal variables differently than
51        nominal ones.
52   
53    .. attribute:: distributed
54   
55        A flag telling whether the values of the variables are distributions.
56        As for the flag ordered, no methods treat such variables in any special
57        manner.
58   
59    .. attribute:: random_generator
60   
61        A local random number generator used by method
62        :obj:`Variable.random_value`.
63   
64    .. attribute:: default_meta_id
65   
66        A proposed (but not guaranteed) meta id to be used for that variable.
67        This is used, for instance, by the data loader for tab-delimited file
68        format instead of assigning an arbitrary new value, or by
69        :obj:`Orange.data.new_meta_id` if the variable is passed as an argument.
70       
71    .. attribute:: attributes
72       
73        A dictionary which allows the user to store additional information
74        about the variable. All values should be strings. See the section
75        about :ref:`storing additional information <attributes>`.
76
77    .. method:: __call__(obj)
78   
79           Convert a string, number, or other suitable object into a variable
80           value.
81           
82           :param obj: An object to be converted into a variable value
83           :type o: any suitable
84           :rtype: :class:`Orange.data.Value`
85       
86    .. method:: randomvalue()
87
88           Return a random value for the variable.
89       
90           :rtype: :class:`Orange.data.Value`
91       
92    .. method:: compute_value(inst)
93
94           Compute the value of the variable given the instance by calling
95           obj:`~Variable.get_value_from` through a mechanism that prevents deadlocks by
96           circular calls.
97
98           :rtype: :class:`Orange.data.Value`
99
100.. _discrete:
101.. class:: Discrete
102
103    Bases: :class:`Variable`
104   
105    Descriptor for discrete variables.
106   
107    .. attribute:: values
108   
109        A list with symbolic names for variables' values. Values are stored as
110        indices referring to this list. Therefore, modifying this list
111        instantly changes the (symbolic) names of values as they are printed out or
112        referred to by user.
113   
114        .. note::
115       
116            The size of the list is also used to indicate the number of
117            possible values for this variable. Changing the size - especially
118            shrinking the list - can have disastrous effects and is therefore not
119            really recommended. Also, do not add values to the list by
120            calling its append or extend method: call the :obj:`add_value`
121            method instead.
122
123            It is also assumed that this attribute is always defined (but can
124            be empty), so never set it to None.
125   
126    .. attribute:: base_value
127
128            Stores the base value for the variable as an index in `values`.
129            This can be, for instance, a "normal" value, such as "no
130            complications" as opposed to abnormal "low blood pressure". The
131            base value is used by certain statistics, continuization etc.
132            potentially, learning algorithms. The default is -1 which means that
133            there is no base value.
134   
135    .. method:: add_value
136   
137            Add a value to values. Always call this function instead of
138            appending to values.
139
140.. _continuous:
141.. class:: Continuous
142
143    Bases: :class:`Variable`
144
145    Descriptor for continuous variables.
146   
147    .. attribute:: number_of_decimals
148   
149        The number of decimals used when the value is printed out, converted to
150        a string or saved to a file.
151   
152    .. attribute:: scientific_format
153   
154        If ``True``, the value is printed in scientific format whenever it
155        would have more than 5 digits. In this case, :obj:`number_of_decimals` is
156        ignored.
157
158    .. attribute:: adjust_decimals
159   
160        Tells Orange to monitor the number of decimals when the value is
161        converted from a string (when the values are read from a file or
162        converted by, e.g. ``inst[0]="3.14"``):
163        0: the number of decimals is not adjusted automatically;
164        1: the number of decimals is (and has already) been adjusted;
165        2: automatic adjustment is enabled, but no values have been converted yet.
166
167        By default, adjustment of the number of decimals goes as follows:
168   
169        If the variable was constructed when data was read from a file, it will
170        be printed with the same number of decimals as the largest number of
171        decimals encountered in the file. If scientific notation occurs in the
172        file, :obj:`scientific_format` will be set to ``True`` and scientific format
173        will be used for values too large or too small.
174   
175        If the variable is created in a script, it will have, by default, three
176        decimal places. This can be changed either by setting the value
177        from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by
178        manually setting the :obj:`number_of_decimals`.
179
180    .. attribute:: start_value, end_value, step_value
181   
182        The range used for :obj:`randomvalue`.
183
184.. _String:
185.. class:: String
186
187    Bases: :class:`Variable`
188
189    Descriptor for variables that contain strings. No method can use them for
190    learning; some will complain and others will silently ignore them when they
191    encounter them. They can be, however, useful for meta-attributes; if
192    instances in a dataset have unique IDs, the most efficient way to store them
193    is to read them as meta-attributes. In general, never use discrete
194    attributes with many (say, more than 50) values. Such attributes are
195    probably not of any use for learning and should be stored as string
196    attributes.
197
198    When converting strings into values and back, empty strings are treated
199    differently than usual. For other types, an empty string can be used to
200    denote undefined values, while :obj:`String` will take empty strings
201    as empty strings -- except when loading or saving into file.
202    Empty strings in files are interpreted as undefined; to specify an empty
203    string, enclose the string in double quotes; these are removed when the
204    string is loaded.
205
206.. _Python:
207.. class:: Python
208
209    Bases: :class:`Variable`
210
211    Base class for descriptors defined in Python. It is fully functional
212    and can be used as a descriptor for attributes that contain arbitrary Python
213    values. Since this is an advanced topic, PythonVariables are described on a
214    separate page. !!TODO!!
215   
216   
217Variables computed from other variables
218---------------------------------------
219
220Values of variables are often computed from other variables, such as in
221discretization. The mechanism described below usually functions behind the scenes,
222so understanding it is required only for implementing specific transformations.
223
224Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``.
225It can help the learning algorithm if the four-valued attribute ``e`` is
226replaced with a binary attribute having values `"1"` and `"not 1"`. The
227new variable will be computed from the old one on the fly.
228
229.. literalinclude:: code/variable-get_value_from.py
230    :lines: 7-17
231   
232The new variable is named ``e2``; we define it with a descriptor of type
233:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we
234chose this order so that the ``not 1``'s index is ``0``, which can be, if
235needed, interpreted as ``False``). Finally, we tell e2 to use
236``checkE`` to compute its value when needed, by assigning ``checkE`` to
237``e2.get_value_from``.
238
239``checkE`` is a function that is passed an instance and another argument we
240do not care about here. If the instance's ``e`` equals ``1``, the function
241returns value ``1``, otherwise it returns ``not 1``. Both are returned as
242values, not plain strings.
243
244In most circumstances the value of ``e2`` can be computed on the fly - we can
245pretend that the variable exists in the data, although it does not (but
246can be computed from it). For instance, we can compute the information gain of
247variable ``e2`` or its distribution without actually constructing data containing
248the new variable.
249
250.. literalinclude:: code/variable-get_value_from.py
251    :lines: 19-22
252
253There are methods which cannot compute values on the fly because it would be
254too complex or time consuming. In such cases, the data need to be converted
255to a new :obj:`Orange.data.Table`::
256
257    new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var])
258    new_data = Orange.data.Table(new_domain, data)
259
260Automatic computation is useful when the data is split into training and
261testing examples. Training instances can be modified by adding, removing
262and transforming variables (in a typical setup, continuous variables
263are discretized prior to learning, therefore the original variables are
264replaced by new ones). Test instances, on the other hand, are left as they
265are. When they are classified, the classifier automatically converts the
266testing instances into the new domain, which includes recomputation of
267transformed variables.
268
269.. literalinclude:: code/variable-get_value_from.py
270    :lines: 24-
271
272.. _attributes:
273
274Storing additional variables
275-----------------------------
276
277All variables have a field :obj:`~Variable.attributes`, a dictionary
278which can contain strings. Although the current implementation allows all
279types of value we strongly advise to use only strings. An example:
280
281.. literalinclude:: code/attributes.py
282
283These attributes can only be saved to a .tab file. They are listed in the
284third line in <name>=<value> format, after other attribute specifications
285(such as "meta" or "class"), and are separated by spaces.
286
287.. _variable_descriptor_reuse:
288
289Reuse of descriptors
290--------------------
291
292There are situations when variable descriptors need to be reused. Typically, the
293user loads some training examples, trains a classifier, and then loads a separate
294test set. For the classifier to recognize the variables in the second data set,
295the descriptors, not just the names, need to be the same.
296
297When constructing new descriptors for data read from a file or during unpickling,
298Orange checks whether an appropriate descriptor (with the same name and, in case
299of discrete variables, also values) already exists and reuses it. When new
300descriptors are constructed by explicitly calling the above constructors, this
301always creates new descriptors and thus new variables, although a variable with
302the same name may already exist.
303
304The search for an existing variable is based on four attributes: the variable's name,
305type, ordered values, and unordered values. As for the latter two, the values can
306be explicitly ordered by the user, e.g. in the second line of the tab-delimited
307file. For instance, sizes can be ordered as small, medium, or big.
308
309The search for existing variables can end with one of the following statuses.
310
311.. data:: Orange.data.variable.MakeStatus.NotFound (4)
312
313    The variable with that name and type does not exist.
314
315.. data:: Orange.data.variable.MakeStatus.Incompatible (3)
316
317    There are variables with matching name and type, but their
318    values are incompatible with the prescribed ordered values. For example,
319    if the existing variable already has values ["a", "b"] and the new one
320    wants ["b", "a"], the old variable cannot be reused. The existing list can,
321    however be appended with the new values, so searching for ["a", "b", "c"] would
322    succeed. Likewise a search for ["a"] would be successful, since the extra existing value
323    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``.
324
325.. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2)
326
327    There is a matching variable, yet it has none of the values that the new
328    variable will have (this is obviously possible only if the new variable has
329    no prescribed ordered values). For instance, we search for a variable
330    "sex" with values "male" and "female", while there is a variable of the same
331    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this
332    variable is possible, though this should probably be a new variable since it
333    obviously comes from a different data set. If we do decide to reuse the variable, the
334    old variable will get some unneeded new values and the new one will inherit
335    some from the old.
336
337.. data:: Orange.data.variable.MakeStatus.MissingValues (1)
338
339    There is a matching variable with some of the values that the new one
340    requires, but some values are missing. This situation is neither uncommon
341    nor suspicious: in case of separate training and testing data sets there may
342    be values which occur in one set but not in the other.
343
344.. data:: Orange.data.variable.MakeStatus.OK (0)
345
346    There is a perfect match which contains all the prescribed values in the
347    correct order. The existing variable may have some extra values, though.
348
349Continuous variables can obviously have only two statuses,
350:obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`.
351
352When loading the data using :obj:`Orange.data.Table`, Orange takes the safest
353approach and, by default, reuses everything that is compatible up to
354and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the
355variable having too many values, which the user can notice and fix. More on that
356in the page on `loading data`. !!TODO!!
357
358There are two functions for reusing the variables instead of creating new ones.
359
360.. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on])
361
362    Find and return an existing variable or create a new one if none of the existing
363    variables matches the given name, type and values.
364   
365    The optional `create_new_on` specifies the status at which a new variable is
366    created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or
367    non-existing) variables cannot be reused. If it is set lower, for instance
368    to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists
369    a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function
370    always creates a new variable.
371   
372    The function returns a tuple containing a variable descriptor and the
373    status of the best matching variable. So, if ``create_new_on`` is set to
374    :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say,
375    :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second
376    element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other
377    hand, there exists a variable which is perfectly OK, its descriptor is
378    returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no
379    indicator whether the returned variable is reused or not. This can be,
380    however, read from the status code: if it is smaller than the specified
381    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed.
382
383    The exception to the rule is when ``create_new_on`` is OK. In this case, the
384    function does not search through the existing variables and cannot know the
385    status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`.
386
387    :param name: Variable name
388    :param type: Variable type
389    :type type: Orange.data.variable.Type
390    :param ordered_values: a list of ordered values
391    :param unordered_values: a list of values, for which the order does not
392        matter
393    :param create_new_on: gives the condition for constructing a new variable instead
394        of using the new one
395   
396    :return_type: a tuple (:class:`Orange.data.variable.Variable`, int)
397   
398.. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on])
399
400    Find and return an existing variable, or :obj:`None` if no match is found.
401   
402    :param name: variable name.
403    :param type: variable type.
404    :type type: Orange.data.variable.Type
405    :param ordered_values: a list of ordered values
406    :param unordered_values: a list of values, for which the order does not
407        matter
408    :param create_new_on: gives the condition for constructing a new variable instead
409        of using the new one
410
411    :return_type: :class:`Orange.data.variable.Variable`
412   
413These following examples (from :download:`variable-reuse.py <code/variable-reuse.py>`) give the shown results if
414executed only once (in a Python session) and in this order.
415
416:func:`Orange.data.variable.make` can be used for the construction of new variables. ::
417   
418    >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"])
419    >>> print s, v1.values
420    4 <a, b>
421
422No surprises here: a new variable is created and the status is :obj:`~Orange.data.variable.MakeStatus.NotFound`. ::
423
424    >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"])
425    >>> print s, v2 is v1, v1.values
426    1 True <a, b, c>
427
428The status is 1 (:obj:`~Orange.data.variable.MakeStatus.MissingValues`), yet the variable is reused (``v2 is v1``).
429``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does
430not matter that the new variable does not need the value ``b``. ::
431
432    >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"])
433    >>> print s, v3 is v1, v1.values
434    1 True <a, b, c, d>
435
436This is like before, except that the new value, ``d`` is not among the
437ordered values. ::
438
439    >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"])
440    >>> print s, v4 is v1, v1.values, v4.values
441    3, False, <b>, <a, b, c, d>
442
443The new variable needs to have ``b`` as the first value, so it is incompatible
444with the existing variables. The status is thus 3 (:obj:`~Orange.data.variable.MakeStatus.Incompatible`), the two
445variables are not equal and have different lists of values. ::
446
447    >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"])
448    >>> print s, v5 is v1, v1.values, v5.values
449    0 True <a, b, c, d> <a, b, c, d>
450
451The new variable has values ``c`` and ``a``, but the order is not important,
452so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. ::
453
454    >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"])
455    >>> print s, v6 is v1, v1.values, v6.values
456    2 True <a, b, c, d, e> <a, b, c, d, e>
457
458The new variable has different values than the existing variable (status is 2,
459:obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`), but the existing one is nonetheless reused. Note that we
460gave ``e`` in the list of unordered values. If it was among the ordered, the
461reuse would fail. ::
462
463    >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None,
464            ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues)))
465    >>> print s, v7 is v1, v1.values, v7.values
466    2 False <a, b, c, d, e> <f>
467
468This is the same as before, except that we prohibited reuse when there are no
469recognized values. Hence a new variable is created, though the returned status is
470the same as before::
471
472    >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete,
473            ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK)
474    >>> print s, v8 is v1, v1.values, v8.values
475    0 False <a, b, c, d, e> <a, b, c, d, e>
476
477Finally, this is a perfect match, but any reuse is prohibited, so a new
478variable is created.
479
480"""
481from orange import Variable
482from orange import EnumVariable as Discrete
483from orange import FloatVariable as Continuous
484from orange import PythonVariable as Python
485from orange import StringVariable as String
486
487from orange import VarList as Variables
488
489import orange
490new_meta_id = orange.newmetaid
491make = orange.Variable.make
492retrieve = orange.Variable.get_existing
493MakeStatus = orange.Variable.MakeStatus
494del orange
Note: See TracBrowser for help on using the repository browser.