source: orange/docs/reference/rst/Orange.data.variable.rst @ 9848:00f5832a4c71

Revision 9848:00f5832a4c71, 20.8 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Fix rst indentation.

Line 
1.. automodule:: Orange.data.variable
2
3========================
4Variables (``variable``)
5========================
6
7Data instances in Orange can contain several types of variables:
8:ref:`discrete <discrete>`, :ref:`continuous <continuous>`,
9:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it.
10The latter represent arbitrary Python objects.
11The names, types, values (where applicable), functions for computing the
12variable value from values of other variables, and other properties of the
13variables are stored in descriptor classes derived from :obj:`Orange.data
14.variable.Variable`.
15
16Orange considers two variables (e.g. in two different data tables) the
17same if they have the same descriptor. It is allowed - but not
18recommended - to have different variables with the same name.
19
20Variable descriptors
21--------------------
22
23Variable descriptors can be constructed either by calling the
24corresponding constructors or by a factory function :func:`Orange.data
25.variable.make`, which either retrieves an existing descriptor or
26constructs a new one.
27
28.. class:: Variable
29
30    An abstract base class for variable descriptors.
31
32    .. attribute:: name
33
34        The name of the variable.
35
36    .. attribute:: var_type
37
38        Variable type; it can be :obj:`~Orange.data.Type.Discrete`,
39        :obj:`~Orange.data.Type.Continuous`,
40        :obj:`~Orange.data.Type.String` or :obj:`~Orange.data.Type.Other`.
41
42    .. attribute:: get_value_from
43
44        A function (an instance of :obj:`~Orange.classification.Classifier`)
45        that computes a value of the variable from values of one or more
46        other variables. This is used, for instance, in discretization,
47        which computes the value of a discretized variable from the
48        original continuous variable.
49
50    .. attribute:: ordered
51
52        A flag telling whether the values of a discrete variable are ordered. At
53        the moment, no built-in method treats ordinal variables differently than
54        nominal ones.
55
56    .. attribute:: random_generator
57
58        A local random number generator used by method
59        :obj:`~Variable.randomvalue()`.
60
61    .. attribute:: default_meta_id
62
63        A proposed (but not guaranteed) meta id to be used for that variable.
64        For instance, when a tab-delimited contains meta attributes and
65        the existing variables are reused, they will have this id
66        (instead of a new one assigned by :obj:`Orange.data.new_meta_id()`).
67
68    .. attribute:: attributes
69
70        A dictionary which allows the user to store additional information
71        about the variable. All values should be strings. See the section
72        about :ref:`storing additional information <attributes>`.
73
74    .. method:: __call__(obj)
75
76           Convert a string, number, or other suitable object into a variable
77           value.
78
79           :param obj: An object to be converted into a variable value
80           :type o: any suitable
81           :rtype: :class:`Orange.data.Value`
82
83    .. method:: randomvalue()
84
85           Return a random value for the variable.
86
87           :rtype: :class:`Orange.data.Value`
88
89    .. method:: compute_value(inst)
90
91           Compute the value of the variable given the instance by calling
92           obj:`~Variable.get_value_from` through a mechanism that
93           prevents infinite recursive calls.
94
95           :rtype: :class:`Orange.data.Value`
96
97.. _discrete:
98.. class:: Discrete
99
100    Bases: :class:`Variable`
101
102    Descriptor for discrete variables.
103
104    .. attribute:: values
105
106        A list with symbolic names for variables' values. Values are stored as
107        indices referring to this list and modifying it instantly
108        changes the (symbolic) names of values as they are printed out or
109        referred to by user.
110
111        .. note::
112
113            The size of the list is also used to indicate the number of
114            possible values for this variable. Changing the size - especially
115            shrinking the list - can crash Python. Also, do not add values
116            to the list by calling its append or extend method:
117            use :obj:`add_value` method instead.
118
119            It is also assumed that this attribute is always defined (but can
120            be empty), so never set it to ``None``.
121
122    .. attribute:: base_value
123
124            Stores the base value for the variable as an index in `values`.
125            This can be, for instance, a "normal" value, such as "no
126            complications" as opposed to abnormal "low blood pressure". The
127            base value is used by certain statistics, continuization etc.
128            potentially, learning algorithms. The default is -1 which means that
129            there is no base value.
130
131    .. method:: add_value(s)
132
133            Add a value with symbolic name ``s`` to values. Always call
134            this function instead of appending to ``values``.
135
136.. _continuous:
137.. class:: Continuous
138
139    Bases: :class:`Variable`
140
141    Descriptor for continuous variables.
142
143    .. attribute:: number_of_decimals
144
145        The number of decimals used when the value is printed out, converted to
146        a string or saved to a file.
147
148    .. attribute:: scientific_format
149
150        If ``True``, the value is printed in scientific format whenever it
151        would have more than 5 digits. In this case, :obj:`number_of_decimals` is
152        ignored.
153
154    .. attribute:: adjust_decimals
155
156        Tells Orange to monitor the number of decimals when the value is
157        converted from a string (when the values are read from a file or
158        converted by, e.g. ``inst[0]="3.14"``):
159
160        * 0: the number of decimals is not adjusted automatically;
161        * 1: the number of decimals is (and has already) been adjusted;
162        * 2: automatic adjustment is enabled, but no values have been
163          converted yet.
164
165        By default, adjustment of the number of decimals goes as follows:
166
167        * If the variable was constructed when data was read from a file,
168          it will be printed with the same number of decimals as the
169          largest number of decimals encountered in the file. If
170          scientific notation occurs in the file,
171          :obj:`scientific_format` will be set to ``True`` and scientific
172          format will be used for values too large or too small.
173
174        * If the variable is created in a script, it will have,
175          by default, three decimal places. This can be changed either by
176          setting the value from a string (e.g. ``inst[0]="3.14"``,
177          but not ``inst[0]=3.14``) or by manually setting the
178          :obj:`number_of_decimals`.
179
180    .. attribute:: start_value, end_value, step_value
181
182        The range used for :obj:`randomvalue`.
183
184.. _String:
185.. class:: String
186
187    Bases: :class:`Variable`
188
189    Descriptor for variables that contain strings. No method can use them for
190    learning; some will raise error or warnings, and others will
191    silently ignore them. They can be, however, used as meta-attributes; if
192    instances in a dataset have unique IDs, the most efficient way to store them
193    is to read them as meta-attributes. In general, never use discrete
194    attributes with many (say, more than 50) values. Such attributes are
195    probably not of any use for learning and should be stored as string
196    attributes.
197
198    When converting strings into values and back, empty strings are treated
199    differently than usual. For other types, an empty string denotes
200    undefined values, while :obj:`String` will take empty strings
201    as empty strings -- except when loading or saving into file.
202    Empty strings in files are interpreted as undefined; to specify an empty
203    string, enclose the string in double quotes; these are removed when the
204    string is loaded.
205
206.. _Python:
207.. class:: Python
208
209    Bases: :class:`Variable`
210
211    Base class for descriptors defined in Python. It is fully functional
212    and can be used as a descriptor for attributes that contain arbitrary Python
213    values. Since this is an advanced topic, PythonVariables are described on a
214    separate page. !!TODO!!
215
216
217.. _attributes:
218
219Storing additional attributes
220-----------------------------
221
222All variables have a field :obj:`~Variable.attributes`, a dictionary
223that can store additional string data.
224
225.. literalinclude:: code/attributes.py
226
227These attributes can only be saved to a .tab file. They are listed in the
228third line in <name>=<value> format, after other attribute specifications
229(such as "meta" or "class"), and are separated by spaces.
230
231.. _variable_descriptor_reuse:
232
233Reuse of descriptors
234--------------------
235
236There are situations when variable descriptors need to be reused. Typically, the
237user loads some training examples, trains a classifier, and then loads a separate
238test set. For the classifier to recognize the variables in the second data set,
239the descriptors, not just the names, need to be the same.
240
241When constructing new descriptors for data read from a file or during unpickling,
242Orange checks whether an appropriate descriptor (with the same name and, in case
243of discrete variables, also values) already exists and reuses it. When new
244descriptors are constructed by explicitly calling the above constructors, this
245always creates new descriptors and thus new variables, although a variable with
246the same name may already exist.
247
248The search for an existing variable is based on four attributes: the variable's name,
249type, ordered values, and unordered values. As for the latter two, the values can
250be explicitly ordered by the user, e.g. in the second line of the tab-delimited
251file. For instance, sizes can be ordered as small, medium, or big.
252
253The search for existing variables can end with one of the following statuses.
254
255.. data:: Orange.data.variable.MakeStatus.NotFound (4)
256
257    The variable with that name and type does not exist.
258
259.. data:: Orange.data.variable.MakeStatus.Incompatible (3)
260
261    There are variables with matching name and type, but their
262    values are incompatible with the prescribed ordered values. For example,
263    if the existing variable already has values ["a", "b"] and the new one
264    wants ["b", "a"], the old variable cannot be reused. The existing list can,
265    however be appended with the new values, so searching for ["a", "b", "c"] would
266    succeed. Likewise a search for ["a"] would be successful, since the extra existing value
267    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``.
268
269.. data:: Orange.data.variable.MakeStatus.NoRecognizedValues (2)
270
271    There is a matching variable, yet it has none of the values that the new
272    variable will have (this is obviously possible only if the new variable has
273    no prescribed ordered values). For instance, we search for a variable
274    "sex" with values "male" and "female", while there is a variable of the same
275    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this
276    variable is possible, though this should probably be a new variable since it
277    obviously comes from a different data set. If we do decide to reuse the variable, the
278    old variable will get some unneeded new values and the new one will inherit
279    some from the old.
280
281.. data:: Orange.data.variable.MakeStatus.MissingValues (1)
282
283    There is a matching variable with some of the values that the new one
284    requires, but some values are missing. This situation is neither uncommon
285    nor suspicious: in case of separate training and testing data sets there may
286    be values which occur in one set but not in the other.
287
288.. data:: Orange.data.variable.MakeStatus.OK (0)
289
290    There is a perfect match which contains all the prescribed values in the
291    correct order. The existing variable may have some extra values, though.
292
293Continuous variables can obviously have only two statuses,
294:obj:`~Orange.data.variable.MakeStatus.NotFound` or :obj:`~Orange.data.variable.MakeStatus.OK`.
295
296When loading the data using :obj:`Orange.data.Table`, Orange takes the safest
297approach and, by default, reuses everything that is compatible up to
298and including :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the
299variable having too many values, which the user can notice and fix. More on that
300in the page on :doc:`Orange.data.formats`.
301
302There are two functions for reusing the variables instead of creating new ones.
303
304.. function:: Orange.data.variable.make(name, type, ordered_values, unordered_values[, create_new_on])
305
306    Find and return an existing variable or create a new one if none of the existing
307    variables matches the given name, type and values.
308
309    The optional `create_new_on` specifies the status at which a new variable is
310    created. The status must be at most :obj:`~Orange.data.variable.MakeStatus.Incompatible` since incompatible (or
311    non-existing) variables cannot be reused. If it is set lower, for instance
312    to :obj:`~Orange.data.variable.MakeStatus.MissingValues`, a new variable is created even if there exists
313    a variable which is only missing the same values. If set to :obj:`~Orange.data.variable.MakeStatus.OK`, the function
314    always creates a new variable.
315
316    The function returns a tuple containing a variable descriptor and the
317    status of the best matching variable. So, if ``create_new_on`` is set to
318    :obj:`~Orange.data.variable.MakeStatus.MissingValues`, and there exists a variable whose status is, say,
319    :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`, a variable would be created, while the second
320    element of the tuple would contain :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`. If, on the other
321    hand, there exists a variable which is perfectly OK, its descriptor is
322    returned and the returned status is :obj:`~Orange.data.variable.MakeStatus.OK`. The function returns no
323    indicator whether the returned variable is reused or not. This can be,
324    however, read from the status code: if it is smaller than the specified
325    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed.
326
327    The exception to the rule is when ``create_new_on`` is OK. In this case, the
328    function does not search through the existing variables and cannot know the
329    status, so the returned status in this case is always :obj:`~Orange.data.variable.MakeStatus.OK`.
330
331    :param name: Variable name
332    :param type: Variable type
333    :type type: Orange.data.variable.Type
334    :param ordered_values: a list of ordered values
335    :param unordered_values: a list of values, for which the order does not
336        matter
337    :param create_new_on: gives the condition for constructing a new variable instead
338        of using the new one
339
340    :return_type: a tuple (:class:`~Orange.data.variable.Variable`, int)
341
342.. function:: Orange.data.variable.retrieve(name, type, ordered_values, onordered_values[, create_new_on])
343
344    Find and return an existing variable, or :obj:`None` if no match is found.
345
346    :param name: variable name.
347    :param type: variable type.
348    :type type: Orange.data.variable.Type
349    :param ordered_values: a list of ordered values
350    :param unordered_values: a list of values, for which the order does not
351        matter
352    :param create_new_on: gives the condition for constructing a new variable instead
353        of using the new one
354
355    :return_type: :class:`~Orange.data.variable.Variable`
356
357The following examples give the shown results if
358executed only once (in a Python session) and in this order.
359
360:func:`Orange.data.variable.make` can be used for the construction of new variables. ::
361
362    >>> v1, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b"])
363    >>> print s, v1.values
364    NotFound <a, b>
365
366A new variable was created and the status is :obj:`~Orange.data.variable
367.MakeStatus.NotFound`. ::
368
369    >>> v2, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a"], ["c"])
370    >>> print s, v2 is v1, v1.values
371    MissingValues True <a, b, c>
372
373The status is :obj:`~Orange.data.variable.MakeStatus.MissingValues`,
374yet the variable is reused (``v2 is v1``). ``v1`` gets a new value,
375``"c"``, which was given as an unordered value. It does
376not matter that the new variable does not need the value ``b``. ::
377
378    >>> v3, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["a", "b", "c", "d"])
379    >>> print s, v3 is v1, v1.values
380    MissingValues True <a, b, c, d>
381
382This is like before, except that the new value, ``d`` is not among the
383ordered values. ::
384
385    >>> v4, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, ["b"])
386    >>> print s, v4 is v1, v1.values, v4.values
387    Incompatible, False, <b>, <a, b, c, d>
388
389The new variable needs to have ``b`` as the first value, so it is incompatible
390with the existing variables. The status is
391:obj:`~Orange.data.variable.MakeStatus.Incompatible` and
392a new variable is created; the two variables are not equal and have
393different lists of values. ::
394
395    >>> v5, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["c", "a"])
396    >>> print s, v5 is v1, v1.values, v5.values
397    OK True <a, b, c, d> <a, b, c, d>
398
399The new variable has values ``c`` and ``a``, but the order is not important,
400so the existing attribute is :obj:`~Orange.data.variable.MakeStatus.OK`. ::
401
402    >>> v6, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None, ["e"]) "a"])
403    >>> print s, v6 is v1, v1.values, v6.values
404    NoRecognizedValues True <a, b, c, d, e> <a, b, c, d, e>
405
406The new variable has different values than the existing variable (status
407is :obj:`~Orange.data.variable.MakeStatus.NoRecognizedValues`),
408but the existing one is nonetheless reused. Note that we
409gave ``e`` in the list of unordered values. If it was among the ordered, the
410reuse would fail. ::
411
412    >>> v7, s = Orange.data.variable.make("a", Orange.data.Type.Discrete, None,
413            ["f"], Orange.data.variable.MakeStatus.NoRecognizedValues)))
414    >>> print s, v7 is v1, v1.values, v7.values
415    Incompatible False <a, b, c, d, e> <f>
416
417This is the same as before, except that we prohibited reuse when there are no
418recognized values. Hence a new variable is created, though the returned status is
419the same as before::
420
421    >>> v8, s = Orange.data.variable.make("a", Orange.data.Type.Discrete,
422            ["a", "b", "c", "d", "e"], None, Orange.data.variable.MakeStatus.OK)
423    >>> print s, v8 is v1, v1.values, v8.values
424    OK False <a, b, c, d, e> <a, b, c, d, e>
425
426Finally, this is a perfect match, but any reuse is prohibited, so a new
427variable is created.
428
429
430
431Variables computed from other variables
432---------------------------------------
433
434Values of variables are often computed from other variables, such as in
435discretization. The mechanism described below usually functions behind the scenes,
436so understanding it is required only for implementing specific transformations.
437
438Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``.
439It can help the learning algorithm if the four-valued attribute ``e`` is
440replaced with a binary attribute having values `"1"` and `"not 1"`. The
441new variable will be computed from the old one on the fly.
442
443.. literalinclude:: code/variable-get_value_from.py
444    :lines: 7-17
445
446The new variable is named ``e2``; we define it with a descriptor of type
447:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we
448chose this order so that the ``not 1``'s index is ``0``, which can be, if
449needed, interpreted as ``False``). Finally, we tell e2 to use
450``checkE`` to compute its value when needed, by assigning ``checkE`` to
451``e2.get_value_from``.
452
453``checkE`` is a function that is passed an instance and another argument we
454do not care about here. If the instance's ``e`` equals ``1``, the function
455returns value ``1``, otherwise it returns ``not 1``. Both are returned as
456values, not plain strings.
457
458In most circumstances the value of ``e2`` can be computed on the fly - we can
459pretend that the variable exists in the data, although it does not (but
460can be computed from it). For instance, we can compute the information gain of
461variable ``e2`` or its distribution without actually constructing data containing
462the new variable.
463
464.. literalinclude:: code/variable-get_value_from.py
465    :lines: 19-22
466
467There are methods which cannot compute values on the fly because it would be
468too complex or time consuming. In such cases, the data need to be converted
469to a new :obj:`Orange.data.Table`::
470
471    new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var])
472    new_data = Orange.data.Table(new_domain, data)
473
474Automatic computation is useful when the data is split into training and
475testing examples. Training instances can be modified by adding, removing
476and transforming variables (in a typical setup, continuous variables
477are discretized prior to learning, therefore the original variables are
478replaced by new ones). Test instances, on the other hand, are left as they
479are. When they are classified, the classifier automatically converts the
480testing instances into the new domain, which includes recomputation of
481transformed variables.
482
483.. literalinclude:: code/variable-get_value_from.py
484    :lines: 24-
Note: See TracBrowser for help on using the repository browser.