source: orange/Orange/data/ @ 9671:a7b056375472

Revision 9671:a7b056375472, 21.8 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

3Variables (``variable``)
6Data instances in Orange can contain several types of variables:
7:ref:`discrete <discrete>`, :ref:`continuous <continuous>`,
8:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it.
9The latter represent arbitrary Python objects.
10The names, types, values (where applicable), functions for computing the
11variable value from values of other variables, and other properties of the
12variables are stored in descriptor classes defined in this module.
14Variable descriptors
17Variable descriptors can be constructed either directly, using
18constructors and passing attributes as parameters, or by a
19factory function :func:``, which either
20retrieves an existing descriptor or constructs a new one.
22.. class:: Variable
24    An abstract base class for variable descriptors.
26    .. attribute:: name
28        The name of the variable. Variable names do not need to be unique since two
29        variables are considered the same only if they have the same descriptor
30        (e.g. even multiple variables in the same table can have the same name).
31        This should, however, be avoided since it may result in unpredictable
32        behavior.
34    .. attribute:: var_type
36        Variable type; it can be,
37, or
40    .. attribute:: get_value_from
42        A function (an instance of :obj:`Orange.classification.Classifier`) which computes
43        a value of the variable from values of one or more other variables. This
44        is used, for instance, in discretization where the variables describing
45        the discretized variable are computed from the original variable.
47    .. attribute:: ordered
49        A flag telling whether the values of a discrete variable are ordered. At
50        the moment, no built-in method treats ordinal variables differently than
51        nominal ones.
53    .. attribute:: distributed
55        A flag telling whether the values of the variables are distributions.
56        As for the flag ordered, no methods treat such variables in any special
57        manner.
59    .. attribute:: random_generator
61        A local random number generator used by method
62        :obj:`Variable.random_value`.
64    .. attribute:: default_meta_id
66        A proposed (but not guaranteed) meta id to be used for that variable.
67        This is used, for instance, by the data loader for tab-delimited file
68        format instead of assigning an arbitrary new value, or by
69        :obj:`` if the variable is passed as an argument.
71    .. attribute:: attributes
73        A dictionary which allows the user to store additional information
74        about the variable. All values should be strings. See the section
75        about :ref:`storing additional information <attributes>`.
77    .. method:: __call__(obj)
79           Convert a string, number, or other suitable object into a variable
80           value.
82           :param obj: An object to be converted into a variable value
83           :type o: any suitable
84           :rtype: :class:``
86    .. method:: randomvalue()
88           Return a random value for the variable.
90           :rtype: :class:``
92    .. method:: compute_value(inst)
94           Compute the value of the variable given the instance by calling
95           obj:`~Variable.get_value_from` through a mechanism that prevents deadlocks by
96           circular calls.
98           :rtype: :class:``
100.. _discrete:
101.. class:: Discrete
103    Bases: :class:`Variable`
105    Descriptor for discrete variables.
107    .. attribute:: values
109        A list with symbolic names for variables' values. Values are stored as
110        indices referring to this list. Therefore, modifying this list
111        instantly changes the (symbolic) names of values as they are printed out or
112        referred to by user.
114        .. note::
116            The size of the list is also used to indicate the number of
117            possible values for this variable. Changing the size - especially
118            shrinking the list - can have disastrous effects and is therefore not
119            really recommended. Also, do not add values to the list by
120            calling its append or extend method: call the :obj:`add_value`
121            method instead.
123            It is also assumed that this attribute is always defined (but can
124            be empty), so never set it to None.
126    .. attribute:: base_value
128            Stores the base value for the variable as an index in `values`.
129            This can be, for instance, a "normal" value, such as "no
130            complications" as opposed to abnormal "low blood pressure". The
131            base value is used by certain statistics, continuization etc.
132            potentially, learning algorithms. The default is -1 which means that
133            there is no base value.
135    .. method:: add_value
137            Add a value to values. Always call this function instead of
138            appending to values.
140.. _continuous:
141.. class:: Continuous
143    Bases: :class:`Variable`
145    Descriptor for continuous variables.
147    .. attribute:: number_of_decimals
149        The number of decimals used when the value is printed out, converted to
150        a string or saved to a file.
152    .. attribute:: scientific_format
154        If ``True``, the value is printed in scientific format whenever it
155        would have more than 5 digits. In this case, :obj:`number_of_decimals` is
156        ignored.
158    .. attribute:: adjust_decimals
160        Tells Orange to monitor the number of decimals when the value is
161        converted from a string (when the values are read from a file or
162        converted by, e.g. ``inst[0]="3.14"``):
163        0: the number of decimals is not adjusted automatically;
164        1: the number of decimals is (and has already) been adjusted;
165        2: automatic adjustment is enabled, but no values have been converted yet.
167        By default, adjustment of the number of decimals goes as follows:
169        If the variable was constructed when data was read from a file, it will
170        be printed with the same number of decimals as the largest number of
171        decimals encountered in the file. If scientific notation occurs in the
172        file, :obj:`scientific_format` will be set to ``True`` and scientific format
173        will be used for values too large or too small.
175        If the variable is created in a script, it will have, by default, three
176        decimal places. This can be changed either by setting the value
177        from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by
178        manually setting the :obj:`number_of_decimals`.
180    .. attribute:: start_value, end_value, step_value
182        The range used for :obj:`randomvalue`.
184.. _String:
185.. class:: String
187    Bases: :class:`Variable`
189    Descriptor for variables that contain strings. No method can use them for
190    learning; some will complain and others will silently ignore them when they
191    encounter them. They can be, however, useful for meta-attributes; if
192    instances in a dataset have unique IDs, the most efficient way to store them
193    is to read them as meta-attributes. In general, never use discrete
194    attributes with many (say, more than 50) values. Such attributes are
195    probably not of any use for learning and should be stored as string
196    attributes.
198    When converting strings into values and back, empty strings are treated
199    differently than usual. For other types, an empty string can be used to
200    denote undefined values, while :obj:`String` will take empty strings
201    as empty strings -- except when loading or saving into file.
202    Empty strings in files are interpreted as undefined; to specify an empty
203    string, enclose the string in double quotes; these are removed when the
204    string is loaded.
206.. _Python:
207.. class:: Python
209    Bases: :class:`Variable`
211    Base class for descriptors defined in Python. It is fully functional
212    and can be used as a descriptor for attributes that contain arbitrary Python
213    values. Since this is an advanced topic, PythonVariables are described on a
214    separate page. !!TODO!!
217Variables computed from other variables
220Values of variables are often computed from other variables, such as in
221discretization. The mechanism described below usually functions behind the scenes,
222so understanding it is required only for implementing specific transformations.
224Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``.
225It can help the learning algorithm if the four-valued attribute ``e`` is
226replaced with a binary attribute having values `"1"` and `"not 1"`. The
227new variable will be computed from the old one on the fly.
229.. literalinclude:: code/
230    :lines: 7-17
232The new variable is named ``e2``; we define it with a descriptor of type
233:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we
234chose this order so that the ``not 1``'s index is ``0``, which can be, if
235needed, interpreted as ``False``). Finally, we tell e2 to use
236``checkE`` to compute its value when needed, by assigning ``checkE`` to
239``checkE`` is a function that is passed an instance and another argument we
240do not care about here. If the instance's ``e`` equals ``1``, the function
241returns value ``1``, otherwise it returns ``not 1``. Both are returned as
242values, not plain strings.
244In most circumstances the value of ``e2`` can be computed on the fly - we can
245pretend that the variable exists in the data, although it does not (but
246can be computed from it). For instance, we can compute the information gain of
247variable ``e2`` or its distribution without actually constructing data containing
248the new variable.
250.. literalinclude:: code/
251    :lines: 19-22
253There are methods which cannot compute values on the fly because it would be
254too complex or time consuming. In such cases, the data need to be converted
255to a new :obj:``::
257    new_domain =[data.domain["a"], data.domain["b"], e2, data.domain.class_var])
258    new_data =, data)
260Automatic computation is useful when the data is split into training and
261testing examples. Training instances can be modified by adding, removing
262and transforming variables (in a typical setup, continuous variables
263are discretized prior to learning, therefore the original variables are
264replaced by new ones). Test instances, on the other hand, are left as they
265are. When they are classified, the classifier automatically converts the
266testing instances into the new domain, which includes recomputation of
267transformed variables.
269.. literalinclude:: code/
270    :lines: 24-
272.. _attributes:
274Storing additional variables
277All variables have a field :obj:`~Variable.attributes`, a dictionary
278which can contain strings. Although the current implementation allows all
279types of value we strongly advise to use only strings. An example:
281.. literalinclude:: code/
283These attributes can only be saved to a .tab file. They are listed in the
284third line in <name>=<value> format, after other attribute specifications
285(such as "meta" or "class"), and are separated by spaces.
287.. _variable_descriptor_reuse:
289Reuse of descriptors
292There are situations when variable descriptors need to be reused. Typically, the
293user loads some training examples, trains a classifier, and then loads a separate
294test set. For the classifier to recognize the variables in the second data set,
295the descriptors, not just the names, need to be the same.
297When constructing new descriptors for data read from a file or during unpickling,
298Orange checks whether an appropriate descriptor (with the same name and, in case
299of discrete variables, also values) already exists and reuses it. When new
300descriptors are constructed by explicitly calling the above constructors, this
301always creates new descriptors and thus new variables, although a variable with
302the same name may already exist.
304The search for an existing variable is based on four attributes: the variable's name,
305type, ordered values, and unordered values. As for the latter two, the values can
306be explicitly ordered by the user, e.g. in the second line of the tab-delimited
307file. For instance, sizes can be ordered as small, medium, or big.
309The search for existing variables can end with one of the following statuses.
311.. data:: (4)
313    The variable with that name and type does not exist.
315.. data:: (3)
317    There are variables with matching name and type, but their
318    values are incompatible with the prescribed ordered values. For example,
319    if the existing variable already has values ["a", "b"] and the new one
320    wants ["b", "a"], the old variable cannot be reused. The existing list can,
321    however be appended with the new values, so searching for ["a", "b", "c"] would
322    succeed. Likewise a search for ["a"] would be successful, since the extra existing value
323    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``.
325.. data:: (2)
327    There is a matching variable, yet it has none of the values that the new
328    variable will have (this is obviously possible only if the new variable has
329    no prescribed ordered values). For instance, we search for a variable
330    "sex" with values "male" and "female", while there is a variable of the same
331    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this
332    variable is possible, though this should probably be a new variable since it
333    obviously comes from a different data set. If we do decide to reuse the variable, the
334    old variable will get some unneeded new values and the new one will inherit
335    some from the old.
337.. data:: (1)
339    There is a matching variable with some of the values that the new one
340    requires, but some values are missing. This situation is neither uncommon
341    nor suspicious: in case of separate training and testing data sets there may
342    be values which occur in one set but not in the other.
344.. data:: (0)
346    There is a perfect match which contains all the prescribed values in the
347    correct order. The existing variable may have some extra values, though.
349Continuous variables can obviously have only two statuses,
350:obj:`` or :obj:``.
352When loading the data using :obj:``, Orange takes the safest
353approach and, by default, reuses everything that is compatible up to
354and including :obj:``. Unintended reuse would be obvious from the
355variable having too many values, which the user can notice and fix. More on that
356in the page on `loading data`. !!TODO!!
358There are two functions for reusing the variables instead of creating new ones.
360.. function::, type, ordered_values, unordered_values[, create_new_on])
362    Find and return an existing variable or create a new one if none of the existing
363    variables matches the given name, type and values.
365    The optional `create_new_on` specifies the status at which a new variable is
366    created. The status must be at most :obj:`` since incompatible (or
367    non-existing) variables cannot be reused. If it is set lower, for instance
368    to :obj:``, a new variable is created even if there exists
369    a variable which is only missing the same values. If set to :obj:``, the function
370    always creates a new variable.
372    The function returns a tuple containing a variable descriptor and the
373    status of the best matching variable. So, if ``create_new_on`` is set to
374    :obj:``, and there exists a variable whose status is, say,
375    :obj:``, a variable would be created, while the second
376    element of the tuple would contain :obj:``. If, on the other
377    hand, there exists a variable which is perfectly OK, its descriptor is
378    returned and the returned status is :obj:``. The function returns no
379    indicator whether the returned variable is reused or not. This can be,
380    however, read from the status code: if it is smaller than the specified
381    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed.
383    The exception to the rule is when ``create_new_on`` is OK. In this case, the
384    function does not search through the existing variables and cannot know the
385    status, so the returned status in this case is always :obj:``.
387    :param name: Variable name
388    :param type: Variable type
389    :type type:
390    :param ordered_values: a list of ordered values
391    :param unordered_values: a list of values, for which the order does not
392        matter
393    :param create_new_on: gives the condition for constructing a new variable instead
394        of using the new one
396    :return_type: a tuple (:class:``, int)
398.. function::, type, ordered_values, onordered_values[, create_new_on])
400    Find and return an existing variable, or :obj:`None` if no match is found.
402    :param name: variable name.
403    :param type: variable type.
404    :type type:
405    :param ordered_values: a list of ordered values
406    :param unordered_values: a list of values, for which the order does not
407        matter
408    :param create_new_on: gives the condition for constructing a new variable instead
409        of using the new one
411    :return_type: :class:``
413These following examples (from :download:` <code/>`) give the shown results if
414executed only once (in a Python session) and in this order.
416:func:`` can be used for the construction of new variables. ::
418    >>> v1, s ="a",, ["a", "b"])
419    >>> print s, v1.values
420    4 <a, b>
422No surprises here: a new variable is created and the status is :obj:``. ::
424    >>> v2, s ="a",, ["a"], ["c"])
425    >>> print s, v2 is v1, v1.values
426    1 True <a, b, c>
428The status is 1 (:obj:``), yet the variable is reused (``v2 is v1``).
429``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does
430not matter that the new variable does not need the value ``b``. ::
432    >>> v3, s ="a",, ["a", "b", "c", "d"])
433    >>> print s, v3 is v1, v1.values
434    1 True <a, b, c, d>
436This is like before, except that the new value, ``d`` is not among the
437ordered values. ::
439    >>> v4, s ="a",, ["b"])
440    >>> print s, v4 is v1, v1.values, v4.values
441    3, False, <b>, <a, b, c, d>
443The new variable needs to have ``b`` as the first value, so it is incompatible
444with the existing variables. The status is thus 3 (:obj:``), the two
445variables are not equal and have different lists of values. ::
447    >>> v5, s ="a",, None, ["c", "a"])
448    >>> print s, v5 is v1, v1.values, v5.values
449    0 True <a, b, c, d> <a, b, c, d>
451The new variable has values ``c`` and ``a``, but the order is not important,
452so the existing attribute is :obj:``. ::
454    >>> v6, s ="a",, None, ["e"]) "a"])
455    >>> print s, v6 is v1, v1.values, v6.values
456    2 True <a, b, c, d, e> <a, b, c, d, e>
458The new variable has different values than the existing variable (status is 2,
459:obj:``), but the existing one is nonetheless reused. Note that we
460gave ``e`` in the list of unordered values. If it was among the ordered, the
461reuse would fail. ::
463    >>> v7, s ="a",, None,
464            ["f"],
465    >>> print s, v7 is v1, v1.values, v7.values
466    2 False <a, b, c, d, e> <f>
468This is the same as before, except that we prohibited reuse when there are no
469recognized values. Hence a new variable is created, though the returned status is
470the same as before::
472    >>> v8, s ="a",,
473            ["a", "b", "c", "d", "e"], None,
474    >>> print s, v8 is v1, v1.values, v8.values
475    0 False <a, b, c, d, e> <a, b, c, d, e>
477Finally, this is a perfect match, but any reuse is prohibited, so a new
478variable is created.
481from orange import Variable
482from orange import EnumVariable as Discrete
483from orange import FloatVariable as Continuous
484from orange import PythonVariable as Python
485from orange import StringVariable as String
487from orange import VarList as Variables
489import orange
490new_meta_id = orange.newmetaid
491make = orange.Variable.make
492retrieve = orange.Variable.get_existing
493MakeStatus = orange.Variable.MakeStatus
494del orange
Note: See TracBrowser for help on using the repository browser.