source: orange/docs/reference/rst/Orange.feature.descriptor.rst @ 9927:d6ca7b346864

Revision 9927:d6ca7b346864, 20.5 KB checked in by markotoplak, 2 years ago (diff)

data.variable -> feature.

Line 
1.. py:currentmodule:: Orange.feature
2
3===========================
4Descriptor (``Descriptor``)
5===========================
6
7Data instances in Orange can contain several types of variables:
8:ref:`discrete <discrete>`, :ref:`continuous <continuous>`,
9:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it.
10The latter represent arbitrary Python objects.
11The names, types, values (where applicable), functions for computing the
12variable value from values of other variables, and other properties of the
13variables are stored in descriptor classes derived from :obj:`Descriptor`.
14
15Orange considers two variables (e.g. in two different data tables) the
16same if they have the same descriptor. It is allowed - but not
17recommended - to have different descriptors with the same name.
18
19Descriptors can be constructed either by calling the corresponding
20constructors or by a factory function :func:`make`, which either retrieves
21an existing descriptor or constructs a new one.
22
23.. class:: Descriptor
24
25    An abstract base class for variable descriptors.
26
27    .. attribute:: name
28
29        The name of the variable.
30
31    .. attribute:: var_type
32
33        Variable type; it can be :obj:`~Orange.feature.Type.Discrete`,
34        :obj:`~Orange.feature.Type.Continuous`,
35        :obj:`~Orange.feature.Type.String` or :obj:`~Orange.feature.Type.Other`.
36
37    .. attribute:: get_value_from
38
39        A function (an instance of :obj:`~Orange.classification.Classifier`)
40        that computes a value of the variable from values of one or more
41        other variables. This is used, for instance, in discretization,
42        which computes the value of a discretized variable from the
43        original continuous variable.
44
45    .. attribute:: ordered
46
47        A flag telling whether the values of a discrete variable are ordered. At
48        the moment, no built-in method treats ordinal variables differently than
49        nominal ones.
50
51    .. attribute:: random_generator
52
53        A local random number generator used by method
54        :obj:`~Descriptor.randomvalue()`.
55
56    .. attribute:: default_meta_id
57
58        A proposed (but not guaranteed) meta id to be used for that variable.
59        For instance, when a tab-delimited contains meta attributes and
60        the existing variables are reused, they will have this id
61        (instead of a new one assigned by :obj:`Orange.data.new_meta_id()`).
62
63    .. attribute:: attributes
64
65        A dictionary which allows the user to store additional information
66        about the variable. All values should be strings. See the section
67        about :ref:`storing additional information <attributes>`.
68
69    .. method:: __call__(obj)
70
71           Convert a string, number, or other suitable object into a variable
72           value.
73
74           :param obj: An object to be converted into a variable value
75           :type o: any suitable
76           :rtype: :class:`Orange.data.Value`
77
78    .. method:: randomvalue()
79
80           Return a random value for the variable.
81
82           :rtype: :class:`Orange.data.Value`
83
84    .. method:: compute_value(inst)
85
86           Compute the value of the variable given the instance by calling
87           obj:`~Descriptor.get_value_from` through a mechanism that
88           prevents infinite recursive calls.
89
90           :rtype: :class:`Orange.data.Value`
91
92
93``Discrete``
94------------
95
96.. _discrete:
97.. class:: Discrete
98
99    Bases: :class:`Descriptor`
100
101    Descriptor for discrete variables.
102
103    .. attribute:: values
104
105        A list with symbolic names for variables' values. Values are stored as
106        indices referring to this list and modifying it instantly
107        changes the (symbolic) names of values as they are printed out or
108        referred to by user.
109
110        .. note::
111
112            The size of the list is also used to indicate the number of
113            possible values for this variable. Changing the size - especially
114            shrinking the list - can crash Python. Also, do not add values
115            to the list by calling its append or extend method:
116            use :obj:`add_value` method instead.
117
118            It is also assumed that this attribute is always defined (but can
119            be empty), so never set it to ``None``.
120
121    .. attribute:: base_value
122
123            Stores the base value for the variable as an index in `values`.
124            This can be, for instance, a "normal" value, such as "no
125            complications" as opposed to abnormal "low blood pressure". The
126            base value is used by certain statistics, continuization etc.
127            potentially, learning algorithms. The default is -1 which means that
128            there is no base value.
129
130    .. method:: add_value(s)
131
132            Add a value with symbolic name ``s`` to values. Always call
133            this function instead of appending to ``values``.
134
135``Continuous``
136--------------
137
138.. _continuous:
139.. class:: Continuous
140
141    Bases: :class:`Descriptor`
142
143    Descriptor for continuous variables.
144
145    .. attribute:: number_of_decimals
146
147        The number of decimals used when the value is printed out, converted to
148        a string or saved to a file.
149
150    .. attribute:: scientific_format
151
152        If ``True``, the value is printed in scientific format whenever it
153        would have more than 5 digits. In this case, :obj:`number_of_decimals` is
154        ignored.
155
156    .. attribute:: adjust_decimals
157
158        Tells Orange to monitor the number of decimals when the value is
159        converted from a string (when the values are read from a file or
160        converted by, e.g. ``inst[0]="3.14"``):
161
162        * 0: the number of decimals is not adjusted automatically;
163        * 1: the number of decimals is (and has already) been adjusted;
164        * 2: automatic adjustment is enabled, but no values have been
165          converted yet.
166
167        By default, adjustment of the number of decimals goes as follows:
168
169        * If the variable was constructed when data was read from a file,
170          it will be printed with the same number of decimals as the
171          largest number of decimals encountered in the file. If
172          scientific notation occurs in the file,
173          :obj:`scientific_format` will be set to ``True`` and scientific
174          format will be used for values too large or too small.
175
176        * If the variable is created in a script, it will have,
177          by default, three decimal places. This can be changed either by
178          setting the value from a string (e.g. ``inst[0]="3.14"``,
179          but not ``inst[0]=3.14``) or by manually setting the
180          :obj:`number_of_decimals`.
181
182    .. attribute:: start_value, end_value, step_value
183
184        The range used for :obj:`randomvalue`.
185
186``String``
187----------
188
189.. _String:
190
191.. class:: String
192
193    Bases: :class:`Descriptor`
194
195    Descriptor for variables that contain strings. No method can use them for
196    learning; some will raise error or warnings, and others will
197    silently ignore them. They can be, however, used as meta-attributes; if
198    instances in a dataset have unique IDs, the most efficient way to store them
199    is to read them as meta-attributes. In general, never use discrete
200    attributes with many (say, more than 50) values. Such attributes are
201    probably not of any use for learning and should be stored as string
202    attributes.
203
204    When converting strings into values and back, empty strings are treated
205    differently than usual. For other types, an empty string denotes
206    undefined values, while :obj:`String` will take empty strings
207    as empty strings -- except when loading or saving into file.
208    Empty strings in files are interpreted as undefined; to specify an empty
209    string, enclose the string in double quotes; these are removed when the
210    string is loaded.
211
212``Python``
213----------
214
215.. _Python:
216.. class:: Python
217
218    Bases: :class:`Descriptor`
219
220    Base class for descriptors defined in Python. It is fully functional
221    and can be used as a descriptor for attributes that contain arbitrary Python
222    values. Since this is an advanced topic, PythonVariables are described on a
223    separate page. !!TODO!!
224
225
226.. _attributes:
227
228Storing additional attributes
229-----------------------------
230
231All variables have a field :obj:`~Descriptor.attributes`, a dictionary
232that can store additional string data.
233
234.. literalinclude:: code/attributes.py
235
236These attributes can only be saved to a .tab file. They are listed in the
237third line in <name>=<value> format, after other attribute specifications
238(such as "meta" or "class"), and are separated by spaces.
239
240.. _variable_descriptor_reuse:
241
242Reuse of descriptors
243--------------------
244
245There are situations when variable descriptors need to be
246reused. Typically, the user loads some training examples, trains a
247classifier, and then loads a separate test set. For the classifier to
248recognize the variables in the second data set, the descriptors, not
249just the names, need to be the same.
250
251When constructing new descriptors for data read from a file or during
252unpickling, Orange checks whether an appropriate descriptor (with the same
253name and, in case of discrete variables, also values) already exists and
254reuses it. When new descriptors are constructed by explicitly calling
255the above constructors, this always creates new descriptors and thus
256new variables, although a variable with the same name may already exist.
257
258The search for an existing variable is based on four attributes: the
259variable's name, type, ordered values, and unordered values. As for the
260latter two, the values can be explicitly ordered by the user, e.g. in
261the second line of the tab-delimited file. For instance, sizes can be
262ordered as small, medium, or big.
263
264The search for existing variables can end with one of the following
265statuses.
266
267.. data:: Descriptor.MakeStatus.NotFound (4)
268
269    The variable with that name and type does not exist.
270
271.. data:: Descriptor.MakeStatus.Incompatible (3)
272
273    There are variables with matching name and type, but their
274    values are incompatible with the prescribed ordered values. For example,
275    if the existing variable already has values ["a", "b"] and the new one
276    wants ["b", "a"], the old variable cannot be reused. The existing list can,
277    however be appended with the new values, so searching for ["a", "b", "c"] would
278    succeed. Likewise a search for ["a"] would be successful, since the extra existing value
279    does not matter. The formal rule is thus that the values are compatible iff ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``.
280
281.. data:: Descriptor.MakeStatus.NoRecognizedValues (2)
282
283    There is a matching variable, yet it has none of the values that the new
284    variable will have (this is obviously possible only if the new variable has
285    no prescribed ordered values). For instance, we search for a variable
286    "sex" with values "male" and "female", while there is a variable of the same
287    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this
288    variable is possible, though this should probably be a new variable since it
289    obviously comes from a different data set. If we do decide to reuse the variable, the
290    old variable will get some unneeded new values and the new one will inherit
291    some from the old.
292
293.. data:: Descriptor.MakeStatus.MissingValues (1)
294
295    There is a matching variable with some of the values that the new one
296    requires, but some values are missing. This situation is neither uncommon
297    nor suspicious: in case of separate training and testing data sets there may
298    be values which occur in one set but not in the other.
299
300.. data:: Descriptor.MakeStatus.OK (0)
301
302    There is a perfect match which contains all the prescribed values in the
303    correct order. The existing variable may have some extra values, though.
304
305Continuous variables can obviously have only two statuses,
306:obj:`~Descriptor.MakeStatus.NotFound` or :obj:`~Descriptor.MakeStatus.OK`.
307
308When loading the data using :obj:`Orange.data.Table`, Orange takes the safest
309approach and, by default, reuses everything that is compatible up to
310and including :obj:`~Descriptor.MakeStatus.NoRecognizedValues`. Unintended reuse would be obvious from the
311variable having too many values, which the user can notice and fix. More on that
312in the page on :doc:`Orange.data.formats`.
313
314There are two functions for reusing the variables instead of creating new ones.
315
316.. function:: Descriptor.make(name, type, ordered_values, unordered_values[, create_new_on])
317
318    Find and return an existing variable or create a new one if none of the existing
319    variables matches the given name, type and values.
320
321    The optional `create_new_on` specifies the status at which a new variable is
322    created. The status must be at most :obj:`~Descriptor.MakeStatus.Incompatible` since incompatible (or
323    non-existing) variables cannot be reused. If it is set lower, for instance
324    to :obj:`~Descriptor.MakeStatus.MissingValues`, a new variable is created even if there exists
325    a variable which is only missing the same values. If set to :obj:`~Descriptor.MakeStatus.OK`, the function
326    always creates a new variable.
327
328    The function returns a tuple containing a variable descriptor and the
329    status of the best matching variable. So, if ``create_new_on`` is set to
330    :obj:`~Descriptor.MakeStatus.MissingValues`, and there exists a variable whose status is, say,
331    :obj:`~Descriptor.MakeStatus.NoRecognizedValues`, a variable would be created, while the second
332    element of the tuple would contain :obj:`~Descriptor.MakeStatus.NoRecognizedValues`. If, on the other
333    hand, there exists a variable which is perfectly OK, its descriptor is
334    returned and the returned status is :obj:`~Descriptor.MakeStatus.OK`. The function returns no
335    indicator whether the returned variable is reused or not. This can be,
336    however, read from the status code: if it is smaller than the specified
337    ``create_new_on``, the variable is reused, otherwise a new descriptor has been constructed.
338
339    The exception to the rule is when ``create_new_on`` is OK. In this case, the
340    function does not search through the existing variables and cannot know the
341    status, so the returned status in this case is always :obj:`~Descriptor.MakeStatus.OK`.
342
343    :param name: Descriptor name
344    :param type: Descriptor type
345    :type type: Type
346    :param ordered_values: a list of ordered values
347    :param unordered_values: a list of values, for which the order does not
348        matter
349    :param create_new_on: gives the condition for constructing a new variable instead
350        of using the new one
351
352    :return_type: a tuple (:class:`~Descriptor`, int)
353
354.. function:: Descriptor.retrieve(name, type, ordered_values, onordered_values[, create_new_on])
355
356    Find and return an existing variable, or :obj:`None` if no match is found.
357
358    :param name: variable name.
359    :param type: variable type.
360    :type type: Type
361    :param ordered_values: a list of ordered values
362    :param unordered_values: a list of values, for which the order does not
363        matter
364    :param create_new_on: gives the condition for constructing a new variable instead
365        of using the new one
366
367    :return_type: :class:`~Descriptor`
368
369The following examples give the shown results if
370executed only once (in a Python session) and in this order.
371
372:func:`make` can be used for the construction of new variables. ::
373
374    >>> v1, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, ["a", "b"])
375    >>> print s, v1.values
376    NotFound <a, b>
377
378A new variable was created and the status is :obj:`~Descriptor.MakeStatus.NotFound`. ::
379
380    >>> v2, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, ["a"], ["c"])
381    >>> print s, v2 is v1, v1.values
382    MissingValues True <a, b, c>
383
384The status is :obj:`~Descriptor.MakeStatus.MissingValues`,
385yet the variable is reused (``v2 is v1``). ``v1`` gets a new value,
386``"c"``, which was given as an unordered value. It does
387not matter that the new variable does not need the value ``b``. ::
388
389    >>> v3, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, ["a", "b", "c", "d"])
390    >>> print s, v3 is v1, v1.values
391    MissingValues True <a, b, c, d>
392
393This is like before, except that the new value, ``d`` is not among the
394ordered values. ::
395
396    >>> v4, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, ["b"])
397    >>> print s, v4 is v1, v1.values, v4.values
398    Incompatible, False, <b>, <a, b, c, d>
399
400The new variable needs to have ``b`` as the first value, so it is incompatible
401with the existing variables. The status is
402:obj:`~Descriptor.MakeStatus.Incompatible` and
403a new variable is created; the two variables are not equal and have
404different lists of values. ::
405
406    >>> v5, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, None, ["c", "a"])
407    >>> print s, v5 is v1, v1.values, v5.values
408    OK True <a, b, c, d> <a, b, c, d>
409
410The new variable has values ``c`` and ``a``, but the order is not important,
411so the existing attribute is :obj:`~Descriptor.MakeStatus.OK`. ::
412
413    >>> v6, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, None, ["e"]) "a"])
414    >>> print s, v6 is v1, v1.values, v6.values
415    NoRecognizedValues True <a, b, c, d, e> <a, b, c, d, e>
416
417The new variable has different values than the existing variable (status
418is :obj:`~Descriptor.MakeStatus.NoRecognizedValues`),
419but the existing one is nonetheless reused. Note that we
420gave ``e`` in the list of unordered values. If it was among the ordered, the
421reuse would fail. ::
422
423    >>> v7, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete, None,
424            ["f"], Orange.feature.MakeStatus.NoRecognizedValues)))
425    >>> print s, v7 is v1, v1.values, v7.values
426    Incompatible False <a, b, c, d, e> <f>
427
428This is the same as before, except that we prohibited reuse when there are no
429recognized values. Hence a new variable is created, though the returned status is
430the same as before::
431
432    >>> v8, s = Orange.feature.Descriptor.make("a", Orange.feature.Type.Discrete,
433            ["a", "b", "c", "d", "e"], None, Orange.feature.MakeStatus.OK)
434    >>> print s, v8 is v1, v1.values, v8.values
435    OK False <a, b, c, d, e> <a, b, c, d, e>
436
437Finally, this is a perfect match, but any reuse is prohibited, so a new
438variable is created.
439
440
441
442Variables computed from other variables
443---------------------------------------
444
445Values of variables are often computed from other variables, such as in
446discretization. The mechanism described below usually functions behind the scenes,
447so understanding it is required only for implementing specific transformations.
448
449Monk 1 is a well-known dataset with target concept ``y := a==b or e==1``.
450It can help the learning algorithm if the four-valued attribute ``e`` is
451replaced with a binary attribute having values `"1"` and `"not 1"`. The
452new variable will be computed from the old one on the fly.
453
454.. literalinclude:: code/variable-get_value_from.py
455    :lines: 7-17
456
457The new variable is named ``e2``; we define it with a descriptor of type
458:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we
459chose this order so that the ``not 1``'s index is ``0``, which can be, if
460needed, interpreted as ``False``). Finally, we tell e2 to use
461``checkE`` to compute its value when needed, by assigning ``checkE`` to
462``e2.get_value_from``.
463
464``checkE`` is a function that is passed an instance and another argument we
465do not care about here. If the instance's ``e`` equals ``1``, the function
466returns value ``1``, otherwise it returns ``not 1``. Both are returned as
467values, not plain strings.
468
469In most circumstances the value of ``e2`` can be computed on the fly - we can
470pretend that the variable exists in the data, although it does not (but
471can be computed from it). For instance, we can compute the information gain of
472variable ``e2`` or its distribution without actually constructing data containing
473the new variable.
474
475.. literalinclude:: code/variable-get_value_from.py
476    :lines: 19-22
477
478There are methods which cannot compute values on the fly because it would be
479too complex or time consuming. In such cases, the data need to be converted
480to a new :obj:`Orange.data.Table`::
481
482    new_domain = Orange.data.Domain([data.domain["a"], data.domain["b"], e2, data.domain.class_var])
483    new_data = Orange.data.Table(new_domain, data)
484
485Automatic computation is useful when the data is split into training and
486testing examples. Training instances can be modified by adding, removing
487and transforming variables (in a typical setup, continuous variables
488are discretized prior to learning, therefore the original variables are
489replaced by new ones). Test instances, on the other hand, are left as they
490are. When they are classified, the classifier automatically converts the
491testing instances into the new domain, which includes recomputation of
492transformed variables.
493
494.. literalinclude:: code/variable-get_value_from.py
495    :lines: 24-
Note: See TracBrowser for help on using the repository browser.