Changeset 7338:488bf5765c7a in orange

02/03/11 19:54:31 (3 years ago)
janezd <janez.demsar@…>
  • finished documentation on; still needs to be checked
2 added
2 edited


  • orange/Orange/data/

    r7260 r7338  
    2 Data instances in Orange can contain four types of features: `discrete`_, 
    3 continuous_, strings_ and Python_; the latter represent arbitrary Python objects. 
     2Data instances in Orange can contain several types of features: 
     3:ref:`discrete <discrete>`, :ref:`continuous <continuous>`, 
     4:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it. 
     5The latter represent arbitrary Python objects. 
    46The names, types, values (where applicable), functions for computing the 
    57feature value from other features, and other properties of the 
    3032    .. attribute:: getValueFrom 
    31     A function (an instance of `Orange.core.Clasifier`) which computes a 
     34    A function (an instance of :obj:`Orange.core.Clasifier`) which computes a 
    3235    value of the feature from values of one or more other features. This is 
    3336    used, for instance, in discretization where the features describing the  
    3639    .. attribute:: ordered 
    3741    A flag telling whether the values of a discrete feature are ordered. At the  
    3842    moment, no builtin method treats ordinal features differently than nominal. 
    4044    .. attribute:: distributed 
    4146    A flag telling whether the values of this features are distributions. 
    4247    As for flag ordered, no methods treat such features in any special manner. 
    4449    .. attribute:: randomGenerator 
    4551    A local random number generator used by method :obj:`Feature.randomvalue`. 
    4753    .. attribute:: defaultMetaId 
    4855    A proposed (but not guaranteed) meta id to be used for that feature. This is  
    4956    used, for instance, by the data loader for tab-delimited file format instead  
    50     of assigning an arbitrary new value, or by `Orange.core.newmetaid` if the 
     57    of assigning an arbitrary new value, or by :obj:`Orange.core.newmetaid` if the 
    5158    feature is passed as an argument.  
    5460    .. method:: __call__(obj) 
    5562       Convert a string, number or other suitable object into a feature value. 
    5664       :param obj: An object to be converted into a feature value 
    5765       :type o: any suitable 
    6068    .. method:: randomvalue() 
    6170       Return a random value of the feature 
    6272       :rtype: :class:`` 
    64     .. method:: computeValue() 
    65        Calls getValueFrom through a mechanism that prevents deadlocks by circular calls. 
     74    .. method:: computeValue(inst) 
     76       Compute the value of the feature given the instance by calling 
     77       `getValueFrom` through a mechanism that prevents deadlocks by 
     78       circular calls. 
     80       :rtype: :class:`` 
     82.. _discrete: 
     83.. class:: Discrete 
     85    Descriptor for discrete features. 
     87    .. attribute:: values 
     89    A list with symbolic names for feature's values. Values are stored as 
     90    indices referring to this list. Therefore, modifying this list instantly 
     91    changes (symbolic) names of values as they are printed out or referred to 
     92    by user. 
     94    .. note:: 
     96        The size of the list is also used to indicate the number of possible values  
     97        for this feature. Changing the size, especially shrinking the list can have  
     98        disastrous effects and is therefore not really recommendable. Also, do not  
     99        add values to the list by calling its append or extend method: call  
     100        the :obj:`addValue` method instead described below. 
     102        It is also assumed that this attribute is always defined (but can be empty),  
     103        so never set it to None. 
     105    .. attribute:: baseValue 
     107    Stores the base value for the feature as an index into `values`. This can 
     108    be, for instance a "normal" value, such as "no complications" as opposed to  
     109    abnormal "low blood pressure". The base value is used by certain statistics,  
     110    continuization etc. potentially, learning algorithms. Default is -1 and  
     111    means that there is no base value. 
     113    .. method:: addValue 
     115    Adds a value to values. Always call this function instead of appending to 
     116    values. 
     118.. _continuous: 
     119.. class:: Continuous 
     121    Descriptor for continuous features. 
     123    .. attribute:: numberOfDecimals 
     125    The number of decimals used when the value is printed out, converted to a 
     126    string or saved to a file  
     128    .. attribute:: scientificFormat 
     130    If ``True``, the value is printed in scientific format whenever it would 
     131    have more than 5 digits. In this case, `numberOfDecimals` is ignored. 
     133    .. attribute:: adjustDecimals 
     135    Tells Orange to monitor the number of decimals when the value is converted 
     136    from a string (when the values are read from a file or converted by, e.g. 
     137    ``inst[0]="3.14"``). The value of ``0`` means that the number of decimals  
     138    should not be adjusted, while 1 and 2 mean that adjustments are on, with 2  
     139    denoting that no values have been converted yet. 
     141    By default, adjustment of number of decimals goes as follows. 
     143    If the feature was constructed when data was read from a file, it will be 
     144    printed with the same number of decimals as the largest number of decimals 
     145    encountered in the file. If scientific notation occurs in the file, 
     146    `scientificFormat` will be set to ``True`` and scientific format will be used 
     147    for values too large or too small. 
     149    If the feature is created in a script, it will have, by default, three 
     150    decimals places. This can be changed either by setting the value 
     151    from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by 
     152    manually setting the `numberOfDecimals`. 
     154    .. attribute:: startValue, endValue, stepValue 
     156    The range used for :obj:`randomvalue`. 
     158.. _String: 
     159.. class:: String 
     161    Descriptor for features that contains strings. No method can use them for  
     162    learning; some will complain and other will silently ignore them when the  
     163    encounter them. They can be, however, useful for meta-attributes; if  
     164    instance in dataset have unique id's, the most efficient way to store them  
     165    is to read them as meta-attributes. In general, never use discrete  
     166    attributes with many (say, more than 50) values. Such attributes are  
     167    probably not of any use for learning and should be stored as string attributes. 
     169    When converting strings into values and back, empty strings are treated  
     170    differently than usual. For other types, an empty string can be used to 
     171    denote undefined values, while :obj:`StringVariable` will take empty string as 
     172    an empty string -- that is, except when loading or saving into file. Empty 
     173    strings in files are interpreted as undefined; to specify an empty string, 
     174    enclose the string into double quotes; these get removed when the string is 
     175    loaded. 
     177.. _Python: 
     178.. class:: Python 
     180    Base class for descriptors defined in Python. It is fully functional, 
     181    and can be used as a descriptor for attributes that contain arbitrary Python 
     182    values. Since this is an advanced topic, PythonVariables are described on a  
     183    separate page. !!TODO!! 
     186Features computed from other features 
     189Values of features are often computed from other features, such as in 
     190discretization. The mechanism described below usually occurs behind the scenes, 
     191so understanding it required only for implementing specific transformations. 
     193Monk 1 is a well-known dataset with target concept ``y := a==b`` or ``e==1``. 
     194It can help the learning algorithm if the four-valued  
     195attribute ``e`` with a binary attribute having values `"1"` and `"not 1"`. The 
     196new feature will be computed from the old one on the fly.  
     198.. literalinclude:: code/ 
     199    :lines: 7-17 
     201The new feature is named ``e2``; we define it by descriptor of type  
     202:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we  
     203chose this order so that the ``not 1``'s index is ``0``, which can be, if  
     204needed, interpreted as ``False``). Finally, we tell e2 to use  
     205``checkE`` to compute its value when needed, by assigning ``checkE`` to  
     208``checkE`` is a function that is passed an instance and another argument we  
     209don't care about here. If the instance's ``e`` equals ``1``, the function  
     210returns value ``1``, otherwise it returns ``not 1``. Both are returned as  
     211values, not plain strings .  
     213In most circumstances, value of ``e2`` can be computed on the fly - we can  
     214pretend that the feature exists in the data, although it doesn't (but  
     215can be computed from it). For instance, we can compute the information gain of 
     216feature ``e2`` or its distribution without actually constructing data containing 
     217the new feature. 
     219.. literalinclude:: code/ 
     220    :lines: 19-22 
     222There are methods which cannot compute values on the fly because it would be 
     223too complex or time consuming. In such cases, the data need to be converted 
     224to a new :obj:``:: 
     226    newDomain = orange.Domain([data.domain["a"], data.domain["b"], e2, data.domain.classVar]) 
     227    newData = orange.ExampleTable(newDomain, data)  
     229Automatic computation is useful when the data is split onto training and  
     230testing examples. Training instanced can be modified by adding, removing  
     231and transforming features (in a typical setup, continuous features  
     232are discretized prior to learning, therefore the original features are  
     233replaced by new ones), while test instances are left as they  
     234are. When they are classified, the classifier automatically converts the  
     235testing instances into the new domain, which includes recomputation of  
     236transformed features.  
     238.. literalinclude:: code/ 
     239    :lines: 24- 
     241Reuse of Descriptors 
     244There are situations when feature descriptors need to be reused. Typically, the  
     245user loads some training examples, trains a classifier and then loads a separate 
     246test set. For the classifier to recognize the features in the second data set, 
     247the descriptors, not just the names, need to be the same.  
     249When constructing new descriptors for data read from a file or at unpickling, 
     250Orange checks whether an appropriate descriptor (with the same name and, in case 
     251of discrete features, also values) already exists and reuses it. When new 
     252descriptors are constructed by explicitly calling the above descriptors, this 
     253always creates new descriptors and thus new features, although the feature with 
     254the same name may already exist. 
     256The search for existing feature is based on four attributes: the feature's name, 
     257type, ordered values and unordered values. As for the latter two, the values can  
     258be explicitly ordered by the user, e.g. in the second line of the tab-delimited  
     259file, for instance to order sizes as small-medium-big. 
     261The search for existing variables can end with one of the following statuses. 
     262 (4) 
     264    The feature with that name and type does not exist. 
     265 (3) 
     267    There is (or are) features with matching name and type, but their 
     268    values are incompatible with the prescribed ordered values. For example, 
     269    if the existing feature already has values ["a", "b"] and the new one 
     270    wants ["b", "a"], the old feature cannot be reused. The existing list can, 
     271    however be appended the new values, so searching for ["a", "b", "c"] would 
     272    succeed. So will also the search for ["a"], since the extra existing value 
     273    does not matter. The formal rule is thus that the values are compatible if ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``. 
     274 (2) 
     276    There is a matching feature, yet it has none of the values that the new 
     277    feature will have (this is obviously possible only if the new attribute has 
     278    no prescribed ordered values). For instance, we search for a feature 
     279    "sex" with values "male" and "female", while there is a feature of the same  
     280    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this  
     281    feature is possible, though this should probably be a new feature since it  
     282    obviously comes from a different data set. If we do decide for reuse, the  
     283    old feature will get some unneeded new values and the new one will inherit  
     284    some from the old. 
     285 (1) 
     287    there is a matching feature with some of the values that the new one  
     288    requires, but some values are missing. This situation is neither uncommon  
     289    nor suspicious: in case of separate training and testing data sets there may 
     290    be values which occur in one set but not in the other. 
     291 (0) 
     293    There is a perfect metch which contains all the prescribed values in the 
     294    correct order. The existing attribute may have some extra values, though. 
     296Continuous attributes can obviously have only two statuses, ``NotFound`` or ``OK``. 
     298When loading the data using :obj:````, Orange takes the safest  
     299approach and, by default, reuses everything that is compatible, that is, up to  
     300and including ``NoRecognizedValues``. Unintended reuse would be obvious from the 
     301feature having too many values, which the user can notice and fix. More on that  
     302in the page on `loading data`. 
     304There are two functions for reusing the attributes instead of creating new ones. 
     306.. function::, type, ordered_values, onordered_values[, createNewOn]) 
     308    Find and return an existing feature or create a new one if none existing 
     309    features matches the given name, type and values. 
     311    The optional `createOnNew` specifies the status at which a new feature is 
     312    created. The status must be at most ``Incompatible`` since incompatible (or non-existing) features cannot be reused. If it is set lower, for instance  
     313    to ``MissingValues``, a new feature is created even if there exists 
     314    a feature which only misses same values. If set to ``OK``, the function 
     315    always creates a new feature. 
     317    The function returns a tuple containing a feature descriptor and the 
     318    status of the best matching feature. So, if ``createOnNew`` is set to ``MissingValues``, and there exists a feature whose status is, say, 
     319    ``UnrecognizedValues``, a feature would be created, while the second  
     320    element of the tuple would contain ``UnrecognizedValues``. If, on the other 
     321    hand, there exists a feature which is perfectly OK, its descriptor is  
     322    returned and the returned status is <code>OK</code>. The function returns no  
     323    indicator whether the returned feature is reused or not. This can be, 
     324    however, read from the status code: if it is smaller than the specified 
     325    ``createNewOn``, the feature is reused, otherwise we got a new descriptor. 
     327    The exception to the rule is when ``createNewOn`` is OK. In this case, the  
     328    function does not search through the existing attributes and cannot know the  
     329    status, so the returned status in this case is always ``OK``. 
     331    :param name: Feature name 
     332    :param type: Feature type 
     333    :type type: 
     334    :param ordered_values: a list of ordered values 
     335    :param unordered_values: a list of values, for which the order does not matter 
     336    :param createNewOn: gives condition for constructing a new feature instead 
     337    of using the new one 
     339.. function::, type, ordered_values, onordered_values[, createNewOn]) 
     341    Find and return an existing feature, or ``None`` if no match is found. 
     343    :param name: Feature name 
     344    :param type: Feature type 
     345    :type type: 
     346    :param ordered_values: a list of ordered values 
     347    :param unordered_values: a list of values, for which the order does not matter 
     348    :param createNewOn: gives condition for constructing a new feature instead 
     349    of using the new one 
     351.. _``: code/ 
     353These following examples (from ``_) give the shown results if executed only once (in a Python session) and in this order. 
     355:py:func:`make` can be used for construction of new features.:: 
     357    >>> v1, s ="a",, ["a", "b"]) 
     358    >>> print s, v1.values 
     359    4 <a, b> 
     361No surprises here: new feature is created and the status is ``NotFound``.:: 
     363    >>> v2, s ="a",, ["a"], ["c"]) 
     364    >>> print s, v2 is v1, v1.values 
     365    1 True <a, b, c> 
     367The status is 1 (``MissingValues``), yet the feature is reused (``v2 is v1``). 
     368``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does 
     369not matter that the new variable does not need value ``b``.:: 
     371    >>> v3, s ="a",, ["a", "b", "c", "d"]) 
     372    >>> print s, v3 is v1, v1.values 
     373    1 True <a, b, c, d> 
     375This is similar as before, except that the new value, <code>d</code> is not among the ordered values.:: 
     377    >>> v4, s ="a",, ["b"]) 
     378    >>> print s, v4 is v1, v1.values, v4.values 
     379    3, False, <b>, <a, b, c, d> 
     381The new feature needs to have ``b`` as the first value, so it is incompatible  
     382with the existing features. The status is thus 3 (``Incompatible``), the two  
     383features are not equal and have different lists of values.:: 
     385    >>> v5, s ="a",, None, ["c", "a"]) 
     386    >>> print s, v5 is v1, v1.values, v5.values 
     387    0 True <a, b, c, d> <a, b, c, d> 
     389The new feature has values ``c`` and ``a``, but does not 
     390mind about the order, so the existing attribute is ``OK``.:: 
     392    >>> v6, s ="a",, None, ["e"]) "a"]) 
     393    >>> print s, v6 is v1, v1.values, v6.values 
     394    2 True <a, b, c, d, e> <a, b, c, d, e> 
     396The new feature has different values than the existing (status is 2, ``NoRecognizedValues``), but the existing is reused nevertheless. Note that we 
     397gave ``e`` in the list of unordered values. If it was among the ordered, the 
     398reuse would fail.:: 
     400    >>> v7, s ="a",, None, 
     401            ["f"], 
     402    >>> print s, v7 is v1, v1.values, v7.values 
     403    2 False <a, b, c, d, e> <f> 
     405This is the same as before, except that we prohibited reuse when there are no 
     406recognized value. Hence a new feature is created, though the returned status is  
     407the same as before:: 
     409    >>> v8, s ="a",, 
     410            ["a", "b", "c", "d", "e"], None, 
     411    >>> print s, v8 is v1, v1.values, v8.values 
     412    0 False <a, b, c, d, e> <a, b, c, d, e> 
     414Finally, this is a perfect match, but any reuse is prohibited, so a new  
     415feature is created. 
    68418from orange import Variable as Feature 
  • orange/doc/Orange/rst/index.rst

    r7334 r7338  
    1212   :maxdepth: 2 
    1415   Orange.associate    
    1516   Orange.cluster 
Note: See TracChangeset for help on using the changeset viewer.