Changeset 7338:488bf5765c7a in orange


Ignore:
Timestamp:
02/03/11 19:54:31 (3 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Convert:
7f34eb56cd49257c4a20e5602a4e30d44d892bba
Message:
  • finished documentation on Orange.data.feature; still needs to be checked
Location:
orange
Files:
2 added
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/data/feature.py

    r7260 r7338  
    11""" 
    2 Data instances in Orange can contain four types of features: `discrete`_, 
    3 continuous_, strings_ and Python_; the latter represent arbitrary Python objects. 
     2Data instances in Orange can contain several types of features: 
     3:ref:`discrete <discrete>`, :ref:`continuous <continuous>`, 
     4:ref:`strings <string>`, and :ref:`Python <Python>` and types derived from it. 
     5The latter represent arbitrary Python objects. 
    46The names, types, values (where applicable), functions for computing the 
    57feature value from other features, and other properties of the 
     
    2931 
    3032    .. attribute:: getValueFrom 
    31     A function (an instance of `Orange.core.Clasifier`) which computes a 
     33 
     34    A function (an instance of :obj:`Orange.core.Clasifier`) which computes a 
    3235    value of the feature from values of one or more other features. This is 
    3336    used, for instance, in discretization where the features describing the  
     
    3538 
    3639    .. attribute:: ordered 
     40     
    3741    A flag telling whether the values of a discrete feature are ordered. At the  
    3842    moment, no builtin method treats ordinal features differently than nominal. 
    3943     
    4044    .. attribute:: distributed 
     45     
    4146    A flag telling whether the values of this features are distributions. 
    4247    As for flag ordered, no methods treat such features in any special manner. 
    4348     
    4449    .. attribute:: randomGenerator 
     50     
    4551    A local random number generator used by method :obj:`Feature.randomvalue`. 
    4652     
    4753    .. attribute:: defaultMetaId 
     54     
    4855    A proposed (but not guaranteed) meta id to be used for that feature. This is  
    4956    used, for instance, by the data loader for tab-delimited file format instead  
    50     of assigning an arbitrary new value, or by `Orange.core.newmetaid` if the 
     57    of assigning an arbitrary new value, or by :obj:`Orange.core.newmetaid` if the 
    5158    feature is passed as an argument.  
    5259         
    53  
    5460    .. method:: __call__(obj) 
     61     
    5562       Convert a string, number or other suitable object into a feature value. 
     63        
    5664       :param obj: An object to be converted into a feature value 
    5765       :type o: any suitable 
     
    5967        
    6068    .. method:: randomvalue() 
     69 
    6170       Return a random value of the feature 
     71        
    6272       :rtype: :class:`Orange.data.Value` 
    6373        
    64     .. method:: computeValue() 
    65        Calls getValueFrom through a mechanism that prevents deadlocks by circular calls. 
    66         
     74    .. method:: computeValue(inst) 
     75 
     76       Compute the value of the feature given the instance by calling 
     77       `getValueFrom` through a mechanism that prevents deadlocks by 
     78       circular calls. 
     79 
     80       :rtype: :class:`Orange.data.Value` 
     81 
     82.. _discrete: 
     83.. class:: Discrete 
     84    
     85    Descriptor for discrete features. 
     86     
     87    .. attribute:: values 
     88     
     89    A list with symbolic names for feature's values. Values are stored as 
     90    indices referring to this list. Therefore, modifying this list instantly 
     91    changes (symbolic) names of values as they are printed out or referred to 
     92    by user. 
     93     
     94    .. note:: 
     95     
     96        The size of the list is also used to indicate the number of possible values  
     97        for this feature. Changing the size, especially shrinking the list can have  
     98        disastrous effects and is therefore not really recommendable. Also, do not  
     99        add values to the list by calling its append or extend method: call  
     100        the :obj:`addValue` method instead described below. 
     101 
     102        It is also assumed that this attribute is always defined (but can be empty),  
     103        so never set it to None. 
     104     
     105    .. attribute:: baseValue 
     106 
     107    Stores the base value for the feature as an index into `values`. This can 
     108    be, for instance a "normal" value, such as "no complications" as opposed to  
     109    abnormal "low blood pressure". The base value is used by certain statistics,  
     110    continuization etc. potentially, learning algorithms. Default is -1 and  
     111    means that there is no base value. 
     112     
     113    .. method:: addValue 
     114     
     115    Adds a value to values. Always call this function instead of appending to 
     116    values. 
     117 
     118.. _continuous: 
     119.. class:: Continuous 
     120 
     121    Descriptor for continuous features. 
     122     
     123    .. attribute:: numberOfDecimals 
     124     
     125    The number of decimals used when the value is printed out, converted to a 
     126    string or saved to a file  
     127     
     128    .. attribute:: scientificFormat 
     129     
     130    If ``True``, the value is printed in scientific format whenever it would 
     131    have more than 5 digits. In this case, `numberOfDecimals` is ignored. 
     132 
     133    .. attribute:: adjustDecimals 
     134     
     135    Tells Orange to monitor the number of decimals when the value is converted 
     136    from a string (when the values are read from a file or converted by, e.g. 
     137    ``inst[0]="3.14"``). The value of ``0`` means that the number of decimals  
     138    should not be adjusted, while 1 and 2 mean that adjustments are on, with 2  
     139    denoting that no values have been converted yet. 
     140 
     141    By default, adjustment of number of decimals goes as follows. 
     142     
     143    If the feature was constructed when data was read from a file, it will be 
     144    printed with the same number of decimals as the largest number of decimals 
     145    encountered in the file. If scientific notation occurs in the file, 
     146    `scientificFormat` will be set to ``True`` and scientific format will be used 
     147    for values too large or too small. 
     148     
     149    If the feature is created in a script, it will have, by default, three 
     150    decimals places. This can be changed either by setting the value 
     151    from a string (e.g. ``inst[0]="3.14"``, but not ``inst[0]=3.14``) or by 
     152    manually setting the `numberOfDecimals`. 
     153 
     154    .. attribute:: startValue, endValue, stepValue 
     155     
     156    The range used for :obj:`randomvalue`. 
     157 
     158.. _String: 
     159.. class:: String 
     160 
     161    Descriptor for features that contains strings. No method can use them for  
     162    learning; some will complain and other will silently ignore them when the  
     163    encounter them. They can be, however, useful for meta-attributes; if  
     164    instance in dataset have unique id's, the most efficient way to store them  
     165    is to read them as meta-attributes. In general, never use discrete  
     166    attributes with many (say, more than 50) values. Such attributes are  
     167    probably not of any use for learning and should be stored as string attributes. 
     168 
     169    When converting strings into values and back, empty strings are treated  
     170    differently than usual. For other types, an empty string can be used to 
     171    denote undefined values, while :obj:`StringVariable` will take empty string as 
     172    an empty string -- that is, except when loading or saving into file. Empty 
     173    strings in files are interpreted as undefined; to specify an empty string, 
     174    enclose the string into double quotes; these get removed when the string is 
     175    loaded. 
     176 
     177.. _Python: 
     178.. class:: Python 
     179 
     180    Base class for descriptors defined in Python. It is fully functional, 
     181    and can be used as a descriptor for attributes that contain arbitrary Python 
     182    values. Since this is an advanced topic, PythonVariables are described on a  
     183    separate page. !!TODO!! 
     184     
     185     
     186Features computed from other features 
     187------------------------------------- 
     188 
     189Values of features are often computed from other features, such as in 
     190discretization. The mechanism described below usually occurs behind the scenes, 
     191so understanding it required only for implementing specific transformations. 
     192 
     193Monk 1 is a well-known dataset with target concept ``y := a==b`` or ``e==1``. 
     194It can help the learning algorithm if the four-valued  
     195attribute ``e`` with a binary attribute having values `"1"` and `"not 1"`. The 
     196new feature will be computed from the old one on the fly.  
     197 
     198.. literalinclude:: code/feature-getValueFrom.py 
     199    :lines: 7-17 
     200     
     201The new feature is named ``e2``; we define it by descriptor of type  
     202:obj:`Discrete`, with appropriate name and values ``"not 1"`` and ``1`` (we  
     203chose this order so that the ``not 1``'s index is ``0``, which can be, if  
     204needed, interpreted as ``False``). Finally, we tell e2 to use  
     205``checkE`` to compute its value when needed, by assigning ``checkE`` to  
     206``e2.getValueFrom``.  
     207 
     208``checkE`` is a function that is passed an instance and another argument we  
     209don't care about here. If the instance's ``e`` equals ``1``, the function  
     210returns value ``1``, otherwise it returns ``not 1``. Both are returned as  
     211values, not plain strings .  
     212 
     213In most circumstances, value of ``e2`` can be computed on the fly - we can  
     214pretend that the feature exists in the data, although it doesn't (but  
     215can be computed from it). For instance, we can compute the information gain of 
     216feature ``e2`` or its distribution without actually constructing data containing 
     217the new feature. 
     218 
     219.. literalinclude:: code/feature-getValueFrom.py 
     220    :lines: 19-22 
     221 
     222There are methods which cannot compute values on the fly because it would be 
     223too complex or time consuming. In such cases, the data need to be converted 
     224to a new :obj:`Orange.data.Table`:: 
     225 
     226    newDomain = orange.Domain([data.domain["a"], data.domain["b"], e2, data.domain.classVar]) 
     227    newData = orange.ExampleTable(newDomain, data)  
     228 
     229Automatic computation is useful when the data is split onto training and  
     230testing examples. Training instanced can be modified by adding, removing  
     231and transforming features (in a typical setup, continuous features  
     232are discretized prior to learning, therefore the original features are  
     233replaced by new ones), while test instances are left as they  
     234are. When they are classified, the classifier automatically converts the  
     235testing instances into the new domain, which includes recomputation of  
     236transformed features.  
     237 
     238.. literalinclude:: code/feature-getValueFrom.py 
     239    :lines: 24- 
     240 
     241Reuse of Descriptors 
     242-------------------- 
     243 
     244There are situations when feature descriptors need to be reused. Typically, the  
     245user loads some training examples, trains a classifier and then loads a separate 
     246test set. For the classifier to recognize the features in the second data set, 
     247the descriptors, not just the names, need to be the same.  
     248 
     249When constructing new descriptors for data read from a file or at unpickling, 
     250Orange checks whether an appropriate descriptor (with the same name and, in case 
     251of discrete features, also values) already exists and reuses it. When new 
     252descriptors are constructed by explicitly calling the above descriptors, this 
     253always creates new descriptors and thus new features, although the feature with 
     254the same name may already exist. 
     255 
     256The search for existing feature is based on four attributes: the feature's name, 
     257type, ordered values and unordered values. As for the latter two, the values can  
     258be explicitly ordered by the user, e.g. in the second line of the tab-delimited  
     259file, for instance to order sizes as small-medium-big. 
     260 
     261The search for existing variables can end with one of the following statuses. 
     262 
     263Orange.data.feature.Feature.MakeStatus.NotFound (4) 
     264    The feature with that name and type does not exist. 
     265 
     266Orange.data.feature.Feature.MakeStatus.Incompatible (3) 
     267    There is (or are) features with matching name and type, but their 
     268    values are incompatible with the prescribed ordered values. For example, 
     269    if the existing feature already has values ["a", "b"] and the new one 
     270    wants ["b", "a"], the old feature cannot be reused. The existing list can, 
     271    however be appended the new values, so searching for ["a", "b", "c"] would 
     272    succeed. So will also the search for ["a"], since the extra existing value 
     273    does not matter. The formal rule is thus that the values are compatible if ``existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]``. 
     274 
     275orange.data.feature.MakeStatus.NoRecognizedValues (2) 
     276    There is a matching feature, yet it has none of the values that the new 
     277    feature will have (this is obviously possible only if the new attribute has 
     278    no prescribed ordered values). For instance, we search for a feature 
     279    "sex" with values "male" and "female", while there is a feature of the same  
     280    name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this  
     281    feature is possible, though this should probably be a new feature since it  
     282    obviously comes from a different data set. If we do decide for reuse, the  
     283    old feature will get some unneeded new values and the new one will inherit  
     284    some from the old. 
     285 
     286Orange.data.feature.MakeStatus.MissingValues (1) 
     287    there is a matching feature with some of the values that the new one  
     288    requires, but some values are missing. This situation is neither uncommon  
     289    nor suspicious: in case of separate training and testing data sets there may 
     290    be values which occur in one set but not in the other. 
     291 
     292Orange.data.feature.MakeStatus.OK (0) 
     293    There is a perfect metch which contains all the prescribed values in the 
     294    correct order. The existing attribute may have some extra values, though. 
     295 
     296Continuous attributes can obviously have only two statuses, ``NotFound`` or ``OK``. 
     297 
     298When loading the data using :obj:``Orange.data.Table``, Orange takes the safest  
     299approach and, by default, reuses everything that is compatible, that is, up to  
     300and including ``NoRecognizedValues``. Unintended reuse would be obvious from the 
     301feature having too many values, which the user can notice and fix. More on that  
     302in the page on `loading data`. 
     303 
     304There are two functions for reusing the attributes instead of creating new ones. 
     305 
     306.. function:: Orange.data.feature.make(name, type, ordered_values, onordered_values[, createNewOn]) 
     307 
     308    Find and return an existing feature or create a new one if none existing 
     309    features matches the given name, type and values. 
     310     
     311    The optional `createOnNew` specifies the status at which a new feature is 
     312    created. The status must be at most ``Incompatible`` since incompatible (or non-existing) features cannot be reused. If it is set lower, for instance  
     313    to ``MissingValues``, a new feature is created even if there exists 
     314    a feature which only misses same values. If set to ``OK``, the function 
     315    always creates a new feature. 
     316     
     317    The function returns a tuple containing a feature descriptor and the 
     318    status of the best matching feature. So, if ``createOnNew`` is set to ``MissingValues``, and there exists a feature whose status is, say, 
     319    ``UnrecognizedValues``, a feature would be created, while the second  
     320    element of the tuple would contain ``UnrecognizedValues``. If, on the other 
     321    hand, there exists a feature which is perfectly OK, its descriptor is  
     322    returned and the returned status is <code>OK</code>. The function returns no  
     323    indicator whether the returned feature is reused or not. This can be, 
     324    however, read from the status code: if it is smaller than the specified 
     325    ``createNewOn``, the feature is reused, otherwise we got a new descriptor. 
     326 
     327    The exception to the rule is when ``createNewOn`` is OK. In this case, the  
     328    function does not search through the existing attributes and cannot know the  
     329    status, so the returned status in this case is always ``OK``. 
     330 
     331    :param name: Feature name 
     332    :param type: Feature type 
     333    :type type: Orange.data.feature.Type 
     334    :param ordered_values: a list of ordered values 
     335    :param unordered_values: a list of values, for which the order does not matter 
     336    :param createNewOn: gives condition for constructing a new feature instead 
     337    of using the new one 
     338     
     339.. function:: Orange.data.feature.retrieve(name, type, ordered_values, onordered_values[, createNewOn]) 
     340 
     341    Find and return an existing feature, or ``None`` if no match is found. 
     342     
     343    :param name: Feature name 
     344    :param type: Feature type 
     345    :type type: Orange.data.feature.Type 
     346    :param ordered_values: a list of ordered values 
     347    :param unordered_values: a list of values, for which the order does not matter 
     348    :param createNewOn: gives condition for constructing a new feature instead 
     349    of using the new one 
     350     
     351.. _`feature-reuse.py`: code/feature-reuse.py 
     352 
     353These following examples (from `feature-reuse.py`_) give the shown results if executed only once (in a Python session) and in this order. 
     354 
     355:py:func:`make` can be used for construction of new features.:: 
     356     
     357    >>> v1, s = Orange.data.feature.make("a", Orange.data.Type.Discrete, ["a", "b"]) 
     358    >>> print s, v1.values 
     359    4 <a, b> 
     360 
     361No surprises here: new feature is created and the status is ``NotFound``.:: 
     362 
     363    >>> v2, s = Orange.data.feature.make("a", orange.data.Type.Discrete, ["a"], ["c"]) 
     364    >>> print s, v2 is v1, v1.values 
     365    1 True <a, b, c> 
     366 
     367The status is 1 (``MissingValues``), yet the feature is reused (``v2 is v1``). 
     368``v1`` gets a new value, ``"c"``, which was given as an unordered value. It does 
     369not matter that the new variable does not need value ``b``.:: 
     370 
     371    >>> v3, s = Orange.data.feature.make("a", orange.data.Type.Discrete, ["a", "b", "c", "d"]) 
     372    >>> print s, v3 is v1, v1.values 
     373    1 True <a, b, c, d> 
     374 
     375This is similar as before, except that the new value, <code>d</code> is not among the ordered values.:: 
     376 
     377    >>> v4, s = Orange.data.feature.make("a", orange.data.Type.Discrete, ["b"]) 
     378    >>> print s, v4 is v1, v1.values, v4.values 
     379    3, False, <b>, <a, b, c, d> 
     380 
     381The new feature needs to have ``b`` as the first value, so it is incompatible  
     382with the existing features. The status is thus 3 (``Incompatible``), the two  
     383features are not equal and have different lists of values.:: 
     384 
     385    >>> v5, s = Orange.data.feature.make("a", orange.data.Type.Discrete, None, ["c", "a"]) 
     386    >>> print s, v5 is v1, v1.values, v5.values 
     387    0 True <a, b, c, d> <a, b, c, d> 
     388 
     389The new feature has values ``c`` and ``a``, but does not 
     390mind about the order, so the existing attribute is ``OK``.:: 
     391 
     392    >>> v6, s = Orange.data.feature.make("a", orange.data.Type.Discrete, None, ["e"]) "a"]) 
     393    >>> print s, v6 is v1, v1.values, v6.values 
     394    2 True <a, b, c, d, e> <a, b, c, d, e> 
     395 
     396The new feature has different values than the existing (status is 2, ``NoRecognizedValues``), but the existing is reused nevertheless. Note that we 
     397gave ``e`` in the list of unordered values. If it was among the ordered, the 
     398reuse would fail.:: 
     399 
     400    >>> v7, s = Orange.data.feature.make("a", orange.data.Type.Discrete, None, 
     401            ["f"], Orange.data.feature.make.MakeStatus.NoRecognizedValues))) 
     402    >>> print s, v7 is v1, v1.values, v7.values 
     403    2 False <a, b, c, d, e> <f> 
     404 
     405This is the same as before, except that we prohibited reuse when there are no 
     406recognized value. Hence a new feature is created, though the returned status is  
     407the same as before:: 
     408 
     409    >>> v8, s = Orange.data.feature.make("a", orange.data.Type.Discrete, 
     410            ["a", "b", "c", "d", "e"], None, Orange.data.feature.MakeStatus.OK) 
     411    >>> print s, v8 is v1, v1.values, v8.values 
     412    0 False <a, b, c, d, e> <a, b, c, d, e> 
     413 
     414Finally, this is a perfect match, but any reuse is prohibited, so a new  
     415feature is created. 
     416 
    67417""" 
    68418from orange import Variable as Feature 
  • orange/doc/Orange/rst/index.rst

    r7334 r7338  
    1212   :maxdepth: 2 
    1313 
     14   orange.data.feature 
    1415   Orange.associate    
    1516   Orange.cluster 
Note: See TracChangeset for help on using the changeset viewer.