02/05/12 23:04:46 (2 years ago)
anze <anze.staric@…>

Moved instance distance documentation to rst file.

1 edited


  • docs/reference/rst/Orange.distance.instances.rst

    r9372 r9639  
     1.. automodule:: Orange.distance.instances 
    24Instances (``instances``) 
    5 .. automodule:: Orange.distance.instances 
     8Distances between Instances 
     11This page describes a bunch of classes for different metrics for measure 
     12distances (dissimilarities) between instances. 
     14Typical (although not all) measures of distance between instances require 
     15some "learning" - adjusting the measure to the data. For instance, when 
     16the dataset contains continuous features, the distances between continuous 
     17values should be normalized, e.g. by dividing the distance with the range 
     18of possible values or with some interquartile distance to ensure that all 
     19features have, in principle, similar impacts. 
     21Different measures of distance thus appear in pairs - a class that measures 
     22the distance and a class that constructs it based on the data. The abstract 
     23classes representing such a pair are `ExamplesDistance` and 
     26Since most measures work on normalized distances between corresponding 
     27features, there is an abstract intermediate class 
     28`ExamplesDistance_Normalized` that takes care of normalizing. 
     29The remaining classes correspond to different ways of defining the distances, 
     30such as Manhattan or Euclidean distance. 
     32Unknown values are treated correctly only by Euclidean and Relief distance. 
     33For other measure of distance, a distance between unknown and known or between 
     34two unknown values is always 0.5. 
     36.. class:: ExamplesDistance 
     38    .. method:: __call__(instance1, instance2) 
     40        Returns a distance between the given instances as floating point number. 
     42.. class:: ExamplesDistanceConstructor 
     44    .. method:: __call__([instances, weightID][, distributions][, basic_var_stat]) 
     46        Constructs an instance of ExamplesDistance. 
     47        Not all the data needs to be given. Most measures can be constructed 
     48        from basic_var_stat; if it is not given, they can help themselves 
     49        either by instances or distributions. 
     50        Some (e.g. ExamplesDistance_Hamming) even do not need any arguments. 
     52.. class:: ExamplesDistance_Normalized 
     54    This abstract class provides a function which is given two instances 
     55    and returns a list of normalized distances between values of their 
     56    features. Many distance measuring classes need such a function and are 
     57    therefore derived from this class 
     59    .. attribute:: normalizers 
     61        A precomputed list of normalizing factors for feature values 
     63        - If a factor positive, differences in feature's values 
     64          are multiplied by it; for continuous features the factor 
     65          would be 1/(max_value-min_value) and for ordinal features 
     66          the factor is 1/number_of_values. If either (or both) of 
     67          features are unknown, the distance is 0.5 
     68        - If a factor is -1, the feature is nominal; the distance 
     69          between two values is 0 if they are same (or at least 
     70          one is unknown) and 1 if they are different. 
     71        - If a factor is 0, the feature is ignored. 
     73    .. attribute:: bases, averages, variances 
     75        The minimal values, averages and variances 
     76        (continuous features only) 
     78    .. attribute:: domainVersion 
     80        Stores a domain version for which the normalizers were computed. 
     81        The domain version is increased each time a domain description is 
     82        changed (i.e. features are added or removed); this is used for a quick 
     83        check that the user is not attempting to measure distances between 
     84        instances that do not correspond to normalizers. 
     85        Since domains are practicably immutable (especially from Python), 
     86        you don't need to care about this anyway. 
     88    .. method:: attributeDistances(instance1, instance2) 
     90        Returns a list of floats representing distances between pairs of 
     91        feature values of the two instances. 
     94.. class:: Hamming, HammingConstructor 
     96    Hamming distance between two instances is defined as the number of 
     97    features in which the two instances differ. Note that this measure 
     98    is not really appropriate for instances that contain continuous features. 
     101.. class:: Maximal, MaximalConstructor 
     103    The maximal between two instances is defined as the maximal distance 
     104    between two feature values. If dist is the result of 
     105    ExamplesDistance_Normalized.attributeDistances, 
     106    then Maximal returns max(dist). 
     109.. class:: Manhattan, ManhattanConstructor 
     111    Manhattan distance between two instances is a sum of absolute values 
     112    of distances between pairs of features, e.g. ``apply(add, [abs(x) for x in dist])`` 
     113    where dist is the result of ExamplesDistance_Normalized.attributeDistances. 
     115.. class:: Euclidean, EuclideanConstructor 
     118    Euclidean distance is a square root of sum of squared per-feature distances, 
     119    i.e. ``sqrt(apply(add, [x*x for x in dist]))``, where dist is the result of 
     120    ExamplesDistance_Normalized.attributeDistances. 
     122    .. method:: distributions 
     124        An object of type 
     125        :obj:`Orange.statistics.distribution.Distribution` that holds 
     126        the distributions for all discrete features used for 
     127        computation of distances between known and unknown values. 
     129    .. method:: bothSpecialDist 
     131        A list containing the distance between two unknown values for each 
     132        discrete feature. 
     134    This measure of distance deals with unknown values by computing the 
     135    expected square of distance based on the distribution obtained from the 
     136    "training" data. Squared distance between 
     138        - A known and unknown continuous attribute equals squared distance 
     139          between the known and the average, plus variance 
     140        - Two unknown continuous attributes equals double variance 
     141        - A known and unknown discrete attribute equals the probabilit 
     142          that the unknown attribute has different value than the known 
     143          (i.e., 1 - probability of the known value) 
     144        - Two unknown discrete attributes equals the probability that two 
     145          random chosen values are equal, which can be computed as 
     146          1 - sum of squares of probabilities. 
     148    Continuous cases can be handled by averages and variances inherited from 
     149    ExamplesDistance_normalized. The data for discrete cases are stored in 
     150    distributions (used for unknown vs. known value) and in bothSpecial 
     151    (the precomputed distance between two unknown values). 
     153.. class:: Relief, ReliefConstructor 
     155    Relief is similar to Manhattan distance, but incorporates a more 
     156    correct treatment of undefined values, which is used by ReliefF measure. 
     158This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized. 
     161.. autoclass:: PearsonR 
     162    :members: 
     164.. autoclass:: SpearmanR 
     165    :members: 
     167.. autoclass:: PearsonRConstructor 
     168    :members: 
     170.. autoclass:: SpearmanRConstructor 
     171    :members: 
Note: See TracChangeset for help on using the changeset viewer.