source: orange/docs/reference/rst/Orange.distance.rst @ 9663:74b63c8ea80c

Revision 9663:74b63c8ea80c, 6.7 KB checked in by markotoplak, 2 years ago (diff)

Orange.distance.instances -> Orange.distance

Line 
1.. automodule:: Orange.distance
2
3##########################################
4Distance (``distance``)
5##########################################
6
7This page describes a bunch of classes for different metrics for measure
8distances (dissimilarities) between instances.
9
10Typical (although not all) measures of distance between instances require
11some "learning" - adjusting the measure to the data. For instance, when
12the dataset contains continuous features, the distances between continuous
13values should be normalized, e.g. by dividing the distance with the range
14of possible values or with some interquartile distance to ensure that all
15features have, in principle, similar impacts.
16
17Different measures of distance thus appear in pairs - a class that measures
18the distance and a class that constructs it based on the data. The abstract
19classes representing such a pair are `ExamplesDistance` and
20`ExamplesDistanceConstructor`.
21
22Since most measures work on normalized distances between corresponding
23features, there is an abstract intermediate class
24`ExamplesDistance_Normalized` that takes care of normalizing.
25The remaining classes correspond to different ways of defining the distances,
26such as Manhattan or Euclidean distance.
27
28Unknown values are treated correctly only by Euclidean and Relief distance.
29For other measure of distance, a distance between unknown and known or between
30two unknown values is always 0.5.
31
32.. class:: ExamplesDistance
33
34    .. method:: __call__(instance1, instance2)
35
36        Returns a distance between the given instances as floating point number.
37
38.. class:: ExamplesDistanceConstructor
39
40    .. method:: __call__([instances, weightID][, distributions][, basic_var_stat])
41
42        Constructs an instance of ExamplesDistance.
43        Not all the data needs to be given. Most measures can be constructed
44        from basic_var_stat; if it is not given, they can help themselves
45        either by instances or distributions.
46        Some (e.g. ExamplesDistance_Hamming) even do not need any arguments.
47
48.. class:: ExamplesDistance_Normalized
49
50    This abstract class provides a function which is given two instances
51    and returns a list of normalized distances between values of their
52    features. Many distance measuring classes need such a function and are
53    therefore derived from this class
54
55    .. attribute:: normalizers
56
57        A precomputed list of normalizing factors for feature values
58
59        - If a factor positive, differences in feature's values
60          are multiplied by it; for continuous features the factor
61          would be 1/(max_value-min_value) and for ordinal features
62          the factor is 1/number_of_values. If either (or both) of
63          features are unknown, the distance is 0.5
64        - If a factor is -1, the feature is nominal; the distance
65          between two values is 0 if they are same (or at least
66          one is unknown) and 1 if they are different.
67        - If a factor is 0, the feature is ignored.
68
69    .. attribute:: bases, averages, variances
70
71        The minimal values, averages and variances
72        (continuous features only)
73
74    .. attribute:: domainVersion
75
76        Stores a domain version for which the normalizers were computed.
77        The domain version is increased each time a domain description is
78        changed (i.e. features are added or removed); this is used for a quick
79        check that the user is not attempting to measure distances between
80        instances that do not correspond to normalizers.
81        Since domains are practicably immutable (especially from Python),
82        you don't need to care about this anyway.
83
84    .. method:: attributeDistances(instance1, instance2)
85
86        Returns a list of floats representing distances between pairs of
87        feature values of the two instances.
88
89
90.. class:: HammingConstructor
91.. class:: Hamming
92
93    Hamming distance between two instances is defined as the number of
94    features in which the two instances differ. Note that this measure
95    is not really appropriate for instances that contain continuous features.
96
97
98.. class:: MaximalConstructor
99.. class:: Maximal
100
101    The maximal between two instances is defined as the maximal distance
102    between two feature values. If dist is the result of
103    ExamplesDistance_Normalized.attributeDistances,
104    then Maximal returns max(dist).
105
106
107.. class:: ManhattanConstructor
108.. class:: Manhattan
109
110    Manhattan distance between two instances is a sum of absolute values
111    of distances between pairs of features, e.g. ``sum(abs(x) for x in dist)``
112    where dist is the result of ExamplesDistance_Normalized.attributeDistances.
113
114.. class:: EuclideanConstructor
115.. class:: Euclidean
116
117    Euclidean distance is a square root of sum of squared per-feature distances,
118    i.e. ``sqrt(sum(x*x for x in dist))``, where dist is the result of
119    ExamplesDistance_Normalized.attributeDistances.
120
121    .. method:: distributions
122
123        An object of type
124        :obj:`~Orange.statistics.distribution.Distribution` that holds
125        the distributions for all discrete features used for
126        computation of distances between known and unknown values.
127
128    .. method:: bothSpecialDist
129
130        A list containing the distance between two unknown values for each
131        discrete feature.
132
133    This measure of distance deals with unknown values by computing the
134    expected square of distance based on the distribution obtained from the
135    "training" data. Squared distance between
136
137        - A known and unknown continuous attribute equals squared distance
138          between the known and the average, plus variance
139        - Two unknown continuous attributes equals double variance
140        - A known and unknown discrete attribute equals the probability
141          that the unknown attribute has different value than the known
142          (i.e., 1 - probability of the known value)
143        - Two unknown discrete attributes equals the probability that two
144          random chosen values are equal, which can be computed as
145          1 - sum of squares of probabilities.
146
147    Continuous cases can be handled by averages and variances inherited from
148    ExamplesDistance_normalized. The data for discrete cases are stored in
149    distributions (used for unknown vs. known value) and in bothSpecial
150    (the precomputed distance between two unknown values).
151
152.. class:: ReliefConstructor
153.. class:: Relief
154
155    Relief is similar to Manhattan distance, but incorporates a more
156    correct treatment of undefined values, which is used by ReliefF measure.
157
158This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized.
159
160
161.. autoclass:: PearsonR
162    :members:
163
164.. autoclass:: SpearmanR
165    :members:
166
167.. autoclass:: PearsonRConstructor
168    :members:
169
170.. autoclass:: SpearmanRConstructor
171    :members:
Note: See TracBrowser for help on using the repository browser.