# source:orange/docs/reference/rst/Orange.distance.rst@9821:6cc715432fa7

Revision 9821:6cc715432fa7, 5.8 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Merge

Line
1.. py:currentmodule:: Orange.distance
2
3##########################################
4Distance (``distance``)
5##########################################
6
7The following example demonstrates how to compute distances between two instances:
8
9.. literalinclude:: code/distance-simple.py
10    :lines: 1-7
11
12A matrix with all pairwise distances can be computed with :obj:`distance_matrix`:
13
14.. literalinclude:: code/distance-simple.py
15    :lines: 9-11
16
17Unknown values are treated correctly only by Euclidean and Relief
18distance.  For other measures, a distance between unknown and known or
19between two unknown values is always 0.5.
20
21===================
22Computing distances
23===================
24
25Distance measures typically have to be adjusted to the data. For instance,
26when the data set contains continuous features, the distances between
27continuous values should be normalized to ensure that all features have
28similar impats, e.g. by dividing the distance with the range.
29
30Distance measures thus appear in pairs:
31
32- a class that constructs the distance measure based on the
33  data (subclass of :obj:`DistanceConstructor`, for
34  example :obj:`Euclidean`), and returns is as
35- a class that measures the distance between two instances
36  (subclass of :obj:`Distance`, for example :obj:`EuclideanDistance`).
37
38.. class:: DistanceConstructor
39
40    .. method:: __call__([instances, weightID][, distributions][, basic_var_stat])
41
42        Constructs an :obj:`Distance`. Not all arguments are required.
43        Most measures can be constructed from basic_var_stat; if it is
44        not given, instances or distributions can be used.
45
46.. class:: Distance
47
48    .. method:: __call__(instance1, instance2)
49
50        Return a distance between the given instances (as a floating point number).
51
52Pairwise distances
53==================
54
55.. autofunction:: distance_matrix
56
57=========
58Measures
59=========
60
61Distance measures are defined with two classes: a subclass of obj:`DistanceConstructor`
62and a subclass of :obj:`Distance`.
63
64.. class:: Hamming
65.. class:: HammingDistance
66
67    The number of features in which the two instances differ. This measure
68    is not appropriate for instances that contain continuous features.
69
70.. class:: Maximal
71.. class:: MaximalDistance
72
73    The maximal distance
74    between two feature values. If dist is the result of
75    :obj:`~DistanceNormalized.feature_distances`,
76    then :class:`Maximal` returns ``max(dist)``.
77
78.. class:: Manhattan
79.. class:: ManhattanDistance
80
81    The sum of absolute values
82    of distances between pairs of features, e.g. ``sum(abs(x) for x in dist)``
83    where dist is the result of :obj:`~DistanceNormalized.feature_distances`.
84
85.. class:: Euclidean
86.. class:: EuclideanDistance
87
88    The square root of sum of squared per-feature distances,
89    i.e. ``sqrt(sum(x*x for x in dist))``, where dist is the result of
90    :obj:`~DistanceNormalized.feature_distances`.
91
92    .. method:: distributions
93
94        A :obj:`~Orange.statistics.distribution.Distribution` containing
95        the distributions for all discrete features used for
96        computation of distances between known and unknown values.
97
98    .. method:: both_special_dist
99
100        A list containing the distance between two unknown values for each
101        discrete feature.
102
103    Unknown values are handled by computing the
104    expected square of distance based on the distribution from the
105    "training" data. Squared distance between
106
107        - A known and unknown continuous feature equals squared distance
108          between the known and the average, plus variance.
109        - Two unknown continuous features equals double variance.
110        - A known and unknown discrete feature equals the probability
111          that the unknown feature has different value than the known
112          (i.e., 1 - probability of the known value).
113        - Two unknown discrete features equals the probability that two
114          random chosen values are equal, which can be computed as
115          1 - sum of squares of probabilities.
116
117    Continuous cases are handled as inherited from
118    :class:`DistanceNormalized`. The data for discrete cases are
119    stored in distributions (used for unknown vs. known value) and
120    in :obj:`both_special_dist` (the precomputed distance between two
121    unknown values).
122
123.. class:: Relief
124.. class:: ReliefDistance
125
126    Relief is similar to Manhattan distance, but incorporates the
127    treatment of undefined values, which is used by ReliefF measure.
128
129    This class is derived directly from :obj:`Distance`.
130
131.. autoclass:: PearsonR
132    :members:
133
134.. autoclass:: PearsonRDistance
135    :members:
136
137.. autoclass:: SpearmanR
138    :members:
139
140.. autoclass:: SpearmanRDistance
141    :members:
142
143.. autoclass:: Mahalanobis
144    :members:
145
146.. autoclass:: MahalanobisDistance
147    :members:
148
149=========
150Utilities
151=========
152
153.. class:: DistanceNormalized
154
155    An abstract class that provides normalization.
156
157    .. attribute:: normalizers
158
159        A precomputed list of normalizing factors for feature values. They are:
160
161        - 1/(max_value-min_value) for continuous and 1/number_of_values
162          for ordinal features.
163          If either feature is unknown, the distance is 0.5. Such factors
164          are used to multiply differences in feature's values.
165        - ``-1`` for nominal features; the distance
166          between two values is 0 if they are same (or at least one is
167          unknown) and 1 if they are different.
168        - ``0`` for ignored features.
169
170    .. attribute:: bases, averages, variances
171
172        The minimal values, averages and variances
173        (continuous features only).
174
175    .. attribute:: domain_version
176
177        The domain version changes each time a domain description is
178        changed (i.e. features are added or removed).
179
180    .. method:: feature_distances(instance1, instance2)
181
182        Return a list of floats representing normalized distances between
183        pairs of feature values of the two instances.
184
185
Note: See TracBrowser for help on using the repository browser.