source:orange/docs/reference/rst/Orange.distance.instances.rst@9641:87e1bab24ac0

Revision 9641:87e1bab24ac0, 6.7 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Updated measure formulas to "recent" python pseudo code.

Line
1.. automodule:: Orange.distance.instances
2
3#########################
4Instances (``instances``)
5#########################
6
7###########################
8Distances between Instances
9###########################
10
11This page describes a bunch of classes for different metrics for measure
12distances (dissimilarities) between instances.
13
14Typical (although not all) measures of distance between instances require
15some "learning" - adjusting the measure to the data. For instance, when
16the dataset contains continuous features, the distances between continuous
17values should be normalized, e.g. by dividing the distance with the range
18of possible values or with some interquartile distance to ensure that all
19features have, in principle, similar impacts.
20
21Different measures of distance thus appear in pairs - a class that measures
22the distance and a class that constructs it based on the data. The abstract
23classes representing such a pair are `ExamplesDistance` and
24`ExamplesDistanceConstructor`.
25
26Since most measures work on normalized distances between corresponding
27features, there is an abstract intermediate class
28`ExamplesDistance_Normalized` that takes care of normalizing.
29The remaining classes correspond to different ways of defining the distances,
30such as Manhattan or Euclidean distance.
31
32Unknown values are treated correctly only by Euclidean and Relief distance.
33For other measure of distance, a distance between unknown and known or between
34two unknown values is always 0.5.
35
36.. class:: ExamplesDistance
37
38    .. method:: __call__(instance1, instance2)
39
40        Returns a distance between the given instances as floating point number.
41
42.. class:: ExamplesDistanceConstructor
43
44    .. method:: __call__([instances, weightID][, distributions][, basic_var_stat])
45
46        Constructs an instance of ExamplesDistance.
47        Not all the data needs to be given. Most measures can be constructed
48        from basic_var_stat; if it is not given, they can help themselves
49        either by instances or distributions.
50        Some (e.g. ExamplesDistance_Hamming) even do not need any arguments.
51
52.. class:: ExamplesDistance_Normalized
53
54    This abstract class provides a function which is given two instances
55    and returns a list of normalized distances between values of their
56    features. Many distance measuring classes need such a function and are
57    therefore derived from this class
58
59    .. attribute:: normalizers
60
61        A precomputed list of normalizing factors for feature values
62
63        - If a factor positive, differences in feature's values
64          are multiplied by it; for continuous features the factor
65          would be 1/(max_value-min_value) and for ordinal features
66          the factor is 1/number_of_values. If either (or both) of
67          features are unknown, the distance is 0.5
68        - If a factor is -1, the feature is nominal; the distance
69          between two values is 0 if they are same (or at least
70          one is unknown) and 1 if they are different.
71        - If a factor is 0, the feature is ignored.
72
73    .. attribute:: bases, averages, variances
74
75        The minimal values, averages and variances
76        (continuous features only)
77
78    .. attribute:: domainVersion
79
80        Stores a domain version for which the normalizers were computed.
81        The domain version is increased each time a domain description is
82        changed (i.e. features are added or removed); this is used for a quick
83        check that the user is not attempting to measure distances between
84        instances that do not correspond to normalizers.
85        Since domains are practicably immutable (especially from Python),
87
88    .. method:: attributeDistances(instance1, instance2)
89
90        Returns a list of floats representing distances between pairs of
91        feature values of the two instances.
92
93
94.. class:: HammingConstructor
95.. class:: Hamming
96
97    Hamming distance between two instances is defined as the number of
98    features in which the two instances differ. Note that this measure
99    is not really appropriate for instances that contain continuous features.
100
101
102.. class:: MaximalConstructor
103.. class:: Maximal
104
105    The maximal between two instances is defined as the maximal distance
106    between two feature values. If dist is the result of
107    ExamplesDistance_Normalized.attributeDistances,
108    then Maximal returns max(dist).
109
110
111.. class:: ManhattanConstructor
112.. class:: Manhattan
113
114    Manhattan distance between two instances is a sum of absolute values
115    of distances between pairs of features, e.g. ``sum(abs(x) for x in dist)``
116    where dist is the result of ExamplesDistance_Normalized.attributeDistances.
117
118.. class:: EuclideanConstructor
119.. class:: Euclidean
120
121    Euclidean distance is a square root of sum of squared per-feature distances,
122    i.e. ``sqrt(sum(x*x for x in dist))``, where dist is the result of
123    ExamplesDistance_Normalized.attributeDistances.
124
125    .. method:: distributions
126
127        An object of type
128        :obj:`~Orange.statistics.distribution.Distribution` that holds
129        the distributions for all discrete features used for
130        computation of distances between known and unknown values.
131
132    .. method:: bothSpecialDist
133
134        A list containing the distance between two unknown values for each
135        discrete feature.
136
137    This measure of distance deals with unknown values by computing the
138    expected square of distance based on the distribution obtained from the
139    "training" data. Squared distance between
140
141        - A known and unknown continuous attribute equals squared distance
142          between the known and the average, plus variance
143        - Two unknown continuous attributes equals double variance
144        - A known and unknown discrete attribute equals the probability
145          that the unknown attribute has different value than the known
146          (i.e., 1 - probability of the known value)
147        - Two unknown discrete attributes equals the probability that two
148          random chosen values are equal, which can be computed as
149          1 - sum of squares of probabilities.
150
151    Continuous cases can be handled by averages and variances inherited from
152    ExamplesDistance_normalized. The data for discrete cases are stored in
153    distributions (used for unknown vs. known value) and in bothSpecial
154    (the precomputed distance between two unknown values).
155
156.. class:: ReliefConstructor
157.. class:: Relief
158
159    Relief is similar to Manhattan distance, but incorporates a more
160    correct treatment of undefined values, which is used by ReliefF measure.
161
162This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized.
163
164
165.. autoclass:: PearsonR
166    :members:
167
168.. autoclass:: SpearmanR
169    :members:
170
171.. autoclass:: PearsonRConstructor
172    :members:
173
174.. autoclass:: SpearmanRConstructor
175    :members:
Note: See TracBrowser for help on using the repository browser.