#
source:
orange/docs/reference/rst/Orange.distance.rst
@
9805:6b1b33eddc91

Revision 9805:6b1b33eddc91, 5.8 KB checked in by markotoplak, 2 years ago (diff) |
---|

Line | |
---|---|

1 | .. py:currentmodule:: Orange.distance |

2 | |

3 | ########################################## |

4 | Distance (``distance``) |

5 | ########################################## |

6 | |

7 | The following example demonstrates how to compute distances between two instances: |

8 | |

9 | .. literalinclude:: code/distance-simple.py |

10 | :lines: 1-7 |

11 | |

12 | A matrix with all pairwise distances can be computed with :obj:`distance_matrix`: |

13 | |

14 | .. literalinclude:: code/distance-simple.py |

15 | :lines: 9-11 |

16 | |

17 | Unknown values are treated correctly only by Euclidean and Relief |

18 | distance. For other measures, a distance between unknown and known or |

19 | between two unknown values is always 0.5. |

20 | |

21 | =================== |

22 | Computing distances |

23 | =================== |

24 | |

25 | Distance measures typically have to be adjusted to the data. For instance, |

26 | when the data set contains continuous features, the distances between |

27 | continuous values should be normalized to ensure that all features have |

28 | similar impats, e.g. by dividing the distance with the range. |

29 | |

30 | Distance measures thus appear in pairs: |

31 | |

32 | - a class that constructs the distance measure based on the |

33 | data (subclass of :obj:`DistanceConstructor`, for |

34 | example :obj:`Euclidean`), and returns is as |

35 | - a class that measures the distance between two instances |

36 | (subclass of :obj:`Distance`, for example :obj:`EuclideanDistance`). |

37 | |

38 | .. class:: DistanceConstructor |

39 | |

40 | .. method:: __call__([instances, weightID][, distributions][, basic_var_stat]) |

41 | |

42 | Constructs an :obj:`Distance`. Not all arguments are required. |

43 | Most measures can be constructed from basic_var_stat; if it is |

44 | not given, instances or distributions can be used. |

45 | |

46 | .. class:: Distance |

47 | |

48 | .. method:: __call__(instance1, instance2) |

49 | |

50 | Return a distance between the given instances (as a floating point number). |

51 | |

52 | Pairwise distances |

53 | ================== |

54 | |

55 | .. autofunction:: distance_matrix |

56 | |

57 | ========= |

58 | Measures |

59 | ========= |

60 | |

61 | Distance measures are defined with two classes: a subclass of obj:`DistanceConstructor` |

62 | and a subclass of :obj:`Distance`. |

63 | |

64 | .. class:: Hamming |

65 | .. class:: HammingDistance |

66 | |

67 | The number of features in which the two instances differ. This measure |

68 | is not appropriate for instances that contain continuous features. |

69 | |

70 | .. class:: Maximal |

71 | .. class:: MaximalDistance |

72 | |

73 | The maximal distance |

74 | between two feature values. If dist is the result of |

75 | :obj:`~DistanceNormalized.feature_distances`, |

76 | then :class:`Maximal` returns ``max(dist)``. |

77 | |

78 | .. class:: Manhattan |

79 | .. class:: ManhattanDistance |

80 | |

81 | The sum of absolute values |

82 | of distances between pairs of features, e.g. ``sum(abs(x) for x in dist)`` |

83 | where dist is the result of :obj:`~DistanceNormalized.feature_distances`. |

84 | |

85 | .. class:: Euclidean |

86 | .. class:: EuclideanDistance |

87 | |

88 | The square root of sum of squared per-feature distances, |

89 | i.e. ``sqrt(sum(x*x for x in dist))``, where dist is the result of |

90 | :obj:`~DistanceNormalized.feature_distances`. |

91 | |

92 | .. method:: distributions |

93 | |

94 | A :obj:`~Orange.statistics.distribution.Distribution` containing |

95 | the distributions for all discrete features used for |

96 | computation of distances between known and unknown values. |

97 | |

98 | .. method:: both_special_dist |

99 | |

100 | A list containing the distance between two unknown values for each |

101 | discrete feature. |

102 | |

103 | Unknown values are handled by computing the |

104 | expected square of distance based on the distribution from the |

105 | "training" data. Squared distance between |

106 | |

107 | - A known and unknown continuous feature equals squared distance |

108 | between the known and the average, plus variance. |

109 | - Two unknown continuous features equals double variance. |

110 | - A known and unknown discrete feature equals the probability |

111 | that the unknown feature has different value than the known |

112 | (i.e., 1 - probability of the known value). |

113 | - Two unknown discrete features equals the probability that two |

114 | random chosen values are equal, which can be computed as |

115 | 1 - sum of squares of probabilities. |

116 | |

117 | Continuous cases are handled as inherited from |

118 | :class:`DistanceNormalized`. The data for discrete cases are |

119 | stored in distributions (used for unknown vs. known value) and |

120 | in :obj:`both_special_dist` (the precomputed distance between two |

121 | unknown values). |

122 | |

123 | .. class:: Relief |

124 | .. class:: ReliefDistance |

125 | |

126 | Relief is similar to Manhattan distance, but incorporates the |

127 | treatment of undefined values, which is used by ReliefF measure. |

128 | |

129 | This class is derived directly from :obj:`Distance`. |

130 | |

131 | .. autoclass:: PearsonR |

132 | :members: |

133 | |

134 | .. autoclass:: PearsonRDistance |

135 | :members: |

136 | |

137 | .. autoclass:: SpearmanR |

138 | :members: |

139 | |

140 | .. autoclass:: SpearmanRDistance |

141 | :members: |

142 | |

143 | .. autoclass:: Mahalanobis |

144 | :members: |

145 | |

146 | .. autoclass:: MahalanobisDistance |

147 | :members: |

148 | |

149 | ========= |

150 | Utilities |

151 | ========= |

152 | |

153 | .. class:: DistanceNormalized |

154 | |

155 | An abstract class that provides normalization. |

156 | |

157 | .. attribute:: normalizers |

158 | |

159 | A precomputed list of normalizing factors for feature values. They are: |

160 | |

161 | - 1/(max_value-min_value) for continuous and 1/number_of_values |

162 | for ordinal features. |

163 | If either feature is unknown, the distance is 0.5. Such factors |

164 | are used to multiply differences in feature's values. |

165 | - ``-1`` for nominal features; the distance |

166 | between two values is 0 if they are same (or at least one is |

167 | unknown) and 1 if they are different. |

168 | - ``0`` for ignored features. |

169 | |

170 | .. attribute:: bases, averages, variances |

171 | |

172 | The minimal values, averages and variances |

173 | (continuous features only). |

174 | |

175 | .. attribute:: domain_version |

176 | |

177 | The domain version changes each time a domain description is |

178 | changed (i.e. features are added or removed). |

179 | |

180 | .. method:: feature_distances(instance1, instance2) |

181 | |

182 | Return a list of floats representing normalized distances between |

183 | pairs of feature values of the two instances. |

184 | |

185 |

**Note:**See TracBrowser for help on using the repository browser.