#
source:
orange/docs/reference/rst/Orange.distance.instances.rst
@
9641:87e1bab24ac0

Revision 9641:87e1bab24ac0, 6.7 KB checked in by anze <anze.staric@…>, 2 years ago (diff) |
---|

Line | |
---|---|

1 | .. automodule:: Orange.distance.instances |

2 | |

3 | ######################### |

4 | Instances (``instances``) |

5 | ######################### |

6 | |

7 | ########################### |

8 | Distances between Instances |

9 | ########################### |

10 | |

11 | This page describes a bunch of classes for different metrics for measure |

12 | distances (dissimilarities) between instances. |

13 | |

14 | Typical (although not all) measures of distance between instances require |

15 | some "learning" - adjusting the measure to the data. For instance, when |

16 | the dataset contains continuous features, the distances between continuous |

17 | values should be normalized, e.g. by dividing the distance with the range |

18 | of possible values or with some interquartile distance to ensure that all |

19 | features have, in principle, similar impacts. |

20 | |

21 | Different measures of distance thus appear in pairs - a class that measures |

22 | the distance and a class that constructs it based on the data. The abstract |

23 | classes representing such a pair are `ExamplesDistance` and |

24 | `ExamplesDistanceConstructor`. |

25 | |

26 | Since most measures work on normalized distances between corresponding |

27 | features, there is an abstract intermediate class |

28 | `ExamplesDistance_Normalized` that takes care of normalizing. |

29 | The remaining classes correspond to different ways of defining the distances, |

30 | such as Manhattan or Euclidean distance. |

31 | |

32 | Unknown values are treated correctly only by Euclidean and Relief distance. |

33 | For other measure of distance, a distance between unknown and known or between |

34 | two unknown values is always 0.5. |

35 | |

36 | .. class:: ExamplesDistance |

37 | |

38 | .. method:: __call__(instance1, instance2) |

39 | |

40 | Returns a distance between the given instances as floating point number. |

41 | |

42 | .. class:: ExamplesDistanceConstructor |

43 | |

44 | .. method:: __call__([instances, weightID][, distributions][, basic_var_stat]) |

45 | |

46 | Constructs an instance of ExamplesDistance. |

47 | Not all the data needs to be given. Most measures can be constructed |

48 | from basic_var_stat; if it is not given, they can help themselves |

49 | either by instances or distributions. |

50 | Some (e.g. ExamplesDistance_Hamming) even do not need any arguments. |

51 | |

52 | .. class:: ExamplesDistance_Normalized |

53 | |

54 | This abstract class provides a function which is given two instances |

55 | and returns a list of normalized distances between values of their |

56 | features. Many distance measuring classes need such a function and are |

57 | therefore derived from this class |

58 | |

59 | .. attribute:: normalizers |

60 | |

61 | A precomputed list of normalizing factors for feature values |

62 | |

63 | - If a factor positive, differences in feature's values |

64 | are multiplied by it; for continuous features the factor |

65 | would be 1/(max_value-min_value) and for ordinal features |

66 | the factor is 1/number_of_values. If either (or both) of |

67 | features are unknown, the distance is 0.5 |

68 | - If a factor is -1, the feature is nominal; the distance |

69 | between two values is 0 if they are same (or at least |

70 | one is unknown) and 1 if they are different. |

71 | - If a factor is 0, the feature is ignored. |

72 | |

73 | .. attribute:: bases, averages, variances |

74 | |

75 | The minimal values, averages and variances |

76 | (continuous features only) |

77 | |

78 | .. attribute:: domainVersion |

79 | |

80 | Stores a domain version for which the normalizers were computed. |

81 | The domain version is increased each time a domain description is |

82 | changed (i.e. features are added or removed); this is used for a quick |

83 | check that the user is not attempting to measure distances between |

84 | instances that do not correspond to normalizers. |

85 | Since domains are practicably immutable (especially from Python), |

86 | you don't need to care about this anyway. |

87 | |

88 | .. method:: attributeDistances(instance1, instance2) |

89 | |

90 | Returns a list of floats representing distances between pairs of |

91 | feature values of the two instances. |

92 | |

93 | |

94 | .. class:: HammingConstructor |

95 | .. class:: Hamming |

96 | |

97 | Hamming distance between two instances is defined as the number of |

98 | features in which the two instances differ. Note that this measure |

99 | is not really appropriate for instances that contain continuous features. |

100 | |

101 | |

102 | .. class:: MaximalConstructor |

103 | .. class:: Maximal |

104 | |

105 | The maximal between two instances is defined as the maximal distance |

106 | between two feature values. If dist is the result of |

107 | ExamplesDistance_Normalized.attributeDistances, |

108 | then Maximal returns max(dist). |

109 | |

110 | |

111 | .. class:: ManhattanConstructor |

112 | .. class:: Manhattan |

113 | |

114 | Manhattan distance between two instances is a sum of absolute values |

115 | of distances between pairs of features, e.g. ``sum(abs(x) for x in dist)`` |

116 | where dist is the result of ExamplesDistance_Normalized.attributeDistances. |

117 | |

118 | .. class:: EuclideanConstructor |

119 | .. class:: Euclidean |

120 | |

121 | Euclidean distance is a square root of sum of squared per-feature distances, |

122 | i.e. ``sqrt(sum(x*x for x in dist))``, where dist is the result of |

123 | ExamplesDistance_Normalized.attributeDistances. |

124 | |

125 | .. method:: distributions |

126 | |

127 | An object of type |

128 | :obj:`~Orange.statistics.distribution.Distribution` that holds |

129 | the distributions for all discrete features used for |

130 | computation of distances between known and unknown values. |

131 | |

132 | .. method:: bothSpecialDist |

133 | |

134 | A list containing the distance between two unknown values for each |

135 | discrete feature. |

136 | |

137 | This measure of distance deals with unknown values by computing the |

138 | expected square of distance based on the distribution obtained from the |

139 | "training" data. Squared distance between |

140 | |

141 | - A known and unknown continuous attribute equals squared distance |

142 | between the known and the average, plus variance |

143 | - Two unknown continuous attributes equals double variance |

144 | - A known and unknown discrete attribute equals the probability |

145 | that the unknown attribute has different value than the known |

146 | (i.e., 1 - probability of the known value) |

147 | - Two unknown discrete attributes equals the probability that two |

148 | random chosen values are equal, which can be computed as |

149 | 1 - sum of squares of probabilities. |

150 | |

151 | Continuous cases can be handled by averages and variances inherited from |

152 | ExamplesDistance_normalized. The data for discrete cases are stored in |

153 | distributions (used for unknown vs. known value) and in bothSpecial |

154 | (the precomputed distance between two unknown values). |

155 | |

156 | .. class:: ReliefConstructor |

157 | .. class:: Relief |

158 | |

159 | Relief is similar to Manhattan distance, but incorporates a more |

160 | correct treatment of undefined values, which is used by ReliefF measure. |

161 | |

162 | This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized. |

163 | |

164 | |

165 | .. autoclass:: PearsonR |

166 | :members: |

167 | |

168 | .. autoclass:: SpearmanR |

169 | :members: |

170 | |

171 | .. autoclass:: PearsonRConstructor |

172 | :members: |

173 | |

174 | .. autoclass:: SpearmanRConstructor |

175 | :members: |

**Note:**See TracBrowser for help on using the repository browser.