Changeset 7599:5c0ed7601f79 in orange
 Timestamp:
 02/05/11 00:42:08 (3 years ago)
 Branch:
 default
 Convert:
 9f4e59a95a5a9d8ddb14fe65406498d3b033e732
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

orange/Orange/distances/__init__.py
r6848 r7599 1 """ 2 3 ########################### 4 Distances between Instances 5 ########################### 6 7 This page describes a bunch of classes for different metrics for measure 8 distances (dissimilarities) between instances. 9 10 Typical (although not all) measures of distance between instances require 11 some "learning"  adjusting the measure to the data. For instance, when 12 the dataset contains continuous features, the distances between continuous 13 values should be normalized, e.g. by dividing the distance with the range 14 of possible values or with some interquartile distance to ensure that all 15 features have, in principle, similar impacts. 16 17 Different measures of distance thus appear in pairs  a class that measures 18 the distance and a class that constructs it based on the data. The abstract 19 classes representing such a pair are `ExamplesDistance` and 20 `ExamplesDistanceConstructor`. 21 22 Since most measures work on normalized distances between corresponding 23 features, there is an abstract intermediate class 24 `ExamplesDistance_Normalized` that takes care of normalizing. 25 The remaining classes correspond to different ways of defining the distances, 26 such as Manhattan or Euclidean distance. 27 28 Unknown values are treated correctly only by Euclidean and Relief distance. 29 For other measure of distance, a distance between unknown and known or between 30 two unknown values is always 0.5. 31 32 .. class:: ExamplesDistance 33 34 .. method:: __call__(instance1, instance2) 35 36 Returns a distance between the given instances as floating point number. 37 38 .. class:: ExamplesDistanceConstructor 39 40 .. method:: __call__([instances, weightID][, DomainDistributions][, DomainBasicAttrStat]) 41 42 Constructs an instance of ExamplesDistance. 43 Not all the data needs to be given. Most measures can be constructed 44 from DomainBasicAttrStat; if it is not given, they can help themselves 45 either by instances or DomainDistributions. 46 Some (e.g. ExamplesDistance_Hamming) even do not need any arguments. 47 48 .. class:: ExamplesDistance_Normalized 49 50 This abstract class provides a function which is given two instances 51 and returns a list of normalized distances between values of their 52 features. Many distance measuring classes need such a function and are 53 therefore derived from this class 54 55 .. attribute:: normalizers 56 57 A precomputed list of normalizing factors for feature values 58 59  If a factor positive, differences in feature's values 60 are multiplied by it; for continuous features the factor 61 would be 1/(max_valuemin_value) and for ordinal features 62 the factor is 1/number_of_values. If either (or both) of 63 features are unknown, the distance is 0.5 64  If a factor is 1, the feature is nominal; the distance 65 between two values is 0 if they are same (or at least 66 one is unknown) and 1 if they are different. 67  If a factor is 0, the feature is ignored. 68 69 .. attribute:: bases, averages, variances 70 71 The minimal values, averages and variances 72 (continuous features only) 73 74 .. attribute:: domainVersion 75 76 Stores a domain version for which the normalizers were computed. 77 The domain version is increased each time a domain description is 78 changed (i.e. features are added or removed); this is used for a quick 79 check that the user is not attempting to measure distances between 80 instances that do not correspond to normalizers. 81 Since domains are practicably immutable (especially from Python), 82 you don't need to care about this anyway. 83 84 .. method:: attributeDistances(instance1, instance2) 85 86 Returns a list of floats representing distances between pairs of 87 feature values of the two instances. 88 89 90 .. class:: Hamming, HammingConstructor 91 92 Hamming distance between two instances is defined as the number of 93 features in which the two instances differ. Note that this measure 94 is not really appropriate for instances that contain continuous features. 95 96 97 .. class:: Maximal, MaximalConstructor 98 99 The maximal between two instances is defined as the maximal distance 100 between two feature values. If dist is the result of 101 ExamplesDistance_Normalized.attributeDistances, 102 then Maximal returns max(dist). 103 104 105 .. class:: Manhattan, ManhattanConstructor 106 107 Manhattan distance between two instances is a sum of absolute values 108 of distances between pairs of features, e.g. ``apply(add, [abs(x) for x in dist])`` 109 where dist is the result of ExamplesDistance_Normalized.attributeDistances. 110 111 .. class:: Euclidean, EuclideanConstructor 112 113 114 Euclidean distance is a square root of sum of squared perfeature distances, 115 i.e. ``sqrt(apply(add, [x*x for x in dist]))``, where dist is the result of 116 ExamplesDistance_Normalized.attributeDistances. 117 118 .. method:: distributions 119 120 An object of type DomainDistributions that holds the distributions 121 for all discrete features. This is needed to compute distances between 122 known and unknown values. 123 124 .. method:: bothSpecialDist 125 126 A list containing the distance between two unknown values for each 127 discrete feature. 128 129 This measure of distance deals with unknown values by computing the 130 expected square of distance based on the distribution obtained from the 131 "training" data. Squared distance between 132 133  A known and unknown continuous attribute equals squared distance 134 between the known and the average, plus variance 135  Two unknown continuous attributes equals double variance 136  A known and unknown discrete attribute equals the probabilit 137 that the unknown attribute has different value than the known 138 (i.e., 1  probability of the known value) 139  Two unknown discrete attributes equals the probability that two 140 random chosen values are equal, which can be computed as 141 1  sum of squares of probabilities. 142 143 Continuous cases can be handled by averages and variances inherited from 144 ExamplesDistance_normalized. The data for discrete cases are stored in 145 distributions (used for unknown vs. known value) and in bothSpecial 146 (the precomputed distance between two unknown values). 147 148 .. class:: Relief, ReliefConstructor 149 150 Relief is similar to Manhattan distance, but incorporates a more 151 correct treatment of undefined values, which is used by ReliefF measure. 152 153 This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized. 154 155 156 .. autoclass:: PearsonR 157 :members: 158 159 .. autoclass:: SpearmanR 160 :members: 161 162 .. autoclass:: PearsonRConstructor 163 :members: 164 165 .. autoclass:: SpearmanRConstructor 166 :members: 167 168 169 """ 170 171 import Orange 172 1 173 from orange import \ 2 AlignmentList, \ 3 DistanceMap, \ 4 DistanceMapConstructor, \ 5 ExampleDistConstructor, \ 6 ExampleDistBySorting, \ 7 ExampleDistVector, \ 8 ExamplesDistance, \ 9 ExamplesDistance_Hamming, \ 10 ExamplesDistance_Normalized, \ 11 ExamplesDistance_DTW, \ 12 ExamplesDistance_Euclidean, \ 13 ExamplesDistance_Manhattan, \ 14 ExamplesDistance_Maximal, \ 15 ExamplesDistance_Relief, \ 16 ExamplesDistanceConstructor, \ 17 ExamplesDistanceConstructor_DTW, \ 18 ExamplesDistanceConstructor_Euclidean, \ 19 ExamplesDistanceConstructor_Hamming, \ 20 ExamplesDistanceConstructor_Manhattan, \ 21 ExamplesDistanceConstructor_Maximal, \ 22 ExamplesDistanceConstructor_Relief 174 AlignmentList, \ 175 DistanceMap, \ 176 DistanceMapConstructor, \ 177 ExampleDistConstructor, \ 178 ExampleDistBySorting, \ 179 ExampleDistVector, \ 180 ExamplesDistance, \ 181 ExamplesDistance_Normalized, \ 182 ExamplesDistanceConstructor 183 184 from orange import ExamplesDistance_Hamming as Hamming 185 from orange import ExamplesDistance_DTW as DTW 186 from orange import ExamplesDistance_Euclidean as Euclidean 187 from orange import ExamplesDistance_Manhattan as Manhattan 188 from orange import ExamplesDistance_Maximal as Maximal 189 from orange import ExamplesDistance_Relief as Relief 190 191 from orange import ExamplesDistanceConstructor_DTW as DTWConstructor 192 from orange import ExamplesDistanceConstructor_Euclidean as EuclideanConstructor 193 from orange import ExamplesDistanceConstructor_Hamming as HammingConstructor 194 from orange import ExamplesDistanceConstructor_Manhattan as ManhattanConstructor 195 from orange import ExamplesDistanceConstructor_Maximal as MaximalConstructor 196 from orange import ExamplesDistanceConstructor_Relief as ReliefConstructor 197 198 import statc 199 200 class PearsonRConstructor(ExamplesDistanceConstructor): 201 """Constructs an instance of PearsonR. Not all the data needs to be given.""" 202 203 def __new__(cls, data=None, **argkw): 204 self = ExamplesDistanceConstructor.__new__(cls, **argkw) 205 self.__dict__.update(argkw) 206 if data: 207 return self.__call__(data) 208 else: 209 return self 210 211 def __call__(self, table): 212 indxs = [i for i, a in enumerate(table.domain.attributes) \ 213 if a.varType==Orange.data.Type.Continuous] 214 return PearsonR(domain=table.domain, indxs=indxs) 215 216 class PearsonR(ExamplesDistance): 217 """ 218 `Pearson correlation coefficient 219 <http://en.wikipedia.org/wiki/Pearson_productmoment\ 220 _correlation_coefficient>`_ 221 """ 222 223 def __init__(self, **argkw): 224 self.__dict__.update(argkw) 225 226 def __call__(self, e1, e2): 227 """ 228 :param e1: data instances. 229 :param e2: data instances. 230 231 Returns Pearson's disimilarity between e1 and e2, 232 i.e. (1r)/2 where r is Sprearman's rank coefficient. 233 """ 234 X1 = [] 235 X2 = [] 236 for i in self.indxs: 237 if not(e1[i].isSpecial() or e2[i].isSpecial()): 238 X1.append(float(e1[i])) 239 X2.append(float(e2[i])) 240 if not X1: 241 return 1.0 242 try: 243 return (1.0  statc.pearsonr(X1, X2)[0]) / 2. 244 except: 245 return 1.0 246 247 class SpearmanRConstructor(ExamplesDistanceConstructor): 248 """Constructs an instance of SpearmanR. Not all the data needs to be given.""" 249 250 def __new__(cls, data=None, **argkw): 251 self = ExamplesDistanceConstructor.__new__(cls, **argkw) 252 self.__dict__.update(argkw) 253 if data: 254 return self.__call__(data) 255 else: 256 return self 257 258 def __call__(self, table): 259 indxs = [i for i, a in enumerate(table.domain.attributes) \ 260 if a.varType==Orange.data.Type.Continuous] 261 return SpearmanR(domain=table.domain, indxs=indxs) 262 263 class SpearmanR(ExamplesDistance): 264 265 """`Spearman's rank correlation coefficient 266 <http://en.wikipedia.org/wiki/Spearman%27s_rank_\ 267 correlation_coefficient>`_""" 268 269 def __init__(self, **argkw): 270 self.__dict__.update(argkw) 271 272 def __call__(self, e1, e2): 273 """ 274 :param e1: data instances. 275 :param e2: data instances. 276 277 Returns Sprearman's disimilarity between e1 and e2, 278 i.e. (1r)/2 where r is Sprearman's rank coefficient. 279 """ 280 X1 = []; X2 = [] 281 for i in self.indxs: 282 if not(e1[i].isSpecial() or e2[i].isSpecial()): 283 X1.append(float(e1[i])) 284 X2.append(float(e2[i])) 285 if not X1: 286 return 1.0 287 try: 288 return (1.0  statc.spearmanr(X1, X2)[0]) / 2. 289 except: 290 return 1.0 291
Note: See TracChangeset
for help on using the changeset viewer.