Changeset 7503:166e99ee3f9d in orange
 Timestamp:
 02/04/11 18:37:32 (3 years ago)
 Branch:
 default
 Convert:
 626035b7b1533e3c3b7d5f224b29d6bed94c8aff
 Location:
 orange/Orange/clustering
 Files:

 1 added
 2 edited
Legend:
 Unmodified
 Added
 Removed

orange/Orange/clustering/__init__.py
r7429 r7503 5 5 Everything about clustering, including agglomerative and hierarchical clustering. 6 6 7 ==================8 kMeans clustering9 ==================10 11 .. index:: aglomerative clustering12 13 .. autoclass:: Orange.cluster.KMeans14 :members:15 16 Examples17 ========18 19 The following code runs kmeans clustering and prints out the cluster indexes for the last 10 data instances (`kmeansrun.py`_, uses `iris.tab`_):20 21 .. literalinclude:: code/kmeansrun.py22 23 The output of this code is::24 25 [1, 1, 2, 1, 1, 1, 2, 1, 1, 2]26 27 Invoking a callback function may be useful when tracing the progress of the clustering. Below is a code that uses an :obj:`inner_callback` to report on the number of instances that have changed the cluster and to report on the clustering score. For the score to be computed at each iteration we have to set :obj:`minscorechange`, but we can leave it at 0 or even set it to a negative value, which allows the score to deteriorate by some amount (`kmeansruncallback.py`_, uses `iris.tab`_):28 29 .. literalinclude:: code/kmeansruncallback.py30 31 The convergence on Iris data set is fast::32 33 Iteration: 1, changes: 150, score: 10.955534 Iteration: 2, changes: 12, score: 10.386735 Iteration: 3, changes: 2, score: 10.203436 Iteration: 4, changes: 2, score: 10.069937 Iteration: 5, changes: 2, score: 9.954238 Iteration: 6, changes: 1, score: 9.916839 Iteration: 7, changes: 2, score: 9.862440 Iteration: 8, changes: 0, score: 9.862441 42 Callback above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (`kmeanstrace.py`_, uses `iris.tab`_):43 44 .. literalinclude:: code/kmeanstrace.py45 46 Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.47 48 49 .. image:: files/kmeansscatter001.png50 51 .. image:: files/kmeansscatter002.png52 53 .. image:: files/kmeansscatter003.png54 55 .. image:: files/kmeansscatter004.png56 57 58 kMeans Utility Functions59 =========================60 61 .. automethod:: Orange.cluster.kmeans_init_random62 63 .. automethod:: Orange.cluster.kmeans_init_diversity64 65 .. automethod:: Orange.cluster.KMeans_init_hierarchicalClustering66 67 .. automethod:: Orange.cluster.data_center68 69 .. automethod:: Orange.cluster.data_center70 71 .. automethod:: Orange.cluster.plot_silhouette72 73 .. automethod:: Orange.cluster.score_distance_to_centroids74 75 .. automethod:: Orange.cluster.score_silhouette76 77 .. automethod:: Orange.cluster.score_fastsilhouette78 79 Typically, the choice of seeds has a large impact on the kmeans clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversitybased and hierarchical clusteringbased) in terms of how fast they converge (`kmeanscmpinit.py`_, uses `iris.tab`_, `housing.tab`_, `vehicle.tab`_):80 81 .. literalinclude:: code/kmeanscmpinit.py82 83 As expected, kmeans converges faster with diversity and clusteringbased initialization that with random seed selection::84 85 Rnd Div HC86 iris 12 3 487 housing 14 6 488 vehicle 11 4 389 90 The following code computes the silhouette score for k=2..7 and plots a silhuette plot for k=3 (`kmeanssilhouette.py`_, uses `iris.tab`_):91 92 .. literalinclude:: code/kmeanssilhouette.py93 94 The analysis suggests that k=2 is preferred as it yields the maximal silhouette coefficient::95 96 2 0.62946755335297 3 0.50431885505498 4 0.40725937785499 5 0.358628975081100 6 0.353228492088101 7 0.366357876944102 103 .. figure:: files/kmeanssilhouette.png104 105 Silhouette plot for k=3.106 107 .. _iris.tab: code/iris.tab108 .. _housing.tab: code/housing.tab109 .. _vehicle.tab: code/vehicle.tab110 .. _kmeansrun.py: code/kmeansrun.py111 .. _kmeansruncallback.py: code/kmeansruncallback.py112 .. _kmeanstrace.py: code/kmeanstrace.py113 .. _kmeanscmpinit.py: code/kmeanscmpinit.py114 .. _kmeanssilhouette.py: code/kmeanssillhouette.py115 7 116 8 ================ … … 118 10 ================ 119 11 120 .. class:: Orange.cluster .ExampleCluster12 .. class:: Orange.clustering.ExampleCluster 121 13 122 14 To je pa moj dodaten opis ExampleClusterja … … 130 22 131 23 132 .. autoclass:: Orange.cluster .HierarchicalCluster24 .. autoclass:: Orange.clustering.HierarchicalCluster 133 25 :members: 134 26 :showinheritance: 135 27 :undocmembers: 136 28 137 .. autoclass:: Orange.cluster .HierarchicalClusterList29 .. autoclass:: Orange.clustering.HierarchicalClusterList 138 30 :members: 139 31 :showinheritance: 140 32 :undocmembers: 141 33 142 .. autoclass:: Orange.cluster .HierarchicalClustering34 .. autoclass:: Orange.clustering.HierarchicalClustering 143 35 :members: 144 36 :showinheritance: … … 162 54 __docformat__ = 'restructuredtext' 163 55 164 # miscellaneous functions (used across this module) 165 166 class ExampleCluster2(orange.ExampleCluster): 167 """New example cluster""" 168 pass 169 170 def _modus(dist): 171 #Check bool(dist)  False means no known cases 172 #Check dist.cases > 0  We cant return some value from the domain without knowing if it is even present 173 #in the data. TOOD: What does this mean for kmeans convergence? 174 if bool(dist) and dist.cases > 0: 175 return dist.modus() 176 else: 177 return None 178 179 def data_center(data): 180 """ 181 Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes). 182 """ 183 atts = data.domain.attributes 184 astats = orange.DomainBasicAttrStat(data) 185 center = [astats[a].avg if a.varType == orange.VarTypes.Continuous \ 186 # else max(enumerate(orange.Distribution(a, data)), key=lambda x:x[1])[0] if a.varType == orange.VarTypes.Discrete 187 else _modus(orange.Distribution(a, data)) if a.varType == orange.VarTypes.Discrete 188 else None 189 for a in atts] 190 if data.domain.classVar: 191 center.append(0) 192 return orange.Example(data.domain, center) 193 194 def minindex(x): 195 """Return the index of the minimum element""" 196 return x.index(min(x)) 197 198 def avg(x): 199 """Return the average (mean) of a given list""" 200 return (float(sum(x)) / len(x)) if x else 0 201 202 # 203 # data distances 204 # 205 206 class ExamplesDistanceConstructor_PearsonR(orange.ExamplesDistanceConstructor): 207 def __new__(cls, data=None, **argkw): 208 self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw) 209 self.__dict__.update(argkw) 210 if data: 211 return self.__call__(data) 212 else: 213 return self 214 215 def __call__(self, data): 216 indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous] 217 return ExamplesDistance_PearsonR(domain=data.domain, indxs=indxs) 218 219 class ExamplesDistance_PearsonR(orange.ExamplesDistance): 220 """ Pearson distance. (1  R)/2, where R is the Pearson correlation between examples. In [0..1]. """ 221 222 def __init__(self, **argkw): 223 self.__dict__.update(argkw) 224 def __call__(self, e1, e2): 225 X1 = []; X2 = [] 226 for i in self.indxs: 227 if not(e1[i].isSpecial() or e2[i].isSpecial()): 228 X1.append(float(e1[i])) 229 X2.append(float(e2[i])) 230 if not X1: 231 return 1.0 232 try: 233 return (1.0  statc.pearsonr(X1, X2)[0]) / 2. 234 except: 235 return 1.0 236 237 class ExamplesDistanceConstructor_SpearmanR(orange.ExamplesDistanceConstructor): 238 def __new__(cls, data=None, **argkw): 239 self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw) 240 self.__dict__.update(argkw) 241 if data: 242 return self.__call__(data) 243 else: 244 return self 245 246 def __call__(self, data): 247 indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous] 248 return ExamplesDistance_SpearmanR(domain=data.domain, indxs=indxs) 249 250 class ExamplesDistance_SpearmanR(orange.ExamplesDistance): 251 def __init__(self, **argkw): 252 self.__dict__.update(argkw) 253 def __call__(self, e1, e2): 254 X1 = []; X2 = [] 255 for i in self.indxs: 256 if not(e1[i].isSpecial() or e2[i].isSpecial()): 257 X1.append(float(e1[i])) 258 X2.append(float(e2[i])) 259 if not X1: 260 return 1.0 261 try: 262 return (1.0  statc.spearmanr(X1, X2)[0]) / 2. 263 except: 264 return 1.0 265 266 # kmeans clustering 267 268 # clustering scoring functions 269 270 def score_distance_to_centroids(km): 271 """Returns an average distance of data instances to their associated centroids. 272 273 :param km: a kmeans clustering object. 274 :type km: :class:`KMeans` 275 """ 276 return sum(km.distance(km.centroids[km.clusters[i]], d) for i,d in enumerate(km.data)) 277 278 score_distance_to_centroids.minimize = True 279 280 def score_conditionalEntropy(km): 281 """UNIMPLEMENTED cluster quality measured by conditional entropy""" 282 pass 283 284 def score_withinClusterDistance(km): 285 """UNIMPLEMENTED weighted average withincluster pairwise distance""" 286 pass 287 288 score_withinClusterDistance.minimize = True 289 290 def score_betweenClusterDistance(km): 291 """Sum of distances from elements to 'nearest miss' centroids""" 292 return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data)) 293 294 def score_silhouette(km, index=None): 295 """Returns an average silhouette score of data instances. 296 297 :param km: a kmeans clustering object. 298 :type km: :class:`KMeans` 299 300 :param index: if given, the functon returns just the silhouette score of that particular data instance. 301 :type index: integer 302 """ 303 304 if index == None: 305 return avg([score_silhouette(km, i) for i in range(len(km.data))]) 306 cind = km.clusters[index] 307 a = avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if 308 km.clusters[i] == cind and i != index]) 309 b = min([avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if 310 km.clusters[i] == c]) 311 for c in range(len(km.centroids)) if c != cind]) 312 return float(b  a) / max(a, b) if max(a, b) > 0 else 0.0 313 314 def score_fastsilhouette(km, index=None): 315 """Same as score_silhouette, but computes an approximation and is faster. 316 317 :param km: a kmeans clustering object. 318 :type km: :class:`KMeans` 319 """ 320 321 if index == None: 322 return avg([score_fastsilhouette(km, i) for i in range(len(km.data))]) 323 cind = km.clusters[index] 324 a = km.distance(km.data[index], km.centroids[km.clusters[index]]) 325 b = min([km.distance(km.data[index], c) for i,c in enumerate(km.centroids) if i != cind]) 326 return float(b  a) / max(a, b) if max(a, b) > 0 else 0.0 327 328 def compute_bic(km): 329 """Compute bayesian information criteria score for given clustering. NEEDS REWRITE!!!""" 330 data = km.data 331 medoids = km.centroids 332 333 M = len(data.domain.attributes) 334 R = float(len(data)) 335 Ri = [km.clusters.count(i) for i in range(km.k)] 336 numFreePar = (len(km.data.domain.attributes) + 1.) * km.k * math.log(R, 2.) / 2. 337 # sigma**2 338 s2 = 0. 339 cidx = [i for i, attr in enumerate(data.domain.attributes) if attr.varType in [orange.VarTypes.Continuous, orange.VarTypes.Discrete]] 340 for x, midx in izip(data, mapping): 341 medoid = medoids[midx] # medoids has a dummy element at the beginning, so we don't need 1 342 s2 += sum( [(float(x[i])  float(medoid[i]))**2 for i in cidx] ) 343 s2 /= (R  K) 344 if s2 < 1e20: 345 return None, [None]*K 346 # loglokehood of clusters: l(Dn) 347 # loglikehood of clustering: l(D) 348 ld = 0 349 bicc = [] 350 for k in range(1, 1+K): 351 ldn = 1. * Ri[k] * ((math.log(2. * math.pi, 2) / 2.)  (M * math.log(s2, 2) / 2.) + (K / 2.) + math.log(Ri[k], 2)  math.log(R, 2)) 352 ld += ldn 353 bicc.append(ldn  numFreePar) 354 return ld  numFreePar, bicc 56 #class ExampleCluster2(orange.ExampleCluster): 57 # """New example cluster""" 58 # pass 355 59 356 60 357 #358 # silhouette plot359 #360 361 def plot_silhouette(km, filename='tmp.png', fast=False):362 """ Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a kmeans clustering object. If fast is True use score_fastsilhouette to compute scores instead of score_silhouette.363 364 :param km: a kmeans clustering object.365 :type km: :class:`KMeans`366 :param filename: name of output plot.367 :type filename: string368 :param fast: if True use :func:`score_fastsilhouette` to compute scores instead of :func:`score_silhouette`369 :type fast: boolean.370 371 """372 import matplotlib.pyplot as plt373 plt.figure()374 scoring = score_fastsilhouette if fast else score_silhouette375 scores = [[] for i in range(km.k)]376 for i, c in enumerate(km.clusters):377 scores[c].append(scoring(km, i))378 csizes = map(len, scores)379 cpositions = [sum(csizes[:i]) + (i+1)*3 + csizes[i]/2 for i in range(km.k)]380 scores = reduce(lambda x,y: x + [0]*3 + sorted(y), scores, [])381 plt.barh(range(len(scores)), scores, linewidth=0, color='c')382 plt.yticks(cpositions, map(str, range(km.k)))383 #plt.title('Silhouette plot')384 plt.ylabel('Cluster')385 plt.xlabel('Silhouette value')386 plt.savefig(filename)387 388 # clustering initialization (seeds)389 # initialization functions should be of the type f(data, k, distfun)390 391 def kmeans_init_random(data, k, _):392 """A function that can be used for initialization of kmeans clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).393 394 :param data: data instances.395 :type data: :class:`orange.ExampleTable`396 :param k: the number of clusters.397 :type k: integer398 :param distfun: a distance function.399 :type distfun: :class:`orange.ExamplesDistance`400 """401 return data.getitems(random.sample(range(len(data)), k))402 403 def kmeans_init_diversity(data, k, distfun):404 """A function that can be used for intialization of kmeans clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm).405 406 :param data: data instances.407 :type data: :class:`orange.ExampleTable`408 :param k: the number of clusters.409 :type k: integer410 :param distfun: a distance function.411 :type distfun: :class:`orange.ExamplesDistance`412 """413 center = data_center(data)414 # the first seed should be the farthest point from the center415 seeds = [max([(distfun(d, center), d) for d in data])[1]]416 # other seeds are added iteratively, and are data points that are farthest from the current set of seeds417 for i in range(1,k):418 seeds.append(max([(min([distfun(d, s) for s in seeds]), d) for d in data if d not in seeds])[1])419 return seeds420 421 class KMeans_init_hierarchicalClustering():422 """A class that returns an clustering initialization function that performs hierarhical clustering, uses it to infer k clusters, and computes a list of clusterbased data centers."""423 424 def __init__(self, n=100):425 """426 :param n: number of data instances to sample.427 :type n: integer428 """429 self.n = n430 431 def __call__(self, data, k, disfun):432 """433 :param data: data instances.434 :type data: :class:`orange.ExampleTable`435 :param k: the number of clusters.436 :type k: integer437 :param distfun: a distance function.438 :type distfun: :class:`orange.ExamplesDistance`439 """440 sample = orange.ExampleTable(random.sample(data, min(self.n, len(data))))441 root = hierarchicalClustering(sample)442 cmap = hierarchicalClustering_topClusters(root, k)443 return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap]444 445 #446 # kmeans clustering, main implementation447 #448 449 class KMeans:450 """451 452 Implements a kmeans clustering algorithm:453 454 #. Choose the number of clusters, k.455 #. Choose a set of k initial centroids.456 #. Assign each instances in the data set to the closest centroid.457 #. For each cluster, compute a new centroid as a center of clustered458 data instances.459 #. Repeat the previous two steps, until some convergence criterion is460 met (e.g., the cluster assignment has not changed).461 462 The main advantages of this algorithm are simplicity and low memory463 requirements. The principal disadvantage is the dependence of results464 on the selection of initial set of centroids.465 466 .. attribute:: k467 468 Number of clusters.469 470 .. attribute:: data471 472 Instances to cluster.473 474 .. attribute:: centroids475 476 Current set of centroids.477 478 .. attribute:: scoring479 480 Current clustering score.481 482 .. attribute:: iteration483 484 Current clustering iteration.485 486 .. attribute:: clusters487 488 A list of cluster indexes. An ith element provides an489 index to a centroid associated with ith data instance from the input490 data set.491 """492 493 def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None,494 stopchanges=0, nstart=1, initialization=kmeans_init_random,495 distance=orange.ExamplesDistanceConstructor_Euclidean,496 scoring=score_distance_to_centroids, inner_callback = None,497 outer_callback = None):498 """499 :param data: Data instances to be clustered. If not None, clustering will be executed immediately after initialization unless initialize_only=True.500 :type data: :class:`orange.ExampleTable` or None501 :param centroids: either specify a number of clusters or provide a list of examples that will serve as clustering centroids.502 :type centroids: integer or a list of :class:`orange.Example`503 :param nstart: If greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.504 :type nstart: integer505 :param distance: an example distance constructor, which measures the distance between two instances.506 :type distance: :class:`orange.ExamplesDistanceConstructor`507 :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`kmeans_init_random`, :func:`kmeans_init_diversity`, :class:`KMeans_init_hierarchicalClustering`).508 :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score.509 :param inner_callback: invoked after every clustering iteration.510 :param outer_callback: invoked after every clustering restart (if nstart is greater than 1).511 512 Stopping criteria:513 514 :param maxiters: maximum number of clustering iterations515 :type maxiters: integer516 :param minscorechange: minimal improvement of the score from previous generation (if lower, the clustering will stop). If None, the score will not be computed between iterations517 :type minscorechange: float or None518 :param stopchanges: if the number of instances changing the cluster is lower or equal to stopchanges, stop the clustering.519 :type stopchanges: integer520 """521 522 self.data = data523 self.k = centroids if type(centroids)==int else len(centroids)524 self.centroids = centroids if type(centroids) == orange.ExampleTable else None525 self.maxiters = maxiters526 self.minscorechange = minscorechange527 self.stopchanges = stopchanges528 self.nstart = nstart529 self.initialization = initialization530 self.distance_constructor = distance531 self.distance = self.distance_constructor(self.data) if self.data else None532 self.scoring = scoring533 self.minimize_score = True if hasattr(scoring, 'minimize') else False534 self.inner_callback = inner_callback535 self.outer_callback = outer_callback536 if self.data:537 self.run()538 539 def __call__(self, data = None):540 """Runs the kmeans clustering algorithm, with optional new data."""541 if data:542 self.data = data543 self.distance = self.distance_constructor(self.data)544 self.run()545 546 def init_centroids(self):547 """Initialize cluster centroids"""548 if self.centroids and not self.nstart > 1: # centroids were specified549 return550 self.centroids = self.initialization(self.data, self.k, self.distance)551 552 def compute_centeroid(self, data):553 """Return a centroid of the data set."""554 return data_center(data)555 556 def compute_cluster(self):557 """calculate membership in clusters"""558 return [minindex([self.distance(s, d) for s in self.centroids]) for d in self.data]559 560 def runone(self):561 """Runs a single clustering iteration, starting with recomputation of centroids, followed by computation of data membership (associating data instances to their nearest centroid)."""562 self.centroids = [self.compute_centeroid(self.data.getitems(563 [i for i, c in enumerate(self.clusters) if c == cl])) for cl in range(self.k)]564 self.clusters = self.compute_cluster()565 566 def run(self):567 """568 Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.569 """570 self.winner = None571 for startindx in range(self.nstart):572 self.init_centroids()573 self.clusters = old_cluster = self.compute_cluster()574 if self.minscorechange != None:575 self.score = old_score = self.scoring(self)576 self.nchanges = len(self.data)577 self.iteration = 0578 stopcondition = False579 if self.inner_callback:580 self.inner_callback(self)581 while not stopcondition:582 self.iteration += 1583 self.runone()584 self.nchanges = sum(map(lambda x,y: x!=y, old_cluster, self.clusters))585 old_cluster = self.clusters586 if self.minscorechange != None:587 self.score = self.scoring(self)588 scorechange = (self.score  old_score) / old_score if old_score > 0 else self.minscorechange589 if self.minimize_score:590 scorechange = scorechange591 old_score = self.score592 stopcondition = (self.nchanges <= self.stopchanges or593 self.iteration == self.maxiters or594 (self.minscorechange != None and595 scorechange <= self.minscorechange))596 if self.inner_callback:597 self.inner_callback(self)598 if self.scoring and self.minscorechange == None:599 self.score = self.scoring(self)600 if self.nstart > 1:601 if not self.winner or (self.score < self.winner[0] if602 self.minimize_score else self.score > self.winner[0]):603 self.winner = (self.score, self.clusters, self.centroids)604 if self.outer_callback:605 self.outer_callback(self)606 607 if self.nstart > 1:608 self.score, self.clusters, self.centroids = self.winner609 610 # hierarhical clustering611 612 def hierarchicalClustering(data,613 distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean,614 linkage=orange.HierarchicalClustering.Average,615 order=False,616 progressCallback=None):617 """Return a hierarhical clustering of the data set."""618 distance = distanceConstructor(data)619 matrix = orange.SymMatrix(len(data))620 for i in range(len(data)):621 for j in range(i+1):622 matrix[i, j] = distance(data[i], data[j])623 root = orange.HierarchicalClustering(matrix, linkage=linkage, progressCallback=(lambda value, obj=None: progressCallback(value*100.0/(2 if order else 1))) if progressCallback else None)624 if order:625 orderLeaves(root, matrix, progressCallback=(lambda value: progressCallback(50.0 + value/2)) if progressCallback else None)626 return root627 628 def hierarchicalClustering_attributes(data, distance=None, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None):629 """Return hierarhical clustering of attributes in the data set."""630 matrix = orange.SymMatrix(len(data.domain.attributes))631 for a1 in range(len(data.domain.attributes)):632 for a2 in range(a1):633 matrix[a1, a2] = (1.0  orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0634 root = orange.HierarchicalClustering(matrix, linkage=linkage, progressCallback=(lambda value, obj=None: progressCallback(value*100.0/(2 if order else 1))) if progressCallback else None)635 if order:636 orderLeaves(root, matrix, progressCallback=(lambda value: progressCallback(50.0 + value/2)) if progressCallback else None)637 return root638 639 def hierarchicalClustering_clusterList(node, prune=None):640 """Return a list of clusters down from the node of hierarchical clustering."""641 if prune:642 if len(node) <= prune:643 return []644 if node.branches:645 return [node] + hierarchicalClustering_clusterList(node.left, prune) + hierarchicalClustering_clusterList(node.right, prune)646 return [node]647 648 def hierarchicalClustering_topClusters(root, k):649 """Return k topmost clusters from hierarchical clustering."""650 candidates = set([root])651 while len(candidates) < k:652 repl = max([(max(c.left.height, c.right.height), c) for c in candidates if c.branches])[1]653 candidates.discard(repl)654 candidates.add(repl.left)655 candidates.add(repl.right)656 return candidates657 658 def hierarhicalClustering_topClustersMembership(root, k):659 """Return data instances' cluster membership (list of indices) to k topmost clusters."""660 clist = hierarchicalClustering_topClusters(root, k)661 cmap = [None] * len(root)662 for i, c in enumerate(clist):663 for e in c:664 cmap[e] = i665 return cmap666 667 def orderLeaves(tree, matrix, progressCallback=None):668 """Order the leaves in the clustering tree.669 670 (based on Ziv BarJoseph et al. (Fast optimal leaf ordering for hierarchical clustering')671 Arguments:672 tree binary hierarchical clustering tree of type orange.HierarchicalCluster673 matrix orange.SymMatrix that was used to compute the clustering674 progressCallback function used to report progress675 """676 objects = getattr(tree.mapping, "objects", None)677 tree.mapping.setattr("objects", range(len(tree)))678 M = {}679 ordering = {}680 visitedClusters = set()681 def _optOrdering(tree):682 if len(tree)==1:683 for leaf in tree:684 M[tree, leaf, leaf] = 0685 ## print "adding:", tree, leaf, leaf686 else:687 _optOrdering(tree.left)688 _optOrdering(tree.right)689 ## print "ordering", [i for i in tree]690 Vl = set(tree.left)691 Vr = set(tree.right)692 Vlr = set(tree.left.right or tree.left)693 Vll = set(tree.left.left or tree.left)694 Vrr = set(tree.right.right or tree.right)695 Vrl = set(tree.right.left or tree.right)696 other = lambda e, V1, V2: V2 if e in V1 else V1697 tree_left, tree_right = tree.left, tree.right698 for u in Vl:699 for w in Vr:700 if True: #Improved search701 C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])702 orderedMs = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])703 orderedKs = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])704 k0 = orderedKs[0]705 curMin = 1e30000706 curMK = ()707 for m in orderedMs:708 if M[tree_left, u, m] + M[tree_right, w, k0] + C >= curMin:709 break710 for k in orderedKs:711 if M[tree_left, u, m] + M[tree_right, w, k] + C >= curMin:712 break713 if curMin > M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]:714 curMin = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]715 curMK = (m, k)716 M[tree, u, w] = M[tree, w, u] = curMin717 ordering[tree, u, w] = (tree_left, u, curMK[0], tree_right, w, curMK[1])718 ordering[tree, w, u] = (tree_right, w, curMK[1], tree_left, u, curMK[0])719 else:720 def MFunc((m, k)):721 return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]722 m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)723 M[tree, u, w] = M[tree, w, u] = MFunc((m, k))724 ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)725 ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)726 727 if progressCallback:728 progressCallback(100.0 * len(visitedClusters) / len(tree.mapping))729 visitedClusters.add(tree)730 731 _optOrdering(tree)732 733 def _order(tree, u, w):734 if len(tree)==1:735 return736 left, u, m, right, w, k = ordering[tree, u, w]737 if len(left)>1 and m not in left.right:738 left.swap()739 _order(left, u, m)740 ## if u!=left[0] or m!=left[1]:741 ## print "error 4:", u, m, list(left)742 if len(right)>1 and k not in right.left:743 right.swap()744 _order(right, k, w)745 ## if k!=right[0] or w!=right[1]:746 ## print "error 5:", k, w, list(right)747 748 u, w = min([(u, w) for u in tree.left for w in tree.right], key=lambda (u, w): M[tree, u, w])749 750 ## print "M(v) =", M[tree, u, w]751 752 _order(tree, u, w)753 754 def _check(tree, u, w):755 if len(tree)==1:756 return757 left, u, m, right, w, k = ordering[tree, u, w]758 if tree[0] == u and tree[1] == w:759 _check(left, u, m)760 _check(right, k, w)761 else:762 print "Error:", u, w, tree[0], tree[1]763 764 _check(tree, u ,w)765 766 767 if objects:768 tree.mapping.setattr("objects", objects)769 770 try:771 import numpy772 except ImportError:773 numpy = None774 775 try:776 import matplotlib777 from matplotlib.figure import Figure778 from matplotlib.table import Table, Cell779 from matplotlib.text import Text780 from matplotlib.artist import Artist781 ## import matplotlib.pyplot as plt782 except (ImportError, IOError), ex:783 matplotlib = None784 Text , Artist, Table, Cell = object, object, object, object785 786 class TableCell(Cell):787 PAD = 0.05788 def __init__(self, *args, **kwargs):789 Cell.__init__(self, *args, **kwargs)790 self._text.set_clip_on(True)791 792 class TablePlot(Table):793 max_fontsize = 12794 def __init__(self, xy, axes=None, bbox=None):795 Table.__init__(self, axes or plt.gca(), bbox=bbox)796 self.xy = xy797 self.set_transform(self._axes.transData)798 self._fixed_widhts = None799 import matplotlib.pyplot as plt800 self.max_fontsize = plt.rcParams.get("font.size", 12)801 802 def add_cell(self, row, col, *args, **kwargs):803 xy = (0,0)804 805 cell = TableCell(xy, *args, **kwargs)806 cell.set_figure(self.figure)807 cell.set_transform(self.get_transform())808 809 cell.set_clip_on(True)810 cell.set_clip_box(self._axes.bbox)811 cell._text.set_clip_box(self._axes.bbox)812 self._cells[(row, col)] = cell813 814 def draw(self, renderer):815 if not self.get_visible(): return816 self._update_positions(renderer)817 818 keys = self._cells.keys()819 keys.sort()820 for key in keys:821 self._cells[key].draw(renderer)822 823 def _update_positions(self, renderer):824 keys = numpy.array(self._cells.keys())825 cells = numpy.array([[self._cells.get((row, col), None) for col in range(max(keys[:, 1] + 1))] \826 for row in range(max(keys[:, 0] + 1))])827 828 widths = self._get_column_widths(renderer)829 x = self.xy[0] + numpy.array([numpy.sum(widths[:i]) for i in range(len(widths))])830 y = self.xy[1]  numpy.arange(cells.shape[0])  0.5831 832 for i in range(cells.shape[0]):833 for j in range(cells.shape[1]):834 cells[i, j].set_xy((x[j], y[i]))835 cells[i, j].set_width(widths[j])836 cells[i, j].set_height(1.0)837 838 self._width = numpy.sum(widths)839 self._height = cells.shape[0]840 841 self.pchanged()842 843 def _get_column_widths(self, renderer):844 keys = numpy.array(self._cells.keys())845 widths = numpy.zeros(len(keys)).reshape((numpy.max(keys[:,0]+1), numpy.max(keys[:,1]+1)))846 fontSize = self._calc_fontsize(renderer)847 for (row, col), cell in self._cells.items():848 cell.set_fontsize(fontSize)849 l, b, w, h = cell._text.get_window_extent(renderer).bounds850 transform = self._axes.transData.inverted()851 x1, _ = transform.transform_point((0, 0))852 x2, _ = transform.transform_point((w + w*TableCell.PAD + 10, 0))853 w = abs(x1  x2)854 widths[row, col] = w855 return numpy.max(widths, 0)856 857 def _calc_fontsize(self, renderer):858 transform = self._axes.transData859 _, y1 = transform.transform_point((0, 0))860 _, y2 = transform.transform_point((0, 1))861 return min(max(int(abs(y1  y2)*0.85) ,4), self.max_fontsize)862 863 def get_children(self):864 return self._cells.values()865 866 def get_bbox(self):867 return matplotlib.transform.Bbox([self.xy[0], self.xy[1], self.xy[0] + 10, self.xy[1] + 180])868 869 class DendrogramPlotPylab(object):870 def __init__(self, root, data=None, labels=None, dendrogram_width=None, heatmap_width=None, label_width=None, space_width=None, border_width=0.05, plot_attr_names=False, cmap=None, params={}):871 if not matplotlib:872 raise ImportError("Could not import matplotlib module. Please make sure matplotlib is installed on your system.")873 import matplotlib.pyplot as plt874 self.plt = plt875 self.root = root876 self.data = data877 self.labels = labels if labels else [str(i) for i in range(len(root))]878 self.dendrogram_width = dendrogram_width879 self.heatmap_width = heatmap_width880 self.label_width = label_width881 self.space_width = space_width882 self.border_width = border_width883 self.params = params884 self.plot_attr_names = plot_attr_names885 886 def plotDendrogram(self):887 self.text_items = []888 def draw_tree(tree):889 if tree.branches:890 points = []891 for branch in tree.branches:892 center = draw_tree(branch)893 self.plt.plot([center[0], tree.height], [center[1], center[1]], color="black")894 points.append(center)895 self.plt.plot([tree.height, tree.height], [points[0][1], points[1][1]], color="black")896 return (tree.height, (points[0][1] + points[1][1])/2.0)897 else:898 return (0.0, tree.first)899 draw_tree(self.root)900 901 def plotHeatMap(self):902 import numpy.ma as ma903 import numpy904 dx, dy = self.root.height, 0905 fx, fy = self.root.height/len(self.data.domain.attributes), 1.0906 data, c, w = self.data.toNumpyMA()907 data = (data  ma.min(data))/(ma.max(data)  ma.min(data))908 x = numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape))909 y = numpy.arange(data.shape[0] + 1)/float(numpy.max(data.shape))*len(self.root)910 self.heatmap_width = numpy.max(x)911 912 X, Y = numpy.meshgrid(x, y  0.5)913 914 self.meshXOffset = numpy.max(X)915 916 self.plt.jet()917 mesh = self.plt.pcolormesh(X, Y, data[self.root.mapping], edgecolor="b", linewidth=2)918 919 if self.plot_attr_names:920 names = [attr.name for attr in self.data.domain.attributes]921 self.plt.xticks(numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape)), names)922 self.plt.gca().xaxis.tick_top()923 for label in self.plt.gca().xaxis.get_ticklabels():924 label.set_rotation(45)925 926 for tick in self.plt.gca().xaxis.get_major_ticks():927 tick.tick1On = False928 tick.tick2On = False929 930 def plotLabels_(self):931 import numpy932 ## self.plt.yticks(numpy.arange(len(self.labels)  1, 0, 1), self.labels)933 ## for tick in self.plt.gca().yaxis.get_major_ticks():934 ## tick.tick1On = False935 ## tick.label1On = False936 ## tick.label2On = True937 ## text = TableTextLayout(xy=(self.meshXOffset+1, len(self.root)), tableText=[[label] for label in self.labels])938 text = TableTextLayout(xy=(self.meshXOffset*1.005, len(self.root)  1), tableText=[[label] for label in self.labels])939 text.set_figure(self.plt.gcf())940 self.plt.gca().add_artist(text)941 self.plt.gca()._set_artist_props(text)942 943 def plotLabels(self):944 ## table = TablePlot(xy=(self.meshXOffset*1.005, len(self.root) 1), axes=self.plt.gca())945 table = TablePlot(xy=(0, len(self.root) 1), axes=self.plt.gca())946 table.set_figure(self.plt.gcf())947 for i,label in enumerate(self.labels):948 table.add_cell(i, 0, width=1, height=1, text=label, loc="left", edgecolor="w")949 table.set_zorder(0)950 self.plt.gca().add_artist(table)951 self.plt.gca()._set_artist_props(table)952 953 def plot(self, filename=None, show=False):954 self.plt.rcParams.update(self.params)955 labelLen = max(len(label) for label in self.labels)956 w, h = 800, 600957 space = 0.01 if self.space_width == None else self.space_width958 border = self.border_width959 width = 1.0  2*border960 height = 1.0  2*border961 textLineHeight = min(max(h/len(self.labels), 4), self.plt.rcParams.get("font.size", 12))962 maxTextLineWidthEstimate = textLineHeight*labelLen963 ## print maxTextLineWidthEstimate964 textAxisWidthRatio = 2.0*maxTextLineWidthEstimate/w965 ## print textAxisWidthRatio966 labelsAreaRatio = min(textAxisWidthRatio, 0.4) if self.label_width == None else self.label_width967 x, y = len(self.data.domain.attributes), len(self.data)968 969 heatmapAreaRatio = min(1.0*y/h*x/w, 0.3) if self.heatmap_width == None else self.heatmap_width970 dendrogramAreaRatio = 1.0  labelsAreaRatio  heatmapAreaRatio  2*space if self.dendrogram_width == None else self.dendrogram_width971 972 self.fig = self.plt.figure()973 self.labels_offset = self.root.height/20.0974 dendrogramAxes = self.plt.axes([border, border, width*dendrogramAreaRatio, height])975 dendrogramAxes.xaxis.grid(True)976 import matplotlib.ticker as ticker977 978 dendrogramAxes.yaxis.set_major_locator(ticker.NullLocator())979 dendrogramAxes.yaxis.set_minor_locator(ticker.NullLocator())980 dendrogramAxes.invert_xaxis()981 self.plotDendrogram()982 heatmapAxes = self.plt.axes([border + width*dendrogramAreaRatio + space, border, width*heatmapAreaRatio, height], sharey=dendrogramAxes)983 984 heatmapAxes.xaxis.set_major_locator(ticker.NullLocator())985 heatmapAxes.xaxis.set_minor_locator(ticker.NullLocator())986 heatmapAxes.yaxis.set_major_locator(ticker.NullLocator())987 heatmapAxes.yaxis.set_minor_locator(ticker.NullLocator())988 989 self.plotHeatMap()990 labelsAxes = self.plt.axes([border + width*(dendrogramAreaRatio + heatmapAreaRatio + 2*space), border, width*labelsAreaRatio, height], sharey=dendrogramAxes)991 self.plotLabels()992 labelsAxes.set_axis_off()993 labelsAxes.xaxis.set_major_locator(ticker.NullLocator())994 labelsAxes.xaxis.set_minor_locator(ticker.NullLocator())995 labelsAxes.yaxis.set_major_locator(ticker.NullLocator())996 labelsAxes.yaxis.set_minor_locator(ticker.NullLocator())997 if filename:998 canvas = matplotlib.backends.backend_agg.FigureCanvasAgg(self.fig)999 canvas.print_figure(filename)1000 if show:1001 self.plt.show()1002 1003 1004 from orngMisc import ColorPalette, EPSRenderer1005 class DendrogramPlot(object):1006 """ A class for drawing dendrograms1007 Example:1008 >>> a = DendrogramPlot(tree)1009 """1010 def __init__(self, tree, attr_tree = None, labels=None, data=None, width=None, height=None, tree_height=None, heatmap_width=None, text_width=None,1011 spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, gamma=None, renderer=EPSRenderer):1012 self.tree = tree1013 self.attr_tree = attr_tree1014 self.labels = [str(ex.getclass()) for ex in data] if not labels and data and data.domain.classVar else (labels or [])1015 # self.attr_labels = [str(attr.name) for attr in data.domain.attributes] if not attr_labels and data else attr_labels or []1016 self.data = data1017 self.width, self.height = float(width) if width else None, float(height) if height else None1018 self.tree_height = tree_height1019 self.heatmap_width = heatmap_width1020 self.text_width = text_width1021 self.font_size = 10.01022 self.linespacing = 0.01023 self.cluster_colors = cluster_colors1024 self.horizontal_margin = 10.01025 self.vertical_margin = 10.01026 self.spacing = float(spacing) if spacing else None1027 self.color_palette = color_palette1028 self.minv = minv1029 self.maxv = maxv1030 self.gamma = gamma1031 self.set_matrix_color_schema(color_palette, minv, maxv, gamma)1032 self.renderer = renderer1033 1034 def set_matrix_color_schema(self, color_palette, minv, maxv, gamma=None):1035 """ Set the matrix color scheme.1036 """1037 if isinstance(color_palette, ColorPalette):1038 self.color_palette = color_palette1039 else:1040 self.color_palette = ColorPalette(color_palette)1041 self.minv = minv1042 self.maxv = maxv1043 self.gamma = gamma1044 1045 def color_shema(self):1046 vals = [float(val) for ex in self.data for val in ex if not val.isSpecial() and val.variable.varType==orange.VarTypes.Continuous] or [0]1047 avg = sum(vals)/len(vals)1048 1049 maxVal = self.maxv if self.maxv else max(vals)1050 minVal = self.minv if self.minv else min(vals)1051 1052 def _colorSchema(val):1053 if val.isSpecial():1054 return self.color_palette(None)1055 elif val.variable.varType==orange.VarTypes.Continuous:1056 r, g, b = self.color_palette((float(val)  minVal) / abs(maxVal  minVal), gamma=self.gamma)1057 elif val.variable.varType==orange.VarTypes.Discrete:1058 r = g = b = int(255.0*float(val)/len(val.variable.values))1059 return (r, g, b)1060 return _colorSchema1061 1062 def layout(self):1063 height_final = False1064 width_final = False1065 tree_height = self.tree_height or 1001066 if self.height:1067 height, height_final = self.height, True1068 heatmap_height = height  (tree_height + self.spacing if self.attr_tree else 0)  2 * self.horizontal_margin1069 font_size = heatmap_height / len(self.labels) #self.font_size or (height  (tree_height + self.spacing if self.attr_tree else 0)  2 * self.horizontal_margin) / len(self.labels)1070 else:1071 font_size = self.font_size1072 heatmap_height = font_size * len(self.labels)1073 height = heatmap_height + (tree_height + self.spacing if self.attr_tree else 0) + 2 * self.horizontal_margin1074 1075 text_width = self.text_width or max([len(label) for label in self.labels] + [0]) * font_size #max([self.renderer.string_size_hint(label) for label in self.labels])1076 1077 if self.width:1078 width = self.width1079 heatmap_width = width  2 * self.vertical_margin  tree_height  (2 if self.data else 1) * self.spacing  text_width if self.data else 01080 else:1081 heatmap_width = len(self.data.domain.attributes) * heatmap_height / len(self.data) if self.data else 01082 width = 2 * self.vertical_margin + tree_height + (heatmap_width + self.spacing if self.data else 0) + self.spacing + text_width1083 1084 return width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size1085 1086 def plot(self, filename="graph.eps"):1087 width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size = self.layout()1088 heatmap_cell_height = heatmap_height / len(self.labels)1089 heatmap_cell_width = heatmap_width / len(self.data.domain.attributes)1090 1091 self.renderer = self.renderer(width, height)1092 1093 def draw_tree(cluster, root, treeheight, treewidth, color):1094 height = treeheight * cluster.height / root.height1095 if cluster.branches:1096 centers = []1097 for branch in cluster.branches:1098 center = draw_tree(branch, root, treeheight, treewidth, self.cluster_colors.get(branch, color))1099 centers.append(center)1100 self.renderer.draw_line(center[0], center[1], center[0], height, stroke_color = self.cluster_colors.get(branch, color))1101 1102 self.renderer.draw_line(centers[0][0], height, centers[1][0], height, stroke_color = self.cluster_colors.get(cluster, color))1103 return (centers[0][0] + centers[1][0]) / 2.0, height1104 else:1105 return float(treewidth) * cluster.first / len(root), 0.01106 self.renderer.save_render_state()1107 self.renderer.translate(self.vertical_margin + tree_height, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0) + heatmap_cell_height / 2.0)1108 self.renderer.rotate(90)1109 # print self.renderer.transform()1110 draw_tree(self.tree, self.tree, tree_height, heatmap_height, self.cluster_colors.get(self.tree, (0,0,0)))1111 self.renderer.restore_render_state()1112 if self.attr_tree:1113 self.renderer.save_render_state()1114 self.renderer.translate(self.vertical_margin + tree_height + self.spacing + heatmap_cell_width / 2.0, self.horizontal_margin + tree_height)1115 self.renderer.scale(1.0, 1.0)1116 # print self.renderer.transform()1117 draw_tree(self.attr_tree, self.attr_tree, tree_height, heatmap_width, self.cluster_colors.get(self.attr_tree, (0,0,0)))1118 self.renderer.restore_render_state()1119 1120 self.renderer.save_render_state()1121 self.renderer.translate(self.vertical_margin + tree_height + self.spacing, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0))1122 # print self.renderer.transform()1123 if self.data:1124 colorSchema = self.color_shema()1125 for i, ii in enumerate(self.tree):1126 ex = self.data[ii]1127 for j, jj in enumerate((self.attr_tree if self.attr_tree else range(len(self.data.domain.attributes)))):1128 r, g, b = colorSchema(ex[jj])1129 self.renderer.draw_rect(j * heatmap_cell_width, i * heatmap_cell_height, heatmap_cell_width, heatmap_cell_height, fill_color=(r, g, b), stroke_color=(255, 255, 255))1130 1131 self.renderer.translate(heatmap_width + self.spacing, heatmap_cell_height)1132 # print self.renderer.transform()1133 self.renderer.set_font("TimesRoman", font_size)1134 for index in self.tree: #label in self.labels:1135 self.renderer.draw_text(0.0, 0.0, self.labels[index])1136 self.renderer.translate(0.0, heatmap_cell_height)1137 self.renderer.restore_render_state()1138 self.renderer.save(filename)1139 1140 def dendrogram_draw(filename, *args, **kwargs):1141 import os1142 from orngMisc import PILRenderer, EPSRenderer, SVGRenderer1143 name, ext = os.path.splitext(filename)1144 kwargs["renderer"] = {".eps":EPSRenderer, ".svg":SVGRenderer, ".png":PILRenderer}.get(ext.lower(), PILRenderer)1145 # print kwargs["renderer"], ext1146 d = DendrogramPlot(*args, **kwargs)1147 d.plot(filename)1148 1149 if __name__=="__main__":1150 data = orange.ExampleTable("doc//datasets//brownselected.tab")1151 # data = orange.ExampleTable("doc//datasets//iris.tab")1152 root = hierarchicalClustering(data, order=True) #, linkage=orange.HierarchicalClustering.Single)1153 attr_root = hierarchicalClustering_attributes(data, order=True)1154 # print root1155 # d = DendrogramPlotPylab(root, data=data, labels=[str(ex.getclass()) for ex in data], dendrogram_width=0.4, heatmap_width=0.3, params={}, cmap=None)1156 # d.plot(show=True, filename="graph.png")1157 1158 dendrogram_draw("graph.eps", root, attr_tree=attr_root, data=data, labels=[str(e.getclass()) for e in data], tree_height=50, #width=500, height=500,1159 cluster_colors={root.right:(255,0,0), root.right.right:(0,255,0)},1160 color_palette=ColorPalette([(255, 0, 0), (0,0,0), (0, 255,0)], gamma=0.5,1161 overflow=(255, 255, 255), underflow=(255, 255, 255))) #, minv=0.5, maxv=0.5)1162 
orange/Orange/clustering/kmeans.py
r7462 r7503 6 6 .. index:: 7 7 single: clustering, kmeans 8 .. index:: ag lomerative clustering9 10 11 .. autoclass:: Orange.cluster .KMeans8 .. index:: agglomerative clustering 9 10 11 .. autoclass:: Orange.clustering.kmeans.Clustering 12 12 :members: 13 13 … … 63 63 ========================= 64 64 65 .. automethod:: Orange.cluster.kmeans_init_random 66 67 .. automethod:: Orange.cluster.kmeans_init_diversity 68 69 .. automethod:: Orange.cluster.KMeans_init_hierarchicalClustering 70 71 .. automethod:: Orange.cluster.data_center 72 73 .. automethod:: Orange.cluster.data_center 74 75 .. automethod:: Orange.cluster.plot_silhouette 76 77 .. automethod:: Orange.cluster.score_distance_to_centroids 78 79 .. automethod:: Orange.cluster.score_silhouette 80 81 .. automethod:: Orange.cluster.score_fastsilhouette 82 83 Typically, the choice of seeds has a large impact on the kmeans clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversitybased and hierarchical clusteringbased) in terms of how fast they converge (`kmeanscmpinit.py`_, uses `iris.tab`_, `housing.tab`_, `vehicle.tab`_): 65 .. automethod:: Orange.clustering.kmeans.init_random 66 67 .. automethod:: Orange.clustering.kmeans.init_diversity 68 69 .. automethod:: Orange.clustering.kmeans.init_hclustering 70 71 .. automethod:: Orange.clustering.kmeans.data_center 72 73 .. automethod:: Orange.clustering.kmeans.data_center 74 75 .. automethod:: Orange.clustering.kmeans.plot_silhouette 76 77 .. automethod:: Orange.clustering.kmeans.score_distance_to_centroids 78 79 .. automethod:: Orange.clustering.kmeans.score_silhouette 80 81 .. automethod:: Orange.clustering.kmeans.score_fast_silhouette 82 83 Typically, the choice of seeds has a large impact on the kmeans clustering, 84 with better initialization methods yielding a clustering that converges faster 85 and finds more optimal centroids. The following code compares three different 86 initialization methods (random, diversitybased and hierarchical clusteringbased) 87 in terms of how fast they converge (`kmeanscmpinit.py`_, uses `iris.tab`_, 88 `housing.tab`_, `vehicle.tab`_): 84 89 85 90 .. literalinclude:: code/kmeanscmpinit.py 86 91 87 As expected, kmeans converges faster with diversity and clusteringbased initialization that with random seed selection:: 92 As expected, kmeans converges faster with diversity and clusteringbased 93 initialization that with random seed selection:: 88 94 89 95 Rnd Div HC … … 92 98 vehicle 11 4 3 93 99 94 The following code computes the silhouette score for k=2..7 and plots a silhuette plot for k=3 (`kmeanssilhouette.py`_, uses `iris.tab`_): 100 The following code computes the silhouette score for k=2..7 and plots a 101 silhuette plot for k=3 (`kmeanssilhouette.py`_, uses `iris.tab`_): 95 102 96 103 .. literalinclude:: code/kmeanssilhouette.py 97 104 98 The analysis suggests that k=2 is preferred as it yields the maximal silhouette coefficient:: 105 The analysis suggests that k=2 is preferred as it yields 106 the maximal silhouette coefficient:: 99 107 100 108 2 0.629467553352 … … 124 132 import random 125 133 import statc 134 135 import Orange.clustering.hierarchical 126 136 127 137 # miscellaneous functions … … 163 173 # 164 174 165 class ExamplesDistanceConstructor_PearsonR(orange.ExamplesDistanceConstructor):166 def __new__(cls, data=None, **argkw):167 self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw)168 self.__dict__.update(argkw)169 if data:170 return self.__call__(data)171 else:172 return self173 174 def __call__(self, data):175 indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous]176 return ExamplesDistance_PearsonR(domain=data.domain, indxs=indxs)177 178 class ExamplesDistance_PearsonR(orange.ExamplesDistance):179 """ Pearson distance. (1  R)/2, where R is the Pearson correlation between examples. In [0..1]. """180 181 def __init__(self, **argkw):182 self.__dict__.update(argkw)183 def __call__(self, e1, e2):184 X1 = []; X2 = []185 for i in self.indxs:186 if not(e1[i].isSpecial() or e2[i].isSpecial()):187 X1.append(float(e1[i]))188 X2.append(float(e2[i]))189 if not X1:190 return 1.0191 try:192 return (1.0  statc.pearsonr(X1, X2)[0]) / 2.193 except:194 return 1.0195 196 class ExamplesDistanceConstructor_SpearmanR(orange.ExamplesDistanceConstructor):197 def __new__(cls, data=None, **argkw):198 self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw)199 self.__dict__.update(argkw)200 if data:201 return self.__call__(data)202 else:203 return self204 205 def __call__(self, data):206 indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous]207 return ExamplesDistance_SpearmanR(domain=data.domain, indxs=indxs)208 209 class ExamplesDistance_SpearmanR(orange.ExamplesDistance):210 def __init__(self, **argkw):211 self.__dict__.update(argkw)212 def __call__(self, e1, e2):213 X1 = []; X2 = []214 for i in self.indxs:215 if not(e1[i].isSpecial() or e2[i].isSpecial()):216 X1.append(float(e1[i]))217 X2.append(float(e2[i]))218 if not X1:219 return 1.0220 try:221 return (1.0  statc.spearmanr(X1, X2)[0]) / 2.222 except:223 return 1.0224 225 175 # kmeans clustering 226 176 … … 271 221 return float(b  a) / max(a, b) if max(a, b) > 0 else 0.0 272 222 273 def score_fast silhouette(km, index=None):223 def score_fast_silhouette(km, index=None): 274 224 """Same as score_silhouette, but computes an approximation and is faster. 275 225 … … 348 298 # initialization functions should be of the type f(data, k, distfun) 349 299 350 def kmeans_init_random(data, k, _):300 def init_random(data, k, _): 351 301 """A function that can be used for initialization of kmeans clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004). 352 302 … … 360 310 return data.getitems(random.sample(range(len(data)), k)) 361 311 362 def kmeans_init_diversity(data, k, distfun):312 def init_diversity(data, k, distfun): 363 313 """A function that can be used for intialization of kmeans clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm). 364 314 … … 378 328 return seeds 379 329 380 class KMeans_init_hierarchicalClustering(): 381 """A class that returns an clustering initialization function that performs hierarhical clustering, uses it to infer k clusters, and computes a list of clusterbased data centers.""" 330 class init_hclustering(): 331 """ 332 A class that returns an clustering initialization function that performs 333 hierarhical clustering, uses it to infer k clusters, and computes a 334 list of clusterbased data centers 335 """ 382 336 383 337 def __init__(self, n=100): … … 398 352 """ 399 353 sample = orange.ExampleTable(random.sample(data, min(self.n, len(data)))) 400 root = hierarchicalClustering(sample)401 cmap = hierarchicalClustering_topClusters(root, k)354 root = Orange.clustering.hierarchical.clustering(sample) 355 cmap = Orange.clustering.hierarchical.top_clusters(root, k) 402 356 return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap] 403 357 … … 406 360 # 407 361 408 class KMeans: 409 """ 410 411 Implements a kmeans clustering algorithm: 362 class Clustering: 363 """Implements a kmeans clustering algorithm: 412 364 413 365 #. Choose the number of clusters, k. … … 451 403 452 404 def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None, 453 stopchanges=0, nstart=1, initialization= kmeans_init_random,405 stopchanges=0, nstart=1, initialization=init_random, 454 406 distance=orange.ExamplesDistanceConstructor_Euclidean, 455 407 scoring=score_distance_to_centroids, inner_callback = None, … … 464 416 :param distance: an example distance constructor, which measures the distance between two instances. 465 417 :type distance: :class:`orange.ExamplesDistanceConstructor` 466 :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:` kmeans_init_random`, :func:`kmeans_init_diversity`, :class:`KMeans_init_hierarchicalClustering`).418 :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`init_random`, :func:`init_diversity`, :class:`init_hclustering`). 467 419 :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score. 468 420 :param inner_callback: invoked after every clustering iteration.
Note: See TracChangeset
for help on using the changeset viewer.