Changeset 7503:166e99ee3f9d in orange


Ignore:
Timestamp:
02/04/11 18:37:32 (3 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Convert:
626035b7b1533e3c3b7d5f224b29d6bed94c8aff
Message:

new clustering modules

Location:
orange/Orange/clustering
Files:
1 added
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/clustering/__init__.py

    r7429 r7503  
    55Everything about clustering, including agglomerative and hierarchical clustering. 
    66 
    7 ================== 
    8 k-Means clustering 
    9 ================== 
    10  
    11 .. index:: aglomerative clustering 
    12  
    13 .. autoclass:: Orange.cluster.KMeans 
    14    :members: 
    15  
    16 Examples 
    17 ======== 
    18  
    19 The following code runs k-means clustering and prints out the cluster indexes for the last 10 data instances (`kmeans-run.py`_, uses `iris.tab`_): 
    20  
    21 .. literalinclude:: code/kmeans-run.py 
    22  
    23 The output of this code is:: 
    24  
    25     [1, 1, 2, 1, 1, 1, 2, 1, 1, 2] 
    26  
    27 Invoking a call-back function may be useful when tracing the progress of the clustering. Below is a code that uses an :obj:`inner_callback` to report on the number of instances that have changed the cluster and to report on the clustering score. For the score to be computed at each iteration we have to set :obj:`minscorechange`, but we can leave it at 0 or even set it to a negative value, which allows the score to deteriorate by some amount (`kmeans-run-callback.py`_, uses `iris.tab`_): 
    28  
    29 .. literalinclude:: code/kmeans-run-callback.py 
    30  
    31 The convergence on Iris data set is fast:: 
    32  
    33     Iteration: 1, changes: 150, score: 10.9555 
    34     Iteration: 2, changes: 12, score: 10.3867 
    35     Iteration: 3, changes: 2, score: 10.2034 
    36     Iteration: 4, changes: 2, score: 10.0699 
    37     Iteration: 5, changes: 2, score: 9.9542 
    38     Iteration: 6, changes: 1, score: 9.9168 
    39     Iteration: 7, changes: 2, score: 9.8624 
    40     Iteration: 8, changes: 0, score: 9.8624 
    41  
    42 Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (`kmeans-trace.py`_, uses `iris.tab`_): 
    43  
    44 .. literalinclude:: code/kmeans-trace.py 
    45  
    46 Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to. 
    47  
    48  
    49 .. image:: files/kmeans-scatter-001.png 
    50  
    51 .. image:: files/kmeans-scatter-002.png 
    52  
    53 .. image:: files/kmeans-scatter-003.png 
    54  
    55 .. image:: files/kmeans-scatter-004.png 
    56  
    57  
    58 k-Means Utility Functions 
    59 ========================= 
    60  
    61 .. automethod:: Orange.cluster.kmeans_init_random 
    62  
    63 .. automethod:: Orange.cluster.kmeans_init_diversity 
    64  
    65 .. automethod:: Orange.cluster.KMeans_init_hierarchicalClustering 
    66  
    67 .. automethod:: Orange.cluster.data_center 
    68  
    69 .. automethod:: Orange.cluster.data_center 
    70  
    71 .. automethod:: Orange.cluster.plot_silhouette 
    72  
    73 .. automethod:: Orange.cluster.score_distance_to_centroids 
    74  
    75 .. automethod:: Orange.cluster.score_silhouette 
    76  
    77 .. automethod:: Orange.cluster.score_fastsilhouette 
    78  
    79 Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge (`kmeans-cmp-init.py`_, uses `iris.tab`_, `housing.tab`_, `vehicle.tab`_): 
    80  
    81 .. literalinclude:: code/kmeans-cmp-init.py 
    82  
    83 As expected, k-means converges faster with diversity and clustering-based initialization that with random seed selection:: 
    84  
    85                Rnd Div  HC 
    86           iris  12   3   4 
    87        housing  14   6   4 
    88        vehicle  11   4   3 
    89  
    90 The following code computes the silhouette score for k=2..7 and plots a silhuette plot for k=3 (`kmeans-silhouette.py`_, uses `iris.tab`_): 
    91  
    92 .. literalinclude:: code/kmeans-silhouette.py 
    93  
    94 The analysis suggests that k=2 is preferred as it yields the maximal silhouette coefficient:: 
    95  
    96     2 0.629467553352 
    97     3 0.504318855054 
    98     4 0.407259377854 
    99     5 0.358628975081 
    100     6 0.353228492088 
    101     7 0.366357876944 
    102  
    103 .. figure:: files/kmeans-silhouette.png 
    104  
    105    Silhouette plot for k=3. 
    106  
    107 .. _iris.tab: code/iris.tab 
    108 .. _housing.tab: code/housing.tab 
    109 .. _vehicle.tab: code/vehicle.tab 
    110 .. _kmeans-run.py: code/kmeans-run.py 
    111 .. _kmeans-run-callback.py: code/kmeans-run-callback.py 
    112 .. _kmeans-trace.py: code/kmeans-trace.py 
    113 .. _kmeans-cmp-init.py: code/kmeans-cmp-init.py 
    114 .. _kmeans-silhouette.py: code/kmeans-sillhouette.py 
    1157 
    1168================ 
     
    11810================ 
    11911 
    120 .. class:: Orange.cluster.ExampleCluster 
     12.. class:: Orange.clustering.ExampleCluster 
    12113 
    12214   To je pa moj dodaten opis ExampleClusterja 
     
    13022    
    13123 
    132 .. autoclass:: Orange.cluster.HierarchicalCluster 
     24.. autoclass:: Orange.clustering.HierarchicalCluster 
    13325   :members: 
    13426   :show-inheritance: 
    13527   :undoc-members: 
    13628 
    137 .. autoclass:: Orange.cluster.HierarchicalClusterList 
     29.. autoclass:: Orange.clustering.HierarchicalClusterList 
    13830   :members: 
    13931   :show-inheritance: 
    14032   :undoc-members: 
    14133 
    142 .. autoclass:: Orange.cluster.HierarchicalClustering 
     34.. autoclass:: Orange.clustering.HierarchicalClustering 
    14335   :members: 
    14436   :show-inheritance: 
     
    16254__docformat__ = 'restructuredtext' 
    16355 
    164 # miscellaneous functions (used across this module) 
    165  
    166 class ExampleCluster2(orange.ExampleCluster): 
    167     """New example cluster""" 
    168     pass 
    169  
    170 def _modus(dist): 
    171     #Check bool(dist) - False means no known cases 
    172     #Check dist.cases > 0 - We cant return some value from the domain without knowing if it is even present 
    173     #in the data. TOOD: What does this mean for k-means convergence? 
    174     if bool(dist) and dist.cases > 0: 
    175         return dist.modus() 
    176     else: 
    177         return None 
    178      
    179 def data_center(data): 
    180     """ 
    181     Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes). 
    182     """ 
    183     atts = data.domain.attributes 
    184     astats = orange.DomainBasicAttrStat(data) 
    185     center = [astats[a].avg if a.varType == orange.VarTypes.Continuous \ 
    186 #              else max(enumerate(orange.Distribution(a, data)), key=lambda x:x[1])[0] if a.varType == orange.VarTypes.Discrete 
    187               else _modus(orange.Distribution(a, data)) if a.varType == orange.VarTypes.Discrete 
    188               else None 
    189               for a in atts] 
    190     if data.domain.classVar: 
    191         center.append(0) 
    192     return orange.Example(data.domain, center) 
    193  
    194 def minindex(x): 
    195     """Return the index of the minimum element""" 
    196     return x.index(min(x)) 
    197  
    198 def avg(x): 
    199     """Return the average (mean) of a given list""" 
    200     return (float(sum(x)) / len(x)) if x else 0 
    201  
    202 # 
    203 # data distances 
    204 # 
    205  
    206 class ExamplesDistanceConstructor_PearsonR(orange.ExamplesDistanceConstructor): 
    207     def __new__(cls, data=None, **argkw): 
    208         self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw) 
    209         self.__dict__.update(argkw) 
    210         if data: 
    211             return self.__call__(data) 
    212         else: 
    213             return self 
    214  
    215     def __call__(self, data): 
    216         indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous] 
    217         return ExamplesDistance_PearsonR(domain=data.domain, indxs=indxs) 
    218  
    219 class ExamplesDistance_PearsonR(orange.ExamplesDistance): 
    220     """ Pearson distance. (1 - R)/2, where R is the Pearson correlation between examples. In [0..1]. """ 
    221  
    222     def __init__(self, **argkw): 
    223         self.__dict__.update(argkw) 
    224     def __call__(self, e1, e2): 
    225         X1 = []; X2 = [] 
    226         for i in self.indxs: 
    227             if not(e1[i].isSpecial() or e2[i].isSpecial()): 
    228                 X1.append(float(e1[i])) 
    229                 X2.append(float(e2[i])) 
    230         if not X1: 
    231             return 1.0 
    232         try: 
    233             return (1.0 - statc.pearsonr(X1, X2)[0]) / 2. 
    234         except: 
    235             return 1.0 
    236  
    237 class ExamplesDistanceConstructor_SpearmanR(orange.ExamplesDistanceConstructor): 
    238     def __new__(cls, data=None, **argkw): 
    239         self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw) 
    240         self.__dict__.update(argkw) 
    241         if data: 
    242             return self.__call__(data) 
    243         else: 
    244             return self 
    245  
    246     def __call__(self, data): 
    247         indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous] 
    248         return ExamplesDistance_SpearmanR(domain=data.domain, indxs=indxs) 
    249  
    250 class ExamplesDistance_SpearmanR(orange.ExamplesDistance): 
    251     def __init__(self, **argkw): 
    252         self.__dict__.update(argkw) 
    253     def __call__(self, e1, e2): 
    254         X1 = []; X2 = [] 
    255         for i in self.indxs: 
    256             if not(e1[i].isSpecial() or e2[i].isSpecial()): 
    257                 X1.append(float(e1[i])) 
    258                 X2.append(float(e2[i])) 
    259         if not X1: 
    260             return 1.0 
    261         try: 
    262             return (1.0 - statc.spearmanr(X1, X2)[0]) / 2. 
    263         except: 
    264             return 1.0 
    265  
    266 # k-means clustering 
    267  
    268 # clustering scoring functions  
    269  
    270 def score_distance_to_centroids(km): 
    271     """Returns an average distance of data instances to their associated centroids. 
    272  
    273     :param km: a k-means clustering object. 
    274     :type km: :class:`KMeans` 
    275     """ 
    276     return sum(km.distance(km.centroids[km.clusters[i]], d) for i,d in enumerate(km.data)) 
    277  
    278 score_distance_to_centroids.minimize = True 
    279  
    280 def score_conditionalEntropy(km): 
    281     """UNIMPLEMENTED cluster quality measured by conditional entropy""" 
    282     pass 
    283  
    284 def score_withinClusterDistance(km): 
    285     """UNIMPLEMENTED weighted average within-cluster pairwise distance""" 
    286     pass 
    287  
    288 score_withinClusterDistance.minimize = True 
    289  
    290 def score_betweenClusterDistance(km): 
    291     """Sum of distances from elements to 'nearest miss' centroids""" 
    292     return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data)) 
    293  
    294 def score_silhouette(km, index=None): 
    295     """Returns an average silhouette score of data instances. 
    296      
    297     :param km: a k-means clustering object. 
    298     :type km: :class:`KMeans` 
    299  
    300     :param index: if given, the functon returns just the silhouette score of that particular data instance. 
    301     :type index: integer 
    302     """ 
    303      
    304     if index == None: 
    305         return avg([score_silhouette(km, i) for i in range(len(km.data))]) 
    306     cind = km.clusters[index] 
    307     a = avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if 
    308              km.clusters[i] == cind and i != index]) 
    309     b = min([avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if 
    310                  km.clusters[i] == c]) 
    311             for c in range(len(km.centroids)) if c != cind]) 
    312     return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0 
    313  
    314 def score_fastsilhouette(km, index=None): 
    315     """Same as score_silhouette, but computes an approximation and is faster. 
    316      
    317     :param km: a k-means clustering object. 
    318     :type km: :class:`KMeans` 
    319     """ 
    320  
    321     if index == None: 
    322         return avg([score_fastsilhouette(km, i) for i in range(len(km.data))]) 
    323     cind = km.clusters[index] 
    324     a = km.distance(km.data[index], km.centroids[km.clusters[index]]) 
    325     b = min([km.distance(km.data[index], c) for i,c in enumerate(km.centroids) if i != cind]) 
    326     return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0 
    327  
    328 def compute_bic(km): 
    329     """Compute bayesian information criteria score for given clustering. NEEDS REWRITE!!!""" 
    330     data = km.data 
    331     medoids = km.centroids 
    332  
    333     M = len(data.domain.attributes) 
    334     R = float(len(data)) 
    335     Ri = [km.clusters.count(i) for i in range(km.k)] 
    336     numFreePar = (len(km.data.domain.attributes) + 1.) * km.k * math.log(R, 2.) / 2. 
    337     # sigma**2 
    338     s2 = 0. 
    339     cidx = [i for i, attr in enumerate(data.domain.attributes) if attr.varType in [orange.VarTypes.Continuous, orange.VarTypes.Discrete]] 
    340     for x, midx in izip(data, mapping): 
    341         medoid = medoids[midx] # medoids has a dummy element at the beginning, so we don't need -1  
    342         s2 += sum( [(float(x[i]) - float(medoid[i]))**2 for i in cidx] ) 
    343     s2 /= (R - K) 
    344     if s2 < 1e-20: 
    345         return None, [None]*K 
    346     # log-lokehood of clusters: l(Dn) 
    347     # log-likehood of clustering: l(D) 
    348     ld = 0 
    349     bicc = [] 
    350     for k in range(1, 1+K): 
    351         ldn = -1. * Ri[k] * ((math.log(2. * math.pi, 2) / -2.) - (M * math.log(s2, 2) / 2.) + (K / 2.) + math.log(Ri[k], 2) - math.log(R, 2)) 
    352         ld += ldn 
    353         bicc.append(ldn - numFreePar) 
    354     return ld - numFreePar, bicc 
     56#class ExampleCluster2(orange.ExampleCluster): 
     57#    """New example cluster""" 
     58#    pass 
    35559 
    35660 
    357 # 
    358 # silhouette plot 
    359 # 
    360  
    361 def plot_silhouette(km, filename='tmp.png', fast=False): 
    362     """ Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a k-means clustering object. If fast is True use score_fastsilhouette to compute scores instead of score_silhouette. 
    363  
    364     :param km: a k-means clustering object. 
    365     :type km: :class:`KMeans` 
    366     :param filename: name of output plot. 
    367     :type filename: string 
    368     :param fast: if True use :func:`score_fastsilhouette` to compute scores instead of :func:`score_silhouette` 
    369     :type fast: boolean. 
    370  
    371     """ 
    372     import matplotlib.pyplot as plt 
    373     plt.figure() 
    374     scoring = score_fastsilhouette if fast else score_silhouette 
    375     scores = [[] for i in range(km.k)] 
    376     for i, c in enumerate(km.clusters): 
    377         scores[c].append(scoring(km, i)) 
    378     csizes = map(len, scores) 
    379     cpositions = [sum(csizes[:i]) + (i+1)*3 + csizes[i]/2 for i in range(km.k)] 
    380     scores = reduce(lambda x,y: x + [0]*3 + sorted(y), scores, []) 
    381     plt.barh(range(len(scores)), scores, linewidth=0, color='c') 
    382     plt.yticks(cpositions, map(str, range(km.k))) 
    383     #plt.title('Silhouette plot') 
    384     plt.ylabel('Cluster') 
    385     plt.xlabel('Silhouette value') 
    386     plt.savefig(filename) 
    387  
    388 # clustering initialization (seeds) 
    389 # initialization functions should be of the type f(data, k, distfun) 
    390  
    391 def kmeans_init_random(data, k, _): 
    392     """A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004). 
    393      
    394     :param data: data instances. 
    395     :type data: :class:`orange.ExampleTable` 
    396     :param k: the number of clusters. 
    397     :type k: integer 
    398     :param distfun: a distance function. 
    399     :type distfun: :class:`orange.ExamplesDistance` 
    400      """ 
    401     return data.getitems(random.sample(range(len(data)), k)) 
    402  
    403 def kmeans_init_diversity(data, k, distfun): 
    404     """A function that can be used for intialization of k-means clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm). 
    405  
    406     :param data: data instances. 
    407     :type data: :class:`orange.ExampleTable` 
    408     :param k: the number of clusters. 
    409     :type k: integer 
    410     :param distfun: a distance function. 
    411     :type distfun: :class:`orange.ExamplesDistance` 
    412     """ 
    413     center = data_center(data) 
    414     # the first seed should be the farthest point from the center 
    415     seeds = [max([(distfun(d, center), d) for d in data])[1]] 
    416     # other seeds are added iteratively, and are data points that are farthest from the current set of seeds 
    417     for i in range(1,k): 
    418         seeds.append(max([(min([distfun(d, s) for s in seeds]), d) for d in data if d not in seeds])[1]) 
    419     return seeds 
    420  
    421 class KMeans_init_hierarchicalClustering(): 
    422     """A class that returns an clustering initialization function that performs hierarhical clustering, uses it to infer k clusters, and computes a list of cluster-based data centers.""" 
    423  
    424     def __init__(self, n=100): 
    425         """ 
    426         :param n: number of data instances to sample. 
    427         :type n: integer 
    428         """ 
    429         self.n = n 
    430  
    431     def __call__(self, data, k, disfun): 
    432         """ 
    433         :param data: data instances. 
    434         :type data: :class:`orange.ExampleTable` 
    435         :param k: the number of clusters. 
    436         :type k: integer 
    437         :param distfun: a distance function. 
    438         :type distfun: :class:`orange.ExamplesDistance` 
    439         """ 
    440         sample = orange.ExampleTable(random.sample(data, min(self.n, len(data)))) 
    441         root = hierarchicalClustering(sample) 
    442         cmap = hierarchicalClustering_topClusters(root, k) 
    443         return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap] 
    444  
    445 #     
    446 # k-means clustering, main implementation 
    447 # 
    448  
    449 class KMeans: 
    450     """ 
    451  
    452     Implements a k-means clustering algorithm: 
    453  
    454     #. Choose the number of clusters, k. 
    455     #. Choose a set of k initial centroids. 
    456     #. Assign each instances in the data set to the closest centroid. 
    457     #. For each cluster, compute a new centroid as a center of clustered  
    458        data instances. 
    459     #. Repeat the previous two steps, until some convergence criterion is  
    460        met (e.g., the cluster assignment has not changed). 
    461  
    462     The main advantages of this algorithm are simplicity and low memory   
    463     requirements. The principal disadvantage is the dependence of results  
    464     on the selection of initial set of centroids. 
    465  
    466     .. attribute:: k 
    467  
    468         Number of clusters. 
    469  
    470     .. attribute:: data 
    471  
    472         Instances to cluster. 
    473  
    474     .. attribute:: centroids 
    475  
    476         Current set of centroids.  
    477  
    478     .. attribute:: scoring 
    479  
    480         Current clustering score. 
    481  
    482     .. attribute:: iteration 
    483  
    484         Current clustering iteration. 
    485  
    486     .. attribute:: clusters 
    487  
    488         A list of cluster indexes. An i-th element provides an 
    489         index to a centroid associated with i-th data instance from the input  
    490         data set.  
    491     """ 
    492  
    493     def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None, 
    494                  stopchanges=0, nstart=1, initialization=kmeans_init_random, 
    495                  distance=orange.ExamplesDistanceConstructor_Euclidean, 
    496                  scoring=score_distance_to_centroids, inner_callback = None, 
    497                  outer_callback = None): 
    498         """ 
    499         :param data: Data instances to be clustered. If not None, clustering will be executed immediately after initialization unless initialize_only=True. 
    500         :type data: :class:`orange.ExampleTable` or None 
    501         :param centroids: either specify a number of clusters or provide a list of examples that will serve as clustering centroids. 
    502         :type centroids: integer or a list of :class:`orange.Example` 
    503         :param nstart: If greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score. 
    504         :type nstart: integer 
    505         :param distance: an example distance constructor, which measures the distance between two instances. 
    506         :type distance: :class:`orange.ExamplesDistanceConstructor` 
    507         :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`kmeans_init_random`, :func:`kmeans_init_diversity`, :class:`KMeans_init_hierarchicalClustering`).  
    508         :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score. 
    509         :param inner_callback: invoked after every clustering iteration. 
    510         :param outer_callback: invoked after every clustering restart (if nstart is greater than 1). 
    511  
    512         Stopping criteria: 
    513  
    514         :param maxiters: maximum number of clustering iterations 
    515         :type maxiters: integer 
    516         :param minscorechange: minimal improvement of the score from previous generation (if lower, the clustering will stop). If None, the score will not be computed between iterations 
    517         :type minscorechange: float or None 
    518         :param stopchanges: if the number of instances changing the cluster is lower or equal to stopchanges, stop the clustering. 
    519         :type stopchanges: integer 
    520         """ 
    521  
    522         self.data = data 
    523         self.k = centroids if type(centroids)==int else len(centroids) 
    524         self.centroids = centroids if type(centroids) == orange.ExampleTable else None 
    525         self.maxiters = maxiters 
    526         self.minscorechange = minscorechange 
    527         self.stopchanges = stopchanges 
    528         self.nstart = nstart 
    529         self.initialization = initialization 
    530         self.distance_constructor = distance 
    531         self.distance = self.distance_constructor(self.data) if self.data else None 
    532         self.scoring = scoring 
    533         self.minimize_score = True if hasattr(scoring, 'minimize') else False 
    534         self.inner_callback = inner_callback 
    535         self.outer_callback = outer_callback 
    536         if self.data: 
    537             self.run() 
    538          
    539     def __call__(self, data = None): 
    540         """Runs the k-means clustering algorithm, with optional new data.""" 
    541         if data: 
    542             self.data = data 
    543             self.distance = self.distance_constructor(self.data) 
    544         self.run() 
    545      
    546     def init_centroids(self): 
    547         """Initialize cluster centroids""" 
    548         if self.centroids and not self.nstart > 1: # centroids were specified 
    549             return 
    550         self.centroids = self.initialization(self.data, self.k, self.distance) 
    551          
    552     def compute_centeroid(self, data): 
    553         """Return a centroid of the data set.""" 
    554         return data_center(data) 
    555      
    556     def compute_cluster(self): 
    557         """calculate membership in clusters""" 
    558         return [minindex([self.distance(s, d) for s in self.centroids]) for d in self.data] 
    559      
    560     def runone(self): 
    561         """Runs a single clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid)."""  
    562         self.centroids = [self.compute_centeroid(self.data.getitems( 
    563             [i for i, c in enumerate(self.clusters) if c == cl])) for cl in range(self.k)] 
    564         self.clusters = self.compute_cluster() 
    565          
    566     def run(self): 
    567         """ 
    568         Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score. 
    569         """ 
    570         self.winner = None 
    571         for startindx in range(self.nstart): 
    572             self.init_centroids() 
    573             self.clusters = old_cluster = self.compute_cluster() 
    574             if self.minscorechange != None: 
    575                 self.score = old_score = self.scoring(self) 
    576             self.nchanges = len(self.data) 
    577             self.iteration = 0 
    578             stopcondition = False 
    579             if self.inner_callback: 
    580                 self.inner_callback(self) 
    581             while not stopcondition: 
    582                 self.iteration += 1 
    583                 self.runone() 
    584                 self.nchanges = sum(map(lambda x,y: x!=y, old_cluster, self.clusters)) 
    585                 old_cluster = self.clusters 
    586                 if self.minscorechange != None: 
    587                     self.score = self.scoring(self) 
    588                     scorechange = (self.score - old_score) / old_score if old_score > 0 else self.minscorechange 
    589                     if self.minimize_score: 
    590                         scorechange = -scorechange 
    591                     old_score = self.score 
    592                 stopcondition = (self.nchanges <= self.stopchanges or 
    593                                  self.iteration == self.maxiters or 
    594                                  (self.minscorechange != None and 
    595                                   scorechange <= self.minscorechange)) 
    596                 if self.inner_callback: 
    597                     self.inner_callback(self) 
    598             if self.scoring and self.minscorechange == None: 
    599                 self.score = self.scoring(self) 
    600             if self.nstart > 1: 
    601                 if not self.winner or (self.score < self.winner[0] if 
    602                         self.minimize_score else self.score > self.winner[0]): 
    603                     self.winner = (self.score, self.clusters, self.centroids) 
    604                 if self.outer_callback: 
    605                     self.outer_callback(self) 
    606  
    607         if self.nstart > 1: 
    608             self.score, self.clusters, self.centroids = self.winner 
    609  
    610 # hierarhical clustering 
    611  
    612 def hierarchicalClustering(data, 
    613                            distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean, 
    614                            linkage=orange.HierarchicalClustering.Average, 
    615                            order=False, 
    616                            progressCallback=None): 
    617     """Return a hierarhical clustering of the data set.""" 
    618     distance = distanceConstructor(data) 
    619     matrix = orange.SymMatrix(len(data)) 
    620     for i in range(len(data)): 
    621         for j in range(i+1): 
    622             matrix[i, j] = distance(data[i], data[j]) 
    623     root = orange.HierarchicalClustering(matrix, linkage=linkage, progressCallback=(lambda value, obj=None: progressCallback(value*100.0/(2 if order else 1))) if progressCallback else None) 
    624     if order: 
    625         orderLeaves(root, matrix, progressCallback=(lambda value: progressCallback(50.0 + value/2)) if progressCallback else None) 
    626     return root 
    627  
    628 def hierarchicalClustering_attributes(data, distance=None, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None): 
    629     """Return hierarhical clustering of attributes in the data set.""" 
    630     matrix = orange.SymMatrix(len(data.domain.attributes)) 
    631     for a1 in range(len(data.domain.attributes)): 
    632         for a2 in range(a1): 
    633             matrix[a1, a2] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0 
    634     root = orange.HierarchicalClustering(matrix, linkage=linkage, progressCallback=(lambda value, obj=None: progressCallback(value*100.0/(2 if order else 1))) if progressCallback else None) 
    635     if order: 
    636         orderLeaves(root, matrix, progressCallback=(lambda value: progressCallback(50.0 + value/2)) if progressCallback else None) 
    637     return root 
    638  
    639 def hierarchicalClustering_clusterList(node, prune=None): 
    640     """Return a list of clusters down from the node of hierarchical clustering.""" 
    641     if prune: 
    642         if len(node) <= prune: 
    643             return []  
    644     if node.branches: 
    645         return [node] + hierarchicalClustering_clusterList(node.left, prune) + hierarchicalClustering_clusterList(node.right, prune) 
    646     return [node] 
    647  
    648 def hierarchicalClustering_topClusters(root, k): 
    649     """Return k topmost clusters from hierarchical clustering.""" 
    650     candidates = set([root]) 
    651     while len(candidates) < k: 
    652         repl = max([(max(c.left.height, c.right.height), c) for c in candidates if c.branches])[1] 
    653         candidates.discard(repl) 
    654         candidates.add(repl.left) 
    655         candidates.add(repl.right) 
    656     return candidates 
    657  
    658 def hierarhicalClustering_topClustersMembership(root, k): 
    659     """Return data instances' cluster membership (list of indices) to k topmost clusters.""" 
    660     clist = hierarchicalClustering_topClusters(root, k) 
    661     cmap = [None] * len(root) 
    662     for i, c in enumerate(clist): 
    663         for e in c: 
    664             cmap[e] = i 
    665     return cmap 
    666  
    667 def orderLeaves(tree, matrix, progressCallback=None): 
    668     """Order the leaves in the clustering tree. 
    669  
    670     (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering') 
    671     Arguments: 
    672         tree   --binary hierarchical clustering tree of type orange.HierarchicalCluster 
    673         matrix --orange.SymMatrix that was used to compute the clustering 
    674         progressCallback --function used to report progress 
    675     """ 
    676     objects = getattr(tree.mapping, "objects", None) 
    677     tree.mapping.setattr("objects", range(len(tree))) 
    678     M = {} 
    679     ordering = {} 
    680     visitedClusters = set() 
    681     def _optOrdering(tree): 
    682         if len(tree)==1: 
    683             for leaf in tree: 
    684                 M[tree, leaf, leaf] = 0 
    685 ##                print "adding:", tree, leaf, leaf 
    686         else: 
    687             _optOrdering(tree.left) 
    688             _optOrdering(tree.right) 
    689 ##            print "ordering", [i for i in tree] 
    690             Vl = set(tree.left) 
    691             Vr = set(tree.right) 
    692             Vlr = set(tree.left.right or tree.left) 
    693             Vll = set(tree.left.left or tree.left) 
    694             Vrr = set(tree.right.right or tree.right) 
    695             Vrl = set(tree.right.left or tree.right) 
    696             other = lambda e, V1, V2: V2 if e in V1 else V1 
    697             tree_left, tree_right = tree.left, tree.right 
    698             for u in Vl: 
    699                 for w in Vr: 
    700                     if True: #Improved search 
    701                         C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)]) 
    702                         orderedMs = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m]) 
    703                         orderedKs = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k]) 
    704                         k0 = orderedKs[0] 
    705                         curMin = 1e30000  
    706                         curMK = () 
    707                         for m in orderedMs: 
    708                             if M[tree_left, u, m] + M[tree_right, w, k0] + C >= curMin: 
    709                                 break 
    710                             for k in  orderedKs: 
    711                                 if M[tree_left, u, m] + M[tree_right, w, k] + C >= curMin: 
    712                                     break 
    713                                 if curMin > M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]: 
    714                                     curMin = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k] 
    715                                     curMK = (m, k) 
    716                         M[tree, u, w] = M[tree, w, u] = curMin 
    717                         ordering[tree, u, w] = (tree_left, u, curMK[0], tree_right, w, curMK[1]) 
    718                         ordering[tree, w, u] = (tree_right, w, curMK[1], tree_left, u, curMK[0]) 
    719                     else: 
    720                         def MFunc((m, k)): 
    721                             return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k] 
    722                         m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc) 
    723                         M[tree, u, w] = M[tree, w, u] = MFunc((m, k)) 
    724                         ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k) 
    725                         ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m) 
    726  
    727             if progressCallback: 
    728                 progressCallback(100.0 * len(visitedClusters) / len(tree.mapping)) 
    729                 visitedClusters.add(tree) 
    730          
    731     _optOrdering(tree) 
    732  
    733     def _order(tree, u, w): 
    734         if len(tree)==1: 
    735             return 
    736         left, u, m, right, w, k = ordering[tree, u, w] 
    737         if len(left)>1 and m not in left.right: 
    738             left.swap() 
    739         _order(left, u, m) 
    740 ##        if u!=left[0] or m!=left[-1]: 
    741 ##            print "error 4:", u, m, list(left) 
    742         if len(right)>1 and k not in right.left: 
    743             right.swap() 
    744         _order(right, k, w) 
    745 ##        if k!=right[0] or w!=right[-1]: 
    746 ##            print "error 5:", k, w, list(right) 
    747      
    748     u, w = min([(u, w) for u in tree.left for w in tree.right], key=lambda (u, w): M[tree, u, w]) 
    749      
    750 ##    print "M(v) =", M[tree, u, w] 
    751      
    752     _order(tree, u, w) 
    753  
    754     def _check(tree, u, w): 
    755         if len(tree)==1: 
    756             return 
    757         left, u, m, right, w, k = ordering[tree, u, w] 
    758         if tree[0] == u and tree[-1] == w: 
    759             _check(left, u, m) 
    760             _check(right, k, w) 
    761         else: 
    762             print "Error:", u, w, tree[0], tree[-1] 
    763  
    764     _check(tree, u ,w) 
    765      
    766  
    767     if objects: 
    768         tree.mapping.setattr("objects", objects) 
    769  
    770 try: 
    771     import numpy 
    772 except ImportError: 
    773     numpy = None 
    774  
    775 try: 
    776     import matplotlib 
    777     from matplotlib.figure import Figure 
    778     from matplotlib.table import Table, Cell 
    779     from matplotlib.text import Text 
    780     from matplotlib.artist import Artist 
    781 ##    import  matplotlib.pyplot as plt 
    782 except (ImportError, IOError), ex: 
    783     matplotlib = None 
    784     Text , Artist, Table, Cell = object, object, object, object 
    785  
    786 class TableCell(Cell): 
    787     PAD = 0.05 
    788     def __init__(self, *args, **kwargs): 
    789         Cell.__init__(self, *args, **kwargs) 
    790         self._text.set_clip_on(True) 
    791  
    792 class TablePlot(Table): 
    793     max_fontsize = 12 
    794     def __init__(self, xy, axes=None, bbox=None): 
    795         Table.__init__(self, axes or plt.gca(), bbox=bbox) 
    796         self.xy = xy 
    797         self.set_transform(self._axes.transData) 
    798         self._fixed_widhts = None 
    799         import matplotlib.pyplot as plt 
    800         self.max_fontsize = plt.rcParams.get("font.size", 12) 
    801  
    802     def add_cell(self, row, col, *args, **kwargs): 
    803         xy = (0,0) 
    804  
    805         cell = TableCell(xy, *args, **kwargs) 
    806         cell.set_figure(self.figure) 
    807         cell.set_transform(self.get_transform()) 
    808  
    809         cell.set_clip_on(True) 
    810         cell.set_clip_box(self._axes.bbox) 
    811         cell._text.set_clip_box(self._axes.bbox) 
    812         self._cells[(row, col)] = cell 
    813  
    814     def draw(self, renderer): 
    815         if not self.get_visible(): return 
    816         self._update_positions(renderer) 
    817  
    818         keys = self._cells.keys() 
    819         keys.sort() 
    820         for key in keys: 
    821             self._cells[key].draw(renderer) 
    822  
    823     def _update_positions(self, renderer): 
    824         keys = numpy.array(self._cells.keys()) 
    825         cells = numpy.array([[self._cells.get((row, col), None) for col in range(max(keys[:, 1] + 1))] \ 
    826                              for row in range(max(keys[:, 0] + 1))]) 
    827          
    828         widths = self._get_column_widths(renderer) 
    829         x = self.xy[0] + numpy.array([numpy.sum(widths[:i]) for i in range(len(widths))]) 
    830         y = self.xy[1] - numpy.arange(cells.shape[0]) - 0.5 
    831          
    832         for i in range(cells.shape[0]): 
    833             for j in range(cells.shape[1]): 
    834                 cells[i, j].set_xy((x[j], y[i])) 
    835                 cells[i, j].set_width(widths[j]) 
    836                 cells[i, j].set_height(1.0) 
    837  
    838         self._width = numpy.sum(widths) 
    839         self._height = cells.shape[0] 
    840  
    841         self.pchanged() 
    842  
    843     def _get_column_widths(self, renderer): 
    844         keys = numpy.array(self._cells.keys()) 
    845         widths = numpy.zeros(len(keys)).reshape((numpy.max(keys[:,0]+1), numpy.max(keys[:,1]+1))) 
    846         fontSize = self._calc_fontsize(renderer) 
    847         for (row, col), cell in self._cells.items(): 
    848             cell.set_fontsize(fontSize) 
    849             l, b, w, h = cell._text.get_window_extent(renderer).bounds 
    850             transform = self._axes.transData.inverted() 
    851             x1, _ = transform.transform_point((0, 0)) 
    852             x2, _ = transform.transform_point((w + w*TableCell.PAD + 10, 0)) 
    853             w = abs(x1 - x2) 
    854             widths[row, col] = w 
    855         return numpy.max(widths, 0) 
    856  
    857     def _calc_fontsize(self, renderer): 
    858         transform = self._axes.transData 
    859         _, y1 = transform.transform_point((0, 0)) 
    860         _, y2 = transform.transform_point((0, 1)) 
    861         return min(max(int(abs(y1 - y2)*0.85) ,4), self.max_fontsize) 
    862  
    863     def get_children(self): 
    864         return self._cells.values() 
    865  
    866     def get_bbox(self): 
    867         return matplotlib.transform.Bbox([self.xy[0], self.xy[1], self.xy[0] + 10, self.xy[1] + 180]) 
    868  
    869 class DendrogramPlotPylab(object): 
    870     def __init__(self, root, data=None, labels=None, dendrogram_width=None, heatmap_width=None, label_width=None, space_width=None, border_width=0.05, plot_attr_names=False, cmap=None, params={}): 
    871         if not matplotlib: 
    872             raise ImportError("Could not import matplotlib module. Please make sure matplotlib is installed on your system.") 
    873         import matplotlib.pyplot as plt 
    874         self.plt = plt 
    875         self.root = root 
    876         self.data = data 
    877         self.labels = labels if labels else [str(i) for i in range(len(root))] 
    878         self.dendrogram_width = dendrogram_width 
    879         self.heatmap_width = heatmap_width 
    880         self.label_width = label_width 
    881         self.space_width = space_width 
    882         self.border_width = border_width 
    883         self.params = params 
    884         self.plot_attr_names = plot_attr_names 
    885  
    886     def plotDendrogram(self): 
    887         self.text_items = [] 
    888         def draw_tree(tree): 
    889             if tree.branches: 
    890                 points = [] 
    891                 for branch in tree.branches: 
    892                     center = draw_tree(branch) 
    893                     self.plt.plot([center[0], tree.height], [center[1], center[1]], color="black") 
    894                     points.append(center) 
    895                 self.plt.plot([tree.height, tree.height], [points[0][1], points[-1][1]], color="black") 
    896                 return (tree.height, (points[0][1] + points[-1][1])/2.0) 
    897             else: 
    898                 return (0.0, tree.first) 
    899         draw_tree(self.root) 
    900          
    901     def plotHeatMap(self): 
    902         import numpy.ma as ma 
    903         import numpy 
    904         dx, dy = self.root.height, 0 
    905         fx, fy = self.root.height/len(self.data.domain.attributes), 1.0 
    906         data, c, w = self.data.toNumpyMA() 
    907         data = (data - ma.min(data))/(ma.max(data) - ma.min(data)) 
    908         x = numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape)) 
    909         y = numpy.arange(data.shape[0] + 1)/float(numpy.max(data.shape))*len(self.root) 
    910         self.heatmap_width = numpy.max(x) 
    911  
    912         X, Y = numpy.meshgrid(x, y - 0.5) 
    913  
    914         self.meshXOffset = numpy.max(X) 
    915  
    916         self.plt.jet() 
    917         mesh = self.plt.pcolormesh(X, Y, data[self.root.mapping], edgecolor="b", linewidth=2) 
    918  
    919         if self.plot_attr_names: 
    920             names = [attr.name for attr in self.data.domain.attributes] 
    921             self.plt.xticks(numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape)), names) 
    922         self.plt.gca().xaxis.tick_top() 
    923         for label in self.plt.gca().xaxis.get_ticklabels(): 
    924             label.set_rotation(45) 
    925  
    926         for tick in self.plt.gca().xaxis.get_major_ticks(): 
    927             tick.tick1On = False 
    928             tick.tick2On = False 
    929  
    930     def plotLabels_(self): 
    931         import numpy 
    932 ##        self.plt.yticks(numpy.arange(len(self.labels) - 1, 0, -1), self.labels) 
    933 ##        for tick in self.plt.gca().yaxis.get_major_ticks(): 
    934 ##            tick.tick1On = False 
    935 ##            tick.label1On = False 
    936 ##            tick.label2On = True 
    937 ##        text = TableTextLayout(xy=(self.meshXOffset+1, len(self.root)), tableText=[[label] for label in self.labels]) 
    938         text = TableTextLayout(xy=(self.meshXOffset*1.005, len(self.root) - 1), tableText=[[label] for label in self.labels]) 
    939         text.set_figure(self.plt.gcf()) 
    940         self.plt.gca().add_artist(text) 
    941         self.plt.gca()._set_artist_props(text) 
    942  
    943     def plotLabels(self): 
    944 ##        table = TablePlot(xy=(self.meshXOffset*1.005, len(self.root) -1), axes=self.plt.gca()) 
    945         table = TablePlot(xy=(0, len(self.root) -1), axes=self.plt.gca()) 
    946         table.set_figure(self.plt.gcf()) 
    947         for i,label in enumerate(self.labels): 
    948             table.add_cell(i, 0, width=1, height=1, text=label, loc="left", edgecolor="w") 
    949         table.set_zorder(0) 
    950         self.plt.gca().add_artist(table) 
    951         self.plt.gca()._set_artist_props(table) 
    952      
    953     def plot(self, filename=None, show=False): 
    954         self.plt.rcParams.update(self.params) 
    955         labelLen = max(len(label) for label in self.labels) 
    956         w, h = 800, 600 
    957         space = 0.01 if self.space_width == None else self.space_width 
    958         border = self.border_width 
    959         width = 1.0 - 2*border 
    960         height = 1.0 - 2*border 
    961         textLineHeight = min(max(h/len(self.labels), 4), self.plt.rcParams.get("font.size", 12)) 
    962         maxTextLineWidthEstimate = textLineHeight*labelLen 
    963 ##        print maxTextLineWidthEstimate 
    964         textAxisWidthRatio = 2.0*maxTextLineWidthEstimate/w 
    965 ##        print textAxisWidthRatio 
    966         labelsAreaRatio = min(textAxisWidthRatio, 0.4) if self.label_width == None else self.label_width 
    967         x, y = len(self.data.domain.attributes), len(self.data) 
    968  
    969         heatmapAreaRatio = min(1.0*y/h*x/w, 0.3) if self.heatmap_width == None else self.heatmap_width 
    970         dendrogramAreaRatio = 1.0 - labelsAreaRatio - heatmapAreaRatio - 2*space if self.dendrogram_width == None else self.dendrogram_width 
    971  
    972         self.fig = self.plt.figure() 
    973         self.labels_offset = self.root.height/20.0 
    974         dendrogramAxes = self.plt.axes([border, border, width*dendrogramAreaRatio, height]) 
    975         dendrogramAxes.xaxis.grid(True) 
    976         import matplotlib.ticker as ticker 
    977  
    978         dendrogramAxes.yaxis.set_major_locator(ticker.NullLocator()) 
    979         dendrogramAxes.yaxis.set_minor_locator(ticker.NullLocator()) 
    980         dendrogramAxes.invert_xaxis() 
    981         self.plotDendrogram() 
    982         heatmapAxes = self.plt.axes([border + width*dendrogramAreaRatio + space, border, width*heatmapAreaRatio, height], sharey=dendrogramAxes) 
    983  
    984         heatmapAxes.xaxis.set_major_locator(ticker.NullLocator()) 
    985         heatmapAxes.xaxis.set_minor_locator(ticker.NullLocator()) 
    986         heatmapAxes.yaxis.set_major_locator(ticker.NullLocator()) 
    987         heatmapAxes.yaxis.set_minor_locator(ticker.NullLocator()) 
    988          
    989         self.plotHeatMap() 
    990         labelsAxes = self.plt.axes([border + width*(dendrogramAreaRatio + heatmapAreaRatio + 2*space), border, width*labelsAreaRatio, height], sharey=dendrogramAxes) 
    991         self.plotLabels() 
    992         labelsAxes.set_axis_off() 
    993         labelsAxes.xaxis.set_major_locator(ticker.NullLocator()) 
    994         labelsAxes.xaxis.set_minor_locator(ticker.NullLocator()) 
    995         labelsAxes.yaxis.set_major_locator(ticker.NullLocator()) 
    996         labelsAxes.yaxis.set_minor_locator(ticker.NullLocator()) 
    997         if filename: 
    998             canvas = matplotlib.backends.backend_agg.FigureCanvasAgg(self.fig) 
    999             canvas.print_figure(filename) 
    1000         if show: 
    1001             self.plt.show() 
    1002          
    1003          
    1004 from orngMisc import ColorPalette, EPSRenderer 
    1005 class DendrogramPlot(object): 
    1006     """ A class for drawing dendrograms 
    1007     Example: 
    1008     >>> a = DendrogramPlot(tree) 
    1009     """ 
    1010     def __init__(self, tree, attr_tree = None, labels=None, data=None, width=None, height=None, tree_height=None, heatmap_width=None, text_width=None,  
    1011                  spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, gamma=None, renderer=EPSRenderer): 
    1012         self.tree = tree 
    1013         self.attr_tree = attr_tree 
    1014         self.labels = [str(ex.getclass()) for ex in data] if not labels and data and data.domain.classVar else (labels or []) 
    1015 #        self.attr_labels = [str(attr.name) for attr in data.domain.attributes] if not attr_labels and data else attr_labels or [] 
    1016         self.data = data 
    1017         self.width, self.height = float(width) if width else None, float(height) if height else None 
    1018         self.tree_height = tree_height 
    1019         self.heatmap_width = heatmap_width 
    1020         self.text_width = text_width 
    1021         self.font_size = 10.0 
    1022         self.linespacing = 0.0 
    1023         self.cluster_colors = cluster_colors 
    1024         self.horizontal_margin = 10.0 
    1025         self.vertical_margin = 10.0 
    1026         self.spacing = float(spacing) if spacing else None 
    1027         self.color_palette = color_palette 
    1028         self.minv = minv 
    1029         self.maxv = maxv 
    1030         self.gamma = gamma 
    1031         self.set_matrix_color_schema(color_palette, minv, maxv, gamma) 
    1032         self.renderer = renderer 
    1033          
    1034     def set_matrix_color_schema(self, color_palette, minv, maxv, gamma=None): 
    1035         """ Set the matrix color scheme. 
    1036         """ 
    1037         if isinstance(color_palette, ColorPalette): 
    1038             self.color_palette = color_palette 
    1039         else: 
    1040             self.color_palette = ColorPalette(color_palette) 
    1041         self.minv = minv 
    1042         self.maxv = maxv 
    1043         self.gamma = gamma 
    1044          
    1045     def color_shema(self): 
    1046         vals = [float(val) for ex in self.data for val in ex if not val.isSpecial() and val.variable.varType==orange.VarTypes.Continuous] or [0] 
    1047         avg = sum(vals)/len(vals) 
    1048          
    1049         maxVal = self.maxv if self.maxv else max(vals) 
    1050         minVal = self.minv if self.minv else min(vals) 
    1051          
    1052         def _colorSchema(val): 
    1053             if val.isSpecial(): 
    1054                 return self.color_palette(None) 
    1055             elif val.variable.varType==orange.VarTypes.Continuous: 
    1056                 r, g, b = self.color_palette((float(val) - minVal) / abs(maxVal - minVal), gamma=self.gamma) 
    1057             elif val.variable.varType==orange.VarTypes.Discrete: 
    1058                 r = g = b = int(255.0*float(val)/len(val.variable.values)) 
    1059             return (r, g, b) 
    1060         return _colorSchema 
    1061      
    1062     def layout(self): 
    1063         height_final = False 
    1064         width_final = False 
    1065         tree_height = self.tree_height or 100 
    1066         if self.height: 
    1067             height, height_final = self.height, True 
    1068             heatmap_height = height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin 
    1069             font_size =  heatmap_height / len(self.labels) #self.font_size or (height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin) / len(self.labels) 
    1070         else: 
    1071             font_size = self.font_size 
    1072             heatmap_height = font_size * len(self.labels) 
    1073             height = heatmap_height + (tree_height + self.spacing if self.attr_tree else 0) + 2 * self.horizontal_margin 
    1074               
    1075         text_width = self.text_width or max([len(label) for label in self.labels] + [0]) * font_size #max([self.renderer.string_size_hint(label) for label in self.labels]) 
    1076          
    1077         if self.width: 
    1078             width = self.width 
    1079             heatmap_width = width - 2 * self.vertical_margin - tree_height - (2 if self.data else 1) * self.spacing - text_width if self.data else 0 
    1080         else: 
    1081             heatmap_width = len(self.data.domain.attributes) * heatmap_height / len(self.data) if self.data else 0 
    1082             width = 2 * self.vertical_margin + tree_height + (heatmap_width + self.spacing if self.data else 0) + self.spacing + text_width 
    1083              
    1084         return width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size 
    1085      
    1086     def plot(self, filename="graph.eps"): 
    1087         width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size = self.layout() 
    1088         heatmap_cell_height = heatmap_height / len(self.labels) 
    1089         heatmap_cell_width = heatmap_width / len(self.data.domain.attributes) 
    1090          
    1091         self.renderer = self.renderer(width, height) 
    1092          
    1093         def draw_tree(cluster, root, treeheight, treewidth, color): 
    1094             height = treeheight * cluster.height / root.height 
    1095             if cluster.branches: 
    1096                 centers = [] 
    1097                 for branch in cluster.branches: 
    1098                     center = draw_tree(branch, root, treeheight, treewidth, self.cluster_colors.get(branch, color)) 
    1099                     centers.append(center) 
    1100                     self.renderer.draw_line(center[0], center[1], center[0], height, stroke_color = self.cluster_colors.get(branch, color)) 
    1101                      
    1102                 self.renderer.draw_line(centers[0][0], height, centers[-1][0], height, stroke_color = self.cluster_colors.get(cluster, color)) 
    1103                 return (centers[0][0] + centers[-1][0]) / 2.0, height 
    1104             else: 
    1105                 return float(treewidth) * cluster.first / len(root), 0.0 
    1106         self.renderer.save_render_state() 
    1107         self.renderer.translate(self.vertical_margin + tree_height, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0) + heatmap_cell_height / 2.0) 
    1108         self.renderer.rotate(90) 
    1109 #        print self.renderer.transform() 
    1110         draw_tree(self.tree, self.tree, tree_height, heatmap_height, self.cluster_colors.get(self.tree, (0,0,0))) 
    1111         self.renderer.restore_render_state() 
    1112         if self.attr_tree: 
    1113             self.renderer.save_render_state() 
    1114             self.renderer.translate(self.vertical_margin + tree_height + self.spacing + heatmap_cell_width / 2.0, self.horizontal_margin + tree_height) 
    1115             self.renderer.scale(1.0, -1.0) 
    1116 #            print self.renderer.transform() 
    1117             draw_tree(self.attr_tree, self.attr_tree, tree_height, heatmap_width, self.cluster_colors.get(self.attr_tree, (0,0,0))) 
    1118             self.renderer.restore_render_state() 
    1119          
    1120         self.renderer.save_render_state() 
    1121         self.renderer.translate(self.vertical_margin + tree_height + self.spacing, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0)) 
    1122 #        print self.renderer.transform() 
    1123         if self.data: 
    1124             colorSchema = self.color_shema() 
    1125             for i, ii in enumerate(self.tree): 
    1126                 ex = self.data[ii] 
    1127                 for j, jj in enumerate((self.attr_tree if self.attr_tree else range(len(self.data.domain.attributes)))): 
    1128                     r, g, b = colorSchema(ex[jj]) 
    1129                     self.renderer.draw_rect(j * heatmap_cell_width, i * heatmap_cell_height, heatmap_cell_width, heatmap_cell_height, fill_color=(r, g, b), stroke_color=(255, 255, 255)) 
    1130          
    1131         self.renderer.translate(heatmap_width + self.spacing, heatmap_cell_height) 
    1132 #        print self.renderer.transform() 
    1133         self.renderer.set_font("Times-Roman", font_size) 
    1134         for index in self.tree: #label in self.labels: 
    1135             self.renderer.draw_text(0.0, 0.0, self.labels[index]) 
    1136             self.renderer.translate(0.0, heatmap_cell_height) 
    1137         self.renderer.restore_render_state() 
    1138         self.renderer.save(filename) 
    1139          
    1140 def dendrogram_draw(filename, *args, **kwargs): 
    1141     import os 
    1142     from orngMisc import PILRenderer, EPSRenderer, SVGRenderer 
    1143     name, ext = os.path.splitext(filename) 
    1144     kwargs["renderer"] = {".eps":EPSRenderer, ".svg":SVGRenderer, ".png":PILRenderer}.get(ext.lower(), PILRenderer) 
    1145 #    print kwargs["renderer"], ext 
    1146     d = DendrogramPlot(*args, **kwargs) 
    1147     d.plot(filename) 
    1148      
    1149 if __name__=="__main__": 
    1150     data = orange.ExampleTable("doc//datasets//brown-selected.tab") 
    1151 #    data = orange.ExampleTable("doc//datasets//iris.tab") 
    1152     root = hierarchicalClustering(data, order=True) #, linkage=orange.HierarchicalClustering.Single) 
    1153     attr_root = hierarchicalClustering_attributes(data, order=True) 
    1154 #    print root 
    1155 #    d = DendrogramPlotPylab(root, data=data, labels=[str(ex.getclass()) for ex in data], dendrogram_width=0.4, heatmap_width=0.3,  params={}, cmap=None) 
    1156 #    d.plot(show=True, filename="graph.png") 
    1157  
    1158     dendrogram_draw("graph.eps", root, attr_tree=attr_root, data=data, labels=[str(e.getclass()) for e in data], tree_height=50, #width=500, height=500, 
    1159                           cluster_colors={root.right:(255,0,0), root.right.right:(0,255,0)},  
    1160                           color_palette=ColorPalette([(255, 0, 0), (0,0,0), (0, 255,0)], gamma=0.5,  
    1161                                                      overflow=(255, 255, 255), underflow=(255, 255, 255))) #, minv=-0.5, maxv=0.5) 
    1162  
  • orange/Orange/clustering/kmeans.py

    r7462 r7503  
    66.. index:: 
    77   single: clustering, kmeans 
    8 .. index:: aglomerative clustering 
    9  
    10  
    11 .. autoclass:: Orange.cluster.KMeans 
     8.. index:: agglomerative clustering 
     9 
     10 
     11.. autoclass:: Orange.clustering.kmeans.Clustering 
    1212   :members: 
    1313 
     
    6363========================= 
    6464 
    65 .. automethod:: Orange.cluster.kmeans_init_random 
    66  
    67 .. automethod:: Orange.cluster.kmeans_init_diversity 
    68  
    69 .. automethod:: Orange.cluster.KMeans_init_hierarchicalClustering 
    70  
    71 .. automethod:: Orange.cluster.data_center 
    72  
    73 .. automethod:: Orange.cluster.data_center 
    74  
    75 .. automethod:: Orange.cluster.plot_silhouette 
    76  
    77 .. automethod:: Orange.cluster.score_distance_to_centroids 
    78  
    79 .. automethod:: Orange.cluster.score_silhouette 
    80  
    81 .. automethod:: Orange.cluster.score_fastsilhouette 
    82  
    83 Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge (`kmeans-cmp-init.py`_, uses `iris.tab`_, `housing.tab`_, `vehicle.tab`_): 
     65.. automethod:: Orange.clustering.kmeans.init_random 
     66 
     67.. automethod:: Orange.clustering.kmeans.init_diversity 
     68 
     69.. automethod:: Orange.clustering.kmeans.init_hclustering 
     70 
     71.. automethod:: Orange.clustering.kmeans.data_center 
     72 
     73.. automethod:: Orange.clustering.kmeans.data_center 
     74 
     75.. automethod:: Orange.clustering.kmeans.plot_silhouette 
     76 
     77.. automethod:: Orange.clustering.kmeans.score_distance_to_centroids 
     78 
     79.. automethod:: Orange.clustering.kmeans.score_silhouette 
     80 
     81.. automethod:: Orange.clustering.kmeans.score_fast_silhouette 
     82 
     83Typically, the choice of seeds has a large impact on the k-means clustering,  
     84with better initialization methods yielding a clustering that converges faster  
     85and finds more optimal centroids. The following code compares three different  
     86initialization methods (random, diversity-based and hierarchical clustering-based)  
     87in terms of how fast they converge (`kmeans-cmp-init.py`_, uses `iris.tab`_,  
     88`housing.tab`_, `vehicle.tab`_): 
    8489 
    8590.. literalinclude:: code/kmeans-cmp-init.py 
    8691 
    87 As expected, k-means converges faster with diversity and clustering-based initialization that with random seed selection:: 
     92As expected, k-means converges faster with diversity and clustering-based  
     93initialization that with random seed selection:: 
    8894 
    8995               Rnd Div  HC 
     
    9298       vehicle  11   4   3 
    9399 
    94 The following code computes the silhouette score for k=2..7 and plots a silhuette plot for k=3 (`kmeans-silhouette.py`_, uses `iris.tab`_): 
     100The following code computes the silhouette score for k=2..7 and plots a  
     101silhuette plot for k=3 (`kmeans-silhouette.py`_, uses `iris.tab`_): 
    95102 
    96103.. literalinclude:: code/kmeans-silhouette.py 
    97104 
    98 The analysis suggests that k=2 is preferred as it yields the maximal silhouette coefficient:: 
     105The analysis suggests that k=2 is preferred as it yields 
     106the maximal silhouette coefficient:: 
    99107 
    100108    2 0.629467553352 
     
    124132import random 
    125133import statc 
     134 
     135import Orange.clustering.hierarchical 
    126136 
    127137# miscellaneous functions  
     
    163173# 
    164174 
    165 class ExamplesDistanceConstructor_PearsonR(orange.ExamplesDistanceConstructor): 
    166     def __new__(cls, data=None, **argkw): 
    167         self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw) 
    168         self.__dict__.update(argkw) 
    169         if data: 
    170             return self.__call__(data) 
    171         else: 
    172             return self 
    173  
    174     def __call__(self, data): 
    175         indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous] 
    176         return ExamplesDistance_PearsonR(domain=data.domain, indxs=indxs) 
    177  
    178 class ExamplesDistance_PearsonR(orange.ExamplesDistance): 
    179     """ Pearson distance. (1 - R)/2, where R is the Pearson correlation between examples. In [0..1]. """ 
    180  
    181     def __init__(self, **argkw): 
    182         self.__dict__.update(argkw) 
    183     def __call__(self, e1, e2): 
    184         X1 = []; X2 = [] 
    185         for i in self.indxs: 
    186             if not(e1[i].isSpecial() or e2[i].isSpecial()): 
    187                 X1.append(float(e1[i])) 
    188                 X2.append(float(e2[i])) 
    189         if not X1: 
    190             return 1.0 
    191         try: 
    192             return (1.0 - statc.pearsonr(X1, X2)[0]) / 2. 
    193         except: 
    194             return 1.0 
    195  
    196 class ExamplesDistanceConstructor_SpearmanR(orange.ExamplesDistanceConstructor): 
    197     def __new__(cls, data=None, **argkw): 
    198         self = orange.ExamplesDistanceConstructor.__new__(cls, **argkw) 
    199         self.__dict__.update(argkw) 
    200         if data: 
    201             return self.__call__(data) 
    202         else: 
    203             return self 
    204  
    205     def __call__(self, data): 
    206         indxs = [i for i, a in enumerate(data.domain.attributes) if a.varType==orange.VarTypes.Continuous] 
    207         return ExamplesDistance_SpearmanR(domain=data.domain, indxs=indxs) 
    208  
    209 class ExamplesDistance_SpearmanR(orange.ExamplesDistance): 
    210     def __init__(self, **argkw): 
    211         self.__dict__.update(argkw) 
    212     def __call__(self, e1, e2): 
    213         X1 = []; X2 = [] 
    214         for i in self.indxs: 
    215             if not(e1[i].isSpecial() or e2[i].isSpecial()): 
    216                 X1.append(float(e1[i])) 
    217                 X2.append(float(e2[i])) 
    218         if not X1: 
    219             return 1.0 
    220         try: 
    221             return (1.0 - statc.spearmanr(X1, X2)[0]) / 2. 
    222         except: 
    223             return 1.0 
    224  
    225175# k-means clustering 
    226176 
     
    271221    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0 
    272222 
    273 def score_fastsilhouette(km, index=None): 
     223def score_fast_silhouette(km, index=None): 
    274224    """Same as score_silhouette, but computes an approximation and is faster. 
    275225     
     
    348298# initialization functions should be of the type f(data, k, distfun) 
    349299 
    350 def kmeans_init_random(data, k, _): 
     300def init_random(data, k, _): 
    351301    """A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004). 
    352302     
     
    360310    return data.getitems(random.sample(range(len(data)), k)) 
    361311 
    362 def kmeans_init_diversity(data, k, distfun): 
     312def init_diversity(data, k, distfun): 
    363313    """A function that can be used for intialization of k-means clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm). 
    364314 
     
    378328    return seeds 
    379329 
    380 class KMeans_init_hierarchicalClustering(): 
    381     """A class that returns an clustering initialization function that performs hierarhical clustering, uses it to infer k clusters, and computes a list of cluster-based data centers.""" 
     330class init_hclustering(): 
     331    """ 
     332    A class that returns an clustering initialization function that performs 
     333    hierarhical clustering, uses it to infer k clusters, and computes a 
     334    list of cluster-based data centers 
     335    """ 
    382336 
    383337    def __init__(self, n=100): 
     
    398352        """ 
    399353        sample = orange.ExampleTable(random.sample(data, min(self.n, len(data)))) 
    400         root = hierarchicalClustering(sample) 
    401         cmap = hierarchicalClustering_topClusters(root, k) 
     354        root = Orange.clustering.hierarchical.clustering(sample) 
     355        cmap = Orange.clustering.hierarchical.top_clusters(root, k) 
    402356        return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap] 
    403357 
     
    406360# 
    407361 
    408 class KMeans: 
    409     """ 
    410  
    411     Implements a k-means clustering algorithm: 
     362class Clustering: 
     363    """Implements a k-means clustering algorithm: 
    412364 
    413365    #. Choose the number of clusters, k. 
     
    451403 
    452404    def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None, 
    453                  stopchanges=0, nstart=1, initialization=kmeans_init_random, 
     405                 stopchanges=0, nstart=1, initialization=init_random, 
    454406                 distance=orange.ExamplesDistanceConstructor_Euclidean, 
    455407                 scoring=score_distance_to_centroids, inner_callback = None, 
     
    464416        :param distance: an example distance constructor, which measures the distance between two instances. 
    465417        :type distance: :class:`orange.ExamplesDistanceConstructor` 
    466         :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`kmeans_init_random`, :func:`kmeans_init_diversity`, :class:`KMeans_init_hierarchicalClustering`).  
     418        :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`init_random`, :func:`init_diversity`, :class:`init_hclustering`).  
    467419        :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score. 
    468420        :param inner_callback: invoked after every clustering iteration. 
Note: See TracChangeset for help on using the changeset viewer.