source: orange/Orange/clustering/kmeans.py @ 9994:1073e0304a87

Revision 9994:1073e0304a87, 20.3 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Remove links from documentation to datasets. Remove datasets reference directory.

Line 
1"""
2*******************************
3K-means clustering (``kmeans``)
4*******************************
5
6.. index::
7   single: clustering, kmeans
8.. index:: agglomerative clustering
9
10
11.. autoclass:: Orange.clustering.kmeans.Clustering
12   :members:
13
14Examples
15========
16
17The following code runs k-means clustering and prints out the cluster indexes
18for the last 10 data instances (:download:`kmeans-run.py <code/kmeans-run.py>`):
19
20.. literalinclude:: code/kmeans-run.py
21
22The output of this code is::
23
24    [1, 1, 2, 1, 1, 1, 2, 1, 1, 2]
25
26Invoking a call-back function may be useful when tracing the progress of the clustering.
27Below is a code that uses an :obj:`inner_callback` to report on the number of instances
28that have changed the cluster and to report on the clustering score. For the score
29o be computed at each iteration we have to set :obj:`minscorechange`, but we can
30leave it at 0 or even set it to a negative value, which allows the score to deteriorate
31by some amount (:download:`kmeans-run-callback.py <code/kmeans-run-callback.py>`):
32
33.. literalinclude:: code/kmeans-run-callback.py
34
35The convergence on Iris data set is fast::
36
37    Iteration: 1, changes: 150, score: 10.9555
38    Iteration: 2, changes: 12, score: 10.3867
39    Iteration: 3, changes: 2, score: 10.2034
40    Iteration: 4, changes: 2, score: 10.0699
41    Iteration: 5, changes: 2, score: 9.9542
42    Iteration: 6, changes: 1, score: 9.9168
43    Iteration: 7, changes: 2, score: 9.8624
44    Iteration: 8, changes: 0, score: 9.8624
45
46Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (:download:`kmeans-trace.py <code/kmeans-trace.py>`):
47
48.. literalinclude:: code/kmeans-trace.py
49
50Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.
51
52
53.. image:: files/kmeans-scatter-001.png
54
55.. image:: files/kmeans-scatter-002.png
56
57.. image:: files/kmeans-scatter-003.png
58
59.. image:: files/kmeans-scatter-004.png
60
61
62k-Means Utility Functions
63=========================
64
65.. automethod:: Orange.clustering.kmeans.init_random
66
67.. automethod:: Orange.clustering.kmeans.init_diversity
68
69.. autoclass:: Orange.clustering.kmeans.init_hclustering
70   :members:
71
72.. automethod:: Orange.clustering.kmeans.plot_silhouette
73
74.. automethod:: Orange.clustering.kmeans.score_distance_to_centroids
75
76.. automethod:: Orange.clustering.kmeans.score_silhouette
77
78.. automethod:: Orange.clustering.kmeans.score_fast_silhouette
79
80Typically, the choice of seeds has a large impact on the k-means clustering,
81with better initialization methods yielding a clustering that converges faster
82and finds more optimal centroids. The following code compares three different
83initialization methods (random, diversity-based and hierarchical clustering-based)
84in terms of how fast they converge (:download:`kmeans-cmp-init.py <code/kmeans-cmp-init.py>`):
85
86.. literalinclude:: code/kmeans-cmp-init.py
87
88As expected, k-means converges faster with diversity and clustering-based
89initialization that with random seed selection::
90
91               Rnd Div  HC
92          iris  12   3   4
93       housing  14   6   4
94       vehicle  11   4   3
95
96The following code computes the silhouette score for k=2..7 and plots a
97silhuette plot for k=3 (:download:`kmeans-silhouette.py <code/kmeans-silhouette.py>`):
98
99.. literalinclude:: code/kmeans-silhouette.py
100
101The analysis suggests that k=2 is preferred as it yields
102the maximal silhouette coefficient::
103
104    2 0.629467553352
105    3 0.504318855054
106    4 0.407259377854
107    5 0.358628975081
108    6 0.353228492088
109    7 0.366357876944
110
111.. figure:: files/kmeans-silhouette.png
112
113   Silhouette plot for k=3.
114
115"""
116
117import math
118import sys
119import orange
120import random
121from Orange import statc
122
123import Orange.clustering.hierarchical
124
125# miscellaneous functions
126
127def _modus(dist):
128    #Check bool(dist) - False means no known cases
129    #Check dist.cases > 0 - We cant return some value from the domain without knowing if it is even present
130    #in the data. TOOD: What does this mean for k-means convergence?
131    if bool(dist) and dist.cases > 0:
132        return dist.modus()
133    else:
134        return None
135   
136def data_center(data):
137    """
138    Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).
139    """
140    atts = data.domain.attributes
141    astats = orange.DomainBasicAttrStat(data)
142    center = [astats[a].avg if a.varType == orange.VarTypes.Continuous \
143#              else max(enumerate(orange.Distribution(a, data)), key=lambda x:x[1])[0] if a.varType == orange.VarTypes.Discrete
144              else _modus(orange.Distribution(a, data)) if a.varType == orange.VarTypes.Discrete
145              else None
146              for a in atts]
147    if data.domain.classVar:
148        center.append(0)
149    return orange.Example(data.domain, center)
150
151def minindex(x):
152    """Return the index of the minimum element"""
153    return x.index(min(x))
154
155def avg(x):
156    """Return the average (mean) of a given list"""
157    return (float(sum(x)) / len(x)) if x else 0
158
159#
160# data distances
161#
162
163# k-means clustering
164
165# clustering scoring functions
166
167def score_distance_to_centroids(km):
168    """Returns an average distance of data instances to their associated centroids.
169
170    :param km: a k-means clustering object.
171    :type km: :class:`KMeans`
172    """
173    return sum(km.distance(km.centroids[km.clusters[i]], d) for i,d in enumerate(km.data))
174
175score_distance_to_centroids.minimize = True
176
177def score_conditional_entropy(km):
178    """UNIMPLEMENTED cluster quality measured by conditional entropy"""
179    raise NotImplemented
180
181def score_within_cluster_distance(km):
182    """UNIMPLEMENTED weighted average within-cluster pairwise distance"""
183    raise NotImplemented
184
185score_within_cluster_distance.minimize = True
186
187def score_between_cluster_distance(km):
188    """Sum of distances from elements to 'nearest miss' centroids"""
189    return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data))
190
191from Orange.misc import deprecated_function_name
192score_betweenClusterDistance = deprecated_function_name(score_between_cluster_distance)
193
194def score_silhouette(km, index=None):
195    """Returns an average silhouette score of data instances.
196   
197    :param km: a k-means clustering object.
198    :type km: :class:`KMeans`
199
200    :param index: if given, the functon returns just the silhouette score of that particular data instance.
201    :type index: integer
202    """
203   
204    if index == None:
205        return avg([score_silhouette(km, i) for i in range(len(km.data))])
206    cind = km.clusters[index]
207    a = avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
208             km.clusters[i] == cind and i != index])
209    b = min([avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
210                 km.clusters[i] == c])
211            for c in range(len(km.centroids)) if c != cind])
212    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
213
214def score_fast_silhouette(km, index=None):
215    """Same as score_silhouette, but computes an approximation and is faster.
216   
217    :param km: a k-means clustering object.
218    :type km: :class:`KMeans`
219    """
220
221    if index == None:
222        return avg([score_fast_silhouette(km, i) for i in range(len(km.data))])
223    cind = km.clusters[index]
224    a = km.distance(km.data[index], km.centroids[km.clusters[index]])
225    b = min([km.distance(km.data[index], c) for i,c in enumerate(km.centroids) if i != cind])
226    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
227
228def compute_bic(km):
229    """Compute bayesian information criteria score for given clustering. NEEDS REWRITE!!!"""
230    data = km.data
231    medoids = km.centroids
232
233    M = len(data.domain.attributes)
234    R = float(len(data))
235    Ri = [km.clusters.count(i) for i in range(km.k)]
236    numFreePar = (len(km.data.domain.attributes) + 1.) * km.k * math.log(R, 2.) / 2.
237    # sigma**2
238    s2 = 0.
239    cidx = [i for i, attr in enumerate(data.domain.attributes) if attr.varType in [orange.VarTypes.Continuous, orange.VarTypes.Discrete]]
240    for x, midx in izip(data, mapping):
241        medoid = medoids[midx] # medoids has a dummy element at the beginning, so we don't need -1
242        s2 += sum( [(float(x[i]) - float(medoid[i]))**2 for i in cidx] )
243    s2 /= (R - K)
244    if s2 < 1e-20:
245        return None, [None]*K
246    # log-lokehood of clusters: l(Dn)
247    # log-likehood of clustering: l(D)
248    ld = 0
249    bicc = []
250    for k in range(1, 1+K):
251        ldn = -1. * Ri[k] * ((math.log(2. * math.pi, 2) / -2.) - (M * math.log(s2, 2) / 2.) + (K / 2.) + math.log(Ri[k], 2) - math.log(R, 2))
252        ld += ldn
253        bicc.append(ldn - numFreePar)
254    return ld - numFreePar, bicc
255
256
257#
258# silhouette plot
259#
260
261def plot_silhouette(km, filename='tmp.png', fast=False):
262    """ Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a k-means clustering object. If fast is True use score_fast_silhouette to compute scores instead of score_silhouette.
263
264    :param km: a k-means clustering object.
265    :type km: :class:`KMeans`
266    :param filename: name of output plot.
267    :type filename: string
268    :param fast: if True use :func:`score_fast_silhouette` to compute scores instead of :func:`score_silhouette`
269    :type fast: boolean.
270
271    """
272    import matplotlib.pyplot as plt
273    plt.figure()
274    scoring = score_fast_silhouette if fast else score_silhouette
275    scores = [[] for i in range(km.k)]
276    for i, c in enumerate(km.clusters):
277        scores[c].append(scoring(km, i))
278    csizes = map(len, scores)
279    cpositions = [sum(csizes[:i]) + (i+1)*3 + csizes[i]/2 for i in range(km.k)]
280    scores = reduce(lambda x,y: x + [0]*3 + sorted(y), scores, [])
281    plt.barh(range(len(scores)), scores, linewidth=0, color='c')
282    plt.yticks(cpositions, map(str, range(km.k)))
283    #plt.title('Silhouette plot')
284    plt.ylabel('Cluster')
285    plt.xlabel('Silhouette value')
286    plt.savefig(filename)
287
288# clustering initialization (seeds)
289# initialization functions should be of the type f(data, k, distfun)
290
291def init_random(data, k, _):
292    """A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).
293   
294    :param data: data instances.
295    :type data: :class:`orange.ExampleTable`
296    :param k: the number of clusters.
297    :type k: integer
298    """
299    return data.getitems(random.sample(range(len(data)), k))
300
301def init_diversity(data, k, distfun):
302    """A function that can be used for intialization of k-means clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm).
303
304    :param data: data instances.
305    :type data: :class:`orange.ExampleTable`
306    :param k: the number of clusters.
307    :type k: integer
308    :param distfun: a distance function.
309    :type distfun: :class:`Orange.distance.Distance`
310    """
311    center = data_center(data)
312    # the first seed should be the farthest point from the center
313    seeds = [max([(distfun(d, center), d) for d in data])[1]]
314    # other seeds are added iteratively, and are data points that are farthest from the current set of seeds
315    for i in range(1,k):
316        seeds.append(max([(min([distfun(d, s) for s in seeds]), d) for d in data if d not in seeds])[1])
317    return seeds
318
319class init_hclustering():
320    """
321    A class that returns an clustering initialization function that performs
322    hierarhical clustering, uses it to infer k clusters, and computes a
323    list of cluster-based data centers
324    """
325
326    def __init__(self, n=100):
327        """
328        :param n: number of data instances to sample.
329        :type n: integer
330        """
331        self.n = n
332
333    def __call__(self, data, k, disfun):
334        """
335        :param data: data instances.
336        :type data: :class:`orange.ExampleTable`
337        :param k: the number of clusters.
338        :type k: integer
339        :param distfun: a distance function.
340        :type distfun: :class:`Orange.distance.Distance`
341        """
342        sample = orange.ExampleTable(random.sample(data, min(self.n, len(data))))
343        root = Orange.clustering.hierarchical.clustering(sample)
344        cmap = Orange.clustering.hierarchical.top_clusters(root, k)
345        return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap]
346
347#   
348# k-means clustering, main implementation
349#
350
351class Clustering:
352    """Implements a k-means clustering algorithm:
353
354    #. Choose the number of clusters, k.
355    #. Choose a set of k initial centroids.
356    #. Assign each instances in the data set to the closest centroid.
357    #. For each cluster, compute a new centroid as a center of clustered
358       data instances.
359    #. Repeat the previous two steps, until some convergence criterion is
360       met (e.g., the cluster assignment has not changed).
361
362    The main advantages of this algorithm are simplicity and low memory 
363    requirements. The principal disadvantage is the dependence of results
364    on the selection of initial set of centroids.
365
366    .. attribute:: k
367
368        Number of clusters.
369
370    .. attribute:: data
371
372        Instances to cluster.
373
374    .. attribute:: centroids
375
376        Current set of centroids.
377
378    .. attribute:: scoring
379
380        Current clustering score.
381
382    .. attribute:: iteration
383
384        Current clustering iteration.
385
386    .. attribute:: clusters
387
388        A list of cluster indexes. An i-th element provides an
389        index to a centroid associated with i-th data instance from the input
390        data set.
391    """
392
393    def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None,
394                 stopchanges=0, nstart=1, initialization=init_random,
395                 distance=Orange.distance.Euclidean,
396                 scoring=score_distance_to_centroids, inner_callback = None,
397                 outer_callback = None):
398        """
399        :param data: Data instances to be clustered. If not None, clustering will be executed immediately after initialization unless initialize_only=True.
400        :type data: :class:`orange.ExampleTable` or None
401        :param centroids: either specify a number of clusters or provide a list of examples that will serve as clustering centroids.
402        :type centroids: integer or a list of :class:`orange.Example`
403        :param nstart: If greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
404        :type nstart: integer
405        :param distance: an example distance constructor, which measures the distance between two instances.
406        :type distance: :class:`Orange.distance.DistanceConstructor`
407        :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`init_random`, :func:`init_diversity`, :class:`init_hclustering`).
408        :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score.
409        :param inner_callback: invoked after every clustering iteration.
410        :param outer_callback: invoked after every clustering restart (if nstart is greater than 1).
411
412        Stopping criteria:
413
414        :param maxiters: maximum number of clustering iterations
415        :type maxiters: integer
416        :param minscorechange: minimal improvement of the score from previous generation (if lower, the clustering will stop). If None, the score will not be computed between iterations
417        :type minscorechange: float or None
418        :param stopchanges: if the number of instances changing the cluster is lower or equal to stopchanges, stop the clustering.
419        :type stopchanges: integer
420        """
421
422        self.data = data
423        self.k = centroids if type(centroids)==int else len(centroids)
424        self.centroids = centroids if type(centroids) == orange.ExampleTable else None
425        self.maxiters = maxiters
426        self.minscorechange = minscorechange
427        self.stopchanges = stopchanges
428        self.nstart = nstart
429        self.initialization = initialization
430        self.distance_constructor = distance
431        self.distance = self.distance_constructor(self.data) if self.data else None
432        self.scoring = scoring
433        self.minimize_score = True if hasattr(scoring, 'minimize') else False
434        self.inner_callback = inner_callback
435        self.outer_callback = outer_callback
436        if self.data:
437            self.run()
438       
439    def __call__(self, data = None):
440        """Runs the k-means clustering algorithm, with optional new data."""
441        if data:
442            self.data = data
443            self.distance = self.distance_constructor(self.data)
444        self.run()
445   
446    def init_centroids(self):
447        """Initialize cluster centroids"""
448        if self.centroids and not self.nstart > 1: # centroids were specified
449            return
450        self.centroids = self.initialization(self.data, self.k, self.distance)
451       
452    def compute_centeroid(self, data):
453        """Return a centroid of the data set."""
454        return data_center(data)
455   
456    def compute_cluster(self):
457        """calculate membership in clusters"""
458        return [minindex([self.distance(s, d) for s in self.centroids]) for d in self.data]
459   
460    def runone(self):
461        """Runs a single clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).""" 
462        self.centroids = [self.compute_centeroid(self.data.getitems(
463            [i for i, c in enumerate(self.clusters) if c == cl])) for cl in range(self.k)]
464        self.clusters = self.compute_cluster()
465       
466    def run(self):
467        """
468        Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
469        """
470        self.winner = None
471        for startindx in range(self.nstart):
472            self.init_centroids()
473            self.clusters = old_cluster = self.compute_cluster()
474            if self.minscorechange != None:
475                self.score = old_score = self.scoring(self)
476            self.nchanges = len(self.data)
477            self.iteration = 0
478            stopcondition = False
479            if self.inner_callback:
480                self.inner_callback(self)
481            while not stopcondition:
482                self.iteration += 1
483                self.runone()
484                self.nchanges = sum(map(lambda x,y: x!=y, old_cluster, self.clusters))
485                old_cluster = self.clusters
486                if self.minscorechange != None:
487                    self.score = self.scoring(self)
488                    scorechange = (self.score - old_score) / old_score if old_score > 0 else self.minscorechange
489                    if self.minimize_score:
490                        scorechange = -scorechange
491                    old_score = self.score
492                stopcondition = (self.nchanges <= self.stopchanges or
493                                 self.iteration == self.maxiters or
494                                 (self.minscorechange != None and
495                                  scorechange <= self.minscorechange))
496                if self.inner_callback:
497                    self.inner_callback(self)
498            if self.scoring and self.minscorechange == None:
499                self.score = self.scoring(self)
500            if self.nstart > 1:
501                if not self.winner or (self.score < self.winner[0] if
502                        self.minimize_score else self.score > self.winner[0]):
503                    self.winner = (self.score, self.clusters, self.centroids)
504                if self.outer_callback:
505                    self.outer_callback(self)
506
507        if self.nstart > 1:
508            self.score, self.clusters, self.centroids = self.winner
509
Note: See TracBrowser for help on using the repository browser.