source: orange/Orange/clustering/kmeans.py @ 9977:d87236e3731e

Revision 9977:d87236e3731e, 20.6 KB checked in by ales_erjavec, 2 years ago (diff)

Fixed an error.

Line 
1"""
2*******************************
3K-means clustering (``kmeans``)
4*******************************
5
6.. index::
7   single: clustering, kmeans
8.. index:: agglomerative clustering
9
10
11.. autoclass:: Orange.clustering.kmeans.Clustering
12   :members:
13
14Examples
15========
16
17The following code runs k-means clustering and prints out the cluster indexes
18for the last 10 data instances (:download:`kmeans-run.py <code/kmeans-run.py>`, uses :download:`iris.tab <code/iris.tab>`):
19
20.. literalinclude:: code/kmeans-run.py
21
22The output of this code is::
23
24    [1, 1, 2, 1, 1, 1, 2, 1, 1, 2]
25
26Invoking a call-back function may be useful when tracing the progress of the clustering.
27Below is a code that uses an :obj:`inner_callback` to report on the number of instances
28that have changed the cluster and to report on the clustering score. For the score
29o be computed at each iteration we have to set :obj:`minscorechange`, but we can
30leave it at 0 or even set it to a negative value, which allows the score to deteriorate
31by some amount (:download:`kmeans-run-callback.py <code/kmeans-run-callback.py>`, uses :download:`iris.tab <code/iris.tab>`):
32
33.. literalinclude:: code/kmeans-run-callback.py
34
35The convergence on Iris data set is fast::
36
37    Iteration: 1, changes: 150, score: 10.9555
38    Iteration: 2, changes: 12, score: 10.3867
39    Iteration: 3, changes: 2, score: 10.2034
40    Iteration: 4, changes: 2, score: 10.0699
41    Iteration: 5, changes: 2, score: 9.9542
42    Iteration: 6, changes: 1, score: 9.9168
43    Iteration: 7, changes: 2, score: 9.8624
44    Iteration: 8, changes: 0, score: 9.8624
45
46Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (:download:`kmeans-trace.py <code/kmeans-trace.py>`, uses :download:`iris.tab <code/iris.tab>`):
47
48.. literalinclude:: code/kmeans-trace.py
49
50Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.
51
52
53.. image:: files/kmeans-scatter-001.png
54
55.. image:: files/kmeans-scatter-002.png
56
57.. image:: files/kmeans-scatter-003.png
58
59.. image:: files/kmeans-scatter-004.png
60
61
62k-Means Utility Functions
63=========================
64
65.. automethod:: Orange.clustering.kmeans.init_random
66
67.. automethod:: Orange.clustering.kmeans.init_diversity
68
69.. autoclass:: Orange.clustering.kmeans.init_hclustering
70   :members:
71
72.. automethod:: Orange.clustering.kmeans.plot_silhouette
73
74.. automethod:: Orange.clustering.kmeans.score_distance_to_centroids
75
76.. automethod:: Orange.clustering.kmeans.score_silhouette
77
78.. automethod:: Orange.clustering.kmeans.score_fast_silhouette
79
80Typically, the choice of seeds has a large impact on the k-means clustering,
81with better initialization methods yielding a clustering that converges faster
82and finds more optimal centroids. The following code compares three different
83initialization methods (random, diversity-based and hierarchical clustering-based)
84in terms of how fast they converge (:download:`kmeans-cmp-init.py <code/kmeans-cmp-init.py>`, uses :download:`iris.tab <code/iris.tab>`,
85:download:`housing.tab <code/housing.tab>`, :download:`vehicle.tab <code/vehicle.tab>`):
86
87.. literalinclude:: code/kmeans-cmp-init.py
88
89As expected, k-means converges faster with diversity and clustering-based
90initialization that with random seed selection::
91
92               Rnd Div  HC
93          iris  12   3   4
94       housing  14   6   4
95       vehicle  11   4   3
96
97The following code computes the silhouette score for k=2..7 and plots a
98silhuette plot for k=3 (:download:`kmeans-silhouette.py <code/kmeans-silhouette.py>`, uses :download:`iris.tab <code/iris.tab>`):
99
100.. literalinclude:: code/kmeans-silhouette.py
101
102The analysis suggests that k=2 is preferred as it yields
103the maximal silhouette coefficient::
104
105    2 0.629467553352
106    3 0.504318855054
107    4 0.407259377854
108    5 0.358628975081
109    6 0.353228492088
110    7 0.366357876944
111
112.. figure:: files/kmeans-silhouette.png
113
114   Silhouette plot for k=3.
115
116"""
117
118import math
119import sys
120import orange
121import random
122from Orange import statc
123
124import Orange.clustering.hierarchical
125
126# miscellaneous functions
127
128def _modus(dist):
129    #Check bool(dist) - False means no known cases
130    #Check dist.cases > 0 - We cant return some value from the domain without knowing if it is even present
131    #in the data. TOOD: What does this mean for k-means convergence?
132    if bool(dist) and dist.cases > 0:
133        return dist.modus()
134    else:
135        return None
136   
137def data_center(data):
138    """
139    Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).
140    """
141    atts = data.domain.attributes
142    astats = orange.DomainBasicAttrStat(data)
143    center = [astats[a].avg if a.varType == orange.VarTypes.Continuous \
144#              else max(enumerate(orange.Distribution(a, data)), key=lambda x:x[1])[0] if a.varType == orange.VarTypes.Discrete
145              else _modus(orange.Distribution(a, data)) if a.varType == orange.VarTypes.Discrete
146              else None
147              for a in atts]
148    if data.domain.classVar:
149        center.append(0)
150    return orange.Example(data.domain, center)
151
152def minindex(x):
153    """Return the index of the minimum element"""
154    return x.index(min(x))
155
156def avg(x):
157    """Return the average (mean) of a given list"""
158    return (float(sum(x)) / len(x)) if x else 0
159
160#
161# data distances
162#
163
164# k-means clustering
165
166# clustering scoring functions
167
168def score_distance_to_centroids(km):
169    """Returns an average distance of data instances to their associated centroids.
170
171    :param km: a k-means clustering object.
172    :type km: :class:`KMeans`
173    """
174    return sum(km.distance(km.centroids[km.clusters[i]], d) for i,d in enumerate(km.data))
175
176score_distance_to_centroids.minimize = True
177
178def score_conditional_entropy(km):
179    """UNIMPLEMENTED cluster quality measured by conditional entropy"""
180    raise NotImplemented
181
182def score_within_cluster_distance(km):
183    """UNIMPLEMENTED weighted average within-cluster pairwise distance"""
184    raise NotImplemented
185
186score_within_cluster_distance.minimize = True
187
188def score_between_cluster_distance(km):
189    """Sum of distances from elements to 'nearest miss' centroids"""
190    return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data))
191
192from Orange.misc import deprecated_function_name
193score_betweenClusterDistance = deprecated_function_name(score_between_cluster_distance)
194
195def score_silhouette(km, index=None):
196    """Returns an average silhouette score of data instances.
197   
198    :param km: a k-means clustering object.
199    :type km: :class:`KMeans`
200
201    :param index: if given, the functon returns just the silhouette score of that particular data instance.
202    :type index: integer
203    """
204   
205    if index == None:
206        return avg([score_silhouette(km, i) for i in range(len(km.data))])
207    cind = km.clusters[index]
208    a = avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
209             km.clusters[i] == cind and i != index])
210    b = min([avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
211                 km.clusters[i] == c])
212            for c in range(len(km.centroids)) if c != cind])
213    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
214
215def score_fast_silhouette(km, index=None):
216    """Same as score_silhouette, but computes an approximation and is faster.
217   
218    :param km: a k-means clustering object.
219    :type km: :class:`KMeans`
220    """
221
222    if index == None:
223        return avg([score_fast_silhouette(km, i) for i in range(len(km.data))])
224    cind = km.clusters[index]
225    a = km.distance(km.data[index], km.centroids[km.clusters[index]])
226    b = min([km.distance(km.data[index], c) for i,c in enumerate(km.centroids) if i != cind])
227    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
228
229def compute_bic(km):
230    """Compute bayesian information criteria score for given clustering. NEEDS REWRITE!!!"""
231    data = km.data
232    medoids = km.centroids
233
234    M = len(data.domain.attributes)
235    R = float(len(data))
236    Ri = [km.clusters.count(i) for i in range(km.k)]
237    numFreePar = (len(km.data.domain.attributes) + 1.) * km.k * math.log(R, 2.) / 2.
238    # sigma**2
239    s2 = 0.
240    cidx = [i for i, attr in enumerate(data.domain.attributes) if attr.varType in [orange.VarTypes.Continuous, orange.VarTypes.Discrete]]
241    for x, midx in izip(data, mapping):
242        medoid = medoids[midx] # medoids has a dummy element at the beginning, so we don't need -1
243        s2 += sum( [(float(x[i]) - float(medoid[i]))**2 for i in cidx] )
244    s2 /= (R - K)
245    if s2 < 1e-20:
246        return None, [None]*K
247    # log-lokehood of clusters: l(Dn)
248    # log-likehood of clustering: l(D)
249    ld = 0
250    bicc = []
251    for k in range(1, 1+K):
252        ldn = -1. * Ri[k] * ((math.log(2. * math.pi, 2) / -2.) - (M * math.log(s2, 2) / 2.) + (K / 2.) + math.log(Ri[k], 2) - math.log(R, 2))
253        ld += ldn
254        bicc.append(ldn - numFreePar)
255    return ld - numFreePar, bicc
256
257
258#
259# silhouette plot
260#
261
262def plot_silhouette(km, filename='tmp.png', fast=False):
263    """ Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a k-means clustering object. If fast is True use score_fast_silhouette to compute scores instead of score_silhouette.
264
265    :param km: a k-means clustering object.
266    :type km: :class:`KMeans`
267    :param filename: name of output plot.
268    :type filename: string
269    :param fast: if True use :func:`score_fast_silhouette` to compute scores instead of :func:`score_silhouette`
270    :type fast: boolean.
271
272    """
273    import matplotlib.pyplot as plt
274    plt.figure()
275    scoring = score_fast_silhouette if fast else score_silhouette
276    scores = [[] for i in range(km.k)]
277    for i, c in enumerate(km.clusters):
278        scores[c].append(scoring(km, i))
279    csizes = map(len, scores)
280    cpositions = [sum(csizes[:i]) + (i+1)*3 + csizes[i]/2 for i in range(km.k)]
281    scores = reduce(lambda x,y: x + [0]*3 + sorted(y), scores, [])
282    plt.barh(range(len(scores)), scores, linewidth=0, color='c')
283    plt.yticks(cpositions, map(str, range(km.k)))
284    #plt.title('Silhouette plot')
285    plt.ylabel('Cluster')
286    plt.xlabel('Silhouette value')
287    plt.savefig(filename)
288
289# clustering initialization (seeds)
290# initialization functions should be of the type f(data, k, distfun)
291
292def init_random(data, k, _):
293    """A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).
294   
295    :param data: data instances.
296    :type data: :class:`orange.ExampleTable`
297    :param k: the number of clusters.
298    :type k: integer
299    """
300    return data.getitems(random.sample(range(len(data)), k))
301
302def init_diversity(data, k, distfun):
303    """A function that can be used for intialization of k-means clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm).
304
305    :param data: data instances.
306    :type data: :class:`orange.ExampleTable`
307    :param k: the number of clusters.
308    :type k: integer
309    :param distfun: a distance function.
310    :type distfun: :class:`Orange.distance.Distance`
311    """
312    center = data_center(data)
313    # the first seed should be the farthest point from the center
314    seeds = [max([(distfun(d, center), d) for d in data])[1]]
315    # other seeds are added iteratively, and are data points that are farthest from the current set of seeds
316    for i in range(1,k):
317        seeds.append(max([(min([distfun(d, s) for s in seeds]), d) for d in data if d not in seeds])[1])
318    return seeds
319
320class init_hclustering():
321    """
322    A class that returns an clustering initialization function that performs
323    hierarhical clustering, uses it to infer k clusters, and computes a
324    list of cluster-based data centers
325    """
326
327    def __init__(self, n=100):
328        """
329        :param n: number of data instances to sample.
330        :type n: integer
331        """
332        self.n = n
333
334    def __call__(self, data, k, disfun):
335        """
336        :param data: data instances.
337        :type data: :class:`orange.ExampleTable`
338        :param k: the number of clusters.
339        :type k: integer
340        :param distfun: a distance function.
341        :type distfun: :class:`Orange.distance.Distance`
342        """
343        sample = orange.ExampleTable(random.sample(data, min(self.n, len(data))))
344        root = Orange.clustering.hierarchical.clustering(sample)
345        cmap = Orange.clustering.hierarchical.top_clusters(root, k)
346        return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap]
347
348#   
349# k-means clustering, main implementation
350#
351
352class Clustering:
353    """Implements a k-means clustering algorithm:
354
355    #. Choose the number of clusters, k.
356    #. Choose a set of k initial centroids.
357    #. Assign each instances in the data set to the closest centroid.
358    #. For each cluster, compute a new centroid as a center of clustered
359       data instances.
360    #. Repeat the previous two steps, until some convergence criterion is
361       met (e.g., the cluster assignment has not changed).
362
363    The main advantages of this algorithm are simplicity and low memory 
364    requirements. The principal disadvantage is the dependence of results
365    on the selection of initial set of centroids.
366
367    .. attribute:: k
368
369        Number of clusters.
370
371    .. attribute:: data
372
373        Instances to cluster.
374
375    .. attribute:: centroids
376
377        Current set of centroids.
378
379    .. attribute:: scoring
380
381        Current clustering score.
382
383    .. attribute:: iteration
384
385        Current clustering iteration.
386
387    .. attribute:: clusters
388
389        A list of cluster indexes. An i-th element provides an
390        index to a centroid associated with i-th data instance from the input
391        data set.
392    """
393
394    def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None,
395                 stopchanges=0, nstart=1, initialization=init_random,
396                 distance=Orange.distance.Euclidean,
397                 scoring=score_distance_to_centroids, inner_callback = None,
398                 outer_callback = None):
399        """
400        :param data: Data instances to be clustered. If not None, clustering will be executed immediately after initialization unless initialize_only=True.
401        :type data: :class:`orange.ExampleTable` or None
402        :param centroids: either specify a number of clusters or provide a list of examples that will serve as clustering centroids.
403        :type centroids: integer or a list of :class:`orange.Example`
404        :param nstart: If greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
405        :type nstart: integer
406        :param distance: an example distance constructor, which measures the distance between two instances.
407        :type distance: :class:`Orange.distance.DistanceConstructor`
408        :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`init_random`, :func:`init_diversity`, :class:`init_hclustering`).
409        :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score.
410        :param inner_callback: invoked after every clustering iteration.
411        :param outer_callback: invoked after every clustering restart (if nstart is greater than 1).
412
413        Stopping criteria:
414
415        :param maxiters: maximum number of clustering iterations
416        :type maxiters: integer
417        :param minscorechange: minimal improvement of the score from previous generation (if lower, the clustering will stop). If None, the score will not be computed between iterations
418        :type minscorechange: float or None
419        :param stopchanges: if the number of instances changing the cluster is lower or equal to stopchanges, stop the clustering.
420        :type stopchanges: integer
421        """
422
423        self.data = data
424        self.k = centroids if type(centroids)==int else len(centroids)
425        self.centroids = centroids if type(centroids) == orange.ExampleTable else None
426        self.maxiters = maxiters
427        self.minscorechange = minscorechange
428        self.stopchanges = stopchanges
429        self.nstart = nstart
430        self.initialization = initialization
431        self.distance_constructor = distance
432        self.distance = self.distance_constructor(self.data) if self.data else None
433        self.scoring = scoring
434        self.minimize_score = True if hasattr(scoring, 'minimize') else False
435        self.inner_callback = inner_callback
436        self.outer_callback = outer_callback
437        if self.data:
438            self.run()
439       
440    def __call__(self, data = None):
441        """Runs the k-means clustering algorithm, with optional new data."""
442        if data:
443            self.data = data
444            self.distance = self.distance_constructor(self.data)
445        self.run()
446   
447    def init_centroids(self):
448        """Initialize cluster centroids"""
449        if self.centroids and not self.nstart > 1: # centroids were specified
450            return
451        self.centroids = self.initialization(self.data, self.k, self.distance)
452       
453    def compute_centeroid(self, data):
454        """Return a centroid of the data set."""
455        return data_center(data)
456   
457    def compute_cluster(self):
458        """calculate membership in clusters"""
459        return [minindex([self.distance(s, d) for s in self.centroids]) for d in self.data]
460   
461    def runone(self):
462        """Runs a single clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).""" 
463        self.centroids = [self.compute_centeroid(self.data.getitems(
464            [i for i, c in enumerate(self.clusters) if c == cl])) for cl in range(self.k)]
465        self.clusters = self.compute_cluster()
466       
467    def run(self):
468        """
469        Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
470        """
471        self.winner = None
472        for startindx in range(self.nstart):
473            self.init_centroids()
474            self.clusters = old_cluster = self.compute_cluster()
475            if self.minscorechange != None:
476                self.score = old_score = self.scoring(self)
477            self.nchanges = len(self.data)
478            self.iteration = 0
479            stopcondition = False
480            if self.inner_callback:
481                self.inner_callback(self)
482            while not stopcondition:
483                self.iteration += 1
484                self.runone()
485                self.nchanges = sum(map(lambda x,y: x!=y, old_cluster, self.clusters))
486                old_cluster = self.clusters
487                if self.minscorechange != None:
488                    self.score = self.scoring(self)
489                    scorechange = (self.score - old_score) / old_score if old_score > 0 else self.minscorechange
490                    if self.minimize_score:
491                        scorechange = -scorechange
492                    old_score = self.score
493                stopcondition = (self.nchanges <= self.stopchanges or
494                                 self.iteration == self.maxiters or
495                                 (self.minscorechange != None and
496                                  scorechange <= self.minscorechange))
497                if self.inner_callback:
498                    self.inner_callback(self)
499            if self.scoring and self.minscorechange == None:
500                self.score = self.scoring(self)
501            if self.nstart > 1:
502                if not self.winner or (self.score < self.winner[0] if
503                        self.minimize_score else self.score > self.winner[0]):
504                    self.winner = (self.score, self.clusters, self.centroids)
505                if self.outer_callback:
506                    self.outer_callback(self)
507
508        if self.nstart > 1:
509            self.score, self.clusters, self.centroids = self.winner
510
Note: See TracBrowser for help on using the repository browser.