source: orange/Orange/clustering/kmeans.py @ 10393:4dbd54af3ac8

Revision 10393:4dbd54af3ac8, 20.9 KB checked in by Lan Zagar <lan.zagar@…>, 2 years ago (diff)

Fixed object addresses appearing in documentation (#1103).

Line 
1"""
2*******************************
3K-means clustering (``kmeans``)
4*******************************
5
6.. index::
7   single: clustering, kmeans
8.. index:: agglomerative clustering
9
10
11.. autoclass:: Orange.clustering.kmeans.Clustering(data=None, centroids=3, maxiters=None, minscorechange=None, stopchanges=0, nstart=1, initialization=init_random, distance=Orange.distance.Euclidean, scoring=score_distance_to_centroids, inner_callback=None, outer_callback=None)
12    :members:
13    :exclude-members: __init__
14
15    .. automethod:: __init__(data=None, centroids=3, maxiters=None, minscorechange=None, stopchanges=0, nstart=1, initialization=init_random, distance=Orange.distance.Euclidean, scoring=score_distance_to_centroids, inner_callback=None, outer_callback=None)
16
17
18Examples
19========
20
21The following code runs k-means clustering and prints out the cluster indexes
22for the last 10 data instances (:download:`kmeans-run.py <code/kmeans-run.py>`):
23
24.. literalinclude:: code/kmeans-run.py
25
26The output of this code is::
27
28    [1, 1, 2, 1, 1, 1, 2, 1, 1, 2]
29
30Invoking a call-back function may be useful when tracing the progress of the clustering.
31Below is a code that uses an :obj:`inner_callback` to report on the number of instances
32that have changed the cluster and to report on the clustering score. For the score
33o be computed at each iteration we have to set :obj:`minscorechange`, but we can
34leave it at 0 or even set it to a negative value, which allows the score to deteriorate
35by some amount (:download:`kmeans-run-callback.py <code/kmeans-run-callback.py>`):
36
37.. literalinclude:: code/kmeans-run-callback.py
38
39The convergence on Iris data set is fast::
40
41    Iteration: 1, changes: 150, score: 10.9555
42    Iteration: 2, changes: 12, score: 10.3867
43    Iteration: 3, changes: 2, score: 10.2034
44    Iteration: 4, changes: 2, score: 10.0699
45    Iteration: 5, changes: 2, score: 9.9542
46    Iteration: 6, changes: 1, score: 9.9168
47    Iteration: 7, changes: 2, score: 9.8624
48    Iteration: 8, changes: 0, score: 9.8624
49
50Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (:download:`kmeans-trace.py <code/kmeans-trace.py>`):
51
52.. literalinclude:: code/kmeans-trace.py
53
54Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.
55
56
57.. image:: files/kmeans-scatter-001.png
58
59.. image:: files/kmeans-scatter-002.png
60
61.. image:: files/kmeans-scatter-003.png
62
63.. image:: files/kmeans-scatter-004.png
64
65
66k-Means Utility Functions
67=========================
68
69.. automethod:: Orange.clustering.kmeans.init_random
70
71.. automethod:: Orange.clustering.kmeans.init_diversity
72
73.. autoclass:: Orange.clustering.kmeans.init_hclustering
74   :members:
75
76.. automethod:: Orange.clustering.kmeans.plot_silhouette
77
78.. automethod:: Orange.clustering.kmeans.score_distance_to_centroids
79
80.. automethod:: Orange.clustering.kmeans.score_silhouette
81
82.. automethod:: Orange.clustering.kmeans.score_fast_silhouette
83
84Typically, the choice of seeds has a large impact on the k-means clustering,
85with better initialization methods yielding a clustering that converges faster
86and finds more optimal centroids. The following code compares three different
87initialization methods (random, diversity-based and hierarchical clustering-based)
88in terms of how fast they converge (:download:`kmeans-cmp-init.py <code/kmeans-cmp-init.py>`):
89
90.. literalinclude:: code/kmeans-cmp-init.py
91
92As expected, k-means converges faster with diversity and clustering-based
93initialization that with random seed selection::
94
95               Rnd Div  HC
96          iris  12   3   4
97       housing  14   6   4
98       vehicle  11   4   3
99
100The following code computes the silhouette score for k=2..7 and plots a
101silhuette plot for k=3 (:download:`kmeans-silhouette.py <code/kmeans-silhouette.py>`):
102
103.. literalinclude:: code/kmeans-silhouette.py
104
105The analysis suggests that k=2 is preferred as it yields
106the maximal silhouette coefficient::
107
108    2 0.629467553352
109    3 0.504318855054
110    4 0.407259377854
111    5 0.358628975081
112    6 0.353228492088
113    7 0.366357876944
114
115.. figure:: files/kmeans-silhouette.png
116
117   Silhouette plot for k=3.
118
119"""
120
121import math
122import sys
123import orange
124import random
125from Orange import statc
126
127import Orange.clustering.hierarchical
128
129# miscellaneous functions
130
131def _modus(dist):
132    #Check bool(dist) - False means no known cases
133    #Check dist.cases > 0 - We cant return some value from the domain without knowing if it is even present
134    #in the data. TOOD: What does this mean for k-means convergence?
135    if bool(dist) and dist.cases > 0:
136        return dist.modus()
137    else:
138        return None
139   
140def data_center(data):
141    """
142    Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).
143    """
144    atts = data.domain.attributes
145    astats = orange.DomainBasicAttrStat(data)
146    center = [astats[a].avg if a.varType == orange.VarTypes.Continuous \
147#              else max(enumerate(orange.Distribution(a, data)), key=lambda x:x[1])[0] if a.varType == orange.VarTypes.Discrete
148              else _modus(orange.Distribution(a, data)) if a.varType == orange.VarTypes.Discrete
149              else None
150              for a in atts]
151    if data.domain.classVar:
152        center.append(0)
153    return orange.Example(data.domain, center)
154
155def minindex(x):
156    """Return the index of the minimum element"""
157    return x.index(min(x))
158
159def avg(x):
160    """Return the average (mean) of a given list"""
161    return (float(sum(x)) / len(x)) if x else 0
162
163#
164# data distances
165#
166
167# k-means clustering
168
169# clustering scoring functions
170
171def score_distance_to_centroids(km):
172    """Returns an average distance of data instances to their associated centroids.
173
174    :param km: a k-means clustering object.
175    :type km: :class:`KMeans`
176    """
177    return sum(km.distance(km.centroids[km.clusters[i]], d) for i,d in enumerate(km.data))
178
179score_distance_to_centroids.minimize = True
180
181def score_conditional_entropy(km):
182    """UNIMPLEMENTED cluster quality measured by conditional entropy"""
183    raise NotImplemented
184
185def score_within_cluster_distance(km):
186    """UNIMPLEMENTED weighted average within-cluster pairwise distance"""
187    raise NotImplemented
188
189score_within_cluster_distance.minimize = True
190
191def score_between_cluster_distance(km):
192    """Sum of distances from elements to 'nearest miss' centroids"""
193    return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data))
194
195from Orange.misc import deprecated_function_name
196score_betweenClusterDistance = deprecated_function_name(score_between_cluster_distance)
197
198def score_silhouette(km, index=None):
199    """Returns an average silhouette score of data instances.
200   
201    :param km: a k-means clustering object.
202    :type km: :class:`KMeans`
203
204    :param index: if given, the functon returns just the silhouette score of that particular data instance.
205    :type index: integer
206    """
207   
208    if index == None:
209        return avg([score_silhouette(km, i) for i in range(len(km.data))])
210    cind = km.clusters[index]
211    a = avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
212             km.clusters[i] == cind and i != index])
213    b = min([avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
214                 km.clusters[i] == c])
215            for c in range(len(km.centroids)) if c != cind])
216    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
217
218def score_fast_silhouette(km, index=None):
219    """Same as score_silhouette, but computes an approximation and is faster.
220   
221    :param km: a k-means clustering object.
222    :type km: :class:`KMeans`
223    """
224
225    if index == None:
226        return avg([score_fast_silhouette(km, i) for i in range(len(km.data))])
227    cind = km.clusters[index]
228    a = km.distance(km.data[index], km.centroids[km.clusters[index]])
229    b = min([km.distance(km.data[index], c) for i,c in enumerate(km.centroids) if i != cind])
230    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
231
232def compute_bic(km):
233    """Compute bayesian information criteria score for given clustering. NEEDS REWRITE!!!"""
234    data = km.data
235    medoids = km.centroids
236
237    M = len(data.domain.attributes)
238    R = float(len(data))
239    Ri = [km.clusters.count(i) for i in range(km.k)]
240    numFreePar = (len(km.data.domain.attributes) + 1.) * km.k * math.log(R, 2.) / 2.
241    # sigma**2
242    s2 = 0.
243    cidx = [i for i, attr in enumerate(data.domain.attributes) if attr.varType in [orange.VarTypes.Continuous, orange.VarTypes.Discrete]]
244    for x, midx in izip(data, mapping):
245        medoid = medoids[midx] # medoids has a dummy element at the beginning, so we don't need -1
246        s2 += sum( [(float(x[i]) - float(medoid[i]))**2 for i in cidx] )
247    s2 /= (R - K)
248    if s2 < 1e-20:
249        return None, [None]*K
250    # log-lokehood of clusters: l(Dn)
251    # log-likehood of clustering: l(D)
252    ld = 0
253    bicc = []
254    for k in range(1, 1+K):
255        ldn = -1. * Ri[k] * ((math.log(2. * math.pi, 2) / -2.) - (M * math.log(s2, 2) / 2.) + (K / 2.) + math.log(Ri[k], 2) - math.log(R, 2))
256        ld += ldn
257        bicc.append(ldn - numFreePar)
258    return ld - numFreePar, bicc
259
260
261#
262# silhouette plot
263#
264
265def plot_silhouette(km, filename='tmp.png', fast=False):
266    """ Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a k-means clustering object. If fast is True use score_fast_silhouette to compute scores instead of score_silhouette.
267
268    :param km: a k-means clustering object.
269    :type km: :class:`KMeans`
270    :param filename: name of output plot.
271    :type filename: string
272    :param fast: if True use :func:`score_fast_silhouette` to compute scores instead of :func:`score_silhouette`
273    :type fast: boolean.
274
275    """
276    import matplotlib.pyplot as plt
277    plt.figure()
278    scoring = score_fast_silhouette if fast else score_silhouette
279    scores = [[] for i in range(km.k)]
280    for i, c in enumerate(km.clusters):
281        scores[c].append(scoring(km, i))
282    csizes = map(len, scores)
283    cpositions = [sum(csizes[:i]) + (i+1)*3 + csizes[i]/2 for i in range(km.k)]
284    scores = reduce(lambda x,y: x + [0]*3 + sorted(y), scores, [])
285    plt.barh(range(len(scores)), scores, linewidth=0, color='c')
286    plt.yticks(cpositions, map(str, range(km.k)))
287    #plt.title('Silhouette plot')
288    plt.ylabel('Cluster')
289    plt.xlabel('Silhouette value')
290    plt.savefig(filename)
291
292# clustering initialization (seeds)
293# initialization functions should be of the type f(data, k, distfun)
294
295def init_random(data, k, _):
296    """A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).
297   
298    :param data: data instances.
299    :type data: :class:`orange.ExampleTable`
300    :param k: the number of clusters.
301    :type k: integer
302    """
303    return data.getitems(random.sample(range(len(data)), k))
304
305def init_diversity(data, k, distfun):
306    """A function that can be used for intialization of k-means clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm).
307
308    :param data: data instances.
309    :type data: :class:`orange.ExampleTable`
310    :param k: the number of clusters.
311    :type k: integer
312    :param distfun: a distance function.
313    :type distfun: :class:`Orange.distance.Distance`
314    """
315    center = data_center(data)
316    # the first seed should be the farthest point from the center
317    seeds = [max([(distfun(d, center), d) for d in data])[1]]
318    # other seeds are added iteratively, and are data points that are farthest from the current set of seeds
319    for i in range(1,k):
320        seeds.append(max([(min([distfun(d, s) for s in seeds]), d) for d in data if d not in seeds])[1])
321    return seeds
322
323class init_hclustering():
324    """
325    A class that returns an clustering initialization function that performs
326    hierarhical clustering, uses it to infer k clusters, and computes a
327    list of cluster-based data centers
328    """
329
330    def __init__(self, n=100):
331        """
332        :param n: number of data instances to sample.
333        :type n: integer
334        """
335        self.n = n
336
337    def __call__(self, data, k, disfun):
338        """
339        :param data: data instances.
340        :type data: :class:`orange.ExampleTable`
341        :param k: the number of clusters.
342        :type k: integer
343        :param distfun: a distance function.
344        :type distfun: :class:`Orange.distance.Distance`
345        """
346        sample = orange.ExampleTable(random.sample(data, min(self.n, len(data))))
347        root = Orange.clustering.hierarchical.clustering(sample)
348        cmap = Orange.clustering.hierarchical.top_clusters(root, k)
349        return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap]
350
351#   
352# k-means clustering, main implementation
353#
354
355class Clustering:
356    """Implements a k-means clustering algorithm:
357
358    #. Choose the number of clusters, k.
359    #. Choose a set of k initial centroids.
360    #. Assign each instances in the data set to the closest centroid.
361    #. For each cluster, compute a new centroid as a center of clustered
362       data instances.
363    #. Repeat the previous two steps, until some convergence criterion is
364       met (e.g., the cluster assignment has not changed).
365
366    The main advantages of this algorithm are simplicity and low memory 
367    requirements. The principal disadvantage is the dependence of results
368    on the selection of initial set of centroids.
369
370    .. attribute:: k
371
372        Number of clusters.
373
374    .. attribute:: data
375
376        Instances to cluster.
377
378    .. attribute:: centroids
379
380        Current set of centroids.
381
382    .. attribute:: scoring
383
384        Current clustering score.
385
386    .. attribute:: iteration
387
388        Current clustering iteration.
389
390    .. attribute:: clusters
391
392        A list of cluster indexes. An i-th element provides an
393        index to a centroid associated with i-th data instance from the input
394        data set.
395    """
396
397    def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None,
398                 stopchanges=0, nstart=1, initialization=init_random,
399                 distance=Orange.distance.Euclidean,
400                 scoring=score_distance_to_centroids, inner_callback=None,
401                 outer_callback=None):
402        """
403        :param data: Data instances to be clustered. If not None, clustering will be executed immediately after initialization unless ``initialize_only=True``.
404        :type data: :class:`~Orange.data.Table` or None
405        :param centroids: either specify a number of clusters or provide a list of examples that will serve as clustering centroids.
406        :type centroids: :obj:`int` or :obj:`list` of :class:`~Orange.data.Instance`
407        :param nstart: If greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
408        :type nstart: int
409        :param distance: an example distance constructor, which measures the distance between two instances.
410        :type distance: :class:`~Orange.distance.DistanceConstructor`
411        :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:obj:`init_random`, :obj:`init_diversity`, :obj:`init_hclustering`).
412        :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score.
413        :param inner_callback: invoked after every clustering iteration.
414        :param outer_callback: invoked after every clustering restart (if nstart is greater than 1).
415
416        Stopping criteria:
417
418        :param maxiters: maximum number of clustering iterations
419        :type maxiters: integer
420        :param minscorechange: minimal improvement of the score from previous generation (if lower, the clustering will stop). If None, the score will not be computed between iterations
421        :type minscorechange: float or None
422        :param stopchanges: if the number of instances changing the cluster is lower or equal to stopchanges, stop the clustering.
423        :type stopchanges: integer
424        """
425
426        self.data = data
427        self.k = centroids if type(centroids)==int else len(centroids)
428        self.centroids = centroids if type(centroids) == orange.ExampleTable else None
429        self.maxiters = maxiters
430        self.minscorechange = minscorechange
431        self.stopchanges = stopchanges
432        self.nstart = nstart
433        self.initialization = initialization
434        self.distance_constructor = distance
435        self.distance = self.distance_constructor(self.data) if self.data else None
436        self.scoring = scoring
437        self.minimize_score = True if hasattr(scoring, 'minimize') else False
438        self.inner_callback = inner_callback
439        self.outer_callback = outer_callback
440        if self.data:
441            self.run()
442       
443    def __call__(self, data = None):
444        """Runs the k-means clustering algorithm, with optional new data."""
445        if data:
446            self.data = data
447            self.distance = self.distance_constructor(self.data)
448        self.run()
449   
450    def init_centroids(self):
451        """Initialize cluster centroids"""
452        if self.centroids and not self.nstart > 1: # centroids were specified
453            return
454        self.centroids = self.initialization(self.data, self.k, self.distance)
455       
456    def compute_centeroid(self, data):
457        """Return a centroid of the data set."""
458        return data_center(data)
459   
460    def compute_cluster(self):
461        """calculate membership in clusters"""
462        return [minindex([self.distance(s, d) for s in self.centroids]) for d in self.data]
463   
464    def runone(self):
465        """Runs a single clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).""" 
466        self.centroids = [self.compute_centeroid(self.data.getitems(
467            [i for i, c in enumerate(self.clusters) if c == cl])) for cl in range(self.k)]
468        self.clusters = self.compute_cluster()
469       
470    def run(self):
471        """
472        Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
473        """
474        self.winner = None
475        for startindx in range(self.nstart):
476            self.init_centroids()
477            self.clusters = old_cluster = self.compute_cluster()
478            if self.minscorechange != None:
479                self.score = old_score = self.scoring(self)
480            self.nchanges = len(self.data)
481            self.iteration = 0
482            stopcondition = False
483            if self.inner_callback:
484                self.inner_callback(self)
485            while not stopcondition:
486                self.iteration += 1
487                self.runone()
488                self.nchanges = sum(map(lambda x,y: x!=y, old_cluster, self.clusters))
489                old_cluster = self.clusters
490                if self.minscorechange != None:
491                    self.score = self.scoring(self)
492                    scorechange = (self.score - old_score) / old_score if old_score > 0 else self.minscorechange
493                    if self.minimize_score:
494                        scorechange = -scorechange
495                    old_score = self.score
496                stopcondition = (self.nchanges <= self.stopchanges or
497                                 self.iteration == self.maxiters or
498                                 (self.minscorechange != None and
499                                  scorechange <= self.minscorechange))
500                if self.inner_callback:
501                    self.inner_callback(self)
502            if self.scoring and self.minscorechange == None:
503                self.score = self.scoring(self)
504            if self.nstart > 1:
505                if not self.winner or (self.score < self.winner[0] if
506                        self.minimize_score else self.score > self.winner[0]):
507                    self.winner = (self.score, self.clusters, self.centroids)
508                if self.outer_callback:
509                    self.outer_callback(self)
510
511        if self.nstart > 1:
512            self.score, self.clusters, self.centroids = self.winner
513
Note: See TracBrowser for help on using the repository browser.