source: orange/Orange/clustering/kmeans.py @ 9724:318e91106d47

Revision 9724:318e91106d47, 20.5 KB checked in by markotoplak, 2 years ago (diff)

Renames in Orange.distance.

Line 
1"""
2*******************************
3K-means clustering (``kmeans``)
4*******************************
5
6.. index::
7   single: clustering, kmeans
8.. index:: agglomerative clustering
9
10
11.. autoclass:: Orange.clustering.kmeans.Clustering
12   :members:
13
14Examples
15========
16
17The following code runs k-means clustering and prints out the cluster indexes
18for the last 10 data instances (:download:`kmeans-run.py <code/kmeans-run.py>`, uses :download:`iris.tab <code/iris.tab>`):
19
20.. literalinclude:: code/kmeans-run.py
21
22The output of this code is::
23
24    [1, 1, 2, 1, 1, 1, 2, 1, 1, 2]
25
26Invoking a call-back function may be useful when tracing the progress of the clustering.
27Below is a code that uses an :obj:`inner_callback` to report on the number of instances
28that have changed the cluster and to report on the clustering score. For the score
29o be computed at each iteration we have to set :obj:`minscorechange`, but we can
30leave it at 0 or even set it to a negative value, which allows the score to deteriorate
31by some amount (:download:`kmeans-run-callback.py <code/kmeans-run-callback.py>`, uses :download:`iris.tab <code/iris.tab>`):
32
33.. literalinclude:: code/kmeans-run-callback.py
34
35The convergence on Iris data set is fast::
36
37    Iteration: 1, changes: 150, score: 10.9555
38    Iteration: 2, changes: 12, score: 10.3867
39    Iteration: 3, changes: 2, score: 10.2034
40    Iteration: 4, changes: 2, score: 10.0699
41    Iteration: 5, changes: 2, score: 9.9542
42    Iteration: 6, changes: 1, score: 9.9168
43    Iteration: 7, changes: 2, score: 9.8624
44    Iteration: 8, changes: 0, score: 9.8624
45
46Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (:download:`kmeans-trace.py <code/kmeans-trace.py>`, uses :download:`iris.tab <code/iris.tab>`):
47
48.. literalinclude:: code/kmeans-trace.py
49
50Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.
51
52
53.. image:: files/kmeans-scatter-001.png
54
55.. image:: files/kmeans-scatter-002.png
56
57.. image:: files/kmeans-scatter-003.png
58
59.. image:: files/kmeans-scatter-004.png
60
61
62k-Means Utility Functions
63=========================
64
65.. automethod:: Orange.clustering.kmeans.init_random
66
67.. automethod:: Orange.clustering.kmeans.init_diversity
68
69.. autoclass:: Orange.clustering.kmeans.init_hclustering
70   :members:
71
72.. automethod:: Orange.clustering.kmeans.plot_silhouette
73
74.. automethod:: Orange.clustering.kmeans.score_distance_to_centroids
75
76.. automethod:: Orange.clustering.kmeans.score_silhouette
77
78.. automethod:: Orange.clustering.kmeans.score_fast_silhouette
79
80Typically, the choice of seeds has a large impact on the k-means clustering,
81with better initialization methods yielding a clustering that converges faster
82and finds more optimal centroids. The following code compares three different
83initialization methods (random, diversity-based and hierarchical clustering-based)
84in terms of how fast they converge (:download:`kmeans-cmp-init.py <code/kmeans-cmp-init.py>`, uses :download:`iris.tab <code/iris.tab>`,
85:download:`housing.tab <code/housing.tab>`, :download:`vehicle.tab <code/vehicle.tab>`):
86
87.. literalinclude:: code/kmeans-cmp-init.py
88
89As expected, k-means converges faster with diversity and clustering-based
90initialization that with random seed selection::
91
92               Rnd Div  HC
93          iris  12   3   4
94       housing  14   6   4
95       vehicle  11   4   3
96
97The following code computes the silhouette score for k=2..7 and plots a
98silhuette plot for k=3 (:download:`kmeans-silhouette.py <code/kmeans-silhouette.py>`, uses :download:`iris.tab <code/iris.tab>`):
99
100.. literalinclude:: code/kmeans-silhouette.py
101
102The analysis suggests that k=2 is preferred as it yields
103the maximal silhouette coefficient::
104
105    2 0.629467553352
106    3 0.504318855054
107    4 0.407259377854
108    5 0.358628975081
109    6 0.353228492088
110    7 0.366357876944
111
112.. figure:: files/kmeans-silhouette.png
113
114   Silhouette plot for k=3.
115
116"""
117
118import math
119import sys
120import orange
121import random
122import statc
123
124import Orange.clustering.hierarchical
125
126# miscellaneous functions
127
128def _modus(dist):
129    #Check bool(dist) - False means no known cases
130    #Check dist.cases > 0 - We cant return some value from the domain without knowing if it is even present
131    #in the data. TOOD: What does this mean for k-means convergence?
132    if bool(dist) and dist.cases > 0:
133        return dist.modus()
134    else:
135        return None
136   
137def data_center(data):
138    """
139    Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).
140    """
141    atts = data.domain.attributes
142    astats = orange.DomainBasicAttrStat(data)
143    center = [astats[a].avg if a.varType == orange.VarTypes.Continuous \
144#              else max(enumerate(orange.Distribution(a, data)), key=lambda x:x[1])[0] if a.varType == orange.VarTypes.Discrete
145              else _modus(orange.Distribution(a, data)) if a.varType == orange.VarTypes.Discrete
146              else None
147              for a in atts]
148    if data.domain.classVar:
149        center.append(0)
150    return orange.Example(data.domain, center)
151
152def minindex(x):
153    """Return the index of the minimum element"""
154    return x.index(min(x))
155
156def avg(x):
157    """Return the average (mean) of a given list"""
158    return (float(sum(x)) / len(x)) if x else 0
159
160#
161# data distances
162#
163
164# k-means clustering
165
166# clustering scoring functions
167
168def score_distance_to_centroids(km):
169    """Returns an average distance of data instances to their associated centroids.
170
171    :param km: a k-means clustering object.
172    :type km: :class:`KMeans`
173    """
174    return sum(km.distance(km.centroids[km.clusters[i]], d) for i,d in enumerate(km.data))
175
176score_distance_to_centroids.minimize = True
177
178def score_conditionalEntropy(km):
179    """UNIMPLEMENTED cluster quality measured by conditional entropy"""
180    pass
181
182def score_withinClusterDistance(km):
183    """UNIMPLEMENTED weighted average within-cluster pairwise distance"""
184    pass
185
186score_withinClusterDistance.minimize = True
187
188def score_betweenClusterDistance(km):
189    """Sum of distances from elements to 'nearest miss' centroids"""
190    return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data))
191
192def score_silhouette(km, index=None):
193    """Returns an average silhouette score of data instances.
194   
195    :param km: a k-means clustering object.
196    :type km: :class:`KMeans`
197
198    :param index: if given, the functon returns just the silhouette score of that particular data instance.
199    :type index: integer
200    """
201   
202    if index == None:
203        return avg([score_silhouette(km, i) for i in range(len(km.data))])
204    cind = km.clusters[index]
205    a = avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
206             km.clusters[i] == cind and i != index])
207    b = min([avg([km.distance(km.data[index], ex) for i, ex in enumerate(km.data) if
208                 km.clusters[i] == c])
209            for c in range(len(km.centroids)) if c != cind])
210    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
211
212def score_fast_silhouette(km, index=None):
213    """Same as score_silhouette, but computes an approximation and is faster.
214   
215    :param km: a k-means clustering object.
216    :type km: :class:`KMeans`
217    """
218
219    if index == None:
220        return avg([score_fast_silhouette(km, i) for i in range(len(km.data))])
221    cind = km.clusters[index]
222    a = km.distance(km.data[index], km.centroids[km.clusters[index]])
223    b = min([km.distance(km.data[index], c) for i,c in enumerate(km.centroids) if i != cind])
224    return float(b - a) / max(a, b) if max(a, b) > 0 else 0.0
225
226def compute_bic(km):
227    """Compute bayesian information criteria score for given clustering. NEEDS REWRITE!!!"""
228    data = km.data
229    medoids = km.centroids
230
231    M = len(data.domain.attributes)
232    R = float(len(data))
233    Ri = [km.clusters.count(i) for i in range(km.k)]
234    numFreePar = (len(km.data.domain.attributes) + 1.) * km.k * math.log(R, 2.) / 2.
235    # sigma**2
236    s2 = 0.
237    cidx = [i for i, attr in enumerate(data.domain.attributes) if attr.varType in [orange.VarTypes.Continuous, orange.VarTypes.Discrete]]
238    for x, midx in izip(data, mapping):
239        medoid = medoids[midx] # medoids has a dummy element at the beginning, so we don't need -1
240        s2 += sum( [(float(x[i]) - float(medoid[i]))**2 for i in cidx] )
241    s2 /= (R - K)
242    if s2 < 1e-20:
243        return None, [None]*K
244    # log-lokehood of clusters: l(Dn)
245    # log-likehood of clustering: l(D)
246    ld = 0
247    bicc = []
248    for k in range(1, 1+K):
249        ldn = -1. * Ri[k] * ((math.log(2. * math.pi, 2) / -2.) - (M * math.log(s2, 2) / 2.) + (K / 2.) + math.log(Ri[k], 2) - math.log(R, 2))
250        ld += ldn
251        bicc.append(ldn - numFreePar)
252    return ld - numFreePar, bicc
253
254
255#
256# silhouette plot
257#
258
259def plot_silhouette(km, filename='tmp.png', fast=False):
260    """ Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a k-means clustering object. If fast is True use score_fast_silhouette to compute scores instead of score_silhouette.
261
262    :param km: a k-means clustering object.
263    :type km: :class:`KMeans`
264    :param filename: name of output plot.
265    :type filename: string
266    :param fast: if True use :func:`score_fast_silhouette` to compute scores instead of :func:`score_silhouette`
267    :type fast: boolean.
268
269    """
270    import matplotlib.pyplot as plt
271    plt.figure()
272    scoring = score_fast_silhouette if fast else score_silhouette
273    scores = [[] for i in range(km.k)]
274    for i, c in enumerate(km.clusters):
275        scores[c].append(scoring(km, i))
276    csizes = map(len, scores)
277    cpositions = [sum(csizes[:i]) + (i+1)*3 + csizes[i]/2 for i in range(km.k)]
278    scores = reduce(lambda x,y: x + [0]*3 + sorted(y), scores, [])
279    plt.barh(range(len(scores)), scores, linewidth=0, color='c')
280    plt.yticks(cpositions, map(str, range(km.k)))
281    #plt.title('Silhouette plot')
282    plt.ylabel('Cluster')
283    plt.xlabel('Silhouette value')
284    plt.savefig(filename)
285
286# clustering initialization (seeds)
287# initialization functions should be of the type f(data, k, distfun)
288
289def init_random(data, k, _):
290    """A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).
291   
292    :param data: data instances.
293    :type data: :class:`orange.ExampleTable`
294    :param k: the number of clusters.
295    :type k: integer
296    """
297    return data.getitems(random.sample(range(len(data)), k))
298
299def init_diversity(data, k, distfun):
300    """A function that can be used for intialization of k-means clustering. Returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. Differs from the initialization proposed by Katsavounidis et al. (1994) only in the selection of the first centroid (where they use a data instance with the highest norm).
301
302    :param data: data instances.
303    :type data: :class:`orange.ExampleTable`
304    :param k: the number of clusters.
305    :type k: integer
306    :param distfun: a distance function.
307    :type distfun: :class:`Orange.distance.Distance`
308    """
309    center = data_center(data)
310    # the first seed should be the farthest point from the center
311    seeds = [max([(distfun(d, center), d) for d in data])[1]]
312    # other seeds are added iteratively, and are data points that are farthest from the current set of seeds
313    for i in range(1,k):
314        seeds.append(max([(min([distfun(d, s) for s in seeds]), d) for d in data if d not in seeds])[1])
315    return seeds
316
317class init_hclustering():
318    """
319    A class that returns an clustering initialization function that performs
320    hierarhical clustering, uses it to infer k clusters, and computes a
321    list of cluster-based data centers
322    """
323
324    def __init__(self, n=100):
325        """
326        :param n: number of data instances to sample.
327        :type n: integer
328        """
329        self.n = n
330
331    def __call__(self, data, k, disfun):
332        """
333        :param data: data instances.
334        :type data: :class:`orange.ExampleTable`
335        :param k: the number of clusters.
336        :type k: integer
337        :param distfun: a distance function.
338        :type distfun: :class:`Orange.distance.Distance`
339        """
340        sample = orange.ExampleTable(random.sample(data, min(self.n, len(data))))
341        root = Orange.clustering.hierarchical.clustering(sample)
342        cmap = Orange.clustering.hierarchical.top_clusters(root, k)
343        return [data_center(orange.ExampleTable([sample[e] for e in cl])) for cl in cmap]
344
345#   
346# k-means clustering, main implementation
347#
348
349class Clustering:
350    """Implements a k-means clustering algorithm:
351
352    #. Choose the number of clusters, k.
353    #. Choose a set of k initial centroids.
354    #. Assign each instances in the data set to the closest centroid.
355    #. For each cluster, compute a new centroid as a center of clustered
356       data instances.
357    #. Repeat the previous two steps, until some convergence criterion is
358       met (e.g., the cluster assignment has not changed).
359
360    The main advantages of this algorithm are simplicity and low memory 
361    requirements. The principal disadvantage is the dependence of results
362    on the selection of initial set of centroids.
363
364    .. attribute:: k
365
366        Number of clusters.
367
368    .. attribute:: data
369
370        Instances to cluster.
371
372    .. attribute:: centroids
373
374        Current set of centroids.
375
376    .. attribute:: scoring
377
378        Current clustering score.
379
380    .. attribute:: iteration
381
382        Current clustering iteration.
383
384    .. attribute:: clusters
385
386        A list of cluster indexes. An i-th element provides an
387        index to a centroid associated with i-th data instance from the input
388        data set.
389    """
390
391    def __init__(self, data=None, centroids=3, maxiters=None, minscorechange=None,
392                 stopchanges=0, nstart=1, initialization=init_random,
393                 distance=Orange.distance.Euclidean,
394                 scoring=score_distance_to_centroids, inner_callback = None,
395                 outer_callback = None):
396        """
397        :param data: Data instances to be clustered. If not None, clustering will be executed immediately after initialization unless initialize_only=True.
398        :type data: :class:`orange.ExampleTable` or None
399        :param centroids: either specify a number of clusters or provide a list of examples that will serve as clustering centroids.
400        :type centroids: integer or a list of :class:`orange.Example`
401        :param nstart: If greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
402        :type nstart: integer
403        :param distance: an example distance constructor, which measures the distance between two instances.
404        :type distance: :class:`Orange.distance.DistanceConstructor`
405        :param initialization: a function to select centroids given data instances, k and a example distance function. This module implements different approaches (:func:`init_random`, :func:`init_diversity`, :class:`init_hclustering`).
406        :param scoring: a function that takes clustering object and returns the clustering score. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score.
407        :param inner_callback: invoked after every clustering iteration.
408        :param outer_callback: invoked after every clustering restart (if nstart is greater than 1).
409
410        Stopping criteria:
411
412        :param maxiters: maximum number of clustering iterations
413        :type maxiters: integer
414        :param minscorechange: minimal improvement of the score from previous generation (if lower, the clustering will stop). If None, the score will not be computed between iterations
415        :type minscorechange: float or None
416        :param stopchanges: if the number of instances changing the cluster is lower or equal to stopchanges, stop the clustering.
417        :type stopchanges: integer
418        """
419
420        self.data = data
421        self.k = centroids if type(centroids)==int else len(centroids)
422        self.centroids = centroids if type(centroids) == orange.ExampleTable else None
423        self.maxiters = maxiters
424        self.minscorechange = minscorechange
425        self.stopchanges = stopchanges
426        self.nstart = nstart
427        self.initialization = initialization
428        self.distance_constructor = distance
429        self.distance = self.distance_constructor(self.data) if self.data else None
430        self.scoring = scoring
431        self.minimize_score = True if hasattr(scoring, 'minimize') else False
432        self.inner_callback = inner_callback
433        self.outer_callback = outer_callback
434        if self.data:
435            self.run()
436       
437    def __call__(self, data = None):
438        """Runs the k-means clustering algorithm, with optional new data."""
439        if data:
440            self.data = data
441            self.distance = self.distance_constructor(self.data)
442        self.run()
443   
444    def init_centroids(self):
445        """Initialize cluster centroids"""
446        if self.centroids and not self.nstart > 1: # centroids were specified
447            return
448        self.centroids = self.initialization(self.data, self.k, self.distance)
449       
450    def compute_centeroid(self, data):
451        """Return a centroid of the data set."""
452        return data_center(data)
453   
454    def compute_cluster(self):
455        """calculate membership in clusters"""
456        return [minindex([self.distance(s, d) for s in self.centroids]) for d in self.data]
457   
458    def runone(self):
459        """Runs a single clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).""" 
460        self.centroids = [self.compute_centeroid(self.data.getitems(
461            [i for i, c in enumerate(self.clusters) if c == cl])) for cl in range(self.k)]
462        self.clusters = self.compute_cluster()
463       
464    def run(self):
465        """
466        Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
467        """
468        self.winner = None
469        for startindx in range(self.nstart):
470            self.init_centroids()
471            self.clusters = old_cluster = self.compute_cluster()
472            if self.minscorechange != None:
473                self.score = old_score = self.scoring(self)
474            self.nchanges = len(self.data)
475            self.iteration = 0
476            stopcondition = False
477            if self.inner_callback:
478                self.inner_callback(self)
479            while not stopcondition:
480                self.iteration += 1
481                self.runone()
482                self.nchanges = sum(map(lambda x,y: x!=y, old_cluster, self.clusters))
483                old_cluster = self.clusters
484                if self.minscorechange != None:
485                    self.score = self.scoring(self)
486                    scorechange = (self.score - old_score) / old_score if old_score > 0 else self.minscorechange
487                    if self.minimize_score:
488                        scorechange = -scorechange
489                    old_score = self.score
490                stopcondition = (self.nchanges <= self.stopchanges or
491                                 self.iteration == self.maxiters or
492                                 (self.minscorechange != None and
493                                  scorechange <= self.minscorechange))
494                if self.inner_callback:
495                    self.inner_callback(self)
496            if self.scoring and self.minscorechange == None:
497                self.score = self.scoring(self)
498            if self.nstart > 1:
499                if not self.winner or (self.score < self.winner[0] if
500                        self.minimize_score else self.score > self.winner[0]):
501                    self.winner = (self.score, self.clusters, self.centroids)
502                if self.outer_callback:
503                    self.outer_callback(self)
504
505        if self.nstart > 1:
506            self.score, self.clusters, self.centroids = self.winner
507
Note: See TracBrowser for help on using the repository browser.