source: orange/Orange/doc/modules/orngClustering.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 21.7 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5<h1>orngClustering: Partitional and Agglomerative Clustering</h1>
6
7<index name="modules+clustering">
8
9<p>The module implements a k-means partitional clustering, and provides a wrapper around Orange's implementation of agglomerative hierarchical clustering. The module also implements a number of useful functions associated with these two clustering methods, like leaf-ordering of the dendrogram and dendrogram plot.</p>
10
11<h2>KMeans</h2>
12
13<p>Class <code>KMeans</code> provides for an implementation of standard k-means clustering algorithm:</p>
14<ol>
15<li>Choose the number of clusters, k.</li>
16<li>Choose a set of k initial centroids.</li>
17<li>Assign each instances in the data set to the closest centroid.</li>
18<li>For each cluster, compute a new centroid as a center of clustered data instances.</li>
19<li>Repeat the previous two steps, until some convergence criterion is met (e.g., the cluster assignment has not changed).</li>
20</ol>
21
22<p>The main advantage of the algorithm is simplicity and low memory space requirements. The principal disadvantage is the dependence of results on the selection of initial set of centroids.</p>
23
24<P class=section>Methods</P>
25<DL class=attributes>
26<DT>__init__(data=None, centroids=3, maxiters=None, minscorechange=None, stopchanges=0, nstart=1, initialization=kmeans_init_random, distance=orange.ExamplesDistanceConstructor_Euclidean,
27scoring=score_distance_to_centroids,
28inner_callback = None,
29outer_callback = None,
30initialize_only = False)</DT>
31<DD><code>data</code> is an Orange's ExampleTable object that stores the data instances to be clustered. If <code>data</code> is not <code>None</code>, clustering will immediately executed after the initialization of clustering parameters unless <code>initialize_only</code> is set to <code>True</code>. <code>centroids</code> either specify a number of clusters or provide a list of examples that will serve as clustering centroids. The clustering will stop if one of the following conditions is met: the number of clustering iterations exceeds <code>maxiters</code>, the number of instances changing the cluster is equal to <code>stopchanges</code>, or the score associated with current clustering improved for less than <code>minscorechange</code> of the score from previous iteration. If <code>minscorechange</code> is not set, the score will not be computed between iterations. User can also provide a example distance constructor, which, given the data set, will provide a function that measures the distance between two example instances (see <a href="../reference/ExamplesDistance.htm">Distances between Examples</a>). A function to select centroids given the table of data instances, k and a example distance function is provided by <code>initialization</code>, the module includes implementations of several different approaches. <code>scoring</code> is a function that takes clustering object as an argument and returns a score for the clustering. It could be used, for instance, in procedure that repeats the clustering <code>nstart</code> times, returning the clustering with the lowest score. The two callbacks are invoked either after every clustering iteration (<code>inner_callback</code>) or after every clustering restart (in case when <code>nstart</code> is greater than 1, <code>outer_callback</code>).<dd>
32<DT>runone()</DT>
33<DD>Runs one clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).</DD>
34<DT>run()</DT>
35<DD>Runs clustering until the convergence conditions are met. If <code>nstart</code> is greater than one, <code>nstart</code> runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.</DD>
36</DL>
37
38<p class=section>Attributes</P>
39<DL class=attributes>
40<DT>k</DT>
41<DD>Number of clusters.</DD>
42<DT>centroids</DT>
43<DD>Current set of centroids.
44<DT>scoring</DT>
45<DD>Current clustering score.</DD>
46<DT>iteration</DT>
47<DD>Current clustering iteration.</DD>
48<DT>clusters</DT>
49<DD>A list of cluster indexes. An i-th element provides an index to a centroid associated with i-th data instance from the input data set.</DD>
50</DL>
51
52<p>The following code runs k-means clustering on Iris data set and prints out the cluster indexes for the last 10 data instances:</p>
53
54<p class="header">part of <a href="kmeans-run.py">kmeans-run.py</a> (uses <a href=
55"iris.tab">iris.tab</a>)</p>
56<xmp class=code>data = orange.ExampleTable("iris")
57km = orngClustering.KMeans(data, 3)
58print km.clusters[-10:]
59</xmp>
60
61<p>The output of this code is:</p>
62<xmp class=code>[1, 1, 2, 1, 1, 1, 2, 1, 1, 2]
63</xmp>
64
65<p>Invoking a call-back function may be useful when tracing the progress of the clustering. Below is a code that uses an <code>inner_callback</code> to report on the number of instances that have changed the cluster and to report on the clustering score. For the score to be computed at each iteration we have to set <code>minscorechange</code>, but we can leave it at 0 (or even set it to a negative value, which allows the score to deteriorate by some amount).</p>
66
67<p class="header">part of <a href="kmeans-run-callback.py">kmeans-run-callback.py</a> (uses <a href=
68"iris.tab">iris.tab</a>)</p>
69<xmp class=code>def callback(km):
70    print "Iteration: %d, changes: %d, score: %.4f" % (km.iteration,
71        km.nchanges, km.score)
72   
73data = orange.ExampleTable("iris")
74km = orngClustering.KMeans(data, 3, minscorechange=0, inner_callback=callback)
75</xmp>
76
77<p>The convergence on Iris data set is fast:</p>
78<xmp class=code>Iteration: 1, changes: 150, score: 10.9555
79Iteration: 2, changes: 12, score: 10.3867
80Iteration: 3, changes: 2, score: 10.2034
81Iteration: 4, changes: 2, score: 10.0699
82Iteration: 5, changes: 2, score: 9.9542
83Iteration: 6, changes: 1, score: 9.9168
84Iteration: 7, changes: 2, score: 9.8624
85Iteration: 8, changes: 0, score: 9.8624
86</xmp>
87
88
89<p>Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script:</p>
90
91<p class="header">part of <a href="kmeans-trace.py">kmeans-trace.py</a> (uses <a href=
92"iris.tab">iris.tab</a>)</p>
93<xmp class=code>def plot_scatter(data, km, attx, atty, filename="kmeans-scatter", title=None):
94    """plot a data scatter plot with the position of centeroids"""
95    pylab.rcParams.update({'font.size': 8, 'figure.figsize': [4,3]})
96    x = [float(d[attx]) for d in data]
97    y = [float(d[atty]) for d in data]
98    colors = ["c", "w", "b"]
99    cs = "".join([colors[c] for c in km.clusters])
100    pylab.scatter(x, y, c=cs, s=10)
101   
102    xc = [float(d[attx]) for d in km.centroids]
103    yc = [float(d[atty]) for d in km.centroids]
104    pylab.scatter(xc, yc, marker="x", c="k", s=200)
105   
106    pylab.xlabel(attx)
107    pylab.ylabel(atty)
108    if title:
109        pylab.title(title)
110    pylab.savefig("%s-%03d.png" % (filename, km.iteration))
111    pylab.close()
112
113def in_callback(km):
114    print "Iteration: %d, changes: %d, score: %8.6f" % (km.iteration, km.nchanges, km.score)
115    plot_scatter(data, km, "petal width", "petal length", title="Iteration %d" % km.iteration)
116   
117data = orange.ExampleTable("iris")
118random.seed(42)
119km = orngClustering.KMeans(data, 3, minscorechange=0, maxiters=10, inner_callback=in_callback)</xmp>
120
121<p>Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.</p>
122
123<table>
124<tr>
125<td><img src="kmeans-scatter-001.png"></td>
126<td><img src="kmeans-scatter-002.png"></td>
127</tr>
128<tr>
129<td><img src="kmeans-scatter-003.png"></td>
130<td><img src="kmeans-scatter-004.png"></td>
131</tr>
132</table>
133
134
135<h2>k-Means Utility Functions</h2>
136
137<dl class="attributes">
138<dt>kmeans_init_random(data, k, _)</dt>
139<dd class="ddfun">A function that can be used for initialization of k-means clustering returns <code>k</code> data instances from the <code>data</code>. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).</dd>
140<dt>kmeans_init_diversity(data, k, distfun)</dt>
141<dd class="ddfun">Another function that can be used for intializationof k-means clustering. Given the data set, number of clusters and adistance function returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. This type of initialization is almost identical to initialization proposed by Katsavounidis et al. (1994), with the only difference being in the selection of the first centroid (where they use a data instance with the biggest norm).</dd>
142<dt>KMeans_init_hierarchicalClustering(n=100)</dt>
143<dd class="ddfun">Is actually a class that returns an clustering initialization function that, given the data, k, and a distance function samples <code>n</code> data instances, performs hierarhical clustering, uses it to infer <code>k</code> clusters, and computes a list of cluster-based data centers.</dd>
144<dt>data_center(data)</dt>
145<dd class="ddfun">Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).</dd>
146<dt>score_distance_to_centroids(kmeans)</dt>
147<dd class="ddfun">Returns an average distance of data instances to their associated centroids. <code>kmeans</code> is a k-means clustering object.</dd>
148<dt>score_silhouette(kmeans, index=None)</dt>
149<dd class="ddfun">Returns an average silhouette score of data instances. If <code>index</code> is specified it instead returns just the silhouette score of that particular data instance. <code>kmeans</code> is a k-means clustering object.</dd>
150<dt>score_fastsilhouette(kmeans, index=None)</dt>
151<dd class="ddfun">Same as score_silhouette, but computes an approximation and is faster.</dd>
152<dt>plot_silhouette(kmeans, filename='tmp.png', fast=False)</dt>
153<dd class="ddfun">Saves a silhuette plot to <code>filename</code>, showing the distributions of silhouette scores in clusters. <code>kmeans</code> is a k-means clustering object. If <code>fast</code> is True use <code>score_fastsilhouette</code> to compute scores instead of <code>score_silhouette</code>.</dd>
154</dl>
155
156<p>Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge:</p>
157
158<p class="header">part of <a href="kmeans-cmp-init.py">kmeans-cmp-init.py</a> (uses <a href=
159"iris.tab">iris.tab</a>, <a href=
160"housing.tab">housing.tab</a>, <a href=
161"vehicle.tab">vehicle.tab</a>)</p>
162<xmp class=code>import orange
163import orngClustering
164import random
165
166data_names = ["iris", "housing", "vehicle"]
167data_sets = [orange.ExampleTable(name) for name in data_names]
168
169print "%10s %3s %3s %3s" % ("", "Rnd", "Div", "HC")
170for data, name in zip(data_sets, data_names):
171    random.seed(42)
172    km_random = orngClustering.KMeans(data, centroids = 3)
173    km_diversity = orngClustering.KMeans(data, centroids = 3, \
174        initialization=orngClustering.kmeans_init_diversity)
175    km_hc = orngClustering.KMeans(data, centroids = 3, \
176        initialization=orngClustering.KMeans_init_hierarchicalClustering(n=100))
177    print "%10s %3d %3d %3d" % (name, km_random.iteration, km_diversity.iteration, km_hc.iteration)
178</xmp>
179
180<p>The results show that diversity and clustering-based initialization make k-means converge faster that random selection of seeds (as expected):</p>
181
182<xmp class=code>           Rnd Div  HC
183      iris  12   3   4
184   housing  14   6   4
185   vehicle  11   4   3
186</xmp>
187
188<p>The following code computes the silhouette score for three different clusterings (k=2..7), and at the end plots a silhuette plot
189for k=3.<p>
190
191<p class="header"><a href="kmeans-silhouette.py">kmeans-silhouette.py</a> (uses <a href="iris.tab">iris.tab</a></p>
192<xmp class=code>import orange
193import orngClustering
194
195data = orange.ExampleTable("iris")
196for k in range(2,8):
197    km = orngClustering.KMeans(data, k, initialization=orngClustering.kmeans_init_diversity)
198    score = orngClustering.score_silhouette(km)
199    print k, score
200
201km = orngClustering.KMeans(data, 3, initialization=orngClustering.kmeans_init_diversity)
202orngClustering.plot_silhouette(km, "kmeans-silhouette.png")
203</xmp>
204
205<p>The analysis sugests that clustering with k=2 is preferred as it yields the maximal silhouette coefficien:</p>
206
207 <xmp class=code>2 0.629467553352
2083 0.504318855054
2094 0.407259377854
2105 0.358628975081
2116 0.353228492088
2127 0.366357876944
213</xmp>
214
215<p>Silhouette plot for k=3 is given below:</p>
216
217<img src="kmeans-silhouette.png">
218
219<h2>Hierarchical Clustering</h2>
220
221<dl class="attributes">
222<dt>hierarchicalClustering(data,                     distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None)</dt>
223<dd class="ddfun">Returns an object with information of a hierarchical clustering of a given data set. This is a wrapper function around <a href="../reference/clustering.htm">HierarchicalClustering</a> class implemented in Orange. <code>hierarchicalClustering</code> uses <code>distanceConstructor</code> function (see <a href="../reference/ExamplesDistance.htm">Distances between example</a>) to construct a distance matrix, which is then passed to Orange's hierarchical clustering algorithm, along with a particular linkage method. Ordering of leaves can be requested (<code>order=True</code>) and if so, the leaves will be ordered using <code>orderLeaves</code> function (see below).</dd>
224
225<dt>orderLeaves(root, distanceMatrix)</dt>
226<dd class="ddfun">Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.</dd>
227
228<dt>orderLeaves(root, distanceMatrix)</dt>
229<dd class="ddfun">Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.</dd>
230
231<dt>hierarchicalClustering_topClusters(root, k)</dt>
232<dd class="ddfun">Returns k topmost clusters (top k nodes of the clustering tree) from hierarchical clustering.</dd>
233
234<dt>hierarhicalClustering_topClustersMembership(root, k)</dt>
235<dd class="ddfun">Returns a list with indexes which indicate the membership of data instances that are included in top k clusters.</dd>
236</dl>
237
238
239<p>Using <code>hierarchicalClustering</code>, scripts need a single line of code to invoke the clustering and get the object with a result. This is demonstrated in the following script, that considers the Iris data set, performs hierarchical clustering, and then plots the data in two-attribute projection, coloring the points representing data instances according to cluster membership.</p>
240
241<p class="header">part of <a href="hclust-iris.py">hclust-iris.py</a> (uses <a href=
242"iris.tab">iris.tab)</a></p>
243<xmp class=code>def plot_scatter(data, cls, attx, atty, filename="hclust-scatter", title=None):
244    """plot a data scatter plot with the position of centeroids"""
245    pylab.rcParams.update({'font.size': 8, 'figure.figsize': [4,3]})
246    x = [float(d[attx]) for d in data]
247    y = [float(d[atty]) for d in data]
248    colors = ["c", "w", "b"]
249    cs = "".join([colors[c] for c in cls])
250    pylab.scatter(x, y, c=cs, s=10)
251   
252    pylab.xlabel(attx)
253    pylab.ylabel(atty)
254    if title:
255        pylab.title(title)
256    pylab.savefig("%s.png" % filename)
257    pylab.close()
258
259data = orange.ExampleTable("iris")
260root = orngClustering.hierarchicalClustering(data)
261n = 3
262cls = orngClustering.hierarhicalClustering_topClustersMembership(root, n)
263plot_scatter(data, cls, "sepal width", "sepal length", title="Hiearchical clustering (%d clusters)" % n)
264</xmp>
265
266<p>The output of the script is a following plot:</p>
267
268<img src="hclust-scatter.png">
269
270
271<h2>DendrogramPlot</h2>
272<p>Class <code>DendrogramPlot</code> implements visualization of the  clustering tree (called dendrogram) and corresponding visualization of attribute heatmap.
273
274<p class=section>Methods</p>
275<dl class=attributes>
276<dt>__init__(tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, text_width=None, heatmap_width=None, spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, renderer=EPSRenderer)</DT>
277<dd><code>tree</code> is an Orange's hierarhical clustering tree object (root node), <code>attr_tree</code> an optional attribute clustering root node. If <code>data</code> (Orange's ExampleTable) is given than the dendrogram will include a heat map with color-based presentation of attribute's values. The length of the data set should match the number of leaves in the hierarchical clustering tree. Following are arguments that define the height and width of the plot areas. Branches are plotted in black, but may be colored to visually expose various clusters. The coloring is specified with <code>cluster_colors</code>, a dictionary with cluster instances as keys and (r, g, b) tuples as items.
278<code>color_palette<code> specifies the palette to use for the heatmap and <code>minv</code> and <code>maxv</code> maximum and minimum data value to plot.</dd>
279<dt>set_matrix_color_schema(color_palette, minv, maxv)</dt>
280<dd>Set the heatmap color schema. <code>color_palette<code> can be an instance of ColorPalette class or an list of (r, g, b) tuples. <code>minv</code> and <code>maxv</code>
281specify the cutoff values for the heatmap (values below and above the interval will be painted with <code>color_palette.underflow</code> and <code>color_palette.overflow</code> respectively)
282<dt>plot(filename="graph.eps")</dt>
283<dd>Plots the dendrogram and save is to the output file.</dd>
284</dl>
285
286<p>Additionaly a module level convenience function <code>dendrogram_draw</code> is provided to streamline the drawing process.</p>
287<dl class="attributes">
288<dt>dendrogram_draw(filename, tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, text_width=None, heatmap_width=None, spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None)</dt>
289<dd>Draw the dendrogram to filename (supported formats: PNG, EPS, SVG)</dd> 
290
291<p>To illustrate the use of the dendrogram plotting class, the following scripts uses it on a subset of 20 instances from the Iris data set. Values of the class variables is used for labeling the leaves (and, of course, it is not used for the clustering - only the non-class attributes are used to compute instance distance matrix).</p>
292
293<p class="header">part of <a href="hclust-dendrogram.py">hclust-dendrogram.py</a> (uses <a href=
294"iris.tab">iris.tab)</a></p>
295<xmp class=code>data = orange.ExampleTable("iris")
296sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0)
297root = orngClustering.hierarchicalClustering(sample)
298orngClustering.dendrogram_draw("hclust-dendrogram.png", root, data=sample, labels=[str(d.getclass()) for d in sample])
299</xmp>
300
301<p>The resulting dendrogram is shown below.</p>
302
303<img src="hclust-dendrogram.png">
304
305<p>Following is a similar script to above one, but this time we have 1) distinctively colored the three topmost dendrogram branches, 2) used a custom color schema for representation of attribute values (spanning red - black - green with custom <code>gamma</code> <code>minv</code> and <code>maxv</code> set ), and 3) included only two attributes in the heat map presentation (note: clustering is still done on all of the data set's attributes).</p>
306
307<p class="header">part of <a href="hclust-colored-dendrogram.py">hclust-colored-dendrogram.py</a> (uses <a href=
308"iris.tab">iris.tab)</a></p>
309<xmp class=code>data = orange.ExampleTable("iris")
310sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0)
311root = orngClustering.hierarchicalClustering(sample)
312reduced = orange.ExampleTable(orange.Domain(sample.domain[:2], False), sample)
313
314my_colors = [(255,0,0), (0,255,0), (0,0,255)]
315cls = orngClustering.hierarchicalClustering_topClusters(root, 3)
316colors = dict([(cl, col) for cl, col in zip(cls, my_colors)])
317
318orngClustering.dendrogram_draw("hclust-colored-dendrogram.png", root, data = reduced, labels=[str(d.getclass()) for d in sample],
319    cluster_colors=colors, color_palette=[(0, 255, 0), (0, 0, 0), (255, 0, 0)], gamma=0.5, minv=2.0, maxv=7.0)
320</xmp>
321
322<p>Our "colored" dendrogram is now saved as shown in the figure below:</p>
323
324<img src="hclust-colored-dendrogram.png">
325
326
327<h2>References</h2>
328
329<p>Forgy E (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21(3): 768-769.</p>
330
331<p>He J, Lan M, Tan C-L , Sung S-Y, Low H-B (2004) <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.7091">Initialization of cluster refinement algorithms: A review and comparative study</a>. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary.</p>
332
333<p>Katsavounidis I, Jay C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10): 144-146.</p>
334
335<p>Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) <a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S22">Fast optimal leaf ordering for herarchical clustering</a>. Bioinformatics 17(Suppl. 1): S22-S29.
336</body>
337</html>
Note: See TracBrowser for help on using the repository browser.