source: orange/Orange/doc/reference/clustering.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 18.6 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8
9<index name="clustering">
10<H1>Clustering</H1>
11
12<P>Core Orange only provides <INDEX>hierarchical clustering</INDEX>, limited to single, complete and average linkage. Other clustering methods and associated utility functions are provided in <a href="../modules/orngClustering.htm">orngClustering</a> module. See also <A href="http://www.ailab.si/aleks/orng/">Orange extensions</A> by Aleks Jakulin for k-medoid and fuzzy clustering.</P>
13
14<h2>Hierarchical Clustering</h2>
15
16<H3>Classes</H3>
17
18<P>The method for hierarchical clustering, encapsulated in class <CODE><INDEX name="classes/HierarchicalClustering">HierarchicalClustering</CODE> works on a distance matrix stored as <a href="SymMatrix.htm"><CODE>SymMatrix</CODE></A>. The method works in approximately O(n<SUP>2</SUP>) time (with the worst case O(n<SUP>3</SUP>)). For orientation, clustering ten thousand of elements should take roughly 15 seconds on a 2 GHz computer. The algorithm can either make a copy of the distances matrix and work on it, or work on the original distance matrix, destroying it in the process. The latter is useful for clustering larger number of objects. Since the distance matrix stores (n+1)(n+2)/2 floats (cca 2 MB for 1000 objects and 200 MB for 10000, assuming the a float takes 4 bytes), by copying it we would quickly run out of physical memory. Using virtual memory is not an option since the matrix is accessed in a random manner.</P>
19
20<P>The distance should contain no negative elements. This limitation is due to implementation details of the algorithm (it is not absolutely necessary and can be lifted in future versions if often requested; it only helps the algorithm run a bit faster). The elements on the diagonal (representing the element's distance from itself) are ignored.</P>
21
22<P>Distance matrix can have the attribute <CODE>objects</CODE>
23describing the objects we are clustering (this is available only in
24Python). This can be any sequence of the same length as the matrix -
25an <CODE>ExampleTable</CODE>, a list of examples, a list of attributes
26(if you're clustering attributes), or even a string of the correct
27length. This attribute is not used in clustering but is only passed to
28the clusters' attribute <CODE>mapping</CODE> (see below), which will
29hold a reference to it (if you modify the list, the changes will
30affect the clusters as well).</P>
31
32<P class=section>Attributes of <CODE>HierarchicalClustering</CODE></P>
33<DL class=attributes>
34<DT>linkage</DT>
35<DD>Specifies the linkage method, which can be either
36<ol>
37  <li><CODE>HierarchicalClustering.Single</CODE> (default), where
38distance between groups is defined as the distance between the closest
39pair of objects, one from each group,</li>
40  <li><CODE>HierarchicalClustering.Average</CODE></li>, where the
41distance between two clusters is defined as the average of distances
42between all pairs of objects, where each pair is made up of one object
43from each group, or
44  <li><CODE>HierarchicalClustering.Complete</CODE>, where the distance
45between groups is defined as the distance between the most distant
46pair of objects, one from each group. Complete linkage is also called
47farthest neighbor.</li>
48  <li><CODE>HierarchicalClustering.Ward</CODE> uses Ward's distance.</li>
49</ol>
50</DD>
51
52<DT>overwriteMatrix</DT>
53<DD>If true (default is false), the algorithm will work on the original distance matrix, destroying it in the process. The benefit is that it will need much less memory (not much more than what is needed to store the tree of clusters).</DD>
54
55<DT>progressCallback</DT>
56<DD>A callback function (<CODE>None</CODE> by default). It can be any function or callable class in Python, which accepts a single float as an argument. The function only gets called if the number of objects being clustered is at least 1000. It will be called for 101 times, and the argument will give the proportion of the work been done. The time intervals between the function calls won't be equal (sorry about that...) since the clustering proceeds faster as the number of clusters decreases.</DD>
57</DL>
58
59<P>The <CODE>HierarchicalClustering</CODE> is called with a distance matrix as an argument. It returns an instance of <CODE><index name="classes/HierarchicalCluster">
60HierarchicalCluster</CODE> representing the root of the hierarchy.</P>
61
62
63<P class=section>Attributes of <CODE>HierarchicalCluster</CODE></P>
64<DL class=attributes>
65<DT>branches</DT>
66<DD>A list of subclusters; <CODE>None</CODE> if the node is a leaf (a single element). The list can contain more than two subclusters. (<CODE>HierarchicalClustering</CODE> never produces such clusters; this is here for any future extensions.)</DD>
67
68<DT>left, right</DT>
69<DD>Left and right subclusters. defined only if there are up to two branches - that is, always, if <CODE>HierarchicalClustering</CODE> was used for constructing the cluster..</DD>
70
71<DT>height</DT>
72<DD>The distance between the subclusters.</DD>
73</DL>
74
75<DT>mapping, first, last</DT>
76<DD><CODE>mapping</CODE> is a list of indices to the distance matrix. It is the same for all clusters in the hierarchy - it simply represents the indices ordered according to the clustering. <CODE>first</CODE> and <CODE>last</CODE> are indices into the elements of <CODE>mapping</CODE> that belong to that cluster. (Seems weird, but is trivial - wait for the examples. On the other hand, you probably won't need to understand this anyway.)</P>
77
78<P>If the distance matrix had an attribute <CODE>objects</CODE> defined, it is copied to <CODE>mapping</CODE>.</P>
79</DD>
80</DL>
81
82
83<P class=section>Methods</P>
84<DL class=attributes>
85<DT>__len__</DT>
86<DD>Asking for the length of the cluster gives the number of the objects belonging to it. This equals <CODE>last-first</CODE>.</DD>
87
88<DT>&lt;indexing&gt;</DT>
89<DD>By indexing the cluster we address its elements; these are either indices or objects (you'll understand this after seeing the examples). For instance <CODE>cluster[2]</CODE> gives the third element of the cluster, and <CODE>list(cluster)</CODE> will return the cluster elements as a list. The cluster elements are read-only. To actually modify them, you'll have to go through <CODE>mapping</CODE>, as described below. This is intentionally complicated to discourage a naive user from doing what he does not understand.</DD>
90
91<DT>swap()</DT>
92<DD>Swaps the left and the right subcluster; obviously this will report an error when the cluster has more than two subclusters. This function changes the <CODE>mapping</CODE> and <CODE>first</CODE> and <CODE>last</CODE> of all clusters below this one and thus needs O(len(cluster)) time.</DD>
93
94<DT>permute(permutation)</DT>
95<DD>Permutes the subclusters. Permutation gives the order in which the subclusters will be arranged. As for <CODE>swap</CODE>, this function changes the <CODE>mapping</CODE> and <CODE>first</CODE> and <CODE>last</CODE> of all clusters below this one.</DD>
96</DL>
97
98<H3>Example 1: Toy matrix</H3>
99
100<P>Let us construct a simple distance matrix run clustering on it.</P>
101
102<P class="header">part of <A href="hclust_art.py">hclust_art.py</A></P>
103<XMP class=code>import orange
104
105m = [[],
106     [ 3],
107     [ 2,  4],
108     [17,  5,  4],
109     [ 2,  8,  3,  8],
110     [ 7,  5, 10, 11, 2],
111     [ 8,  4,  1,  5, 11, 13],
112     [ 4,  7, 12,  8, 10,  1,  5],
113     [13,  9, 14, 15,  7,  8,  4,  6],
114     [12, 10, 11, 15,  2,  5,  7,  3,  1]]
115
116matrix = orange.SymMatrix(m)
117root = orange.HierarchicalClustering(matrix, linkage=orange.HierarchicalClustering.Average)
118</XMP>
119
120<P>Root is a root of the cluster hierarchy. We can print using a simple recursive function.</P>
121
122<P class="header">part of <A href="hclust_art.py">hclust_art.py</A></P>
123<XMP class=code>def printClustering(cluster):
124    if cluster.branches:
125        return "(%s%s)" % (printClustering(cluster.left), printClustering(cluster.right))
126    else:
127        return `cluster[0]`
128</XMP>
129
130<P>The output is not exactly nice, but it will have to do. Our clustering, printed by calling <CODE>printClustering(root)</CODE> looks like this: <CODE>(((04)((57)(89)))((1(26))3))</CODE>. The elements are separated into two groups, the first containing elements 0, 4, 5, 7, 8, 9, and the second 1, 2, 6, 3. The difference between them equals <CODE>root.height</CODE>, 9.0 in our case. The first cluster is further divided onto 0 and 4 in one, and 5, 7, 8, 9 in the other subcluster...</P>
131
132<P>It is easy to print out the cluster's objects. Here's what's in the left subcluster of <CODE>root</CODE>.</P>
133<XMP class=code>>>> for el in root.left:
134...    print el,
1350 4 5 7 8 9
136</XMP>
137
138
139<P>Everything that can be iterated over, can as well be cast into a list or tuple. Let us demonstrate this by writing a better function for printing out the clustering (which will also come handy for something else in a while). The one above supposed that each leaf contains a single object. This is not necessarily so; instead of printing out the first (and supposedly the only) element of cluster, <CODE>cluster[0]</CODE>, we shall print it out as a tuple.</P>
140
141<P class="header">part of <A href="hclust_art.py">hclust_art.py</A></P>
142<XMP class=code>def printClustering2(cluster):
143    if cluster.branches:
144        return "(%s%s)" % (printClustering2(cluster.left), printClustering2(cluster.right))
145    else:
146        return str(tuple(cluster))
147</XMP>
148
149<P>The distance matrix could have been given a list of objects. We could, for instance, put</P>
150<XMP CLASS=CODE>matrix.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"]
151</XMP>
152<P>above calling the <CODE>HierarchicalClustering</CODE>. (This code will actually trigger a warning; to avoid it, use <CODE>matrix.setattr("objects", ["Ann", "Bob"...</CODE>. Why this is needed is explained in the page on <A href="peculiarities.htm">Orange peculiarities</A>.) If we've forgotten to store the <CODE>objects</CODE> into <CODE>matrix</CODE> prior to clustering, nothing is lost. We can add it into clustering later, by</P>
153<XMP CLASS=CODE>root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"])
154</XMP>
155
156<P>So, what do these "objects" do? Call <CODE>printClustering(root)</CODE> again and you'll see. Or, let us print out the elements of the first left cluster, as we did before.</P>
157<XMP class=code>>>> for el in root.left:
158...    print el,
159Ann Eve Fred Hue Ivy Jon
160</XMP>
161
162<P>If objects are given, the cluster's elements, as got by indexing (eg <CODE>root.left[2]</CODE>) or by iteration, as in the above case, won't be indices but the elements we clustered. If we put an <CODE>ExampleTable</CODE> into <CODE>objects</CODE>, <CODE>root.left[-1]</CODE> will be the last example of the first left cluster.</P>
163
164
165<P>Now for swapping and permutations.</P>
166<XMP class=code>>>> printClustering(root)
167((('Ann''Eve')(('Fred''Hue')('Ivy''Jon')))(('Bob'('Curt''Greg'))'Danny'))
168>>> root.left.swap()
169>>> printClustering(root)
170(((('Fred''Hue')('Ivy''Jon'))('Ann''Eve'))(('Bob'('Curt''Greg'))'Danny'))
171>>> root.permute([1, 0])
172>>> printClustering(root)
173((('Bob'('Curt''Greg'))'Danny')((('Fred''Hue')('Ivy''Jon'))('Ann''Eve')))
174</XMP>
175
176<P>Calling <CODE>root.left.swap</CODE> reversed the order of subclusters of <CODE>root.left</CODE> and <CODE>root.permute([1, 0])</CODE> (which is equivalent to <CODE>root.swap</CODE> - there aren't many possible permutations of two elements) reverses the order of <CODE>root.left</CODE> and <CODE>root.right</CODE>.
177
178<P>Let us write function for cluster pruning.</P>
179
180<P class="header">part of <A href="hclust_art.py">hclust_art.py</A></P>
181<XMP class=code>def prune(cluster, togo):
182    if cluster.branches:
183        if togo<0:
184            cluster.branches = None
185        else:
186            for branch in cluster.branches:
187                prune(branch, togo-cluster.height)
188
189</XMP>
190
191<P>We shall use <CODE>printClustering2</CODE> here, since we can have multiple elements in a leaf of the clustering hierarchy.</P>
192<XMP class=code>>>> prune(root, 9)
193>>> print printClustering2(root)
194((('Bob', 'Curt', 'Greg')('Danny',))(('Fred', 'Hue', 'Ivy', 'Jon')('Ann', 'Eve')))
195</XMP>
196
197<P>We've ended up with four clusters. Need a list of clusters? Here's the function.</P>
198
199<P class="header">part of <A href="hclust_art.py">hclust_art.py</A></P>
200<XMP class=code>def listOfClusters0(cluster, alist):
201    if not cluster.branches:
202        alist.append(list(cluster))
203    else:
204        for branch in cluster.branches:
205            listOfClusters0(branch, alist)
206
207def listOfClusters(root):
208    l = []
209    listOfClusters0(root, l)
210    return l
211
212</XMP>
213
214<P>The function returns a list of lists, in our case <CODE>[['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']]</CODE>. If there were no <CODE>objects</CODE> the list would contains indices instead of names.
215
216
217<H3>Example 2: Clustering of examples</H3>
218
219<P>The most common things to cluster are certainly examples. To show how to this is done, we shall now load the Iris data set, initialize a distance matrix with the distances measure by <A href="ExamplesDistance.htm"><CODE>ExamplesDistance_Euclidean</CODE></A> and cluster it with average linkage. Since we don't need the matrix, we shall let the clustering overwrite it (not that it's needed for such a small data set as Iris).</P>
220
221<P class="header">part of <A href="hclust.py">hclust.py</A> (uses <a href="iris.tab">iris.tab</a>)</P>
222<XMP class=code>import orange
223
224data = orange.ExampleTable("iris")
225
226matrix = orange.SymMatrix(len(data))
227matrix.setattr("objects", data)
228distance = orange.ExamplesDistanceConstructor_Euclidean(data)
229for i1, ex1 in enumerate(data):
230    for i2 in range(i1+1, len(data)):
231        matrix[i1, i2] = distance(ex1, data[i2])
232
233clustering = orange.HierarchicalClustering()
234clustering.linkage = clustering.Average
235clustering.overwriteMatrix = 1
236root = clustering(matrix)
237</XMP>
238
239<P>Note that we haven't forgotten to set the <CODE>matrix.objects</CODE>. We did it through <CODE>matrix.setattr</CODE> to avoid the warning. Let us now prune the clustering using the function we've written above, and print out the clusters.</P>
240<P class="header">part of <A href="hclust.py">hclust.py</A> (uses <a href="iris.tab">iris.tab</a>)</P>
241<XMP class=code>prune(root, 1.4)
242
243for n, cluster in enumerate(listOfClusters(root)):
244    print "\n\n*** Cluster %i ***\n" % n
245    for ex in cluster:
246        print ex
247</XMP>
248
249<P>Since the printout is pretty long, it might be more informative to just print out the class distributions for each cluster.</P>
250
251<P class="header">part of <A href="hclust.py">hclust.py</A> (uses <a href="iris.tab">iris.tab</a>)</P>
252<XMP class=code>for cluster in listOfClusters(root):
253    dist = orange.getClassDistribution(cluster)
254    for e, d in enumerate(dist):
255        print "%s: %3.0f   " % (data.domain.classVar.values[e], d),
256    print
257</XMP>
258
259<P>Here's what it shows.</P>
260<XMP class=code>Iris-setosa:  49    Iris-versicolor:   0    Iris-virginica:   0
261Iris-setosa:   1    Iris-versicolor:   0    Iris-virginica:   0
262Iris-setosa:   0    Iris-versicolor:  50    Iris-virginica:  17
263Iris-setosa:   0    Iris-versicolor:   0    Iris-virginica:  33
264</XMP>
265
266<P>Note something else: <CODE>listOfClusters</CODE> does not return a list of <CODE>ExampleTable</CODE>s, but a list of lists of examples. Therefore, in the above script, <CODE>cluster</CODE> is a list of examples, not an <CODE>ExampleTable</CODE>, but it gets converted to it automatically when the function is called. Most Orange functions will do this for you automatically. You can, for instance, call a learning algorithms, passing a cluster as an argument. It won't mind. If you, however, want to have a list of table, you can easily convert the list by</P>
267<XMP class=code>tables = [orange.ExampleTable(cluster) for cluster in listOfClusters(root)]
268</XMP>
269</P>
270
271<P>Finally, if you are dealing with examples, you may want to take the function <CODE>listOfClusters</CODE> and replace <CODE>alist.append(list(cluster))</CODE> by <CODE>alist.append(orange.ExampleTable(cluster))</CODE>. This function is less general, it will fail if <CODE>objects</CODE> are not of type <CODE>Example</CODE>. However, instead of list of lists, it will return a list of example tables.</P>
272
273
274<H3>How the data in HierarchyCluster is really stored?</H3>
275
276<P>To demonstrate how the data in clusters is stored, we shall continue with the clustering we got in the first example.</P>
277
278<XMP class=code>>>> del root.mapping.objects
279>>> print printClustering(root)
280(((1(26))3)(((57)(89))(04)))
281>>> print root.mapping
282<1, 2, 6, 3, 5, 7, 8, 9, 0, 4>
283>>> print root.left.first
2840
285>>> print root.left.last
2864
287>>> print root.left.mapping[root.left.first:root.left.last]
288<1, 2, 6, 3>
289>>> print root.left.left.first
2900
291>>> print root.left.left.last
2923
293</XMP>
294
295<P>We removed <CODE>objects</CODE> to just to see more clearly what is going on. <CODE>mapping</CODE> is an ordered list of indices to the rows/columns of distance matrix (and, at the same time, indices into <CODE>objects</CODE>, if they exist). Each cluster's fields <CODE>first</CODE> and <CODE>last</CODE> are indices into mapping, so the clusters elements are actually <CODE>cluster.mapping[cluster.first:cluster.last]</CODE>. <CODE>cluster[i]</CODE> therefore returns <CODE>cluster.mapping[cluster.first+i]</CODE> or, if <CODE>objects</CODE> are specified, <CODE>cluster.objects[cluster.mapping[cluster.first+i]]</CODE>. Space consumption is minimal since all clusters share the same objects <CODE>mapping</CODE> and <CODE>objects</CODE>.</P>
296
297<P>Subclusters are ordered so that <CODE>cluster.left.last</CODE> always equals <CODE>cluster.right.first</CODE> or, in general, <CODE>cluster.branches[i].last</CODE> equals <CODE>cluster.branches[i+1].first</CODE>.</P>
298
299<P>Swapping and permutation do three things: change the order of elements in <CODE>branches</CODE>, permute the corresponding regions in <CODE>mapping</CODE> and adjust the <CODE>first</CODE> and <CODE>last</CODE> for all the clusters below. For the latter, when subclusters of <CODE>cluster</CODE> are permuted, the entire subtree starting at <CODE>cluster.branches[i]</CODE> is moved by the same offset.</P>
300
301<P>The hierarchy of objects that represent a clustering is open, everything is accessible from Python. You can write your own clustering algorithms that build this same structure, or you can use Orange's clustering and then do we the structure anything you want. For instance prune it, as we have shown earlier. However, it is easy to do things wrong: shuffle the <CODE>mapping</CODE>, for instance, and forget to adjust the <CODE>first</CODE> and <CODE>last</CODE> pointers. Orange does some checking for the internal consistency, but you are surely smarter and can find a way to crash it. For instance, just create a cycle in the structure, call <CODE>swap</CODE> for some cluster above the cycle and you're there. But don't blame it on me then.</P>
302
Note: See TracBrowser for help on using the repository browser.