source: orange/Orange/clustering/hierarchical.py @ 9752:cbd6f6f10f06

Revision 9752:cbd6f6f10f06, 64.4 KB checked in by markotoplak, 2 years ago (diff)

Moved instance_distance_matrix to Orange.distance.distance_matrix

Line 
1r"""
2******************************************
3Hierarchical clustering (``hierarchical``)
4******************************************
5
6.. index::
7   single: clustering, hierarchical, dendrogram
8.. index:: aglomerative clustering
9
10The method for hierarchical clustering, encapsulated in class
11:class:`HierarchicalClustering` works on a distance matrix stored as
12:class:`SymMatrix`. The method works in approximately O(n2) time (with
13the worst case O(n3)). For orientation, clustering ten thousand of
14elements should take roughly 15 seconds on a 2 GHz computer.
15The algorithm can either make a copy of the distances matrix and work on
16it, or work on the original distance matrix, destroying it in the process.
17The latter is useful for clustering larger number of objects. Since the
18distance matrix stores (n+1)(n+2)/2 floats (cca 2 MB for 1000 objects and
19200 MB for 10000, assuming the a float takes 4 bytes), by copying it we
20would quickly run out of physical memory. Using virtual memory is not
21an option since the matrix is accessed in a random manner.
22
23The distance should contain no negative elements. This limitation is
24due to implementation details of the algorithm (it is not absolutely
25necessary and can be lifted in future versions if often requested; it
26only helps the algorithm run a bit faster). The elements on the diagonal
27(representing the element's distance from itself) are ignored.
28
29Distance matrix can have the attribute objects describing the objects we
30are clustering (this is available only in Python). This can be any sequence
31of the same length as the matrix - an ExampleTable, a list of examples, a
32list of attributes (if you're clustering attributes), or even a string of
33the correct length. This attribute is not used in clustering but is only
34passed to the clusters' attribute ``mapping`` (see below), which will hold a
35reference to it (if you modify the list, the changes will affect the
36clusters as well).
37
38.. class:: HierarchicalClustering
39   
40    .. attribute:: linkage
41       
42        Specifies the linkage method, which can be either :
43       
44            1. ``HierarchicalClustering.Single`` (default), where distance
45                between groups is defined as the distance between the closest
46                pair of objects, one from each group,
47            2. ``HierarchicalClustering.Average`` , where the distance between
48                two clusters is defined as the average of distances between
49                all pairs of objects, where each pair is made up of one object
50                from each group, or
51            3. ``HierarchicalClustering.Complete``, where the distance between
52                groups is defined as the distance between the most distant
53                pair of objects, one from each group. Complete linkage is
54                also called farthest neighbor.
55            4. ``HierarchicalClustering.Ward`` uses Ward's distance.
56           
57    .. attribute:: overwrite_matrix
58
59        If true (default is false), the algorithm will work on the original
60        distance matrix, destroying it in the process. The benefit is that it
61        will need much less memory (not much more than what is needed to store
62        the tree of clusters).
63       
64    .. attribute:: progress_callback
65       
66        A callback function (None by default). It can be any function or
67        callable class in Python, which accepts a single float as an
68        argument. The function only gets called if the number of objects
69        being clustered is at least 1000. It will be called for 101 times,
70        and the argument will give the proportion of the work been done.
71        The time intervals between the function calls won't be equal (sorry
72        about that...) since the clustering proceeds faster as the number
73        of clusters decreases.
74       
75    .. method:: __call__(matrix)
76         
77        The ``HierarchicalClustering`` is called with a distance matrix as an
78        argument. It returns an instance of HierarchicalCluster representing
79        the root of the hierarchy (instance of :class:`HierarchicalCluster`).
80        See examples section for details.
81       
82        :param matrix: A distance matrix to perform the clustering on.
83        :type matrix: :class:`Orange.core.SymMatrix`
84
85
86.. class:: HierarchicalCluster
87
88    Represents a node in the clustering tree, as returned by
89    :obj:`HierarchicalClustering`
90
91    .. attribute:: branches
92   
93        A list of sub-clusters (:class:`HierarchicalCluster` instances). If this
94        is a leaf node this attribute is `None`
95       
96    .. attribute:: left
97   
98        The left sub-cluster (defined only if there are only two branches).
99       
100        .. note:: Same as ``branches[0]``
101       
102    .. attribute:: right
103   
104        The right sub-cluster (defined only if there are only two branches).
105       
106        .. note:: Same as ``branches[1]``
107       
108    .. attribute:: height
109   
110        Height of the cluster (distance between the sub-clusters).
111       
112    .. attribute:: mapping
113   
114        A list of indices to the original distance matrix. It is the same
115        for all clusters in the hierarchy - it simply represents the indices
116        ordered according to the clustering.
117       
118    .. attribute:: first
119    .. attribute:: last
120   
121        ``first`` and ``last`` are indices into the elements of ``mapping`` that
122        belong to that cluster. (Seems weird, but is trivial - wait for the
123        examples. On the other hand, you probably won't need to understand this
124        anyway).
125
126    .. method:: __len__()
127   
128        Asking for the length of the cluster gives the number of the objects
129        belonging to it. This equals ``last - first``.
130   
131    .. method:: __getitem__(index)
132   
133        By indexing the cluster we address its elements; these are either
134        indices or objects (you'll understand this after seeing the examples).
135        For instance cluster[2] gives the third element of the cluster, and
136        list(cluster) will return the cluster elements as a list. The cluster
137        elements are read-only. To actually modify them, you'll have to go
138        through mapping, as described below. This is intentionally complicated
139        to discourage a naive user from doing what he does not understand.
140   
141    .. method:: swap()
142   
143        Swaps the ``left`` and the ``right`` subcluster; obviously this will
144        report an error when the cluster has more than two subclusters. This
145        function changes the mapping and first and last of all clusters below
146        this one and thus needs O(len(cluster)) time.
147       
148    .. method:: permute(permutation)
149   
150        Permutes the subclusters. Permutation gives the order in which the
151        subclusters will be arranged. As for swap, this function changes the
152        mapping and first and last of all clusters below this one.
153   
154   
155Example 1 - Toy matrix
156----------------------
157
158Let us construct a simple distance matrix and run clustering on it.
159::
160
161    import Orange
162    from Orange.clustering import hierarchical
163    m = [[],
164         [ 3],
165         [ 2, 4],
166         [17, 5, 4],
167         [ 2, 8, 3, 8],
168         [ 7, 5, 10, 11, 2],
169         [ 8, 4, 1, 5, 11, 13],
170         [ 4, 7, 12, 8, 10, 1, 5],
171         [13, 9, 14, 15, 7, 8, 4, 6],
172         [12, 10, 11, 15, 2, 5, 7, 3, 1]]
173    matrix = Orange.core.SymMatrix(m)
174    root = hierarchical.HierarchicalClustering(matrix,
175            linkage=hierarchical.HierarchicalClustering.Average)
176   
177Root is a root of the cluster hierarchy. We can print using a
178simple recursive function.
179::
180
181    def printClustering(cluster):
182        if cluster.branches:
183            return "(%s%s)" % (printClustering(cluster.left), printClustering(cluster.right))
184        else:
185            return str(cluster[0])
186           
187The output is not exactly nice, but it will have to do. Our clustering,
188printed by calling printClustering(root) looks like this
189::
190   
191    (((04)((57)(89)))((1(26))3))
192   
193The elements are separated into two groups, the first containing elements
1940, 4, 5, 7, 8, 9, and the second 1, 2, 6, 3. The difference between them
195equals ``root.height``, 9.0 in our case. The first cluster is further
196divided onto 0 and 4 in one, and 5, 7, 8, 9 in the other subcluster...
197
198It is easy to print out the cluster's objects. Here's what's in the
199left subcluster of root.
200::
201
202    >>> for el in root.left:
203        ... print el,
204    0 4 5 7 8 9
205   
206Everything that can be iterated over, can as well be cast into a list or
207tuple. Let us demonstrate this by writing a better function for printing
208out the clustering (which will also come handy for something else in a
209while). The one above supposed that each leaf contains a single object.
210This is not necessarily so; instead of printing out the first (and
211supposedly the only) element of cluster, cluster[0], we shall print
212it out as a tuple.
213::
214
215    def printClustering2(cluster):
216        if cluster.branches:
217            return "(%s%s)" % (printClustering2(cluster.left), printClustering2(cluster.right))
218        else:
219            return str(tuple(cluster))
220           
221The distance matrix could have been given a list of objects. We could,
222for instance, put
223::
224   
225    matrix.objects = ["Ann", "Bob", "Curt", "Danny", "Eve",
226                      "Fred", "Greg", "Hue", "Ivy", "Jon"]
227
228above calling the HierarchicalClustering.
229
230.. note:: This code will actually trigger a warning;
231    to avoid it, use matrix.setattr("objects", ["Ann", "Bob"....
232..    Why this is needed is explained in the page on `Orange peculiarities`_. TODO: Create page on Orange Peculiarities.
233
234If we've forgotten to store the objects into matrix prior to clustering,
235nothing is lost. We can add it into clustering later, by
236::
237
238    root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"]
239   
240So, what do these "objects" do? Call printClustering(root) again and you'll
241see. Or, let us print out the elements of the first left cluster, as we did
242before.
243::
244
245    >>> for el in root.left:
246        ... print el,
247    Ann Eve Fred Hue Ivy Jon
248   
249If objects are given, the cluster's elements, as got by indexing
250(eg root.left[2]) or by iteration, as in the above case, won't be
251indices but the elements we clustered. If we put an ExampleTable
252into objects, root.left[-1] will be the last example of the first
253left cluster.
254
255Now for swapping and permutations. ::
256
257    >>> printClustering(root)
258    ((('Ann''Eve')(('Fred''Hue')('Ivy''Jon')))(('Bob'('Curt''Greg'))'Danny'))
259    >>> root.left.swap()
260    >>> printClustering(root)
261    (((('Fred''Hue')('Ivy''Jon'))('Ann''Eve'))(('Bob'('Curt''Greg'))'Danny'))
262    >>> root.permute([1, 0])
263    >>> printClustering(root)
264    ((('Bob'('Curt''Greg'))'Danny')((('Fred''Hue')('Ivy''Jon'))('Ann''Eve')))
265   
266Calling ``root.left.swap`` reversed the order of subclusters of ``root.left``
267and ``root.permute([1, 0])`` (which is equivalent to ``root.swap`` - there
268aren't many possible permutations of two elements) reverses the order
269of ``root.left`` and ``root.right``.
270
271Let us write function for cluster pruning. ::
272
273    def prune(cluster, togo):
274        if cluster.branches:
275            if togo<0:
276                cluster.branches = None
277            else:
278                for branch in cluster.branches:
279                    prune(branch, togo-cluster.height)
280
281We shall use ``printClustering2`` here, since we can have multiple elements
282in a leaf of the clustering hierarchy. ::
283   
284    >>> prune(root, 9)
285    >>> print printClustering2(root)
286    ((('Bob', 'Curt', 'Greg')('Danny',))(('Fred', 'Hue', 'Ivy', 'Jon')('Ann', 'Eve')))
287   
288We've ended up with four clusters. Need a list of clusters?
289Here's the function. ::
290   
291    def listOfClusters0(cluster, alist):
292        if not cluster.branches:
293            alist.append(list(cluster))
294        else:
295            for branch in cluster.branches:
296                listOfClusters0(branch, alist)
297               
298    def listOfClusters(root):
299        l = []
300        listOfClusters0(root, l)
301        return l
302       
303The function returns a list of lists, in our case
304``[['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']]``   
305If there were no ``objects`` the list would contains indices instead of names.
306
307Example 2 - Clustering of examples
308----------------------------------
309
310The most common things to cluster are certainly examples. To show how to
311this is done, we shall now load the Iris data set, initialize a distance
312matrix with the distances measure by :class:`Euclidean`
313and cluster it with average linkage. Since we don't need the matrix,
314we shall let the clustering overwrite it (not that it's needed for
315such a small data set as Iris). ::
316
317    import Orange
318    from Orange.clustering import hierarchical
319
320    data = Orange.data.Table("iris")
321    matrix = Orange.core.SymMatrix(len(data))
322    matrix.setattr("objects", data)
323    distance = Orange.distance.Euclidean(data)
324    for i1, instance1 in enumerate(data):
325        for i2 in range(i1+1, len(data)):
326            matrix[i1, i2] = distance(instance1, data[i2])
327           
328    clustering = hierarchical.HierarchicalClustering()
329    clustering.linkage = clustering.Average
330    clustering.overwrite_matrix = 1
331    root = clustering(matrix)
332
333Note that we haven't forgotten to set the ``matrix.objects``. We did it
334through ``matrix.setattr`` to avoid the warning. Let us now prune the
335clustering using the function we've written above, and print out the
336clusters. ::
337   
338    prune(root, 1.4)
339    for n, cluster in enumerate(listOfClusters(root)):
340        print "\n\n Cluster %i \n" % n
341        for instance in cluster:
342            print instance
343           
344Since the printout is pretty long, it might be more informative to just
345print out the class distributions for each cluster. ::
346   
347    for cluster in listOfClusters(root):
348        dist = Orange.core.get_class_distribution(cluster)
349        for e, d in enumerate(dist):
350            print "%s: %3.0f " % (data.domain.class_var.values[e], d),
351        print
352       
353Here's what it shows. ::
354
355    Iris-setosa:  49    Iris-versicolor:   0    Iris-virginica:   0
356    Iris-setosa:   1    Iris-versicolor:   0    Iris-virginica:   0
357    Iris-setosa:   0    Iris-versicolor:  50    Iris-virginica:  17
358    Iris-setosa:   0    Iris-versicolor:   0    Iris-virginica:  33
359   
360Note something else: ``listOfClusters`` does not return a list of
361:class:`Orange.data.Table`, but a list of lists of instances. Therefore,
362in the above script, cluster is a list of examples, not an ``Table``, but
363it gets converted to it automatically when the function is called.
364Most Orange functions will do this for you automatically. You can, for
365instance, call a learning algorithms, passing a cluster as an argument.
366It won't mind. If you, however, want to have a list of table, you can
367easily convert the list by ::
368
369    tables = [Orange.data.Table(cluster) for cluster in listOfClusters(root)]
370   
371Finally, if you are dealing with examples, you may want to take the function
372``listOfClusters`` and replace ``alist.append(list(cluster))`` by
373``alist.append(Orange.data.Table(cluster))``. This function is less general,
374it will fail if objects are not of type :class:`Orange.data.Instance`.
375However, instead of list of lists, it will return a list of tables.
376
377How the data in ``HierarchicalCluster`` is really stored?
378---------------------------------------------------------
379
380To demonstrate how the data in clusters is stored, we shall continue with
381the clustering we got in the first example. ::
382   
383    >>> del root.mapping.objects
384    >>> print printClustering(root)
385    (((1(26))3)(((57)(89))(04)))
386    >>> print root.mapping
387    <1, 2, 6, 3, 5, 7, 8, 9, 0, 4>
388    >>> print root.left.first
389    0
390    >>> print root.left.last
391    4
392    >>> print root.left.mapping[root.left.first:root.left.last]
393    <1, 2, 6, 3>
394    >>> print root.left.left.first
395    0
396    >>> print root.left.left.last
397    3
398   
399We removed objects to just to see more clearly what is going on.
400``mapping`` is an ordered list of indices to the rows/columns of distance
401matrix (and, at the same time, indices into objects, if they exist). Each
402cluster's fields ``first`` and ``last`` are indices into mapping, so the
403clusters elements are actually
404``cluster.mapping[cluster.first:cluster.last]``. ``cluster[i]`` therefore
405returns ``cluster.mapping[cluster.first+i]`` or, if objects are specified,
406``cluster.objects[cluster.mapping[cluster.first+i]]``. Space consumption
407is minimal since all clusters share the same objects ``mapping`` and
408``objects``.
409
410
411Subclusters are ordered so that ``cluster.left.last`` always equals
412``cluster.right.first`` or, in general, ``cluster.branches[i].last``
413equals ``cluster.branches[i+1].first``.
414
415
416Swapping and permutation do three things: change the order of elements in
417``branches``, permute the corresponding regions in ``mapping`` and adjust
418the ``first`` and ``last`` for all the clusters below. For the latter, when
419subclusters of cluster are permuted, the entire subtree starting at
420``cluster.branches[i]`` is moved by the same offset.
421
422
423The hierarchy of objects that represent a clustering is open, everything is
424accessible from Python. You can write your own clustering algorithms that
425build this same structure, or you can use Orange's clustering and then do to
426the structure anything you want. For instance prune it, as we have shown
427earlier. However, it is easy to do things wrong: shuffle the mapping, for
428instance, and forget to adjust the ``first`` and ``last`` pointers. Orange
429does some checking for the internal consistency, but you are surely smarter
430and can find a way to crash it. For instance, just create a cycle in the
431structure, call ``swap`` for some cluster above the cycle and you're there.
432But don't blame it on me then.
433
434
435Utility Functions
436-----------------
437
438.. autofunction:: clustering
439.. autofunction:: clustering_features
440.. autofunction:: cluster_to_list
441.. autofunction:: top_clusters
442.. autofunction:: top_cluster_membership
443.. autofunction:: order_leaves
444
445.. autofunction:: postorder
446.. autofunction:: preorder
447.. autofunction:: dendrogram_layout
448.. autofunction:: dendrogram_draw
449.. autofunction:: clone
450.. autofunction:: prune
451.. autofunction:: pruned
452.. autofunction:: cluster_depths
453.. autofunction:: feature_distance_matrix
454.. autofunction:: joining_cluster
455.. autofunction:: cophenetic_distances
456.. autofunction:: cophenetic_correlation
457
458"""
459
460import orange
461import Orange
462from Orange.core import HierarchicalClustering, \
463                        HierarchicalCluster, \
464                        HierarchicalClusterList
465
466from Orange.misc import progress_bar_milestones, deprecated_keywords
467                       
468import sys
469
470SINGLE = HierarchicalClustering.Single
471AVERAGE = HierarchicalClustering.Average
472COMPLETE = HierarchicalClustering.Complete
473WARD = HierarchicalClustering.Ward
474
475def clustering(data,
476               distance_constructor=Orange.distance.Euclidean,
477               linkage=AVERAGE,
478               order=False,
479               progress_callback=None):
480    """ Return a hierarchical clustering of the instances in a data set.
481   
482    :param data: Input data table for clustering.
483    :type data: :class:`Orange.data.Table`
484    :param distance_constructor: Instance distance constructor
485    :type distance_constructor: :class:`Orange.distance.DistanceConstructor`
486    :param linkage: Linkage flag. Must be one of global module level flags:
487   
488        - SINGLE
489        - AVERAGE
490        - COMPLETE
491        - WARD
492       
493    :type linkage: int
494    :param order: If `True` run `order_leaves` on the resulting clustering.
495    :type order: bool
496    :param progress_callback: A function (taking one argument) to use for
497        reporting the on the progress.
498    :type progress_callback: function
499   
500    :rtype: :class:`HierarchicalCluster`
501   
502    """
503    distance = distance_constructor(data)
504    matrix = orange.SymMatrix(len(data))
505    for i in range(len(data)):
506        for j in range(i+1):
507            matrix[i, j] = distance(data[i], data[j])
508    root = HierarchicalClustering(matrix, linkage=linkage, progress_callback=(lambda value, obj=None: progress_callback(value*100.0/(2 if order else 1))) if progress_callback else None)
509    if order:
510        order_leaves(root, matrix, progress_callback=(lambda value: progress_callback(50.0 + value/2)) if progress_callback else None)
511    return root
512
513clustering = \
514    deprecated_keywords({"distanceConstructor": "distance_constructor",
515                         "progressCallback": "progress_callback"})(clustering)
516
517
518def clustering_features(data, distance=None, linkage=orange.HierarchicalClustering.Average, order=False, progress_callback=None):
519    """ Return hierarchical clustering of attributes in a data set.
520   
521    :param data: Input data table for clustering.
522    :type data: :class:`Orange.data.Table`
523    :param distance: Attribute distance constructor
524        .. note:: currently not used.
525    :param linkage: Linkage flag. Must be one of global module level flags:
526   
527        - SINGLE
528        - AVERAGE
529        - COMPLETE
530        - WARD
531       
532    :type linkage: int
533    :param order: If `True` run `order_leaves` on the resulting clustering.
534    :type order: bool
535    :param progress_callback: A function (taking one argument) to use for
536        reporting the on the progress.
537    :type progress_callback: function
538   
539    :rtype: :class:`HierarchicalCluster`
540   
541    """
542    matrix = orange.SymMatrix(len(data.domain.attributes))
543    for a1 in range(len(data.domain.attributes)):
544        for a2 in range(a1):
545            matrix[a1, a2] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0
546    root = orange.HierarchicalClustering(matrix, linkage=linkage, progress_callback=(lambda value, obj=None: progress_callback(value*100.0/(2 if order else 1))) if progress_callback else None)
547    if order:
548        order_leaves(root, matrix, progressCallback=(lambda value: progress_callback(50.0 + value/2)) if progress_callback else None)
549    return root
550
551clustering_features = \
552    deprecated_keywords({"progressCallback":"progress_callback"})(clustering_features)
553   
554   
555def cluster_to_list(node, prune=None):
556    """ Return a list of clusters down from the node of hierarchical clustering.
557   
558    :param node: Cluster node.
559    :type node: :class:`HierarchicalCluster`
560    :param prune: If not `None` it must be a positive integer. Any cluster
561        with less then `prune` items will be left out of the list.
562    :type node: int or `NoneType`
563   
564    :rtype: list of :class:`HierarchicalCluster` instances
565   
566    """
567    if prune:
568        if len(node) <= prune:
569            return [] 
570    if node.branches:
571        return [node] + cluster_to_list(node.left, prune) + cluster_to_list(node.right, prune)
572    return [node]
573
574
575def top_clusters(root, k):
576    """ Return k topmost clusters from hierarchical clustering.
577   
578    :param root: Root cluster.
579    :type root: :class:`HierarchicalCluster`
580    :param k: Number of top clusters.
581    :type k: int
582   
583    :rtype: list of :class:`HierarchicalCluster` instances
584
585    """
586    candidates = set([root])
587    while len(candidates) < k:
588        repl = max([(c.height, c) for c in candidates if c.branches])[1]
589        candidates.discard(repl)
590        candidates.add(repl.left)
591        candidates.add(repl.right)
592    return candidates
593
594def top_cluster_membership(root, k):
595    """ Return data instances' cluster membership (list of indices) to k topmost clusters.
596   
597    :param root: Root cluster.
598    :type root: :class:`HierarchicalCluster`
599    :param k: Number of top clusters.
600    :type k: int
601   
602    :rtype: list-of-int
603   
604    """
605    clist = top_clusters(root, k)
606    cmap = [None] * len(root)
607    for i, c in enumerate(clist):
608        for e in c:
609            cmap[e] = i
610    return cmap
611
612def order_leaves_py(tree, matrix, progress_callback=None):
613    """Order the leaves in the clustering tree.
614   
615    (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering))
616   
617    :param tree: Binary hierarchical clustering tree.
618    :type tree: :class:`HierarchicalCluster`
619    :param matrix: SymMatrix that was used to compute the clustering.
620    :type matrix: :class:`Orange.core.SymMatrix`
621    :param progress_callback: Function used to report on progress.
622    :type progress_callback: function
623   
624    .. note:: The ordering is done inplace.
625   
626    """
627   
628    objects = getattr(tree.mapping, "objects", None)
629    tree.mapping.setattr("objects", range(len(tree)))
630    M = {}
631    ordering = {}
632    visited_clusters = set()
633   
634#    def opt_ordering_recursive(tree):
635#        if len(tree)==1:
636#            for leaf in tree:
637#                M[tree, leaf, leaf] = 0
638#        else:
639#            opt_ordering_recursive(tree.left)
640#            opt_ordering_recursive(tree.right)
641#           
642#            Vl = set(tree.left)
643#            Vr = set(tree.right)
644#            Vlr = set(tree.left.right or tree.left)
645#            Vll = set(tree.left.left or tree.left)
646#            Vrr = set(tree.right.right or tree.right)
647#            Vrl = set(tree.right.left or tree.right)
648#            other = lambda e, V1, V2: V2 if e in V1 else V1
649#            tree_left, tree_right = tree.left, tree.right
650#            for u in Vl:
651#                for w in Vr:
652##                    if True: #Improved search
653#                    C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])
654#                    ordered_m = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])
655#                    ordered_k = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])
656#                    k0 = ordered_k[0]
657#                    cur_min = 1e30000
658#                    cur_m = cur_k = None
659#                    for m in ordered_m:
660#                        if M[tree_left, u, m] + M[tree_right, w, k0] + C >= cur_min:
661#                            break
662#                        for k in  ordered_k:
663#                            if M[tree_left, u, m] + M[tree_right, w, k] + C >= cur_min:
664#                                break
665#                            test_min = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
666#                            if cur_min > test_min:
667#                                cur_min = test_min
668#                                cur_m = m
669#                                cur_k = k
670#                    M[tree, u, w] = M[tree, w, u] = cur_min
671#                    ordering[tree, u, w] = (tree_left, u, cur_m, tree_right, w, cur_k)
672#                    ordering[tree, w, u] = (tree_right, w, cur_k, tree_left, u, cur_m)
673##                    else:
674##                    def MFunc((m, k)):
675##                        return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
676##                    m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)
677##                    M[tree, u, w] = M[tree, w, u] = MFunc((m, k))
678##                    ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)
679##                    ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)
680#
681#            if progressCallback:
682#                progressCallback(100.0 * len(visited_clusters) / len(tree.mapping))
683#                visited_clusters.add(tree)
684#   
685#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
686#        opt_ordering_recursive(tree)
687       
688    def opt_ordering_iterative(tree):
689        if len(tree)==1:
690            for leaf in tree:
691                M[tree, leaf, leaf] = 0
692        else:
693#            _optOrdering(tree.left)
694#            _optOrdering(tree.right)
695           
696            Vl = set(tree.left)
697            Vr = set(tree.right)
698            Vlr = set(tree.left.right or tree.left)
699            Vll = set(tree.left.left or tree.left)
700            Vrr = set(tree.right.right or tree.right)
701            Vrl = set(tree.right.left or tree.right)
702            other = lambda e, V1, V2: V2 if e in V1 else V1
703            tree_left, tree_right = tree.left, tree.right
704            for u in Vl:
705                for w in Vr:
706#                    if True: #Improved search
707                        C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])
708                        ordered_m = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])
709                        ordered_k = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])
710                        k0 = ordered_k[0]
711                        cur_min = 1e30000 
712                        cur_m = cur_k = None
713                        for m in ordered_m:
714                            if M[tree_left, u, m] + M[tree_right, w, k0] + C >= cur_min:
715                                break
716                            for k in  ordered_k:
717                                if M[tree_left, u, m] + M[tree_right, w, k] + C >= cur_min:
718                                    break
719                                test_min = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
720                                if cur_min > test_min:
721                                    cur_min = test_min
722                                    cur_m = m
723                                    cur_k = k
724                        M[tree, u, w] = M[tree, w, u] = cur_min
725                        ordering[tree, u, w] = (tree_left, u, cur_m, tree_right, w, cur_k)
726                        ordering[tree, w, u] = (tree_right, w, cur_k, tree_left, u, cur_m)
727#                    else:
728#                        def MFunc((m, k)):
729#                            return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
730#                        m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)
731#                        M[tree, u, w] = M[tree, w, u] = MFunc((m, k))
732#                        ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)
733#                        ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)
734
735#            if progressCallback:
736#                progressCallback(100.0 * len(visited_clusters) / len(tree.mapping))
737#                visited_clusters.add(tree)
738   
739    subtrees = postorder(tree)
740    milestones = progress_bar_milestones(len(subtrees), 1000)
741   
742    for i, subtree in enumerate(subtrees):
743        opt_ordering_iterative(subtree)
744        if progress_callback and i in milestones:
745            progress_callback(100.0 * i / len(subtrees))
746
747#    def order_recursive(tree, u, w):
748#        """ Order the tree based on the computed optimal ordering.
749#        """
750#        if len(tree)==1:
751#            return
752#        left, u, m, right, w, k = ordering[tree, u, w]
753#        if len(left)>1 and m not in left.right:
754#            left.swap()
755#        order_recursive(left, u, m)
756#       
757#        if len(right)>1 and k not in right.left:
758#            right.swap()
759#        order_recursive(right, k, w)
760       
761    def order_iterative(tree, u, w):
762        """ Order the tree based on the computed optimal ordering.
763        """
764        opt_uw = {tree: (u, w)}
765        for subtree in preorder(tree):
766            if subtree.branches:
767                u, w = opt_uw[subtree]
768                left, u, m, right, w, k = ordering[subtree, u, w]
769                opt_uw[left] = (u, m)
770                opt_uw[right] = (k, w)
771               
772                if left.branches and m not in left.right:
773                    left.swap()
774               
775                if right.branches and k not in right.left:
776                    right.swap()
777   
778    u, w = min([(u, w) for u in tree.left for w in tree.right], key=lambda (u, w): M[tree, u, w])
779   
780##    print "M(v) =", M[tree, u, w]
781   
782#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
783#        order_recursive(tree, u, w)
784           
785    order_iterative(tree, u, w)
786           
787
788#    def _check(tree, u, w):
789#        if len(tree)==1:
790#            return
791#        left, u, m, right, w, k = ordering[tree, u, w]
792#        if tree[0] == u and tree[-1] == w:
793#            _check(left, u, m)
794#            _check(right, k, w)
795#        else:
796#            print "Error:", u, w, tree[0], tree[-1]
797#
798#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
799#        _check(tree, u ,w)
800
801    if objects:
802        tree.mapping.setattr("objects", objects)
803
804
805def order_leaves_cpp(tree, matrix, progress_callback=None):
806    """ Order the leaves in the clustering tree (C++ implementation).
807   
808    (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering))
809   
810    :param tree: Binary hierarchical clustering tree.
811    :type tree: :class:`HierarchicalCluster`
812    :param matrix: SymMatrix that was used to compute the clustering.
813    :type matrix: :class:`Orange.core.SymMatrix`
814    :param progress_callback: Function used to report on progress.
815    :type progress_callback: function
816   
817    .. note:: The ordering is done inplace.
818   
819    """
820    node_count = iter(range(len(tree)))
821   
822    if progress_callback is not None:
823        def p(*args):
824            progress_callback(100.0 * node_count.next() / len(tree))
825    else:
826        p = None
827   
828    Orange.core.HierarchicalClusterOrdering(tree, matrix, progress_callback=p)
829
830order_leaves_cpp = deprecated_keywords({"progressCallback":"progress_callback"})(order_leaves_cpp)
831order_leaves_py = deprecated_keywords({"progressCallback":"progress_callback"})(order_leaves_py)
832
833order_leaves = order_leaves_cpp
834   
835"""
836Matplotlib dendrogram ploting. This is mostly untested,
837use dendrogram_draw funciton instead of this.
838
839"""
840try:
841    import numpy
842except ImportError:
843    numpy = None
844
845try:
846    import matplotlib
847    from matplotlib.figure import Figure
848    from matplotlib.table import Table, Cell
849    from matplotlib.text import Text
850    from matplotlib.artist import Artist
851##    import  matplotlib.pyplot as plt
852except (ImportError, IOError, RuntimeError), ex:
853    matplotlib = None
854    Text , Artist, Table, Cell = object, object, object, object
855
856class TableCell(Cell):
857    PAD = 0.05
858    def __init__(self, *args, **kwargs):
859        Cell.__init__(self, *args, **kwargs)
860        self._text.set_clip_on(True)
861
862class TablePlot(Table):
863    max_fontsize = 12
864    def __init__(self, xy, axes=None, bbox=None):
865        Table.__init__(self, axes or plt.gca(), bbox=bbox)
866        self.xy = xy
867        self.set_transform(self._axes.transData)
868        self._fixed_widhts = None
869        import matplotlib.pyplot as plt
870        self.max_fontsize = plt.rcParams.get("font.size", 12)
871
872    def add_cell(self, row, col, *args, **kwargs):
873        xy = (0,0)
874
875        cell = TableCell(xy, *args, **kwargs)
876        cell.set_figure(self.figure)
877        cell.set_transform(self.get_transform())
878
879        cell.set_clip_on(True)
880        cell.set_clip_box(self._axes.bbox)
881        cell._text.set_clip_box(self._axes.bbox)
882        self._cells[(row, col)] = cell
883
884    def draw(self, renderer):
885        if not self.get_visible(): return
886        self._update_positions(renderer)
887
888        keys = self._cells.keys()
889        keys.sort()
890        for key in keys:
891            self._cells[key].draw(renderer)
892
893    def _update_positions(self, renderer):
894        keys = numpy.array(self._cells.keys())
895        cells = numpy.array([[self._cells.get((row, col), None) for col in range(max(keys[:, 1] + 1))] \
896                             for row in range(max(keys[:, 0] + 1))])
897       
898        widths = self._get_column_widths(renderer)
899        x = self.xy[0] + numpy.array([numpy.sum(widths[:i]) for i in range(len(widths))])
900        y = self.xy[1] - numpy.arange(cells.shape[0]) - 0.5
901       
902        for i in range(cells.shape[0]):
903            for j in range(cells.shape[1]):
904                cells[i, j].set_xy((x[j], y[i]))
905                cells[i, j].set_width(widths[j])
906                cells[i, j].set_height(1.0)
907
908        self._width = numpy.sum(widths)
909        self._height = cells.shape[0]
910
911        self.pchanged()
912
913    def _get_column_widths(self, renderer):
914        keys = numpy.array(self._cells.keys())
915        widths = numpy.zeros(len(keys)).reshape((numpy.max(keys[:,0]+1), numpy.max(keys[:,1]+1)))
916        fontSize = self._calc_fontsize(renderer)
917        for (row, col), cell in self._cells.items():
918            cell.set_fontsize(fontSize)
919            l, b, w, h = cell._text.get_window_extent(renderer).bounds
920            transform = self._axes.transData.inverted()
921            x1, _ = transform.transform_point((0, 0))
922            x2, _ = transform.transform_point((w + w*TableCell.PAD + 10, 0))
923            w = abs(x1 - x2)
924            widths[row, col] = w
925        return numpy.max(widths, 0)
926
927    def _calc_fontsize(self, renderer):
928        transform = self._axes.transData
929        _, y1 = transform.transform_point((0, 0))
930        _, y2 = transform.transform_point((0, 1))
931        return min(max(int(abs(y1 - y2)*0.85) ,4), self.max_fontsize)
932
933    def get_children(self):
934        return self._cells.values()
935
936    def get_bbox(self):
937        return matplotlib.transform.Bbox([self.xy[0], self.xy[1], self.xy[0] + 10, self.xy[1] + 180])
938
939class DendrogramPlotPylab(object):
940    def __init__(self, root, data=None, labels=None, dendrogram_width=None, heatmap_width=None, label_width=None, space_width=None, border_width=0.05, plot_attr_names=False, cmap=None, params={}):
941        if not matplotlib:
942            raise ImportError("Could not import matplotlib module. Please make sure matplotlib is installed on your system.")
943        import matplotlib.pyplot as plt
944        self.plt = plt
945        self.root = root
946        self.data = data
947        self.labels = labels if labels else [str(i) for i in range(len(root))]
948        self.dendrogram_width = dendrogram_width
949        self.heatmap_width = heatmap_width
950        self.label_width = label_width
951        self.space_width = space_width
952        self.border_width = border_width
953        self.params = params
954        self.plot_attr_names = plot_attr_names
955
956    def plotDendrogram(self):
957        self.text_items = []
958        def draw_tree(tree):
959            if tree.branches:
960                points = []
961                for branch in tree.branches:
962                    center = draw_tree(branch)
963                    self.plt.plot([center[0], tree.height], [center[1], center[1]], color="black")
964                    points.append(center)
965                self.plt.plot([tree.height, tree.height], [points[0][1], points[-1][1]], color="black")
966                return (tree.height, (points[0][1] + points[-1][1])/2.0)
967            else:
968                return (0.0, tree.first)
969        draw_tree(self.root)
970       
971    def plotHeatMap(self):
972        import numpy.ma as ma
973        import numpy
974        dx, dy = self.root.height, 0
975        fx, fy = self.root.height/len(self.data.domain.attributes), 1.0
976        data, c, w = self.data.toNumpyMA()
977        data = (data - ma.min(data))/(ma.max(data) - ma.min(data))
978        x = numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape))
979        y = numpy.arange(data.shape[0] + 1)/float(numpy.max(data.shape))*len(self.root)
980        self.heatmap_width = numpy.max(x)
981
982        X, Y = numpy.meshgrid(x, y - 0.5)
983
984        self.meshXOffset = numpy.max(X)
985
986        self.plt.jet()
987        mesh = self.plt.pcolormesh(X, Y, data[self.root.mapping], edgecolor="b", linewidth=2)
988
989        if self.plot_attr_names:
990            names = [attr.name for attr in self.data.domain.attributes]
991            self.plt.xticks(numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape)), names)
992        self.plt.gca().xaxis.tick_top()
993        for label in self.plt.gca().xaxis.get_ticklabels():
994            label.set_rotation(45)
995
996        for tick in self.plt.gca().xaxis.get_major_ticks():
997            tick.tick1On = False
998            tick.tick2On = False
999
1000    def plotLabels_(self):
1001        import numpy
1002##        self.plt.yticks(numpy.arange(len(self.labels) - 1, 0, -1), self.labels)
1003##        for tick in self.plt.gca().yaxis.get_major_ticks():
1004##            tick.tick1On = False
1005##            tick.label1On = False
1006##            tick.label2On = True
1007##        text = TableTextLayout(xy=(self.meshXOffset+1, len(self.root)), tableText=[[label] for label in self.labels])
1008        text = TableTextLayout(xy=(self.meshXOffset*1.005, len(self.root) - 1), tableText=[[label] for label in self.labels])
1009        text.set_figure(self.plt.gcf())
1010        self.plt.gca().add_artist(text)
1011        self.plt.gca()._set_artist_props(text)
1012
1013    def plotLabels(self):
1014##        table = TablePlot(xy=(self.meshXOffset*1.005, len(self.root) -1), axes=self.plt.gca())
1015        table = TablePlot(xy=(0, len(self.root) -1), axes=self.plt.gca())
1016        table.set_figure(self.plt.gcf())
1017        for i,label in enumerate(self.labels):
1018            table.add_cell(i, 0, width=1, height=1, text=label, loc="left", edgecolor="w")
1019        table.set_zorder(0)
1020        self.plt.gca().add_artist(table)
1021        self.plt.gca()._set_artist_props(table)
1022   
1023    def plot(self, filename=None, show=False):
1024        self.plt.rcParams.update(self.params)
1025        labelLen = max(len(label) for label in self.labels)
1026        w, h = 800, 600
1027        space = 0.01 if self.space_width == None else self.space_width
1028        border = self.border_width
1029        width = 1.0 - 2*border
1030        height = 1.0 - 2*border
1031        textLineHeight = min(max(h/len(self.labels), 4), self.plt.rcParams.get("font.size", 12))
1032        maxTextLineWidthEstimate = textLineHeight*labelLen
1033##        print maxTextLineWidthEstimate
1034        textAxisWidthRatio = 2.0*maxTextLineWidthEstimate/w
1035##        print textAxisWidthRatio
1036        labelsAreaRatio = min(textAxisWidthRatio, 0.4) if self.label_width == None else self.label_width
1037        x, y = len(self.data.domain.attributes), len(self.data)
1038
1039        heatmapAreaRatio = min(1.0*y/h*x/w, 0.3) if self.heatmap_width == None else self.heatmap_width
1040        dendrogramAreaRatio = 1.0 - labelsAreaRatio - heatmapAreaRatio - 2*space if self.dendrogram_width == None else self.dendrogram_width
1041
1042        self.fig = self.plt.figure()
1043        self.labels_offset = self.root.height/20.0
1044        dendrogramAxes = self.plt.axes([border, border, width*dendrogramAreaRatio, height])
1045        dendrogramAxes.xaxis.grid(True)
1046        import matplotlib.ticker as ticker
1047
1048        dendrogramAxes.yaxis.set_major_locator(ticker.NullLocator())
1049        dendrogramAxes.yaxis.set_minor_locator(ticker.NullLocator())
1050        dendrogramAxes.invert_xaxis()
1051        self.plotDendrogram()
1052        heatmapAxes = self.plt.axes([border + width*dendrogramAreaRatio + space, border, width*heatmapAreaRatio, height], sharey=dendrogramAxes)
1053
1054        heatmapAxes.xaxis.set_major_locator(ticker.NullLocator())
1055        heatmapAxes.xaxis.set_minor_locator(ticker.NullLocator())
1056        heatmapAxes.yaxis.set_major_locator(ticker.NullLocator())
1057        heatmapAxes.yaxis.set_minor_locator(ticker.NullLocator())
1058       
1059        self.plotHeatMap()
1060        labelsAxes = self.plt.axes([border + width*(dendrogramAreaRatio + heatmapAreaRatio + 2*space), border, width*labelsAreaRatio, height], sharey=dendrogramAxes)
1061        self.plotLabels()
1062        labelsAxes.set_axis_off()
1063        labelsAxes.xaxis.set_major_locator(ticker.NullLocator())
1064        labelsAxes.xaxis.set_minor_locator(ticker.NullLocator())
1065        labelsAxes.yaxis.set_major_locator(ticker.NullLocator())
1066        labelsAxes.yaxis.set_minor_locator(ticker.NullLocator())
1067        if filename:
1068            import matplotlib.backends.backend_agg
1069            canvas = matplotlib.backends.backend_agg.FigureCanvasAgg(self.fig)
1070            canvas.print_figure(filename)
1071        if show:
1072            self.plt.show()
1073       
1074       
1075"""
1076Dendrogram ploting using Orange.misc.render
1077"""
1078
1079from Orange.misc.render import EPSRenderer, ColorPalette
1080
1081class DendrogramPlot(object):
1082    """ A class for drawing dendrograms.
1083   
1084    .. note:: ``dendrogram_draw`` function is a more convinient interface
1085        to the functionality provided by this class and.   
1086       
1087    Example::
1088   
1089        a = DendrogramPlot(tree)
1090        a.plot("tree.png", format="png")
1091       
1092    """
1093    def __init__(self, tree, attr_tree = None, labels=None, data=None, width=None, height=None, tree_height=None, heatmap_width=None, text_width=None, 
1094                 spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, gamma=None, renderer=EPSRenderer, **kwargs):
1095        self.tree = tree
1096        self.attr_tree = attr_tree
1097        if not labels:
1098            if data and data.domain.class_var:
1099                labels = [str(ex.getclass()) for ex in data]
1100            elif hasattr(tree.mapping, "objects"):
1101                labels = [str(obj) for obj in tree.mapping.objects]
1102            else:
1103                labels = [""] * len(tree)
1104        self.labels = labels
1105       
1106#        self.attr_labels = [str(attr.name) for attr in data.domain.attributes] if not attr_labels and data else attr_labels or []
1107        self.data = data
1108        self.width, self.height = float(width) if width else None, float(height) if height else None
1109        self.tree_height = tree_height
1110        self.heatmap_width = heatmap_width
1111        self.text_width = text_width
1112        self.font_size = 10.0
1113        self.linespacing = 0.0
1114        self.cluster_colors = cluster_colors
1115        self.horizontal_margin = 10.0
1116        self.vertical_margin = 10.0
1117        self.spacing = float(spacing) if spacing else None
1118        self.color_palette = color_palette
1119        self.minv = minv
1120        self.maxv = maxv
1121        self.gamma = gamma
1122        self.set_matrix_color_schema(color_palette, minv, maxv, gamma)
1123        self.renderer = renderer
1124       
1125    def set_matrix_color_schema(self, color_palette, minv, maxv, gamma=None):
1126        """ Set the matrix color scheme.
1127        """
1128        if isinstance(color_palette, ColorPalette):
1129            self.color_palette = color_palette
1130        else:
1131            self.color_palette = ColorPalette(color_palette)
1132        self.minv = minv
1133        self.maxv = maxv
1134        self.gamma = gamma
1135       
1136    def color_shema(self):
1137        vals = [float(val) for ex in self.data for val in ex if not val.isSpecial() and val.variable.varType==orange.VarTypes.Continuous] or [0]
1138        avg = sum(vals)/len(vals)
1139       
1140        maxVal = self.maxv if self.maxv else max(vals)
1141        minVal = self.minv if self.minv else min(vals)
1142       
1143        def _colorSchema(val):
1144            if val.isSpecial():
1145                return self.color_palette(None)
1146            elif val.variable.varType==orange.VarTypes.Continuous:
1147                r, g, b = self.color_palette((float(val) - minVal) / abs(maxVal - minVal), gamma=self.gamma)
1148            elif val.variable.varType==orange.VarTypes.Discrete:
1149                r = g = b = int(255.0*float(val)/len(val.variable.values))
1150            return (r, g, b)
1151        return _colorSchema
1152   
1153    def layout(self):
1154        height_final = False
1155        width_final = False
1156        tree_height = self.tree_height or 100
1157        if self.height:
1158            height, height_final = self.height, True
1159            heatmap_height = height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin
1160            font_size =  heatmap_height / len(self.labels) #self.font_size or (height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin) / len(self.labels)
1161        else:
1162            font_size = self.font_size
1163            heatmap_height = font_size * len(self.labels)
1164            height = heatmap_height + (tree_height + self.spacing if self.attr_tree else 0) + 2 * self.horizontal_margin
1165             
1166        text_width = self.text_width or max([len(label) for label in self.labels] + [0]) * font_size #max([self.renderer.string_size_hint(label) for label in self.labels])
1167       
1168        if self.width:
1169            width = self.width
1170            heatmap_width = width - 2 * self.vertical_margin - tree_height - (2 if self.data else 1) * self.spacing - text_width if self.data else 0
1171        else:
1172            heatmap_width = len(self.data.domain.attributes) * heatmap_height / len(self.data) if self.data else 0
1173            width = 2 * self.vertical_margin + tree_height + (heatmap_width + self.spacing if self.data else 0) + self.spacing + text_width
1174           
1175        return width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size
1176   
1177    def plot(self, filename="graph.eps", **kwargs):
1178        width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size = self.layout()
1179        heatmap_cell_height = heatmap_height / len(self.labels)
1180
1181        heatmap_cell_width = 0.0 if not self.data else heatmap_width / len(self.data.domain.attributes)
1182       
1183        self.renderer = self.renderer(width, height)
1184       
1185        def draw_tree(cluster, root, treeheight, treewidth, color):
1186            height = treeheight * cluster.height / root.height
1187            if cluster.branches:
1188                centers = []
1189                for branch in cluster.branches:
1190                    center = draw_tree(branch, root, treeheight, treewidth, self.cluster_colors.get(branch, color))
1191                    centers.append(center)
1192                    self.renderer.draw_line(center[0], center[1], center[0], height, stroke_color = self.cluster_colors.get(branch, color))
1193                   
1194                self.renderer.draw_line(centers[0][0], height, centers[-1][0], height, stroke_color = self.cluster_colors.get(cluster, color))
1195                return (centers[0][0] + centers[-1][0]) / 2.0, height
1196            else:
1197                return float(treewidth) * cluster.first / len(root), 0.0
1198        self.renderer.save_render_state()
1199        self.renderer.translate(self.vertical_margin + tree_height, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0) + heatmap_cell_height / 2.0)
1200        self.renderer.rotate(90)
1201#        print self.renderer.transform()
1202        draw_tree(self.tree, self.tree, tree_height, heatmap_height, self.cluster_colors.get(self.tree, (0,0,0)))
1203        self.renderer.restore_render_state()
1204        if self.attr_tree:
1205            self.renderer.save_render_state()
1206            self.renderer.translate(self.vertical_margin + tree_height + self.spacing + heatmap_cell_width / 2.0, self.horizontal_margin + tree_height)
1207            self.renderer.scale(1.0, -1.0)
1208#            print self.renderer.transform()
1209            draw_tree(self.attr_tree, self.attr_tree, tree_height, heatmap_width, self.cluster_colors.get(self.attr_tree, (0,0,0)))
1210            self.renderer.restore_render_state()
1211       
1212        self.renderer.save_render_state()
1213        self.renderer.translate(self.vertical_margin + tree_height + self.spacing, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0))
1214#        print self.renderer.transform()
1215        if self.data:
1216            colorSchema = self.color_shema()
1217            for i, ii in enumerate(self.tree):
1218                ex = self.data[ii]
1219                for j, jj in enumerate((self.attr_tree if self.attr_tree else range(len(self.data.domain.attributes)))):
1220                    r, g, b = colorSchema(ex[jj])
1221                    self.renderer.draw_rect(j * heatmap_cell_width, i * heatmap_cell_height, heatmap_cell_width, heatmap_cell_height, fill_color=(r, g, b), stroke_color=(255, 255, 255))
1222       
1223        self.renderer.translate(heatmap_width + self.spacing, heatmap_cell_height)
1224#        print self.renderer.transform()
1225        self.renderer.set_font("Times-Roman", font_size)
1226        for index in self.tree.mapping: #label in self.labels:
1227            self.renderer.draw_text(0.0, 0.0, self.labels[index])
1228            self.renderer.translate(0.0, heatmap_cell_height)
1229        self.renderer.restore_render_state()
1230        self.renderer.save(filename, **kwargs)
1231       
1232       
1233def dendrogram_draw(file, cluster, attr_cluster = None, labels=None, data=None,
1234                    width=None, height=None, tree_height=None,
1235                    heatmap_width=None, text_width=None,  spacing=2,
1236                    cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]),
1237                    maxv=None, minv=None, gamma=None,
1238                    format=None):
1239    """ Plot the dendrogram to ``file``.
1240   
1241    :param file: An  open file or a filename to store the image to.
1242    :type file: str or an file-like object suitable for writing.
1243   
1244    :param cluster: An instance of :class:`HierarcicalCluster`
1245    :type cluster: :class:`HierarcicalCluster`
1246   
1247    :param attr_cluster: An instance of :class:`HierarcicalCluster` for the attributes
1248        in ``data`` (unused if ``data`` is not supplied)
1249    :type attr_cluster: :class:`HierarcicalCluster`
1250   
1251    :param labels: Labels for the ``cluster`` leaves.
1252    :type labels: list-of-strings
1253   
1254    :param data: A data table for the (optional) heatmap.
1255    :type data: :class:`Orange.data.Table`
1256   
1257    :param width: Image width.
1258    :type width: int
1259   
1260    :param height: Image height.
1261    :type height: int
1262   
1263    :param tree_height: Dendrogram tree height in the image.
1264    :type tree_height: int
1265   
1266    :param heatmap_width: Heatmap width.
1267    :type heatmap_width: int
1268   
1269    :param text_width: Text labels area width.
1270    :type text_width: int
1271   
1272    :param spacing: Spacing between consecutive leaves.
1273    :type spacing: int
1274   
1275    :param cluster_colors: A dictionary mapping :class:`HierarcicalCluster`
1276        instances in ``cluster`` to a RGB color 3-tuple.
1277    :type cluster_colors: dict
1278   
1279    :param format: Output image format Currently supported formats are
1280        "png" (default), "eps" and "svg". You only need this arguement if the
1281        format cannot be deduced from the ``file`` argument.
1282       
1283    :type format: str
1284   
1285    """
1286    import os
1287    from Orange.misc.render import PILRenderer, EPSRenderer, SVGRenderer
1288    if isinstance(file, basestring):
1289        name, ext = os.path.splitext(file)
1290        format = ext.lower().lstrip(".") or format
1291       
1292    if format is None:
1293        format = "png"
1294       
1295    renderer = {"eps":EPSRenderer, "svg":SVGRenderer, "png":PILRenderer}.get(format, "png")
1296   
1297    d = DendrogramPlot(cluster, attr_cluster, labels, data, width, height,
1298                       tree_height, heatmap_width, text_width, spacing,
1299                       cluster_colors, color_palette, maxv, minv, gamma,
1300                       renderer=renderer)
1301    if renderer is PILRenderer:
1302        d.plot(file, format=format)
1303    else:
1304        d.plot(file)
1305   
1306def postorder(cluster):
1307    """ Return a post order list of clusters.
1308   
1309    :param cluster: Cluster
1310    :type cluster: :class:`HierarchicalCluster`
1311   
1312    :rtype: list of :class:`HierarchicalCluster` instances
1313   
1314    """
1315    order = []
1316    visited = set()
1317    stack = [cluster]
1318    while stack:
1319        cluster = stack.pop(0)
1320       
1321        if cluster.branches:
1322            if cluster in visited:
1323                order.append(cluster)
1324            else:
1325                stack = cluster.branches + [cluster] + stack
1326                visited.add(cluster)
1327        else:
1328            order.append(cluster)
1329            visited.add(cluster)
1330    return order
1331   
1332   
1333def preorder(cluster):
1334    """ Return a pre order list of clusters.
1335   
1336    :param cluster: Cluster
1337    :type cluster: :class:`HierarchicalCluster`
1338   
1339    :rtype: list of :class:`HierarchicalCluster` instances
1340   
1341    """
1342    order = []
1343    stack = [cluster]
1344    while stack:
1345        cluster = stack.pop(0)
1346        order.append(cluster)
1347        if cluster.branches:
1348            stack = cluster.branches + stack
1349    return order
1350   
1351   
1352def dendrogram_layout(cluster, expand_leaves=False):
1353    """ Return a layout of the cluster dendrogram on a 2D plane. The return
1354    value if a list of (subcluster, (start, center, end)) tuples where
1355    subcluster is an instance of :class:`HierarchicalCluster` and start,
1356    end are the two end points for the cluster branch. The list is sorted
1357    in post-order.
1358   
1359    :param cluster: Cluster to layout.
1360    :type cluster: :class:`HierarchicalCluster`
1361   
1362    :param expand_leaves: If True leaves will span proportional to the number
1363        of items they contain, else all leaves will be the same width.
1364   
1365    :rtype: list of (:class:`HierarchicalCluster`, (start, center, end)) tuples
1366   
1367    """
1368    result = []
1369    cluster_geometry = {}
1370    leaf_idx = 0
1371    for cluster in postorder(cluster):
1372        if not cluster.branches:
1373            if expand_leaves:
1374                start, end = float(cluster.first), float(cluster.last)
1375            else:
1376                start = end = leaf_idx
1377                leaf_idx += 1
1378            center = (start + end) / 2.0
1379            cluster_geometry[cluster] = (start, center, end)
1380            result.append((cluster, (start, center, end)))
1381        else:
1382            left = cluster.branches[0]
1383            right = cluster.branches[1]
1384            left_center = cluster_geometry[left][1]
1385            right_center = cluster_geometry[right][1]
1386            start, end = left_center, right_center
1387            center = (start + end) / 2.0
1388            cluster_geometry[cluster] = (start, center, end)
1389            result.append((cluster, (start, center, end)))
1390           
1391    return result
1392   
1393def clone(cluster):
1394    """ Clone a cluster, including it's subclusters.
1395   
1396    :param cluster: Cluster to clone
1397    :type cluster: :class:`HierarchicalCluster`
1398   
1399    :rtype: :class:`HierarchicalCluster`
1400   
1401    """
1402    import copy
1403    clones = {}
1404    mapping = copy.copy(cluster.mapping)
1405    for node in postorder(cluster):
1406        node_clone = copy.copy(node)
1407        if node.branches:
1408            node_clone.branches = [clones[b] for b in node.branches]
1409        node_clone.mapping = mapping
1410        clones[node] = node_clone
1411       
1412    return clones[cluster]
1413   
1414def pruned(cluster, level=None, height=None, condition=None):
1415    """ Return a new pruned clustering instance.
1416   
1417    .. note:: This uses :obj:`clone` to create a copy of the `cluster`
1418        instance.
1419   
1420    :param cluster: Cluster to prune.
1421    :type cluster: :class:`HierarchicalCluster`
1422   
1423    :param level: If not None prune all clusters deeper then `level`.
1424    :type level: int
1425   
1426    :param height: If not None prune all clusters with height lower then
1427        `height`.
1428    :type height: float
1429   
1430    :param condition: If not None condition must be a function taking a
1431        single :class:`HierarchicalCluster` instance and returning a True
1432        or False value indicating if the cluster should be pruned.
1433    :type condition: function
1434   
1435    :rtype: :class:`HierarchicalCluster`
1436   
1437    """
1438    cluster = clone(cluster)
1439    prune(cluster, level, height, condition)
1440    return cluster
1441   
1442   
1443def prune(cluster, level=None, height=None, condition=None):
1444    """ Prune the clustering instance ``cluster`` in place.
1445   
1446    :param cluster: Cluster to prune.
1447    :type cluster: :class:`HierarchicalCluster`
1448   
1449    :param level: If not None prune all clusters deeper then `level`.
1450    :type level: int
1451   
1452    :param height: If not None prune all clusters with height lower then
1453        `height`.
1454    :type height: float
1455   
1456    :param condition: If not None condition must be a function taking a
1457        single :class:`HierarchicalCluster` instance and returning a True
1458        or False value indicating if the cluster should be pruned.
1459    :type condition: function
1460   
1461    """
1462    if not any(arg is not None for arg in [level, height, condition]):
1463        raise ValueError("At least one pruning argument must be supplied")
1464   
1465    level_check = height_check = condition_check = lambda cl: False
1466    cluster_depth = cluster_depths(cluster)
1467   
1468    if level is not None:
1469        level_check = lambda cl: cluster_depth[cl] >= level
1470       
1471    if height is not None:
1472        height_check = lambda cl: cl.height <= height
1473
1474    if condition is not None:
1475        condition_check = condition
1476       
1477    pruned_set = set()
1478   
1479    def check_all(cl):
1480        return any([check(cl) for check in [level_check, height_check,
1481                                            condition_check]])
1482       
1483    for cluster in preorder(cluster):
1484        if cluster not in pruned_set:
1485            if check_all(cluster):
1486                cluster.branches = None
1487                pruned_set.update(set(preorder(cluster)))
1488            else:
1489                pass
1490   
1491   
1492def cluster_depths(cluster):
1493    """ Return a dictionary mapping :class:`HierarchicalCluster` instances to
1494    their depths in the `cluster` hierarchy.
1495   
1496    :param cluster: Root cluster
1497    :type cluster: :class:`HierarchicalCluster`
1498   
1499    :rtype: class:`dict`
1500   
1501    """
1502    depths = {}
1503    depths[cluster] = 0
1504    for cluster in preorder(cluster):
1505        cl_depth = depths[cluster]
1506        if cluster.branches:
1507            depths.update(dict.fromkeys(cluster.branches, cl_depth + 1))
1508    return depths
1509
1510instance_distance_matrix = Orange.distance.distance_matrix
1511
1512def feature_distance_matrix(data, distance=None, progress_callback=None):
1513    """ A helper function that computes an :class:`Orange.core.SymMatrix` of
1514    all pairwise distances between features in `data`.
1515   
1516    :param data: A data table
1517    :type data: :class:`Orange.data.Table`
1518   
1519    :param distance: a function taking two lists and returning the distance.
1520    :type distance: function
1521   
1522    :param progress_callback: A function (taking one argument) to use for
1523        reporting the on the progress.
1524    :type progress_callback: function
1525   
1526    :rtype: :class:`Orange.core.SymMatrix`
1527   
1528    """
1529    attributes = data.domain.attributes
1530    matrix = orange.SymMatrix(len(attributes))
1531    iter_count = matrix.dim * (matrix.dim - 1) / 2
1532    milestones = progress_bar_milestones(iter_count, 100)
1533   
1534    for count, ((i, a1), (j, a2)) in enumerate(_pairs(enumerate(attributes))):
1535        matrix[i, j] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0
1536        if progress_callback and count in milestones:
1537            progress_callback(100.0 * count / iter_count)
1538           
1539    return matrix
1540
1541
1542def _pairs(seq, same = False):
1543    """ Return all pairs from elements of `seq`.
1544    """
1545    seq = list(seq)
1546    same = 0 if same else 1
1547    for i in range(len(seq)):
1548        for j in range(i + same, len(seq)):
1549            yield seq[i], seq[j]
1550   
1551   
1552def joining_cluster(cluster, item1, item2):
1553    """ Return the cluster where `item1` and `item2` are first joined
1554   
1555    :param cluster: Clustering.
1556    :type cluster: :class:`HierarchicalCluster`
1557    :param item1: An element of `cluster.mapping` or `cluster.mapping.objects`
1558    :param item2: An element of `cluster.mapping` or `cluster.mapping.objects`
1559   
1560    :rtype: :class:`HierarchicalCluster`
1561   
1562    """
1563    while True:
1564        if cluster.branches:
1565            for branch in cluster.branches:
1566                if item1 in branch and item2 in branch:
1567                    cluster = branch
1568                    break
1569            else:
1570                return cluster
1571        else:
1572            return cluster
1573       
1574
1575def cophenetic_distances(cluster):
1576    """ Return the cophenetic distance matrix between items in clustering.
1577    Cophenetic distance is defined as the height of the cluster where the
1578    two items are first joined.
1579   
1580    :param cluster: Clustering.
1581    :type cluster: :class:`HierarchicalCluster`
1582   
1583    :rtype: :class:`Orange.core.SymMatrix`
1584   
1585    """
1586
1587    mapping = cluster.mapping 
1588    matrix = Orange.core.SymMatrix(len(mapping))
1589    for cluster in postorder(cluster):
1590        if cluster.branches:
1591            for branch1, branch2 in _pairs(cluster.branches):
1592                for idx1 in mapping[branch1.first: branch1.last]:
1593                    for idx2 in mapping[branch2.first: branch2.last]:
1594                        matrix[idx1, idx2] = cluster.height
1595               
1596        else:
1597            for ind1, ind2 in _pairs(mapping[cluster.first: cluster.last]):
1598                matrix[ind1, ind2] = cluster.height
1599   
1600    return matrix
1601
1602   
1603def cophenetic_correlation(cluster, matrix):
1604    """ Return the `cophenetic correlation coefficient
1605    <http://en.wikipedia.org/wiki/Cophenetic_correlation>`_ of the given
1606    clustering.
1607   
1608    :param cluster: Clustering.
1609    :type cluster: :class:`HierarchicalCluster`
1610   
1611    :param matrix: The distance matrix from which the `cluster` was
1612        derived.
1613   
1614    :rtype: :class:`float`
1615   
1616    """
1617    import numpy
1618    cophenetic = cophenetic_distances(cluster)
1619    cophenetic = [list(row) for row in cophenetic]
1620    original = [list(row) for row in matrix]
1621    cophenetic = numpy.ravel(cophenetic)
1622    original = numpy.ravel(original)
1623    return numpy.corrcoef(cophenetic, original)[0, 1]
1624   
1625   
1626if __name__=="__main__":
1627    data = orange.ExampleTable("doc//datasets//brown-selected.tab")
1628#    data = orange.ExampleTable("doc//datasets//iris.tab")
1629    root = hierarchicalClustering(data, order=True) #, linkage=orange.HierarchicalClustering.Single)
1630    attr_root = hierarchicalClustering_attributes(data, order=True)
1631#    print root
1632#    d = DendrogramPlotPylab(root, data=data, labels=[str(ex.getclass()) for ex in data], dendrogram_width=0.4, heatmap_width=0.3,  params={}, cmap=None)
1633#    d.plot(show=True, filename="graph.png")
1634
1635    dendrogram_draw("graph.eps", root, attr_tree=attr_root, data=data, labels=[str(e.getclass()) for e in data], tree_height=50, #width=500, height=500,
1636                          cluster_colors={root.right:(255,0,0), root.right.right:(0,255,0)}, 
1637                          color_palette=ColorPalette([(255, 0, 0), (0,0,0), (0, 255,0)], gamma=0.5, 
1638                                                     overflow=(255, 255, 255), underflow=(255, 255, 255))) #, minv=-0.5, maxv=0.5)
Note: See TracBrowser for help on using the repository browser.