source: orange/Orange/clustering/hierarchical.py @ 9723:d0c9015bc73a

Revision 9723:d0c9015bc73a, 65.6 KB checked in by markotoplak, 2 years ago (diff)

Orange.distance renames.

Line 
1r"""
2******************************************
3Hierarchical clustering (``hierarchical``)
4******************************************
5
6.. index::
7   single: clustering, hierarchical, dendrogram
8.. index:: aglomerative clustering
9
10The method for hierarchical clustering, encapsulated in class
11:class:`HierarchicalClustering` works on a distance matrix stored as
12:class:`SymMatrix`. The method works in approximately O(n2) time (with
13the worst case O(n3)). For orientation, clustering ten thousand of
14elements should take roughly 15 seconds on a 2 GHz computer.
15The algorithm can either make a copy of the distances matrix and work on
16it, or work on the original distance matrix, destroying it in the process.
17The latter is useful for clustering larger number of objects. Since the
18distance matrix stores (n+1)(n+2)/2 floats (cca 2 MB for 1000 objects and
19200 MB for 10000, assuming the a float takes 4 bytes), by copying it we
20would quickly run out of physical memory. Using virtual memory is not
21an option since the matrix is accessed in a random manner.
22
23The distance should contain no negative elements. This limitation is
24due to implementation details of the algorithm (it is not absolutely
25necessary and can be lifted in future versions if often requested; it
26only helps the algorithm run a bit faster). The elements on the diagonal
27(representing the element's distance from itself) are ignored.
28
29Distance matrix can have the attribute objects describing the objects we
30are clustering (this is available only in Python). This can be any sequence
31of the same length as the matrix - an ExampleTable, a list of examples, a
32list of attributes (if you're clustering attributes), or even a string of
33the correct length. This attribute is not used in clustering but is only
34passed to the clusters' attribute ``mapping`` (see below), which will hold a
35reference to it (if you modify the list, the changes will affect the
36clusters as well).
37
38.. class:: HierarchicalClustering
39   
40    .. attribute:: linkage
41       
42        Specifies the linkage method, which can be either :
43       
44            1. ``HierarchicalClustering.Single`` (default), where distance
45                between groups is defined as the distance between the closest
46                pair of objects, one from each group,
47            2. ``HierarchicalClustering.Average`` , where the distance between
48                two clusters is defined as the average of distances between
49                all pairs of objects, where each pair is made up of one object
50                from each group, or
51            3. ``HierarchicalClustering.Complete``, where the distance between
52                groups is defined as the distance between the most distant
53                pair of objects, one from each group. Complete linkage is
54                also called farthest neighbor.
55            4. ``HierarchicalClustering.Ward`` uses Ward's distance.
56           
57    .. attribute:: overwrite_matrix
58
59        If true (default is false), the algorithm will work on the original
60        distance matrix, destroying it in the process. The benefit is that it
61        will need much less memory (not much more than what is needed to store
62        the tree of clusters).
63       
64    .. attribute:: progress_callback
65       
66        A callback function (None by default). It can be any function or
67        callable class in Python, which accepts a single float as an
68        argument. The function only gets called if the number of objects
69        being clustered is at least 1000. It will be called for 101 times,
70        and the argument will give the proportion of the work been done.
71        The time intervals between the function calls won't be equal (sorry
72        about that...) since the clustering proceeds faster as the number
73        of clusters decreases.
74       
75    .. method:: __call__(matrix)
76         
77        The ``HierarchicalClustering`` is called with a distance matrix as an
78        argument. It returns an instance of HierarchicalCluster representing
79        the root of the hierarchy (instance of :class:`HierarchicalCluster`).
80        See examples section for details.
81       
82        :param matrix: A distance matrix to perform the clustering on.
83        :type matrix: :class:`Orange.core.SymMatrix`
84
85
86.. class:: HierarchicalCluster
87
88    Represents a node in the clustering tree, as returned by
89    :obj:`HierarchicalClustering`
90
91    .. attribute:: branches
92   
93        A list of sub-clusters (:class:`HierarchicalCluster` instances). If this
94        is a leaf node this attribute is `None`
95       
96    .. attribute:: left
97   
98        The left sub-cluster (defined only if there are only two branches).
99       
100        .. note:: Same as ``branches[0]``
101       
102    .. attribute:: right
103   
104        The right sub-cluster (defined only if there are only two branches).
105       
106        .. note:: Same as ``branches[1]``
107       
108    .. attribute:: height
109   
110        Height of the cluster (distance between the sub-clusters).
111       
112    .. attribute:: mapping
113   
114        A list of indices to the original distance matrix. It is the same
115        for all clusters in the hierarchy - it simply represents the indices
116        ordered according to the clustering.
117       
118    .. attribute:: first
119    .. attribute:: last
120   
121        ``first`` and ``last`` are indices into the elements of ``mapping`` that
122        belong to that cluster. (Seems weird, but is trivial - wait for the
123        examples. On the other hand, you probably won't need to understand this
124        anyway).
125
126    .. method:: __len__()
127   
128        Asking for the length of the cluster gives the number of the objects
129        belonging to it. This equals ``last - first``.
130   
131    .. method:: __getitem__(index)
132   
133        By indexing the cluster we address its elements; these are either
134        indices or objects (you'll understand this after seeing the examples).
135        For instance cluster[2] gives the third element of the cluster, and
136        list(cluster) will return the cluster elements as a list. The cluster
137        elements are read-only. To actually modify them, you'll have to go
138        through mapping, as described below. This is intentionally complicated
139        to discourage a naive user from doing what he does not understand.
140   
141    .. method:: swap()
142   
143        Swaps the ``left`` and the ``right`` subcluster; obviously this will
144        report an error when the cluster has more than two subclusters. This
145        function changes the mapping and first and last of all clusters below
146        this one and thus needs O(len(cluster)) time.
147       
148    .. method:: permute(permutation)
149   
150        Permutes the subclusters. Permutation gives the order in which the
151        subclusters will be arranged. As for swap, this function changes the
152        mapping and first and last of all clusters below this one.
153   
154   
155Example 1 - Toy matrix
156----------------------
157
158Let us construct a simple distance matrix and run clustering on it.
159::
160
161    import Orange
162    from Orange.clustering import hierarchical
163    m = [[],
164         [ 3],
165         [ 2, 4],
166         [17, 5, 4],
167         [ 2, 8, 3, 8],
168         [ 7, 5, 10, 11, 2],
169         [ 8, 4, 1, 5, 11, 13],
170         [ 4, 7, 12, 8, 10, 1, 5],
171         [13, 9, 14, 15, 7, 8, 4, 6],
172         [12, 10, 11, 15, 2, 5, 7, 3, 1]]
173    matrix = Orange.core.SymMatrix(m)
174    root = hierarchical.HierarchicalClustering(matrix,
175            linkage=hierarchical.HierarchicalClustering.Average)
176   
177Root is a root of the cluster hierarchy. We can print using a
178simple recursive function.
179::
180
181    def printClustering(cluster):
182        if cluster.branches:
183            return "(%s%s)" % (printClustering(cluster.left), printClustering(cluster.right))
184        else:
185            return str(cluster[0])
186           
187The output is not exactly nice, but it will have to do. Our clustering,
188printed by calling printClustering(root) looks like this
189::
190   
191    (((04)((57)(89)))((1(26))3))
192   
193The elements are separated into two groups, the first containing elements
1940, 4, 5, 7, 8, 9, and the second 1, 2, 6, 3. The difference between them
195equals ``root.height``, 9.0 in our case. The first cluster is further
196divided onto 0 and 4 in one, and 5, 7, 8, 9 in the other subcluster...
197
198It is easy to print out the cluster's objects. Here's what's in the
199left subcluster of root.
200::
201
202    >>> for el in root.left:
203        ... print el,
204    0 4 5 7 8 9
205   
206Everything that can be iterated over, can as well be cast into a list or
207tuple. Let us demonstrate this by writing a better function for printing
208out the clustering (which will also come handy for something else in a
209while). The one above supposed that each leaf contains a single object.
210This is not necessarily so; instead of printing out the first (and
211supposedly the only) element of cluster, cluster[0], we shall print
212it out as a tuple.
213::
214
215    def printClustering2(cluster):
216        if cluster.branches:
217            return "(%s%s)" % (printClustering2(cluster.left), printClustering2(cluster.right))
218        else:
219            return str(tuple(cluster))
220           
221The distance matrix could have been given a list of objects. We could,
222for instance, put
223::
224   
225    matrix.objects = ["Ann", "Bob", "Curt", "Danny", "Eve",
226                      "Fred", "Greg", "Hue", "Ivy", "Jon"]
227
228above calling the HierarchicalClustering.
229
230.. note:: This code will actually trigger a warning;
231    to avoid it, use matrix.setattr("objects", ["Ann", "Bob"....
232..    Why this is needed is explained in the page on `Orange peculiarities`_. TODO: Create page on Orange Peculiarities.
233
234If we've forgotten to store the objects into matrix prior to clustering,
235nothing is lost. We can add it into clustering later, by
236::
237
238    root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"]
239   
240So, what do these "objects" do? Call printClustering(root) again and you'll
241see. Or, let us print out the elements of the first left cluster, as we did
242before.
243::
244
245    >>> for el in root.left:
246        ... print el,
247    Ann Eve Fred Hue Ivy Jon
248   
249If objects are given, the cluster's elements, as got by indexing
250(eg root.left[2]) or by iteration, as in the above case, won't be
251indices but the elements we clustered. If we put an ExampleTable
252into objects, root.left[-1] will be the last example of the first
253left cluster.
254
255Now for swapping and permutations. ::
256
257    >>> printClustering(root)
258    ((('Ann''Eve')(('Fred''Hue')('Ivy''Jon')))(('Bob'('Curt''Greg'))'Danny'))
259    >>> root.left.swap()
260    >>> printClustering(root)
261    (((('Fred''Hue')('Ivy''Jon'))('Ann''Eve'))(('Bob'('Curt''Greg'))'Danny'))
262    >>> root.permute([1, 0])
263    >>> printClustering(root)
264    ((('Bob'('Curt''Greg'))'Danny')((('Fred''Hue')('Ivy''Jon'))('Ann''Eve')))
265   
266Calling ``root.left.swap`` reversed the order of subclusters of ``root.left``
267and ``root.permute([1, 0])`` (which is equivalent to ``root.swap`` - there
268aren't many possible permutations of two elements) reverses the order
269of ``root.left`` and ``root.right``.
270
271Let us write function for cluster pruning. ::
272
273    def prune(cluster, togo):
274        if cluster.branches:
275            if togo<0:
276                cluster.branches = None
277            else:
278                for branch in cluster.branches:
279                    prune(branch, togo-cluster.height)
280
281We shall use ``printClustering2`` here, since we can have multiple elements
282in a leaf of the clustering hierarchy. ::
283   
284    >>> prune(root, 9)
285    >>> print printClustering2(root)
286    ((('Bob', 'Curt', 'Greg')('Danny',))(('Fred', 'Hue', 'Ivy', 'Jon')('Ann', 'Eve')))
287   
288We've ended up with four clusters. Need a list of clusters?
289Here's the function. ::
290   
291    def listOfClusters0(cluster, alist):
292        if not cluster.branches:
293            alist.append(list(cluster))
294        else:
295            for branch in cluster.branches:
296                listOfClusters0(branch, alist)
297               
298    def listOfClusters(root):
299        l = []
300        listOfClusters0(root, l)
301        return l
302       
303The function returns a list of lists, in our case
304``[['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']]``   
305If there were no ``objects`` the list would contains indices instead of names.
306
307Example 2 - Clustering of examples
308----------------------------------
309
310The most common things to cluster are certainly examples. To show how to
311this is done, we shall now load the Iris data set, initialize a distance
312matrix with the distances measure by :class:`ExamplesDistance_Euclidean`
313and cluster it with average linkage. Since we don't need the matrix,
314we shall let the clustering overwrite it (not that it's needed for
315such a small data set as Iris). ::
316
317    import Orange
318    from Orange.clustering import hierarchical
319
320    data = Orange.data.Table("iris")
321    matrix = Orange.core.SymMatrix(len(data))
322    matrix.setattr("objects", data)
323    distance = Orange.distance.EuclideanConstructor(data)
324    for i1, instance1 in enumerate(data):
325        for i2 in range(i1+1, len(data)):
326            matrix[i1, i2] = distance(instance1, data[i2])
327           
328    clustering = hierarchical.HierarchicalClustering()
329    clustering.linkage = clustering.Average
330    clustering.overwrite_matrix = 1
331    root = clustering(matrix)
332
333Note that we haven't forgotten to set the ``matrix.objects``. We did it
334through ``matrix.setattr`` to avoid the warning. Let us now prune the
335clustering using the function we've written above, and print out the
336clusters. ::
337   
338    prune(root, 1.4)
339    for n, cluster in enumerate(listOfClusters(root)):
340        print "\n\n Cluster %i \n" % n
341        for instance in cluster:
342            print instance
343           
344Since the printout is pretty long, it might be more informative to just
345print out the class distributions for each cluster. ::
346   
347    for cluster in listOfClusters(root):
348        dist = Orange.core.get_class_distribution(cluster)
349        for e, d in enumerate(dist):
350            print "%s: %3.0f " % (data.domain.class_var.values[e], d),
351        print
352       
353Here's what it shows. ::
354
355    Iris-setosa:  49    Iris-versicolor:   0    Iris-virginica:   0
356    Iris-setosa:   1    Iris-versicolor:   0    Iris-virginica:   0
357    Iris-setosa:   0    Iris-versicolor:  50    Iris-virginica:  17
358    Iris-setosa:   0    Iris-versicolor:   0    Iris-virginica:  33
359   
360Note something else: ``listOfClusters`` does not return a list of
361:class:`Orange.data.Table`, but a list of lists of instances. Therefore,
362in the above script, cluster is a list of examples, not an ``Table``, but
363it gets converted to it automatically when the function is called.
364Most Orange functions will do this for you automatically. You can, for
365instance, call a learning algorithms, passing a cluster as an argument.
366It won't mind. If you, however, want to have a list of table, you can
367easily convert the list by ::
368
369    tables = [Orange.data.Table(cluster) for cluster in listOfClusters(root)]
370   
371Finally, if you are dealing with examples, you may want to take the function
372``listOfClusters`` and replace ``alist.append(list(cluster))`` by
373``alist.append(Orange.data.Table(cluster))``. This function is less general,
374it will fail if objects are not of type :class:`Orange.data.Instance`.
375However, instead of list of lists, it will return a list of tables.
376
377How the data in ``HierarchicalCluster`` is really stored?
378---------------------------------------------------------
379
380To demonstrate how the data in clusters is stored, we shall continue with
381the clustering we got in the first example. ::
382   
383    >>> del root.mapping.objects
384    >>> print printClustering(root)
385    (((1(26))3)(((57)(89))(04)))
386    >>> print root.mapping
387    <1, 2, 6, 3, 5, 7, 8, 9, 0, 4>
388    >>> print root.left.first
389    0
390    >>> print root.left.last
391    4
392    >>> print root.left.mapping[root.left.first:root.left.last]
393    <1, 2, 6, 3>
394    >>> print root.left.left.first
395    0
396    >>> print root.left.left.last
397    3
398   
399We removed objects to just to see more clearly what is going on.
400``mapping`` is an ordered list of indices to the rows/columns of distance
401matrix (and, at the same time, indices into objects, if they exist). Each
402cluster's fields ``first`` and ``last`` are indices into mapping, so the
403clusters elements are actually
404``cluster.mapping[cluster.first:cluster.last]``. ``cluster[i]`` therefore
405returns ``cluster.mapping[cluster.first+i]`` or, if objects are specified,
406``cluster.objects[cluster.mapping[cluster.first+i]]``. Space consumption
407is minimal since all clusters share the same objects ``mapping`` and
408``objects``.
409
410
411Subclusters are ordered so that ``cluster.left.last`` always equals
412``cluster.right.first`` or, in general, ``cluster.branches[i].last``
413equals ``cluster.branches[i+1].first``.
414
415
416Swapping and permutation do three things: change the order of elements in
417``branches``, permute the corresponding regions in ``mapping`` and adjust
418the ``first`` and ``last`` for all the clusters below. For the latter, when
419subclusters of cluster are permuted, the entire subtree starting at
420``cluster.branches[i]`` is moved by the same offset.
421
422
423The hierarchy of objects that represent a clustering is open, everything is
424accessible from Python. You can write your own clustering algorithms that
425build this same structure, or you can use Orange's clustering and then do to
426the structure anything you want. For instance prune it, as we have shown
427earlier. However, it is easy to do things wrong: shuffle the mapping, for
428instance, and forget to adjust the ``first`` and ``last`` pointers. Orange
429does some checking for the internal consistency, but you are surely smarter
430and can find a way to crash it. For instance, just create a cycle in the
431structure, call ``swap`` for some cluster above the cycle and you're there.
432But don't blame it on me then.
433
434
435Utility Functions
436-----------------
437
438.. autofunction:: clustering
439.. autofunction:: clustering_features
440.. autofunction:: cluster_to_list
441.. autofunction:: top_clusters
442.. autofunction:: top_cluster_membership
443.. autofunction:: order_leaves
444
445.. autofunction:: postorder
446.. autofunction:: preorder
447.. autofunction:: dendrogram_layout
448.. autofunction:: dendrogram_draw
449.. autofunction:: clone
450.. autofunction:: prune
451.. autofunction:: pruned
452.. autofunction:: cluster_depths
453.. autofunction:: instance_distance_matrix
454.. autofunction:: feature_distance_matrix
455.. autofunction:: joining_cluster
456.. autofunction:: cophenetic_distances
457.. autofunction:: cophenetic_correlation
458
459"""
460
461import orange
462import Orange
463from Orange.core import HierarchicalClustering, \
464                        HierarchicalCluster, \
465                        HierarchicalClusterList
466
467from Orange.misc import progress_bar_milestones, deprecated_keywords
468                       
469import sys
470
471SINGLE = HierarchicalClustering.Single
472AVERAGE = HierarchicalClustering.Average
473COMPLETE = HierarchicalClustering.Complete
474WARD = HierarchicalClustering.Ward
475
476def clustering(data,
477               distance_constructor=orange.ExamplesDistanceConstructor_Euclidean,
478               linkage=AVERAGE,
479               order=False,
480               progress_callback=None):
481    """ Return a hierarchical clustering of the instances in a data set.
482   
483    :param data: Input data table for clustering.
484    :type data: :class:`Orange.data.Table`
485    :param distance_constructor: Instance distance constructor
486    :type distance_constructor: :class:`Orange.distance.ExamplesDistanceConstructor`
487    :param linkage: Linkage flag. Must be one of global module level flags:
488   
489        - SINGLE
490        - AVERAGE
491        - COMPLETE
492        - WARD
493       
494    :type linkage: int
495    :param order: If `True` run `order_leaves` on the resulting clustering.
496    :type order: bool
497    :param progress_callback: A function (taking one argument) to use for
498        reporting the on the progress.
499    :type progress_callback: function
500   
501    :rtype: :class:`HierarchicalCluster`
502   
503    """
504    distance = distance_constructor(data)
505    matrix = orange.SymMatrix(len(data))
506    for i in range(len(data)):
507        for j in range(i+1):
508            matrix[i, j] = distance(data[i], data[j])
509    root = HierarchicalClustering(matrix, linkage=linkage, progress_callback=(lambda value, obj=None: progress_callback(value*100.0/(2 if order else 1))) if progress_callback else None)
510    if order:
511        order_leaves(root, matrix, progress_callback=(lambda value: progress_callback(50.0 + value/2)) if progress_callback else None)
512    return root
513
514clustering = \
515    deprecated_keywords({"distanceConstructor": "distance_constructor",
516                         "progressCallback": "progress_callback"})(clustering)
517
518
519def clustering_features(data, distance=None, linkage=orange.HierarchicalClustering.Average, order=False, progress_callback=None):
520    """ Return hierarchical clustering of attributes in a data set.
521   
522    :param data: Input data table for clustering.
523    :type data: :class:`Orange.data.Table`
524    :param distance: Attribute distance constructor
525        .. note:: currently not used.
526    :param linkage: Linkage flag. Must be one of global module level flags:
527   
528        - SINGLE
529        - AVERAGE
530        - COMPLETE
531        - WARD
532       
533    :type linkage: int
534    :param order: If `True` run `order_leaves` on the resulting clustering.
535    :type order: bool
536    :param progress_callback: A function (taking one argument) to use for
537        reporting the on the progress.
538    :type progress_callback: function
539   
540    :rtype: :class:`HierarchicalCluster`
541   
542    """
543    matrix = orange.SymMatrix(len(data.domain.attributes))
544    for a1 in range(len(data.domain.attributes)):
545        for a2 in range(a1):
546            matrix[a1, a2] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0
547    root = orange.HierarchicalClustering(matrix, linkage=linkage, progress_callback=(lambda value, obj=None: progress_callback(value*100.0/(2 if order else 1))) if progress_callback else None)
548    if order:
549        order_leaves(root, matrix, progressCallback=(lambda value: progress_callback(50.0 + value/2)) if progress_callback else None)
550    return root
551
552clustering_features = \
553    deprecated_keywords({"progressCallback":"progress_callback"})(clustering_features)
554   
555   
556def cluster_to_list(node, prune=None):
557    """ Return a list of clusters down from the node of hierarchical clustering.
558   
559    :param node: Cluster node.
560    :type node: :class:`HierarchicalCluster`
561    :param prune: If not `None` it must be a positive integer. Any cluster
562        with less then `prune` items will be left out of the list.
563    :type node: int or `NoneType`
564   
565    :rtype: list of :class:`HierarchicalCluster` instances
566   
567    """
568    if prune:
569        if len(node) <= prune:
570            return [] 
571    if node.branches:
572        return [node] + cluster_to_list(node.left, prune) + cluster_to_list(node.right, prune)
573    return [node]
574
575
576def top_clusters(root, k):
577    """ Return k topmost clusters from hierarchical clustering.
578   
579    :param root: Root cluster.
580    :type root: :class:`HierarchicalCluster`
581    :param k: Number of top clusters.
582    :type k: int
583   
584    :rtype: list of :class:`HierarchicalCluster` instances
585
586    """
587    candidates = set([root])
588    while len(candidates) < k:
589        repl = max([(c.height, c) for c in candidates if c.branches])[1]
590        candidates.discard(repl)
591        candidates.add(repl.left)
592        candidates.add(repl.right)
593    return candidates
594
595def top_cluster_membership(root, k):
596    """ Return data instances' cluster membership (list of indices) to k topmost clusters.
597   
598    :param root: Root cluster.
599    :type root: :class:`HierarchicalCluster`
600    :param k: Number of top clusters.
601    :type k: int
602   
603    :rtype: list-of-int
604   
605    """
606    clist = top_clusters(root, k)
607    cmap = [None] * len(root)
608    for i, c in enumerate(clist):
609        for e in c:
610            cmap[e] = i
611    return cmap
612
613def order_leaves_py(tree, matrix, progress_callback=None):
614    """Order the leaves in the clustering tree.
615   
616    (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering))
617   
618    :param tree: Binary hierarchical clustering tree.
619    :type tree: :class:`HierarchicalCluster`
620    :param matrix: SymMatrix that was used to compute the clustering.
621    :type matrix: :class:`Orange.core.SymMatrix`
622    :param progress_callback: Function used to report on progress.
623    :type progress_callback: function
624   
625    .. note:: The ordering is done inplace.
626   
627    """
628   
629    objects = getattr(tree.mapping, "objects", None)
630    tree.mapping.setattr("objects", range(len(tree)))
631    M = {}
632    ordering = {}
633    visited_clusters = set()
634   
635#    def opt_ordering_recursive(tree):
636#        if len(tree)==1:
637#            for leaf in tree:
638#                M[tree, leaf, leaf] = 0
639#        else:
640#            opt_ordering_recursive(tree.left)
641#            opt_ordering_recursive(tree.right)
642#           
643#            Vl = set(tree.left)
644#            Vr = set(tree.right)
645#            Vlr = set(tree.left.right or tree.left)
646#            Vll = set(tree.left.left or tree.left)
647#            Vrr = set(tree.right.right or tree.right)
648#            Vrl = set(tree.right.left or tree.right)
649#            other = lambda e, V1, V2: V2 if e in V1 else V1
650#            tree_left, tree_right = tree.left, tree.right
651#            for u in Vl:
652#                for w in Vr:
653##                    if True: #Improved search
654#                    C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])
655#                    ordered_m = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])
656#                    ordered_k = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])
657#                    k0 = ordered_k[0]
658#                    cur_min = 1e30000
659#                    cur_m = cur_k = None
660#                    for m in ordered_m:
661#                        if M[tree_left, u, m] + M[tree_right, w, k0] + C >= cur_min:
662#                            break
663#                        for k in  ordered_k:
664#                            if M[tree_left, u, m] + M[tree_right, w, k] + C >= cur_min:
665#                                break
666#                            test_min = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
667#                            if cur_min > test_min:
668#                                cur_min = test_min
669#                                cur_m = m
670#                                cur_k = k
671#                    M[tree, u, w] = M[tree, w, u] = cur_min
672#                    ordering[tree, u, w] = (tree_left, u, cur_m, tree_right, w, cur_k)
673#                    ordering[tree, w, u] = (tree_right, w, cur_k, tree_left, u, cur_m)
674##                    else:
675##                    def MFunc((m, k)):
676##                        return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
677##                    m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)
678##                    M[tree, u, w] = M[tree, w, u] = MFunc((m, k))
679##                    ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)
680##                    ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)
681#
682#            if progressCallback:
683#                progressCallback(100.0 * len(visited_clusters) / len(tree.mapping))
684#                visited_clusters.add(tree)
685#   
686#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
687#        opt_ordering_recursive(tree)
688       
689    def opt_ordering_iterative(tree):
690        if len(tree)==1:
691            for leaf in tree:
692                M[tree, leaf, leaf] = 0
693        else:
694#            _optOrdering(tree.left)
695#            _optOrdering(tree.right)
696           
697            Vl = set(tree.left)
698            Vr = set(tree.right)
699            Vlr = set(tree.left.right or tree.left)
700            Vll = set(tree.left.left or tree.left)
701            Vrr = set(tree.right.right or tree.right)
702            Vrl = set(tree.right.left or tree.right)
703            other = lambda e, V1, V2: V2 if e in V1 else V1
704            tree_left, tree_right = tree.left, tree.right
705            for u in Vl:
706                for w in Vr:
707#                    if True: #Improved search
708                        C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])
709                        ordered_m = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])
710                        ordered_k = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])
711                        k0 = ordered_k[0]
712                        cur_min = 1e30000 
713                        cur_m = cur_k = None
714                        for m in ordered_m:
715                            if M[tree_left, u, m] + M[tree_right, w, k0] + C >= cur_min:
716                                break
717                            for k in  ordered_k:
718                                if M[tree_left, u, m] + M[tree_right, w, k] + C >= cur_min:
719                                    break
720                                test_min = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
721                                if cur_min > test_min:
722                                    cur_min = test_min
723                                    cur_m = m
724                                    cur_k = k
725                        M[tree, u, w] = M[tree, w, u] = cur_min
726                        ordering[tree, u, w] = (tree_left, u, cur_m, tree_right, w, cur_k)
727                        ordering[tree, w, u] = (tree_right, w, cur_k, tree_left, u, cur_m)
728#                    else:
729#                        def MFunc((m, k)):
730#                            return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
731#                        m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)
732#                        M[tree, u, w] = M[tree, w, u] = MFunc((m, k))
733#                        ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)
734#                        ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)
735
736#            if progressCallback:
737#                progressCallback(100.0 * len(visited_clusters) / len(tree.mapping))
738#                visited_clusters.add(tree)
739   
740    subtrees = postorder(tree)
741    milestones = progress_bar_milestones(len(subtrees), 1000)
742   
743    for i, subtree in enumerate(subtrees):
744        opt_ordering_iterative(subtree)
745        if progress_callback and i in milestones:
746            progress_callback(100.0 * i / len(subtrees))
747
748#    def order_recursive(tree, u, w):
749#        """ Order the tree based on the computed optimal ordering.
750#        """
751#        if len(tree)==1:
752#            return
753#        left, u, m, right, w, k = ordering[tree, u, w]
754#        if len(left)>1 and m not in left.right:
755#            left.swap()
756#        order_recursive(left, u, m)
757#       
758#        if len(right)>1 and k not in right.left:
759#            right.swap()
760#        order_recursive(right, k, w)
761       
762    def order_iterative(tree, u, w):
763        """ Order the tree based on the computed optimal ordering.
764        """
765        opt_uw = {tree: (u, w)}
766        for subtree in preorder(tree):
767            if subtree.branches:
768                u, w = opt_uw[subtree]
769                left, u, m, right, w, k = ordering[subtree, u, w]
770                opt_uw[left] = (u, m)
771                opt_uw[right] = (k, w)
772               
773                if left.branches and m not in left.right:
774                    left.swap()
775               
776                if right.branches and k not in right.left:
777                    right.swap()
778   
779    u, w = min([(u, w) for u in tree.left for w in tree.right], key=lambda (u, w): M[tree, u, w])
780   
781##    print "M(v) =", M[tree, u, w]
782   
783#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
784#        order_recursive(tree, u, w)
785           
786    order_iterative(tree, u, w)
787           
788
789#    def _check(tree, u, w):
790#        if len(tree)==1:
791#            return
792#        left, u, m, right, w, k = ordering[tree, u, w]
793#        if tree[0] == u and tree[-1] == w:
794#            _check(left, u, m)
795#            _check(right, k, w)
796#        else:
797#            print "Error:", u, w, tree[0], tree[-1]
798#
799#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
800#        _check(tree, u ,w)
801
802    if objects:
803        tree.mapping.setattr("objects", objects)
804
805
806def order_leaves_cpp(tree, matrix, progress_callback=None):
807    """ Order the leaves in the clustering tree (C++ implementation).
808   
809    (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering))
810   
811    :param tree: Binary hierarchical clustering tree.
812    :type tree: :class:`HierarchicalCluster`
813    :param matrix: SymMatrix that was used to compute the clustering.
814    :type matrix: :class:`Orange.core.SymMatrix`
815    :param progress_callback: Function used to report on progress.
816    :type progress_callback: function
817   
818    .. note:: The ordering is done inplace.
819   
820    """
821    node_count = iter(range(len(tree)))
822   
823    if progress_callback is not None:
824        def p(*args):
825            progress_callback(100.0 * node_count.next() / len(tree))
826    else:
827        p = None
828   
829    Orange.core.HierarchicalClusterOrdering(tree, matrix, progress_callback=p)
830
831order_leaves_cpp = deprecated_keywords({"progressCallback":"progress_callback"})(order_leaves_cpp)
832order_leaves_py = deprecated_keywords({"progressCallback":"progress_callback"})(order_leaves_py)
833
834order_leaves = order_leaves_cpp
835   
836"""
837Matplotlib dendrogram ploting. This is mostly untested,
838use dendrogram_draw funciton instead of this.
839
840"""
841try:
842    import numpy
843except ImportError:
844    numpy = None
845
846try:
847    import matplotlib
848    from matplotlib.figure import Figure
849    from matplotlib.table import Table, Cell
850    from matplotlib.text import Text
851    from matplotlib.artist import Artist
852##    import  matplotlib.pyplot as plt
853except (ImportError, IOError, RuntimeError), ex:
854    matplotlib = None
855    Text , Artist, Table, Cell = object, object, object, object
856
857class TableCell(Cell):
858    PAD = 0.05
859    def __init__(self, *args, **kwargs):
860        Cell.__init__(self, *args, **kwargs)
861        self._text.set_clip_on(True)
862
863class TablePlot(Table):
864    max_fontsize = 12
865    def __init__(self, xy, axes=None, bbox=None):
866        Table.__init__(self, axes or plt.gca(), bbox=bbox)
867        self.xy = xy
868        self.set_transform(self._axes.transData)
869        self._fixed_widhts = None
870        import matplotlib.pyplot as plt
871        self.max_fontsize = plt.rcParams.get("font.size", 12)
872
873    def add_cell(self, row, col, *args, **kwargs):
874        xy = (0,0)
875
876        cell = TableCell(xy, *args, **kwargs)
877        cell.set_figure(self.figure)
878        cell.set_transform(self.get_transform())
879
880        cell.set_clip_on(True)
881        cell.set_clip_box(self._axes.bbox)
882        cell._text.set_clip_box(self._axes.bbox)
883        self._cells[(row, col)] = cell
884
885    def draw(self, renderer):
886        if not self.get_visible(): return
887        self._update_positions(renderer)
888
889        keys = self._cells.keys()
890        keys.sort()
891        for key in keys:
892            self._cells[key].draw(renderer)
893
894    def _update_positions(self, renderer):
895        keys = numpy.array(self._cells.keys())
896        cells = numpy.array([[self._cells.get((row, col), None) for col in range(max(keys[:, 1] + 1))] \
897                             for row in range(max(keys[:, 0] + 1))])
898       
899        widths = self._get_column_widths(renderer)
900        x = self.xy[0] + numpy.array([numpy.sum(widths[:i]) for i in range(len(widths))])
901        y = self.xy[1] - numpy.arange(cells.shape[0]) - 0.5
902       
903        for i in range(cells.shape[0]):
904            for j in range(cells.shape[1]):
905                cells[i, j].set_xy((x[j], y[i]))
906                cells[i, j].set_width(widths[j])
907                cells[i, j].set_height(1.0)
908
909        self._width = numpy.sum(widths)
910        self._height = cells.shape[0]
911
912        self.pchanged()
913
914    def _get_column_widths(self, renderer):
915        keys = numpy.array(self._cells.keys())
916        widths = numpy.zeros(len(keys)).reshape((numpy.max(keys[:,0]+1), numpy.max(keys[:,1]+1)))
917        fontSize = self._calc_fontsize(renderer)
918        for (row, col), cell in self._cells.items():
919            cell.set_fontsize(fontSize)
920            l, b, w, h = cell._text.get_window_extent(renderer).bounds
921            transform = self._axes.transData.inverted()
922            x1, _ = transform.transform_point((0, 0))
923            x2, _ = transform.transform_point((w + w*TableCell.PAD + 10, 0))
924            w = abs(x1 - x2)
925            widths[row, col] = w
926        return numpy.max(widths, 0)
927
928    def _calc_fontsize(self, renderer):
929        transform = self._axes.transData
930        _, y1 = transform.transform_point((0, 0))
931        _, y2 = transform.transform_point((0, 1))
932        return min(max(int(abs(y1 - y2)*0.85) ,4), self.max_fontsize)
933
934    def get_children(self):
935        return self._cells.values()
936
937    def get_bbox(self):
938        return matplotlib.transform.Bbox([self.xy[0], self.xy[1], self.xy[0] + 10, self.xy[1] + 180])
939
940class DendrogramPlotPylab(object):
941    def __init__(self, root, data=None, labels=None, dendrogram_width=None, heatmap_width=None, label_width=None, space_width=None, border_width=0.05, plot_attr_names=False, cmap=None, params={}):
942        if not matplotlib:
943            raise ImportError("Could not import matplotlib module. Please make sure matplotlib is installed on your system.")
944        import matplotlib.pyplot as plt
945        self.plt = plt
946        self.root = root
947        self.data = data
948        self.labels = labels if labels else [str(i) for i in range(len(root))]
949        self.dendrogram_width = dendrogram_width
950        self.heatmap_width = heatmap_width
951        self.label_width = label_width
952        self.space_width = space_width
953        self.border_width = border_width
954        self.params = params
955        self.plot_attr_names = plot_attr_names
956
957    def plotDendrogram(self):
958        self.text_items = []
959        def draw_tree(tree):
960            if tree.branches:
961                points = []
962                for branch in tree.branches:
963                    center = draw_tree(branch)
964                    self.plt.plot([center[0], tree.height], [center[1], center[1]], color="black")
965                    points.append(center)
966                self.plt.plot([tree.height, tree.height], [points[0][1], points[-1][1]], color="black")
967                return (tree.height, (points[0][1] + points[-1][1])/2.0)
968            else:
969                return (0.0, tree.first)
970        draw_tree(self.root)
971       
972    def plotHeatMap(self):
973        import numpy.ma as ma
974        import numpy
975        dx, dy = self.root.height, 0
976        fx, fy = self.root.height/len(self.data.domain.attributes), 1.0
977        data, c, w = self.data.toNumpyMA()
978        data = (data - ma.min(data))/(ma.max(data) - ma.min(data))
979        x = numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape))
980        y = numpy.arange(data.shape[0] + 1)/float(numpy.max(data.shape))*len(self.root)
981        self.heatmap_width = numpy.max(x)
982
983        X, Y = numpy.meshgrid(x, y - 0.5)
984
985        self.meshXOffset = numpy.max(X)
986
987        self.plt.jet()
988        mesh = self.plt.pcolormesh(X, Y, data[self.root.mapping], edgecolor="b", linewidth=2)
989
990        if self.plot_attr_names:
991            names = [attr.name for attr in self.data.domain.attributes]
992            self.plt.xticks(numpy.arange(data.shape[1] + 1)/float(numpy.max(data.shape)), names)
993        self.plt.gca().xaxis.tick_top()
994        for label in self.plt.gca().xaxis.get_ticklabels():
995            label.set_rotation(45)
996
997        for tick in self.plt.gca().xaxis.get_major_ticks():
998            tick.tick1On = False
999            tick.tick2On = False
1000
1001    def plotLabels_(self):
1002        import numpy
1003##        self.plt.yticks(numpy.arange(len(self.labels) - 1, 0, -1), self.labels)
1004##        for tick in self.plt.gca().yaxis.get_major_ticks():
1005##            tick.tick1On = False
1006##            tick.label1On = False
1007##            tick.label2On = True
1008##        text = TableTextLayout(xy=(self.meshXOffset+1, len(self.root)), tableText=[[label] for label in self.labels])
1009        text = TableTextLayout(xy=(self.meshXOffset*1.005, len(self.root) - 1), tableText=[[label] for label in self.labels])
1010        text.set_figure(self.plt.gcf())
1011        self.plt.gca().add_artist(text)
1012        self.plt.gca()._set_artist_props(text)
1013
1014    def plotLabels(self):
1015##        table = TablePlot(xy=(self.meshXOffset*1.005, len(self.root) -1), axes=self.plt.gca())
1016        table = TablePlot(xy=(0, len(self.root) -1), axes=self.plt.gca())
1017        table.set_figure(self.plt.gcf())
1018        for i,label in enumerate(self.labels):
1019            table.add_cell(i, 0, width=1, height=1, text=label, loc="left", edgecolor="w")
1020        table.set_zorder(0)
1021        self.plt.gca().add_artist(table)
1022        self.plt.gca()._set_artist_props(table)
1023   
1024    def plot(self, filename=None, show=False):
1025        self.plt.rcParams.update(self.params)
1026        labelLen = max(len(label) for label in self.labels)
1027        w, h = 800, 600
1028        space = 0.01 if self.space_width == None else self.space_width
1029        border = self.border_width
1030        width = 1.0 - 2*border
1031        height = 1.0 - 2*border
1032        textLineHeight = min(max(h/len(self.labels), 4), self.plt.rcParams.get("font.size", 12))
1033        maxTextLineWidthEstimate = textLineHeight*labelLen
1034##        print maxTextLineWidthEstimate
1035        textAxisWidthRatio = 2.0*maxTextLineWidthEstimate/w
1036##        print textAxisWidthRatio
1037        labelsAreaRatio = min(textAxisWidthRatio, 0.4) if self.label_width == None else self.label_width
1038        x, y = len(self.data.domain.attributes), len(self.data)
1039
1040        heatmapAreaRatio = min(1.0*y/h*x/w, 0.3) if self.heatmap_width == None else self.heatmap_width
1041        dendrogramAreaRatio = 1.0 - labelsAreaRatio - heatmapAreaRatio - 2*space if self.dendrogram_width == None else self.dendrogram_width
1042
1043        self.fig = self.plt.figure()
1044        self.labels_offset = self.root.height/20.0
1045        dendrogramAxes = self.plt.axes([border, border, width*dendrogramAreaRatio, height])
1046        dendrogramAxes.xaxis.grid(True)
1047        import matplotlib.ticker as ticker
1048
1049        dendrogramAxes.yaxis.set_major_locator(ticker.NullLocator())
1050        dendrogramAxes.yaxis.set_minor_locator(ticker.NullLocator())
1051        dendrogramAxes.invert_xaxis()
1052        self.plotDendrogram()
1053        heatmapAxes = self.plt.axes([border + width*dendrogramAreaRatio + space, border, width*heatmapAreaRatio, height], sharey=dendrogramAxes)
1054
1055        heatmapAxes.xaxis.set_major_locator(ticker.NullLocator())
1056        heatmapAxes.xaxis.set_minor_locator(ticker.NullLocator())
1057        heatmapAxes.yaxis.set_major_locator(ticker.NullLocator())
1058        heatmapAxes.yaxis.set_minor_locator(ticker.NullLocator())
1059       
1060        self.plotHeatMap()
1061        labelsAxes = self.plt.axes([border + width*(dendrogramAreaRatio + heatmapAreaRatio + 2*space), border, width*labelsAreaRatio, height], sharey=dendrogramAxes)
1062        self.plotLabels()
1063        labelsAxes.set_axis_off()
1064        labelsAxes.xaxis.set_major_locator(ticker.NullLocator())
1065        labelsAxes.xaxis.set_minor_locator(ticker.NullLocator())
1066        labelsAxes.yaxis.set_major_locator(ticker.NullLocator())
1067        labelsAxes.yaxis.set_minor_locator(ticker.NullLocator())
1068        if filename:
1069            import matplotlib.backends.backend_agg
1070            canvas = matplotlib.backends.backend_agg.FigureCanvasAgg(self.fig)
1071            canvas.print_figure(filename)
1072        if show:
1073            self.plt.show()
1074       
1075       
1076"""
1077Dendrogram ploting using Orange.misc.render
1078"""
1079
1080from Orange.misc.render import EPSRenderer, ColorPalette
1081
1082class DendrogramPlot(object):
1083    """ A class for drawing dendrograms.
1084   
1085    .. note:: ``dendrogram_draw`` function is a more convinient interface
1086        to the functionality provided by this class and.   
1087       
1088    Example::
1089   
1090        a = DendrogramPlot(tree)
1091        a.plot("tree.png", format="png")
1092       
1093    """
1094    def __init__(self, tree, attr_tree = None, labels=None, data=None, width=None, height=None, tree_height=None, heatmap_width=None, text_width=None, 
1095                 spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, gamma=None, renderer=EPSRenderer, **kwargs):
1096        self.tree = tree
1097        self.attr_tree = attr_tree
1098        if not labels:
1099            if data and data.domain.class_var:
1100                labels = [str(ex.getclass()) for ex in data]
1101            elif hasattr(tree.mapping, "objects"):
1102                labels = [str(obj) for obj in tree.mapping.objects]
1103            else:
1104                labels = [""] * len(tree)
1105        self.labels = labels
1106       
1107#        self.attr_labels = [str(attr.name) for attr in data.domain.attributes] if not attr_labels and data else attr_labels or []
1108        self.data = data
1109        self.width, self.height = float(width) if width else None, float(height) if height else None
1110        self.tree_height = tree_height
1111        self.heatmap_width = heatmap_width
1112        self.text_width = text_width
1113        self.font_size = 10.0
1114        self.linespacing = 0.0
1115        self.cluster_colors = cluster_colors
1116        self.horizontal_margin = 10.0
1117        self.vertical_margin = 10.0
1118        self.spacing = float(spacing) if spacing else None
1119        self.color_palette = color_palette
1120        self.minv = minv
1121        self.maxv = maxv
1122        self.gamma = gamma
1123        self.set_matrix_color_schema(color_palette, minv, maxv, gamma)
1124        self.renderer = renderer
1125       
1126    def set_matrix_color_schema(self, color_palette, minv, maxv, gamma=None):
1127        """ Set the matrix color scheme.
1128        """
1129        if isinstance(color_palette, ColorPalette):
1130            self.color_palette = color_palette
1131        else:
1132            self.color_palette = ColorPalette(color_palette)
1133        self.minv = minv
1134        self.maxv = maxv
1135        self.gamma = gamma
1136       
1137    def color_shema(self):
1138        vals = [float(val) for ex in self.data for val in ex if not val.isSpecial() and val.variable.varType==orange.VarTypes.Continuous] or [0]
1139        avg = sum(vals)/len(vals)
1140       
1141        maxVal = self.maxv if self.maxv else max(vals)
1142        minVal = self.minv if self.minv else min(vals)
1143       
1144        def _colorSchema(val):
1145            if val.isSpecial():
1146                return self.color_palette(None)
1147            elif val.variable.varType==orange.VarTypes.Continuous:
1148                r, g, b = self.color_palette((float(val) - minVal) / abs(maxVal - minVal), gamma=self.gamma)
1149            elif val.variable.varType==orange.VarTypes.Discrete:
1150                r = g = b = int(255.0*float(val)/len(val.variable.values))
1151            return (r, g, b)
1152        return _colorSchema
1153   
1154    def layout(self):
1155        height_final = False
1156        width_final = False
1157        tree_height = self.tree_height or 100
1158        if self.height:
1159            height, height_final = self.height, True
1160            heatmap_height = height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin
1161            font_size =  heatmap_height / len(self.labels) #self.font_size or (height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin) / len(self.labels)
1162        else:
1163            font_size = self.font_size
1164            heatmap_height = font_size * len(self.labels)
1165            height = heatmap_height + (tree_height + self.spacing if self.attr_tree else 0) + 2 * self.horizontal_margin
1166             
1167        text_width = self.text_width or max([len(label) for label in self.labels] + [0]) * font_size #max([self.renderer.string_size_hint(label) for label in self.labels])
1168       
1169        if self.width:
1170            width = self.width
1171            heatmap_width = width - 2 * self.vertical_margin - tree_height - (2 if self.data else 1) * self.spacing - text_width if self.data else 0
1172        else:
1173            heatmap_width = len(self.data.domain.attributes) * heatmap_height / len(self.data) if self.data else 0
1174            width = 2 * self.vertical_margin + tree_height + (heatmap_width + self.spacing if self.data else 0) + self.spacing + text_width
1175           
1176        return width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size
1177   
1178    def plot(self, filename="graph.eps", **kwargs):
1179        width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size = self.layout()
1180        heatmap_cell_height = heatmap_height / len(self.labels)
1181
1182        heatmap_cell_width = 0.0 if not self.data else heatmap_width / len(self.data.domain.attributes)
1183       
1184        self.renderer = self.renderer(width, height)
1185       
1186        def draw_tree(cluster, root, treeheight, treewidth, color):
1187            height = treeheight * cluster.height / root.height
1188            if cluster.branches:
1189                centers = []
1190                for branch in cluster.branches:
1191                    center = draw_tree(branch, root, treeheight, treewidth, self.cluster_colors.get(branch, color))
1192                    centers.append(center)
1193                    self.renderer.draw_line(center[0], center[1], center[0], height, stroke_color = self.cluster_colors.get(branch, color))
1194                   
1195                self.renderer.draw_line(centers[0][0], height, centers[-1][0], height, stroke_color = self.cluster_colors.get(cluster, color))
1196                return (centers[0][0] + centers[-1][0]) / 2.0, height
1197            else:
1198                return float(treewidth) * cluster.first / len(root), 0.0
1199        self.renderer.save_render_state()
1200        self.renderer.translate(self.vertical_margin + tree_height, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0) + heatmap_cell_height / 2.0)
1201        self.renderer.rotate(90)
1202#        print self.renderer.transform()
1203        draw_tree(self.tree, self.tree, tree_height, heatmap_height, self.cluster_colors.get(self.tree, (0,0,0)))
1204        self.renderer.restore_render_state()
1205        if self.attr_tree:
1206            self.renderer.save_render_state()
1207            self.renderer.translate(self.vertical_margin + tree_height + self.spacing + heatmap_cell_width / 2.0, self.horizontal_margin + tree_height)
1208            self.renderer.scale(1.0, -1.0)
1209#            print self.renderer.transform()
1210            draw_tree(self.attr_tree, self.attr_tree, tree_height, heatmap_width, self.cluster_colors.get(self.attr_tree, (0,0,0)))
1211            self.renderer.restore_render_state()
1212       
1213        self.renderer.save_render_state()
1214        self.renderer.translate(self.vertical_margin + tree_height + self.spacing, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0))
1215#        print self.renderer.transform()
1216        if self.data:
1217            colorSchema = self.color_shema()
1218            for i, ii in enumerate(self.tree):
1219                ex = self.data[ii]
1220                for j, jj in enumerate((self.attr_tree if self.attr_tree else range(len(self.data.domain.attributes)))):
1221                    r, g, b = colorSchema(ex[jj])
1222                    self.renderer.draw_rect(j * heatmap_cell_width, i * heatmap_cell_height, heatmap_cell_width, heatmap_cell_height, fill_color=(r, g, b), stroke_color=(255, 255, 255))
1223       
1224        self.renderer.translate(heatmap_width + self.spacing, heatmap_cell_height)
1225#        print self.renderer.transform()
1226        self.renderer.set_font("Times-Roman", font_size)
1227        for index in self.tree.mapping: #label in self.labels:
1228            self.renderer.draw_text(0.0, 0.0, self.labels[index])
1229            self.renderer.translate(0.0, heatmap_cell_height)
1230        self.renderer.restore_render_state()
1231        self.renderer.save(filename, **kwargs)
1232       
1233       
1234def dendrogram_draw(file, cluster, attr_cluster = None, labels=None, data=None,
1235                    width=None, height=None, tree_height=None,
1236                    heatmap_width=None, text_width=None,  spacing=2,
1237                    cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]),
1238                    maxv=None, minv=None, gamma=None,
1239                    format=None):
1240    """ Plot the dendrogram to ``file``.
1241   
1242    :param file: An  open file or a filename to store the image to.
1243    :type file: str or an file-like object suitable for writing.
1244   
1245    :param cluster: An instance of :class:`HierarcicalCluster`
1246    :type cluster: :class:`HierarcicalCluster`
1247   
1248    :param attr_cluster: An instance of :class:`HierarcicalCluster` for the attributes
1249        in ``data`` (unused if ``data`` is not supplied)
1250    :type attr_cluster: :class:`HierarcicalCluster`
1251   
1252    :param labels: Labels for the ``cluster`` leaves.
1253    :type labels: list-of-strings
1254   
1255    :param data: A data table for the (optional) heatmap.
1256    :type data: :class:`Orange.data.Table`
1257   
1258    :param width: Image width.
1259    :type width: int
1260   
1261    :param height: Image height.
1262    :type height: int
1263   
1264    :param tree_height: Dendrogram tree height in the image.
1265    :type tree_height: int
1266   
1267    :param heatmap_width: Heatmap width.
1268    :type heatmap_width: int
1269   
1270    :param text_width: Text labels area width.
1271    :type text_width: int
1272   
1273    :param spacing: Spacing between consecutive leaves.
1274    :type spacing: int
1275   
1276    :param cluster_colors: A dictionary mapping :class:`HierarcicalCluster`
1277        instances in ``cluster`` to a RGB color 3-tuple.
1278    :type cluster_colors: dict
1279   
1280    :param format: Output image format Currently supported formats are
1281        "png" (default), "eps" and "svg". You only need this arguement if the
1282        format cannot be deduced from the ``file`` argument.
1283       
1284    :type format: str
1285   
1286    """
1287    import os
1288    from Orange.misc.render import PILRenderer, EPSRenderer, SVGRenderer
1289    if isinstance(file, basestring):
1290        name, ext = os.path.splitext(file)
1291        format = ext.lower().lstrip(".") or format
1292       
1293    if format is None:
1294        format = "png"
1295       
1296    renderer = {"eps":EPSRenderer, "svg":SVGRenderer, "png":PILRenderer}.get(format, "png")
1297   
1298    d = DendrogramPlot(cluster, attr_cluster, labels, data, width, height,
1299                       tree_height, heatmap_width, text_width, spacing,
1300                       cluster_colors, color_palette, maxv, minv, gamma,
1301                       renderer=renderer)
1302    if renderer is PILRenderer:
1303        d.plot(file, format=format)
1304    else:
1305        d.plot(file)
1306   
1307def postorder(cluster):
1308    """ Return a post order list of clusters.
1309   
1310    :param cluster: Cluster
1311    :type cluster: :class:`HierarchicalCluster`
1312   
1313    :rtype: list of :class:`HierarchicalCluster` instances
1314   
1315    """
1316    order = []
1317    visited = set()
1318    stack = [cluster]
1319    while stack:
1320        cluster = stack.pop(0)
1321       
1322        if cluster.branches:
1323            if cluster in visited:
1324                order.append(cluster)
1325            else:
1326                stack = cluster.branches + [cluster] + stack
1327                visited.add(cluster)
1328        else:
1329            order.append(cluster)
1330            visited.add(cluster)
1331    return order
1332   
1333   
1334def preorder(cluster):
1335    """ Return a pre order list of clusters.
1336   
1337    :param cluster: Cluster
1338    :type cluster: :class:`HierarchicalCluster`
1339   
1340    :rtype: list of :class:`HierarchicalCluster` instances
1341   
1342    """
1343    order = []
1344    stack = [cluster]
1345    while stack:
1346        cluster = stack.pop(0)
1347        order.append(cluster)
1348        if cluster.branches:
1349            stack = cluster.branches + stack
1350    return order
1351   
1352   
1353def dendrogram_layout(cluster, expand_leaves=False):
1354    """ Return a layout of the cluster dendrogram on a 2D plane. The return
1355    value if a list of (subcluster, (start, center, end)) tuples where
1356    subcluster is an instance of :class:`HierarchicalCluster` and start,
1357    end are the two end points for the cluster branch. The list is sorted
1358    in post-order.
1359   
1360    :param cluster: Cluster to layout.
1361    :type cluster: :class:`HierarchicalCluster`
1362   
1363    :param expand_leaves: If True leaves will span proportional to the number
1364        of items they contain, else all leaves will be the same width.
1365   
1366    :rtype: list of (:class:`HierarchicalCluster`, (start, center, end)) tuples
1367   
1368    """
1369    result = []
1370    cluster_geometry = {}
1371    leaf_idx = 0
1372    for cluster in postorder(cluster):
1373        if not cluster.branches:
1374            if expand_leaves:
1375                start, end = float(cluster.first), float(cluster.last)
1376            else:
1377                start = end = leaf_idx
1378                leaf_idx += 1
1379            center = (start + end) / 2.0
1380            cluster_geometry[cluster] = (start, center, end)
1381            result.append((cluster, (start, center, end)))
1382        else:
1383            left = cluster.branches[0]
1384            right = cluster.branches[1]
1385            left_center = cluster_geometry[left][1]
1386            right_center = cluster_geometry[right][1]
1387            start, end = left_center, right_center
1388            center = (start + end) / 2.0
1389            cluster_geometry[cluster] = (start, center, end)
1390            result.append((cluster, (start, center, end)))
1391           
1392    return result
1393   
1394def clone(cluster):
1395    """ Clone a cluster, including it's subclusters.
1396   
1397    :param cluster: Cluster to clone
1398    :type cluster: :class:`HierarchicalCluster`
1399   
1400    :rtype: :class:`HierarchicalCluster`
1401   
1402    """
1403    import copy
1404    clones = {}
1405    mapping = copy.copy(cluster.mapping)
1406    for node in postorder(cluster):
1407        node_clone = copy.copy(node)
1408        if node.branches:
1409            node_clone.branches = [clones[b] for b in node.branches]
1410        node_clone.mapping = mapping
1411        clones[node] = node_clone
1412       
1413    return clones[cluster]
1414   
1415def pruned(cluster, level=None, height=None, condition=None):
1416    """ Return a new pruned clustering instance.
1417   
1418    .. note:: This uses :obj:`clone` to create a copy of the `cluster`
1419        instance.
1420   
1421    :param cluster: Cluster to prune.
1422    :type cluster: :class:`HierarchicalCluster`
1423   
1424    :param level: If not None prune all clusters deeper then `level`.
1425    :type level: int
1426   
1427    :param height: If not None prune all clusters with height lower then
1428        `height`.
1429    :type height: float
1430   
1431    :param condition: If not None condition must be a function taking a
1432        single :class:`HierarchicalCluster` instance and returning a True
1433        or False value indicating if the cluster should be pruned.
1434    :type condition: function
1435   
1436    :rtype: :class:`HierarchicalCluster`
1437   
1438    """
1439    cluster = clone(cluster)
1440    prune(cluster, level, height, condition)
1441    return cluster
1442   
1443   
1444def prune(cluster, level=None, height=None, condition=None):
1445    """ Prune the clustering instance ``cluster`` in place.
1446   
1447    :param cluster: Cluster to prune.
1448    :type cluster: :class:`HierarchicalCluster`
1449   
1450    :param level: If not None prune all clusters deeper then `level`.
1451    :type level: int
1452   
1453    :param height: If not None prune all clusters with height lower then
1454        `height`.
1455    :type height: float
1456   
1457    :param condition: If not None condition must be a function taking a
1458        single :class:`HierarchicalCluster` instance and returning a True
1459        or False value indicating if the cluster should be pruned.
1460    :type condition: function
1461   
1462    """
1463    if not any(arg is not None for arg in [level, height, condition]):
1464        raise ValueError("At least one pruning argument must be supplied")
1465   
1466    level_check = height_check = condition_check = lambda cl: False
1467    cluster_depth = cluster_depths(cluster)
1468   
1469    if level is not None:
1470        level_check = lambda cl: cluster_depth[cl] >= level
1471       
1472    if height is not None:
1473        height_check = lambda cl: cl.height <= height
1474
1475    if condition is not None:
1476        condition_check = condition
1477       
1478    pruned_set = set()
1479   
1480    def check_all(cl):
1481        return any([check(cl) for check in [level_check, height_check,
1482                                            condition_check]])
1483       
1484    for cluster in preorder(cluster):
1485        if cluster not in pruned_set:
1486            if check_all(cluster):
1487                cluster.branches = None
1488                pruned_set.update(set(preorder(cluster)))
1489            else:
1490                pass
1491   
1492   
1493def cluster_depths(cluster):
1494    """ Return a dictionary mapping :class:`HierarchicalCluster` instances to
1495    their depths in the `cluster` hierarchy.
1496   
1497    :param cluster: Root cluster
1498    :type cluster: :class:`HierarchicalCluster`
1499   
1500    :rtype: class:`dict`
1501   
1502    """
1503    depths = {}
1504    depths[cluster] = 0
1505    for cluster in preorder(cluster):
1506        cl_depth = depths[cluster]
1507        if cluster.branches:
1508            depths.update(dict.fromkeys(cluster.branches, cl_depth + 1))
1509    return depths
1510
1511
1512def instance_distance_matrix(data,
1513            distance_constructor=orange.ExamplesDistanceConstructor_Euclidean,
1514            progress_callback=None):
1515    """ A helper function that computes an :class:`Orange.core.SymMatrix` of all
1516    pairwise distances between instances in `data`.
1517   
1518    :param data: A data table
1519    :type data: :class:`Orange.data.Table`
1520   
1521    :param distance_constructor: An ExamplesDistance_Constructor instance.
1522    :type distance_constructor: :class:`Orange.distance.DistanceConstructor`
1523   
1524    :param progress_callback: A function (taking one argument) to use for
1525        reporting the on the progress.
1526    :type progress_callback: function
1527   
1528    :rtype: :class:`Orange.core.SymMatrix`
1529   
1530    """
1531    matrix = orange.SymMatrix(len(data))
1532    dist = distance_constructor(data)
1533   
1534    iter_count = matrix.dim * (matrix.dim - 1) / 2
1535    milestones = progress_bar_milestones(iter_count, 100)
1536   
1537    for count, ((i, ex1), (j, ex2)) in enumerate(_pairs(enumerate(data))):
1538        matrix[i, j] = dist(ex1, ex2)
1539        if progress_callback and count in milestones:
1540            progress_callback(100.0 * count / iter_count)
1541           
1542    return matrix
1543
1544
1545def feature_distance_matrix(data, distance=None, progress_callback=None):
1546    """ A helper function that computes an :class:`Orange.core.SymMatrix` of
1547    all pairwise distances between features in `data`.
1548   
1549    :param data: A data table
1550    :type data: :class:`Orange.data.Table`
1551   
1552    :param distance: a function taking two lists and returning the distance.
1553    :type distance: function
1554   
1555    :param progress_callback: A function (taking one argument) to use for
1556        reporting the on the progress.
1557    :type progress_callback: function
1558   
1559    :rtype: :class:`Orange.core.SymMatrix`
1560   
1561    """
1562    attributes = data.domain.attributes
1563    matrix = orange.SymMatrix(len(attributes))
1564    iter_count = matrix.dim * (matrix.dim - 1) / 2
1565    milestones = progress_bar_milestones(iter_count, 100)
1566   
1567    for count, ((i, a1), (j, a2)) in enumerate(_pairs(enumerate(attributes))):
1568        matrix[i, j] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0
1569        if progress_callback and count in milestones:
1570            progress_callback(100.0 * count / iter_count)
1571           
1572    return matrix
1573
1574
1575def _pairs(seq, same = False):
1576    """ Return all pairs from elements of `seq`.
1577    """
1578    seq = list(seq)
1579    same = 0 if same else 1
1580    for i in range(len(seq)):
1581        for j in range(i + same, len(seq)):
1582            yield seq[i], seq[j]
1583   
1584   
1585def joining_cluster(cluster, item1, item2):
1586    """ Return the cluster where `item1` and `item2` are first joined
1587   
1588    :param cluster: Clustering.
1589    :type cluster: :class:`HierarchicalCluster`
1590    :param item1: An element of `cluster.mapping` or `cluster.mapping.objects`
1591    :param item2: An element of `cluster.mapping` or `cluster.mapping.objects`
1592   
1593    :rtype: :class:`HierarchicalCluster`
1594   
1595    """
1596    while True:
1597        if cluster.branches:
1598            for branch in cluster.branches:
1599                if item1 in branch and item2 in branch:
1600                    cluster = branch
1601                    break
1602            else:
1603                return cluster
1604        else:
1605            return cluster
1606       
1607
1608def cophenetic_distances(cluster):
1609    """ Return the cophenetic distance matrix between items in clustering.
1610    Cophenetic distance is defined as the height of the cluster where the
1611    two items are first joined.
1612   
1613    :param cluster: Clustering.
1614    :type cluster: :class:`HierarchicalCluster`
1615   
1616    :rtype: :class:`Orange.core.SymMatrix`
1617   
1618    """
1619
1620    mapping = cluster.mapping 
1621    matrix = Orange.core.SymMatrix(len(mapping))
1622    for cluster in postorder(cluster):
1623        if cluster.branches:
1624            for branch1, branch2 in _pairs(cluster.branches):
1625                for idx1 in mapping[branch1.first: branch1.last]:
1626                    for idx2 in mapping[branch2.first: branch2.last]:
1627                        matrix[idx1, idx2] = cluster.height
1628               
1629        else:
1630            for ind1, ind2 in _pairs(mapping[cluster.first: cluster.last]):
1631                matrix[ind1, ind2] = cluster.height
1632   
1633    return matrix
1634
1635   
1636def cophenetic_correlation(cluster, matrix):
1637    """ Return the `cophenetic correlation coefficient
1638    <http://en.wikipedia.org/wiki/Cophenetic_correlation>`_ of the given
1639    clustering.
1640   
1641    :param cluster: Clustering.
1642    :type cluster: :class:`HierarchicalCluster`
1643   
1644    :param matrix: The distance matrix from which the `cluster` was
1645        derived.
1646   
1647    :rtype: :class:`float`
1648   
1649    """
1650    import numpy
1651    cophenetic = cophenetic_distances(cluster)
1652    cophenetic = [list(row) for row in cophenetic]
1653    original = [list(row) for row in matrix]
1654    cophenetic = numpy.ravel(cophenetic)
1655    original = numpy.ravel(original)
1656    return numpy.corrcoef(cophenetic, original)[0, 1]
1657   
1658   
1659if __name__=="__main__":
1660    data = orange.ExampleTable("doc//datasets//brown-selected.tab")
1661#    data = orange.ExampleTable("doc//datasets//iris.tab")
1662    root = hierarchicalClustering(data, order=True) #, linkage=orange.HierarchicalClustering.Single)
1663    attr_root = hierarchicalClustering_attributes(data, order=True)
1664#    print root
1665#    d = DendrogramPlotPylab(root, data=data, labels=[str(ex.getclass()) for ex in data], dendrogram_width=0.4, heatmap_width=0.3,  params={}, cmap=None)
1666#    d.plot(show=True, filename="graph.png")
1667
1668    dendrogram_draw("graph.eps", root, attr_tree=attr_root, data=data, labels=[str(e.getclass()) for e in data], tree_height=50, #width=500, height=500,
1669                          cluster_colors={root.right:(255,0,0), root.right.right:(0,255,0)}, 
1670                          color_palette=ColorPalette([(255, 0, 0), (0,0,0), (0, 255,0)], gamma=0.5, 
1671                                                     overflow=(255, 255, 255), underflow=(255, 255, 255))) #, minv=-0.5, maxv=0.5)
Note: See TracBrowser for help on using the repository browser.