source: orange/Orange/clustering/hierarchical.py @ 10633:fb05a6f3a235

Revision 10633:fb05a6f3a235, 56.3 KB checked in by mstajdohar, 2 years ago (diff)

Changed obsolete names.

Line 
1"""
2******************************************
3Hierarchical clustering (``hierarchical``)
4******************************************
5
6.. index::
7    single: clustering, hierarchical, dendrogram
8
9.. index:: aglomerative clustering
10
11The following example show clustering of the Iris data, with distance matrix
12computed with the :class:`Orange.distance.Euclidean` distance measure
13and cluster it with average linkage.
14
15.. literalinclude:: code/hierarchical-example-2.py
16    :lines: 1-12
17
18Data instances belonging to the top-most four clusters
19(obtained with :obj:`top_clusters`) could be printed out
20with:
21
22.. literalinclude:: code/hierarchical-example-2.py
23    :lines: 14-19
24           
25It could be more informative to
26print out the class distributions for each cluster.
27   
28.. literalinclude:: code/hierarchical-example-2.py
29    :lines: 21-26
30       
31Here is the output.
32::
33
34    Iris-setosa:   0  Iris-versicolor:  50  Iris-virginica:  17
35    Iris-setosa:  49  Iris-versicolor:   0  Iris-virginica:   0
36    Iris-setosa:   0  Iris-versicolor:   0  Iris-virginica:  33
37    Iris-setosa:   1  Iris-versicolor:   0  Iris-virginica:   0
38
39The same results could also be obtained with:
40
41.. literalinclude:: code/hierarchical-example-3.py
42    :lines: 1-7
43 
44Basic functionality
45-------------------
46
47.. autofunction:: clustering
48
49.. class:: HierarchicalClustering
50   
51    .. attribute:: linkage
52       
53        Specifies the linkage method, which can be either. Default is
54        :obj:`SINGLE`.
55
56    .. attribute:: overwrite_matrix
57
58        If True (default is False), the algorithm will save memory
59        by working on the original distance matrix, destroying it in
60        the process.
61
62    .. attribute:: progress_callback
63       
64        A callback function (None by default), which will be called 101 times.
65        The function only gets called if the number of objects is at least 1000.
66       
67    .. method:: __call__(matrix)
68         
69        Return an instance of HierarchicalCluster representing
70        the root of the hierarchy (instance of :class:`HierarchicalCluster`).
71
72        The distance matrix has to contain no negative elements, as
73        this helps the algorithm to run faster. The elements on the
74        diagonal are ignored. The method works in approximately O(n2)
75        time (with the worst case O(n3)).
76
77        :param matrix: A distance matrix to perform the clustering on.
78        :type matrix: :class:`Orange.misc.SymMatrix`
79
80.. rubric:: Linkage methods
81
82.. data:: SINGLE
83
84    Distance between groups is defined as the distance between the closest
85                pair of objects, one from each group.
86
87.. data:: AVERAGE
88
89    Distance between two clusters is defined as the average of distances
90    between all pairs of objects, where each pair is made up of one
91    object from each group.
92
93.. data:: COMPLETE
94
95    Distance between groups is defined as the distance between the most
96    distant pair of objects, one from each group. Complete linkage is
97    also called farthest neighbor.
98
99.. data:: WARD
100
101    Ward's distance.
102 
103Drawing
104--------------
105
106.. autofunction:: dendrogram_draw(file, cluster, attr_cluster=None, labels=None, data=None, width=None, height=None, tree_height=None, heatmap_width=None, text_width=None,  spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, gamma=None, format=None)
107
108.. rubric:: Example
109
110The following scripts clusters a subset of 20 instances from the Iris data set.
111The leaves are labelled with the class value.
112
113.. literalinclude:: code/hierarchical-draw.py
114    :lines: 1-8
115
116The resulting dendrogram is shown below.
117
118.. image:: files/hclust-dendrogram.png
119
120The following code, that produces the dendrogram below, also colors the
121three topmost branches and represents attribute values with a custom color
122schema, (spanning red - black - green with custom gamma minv and maxv).
123
124.. literalinclude:: code/hierarchical-draw.py
125    :lines: 10-16
126
127.. image:: files/hclust-colored-dendrogram.png
128   
129Cluster analysis
130-----------------
131
132.. autofunction:: cluster_to_list
133.. autofunction:: top_clusters
134.. autofunction:: top_cluster_membership
135.. autofunction:: order_leaves
136.. autofunction:: postorder
137.. autofunction:: preorder
138.. autofunction:: prune
139.. autofunction:: pruned
140.. autofunction:: clone
141.. autofunction:: cluster_depths
142.. autofunction:: cophenetic_distances
143.. autofunction:: cophenetic_correlation
144.. autofunction:: joining_cluster
145
146
147HierarchicalCluster hierarchy
148-----------------------------
149
150Results of clustering are stored in a hierarchy of
151:obj:`HierarchicalCluster` objects.
152
153.. class:: HierarchicalCluster
154
155    A node in the clustering tree, as returned by
156    :obj:`HierarchicalClustering`.
157
158    .. attribute:: branches
159   
160        A list of sub-clusters (:class:`HierarchicalCluster` instances). If this
161        is a leaf node this attribute is `None`
162       
163    .. attribute:: left
164   
165        The left sub-cluster (defined only if there are only two branches).
166        Same as ``branches[0]``.
167       
168    .. attribute:: right
169   
170        The right sub-cluster (defined only if there are only two branches).
171        Same as ``branches[1]``.
172       
173    .. attribute:: height
174   
175        Height of the cluster (distance between the sub-clusters).
176       
177    .. attribute:: mapping
178   
179        A list of indices to the original distance matrix. It is the same
180        for all clusters in the hierarchy - it simply represents the indices
181        ordered according to the clustering.
182   
183    .. attribute:: mapping.objects
184
185        A sequence describing objects - an :obj:`Orange.data.Table`, a
186        list of instance, a list of features (when clustering features),
187        or even a string of the same length as the number of elements.
188        If objects are given, the cluster's elements, as got by indexing
189        or interacion, are not indices but corresponding objects.  It we
190        put an :obj:`Orange.data.Table` into objects, ``root.left[-1]``
191        is the last instance of the first left cluster.
192
193    .. attribute:: first
194    .. attribute:: last
195   
196        ``first`` and ``last`` are indices into the elements of ``mapping`` that
197        belong to that cluster.
198
199    .. method:: __len__()
200   
201        Asking for the length of the cluster gives the number of the objects
202        belonging to it. This equals ``last - first``.
203   
204    .. method:: __getitem__(index)
205   
206        By indexing the cluster we address its elements; these are either
207        indices or objects.
208        For instance cluster[2] gives the third element of the cluster, and
209        list(cluster) will return the cluster elements as a list. The cluster
210        elements are read-only.
211   
212    .. method:: swap()
213   
214        Swaps the ``left`` and the ``right`` subcluster; it will
215        report an error when the cluster has more than two subclusters. This
216        function changes the mapping and first and last of all clusters below
217        this one and thus needs O(len(cluster)) time.
218       
219    .. method:: permute(permutation)
220   
221        Permutes the subclusters. Permutation gives the order in which the
222        subclusters will be arranged. As for swap, this function changes the
223        mapping and first and last of all clusters below this one.
224
225Subclusters are ordered so that ``cluster.left.last`` always equals
226``cluster.right.first`` or, in general, ``cluster.branches[i].last``
227equals ``cluster.branches[i+1].first``.
228
229Swapping and permutation change the order of
230elements in ``branches``, permute the corresponding regions in
231:obj:`~HierarchicalCluster.mapping` and adjust the ``first`` and ``last``
232for all the clusters below.
233
234.. rubric:: An example
235
236The following example constructs a simple distance matrix and runs clustering
237on it.
238
239    >>> import Orange
240    >>> m = [[],
241    ...      [ 3],
242    ...      [ 2, 4],
243    ...      [17, 5, 4],
244    ...      [ 2, 8, 3, 8],
245    ...      [ 7, 5, 10, 11, 2],
246    ...      [ 8, 4, 1, 5, 11, 13],
247    ...      [ 4, 7, 12, 8, 10, 1, 5],
248    ...      [13, 9, 14, 15, 7, 8, 4, 6],
249    ...      [12, 10, 11, 15, 2, 5, 7, 3, 1]]
250    >>> matrix = Orange.misc.SymMatrix(m)
251    >>> root = Orange.clustering.hierarchical.HierarchicalClustering(matrix,
252    ...     linkage=Orange.clustering.hierarchical.AVERAGE)
253       
254``root`` is the root of the cluster hierarchy. We can print it with a
255simple recursive function.
256
257    >>> def print_clustering(cluster):
258    ...     if cluster.branches:
259    ...         return "(%s %s)" % (print_clustering(cluster.left), print_clustering(cluster.right))
260    ...     else:
261    ...         return str(cluster[0])
262           
263The clustering looks like
264
265    >>> print_clustering(root)
266    '(((0 4) ((5 7) (8 9))) ((1 (2 6)) 3))'
267   
268The elements form two groups, the first with elements 0, 4, 5, 7, 8, 9,
269and the second with 1, 2, 6, 3. The difference between them equals to
270
271    >>> print root.height
272    9.0
273
274The first cluster is further divided onto 0 and 4 in one, and 5, 7, 8,
2759 in the other subcluster.
276
277The following code prints the left subcluster of root.
278
279    >>> for el in root.left:
280    ...     print el,
281    0 4 5 7 8 9
282   
283Object descriptions can be added with
284
285    >>> root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve",
286    ...    "Fred", "Greg", "Hue", "Ivy", "Jon"]
287   
288As before, let us print out the elements of the first left cluster
289
290    >>> for el in root.left:
291    ...     print el,
292    Ann Eve Fred Hue Ivy Jon
293
294Calling ``root.left.swap`` reverses the order of subclusters of
295``root.left``
296
297    >>> print_clustering(root)
298    '(((Ann Eve) ((Fred Hue) (Ivy Jon))) ((Bob (Curt Greg)) Danny))'
299    >>> root.left.swap()
300    >>> print_clustering(root)
301    '((((Fred Hue) (Ivy Jon)) (Ann Eve)) ((Bob (Curt Greg)) Danny))'
302   
303Let us write function for cluster pruning.
304
305    >>> def prune(cluster, h):
306    ...     if cluster.branches:
307    ...         if cluster.height < h:
308    ...             cluster.branches = None
309    ...         else:
310    ...             for branch in cluster.branches:
311    ...                 prune(branch, h)
312
313Here we need a function that can plot leafs with multiple elements.
314
315    >>> def print_clustering2(cluster):
316    ...     if cluster.branches:
317    ...         return "(%s %s)" % (print_clustering2(cluster.left), print_clustering2(cluster.right))
318    ...     else:
319    ...         return str(tuple(cluster))
320
321Four clusters remain.
322
323    >>> prune(root, 5)
324    >>> print print_clustering2(root)
325    (((('Fred', 'Hue') ('Ivy', 'Jon')) ('Ann', 'Eve')) ('Bob', 'Curt', 'Greg', 'Danny'))
326   
327The following function returns a list of lists.
328
329    >>> def list_of_clusters0(cluster, alist):
330    ...     if not cluster.branches:
331    ...         alist.append(list(cluster))
332    ...     else:
333    ...         for branch in cluster.branches:
334    ...             list_of_clusters0(branch, alist)
335    ...
336    >>> def list_of_clusters(root):
337    ...     l = []
338    ...     list_of_clusters0(root, l)
339    ...     return l
340       
341The function returns a list of lists, in our case
342
343    >>> list_of_clusters(root)
344    [['Fred', 'Hue'], ['Ivy', 'Jon'], ['Ann', 'Eve'], ['Bob', 'Curt', 'Greg', 'Danny']]
345
346If :obj:`~HierarchicalCluster.mapping.objects` were not defined the list
347would contains indices instead of names.
348
349    >>> root.mapping.objects = None
350    >>> print list_of_clusters(root)
351    [[5, 7], [8, 9], [0, 4], [1, 2, 6, 3]]
352
353Utility Functions
354-----------------
355
356.. autofunction:: clustering_features
357.. autofunction:: feature_distance_matrix
358.. autofunction:: dendrogram_layout
359
360"""
361
362
363import orange
364import Orange
365from Orange.core import HierarchicalClustering, \
366                        HierarchicalCluster, \
367                        HierarchicalClusterList
368
369from Orange.utils import progress_bar_milestones, deprecated_keywords
370
371import sys
372
373SINGLE = HierarchicalClustering.Single
374AVERAGE = HierarchicalClustering.Average
375COMPLETE = HierarchicalClustering.Complete
376WARD = HierarchicalClustering.Ward
377
378def clustering(data,
379               distance_constructor=Orange.distance.Euclidean,
380               linkage=AVERAGE,
381               order=False,
382               progress_callback=None):
383    """ Return a hierarchical clustering of the instances in a data set.
384    The method works in approximately O(n2) time (with the worst case O(n3)).
385   
386    :param data: Input data table for clustering.
387    :type data: :class:`Orange.data.Table`
388    :param distance_constructor: Instance distance constructor
389    :type distance_constructor: :class:`Orange.distance.DistanceConstructor`
390    :param linkage: Linkage flag. Must be one of module level flags:
391   
392        - :obj:`SINGLE`
393        - :obj:`AVERAGE`
394        - :obj:`COMPLETE`
395        - :obj:`WARD`
396       
397    :type linkage: int
398    :param order: If `True` run `order_leaves` on the resulting clustering.
399    :type order: bool
400    :param progress_callback: A function (taking one argument) to use for
401        reporting the on the progress.
402    :type progress_callback: function
403   
404    :rtype: :class:`HierarchicalCluster`
405   
406    """
407    distance = distance_constructor(data)
408    matrix = Orange.misc.SymMatrix(len(data))
409    for i in range(len(data)):
410        for j in range(i + 1):
411            matrix[i, j] = distance(data[i], data[j])
412    root = HierarchicalClustering(matrix, linkage=linkage, progress_callback=(lambda value, obj=None: progress_callback(value * 100.0 / (2 if order else 1))) if progress_callback else None)
413    if order:
414        order_leaves(root, matrix, progress_callback=(lambda value: progress_callback(50.0 + value / 2)) if progress_callback else None)
415    return root
416
417clustering = \
418    deprecated_keywords({"distanceConstructor": "distance_constructor",
419                         "progressCallback": "progress_callback"})(clustering)
420
421
422def clustering_features(data, distance=None, linkage=AVERAGE, order=False, progress_callback=None):
423    """ Return hierarchical clustering of attributes in a data set.
424   
425    :param data: Input data table for clustering.
426    :type data: :class:`Orange.data.Table`
427    :param distance: Attribute distance constructor  (currently not used).
428    :param linkage: Linkage flag; one of global module level flags:
429   
430        - SINGLE
431        - AVERAGE
432        - COMPLETE
433        - WARD
434       
435    :type linkage: int
436    :param order: If `True` run `order_leaves` on the resulting clustering.
437    :type order: bool
438    :param progress_callback: A function (taking one argument) to use for
439        reporting the on the progress.
440    :type progress_callback: function
441
442    :rtype: :class:`HierarchicalCluster`
443   
444    """
445    matrix = Orange.misc.SymMatrix(len(data.domain.attributes))
446    for a1 in range(len(data.domain.attributes)):
447        for a2 in range(a1):
448            matrix[a1, a2] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0
449    root = orange.HierarchicalClustering(matrix, linkage=linkage, progress_callback=(lambda value, obj=None: progress_callback(value * 100.0 / (2 if order else 1))) if progress_callback else None)
450    if order:
451        order_leaves(root, matrix, progressCallback=(lambda value: progress_callback(50.0 + value / 2)) if progress_callback else None)
452    return root
453
454clustering_features = \
455    deprecated_keywords({"progressCallback":"progress_callback"})(clustering_features)
456
457
458def cluster_to_list(node, prune=None):
459    """ Return a list of clusters down from the node of hierarchical clustering.
460   
461    :param node: Cluster node.
462    :type node: :class:`HierarchicalCluster`
463    :param prune: If not `None` it must be a positive integer. Any cluster
464        with less then `prune` items will be left out of the list.
465    :type node: int or `NoneType`
466   
467    :rtype: list of :class:`HierarchicalCluster` instances
468   
469    """
470    if prune:
471        if len(node) <= prune:
472            return []
473    if node.branches:
474        return [node] + cluster_to_list(node.left, prune) + cluster_to_list(node.right, prune)
475    return [node]
476
477
478def top_clusters(root, k):
479    """ Return k topmost clusters from hierarchical clustering.
480   
481    :param root: Root cluster.
482    :type root: :class:`HierarchicalCluster`
483    :param k: Number of top clusters.
484    :type k: int
485   
486    :rtype: list of :class:`HierarchicalCluster` instances
487
488    """
489    candidates = set([root])
490    while len(candidates) < k:
491        repl = max([(c.height, c) for c in candidates if c.branches])[1]
492        candidates.discard(repl)
493        candidates.add(repl.left)
494        candidates.add(repl.right)
495    return candidates
496
497def top_cluster_membership(root, k):
498    """ Return data instances' cluster membership (list of indices) to k topmost clusters.
499   
500    :param root: Root cluster.
501    :type root: :class:`HierarchicalCluster`
502    :param k: Number of top clusters.
503    :type k: int
504   
505    :rtype: list-of-int
506   
507    """
508    clist = top_clusters(root, k)
509    cmap = [None] * len(root)
510    for i, c in enumerate(clist):
511        for e in c:
512            cmap[e] = i
513    return cmap
514
515def order_leaves_py(tree, matrix, progress_callback=None):
516    """Order the leaves in the clustering tree.
517   
518    (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering))
519   
520    :param tree: Binary hierarchical clustering tree.
521    :type tree: :class:`HierarchicalCluster`
522    :param matrix: SymMatrix that was used to compute the clustering.
523    :type matrix: :class:`Orange.misc.SymMatrix`
524    :param progress_callback: Function used to report on progress.
525    :type progress_callback: function
526   
527    The ordering is done inplace.
528   
529    """
530
531    objects = getattr(tree.mapping, "objects", None)
532    tree.mapping.setattr("objects", range(len(tree)))
533    M = {}
534    ordering = {}
535    visited_clusters = set()
536
537#    def opt_ordering_recursive(tree):
538#        if len(tree)==1:
539#            for leaf in tree:
540#                M[tree, leaf, leaf] = 0
541#        else:
542#            opt_ordering_recursive(tree.left)
543#            opt_ordering_recursive(tree.right)
544#           
545#            Vl = set(tree.left)
546#            Vr = set(tree.right)
547#            Vlr = set(tree.left.right or tree.left)
548#            Vll = set(tree.left.left or tree.left)
549#            Vrr = set(tree.right.right or tree.right)
550#            Vrl = set(tree.right.left or tree.right)
551#            other = lambda e, V1, V2: V2 if e in V1 else V1
552#            tree_left, tree_right = tree.left, tree.right
553#            for u in Vl:
554#                for w in Vr:
555##                    if True: #Improved search
556#                    C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])
557#                    ordered_m = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])
558#                    ordered_k = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])
559#                    k0 = ordered_k[0]
560#                    cur_min = 1e30000
561#                    cur_m = cur_k = None
562#                    for m in ordered_m:
563#                        if M[tree_left, u, m] + M[tree_right, w, k0] + C >= cur_min:
564#                            break
565#                        for k in  ordered_k:
566#                            if M[tree_left, u, m] + M[tree_right, w, k] + C >= cur_min:
567#                                break
568#                            test_min = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
569#                            if cur_min > test_min:
570#                                cur_min = test_min
571#                                cur_m = m
572#                                cur_k = k
573#                    M[tree, u, w] = M[tree, w, u] = cur_min
574#                    ordering[tree, u, w] = (tree_left, u, cur_m, tree_right, w, cur_k)
575#                    ordering[tree, w, u] = (tree_right, w, cur_k, tree_left, u, cur_m)
576##                    else:
577##                    def MFunc((m, k)):
578##                        return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
579##                    m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)
580##                    M[tree, u, w] = M[tree, w, u] = MFunc((m, k))
581##                    ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)
582##                    ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)
583#
584#            if progressCallback:
585#                progressCallback(100.0 * len(visited_clusters) / len(tree.mapping))
586#                visited_clusters.add(tree)
587#   
588#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
589#        opt_ordering_recursive(tree)
590
591    def opt_ordering_iterative(tree):
592        if len(tree) == 1:
593            for leaf in tree:
594                M[tree, leaf, leaf] = 0
595        else:
596#            _optOrdering(tree.left)
597#            _optOrdering(tree.right)
598
599            Vl = set(tree.left)
600            Vr = set(tree.right)
601            Vlr = set(tree.left.right or tree.left)
602            Vll = set(tree.left.left or tree.left)
603            Vrr = set(tree.right.right or tree.right)
604            Vrl = set(tree.right.left or tree.right)
605            other = lambda e, V1, V2: V2 if e in V1 else V1
606            tree_left, tree_right = tree.left, tree.right
607            for u in Vl:
608                for w in Vr:
609#                    if True: #Improved search
610                        C = min([matrix[m, k] for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)])
611                        ordered_m = sorted(other(u, Vll, Vlr), key=lambda m: M[tree_left, u, m])
612                        ordered_k = sorted(other(w, Vrl, Vrr), key=lambda k: M[tree_right, w, k])
613                        k0 = ordered_k[0]
614                        cur_min = 1e30000
615                        cur_m = cur_k = None
616                        for m in ordered_m:
617                            if M[tree_left, u, m] + M[tree_right, w, k0] + C >= cur_min:
618                                break
619                            for k in  ordered_k:
620                                if M[tree_left, u, m] + M[tree_right, w, k] + C >= cur_min:
621                                    break
622                                test_min = M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
623                                if cur_min > test_min:
624                                    cur_min = test_min
625                                    cur_m = m
626                                    cur_k = k
627                        M[tree, u, w] = M[tree, w, u] = cur_min
628                        ordering[tree, u, w] = (tree_left, u, cur_m, tree_right, w, cur_k)
629                        ordering[tree, w, u] = (tree_right, w, cur_k, tree_left, u, cur_m)
630#                    else:
631#                        def MFunc((m, k)):
632#                            return M[tree_left, u, m] + M[tree_right, w, k] + matrix[m, k]
633#                        m, k = min([(m, k) for m in other(u, Vll, Vlr) for k in other(w, Vrl, Vrr)], key=MFunc)
634#                        M[tree, u, w] = M[tree, w, u] = MFunc((m, k))
635#                        ordering[tree, u, w] = (tree_left, u, m, tree_right, w, k)
636#                        ordering[tree, w, u] = (tree_right, w, k, tree_left, u, m)
637
638#            if progressCallback:
639#                progressCallback(100.0 * len(visited_clusters) / len(tree.mapping))
640#                visited_clusters.add(tree)
641
642    subtrees = postorder(tree)
643    milestones = progress_bar_milestones(len(subtrees), 1000)
644
645    for i, subtree in enumerate(subtrees):
646        opt_ordering_iterative(subtree)
647        if progress_callback and i in milestones:
648            progress_callback(100.0 * i / len(subtrees))
649
650#    def order_recursive(tree, u, w):
651#        """ Order the tree based on the computed optimal ordering.
652#        """
653#        if len(tree)==1:
654#            return
655#        left, u, m, right, w, k = ordering[tree, u, w]
656#        if len(left)>1 and m not in left.right:
657#            left.swap()
658#        order_recursive(left, u, m)
659#       
660#        if len(right)>1 and k not in right.left:
661#            right.swap()
662#        order_recursive(right, k, w)
663
664    def order_iterative(tree, u, w):
665        """ Order the tree based on the computed optimal ordering.
666        """
667        opt_uw = {tree: (u, w)}
668        for subtree in preorder(tree):
669            if subtree.branches:
670                u, w = opt_uw[subtree]
671                left, u, m, right, w, k = ordering[subtree, u, w]
672                opt_uw[left] = (u, m)
673                opt_uw[right] = (k, w)
674
675                if left.branches and m not in left.right:
676                    left.swap()
677
678                if right.branches and k not in right.left:
679                    right.swap()
680
681    u, w = min([(u, w) for u in tree.left for w in tree.right], key=lambda (u, w): M[tree, u, w])
682
683##    print "M(v) =", M[tree, u, w]
684
685#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
686#        order_recursive(tree, u, w)
687
688    order_iterative(tree, u, w)
689
690
691#    def _check(tree, u, w):
692#        if len(tree)==1:
693#            return
694#        left, u, m, right, w, k = ordering[tree, u, w]
695#        if tree[0] == u and tree[-1] == w:
696#            _check(left, u, m)
697#            _check(right, k, w)
698#        else:
699#            print "Error:", u, w, tree[0], tree[-1]
700#
701#    with recursion_limit(sys.getrecursionlimit() + len(tree)):
702#        _check(tree, u ,w)
703
704    if objects:
705        tree.mapping.setattr("objects", objects)
706
707
708def order_leaves_cpp(tree, matrix, progress_callback=None):
709    """ Order the leaves in the clustering tree (C++ implementation).
710   
711    (based on Ziv Bar-Joseph et al. (Fast optimal leaf ordering for hierarchical clustering))
712   
713    :param tree: Binary hierarchical clustering tree.
714    :type tree: :class:`HierarchicalCluster`
715    :param matrix: SymMatrix that was used to compute the clustering.
716    :type matrix: :class:`Orange.misc.SymMatrix`
717    :param progress_callback: Function used to report on progress.
718    :type progress_callback: function
719   
720    The ordering is done inplace.
721   
722    """
723    node_count = iter(range(len(tree)))
724
725    if progress_callback is not None:
726        def p(*args):
727            progress_callback(100.0 * node_count.next() / len(tree))
728    else:
729        p = None
730
731    Orange.core.HierarchicalClusterOrdering(tree, matrix, progress_callback=p)
732
733order_leaves_cpp = deprecated_keywords({"progressCallback":"progress_callback"})(order_leaves_cpp)
734order_leaves_py = deprecated_keywords({"progressCallback":"progress_callback"})(order_leaves_py)
735
736order_leaves = order_leaves_cpp
737
738"""
739Matplotlib dendrogram ploting. This is mostly untested,
740use dendrogram_draw funciton instead of this.
741
742"""
743try:
744    import numpy
745except ImportError:
746    numpy = None
747
748try:
749    import matplotlib
750    from matplotlib.figure import Figure
751    from matplotlib.table import Table, Cell
752    from matplotlib.text import Text
753    from matplotlib.artist import Artist
754##    import  matplotlib.pyplot as plt
755except (ImportError, IOError, RuntimeError), ex:
756    matplotlib = None
757    Text , Artist, Table, Cell = object, object, object, object
758
759class TableCell(Cell):
760    PAD = 0.05
761    def __init__(self, *args, **kwargs):
762        Cell.__init__(self, *args, **kwargs)
763        self._text.set_clip_on(True)
764
765class TablePlot(Table):
766    max_fontsize = 12
767    def __init__(self, xy, axes=None, bbox=None):
768        Table.__init__(self, axes or plt.gca(), bbox=bbox)
769        self.xy = xy
770        self.set_transform(self._axes.transData)
771        self._fixed_widhts = None
772        import matplotlib.pyplot as plt
773        self.max_fontsize = plt.rcParams.get("font.size", 12)
774
775    def add_cell(self, row, col, *args, **kwargs):
776        xy = (0, 0)
777
778        cell = TableCell(xy, *args, **kwargs)
779        cell.set_figure(self.figure)
780        cell.set_transform(self.get_transform())
781
782        cell.set_clip_on(True)
783        cell.set_clip_box(self._axes.bbox)
784        cell._text.set_clip_box(self._axes.bbox)
785        self._cells[(row, col)] = cell
786
787    def draw(self, renderer):
788        if not self.get_visible(): return
789        self._update_positions(renderer)
790
791        keys = self._cells.keys()
792        keys.sort()
793        for key in keys:
794            self._cells[key].draw(renderer)
795
796    def _update_positions(self, renderer):
797        keys = numpy.array(self._cells.keys())
798        cells = numpy.array([[self._cells.get((row, col), None) for col in range(max(keys[:, 1] + 1))] \
799                             for row in range(max(keys[:, 0] + 1))])
800
801        widths = self._get_column_widths(renderer)
802        x = self.xy[0] + numpy.array([numpy.sum(widths[:i]) for i in range(len(widths))])
803        y = self.xy[1] - numpy.arange(cells.shape[0]) - 0.5
804
805        for i in range(cells.shape[0]):
806            for j in range(cells.shape[1]):
807                cells[i, j].set_xy((x[j], y[i]))
808                cells[i, j].set_width(widths[j])
809                cells[i, j].set_height(1.0)
810
811        self._width = numpy.sum(widths)
812        self._height = cells.shape[0]
813
814        self.pchanged()
815
816    def _get_column_widths(self, renderer):
817        keys = numpy.array(self._cells.keys())
818        widths = numpy.zeros(len(keys)).reshape((numpy.max(keys[:, 0] + 1), numpy.max(keys[:, 1] + 1)))
819        fontSize = self._calc_fontsize(renderer)
820        for (row, col), cell in self._cells.items():
821            cell.set_fontsize(fontSize)
822            l, b, w, h = cell._text.get_window_extent(renderer).bounds
823            transform = self._axes.transData.inverted()
824            x1, _ = transform.transform_point((0, 0))
825            x2, _ = transform.transform_point((w + w * TableCell.PAD + 10, 0))
826            w = abs(x1 - x2)
827            widths[row, col] = w
828        return numpy.max(widths, 0)
829
830    def _calc_fontsize(self, renderer):
831        transform = self._axes.transData
832        _, y1 = transform.transform_point((0, 0))
833        _, y2 = transform.transform_point((0, 1))
834        return min(max(int(abs(y1 - y2) * 0.85) , 4), self.max_fontsize)
835
836    def get_children(self):
837        return self._cells.values()
838
839    def get_bbox(self):
840        return matplotlib.transform.Bbox([self.xy[0], self.xy[1], self.xy[0] + 10, self.xy[1] + 180])
841
842class DendrogramPlotPylab(object):
843    def __init__(self, root, data=None, labels=None, dendrogram_width=None, heatmap_width=None, label_width=None, space_width=None, border_width=0.05, plot_attr_names=False, cmap=None, params={}):
844        if not matplotlib:
845            raise ImportError("Could not import matplotlib module. Please make sure matplotlib is installed on your system.")
846        import matplotlib.pyplot as plt
847        self.plt = plt
848        self.root = root
849        self.data = data
850        self.labels = labels if labels else [str(i) for i in range(len(root))]
851        self.dendrogram_width = dendrogram_width
852        self.heatmap_width = heatmap_width
853        self.label_width = label_width
854        self.space_width = space_width
855        self.border_width = border_width
856        self.params = params
857        self.plot_attr_names = plot_attr_names
858
859    def plotDendrogram(self):
860        self.text_items = []
861        def draw_tree(tree):
862            if tree.branches:
863                points = []
864                for branch in tree.branches:
865                    center = draw_tree(branch)
866                    self.plt.plot([center[0], tree.height], [center[1], center[1]], color="black")
867                    points.append(center)
868                self.plt.plot([tree.height, tree.height], [points[0][1], points[-1][1]], color="black")
869                return (tree.height, (points[0][1] + points[-1][1]) / 2.0)
870            else:
871                return (0.0, tree.first)
872        draw_tree(self.root)
873
874    def plotHeatMap(self):
875        import numpy.ma as ma
876        import numpy
877        dx, dy = self.root.height, 0
878        fx, fy = self.root.height / len(self.data.domain.attributes), 1.0
879        data, c, w = self.data.toNumpyMA()
880        data = (data - ma.min(data)) / (ma.max(data) - ma.min(data))
881        x = numpy.arange(data.shape[1] + 1) / float(numpy.max(data.shape))
882        y = numpy.arange(data.shape[0] + 1) / float(numpy.max(data.shape)) * len(self.root)
883        self.heatmap_width = numpy.max(x)
884
885        X, Y = numpy.meshgrid(x, y - 0.5)
886
887        self.meshXOffset = numpy.max(X)
888
889        self.plt.jet()
890        mesh = self.plt.pcolormesh(X, Y, data[self.root.mapping], edgecolor="b", linewidth=2)
891
892        if self.plot_attr_names:
893            names = [attr.name for attr in self.data.domain.attributes]
894            self.plt.xticks(numpy.arange(data.shape[1] + 1) / float(numpy.max(data.shape)), names)
895        self.plt.gca().xaxis.tick_top()
896        for label in self.plt.gca().xaxis.get_ticklabels():
897            label.set_rotation(45)
898
899        for tick in self.plt.gca().xaxis.get_major_ticks():
900            tick.tick1On = False
901            tick.tick2On = False
902
903    def plotLabels_(self):
904        import numpy
905##        self.plt.yticks(numpy.arange(len(self.labels) - 1, 0, -1), self.labels)
906##        for tick in self.plt.gca().yaxis.get_major_ticks():
907##            tick.tick1On = False
908##            tick.label1On = False
909##            tick.label2On = True
910##        text = TableTextLayout(xy=(self.meshXOffset+1, len(self.root)), tableText=[[label] for label in self.labels])
911        text = TableTextLayout(xy=(self.meshXOffset * 1.005, len(self.root) - 1), tableText=[[label] for label in self.labels])
912        text.set_figure(self.plt.gcf())
913        self.plt.gca().add_artist(text)
914        self.plt.gca()._set_artist_props(text)
915
916    def plotLabels(self):
917##        table = TablePlot(xy=(self.meshXOffset*1.005, len(self.root) -1), axes=self.plt.gca())
918        table = TablePlot(xy=(0, len(self.root) - 1), axes=self.plt.gca())
919        table.set_figure(self.plt.gcf())
920        for i, label in enumerate(self.labels):
921            table.add_cell(i, 0, width=1, height=1, text=label, loc="left", edgecolor="w")
922        table.set_zorder(0)
923        self.plt.gca().add_artist(table)
924        self.plt.gca()._set_artist_props(table)
925
926    def plot(self, filename=None, show=False):
927        self.plt.rcParams.update(self.params)
928        labelLen = max(len(label) for label in self.labels)
929        w, h = 800, 600
930        space = 0.01 if self.space_width == None else self.space_width
931        border = self.border_width
932        width = 1.0 - 2 * border
933        height = 1.0 - 2 * border
934        textLineHeight = min(max(h / len(self.labels), 4), self.plt.rcParams.get("font.size", 12))
935        maxTextLineWidthEstimate = textLineHeight * labelLen
936##        print maxTextLineWidthEstimate
937        textAxisWidthRatio = 2.0 * maxTextLineWidthEstimate / w
938##        print textAxisWidthRatio
939        labelsAreaRatio = min(textAxisWidthRatio, 0.4) if self.label_width == None else self.label_width
940        x, y = len(self.data.domain.attributes), len(self.data)
941
942        heatmapAreaRatio = min(1.0 * y / h * x / w, 0.3) if self.heatmap_width == None else self.heatmap_width
943        dendrogramAreaRatio = 1.0 - labelsAreaRatio - heatmapAreaRatio - 2 * space if self.dendrogram_width == None else self.dendrogram_width
944
945        self.fig = self.plt.figure()
946        self.labels_offset = self.root.height / 20.0
947        dendrogramAxes = self.plt.axes([border, border, width * dendrogramAreaRatio, height])
948        dendrogramAxes.xaxis.grid(True)
949        import matplotlib.ticker as ticker
950
951        dendrogramAxes.yaxis.set_major_locator(ticker.NullLocator())
952        dendrogramAxes.yaxis.set_minor_locator(ticker.NullLocator())
953        dendrogramAxes.invert_xaxis()
954        self.plotDendrogram()
955        heatmapAxes = self.plt.axes([border + width * dendrogramAreaRatio + space, border, width * heatmapAreaRatio, height], sharey=dendrogramAxes)
956
957        heatmapAxes.xaxis.set_major_locator(ticker.NullLocator())
958        heatmapAxes.xaxis.set_minor_locator(ticker.NullLocator())
959        heatmapAxes.yaxis.set_major_locator(ticker.NullLocator())
960        heatmapAxes.yaxis.set_minor_locator(ticker.NullLocator())
961
962        self.plotHeatMap()
963        labelsAxes = self.plt.axes([border + width * (dendrogramAreaRatio + heatmapAreaRatio + 2 * space), border, width * labelsAreaRatio, height], sharey=dendrogramAxes)
964        self.plotLabels()
965        labelsAxes.set_axis_off()
966        labelsAxes.xaxis.set_major_locator(ticker.NullLocator())
967        labelsAxes.xaxis.set_minor_locator(ticker.NullLocator())
968        labelsAxes.yaxis.set_major_locator(ticker.NullLocator())
969        labelsAxes.yaxis.set_minor_locator(ticker.NullLocator())
970        if filename:
971            import matplotlib.backends.backend_agg
972            canvas = matplotlib.backends.backend_agg.FigureCanvasAgg(self.fig)
973            canvas.print_figure(filename)
974        if show:
975            self.plt.show()
976
977
978"""
979Dendrogram ploting using Orange.utils.render
980"""
981
982from Orange.utils.render import EPSRenderer, ColorPalette
983
984class DendrogramPlot(object):
985    """ A class for drawing dendrograms.
986   
987    :obj:`dendrogram_draw` function is a more convenient interface
988    to the functionality provided by this class.
989       
990    Example::
991   
992        a = DendrogramPlot(tree)
993        a.plot("tree.png", format="png")
994       
995    """
996    def __init__(self, tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, heatmap_width=None, text_width=None,
997                 spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, gamma=None, renderer=EPSRenderer, **kwargs):
998        self.tree = tree
999        self.attr_tree = attr_tree
1000        if not labels:
1001            if data and data.domain.class_var:
1002                labels = [str(ex.getclass()) for ex in data]
1003            elif hasattr(tree.mapping, "objects"):
1004                labels = [str(obj) for obj in tree.mapping.objects]
1005            else:
1006                labels = [""] * len(tree)
1007        self.labels = labels
1008
1009#        self.attr_labels = [str(attr.name) for attr in data.domain.attributes] if not attr_labels and data else attr_labels or []
1010        self.data = data
1011        self.width, self.height = float(width) if width else None, float(height) if height else None
1012        self.tree_height = tree_height
1013        self.heatmap_width = heatmap_width
1014        self.text_width = text_width
1015        self.font_size = 10.0
1016        self.linespacing = 0.0
1017        self.cluster_colors = cluster_colors
1018        self.horizontal_margin = 10.0
1019        self.vertical_margin = 10.0
1020        self.spacing = float(spacing) if spacing else None
1021        self.color_palette = color_palette
1022        self.minv = minv
1023        self.maxv = maxv
1024        self.gamma = gamma
1025        self.set_matrix_color_schema(color_palette, minv, maxv, gamma)
1026        self.renderer = renderer
1027
1028    def set_matrix_color_schema(self, color_palette, minv, maxv, gamma=None):
1029        """ Set the matrix color scheme.
1030        """
1031        if isinstance(color_palette, ColorPalette):
1032            self.color_palette = color_palette
1033        else:
1034            self.color_palette = ColorPalette(color_palette)
1035        self.minv = minv
1036        self.maxv = maxv
1037        self.gamma = gamma
1038
1039    def color_shema(self):
1040        vals = [float(val) for ex in self.data for val in ex if not val.isSpecial() and val.variable.varType == orange.VarTypes.Continuous] or [0]
1041        avg = sum(vals) / len(vals)
1042
1043        maxVal = self.maxv if self.maxv else max(vals)
1044        minVal = self.minv if self.minv else min(vals)
1045
1046        def _colorSchema(val):
1047            if val.isSpecial():
1048                return self.color_palette(None)
1049            elif val.variable.varType == orange.VarTypes.Continuous:
1050                r, g, b = self.color_palette((float(val) - minVal) / abs(maxVal - minVal), gamma=self.gamma)
1051            elif val.variable.varType == orange.VarTypes.Discrete:
1052                r = g = b = int(255.0 * float(val) / len(val.variable.values))
1053            return (r, g, b)
1054        return _colorSchema
1055
1056    def layout(self):
1057        height_final = False
1058        width_final = False
1059        tree_height = self.tree_height or 100
1060        if self.height:
1061            height, height_final = self.height, True
1062            heatmap_height = height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin
1063            font_size = heatmap_height / len(self.labels) #self.font_size or (height - (tree_height + self.spacing if self.attr_tree else 0) - 2 * self.horizontal_margin) / len(self.labels)
1064        else:
1065            font_size = self.font_size
1066            heatmap_height = font_size * len(self.labels)
1067            height = heatmap_height + (tree_height + self.spacing if self.attr_tree else 0) + 2 * self.horizontal_margin
1068
1069        text_width = self.text_width or max([len(label) for label in self.labels] + [0]) * font_size #max([self.renderer.string_size_hint(label) for label in self.labels])
1070
1071        if self.width:
1072            width = self.width
1073            heatmap_width = width - 2 * self.vertical_margin - tree_height - (2 if self.data else 1) * self.spacing - text_width if self.data else 0
1074        else:
1075            heatmap_width = len(self.data.domain.attributes) * heatmap_height / len(self.data) if self.data else 0
1076            width = 2 * self.vertical_margin + tree_height + (heatmap_width + self.spacing if self.data else 0) + self.spacing + text_width
1077
1078        return width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size
1079
1080    def plot(self, filename="graph.eps", **kwargs):
1081        width, height, tree_height, heatmap_width, heatmap_height, text_width, font_size = self.layout()
1082        heatmap_cell_height = heatmap_height / len(self.labels)
1083
1084        heatmap_cell_width = 0.0 if not self.data else heatmap_width / len(self.data.domain.attributes)
1085
1086        self.renderer = self.renderer(width, height)
1087
1088        def draw_tree(cluster, root, treeheight, treewidth, color):
1089            height = treeheight * cluster.height / root.height
1090            if cluster.branches:
1091                centers = []
1092                for branch in cluster.branches:
1093                    center = draw_tree(branch, root, treeheight, treewidth, self.cluster_colors.get(branch, color))
1094                    centers.append(center)
1095                    self.renderer.draw_line(center[0], center[1], center[0], height, stroke_color=self.cluster_colors.get(branch, color))
1096
1097                self.renderer.draw_line(centers[0][0], height, centers[-1][0], height, stroke_color=self.cluster_colors.get(cluster, color))
1098                return (centers[0][0] + centers[-1][0]) / 2.0, height
1099            else:
1100                return float(treewidth) * cluster.first / len(root), 0.0
1101        self.renderer.save_render_state()
1102        self.renderer.translate(self.vertical_margin + tree_height, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0) + heatmap_cell_height / 2.0)
1103        self.renderer.rotate(90)
1104#        print self.renderer.transform()
1105        draw_tree(self.tree, self.tree, tree_height, heatmap_height, self.cluster_colors.get(self.tree, (0, 0, 0)))
1106        self.renderer.restore_render_state()
1107        if self.attr_tree:
1108            self.renderer.save_render_state()
1109            self.renderer.translate(self.vertical_margin + tree_height + self.spacing + heatmap_cell_width / 2.0, self.horizontal_margin + tree_height)
1110            self.renderer.scale(1.0, -1.0)
1111#            print self.renderer.transform()
1112            draw_tree(self.attr_tree, self.attr_tree, tree_height, heatmap_width, self.cluster_colors.get(self.attr_tree, (0, 0, 0)))
1113            self.renderer.restore_render_state()
1114
1115        self.renderer.save_render_state()
1116        self.renderer.translate(self.vertical_margin + tree_height + self.spacing, self.horizontal_margin + (tree_height + self.spacing if self.attr_tree else 0))
1117#        print self.renderer.transform()
1118        if self.data:
1119            colorSchema = self.color_shema()
1120            for i, ii in enumerate(self.tree):
1121                ex = self.data[ii]
1122                for j, jj in enumerate((self.attr_tree if self.attr_tree else range(len(self.data.domain.attributes)))):
1123                    r, g, b = colorSchema(ex[jj])
1124                    self.renderer.draw_rect(j * heatmap_cell_width, i * heatmap_cell_height, heatmap_cell_width, heatmap_cell_height, fill_color=(r, g, b), stroke_color=(255, 255, 255))
1125
1126        self.renderer.translate(heatmap_width + self.spacing, heatmap_cell_height)
1127#        print self.renderer.transform()
1128        self.renderer.set_font("Times-Roman", font_size)
1129        for index in self.tree.mapping: #label in self.labels:
1130            self.renderer.draw_text(0.0, 0.0, self.labels[index])
1131            self.renderer.translate(0.0, heatmap_cell_height)
1132        self.renderer.restore_render_state()
1133        self.renderer.save(filename, **kwargs)
1134
1135
1136def dendrogram_draw(file, cluster, attr_cluster=None, labels=None, data=None,
1137                    width=None, height=None, tree_height=None,
1138                    heatmap_width=None, text_width=None, spacing=2,
1139                    cluster_colors={},
1140                    color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]),
1141                    maxv=None, minv=None, gamma=None, format=None):
1142    """ Plot the dendrogram to ``file``.
1143   
1144    :param file: An  open file or a filename to store the image to. The output format
1145        is chosen according to the extension. Supported formats: PNG, EPS, SVG.
1146    :type file: str or an file-like object suitable for writing.
1147   
1148    :param cluster: An instance of :class:`HierarcicalCluster`
1149    :type cluster: :class:`HierarcicalCluster`
1150   
1151    :param attr_cluster: An instance of :class:`HierarcicalCluster` for the attributes
1152        in ``data`` (unused if ``data`` is not supplied)
1153    :type attr_cluster: :class:`HierarcicalCluster`
1154   
1155    :param labels: Labels for the ``cluster`` leaves.
1156    :type labels: list-of-strings
1157   
1158    :param data: A data table for the (optional) heatmap.
1159    :type data: :class:`Orange.data.Table`
1160   
1161    :param width: Image width.
1162    :type width: int
1163   
1164    :param height: Image height.
1165    :type height: int
1166   
1167    :param tree_height: Dendrogram tree height in the image.
1168    :type tree_height: int
1169   
1170    :param heatmap_width: Heatmap width.
1171    :type heatmap_width: int
1172   
1173    :param text_width: Text labels area width.
1174    :type text_width: int
1175   
1176    :param spacing: Spacing between consecutive leaves.
1177    :type spacing: int
1178   
1179    :param cluster_colors: A dictionary mapping :class:`HierarcicalCluster`
1180        instances in ``cluster`` to a RGB color 3-tuple.
1181    :type cluster_colors: dict
1182   
1183    :param format: Output image format Currently supported formats are
1184        "png" (default), "eps" and "svg". You only need this arguement if the
1185        format cannot be deduced from the ``file`` argument.
1186       
1187    :type format: str
1188   
1189    """
1190    import os
1191    from Orange.utils.render import PILRenderer, EPSRenderer, SVGRenderer
1192    if isinstance(file, basestring):
1193        name, ext = os.path.splitext(file)
1194        format = ext.lower().lstrip(".") or format
1195
1196    if format is None:
1197        format = "png"
1198
1199    renderer = {"eps":EPSRenderer, "svg":SVGRenderer, "png":PILRenderer}.get(format, "png")
1200
1201    d = DendrogramPlot(cluster, attr_cluster, labels, data, width, height,
1202                       tree_height, heatmap_width, text_width, spacing,
1203                       cluster_colors, color_palette, maxv, minv, gamma,
1204                       renderer=renderer)
1205    if renderer is PILRenderer:
1206        d.plot(file, format=format)
1207    else:
1208        d.plot(file)
1209
1210def postorder(cluster):
1211    """ Return a post order list of clusters.
1212   
1213    :param cluster: Cluster
1214    :type cluster: :class:`HierarchicalCluster`
1215   
1216    :rtype: list of :class:`HierarchicalCluster` instances
1217   
1218    """
1219    order = []
1220    visited = set()
1221    stack = [cluster]
1222    while stack:
1223        cluster = stack.pop(0)
1224
1225        if cluster.branches:
1226            if cluster in visited:
1227                order.append(cluster)
1228            else:
1229                stack = cluster.branches + [cluster] + stack
1230                visited.add(cluster)
1231        else:
1232            order.append(cluster)
1233            visited.add(cluster)
1234    return order
1235
1236
1237def preorder(cluster):
1238    """ Return a pre order list of clusters.
1239   
1240    :param cluster: Cluster
1241    :type cluster: :class:`HierarchicalCluster`
1242   
1243    :rtype: list of :class:`HierarchicalCluster` instances
1244   
1245    """
1246    order = []
1247    stack = [cluster]
1248    while stack:
1249        cluster = stack.pop(0)
1250        order.append(cluster)
1251        if cluster.branches:
1252            stack = cluster.branches + stack
1253    return order
1254
1255
1256def dendrogram_layout(cluster, expand_leaves=False):
1257    """ Return a layout of the cluster dendrogram on a 2D plane. The return
1258    value if a list of (subcluster, (start, center, end)) tuples where
1259    subcluster is an instance of :class:`HierarchicalCluster` and start,
1260    end are the two end points for the cluster branch. The list is sorted
1261    in post-order.
1262   
1263    :param cluster: Cluster to layout.
1264    :type cluster: :class:`HierarchicalCluster`
1265   
1266    :param expand_leaves: If True leaves will span proportional to the number
1267        of items they contain, else all leaves will be the same width.
1268   
1269    :rtype: list of (:class:`HierarchicalCluster`, (start, center, end)) tuples
1270   
1271    """
1272    result = []
1273    cluster_geometry = {}
1274    leaf_idx = 0
1275    for cluster in postorder(cluster):
1276        if not cluster.branches:
1277            if expand_leaves:
1278                start, end = float(cluster.first), float(cluster.last)
1279            else:
1280                start = end = leaf_idx
1281                leaf_idx += 1
1282            center = (start + end) / 2.0
1283            cluster_geometry[cluster] = (start, center, end)
1284            result.append((cluster, (start, center, end)))
1285        else:
1286            left = cluster.branches[0]
1287            right = cluster.branches[1]
1288            left_center = cluster_geometry[left][1]
1289            right_center = cluster_geometry[right][1]
1290            start, end = left_center, right_center
1291            center = (start + end) / 2.0
1292            cluster_geometry[cluster] = (start, center, end)
1293            result.append((cluster, (start, center, end)))
1294
1295    return result
1296
1297def clone(cluster):
1298    """ Clone a cluster, including it's subclusters.
1299   
1300    :param cluster: Cluster to clone
1301    :type cluster: :class:`HierarchicalCluster`
1302   
1303    :rtype: :class:`HierarchicalCluster`
1304   
1305    """
1306    import copy
1307    clones = {}
1308    mapping = copy.copy(cluster.mapping)
1309    for node in postorder(cluster):
1310        node_clone = copy.copy(node)
1311        if node.branches:
1312            node_clone.branches = [clones[b] for b in node.branches]
1313        node_clone.mapping = mapping
1314        clones[node] = node_clone
1315
1316    return clones[cluster]
1317
1318def pruned(cluster, level=None, height=None, condition=None):
1319    """ Return a new pruned clustering instance.   
1320    It uses :obj:`clone` to create a copy of the `cluster`
1321    instance.
1322   
1323    :param cluster: Cluster to prune.
1324    :type cluster: :class:`HierarchicalCluster`
1325   
1326    :param level: If not None prune all clusters deeper then `level`.
1327    :type level: int
1328   
1329    :param height: If not None prune all clusters with height lower then
1330        `height`.
1331    :type height: float
1332   
1333    :param condition: If not None condition must be a function taking a
1334        single :class:`HierarchicalCluster` instance and returning a True
1335        or False value indicating if the cluster should be pruned.
1336    :type condition: function
1337   
1338    :rtype: :class:`HierarchicalCluster`
1339   
1340    """
1341    cluster = clone(cluster)
1342    prune(cluster, level, height, condition)
1343    return cluster
1344
1345
1346def prune(cluster, level=None, height=None, condition=None):
1347    """ Prune the clustering instance ``cluster`` in place.
1348   
1349    :param cluster: Cluster to prune.
1350    :type cluster: :class:`HierarchicalCluster`
1351   
1352    :param level: If not None prune all clusters deeper then `level`.
1353    :type level: int
1354   
1355    :param height: If not None prune all clusters with height lower then
1356        `height`.
1357    :type height: float
1358   
1359    :param condition: If not None condition must be a function taking a
1360        single :class:`HierarchicalCluster` instance and returning a True
1361        or False value indicating if the cluster should be pruned.
1362    :type condition: function
1363   
1364    """
1365    if not any(arg is not None for arg in [level, height, condition]):
1366        raise ValueError("At least one pruning argument must be supplied")
1367
1368    level_check = height_check = condition_check = lambda cl: False
1369    cluster_depth = cluster_depths(cluster)
1370
1371    if level is not None:
1372        level_check = lambda cl: cluster_depth[cl] >= level
1373
1374    if height is not None:
1375        height_check = lambda cl: cl.height <= height
1376
1377    if condition is not None:
1378        condition_check = condition
1379
1380    pruned_set = set()
1381
1382    def check_all(cl):
1383        return any([check(cl) for check in [level_check, height_check,
1384                                            condition_check]])
1385
1386    for cluster in preorder(cluster):
1387        if cluster not in pruned_set:
1388            if check_all(cluster):
1389                cluster.branches = None
1390                pruned_set.update(set(preorder(cluster)))
1391            else:
1392                pass
1393
1394
1395def cluster_depths(cluster):
1396    """ Return a dictionary mapping :class:`HierarchicalCluster` instances to
1397    their depths in the `cluster` hierarchy.
1398   
1399    :param cluster: Root cluster
1400    :type cluster: :class:`HierarchicalCluster`
1401   
1402    :rtype: class:`dict`
1403   
1404    """
1405    depths = {}
1406    depths[cluster] = 0
1407    for cluster in preorder(cluster):
1408        cl_depth = depths[cluster]
1409        if cluster.branches:
1410            depths.update(dict.fromkeys(cluster.branches, cl_depth + 1))
1411    return depths
1412
1413instance_distance_matrix = Orange.distance.distance_matrix
1414
1415def feature_distance_matrix(data, distance=None, progress_callback=None):
1416    """ A helper function that computes an :class:`Orange.misc.SymMatrix` of
1417    all pairwise distances between features in `data`.
1418   
1419    :param data: A data table
1420    :type data: :class:`Orange.data.Table`
1421   
1422    :param distance: a function taking two lists and returning the distance.
1423    :type distance: function
1424   
1425    :param progress_callback: A function (taking one argument) to use for
1426        reporting the on the progress.
1427    :type progress_callback: function
1428   
1429    :rtype: :class:`Orange.misc.SymMatrix`
1430   
1431    """
1432    attributes = data.domain.attributes
1433    matrix = Orange.misc.SymMatrix(len(attributes))
1434    iter_count = matrix.dim * (matrix.dim - 1) / 2
1435    milestones = progress_bar_milestones(iter_count, 100)
1436
1437    for count, ((i, a1), (j, a2)) in enumerate(_pairs(enumerate(attributes))):
1438        matrix[i, j] = (1.0 - orange.PearsonCorrelation(a1, a2, data, 0).r) / 2.0
1439        if progress_callback and count in milestones:
1440            progress_callback(100.0 * count / iter_count)
1441
1442    return matrix
1443
1444
1445def _pairs(seq, same=False):
1446    """ Return all pairs from elements of `seq`.
1447    """
1448    seq = list(seq)
1449    same = 0 if same else 1
1450    for i in range(len(seq)):
1451        for j in range(i + same, len(seq)):
1452            yield seq[i], seq[j]
1453
1454
1455def joining_cluster(cluster, item1, item2):
1456    """ Return the cluster where `item1` and `item2` are first joined
1457   
1458    :param cluster: Clustering.
1459    :type cluster: :class:`HierarchicalCluster`
1460    :param item1: An element of `cluster.mapping` or `cluster.mapping.objects`
1461    :param item2: An element of `cluster.mapping` or `cluster.mapping.objects`
1462   
1463    :rtype: :class:`HierarchicalCluster`
1464   
1465    """
1466    while True:
1467        if cluster.branches:
1468            for branch in cluster.branches:
1469                if item1 in branch and item2 in branch:
1470                    cluster = branch
1471                    break
1472            else:
1473                return cluster
1474        else:
1475            return cluster
1476
1477
1478def cophenetic_distances(cluster):
1479    """ Return the cophenetic distance matrix between items in clustering.
1480    Cophenetic distance is defined as the height of the cluster where the
1481    two items are first joined.
1482   
1483    :param cluster: Clustering.
1484    :type cluster: :class:`HierarchicalCluster`
1485   
1486    :rtype: :class:`Orange.misc.SymMatrix`
1487   
1488    """
1489
1490    mapping = cluster.mapping
1491    matrix = Orange.misc.SymMatrix(len(mapping))
1492    for cluster in postorder(cluster):
1493        if cluster.branches:
1494            for branch1, branch2 in _pairs(cluster.branches):
1495                for idx1 in mapping[branch1.first: branch1.last]:
1496                    for idx2 in mapping[branch2.first: branch2.last]:
1497                        matrix[idx1, idx2] = cluster.height
1498
1499        else:
1500            for ind1, ind2 in _pairs(mapping[cluster.first: cluster.last]):
1501                matrix[ind1, ind2] = cluster.height
1502
1503    return matrix
1504
1505
1506def cophenetic_correlation(cluster, matrix):
1507    """ Return the `cophenetic correlation coefficient
1508    <http://en.wikipedia.org/wiki/Cophenetic_correlation>`_ of the given
1509    clustering.
1510   
1511    :param cluster: Clustering.
1512    :type cluster: :class:`HierarchicalCluster`
1513   
1514    :param matrix: The distance matrix from which the `cluster` was
1515        derived.
1516   
1517    :rtype: :class:`float`
1518   
1519    """
1520    import numpy
1521    cophenetic = cophenetic_distances(cluster)
1522    cophenetic = [list(row) for row in cophenetic]
1523    original = [list(row) for row in matrix]
1524    cophenetic = numpy.ravel(cophenetic)
1525    original = numpy.ravel(original)
1526    return numpy.corrcoef(cophenetic, original)[0, 1]
1527
1528
Note: See TracBrowser for help on using the repository browser.