Changeset 10106:643f0754d673 in orange


Ignore:
Timestamp:
02/08/12 18:40:14 (2 years ago)
Author:
markotoplak
Branch:
default
rebase_source:
a51d842a2c0ecf4c862a05d451fffa52d98e44bf
Message:

Hierarchical clustering documentation updates. New documentation scripts. Fixed a showcase "pruned".

Files:
1 added
3 edited

Legend:

Unmodified
Added
Removed
  • Orange/clustering/hierarchical.py

    r10066 r10106  
    1414The distance matrix has to contain no negative elements, as this helps 
    1515the algorithm to run faster. The elements on the diagonal are ignored. 
     16 
     17.. rubric:: Example 
     18 
     19The following example show clustering of the Iris data, Distance matrix 
     20is computed with the :class:`Orange.distance.Euclidean` distance measure 
     21and cluster it with average linkage. 
     22 
     23.. literalinclude:: code/hierarchical-example-2.py 
     24    :lines: 1-12 
     25 
     26Data instances belonging to the top-most four clusters could be printed out 
     27with: 
     28 
     29.. literalinclude:: code/hierarchical-example-2.py 
     30    :lines: 14-19 
     31             
     32It could be more informative to 
     33print out the class distributions for each cluster. 
     34     
     35.. literalinclude:: code/hierarchical-example-2.py 
     36    :lines: 21-26 
     37         
     38Here is the output.  
     39:: 
     40 
     41    Iris-setosa:   0  Iris-versicolor:  50  Iris-virginica:  17  
     42    Iris-setosa:  49  Iris-versicolor:   0  Iris-virginica:   0  
     43    Iris-setosa:   0  Iris-versicolor:   0  Iris-virginica:  33  
     44    Iris-setosa:   1  Iris-versicolor:   0  Iris-virginica:   0 
     45 
     46The same results could also be obtained with: 
     47 
     48.. literalinclude:: code/hierarchical-example-3.py 
     49    :lines: 1-7 
     50  
    1651 
    1752Basic functionality 
     
    6196 
    6297     
    63 Example 1 - Toy matrix 
    64 ---------------------- 
    65  
    66 Let us construct a simple distance matrix and run clustering on it. 
    67  
    68 .. literalinclude:: code/hierarchical-example.py 
    69     :lines: 1-14 
    70      
    71 Root is a root of the cluster hierarchy. We can print using a 
    72 simple recursive function. 
    73  
    74 .. literalinclude:: code/hierarchical-example.py 
    75     :lines: 16-20 
    76              
    77 Our clustering, 
    78 printed by calling printClustering(root) looks like  
    79 :: 
    80      
    81     (((04)((57)(89)))((1(26))3)) 
    82      
    83 The elements are separated into two groups, the first containing elements 
    84 0, 4, 5, 7, 8, 9, and the second 1, 2, 6, 3. The difference between them 
    85 equals ``root.height``, 9.0 in our case. The first cluster is further 
    86 divided onto 0 and 4 in one, and 5, 7, 8, 9 in the other subcluster... 
    87  
    88 It is easy to print out the cluster's objects. Here's what's in the 
    89 left subcluster of root. 
    90 :: 
    91  
    92     >>> for el in root.left: 
    93         ... print el, 
    94     0 4 5 7 8 9  
    95      
    96 Let us write a better function for printing 
    97 out the clustering: instead of printing out the first (and 
    98 supposedly the only) element of cluster, cluster[0], we shall print 
    99 it out as a tuple.  
    100  
    101 .. literalinclude:: code/hierarchical-example.py 
    102     :lines: 22-26 
    103              
    104 We can add object description into clustering by 
    105  
    106 .. literalinclude:: code/hierarchical-example.py 
    107     :lines: 31 
    108      
    109 As before, let us print out the elements of the first left cluster:: 
    110  
    111     >>> for el in root.left: 
    112         ... print el, 
    113     Ann Eve Fred Hue Ivy Jon 
    114      
    115 If objects are given, the cluster's elements, as got by indexing 
    116 (eg root.left[2]) or by iteration, as in the above case, won't be 
    117 indices but the elements we clustered. If we put an ExampleTable 
    118 into objects, root.left[-1] will be the last example of the first 
    119 left cluster. 
    120  
    121 Now for swapping and permutations. :: 
    122  
    123     >>> printClustering(root) 
    124     ((('Ann''Eve')(('Fred''Hue')('Ivy''Jon')))(('Bob'('Curt''Greg'))'Danny')) 
    125     >>> root.left.swap() 
    126     >>> printClustering(root) 
    127     (((('Fred''Hue')('Ivy''Jon'))('Ann''Eve'))(('Bob'('Curt''Greg'))'Danny')) 
    128     >>> root.permute([1, 0]) 
    129     >>> printClustering(root) 
    130     ((('Bob'('Curt''Greg'))'Danny')((('Fred''Hue')('Ivy''Jon'))('Ann''Eve'))) 
    131      
    132 Calling ``root.left.swap`` reversed the order of subclusters of ``root.left`` 
    133 and ``root.permute([1, 0])`` (which is equivalent to ``root.swap`` - there 
    134 aren't many possible permutations of two elements) reverses the order 
    135 of ``root.left`` and ``root.right``. 
    136  
    137 Let us write function for cluster pruning. 
    138  
    139 .. literalinclude:: code/hierarchical-example.py 
    140     :lines: 33-39 
    141  
    142 We shall use ``printClustering2`` here, since we can have multiple elements 
    143 in a leaf of the clustering hierarchy. :: 
    144      
    145     >>> prune(root, 9) 
    146     >>> print printClustering2(root) 
    147     ((('Bob', 'Curt', 'Greg')('Danny',))(('Fred', 'Hue', 'Ivy', 'Jon')('Ann', 'Eve'))) 
    148      
    149 We've ended up with four clusters. Need a list of clusters? 
    150 Here's the function. 
    151  
    152 .. literalinclude:: code/hierarchical-example.py 
    153     :lines: 41-51 
    154          
    155 The function returns a list of lists, in our case 
    156 ``[['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']]``     
    157 If there were no ``objects`` the list would contains indices instead of names. 
    158  
    159 Example 2 - Clustering of examples 
    160 ---------------------------------- 
    161  
    162 Clustering of examples is shown on the Iris data set, initialize a distance 
    163 matrix with the :class:`Euclidean` distance measure 
    164 and cluster it with average linkage. Since we don't need the matrix, 
    165 we shall let the clustering overwrite it. 
    166  
    167 .. literalinclude:: code/hierarchical-example-2.py 
    168     :lines: 1-15 
    169  
    170 Let us now prune the clustering using the function we've written above, 
    171 and print out the clusters. 
    172  
    173 .. literalinclude:: code/hierarchical-example-2.py 
    174     :lines: 16-20 
    175              
    176 Since the printout is pretty long, it might be more informative to just 
    177 print out the class distributions for each cluster. 
    178      
    179 .. literalinclude:: code/hierarchical-example-2.py 
    180     :lines: 22-26 
    181          
    182 Here's what it shows. :: 
    183  
    184     Iris-setosa:  49    Iris-versicolor:   0    Iris-virginica:   0 
    185     Iris-setosa:   1    Iris-versicolor:   0    Iris-virginica:   0 
    186     Iris-setosa:   0    Iris-versicolor:  50    Iris-virginica:  17 
    187     Iris-setosa:   0    Iris-versicolor:   0    Iris-virginica:  33 
    188      
    189 Note something else: ``listOfClusters`` does not return a list of 
    190 :class:`Orange.data.Table`, but a list of lists of instances. Therefore, 
    191 in the above script, cluster is a list of examples, not an ``Table``, but 
    192 it gets converted to it automatically when the function is called. 
    193 Most Orange functions will do this for you automatically. You can, for 
    194 instance, call a learning algorithms, passing a cluster as an argument. 
    195 It won't mind. If you, however, want to have a list of table, you can 
    196 easily convert the list by 
    197  
    198 .. literalinclude:: code/hierarchical-example-2.py 
    199     :lines: 28 
    200      
    201 Finally, if you are dealing with examples, you may want to take the function 
    202 ``listOfClusters`` and replace ``alist.append(list(cluster))`` by 
    203 ``alist.append(Orange.data.Table(cluster))``. This function is less general, 
    204 it will fail if objects are not of type :class:`Orange.data.Instance`. 
    205 However, instead of list of lists, it will return a list of tables. 
    206  
     98 
     99    
    207100Cluster analysis 
    208101----------------- 
     
    264157        list of instance, a list of features (when clustering features), 
    265158        or even a string of the same length as the number of elements. 
     159        If objects are given, the cluster's elements, as got by indexing 
     160        or interacion, are not indices but corresponding objects.  It we 
     161        put an :obj:`Orange.data.Table` into objects, ``root.left[-1]`` 
     162        is the last instance of the first left cluster. 
    266163 
    267164    .. attribute:: first 
     
    296193        subclusters will be arranged. As for swap, this function changes the 
    297194        mapping and first and last of all clusters below this one.  
    298      
    299 To demonstrate how the data in clusters is stored, we shall continue with 
    300 the clustering we got in the first example. :: 
    301      
    302     >>> del root.mapping.objects 
    303     >>> print printClustering(root) 
    304     (((1(26))3)(((57)(89))(04))) 
    305     >>> print root.mapping 
    306     <1, 2, 6, 3, 5, 7, 8, 9, 0, 4> 
    307     >>> print root.left.first 
    308     0 
    309     >>> print root.left.last 
    310     4 
    311     >>> print root.left.mapping[root.left.first:root.left.last] 
    312     <1, 2, 6, 3> 
    313     >>> print root.left.left.first 
    314     0 
    315     >>> print root.left.left.last 
    316     3 
    317      
    318 We removed objects to just to see more clearly what is going on. 
    319 ``mapping`` is an ordered list of indices to the rows/columns of distance 
    320 matrix (and, at the same time, indices into objects, if they exist). Each 
    321 cluster's fields ``first`` and ``last`` are indices into mapping, so the 
    322 clusters elements are actually 
    323 ``cluster.mapping[cluster.first:cluster.last]``. ``cluster[i]`` therefore 
    324 returns ``cluster.mapping[cluster.first+i]`` or, if objects are specified, 
    325 ``cluster.objects[cluster.mapping[cluster.first+i]]``. 
    326195 
    327196Subclusters are ordered so that ``cluster.left.last`` always equals 
     
    329198equals ``cluster.branches[i+1].first``. 
    330199 
    331 Swapping and permutation do three things: change the order of 
     200Swapping and permutation change the order of 
    332201elements in ``branches``, permute the corresponding regions in 
    333202:obj:`~HierarchicalCluster.mapping` and adjust the ``first`` and ``last`` 
    334 for all the clusters below. For the latter, when subclusters of cluster 
    335 are permuted, the entire subtree starting at ``cluster.branches[i]`` 
    336 is moved by the same offset. 
    337  
    338 When building the structure of :obj:`HierarchicalCluster`\s yourself. 
    339 However, it is easy to do things wrong: shuffle the mapping, for 
    340 instance, and forget to adjust the ``first`` and ``last`` pointers. 
     203for all the clusters below. 
     204 
     205.. rubric:: An example 
     206 
     207The following example constructs a simple distance matrix and runs clustering 
     208on it. 
     209 
     210.. literalinclude:: code/hierarchical-example.py 
     211    :lines: 1-16 
     212     
     213``root`` is the root of the cluster hierarchy. We can print it with a 
     214simple recursive function. 
     215 
     216 
     217.. literalinclude:: code/hierarchical-example.py 
     218    :lines: 18-22 
     219             
     220The clustering looks like 
     221:: 
     222 
     223    >>> print print_clustering(root) 
     224    (((0 4) ((5 7) (8 9))) ((1 (2 6)) 3)) 
     225     
     226The elements form two groups, the first with elements 0, 4, 5, 7, 8, 9, 
     227and the second with 1, 2, 6, 3. The difference between them equals to 
     228 
     229    >>> print root.height 
     230    9.0 
     231 
     232The first cluster is further divided onto 0 and 4 in one, and 5, 7, 8, 
     2339 in the other subcluster. 
     234 
     235The following code prints the left subcluster of root. 
     236:: 
     237 
     238    >>> for el in root.left: 
     239        ... print el, 
     240    0 4 5 7 8 9  
     241     
     242Instead of printing out the first (and supposedly the only) element of 
     243cluster, cluster[0], we shall print it out as a tuple. 
     244 
     245.. literalinclude:: code/hierarchical-example.py 
     246    :lines: 24-28 
     247             
     248Object descriptions can be added with 
     249 
     250.. literalinclude:: code/hierarchical-example.py 
     251    :lines: 28-29 
     252     
     253As before, let us print out the elements of the first left cluster:: 
     254 
     255    >>> for el in root.left: 
     256        ... print el, 
     257    Ann Eve Fred Hue Ivy Jon 
     258 
     259Calling ``root.left.swap`` reverses the order of subclusters of 
     260``root.left``:: 
     261 
     262    >>> print_clustering(root) 
     263    (((Ann Eve) ((Fred Hue) (Ivy Jon))) ((Bob (Curt Greg)) Danny)) 
     264    >>> root.left.swap() 
     265    >>> print_clustering(root) 
     266    ((((Fred Hue) (Ivy Jon)) (Ann Eve)) ((Bob (Curt Greg)) Danny)) 
     267     
     268Let us write function for cluster pruning. 
     269 
     270.. literalinclude:: code/hierarchical-example.py 
     271    :lines: 42-48 
     272 
     273Here we need a function that can plot leafs with multiple elements. 
     274 
     275.. literalinclude:: code/hierarchical-example.py 
     276    :lines: 50-54 
     277 
     278Four clusters remain. 
     279 
     280    >>> prune(root, 5) 
     281    >>> print print_clustering2(root) 
     282    (('Bob', 'Curt', 'Greg', 'Danny') ((('Fred', 'Hue') ('Ivy', 'Jon')) ('Ann', 'Eve'))) 
     283     
     284The following function returns a list of lists. 
     285 
     286.. literalinclude:: code/hierarchical-example.py 
     287    :lines: 59-69 
     288         
     289The function returns a list of lists, in our case 
     290:: 
     291 
     292    >>> list_of_clusters(root) 
     293    [['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']] 
     294 
     295If :obj:`~HierarchicalCluster.mapping.objects` were not defined the list 
     296would contains indices instead of names. 
     297:: 
     298 
     299    >>> root.mapping.objects = None 
     300    >>> print list_of_clusters(root) 
     301    [[1, 2, 6], [3], [5, 7, 8, 9], [0, 4]] 
    341302 
    342303Output 
     
    345306.. autofunction:: dendrogram_layout 
    346307.. autofunction:: dendrogram_draw 
    347  
    348308 
    349309Utility Functions 
     
    518478    :type progress_callback: function 
    519479     
    520     .. note:: The ordering is done inplace.  
     480    The ordering is done inplace.  
    521481     
    522482    """ 
     
    711671    :type progress_callback: function 
    712672     
    713     .. note:: The ordering is done inplace. 
     673    The ordering is done inplace. 
    714674     
    715675    """ 
     
    978938    """ A class for drawing dendrograms. 
    979939     
    980     .. note:: ``dendrogram_draw`` function is a more convinient interface 
     940    ``dendrogram_draw`` function is a more convinient interface 
    981941        to the functionality provided by this class and.    
    982942         
     
    13091269     
    13101270def pruned(cluster, level=None, height=None, condition=None): 
    1311     """ Return a new pruned clustering instance. 
    1312      
    1313     .. note:: This uses :obj:`clone` to create a copy of the `cluster` 
    1314         instance. 
     1271    """ Return a new pruned clustering instance.     
     1272    It uses :obj:`clone` to create a copy of the `cluster` 
     1273    instance. 
    13151274     
    13161275    :param cluster: Cluster to prune. 
  • docs/reference/rst/code/hierarchical-example-2.py

    r10066 r10106  
    22 
    33iris = Orange.data.Table("iris") 
     4 
    45matrix = Orange.misc.SymMatrix(len(iris)) 
    5 distance = Orange.distance.Euclidean(iris) 
    6 for i1, instance1 in enumerate(iris): 
    7     for i2 in range(i1 + 1, len(iris)): 
    8         matrix[i1, i2] = distance(instance1, iris[i2]) 
     6matrix = Orange.distance.distance_matrix(iris, Orange.distance.Euclidean) 
    97 
    108clustering = Orange.clustering.hierarchical.HierarchicalClustering() 
    119clustering.linkage = clustering.Average 
    12 clustering.overwrite_matrix = 1 
    1310root = clustering(matrix) 
     11 
    1412root.mapping.objects = iris 
    1513 
    16 def prune(cluster, togo): 
    17     if cluster.branches: 
    18         if togo < 0: 
    19             cluster.branches = None 
    20         else: 
    21             for branch in cluster.branches: 
    22                 prune(branch, togo - cluster.height) 
     14topmost = Orange.clustering.hierarchical.top_clusters(root, 4) 
    2315 
    24 def listOfClusters0(cluster, alist): 
    25     if not cluster.branches: 
    26         alist.append(list(cluster)) 
    27     else: 
    28         for branch in cluster.branches: 
    29             listOfClusters0(branch, alist) 
    30  
    31 def listOfClusters(root): 
    32     l = [] 
    33     listOfClusters0(root, l) 
    34     return l 
    35 tables = [Orange.data.Table(cluster) for cluster in listOfClusters(root)] 
    36  
    37 prune(root, 1.4) 
    38 for n, cluster in enumerate(listOfClusters(root)): 
     16for n, cluster in enumerate(topmost): 
    3917    print "\n\n Cluster %i \n" % n 
    4018    for instance in cluster: 
    4119        print instance 
    4220 
    43 for cluster in listOfClusters(root): 
    44     dist = Orange.statistics.distribution.Distribution(iris.domain.class_var, cluster) 
     21for cluster in topmost: 
     22    dist = Orange.statistics.distribution.Distribution(iris.domain.class_var, \ 
     23        [ ex for ex in cluster ]) 
    4524    for e, d in enumerate(dist): 
    4625        print "%s: %3.0f " % (iris.domain.class_var.values[e], d), 
    4726    print 
     27 
  • docs/reference/rst/code/hierarchical-example.py

    r9906 r10106  
    11import Orange 
     2 
    23m = [[], 
    34     [ 3], 
     
    1011     [13, 9, 14, 15, 7, 8, 4, 6], 
    1112     [12, 10, 11, 15, 2, 5, 7, 3, 1]] 
     13 
    1214matrix = Orange.misc.SymMatrix(m) 
    1315root = Orange.clustering.hierarchical.HierarchicalClustering(matrix, 
    1416        linkage=Orange.clustering.hierarchical.HierarchicalClustering.Average) 
    1517 
    16 def printClustering(cluster): 
     18def print_clustering(cluster): 
    1719    if cluster.branches: 
    18         return "(%s%s)" % (printClustering(cluster.left), printClustering(cluster.right)) 
     20        return "(%s %s)" % (print_clustering(cluster.left), print_clustering(cluster.right)) 
    1921    else: 
    2022        return str(cluster[0]) 
    2123 
    22 def printClustering2(cluster): 
     24print print_clustering(root) 
     25 
     26print root.height 
     27 
     28root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve",  
     29    "Fred", "Greg", "Hue", "Ivy", "Jon"] 
     30 
     31print print_clustering(root) 
     32 
     33for el in root.left: 
     34    print el 
     35 
     36print print_clustering(root) 
     37root.left.swap() 
     38print print_clustering(root) 
     39root.permute([1, 0]) 
     40print print_clustering(root) 
     41 
     42def prune(cluster, h): 
    2343    if cluster.branches: 
    24         return "(%s%s)" % (printClustering2(cluster.left), printClustering2(cluster.right)) 
     44        if cluster.height < h: 
     45            cluster.branches = None 
     46        else: 
     47            for branch in cluster.branches: 
     48                prune(branch, h) 
     49 
     50def print_clustering2(cluster): 
     51    if cluster.branches: 
     52        return "(%s %s)" % (print_clustering2(cluster.left), print_clustering2(cluster.right)) 
    2553    else: 
    2654        return str(tuple(cluster)) 
    2755 
    28 matrix.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", 
    29                   "Fred", "Greg", "Hue", "Ivy", "Jon"] 
     56prune(root, 5) 
     57print print_clustering2(root) 
    3058 
    31 root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"] 
    32      
    33 def prune(cluster, togo): 
    34     if cluster.branches: 
    35         if togo<0: 
    36             cluster.branches = None 
    37         else: 
    38             for branch in cluster.branches: 
    39                 prune(branch, togo-cluster.height) 
    40  
    41 def listOfClusters0(cluster, alist): 
     59def list_of_clusters0(cluster, alist): 
    4260    if not cluster.branches: 
    4361        alist.append(list(cluster)) 
    4462    else: 
    4563        for branch in cluster.branches: 
    46             listOfClusters0(branch, alist) 
    47              
    48 def listOfClusters(root): 
     64            list_of_clusters0(branch, alist) 
     65 
     66def list_of_clusters(root): 
    4967    l = [] 
    50     listOfClusters0(root, l) 
    51     return l        
     68    list_of_clusters0(root, l) 
     69    return l 
     70 
     71print list_of_clusters(root) 
     72 
     73root.mapping.objects = None 
     74print list_of_clusters(root) 
     75 
     76print root.left.last 
     77print root.right.first 
Note: See TracChangeset for help on using the changeset viewer.