Changeset 5797:2e5c44717e17 in orange


Ignore:
Timestamp:
03/02/09 14:40:15 (4 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Convert:
eb722f0fe0031cb6455cf3e30e64069b85fccb2c
Message:

added documentation on hierarchical clustering

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/doc/modules/orngClustering.htm

    r5788 r5797  
    146146<dt>score_distanceToCentroids(kmeans)</dt> 
    147147<dd class="ddfun">Returns an average distance of data instances to their associated centroids. <code>kmeans</code> is a k-means clustering object.</dd> 
     148</dl> 
    148149 
    149150<p>Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge:</p> 
     
    152153"iris.tab">iris.tab</a>, <a href= 
    153154"housing.tab">housing.tab</a>, <a href= 
    154 "vehicle.tab">vehicle.tab</a>,)</p> 
     155"vehicle.tab">vehicle.tab</a>)</p> 
    155156<xmp class=code>import orange 
    156157import orngClustering 
     
    179180</xmp> 
    180181 
     182<h2>Hierarchical Clustering</h2> 
     183 
     184<dl class="attributes"> 
     185<dt>hierarchicalClustering(data,                     distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None)</dt> 
     186<dd class="ddfun">Returns an object with information of a hierarchical clustering of a given data set. This is a wrapper function around <a href="../reference/Clustering.htm">HierarchicalClustering</a> class implemented in Orange. <code>hierarchicalClustering</code> uses <code>distanceConstructor</code> function (see <a href="../reference/ExamplesDistance.htm">Distances between example</a>) to construct a distance matrix, which is then passed to Orange's hierarchical clustering algorithm, along with a particular linkage method. Ordering of leaves can be requested (<code>order=True</code>) and if so, the leaves will be ordered using <code>orderLeaves</code> function (see below).</dd> 
     187 
     188<dt>orderLeaves(root, distanceMatrix)</dt> 
     189<dd class="ddfun">Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.</dd> 
     190 
     191<dt>orderLeaves(root, distanceMatrix)</dt> 
     192<dd class="ddfun">Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.</dd> 
     193 
     194<dt>hierarchicalClustering_topClusters(root, k)</dt> 
     195<dd class="ddfun">Returns k topmost clusters (top k nodes of the clustering tree) from hierarchical clustering.</dd> 
     196 
     197<dt>hierarhicalClustering_topClustersMembership(root, k)</dt> 
     198<dd class="ddfun">Returns a list with indexes which indicate the membership of data instances used to create the clustering to top k clusters.</dd> 
     199</dl> 
     200 
     201 
     202<p>Using <code>hierarchicalClustering</code>, scripts need a single line of code to invoke the clustering and get the object with a result. This is demonstrated in the following script, that considers the Iris data set, performs hierarchical clustering, and then plots the data in two-attribute projection, coloring the points representing data instances according to cluster membership.</p> 
     203 
     204<p class="header">part of <a href="hclust-iris.py">hclust-iris.py</a> (uses <a href= 
     205"iris.tab">iris.tab</a></p> 
     206<xmp class=code>def plot_scatter(data, cls, attx, atty, filename="hclust-scatter", title=None): 
     207    """plot a data scatter plot with the position of centeroids""" 
     208    pylab.rcParams.update({'font.size': 8, 'figure.figsize': [4,3]}) 
     209    x = [float(d[attx]) for d in data] 
     210    y = [float(d[atty]) for d in data] 
     211    colors = ["c", "w", "b"] 
     212    cs = "".join([colors[c] for c in cls]) 
     213    pylab.scatter(x, y, c=cs, s=10) 
     214     
     215    pylab.xlabel(attx) 
     216    pylab.ylabel(atty) 
     217    if title: 
     218        pylab.title(title) 
     219    pylab.savefig("%s.png" % filename) 
     220    pylab.close() 
     221 
     222data = orange.ExampleTable("iris") 
     223root = orngClustering.hierarchicalClustering(data) 
     224n = 3 
     225cls = orngClustering.hierarhicalClustering_topClustersMembership(root, n) 
     226plot_scatter(data, cls, "sepal width", "sepal length", title="Hiearchical clustering (%d clusters)" % n) 
     227</xmp> 
     228 
     229<p>The output of the script is a following plot:</p> 
     230 
     231<img src="hclust-scatter.png"> 
     232 
     233 
     234<h2>DendrogramPlot</h2> 
     235<p>Class <code>DendrogramPlot</code> implements visualization of the  clustering tree (called dendrogram) and corresponding visualization of attribute heatmap. 
     236 
     237<p class=section>Methods</p> 
     238<dl class=attributes> 
     239<dt>__init__(tree, data=None, labels=None, width=None, height=None, treeAreaWidth=None, textAreaWidth=None, matrixAreaWidth=None, fontSize=None, lineWidth=2, clusterColors={})</DT> 
     240<dd><code>tree</code> is an Orange's hierarhical clustering tree object (root node). If <code>data</code> (Orange's ExampleTable) is given than the dendrogram will include a heat map with color-based presentation of attribute's values. The length of the data set should match the number of leaves in the hierarchical clustering tree. Following are arguments that define the height and width of the plot areas, font size for the labels, and width of lines that indicate branches of the tree. Branches are plotted in black, but may be color to visually expose various clusters. The coloring is specified with <code>clusterColors</code>, a dictionary with cluster instances as keys and (r, g, b) tuples as items.</dd> 
     241 
     242<dt>plot(filename="graph.png")</dt> 
     243<dd>Plots the dendrogram and save is to the output file.</dd> 
     244</dl> 
     245 
     246 
     247<p>To illustrate the use of the dendrogram plotting class, the following scripts uses it on a subset of 20 instances from the Iris data set. Values of the class variables is used for labeling the leaves (and, of course, it is not used for the clustering - only the non-class attributes are used to compute instance distance matrix).</p> 
     248 
     249<p class="header">part of <a href="hclust-dendrogram.py">hclust-dendrogram.py</a> (uses <a href= 
     250"iris.tab">iris.tab</a></p> 
     251<xmp class=code>data = orange.ExampleTable("iris") 
     252sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0) 
     253root = orngClustering.hierarchicalClustering(sample) 
     254dendrogram = orngClustering.DendrogramPlot(root, sample, labels=[str(d.getclass()) for d in sample]) 
     255dendrogram.plot("hclust-dendrogram.png") 
     256</xmp> 
     257 
     258<p>The resulting dendrogram is shown below.</p> 
     259 
     260<img src="hclust-dendrogram.png"> 
     261 
    181262<h2>References</h2> 
    182263 
    183 <p>E. Forgy. Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics, 21(3):768-769, 1965.</p> 
    184  
    185 <p>J. He, M. Lan, C.-L. Tan, S.-Y. Sung, and H.-B. Low. Initialization of cluster refinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary, July 2004.</p> 
    186  
    187 <p>I. Katsavounidis, C. Jay, and Zhen Zhang. A new initialization technique for generalized Lloyd iteration. Signal Processing Letters, IEEE, 1(10):144-146, 1994.</p> 
    188  
     264<p>Forgy E (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21(3): 768-769.</p> 
     265 
     266<p>He J, Lan M, Tan C-L , Sung S-Y, Low H-B (2004) <a href="http://www.comp.nus.edu.sg/~tancl/Papers/IJCNN04/he04ijcnn.pdf">Initialization of cluster refinement algorithms: A review and comparative study</a>. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary.</p> 
     267 
     268<p>Katsavounidis I, Jay C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10): 144-146.</p> 
     269 
     270<p>Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) <a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S22">Fast optimal leaf ordering for herarchical clustering</a>. Bioinformatics 17(Suppl. 1): S22-S29. 
    189271</body> 
    190272</html> 
Note: See TracChangeset for help on using the changeset viewer.