Changeset 5797:2e5c44717e17 in orange
- Timestamp:
- 03/02/09 14:40:15 (4 years ago)
- Branch:
- default
- Convert:
- eb722f0fe0031cb6455cf3e30e64069b85fccb2c
- File:
-
- 1 edited
-
orange/doc/modules/orngClustering.htm (modified) (3 diffs)
Legend:
- Unmodified
- Added
- Removed
-
orange/doc/modules/orngClustering.htm
r5788 r5797 146 146 <dt>score_distanceToCentroids(kmeans)</dt> 147 147 <dd class="ddfun">Returns an average distance of data instances to their associated centroids. <code>kmeans</code> is a k-means clustering object.</dd> 148 </dl> 148 149 149 150 <p>Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge:</p> … … 152 153 "iris.tab">iris.tab</a>, <a href= 153 154 "housing.tab">housing.tab</a>, <a href= 154 "vehicle.tab">vehicle.tab</a> ,)</p>155 "vehicle.tab">vehicle.tab</a>)</p> 155 156 <xmp class=code>import orange 156 157 import orngClustering … … 179 180 </xmp> 180 181 182 <h2>Hierarchical Clustering</h2> 183 184 <dl class="attributes"> 185 <dt>hierarchicalClustering(data, distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None)</dt> 186 <dd class="ddfun">Returns an object with information of a hierarchical clustering of a given data set. This is a wrapper function around <a href="../reference/Clustering.htm">HierarchicalClustering</a> class implemented in Orange. <code>hierarchicalClustering</code> uses <code>distanceConstructor</code> function (see <a href="../reference/ExamplesDistance.htm">Distances between example</a>) to construct a distance matrix, which is then passed to Orange's hierarchical clustering algorithm, along with a particular linkage method. Ordering of leaves can be requested (<code>order=True</code>) and if so, the leaves will be ordered using <code>orderLeaves</code> function (see below).</dd> 187 188 <dt>orderLeaves(root, distanceMatrix)</dt> 189 <dd class="ddfun">Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.</dd> 190 191 <dt>orderLeaves(root, distanceMatrix)</dt> 192 <dd class="ddfun">Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.</dd> 193 194 <dt>hierarchicalClustering_topClusters(root, k)</dt> 195 <dd class="ddfun">Returns k topmost clusters (top k nodes of the clustering tree) from hierarchical clustering.</dd> 196 197 <dt>hierarhicalClustering_topClustersMembership(root, k)</dt> 198 <dd class="ddfun">Returns a list with indexes which indicate the membership of data instances used to create the clustering to top k clusters.</dd> 199 </dl> 200 201 202 <p>Using <code>hierarchicalClustering</code>, scripts need a single line of code to invoke the clustering and get the object with a result. This is demonstrated in the following script, that considers the Iris data set, performs hierarchical clustering, and then plots the data in two-attribute projection, coloring the points representing data instances according to cluster membership.</p> 203 204 <p class="header">part of <a href="hclust-iris.py">hclust-iris.py</a> (uses <a href= 205 "iris.tab">iris.tab</a></p> 206 <xmp class=code>def plot_scatter(data, cls, attx, atty, filename="hclust-scatter", title=None): 207 """plot a data scatter plot with the position of centeroids""" 208 pylab.rcParams.update({'font.size': 8, 'figure.figsize': [4,3]}) 209 x = [float(d[attx]) for d in data] 210 y = [float(d[atty]) for d in data] 211 colors = ["c", "w", "b"] 212 cs = "".join([colors[c] for c in cls]) 213 pylab.scatter(x, y, c=cs, s=10) 214 215 pylab.xlabel(attx) 216 pylab.ylabel(atty) 217 if title: 218 pylab.title(title) 219 pylab.savefig("%s.png" % filename) 220 pylab.close() 221 222 data = orange.ExampleTable("iris") 223 root = orngClustering.hierarchicalClustering(data) 224 n = 3 225 cls = orngClustering.hierarhicalClustering_topClustersMembership(root, n) 226 plot_scatter(data, cls, "sepal width", "sepal length", title="Hiearchical clustering (%d clusters)" % n) 227 </xmp> 228 229 <p>The output of the script is a following plot:</p> 230 231 <img src="hclust-scatter.png"> 232 233 234 <h2>DendrogramPlot</h2> 235 <p>Class <code>DendrogramPlot</code> implements visualization of the clustering tree (called dendrogram) and corresponding visualization of attribute heatmap. 236 237 <p class=section>Methods</p> 238 <dl class=attributes> 239 <dt>__init__(tree, data=None, labels=None, width=None, height=None, treeAreaWidth=None, textAreaWidth=None, matrixAreaWidth=None, fontSize=None, lineWidth=2, clusterColors={})</DT> 240 <dd><code>tree</code> is an Orange's hierarhical clustering tree object (root node). If <code>data</code> (Orange's ExampleTable) is given than the dendrogram will include a heat map with color-based presentation of attribute's values. The length of the data set should match the number of leaves in the hierarchical clustering tree. Following are arguments that define the height and width of the plot areas, font size for the labels, and width of lines that indicate branches of the tree. Branches are plotted in black, but may be color to visually expose various clusters. The coloring is specified with <code>clusterColors</code>, a dictionary with cluster instances as keys and (r, g, b) tuples as items.</dd> 241 242 <dt>plot(filename="graph.png")</dt> 243 <dd>Plots the dendrogram and save is to the output file.</dd> 244 </dl> 245 246 247 <p>To illustrate the use of the dendrogram plotting class, the following scripts uses it on a subset of 20 instances from the Iris data set. Values of the class variables is used for labeling the leaves (and, of course, it is not used for the clustering - only the non-class attributes are used to compute instance distance matrix).</p> 248 249 <p class="header">part of <a href="hclust-dendrogram.py">hclust-dendrogram.py</a> (uses <a href= 250 "iris.tab">iris.tab</a></p> 251 <xmp class=code>data = orange.ExampleTable("iris") 252 sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0) 253 root = orngClustering.hierarchicalClustering(sample) 254 dendrogram = orngClustering.DendrogramPlot(root, sample, labels=[str(d.getclass()) for d in sample]) 255 dendrogram.plot("hclust-dendrogram.png") 256 </xmp> 257 258 <p>The resulting dendrogram is shown below.</p> 259 260 <img src="hclust-dendrogram.png"> 261 181 262 <h2>References</h2> 182 263 183 <p>E. Forgy. Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics, 21(3):768-769, 1965.</p> 184 185 <p>J. He, M. Lan, C.-L. Tan, S.-Y. Sung, and H.-B. Low. Initialization of cluster refinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary, July 2004.</p> 186 187 <p>I. Katsavounidis, C. Jay, and Zhen Zhang. A new initialization technique for generalized Lloyd iteration. Signal Processing Letters, IEEE, 1(10):144-146, 1994.</p> 188 264 <p>Forgy E (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21(3): 768-769.</p> 265 266 <p>He J, Lan M, Tan C-L , Sung S-Y, Low H-B (2004) <a href="http://www.comp.nus.edu.sg/~tancl/Papers/IJCNN04/he04ijcnn.pdf">Initialization of cluster refinement algorithms: A review and comparative study</a>. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary.</p> 267 268 <p>Katsavounidis I, Jay C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10): 144-146.</p> 269 270 <p>Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) <a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S22">Fast optimal leaf ordering for herarchical clustering</a>. Bioinformatics 17(Suppl. 1): S22-S29. 189 271 </body> 190 272 </html>
Note: See TracChangeset
for help on using the changeset viewer.
