source: orange/orange/doc/widgets/Unsupervised/HierarchicalClustering.htm @ 9399:6bbe263e8bcf

Revision 9399:6bbe263e8bcf, 7.5 KB checked in by mitar, 2 years ago (diff)

Renaming widgets catalog.

Line 
1<html>
2<head>
3<title>Hierarchical Clustering</title>
4<link rel=stylesheet href="../../../style.css" type="text/css" media=screen>
5<link rel=stylesheet href="../../../style-print.css" type="text/css" media=print></link>
6</head>
7
8<body>
9
10<h1>Hierarchical Clustering</h1>
11
12<img class="screenshot" src="../icons/HierarchicalClustering.png">
13<p>Groups items using a hierarchical clustering algorithm.</p>
14
15<h2>Channels</h2>
16
17<h3>Inputs</h3>
18
19<DL class=attributes>
20<DT>Distance Matrix</DT>
21<DD>A matrix of distances between items being clustered</DD>
22</dl>
23
24<h3>Outputs</h3>
25<DL class="attributes">
26<DT>Selected Examples</DT>
27<DD>A list of selected examples; applicable only when the input matrix refers to distances between examples</DD>
28
29<DT>Structured Data Files</DT>
30<DD>???</DD>
31</dl>
32
33<h2>Description</h2>
34
35<p>The widget computes hierarchical clustering of arbitrary types of objects from the matrix of distances between them and shows the corresponding dendrogram. If the distances apply to examples, the widget offers some special functionality (adding cluster indices, outputting examples...).</p>
36
37<img class="screenshot" src="HierarchicalClustering.png" border=0 />
38
39<P>The widget supports three kinds of linkages. In <span class="option">Single linkage</span> clustering, the distance between two clusters is defined as the distance between the closest elements of the two clusters. <span class="option">Average linkage</span> clustering computes the average distance between elements of the two clusters, and <span class="option">Complete linkage</span> defines the distance between two clusters as the distance between their most distant elements.</P>
40
41<p>Nodes of the dendrogram can be labeled. What the labels are depends upon the items being clustered. For instance, when clustering attributes, the labels are obviously the attribute names. When clustering examples, we can use the values of one of the attributes, typically one that give the name or id of an instance, as labels. The label can be chosen in the box <span class="option">Annotate</span>, which also allows setting the font size and line spacing.</p>
42
43<p>Huge dendrograms can be pruned by checking <span class="option">Limit pring depth</span> and selecting the appropriate depth. This only affects the displayed dendrogram and not the actual clustering.</p>
44
45<p>Clicking inside the dendrogram can have two effects. If the cut off line is not shown (<span class="option">Show cutoff line</span> is unchecked), clicking inside the dendrogram will select a cluster. Multiple clusters can be selected by holding Ctrl. Each selected cluster is shown in different color and is treated as a separate cluster on the output.</p>
46
47<p>If <span class="option">Show cutoff line</span> is checked, clicking in the dendrogram places a cutoff line. All items in the clustering are selected and the are divided into groups according to the position of the line.</p>
48
49<p>If the items being clustered are examples, they can be added a cluster index (<span class="option">Append cluster indices</span>). The index can appear as a <span class="option">Class attribute</span>, ordinary <span class="option">Attribute</span> or a <span class="option">Meta attribute</span>. In the former case, if the data already has a class attribute, the original class is placed among meta attributes.</p>
50
51<p>The data can be output on any change (<span class="option">Commit on change</span>) or, if this is disabled, by pushing <span class="option">Commit</span>.</p>
52
53
54Clustering has two parameters that can be set by the user, the number of clusters and the type of distance metrics, <span class="option">Euclidean distance</span> or <span class="option">Manhattan</span>. Any changes must be confirmed by pushing <span class="option">Apply</span>.</P>
55
56<p>The table on the right hand side shows the results of clustering. For each cluster it gives the number of examples, its fitness and BIC.</p>
57
58<p>Fitness measures how well the cluster is defined. Let <em>d<sub>i,C</sub></em> be the average distance between point <em>i</em> and the points in cluster <em>C</em>. Now, let <em>a<sub>i</sub></em> equal <em>d<sub>i,C'</sub></em>, where <em>C'</em> is the cluster <em>i</em> belongs to, and let <em>b<sub>i</sub></em>=min <em>d<sub>i,C</sub></em> over all other clusters <em>C</em>. Fitness is then defined as the average silhouette of the cluster <em>C</em>, that is avg( (<em>b<sub>i</sub></em>-<em>a<sub>i</sub></em>)/max(<em>b<sub>i</sub></em>, <em>a<sub>i</sub></em>) ).</p>
59
60<p>To make it simple, fitness close to 1 signifies a well-defined cluster.</p>
61
62<P>BIC is short for Bayesian Information Criteria and is computed as ln <em>L</em>-<em>k</em>(<em>d</em>+1)/2 ln <em>n</em>, where <em>k</em> is the number of clusters, <em>d</em> is dimension of data (the number of attributes) and <em>n</em> is the number of examples (data instances). <em>L</em> is the likelihood of the model, assuming the spherical Gaussian distributions around the centroid(s) of the cluster(s).</P>
63
64
65<h2>Examples</h2>
66
67<p>The schema below computes clustering of attributes and of examples.</p>
68
69<img class="screenshot" src="HierarchicalClustering-Schema.png"/>
70
71<p>We loaded the Zoo data set. The clustering of attributes is already shown above. Below is the clustering of examples, that is, of animals, and the nodes are annotated by the animals' names. We connected the <a href="../Visualize/LinearProjection.htm">Linear projection widget</a> showing the freeviz-optimized projection of the data so that it shows all examples read from the file, while the signal from Hierarchical clustering is used as a subset. Linear projection thus marks the examples selected in Hierarchical clustering. This way, we can observe the position of the selected cluster(s) in the projection.</p>
72
73<img class="screenshot" src="HierarchicalClustering-Example.png"/>
74
75<p>To (visually) test how well the clustering corresponds to the actual classes in the data, we can tell the widget to show the class ("type") of the animal instead of its name (<span class="option">Annotate</span>). Correspondence looks good.</p>
76
77<img class="screenshot" src="HierarchicalClustering-Example2.png"/>
78
79<p>A fancy way to verify the correspondence between the clustering and the actual classes would be to compute the chi-square test between them. As Orange does not have a dedicated widget for that, we can compute the chi-square in <a href="AttributeDistance.htm">Attribute Distance</a> and observe it in <a href="DistanceMap.htm">Distance Map</a>. The only caveat is that Attribute Distance computes distances between attributes and not the class and the attribute, so we have to use <a href="../Data/SelectAttributes.htm">Select attributes</a> to put the class among the ordinary attributes and replace it with another attribute, say "tail" (this is needed since Attribute Distance requires data with a class attribute, for technical reasons; the class attribute itself does not affect the computed chi-square).</p>
80
81<p>A more direct approach is to leave the class attribute (the animal type) as it is, simply add the cluster index and observe its information gain in the <a href="../Data/Rank.htm">Rank widget</a>.</p>
82
83<p>More tricks with a similar purpose are described in the documentation for <a href="K-MeansClustering.htm">K-Means Clustering</a>.</p>
84
85<p>The schema that does both and the corresponding settings of the hiearchical clustering widget are shown below.</p>
86
87<p><img class="screenshot" src="HierarchicalClustering-Schema2.png" /></p>
88
89<p><img class="screenshot" src="HierarchicalClustering-Example3.png" /></p>
90
91</body>
92</html>
Note: See TracBrowser for help on using the repository browser.