source: orange/docs/widgets/rst/unsupervized/hierarchicalclustering.rst @ 11359:8d54e79aa135

Revision 11359:8d54e79aa135, 6.4 KB checked in by Ales Erjavec <ales.erjavec@…>, 14 months ago (diff)

Cleanup of 'Widget catalog' documentation.

Fixed rst text formating, replaced dead hardcoded reference links (now using
:ref:), etc.

Line 
1.. _Hierarchical Clustering:
2
3Hierarchical Clustering
4=======================
5
6.. image:: ../icons/HierarchicalClustering.png
7
8Groups items using a hierarchical clustering algorithm.
9
10Signals
11-------
12
13Inputs:
14   - Distance Matrix
15      A matrix of distances between items being clustered
16
17
18Outputs:
19   - Selected Examples
20      A list of selected examples; applicable only when the input matrix
21      refers to distances between examples
22   - Remaining Examples
23      A list of unselected examples
24   - Centroids
25      A list of cluster centroids
26
27Description
28-----------
29
30The widget computes hierarchical clustering of arbitrary types of objects from
31the matrix of distances between them and shows the corresponding dendrogram. If
32the distances apply to examples, the widget offers some special functionality
33(adding cluster indices, outputting examples...).
34
35.. image:: images/HierarchicalClustering.png
36
37The widget supports three kinds of linkages. In :obj:`Single linkage`
38clustering, the distance between two clusters is defined as the distance
39between the closest elements of the two clusters. :obj:`Average linkage`
40clustering computes the average distance between elements of the two clusters,
41and :obj:`Complete linkage` defines the distance between two clusters as the
42distance between their most distant elements.
43
44Nodes of the dendrogram can be labeled. What the labels are depends upon the
45items being clustered. For instance, when clustering attributes, the labels
46are obviously the attribute names. When clustering examples, we can use the
47values of one of the attributes, typically one that give the name or id of an
48instance, as labels. The label can be chosen in the box :obj:`Annotate`, which
49also allows setting the font size and line spacing.
50
51Huge dendrograms can be pruned by checking :obj:`Limit pring depth` and
52selecting the appropriate depth. This only affects the displayed dendrogram
53and not the actual clustering.
54
55Clicking inside the dendrogram can have two effects. If the cut off line is
56not shown (:obj:`Show cutoff line` is unchecked), clicking inside the
57dendrogram will select a cluster. Multiple clusters can be selected by holding
58Ctrl. Each selected cluster is shown in different color and is treated as a
59separate cluster on the output.
60
61If :obj:`Show cutoff line` is checked, clicking in the dendrogram places a
62cutoff line. All items in the clustering are selected and the are divided
63into groups according to the position of the line.
64
65If the items being clustered are examples, they can be added a cluster index
66(:obj:`Append cluster indices`). The index can appear as a
67:obj:`Class attribute`, ordinary :obj:`Attribute` or a :obj:`Meta attribute`.
68In the former case, if the data already has a class attribute, the original
69class is placed among meta attributes.
70
71The data can be output on any change (:obj:`Commit on change`) or, if this
72is disabled, by pushing :obj:`Commit`.
73
74
75.. This is from the old Alex Jakulin's widget doc. Left in case BIC is
76   reimplemented
77
78   Clustering has two parameters that can be set by the user, the number of
79   clusters and the type of distance metrics, :obj:`Euclidean distance` or
80   :obj:`Manhattan`. Any changes must be confirmed by pushing :obj:`Apply`.
81
82   The table on the right hand side shows the results of clustering. For each
83   cluster it gives the number of examples, its fitness and BIC.
84
85   Fitness measures how well the cluster is defined. Let d<sub>i,C</sub> be the
86   average distance between point i and the points in cluster C. Now, let
87   a<sub>i</sub> equal d<sub>i,C'</sub>, where C' is the cluster i belongs to,
88   and let b<sub>i</sub>=min d<sub>i,C</sub> over all other clusters C. Fitness
89   is then defined as the average silhouette of the cluster C, that is
90   avg( (b<sub>i</sub>-a<sub>i</sub>)/max(b<sub>i</sub>, a<sub>i</sub>) ).
91
92   To make it simple, fitness close to 1 signifies a well-defined cluster.
93
94   BIC is short for Bayesian Information Criteria and is computed as
95   ln L-k(d+1)/2 ln n, where k is the number of clusters, d is dimension of
96   data (the number of attributes) and n is the number of examples
97   (data instances). L is the likelihood of the model, assuming the
98   spherical Gaussian distributions around the centroid(s) of the cluster(s).
99
100
101Examples
102--------
103
104The schema below computes clustering of attributes and of examples.
105
106.. image:: images/HierarchicalClustering-Schema.png
107
108We loaded the Zoo data set. The clustering of attributes is already shown
109above. Below is the clustering of examples, that is, of animals, and the nodes
110are annotated by the animals' names. We connected the :ref:`Linear projection`
111showing the freeviz-optimized projection of the data so that it shows all
112examples read from the file, while the signal from Hierarchical clustering is
113used as a subset. Linear projection thus marks the examples selected in
114Hierarchical clustering. This way, we can observe the position of the selected
115cluster(s) in the projection.
116
117.. image:: images/HierarchicalClustering-Example.png
118
119To (visually) test how well the clustering corresponds to the actual classes
120in the data, we can tell the widget to show the class ("type") of the animal
121instead of its name (:obj:`Annotate`). Correspondence looks good.
122
123.. image:: images/HierarchicalClustering-Example2.png
124
125A fancy way to verify the correspondence between the clustering and the actual
126classes would be to compute the chi-square test between them. As Orange does
127not have a dedicated widget for that, we can compute the chi-square in
128:ref:`Attribute Distance` and observe it in :ref:`Distance Map`. The only
129caveat is that Attribute Distance computes distances between attributes and
130not the class and the attribute, so we have to use :ref:`Select attributes` to
131put the class among the ordinary attributes and replace it with another
132attribute, say "tail" (this is needed since Attribute Distance requires data
133with a class attribute, for technical reasons; the class attribute itself does
134not affect the computed chi-square).
135
136A more direct approach is to leave the class attribute (the animal type) as it
137is, simply add the cluster index and observe its information gain in the
138:ref:`Rank`.
139
140More tricks with a similar purpose are described in the documentation for
141:ref:`K-Means Clustering`.
142
143The schema that does both and the corresponding settings of the hiearchical
144clustering widget are shown below.
145
146.. image:: images/HierarchicalClustering-Schema2.png
147
148.. image:: images/HierarchicalClustering-Example3.png
Note: See TracBrowser for help on using the repository browser.