source: orange/docs/widgets/rst/unsupervized/kmeansclustering.rst @ 11359:8d54e79aa135

Revision 11359:8d54e79aa135, 4.2 KB checked in by Ales Erjavec <ales.erjavec@…>, 14 months ago (diff)

Cleanup of 'Widget catalog' documentation.

Fixed rst text formating, replaced dead hardcoded reference links (now using
:ref:), etc.

Line 
1.. _k-Means Clustering:
2
3K-Means Clustering
4==================
5
6.. image:: ../icons/K-MeansClustering.png
7
8Groups the examples using the K-Means clustering algorithm.
9
10Signals
11-------
12
13Inputs:
14   - Examples
15      A list of examples
16
17Outputs:
18   - Examples
19      A list of examples with the cluster index as the class attribute
20
21
22Description
23-----------
24
25The widget applies the K-means clustering algorithm to the data from the input
26and outputs a new data set in which the cluster index is used for the class
27attribute. The original class attribute, if it existed, is moved to meta
28attributes. The basic information on the clustering results is also shown in
29the widget.
30
31
32.. Clustering has two parameters that can be set by the user, the number of
33   clusters and the type of distance metrics, :obj:`Euclidean distance` or
34   :obj:`Manhattan`. Any changes must be confirmed by pushing :obj:`Apply`.
35
36   The table on the right hand side shows the results of clustering. For each
37   cluster it gives the number of examples, its fitness and BIC.
38
39   Fitness measures how well the cluster is defined. Let d<sub>i,C</sub> be the
40   average distance between point i and the points in cluster C. Now, let
41   a<sub>i</sub> equal d<sub>i,C'</sub>, where C' is the cluster i belongs to,
42   and let b<sub>i</sub>=min d<sub>i,C</sub> over all other clusters C. Fitness
43   is then defined as the average silhouette of the cluster C, that is
44   avg( (b<sub>i</sub>-a<sub>i</sub>)/max(b<sub>i</sub>, a<sub>i</sub>) ).
45
46   To make it simple, fitness close to 1 signifies a well-defined cluster.
47
48   BIC is short for Bayesian Information Criteria and is computed as
49   ln L-k(d+1)/2 ln n, where k is the number of clusters, d is dimension of
50   data (the number of attributes) and n is the number of examples (data
51   instances). L is the likelihood of the model, assuming the spherical
52   Gaussian distributions around the centroid(s) of the cluster(s).
53
54
55Examples
56--------
57
58We are going to explore the widget with the following schema.
59
60.. image:: images/K-MeansClustering-Schema.png
61
62The beginning is nothing special: we load the iris data, divide it into
63three clusters, show it in a table, where we can observe which example went
64into which cluster. The interesting part are the Scatter plot and Select data.
65
66Since K-means added the cluster index as the class attribute, the scatter
67plot will color the points according to the clusters they are in. Indeed, what
68we get looks like this.
69
70.. image:: images/K-MeansClustering-Scatterplot.png
71
72The thing we might be really interested in is how well the clusters induced by
73the (unsupervised) clustering algorithm match the actual classes appearing in
74the data. We thus take the Select data widget in which we can select individual
75classes and get the corresponding points in the scatter plot marked. The match
76is perfect setosa, and pretty good for the other two classes.
77
78.. image:: images/K-MeansClustering-Example.png
79
80You may have noticed that we left the :obj:`Remove unused values/attributes`
81and :obj:`Remove unused classes` in Select Data unchecked. This is important:
82if the widget modifies the attributes, it outputs a list of modified examples
83and the scatter plot cannot compare them to the original examples.
84
85Another, perhaps simpler way to test the match between clusters and the
86original classes is to use the widget :ref:`Distributions`. The only (minor)
87problem here is that this widget only visualizes the normal attributes and not
88the meta attributes. We solve this by using :ref:`Select Attributes` with which
89we move the original class to normal attributes.
90
91.. image:: images/K-MeansClustering-Schema.png
92
93The match is perfect for setosa: all instances of setosa are in the first
94cluster (blue). 47 versicolors are in the third cluster (green), while three
95ended up in the second. For virginicae, 49 are in the second cluster and one
96in the third.
97
98To observe the possibly more interesting reverse relation, we need to
99rearrange the attributes in the Select Attributes: we reinstate the original
100class Iris as the class and put the cluster index among the attributes.
101
102.. image:: images/K-MeansClustering-Example2a.png
103
104The first cluster is exclusively setosae, the second has mostly virginicae
105and the third has mostly versicolors.
Note: See TracBrowser for help on using the repository browser.