source: orange/docs/widgets/rst/unsupervized/kmeansclustering.rst @ 11050:e3c4699ca155

Revision 11050:e3c4699ca155, 4.3 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 16 months ago (diff)

Widget docs From HTML to Sphinx.

Line 
1.. _k-Means Clustering:
2
3K-Means Clustering
4==================
5
6.. image:: ../icons/K-MeansClustering.png
7
8Groups the examples using the K-Means clustering algorithm.
9
10Signals
11-------
12
13Inputs:
14   - Examples
15      A list of examples
16
17Outputs:
18   - Examples
19      A list of examples with the cluster index as the class attribute
20
21
22Description
23-----------
24
25The widget applies the K-means clustering algorithm to the data from the input and outputs a new data set in which the cluster index is used for the class attribute. The original class attribute, if it existed, is moved to meta attributes. The basic information on the clustering results is also shown in the widget.
26
27<img class="leftscreenshot" src="K-MeansClustering.png" border=0>
28
29Clustering has two parameters that can be set by the user, the number of clusters and the type of distance metrics, :obj:`Euclidean distance` or :obj:`Manhattan`. Any changes must be confirmed by pushing :obj:`Apply`.
30
31The table on the right hand side shows the results of clustering. For each cluster it gives the number of examples, its fitness and BIC.
32
33Fitness measures how well the cluster is defined. Let d<sub>i,C</sub> be the average distance between point i and the points in cluster C. Now, let a<sub>i</sub> equal d<sub>i,C'</sub>, where C' is the cluster i belongs to, and let b<sub>i</sub>=min d<sub>i,C</sub> over all other clusters C. Fitness is then defined as the average silhouette of the cluster C, that is avg( (b<sub>i</sub>-a<sub>i</sub>)/max(b<sub>i</sub>, a<sub>i</sub>) ).
34
35To make it simple, fitness close to 1 signifies a well-defined cluster.
36
37BIC is short for Bayesian Information Criteria and is computed as ln L-k(d+1)/2 ln n, where k is the number of clusters, d is dimension of data (the number of attributes) and n is the number of examples (data instances). L is the likelihood of the model, assuming the spherical Gaussian distributions around the centroid(s) of the cluster(s).
38
39
40Examples
41--------
42
43We are going to explore the widget with the following schema.
44
45.. image:: images/K-MeansClustering-Schema.png
46
47The beginning is nothing special: we load the iris data, divide it into three clusters, show it in a table, where we can observe which example went into which cluster. The interesting part are the Scatter plot and Select data.
48
49Since K-means added the cluster index as the class attribute, the scatter plot will color the points according to the clusters they are in. Indeed, what we get looks like this.
50.. image:: images/K-MeansClustering-Scatterplot.png
51
52The thing we might be really interested in is how well the clusters induced by the (unsupervised) clustering algorithm match the actual classes appearing in the data. We thus take the Select data widget in which we can select individual classes and get the corresponding points in the scatter plot marked. The match is perfect setosa, and pretty good for the other two classes.
53
54.. image:: images/K-MeansClustering-Example.png
55
56You may have noticed that we left the :obj:`Remove unused values/attributes` and :obj:`Remove unused classes` in Select Data unchecked. This is important: if the widget modifies the attributes, it outputs a list of modified examples and the scatter plot cannot compare them to the original examples.
57
58Another, perhaps simpler way to test the match between clusters and the original classes is to use the widget `Distributions <../Visualize/Distributions.htm>`_. The only (minor) problem here is that this widget only visualizes the normal attributes and not the meta attributes. We solve this by using `Select Attributes <../Data/SelectAttributes.htm>`_ with which we move the original class to normal attributes.
59
60.. image:: images/K-MeansClustering-Schema.png
61
62The match is perfect for setosa: all instances of setosa are in the first cluster (blue). 47 versicolors are in the third cluster (green), while three ended up in the second. For virginicae, 49 are in the second cluster and one in the third.
63
64To observe the possibly more interesting reverse relation, we need to rearrange the attributes in the Select Attributes: we reinstate the original class Iris as the class and put the cluster index among the attributes.
65
66.. image:: images/K-MeansClustering-Example2a.png
67
68The first cluster is exclusively setosae, the second has mostly virginicae and the third has mostly versicolors.
Note: See TracBrowser for help on using the repository browser.