#
source:
orange/docs/widgets/rst/unsupervized/kmeansclustering.rst
@
11359:8d54e79aa135

Revision 11359:8d54e79aa135, 4.2 KB checked in by Ales Erjavec <ales.erjavec@…>, 14 months ago (diff) |
---|

Line | |
---|---|

1 | .. _k-Means Clustering: |

2 | |

3 | K-Means Clustering |

4 | ================== |

5 | |

6 | .. image:: ../icons/K-MeansClustering.png |

7 | |

8 | Groups the examples using the K-Means clustering algorithm. |

9 | |

10 | Signals |

11 | ------- |

12 | |

13 | Inputs: |

14 | - Examples |

15 | A list of examples |

16 | |

17 | Outputs: |

18 | - Examples |

19 | A list of examples with the cluster index as the class attribute |

20 | |

21 | |

22 | Description |

23 | ----------- |

24 | |

25 | The widget applies the K-means clustering algorithm to the data from the input |

26 | and outputs a new data set in which the cluster index is used for the class |

27 | attribute. The original class attribute, if it existed, is moved to meta |

28 | attributes. The basic information on the clustering results is also shown in |

29 | the widget. |

30 | |

31 | |

32 | .. Clustering has two parameters that can be set by the user, the number of |

33 | clusters and the type of distance metrics, :obj:`Euclidean distance` or |

34 | :obj:`Manhattan`. Any changes must be confirmed by pushing :obj:`Apply`. |

35 | |

36 | The table on the right hand side shows the results of clustering. For each |

37 | cluster it gives the number of examples, its fitness and BIC. |

38 | |

39 | Fitness measures how well the cluster is defined. Let d<sub>i,C</sub> be the |

40 | average distance between point i and the points in cluster C. Now, let |

41 | a<sub>i</sub> equal d<sub>i,C'</sub>, where C' is the cluster i belongs to, |

42 | and let b<sub>i</sub>=min d<sub>i,C</sub> over all other clusters C. Fitness |

43 | is then defined as the average silhouette of the cluster C, that is |

44 | avg( (b<sub>i</sub>-a<sub>i</sub>)/max(b<sub>i</sub>, a<sub>i</sub>) ). |

45 | |

46 | To make it simple, fitness close to 1 signifies a well-defined cluster. |

47 | |

48 | BIC is short for Bayesian Information Criteria and is computed as |

49 | ln L-k(d+1)/2 ln n, where k is the number of clusters, d is dimension of |

50 | data (the number of attributes) and n is the number of examples (data |

51 | instances). L is the likelihood of the model, assuming the spherical |

52 | Gaussian distributions around the centroid(s) of the cluster(s). |

53 | |

54 | |

55 | Examples |

56 | -------- |

57 | |

58 | We are going to explore the widget with the following schema. |

59 | |

60 | .. image:: images/K-MeansClustering-Schema.png |

61 | |

62 | The beginning is nothing special: we load the iris data, divide it into |

63 | three clusters, show it in a table, where we can observe which example went |

64 | into which cluster. The interesting part are the Scatter plot and Select data. |

65 | |

66 | Since K-means added the cluster index as the class attribute, the scatter |

67 | plot will color the points according to the clusters they are in. Indeed, what |

68 | we get looks like this. |

69 | |

70 | .. image:: images/K-MeansClustering-Scatterplot.png |

71 | |

72 | The thing we might be really interested in is how well the clusters induced by |

73 | the (unsupervised) clustering algorithm match the actual classes appearing in |

74 | the data. We thus take the Select data widget in which we can select individual |

75 | classes and get the corresponding points in the scatter plot marked. The match |

76 | is perfect setosa, and pretty good for the other two classes. |

77 | |

78 | .. image:: images/K-MeansClustering-Example.png |

79 | |

80 | You may have noticed that we left the :obj:`Remove unused values/attributes` |

81 | and :obj:`Remove unused classes` in Select Data unchecked. This is important: |

82 | if the widget modifies the attributes, it outputs a list of modified examples |

83 | and the scatter plot cannot compare them to the original examples. |

84 | |

85 | Another, perhaps simpler way to test the match between clusters and the |

86 | original classes is to use the widget :ref:`Distributions`. The only (minor) |

87 | problem here is that this widget only visualizes the normal attributes and not |

88 | the meta attributes. We solve this by using :ref:`Select Attributes` with which |

89 | we move the original class to normal attributes. |

90 | |

91 | .. image:: images/K-MeansClustering-Schema.png |

92 | |

93 | The match is perfect for setosa: all instances of setosa are in the first |

94 | cluster (blue). 47 versicolors are in the third cluster (green), while three |

95 | ended up in the second. For virginicae, 49 are in the second cluster and one |

96 | in the third. |

97 | |

98 | To observe the possibly more interesting reverse relation, we need to |

99 | rearrange the attributes in the Select Attributes: we reinstate the original |

100 | class Iris as the class and put the cluster index among the attributes. |

101 | |

102 | .. image:: images/K-MeansClustering-Example2a.png |

103 | |

104 | The first cluster is exclusively setosae, the second has mostly virginicae |

105 | and the third has mostly versicolors. |

**Note:**See TracBrowser for help on using the repository browser.