orngClustering: Partitional and Agglomerative Clustering
The module implements a k-means partitional clustering, and provides a wrapper around Orange's implementation of agglomerative hierarchical clustering. The module also implements a number of useful functions associated with these two clustering methods, like leaf-ordering of the dendrogram and dendrogram plot.
KMeans
Class KMeans provides for an implementation of standard k-means clustering algorithm:
- Choose the number of clusters, k.
- Choose a set of k initial centroids.
- Assign each instances in the data set to the closest centroid.
- For each cluster, compute a new centroid as a center of clustered data instances.
- Repeat the previous two steps, until some convergence criterion is met (e.g., the cluster assignment has not changed).
The main advantage of the algorithm is simplicity and low memory space requirements. The principal disadvantage is the dependence of results on the selection of initial set of centroids.
Methods
- __init__(data=None, centroids=3, maxiters=None, minscorechange=None, stopchanges=0, nstart=1, initialization=kmeans_init_random, distance=orange.ExamplesDistanceConstructor_Euclidean, scoring=score_distance_to_centroids, inner_callback = None, outer_callback = None, initialize_only = False)
datais an Orange's ExampleTable object that stores the data instances to be clustered. Ifdatais notNone, clustering will immediately executed after the initialization of clustering parameters unlessinitialize_onlyis set toTrue.centroidseither specify a number of clusters or provide a list of examples that will serve as clustering centroids. The clustering will stop if one of the following conditions is met: the number of clustering iterations exceedsmaxiters, the number of instances changing the cluster is equal tostopchanges, or the score associated with current clustering improved for less thanminscorechangeof the score from previous iteration. Ifminscorechangeis not set, the score will not be computed between iterations. User can also provide a example distance constructor, which, given the data set, will provide a function that measures the distance between two example instances (see Distances between Examples). A function to select centroids given the table of data instances, k and a example distance function is provided byinitialization, the module includes implementations of several different approaches.scoringis a function that takes clustering object as an argument and returns a score for the clustering. It could be used, for instance, in procedure that repeats the clusteringnstarttimes, returning the clustering with the lowest score. The two callbacks are invoked either after every clustering iteration (inner_callback) or after every clustering restart (in case whennstartis greater than 1,outer_callback).- runone()
- Runs one clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).
- run()
- Runs clustering until the convergence conditions are met. If
nstartis greater than one,nstartruns of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.
Attributes
- k
- Number of clusters.
- centroids
- Current set of centroids.
- scoring
- Current clustering score.
- iteration
- Current clustering iteration.
- clusters
- A list of cluster indexes. An i-th element provides an index to a centroid associated with i-th data instance from the input data set.
The following code runs k-means clustering on Iris data set and prints out the cluster indexes for the last 10 data instances:
part of kmeans-run.py (uses iris.tab)
The output of this code is:
Invoking a call-back function may be useful when tracing the progress of the clustering. Below is a code that uses an inner_callback to report on the number of instances that have changed the cluster and to report on the clustering score. For the score to be computed at each iteration we have to set minscorechange, but we can leave it at 0 (or even set it to a negative value, which allows the score to deteriorate by some amount).
part of kmeans-run-callback.py (uses iris.tab)
The convergence on Iris data set is fast:
Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script:
part of kmeans-trace.py (uses iris.tab)
Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.
![]() |
![]() |
![]() |
![]() |
k-Means Utility Functions
- kmeans_init_random(data, k, _)
- A function that can be used for initialization of k-means clustering returns
kdata instances from thedata. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004). - kmeans_init_diversity(data, k, distfun)
- Another function that can be used for intializationof k-means clustering. Given the data set, number of clusters and adistance function returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. This type of initialization is almost identical to initialization proposed by Katsavounidis et al. (1994), with the only difference being in the selection of the first centroid (where they use a data instance with the biggest norm).
- KMeans_init_hierarchicalClustering(n=100)
- Is actually a class that returns an clustering initialization function that, given the data, k, and a distance function samples
ndata instances, performs hierarhical clustering, uses it to inferkclusters, and computes a list of cluster-based data centers. - data_center(data)
- Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).
- score_distance_to_centroids(kmeans)
- Returns an average distance of data instances to their associated centroids.
kmeansis a k-means clustering object. - score_silhouette(kmeans, index=None)
- Returns an average silhouette score of data instances. If
indexis specified it instead returns just the silhouette score of that particular data instance.kmeansis a k-means clustering object. - score_fastsilhouette(kmeans, index=None)
- Same as score_silhouette, but computes an approximation and is faster.
- plot_silhouette(kmeans, filename='tmp.png', fast=False)
- Saves a silhuette plot to
filename, showing the distributions of silhouette scores in clusters.kmeansis a k-means clustering object. Iffastis True usescore_fastsilhouetteto compute scores instead ofscore_silhouette.
Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge:
part of kmeans-cmp-init.py (uses iris.tab, housing.tab, vehicle.tab)
The results show that diversity and clustering-based initialization make k-means converge faster that random selection of seeds (as expected):
The following code computes the silhouette score for three different clusterings (k=2..7), and at the end plots a silhuette plot for k=3.
kmeans-silhouette.py (uses iris.tab
The analysis sugests that clustering with k=2 is preferred as it yields the maximal silhouette coefficien:
Silhouette plot for k=3 is given below:
Hierarchical Clustering
- hierarchicalClustering(data, distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None)
- Returns an object with information of a hierarchical clustering of a given data set. This is a wrapper function around HierarchicalClustering class implemented in Orange.
hierarchicalClusteringusesdistanceConstructorfunction (see Distances between example) to construct a distance matrix, which is then passed to Orange's hierarchical clustering algorithm, along with a particular linkage method. Ordering of leaves can be requested (order=True) and if so, the leaves will be ordered usingorderLeavesfunction (see below). - orderLeaves(root, distanceMatrix)
- Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.
- orderLeaves(root, distanceMatrix)
- Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.
- hierarchicalClustering_topClusters(root, k)
- Returns k topmost clusters (top k nodes of the clustering tree) from hierarchical clustering.
- hierarhicalClustering_topClustersMembership(root, k)
- Returns a list with indexes which indicate the membership of data instances that are included in top k clusters.
Using hierarchicalClustering, scripts need a single line of code to invoke the clustering and get the object with a result. This is demonstrated in the following script, that considers the Iris data set, performs hierarchical clustering, and then plots the data in two-attribute projection, coloring the points representing data instances according to cluster membership.
part of hclust-iris.py (uses iris.tab)
The output of the script is a following plot:
DendrogramPlot
Class DendrogramPlot implements visualization of the clustering tree (called dendrogram) and corresponding visualization of attribute heatmap.
Methods
- __init__(tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, text_width=None, heatmap_width=None, spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, renderer=EPSRenderer)
treeis an Orange's hierarhical clustering tree object (root node),attr_treean optional attribute clustering root node. Ifdata(Orange's ExampleTable) is given than the dendrogram will include a heat map with color-based presentation of attribute's values. The length of the data set should match the number of leaves in the hierarchical clustering tree. Following are arguments that define the height and width of the plot areas. Branches are plotted in black, but may be colored to visually expose various clusters. The coloring is specified withcluster_colors, a dictionary with cluster instances as keys and (r, g, b) tuples as items.color_palettespecifies the palette to use for the heatmap andminvandmaxvmaximum and minimum data value to plot.- set_matrix_color_schema(color_palette, minv, maxv)
- Set the heatmap color schema.
color_palettecan be an instance of ColorPalette class or an list of (r, g, b) tuples.minvandmaxvspecify the cutoff values for the heatmap (values below and above the interval will be painted withcolor_palette.underflowandcolor_palette.overflowrespectively)- plot(filename="graph.eps")
- Plots the dendrogram and save is to the output file.
Additionaly a module level convenience function dendrogram_draw is provided to streamline the drawing process.
- dendrogram_draw(filename, tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, text_width=None, heatmap_width=None, spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None)
- Draw the dendrogram to filename (supported formats: PNG, EPS, SVG)
To illustrate the use of the dendrogram plotting class, the following scripts uses it on a subset of 20 instances from the Iris data set. Values of the class variables is used for labeling the leaves (and, of course, it is not used for the clustering - only the non-class attributes are used to compute instance distance matrix).
part of hclust-dendrogram.py (uses iris.tab)
The resulting dendrogram is shown below.
Following is a similar script to above one, but this time we have 1) distinctively colored the three topmost dendrogram branches, 2) used a custom color schema for representation of attribute values (spanning red - black - green with custom gamma minv and maxv set ), and 3) included only two attributes in the heat map presentation (note: clustering is still done on all of the data set's attributes).
part of hclust-colored-dendrogram.py (uses iris.tab)
Our "colored" dendrogram is now saved as shown in the figure below:
References
Forgy E (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21(3): 768-769.
He J, Lan M, Tan C-L , Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary.
Katsavounidis I, Jay C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10): 144-146.
Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) Fast optimal leaf ordering for herarchical clustering. Bioinformatics 17(Suppl. 1): S22-S29.




