Orange Forum • View topic - Persistent k-means clustering objects

Persistent k-means clustering objects

A place to ask questions about methods in Orange and how they are used and other general support.

Persistent k-means clustering objects

Postby algo » Fri Aug 08, 2014 21:52

I see that you can save a classifier to a file and load a classifier from a file in order to use previously trained classifiers in later Orange sessions.

I actually want to be able to save a k-means clustering object in the same way. (I do want to use the k-means clustering object as a classifier after it is trained.)

Is there any way to save a k-means clustering object to a file, and then later load it from a file, in the same way that you can currently save and load classifiers?

Thanks in advance for any help with this question.

Re: Persistent k-means clustering objects

Postby algo » Fri Aug 15, 2014 19:09

Just in case anyone else wonders about this in the future...

As of today I don't see how to do this through the GUI (Orange Canvas).

But I guess you could do this in Python code by writing at least the set of trained centroids to a file. Then you could initialize any k-means clustering objects you create in later Orange sessions using those previously-found centroids (and all the other necessary parameters) to replicate your previously-trained k-means clustering object.

You could certainly code a more elaborate and general solution to the problem, as well. But the above paragraph sketches out at least one bare-bones approach to providing persistent k-means clustering objects in Orange.

(This modest form of serialization is proposed under the assumption that "pickling" of Orange objects is still not possible, as stated in this old thread:

viewtopic.php?f=4&t=62 )
Last edited by algo on Tue Aug 19, 2014 20:43, edited 2 times in total.

Re: Persistent k-means clustering objects

Postby Ales » Tue Aug 19, 2014 9:47

The k-means widget does not send the underlying object on its output.
You can combine the k-means widget and the k-nearest neighbors widget to create a 'classfier'. You simply connect it to the 'Centroids' output of the k-means, set k=1 (number of neighbors) and ensure both widgets are using the same distance metric.

Re: Persistent k-means clustering objects

Postby algo » Tue Aug 19, 2014 19:23

Thanks for the tip Ales!!

Your idea about creating a persistent k-means classifier using a kNN classifier with the Centroids from the k-means clustering object as possible neighbors, set to return k = 1 nearest neighbors, and using the same distance metric as the k-means clustering object, seems like it should do the trick.

That's clever. :D

Thanks again.

Re: Persistent k-means clustering objects

Postby algo » Sat Aug 23, 2014 8:53

Ales,

Something odd is going on with my version of Orange. I'm running Orange 2.7.1, and it's a snapshot from August 27, 2013.

I used Orange Canvas to set things up as you suggested in your previous post.

I ran the K-means clustering object on some data that I have. The data set has 385 examples. I chose to fix the number of clusters at 3 clusters.

When I "connect" a "Test Learners" object to the kNN classifer in the canvas to test the classification results of the kNN classifier, the report shows a classification accuracy for the kNN classifier of 1.0000 for each of the three clusters/classes generated by the K-means clustering object.

However, the "Predictions" object connected to the same kNN classifier only achieves an overall classification accuracy of 0.9299 for the three clusters/classes generated by the K-means clustering object. (The CA scores for each of the clusters/classes are: C1 = 0.9, C2 = 0.9167, C3 = 0.9454.)

I certainly don't understand how the "Test Learners" object in the canvas reports 100% accuracy over all 3 clusters/classes for the kNN classifier, while the "Predictions" object has mismatches between predicted clusters/classes and actual clusters/classes that result in an overall classification accuracy of only 92.99%.

I don't know why the kNN classifier would misclassify any of the examples to begin with, but the mismatch between the "Test Learners" and the "Predictions" objects seems incomprehensible. (Is this a bug?)
Last edited by algo on Mon Aug 25, 2014 3:28, edited 3 times in total.

Re: Persistent k-means clustering objects

Postby algo » Sat Aug 23, 2014 9:22

Just to make the problem easier to reproduce, I tried the setup with the iris.tab data set.

The problem/error shows up there, too.

To reproduce the problem do the following: Set up a canvas as you advised above using the iris.tab dataset as input. Set the K-means clustering object to 3 fixed clusters. Use a Euclidean distance measure, and use Diversity as the initialization method. Append cluster indices as a class attribute.

When you run the "Test Learners" object on all the data from the K-means clustering object and the learner from the kNN classifier, it reports a classification accuracy for the kNN classifier of 1.0000 for all three clusters as target-classes. (I've got the "Test Learners" object set to "Test on train data", and the entire data set is being selected and sent to the "Test Learners" object as the training set.)

However, when you inspect the results of the kNN classifier over the same data in the "Predictions" object, you will see that examples 115 and 135 are misclassified. (i.e. The kNN classifier labels them with incorrect cluster labels.)
Last edited by algo on Wed Aug 27, 2014 18:19, edited 2 times in total.

Re: Persistent k-means clustering objects

Postby algo » Tue Aug 26, 2014 17:07

I updated my version of Orange to the latest snapshot (version 2.7.6 from 8-26-2014), and the above problem persists even when using the most current version of Orange (Canvas).

(I chose to install with new settings for canvas and widgets, but this did not fix the problem.)

Re: Persistent k-means clustering objects

Postby Ales » Mon Sep 01, 2014 14:09

algo wrote:When you run the "Test Learners" object on all the data from the K-means clustering object and the learner from the kNN classifier, it reports a classification accuracy for the kNN classifier of 1.0000 for all three clusters as target-classes. (I've got the "Test Learners" object set to "Test on train data", and the entire data set is being selected and sent to the "Test Learners" object as the training set.)

However, when you inspect the results of the kNN classifier over the same data in the "Predictions" object, you will see that examples 115 and 135 are misclassified. (i.e. The kNN classifier labels them with incorrect cluster labels.)

In order to observe the same results in Test Learner and the Predictions make sure you are using the same train/test datasets. In particular if you use the Centroids as the input to the kNN then also use it as input to the Test Learners, but then use the full data as the 'Separate Test Set' input for the Test Learners widget, also selecting the 'Test on test data' options in the 'Samping' options group.

algo wrote:I don't know why the kNN classifier would misclassify any of the examples...

I appear to have been wrong in my initial response.
The problem arises with the scaling (normalization) of the distance metric. The k-Means widget uses normalized euclidean distance and induces the scaling factors from the input data set (this cannot overridden in the GUI) . The k-NN also uses normalized distance induced from the train data set, but the induced factors are in general different when induced on the Centroids as produced by k-Means.

Which means (in general) you can't recreate the distance metric used in k-Means (using only the kNN widget).
One workaround is to take care of normalization before k-Means (using Continuize widget, 'Normalize by span') and disable 'Normalize continuous attributes' in k-NN widget.

Another option is using the 'Python Script' widget to create the kNN classifier with a distance metric induced on the full data.
Code: Select all
# `in_data` input is the full data set used for distance metric construction,
# `in_object` are the centroids.
import Orange
centroids = in_object
learner = Orange.classification.knn.kNNLearner(
    k=1, distance_constructor=lambda data, w, _2, _3: Orange.distance.Euclidean(in_data, w)
)
out_classifier = learner(centroids)

This script assumes the full data is passed as the 'in_data' and the centroids as the 'in_object' signal.

Sorry about my mistake.

Re: Persistent k-means clustering objects

Postby algo » Mon Sep 08, 2014 17:47

Ales,

Thanks for the insight into what causes this problem and for posting possible solutions.

For what it's worth, I was actually going to call the k-means clustering object from a program written in C# in order to classify data points according to their nearest centroid/cluster. At first I thought I might serialize a k-means clustering object from Orange and call this object from C# code in order to do this. Then, I thought I might call an equivalent kNN classifier object (saved from Orange) in my C# code in order to achieve the same end.

Instead, I've just written a simple computer_cluster() method in C# code that uses the centroids found by a k-means clustering object to assign clusters to given data patterns/points.

Unfortunately, I still find that I cannot use Orange to do this because the k-means clustering objects in Orange only report coordinate values for centroids (for my data) out to a precision of two decimal places. This causes classification/cluster-assignment errors when I calculate a Euclidean distance (or a squared Euclidean distance) between data points and centroids.

Another machine learning package, scikit-learn, outputs centroid coordinates of much greater precision, and I have no problem computing/assigning correct cluster values for data points using the centroids generated by their k-means clustering objects.

I'm telling you this to give you feedback on some shortcomings of Orange's k-means clustering objects that have caused me to move on to another machine learning package in order to get my work done.

If I've run into this problem, it may be(come) a problem for other Orange users, as well.


Return to Questions & Support



cron