Orange Forum • View topic - Replicating NaiveBayes results from other tools (NLTK)

Replicating NaiveBayes results from other tools (NLTK)

A place to ask questions about methods in Orange and how they are used and other general support.

Replicating NaiveBayes results from other tools (NLTK)

Postby garyspatterson » Fri Feb 03, 2012 19:37

I am trying to replicate the very simple classification example from the NLTK (Natural Language Toolkit) book, where a corpus of people's first names are tagged for gender, and the NB learner uses the single feature of the last letter of the name to predict the label of male or female. (http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html)

Of the total corpus of ~7,900 names, some 7,400 are used for training and 500 are held out for testing. Using the nltk NaiveBayes classifier module in Python trained on the 7,400 names, I get accuracy of some 70.2% when the test set is evaluated. (Since the learner is only using a single feature, so the low accuracy level isn't too surprising.)

Now, when I try to do the same thing using the Orange canvas with the very same training and test data, I wind up with accuracy on the test set of some 73.2%. Looking in closer detail at the numbers, of the 500 test data items, it seems that 342 were classified accurately by both models; 9 were classified correctly by nltk NB but incorrectly by Orange NB; 24 were classified correctly by Orange NB but incorrectly by nltk NB; and 125 were misclassified by both.

The Orange learner allows you to set some parameters (e.g. size of LOESS window, and number of LOESS sample points). I'm not exactly sure what this is doing, but tinkering with these parameters doesn't seem to affect the results at all. The data is perfectly clean, too, so it can't be to do with how the systems deal with missing data. For instance, there are no features in the test set that were not seen in the training set (although admittedly, some of the frequencies in the training set were very small: last letter 'z' appears 13 times out of 7444 instances in training and 2 times out of 500 in the test set).

So my question is: should I expect a Naive Bayes model to give the same results, regardless of the tool used to run the algorithm? Or is there going to be some margin of error due to the different treatment of smoothing of very low probabilities?

Thanks again in advance.

Return to Questions & Support



cron