Orange Forum • View topic - Tree post-pruning

Tree post-pruning

A place to ask questions about methods in Orange and how they are used and other general support.

Tree post-pruning

Postby John » Wed Apr 13, 2005 22:21

Hello,

I'd like to post-prune a decision tree used for classification. Does the parameter "mForPruning", which is passed into TreeLearner(), control the amount of post-pruning? I can't find any documentation describing this parameter.

Thanks,

John

Postby Blaz » Thu Apr 14, 2005 15:49

Parameter m (called mForPruning in Orange) controls the post-prunning. m is the parameter in the m-estimate formula for computing probabilites. Having, for instance, Nc examples from class c, in the group of N examples, the m-estimated probability would be (m*Pc + Nc)/(N+m), where Pc is a unconditional (prior) probability of class. Notice that if relative frequency would be used, this estimate would be Nc/N, and if Laplace estimate would be used, the estimate would be (Nc+1)/(N+k), where k is number of distinct classes.

The m-estimate and its utility within machine learning has been described in Cestnik B. Estimating probabilities: A crucial task in machine learning. Proceedings of the Ninth European Conference on Artificial Intelligence (pp. 147-149). London: Pitman, 1990. I also have some slides from my lectures on this, see Minimal error prunning section in http://www.ailab.si/blaz/predavanja/uisp/slides/uisp05-PostPruning.ppt.

Hope this helps. I also made a mark to update description of orngTree module accordingly.

Postby John » Fri Apr 15, 2005 0:23

Blaz,

Thanks for the quick reply. I took a look at your Minimal Error Pruning slides, and have a couple of follow-up questions. (With so many parameters to choose from, my goal here is to figure out a reasonable set of parameters and parameter values to try when learning a tree.)

(1) You set the list of m's to try out to be [0, 0.2, 0.5, 1, 2, 5, 10, 100]. Any reason for this particular choice?

(2) In the tree learner, you set sameMajorityPruning=1. Have you found that turning this on typically gives better performance?

Thanks again,

John

Postby Blaz » Fri Apr 15, 2005 7:03

on 1) The default for m in much of related work is usally 2. No particular reason for the list that you give, but we have noticed the "logarithmic" effect m has on prunning (I would include 0.1 and 20 for completness, though).

Notice that to do the fitting right, you need to embed your learning within a wrapper. If you want to, say, use 10-fold cross validation for testing of the classifier, the "right" m has to be learned on the training set (90% of the data). The simplest is to do internal cross-validation, e.g. use 5- or 10-fold CV on each training set and go through the list of candidate m. Choose the best one (e.g. the one that gives best predictive accuracy), and use it when learning from complete training set.

There is (undocumented) module orngWrap that is included in the distribution, some documentation is included in the Python code. It starts, for instance, with

orngWrap.Tune1Parameter(object=bayes, parameter='m', values=[0,10,20,30])

which is how to tune m for Naive Bayesian classifier (provided that that one uses m-estimate for computing probabilities). here is the info included on this class:

# The class needs to be given the following attributes
# object - the learning algorithm to be fitter
# parameter - a string or a list of strings with parameter(s) to fit
# values - possible values of the parameter
# (eg <object>.<parameter> = <value>[i])
# evaluate - statistics to evaluate (default: orngStat.CA)
# compare - function to compare (default: cmp - the bigger the better)
# returnWhat - tells whether to return values of parameters, a fitted
# learner, the best classifier or None. "object" is left
# with optimal parameters in any case

on 2) sameMajorityPruning=1 helps. orange tree learner (intentionally) builds trees such that it would not stop when creating leaves with same majority class, but it often helps to merge these in post prunning.

Postby John » Fri Apr 15, 2005 19:55

Blaz,

I'm having trouble using the orngWrap module, most likely because I'm using it incorrectly.

When I run the code

bayes = orange.BayesLearner(data)
orngWrap.Tune1Parameter(object=bayes, parameter='m', values=[0,10,20,30])

none of the internal logic of Tune1Parameters appears to be executing (I don't think its __call__ method is being called). Should I be embedding orngWrap.Tune1Parameter() inside a call to crossValidation?

Thanks,

John

Postby Blaz » Tue Apr 19, 2005 14:45

Notice that Tune1Parameter should receive an object (Learner) that is yet to see the data to be able to learn. You passed a classifier that was already constructed, so Tune1Parameter couldn't do anything with it...

To be more clear, consider the following code for tree learning
Code: Select all
import orange, orngWrap, orngTree

data = orange.ExampleTable('voting')
tree = orngTree.TreeLearner()
tunedTree = orngWrap.Tune1Parameter(object=tree, parameter='mForPruning', \
    values=[0,1,2,5,10], verbose=2)

classifier = tunedTree(data)


Notice that the verbose level can be set differently. We used level 2 to print out the following

orngWrap:
0 [0.86206896551724144]
orngWrap:
1 [0.94942528735632181]
orngWrap:
2 [0.95402298850574707]
orngWrap:
5 [0.9517241379310345]
orngWrap:
10 [0.94942528735632181]
*** Optimal parameter: mForPruning = 2

To fit m values for naive bayes, you have to construct a naive bayesian learner that uses m-estimation of probabilities instead of relative frequencies. See the documentation for Naive Bayesian for an example how to do that.

Postby John » Mon Apr 25, 2005 22:27

Blaz,

I think there's a bug in the code you posted.

The last line of your posted code:
classifier = tunedTree(data)
sets the variable "classifier" to "None". I believe the problem is that tunedTree isn't a tree, rather it's of type orngWrap.Tune1Parameter.

One way to fix it would be to replace the last line with:
classifier = tree(data)
or
classifier = tunedTree.object(data)

Is this right?

John

Postby Blaz » Tue Apr 26, 2005 21:02

You're partially right :) . Tune1Parameter by default does not return anything, and needs to be told (by setting returnWhat) otherwise. But Janez and I though, while answering your post, that the default action should be returning a classifier, so we have now changed orngWrap accordingly. In any case, with current (I have just put it on CVS) and previous version, the following works:

Code: Select all
import orange, orngWrap, orngTree

data = orange.ExampleTable('voting')
tree = orngTree.TreeLearner()
tunedTree = orngWrap.Tune1Parameter(object=tree, parameter='mForPruning', \
    values=[0,1,2,5,10], verbose=2, \
    returnWhat=orngWrap.TuneParameters.returnClassifier)

classifier = tunedTree(data)
for i in range(10):
    print classifier(data[i], orange.GetProbabilities)


notice that the only change is in a call of orngWrap.Tune1Parameter. Alternatively, you can also say something like:

Code: Select all
tunedTree.returnWhat = tunedTree.returnClassifier


just prior to the call tunedTree(data).

hope this helps!


Return to Questions & Support



cron