Orange Forum • View topic - treeLearner vs. rndLearner

treeLearner vs. rndLearner

A place to ask questions about methods in Orange and how they are used and other general support.

treeLearner vs. rndLearner

Postby Yakov » Mon Mar 09, 2009 21:42

As I was running the treeLearner vs. rndLeaner example (p. 8 of the "From Experimental Machine Learning to Interactive Data Mining" white paper), I discovered, somewhat to my surprise, that rndLearner sometimes produces smaller trees with fewer errors. For example, it was able to achieve 6 errors (on the original data set) with tree size of 141 vs. a tree size of 229 and 10 errors for treeLearner. I'd be interested in seeing an explanation to this fact. I'm attaching my code. Thanks.

Code: Select all
import orange, random

def randomChoice(instances, *args):
    attr = random.choice(instances.domain.attributes)
    cl = orange.ClassifierFromVar(whichVar=attr, classVar=attr)
    return cl, attr.values, None, 1

data = orange.ExampleTable('')

treeLearner = orange.TreeLearner()
tree = treeLearner(data)

rndLearner = orange.TreeLearner()
rndLearner.split = randomChoice

for i in range(10):
    rndtree = rndLearner(data)
    print tree.treesize(), 'vs.', rndtree.treesize()
    if tree.treesize() > rndtree.treesize():
        def total_errors( classifier, data ):
            errors = [ classifier(datum) != datum.getclass() for datum in data ]
            return sum(errors)
        learner = {'Tree': tree, 'Random tree': rndtree}
        for label in learner:
            print label, 'errors:', total_errors( learner[label], data )

Postby Janez » Wed Mar 11, 2009 12:30

Before I go into exploring this: how old is your Orange? From September till a few weeks ago there was a nasty bug in classification trees which made nonsense trees in some cases. Are you working with the latest snapshot?

Postby Yakov » Wed Mar 11, 2009 19:40

Well, before I was using the stable Widnows distribution, orange-win-1.0-py2.5.exe. I have downloaded/installed the latest snapshot Windows distribution (orange-win-snapshot-2009-03-11-py2.5.exe), and the problem is still there.

What I've noticed is that '' is still dated 9/25/2008. So, perhaps, your new code did not get propagated to the latest snapshot distribution? (I also see files dated 2/11/2009; so, some new code did propagate).

Should I "upgrade" from a code repository to get the latest and greatest?

Many thanks.

Postby Blaz » Mon Mar 16, 2009 16:09

Try running from orange/doc/modules. If the output is something like:
Code: Select all
petal width<=0.800: Iris-setosa 100% (100.00%)
petal width>0.800
|    petal width<=1.750
|    |    petal length<=5.350: Iris-versicolor 88% (108.57%)
|    |    petal length>5.350: Iris-virginica 100% (122.73%)
|    petal width>1.750
|    |    petal length<=4.850: Iris-virginica 33% (34.85%)
|    |    petal length>4.850: Iris-virginica 100% (104.55%)

then you're using the code w/o the bug they Janez has mentioned.

Also, your dates on orngTree look ok (though on SVN the dates of the last change are Aug 4, 2008): this is simply a wrapper around Orange's core tree inducer (written in C) and does not need to be updated with an update of the Orange's core.

Postby Yakov » Mon Mar 16, 2009 18:38

Thanks for the tip. I've checked my output, and it looks identical to what you have.

It may as well be the case that sometimes random trees are better (smaller and produce better results) than trees obtained by a deterministic algorithm, especially on a noisy data set, and especially if the deterministic tree is relatively large to start with. What confused me initially was the bold claim that the deterministic tree is going to be (always) better than a random tree. And it does not seem to be the case in this specific setting.

I'm not sure you'd want to breed better trees by random variations (in the GA vein), though :-)

Thanks to both of you for your help. Take care.

Return to Questions & Support