Orange Forum • View topic - Random Forests

## Random Forests

A place to ask questions about methods in Orange and how they are used and other general support.

### Random Forests

Has anyone had success modelling Random Forests using Orange? So far I have been using the RPy module to try out the randomForest R package from Python.

It seems like Orange is the perfect package to construct such a model, but I'm currently inexperienced. The problem I'm trying to solve is outlined here (the first project). Does this sound like the right approach?

Here's a quick summary from my statician friend justifying the approach and summarizing Random Forests:

The idea is you can't make any good statistical inference from one decision tree. Random Forests works by splitting the data using majority vote decision tree processes. It does that lots of times, thousands if you want, and it gets more precise the more it does that. Then it tells you which independent variables are the most valuable predictors of your dependent variable. In Knoware's case, you don't know what the problem is or how to fix it, so the problem is the dependent variable. Also, it is a means to evaluate groups, so you can see if a certain kind of problem clumps with other factors.

Thanks.

Here is the code for Random Forests I am about to integrate into orngEnsamble module (once I finish the documentation :-); also thanks to Janez and Minca from which I took the code they used for RF-based feature subset selection):

Code: Select all
`import randomfrom math import sqrt, floorimport orange, orngTreeclass SplitConstructor_AttributeSubset(orange.TreeSplitConstructor):    def __init__(self, scons, attributes, rand = random.Random()):        self.scons = scons           # split constructor of original tree        self.attributes = attributes # number of attributes to consider        self.rand = rand             # a random generator    def __call__(self, gen, weightID, contingencies, apriori, candidates, clsfr):        cand = [1]*self.attributes + [0]*(len(candidates) - self.attributes)        self.rand.shuffle(cand)        # instead with all attributes, we will invoke split constructor only for the        # subset of a attributes        t = self.scons(gen, weightID, contingencies, apriori, cand, clsfr)        return tclass RandomForestLearner(orange.Learner):    def __new__(cls, examples=None, weight = 0, **kwds):        self = orange.Learner.__new__(cls, **kwds)        if examples:            self.__init__(**kwds)            return self.__call__(examples, weight)        else:            return self    def __init__(self, learner=None, trees=100, attributes=None, name='Random Forest', rand = None):        self.trees = trees        self.name = name        self.learner = learner        self.attributes = attributes        if rand:            self.rand = rand        else:            self.rand = random.Random()            self.rand.seed(0)        if not learner:            # tree learner assembled as suggested by Brieman (2001)            smallTreeLearner = orngTree.TreeLearner(storeNodeClassifier = 0, storeContingencies=0, storeDistributions=1, minExamples=5).instance()            smallTreeLearner.split.discreteSplitConstructor.measure = smallTreeLearner.split.continuousSplitConstructor.measure = orange.MeasureAttribute_gini()            smallTreeLearner.split = SplitConstructor_AttributeSubset(smallTreeLearner.split, attributes, self.rand)            self.learner = smallTreeLearner    def __call__(self, examples, weight=0):        # if number of attributes for subset is not set, use square root        if hasattr(self.learner.split, 'attributes') and not self.learner.split.attributes:            self.learner.split.attributes = int(sqrt(len(examples.domain.attributes)))        n = len(examples)        # build the forest        classifiers = []        for i in range(self.trees):            # draw bootstrap sample            selection = []            for j in range(n):                selection.append(self.rand.randrange(n))            data = examples.getitems(selection)            # build the model from the bootstrap sample            classifiers.append(self.learner(data))        return RandomForestClassifier(classifiers = classifiers, name=self.name, domain=examples.domain)        class RandomForestClassifier:    def __init__(self, **kwds):        self.__dict__.update(kwds)    def __call__(self, example, resultType = orange.GetValue):        from operator import add        # voting for class probabilities        if resultType == orange.GetProbabilities or resultType == orange.GetBoth:            cprob = [0.] * len(self.domain.classVar.values)            for c in self.classifiers:                a = [x for x in c(example, orange.GetProbabilities)]                cprob = map(add, cprob, a)            norm = sum(cprob)            for i in range(len(cprob)):                cprob[i] = cprob[i]/norm        # voting for crisp class membership, notice that        # this may not be the same class as one obtaining the        # highest probability through probability voting        if resultType == orange.GetValue or resultType == orange.GetBoth:            cfreq = [0] * len(self.domain.classVar.values)            for c in self.classifiers:                cfreq[int(c(example))] += 1            index = cfreq.index(max(cfreq))            cvalue = orange.Value(self.domain.classVar, index)        if resultType == orange.GetValue: return cvalue        elif resultType == orange.GetProbabilities: return cprob        else: return (cvalue, cprob)`

This is a Random Forest algorithm that closely follows a proposal by Brieman (Machine Learning, 2001). It constructs a number of classification trees for which gini index is used for attribute scoring, and where attributes from each node are chosen from an attribute subset. Each tree is developed from a bootstrap sample of the training data.

The defaults are: 100 trees, where for each node an attribute is chosen from a randomly picked subset of size equal to square root of number of attributes in the training data. The above code is general to the point where you can change all this, as you can change the tree learner you wish to use for growing of the forest.

The principal trick in the code is SplitConstructor_AttributeSubset, which replaces a splitting function used in Orange's tree induction. The trick is that this function is the same as original function implemented in C, except that is using a subset insead of complete set of attributes. This is also a nice example how a component-based architecture can help in prototyping a new methods (prototype a component in Python and then use it with algorithm that was developed in C).

When presented an example, Forest Trees use simple voting (all classifiers being equal). When predicting class probabilities, the above code uses average probability predictied from a set of classifiers (see also a note just in RandomForestClassifier).

Here is the code that tests the above and compares RF to a single tree on a single data set:

Code: Select all
`data = orange.ExampleTable('bupa.tab')forest = RandomForestLearner(trees=50, name="forest")tree = orngTree.TreeLearner(minExamples=2, mForPrunning=2, sameMajorityPruning=True, name='tree')learners = [forest, tree]import orngTest, orngStatresults = orngTest.crossValidation(learners, data, folds=10)print "Learner  CA     Brier    AUC"for i in range(len(learners)):    print "%-8s %5.3f  %5.3f  %5.3f" % (learners[i].name, \        orngStat.CA(results)[i],         orngStat.BrierScore(results)[i],        orngStat.AUC(results)[i])`