Orange Forum • View topic - Random Forests

Random Forests

A place to ask questions about methods in Orange and how they are used and other general support.

Random Forests

Postby exogen » Wed Aug 03, 2005 15:26

Has anyone had success modelling Random Forests using Orange? So far I have been using the RPy module to try out the randomForest R package from Python.

It seems like Orange is the perfect package to construct such a model, but I'm currently inexperienced. The problem I'm trying to solve is outlined here (the first project). Does this sound like the right approach?

Here's a quick summary from my statician friend justifying the approach and summarizing Random Forests:

The idea is you can't make any good statistical inference from one decision tree. Random Forests works by splitting the data using majority vote decision tree processes. It does that lots of times, thousands if you want, and it gets more precise the more it does that. Then it tells you which independent variables are the most valuable predictors of your dependent variable. In Knoware's case, you don't know what the problem is or how to fix it, so the problem is the dependent variable. Also, it is a means to evaluate groups, so you can see if a certain kind of problem clumps with other factors.


Thanks.

Postby Blaz » Fri Aug 12, 2005 15:08

Here is the code for Random Forests I am about to integrate into orngEnsamble module (once I finish the documentation :-); also thanks to Janez and Minca from which I took the code they used for RF-based feature subset selection):

Code: Select all
import random
from math import sqrt, floor
import orange, orngTree

class SplitConstructor_AttributeSubset(orange.TreeSplitConstructor):
    def __init__(self, scons, attributes, rand = random.Random()):
        self.scons = scons           # split constructor of original tree
        self.attributes = attributes # number of attributes to consider
        self.rand = rand             # a random generator

    def __call__(self, gen, weightID, contingencies, apriori, candidates, clsfr):
        cand = [1]*self.attributes + [0]*(len(candidates) - self.attributes)
        self.rand.shuffle(cand)
        # instead with all attributes, we will invoke split constructor only for the
        # subset of a attributes
        t = self.scons(gen, weightID, contingencies, apriori, cand, clsfr)
        return t

class RandomForestLearner(orange.Learner):
    def __new__(cls, examples=None, weight = 0, **kwds):
        self = orange.Learner.__new__(cls, **kwds)
        if examples:
            self.__init__(**kwds)
            return self.__call__(examples, weight)
        else:
            return self

    def __init__(self, learner=None, trees=100, attributes=None, name='Random Forest', rand = None):
        self.trees = trees
        self.name = name
        self.learner = learner
        self.attributes = attributes
        if rand:
            self.rand = rand
        else:
            self.rand = random.Random()
            self.rand.seed(0)

        if not learner:
            # tree learner assembled as suggested by Brieman (2001)
            smallTreeLearner = orngTree.TreeLearner(storeNodeClassifier = 0, storeContingencies=0, storeDistributions=1, minExamples=5).instance()
            smallTreeLearner.split.discreteSplitConstructor.measure = smallTreeLearner.split.continuousSplitConstructor.measure = orange.MeasureAttribute_gini()
            smallTreeLearner.split = SplitConstructor_AttributeSubset(smallTreeLearner.split, attributes, self.rand)
            self.learner = smallTreeLearner

    def __call__(self, examples, weight=0):
        # if number of attributes for subset is not set, use square root
        if hasattr(self.learner.split, 'attributes') and not self.learner.split.attributes:
            self.learner.split.attributes = int(sqrt(len(examples.domain.attributes)))

        n = len(examples)
        # build the forest
        classifiers = []
        for i in range(self.trees):
            # draw bootstrap sample
            selection = []
            for j in range(n):
                selection.append(self.rand.randrange(n))
            data = examples.getitems(selection)
            # build the model from the bootstrap sample
            classifiers.append(self.learner(data))

        return RandomForestClassifier(classifiers = classifiers, name=self.name, domain=examples.domain)
       
class RandomForestClassifier:
    def __init__(self, **kwds):
        self.__dict__.update(kwds)

    def __call__(self, example, resultType = orange.GetValue):
        from operator import add

        # voting for class probabilities
        if resultType == orange.GetProbabilities or resultType == orange.GetBoth:
            cprob = [0.] * len(self.domain.classVar.values)
            for c in self.classifiers:
                a = [x for x in c(example, orange.GetProbabilities)]
                cprob = map(add, cprob, a)
            norm = sum(cprob)
            for i in range(len(cprob)):
                cprob[i] = cprob[i]/norm

        # voting for crisp class membership, notice that
        # this may not be the same class as one obtaining the
        # highest probability through probability voting
        if resultType == orange.GetValue or resultType == orange.GetBoth:
            cfreq = [0] * len(self.domain.classVar.values)
            for c in self.classifiers:
                cfreq[int(c(example))] += 1
            index = cfreq.index(max(cfreq))
            cvalue = orange.Value(self.domain.classVar, index)

        if resultType == orange.GetValue: return cvalue
        elif resultType == orange.GetProbabilities: return cprob
        else: return (cvalue, cprob)


This is a Random Forest algorithm that closely follows a proposal by Brieman (Machine Learning, 2001). It constructs a number of classification trees for which gini index is used for attribute scoring, and where attributes from each node are chosen from an attribute subset. Each tree is developed from a bootstrap sample of the training data.

The defaults are: 100 trees, where for each node an attribute is chosen from a randomly picked subset of size equal to square root of number of attributes in the training data. The above code is general to the point where you can change all this, as you can change the tree learner you wish to use for growing of the forest.

The principal trick in the code is SplitConstructor_AttributeSubset, which replaces a splitting function used in Orange's tree induction. The trick is that this function is the same as original function implemented in C, except that is using a subset insead of complete set of attributes. This is also a nice example how a component-based architecture can help in prototyping a new methods (prototype a component in Python and then use it with algorithm that was developed in C).

When presented an example, Forest Trees use simple voting (all classifiers being equal). When predicting class probabilities, the above code uses average probability predictied from a set of classifiers (see also a note just in RandomForestClassifier).

Here is the code that tests the above and compares RF to a single tree on a single data set:

Code: Select all
data = orange.ExampleTable('bupa.tab')
forest = RandomForestLearner(trees=50, name="forest")

tree = orngTree.TreeLearner(minExamples=2, mForPrunning=2, sameMajorityPruning=True, name='tree')
learners = [forest, tree]

import orngTest, orngStat
results = orngTest.crossValidation(learners, data, folds=10)
print "Learner  CA     Brier    AUC"
for i in range(len(learners)):
    print "%-8s %5.3f  %5.3f  %5.3f" % (learners[i].name, \
        orngStat.CA(results)[i],
        orngStat.BrierScore(results)[i],
        orngStat.AUC(results)[i])

Postby exogen » Fri Aug 12, 2005 16:02

Thanks, Braz! I was beginning to lose hope -- your post certainly made my day. The code looks clean and easy to follow. I have to ask, was this already available somewhere, or is it a recent addition?

Thanks again! I'll let you know if it works out for my project.

Postby Blaz » Tue Aug 16, 2005 15:05

No, that's new (actually, initiated also by your question). We have also just updated the orngEnsamble module (get it from the recent snapshot) and its documentation: see http://www.ailab.si/orange/doc/modules/orngEnsemble.htm.


Return to Questions & Support