## Random Forests

4 posts
• Page

**1**of**1**### Random Forests

Has anyone had success modelling Random Forests using Orange? So far I have been using the RPy module to try out the randomForest R package from Python.

It seems like Orange is the perfect package to construct such a model, but I'm currently inexperienced. The problem I'm trying to solve is outlined here (the first project). Does this sound like the right approach?

Here's a quick summary from my statician friend justifying the approach and summarizing Random Forests:

Thanks.

It seems like Orange is the perfect package to construct such a model, but I'm currently inexperienced. The problem I'm trying to solve is outlined here (the first project). Does this sound like the right approach?

Here's a quick summary from my statician friend justifying the approach and summarizing Random Forests:

The idea is you can't make any good statistical inference from one decision tree. Random Forests works by splitting the data using majority vote decision tree processes. It does that lots of times, thousands if you want, and it gets more precise the more it does that. Then it tells you which independent variables are the most valuable predictors of your dependent variable. In Knoware's case, you don't know what the problem is or how to fix it, so the problem is the dependent variable. Also, it is a means to evaluate groups, so you can see if a certain kind of problem clumps with other factors.

Thanks.

- Code: Select all
`import random`

from math import sqrt, floor

import orange, orngTree

class SplitConstructor_AttributeSubset(orange.TreeSplitConstructor):

def __init__(self, scons, attributes, rand = random.Random()):

self.scons = scons # split constructor of original tree

self.attributes = attributes # number of attributes to consider

self.rand = rand # a random generator

def __call__(self, gen, weightID, contingencies, apriori, candidates, clsfr):

cand = [1]*self.attributes + [0]*(len(candidates) - self.attributes)

self.rand.shuffle(cand)

# instead with all attributes, we will invoke split constructor only for the

# subset of a attributes

t = self.scons(gen, weightID, contingencies, apriori, cand, clsfr)

return t

class RandomForestLearner(orange.Learner):

def __new__(cls, examples=None, weight = 0, **kwds):

self = orange.Learner.__new__(cls, **kwds)

if examples:

self.__init__(**kwds)

return self.__call__(examples, weight)

else:

return self

def __init__(self, learner=None, trees=100, attributes=None, name='Random Forest', rand = None):

self.trees = trees

self.name = name

self.learner = learner

self.attributes = attributes

if rand:

self.rand = rand

else:

self.rand = random.Random()

self.rand.seed(0)

if not learner:

# tree learner assembled as suggested by Brieman (2001)

smallTreeLearner = orngTree.TreeLearner(storeNodeClassifier = 0, storeContingencies=0, storeDistributions=1, minExamples=5).instance()

smallTreeLearner.split.discreteSplitConstructor.measure = smallTreeLearner.split.continuousSplitConstructor.measure = orange.MeasureAttribute_gini()

smallTreeLearner.split = SplitConstructor_AttributeSubset(smallTreeLearner.split, attributes, self.rand)

self.learner = smallTreeLearner

def __call__(self, examples, weight=0):

# if number of attributes for subset is not set, use square root

if hasattr(self.learner.split, 'attributes') and not self.learner.split.attributes:

self.learner.split.attributes = int(sqrt(len(examples.domain.attributes)))

n = len(examples)

# build the forest

classifiers = []

for i in range(self.trees):

# draw bootstrap sample

selection = []

for j in range(n):

selection.append(self.rand.randrange(n))

data = examples.getitems(selection)

# build the model from the bootstrap sample

classifiers.append(self.learner(data))

return RandomForestClassifier(classifiers = classifiers, name=self.name, domain=examples.domain)

class RandomForestClassifier:

def __init__(self, **kwds):

self.__dict__.update(kwds)

def __call__(self, example, resultType = orange.GetValue):

from operator import add

# voting for class probabilities

if resultType == orange.GetProbabilities or resultType == orange.GetBoth:

cprob = [0.] * len(self.domain.classVar.values)

for c in self.classifiers:

a = [x for x in c(example, orange.GetProbabilities)]

cprob = map(add, cprob, a)

norm = sum(cprob)

for i in range(len(cprob)):

cprob[i] = cprob[i]/norm

# voting for crisp class membership, notice that

# this may not be the same class as one obtaining the

# highest probability through probability voting

if resultType == orange.GetValue or resultType == orange.GetBoth:

cfreq = [0] * len(self.domain.classVar.values)

for c in self.classifiers:

cfreq[int(c(example))] += 1

index = cfreq.index(max(cfreq))

cvalue = orange.Value(self.domain.classVar, index)

if resultType == orange.GetValue: return cvalue

elif resultType == orange.GetProbabilities: return cprob

else: return (cvalue, cprob)

This is a Random Forest algorithm that closely follows a proposal by Brieman (Machine Learning, 2001). It constructs a number of classification trees for which gini index is used for attribute scoring, and where attributes from each node are chosen from an attribute subset. Each tree is developed from a bootstrap sample of the training data.

The defaults are: 100 trees, where for each node an attribute is chosen from a randomly picked subset of size equal to square root of number of attributes in the training data. The above code is general to the point where you can change all this, as you can change the tree learner you wish to use for growing of the forest.

The principal trick in the code is SplitConstructor_AttributeSubset, which replaces a splitting function used in Orange's tree induction. The trick is that this function is the same as original function implemented in C, except that is using a subset insead of complete set of attributes. This is also a nice example how a component-based architecture can help in prototyping a new methods (prototype a component in Python and then use it with algorithm that was developed in C).

When presented an example, Forest Trees use simple voting (all classifiers being equal). When predicting class probabilities, the above code uses average probability predictied from a set of classifiers (see also a note just in RandomForestClassifier).

Here is the code that tests the above and compares RF to a single tree on a single data set:

- Code: Select all
`data = orange.ExampleTable('bupa.tab')`

forest = RandomForestLearner(trees=50, name="forest")

tree = orngTree.TreeLearner(minExamples=2, mForPrunning=2, sameMajorityPruning=True, name='tree')

learners = [forest, tree]

import orngTest, orngStat

results = orngTest.crossValidation(learners, data, folds=10)

print "Learner CA Brier AUC"

for i in range(len(learners)):

print "%-8s %5.3f %5.3f %5.3f" % (learners[i].name, \

orngStat.CA(results)[i],

orngStat.BrierScore(results)[i],

orngStat.AUC(results)[i])

4 posts
• Page

**1**of**1**