Changeset 3573:cce66748b0b9 in orange


Ignore:
Timestamp:
04/23/07 13:25:30 (7 years ago)
Author:
markotoplak
Branch:
default
Convert:
7063f522de9786e445af235ed69a1b4dc435e2df
Message:

Added documentation and example for MeasureAttribute_randomForests.

Location:
orange/doc/modules
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • orange/doc/modules/orngEnsemble.htm

    r2828 r3573  
    5454votes for a particular class.<p> 
    5555 
     56<H3>Example</H3> 
     57<p>See <a href="#ble">BoostedLearner example</a>.</p> 
     58 
    5659 
    5760<h2>BoostedLearner</h2> 
     
    8992  <dd>The name of the learner (default: AdaBoost.M1).</dd> 
    9093</dl> 
     94 
     95 
     96<a name="ble"><H3>Example</H3> 
     97 
     98<P>Let us try boosting and bagging on Iris data set and use 
     99<code>TreeLearner</code> with post-pruning as a base learner. For 
     100testing, we use 10-fold cross validation and observe classification 
     101accuracy.</p> 
     102 
     103<p class="header"><a href="ensemble.py">ensemble.py</a> (uses <a href= 
     104"iris.tab">iris.tab</a>)</p> 
     105<XMP class=code>import orange, orngEnsemble, orngTree 
     106import orngTest, orngStat 
     107 
     108tree = orngTree.TreeLearner(mForPruning=2, name="tree") 
     109bs = orngEnsemble.BoostedLearner(tree, name="boosted tree") 
     110bg = orngEnsemble.BaggedLearner(tree, name="bagged tree") 
     111 
     112data = orange.ExampleTable("lymphography.tab") 
     113 
     114learners = [tree, bs, bg] 
     115results = orngTest.crossValidation(learners, data) 
     116print "Classification Accuracy:" 
     117for i in range(len(learners)): 
     118    print ("%15s: %5.3f") % (learners[i].name, orngStat.CA(results)[i]) 
     119</XMP> 
     120 
     121<p>Running this script, we may get something like:</p> 
     122<XMP class=code>Classification Accuracy: 
     123Classification Accuracy: 
     124           tree: 0.769 
     125   boosted tree: 0.782 
     126    bagged tree: 0.783 
     127</XMP> 
     128 
    91129 
    92130 
     
    149187this will return the averaged probabilities from each of the trees.</p> 
    150188 
    151  
    152 <hr> 
    153  
    154 <H2>Examples</H2> 
    155  
    156 <P>Let us try boosting and bagging on Iris data set and use 
    157 <code>TreeLearner</code> with post-pruning as a base learner. For 
    158 testing, we use 10-fold cross validation and observe classification 
    159 accuracy.</p> 
    160  
    161 <p class="header"><a href="ensemble.py">ensemble.py</a> (uses <a href= 
    162 "iris.tab">iris.tab</a>)</p> 
    163 <XMP class=code>import orange, orngEnsemble, orngTree 
    164 import orngTest, orngStat 
    165  
    166 tree = orngTree.TreeLearner(mForPruning=2, name="tree") 
    167 bs = orngEnsemble.BoostedLearner(tree, name="boosted tree") 
    168 bg = orngEnsemble.BaggedLearner(tree, name="bagged tree") 
    169  
    170 data = orange.ExampleTable("lymphography.tab") 
    171  
    172 learners = [tree, bs, bg] 
    173 results = orngTest.crossValidation(learners, data) 
    174 print "Classification Accuracy:" 
    175 for i in range(len(learners)): 
    176     print ("%15s: %5.3f") % (learners[i].name, orngStat.CA(results)[i]) 
    177 </XMP> 
    178  
    179 <p>Running this script, we may get something like:</p> 
    180 <XMP class=code>Classification Accuracy: 
    181 Classification Accuracy: 
    182            tree: 0.769 
    183    boosted tree: 0.782 
    184     bagged tree: 0.783 
    185 </XMP> 
    186  
     189<h3>Examples</h3> 
    187190 
    188191<p>The following script assembles a random forest learner and compares 
     
    251254the tree in a constructed random forest.</p> 
    252255 
     256 
     257 
     258<h2>MeasureAttribute_randomForests</h2> 
     259 
     260<p>L. Breiman (2001) suggested the possibility of using random forests  
     261as a non-myopic measure of attribute importance. </p> 
     262 
     263<p>Assessing relevance of attributes with random forests is based 
     264on the idea that randomly changing the value of an important attribute 
     265greatly affects example's classification while changing the value 
     266of an unimportant attribute doen't affect it much. Implemented algorithm 
     267accumulates attribute scores over given number of trees. 
     268Importances of all atributes for a single tree are computed as:  
     269correctly classified OOB examples  
     270minus correctly classified OOB examples when an attribute is randomly 
     271shuffled. The accumulated attribute scores are divided by the number 
     272of used trees and multiplied by 100 before they are returned.</p> 
     273 
     274<p class=section>Attributes</p> 
     275<dl class=attributes> 
     276 
     277  <dt>trees</dt> 
     278  <dd>Number of trees in the forest (default: 100).</dd> 
     279 
     280  <dt>learner</dt> 
     281  <dd>Although not required, one can use this argument to pass one's 
     282  own tree induction algorithm. If none is passed, 
     283  <code>MeasureAttribute_randomForests</code> will use Orange's tree induction 
     284  algorithm such that in induction nodes with less then 5 examples 
     285  will not be considered for (further) splitting. (default: None)</dd> 
     286 
     287  <dt>attributes</dt> 
     288  <dd>Number of attributes used in a randomly drawn subset when 
     289  searching for best attribute to split the node in tree growing 
     290  (default: None, and if kept this way, this is turned into square 
     291  root of attributes in example set).</dd> 
     292 
     293  <dt>rand</dt> 
     294  <dd>Random generator used in bootstrap sampling. If none is passed, 
     295  then Python's Random from random library is used, with seed 
     296  initialized to 0.</dd> 
     297 
     298</dl> 
     299 
     300<p>Computation of attribute importance with random forests is rather slow.  
     301Also, importances for all attributes need to be considered simultaneous. 
     302Since we normally compute attribute importance with random forests 
     303for all attributes in the dataset, <CODE>MeasureAttribute_randomForests</CODE> 
     304caches the results. When it is called to compute a quality of certain attribute,  
     305it computes qualities for all attributes in the dataset.  
     306When called again, it uses the stored results if the domain is still  
     307the same and the example table has not changed (this is done by  
     308checking the example tables version and is not foolproof;  
     309it won't detect if you change values of existing examples,  
     310but will notice adding and removing examples; see the page on  
     311<A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> for details).</P> 
     312 
     313<p>Caching will only have an effect if you use the same instance for all attributes in the domain.</p> 
     314 
     315<h3>Example</h3> 
     316 
     317<p>The following script demonstrates measuring attribute importance with random forests.</p> 
     318 
     319<p class="header"><a href="ensemble4.py">ensemble4.py</a> (uses <a href= 
     320"iris.tab">iris.tab</a>)</p> 
     321<xmp class=code>import orange, orngEnsemble, random 
     322 
     323data = orange.ExampleTable("iris.tab") 
     324 
     325measure = orngEnsemble.MeasureAttribute_randomForests(trees=100) 
     326 
     327#call by attribute index 
     328imp0 = measure(0, data)  
     329#call by orange.Variable 
     330imp1 = measure(data.domain.attributes[1], data) 
     331print "first: %0.2f, second: %0.2f\n" % (imp0, imp1) 
     332 
     333print "different random seed" 
     334measure = orngEnsemble.MeasureAttribute_randomForests(trees=100, rand=random.Random(10)) 
     335 
     336imp0 = measure(0, data) 
     337imp1 = measure(data.domain.attributes[1], data) 
     338print "first: %0.2f, second: %0.2f\n" % (imp0, imp1) 
     339 
     340print "All importances:" 
     341imps = measure.importances(data) 
     342for i,imp in enumerate(imps): 
     343    print "%15s: %6.2f" % (data.domain.attributes[i].name, imp) 
     344</xmp> 
     345 
     346<p>Corresponding output:</p> 
     347 
     348<xmp class=code>first: 0.32, second: 0.04 
     349 
     350different random seed 
     351first: 0.33, second: 0.14 
     352 
     353All importances: 
     354   sepal length:   0.33 
     355    sepal width:   0.14 
     356   petal length:  15.16 
     357    petal width:  48.59 
     358</xmp> 
     359 
    253360<HR> 
    254361<H2>References</H2> 
     
    2653721996. [<a href="http://www.rulequest.com/Personal/q.aaai96.ps">PS</a>]</P> 
    266373 
    267 <p>L Brieman. Random Forests. Machine Learning, 45, 5-32, 2001. [<a href="http://www.springerlink.com/app/home/contribution.asp?wasp=bd3f27906cfa4bfdaa16527c7c167456&referrer=parent&backto=issue,1,5;journal,32,140;linkingpublicationresults,1:100309,1">SpringerLink</a>]</p> 
     374<p>L Brieman. Random Forests. Machine Learning, 45, 5-32, 2001. [<a href="http://www.springerlink.com/content/u0p06167n6173512/">SpringerLink</a>]</p> 
    268375 
    269376<p> M Robnik-Sikonja. Improving Random Forests. In Proc. of European Conference on Machine Learning (ECML 2004), pp. 359-370, 2004. [<a href="http://lkm.fri.uni-lj.si/rmarko/papers/robnik04-ecml.pdf">PDF</a>]</p> 
Note: See TracChangeset for help on using the changeset viewer.