Changeset 7334:7424acc0c705 in orange


Ignore:
Timestamp:
02/03/11 19:40:22 (3 years ago)
Author:
crt <crtomir.gorup@…>
Branch:
default
Convert:
5a8ab4d0c54d25318f3c559da2c5eca58d046074
Message:

Added Orange.emsemble.forest.MeasureAttribute, changed orange.cluster to Orange.cluster, added Orange.misc

Location:
orange
Files:
2 added
1 deleted
4 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/ensemble/__init__.py

    r7285 r7334  
    1010Boosting 
    1111================== 
    12 .. index ensemble boosting 
     12.. index:: ensembleboosting 
    1313.. autoclass:: Orange.ensemble.bagging.BaggedLearner 
    1414   :members: 
     
    1717Bagging 
    1818================== 
    19 .. index ensemble bagging 
     19.. index:: ensemblebagging 
    2020.. autoclass:: Orange.ensemble.boosting.BoostedLearner 
    2121  :members: 
     22 
     23Example 
     24======== 
     25Let us try boosting and bagging on Lymphography data set and use TreeLearner 
     26with post-pruning as a base learner. For testing, we use 10-fold cross 
     27validation and observe classification accuracy. 
     28 
     29`ensemble.py`_ (uses `lymphography.tab`_) 
     30 
     31.. literalinclude:: code/ensemble.py 
     32 
     33.. _lymphography.tab: code/lymphography.tab 
     34.. _ensemble.py: code/ensemble.py 
     35 
     36Running this script, we may get something like:: 
     37 
     38  TODO, we have to wait for executable TreeLearner with m-prunning 
    2239 
    2340================== 
    2441Forest 
    2542================== 
    26 .. index ensemble forest 
     43 
     44.. index:: randomforest 
    2745.. autoclass:: Orange.ensemble.forest.RandomForestLearner 
    2846  :members: 
    2947 
     48 
     49Example 
     50======== 
     51 
     52The following script assembles a random forest learner and compares it 
     53to a tree learner on a liver disorder (bupa) data set. 
     54 
     55`ensemble-forest.py`_ (uses `buba.tab`_) 
     56 
     57.. literalinclude:: code/ensemble-forest.py 
     58 
     59.. _buba.tab: code/buba.tab 
     60.. _ensemble-forest.py: code/ensemble-forest.py 
     61 
     62Notice that our forest contains 50 trees. Learners are compared through  
     6310-fold cross validation, and results reported on classification accuracy, 
     64brier score and area under ROC curve:: 
     65 
     66    WAIT FOR WORKING TREE 
     67 
     68Perhaps the sole purpose of the following example is to show how to access 
     69the individual classifiers once they are assembled into the forest, and to  
     70show how we can assemble a tree learner to be used in random forests. The  
     71tree induction uses an feature subset split constructor, which we have  
     72borrowed from :class:`Orange.ensemble` and from which we have requested the 
     73best feature for decision nodes to be selected from three randomly  
     74chosen features. 
     75 
     76`ensemble-forest2.py`_ (uses `buba.tab`_) 
     77 
     78.. literalinclude:: code/ensemble-forest2.py 
     79 
     80.. _ensemble-forest2.py: code/ensemble-forest2.py 
     81 
     82Running the above code would report on sizes (number of nodes) of the tree 
     83in a constructed random forest. 
     84 
     85================ 
     86MeasureAttribute 
     87================ 
     88 
     89L. Breiman (2001) suggested the possibility of using random forests as a 
     90non-myopic measure of attribute importance. 
     91 
     92Assessing relevance of features with random forests is based on the 
     93idea that randomly changing the value of an important feature greatly 
     94affects example's classification while changing the value of an 
     95unimportant feature doen't affect it much. Implemented algorithm 
     96accumulates feature scores over given number of trees. Importances of 
     97all features for a single tree are computed as: correctly classified OOB 
     98examples minus correctly classified OOB examples when an feature is 
     99randomly shuffled. The accumulated feature scores are divided by the 
     100number of used trees and multiplied by 100 before they are returned. 
     101 
     102.. autoclass:: Orange.ensemble.forest.MeasureAttribute 
     103  :members: 
     104 
     105Computation of feature importance with random forests is rather slow. Also,  
     106importances for all features need to be considered simultaneous. Since we 
     107normally compute feature importance with random forests for all features in 
     108the dataset, MeasureAttribute_randomForests caches the results. When it  
     109is called to compute a quality of certain feature, it computes qualities 
     110for all features in the dataset. When called again, it uses the stored  
     111results if the domain is still the same and the example table has not 
     112changed (this is done by checking the example tables version and is 
     113not foolproof; it won't detect if you change values of existing examples, 
     114but will notice adding and removing examples; see the page on  
     115:class:`Orange.data.Table` for details). 
     116 
     117Caching will only have an effect if you use the same instance for all 
     118features in the domain. 
     119 
     120`ensemble-forest-measure.py`_ (uses `iris.tab`_) 
     121 
     122.. literalinclude:: code/ensemble-forest-measure.py 
     123 
     124.. _ensemble-forest-measure.py: code/ensemble-forest-measure.py 
     125.. _iris.tab: code/iris.tab 
     126 
     127Corresponding output:: 
     128 
     129    WAITING FOR WORKING TREES 
     130 
     131 
    30132References 
    31 ---------- 
     133============ 
    32134* L Breiman. Bagging Predictors. `Technical report No. 421 \ 
    33135    <http://www.stat.berkeley.edu/tech-reports/421.ps.Z>`_. University of \ 
     
    47149    pp. 359-370, 2004. [PDF] 
    48150 
    49 Examples 
    50 ======== 
    51  
    52 .. literalinclude:: code/ensemble.py 
    53  
    54  
    55151""" 
    56152 
    57153__all__ = ["bagging", "boosting", "forest"] 
    58154__docformat__ = 'restructuredtext' 
     155import Orange.core as orange 
     156 
     157class SplitConstructor_AttributeSubset(orange.TreeSplitConstructor): 
     158    def __init__(self, scons, attributes, rand = None): 
     159        import random 
     160        self.scons = scons           # split constructor of original tree 
     161        self.attributes = attributes # number of attributes to consider 
     162        if rand: 
     163            self.rand = rand             # a random generator 
     164        else: 
     165            self.rand = random.Random() 
     166            self.rand.seed(0) 
     167 
     168    def __call__(self, gen, weightID, contingencies, apriori, candidates, clsfr): 
     169        cand = [1]*self.attributes + [0]*(len(candidates) - self.attributes) 
     170        self.rand.shuffle(cand) 
     171        # instead with all attributes, we will invoke split constructor  
     172        # only for the subset of a attributes 
     173        t = self.scons(gen, weightID, contingencies, apriori, cand, clsfr) 
     174        return t 
  • orange/Orange/ensemble/forest.py

    r7278 r7334  
    44import Orange 
    55 
    6 class SplitConstructor_AttributeSubset(orange.TreeSplitConstructor): 
    7     def __init__(self, scons, attributes, rand = None): 
    8         self.scons = scons           # split constructor of original tree 
    9         self.attributes = attributes # number of attributes to consider 
    10         if rand: 
    11             self.rand = rand             # a random generator 
    12         else: 
    13             self.rand = random.Random() 
    14             self.rand.seed(0) 
    15  
    16     def __call__(self, gen, weightID, contingencies, apriori, candidates, clsfr): 
    17         cand = [1]*self.attributes + [0]*(len(candidates) - self.attributes) 
    18         self.rand.shuffle(cand) 
    19         # instead with all attributes, we will invoke split constructor  
    20         # only for the subset of a attributes 
    21         t = self.scons(gen, weightID, contingencies, apriori, cand, clsfr) 
    22         return t 
    23  
    246class RandomForestLearner(orange.Learner): 
    257    """ 
    26     Just like bagging, classifiers in random forests are trained from bootstrap\ 
    27     samples of training data. Here, classifiers are trees, but to increase\ 
    28     randomness build in the way that at each node the best attribute is chosen\ 
    29     from a subset of attributes in the training set. We closely follows the\ 
    30     original algorithm (Brieman, 2001) both in implementation and parameter\ 
     8    Just like bagging, classifiers in random forests are trained from bootstrap 
     9    samples of training data. Here, classifiers are trees, but to increase 
     10    randomness build in the way that at each node the best attribute is chosen 
     11    from a subset of attributes in the training set. We closely follows the 
     12    original algorithm (Brieman, 2001) both in implementation and parameter 
    3113    defaults. 
    3214 
    3315    .. note:: 
    34         Random forest classifier uses decision trees induced from bootstrapped\ 
    35         training set to vote on class of presented example. Most frequent vote\ 
    36         is returned. However, in our implementation, if class probability is\ 
    37         requested from a classifier, this will return the averaged probabilities\ 
     16        Random forest classifier uses decision trees induced from bootstrapped 
     17        training set to vote on class of presented example. Most frequent vote 
     18        is returned. However, in our implementation, if class probability is 
     19        requested from a classifier, this will return the averaged probabilities 
    3820        from each of the trees. 
    3921    """ 
     
    5234                decision trees, which, when presented with an examples, vote 
    5335                for the predicted class. 
     36        :type examples: :class:`Orange.data.Table` 
    5437        :param trees: Number of trees in the forest. 
    5538        :type trees: int 
     
    7457        :param name: The name of the learner. 
    7558        :type name: string""" 
     59        import random 
    7660        self.trees = trees 
    7761        self.name = name 
     
    159143### MeasureAttribute_randomForests 
    160144 
    161 class MeasureAttribute_randomForests(orange.MeasureAttribute): 
    162   def __init__(self, learner=None, trees = 100, attributes=None, rand=None): 
    163     self.trees = trees 
    164     self.learner = learner 
    165     self.bufexamples = None 
    166     self.attributes = attributes 
    167      
    168     if self.learner == None: 
    169       temp = RandomForestLearner(attributes=self.attributes) 
    170       self.learner = temp.learner 
    171      
    172     if hasattr(self.learner.split, 'attributes'): 
    173       self.origattr = self.learner.split.attributes 
    174        
    175     if rand: 
    176       self.rand = rand  # a random generator 
    177     else: 
    178       self.rand = random.Random() 
    179       self.rand.seed(0) 
    180  
    181   def __call__(self, a1, a2, a3=None): 
    182     """Return importance of a given attribute. Can be given by index,  
    183     name or as a Orange.data.feature.Feature.""" 
    184     attrNo = None 
    185     examples = None 
    186  
    187     if type(a1) == int: #by attr. index 
    188       attrNo, examples, apriorClass = a1, a2, a3 
    189     elif type(a1) == type("a"): #by attr. name 
    190       attrName, examples, apriorClass = a1, a2, a3 
    191       attrNo = examples.domain.index(attrName) 
    192     elif isinstance(a1, Orange.data.feature.Feature): 
    193       a1, examples, apriorClass = a1, a2, a3 
    194       atrs = [a for a in examples.domain.attributes] 
    195       attrNo = atrs.index(a1) 
    196     else: 
    197       contingency, classDistribution, apriorClass = a1, a2, a3 
    198       raise Exception("MeasureAttribute_rf can not be called with (\ 
    199             contingency,classDistribution, apriorClass) as fuction arguments.") 
    200  
    201     self.buffer(examples) 
    202  
    203     return self.avimp[attrNo]*100/self.trees 
    204  
    205   def importances(self, examples): 
    206     """Returns importances of all attributes in dataset in a list. 
    207     Buffered.""" 
    208     self.buffer(examples) 
    209      
    210     return [a*100/self.trees for a in self.avimp] 
    211  
    212   def buffer(self, examples): 
    213     """Recalcule importances if needed (new examples).""" 
    214     recalculate = False 
    215      
    216     if examples != self.bufexamples: 
    217       recalculate = True 
    218     elif examples.version != self.bufexamples.version: 
    219       recalculate = True 
     145class MeasureAttribute(orange.MeasureAttribute): 
     146     
     147    def __init__(self, learner=None, trees = 100, attributes=None, rand=None): 
     148        """:param trees: Number of trees in the forest. 
     149        :type trees: int 
     150        :param learner: Although not required, one can use this argument to pass 
     151            one's own tree induction algorithm. If None is  
     152            passed, :class:`Orange.ensemble.forest.MeasureAttribute` will  
     153            use Orange's tree induction algorithm such that in  
     154            induction nodes with less then 5 examples will not be  
     155            considered for (further) splitting. 
     156        :type learner: None or :class:`Orange.core.Learner` 
     157        :param attributes: Number of attributes used in a randomly drawn 
     158            subset when searching for best attribute to split the node in tree 
     159            growing (default: None, and if kept this way, this is turned into 
     160            square root of attributes in example set). 
     161        :type attributes: int 
     162        :param rand: Random generator used in bootstrap sampling. If None is  
     163            passed, then Python's Random from random library is used, with seed 
     164            initialized to 0.""" 
     165        self.trees = trees 
     166        self.learner = learner 
     167        self.bufexamples = None 
     168        self.attributes = attributes 
     169     
     170        if self.learner == None: 
     171          temp = RandomForestLearner(attributes=self.attributes) 
     172          self.learner = temp.learner 
     173     
     174        if hasattr(self.learner.split, 'attributes'): 
     175          self.origattr = self.learner.split.attributes 
     176       
     177        if rand: 
     178          self.rand = rand  # a random generator 
     179        else: 
     180          self.rand = random.Random() 
     181          self.rand.seed(0) 
     182 
     183    def __call__(self, a1, a2, a3=None): 
     184        """Return importance of a given attribute. Can be given by index,  
     185        name or as a Orange.data.feature.Feature.""" 
     186        attrNo = None 
     187        examples = None 
     188 
     189        if type(a1) == int: #by attr. index 
     190          attrNo, examples, apriorClass = a1, a2, a3 
     191        elif type(a1) == type("a"): #by attr. name 
     192          attrName, examples, apriorClass = a1, a2, a3 
     193          attrNo = examples.domain.index(attrName) 
     194        elif isinstance(a1, Orange.data.feature.Feature): 
     195          a1, examples, apriorClass = a1, a2, a3 
     196          atrs = [a for a in examples.domain.attributes] 
     197          attrNo = atrs.index(a1) 
     198        else: 
     199          contingency, classDistribution, apriorClass = a1, a2, a3 
     200          raise Exception("MeasureAttribute_rf can not be called with (\ 
     201                contingency,classDistribution, apriorClass) as fuction arguments.") 
     202 
     203        self.buffer(examples) 
     204 
     205        return self.avimp[attrNo]*100/self.trees 
     206 
     207    def importances(self, examples): 
     208        """Return importances of all attributes in dataset in a list. 
     209        Buffered.""" 
     210        self.buffer(examples) 
     211     
     212        return [a*100/self.trees for a in self.avimp] 
     213 
     214    def buffer(self, examples): 
     215        """Recalcule importances if needed (new examples).""" 
     216        recalculate = False 
     217     
     218        if examples != self.bufexamples: 
     219          recalculate = True 
     220        elif examples.version != self.bufexamples.version: 
     221          recalculate = True 
    220222          
    221     if (recalculate): 
    222       self.bufexamples = examples 
    223       self.avimp = [0.0]*len(self.bufexamples.domain.attributes) 
    224       self.acu = 0 
    225        
    226       if hasattr(self.learner.split, 'attributes'): 
    227           self.learner.split.attributes = self.origattr 
    228        
    229       # if number of attributes for subset is not set, use square root 
    230       if hasattr(self.learner.split, 'attributes') and not\ 
    231                 self.learner.split.attributes: 
    232           self.learner.split.attributes = int(sqrt(\ 
    233                         len(examples.domain.attributes))) 
    234        
    235       self.importanceAcu(self.bufexamples, self.trees, self.avimp) 
    236        
    237   def getOOB(self, examples, selection, nexamples): 
     223        if (recalculate): 
     224          self.bufexamples = examples 
     225          self.avimp = [0.0]*len(self.bufexamples.domain.attributes) 
     226          self.acu = 0 
     227       
     228          if hasattr(self.learner.split, 'attributes'): 
     229              self.learner.split.attributes = self.origattr 
     230       
     231          # if number of attributes for subset is not set, use square root 
     232          if hasattr(self.learner.split, 'attributes') and not\ 
     233                    self.learner.split.attributes: 
     234              self.learner.split.attributes = int(sqrt(\ 
     235                            len(examples.domain.attributes))) 
     236       
     237          self.importanceAcu(self.bufexamples, self.trees, self.avimp) 
     238       
     239    def getOOB(self, examples, selection, nexamples): 
    238240        ooblist = filter(lambda x: x not in selection, range(nexamples)) 
    239241        return examples.getitems(ooblist) 
    240242 
    241   def numRight(self, oob, classifier): 
    242         """Returns a number of examples which are classified correcty.""" 
     243    def numRight(self, oob, classifier): 
     244        """Return a number of examples which are classified correcty.""" 
    243245        right = 0 
    244246        for el in oob: 
     
    247249        return right 
    248250     
    249   def numRightMix(self, oob, classifier, attr): 
     251    def numRightMix(self, oob, classifier, attr): 
    250252        """Return a number of examples  which are classified  
    251253        correctly even if an attribute is shuffled.""" 
     
    266268        return right 
    267269 
    268   def importanceAcu(self, examples, trees, avimp): 
     270    def importanceAcu(self, examples, trees, avimp): 
    269271        """Accumulate avimp by importances for a given number of trees.""" 
    270272        n = len(examples) 
     
    279281        classifiers = []   
    280282        for i in range(trees): 
    281              
    282283            # draw bootstrap sample 
    283284            selection = [] 
     
    304305                rightimp = self.numRightMix(oob, cla, attr)                 
    305306                avimp[attr] += (float(right-rightimp))/len(oob) 
    306  
    307307        self.acu += trees   
    308308 
    309   def presentInTree(self, node, attrnum): 
     309    def presentInTree(self, node, attrnum): 
    310310        """Return attributes present in tree (attributes that split).""" 
    311311        if not node: 
  • orange/doc/Orange/rst/index.rst

    r7303 r7334  
    1313 
    1414   Orange.associate    
    15    orange.cluster 
     15   Orange.cluster 
    1616   Orange.data 
    1717   orange.classification.bayes 
  • orange/orngEnsemble.py

    r7267 r7334  
    11from Orange.ensemble.bagging import * 
    22from Orange.ensemble.boosting import * 
    3 from Orange.ensemble.forest import * 
     3from Orange.ensemble.forest import RandomForestLearner 
     4from Orange.ensemble.forest import RandomForestClassifier 
     5from ORange.ensemble.forest import MeasureAttribute as MeasureAttribute_randomForests 
Note: See TracChangeset for help on using the changeset viewer.