Changeset 7326:be649f7c89a2 in orange


Ignore:
Timestamp:
02/03/11 18:33:16 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
6ef68b0a71606872a5b28ce08c5d37d00856b932
Message:

Joined c45 reference + module to the new tree.py.

Location:
orange
Files:
2 added
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classification/tree.py

    r7283 r7326  
    11""" 
    22 
    3  
    4  
    5 This page describes the Orange trees. It first describes the basic components and procedures: it starts with <A href="#structure">the structure</A> that represents the tree, then it defines <A href="#classification">how the tree is used for classification</A>, then <A href="#learning">how it is built</A> and <a href="#pruning">pruned</A>. The order might seem strange, but the things are rather complex and this order is perhaps a bit easier to follow. After you have some idea about what the principal components do, we described the <a href="#classes">concrete classes</A> that you can use as components for a tree learner. 
    6  
    7 Classification trees are represented as a tree-like hierarchy of :obj:`TreeNode` classes. 
    8  
     3This page describes the Orange trees. It first describes the basic 
     4components and procedures: it starts with the 
     5structure that represents the tree, then it defines 
     6how the tree is used for classification XXXXXXXXXX, 
     7then how it is built XXXXXXXX and 
     8pruned XXXXXXXXXX. The order might seem strange, 
     9but the things are rather complex and this order is perhaps a 
     10bit easier to follow. After you have some idea about what the 
     11principal components do, we described the 
     12concrete classes XXXXXXXXXX that you can use as 
     13components for a tree learner. 
     14 
     15Classification trees are represented as a tree-like hierarchy of 
     16:obj:`TreeNode` classes. 
    917 
    1018.. class:: TreeNode 
     
    5664        never disable this if you intend to prune the tree later. 
    5765 
    58     If the node is a leaf, the remaining fields are <code>None</code>. If it's an internal node, there are several additional fields. 
     66    If the node is a leaf, the remaining fields are <code>None</code>.  
     67    If it's an internal node, there are several additional fields. 
    5968 
    6069    .. attribute:: branches 
     
    92101        (when learning) or :obj:`TreeDescender` (when classifying). 
    93102 
    94     The lists :obj:`branches`, :obj:`branchDescriptions` and :obj:`branchSizes` are of the same length; all of them are defined if the node is internal and none if it is a leaf. 
     103    The lists :obj:`branches`, :obj:`branchDescriptions` and 
     104    :obj:`branchSizes` are of the same length; all of them are 
     105    defined if the node is internal and none if it is a leaf. 
    95106 
    96107    .. method:: treeSize() 
    97108         
    98         Return the number of nodes in the subtrees (including the node, excluding null-nodes). 
     109        Return the number of nodes in the subtrees (including the 
     110        node, excluding null-nodes). 
    99111 
    100112============== 
     
    160172 
    161173If you'd like to understand how the classification works in C++,  
    162 start reading at <code>TTreeClassifier::vote</code>. It gets a  
     174start reading at :obj:`TTreeClassifier::vote`. It gets a  
    163175:obj:`TreeNode`, an :obj:`orange.Example`> and a distribution of  
    164176vote weights. For each node, it calls the  
    165 <code>TTreeClassifier::classDistribution</code> and then multiplies  
    166 and sums the distribution. <code>vote</code> returns a normalized  
     177:obj:`TTreeClassifier::classDistribution` and then multiplies  
     178and sums the distribution. :obj:`vote` returns a normalized  
    167179distribution of predictions. 
    168180 
    169 A new overload of <code>TTreeClassifier::classDistribution</code> gets 
     181A new overload of :obj:`TTreeClassifier::classDistribution` gets 
    170182an additional parameter, a :obj:`TreeNode`. This is done  
    171183for the sake of recursion. The normal version of  
    172 <code>classDistribution</code> simply calls the overloaded with a  
    173 tree root as an additional parameter. <code>classDistribution</code>  
    174 uses <code>descender</code>. If descender reaches a leaf, it calls  
    175 :obj:`nodeClassifier`, otherwise it calls <CODE>vote</CODE>. 
    176  
    177 Thus, the :obj:`TreeClassifier`'s <code>vote</code> and  
    178 <code>classDistribution</code> are written in a form of double  
     184:obj:`classDistribution` simply calls the overloaded with a  
     185tree root as an additional parameter. :obj:`classDistribution`  
     186uses :obj:`descender`. If descender reaches a leaf, it calls  
     187:obj:`nodeClassifier`, otherwise it calls :obj:`vote`. 
     188 
     189Thus, the :obj:`TreeClassifier`'s :obj:`vote` and  
     190:obj:`classDistribution` are written in a form of double  
    179191recursion. The recursive calls do not happen at each node of the  
    180192tree but only at nodes where a vote is needed (that is, at nodes  
    181193where the descender halts). 
    182194 
    183 For predicting a class, <code>operator()</code>, calls the 
     195For predicting a class, :obj:`operator()`, calls the 
    184196descender. If it reaches a leaf, the class is predicted by the  
    185197leaf's :obj:`nodeClassifier`. Otherwise, it calls  
    186 <code>vote</code>. From now on, <code>vote</code> and  
     198:obj:`vote`>. From now on, :obj:`vote` and  
    187199<code>classDistribution</code> interweave down the tree and return  
    188 a distribution of predictions. <code>operator()</code> then simply  
     200a distribution of predictions. :obj:`operator()` then simply  
    189201chooses the most probable class. 
    190  
    191202 
    192203======== 
     
    198209functions. For easier use, defaults are provided. 
    199210 
    200 <p>Components that govern the structure of the tree are <code>split</code>  
    201 (of type :obj:`TreeSplitConstructor`), <code>stop</code> (of  
    202 type :obj:`TreeStopCriteria` and <code>exampleSplitter</code>  
    203 (of type :obj:`TreeExampleSplitter`).</p> 
     211Components that govern the structure of the tree are :obj:`split` 
     212(of type :obj:`TreeSplitConstructor`), :obj:`stop` (of  
     213type :obj:`TreeStopCriteria` and :obj:`exampleSplitter` 
     214(of type :obj:`TreeExampleSplitter`). 
    204215 
    205216.. class:: TreeSplitConstructor 
     
    296307    majority classes. 
    297308 
     309    As opposed to :obj:`TreeSplitConstructor` and similar basic classes, 
     310    :obj:`TreeStopCriteria` is not an abstract but a fully functional 
     311    class that provides the basic stopping criteria. That is, the tree 
     312    induction stops when there is at most one example left; in this case, 
     313    it is not the weighted but the actual number of examples that counts. 
     314    Besides that, the induction stops when all examples are in the same 
     315    class (for discrete problems) or have the same value of the outcome 
     316    (for regression problems). 
     317 
     318    .. method:: __call__(examples[, weightID, domain contingencies]) 
     319 
     320        Decides whether to stop (true) or continue (false) the induction. 
     321        If contingencies are given, they are used for checking whether 
     322        the examples are in the same class (but not for counting the 
     323        examples). Derived classes should use the contingencies whenever 
     324        possible. If contingencies are not given, :obj:`TreeStopCriteria` 
     325        will work without them. Derived classes should also use them if 
     326        they are available, but otherwise compute them only when they 
     327        really need them. 
     328 
    298329 
    299330.. class:: TreeExampleSplitter 
    300331 
    301     Analoguous to the :obj:`TreeDescender` described about a while ago.  
    302332    Just like the :obj:`TreeDescender` decides the branch for an 
    303333    example during classification, the :obj:`TreeExampleSplitter` 
     
    333363    there was no need for splitting - no weight ID's are returned. 
    334364 
     365    An abstract base class for objects that split sets of examples into  
     366    subsets. The derived classes differ in treatment of examples which 
     367    cannot be unambiguously placed into a single branch (usually due 
     368    to unknown value of the crucial attribute). 
     369 
     370    .. method:: __call__(node, examples[, weightID]) 
     371         
     372        Use the information in :obj:`node` (particularly the  
     373        :obj:`branchSelector`) to split the given set of examples into subsets.  
     374        Return a tuple with a list of example generators and a list of weights.  
     375        The list of weights is either an ordinary python list of integers or  
     376        a None when no splitting of examples occurs and thus no weights are  
     377        needed. 
     378 
    335379.. class:: TreeLearner 
    336380 
     
    430474    of Boolean values). 
    431475 
    432     <code>Contingency matrix</code> is computed next. This happens 
     476    The contingency matrix is computed next. This happens 
    433477    even if the flag :obj:`storeContingencies` is <CODE>false</CODE>. 
    434478    If the <code>contingencyComputer</code> is given we use it, 
     
    476520======= 
    477521 
    478 Tree pruners derived from :obj:`TreePrune` can be given either a 
     522Tree pruners derived from :obj:`TreePruner` can be given either a 
    479523:obj:`TreeNode` (presumably, but not necessarily a root) or a 
    480524:obj:`TreeClassifier`. The result is a new, pruned :obj:`TreeNode` 
     
    518562separate page, dedicated to callbacks to Python XXXXXXXXXX. 
    519563 
    520 TreeSplitConstructor and Derived Classes 
    521 ======================================== 
     564TreeSplitConstructors 
     565===================== 
    522566 
    523567Split construction is almost as exciting as waiting for a delayed flight. 
     
    540584class (and derived classes). 
    541585 
    542 TreeSplitConstructor 
    543 ==================== 
    544  
    545586.. class:: TreeSplitConstructor_Measure 
     587 
     588    Bases: :class:`TreeSplitConstructor` 
    546589 
    547590    An abstract base class for split constructors that employ  
     
    567610.. class:: TreeSplitConstructor_Attribute 
    568611 
     612    Bases: :class:`TreeSplitConstructor_Measure` 
     613 
    569614    Attempts to use a discrete attribute as a split; each value of the  
    570615    attribute corresponds to a branch in the tree. Attributes are 
     
    581626 
    582627.. class:: TreeSplitConstructor_ExhaustiveBinary 
     628 
     629    Bases: :class:`TreeSplitConstructor_Measure` 
    583630 
    584631    Works on discrete attributes. For each attribute, it determines 
     
    603650.. class:: TreeSplitConstructor_Threshold 
    604651 
     652    Bases: :class:`TreeSplitConstructor_Measure` 
     653 
    605654    This is currently the only constructor for splits with continuous  
    606655    attributes. It divides the range of attributes values with a threshold  
     
    615664    "<threshold" and ">=threshold". The attribute is not spent. 
    616665 
     666.. class:: TreeSplitConstructor_OneAgainstOthers 
     667     
     668    Bases: :class:`TreeSplitConstructor_Measure` 
     669 
     670    Undocumented. 
     671 
    617672.. class:: TreeSplitConstructor_Combined 
     673 
     674    Bases: :class:`TreeSplitConstructor` 
    618675 
    619676    This constructor delegates the task of finding the optimal split  
     
    658715        constructor you programmed in Python. 
    659716 
     717 
    660718TreeStopCriteria and TreeStopCriteria_common 
    661719============================================ 
    662720 
    663721obj:`TreeStopCriteria` determines when to stop the induction of subtrees, as described in detail in description of the learning process. XXXXXXXXXX 
    664  
    665 .. class:: TreeStopCriteria 
    666  
    667     As opposed to :obj:`TreeSplitConstructor` and similar basic classes, 
    668     :obj:`TreeStopCriteria` is not an abstract but a fully functional 
    669     class that provides the basic stopping criteria. That is, the tree 
    670     induction stops when there is at most one example left; in this case, 
    671     it is not the weighted but the actual number of examples that counts. 
    672     Besides that, the induction stops when all examples are in the same 
    673     class (for discrete problems) or have the same value of the outcome 
    674     (for regression problems). 
    675  
    676     .. method:: __call__(examples[, weightID, domain contingencies]) 
    677  
    678         Decides whether to stop (true) or continue (false) the induction. 
    679         If contingencies are given, they are used for checking whether 
    680         the examples are in the same class (but not for counting the 
    681         examples). Derived classes should use the contingencies whenever 
    682         possible. If contingencies are not given, :obj:`TreeStopCriteria` 
    683         will work without them. Derived classes should also use them if 
    684         they are available, but otherwise compute them only when they 
    685         really need them. 
    686722 
    687723.. class:: TreeStopCriteria_common 
     
    702738        Example count is weighed. 
    703739 
    704  
    705 TreeExampleSplitter and derived classes 
    706 ======================================= 
     740.. class:: TreeStopCriteria_Python 
     741 
     742    Undocumented. 
     743 
     744Classes derived from TreeExampleSplitter 
     745======================================== 
    707746 
    708747:obj:`TreeExampleSplitter` is the third crucial component of 
    709748:obj:`TreeLearner`. Its function is described in  
    710 description of the learning process. XXXXXXXXX 
    711  
    712 .. class:: TreeExampleSplitter 
    713  
    714     An abstract base class for objects that split sets of examples into  
    715     subsets. The derived classes differ in treatment of examples which 
    716     cannot be unambiguously placed into a single branch (usually due 
    717     to unknown value of the crucial attribute). 
    718  
    719     .. method:: __call__(node, examples[, weightID]) 
    720          
    721         Use the information in :obj:`node` (particularly the  
    722         :obj:`branchSelector`) to split the given set of examples into subsets.  
    723         Return a tuple with a list of example generators and a list of weights.  
    724         The list of weights is either an ordinary python list of integers or  
    725         a None when no splitting of examples occurs and thus no weights are  
    726         needed. 
     749description of the learning process. XXXXXXXXXX 
    727750 
    728751.. class:: TreeExampleSplitter_IgnoreUnknowns 
    729752 
     753    Bases: :class:`TreeExampleSplitter` 
     754 
    730755    Simply ignores the examples for which no single branch can be determined. 
    731756 
    732757.. class:: TreeExampleSplitter_UnknownsToCommon 
     758 
     759    Bases: :class:`TreeExampleSplitter` 
    733760 
    734761    Places all such examples to a branch with the highest number of 
     
    738765.. class:: TreeExampleSplitter_UnknownsToAll 
    739766 
     767    Bases: :class:`TreeExampleSplitter` 
     768 
    740769    Places examples with unknown value of the attribute into all branches. 
    741770 
    742771.. class:: TreeExampleSplitter_UnknownsToRandom 
    743772 
     773    Bases: :class:`TreeExampleSplitter` 
     774 
    744775    Selects a random branch for such examples. 
    745776 
    746777.. class:: TreeExampleSplitter_UnknownsToBranch 
     778 
     779    Bases: :class:`TreeExampleSplitter` 
    747780 
    748781    Constructs an additional branch to contain all such examples.  
     
    751784.. class:: TreeExampleSplitter_UnknownsAsBranchSizes 
    752785 
     786    Bases: :class:`TreeExampleSplitter` 
     787 
    753788    Splits examples with unknown value of the attribute according to  
    754789    proportions of examples in each branch. 
    755790 
    756791.. class:: TreeExampleSplitter_UnknownsAsSelector 
     792 
     793    Bases: :class:`TreeExampleSplitter` 
    757794 
    758795    Splits examples with unknown value of the attribute according to  
     
    789826.. class:: TreeDescender_UnknownsToNode 
    790827 
     828    Bases: :obj:`TreeDescender` 
     829 
    791830    When example cannot be classified into a single branch, the 
    792831    current node is returned. Thus, the node's :obj:`NodeClassifier` 
     
    796835    them after the induction, that's all) 
    797836 
    798 .. class:: TreeDescender_UnknownsToCommon 
     837.. class:: TreeDescender_UnknownsToBranch 
     838 
     839    Bases: :obj:`TreeDescender` 
    799840 
    800841    Classifies examples with unknown value to a special branch. This 
     
    804845.. class:: TreeDescender_UnknownsToCommonBranch 
    805846 
     847    Bases: :obj:`TreeDescender` 
     848 
    806849    Classifies examples with unknown values to the branch with the 
    807850    highest number of examples. If there is more than one such branch, 
     
    810853.. class:: TreeDescender_UnknownsToCommonSelector 
    811854 
     855    Bases: :obj:`TreeDescender` 
     856 
    812857    Classifies examples with unknown values to the branch which received  
    813858    the highest recommendation by the selector. 
     
    815860.. class:: TreeDescender_MergeAsBranchSizes 
    816861 
     862    Bases: :obj:`TreeDescender` 
     863 
    817864    Makes the subtrees vote for the example's class; the vote is 
    818865    weighted according to the sizes of the branches. 
    819866 
    820867.. class:: TreeDescender_MergeAsSelector 
     868 
     869    Bases: :obj:`TreeDescender` 
    821870 
    822871    Makes the subtrees vote for the example's class; the vote is  
     
    846895 
    847896.. class:: TreePruner_SameMajority 
     897 
     898    Bases: :class:`TreePruner` 
    848899 
    849900    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees  
     
    865916 
    866917.. class:: TreePruner_m 
     918 
     919    Bases: :class:`TreePruner` 
    867920 
    868921    Prunes a tree by comparing m-estimates of static and dynamic  
     
    9471000We could write something like... 
    9481001 
     1002:: 
     1003 
    9491004    def printTree(x): 
    9501005        printTree0(x.tree, 0) 
     
    10131068======== 
    10141069 
    1015 You've already seen a simple example of using a :obj:`TreeLearner`. You can just call it and let it fill the empty slots with the default components. This section will teach you three things: what are the missing components (and how to set the same components yourself), how to use alternative components to get a different tree and, finally, how to write a skeleton for tree induction in Python.</p> 
    1016  
    1017 <H4>Default components for TreeLearner</H4> 
    1018  
    1019 <p>Let us construct a :obj:`TreeLearner` to play with.</p> 
    1020  
    1021 <p class="header"><a href="treelearner.py">treelearner.py</a> 
    1022 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    1023 <xmp class="code">>>> learner = orange.TreeLearner() 
    1024 </xmp> 
    1025  
    1026 <p>There are three crucial components in learning: the split and stop criteria, and the <code>exampleSplitter</code> (there are some others, which become important during classification; we'll talk about them later). They are not defined; if you use the learner, the slots are filled temporarily but later cleared again.</code> 
    1027  
    1028 <xmp class="code">>>> print learner.split 
    1029 None 
    1030 >>> learner(data) 
    1031 <TreeClassifier instance at 0x01F08760> 
    1032 >>> print learner.split 
    1033 None 
    1034 </xmp> 
    1035  
    1036 <H4>Stopping criteria</H4> 
    1037 <p>The stop is trivial. The default is set by</p> 
    1038  
    1039 <xmp class="code">>>> learner.stop = orange.TreeStopCriteria_common() 
    1040 </xmp> 
    1041  
    1042 <p>Well, this is actually done in C++ and it uses a global component that is constructed once for all, but apart from that we did effectively the same thing.</p> 
    1043  
    1044 <p>We can now examine the default stopping parameters.</p> 
    1045  
    1046 <xmp class="code">>>> print learner.stop.maxMajority, learner.stop.minExamples 
    1047 1.0 0.0 
    1048 </xmp> 
    1049  
    1050 <p>Not very restrictive. This keeps splitting the examples until there's nothing left to split or all the examples are in the same class. Let us set the minimal subset that we allow to be split to five examples and see what comes out.</p> 
    1051  
    1052 <p class="header">part of <a href="treelearner.py">treelearner.py</a> 
    1053 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    1054 <xmp class="code">>>> learner.stop.minExamples = 5.0 
    1055 >>> tree = learner(data) 
    1056 >>> printTree(tree) 
    1057 tear_rate (<15.000, 5.000, 4.000>) 
    1058 : reduced --> none (<12.000, 0.000, 0.000>) 
    1059 : normal 
    1060    astigmatic (<3.000, 5.000, 4.000>) 
    1061    : no 
    1062       age (<1.000, 5.000, 0.000>) 
    1063       : young --> soft (<0.000, 2.000, 0.000>) 
    1064       : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
    1065       : presbyopic --> soft (<1.000, 1.000, 0.000>) 
    1066    : yes 
    1067       prescription (<2.000, 0.000, 4.000>) 
    1068       : myope --> hard (<0.000, 0.000, 3.000>) 
    1069       : hypermetrope --> none (<2.000, 0.000, 1.000>) 
    1070 </xmp> 
    1071  
    1072 <p>OK, that's better. If we want an even smaller tree, we can also limit the maximal proportion of majority class.</p> 
    1073  
    1074 <p class="header">part of <a href="treelearner.py">treelearner.py</a> 
    1075 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    1076 <xmp class="code">>>> learner.stop.maxMajority = 0.5 
    1077 >>> tree = learner(tree) 
    1078 >>> printTree(tree) 
    1079 --> none (<15.000, 5.000, 4.000>) 
    1080 </xmp> 
    1081  
    1082 <p>Well, this might have been an overkill...</p> 
    1083  
    1084  
    1085 <H4>Splitting criteria</H4> 
    1086  
    1087 <H4>Example splitter</H4> 
    1088  
    1089 <H4>Flags and similar</H4> 
    1090  
    1091 ... also mention nodeLearner and descender 
    1092  
    1093 <H4>Programming your own tree learner skeleton</H4> 
    1094  
    1095 <H3>Classification</H3> 
    1096  
    1097 <H4>Descender</H4> 
    1098  
    1099 <H4>Node classifier</H4> 
    1100  
    1101 <H3>Pruning</H3> 
    1102  
    1103 <H4>Same majority pruning</H4> 
    1104  
    1105 <H4>Post pruning</H4> 
    1106  
    1107 ... show a series of trees 
    1108  
    1109 <H3>Defining your own components</H3> 
    1110  
    1111 <H3>Various tricks</H3> 
    1112  
    1113 <H4>Storing testing examples</H4> 
    1114  
    1115 ... show how to remember misclassifications etc.<P> 
    1116  
    1117 <H4>Replacing node classifiers</H4> 
    1118 ... replacing with something else, but based on learning examples<P> 
    1119  
    1120 <hr> 
    1121  
    1122 <H3><U>References</U></H3> 
    1123  
    1124 Bratko, I. (2002). <EM>Prolog Programming for Artificial Intelligence</EM>, Addison Wesley, 2002.<P> 
     1070You've already seen a simple example of using a :obj:`TreeLearner`. 
     1071You can just call it and let it fill the empty slots with the default 
     1072components. This section will teach you three things: what are the 
     1073missing components (and how to set the same components yourself), 
     1074how to use alternative components to get a different tree and, 
     1075finally, how to write a skeleton for tree induction in Python. 
     1076 
     1077Default components for TreeLearner 
     1078================================== 
     1079 
     1080Let us construct a :obj:`TreeLearner` to play with. 
     1081 
     1082.. _treelearner.py: code/treelearner.py 
     1083 
     1084`treelearner.py`_, uses `lenses.tab`_: 
     1085 
     1086.. literalinclude:: code/treelearner.py 
     1087   :lines: 7-10 
     1088 
     1089There are three crucial components in learning: the split and stop 
     1090criteria, and the :obj:`exampleSplitter` (there are some others, 
     1091which become important during classification; we'll talk about them 
     1092later). They are not defined; if you use the learner, the slots are 
     1093filled temporarily but later cleared again. 
     1094 
     1095:: 
     1096 
     1097    >>> print learner.split 
     1098    None 
     1099    >>> learner(data) 
     1100    <TreeClassifier instance at 0x01F08760> 
     1101    >>> print learner.split 
     1102    None 
     1103 
     1104Stopping criteria 
     1105================= 
     1106 
     1107The stop is trivial. The default is set by 
     1108:: 
     1109    >>> learner.stop = orange.TreeStopCriteria_common() 
     1110 
     1111Well, this is actually done in C++ and it uses a global component 
     1112that is constructed once for all, but apart from that we did 
     1113effectively the same thing. 
     1114 
     1115We can now examine the default stopping parameters. 
     1116 
     1117    >>> print learner.stop.maxMajority, learner.stop.minExamples 
     1118    1.0 0.0 
     1119 
     1120Not very restrictive. This keeps splitting the examples until 
     1121there's nothing left to split or all the examples are in the same 
     1122class. Let us set the minimal subset that we allow to be split to 
     1123five examples and see what comes out. 
     1124 
     1125    >>> learner.stop.minExamples = 5.0 
     1126    >>> tree = learner(data) 
     1127    >>> printTree(tree) 
     1128    tear_rate (<15.000, 5.000, 4.000>) 
     1129    : reduced --> none (<12.000, 0.000, 0.000>) 
     1130    : normal 
     1131       astigmatic (<3.000, 5.000, 4.000>) 
     1132       : no 
     1133          age (<1.000, 5.000, 0.000>) 
     1134          : young --> soft (<0.000, 2.000, 0.000>) 
     1135          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
     1136          : presbyopic --> soft (<1.000, 1.000, 0.000>) 
     1137       : yes 
     1138          prescription (<2.000, 0.000, 4.000>) 
     1139          : myope --> hard (<0.000, 0.000, 3.000>) 
     1140          : hypermetrope --> none (<2.000, 0.000, 1.000>) 
     1141 
     1142OK, that's better. If we want an even smaller tree, we can also limit 
     1143the maximal proportion of majority class. 
     1144 
     1145    >>> learner.stop.maxMajority = 0.5 
     1146    >>> tree = learner(tree) 
     1147    >>> printTree(tree) 
     1148    --> none (<15.000, 5.000, 4.000>) 
     1149 
     1150References 
     1151========== 
     1152 
     1153Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison  
     1154Wesley, 2002. 
     1155 
     1156.. class:: TreeNodeList 
     1157 
     1158    Undocumented. 
     1159 
     1160.. class:: C45TreeNode 
     1161 
     1162    Undocumented. 
     1163 
     1164.. class:: C45TreeNodeList 
     1165 
     1166    Undocumented. 
     1167 
     1168=========================== 
     1169C4.5 Classifier and Learner 
     1170=========================== 
     1171 
     1172As C4.5 is a standard benchmark in machine learning,  
     1173it is incorporated in Orange, although Orange has its own 
     1174implementation of decision trees. 
     1175 
     1176The implementation uses the original Quinlan's code for learning so the 
     1177tree you get is exactly like the one that would be build by standalone 
     1178C4.5. Upon return, however, the original tree is copied to Orange 
     1179components that contain exactly the same information plus what is needed 
     1180to make them visible from Python. To be sure that the algorithm behaves 
     1181just as the original, we use a dedicated class :class:`C45TreeNode` 
     1182instead of reusing the components used by Orange's tree inducer 
     1183(ie, :class:`TreeNode`). This, however, could be done and probably 
     1184will be done in the future; we shall still retain :class:`C45TreeNode`  
     1185but offer transformation to :class:`TreeNode` so that routines 
     1186that work on Orange trees will also be usable for C45 trees. 
     1187 
     1188:class:`C45Learner` and :class:`C45Classifier` behave 
     1189like any other Orange learner and classifier. Unlike most of Orange  
     1190learning algorithms, C4.5 does not accepts weighted examples. 
     1191 
     1192Building the C4.5 plug-in 
     1193========================= 
     1194 
     1195We haven't been able to obtain the legal rights to distribute 
     1196C4.5 and therefore couldn't statically link it into Orange. Instead, 
     1197it's incorporated as a plug-in which you'll need to build yourself. 
     1198The procedure is trivial, but you'll need a C compiler. On Windows, 
     1199the scripts we provide work with MS Visual C and the files CL.EXE 
     1200and LINK.EXE must be on the PATH. On Linux you're equipped with 
     1201gcc. Mac OS X comes without gcc, but you can download it for 
     1202free from Apple. 
     1203 
     1204Orange must be installed prior to building C4.5. (This is because 
     1205the build script will copy the created file next to Orange, 
     1206which it obviously can't if Orange isn't there yet.) 
     1207 
     1208#. Download the  
     1209   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_ 
     1210   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract 
     1211   them into some temporary directory. The files will be modified in the 
     1212   further process, so don't use your copy of Quinlan's sources that you 
     1213   need for another purpose. 
     1214#. Download  
     1215   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_  
     1216   and unzip its contents into the directory R8/Src of the Quinlan's  
     1217   stuff (it's the directory that contains, for instance, the file 
     1218   average.c). 
     1219#. Run buildC45.py, which will build the plug-in and put it next to  
     1220   orange.pyd (or orange.so on Linux/Mac). 
     1221#. Run python, import orange and create create :samp:`orange.C45Learner()`. 
     1222   If this fails, something went wrong; see the diagnostic messages from 
     1223   buildC45.py and read the below paragraph. 
     1224#. Finally, you can remove the Quinlan's stuff, along with everything 
     1225   created by buildC45.py. 
     1226 
     1227If the process fails, here's what buildC45.py really does: it creates 
     1228.h files that wrap Quinlan's .i files and ensure that they are not 
     1229included twice. It modifies C4.5 sources to include .h's instead of 
     1230.i's. This step can hardly fail. Then follows the platform dependent 
     1231step which compiles ensemble.c (which includes all the Quinlan's .c 
     1232files it needs) into c45.dll or c45.so and puts it next to Orange. 
     1233If this fails, but you do have a C compiler and linker, and you know 
     1234how to use them, you can compile the ensemble.c into a dynamic 
     1235library yourself. See the compile and link steps in buildC45.py, 
     1236if it helps. Anyway, after doing this check that the built C4.5 
     1237gives the same results as the original. 
     1238 
     1239.. class:: C45Learner 
     1240 
     1241    :class:`C45Learner`'s attributes have double names - those that 
     1242    you know from C4.5 command lines and the corresponding names of C4.5's 
     1243    internal variables. All defaults are set as in C4.5; if you change 
     1244    nothing, you are running C4.5. 
     1245 
     1246    .. attribute:: gainRatio (g) 
     1247         
     1248        Determines whether to use information gain (false>, default) 
     1249        or gain ratio for selection of attributes (true). 
     1250 
     1251    .. attribute:: batch (b) 
     1252 
     1253        Turn on batch mode (no windows, no iterations); this option is 
     1254        not documented in C4.5 manuals. It conflicts with "window", 
     1255        "increment" and "trials". 
     1256 
     1257    .. attribute:: subset (s) 
     1258         
     1259        Enables subsetting (default: false, no subsetting), 
     1260  
     1261    .. attribute:: probThresh (p) 
     1262 
     1263        Probabilistic threshold for continuous attributes (default: false). 
     1264 
     1265    .. attribute:: minObjs (m) 
     1266         
     1267        Minimal number of objects (examples) in leaves (default: 2). 
     1268 
     1269    .. attribute:: window (w) 
     1270 
     1271        Initial windows size (default: maximum of 20% and twice the 
     1272        square root of the number of data objects). 
     1273 
     1274    .. attribute:: increment (i) 
     1275 
     1276        The maximum number of objects that can be added to the window 
     1277        at each iteration (default: 20% of the initial window size). 
     1278 
     1279    .. attribute:: cf (c) 
     1280 
     1281        Prunning confidence level (default: 25%). 
     1282 
     1283    .. attribute:: trials (t) 
     1284 
     1285        Set the number of trials in iterative (i.e. non-batch) mode (default: 10). 
     1286 
     1287    .. attribute:: prune 
     1288         
     1289        Return pruned tree (not an original C4.5 option) (default: true) 
     1290 
     1291 
     1292:class:`C45Learner` also offers another way for setting 
     1293the arguments: it provides a function :obj:`C45Learner.commandLine` 
     1294which is given a string and parses it the same way as C4.5 would 
     1295parse its command line. XXXXXXXXXXX 
     1296 
     1297.. class:: C45Classifier 
     1298 
     1299    A faithful reimplementation of Quinlan's function from C4.5. The only 
     1300    difference (and the only reason it's been rewritten) is that it uses 
     1301    a tree composed of :class:`C45TreeNode` instead of C4.5's 
     1302    original tree structure. 
     1303 
     1304    .. attribute:: tree 
     1305 
     1306        C4.5 tree stored as a tree of :obj:`C45TreeNode`. 
     1307 
     1308 
     1309.. class:: C45TreeNode 
     1310 
     1311    This class is a reimplementation of the corresponding *struct* from 
     1312    Quinlan's C4.5 code. 
     1313 
     1314    .. attribute:: nodeType 
     1315 
     1316        Type of the node:  :obj:`C45TreeNode.Leaf` (0),  
     1317        :obj:`C45TreeNode.Branch` (1), :obj:`C45TreeNode.Cut` (2), 
     1318        :obj:`C45TreeNode.Subset` (3). "Leaves" are leaves, "branches" 
     1319        split examples based on values of a discrete attribute, 
     1320        "cuts" cut them according to a threshold value of a continuous 
     1321        attributes and "subsets" use discrete attributes but with subsetting 
     1322        so that several values can go into the same branch. 
     1323 
     1324    .. attribute:: leaf 
     1325 
     1326        Value returned by that leaf. The field is defined for internal  
     1327        nodes as well. 
     1328 
     1329    .. attribute:: items 
     1330 
     1331        Number of (learning) examples in the node. 
     1332 
     1333    .. attribute:: classDist 
     1334 
     1335        Class distribution for the node (of type  
     1336        :obj:`orange.DiscDistribution`). 
     1337 
     1338    .. attribute:: tested 
     1339         
     1340        The attribute used in the node's test. If node is a leaf, 
     1341        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut` 
     1342        :obj:`tested` is a discrete attribute, and if node is of type 
     1343        :obj:`Cut` then :obj:`tested` is a continuous attribute. 
     1344 
     1345    .. attribute:: cut 
     1346 
     1347        A threshold for continuous attributes, if node is of type :obj:`Cut`. 
     1348        Undefined otherwise. 
     1349 
     1350    .. attribute:: mapping 
     1351 
     1352        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]` 
     1353        gives the index for an example whose value of :obj:`tested` is *i*.  
     1354        Here, *i* denotes an index of value, not a :class:`orange.Value`. 
     1355 
     1356    .. attribute:: branch 
     1357         
     1358        A list of branches stemming from this node. 
     1359 
     1360Examples 
     1361======== 
     1362 
     1363.. _tree_c45.py: code/tree_c45.py 
     1364.. _iris.tac: code/iris.tab 
     1365 
     1366The simplest way to use :class:`C45Learner` is to call it. This 
     1367script constructs the same learner as you would get by calling 
     1368the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_): 
     1369 
     1370.. literalinclude:: code/tree_c45.py 
     1371   :lines: 7-14 
     1372 
     1373Arguments can be set by the usual mechanism (the below to lines do the 
     1374same, except that one uses command-line symbols and the other internal 
     1375variable names) 
     1376 
     1377:: 
     1378 
     1379    tree = orange.C45Learner(data, m=100) 
     1380    tree = orange.C45Learner(data, minObjs=100) 
     1381 
     1382The way that could be prefered by veteran C4.5 user might be through 
     1383method `:obj:C45Learner.commandline`. 
     1384 
     1385:: 
     1386 
     1387    lrn = orange.C45Learner() 
     1388    lrn.commandline("-m 1 -s") 
     1389    tree = lrn(data) 
     1390 
     1391There's nothing special about using :obj:`C45Classifier` - it's  
     1392just like any other. To demonstrate what the structure of  
     1393:class:`C45TreeNode`'s looks like, will show a script that prints  
     1394it out in the same format as C4.5 does. 
     1395 
     1396.. literalinclude:: code/tree_c45_printtree.py 
     1397 
     1398Leaves are the simplest. We just print out the value contained 
     1399in :samp:`node.leaf`. Since this is not a qualified value (ie.,  
     1400:obj:`C45TreeNode` does not know to which attribute it belongs), we need to 
     1401convert it to a string through :obj:`classVar`, which is passed as an 
     1402extra argument to the recursive part of printTree. 
     1403 
     1404For discrete splits without subsetting, we print out all attribute values 
     1405and recursively call the function for all branches. Continuous splits are 
     1406equally easy to handle. 
     1407 
     1408For discrete splits with subsetting, we iterate through branches, retrieve 
     1409the corresponding values that go into each branch to inset, turn 
     1410the values into strings and print them out, separately treating the 
     1411case when only a single value goes into the branch. 
     1412 
     1413Printing out C45 Tree 
     1414===================== 
     1415 
     1416.. autofunction:: c45_printTree 
     1417 
    11251418""" 
    11261419 
     
    11631456 
    11641457 
     1458#from orngC45 
     1459def c45_showBranch(node, classvar, lev, i): 
     1460    var = node.tested 
     1461    if node.nodeType == 1: 
     1462        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]), 
     1463        c45_printTree0(node.branch[i], classvar, lev+1) 
     1464    elif node.nodeType == 2: 
     1465        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut), 
     1466        c45_printTree0(node.branch[i], classvar, lev+1) 
     1467    else: 
     1468        inset = filter(lambda a:a[1]==i, enumerate(node.mapping)) 
     1469        inset = [var.values[j[0]] for j in inset] 
     1470        if len(inset)==1: 
     1471            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]), 
     1472        else: 
     1473            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)), 
     1474        c45_printTree0(node.branch[i], classvar, lev+1) 
     1475         
     1476         
     1477def c45_printTree0(node, classvar, lev): 
     1478    var = node.tested 
     1479    if node.nodeType == 0: 
     1480        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items), 
     1481    else: 
     1482        for i, branch in enumerate(node.branch): 
     1483            if not branch.nodeType: 
     1484                c45_showBranch(node, classvar, lev, i) 
     1485        for i, branch in enumerate(node.branch): 
     1486            if branch.nodeType: 
     1487                c45_showBranch(node, classvar, lev, i) 
     1488 
     1489def c45_printTree(tree): 
     1490    """ 
     1491    Prints the tree given as an argument in the same form as Ross Quinlan's  
     1492    C4.5 program. 
     1493 
     1494    :: 
     1495 
     1496        import orange, orngC45 
     1497 
     1498        data = orange.ExampleTable("voting") 
     1499        c45 = orange.C45Learner(data) 
     1500        orngC45.printTree(c45) 
     1501 
     1502    will print out 
     1503 
     1504    :: 
     1505 
     1506        physician-fee-freeze = y: 
     1507        |   synfuels-corporation-cutback = n: republican (145.7) 
     1508        |   synfuels-corporation-cutback = y: 
     1509        |   |   mx-missile = y: democrat (6.0) 
     1510        |   |   mx-missile = n: 
     1511        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6) 
     1512        |   |   |   adoption-of-the-budget-resolution = y: 
     1513        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0) 
     1514        |   |   |   |   anti-satellite-test-ban = y: republican (2.2) 
     1515 
     1516 
     1517    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out 
     1518 
     1519    :: 
     1520 
     1521        <xmp class="printout">physician-fee-freeze = n: democrat (253.4/5.9) 
     1522        physician-fee-freeze = y: 
     1523        |   synfuels-corporation-cutback = n: republican (145.7/6.2) 
     1524        |   synfuels-corporation-cutback = y: 
     1525        |   |   mx-missile = y: democrat (6.0/2.4) 
     1526        |   |   mx-missile = n: 
     1527        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2) 
     1528        |   |   |   adoption-of-the-budget-resolution = y: 
     1529        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2) 
     1530        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0) 
     1531 
     1532    which is adoringly similar, except that C4.5 tested the tree on  
     1533    the learning data and has also printed out the number of errors  
     1534    in each node - something which :obj:`c45_printTree` obviously can't do 
     1535    (nor is there any need it should). 
     1536 
     1537""" 
     1538    c45_printTree0(tree.tree, tree.classVar, 0) 
     1539    print 
     1540#end orngC45 
     1541 
     1542 
     1543 
  • orange/doc/reference/c45.py

    r6538 r7326  
    77import orange 
    88 
    9 #data = orange.ExampleTable("lenses.tab") 
    109data = orange.ExampleTable("iris") 
    1110tree = orange.C45Learner(data) 
Note: See TracChangeset for help on using the changeset viewer.