Changeset 7283:486d833a62a8 in orange


Ignore:
Timestamp:
02/03/11 00:12:31 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
de739fc68037d8eac19aaca73c5c7a3e4fbc8d11
Message:

Tree.py slowly progressing.

Location:
orange
Files:
2 added
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classification/tree.py

    r7212 r7283  
    9494    The lists :obj:`branches`, :obj:`branchDescriptions` and :obj:`branchSizes` are of the same length; all of them are defined if the node is internal and none if it is a leaf. 
    9595 
    96     .. method:: treeSize(): 
     96    .. method:: treeSize() 
    9797         
    9898        Return the number of nodes in the subtrees (including the node, excluding null-nodes). 
    9999 
    100 <A name="classification"></A> 
    101 <H3>Classification</H3> 
     100============== 
     101Classification 
     102============== 
    102103 
    103104.. class:: TreeClassifier 
     
    122123    In general there are three possible outcomes of a descent. 
    123124 
    124     # Descender reaches a leaf. This happens when nothing went wrong (there are no unknown or out-of-range values in the example) or when things went wrong, but the descender smoothed them by selecting a single branch and continued the descend. In this case, the descender returns the reached :obj:`TreeNode`. 
    125  
    126     # :obj:`branchSelector` returned a distribution and the :obj:`TreeDescender` decided to stop the descend at this (internal) node. Again, descender returns the current <code>TreeNode</code> and nothing else.</LI> 
    127  
    128     # :obj:`branchSelector` returned a distribution and the <code>TreeNode</code> wants to split the example (i.e., to decide the class by voting). It returns a <code>TreeNode</code> and the vote-weights for the branches. The weights can correspond to the distribution returned by 
    129 <code>branchSelector</code>, to the number of learning examples that were assigned to each branch, or to something else.</LI> 
    130 </UL> 
    131  
    132 <p><code>TreeClassifier</code> uses the descender to descend from the root. If it returns only a <code>TreeNode</code> and no distribution, the descend should stop; it does not matter whether it's a leaf (the first case above) or an internal node (the second case). The node's <code>nodeClassifier</code> is used to decide the class. If the descender returns a <code>TreeNode</code> and a distribution, the <code>TreeClassifier</code> recursively calls itself for each of the subtrees and the predictions are weighted as requested by the descender.</p> 
    133  
    134 <p>When voting, subtrees do not predict the class but probabilities of classes. The predictions are multiplied by weights, summed and the most probable class is returned.</p> 
    135  
    136 <p><B>The rest of this section is only for those interested in the C++ code.</B></p> 
    137  
    138 <p>If you'd like to understand how the classification works in C++, start reading at <code>TTreeClassifier::vote</code>. It gets a <code>TreeNode</code>, an <code>Example</code> and a distribution of vote weights. For each node, it calls the <code>TTreeClassifier::classDistribution</code> and then multiplies and sums the distribution. <code>vote</code> returns a normalized distribution of predictions.</p> 
    139  
    140 <p>A new overload of <code>TTreeClassifier::classDistribution</code> gets an additional parameter, a <code>TreeNode</code>. This is done for the sake of recursion. The normal version of <code>classDistribution</code> simply calls the overloaded with a tree root as an additional parameter. <code>classDistribution</code> uses <code>descender</code>. If descender reaches a leaf, it calls <code>nodeClassifier</code>, otherwise it calls <CODE>vote</CODE>.</P> 
    141  
    142 <p>Thus, the <code>TreeClassifier</code>'s <code>vote</code> and <code>classDistribution</code> are written in a form of double recursion. The recursive calls do not happen at each node of the tree but only at nodes where a vote is needed (that is, at nodes where the descender halts).</p> 
    143  
    144 <p>For predicting a class, <code>operator()</code>, calls the descender. If it reaches a leaf, the class is predicted by the leaf's <code>nodeClassifier</code>. Otherwise, it calls <code>vote</code>. From now on, <code>vote</code> and <code>classDistribution</code> interweave down the tree and return a distribution of predictions. <code>operator()</code> then simply chooses the most probable class.</p> 
    145  
    146  
    147 <A name="learning"></A> 
    148 <H3>Learning</H3> 
    149  
    150 <p>The main learning object is <CODE><INDEX name="classes/TreeLearner">TreeLearner</CODE>. It is basically a skeleton into which the user must plug the components for particular functions. For easier use, defaults are provided.</p> 
    151  
    152 <p>Components that govern the structure of the tree are <code>split</code> (of type <code>TreeSplitConstructor</code>), <code>stop</code> (of type <code>TreeStopCriteria</code>) and <code>exampleSplitter</code> (of type <code>TreeExampleSplitter</code>).</p> 
    153  
    154 <H4>TreeSplitConstructor</H4> 
    155  
    156 <p>The job of <code><INDEX name="classes/TreeSplitConstructor">TreeSplitConstructor</code> is to find a suitable criteria for dividing the learning (and later testing) examples coming to the node. The data it gets is a set of examples (and, optionally, an ID of weight meta-attribute), a domain contingency computed from examples, apriori class probabilities, a list of candidate attributes it should consider and a node classifier (if it was constructed, that is, if <CODE>storeNodeClassifier</CODE> is left <CODE>true</CODE>).</p> 
    157  
    158 <p>The <code>TreeSplitConstructor</code> should use the domain contingency when possible. The reasons are two-fold; one is that it's faster and the other is that the contingency matrices are not necessarily constructed by simply counting the examples. Why and how is explained later. There are, however, cases, when domain contingency does not suffice, for examples, when ReliefF is used as a measure of quality of attributes. In this case, there's no other way but to use the examples and ignore the precomputed contingencies.</p> 
    159  
    160 <p>The split constructor should consider only the attributes in the candidate list (the list is a vector of booleans, one for each attribute).</p> 
    161  
    162 <p><code>TreeSplitConstructor</code> returns most of the data we talked about when describing the <code>TreeNode</code>. It returns a classifier to be used as <code>TreeNode</code>'s <code>branchSelector</code>, a list of branch descriptions and a list with the number of examples that go into each branch. Just what we need for the <code>TreeNode</code>. It can return an empty list for the number of examples in branches; in this case, the <code>TreeLearner</code> will find the number itself after splitting the example set into subsets. However, if a split constructors can provide the numbers at no extra computational cost, it should do so.</P> 
    163  
    164 <p>In addition, it returns a quality of the split; a number without any fixed meaning except that higher numbers mean better splits.</p> 
    165  
    166 <p>If the constructed splitting criterion uses an attribute in such a way that the attribute is 'completely spent' and should not be considered as a split criterion in any of the subtrees (the typical case of this are discrete attributes that are used as-they-are, that is, without any binarization or subsetting), then it should report the index of this attribute. Some splits do not spend any attribute; this is indicated by returning a negative index.</p> 
    167  
    168 <p>A <code>TreeSplitConstructor</code> can veto the further tree induction by returning no classifier. This can happen for many reasons. A general one is related to number of examples in the branches. <code>TreeSplitConstructor</code> has a field <code>minSubset</code>, which sets the minimal number of examples in a branch; null nodes, however, are allowed. If there is no split where this condition is met, <code>TreeSplitConstructor</code> stops the induction.</p> 
    169  
    170  
    171 <H4>TreeStopCriteria</H4> 
    172  
    173 <p><code><INDEX name="classes/TreeStopCriteria">TreeStopCriteria</code> is a much simpler component that, given a set of examples, weight ID and contingency matrices, decides whether to continue the induction or not. The basic criterion checks whether there are any examples and whether they belong to at least two different classes (if the class is discrete). Derived components check things like the number of examples and the proportion of majority classes.</p> 
    174  
    175  
    176 <H4>TreeExampleSplitter</H4> 
    177  
    178 <p><code><INDEX name="classes/TreeExampleSplitter">TreeExampleSplitter</code> is analogous to the <code>TreeDescender</code> described about a while ago. Just like the <code>TreeDescender</code> decides the branch for an example during classification, the <code>TreeExampleSplitter</code> sorts the learning examples into branches.</p> 
    179  
    180 <p><code>TreeExampleSplitter</code> is given a <code>TreeNode</code> (from which it can use different stuff, but most of splitters only use the <code>branchSelector</code>), a set of examples to be divided, and the weight ID. The result is a list of subsets of examples and, optionally, a list of new weight ID's.</p> 
    181  
    182 <p>Subsets are usually stored as <code>ExamplePointerTable</code>'s. Most of <code>TreeExampleSplitters</code> simply call the node's <code>branchSelector</code> and assign examples to corresponding branches. When the value is unknown they choose a particular branch or simply skip the example.</p> 
    183  
    184 <p>Some enhanced splitters can split examples. An example (actually, a pointer to it) is copied to more than one subset. To facilitate real splitting, weights are needed. Each branch is assigned a weight ID (each would usually have its own ID) and all examples that are in that branch (either completely or partially) should have this meta attribute. If an example hasn't been split, it has only one additional attribute - with weight ID corresponding to the subset to which it went. Example that is split between, say, three subsets, has three new meta attributes, one for each subset. ID's of weight meta attributes are returned by the <code>TreeExampleSplitter</code> to be used at induction of the corresponding subtrees.</p> 
    185  
    186 <p>Note that weights are used only when needed. When no splitting occured - because the splitter is not able to do it or because there was no need for splitting - no weight ID's are returned.</p> 
    187  
    188 <H4>TreeLearner</H4> 
    189  
    190 <p>TreeLearner has a number of components.</p> 
    191  
    192 <P class=section> 
    193 <DL class=attributes> 
    194 <DT>split</DT> 
    195 <DD>Object of type <code>TreeSplitConstructor</code>. Default value, provided by <code>TreeLearner</code>, is <code>SplitConstructor_Combined</code> with separate constructors for discrete and continuous attributes. Discrete attributes are used as are, while continuous attributes are binarized. Gain ratio is used to select attributes. A minimum of two examples in a leaf is required for discrete and five examples in a leaf for continuous attributes.</DD> 
    196  
    197 <DT>stop</DT> 
    198 <DD>Object of type <code>TreeStopCriteria</code>. The default stopping criterion stops induction when all examples in a node belong to the same class.</DD> 
    199  
    200 <DT>splitter</DT> 
    201 <DD>Object of type <code>TreeExampleSplitter</code>. The default splitter is <code>TreeExampleSplitter_UnknownsAsSelector</code> that splits the learning examples according to distributions given by the selector.</DD> 
    202  
    203 <DT>contingencyComputer</DT> 
    204 <DD>By default, this slot is left empty and ordinary contingency matrices are computed for examples at each node. If need arises, one can change the way the matrices are computed. This can be used to change the way that unknown values are treated when assessing qualities of attributes. As mentioned earlier, the computed matrices can be used by split constructor and by stopping criteria. On the other hand, they can be (and are) ignored by some splitting constructors.</DD> 
    205  
    206 <DT>nodeLearner</DT> 
    207 <DD>Induces a classifier from examples belonging to a node. The same learner is used for internal nodes and for leaves. The default <code>nodeLearner</code> is <code>MajorityLearner</code>.</DD> 
    208  
    209 <DT>descender</DT> 
    210 <DD>Descending component that the induces <code>TreeClassifier</code> will use. Default descender is <code>TreeDescender_UnknownMergeAsSelector</code> which votes using the <code>branchSelector</code>'s distribution for vote weights.</DD> 
    211  
    212 <DT>maxDepth</DT> 
    213 <DD>Gives maximal tree depth; 0 means that only root is generated. The default is 100 to prevent any infinite tree induction due to missettings in stop criteria. If you are sure you need larger trees, increase it. If you, on the other hand, want to lower this hard limit, you can do so as well.</DD> 
    214  
    215 <DT>storeDistributions, storeContingencies, storeExamples, storeNodeClassifier</DT> 
    216 <DD>Decides whether to store class distributions, contingencies and examples in <code>TreeNodes</code>, and whether the <code>nodeClassifier</code> should be build for internal nodes. By default, distributions and node classifiers are stored, while contingencies and examples are not. You won't save any memory by not storing distributions but storing contingencies, since distributions actually points to the same distribution that is stored in <code>contingency.classes</code>.</DD> 
    217 </DL> 
    218  
    219 <p>The <code>TreeLearner</code> first sets the defaults for missing components. Although stored in the actual <code>TreeLearner</code>'s fields, they are removed when the induction is finished.</p> 
    220  
    221 <p>Then it ensures that examples are stored in a table. This is needed because the algorithm juggles with pointers to examples. If examples are in a file or are fed through a filter, they are copied to a table. Even if they are already in a table, they are copied if <code>storeExamples</code> is set. This is to assure that pointers remain pointing to examples even if the user later changes the example table. If they are in the table and the <code>storeExamples</code> flag is clear, we just use them as they are. This will obviously crash in a multi-threaded system if one changes the table during the tree induction. Well... don't do it.</p> 
    222  
    223 <p>Apriori class probabilities are computed. At this point we check the sum of example weights; if it's zero, there are no examples and we cannot proceed. A list of candidate attributes is set; in the beginning, all attributes are candidates for the split criterion.</p> 
    224  
    225 <p>Now comes the recursive part of the <code>TreeLearner</code>. Its arguments are a set of examples, a weight meta-attribute ID (a tricky thing, it can be always the same as the original or can change to accomodate splitting of examples among branches), apriori class distribution and a list of candidates (represented as a vector of Boolean values).</p> 
    226  
    227 <p><code>Contingency matrix</code> is computed next. This happens even if the flag <code>storeContingencies</code> is <CODE>false</CODE>. If the <code>contingencyComputer</code> is given we use it, otherwise we construct just an ordinary contingency matrix.</p> 
    228  
    229 <p>A <code>stop</code> is called to see whether it's worth to continue. If not, a <code>nodeClassifier</code> is built and the <code>TreeNode</code> is returned. Otherwise, a <code>nodeClassifier</code> is only built if <code>forceNodeClassifier</code> flag is set.</p> 
    230  
    231 <p>To get a <code>TreeNode</code>'s <code>nodeClassifier</code>, the <code>nodeLearner</code>'s <code>smartLearn</code> function is called with the given examples, weight ID and the just computed matrix. If the learner can use the matrix (and the default, <code>MajorityLearner</code>, can), it won't touch the examples. Thus, a choice of <code>contingencyComputer</code> will, in many cases, affect the <code>nodeClassifier</code>. The <code>nodeLearner</code> can return no classifier; if so and if the classifier would be needed for classification, the <code>TreeClassifier</code>'s function returns DK or an empty distribution. If you're writing your own tree classifier - pay attention.</p> 
    232  
    233 <p>If the induction is to continue, a <code>split</code> component is called. If it fails to return a branch selector, induction stops and the <code>TreeNode</code> is returned.</p> 
    234  
    235 <p><code>TreeLearner</code> than uses <code>ExampleSplitter</code> to divide the examples as described above.</p> 
    236  
    237 <p>The contingency gets removed at this point if it is not to be stored. Thus, the <code>split</code>, <code>stop</code> and <code>exampleSplitter</code> can use the contingency matrices if they will.</p> 
    238  
    239 <p>The <code>TreeLearner</code> then recursively calls itself for each of the non-empty subsets. If the splitter returnes a list of weights, a corresponding weight is used for each branch. Besides, the attribute spent by the splitter (if any) is removed from the list of candidates for the subtree.</p> 
    240  
    241 <p>A subset of examples is stored in its corresponding tree node, if so requested. If not, the new weight attributes are removed (if any were created).</p> 
    242  
    243 <A name="pruning"></A> 
    244 <H3>Pruning</H3> 
    245  
    246 <p>Tree pruners derived from <code><INDEX name="classes/TreePrune">TreePrune</code> can be given either a <code>TreeNode</code> (presumably, but not necessarily a root) or a <code>TreeClassifier</code>. The result is a new, pruned <code>TreeNode</code> or a new <code>TreeClassifier</code> with a pruned tree. The original tree remains intact.</p> 
    247  
    248 <p>Note however that pruners construct only a shallow copy of a tree. The pruned tree's <code>TreeNode</code>s contain references to the same contingency matrices, node classifiers, branch selectors, ... as the original tree. Thus, you may modify a pruned tree structure (manually cut it, add new nodes, replace components) but modifying, for instance, some node's <code>nodeClassifier</code> (a <code>nodeClassifier</code> itself, not a reference to it!) would modify the node's <code>nodeClassifier</code> in the corresponding node of the original tree.</p> 
    249  
    250 <p>Talking about node classifiers - pruners cannot construct a <code>nodeClassifier</code> nor merge <code>nodeClassifiers</code> of the pruned subtrees into classifiers for new leaves. Thus, if you want to build a prunable tree, internal nodes must have their <code>nodeClassifiers</code> defined. Fortunately, all you need to do is nothing; if you leave the <code>TreeLearner</code>'s flags as they are by default, the <code>nodeClassifiers</code> are created.</p> 
    251  
    252 <hr> 
    253  
    254 <A name="classes"></A> 
    255 <H2>Classes</H2> 
    256  
    257 <p>Several classes described above are already functional and can (and mostly will) be used as they are. Those classes are <code>TreeNode</code>, <code>TreeLearner</code> and <code>TreeClassifier</code>. This section describe the other classes.</p> 
    258  
    259 <p>Classes <code>TreeSplitConstructor</code>, <code>TreeStopCriteria</code>, <code>TreeExampleSplitter</code>, <code>TreeDescender</code>, <code>Learner</code> and <code>Classifier</code> are among the Orange classes that can be subtyped in Python and have the call operator overloadedd in such a way that it is callbacked from C++ code. You can thus program your own components for <code>TreeLearners</code> and <code>TreeClassifiers</code>. The detailed information on how this is done and what can go wrong, is given in a separate page, dedicated to <A href="callbacks.htm">callbacks to Python</A>.</p> 
    260  
    261  
    262 <H3>TreeSplitConstructor and Derived Classes</H3> 
    263  
    264 <p>Split construction is almost as exciting as waiting for a delayed flight. Boring, that is. Split constructors are programmed as spaghetti code that juggles with contingency matrices, with separate cases for discrete and continuous classes... Most split constructors work either for discrete or for continuous attributes. The suggested practice is to use a <code>TreeSplitConstructor_Combined</code> that can handle both by simply delegating attributes to specialized split constructors.</p> 
    265  
    266 <p><B>Note:</B> split constructors that cannot handle attributes of particular type (discrete, continuous) do not report an error or a warning but simply skip the attribute. It is your responsibility to use a correct split constructor for your dataset. (May we again suggest using <code>TreeSplitConstructor_Combined</code>?)</p> 
    267  
    268 <p>The same components can be used either for inducing classification and regression trees. The only component that needs to be chosen accordingly is the 'measure' attribute for the <code>TreeSplitConstructor_Measure</code> class (and derived classes).</p> 
    269  
    270 <H4>TreeSplitConstructor</H4> 
    271  
    272 <p>The <code>TreeSplitConstructor</code>'s function has been described in details in <a href="#learning">description of the learning process</A>.</p> 
    273  
    274 <p class=section>Attributes</P> 
    275 <DL class=attributes> 
    276 <DT>minSubset</DT> 
    277 <DD>Sets the minimal number of examples in non-null leaves. As always in Orange (where not specified otherwise), "number of examples" refers to the weighted number of examples.</DD> 
    278 </DL> 
    279  
    280 <p class=section>Methods</P> 
    281 <DL class=attributes> 
    282 <DT>__call__(examples[, weightID, apriori_distribution, candidates])</DT> 
    283 <DD>Constructs a split. Examples can be given in any acceptable form (an <code>ExampleGenerator</code>, such as <code>ExampleTable</code>, or a list of examples). <code>WeightID</code> is optional; the default of 0 means that all examples have a weight of 1.0. Apriori-distribution should be of type <code>orange.Distribution</code> and candidates should be a Python list of objects which are interpreted as booleans. 
    284 The function returns a tuple (<code>branchSelector</code>, <code>branchDescriptions</code>, <code>subsetSizes</code>, <code>quality</code>, <code>spentAttribute</code>). <code>SpentAttribute</code> is -1 if no attribute is completely spent by the split criterion. If no split is constructed, the <code>selector</code>, <code>branchDescriptions</code> and <code>subsetSizes</code> are <code>None</code>, while <CODE>quality</CODE> is 0.0 and <code>spentAttribute</code> is -1.</DD> 
    285 </DL> 
    286  
    287  
    288 <H4>TreeSplitConstructor_Measure</H4> 
    289 <index name="classes/TreeSplitConstructor_Measure"> 
    290  
    291 <p>An abstract base class for split constructors that employ a <code>MeasureAttribute</code> to assess a quality of a split. At present, all split constructors except for <code>TreeSplitConstructor_Combined</code> are derived from this class.</p> 
    292  
    293 <p class=section>Attributes</p> 
    294 <DL class=attributes> 
    295 <DT>measure</DT> 
    296 <DD>A component of type <code>MeasureAttribute</code> used for evaluation of a split. Note that you must select the subclass <code>MeasureAttribute</code> capable of handling your class type - you cannot use <code>MeasureAttribute_gainRatio</code> for building regression trees or <code>MeasureAttribute_MSE</code> for classification trees.</DD> 
    297  
    298 <DT>worstAcceptable</DT> 
    299 <DD>The lowest required split quality for a split to be acceptable. Note that this value make sense only in connection with a <code>measure</code> component. Default is 0.0.</DD> 
    300 </DL> 
    301  
    302 <H4>TreeSplitConstructor_Attribute</H4> 
    303 <index name="classes/TreeSplitConstructor_Attribute"> 
    304  
    305 <p><code>TreeSplitConstructor_Attribute</code> attempts to use a discrete attribute as a split; each value of the attribute corresponds to a branch in the tree. Attributes are evaluated with the <code>measure</code> and the one with the highest score is used for a split. If there is more than one attribute with the highest score, one of them is selected by random.</P> 
    306  
    307 <p>The constructed <code>branchSelector</code> is an instance of <code>ClassifierFromVarFD</code> that returns a value of the selected attribute. If the attribute is <code>EnumVariable</code>, <code>branchDescription</code>'s are the attribute's values. The attribute is marked as spent, so that it cannot reappear in the node's subtrees.</p> 
    308  
    309 <H4>TreeSplitConstructor_ExhaustiveBinary</H4> 
    310 <index name="classes/TreeSplitConstructor_ExhaustiveBinary"> 
    311 <index name="binarization (in trees)" 
    312  
    313 <p><code>TreeSplitConstructor_ExhaustiveBinary</code> works on discrete attributes. For each attribute, it determines which binarization of the attribute gives the split with the highest score. If more than one split has the highest score, one of them is selected by random. After trying all the attributes, it returns one of those with the highest score.</p> 
    314  
    315 <p>The constructed <CODE>branchSelector</CODE> is again an instance <code>ClassifierFromVarFD</code> that returns a value of the selected attribute. This time, however, its <code>transformer</code> contains an instance of <code>MapIntValue</code> that maps the values of the attribute into a binary attribute. Branch descriptions are of form "[&lt;val1&gt;, &lt;val2&gt;, ...&lt;valn&gt;]" for branches corresponding to more than one value of the attribute. Branches that correspond to a single value of the attribute are described with this value. If the attribute was originally binary, it is spent and cannot be used in the node's subtrees. Otherwise, it can reappear in the subtrees.</p> 
    316  
    317 <H4>TreeSplitConstructor_Threshold</H4> 
    318 <index name="classes/TreeSplitConstructor_Threshold"> 
    319  
    320 <p>This is currently the only constructor for splits with continuous attributes. It divides the range of attributes values with a threshold that maximizes the split's quality. As always, if there is more than one split with the highest score, a random threshold is selected. The attribute that yields the highest binary split is returned.</p> 
    321  
    322 <p>The constructed <code>branchSelector</code> is again an instance of <code>ClassifierFromVarFD</code> with an attached <code>transformer</code>. This time, <code>transformer</code> is of type <code>ThresholdDiscretizer</code>. The branch descriptions are "&lt;threshold" and "&gt;=threshold". The attribute is not spent.</p> 
    323  
    324  
    325 <H4>TreeSplitConstructor_Combined</H4> 
    326 <index name="classes/TreeSplitConstructor_Combined"> 
    327  
    328 <p>This constructor delegates the task of finding the optimal split to separate split constructors for discrete and for continuous attributes. Each split constructor is called, given only attributes of appropriate types as candidates. Both construct a candidate for a split; the better of them is selected.</p> 
    329  
    330 <p>(Note that there is a problem when more candidates have the same score. Let there be are nine discrete attributes with the highest score; the split constructor for discrete attributes will select one of them. Now, let us suppose that there is a single continuous attribute with the same score. <code>TreeSplitConstructor_Combined</code> would randomly select between the proposed discrete attribute and the continuous attribute, not aware of the fact that the discrete has already competed with eight other discrete attributes. So, the probability for selecting (each) discrete attribute would be 1/18 instead of 1/10. Although not really correct, we doubt that this would affect the tree's performance; many other machine learning systems simply choose the first attribute with the highest score anyway.)</p> 
    331  
    332 <p>The <code>branchSelector</code>, <code>branchDescriptions</code> and whether the attribute is spent is decided by the winning split constructor.</p> 
    333  
    334 <p class=section>Attributes</p> 
    335 <DL class=attributes> 
    336 <DT>discreteSplitConstructor</DT> 
    337 <DD>Split constructor for discrete attributes; can be, for instance, <code>TreeSplitConstructor_Attribute</code> or <code>TreeSplitConstructor_ExhaustiveBinary</code></DD> 
    338  
    339 <DT>continuousSplitConstructor</DT> 
    340 <DD>Split constructor for continuous attributes; at the moment, it can be either <code>TreeSplitConstructor_Threshold</code> or a split constructor you programmed in Python.</DD> 
    341 </DL> 
    342  
    343  
    344 <H3>TreeStopCriteria and TreeStopCriteria_common</H3> 
    345  
    346 <p><code>TreeStopCriteria</code> determines when to stop the induction of subtrees, as described in detail in <a href="#learning">description of the learning process</A>.</p> 
    347  
    348 <H4>TreeStopCriteria</H4> 
    349  
    350 <p>As opposed to <code>TreeSplitConstructor</code> and similar basic classes, <code>TreeStopCriteria</code> is not an abstract but a fully functional class that provides the basic stopping criteria. That is, the tree induction stops when there is at most one example left; in this case, it is not the weighted but the actual number of examples that counts. Besides that, the induction stops when all examples are in the same class (for discrete problems) or have the same value of the outcome (for regression problems).</p> 
    351  
    352 <p class=section>Methods</p> 
    353 <DL class=attributes> 
    354 <DT>__call__(examples[, weightID, domain contingencies])</DT> 
    355 <DD> 
    356 Decides whether to stop (<CODE>true</CODE>) or continue (<CODE>false</CODE>) the induction. If contingencies are given, they are used for checking whether the examples are in the same class (but not for counting the examples). Derived classes should use the contingencies whenever possible. If contingencies are not given, <code>TreeStopCriteria</code> will work without them. Derived classes should also use them if they are available, but otherwise compute them only when they really need them. 
    357 </DD> 
    358 </DL> 
    359  
    360 <H4>TreeStopCriteria_common</H4> 
    361 <index name="classes/TreeSplitConstructor_common"> 
    362  
    363 <p><code>TreeStopCriteria_common</code> contains additional criteria for pre-pruning: it checks the proportion of majority class and the number of weighted examples.</p> 
    364  
    365 <p class=section>Attributes</p> 
    366 <DL class=attributes> 
    367 <DT>maxMajor</DT> 
    368 <DD>Maximal proportion of majority class. When this is exceeded, induction stops.</DD> 
    369  
    370 <DT>minExamples</DT> 
    371 <DD>Minimal number of examples in internal leaves. Subsets with less than <code>minExamples</code> examples are not split any further. Example count is weighed.</DD> 
    372 </DL> 
    373  
    374 <H3>TreeExampleSplitter and derived classes</H3> 
    375  
    376 <p><code>TreeExampleSplitter</code> is the third crucial component of <code>TreeLearner</code>. Its function is described in <a href="#learning">description of the learning process</A>.</p> 
    377  
    378 <H4>TreeExampleSplitter</H4> 
    379  
    380 <p>An abstract base class for objects that split sets of examples into subsets. The derived classes differ in treatment of examples which cannot be unambiguously placed into a single branch (usually due to unknown value of the crucial attribute).</p> 
    381  
    382 <P class=section>Methods</p> 
    383 <DL class=attributes> 
    384 <DT><B>__call__(node, examples[, weightID])</B></DT> 
    385 <DD> 
    386 Uses the information in <code>node</code> (particularly the <code>branchSelector</code>) to split the given set of examples into subsets. Function returns a tuple with a list of example generators and a list of weights. The list of weights is either an ordinary python list of integers or a None when no splitting of examples occurs and thus no weights are needed. 
    387 </DD> 
    388 </DL> 
    389  
    390 <p class=section>Derived classes</p> 
    391 <DL class=attributes> 
    392 <DT><INDEX name="classes/TreeExampleSplitter_IgnoreUnknowns">TreeExampleSplitter_IgnoreUnknowns</DT> 
    393 <DD>Simply ignores the examples for which no single branch can be determined.</DD> 
    394  
    395 <DT><INDEX name="classes/TreeExampleSplitter_UnknownsToCommon">TreeExampleSplitter_UnknownsToCommon</DT> 
    396 <DD>Places all such examples to a branch with the highest number of examples. If there is more than one such branch, one is selected at random and then used for all examples.</DD> 
    397  
    398 <DT><INDEX name="classes/TreeExampleSplitter_UnknownsToAll">TreeExampleSplitter_UnknownsToAll</DT> 
    399 <DD>Places examples with unknown value of the attribute into all branches.</DD> 
    400  
    401 <DT><INDEX name="classes/TreeExampleSplitter_UnknownsToRandom">TreeExampleSplitter_UnknownsToRandom</DT> 
    402 <DD>Selects a random branch for such examples.</DD> 
    403  
    404 <DT><INDEX name="classes/TreeExampleSplitter_UnknownsToBranch">TreeExampleSplitter_UnknownsToBranch</DT> 
    405 <DD>Constructs an additional branch to contain all such examples. The branch's description is "unknown".</DD> 
    406  
    407 <DT><INDEX name="classes/TreeExampleSplitter_UnknownsAsBranchSizes">TreeExampleSplitter_UnknownsAsBranchSizes</DT> 
    408 <DD>Splits examples with unknown value of the attribute according to proportions of examples in each branch.</DD> 
    409  
    410 <DT><INDEX name="classes/TreeExampleSplitter_UnknownsAsSelector">TreeExampleSplitter_UnknownsAsSelector</DT> 
    411 <DD>Splits examples with unknown value of the attribute according to distribution proposed by selector (which is in most cases the same as proportions of examples in branches).</DD> 
    412 </DL> 
    413  
    414  
    415  
    416 <H3>TreeDescender and derived classes</H3> 
    417  
    418 <p>This is a classifier's counterpart for <code>TreeExampleSplitter</code>. It decides the destiny of examples that need to be classified and cannot be unambiguously put in a branch. The detailed function of this class is given in <a href="#classification"> description of classification with trees</A>.</p> 
    419  
    420 <H4>TreeDescender</H4> 
    421  
    422 <p>An abstract base object for tree descenders.</p> 
    423  
    424 <p class=section>Methods</p> 
    425 <DL class=attributes> 
    426 <DT>__call__(node, example)</DT> 
    427 <DD>Descends down the tree until it reaches a leaf or a node in which a vote of subtrees is required. In both cases, a tuple of two elements is returned; in the former, the tuple contains the reached node and <CODE>None</CODE>, in the latter in contains a node and weights of votes for subtrees (a list of floats). 
    428  
    429 <code>TreeDescender</code>'s that never split examples always descend to a leaf, but they differ in the treatment of examples with unknown values (or, in general, examples for which a branch cannot be determined at some node(s) the tree). <code>TreeDescender</code>'s that do split examples differ in returned vote weights.</DD> 
    430 </DL> 
    431  
    432 <p class=section>Derived classes</p> 
    433 <DL class=attributes> 
    434 <DT><INDEX name="classes/TreeDescender_UnknownsToNode">TreeDescender_UnknownsToNode</DT> 
    435 <DD>When example cannot be classified into a single branch, the current node is returned. Thus, the node's <CODE>nodeClassifier</CODE> will be used to make a decision. It is your responsibility to see that even the internal nodes have their <CODE>nodeClassifiers</CODE> (i.e., don't disable creating node classifier or manually remove them after the induction, that's all)</DD> 
    436  
    437 <DT><INDEX name="classes/TreeDescender_UnknownsToCommon">TreeDescender_UnknownsToCommon</DT> 
    438 <DD>Classifies examples with unknown value to a special branch. This makes sense only if the tree itself was constructed with <CODE>TreeExampleSplitter_UnknownsToBranch</CODE>.</DD> 
    439  
    440 <DT><INDEX name="classes/TreeDescender_UnknownsToCommonBranch">TreeDescender_UnknownsToCommonBranch</DT> 
    441 <DD>Classifies examples with unknown values to the branch with the highest number of examples. If there is more than one such branch, random branch is chosen for each example that is to be classified.</DD> 
    442  
    443 <DT><INDEX name="classes/TreeDescender_UnknownsToCommonSelector">TreeDescender_UnknownsToCommonSelector</DT> 
    444 <DD>Classifies examples with unknown values to the branch which received the highest recommendation by the selector.</DD> 
    445  
    446  
    447 <DT><INDEX name="classes/TreeDescender_MergeAsBranchSizes">TreeDescender_MergeAsBranchSizes</DT> 
    448 <DD>Makes the subtrees vote for the example's class; the vote is weighted according to the sizes of the branches.</DD> 
    449  
    450 <DT><INDEX name="classes/TreeDescender_MergeAsSelector">TreeDescender_MergeAsSelector</DT> 
    451 <DD>Makes the subtrees vote for the example's class; the vote is weighted according to the selectors proposal.</DD> 
    452 </DL> 
    453  
    454  
    455 <H3>TreePruner and derived classes</H3> 
    456 <index name="classification trees/pruning"> 
    457 <index name="pruning classification trees"> 
    458  
    459 <p>Classes derived from TreePruner prune the trees as described in the section <A href="#pruning">pruning</A> - make sure you read it to understand what the pruners will do to your trees.</p> 
    460  
    461 <H4>TreePruner</H4> 
    462  
    463 <p>This is an abstract base class which defines nothing useful, only a pure virtual call operator.</p> 
    464  
    465 <P class=section>Methods</P> 
    466 <DL class=attributes> 
    467 <DT>__call__(tree)</DT> 
    468 <DD>Prunes a tree. The argument can be either a tree classifier or a tree node; the result is of the same type as the argument.</DD> 
    469 </DL> 
    470  
    471  
    472 <H4>TreePruner_SameMajority</H4> 
    473 <index name="classes/TreePruner_SameMajority"> 
    474  
    475 <p>In Orange, a tree can have a non-trivial subtrees (i.e. subtrees with more than one leaf) in which all the leaves have the same majority class. (The reason why this is allowed is that those leaves can still have different distributions of classes and thus predict different probabilities.) However, this can be undesired when we're only interested in the class prediction or a simple tree interpretation. The <code>TreePruner_SameMajority</code> prunes the tree so that there is no subtree in which all the nodes would have the same majority class.</p> 
    476  
    477 <p>This pruner will only prune the nodes in which the node classifier is of class <code>DefaultClassifier</code> (or from a derived class).</p> 
    478  
    479 <p>Note that the leaves with more than one majority class require some special handling. The pruning goes backwards, from leaves to the root. When siblings are compared, the algorithm checks whether they have (at least one) common majority class. If so, they can be pruned.</p> 
    480  
    481 <H4>TreePruner_m</H4> 
    482 <index name="classes/TreePruner_m"> 
    483  
    484 <p>Prunes a tree by comparing m-estimates of static and dynamic error as defined in (Bratko, 2002).</p> 
    485  
    486 <p class=section>Attributes</p> 
    487 <DL class=attributes> 
    488 <DT>m</DT> 
    489 <DD>Parameter m for m-estimation.</DD> 
    490 </DL> 
    491  
    492 <hr> 
    493  
    494 <A name="examples"></A> 
    495 <H2>Examples</H2> 
    496  
    497 <p>This page does not provide examples for programming your own components, such as, for instance, a <code>TreeSplitConstructor</code>. Those examples can be found on a page dedicated to <A href="callbacks.htm">callbacks to Python</A>.</p> 
    498  
    499 <H3>Tree Structure</H3> 
    500  
    501 <p>To have something to work on, we'll take the data from lenses dataset and build a tree using the default components.</p> 
    502  
    503 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    504 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    505 <xmp class="code">>>> data = orange.ExampleTable("lenses") 
    506 >>> treeClassifier = orange.TreeLearner(data) 
    507 </xmp> 
    508  
    509 <p>How big is our tree?</p> 
    510  
    511 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    512 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    513 <xmp class="code">def treeSize(node): 
    514     if not node: 
    515         return 0 
    516  
    517     size = 1 
    518     if node.branchSelector: 
    519         for branch in node.branches: 
    520             size += treeSize(branch) 
    521  
    522     return size 
    523 </xmp> 
    524  
    525 <p>If node is <CODE>None</CODE>, we have a null-node; null nodes don't count, so we return 0. Otherwise, the size is 1 (this node) plus the sizes of all subtrees. The node is an internal node if it has a <CODE>node.branchSelector</CODE>; it there's no selector, it's a leaf. Don't attempt to skip the <CODE>if</CODE> statement: leaves don't have an empty list of branches, they don't have a list of branches at all.</p> 
    526  
    527 <xmp>>>> treeSize(treeClassifier.tree) 
    528 10 
    529 </xmp> 
    530  
    531 <p>Don't forget that this was only an excercise - <code>TreeNode</code> has a built-in method <code>treesize()</code> that does exactly the same.</p> 
    532  
    533 <p>Let us now write a simple script that prints out a tree. The recursive part of the function will get a node and its level.</p> 
    534  
    535 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    536 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    537 <xmp class="code">def printTree0(node, level): 
    538     if not node: 
    539         print " "*level + "<null node>" 
    540         return 
    541  
    542     if node.branchSelector: 
    543         nodeDesc = node.branchSelector.classVar.name 
    544         nodeCont = node.distribution 
    545         print "\n" + "   "*level + "%s (%s)" % (nodeDesc, nodeCont), 
    546         for i in range(len(node.branches)): 
    547             print "\n" + "   "*level + ": %s" % node.branchDescriptions[i], 
    548             printTree0(node.branches[i], level+1) 
    549     else: 
    550         nodeCont = node.distribution 
    551         majorClass = node.nodeClassifier.defaultValue 
    552         print "--> %s (%s) " % (majorClass, nodeCont), 
    553 </xmp> 
    554  
    555 <p>Don't waste time on studying formatting tricks (\n's etc.), this is just for nicer output. What matters is everything but the print statements. As first, we check whether the node is a null-node (a node to which no learning examples were classified). If this is so, we just print out "&lt;null node&gt;" and return.</p> 
    556  
    557 <p>After handling null nodes, remaining nodes are internal nodes and leaves. For internal nodes, we print a node description consisting of the attribute's name and distribution of classes. <code>TreeNode</code>'s branch description is, for all currently defined splits, an instance of a class derived from <code>Classifier</code> (in fact, it is a <code>ClassifierFromVarFD</code>, but a <code>Classifier</code> would suffice), and its <code>classVar</code> points to the attribute we seek. So we print its name. We will also assume that storing class distributions has not been disabled and print them as well. A more able function for printing trees (as one defined in <code>orngTree</code>) has an alternative means to get the distribution, when this fails. Then we iterate through branches; for each we print a branch description and iteratively call the <code>printTree0</code> with a level increased by 1 (to increase the indent).</p> 
    558  
    559 <p>Finally, if the node is a leaf, we print out the distribution of learning examples in the node and the class to which the examples in the node would be classified. We again assume that the <code>nodeClassifier</code> is the default one - a <code>DefaultClassifier</code>. A better print function should be aware of possible alternatives.</p> 
    560  
    561 <p>Now, we just need to write a simple function to call our printTree0. We could write something like...</p> 
    562 <xmp class="code">def printTree(x): 
    563     printTree0(x.tree, 0) 
    564 </xmp> 
    565 <p>... but we won't. Let us learn how to handle arguments of different types. Let's write a function that will accept either a <code>TreeClassifier</code> or a <code>TreeNode</code> (just like <code>TreePruners</code>, remember?)</p> 
    566  
    567 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    568 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    569 <xmp class="code">def printTree(x): 
    570     if isinstance(x, orange.TreeClassifier): 
     125    #. Descender reaches a leaf. This happens when nothing went wrong  
     126       (there are no unknown or out-of-range values in the example) or  
     127       when things went wrong, but the descender smoothed them by  
     128       selecting a single branch and continued the descend. In this 
     129       case, the descender returns the reached :obj:`TreeNode`. 
     130    #. :obj:`branchSelector` returned a distribution and the  
     131       :obj:`TreeDescender` decided to stop the descend at this  
     132       (internal) node.  Again, descender returns the current  
     133       :obj:`TreeNode` and nothing else. 
     134    #. :obj:`branchSelector` returned a distribution and the  
     135       :obj:`TreeNode` wants to split the example (i.e., to decide the  
     136       class by voting).  
     137 
     138    It returns a :obj:`TreeNode` and the vote-weights for the branches.  
     139    The weights can correspond to the distribution returned by 
     140    :obj:`branchSelector`, to the number of learning examples that 
     141    were assigned to each branch, or to something else. 
     142 
     143    :obj:`TreeClassifier` uses the descender to descend from the root.  
     144    If it returns only a :obj:`TreeNode` and no distribution, the  
     145    descend should stop; it does not matter whether it's a leaf (the 
     146    first case above) or an internal node (the second case). The node's 
     147    :obj:`nodeClassifier` is used to decide the class. If the descender 
     148    returns a :obj:`TreeNode` and a distribution, the :obj:`TreeClassifier` 
     149    recursively calls itself for each of the subtrees and the  
     150    predictions are weighted as requested by the descender. 
     151 
     152    When voting, subtrees do not predict the class but probabilities  
     153    of classes. The predictions are multiplied by weights, summed and  
     154    the most probable class is returned. 
     155 
     156 
     157 
     158The rest of this section is only for those interested in the C++ code. 
     159====================================================================== 
     160 
     161If you'd like to understand how the classification works in C++,  
     162start reading at <code>TTreeClassifier::vote</code>. It gets a  
     163:obj:`TreeNode`, an :obj:`orange.Example`> and a distribution of  
     164vote weights. For each node, it calls the  
     165<code>TTreeClassifier::classDistribution</code> and then multiplies  
     166and sums the distribution. <code>vote</code> returns a normalized  
     167distribution of predictions. 
     168 
     169A new overload of <code>TTreeClassifier::classDistribution</code> gets 
     170an additional parameter, a :obj:`TreeNode`. This is done  
     171for the sake of recursion. The normal version of  
     172<code>classDistribution</code> simply calls the overloaded with a  
     173tree root as an additional parameter. <code>classDistribution</code>  
     174uses <code>descender</code>. If descender reaches a leaf, it calls  
     175:obj:`nodeClassifier`, otherwise it calls <CODE>vote</CODE>. 
     176 
     177Thus, the :obj:`TreeClassifier`'s <code>vote</code> and  
     178<code>classDistribution</code> are written in a form of double  
     179recursion. The recursive calls do not happen at each node of the  
     180tree but only at nodes where a vote is needed (that is, at nodes  
     181where the descender halts). 
     182 
     183For predicting a class, <code>operator()</code>, calls the 
     184descender. If it reaches a leaf, the class is predicted by the  
     185leaf's :obj:`nodeClassifier`. Otherwise, it calls  
     186<code>vote</code>. From now on, <code>vote</code> and  
     187<code>classDistribution</code> interweave down the tree and return  
     188a distribution of predictions. <code>operator()</code> then simply  
     189chooses the most probable class. 
     190 
     191 
     192======== 
     193Learning 
     194======== 
     195 
     196The main learning object is :obj:`TreeLearner`. It is basically  
     197a skeleton into which the user must plug the components for particular  
     198functions. For easier use, defaults are provided. 
     199 
     200<p>Components that govern the structure of the tree are <code>split</code>  
     201(of type :obj:`TreeSplitConstructor`), <code>stop</code> (of  
     202type :obj:`TreeStopCriteria` and <code>exampleSplitter</code>  
     203(of type :obj:`TreeExampleSplitter`).</p> 
     204 
     205.. class:: TreeSplitConstructor 
     206 
     207    Finds a suitable criteria for dividing the learning (and later testing) 
     208    examples coming to the node. The data it gets is a set of examples 
     209    (and, optionally, an ID of weight meta-attribute), a domain 
     210    contingency computed from examples, apriori class probabilities, 
     211    a list of candidate attributes it should consider and a node 
     212    classifier (if it was constructed, that is, if  
     213    :obj:`storeNodeClassifier` is left true). 
     214 
     215    The :obj:`TreeSplitConstructor` should use the domain contingency 
     216    when possible. The reasons are two-fold; one is that it's faster 
     217    and the other is that the contingency matrices are not necessarily 
     218    constructed by simply counting the examples. Why and how is 
     219    explained later. There are, however, cases, when domain contingency 
     220    does not suffice, for examples, when ReliefF is used as a measure 
     221    of quality of attributes. In this case, there's no other way but 
     222    to use the examples and ignore the precomputed contingencies. 
     223 
     224    The split constructor should consider only the attributes in the 
     225    candidate list (the list is a vector of booleans, one for each 
     226    attribute). 
     227 
     228    :obj:`TreeSplitConstructor` returns most of the data we talked 
     229    about when describing the :obj:`TreeNode`. It returns a classifier 
     230    to be used as :obj:`TreeNode`'s :obj:`branchSelector`, a list of 
     231    branch descriptions and a list with the number of examples that 
     232    go into each branch. Just what we need for the :obj:`TreeNode`. 
     233    It can return an empty list for the number of examples in branches; 
     234    in this case, the :obj:`TreeLearner` will find the number itself 
     235    after splitting the example set into subsets. However, if a split 
     236    constructors can provide the numbers at no extra computational 
     237    cost, it should do so. 
     238 
     239    In addition, it returns a quality of the split; a number without 
     240    any fixed meaning except that higher numbers mean better splits. 
     241 
     242    If the constructed splitting criterion uses an attribute in such 
     243    a way that the attribute is 'completely spent' and should not be 
     244    considered as a split criterion in any of the subtrees (the 
     245    typical case of this are discrete attributes that are used 
     246    as-they-are, that is, without any binarization or subsetting), 
     247    then it should report the index of this attribute. Some splits 
     248    do not spend any attribute; this is indicated by returning a 
     249    negative index. 
     250 
     251    A :obj:`TreeSplitConstructor` can veto the further tree induction 
     252    by returning no classifier. This can happen for many reasons. 
     253    A general one is related to number of examples in the branches. 
     254    :obj:`TreeSplitConstructor` has a field <code>minSubset</code>, 
     255    which sets the minimal number of examples in a branch; null nodes, 
     256    however, are allowed. If there is no split where this condition 
     257    is met, :obj:`TreeSplitConstructor` stops the induction. 
     258 
     259    .. attribute:: minSubset 
     260 
     261        Sets the minimal number of examples in non-null leaves. As 
     262        always in Orange (where not specified otherwise), "number of  
     263        examples" refers to the weighted number of examples. 
     264     
     265    .. method:: test() 
     266 
     267        Bla bla. 
     268 
     269    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])  
     270 
     271        Construct a split. Returns a tuple (:obj:`branchSelector`, 
     272        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`, 
     273        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no 
     274        attribute is completely spent by the split criterion. If no 
     275        split is constructed, the :obj:`selector`, :obj:`branchDescriptions` 
     276        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and 
     277        :obj:`spentAttribute` is -1. 
     278 
     279        :param examples:  Examples can be given in any acceptable form 
     280            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a 
     281            list of examples). 
     282        :param weightID: Optional; the default of 0 means that all 
     283            examples have a weight of 1.0.  
     284        :param apriori-distribution: Should be of type  
     285            :obj:`orange.Distribution` and candidates should be a Python  
     286            list of objects which are interpreted as booleans. 
     287 
     288 
     289.. class:: TreeStopCriteria 
     290 
     291    Given a set of examples, weight ID and contingency matrices, decide 
     292    whether to continue the induction or not. The basic criterion checks 
     293    whether there are any examples and whether they belong to at least 
     294    two different classes (if the class is discrete). Derived components 
     295    check things like the number of examples and the proportion of 
     296    majority classes. 
     297 
     298 
     299.. class:: TreeExampleSplitter 
     300 
     301    Analoguous to the :obj:`TreeDescender` described about a while ago.  
     302    Just like the :obj:`TreeDescender` decides the branch for an 
     303    example during classification, the :obj:`TreeExampleSplitter` 
     304    sorts the learning examples into branches. 
     305 
     306    :obj:`TreeExampleSplitter` is given a :obj:`TreeNode` (from which  
     307    it can use different stuff, but most of splitters only use the  
     308    :obj:`branchSelector`), a set of examples to be divided, and  
     309    the weight ID. The result is a list of subsets of examples 
     310    and, optionally, a list of new weight ID's. 
     311 
     312    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most  
     313    of :obj:`TreeExampleSplitters` simply call the node's  
     314    :obj:`branchSelector` and assign examples to corresponding  
     315    branches. When the value is unknown they choose a particular  
     316    branch or simply skip the example. 
     317 
     318    Some enhanced splitters can split examples. An example (actually,  
     319    a pointer to it) is copied to more than one subset. To facilitate 
     320    real splitting, weights are needed. Each branch is assigned a 
     321    weight ID (each would usually have its own ID) and all examples 
     322    that are in that branch (either completely or partially) should 
     323    have this meta attribute. If an example hasn't been split, it 
     324    has only one additional attribute - with weight ID corresponding 
     325    to the subset to which it went. Example that is split between, 
     326    say, three subsets, has three new meta attributes, one for each 
     327    subset. ID's of weight meta attributes are returned by the 
     328    :obj:`TreeExampleSplitter` to be used at induction of the 
     329    corresponding subtrees. 
     330 
     331    Note that weights are used only when needed. When no splitting 
     332    occured - because the splitter is not able to do it or becauser 
     333    there was no need for splitting - no weight ID's are returned. 
     334 
     335.. class:: TreeLearner 
     336 
     337    TreeLearner has a number of components. 
     338 
     339    .. attribute:: split 
     340 
     341        Object of type :obj:`TreeSplitConstructor`. Default value,  
     342        provided by :obj:`TreeLearner`, is :obj:`SplitConstructor_Combined` 
     343        with separate constructors for discrete and continuous attributes.  
     344        Discrete attributes are used as are, while continuous attributes 
     345        are binarized. Gain ratio is used to select attributes. 
     346        A minimum of two examples in a leaf is required for discreter 
     347        and five examples in a leaf for continuous attributes.</DD> 
     348     
     349    .. attribute:: stop 
     350 
     351        Object of type :obj:`TreeStopCriteria`. The default stopping 
     352        criterion stops induction when all examples in a node belong  
     353        to the same class. 
     354 
     355    .. attribute:: splitter 
     356 
     357        Object of type :obj:`TreeExampleSplitter`. The default splitter 
     358        is :obj:`TreeExampleSplitter_UnknownsAsSelector` that splits 
     359        the learning examples according to distributions given by the 
     360        selector. 
     361 
     362    .. attribute:: contingencyComputer 
     363     
     364        By default, this slot is left empty and ordinary contingency 
     365        matrices are computed for examples at each node. If need arises, 
     366        one can change the way the matrices are computed. This can be 
     367        used to change the way that unknown values are treated when 
     368        assessing qualities of attributes. As mentioned earlier, 
     369        the computed matrices can be used by split constructor and by 
     370        stopping criteria. On the other hand, they can be (and are) 
     371        ignored by some splitting constructors. 
     372 
     373    .. attribute:: nodeLearner 
     374 
     375        Induces a classifier from examples belonging to a node. The 
     376        same learner is used for internal nodes and for leaves. The 
     377        default :obj:`nodeLearner` is :obj:`MajorityLearner`. 
     378 
     379    .. attribute:: descender 
     380 
     381        Descending component that the induces :obj:`TreeClassifier` 
     382        will use. Default descender is  
     383        :obj:`TreeDescender_UnknownMergeAsSelector` which votes using  
     384        the :obj:`branchSelector`'s distribution for vote weights. 
     385 
     386    .. attribute:: maxDepth 
     387 
     388        Gives maximal tree depth; 0 means that only root is generated.  
     389        The default is 100 to prevent any infinite tree induction due 
     390        to missettings in stop criteria. If you are sure you need 
     391        larger trees, increase it. If you, on the other hand, want 
     392        to lower this hard limit, you can do so as well. 
     393 
     394    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier 
     395 
     396        Decides whether to store class distributions, contingencies  
     397        and examples in :obj:`TreeNode`, and whether the  
     398        :obj:`nodeClassifier` should be build for internal nodes.  
     399        By default, distributions and node classifiers are stored,  
     400        while contingencies and examples are not. You won't save any  
     401        memory by not storing distributions but storing contingencies, 
     402        since distributions actually points to the same distribution 
     403        that is stored in <code>contingency.classes</code>. 
     404 
     405    The :obj:`TreeLearner` first sets the defaults for missing 
     406    components. Although stored in the actual :obj:`TreeLearner`'s 
     407    fields, they are removed when the induction is finished. 
     408 
     409    Then it ensures that examples are stored in a table. This is needed 
     410    because the algorithm juggles with pointers to examples. If 
     411    examples are in a file or are fed through a filter, they are copied 
     412    to a table. Even if they are already in a table, they are copied 
     413    if :obj:`storeExamples` is set. This is to assure that pointers 
     414    remain pointing to examples even if the user later changes the 
     415    example table. If they are in the table and the :obj:`storeExamples` 
     416    flag is clear, we just use them as they are. This will obviously 
     417    crash in a multi-threaded system if one changes the table during 
     418    the tree induction. Well... don't do it. 
     419 
     420    Apriori class probabilities are computed. At this point we check 
     421    the sum of example weights; if it's zero, there are no examples and  
     422    we cannot proceed. A list of candidate attributes is set; in the 
     423    beginning, all attributes are candidates for the split criterion. 
     424 
     425    Now comes the recursive part of the :obj:`TreeLearner`. Its arguments  
     426    are a set of examples, a weight meta-attribute ID (a tricky thing, 
     427    it can be always the same as the original or can change to  
     428    accomodate splitting of examples among branches), apriori class 
     429    distribution and a list of candidates (represented as a vector 
     430    of Boolean values). 
     431 
     432    <code>Contingency matrix</code> is computed next. This happens 
     433    even if the flag :obj:`storeContingencies` is <CODE>false</CODE>. 
     434    If the <code>contingencyComputer</code> is given we use it, 
     435    otherwise we construct just an ordinary contingency matrix. 
     436 
     437    A :obj:`stop` is called to see whether it's worth to continue. If  
     438    not, a :obj:`nodeClassifier` is built and the :obj:`TreeNode` is  
     439    returned. Otherwise, a :obj:`nodeClassifier` is only built if  
     440    :obj:`forceNodeClassifier` flag is set. 
     441 
     442    To get a :obj:`TreeNode`'s :obj:`nodeClassifier`, the  
     443    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with  
     444    the given examples, weight ID and the just computed matrix. If  
     445    the learner can use the matrix (and the default,  
     446    :obj:`MajorityLearner`, can), it won't touch the examples. Thus, 
     447    a choice of :obj:`contingencyComputer` will, in many cases,  
     448    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can 
     449    return no classifier; if so and if the classifier would be  
     450    needed for classification, the :obj:`TreeClassifier`'s function 
     451    returns DK or an empty distribution. If you're writing your own 
     452    tree classifier - pay attention. 
     453 
     454    If the induction is to continue, a :obj:`split` component is called.  
     455    If it fails to return a branch selector, induction stops and the  
     456    :obj:`TreeNode` is returned. 
     457 
     458    :obj:`TreeLearner` than uses :obj:`ExampleSplitter` to divide  
     459    the examples as described above. 
     460 
     461    The contingency gets removed at this point if it is not to be  
     462    stored. Thus, the :obj:`split`, :obj:`stop` and  
     463    :obj:`exampleSplitter` can use the contingency matrices if they will. 
     464 
     465    The :obj:`TreeLearner` then recursively calls itself for each of  
     466    the non-empty subsets. If the splitter returnes a list of weights,  
     467    a corresponding weight is used for each branch. Besides, the  
     468    attribute spent by the splitter (if any) is removed from the  
     469    list of candidates for the subtree. 
     470 
     471    A subset of examples is stored in its corresponding tree node,  
     472    if so requested. If not, the new weight attributes are removed (if  
     473    any were created). 
     474 
     475Pruning 
     476======= 
     477 
     478Tree pruners derived from :obj:`TreePrune` can be given either a 
     479:obj:`TreeNode` (presumably, but not necessarily a root) or a 
     480:obj:`TreeClassifier`. The result is a new, pruned :obj:`TreeNode` 
     481or a new :obj:`TreeClassifier` with a pruned tree. The original 
     482tree remains intact. 
     483 
     484Note however that pruners construct only a shallow copy of a tree. 
     485The pruned tree's :obj:`TreeNode` contain references to the same 
     486contingency matrices, node classifiers, branch selectors, ... 
     487as the original tree. Thus, you may modify a pruned tree structure 
     488(manually cut it, add new nodes, replace components) but modifying, 
     489for instance, some node's :obj:`nodeClassifier` (a 
     490:obj:`nodeClassifier` itself, not a reference to it!) would modify 
     491the node's :obj:`nodeClassifier` in the corresponding node of 
     492the original tree. 
     493 
     494Talking about node classifiers - pruners cannot construct a 
     495:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned 
     496subtrees into classifiers for new leaves. Thus, if you want to build 
     497a prunable tree, internal nodes must have their :obj:`nodeClassifier` 
     498defined. Fortunately, all you need to do is nothing; if you leave 
     499the :obj:`TreeLearner`'s flags as they are by default, the 
     500:obj:`nodeClassifier` are created. 
     501 
     502======= 
     503Classes 
     504======= 
     505 
     506Several classes described above are already functional and can 
     507(and mostly will) be used as they are. Those classes are :obj:`TreeNode`, 
     508:obj:`TreeLearner` and :obj:`TreeClassifier`. This section describe  
     509the other classes. 
     510 
     511Classes :obj:`TreeSplitConstructor`, :obj:`TreeStopCriteria`,  
     512:obj:`TreeExampleSplitter`, :obj:`TreeDescender`, :obj:`orange.Learner` 
     513and :obj:`Classifier` are among the Orange classes that can be subtyped  
     514in Python and have the call operator overloadedd in such a way that it 
     515is callbacked from C++ code. You can thus program your own components 
     516for :obj:`orange.TreeLearner` and :obj:`TreeClassifier`. The detailed  
     517information on how this is done and what can go wrong, is given in a  
     518separate page, dedicated to callbacks to Python XXXXXXXXXX. 
     519 
     520TreeSplitConstructor and Derived Classes 
     521======================================== 
     522 
     523Split construction is almost as exciting as waiting for a delayed flight. 
     524Boring, that is. Split constructors are programmed as spaghetti code 
     525that juggles with contingency matrices, with separate cases for discrete 
     526and continuous classes... Most split constructors work either for 
     527discrete or for continuous attributes. The suggested practice is 
     528to use a :obj:`TreeSplitConstructor_Combined` that can handle 
     529both by simply delegating attributes to specialized split constructors. 
     530 
     531Note: split constructors that cannot handle attributes of particular 
     532type (discrete, continuous) do not report an error or a warning but 
     533simply skip the attribute. It is your responsibility to use a correct 
     534split constructor for your dataset. (May we again suggest 
     535using :obj:`TreeSplitConstructor_Combined`?) 
     536 
     537The same components can be used either for inducing classification and 
     538regression trees. The only component that needs to be chosen accordingly 
     539is the 'measure' attribute for the :obj:`TreeSplitConstructor_Measure` 
     540class (and derived classes). 
     541 
     542TreeSplitConstructor 
     543==================== 
     544 
     545.. class:: TreeSplitConstructor_Measure 
     546 
     547    An abstract base class for split constructors that employ  
     548    a :obj:`orange.MeasureAttribute` to assess a quality of a split. At present, 
     549    all split constructors except for :obj:`TreeSplitConstructor_Combined` 
     550    are derived from this class. 
     551 
     552    .. attribute:: measure 
     553 
     554        A component of type :obj:`orange.MeasureAttribute` used for 
     555        evaluation of a split. Note that you must select the subclass  
     556        :obj:`MeasureAttribute` capable of handling your class type  
     557        - you cannot use :obj:`orange.MeasureAttribute_gainRatio` 
     558        for building regression trees or :obj:`orange.MeasureAttribute_MSE` 
     559        for classification trees. 
     560 
     561    .. attribute:: worstAcceptable 
     562 
     563        The lowest required split quality for a split to be acceptable. 
     564        Note that this value make sense only in connection with a 
     565        :obj:`measure` component. Default is 0.0. 
     566 
     567.. class:: TreeSplitConstructor_Attribute 
     568 
     569    Attempts to use a discrete attribute as a split; each value of the  
     570    attribute corresponds to a branch in the tree. Attributes are 
     571    evaluated with the :obj:`measure` and the one with the 
     572    highest score is used for a split. If there is more than one 
     573    attribute with the highest score, one of them is selected by random. 
     574 
     575    The constructed :obj:`branchSelector` is an instance of  
     576    :obj:`orange.ClassifierFromVarFD` that returns a value of the  
     577    selected attribute. If the attribute is :obj:`orange.EnumVariable`, 
     578    :obj:`branchDescription`'s are the attribute's values. The  
     579    attribute is marked as spent, so that it cannot reappear in the  
     580    node's subtrees. 
     581 
     582.. class:: TreeSplitConstructor_ExhaustiveBinary 
     583 
     584    Works on discrete attributes. For each attribute, it determines 
     585    which binarization of the attribute gives the split with the 
     586    highest score. If more than one split has the highest score, 
     587    one of them is selected by random. After trying all the attributes, 
     588    it returns one of those with the highest score. 
     589 
     590    The constructed :obj:`branchSelector` is again an instance 
     591    :obj:`orange.ClassifierFromVarFD` that returns a value of the 
     592    selected attribute. This time, however, its :obj:`transformer` 
     593    contains an instance of :obj:`MapIntValue` that maps the values 
     594    of the attribute into a binary attribute. Branch descriptions are 
     595    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to 
     596    more than one value of the attribute. Branches that correspond to 
     597    a single value of the attribute are described with this value. If  
     598    the attribute was originally binary, it is spent and cannot be  
     599    used in the node's subtrees. Otherwise, it can reappear in the  
     600    subtrees. 
     601 
     602 
     603.. class:: TreeSplitConstructor_Threshold 
     604 
     605    This is currently the only constructor for splits with continuous  
     606    attributes. It divides the range of attributes values with a threshold  
     607    that maximizes the split's quality. As always, if there is more than  
     608    one split with the highest score, a random threshold is selected.  
     609    The attribute that yields the highest binary split is returned. 
     610 
     611    The constructed :obj:`branchSelector` is again an instance of  
     612    :obj:`orange.ClassifierFromVarFD` with an attached  
     613    :obj:`transformer`. This time, :obj:`transformer` is of type  
     614    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are  
     615    "<threshold" and ">=threshold". The attribute is not spent. 
     616 
     617.. class:: TreeSplitConstructor_Combined 
     618 
     619    This constructor delegates the task of finding the optimal split  
     620    to separate split constructors for discrete and for continuous 
     621    attributes. Each split constructor is called, given only attributes 
     622    of appropriate types as candidates. Both construct a candidate for 
     623    a split; the better of them is selected. 
     624 
     625    (Note that there is a problem when more candidates have the same 
     626    score. Let there be are nine discrete attributes with the highest 
     627    score; the split constructor for discrete attributes will select 
     628    one of them. Now, let us suppose that there is a single continuous 
     629    attribute with the same score. :obj:`TreeSplitConstructor_Combined` 
     630    would randomly select between the proposed discrete attribute and  
     631    the continuous attribute, not aware of the fact that the discrete 
     632    has already competed with eight other discrete attributes. So,  
     633    he probability for selecting (each) discrete attribute would be 1/18 
     634    instead of 1/10. Although not really correct, we doubt that this 
     635    would affect the tree's performance; many other machine learning 
     636    systems simply choose the first attribute with the highest score  
     637    anyway.) 
     638 
     639    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether  
     640    the attribute is spent is decided by the winning split constructor. 
     641 
     642    .. attribute: discreteSplitConstructor 
     643 
     644        Split constructor for discrete attributes; can be, for instance, 
     645        :obj:`TreeSplitConstructor_Attribute` or  
     646        :obj:`TreeSplitConstructor_ExhaustiveBinary`. 
     647 
     648    .. attribute: continuousSplitConstructor 
     649 
     650        Split constructor for continuous attributes; at the moment, it  
     651        can be either :obj:`TreeSplitConstructor_Threshold` or a  
     652        split constructor you programmed in Python. 
     653 
     654    .. attribute: continuousSplitConstructor 
     655     
     656        Split constructor for continuous attributes; at the moment, it  
     657        can be either :obj:`TreeSplitConstructor_Threshold` or a split 
     658        constructor you programmed in Python. 
     659 
     660TreeStopCriteria and TreeStopCriteria_common 
     661============================================ 
     662 
     663obj:`TreeStopCriteria` determines when to stop the induction of subtrees, as described in detail in description of the learning process. XXXXXXXXXX 
     664 
     665.. class:: TreeStopCriteria 
     666 
     667    As opposed to :obj:`TreeSplitConstructor` and similar basic classes, 
     668    :obj:`TreeStopCriteria` is not an abstract but a fully functional 
     669    class that provides the basic stopping criteria. That is, the tree 
     670    induction stops when there is at most one example left; in this case, 
     671    it is not the weighted but the actual number of examples that counts. 
     672    Besides that, the induction stops when all examples are in the same 
     673    class (for discrete problems) or have the same value of the outcome 
     674    (for regression problems). 
     675 
     676    .. method:: __call__(examples[, weightID, domain contingencies]) 
     677 
     678        Decides whether to stop (true) or continue (false) the induction. 
     679        If contingencies are given, they are used for checking whether 
     680        the examples are in the same class (but not for counting the 
     681        examples). Derived classes should use the contingencies whenever 
     682        possible. If contingencies are not given, :obj:`TreeStopCriteria` 
     683        will work without them. Derived classes should also use them if 
     684        they are available, but otherwise compute them only when they 
     685        really need them. 
     686 
     687.. class:: TreeStopCriteria_common 
     688 
     689    :obj:`TreeStopCriteria` contains additional criteria for pre-pruning: 
     690    it checks the proportion of majority class and the number of weighted 
     691    examples. 
     692 
     693    .. attribute:: maxMajor 
     694 
     695        Maximal proportion of majority class. When this is exceeded,  
     696        induction stops. 
     697 
     698    .. attribute:: minExamples 
     699 
     700        Minimal number of examples in internal leaves. Subsets with less 
     701        than :obj:`minExamples` examples are not split any further. 
     702        Example count is weighed. 
     703 
     704 
     705TreeExampleSplitter and derived classes 
     706======================================= 
     707 
     708:obj:`TreeExampleSplitter` is the third crucial component of 
     709:obj:`TreeLearner`. Its function is described in  
     710description of the learning process. XXXXXXXXX 
     711 
     712.. class:: TreeExampleSplitter 
     713 
     714    An abstract base class for objects that split sets of examples into  
     715    subsets. The derived classes differ in treatment of examples which 
     716    cannot be unambiguously placed into a single branch (usually due 
     717    to unknown value of the crucial attribute). 
     718 
     719    .. method:: __call__(node, examples[, weightID]) 
     720         
     721        Use the information in :obj:`node` (particularly the  
     722        :obj:`branchSelector`) to split the given set of examples into subsets.  
     723        Return a tuple with a list of example generators and a list of weights.  
     724        The list of weights is either an ordinary python list of integers or  
     725        a None when no splitting of examples occurs and thus no weights are  
     726        needed. 
     727 
     728.. class:: TreeExampleSplitter_IgnoreUnknowns 
     729 
     730    Simply ignores the examples for which no single branch can be determined. 
     731 
     732.. class:: TreeExampleSplitter_UnknownsToCommon 
     733 
     734    Places all such examples to a branch with the highest number of 
     735    examples. If there is more than one such branch, one is selected at 
     736    random and then used for all examples. 
     737 
     738.. class:: TreeExampleSplitter_UnknownsToAll 
     739 
     740    Places examples with unknown value of the attribute into all branches. 
     741 
     742.. class:: TreeExampleSplitter_UnknownsToRandom 
     743 
     744    Selects a random branch for such examples. 
     745 
     746.. class:: TreeExampleSplitter_UnknownsToBranch 
     747 
     748    Constructs an additional branch to contain all such examples.  
     749    The branch's description is "unknown". 
     750 
     751.. class:: TreeExampleSplitter_UnknownsAsBranchSizes 
     752 
     753    Splits examples with unknown value of the attribute according to  
     754    proportions of examples in each branch. 
     755 
     756.. class:: TreeExampleSplitter_UnknownsAsSelector 
     757 
     758    Splits examples with unknown value of the attribute according to  
     759    distribution proposed by selector (which is in most cases the same  
     760    as proportions of examples in branches). 
     761 
     762TreeDescender and derived classes 
     763================================= 
     764 
     765This is a classifier's counterpart for :obj:`TreeExampleSplitter`. It  
     766decides the destiny of examples that need to be classified and cannot 
     767be unambiguously put in a branch. The detailed function of this class 
     768is given in description of classification with trees. XXXXXX 
     769 
     770.. class:: TreeDescender 
     771 
     772    An abstract base object for tree descenders. 
     773 
     774    .. method:: __call__(node, example) 
     775 
     776        Descends down the tree until it reaches a leaf or a node in  
     777        which a vote of subtrees is required. In both cases, a tuple  
     778        of two elements is returned; in the former, the tuple contains  
     779        the reached node and <CODE>None</CODE>, in the latter in  
     780        contains a node and weights of votes for subtrees (a list of floats). 
     781 
     782        :obj:`TreeDescender`'s that never split examples always descend 
     783        to a leaf, but they differ in the treatment of examples with 
     784        unknown values (or, in general, examples for which a branch 
     785        cannot be determined at some node(s) the tree). 
     786        :obj:`TreeDescender`'s that do split examples differ in returned 
     787        vote weights. 
     788 
     789.. class:: TreeDescender_UnknownsToNode 
     790 
     791    When example cannot be classified into a single branch, the 
     792    current node is returned. Thus, the node's :obj:`NodeClassifier` 
     793    will be used to make a decision. It is your responsibility to see 
     794    that even the internal nodes have their :obj:`NodeClassifier` 
     795    (i.e., don't disable creating node classifier or manually remove 
     796    them after the induction, that's all) 
     797 
     798.. class:: TreeDescender_UnknownsToCommon 
     799 
     800    Classifies examples with unknown value to a special branch. This 
     801    makes sense only if the tree itself was constructed with 
     802    :obj:`TreeExampleSplitter_UnknownsToBranch`. 
     803 
     804.. class:: TreeDescender_UnknownsToCommonBranch 
     805 
     806    Classifies examples with unknown values to the branch with the 
     807    highest number of examples. If there is more than one such branch, 
     808    random branch is chosen for each example that is to be classified. 
     809 
     810.. class:: TreeDescender_UnknownsToCommonSelector 
     811 
     812    Classifies examples with unknown values to the branch which received  
     813    the highest recommendation by the selector. 
     814 
     815.. class:: TreeDescender_MergeAsBranchSizes 
     816 
     817    Makes the subtrees vote for the example's class; the vote is 
     818    weighted according to the sizes of the branches. 
     819 
     820.. class:: TreeDescender_MergeAsSelector 
     821 
     822    Makes the subtrees vote for the example's class; the vote is  
     823    weighted according to the selectors proposal. 
     824 
     825TreePruner and derived classes 
     826============================== 
     827 
     828 
     829.. index:: 
     830    pair: classification trees; pruning 
     831.. index:: pruning classification trees 
     832 
     833    Classes derived from :obj:`TreePruner` prune the trees as a 
     834    described in the section pruning XXXXXXXX - make sure you read it  
     835    to understand what the pruners will do to your trees. 
     836 
     837.. class:: TreePruner 
     838 
     839    This is an abstract base class which defines nothing useful, only  
     840    a pure virtual call operator. 
     841 
     842    .. method:: __call__(tree) 
     843 
     844        Prunes a tree. The argument can be either a tree classifier or  
     845        a tree node; the result is of the same type as the argument. 
     846 
     847.. class:: TreePruner_SameMajority 
     848 
     849    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees  
     850    with more than one leaf) in which all the leaves have the same majority  
     851    class. (This is allowed because those leaves can still have different 
     852    distributions of classes and thus predict different probabilities.)  
     853    However, this can be undesired when we're only interested in the  
     854    class prediction or a simple tree interpretation. The  
     855    :obj:`TreePruner_SameMajority` prunes the tree so that there is no 
     856    subtree in which all the nodes would have the same majority class. 
     857 
     858    This pruner will only prune the nodes in which the node classifier  
     859    is of class :obj:`orange.DefaultClassifier` (or from a derived class). 
     860 
     861    Note that the leaves with more than one majority class require some  
     862    special handling. The pruning goes backwards, from leaves to the root.  
     863    When siblings are compared, the algorithm checks whether they  
     864    have (at least one) common majority class. If so, they can be pruned. 
     865 
     866.. class:: TreePruner_m 
     867 
     868    Prunes a tree by comparing m-estimates of static and dynamic  
     869    error as defined in (Bratko, 2002). 
     870 
     871    .. attributes:: m 
     872 
     873        Parameter m for m-estimation. 
     874 
     875======== 
     876Examples 
     877======== 
     878 
     879This page does not provide examples for programming your own components,  
     880such as, for instance, a :obj:`TreeSplitConstructor`. Those examples 
     881can be found on a page dedicated to callbacks to Python XXXXXXXX. 
     882 
     883Tree Structure 
     884============== 
     885 
     886To have something to work on, we'll take the data from lenses dataset  
     887and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_): 
     888 
     889.. literalinclude:: code/treestructure.py 
     890   :lines: 7-10 
     891 
     892How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)? 
     893 
     894.. _lenses.tab: code/lenses.tab 
     895.. _treestructure.py: code/treestructure.py 
     896 
     897.. literalinclude:: code/treestructure.py 
     898   :lines: 12-21 
     899 
     900If node is None, we have a null-node; null nodes don't count,  
     901so we return 0. Otherwise, the size is 1 (this node) plus the 
     902sizes of all subtrees. The node is an internal node if it has a  
     903:obj:`branchSelector`; it there's no selector, it's a leaf. Don't 
     904attempt to skip the if statement: leaves don't have an empty list  
     905of branches, they don't have a list of branches at all. 
     906 
     907    >>> treeSize(treeClassifier.tree) 
     908    10 
     909 
     910Don't forget that this was only an excercise - :obj:`TreeNode` has a  
     911built-in method :obj:`TreeNode.treeSize` that does exactly the same. 
     912 
     913Let us now write a simple script that prints out a tree. The recursive 
     914part of the function will get a node and its level (part of `treestructure.py`_, uses `lenses.tab`_). 
     915 
     916.. literalinclude:: code/treestructure.py 
     917   :lines: 26-41 
     918 
     919Don't waste time on studying formatting tricks (\n's etc.), this is just 
     920for nicer output. What matters is everything but the print statements. 
     921As first, we check whether the node is a null-node (a node to which no 
     922learning examples were classified). If this is so, we just print out 
     923"<null node>" and return. 
     924 
     925After handling null nodes, remaining nodes are internal nodes and leaves. 
     926For internal nodes, we print a node description consisting of the 
     927attribute's name and distribution of classes. :obj:`TreeNode`'s branch 
     928description is, for all currently defined splits, an instance of a 
     929class derived from :obj:`orange.Classifier` (in fact, it is a 
     930:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would  
     931suffice), and its :obj:`classVar` XXXXX points to the attribute we seek.  
     932So we print its name. We will also assume that storing class distributions  
     933has not been disabled and print them as well. A more able function for  
     934printing trees (as one defined in orngTree XXXXXXXXXX) has an alternative  
     935means to get the distribution, when this fails. Then we iterate  
     936through branches; for each we print a branch description and iteratively  
     937call the :obj:`printTree0` with a level increased by 1 (to increase  
     938the indent). 
     939 
     940Finally, if the node is a leaf, we print out the distribution of  
     941learning examples in the node and the class to which the examples in  
     942the node would be classified. We again assume that the :obj:`nodeClassifier`  
     943is the default one - a :obj:`DefaultClassifier`. A better print  
     944function should be aware of possible alternatives. 
     945 
     946Now, we just need to write a simple function to call our printTree0.  
     947We could write something like... 
     948 
     949    def printTree(x): 
    571950        printTree0(x.tree, 0) 
    572     elif isinstance(x, orange.TreeNode): 
    573         printTree0(x, 0) 
    574     else: 
    575         raise TypeError, "TreeClassifier or TreeNode expected" 
    576 </xmp> 
    577  
    578 <p>It's fairly straightforward: if <code>x</code> is of type derived from <code>orange.TreeClassifier</code>, we print <code>x.tree</code>; if it's <code>TreeNode</code> we just call <code>printTree0</code> with <code>x</code>. If it's of some other type, we don't know how to handle it and thus raise an exception. (Note that we could also use <CODE>if type(x) == orange.TreeClassifier:</CODE>, but this would only work if <CODE>x</CODE> would be of type <CODE>orange.TreeClassifier</CODE> and not of any derived types. The latter, however, do not exist yet...)</p> 
    579  
    580 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    581 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    582 <xmp class="code">>>> printTree(treeClassifier) 
    583  
    584 tear_rate (<15.000, 5.000, 4.000>) 
    585 : reduced --> none (<12.000, 0.000, 0.000>) 
    586 : normal 
    587    astigmatic (<3.000, 5.000, 4.000>) 
    588    : no 
    589       age (<1.000, 5.000, 0.000>) 
    590       : young --> soft (<0.000, 2.000, 0.000>) 
    591       : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
    592       : presbyopic --> none (<1.000, 1.000, 0.000>) 
    593    : yes 
    594       prescription (<2.000, 0.000, 4.000>) 
    595       : myope --> hard (<0.000, 0.000, 3.000>) 
    596       : hypermetrope --> none (<2.000, 0.000, 1.000>) 
    597 </xmp> 
    598  
    599 <p>For a final exercise, let us write a simple pruning procedure. It will be written entirely in Python, unrelated to any <code>TreePruner</code>s. Our procedure will limit the tree depth - the maximal depth (here defined as the number of internal nodes on any path down the tree) shall be given as an argument. For example, to get a two-level tree, we would call "<code>cutTree(root, 2)</code>". The function will be recursive, with the second argument (level) decreasing at each call; when zero, the current node will be made a leaf.</code> 
    600  
    601 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    602 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    603 <xmp class="code">def cutTree(node, level): 
    604     if node and node.branchSelector: 
    605         if level: 
    606             for branch in node.branches: 
    607                 cutTree(branch, level-1) 
    608         else: 
    609             node.branchSelector = None 
    610             node.branches = None 
    611             node.branchDescriptions = None 
    612 </xmp> 
    613  
    614 <p>There's nothing to prune at null-nodes or leaves, so we act only when <CODE>node</CODE> and <CODE>node.branchSelector</CODE> are defined. If level is not zero, we call the function for each branch. Otherwise, we clear the selector, branches and branch descriptions.</p> 
    615  
    616 <p class="header">part of <a href="treestructure.py">treestructure.py</a> 
    617 (uses <a href="lenses.tab">lenses.tab</a>)</p> 
    618 <xmp class="code">>>> cutTree(tree.tree, 2) 
    619 >>> printTree(tree) 
    620  
    621 tear_rate (<15.000, 5.000, 4.000>) 
    622 : reduced --> none (<12.000, 0.000, 0.000>) 
    623 : normal 
    624    astigmatic (<3.000, 5.000, 4.000>) 
    625    : no --> soft (<1.000, 5.000, 0.000>) 
    626    : yes --> hard (<2.000, 0.000, 4.000>) 
    627 </xmp> 
    628  
    629  
    630 <H3>Learning</H3> 
    631  
    632 <p>You've already seen a simple example of using a <code>TreeLearner</code>. You can just call it and let it fill the empty slots with the default components. This section will teach you three things: what are the missing components (and how to set the same components yourself), how to use alternative components to get a different tree and, finally, how to write a skeleton for tree induction in Python.</p> 
     951 
     952... but we won't. Let us learn how to handle arguments of different 
     953types. Let's write a function that will accept either a :obj:`TreeClassifier` 
     954or a :obj:`TreeNode`; just like TreePruners XXXXXX, remember? Part of `treestructure.py`_, uses `lenses.tab`_. 
     955 
     956.. literalinclude:: code/treestructure.py 
     957   :lines: 43-49 
     958 
     959It's fairly straightforward: if :obj:`x` is of type derived from  
     960:obj:`orange.TreeClassifier`, we print :obj:`x.tree`; if it's  
     961:obj:`TreeNode` we just call :obj:`printTree0` with :obj:`x`. If it's  
     962of some other type, we don't know how to handle it and thus raise  
     963an exception. (Note that we could also use  
     964:: 
     965    if type(x) == orange.TreeClassifier: 
     966 
     967but this would only work if :obj:`x` would be of type  
     968:obj:`orange.TreeClassifier` and not of any derived types. The latter,  
     969however, do not exist yet...) 
     970 
     971    >>> printTree(treeClassifier) 
     972    tear_rate (<15.000, 5.000, 4.000>) 
     973    : reduced --> none (<12.000, 0.000, 0.000>) 
     974    : normal 
     975       astigmatic (<3.000, 5.000, 4.000>) 
     976       : no 
     977          age (<1.000, 5.000, 0.000>) 
     978          : young --> soft (<0.000, 2.000, 0.000>) 
     979          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
     980          : presbyopic --> none (<1.000, 1.000, 0.000>) 
     981       : yes 
     982          prescription (<2.000, 0.000, 4.000>) 
     983          : myope --> hard (<0.000, 0.000, 3.000>) 
     984          : hypermetrope --> none (<2.000, 0.000, 1.000>) 
     985 
     986For a final exercise, let us write a simple pruning procedure. It will  
     987be written entirely in Python, unrelated to any :obj:`TreePruner`. Our 
     988procedure will limit the tree depth - the maximal depth (here defined 
     989as the number of internal nodes on any path down the tree) shall be 
     990given as an argument. For example, to get a two-level tree, we would 
     991call cutTree(root, 2). The function will be recursive, with the second  
     992argument (level) decreasing at each call; when zero, the current node  
     993will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_): 
     994 
     995.. literalinclude:: code/treestructure.py 
     996   :lines: 54-62 
     997 
     998There's nothing to prune at null-nodes or leaves, so we act only when  
     999:obj:`node` and :obj:`node.branchSelector` are defined. If level is  
     1000not zero, we call the function for each branch. Otherwise, we clear  
     1001the selector, branches and branch descriptions. 
     1002 
     1003    >>> cutTree(tree.tree, 2) 
     1004    >>> printTree(tree) 
     1005    tear_rate (<15.000, 5.000, 4.000>) 
     1006    : reduced --> none (<12.000, 0.000, 0.000>) 
     1007    : normal 
     1008       astigmatic (<3.000, 5.000, 4.000>) 
     1009       : no --> soft (<1.000, 5.000, 0.000>) 
     1010       : yes --> hard (<2.000, 0.000, 4.000>) 
     1011 
     1012Learning 
     1013======== 
     1014 
     1015You've already seen a simple example of using a :obj:`TreeLearner`. You can just call it and let it fill the empty slots with the default components. This section will teach you three things: what are the missing components (and how to set the same components yourself), how to use alternative components to get a different tree and, finally, how to write a skeleton for tree induction in Python.</p> 
    6331016 
    6341017<H4>Default components for TreeLearner</H4> 
    6351018 
    636 <p>Let us construct a <code>TreeLearner</code> to play with.</p> 
     1019<p>Let us construct a :obj:`TreeLearner` to play with.</p> 
    6371020 
    6381021<p class="header"><a href="treelearner.py">treelearner.py</a> 
Note: See TracChangeset for help on using the changeset viewer.