Changeset 7196:ff7963acf2b0 in orange


Ignore:
Timestamp:
02/02/11 16:28:13 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
f3c8efcb918663245c98301ae11d2fce6bf840bb
Message:

Modifying tree.py.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classify/tree.py

    r7187 r7196  
    1 from orange import \ 
     1""" 
     2 
     3 
     4 
     5This page describes the Orange trees. It first describes the basic components and procedures: it starts with <A href="#structure">the structure</A> that represents the tree, then it defines <A href="#classification">how the tree is used for classification</A>, then <A href="#learning">how it is built</A> and <a href="#pruning">pruned</A>. The order might seem strange, but the things are rather complex and this order is perhaps a bit easier to follow. After you have some idea about what the principal components do, we described the <a href="#classes">concrete classes</A> that you can use as components for a tree learner. 
     6 
     7Classification trees are represented as a tree-like hierarchy of TreeNode classes. 
     8 
     9 
     10.. class:: TreeNode 
     11 
     12    TreeNode stores information about the learning examples belonging  
     13    to the node, a branch selector, a list of branches (if the node is  
     14    not a leaf) with their descriptions and strengths, and a classifier. 
     15 
     16    .. attribute:: distribution 
     17     
     18        Stores a distribution for learning examples belonging to the node. 
     19        Storing distributions can be disabled by setting the TreeLearners's 
     20        storeDistributions flag to false. 
     21 
     22    .. attribute:: contingency 
     23 
     24        Stores complete contingency matrices for the learning examples  
     25        belonging to the node. Storing contingencies can be enabled by  
     26        setting <code>TreeLearner</code>'s <code>storeContingencies</code>  
     27        flag to <CODE>true</CODE>. Note that even when the flag is not  
     28        set, the contingencies get computed and stored to  
     29        <code>TreeNode</code>, but are removed shortly afterwards.  
     30        The details are given in the  
     31        description of the <code>TreeLearner</code> object. 
     32 
     33    .. attribute:: examples, weightID 
     34 
     35        Store a set of learning examples for the node and the 
     36        corresponding ID of weight meta attribute. The root of the 
     37        tree stores a "master table" of examples, while other nodes' 
     38        <CODE>ExampleTable</CODE>s contain reference to examples in 
     39        the root's <CODE>ExampleTable</CODE>. Examples are only stored 
     40        if a corresponding flag (<code>storeExamples</code>) has been 
     41        set while building the tree; to conserve the space, storing 
     42        is disabled by default.</DD> 
     43 
     44    .. attribute:: nodeClassifier 
     45 
     46        A classifier (usually, but not necessarily, a 
     47        <code>DefaultClassifier</code>) that can be used to classify 
     48        examples coming to the node. If the node is a leaf, this is 
     49        used to decide the final class (or class distribution) of an 
     50        example. If it's an internal node, it is stored if 
     51        <code>TreeNode</code>'s flag <code>storeNodeClassifier</code> 
     52        is set. Since the <code>nodeClassifier</code> is needed by 
     53        some <code>TreeDescenders</code> and for pruning (see far below), 
     54        this is the default behaviour; space consumption of the default 
     55        <code>DefaultClassifier</code> is rather small. You should 
     56        never disable this if you intend to prune the tree later. 
     57 
     58    If the node is a leaf, the remaining fields are <code>None</code>. If it's an internal node, there are several additional fields. 
     59 
     60    .. attribute:: branches 
     61 
     62        Stores a list of subtrees, given as <code>TreeNode</code>s.  
     63        An element can be <code>None</code>; in this case the node is empty. 
     64 
     65    .. attribute:: branchDescriptionsa 
     66 
     67        A list with string descriptions for branches, constructed by 
     68        <code>TreeSplitConstructor</code>. It can contain different kinds 
     69        of descriptions, but basically, expect things like 'red' or '>12.3'. 
     70 
     71    .. attribute:: branchSizes 
     72 
     73        Gives a (weighted) number of training examples that went into 
     74        each branch. This can be used later, for instance, for 
     75        modeling probabilities when classifying examples with 
     76        unknown values. 
     77 
     78    .. attribute:: branchSelector 
     79 
     80        Gives a branch for each example. The same object is used during 
     81        learning and classifying. The <code>branchSelector</code> is of type <code>Classifier</code>, since its job is similar to that of a classifier: it gets an example and returns discrete <code>Value</code> in range [0, <CODE>len(branches)-1</CODE>]. When an example cannot be classified to any branch, the selector can return a <CODE>Value</CODE> containing a special value (<code>sVal</code>) which should be a discrete distribution (<code>DiscDistribution</code>). This should represent a <code>branchSelector</code>'s opinion of how to divide the example between the branches. Whether the proposition will be used or not depends upon the chosen <code>TreeExampleSplitter</code> (when learning) or <code>TreeDescender</code> (when classifying).</DD> 
     82</DL> 
     83 
     84<p>The three lists (<code>branches</code>, <code>branchDescriptions</code> and <code>branchSizes</code>) are of the same length; all of them are defined if the node is internal and none if it is a leaf.</p> 
     85 
     86<p><code>TreeNode</code> has a method <code>treesize()</code> that returns the number of nodes in the subtrees (including the node, excluding null-nodes).</p> 
     87 
     88<A name="classification"></A> 
     89<H3>Classification</H3> 
     90 
     91<p>A <code><INDEX name="classes/TreeClassifier">TreeClassifier</code> is an object that classifies examples according to a tree stored in a field <code>tree</code>.</p> 
     92 
     93<p>Classification would be straightforward if there were no unknown values or, in general, examples that cannot be placed into a single branch. The response in such cases is determined by a component <code>descender</code>.</p> 
     94 
     95<p><code><INDEX name="classes/TreeDescender">TreeDescender</code> is an abstract object which is given an example and whose basic job is to descend as far down the tree as possible, according to the values of example's attributes. The <code>TreeDescender</code> calls the node's <code>branchSelector</code> to get the branch index. If it's a simple index, the corresponding branch is followed. If not, it's up to descender to decide what to do, and that's where descenders differ. A <code>descender</code> can choose a single branch (for instance, the one that is the most recommended by the <code>branchSelector</code>) or it can let the branches vote.</p> 
     96 
     97<p>In general there are three possible outcomes of a descent.</p> 
     98<UL> 
     99<LI>Descender reaches a leaf. This happens when nothing went wrong (there are no unknown or out-of-range values in the example) or when things went wrong, but the descender smoothed them by selecting a single branch and continued the descend. In this case, the descender returns the reached <code>TreeNode</code>.</li> 
     100 
     101<LI><code>branchSelector</code> returned a distribution and the <code>TreeDescender</code> decided to stop the descend at this (internal) node. Again, descender returns the current <code>TreeNode</code> and nothing else.</LI> 
     102 
     103<LI><code>branchSelector</code> returned a distribution and the <code>TreeNode</code> wants to split the example (i.e., to decide the class by voting). It returns a <code>TreeNode</code> and the vote-weights for the branches. The weights can correspond to the distribution returned by 
     104<code>branchSelector</code>, to the number of learning examples that were assigned to each branch, or to something else.</LI> 
     105</UL> 
     106 
     107<p><code>TreeClassifier</code> uses the descender to descend from the root. If it returns only a <code>TreeNode</code> and no distribution, the descend should stop; it does not matter whether it's a leaf (the first case above) or an internal node (the second case). The node's <code>nodeClassifier</code> is used to decide the class. If the descender returns a <code>TreeNode</code> and a distribution, the <code>TreeClassifier</code> recursively calls itself for each of the subtrees and the predictions are weighted as requested by the descender.</p> 
     108 
     109<p>When voting, subtrees do not predict the class but probabilities of classes. The predictions are multiplied by weights, summed and the most probable class is returned.</p> 
     110 
     111<p><B>The rest of this section is only for those interested in the C++ code.</B></p> 
     112 
     113<p>If you'd like to understand how the classification works in C++, start reading at <code>TTreeClassifier::vote</code>. It gets a <code>TreeNode</code>, an <code>Example</code> and a distribution of vote weights. For each node, it calls the <code>TTreeClassifier::classDistribution</code> and then multiplies and sums the distribution. <code>vote</code> returns a normalized distribution of predictions.</p> 
     114 
     115<p>A new overload of <code>TTreeClassifier::classDistribution</code> gets an additional parameter, a <code>TreeNode</code>. This is done for the sake of recursion. The normal version of <code>classDistribution</code> simply calls the overloaded with a tree root as an additional parameter. <code>classDistribution</code> uses <code>descender</code>. If descender reaches a leaf, it calls <code>nodeClassifier</code>, otherwise it calls <CODE>vote</CODE>.</P> 
     116 
     117<p>Thus, the <code>TreeClassifier</code>'s <code>vote</code> and <code>classDistribution</code> are written in a form of double recursion. The recursive calls do not happen at each node of the tree but only at nodes where a vote is needed (that is, at nodes where the descender halts).</p> 
     118 
     119<p>For predicting a class, <code>operator()</code>, calls the descender. If it reaches a leaf, the class is predicted by the leaf's <code>nodeClassifier</code>. Otherwise, it calls <code>vote</code>. From now on, <code>vote</code> and <code>classDistribution</code> interweave down the tree and return a distribution of predictions. <code>operator()</code> then simply chooses the most probable class.</p> 
     120 
     121 
     122<A name="learning"></A> 
     123<H3>Learning</H3> 
     124 
     125<p>The main learning object is <CODE><INDEX name="classes/TreeLearner">TreeLearner</CODE>. It is basically a skeleton into which the user must plug the components for particular functions. For easier use, defaults are provided.</p> 
     126 
     127<p>Components that govern the structure of the tree are <code>split</code> (of type <code>TreeSplitConstructor</code>), <code>stop</code> (of type <code>TreeStopCriteria</code>) and <code>exampleSplitter</code> (of type <code>TreeExampleSplitter</code>).</p> 
     128 
     129<H4>TreeSplitConstructor</H4> 
     130 
     131<p>The job of <code><INDEX name="classes/TreeSplitConstructor">TreeSplitConstructor</code> is to find a suitable criteria for dividing the learning (and later testing) examples coming to the node. The data it gets is a set of examples (and, optionally, an ID of weight meta-attribute), a domain contingency computed from examples, apriori class probabilities, a list of candidate attributes it should consider and a node classifier (if it was constructed, that is, if <CODE>storeNodeClassifier</CODE> is left <CODE>true</CODE>).</p> 
     132 
     133<p>The <code>TreeSplitConstructor</code> should use the domain contingency when possible. The reasons are two-fold; one is that it's faster and the other is that the contingency matrices are not necessarily constructed by simply counting the examples. Why and how is explained later. There are, however, cases, when domain contingency does not suffice, for examples, when ReliefF is used as a measure of quality of attributes. In this case, there's no other way but to use the examples and ignore the precomputed contingencies.</p> 
     134 
     135<p>The split constructor should consider only the attributes in the candidate list (the list is a vector of booleans, one for each attribute).</p> 
     136 
     137<p><code>TreeSplitConstructor</code> returns most of the data we talked about when describing the <code>TreeNode</code>. It returns a classifier to be used as <code>TreeNode</code>'s <code>branchSelector</code>, a list of branch descriptions and a list with the number of examples that go into each branch. Just what we need for the <code>TreeNode</code>. It can return an empty list for the number of examples in branches; in this case, the <code>TreeLearner</code> will find the number itself after splitting the example set into subsets. However, if a split constructors can provide the numbers at no extra computational cost, it should do so.</P> 
     138 
     139<p>In addition, it returns a quality of the split; a number without any fixed meaning except that higher numbers mean better splits.</p> 
     140 
     141<p>If the constructed splitting criterion uses an attribute in such a way that the attribute is 'completely spent' and should not be considered as a split criterion in any of the subtrees (the typical case of this are discrete attributes that are used as-they-are, that is, without any binarization or subsetting), then it should report the index of this attribute. Some splits do not spend any attribute; this is indicated by returning a negative index.</p> 
     142 
     143<p>A <code>TreeSplitConstructor</code> can veto the further tree induction by returning no classifier. This can happen for many reasons. A general one is related to number of examples in the branches. <code>TreeSplitConstructor</code> has a field <code>minSubset</code>, which sets the minimal number of examples in a branch; null nodes, however, are allowed. If there is no split where this condition is met, <code>TreeSplitConstructor</code> stops the induction.</p> 
     144 
     145 
     146<H4>TreeStopCriteria</H4> 
     147 
     148<p><code><INDEX name="classes/TreeStopCriteria">TreeStopCriteria</code> is a much simpler component that, given a set of examples, weight ID and contingency matrices, decides whether to continue the induction or not. The basic criterion checks whether there are any examples and whether they belong to at least two different classes (if the class is discrete). Derived components check things like the number of examples and the proportion of majority classes.</p> 
     149 
     150 
     151<H4>TreeExampleSplitter</H4> 
     152 
     153<p><code><INDEX name="classes/TreeExampleSplitter">TreeExampleSplitter</code> is analogous to the <code>TreeDescender</code> described about a while ago. Just like the <code>TreeDescender</code> decides the branch for an example during classification, the <code>TreeExampleSplitter</code> sorts the learning examples into branches.</p> 
     154 
     155<p><code>TreeExampleSplitter</code> is given a <code>TreeNode</code> (from which it can use different stuff, but most of splitters only use the <code>branchSelector</code>), a set of examples to be divided, and the weight ID. The result is a list of subsets of examples and, optionally, a list of new weight ID's.</p> 
     156 
     157<p>Subsets are usually stored as <code>ExamplePointerTable</code>'s. Most of <code>TreeExampleSplitters</code> simply call the node's <code>branchSelector</code> and assign examples to corresponding branches. When the value is unknown they choose a particular branch or simply skip the example.</p> 
     158 
     159<p>Some enhanced splitters can split examples. An example (actually, a pointer to it) is copied to more than one subset. To facilitate real splitting, weights are needed. Each branch is assigned a weight ID (each would usually have its own ID) and all examples that are in that branch (either completely or partially) should have this meta attribute. If an example hasn't been split, it has only one additional attribute - with weight ID corresponding to the subset to which it went. Example that is split between, say, three subsets, has three new meta attributes, one for each subset. ID's of weight meta attributes are returned by the <code>TreeExampleSplitter</code> to be used at induction of the corresponding subtrees.</p> 
     160 
     161<p>Note that weights are used only when needed. When no splitting occured - because the splitter is not able to do it or because there was no need for splitting - no weight ID's are returned.</p> 
     162 
     163<H4>TreeLearner</H4> 
     164 
     165<p>TreeLearner has a number of components.</p> 
     166 
     167<P class=section> 
     168<DL class=attributes> 
     169<DT>split</DT> 
     170<DD>Object of type <code>TreeSplitConstructor</code>. Default value, provided by <code>TreeLearner</code>, is <code>SplitConstructor_Combined</code> with separate constructors for discrete and continuous attributes. Discrete attributes are used as are, while continuous attributes are binarized. Gain ratio is used to select attributes. A minimum of two examples in a leaf is required for discrete and five examples in a leaf for continuous attributes.</DD> 
     171 
     172<DT>stop</DT> 
     173<DD>Object of type <code>TreeStopCriteria</code>. The default stopping criterion stops induction when all examples in a node belong to the same class.</DD> 
     174 
     175<DT>splitter</DT> 
     176<DD>Object of type <code>TreeExampleSplitter</code>. The default splitter is <code>TreeExampleSplitter_UnknownsAsSelector</code> that splits the learning examples according to distributions given by the selector.</DD> 
     177 
     178<DT>contingencyComputer</DT> 
     179<DD>By default, this slot is left empty and ordinary contingency matrices are computed for examples at each node. If need arises, one can change the way the matrices are computed. This can be used to change the way that unknown values are treated when assessing qualities of attributes. As mentioned earlier, the computed matrices can be used by split constructor and by stopping criteria. On the other hand, they can be (and are) ignored by some splitting constructors.</DD> 
     180 
     181<DT>nodeLearner</DT> 
     182<DD>Induces a classifier from examples belonging to a node. The same learner is used for internal nodes and for leaves. The default <code>nodeLearner</code> is <code>MajorityLearner</code>.</DD> 
     183 
     184<DT>descender</DT> 
     185<DD>Descending component that the induces <code>TreeClassifier</code> will use. Default descender is <code>TreeDescender_UnknownMergeAsSelector</code> which votes using the <code>branchSelector</code>'s distribution for vote weights.</DD> 
     186 
     187<DT>maxDepth</DT> 
     188<DD>Gives maximal tree depth; 0 means that only root is generated. The default is 100 to prevent any infinite tree induction due to missettings in stop criteria. If you are sure you need larger trees, increase it. If you, on the other hand, want to lower this hard limit, you can do so as well.</DD> 
     189 
     190<DT>storeDistributions, storeContingencies, storeExamples, storeNodeClassifier</DT> 
     191<DD>Decides whether to store class distributions, contingencies and examples in <code>TreeNodes</code>, and whether the <code>nodeClassifier</code> should be build for internal nodes. By default, distributions and node classifiers are stored, while contingencies and examples are not. You won't save any memory by not storing distributions but storing contingencies, since distributions actually points to the same distribution that is stored in <code>contingency.classes</code>.</DD> 
     192</DL> 
     193 
     194<p>The <code>TreeLearner</code> first sets the defaults for missing components. Although stored in the actual <code>TreeLearner</code>'s fields, they are removed when the induction is finished.</p> 
     195 
     196<p>Then it ensures that examples are stored in a table. This is needed because the algorithm juggles with pointers to examples. If examples are in a file or are fed through a filter, they are copied to a table. Even if they are already in a table, they are copied if <code>storeExamples</code> is set. This is to assure that pointers remain pointing to examples even if the user later changes the example table. If they are in the table and the <code>storeExamples</code> flag is clear, we just use them as they are. This will obviously crash in a multi-threaded system if one changes the table during the tree induction. Well... don't do it.</p> 
     197 
     198<p>Apriori class probabilities are computed. At this point we check the sum of example weights; if it's zero, there are no examples and we cannot proceed. A list of candidate attributes is set; in the beginning, all attributes are candidates for the split criterion.</p> 
     199 
     200<p>Now comes the recursive part of the <code>TreeLearner</code>. Its arguments are a set of examples, a weight meta-attribute ID (a tricky thing, it can be always the same as the original or can change to accomodate splitting of examples among branches), apriori class distribution and a list of candidates (represented as a vector of Boolean values).</p> 
     201 
     202<p><code>Contingency matrix</code> is computed next. This happens even if the flag <code>storeContingencies</code> is <CODE>false</CODE>. If the <code>contingencyComputer</code> is given we use it, otherwise we construct just an ordinary contingency matrix.</p> 
     203 
     204<p>A <code>stop</code> is called to see whether it's worth to continue. If not, a <code>nodeClassifier</code> is built and the <code>TreeNode</code> is returned. Otherwise, a <code>nodeClassifier</code> is only built if <code>forceNodeClassifier</code> flag is set.</p> 
     205 
     206<p>To get a <code>TreeNode</code>'s <code>nodeClassifier</code>, the <code>nodeLearner</code>'s <code>smartLearn</code> function is called with the given examples, weight ID and the just computed matrix. If the learner can use the matrix (and the default, <code>MajorityLearner</code>, can), it won't touch the examples. Thus, a choice of <code>contingencyComputer</code> will, in many cases, affect the <code>nodeClassifier</code>. The <code>nodeLearner</code> can return no classifier; if so and if the classifier would be needed for classification, the <code>TreeClassifier</code>'s function returns DK or an empty distribution. If you're writing your own tree classifier - pay attention.</p> 
     207 
     208<p>If the induction is to continue, a <code>split</code> component is called. If it fails to return a branch selector, induction stops and the <code>TreeNode</code> is returned.</p> 
     209 
     210<p><code>TreeLearner</code> than uses <code>ExampleSplitter</code> to divide the examples as described above.</p> 
     211 
     212<p>The contingency gets removed at this point if it is not to be stored. Thus, the <code>split</code>, <code>stop</code> and <code>exampleSplitter</code> can use the contingency matrices if they will.</p> 
     213 
     214<p>The <code>TreeLearner</code> then recursively calls itself for each of the non-empty subsets. If the splitter returnes a list of weights, a corresponding weight is used for each branch. Besides, the attribute spent by the splitter (if any) is removed from the list of candidates for the subtree.</p> 
     215 
     216<p>A subset of examples is stored in its corresponding tree node, if so requested. If not, the new weight attributes are removed (if any were created).</p> 
     217 
     218<A name="pruning"></A> 
     219<H3>Pruning</H3> 
     220 
     221<p>Tree pruners derived from <code><INDEX name="classes/TreePrune">TreePrune</code> can be given either a <code>TreeNode</code> (presumably, but not necessarily a root) or a <code>TreeClassifier</code>. The result is a new, pruned <code>TreeNode</code> or a new <code>TreeClassifier</code> with a pruned tree. The original tree remains intact.</p> 
     222 
     223<p>Note however that pruners construct only a shallow copy of a tree. The pruned tree's <code>TreeNode</code>s contain references to the same contingency matrices, node classifiers, branch selectors, ... as the original tree. Thus, you may modify a pruned tree structure (manually cut it, add new nodes, replace components) but modifying, for instance, some node's <code>nodeClassifier</code> (a <code>nodeClassifier</code> itself, not a reference to it!) would modify the node's <code>nodeClassifier</code> in the corresponding node of the original tree.</p> 
     224 
     225<p>Talking about node classifiers - pruners cannot construct a <code>nodeClassifier</code> nor merge <code>nodeClassifiers</code> of the pruned subtrees into classifiers for new leaves. Thus, if you want to build a prunable tree, internal nodes must have their <code>nodeClassifiers</code> defined. Fortunately, all you need to do is nothing; if you leave the <code>TreeLearner</code>'s flags as they are by default, the <code>nodeClassifiers</code> are created.</p> 
     226 
     227<hr> 
     228 
     229<A name="classes"></A> 
     230<H2>Classes</H2> 
     231 
     232<p>Several classes described above are already functional and can (and mostly will) be used as they are. Those classes are <code>TreeNode</code>, <code>TreeLearner</code> and <code>TreeClassifier</code>. This section describe the other classes.</p> 
     233 
     234<p>Classes <code>TreeSplitConstructor</code>, <code>TreeStopCriteria</code>, <code>TreeExampleSplitter</code>, <code>TreeDescender</code>, <code>Learner</code> and <code>Classifier</code> are among the Orange classes that can be subtyped in Python and have the call operator overloadedd in such a way that it is callbacked from C++ code. You can thus program your own components for <code>TreeLearners</code> and <code>TreeClassifiers</code>. The detailed information on how this is done and what can go wrong, is given in a separate page, dedicated to <A href="callbacks.htm">callbacks to Python</A>.</p> 
     235 
     236 
     237<H3>TreeSplitConstructor and Derived Classes</H3> 
     238 
     239<p>Split construction is almost as exciting as waiting for a delayed flight. Boring, that is. Split constructors are programmed as spaghetti code that juggles with contingency matrices, with separate cases for discrete and continuous classes... Most split constructors work either for discrete or for continuous attributes. The suggested practice is to use a <code>TreeSplitConstructor_Combined</code> that can handle both by simply delegating attributes to specialized split constructors.</p> 
     240 
     241<p><B>Note:</B> split constructors that cannot handle attributes of particular type (discrete, continuous) do not report an error or a warning but simply skip the attribute. It is your responsibility to use a correct split constructor for your dataset. (May we again suggest using <code>TreeSplitConstructor_Combined</code>?)</p> 
     242 
     243<p>The same components can be used either for inducing classification and regression trees. The only component that needs to be chosen accordingly is the 'measure' attribute for the <code>TreeSplitConstructor_Measure</code> class (and derived classes).</p> 
     244 
     245<H4>TreeSplitConstructor</H4> 
     246 
     247<p>The <code>TreeSplitConstructor</code>'s function has been described in details in <a href="#learning">description of the learning process</A>.</p> 
     248 
     249<p class=section>Attributes</P> 
     250<DL class=attributes> 
     251<DT>minSubset</DT> 
     252<DD>Sets the minimal number of examples in non-null leaves. As always in Orange (where not specified otherwise), "number of examples" refers to the weighted number of examples.</DD> 
     253</DL> 
     254 
     255<p class=section>Methods</P> 
     256<DL class=attributes> 
     257<DT>__call__(examples[, weightID, apriori_distribution, candidates])</DT> 
     258<DD>Constructs a split. Examples can be given in any acceptable form (an <code>ExampleGenerator</code>, such as <code>ExampleTable</code>, or a list of examples). <code>WeightID</code> is optional; the default of 0 means that all examples have a weight of 1.0. Apriori-distribution should be of type <code>orange.Distribution</code> and candidates should be a Python list of objects which are interpreted as booleans. 
     259The function returns a tuple (<code>branchSelector</code>, <code>branchDescriptions</code>, <code>subsetSizes</code>, <code>quality</code>, <code>spentAttribute</code>). <code>SpentAttribute</code> is -1 if no attribute is completely spent by the split criterion. If no split is constructed, the <code>selector</code>, <code>branchDescriptions</code> and <code>subsetSizes</code> are <code>None</code>, while <CODE>quality</CODE> is 0.0 and <code>spentAttribute</code> is -1.</DD> 
     260</DL> 
     261 
     262 
     263<H4>TreeSplitConstructor_Measure</H4> 
     264<index name="classes/TreeSplitConstructor_Measure"> 
     265 
     266<p>An abstract base class for split constructors that employ a <code>MeasureAttribute</code> to assess a quality of a split. At present, all split constructors except for <code>TreeSplitConstructor_Combined</code> are derived from this class.</p> 
     267 
     268<p class=section>Attributes</p> 
     269<DL class=attributes> 
     270<DT>measure</DT> 
     271<DD>A component of type <code>MeasureAttribute</code> used for evaluation of a split. Note that you must select the subclass <code>MeasureAttribute</code> capable of handling your class type - you cannot use <code>MeasureAttribute_gainRatio</code> for building regression trees or <code>MeasureAttribute_MSE</code> for classification trees.</DD> 
     272 
     273<DT>worstAcceptable</DT> 
     274<DD>The lowest required split quality for a split to be acceptable. Note that this value make sense only in connection with a <code>measure</code> component. Default is 0.0.</DD> 
     275</DL> 
     276 
     277<H4>TreeSplitConstructor_Attribute</H4> 
     278<index name="classes/TreeSplitConstructor_Attribute"> 
     279 
     280<p><code>TreeSplitConstructor_Attribute</code> attempts to use a discrete attribute as a split; each value of the attribute corresponds to a branch in the tree. Attributes are evaluated with the <code>measure</code> and the one with the highest score is used for a split. If there is more than one attribute with the highest score, one of them is selected by random.</P> 
     281 
     282<p>The constructed <code>branchSelector</code> is an instance of <code>ClassifierFromVarFD</code> that returns a value of the selected attribute. If the attribute is <code>EnumVariable</code>, <code>branchDescription</code>'s are the attribute's values. The attribute is marked as spent, so that it cannot reappear in the node's subtrees.</p> 
     283 
     284<H4>TreeSplitConstructor_ExhaustiveBinary</H4> 
     285<index name="classes/TreeSplitConstructor_ExhaustiveBinary"> 
     286<index name="binarization (in trees)" 
     287 
     288<p><code>TreeSplitConstructor_ExhaustiveBinary</code> works on discrete attributes. For each attribute, it determines which binarization of the attribute gives the split with the highest score. If more than one split has the highest score, one of them is selected by random. After trying all the attributes, it returns one of those with the highest score.</p> 
     289 
     290<p>The constructed <CODE>branchSelector</CODE> is again an instance <code>ClassifierFromVarFD</code> that returns a value of the selected attribute. This time, however, its <code>transformer</code> contains an instance of <code>MapIntValue</code> that maps the values of the attribute into a binary attribute. Branch descriptions are of form "[&lt;val1&gt;, &lt;val2&gt;, ...&lt;valn&gt;]" for branches corresponding to more than one value of the attribute. Branches that correspond to a single value of the attribute are described with this value. If the attribute was originally binary, it is spent and cannot be used in the node's subtrees. Otherwise, it can reappear in the subtrees.</p> 
     291 
     292<H4>TreeSplitConstructor_Threshold</H4> 
     293<index name="classes/TreeSplitConstructor_Threshold"> 
     294 
     295<p>This is currently the only constructor for splits with continuous attributes. It divides the range of attributes values with a threshold that maximizes the split's quality. As always, if there is more than one split with the highest score, a random threshold is selected. The attribute that yields the highest binary split is returned.</p> 
     296 
     297<p>The constructed <code>branchSelector</code> is again an instance of <code>ClassifierFromVarFD</code> with an attached <code>transformer</code>. This time, <code>transformer</code> is of type <code>ThresholdDiscretizer</code>. The branch descriptions are "&lt;threshold" and "&gt;=threshold". The attribute is not spent.</p> 
     298 
     299 
     300<H4>TreeSplitConstructor_Combined</H4> 
     301<index name="classes/TreeSplitConstructor_Combined"> 
     302 
     303<p>This constructor delegates the task of finding the optimal split to separate split constructors for discrete and for continuous attributes. Each split constructor is called, given only attributes of appropriate types as candidates. Both construct a candidate for a split; the better of them is selected.</p> 
     304 
     305<p>(Note that there is a problem when more candidates have the same score. Let there be are nine discrete attributes with the highest score; the split constructor for discrete attributes will select one of them. Now, let us suppose that there is a single continuous attribute with the same score. <code>TreeSplitConstructor_Combined</code> would randomly select between the proposed discrete attribute and the continuous attribute, not aware of the fact that the discrete has already competed with eight other discrete attributes. So, the probability for selecting (each) discrete attribute would be 1/18 instead of 1/10. Although not really correct, we doubt that this would affect the tree's performance; many other machine learning systems simply choose the first attribute with the highest score anyway.)</p> 
     306 
     307<p>The <code>branchSelector</code>, <code>branchDescriptions</code> and whether the attribute is spent is decided by the winning split constructor.</p> 
     308 
     309<p class=section>Attributes</p> 
     310<DL class=attributes> 
     311<DT>discreteSplitConstructor</DT> 
     312<DD>Split constructor for discrete attributes; can be, for instance, <code>TreeSplitConstructor_Attribute</code> or <code>TreeSplitConstructor_ExhaustiveBinary</code></DD> 
     313 
     314<DT>continuousSplitConstructor</DT> 
     315<DD>Split constructor for continuous attributes; at the moment, it can be either <code>TreeSplitConstructor_Threshold</code> or a split constructor you programmed in Python.</DD> 
     316</DL> 
     317 
     318 
     319<H3>TreeStopCriteria and TreeStopCriteria_common</H3> 
     320 
     321<p><code>TreeStopCriteria</code> determines when to stop the induction of subtrees, as described in detail in <a href="#learning">description of the learning process</A>.</p> 
     322 
     323<H4>TreeStopCriteria</H4> 
     324 
     325<p>As opposed to <code>TreeSplitConstructor</code> and similar basic classes, <code>TreeStopCriteria</code> is not an abstract but a fully functional class that provides the basic stopping criteria. That is, the tree induction stops when there is at most one example left; in this case, it is not the weighted but the actual number of examples that counts. Besides that, the induction stops when all examples are in the same class (for discrete problems) or have the same value of the outcome (for regression problems).</p> 
     326 
     327<p class=section>Methods</p> 
     328<DL class=attributes> 
     329<DT>__call__(examples[, weightID, domain contingencies])</DT> 
     330<DD> 
     331Decides whether to stop (<CODE>true</CODE>) or continue (<CODE>false</CODE>) the induction. If contingencies are given, they are used for checking whether the examples are in the same class (but not for counting the examples). Derived classes should use the contingencies whenever possible. If contingencies are not given, <code>TreeStopCriteria</code> will work without them. Derived classes should also use them if they are available, but otherwise compute them only when they really need them. 
     332</DD> 
     333</DL> 
     334 
     335<H4>TreeStopCriteria_common</H4> 
     336<index name="classes/TreeSplitConstructor_common"> 
     337 
     338<p><code>TreeStopCriteria_common</code> contains additional criteria for pre-pruning: it checks the proportion of majority class and the number of weighted examples.</p> 
     339 
     340<p class=section>Attributes</p> 
     341<DL class=attributes> 
     342<DT>maxMajor</DT> 
     343<DD>Maximal proportion of majority class. When this is exceeded, induction stops.</DD> 
     344 
     345<DT>minExamples</DT> 
     346<DD>Minimal number of examples in internal leaves. Subsets with less than <code>minExamples</code> examples are not split any further. Example count is weighed.</DD> 
     347</DL> 
     348 
     349<H3>TreeExampleSplitter and derived classes</H3> 
     350 
     351<p><code>TreeExampleSplitter</code> is the third crucial component of <code>TreeLearner</code>. Its function is described in <a href="#learning">description of the learning process</A>.</p> 
     352 
     353<H4>TreeExampleSplitter</H4> 
     354 
     355<p>An abstract base class for objects that split sets of examples into subsets. The derived classes differ in treatment of examples which cannot be unambiguously placed into a single branch (usually due to unknown value of the crucial attribute).</p> 
     356 
     357<P class=section>Methods</p> 
     358<DL class=attributes> 
     359<DT><B>__call__(node, examples[, weightID])</B></DT> 
     360<DD> 
     361Uses the information in <code>node</code> (particularly the <code>branchSelector</code>) to split the given set of examples into subsets. Function returns a tuple with a list of example generators and a list of weights. The list of weights is either an ordinary python list of integers or a None when no splitting of examples occurs and thus no weights are needed. 
     362</DD> 
     363</DL> 
     364 
     365<p class=section>Derived classes</p> 
     366<DL class=attributes> 
     367<DT><INDEX name="classes/TreeExampleSplitter_IgnoreUnknowns">TreeExampleSplitter_IgnoreUnknowns</DT> 
     368<DD>Simply ignores the examples for which no single branch can be determined.</DD> 
     369 
     370<DT><INDEX name="classes/TreeExampleSplitter_UnknownsToCommon">TreeExampleSplitter_UnknownsToCommon</DT> 
     371<DD>Places all such examples to a branch with the highest number of examples. If there is more than one such branch, one is selected at random and then used for all examples.</DD> 
     372 
     373<DT><INDEX name="classes/TreeExampleSplitter_UnknownsToAll">TreeExampleSplitter_UnknownsToAll</DT> 
     374<DD>Places examples with unknown value of the attribute into all branches.</DD> 
     375 
     376<DT><INDEX name="classes/TreeExampleSplitter_UnknownsToRandom">TreeExampleSplitter_UnknownsToRandom</DT> 
     377<DD>Selects a random branch for such examples.</DD> 
     378 
     379<DT><INDEX name="classes/TreeExampleSplitter_UnknownsToBranch">TreeExampleSplitter_UnknownsToBranch</DT> 
     380<DD>Constructs an additional branch to contain all such examples. The branch's description is "unknown".</DD> 
     381 
     382<DT><INDEX name="classes/TreeExampleSplitter_UnknownsAsBranchSizes">TreeExampleSplitter_UnknownsAsBranchSizes</DT> 
     383<DD>Splits examples with unknown value of the attribute according to proportions of examples in each branch.</DD> 
     384 
     385<DT><INDEX name="classes/TreeExampleSplitter_UnknownsAsSelector">TreeExampleSplitter_UnknownsAsSelector</DT> 
     386<DD>Splits examples with unknown value of the attribute according to distribution proposed by selector (which is in most cases the same as proportions of examples in branches).</DD> 
     387</DL> 
     388 
     389 
     390 
     391<H3>TreeDescender and derived classes</H3> 
     392 
     393<p>This is a classifier's counterpart for <code>TreeExampleSplitter</code>. It decides the destiny of examples that need to be classified and cannot be unambiguously put in a branch. The detailed function of this class is given in <a href="#classification"> description of classification with trees</A>.</p> 
     394 
     395<H4>TreeDescender</H4> 
     396 
     397<p>An abstract base object for tree descenders.</p> 
     398 
     399<p class=section>Methods</p> 
     400<DL class=attributes> 
     401<DT>__call__(node, example)</DT> 
     402<DD>Descends down the tree until it reaches a leaf or a node in which a vote of subtrees is required. In both cases, a tuple of two elements is returned; in the former, the tuple contains the reached node and <CODE>None</CODE>, in the latter in contains a node and weights of votes for subtrees (a list of floats). 
     403 
     404<code>TreeDescender</code>'s that never split examples always descend to a leaf, but they differ in the treatment of examples with unknown values (or, in general, examples for which a branch cannot be determined at some node(s) the tree). <code>TreeDescender</code>'s that do split examples differ in returned vote weights.</DD> 
     405</DL> 
     406 
     407<p class=section>Derived classes</p> 
     408<DL class=attributes> 
     409<DT><INDEX name="classes/TreeDescender_UnknownsToNode">TreeDescender_UnknownsToNode</DT> 
     410<DD>When example cannot be classified into a single branch, the current node is returned. Thus, the node's <CODE>nodeClassifier</CODE> will be used to make a decision. It is your responsibility to see that even the internal nodes have their <CODE>nodeClassifiers</CODE> (i.e., don't disable creating node classifier or manually remove them after the induction, that's all)</DD> 
     411 
     412<DT><INDEX name="classes/TreeDescender_UnknownsToCommon">TreeDescender_UnknownsToCommon</DT> 
     413<DD>Classifies examples with unknown value to a special branch. This makes sense only if the tree itself was constructed with <CODE>TreeExampleSplitter_UnknownsToBranch</CODE>.</DD> 
     414 
     415<DT><INDEX name="classes/TreeDescender_UnknownsToCommonBranch">TreeDescender_UnknownsToCommonBranch</DT> 
     416<DD>Classifies examples with unknown values to the branch with the highest number of examples. If there is more than one such branch, random branch is chosen for each example that is to be classified.</DD> 
     417 
     418<DT><INDEX name="classes/TreeDescender_UnknownsToCommonSelector">TreeDescender_UnknownsToCommonSelector</DT> 
     419<DD>Classifies examples with unknown values to the branch which received the highest recommendation by the selector.</DD> 
     420 
     421 
     422<DT><INDEX name="classes/TreeDescender_MergeAsBranchSizes">TreeDescender_MergeAsBranchSizes</DT> 
     423<DD>Makes the subtrees vote for the example's class; the vote is weighted according to the sizes of the branches.</DD> 
     424 
     425<DT><INDEX name="classes/TreeDescender_MergeAsSelector">TreeDescender_MergeAsSelector</DT> 
     426<DD>Makes the subtrees vote for the example's class; the vote is weighted according to the selectors proposal.</DD> 
     427</DL> 
     428 
     429 
     430<H3>TreePruner and derived classes</H3> 
     431<index name="classification trees/pruning"> 
     432<index name="pruning classification trees"> 
     433 
     434<p>Classes derived from TreePruner prune the trees as described in the section <A href="#pruning">pruning</A> - make sure you read it to understand what the pruners will do to your trees.</p> 
     435 
     436<H4>TreePruner</H4> 
     437 
     438<p>This is an abstract base class which defines nothing useful, only a pure virtual call operator.</p> 
     439 
     440<P class=section>Methods</P> 
     441<DL class=attributes> 
     442<DT>__call__(tree)</DT> 
     443<DD>Prunes a tree. The argument can be either a tree classifier or a tree node; the result is of the same type as the argument.</DD> 
     444</DL> 
     445 
     446 
     447<H4>TreePruner_SameMajority</H4> 
     448<index name="classes/TreePruner_SameMajority"> 
     449 
     450<p>In Orange, a tree can have a non-trivial subtrees (i.e. subtrees with more than one leaf) in which all the leaves have the same majority class. (The reason why this is allowed is that those leaves can still have different distributions of classes and thus predict different probabilities.) However, this can be undesired when we're only interested in the class prediction or a simple tree interpretation. The <code>TreePruner_SameMajority</code> prunes the tree so that there is no subtree in which all the nodes would have the same majority class.</p> 
     451 
     452<p>This pruner will only prune the nodes in which the node classifier is of class <code>DefaultClassifier</code> (or from a derived class).</p> 
     453 
     454<p>Note that the leaves with more than one majority class require some special handling. The pruning goes backwards, from leaves to the root. When siblings are compared, the algorithm checks whether they have (at least one) common majority class. If so, they can be pruned.</p> 
     455 
     456<H4>TreePruner_m</H4> 
     457<index name="classes/TreePruner_m"> 
     458 
     459<p>Prunes a tree by comparing m-estimates of static and dynamic error as defined in (Bratko, 2002).</p> 
     460 
     461<p class=section>Attributes</p> 
     462<DL class=attributes> 
     463<DT>m</DT> 
     464<DD>Parameter m for m-estimation.</DD> 
     465</DL> 
     466 
     467<hr> 
     468 
     469<A name="examples"></A> 
     470<H2>Examples</H2> 
     471 
     472<p>This page does not provide examples for programming your own components, such as, for instance, a <code>TreeSplitConstructor</code>. Those examples can be found on a page dedicated to <A href="callbacks.htm">callbacks to Python</A>.</p> 
     473 
     474<H3>Tree Structure</H3> 
     475 
     476<p>To have something to work on, we'll take the data from lenses dataset and build a tree using the default components.</p> 
     477 
     478<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     479(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     480<xmp class="code">>>> data = orange.ExampleTable("lenses") 
     481>>> treeClassifier = orange.TreeLearner(data) 
     482</xmp> 
     483 
     484<p>How big is our tree?</p> 
     485 
     486<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     487(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     488<xmp class="code">def treeSize(node): 
     489    if not node: 
     490        return 0 
     491 
     492    size = 1 
     493    if node.branchSelector: 
     494        for branch in node.branches: 
     495            size += treeSize(branch) 
     496 
     497    return size 
     498</xmp> 
     499 
     500<p>If node is <CODE>None</CODE>, we have a null-node; null nodes don't count, so we return 0. Otherwise, the size is 1 (this node) plus the sizes of all subtrees. The node is an internal node if it has a <CODE>node.branchSelector</CODE>; it there's no selector, it's a leaf. Don't attempt to skip the <CODE>if</CODE> statement: leaves don't have an empty list of branches, they don't have a list of branches at all.</p> 
     501 
     502<xmp>>>> treeSize(treeClassifier.tree) 
     50310 
     504</xmp> 
     505 
     506<p>Don't forget that this was only an excercise - <code>TreeNode</code> has a built-in method <code>treesize()</code> that does exactly the same.</p> 
     507 
     508<p>Let us now write a simple script that prints out a tree. The recursive part of the function will get a node and its level.</p> 
     509 
     510<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     511(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     512<xmp class="code">def printTree0(node, level): 
     513    if not node: 
     514        print " "*level + "<null node>" 
     515        return 
     516 
     517    if node.branchSelector: 
     518        nodeDesc = node.branchSelector.classVar.name 
     519        nodeCont = node.distribution 
     520        print "\n" + "   "*level + "%s (%s)" % (nodeDesc, nodeCont), 
     521        for i in range(len(node.branches)): 
     522            print "\n" + "   "*level + ": %s" % node.branchDescriptions[i], 
     523            printTree0(node.branches[i], level+1) 
     524    else: 
     525        nodeCont = node.distribution 
     526        majorClass = node.nodeClassifier.defaultValue 
     527        print "--> %s (%s) " % (majorClass, nodeCont), 
     528</xmp> 
     529 
     530<p>Don't waste time on studying formatting tricks (\n's etc.), this is just for nicer output. What matters is everything but the print statements. As first, we check whether the node is a null-node (a node to which no learning examples were classified). If this is so, we just print out "&lt;null node&gt;" and return.</p> 
     531 
     532<p>After handling null nodes, remaining nodes are internal nodes and leaves. For internal nodes, we print a node description consisting of the attribute's name and distribution of classes. <code>TreeNode</code>'s branch description is, for all currently defined splits, an instance of a class derived from <code>Classifier</code> (in fact, it is a <code>ClassifierFromVarFD</code>, but a <code>Classifier</code> would suffice), and its <code>classVar</code> points to the attribute we seek. So we print its name. We will also assume that storing class distributions has not been disabled and print them as well. A more able function for printing trees (as one defined in <code>orngTree</code>) has an alternative means to get the distribution, when this fails. Then we iterate through branches; for each we print a branch description and iteratively call the <code>printTree0</code> with a level increased by 1 (to increase the indent).</p> 
     533 
     534<p>Finally, if the node is a leaf, we print out the distribution of learning examples in the node and the class to which the examples in the node would be classified. We again assume that the <code>nodeClassifier</code> is the default one - a <code>DefaultClassifier</code>. A better print function should be aware of possible alternatives.</p> 
     535 
     536<p>Now, we just need to write a simple function to call our printTree0. We could write something like...</p> 
     537<xmp class="code">def printTree(x): 
     538    printTree0(x.tree, 0) 
     539</xmp> 
     540<p>... but we won't. Let us learn how to handle arguments of different types. Let's write a function that will accept either a <code>TreeClassifier</code> or a <code>TreeNode</code> (just like <code>TreePruners</code>, remember?)</p> 
     541 
     542<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     543(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     544<xmp class="code">def printTree(x): 
     545    if isinstance(x, orange.TreeClassifier): 
     546        printTree0(x.tree, 0) 
     547    elif isinstance(x, orange.TreeNode): 
     548        printTree0(x, 0) 
     549    else: 
     550        raise TypeError, "TreeClassifier or TreeNode expected" 
     551</xmp> 
     552 
     553<p>It's fairly straightforward: if <code>x</code> is of type derived from <code>orange.TreeClassifier</code>, we print <code>x.tree</code>; if it's <code>TreeNode</code> we just call <code>printTree0</code> with <code>x</code>. If it's of some other type, we don't know how to handle it and thus raise an exception. (Note that we could also use <CODE>if type(x) == orange.TreeClassifier:</CODE>, but this would only work if <CODE>x</CODE> would be of type <CODE>orange.TreeClassifier</CODE> and not of any derived types. The latter, however, do not exist yet...)</p> 
     554 
     555<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     556(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     557<xmp class="code">>>> printTree(treeClassifier) 
     558 
     559tear_rate (<15.000, 5.000, 4.000>) 
     560: reduced --> none (<12.000, 0.000, 0.000>) 
     561: normal 
     562   astigmatic (<3.000, 5.000, 4.000>) 
     563   : no 
     564      age (<1.000, 5.000, 0.000>) 
     565      : young --> soft (<0.000, 2.000, 0.000>) 
     566      : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
     567      : presbyopic --> none (<1.000, 1.000, 0.000>) 
     568   : yes 
     569      prescription (<2.000, 0.000, 4.000>) 
     570      : myope --> hard (<0.000, 0.000, 3.000>) 
     571      : hypermetrope --> none (<2.000, 0.000, 1.000>) 
     572</xmp> 
     573 
     574<p>For a final exercise, let us write a simple pruning procedure. It will be written entirely in Python, unrelated to any <code>TreePruner</code>s. Our procedure will limit the tree depth - the maximal depth (here defined as the number of internal nodes on any path down the tree) shall be given as an argument. For example, to get a two-level tree, we would call "<code>cutTree(root, 2)</code>". The function will be recursive, with the second argument (level) decreasing at each call; when zero, the current node will be made a leaf.</code> 
     575 
     576<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     577(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     578<xmp class="code">def cutTree(node, level): 
     579    if node and node.branchSelector: 
     580        if level: 
     581            for branch in node.branches: 
     582                cutTree(branch, level-1) 
     583        else: 
     584            node.branchSelector = None 
     585            node.branches = None 
     586            node.branchDescriptions = None 
     587</xmp> 
     588 
     589<p>There's nothing to prune at null-nodes or leaves, so we act only when <CODE>node</CODE> and <CODE>node.branchSelector</CODE> are defined. If level is not zero, we call the function for each branch. Otherwise, we clear the selector, branches and branch descriptions.</p> 
     590 
     591<p class="header">part of <a href="treestructure.py">treestructure.py</a> 
     592(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     593<xmp class="code">>>> cutTree(tree.tree, 2) 
     594>>> printTree(tree) 
     595 
     596tear_rate (<15.000, 5.000, 4.000>) 
     597: reduced --> none (<12.000, 0.000, 0.000>) 
     598: normal 
     599   astigmatic (<3.000, 5.000, 4.000>) 
     600   : no --> soft (<1.000, 5.000, 0.000>) 
     601   : yes --> hard (<2.000, 0.000, 4.000>) 
     602</xmp> 
     603 
     604 
     605<H3>Learning</H3> 
     606 
     607<p>You've already seen a simple example of using a <code>TreeLearner</code>. You can just call it and let it fill the empty slots with the default components. This section will teach you three things: what are the missing components (and how to set the same components yourself), how to use alternative components to get a different tree and, finally, how to write a skeleton for tree induction in Python.</p> 
     608 
     609<H4>Default components for TreeLearner</H4> 
     610 
     611<p>Let us construct a <code>TreeLearner</code> to play with.</p> 
     612 
     613<p class="header"><a href="treelearner.py">treelearner.py</a> 
     614(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     615<xmp class="code">>>> learner = orange.TreeLearner() 
     616</xmp> 
     617 
     618<p>There are three crucial components in learning: the split and stop criteria, and the <code>exampleSplitter</code> (there are some others, which become important during classification; we'll talk about them later). They are not defined; if you use the learner, the slots are filled temporarily but later cleared again.</code> 
     619 
     620<xmp class="code">>>> print learner.split 
     621None 
     622>>> learner(data) 
     623<TreeClassifier instance at 0x01F08760> 
     624>>> print learner.split 
     625None 
     626</xmp> 
     627 
     628<H4>Stopping criteria</H4> 
     629<p>The stop is trivial. The default is set by</p> 
     630 
     631<xmp class="code">>>> learner.stop = orange.TreeStopCriteria_common() 
     632</xmp> 
     633 
     634<p>Well, this is actually done in C++ and it uses a global component that is constructed once for all, but apart from that we did effectively the same thing.</p> 
     635 
     636<p>We can now examine the default stopping parameters.</p> 
     637 
     638<xmp class="code">>>> print learner.stop.maxMajority, learner.stop.minExamples 
     6391.0 0.0 
     640</xmp> 
     641 
     642<p>Not very restrictive. This keeps splitting the examples until there's nothing left to split or all the examples are in the same class. Let us set the minimal subset that we allow to be split to five examples and see what comes out.</p> 
     643 
     644<p class="header">part of <a href="treelearner.py">treelearner.py</a> 
     645(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     646<xmp class="code">>>> learner.stop.minExamples = 5.0 
     647>>> tree = learner(data) 
     648>>> printTree(tree) 
     649tear_rate (<15.000, 5.000, 4.000>) 
     650: reduced --> none (<12.000, 0.000, 0.000>) 
     651: normal 
     652   astigmatic (<3.000, 5.000, 4.000>) 
     653   : no 
     654      age (<1.000, 5.000, 0.000>) 
     655      : young --> soft (<0.000, 2.000, 0.000>) 
     656      : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
     657      : presbyopic --> soft (<1.000, 1.000, 0.000>) 
     658   : yes 
     659      prescription (<2.000, 0.000, 4.000>) 
     660      : myope --> hard (<0.000, 0.000, 3.000>) 
     661      : hypermetrope --> none (<2.000, 0.000, 1.000>) 
     662</xmp> 
     663 
     664<p>OK, that's better. If we want an even smaller tree, we can also limit the maximal proportion of majority class.</p> 
     665 
     666<p class="header">part of <a href="treelearner.py">treelearner.py</a> 
     667(uses <a href="lenses.tab">lenses.tab</a>)</p> 
     668<xmp class="code">>>> learner.stop.maxMajority = 0.5 
     669>>> tree = learner(tree) 
     670>>> printTree(tree) 
     671--> none (<15.000, 5.000, 4.000>) 
     672</xmp> 
     673 
     674<p>Well, this might have been an overkill...</p> 
     675 
     676 
     677<H4>Splitting criteria</H4> 
     678 
     679<H4>Example splitter</H4> 
     680 
     681<H4>Flags and similar</H4> 
     682 
     683... also mention nodeLearner and descender 
     684 
     685<H4>Programming your own tree learner skeleton</H4> 
     686 
     687<H3>Classification</H3> 
     688 
     689<H4>Descender</H4> 
     690 
     691<H4>Node classifier</H4> 
     692 
     693<H3>Pruning</H3> 
     694 
     695<H4>Same majority pruning</H4> 
     696 
     697<H4>Post pruning</H4> 
     698 
     699... show a series of trees 
     700 
     701<H3>Defining your own components</H3> 
     702 
     703<H3>Various tricks</H3> 
     704 
     705<H4>Storing testing examples</H4> 
     706 
     707... show how to remember misclassifications etc.<P> 
     708 
     709<H4>Replacing node classifiers</H4> 
     710... replacing with something else, but based on learning examples<P> 
     711 
     712<hr> 
     713 
     714<H3><U>References</U></H3> 
     715 
     716Bratko, I. (2002). <EM>Prolog Programming for Artificial Intelligence</EM>, Addison Wesley, 2002.<P> 
     717""" 
     718 
     719from Orange.core import \ 
    2720     TreeLearner, \ 
    3721         TreeClassifier, \ 
     
    35753              TreeStopCriteria_Python, \ 
    36754              TreeStopCriteria_common 
     755 
     756 
Note: See TracChangeset for help on using the changeset viewer.