Changeset 7713:d27afc892c9c in orange


Ignore:
Timestamp:
02/28/11 13:12:47 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
f7c8320a54b9bfb6a744524bbea26b913f1df33c
Message:

Tree documentation update.

Location:
orange
Files:
1 added
6 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classification/tree.py

    r7712 r7713  
    1010******************** 
    1111 
    12 This page describes the Orange trees. It first describes the basic 
    13 components and procedures: it starts with the structure that represents  
    14 the tree, then it defines how the tree is used for classification, 
    15 how it is built and pruned. The order might seem strange, 
    16 but the things are rather complex and this order is perhaps a 
    17 bit easier to follow. After you have some idea about what the 
    18 principal components do, we described the 
    19 concrete classes that you can use as 
    20 components for a tree learner. 
    21  
    22 Classification trees are represented as a tree-like hierarchy of 
    23 :obj:`Node` classes. 
    24  
    25 .. class:: Node 
    26  
    27     Node stores information about the learning examples belonging  
    28     to the node, a branch selector, a list of branches (if the node is  
    29     not a leaf) with their descriptions and strengths, and a classifier. 
    30  
    31     .. attribute:: distribution 
    32      
    33         Stores a distribution for learning examples belonging to the node. 
    34         Storing distributions can be disabled by setting the  
    35         :obj:`TreeLearnerBase`'s storeDistributions flag to false. 
    36  
    37     .. attribute:: contingency 
    38  
    39         Stores complete contingency matrices for the learning examples  
    40         belonging to the node. Storing contingencies can be enabled by  
    41         setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`  
    42         flag to true. Note that even when the flag is not  
    43         set, the contingencies get computed and stored to  
    44         :obj:`Node`, but are removed shortly afterwards.  
    45         The details are given in the  
    46         description of the :obj:`TreeLearnerBase` object. 
    47  
    48     .. attribute:: examples, weightID 
    49  
    50         Store a set of learning examples for the node and the 
    51         corresponding ID of /weight meta attribute. The root of the 
    52         tree stores a "master table" of examples, while other nodes' 
    53         :obj:`Orange.data.Table` contain reference to examples in 
    54         the root's :obj:`Orange.data.Table`. Examples are only stored 
    55         if a corresponding flag (:obj:`storeExamples`) has been 
    56         set while building the tree; to conserve the space, storing 
    57         is disabled by default. 
    58  
    59     .. attribute:: nodeClassifier 
    60  
    61         A classifier (usually, but not necessarily, a 
    62         :obj:`DefaultClassifier`) that can be used to classify 
    63         examples coming to the node. If the node is a leaf, this is 
    64         used to decide the final class (or class distribution) of an 
    65         example. If it's an internal node, it is stored if 
    66         :obj:`Node`'s flag :obj:`storeNodeClassifier` 
    67         is set. Since the :obj:`nodeClassifier` is needed by 
    68         :obj:`Descender` and for pruning (see far below), 
    69         this is the default behaviour; space consumption of the default 
    70         :obj:`DefaultClassifier` is rather small. You should 
    71         never disable this if you intend to prune the tree later. 
    72  
    73     If the node is a leaf, the remaining fields are None.  
    74     If it's an internal node, there are several additional fields. 
    75  
    76     .. attribute:: branches 
    77  
    78         Stores a list of subtrees, given as :obj:`Node`. 
    79         An element can be None; in this case the node is empty. 
    80  
    81     .. attribute:: branchDescriptions 
    82  
    83         A list with string descriptions for branches, constructed by 
    84         :obj:`SplitConstructor`. It can contain different kinds 
    85         of descriptions, but basically, expect things like 'red' or '>12.3'. 
    86  
    87     .. attribute:: branchSizes 
    88  
    89         Gives a (weighted) number of training examples that went into 
    90         each branch. This can be used later, for instance, for 
    91         modeling probabilities when classifying examples with 
    92         unknown values. 
    93  
    94     .. attribute:: branchSelector 
    95  
    96         Gives a branch for each example. The same object is used during 
    97         learning and classifying. The :obj:`branchSelector` is of 
    98         type :obj:`orange.Classifier`, since its job is similar to that 
    99         of a classifier: it gets an example and returns discrete 
    100         :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`. 
    101         When an example cannot be classified to any branch, the selector 
    102         can return a :obj:`Orange.data.Value` containing a special value 
    103         (sVal) which should be a discrete distribution 
    104         (DiscDistribution). This should represent a 
    105         :obj:`branchSelector`'s opinion of how to divide the 
    106         example between the branches. Whether the proposition will be 
    107         used or not depends upon the chosen :obj:`ExampleSplitter` 
    108         (when learning) or :obj:`Descender` (when classifying). 
    109  
    110     The lists :obj:`branches`, :obj:`branchDescriptions` and 
    111     :obj:`branchSizes` are of the same length; all of them are 
    112     defined if the node is internal and none if it is a leaf. 
    113  
    114     .. method:: treeSize() 
    115          
    116         Return the number of nodes in the subtrees (including the 
    117         node, excluding null-nodes). 
    118  
    119 ============== 
    120 Classification 
    121 ============== 
    122  
    123 .. class:: _TreeClassifier 
    124  
    125     Classifies examples according to a tree stored in :obj:`tree`. 
    126  
    127     .. attribute:: tree 
    128  
    129         The root of the tree, represented as a :class:`Node`. 
    130      
    131     Classification would be straightforward if there were no unknown  
    132     values or, in general, examples that cannot be placed into a  
    133     single branch. The response in such cases is determined by a 
    134     component :obj:`descender`. 
    135  
    136     :obj:`Descender` is an abstract object which is given an example 
    137     and whose basic job is to descend as far down the tree as possible, 
    138     according to the values of example's attributes. The 
    139     :obj:`Descender`: calls the node's :obj:`branchSelector` to get  
    140     the branch index. If it's a simple index, the corresponding branch  
    141     is followed. If not, it's up to descender to decide what to do, and 
    142     that's where descenders differ. A :obj:`descender` can choose  
    143     a single branch (for instance, the one that is the most recommended  
    144     by the :obj:`branchSelector`) or it can let the branches vote. 
    145  
    146     In general there are three possible outcomes of a descent. 
    147  
    148     #. Descender reaches a leaf. This happens when nothing went wrong  
    149        (there are no unknown or out-of-range values in the example) or  
    150        when things went wrong, but the descender smoothed them by  
    151        selecting a single branch and continued the descend. In this 
    152        case, the descender returns the reached :obj:`Node`. 
    153     #. :obj:`branchSelector` returned a distribution and the  
    154        :obj:`Descender` decided to stop the descend at this  
    155        (internal) node.  Again, descender returns the current  
    156        :obj:`Node` and nothing else. 
    157     #. :obj:`branchSelector` returned a distribution and the  
    158        :obj:`Node` wants to split the example (i.e., to decide the  
    159        class by voting).  
    160  
    161     It returns a :obj:`Node` and the vote-weights for the branches.  
    162     The weights can correspond to the distribution returned by 
    163     :obj:`branchSelector`, to the number of learning examples that 
    164     were assigned to each branch, or to something else. 
    165  
    166     :obj:`TreeClassifier` uses the descender to descend from the root.  
    167     If it returns only a :obj:`Node` and no distribution, the  
    168     descend should stop; it does not matter whether it's a leaf (the 
    169     first case above) or an internal node (the second case). The node's 
    170     :obj:`nodeClassifier` is used to decide the class. If the descender 
    171     returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier` 
    172     recursively calls itself for each of the subtrees and the  
    173     predictions are weighted as requested by the descender. 
    174  
    175     When voting, subtrees do not predict the class but probabilities  
    176     of classes. The predictions are multiplied by weights, summed and  
    177     the most probable class is returned. 
    178  
    179     .. method:: vote() 
    180  
    181     .. method:: classDistribution() 
    182  
    183  
    184 The rest of this section is only for those interested in the C++ code. 
    185 ====================================================================== 
    186  
    187 If you'd like to understand how the classification works in C++,  
    188 start reading at :obj:`TTreeClassifier::vote`. It gets a  
    189 :obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of  
    190 vote weights. For each node, it calls the  
    191 :obj:`TTreeClassifier::classDistribution` and then multiplies  
    192 and sums the distribution. :obj:`vote` returns a normalized  
    193 distribution of predictions. 
    194  
    195 A new overload of :obj:`TTreeClassifier::classDistribution` gets 
    196 an additional parameter, a :obj:`Node`. This is done  
    197 for the sake of recursion. The normal version of  
    198 :obj:`classDistribution` simply calls the overloaded with a  
    199 tree root as an additional parameter. :obj:`classDistribution`  
    200 uses :obj:`descender`. If descender reaches a leaf, it calls  
    201 :obj:`nodeClassifier`, otherwise it calls :obj:`vote`. 
    202  
    203 Thus, the :obj:`TreeClassifier`'s :obj:`vote` and  
    204 :obj:`classDistribution` are written in a form of double  
    205 recursion. The recursive calls do not happen at each node of the  
    206 tree but only at nodes where a vote is needed (that is, at nodes  
    207 where the descender halts). 
    208  
    209 For predicting a class, :obj:`operator()`, calls the 
    210 descender. If it reaches a leaf, the class is predicted by the  
    211 leaf's :obj:`nodeClassifier`. Otherwise, it calls  
    212 :obj:`vote`>. From now on, :obj:`vote` and  
    213 :obj:`classDistribution` interweave down the tree and return  
    214 a distribution of predictions. :obj:`operator()` then simply  
    215 chooses the most probable class. 
    216  
    217 ======== 
    218 Learning 
    219 ======== 
    220  
    221 The main learning object is :obj:`TreeLearnerBase`. It is basically  
    222 a skeleton into which the user must plug the components for particular  
    223 functions. For easier use, defaults are provided. 
    224  
    225 Components that govern the structure of the tree are :obj:`split` 
    226 (of type :obj:`SplitConstructor`), :obj:`stop` (of  
    227 type :obj:`StopCriteria` and :obj:`exampleSplitter` 
    228 (of type :obj:`ExampleSplitter`). 
    229  
    230 .. class:: SplitConstructor 
    231  
    232     Finds a suitable criteria for dividing the learning (and later testing) 
    233     examples coming to the node. The data it gets is a set of examples 
    234     (and, optionally, an ID of weight meta-attribute), a domain 
    235     contingency computed from examples, apriori class probabilities, 
    236     a list of candidate attributes it should consider and a node 
    237     classifier (if it was constructed, that is, if  
    238     :obj:`storeNodeClassifier` is left true). 
    239  
    240     The :obj:`SplitConstructor` should use the domain contingency 
    241     when possible. The reasons are two-fold; one is that it's faster 
    242     and the other is that the contingency matrices are not necessarily 
    243     constructed by simply counting the examples. Why and how is 
    244     explained later. There are, however, cases, when domain contingency 
    245     does not suffice, for examples, when ReliefF is used as a measure 
    246     of quality of attributes. In this case, there's no other way but 
    247     to use the examples and ignore the precomputed contingencies. 
    248  
    249     The split constructor should consider only the attributes in the 
    250     candidate list (the list is a vector of booleans, one for each 
    251     attribute). 
    252  
    253     :obj:`SplitConstructor` returns most of the data we talked 
    254     about when describing the :obj:`Node`. It returns a classifier 
    255     to be used as :obj:`Node`'s :obj:`branchSelector`, a list of 
    256     branch descriptions and a list with the number of examples that 
    257     go into each branch. Just what we need for the :obj:`Node`. 
    258     It can return an empty list for the number of examples in branches; 
    259     in this case, the :obj:`TreeLearnerBase` will find the number itself 
    260     after splitting the example set into subsets. However, if a split 
    261     constructors can provide the numbers at no extra computational 
    262     cost, it should do so. 
    263  
    264     In addition, it returns a quality of the split; a number without 
    265     any fixed meaning except that higher numbers mean better splits. 
    266  
    267     If the constructed splitting criterion uses an attribute in such 
    268     a way that the attribute is 'completely spent' and should not be 
    269     considered as a split criterion in any of the subtrees (the 
    270     typical case of this are discrete attributes that are used 
    271     as-they-are, that is, without any binarization or subsetting), 
    272     then it should report the index of this attribute. Some splits 
    273     do not spend any attribute; this is indicated by returning a 
    274     negative index. 
    275  
    276     A :obj:`SplitConstructor` can veto the further tree induction 
    277     by returning no classifier. This can happen for many reasons. 
    278     A general one is related to number of examples in the branches. 
    279     :obj:`SplitConstructor` has a field :obj:`minSubset`, 
    280     which sets the minimal number of examples in a branch; null nodes, 
    281     however, are allowed. If there is no split where this condition 
    282     is met, :obj:`SplitConstructor` stops the induction. 
    283  
    284     .. attribute:: minSubset 
    285  
    286         Sets the minimal number of examples in non-null leaves. As 
    287         always in Orange (where not specified otherwise), "number of  
    288         examples" refers to the weighted number of examples. 
    289      
    290     .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])  
    291  
    292         Construct a split. Returns a tuple (:obj:`branchSelector`, 
    293         :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`, 
    294         :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no 
    295         attribute is completely spent by the split criterion. If no 
    296         split is constructed, the :obj:`selector`, :obj:`branchDescriptions` 
    297         and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and 
    298         :obj:`spentAttribute` is -1. 
    299  
    300         :param examples:  Examples can be given in any acceptable form 
    301             (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a 
    302             list of examples). 
    303         :param weightID: Optional; the default of 0 means that all 
    304             examples have a weight of 1.0.  
    305         :param apriori-distribution: Should be of type  
    306             :obj:`orange.Distribution` and candidates should be a Python  
    307             list of objects which are interpreted as booleans. 
    308  
    309  
    310 .. class:: StopCriteria 
    311  
    312     Given a set of examples, weight ID and contingency matrices, decide 
    313     whether to continue the induction or not. The basic criterion checks 
    314     whether there are any examples and whether they belong to at least 
    315     two different classes (if the class is discrete). Derived components 
    316     check things like the number of examples and the proportion of 
    317     majority classes. 
    318  
    319     As opposed to :obj:`SplitConstructor` and similar basic classes, 
    320     :obj:`StopCriteria` is not an abstract but a fully functional 
    321     class that provides the basic stopping criteria. That is, the tree 
    322     induction stops when there is at most one example left; in this case, 
    323     it is not the weighted but the actual number of examples that counts. 
    324     Besides that, the induction stops when all examples are in the same 
    325     class (for discrete problems) or have the same value of the outcome 
    326     (for regression problems). 
    327  
    328     .. method:: __call__(examples[, weightID, domain contingencies]) 
    329  
    330         Decides whether to stop (true) or continue (false) the induction. 
    331         If contingencies are given, they are used for checking whether 
    332         the examples are in the same class (but not for counting the 
    333         examples). Derived classes should use the contingencies whenever 
    334         possible. If contingencies are not given, :obj:`StopCriteria` 
    335         will work without them. Derived classes should also use them if 
    336         they are available, but otherwise compute them only when they 
    337         really need them. 
    338  
    339  
    340 .. class:: ExampleSplitter 
    341  
    342     Just like the :obj:`Descender` decides the branch for an 
    343     example during classification, the :obj:`ExampleSplitter` 
    344     sorts the learning examples into branches. 
    345  
    346     :obj:`ExampleSplitter` is given a :obj:`Node` (from which  
    347     it can use different stuff, but most of splitters only use the  
    348     :obj:`branchSelector`), a set of examples to be divided, and  
    349     the weight ID. The result is a list of subsets of examples 
    350     and, optionally, a list of new weight ID's. 
    351  
    352     Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most  
    353     of :obj:`ExampleSplitters` simply call the node's  
    354     :obj:`branchSelector` and assign examples to corresponding  
    355     branches. When the value is unknown they choose a particular  
    356     branch or simply skip the example. 
    357  
    358     Some enhanced splitters can split examples. An example (actually,  
    359     a pointer to it) is copied to more than one subset. To facilitate 
    360     real splitting, weights are needed. Each branch is assigned a 
    361     weight ID (each would usually have its own ID) and all examples 
    362     that are in that branch (either completely or partially) should 
    363     have this meta attribute. If an example hasn't been split, it 
    364     has only one additional attribute - with weight ID corresponding 
    365     to the subset to which it went. Example that is split between, 
    366     say, three subsets, has three new meta attributes, one for each 
    367     subset. ID's of weight meta attributes are returned by the 
    368     :obj:`ExampleSplitter` to be used at induction of the 
    369     corresponding subtrees. 
    370  
    371     Note that weights are used only when needed. When no splitting 
    372     occured - because the splitter is not able to do it or becauser 
    373     there was no need for splitting - no weight ID's are returned. 
    374  
    375     An abstract base class for objects that split sets of examples into  
    376     subsets. The derived classes differ in treatment of examples which 
    377     cannot be unambiguously placed into a single branch (usually due 
    378     to unknown value of the crucial attribute). 
    379  
    380     .. method:: __call__(node, examples[, weightID]) 
    381          
    382         Use the information in :obj:`node` (particularly the  
    383         :obj:`branchSelector`) to split the given set of examples into subsets.  
    384         Return a tuple with a list of example generators and a list of weights.  
    385         The list of weights is either an ordinary python list of integers or  
    386         a None when no splitting of examples occurs and thus no weights are  
    387         needed. 
    388  
    389 .. class:: TreeLearnerBase 
    390  
    391     TreeLearnerBase has a number of components. 
    392  
    393     .. attribute:: split 
    394  
    395         Object of type :obj:`SplitConstructor`. Default value,  
    396         provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined` 
    397         with separate constructors for discrete and continuous attributes.  
    398         Discrete attributes are used as are, while continuous attributes 
    399         are binarized. Gain ratio is used to select attributes. 
    400         A minimum of two examples in a leaf is required for discreter 
    401         and five examples in a leaf for continuous attributes.</DD> 
    402      
    403     .. attribute:: stop 
    404  
    405         Object of type :obj:`StopCriteria`. The default stopping 
    406         criterion stops induction when all examples in a node belong  
    407         to the same class. 
    408  
    409     .. attribute:: splitter 
    410  
    411         Object of type :obj:`ExampleSplitter`. The default splitter 
    412         is :obj:`ExampleSplitter_UnknownsAsSelector` that splits 
    413         the learning examples according to distributions given by the 
    414         selector. 
    415  
    416     .. attribute:: contingencyComputer 
    417      
    418         By default, this slot is left empty and ordinary contingency 
    419         matrices are computed for examples at each node. If need arises, 
    420         one can change the way the matrices are computed. This can be 
    421         used to change the way that unknown values are treated when 
    422         assessing qualities of attributes. As mentioned earlier, 
    423         the computed matrices can be used by split constructor and by 
    424         stopping criteria. On the other hand, they can be (and are) 
    425         ignored by some splitting constructors. 
    426  
    427     .. attribute:: nodeLearner 
    428  
    429         Induces a classifier from examples belonging to a node. The 
    430         same learner is used for internal nodes and for leaves. The 
    431         default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`. 
    432  
    433     .. attribute:: descender 
    434  
    435         Descending component that the induces :obj:`TreeClassifier` 
    436         will use. Default descender is  
    437         :obj:`Descender_UnknownMergeAsSelector` which votes using  
    438         the :obj:`branchSelector`'s distribution for vote weights. 
    439  
    440     .. attribute:: maxDepth 
    441  
    442         Gives maximal tree depth; 0 means that only root is generated.  
    443         The default is 100 to prevent any infinite tree induction due 
    444         to missettings in stop criteria. If you are sure you need 
    445         larger trees, increase it. If you, on the other hand, want 
    446         to lower this hard limit, you can do so as well. 
    447  
    448     .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier 
    449  
    450         Decides whether to store class distributions, contingencies  
    451         and examples in :obj:`Node`, and whether the  
    452         :obj:`nodeClassifier` should be build for internal nodes.  
    453         By default, distributions and node classifiers are stored,  
    454         while contingencies and examples are not. You won't save any  
    455         memory by not storing distributions but storing contingencies, 
    456         since distributions actually points to the same distribution 
    457         that is stored in :obj:`contingency.classes`. 
    458  
    459     The :obj:`TreeLearnerBase` first sets the defaults for missing 
    460     components. Although stored in the actual :obj:`TreeLearnerBase`'s 
    461     fields, they are removed when the induction is finished. 
    462  
    463     Then it ensures that examples are stored in a table. This is needed 
    464     because the algorithm juggles with pointers to examples. If 
    465     examples are in a file or are fed through a filter, they are copied 
    466     to a table. Even if they are already in a table, they are copied 
    467     if :obj:`storeExamples` is set. This is to assure that pointers 
    468     remain pointing to examples even if the user later changes the 
    469     example table. If they are in the table and the :obj:`storeExamples` 
    470     flag is clear, we just use them as they are. This will obviously 
    471     crash in a multi-threaded system if one changes the table during 
    472     the tree induction. Well... don't do it. 
    473  
    474     Apriori class probabilities are computed. At this point we check 
    475     the sum of example weights; if it's zero, there are no examples and  
    476     we cannot proceed. A list of candidate attributes is set; in the 
    477     beginning, all attributes are candidates for the split criterion. 
    478  
    479     Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments  
    480     are a set of examples, a weight meta-attribute ID (a tricky thing, 
    481     it can be always the same as the original or can change to  
    482     accomodate splitting of examples among branches), apriori class 
    483     distribution and a list of candidates (represented as a vector 
    484     of Boolean values). 
    485  
    486     The contingency matrix is computed next. This happens 
    487     even if the flag :obj:`storeContingencies` is false. 
    488     If the :obj:`contingencyComputer` is given we use it, 
    489     otherwise we construct just an ordinary contingency matrix. 
    490  
    491     A :obj:`stop` is called to see whether it's worth to continue. If  
    492     not, a :obj:`nodeClassifier` is built and the :obj:`Node` is  
    493     returned. Otherwise, a :obj:`nodeClassifier` is only built if  
    494     :obj:`forceNodeClassifier` flag is set. 
    495  
    496     To get a :obj:`Node`'s :obj:`nodeClassifier`, the  
    497     :obj:`nodeLearner`'s :obj:`smartLearn` function is called with  
    498     the given examples, weight ID and the just computed matrix. If  
    499     the learner can use the matrix (and the default,  
    500     :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus, 
    501     a choice of :obj:`contingencyComputer` will, in many cases,  
    502     affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can 
    503     return no classifier; if so and if the classifier would be  
    504     needed for classification, the :obj:`TreeClassifier`'s function 
    505     returns DK or an empty distribution. If you're writing your own 
    506     tree classifier - pay attention. 
    507  
    508     If the induction is to continue, a :obj:`split` component is called.  
    509     If it fails to return a branch selector, induction stops and the  
    510     :obj:`Node` is returned. 
    511  
    512     :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide  
    513     the examples as described above. 
    514  
    515     The contingency gets removed at this point if it is not to be  
    516     stored. Thus, the :obj:`split`, :obj:`stop` and  
    517     :obj:`exampleSplitter` can use the contingency matrices if they will. 
    518  
    519     The :obj:`TreeLearnerBase` then recursively calls itself for each of  
    520     the non-empty subsets. If the splitter returnes a list of weights,  
    521     a corresponding weight is used for each branch. Besides, the  
    522     attribute spent by the splitter (if any) is removed from the  
    523     list of candidates for the subtree. 
    524  
    525     A subset of examples is stored in its corresponding tree node,  
    526     if so requested. If not, the new weight attributes are removed (if  
    527     any were created). 
    528  
    529 Pruning 
    530 ======= 
    531  
    532 Tree pruners derived from :obj:`Pruner` can be given either a 
    533 :obj:`Node` (presumably, but not necessarily a root) or a 
    534 :obj:`TreeClassifier`. The result is a new, pruned :obj:`Node` 
    535 or a new :obj:`TreeClassifier` with a pruned tree. The original 
    536 tree remains intact. 
    537  
    538 Note however that pruners construct only a shallow copy of a tree. 
    539 The pruned tree's :obj:`Node` contain references to the same 
    540 contingency matrices, node classifiers, branch selectors, ... 
    541 as the original tree. Thus, you may modify a pruned tree structure 
    542 (manually cut it, add new nodes, replace components) but modifying, 
    543 for instance, some node's :obj:`nodeClassifier` (a 
    544 :obj:`nodeClassifier` itself, not a reference to it!) would modify 
    545 the node's :obj:`nodeClassifier` in the corresponding node of 
    546 the original tree. 
    547  
    548 Talking about node classifiers - pruners cannot construct a 
    549 :obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned 
    550 subtrees into classifiers for new leaves. Thus, if you want to build 
    551 a prunable tree, internal nodes must have their :obj:`nodeClassifier` 
    552 defined. Fortunately, all you need to do is nothing; if you leave 
    553 the :obj:`TreeLearnerBase`'s flags as they are by default, the 
    554 :obj:`nodeClassifier` are created. 
    555  
    556 ======= 
    557 Classes 
    558 ======= 
    559  
    560 Several classes described above are already functional and can 
    561 (and mostly will) be used as they are. Those classes are :obj:`Node`, 
    562 :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe  
    563 the other classes. 
    564  
    565 Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,  
    566 :obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner` 
    567 and :obj:`Classifier` are among the Orange classes that can be subtyped  
    568 in Python and have the call operator overloadedd in such a way that it 
    569 is callbacked from C++ code. You can thus program your own components 
    570 for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed  
    571 information on how this is done and what can go wrong, is given in a  
    572 separate page, dedicated to callbacks to Python XXXXXXXXXX. 
    573  
    574 SplitConstructors 
    575 ===================== 
    576  
    577 Split construction is almost as exciting as waiting for a delayed flight. 
    578 Boring, that is. Split constructors are programmed as spaghetti code 
    579 that juggles with contingency matrices, with separate cases for discrete 
    580 and continuous classes... Most split constructors work either for 
    581 discrete or for continuous attributes. The suggested practice is 
    582 to use a :obj:`SplitConstructor_Combined` that can handle 
    583 both by simply delegating attributes to specialized split constructors. 
    584  
    585 Note: split constructors that cannot handle attributes of particular 
    586 type (discrete, continuous) do not report an error or a warning but 
    587 simply skip the attribute. It is your responsibility to use a correct 
    588 split constructor for your dataset. (May we again suggest 
    589 using :obj:`SplitConstructor_Combined`?) 
    590  
    591 The same components can be used either for inducing classification and 
    592 regression trees. The only component that needs to be chosen accordingly 
    593 is the 'measure' attribute for the :obj:`SplitConstructor_Measure` 
    594 class (and derived classes). 
    595  
    596 .. class:: SplitConstructor_Measure 
    597  
    598     Bases: :class:`SplitConstructor` 
    599  
    600     An abstract base class for split constructors that employ  
    601     a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.  
    602     At present, 
    603     all split constructors except for :obj:`SplitConstructor_Combined` 
    604     are derived from this class. 
    605  
    606     .. attribute:: measure 
    607  
    608         A component of type :obj:`Orange.feature.scoring.Measure` used for 
    609         evaluation of a split. Note that you must select the subclass  
    610         :obj:`Orange.feature.scoring.Measure` capable of handling your  
    611         class type  
    612         - you cannot use :obj:`Orange.feature.scoring.GainRatio` 
    613         for building regression trees or :obj:`Orange.feature.scoring.MSE` 
    614         for classification trees. 
    615  
    616     .. attribute:: worstAcceptable 
    617  
    618         The lowest required split quality for a split to be acceptable. 
    619         Note that this value make sense only in connection with a 
    620         :obj:`measure` component. Default is 0.0. 
    621  
    622 .. class:: SplitConstructor_Attribute 
    623  
    624     Bases: :class:`SplitConstructor_Measure` 
    625  
    626     Attempts to use a discrete attribute as a split; each value of the  
    627     attribute corresponds to a branch in the tree. Attributes are 
    628     evaluated with the :obj:`measure` and the one with the 
    629     highest score is used for a split. If there is more than one 
    630     attribute with the highest score, one of them is selected by random. 
    631  
    632     The constructed :obj:`branchSelector` is an instance of  
    633     :obj:`orange.ClassifierFromVarFD` that returns a value of the  
    634     selected attribute. If the attribute is  
    635     :obj:`Orange.data.variable.Discrete`, 
    636     :obj:`branchDescription`'s are the attribute's values. The  
    637     attribute is marked as spent, so that it cannot reappear in the  
    638     node's subtrees. 
    639  
    640 .. class:: SplitConstructor_ExhaustiveBinary 
    641  
    642     Bases: :class:`SplitConstructor_Measure` 
    643  
    644     Works on discrete attributes. For each attribute, it determines 
    645     which binarization of the attribute gives the split with the 
    646     highest score. If more than one split has the highest score, 
    647     one of them is selected by random. After trying all the attributes, 
    648     it returns one of those with the highest score. 
    649  
    650     The constructed :obj:`branchSelector` is again an instance 
    651     :obj:`orange.ClassifierFromVarFD` that returns a value of the 
    652     selected attribute. This time, however, its :obj:`transformer` 
    653     contains an instance of :obj:`MapIntValue` that maps the values 
    654     of the attribute into a binary attribute. Branch descriptions are 
    655     of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to 
    656     more than one value of the attribute. Branches that correspond to 
    657     a single value of the attribute are described with this value. If  
    658     the attribute was originally binary, it is spent and cannot be  
    659     used in the node's subtrees. Otherwise, it can reappear in the  
    660     subtrees. 
    661  
    662  
    663 .. class:: SplitConstructor_Threshold 
    664  
    665     Bases: :class:`SplitConstructor_Measure` 
    666  
    667     This is currently the only constructor for splits with continuous  
    668     attributes. It divides the range of attributes values with a threshold  
    669     that maximizes the split's quality. As always, if there is more than  
    670     one split with the highest score, a random threshold is selected.  
    671     The attribute that yields the highest binary split is returned. 
    672  
    673     The constructed :obj:`branchSelector` is again an instance of  
    674     :obj:`orange.ClassifierFromVarFD` with an attached  
    675     :obj:`transformer`. This time, :obj:`transformer` is of type  
    676     :obj:`orange.ThresholdDiscretizer`. The branch descriptions are  
    677     "<threshold" and ">=threshold". The attribute is not spent. 
    678  
    679 .. class:: SplitConstructor_OneAgainstOthers 
    680      
    681     Bases: :class:`SplitConstructor_Measure` 
    682  
    683     Undocumented. 
    684  
    685 .. class:: SplitConstructor_Combined 
    686  
    687     Bases: :class:`SplitConstructor` 
    688  
    689     This constructor delegates the task of finding the optimal split  
    690     to separate split constructors for discrete and for continuous 
    691     attributes. Each split constructor is called, given only attributes 
    692     of appropriate types as candidates. Both construct a candidate for 
    693     a split; the better of them is selected. 
    694  
    695     (Note that there is a problem when more candidates have the same 
    696     score. Let there be are nine discrete attributes with the highest 
    697     score; the split constructor for discrete attributes will select 
    698     one of them. Now, let us suppose that there is a single continuous 
    699     attribute with the same score. :obj:`SplitConstructor_Combined` 
    700     would randomly select between the proposed discrete attribute and  
    701     the continuous attribute, not aware of the fact that the discrete 
    702     has already competed with eight other discrete attributes. So,  
    703     he probability for selecting (each) discrete attribute would be 1/18 
    704     instead of 1/10. Although not really correct, we doubt that this 
    705     would affect the tree's performance; many other machine learning 
    706     systems simply choose the first attribute with the highest score  
    707     anyway.) 
    708  
    709     The :obj:`branchSelector`, :obj:`branchDescriptions` and whether  
    710     the attribute is spent is decided by the winning split constructor. 
    711  
    712     .. attribute: discreteSplitConstructor 
    713  
    714         Split constructor for discrete attributes; can be, for instance, 
    715         :obj:`SplitConstructor_Attribute` or  
    716         :obj:`SplitConstructor_ExhaustiveBinary`. 
    717  
    718     .. attribute: continuousSplitConstructor 
    719  
    720         Split constructor for continuous attributes; at the moment, it  
    721         can be either :obj:`SplitConstructor_Threshold` or a  
    722         split constructor you programmed in Python. 
    723  
    724     .. attribute: continuousSplitConstructor 
    725      
    726         Split constructor for continuous attributes; at the moment, it  
    727         can be either :obj:`SplitConstructor_Threshold` or a split 
    728         constructor you programmed in Python. 
    729  
    730  
    731 StopCriteria and StopCriteria_common 
    732 ============================================ 
    733  
    734 obj:`StopCriteria` determines when to stop the induction of subtrees, as  
    735 described in detail in description of the learning process. XXXXXXXXXX 
    736  
    737 .. class:: StopCriteria_common 
    738  
    739     :obj:`StopCriteria` contains additional criteria for pre-pruning: 
    740     it checks the proportion of majority class and the number of weighted 
    741     examples. 
    742  
    743     .. attribute:: maxMajor 
    744  
    745         Maximal proportion of majority class. When this is exceeded,  
    746         induction stops. 
    747  
    748     .. attribute:: minExamples 
    749  
    750         Minimal number of examples in internal leaves. Subsets with less 
    751         than :obj:`minExamples` examples are not split any further. 
    752         Example count is weighed. 
    753  
    754 .. class:: StopCriteria_Python 
    755  
    756     Undocumented. 
    757  
    758 Classes derived from ExampleSplitter 
    759 ======================================== 
    760  
    761 :obj:`ExampleSplitter` is the third crucial component of 
    762 :obj:`TreeLearnerBase`. Its function is described in  
    763 description of the learning process. XXXXXXXXXX 
    764  
    765 .. class:: ExampleSplitter_IgnoreUnknowns 
    766  
    767     Bases: :class:`ExampleSplitter` 
    768  
    769     Simply ignores the examples for which no single branch can be determined. 
    770  
    771 .. class:: ExampleSplitter_UnknownsToCommon 
    772  
    773     Bases: :class:`ExampleSplitter` 
    774  
    775     Places all such examples to a branch with the highest number of 
    776     examples. If there is more than one such branch, one is selected at 
    777     random and then used for all examples. 
    778  
    779 .. class:: ExampleSplitter_UnknownsToAll 
    780  
    781     Bases: :class:`ExampleSplitter` 
    782  
    783     Places examples with unknown value of the attribute into all branches. 
    784  
    785 .. class:: ExampleSplitter_UnknownsToRandom 
    786  
    787     Bases: :class:`ExampleSplitter` 
    788  
    789     Selects a random branch for such examples. 
    790  
    791 .. class:: ExampleSplitter_UnknownsToBranch 
    792  
    793     Bases: :class:`ExampleSplitter` 
    794  
    795     Constructs an additional branch to contain all such examples.  
    796     The branch's description is "unknown". 
    797  
    798 .. class:: ExampleSplitter_UnknownsAsBranchSizes 
    799  
    800     Bases: :class:`ExampleSplitter` 
    801  
    802     Splits examples with unknown value of the attribute according to  
    803     proportions of examples in each branch. 
    804  
    805 .. class:: ExampleSplitter_UnknownsAsSelector 
    806  
    807     Bases: :class:`ExampleSplitter` 
    808  
    809     Splits examples with unknown value of the attribute according to  
    810     distribution proposed by selector (which is in most cases the same  
    811     as proportions of examples in branches). 
    812  
    813 Descender and derived classes 
    814 ================================= 
    815  
    816 This is a classifier's counterpart for :obj:`ExampleSplitter`. It  
    817 decides the destiny of examples that need to be classified and cannot 
    818 be unambiguously put in a branch. The detailed function of this class 
    819 is given in description of classification with trees. XXXXXX 
    820  
    821 .. class:: Descender 
    822  
    823     An abstract base object for tree descenders. 
    824  
    825     .. method:: __call__(node, example) 
    826  
    827         Descends down the tree until it reaches a leaf or a node in  
    828         which a vote of subtrees is required. In both cases, a tuple  
    829         of two elements is returned; in the former, the tuple contains  
    830         the reached node and None, in the latter in  
    831         contains a node and weights of votes for subtrees (a list of floats). 
    832  
    833         :obj:`Descender`'s that never split examples always descend 
    834         to a leaf, but they differ in the treatment of examples with 
    835         unknown values (or, in general, examples for which a branch 
    836         cannot be determined at some node(s) the tree). 
    837         :obj:`Descender`'s that do split examples differ in returned 
    838         vote weights. 
    839  
    840 .. class:: Descender_UnknownsToNode 
    841  
    842     Bases: :obj:`Descender` 
    843  
    844     When example cannot be classified into a single branch, the 
    845     current node is returned. Thus, the node's :obj:`NodeClassifier` 
    846     will be used to make a decision. It is your responsibility to see 
    847     that even the internal nodes have their :obj:`NodeClassifier` 
    848     (i.e., don't disable creating node classifier or manually remove 
    849     them after the induction, that's all) 
    850  
    851 .. class:: Descender_UnknownsToBranch 
    852  
    853     Bases: :obj:`Descender` 
    854  
    855     Classifies examples with unknown value to a special branch. This 
    856     makes sense only if the tree itself was constructed with 
    857     :obj:`ExampleSplitter_UnknownsToBranch`. 
    858  
    859 .. class:: Descender_UnknownsToCommonBranch 
    860  
    861     Bases: :obj:`Descender` 
    862  
    863     Classifies examples with unknown values to the branch with the 
    864     highest number of examples. If there is more than one such branch, 
    865     random branch is chosen for each example that is to be classified. 
    866  
    867 .. class:: Descender_UnknownsToCommonSelector 
    868  
    869     Bases: :obj:`Descender` 
    870  
    871     Classifies examples with unknown values to the branch which received  
    872     the highest recommendation by the selector. 
    873  
    874 .. class:: Descender_MergeAsBranchSizes 
    875  
    876     Bases: :obj:`Descender` 
    877  
    878     Makes the subtrees vote for the example's class; the vote is 
    879     weighted according to the sizes of the branches. 
    880  
    881 .. class:: Descender_MergeAsSelector 
    882  
    883     Bases: :obj:`Descender` 
    884  
    885     Makes the subtrees vote for the example's class; the vote is  
    886     weighted according to the selectors proposal. 
    887  
    888 Pruner and derived classes 
    889 ============================== 
    890  
    891 .. index:: 
    892     pair: classification trees; pruning 
    893  
    894 Classes derived from :obj:`Pruner` prune the trees as a 
    895 described in the section pruning XXXXXXXX - make sure you read it  
    896 to understand what the pruners will do to your trees. 
    897  
    898 .. class:: Pruner 
    899  
    900     This is an abstract base class which defines nothing useful, only  
    901     a pure virtual call operator. 
    902  
    903     .. method:: __call__(tree) 
    904  
    905         Prunes a tree. The argument can be either a tree classifier or  
    906         a tree node; the result is of the same type as the argument. 
    907  
    908 .. class:: Pruner_SameMajority 
    909  
    910     Bases: :class:`Pruner` 
    911  
    912     In Orange, a tree can have a non-trivial subtrees (i.e. subtrees  
    913     with more than one leaf) in which all the leaves have the same majority  
    914     class. (This is allowed because those leaves can still have different 
    915     distributions of classes and thus predict different probabilities.)  
    916     However, this can be undesired when we're only interested in the  
    917     class prediction or a simple tree interpretation. The  
    918     :obj:`Pruner_SameMajority` prunes the tree so that there is no 
    919     subtree in which all the nodes would have the same majority class. 
    920  
    921     This pruner will only prune the nodes in which the node classifier  
    922     is of class :obj:`orange.DefaultClassifier` (or from a derived class). 
    923  
    924     Note that the leaves with more than one majority class require some  
    925     special handling. The pruning goes backwards, from leaves to the root.  
    926     When siblings are compared, the algorithm checks whether they  
    927     have (at least one) common majority class. If so, they can be pruned. 
    928  
    929 .. class:: Pruner_m 
    930  
    931     Bases: :class:`Pruner` 
    932  
    933     Prunes a tree by comparing m-estimates of static and dynamic  
    934     error as defined in (Bratko, 2002). 
    935  
    936     .. attribute:: m 
    937  
    938         Parameter m for m-estimation. 
    939  
    940 ======== 
    941 Examples 
    942 ======== 
    943  
    944 This page does not provide examples for programming your own components,  
    945 such as, for instance, a :obj:`SplitConstructor`. Those examples 
    946 can be found on a page dedicated to callbacks to Python XXXXXXXX. 
    947  
    948 Tree Structure 
    949 ============== 
    950  
    951 To have something to work on, we'll take the data from lenses dataset  
    952 and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_): 
    953  
    954 .. literalinclude:: code/treestructure.py 
    955    :lines: 7-10 
    956  
    957 How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)? 
    958  
    959 .. _lenses.tab: code/lenses.tab 
    960 .. _treestructure.py: code/treestructure.py 
    961  
    962 .. literalinclude:: code/treestructure.py 
    963    :lines: 12-21 
    964  
    965 If node is None, we have a null-node; null nodes don't count,  
    966 so we return 0. Otherwise, the size is 1 (this node) plus the 
    967 sizes of all subtrees. The node is an internal node if it has a  
    968 :obj:`branchSelector`; it there's no selector, it's a leaf. Don't 
    969 attempt to skip the if statement: leaves don't have an empty list  
    970 of branches, they don't have a list of branches at all. 
    971  
    972     >>> treeSize(treeClassifier.tree) 
    973     10 
    974  
    975 Don't forget that this was only an excercise - :obj:`Node` has a  
    976 built-in method :obj:`Node.treeSize` that does exactly the same. 
    977  
    978 Let us now write a simple script that prints out a tree. The recursive 
    979 part of the function will get a node and its level (part of `treestructure.py`_, uses `lenses.tab`_). 
    980  
    981 .. literalinclude:: code/treestructure.py 
    982    :lines: 26-41 
    983  
    984 Don't waste time on studying formatting tricks (\n's etc.), this is just 
    985 for nicer output. What matters is everything but the print statements. 
    986 As first, we check whether the node is a null-node (a node to which no 
    987 learning examples were classified). If this is so, we just print out 
    988 "<null node>" and return. 
    989  
    990 After handling null nodes, remaining nodes are internal nodes and leaves. 
    991 For internal nodes, we print a node description consisting of the 
    992 attribute's name and distribution of classes. :obj:`Node`'s branch 
    993 description is, for all currently defined splits, an instance of a 
    994 class derived from :obj:`orange.Classifier` (in fact, it is a 
    995 :obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would  
    996 suffice), and its :obj:`classVar` XXXXX points to the attribute we seek.  
    997 So we print its name. We will also assume that storing class distributions  
    998 has not been disabled and print them as well. A more able function for  
    999 printing trees (as one defined in XXXXXXXXXX) has an alternative  
    1000 means to get the distribution, when this fails. Then we iterate  
    1001 through branches; for each we print a branch description and iteratively  
    1002 call the :obj:`printTree0` with a level increased by 1 (to increase  
    1003 the indent). 
    1004  
    1005 Finally, if the node is a leaf, we print out the distribution of  
    1006 learning examples in the node and the class to which the examples in  
    1007 the node would be classified. We again assume that the :obj:`nodeClassifier`  
    1008 is the default one - a :obj:`DefaultClassifier`. A better print  
    1009 function should be aware of possible alternatives. 
    1010  
    1011 Now, we just need to write a simple function to call our printTree0.  
    1012 We could write something like... 
    1013  
    1014 :: 
    1015  
    1016     def printTree(x): 
    1017         printTree0(x.tree, 0) 
    1018  
    1019 ... but we won't. Let us learn how to handle arguments of different 
    1020 types. Let's write a function that will accept either a :obj:`TreeClassifier` 
    1021 or a :obj:`Node`; just like Pruners XXXXXX, remember? Part of `treestructure.py`_, uses `lenses.tab`_. 
    1022  
    1023 .. literalinclude:: code/treestructure.py 
    1024    :lines: 43-49 
    1025  
    1026 It's fairly straightforward: if :obj:`x` is of type derived from  
    1027 :obj:`TreeClassifier`, we print :obj:`x.tree`; if it's  
    1028 :obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's  
    1029 of some other type, we don't know how to handle it and thus raise  
    1030 an exception. (Note that we could also use  
    1031  
    1032 :: 
    1033  
    1034     if isinstance(x, Orange.classification.tree.TreeClassifier) 
    1035  
    1036 but this would only work if :obj:`x` would be of type  
    1037 :obj:`TreeClassifier` and not of any derived types. The latter,  
    1038 however, do not exist yet...) 
    1039  
    1040     >>> printTree(treeClassifier) 
    1041     tear_rate (<15.000, 5.000, 4.000>) 
    1042     : reduced --> none (<12.000, 0.000, 0.000>) 
    1043     : normal 
    1044        astigmatic (<3.000, 5.000, 4.000>) 
    1045        : no 
    1046           age (<1.000, 5.000, 0.000>) 
    1047           : young --> soft (<0.000, 2.000, 0.000>) 
    1048           : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
    1049           : presbyopic --> none (<1.000, 1.000, 0.000>) 
    1050        : yes 
    1051           prescription (<2.000, 0.000, 4.000>) 
    1052           : myope --> hard (<0.000, 0.000, 3.000>) 
    1053           : hypermetrope --> none (<2.000, 0.000, 1.000>) 
    1054  
    1055 For a final exercise, let us write a simple pruning procedure. It will  
    1056 be written entirely in Python, unrelated to any :obj:`Pruner`. Our 
    1057 procedure will limit the tree depth - the maximal depth (here defined 
    1058 as the number of internal nodes on any path down the tree) shall be 
    1059 given as an argument. For example, to get a two-level tree, we would 
    1060 call cutTree(root, 2). The function will be recursive, with the second  
    1061 argument (level) decreasing at each call; when zero, the current node  
    1062 will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_): 
    1063  
    1064 .. literalinclude:: code/treestructure.py 
    1065    :lines: 54-62 
    1066  
    1067 There's nothing to prune at null-nodes or leaves, so we act only when  
    1068 :obj:`node` and :obj:`node.branchSelector` are defined. If level is  
    1069 not zero, we call the function for each branch. Otherwise, we clear  
    1070 the selector, branches and branch descriptions. 
    1071  
    1072     >>> cutTree(tree.tree, 2) 
    1073     >>> printTree(tree) 
    1074     tear_rate (<15.000, 5.000, 4.000>) 
    1075     : reduced --> none (<12.000, 0.000, 0.000>) 
    1076     : normal 
    1077        astigmatic (<3.000, 5.000, 4.000>) 
    1078        : no --> soft (<1.000, 5.000, 0.000>) 
    1079        : yes --> hard (<2.000, 0.000, 4.000>) 
    1080  
    1081 Learning 
    1082 ======== 
    1083  
    1084 You've already seen a simple example of using a :obj:`TreeLearnerBase`. 
    1085 You can just call it and let it fill the empty slots with the default 
    1086 components. This section will teach you three things: what are the 
    1087 missing components (and how to set the same components yourself), 
    1088 how to use alternative components to get a different tree and, 
    1089 finally, how to write a skeleton for tree induction in Python. 
    1090  
    1091 Default components for TreeLearnerBase 
    1092 ====================================== 
    1093  
    1094 Let us construct a :obj:`TreeLearnerBase` to play with. 
    1095  
    1096 .. _treelearner.py: code/treelearner.py 
    1097  
    1098 `treelearner.py`_, uses `lenses.tab`_: 
    1099  
    1100 .. literalinclude:: code/treelearner.py 
    1101    :lines: 7-10 
    1102  
    1103 There are three crucial components in learning: the split and stop 
    1104 criteria, and the :obj:`exampleSplitter` (there are some others, 
    1105 which become important during classification; we'll talk about them 
    1106 later). They are not defined; if you use the learner, the slots are 
    1107 filled temporarily but later cleared again. 
    1108  
    1109 :: 
    1110  
    1111     >>> print learner.split 
    1112     None 
    1113     >>> learner(data) 
    1114     <TreeClassifier instance at 0x01F08760> 
    1115     >>> print learner.split 
    1116     None 
    1117  
    1118 Stopping criteria 
    1119 ================= 
    1120  
    1121 The stop is trivial. The default is set by 
    1122  
    1123 :: 
    1124     >>> learner.stop = Orange.classification.tree.StopCriteria_common() 
    1125  
    1126 Well, this is actually done in C++ and it uses a global component 
    1127 that is constructed once for all, but apart from that we did 
    1128 effectively the same thing. 
    1129  
    1130 We can now examine the default stopping parameters. 
    1131  
    1132     >>> print learner.stop.maxMajority, learner.stop.minExamples 
    1133     1.0 0.0 
    1134  
    1135 Not very restrictive. This keeps splitting the examples until 
    1136 there's nothing left to split or all the examples are in the same 
    1137 class. Let us set the minimal subset that we allow to be split to 
    1138 five examples and see what comes out. 
    1139  
    1140     >>> learner.stop.minExamples = 5.0 
    1141     >>> tree = learner(data) 
    1142     >>> printTree(tree) 
    1143     tear_rate (<15.000, 5.000, 4.000>) 
    1144     : reduced --> none (<12.000, 0.000, 0.000>) 
    1145     : normal 
    1146        astigmatic (<3.000, 5.000, 4.000>) 
    1147        : no 
    1148           age (<1.000, 5.000, 0.000>) 
    1149           : young --> soft (<0.000, 2.000, 0.000>) 
    1150           : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
    1151           : presbyopic --> soft (<1.000, 1.000, 0.000>) 
    1152        : yes 
    1153           prescription (<2.000, 0.000, 4.000>) 
    1154           : myope --> hard (<0.000, 0.000, 3.000>) 
    1155           : hypermetrope --> none (<2.000, 0.000, 1.000>) 
    1156  
    1157 OK, that's better. If we want an even smaller tree, we can also limit 
    1158 the maximal proportion of majority class. 
    1159  
    1160     >>> learner.stop.maxMajority = 0.5 
    1161     >>> tree = learner(tree) 
    1162     >>> printTree(tree) 
    1163     --> none (<15.000, 5.000, 4.000>) 
    1164  
    1165  
    1166 Undocumented 
    1167 ============ 
    1168  
    1169 .. class:: NodeList 
    1170  
    1171     Undocumented. 
    1172  
    1173 .. class:: C45NodeList 
    1174  
    1175     Undocumented. 
    1176  
    1177 =========================== 
    1178 C4.5 Classifier and Learner 
    1179 =========================== 
    1180  
    1181 As C4.5 is a standard benchmark in machine learning,  
    1182 it is incorporated in Orange, although Orange has its own 
    1183 implementation of decision trees. 
    1184  
    1185 The implementation uses the original Quinlan's code for learning so the 
    1186 tree you get is exactly like the one that would be build by standalone 
    1187 C4.5. Upon return, however, the original tree is copied to Orange 
    1188 components that contain exactly the same information plus what is needed 
    1189 to make them visible from Python. To be sure that the algorithm behaves 
    1190 just as the original, we use a dedicated class :class:`C45Node` 
    1191 instead of reusing the components used by Orange's tree inducer 
    1192 (ie, :class:`Node`). This, however, could be done and probably 
    1193 will be done in the future; we shall still retain :class:`C45Node`  
    1194 but offer transformation to :class:`Node` so that routines 
    1195 that work on Orange trees will also be usable for C45 trees. 
    1196  
    1197 :class:`C45Learner` and :class:`C45Classifier` behave 
    1198 like any other Orange learner and classifier. Unlike most of Orange  
    1199 learning algorithms, C4.5 does not accepts weighted examples. 
    1200  
    1201 Building the C4.5 plug-in 
    1202 ========================= 
    1203  
    1204 We haven't been able to obtain the legal rights to distribute 
    1205 C4.5 and therefore couldn't statically link it into Orange. Instead, 
    1206 it's incorporated as a plug-in which you'll need to build yourself. 
    1207 The procedure is trivial, but you'll need a C compiler. On Windows, 
    1208 the scripts we provide work with MS Visual C and the files CL.EXE 
    1209 and LINK.EXE must be on the PATH. On Linux you're equipped with 
    1210 gcc. Mac OS X comes without gcc, but you can download it for 
    1211 free from Apple. 
    1212  
    1213 Orange must be installed prior to building C4.5. (This is because 
    1214 the build script will copy the created file next to Orange, 
    1215 which it obviously can't if Orange isn't there yet.) 
    1216  
    1217 #. Download the  
    1218    `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_ 
    1219    from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract 
    1220    them into some temporary directory. The files will be modified in the 
    1221    further process, so don't use your copy of Quinlan's sources that you 
    1222    need for another purpose. 
    1223 #. Download  
    1224    `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_  
    1225    and unzip its contents into the directory R8/Src of the Quinlan's  
    1226    stuff (it's the directory that contains, for instance, the file 
    1227    average.c). 
    1228 #. Run buildC45.py, which will build the plug-in and put it next to  
    1229    orange.pyd (or orange.so on Linux/Mac). 
    1230 #. Run python, type :samp:`import Orange` and  
    1231    create create :samp:`Orange.classification.tree.C45Learner()`. 
    1232    If this fails, something went wrong; see the diagnostic messages from 
    1233    buildC45.py and read the below paragraph. 
    1234 #. Finally, you can remove the Quinlan's stuff, along with everything 
    1235    created by buildC45.py. 
    1236  
    1237 If the process fails, here's what buildC45.py really does: it creates 
    1238 .h files that wrap Quinlan's .i files and ensure that they are not 
    1239 included twice. It modifies C4.5 sources to include .h's instead of 
    1240 .i's. This step can hardly fail. Then follows the platform dependent 
    1241 step which compiles ensemble.c (which includes all the Quinlan's .c 
    1242 files it needs) into c45.dll or c45.so and puts it next to Orange. 
    1243 If this fails, but you do have a C compiler and linker, and you know 
    1244 how to use them, you can compile the ensemble.c into a dynamic 
    1245 library yourself. See the compile and link steps in buildC45.py, 
    1246 if it helps. Anyway, after doing this check that the built C4.5 
    1247 gives the same results as the original. 
    1248  
    1249 .. class:: C45Learner 
    1250  
    1251     :class:`C45Learner`'s attributes have double names - those that 
    1252     you know from C4.5 command lines and the corresponding names of C4.5's 
    1253     internal variables. All defaults are set as in C4.5; if you change 
    1254     nothing, you are running C4.5. 
    1255  
    1256     .. attribute:: gainRatio (g) 
    1257          
    1258         Determines whether to use information gain (false>, default) 
    1259         or gain ratio for selection of attributes (true). 
    1260  
    1261     .. attribute:: batch (b) 
    1262  
    1263         Turn on batch mode (no windows, no iterations); this option is 
    1264         not documented in C4.5 manuals. It conflicts with "window", 
    1265         "increment" and "trials". 
    1266  
    1267     .. attribute:: subset (s) 
    1268          
    1269         Enables subsetting (default: false, no subsetting), 
    1270   
    1271     .. attribute:: probThresh (p) 
    1272  
    1273         Probabilistic threshold for continuous attributes (default: false). 
    1274  
    1275     .. attribute:: minObjs (m) 
    1276          
    1277         Minimal number of objects (examples) in leaves (default: 2). 
    1278  
    1279     .. attribute:: window (w) 
    1280  
    1281         Initial windows size (default: maximum of 20% and twice the 
    1282         square root of the number of data objects). 
    1283  
    1284     .. attribute:: increment (i) 
    1285  
    1286         The maximum number of objects that can be added to the window 
    1287         at each iteration (default: 20% of the initial window size). 
    1288  
    1289     .. attribute:: cf (c) 
    1290  
    1291         Prunning confidence level (default: 25%). 
    1292  
    1293     .. attribute:: trials (t) 
    1294  
    1295         Set the number of trials in iterative (i.e. non-batch) mode (default: 10). 
    1296  
    1297     .. attribute:: prune 
    1298          
    1299         Return pruned tree (not an original C4.5 option) (default: true) 
    1300  
    1301  
    1302 :class:`C45Learner` also offers another way for setting 
    1303 the arguments: it provides a function :obj:`C45Learner.commandLine` 
    1304 which is given a string and parses it the same way as C4.5 would 
    1305 parse its command line. XXXXXXXXXXX 
    1306  
    1307 .. class:: C45Classifier 
    1308  
    1309     A faithful reimplementation of Quinlan's function from C4.5. The only 
    1310     difference (and the only reason it's been rewritten) is that it uses 
    1311     a tree composed of :class:`C45Node` instead of C4.5's 
    1312     original tree structure. 
    1313  
    1314     .. attribute:: tree 
    1315  
    1316         C4.5 tree stored as a tree of :obj:`C45Node`. 
    1317  
    1318  
    1319 .. class:: C45Node 
    1320  
    1321     This class is a reimplementation of the corresponding *struct* from 
    1322     Quinlan's C4.5 code. 
    1323  
    1324     .. attribute:: nodeType 
    1325  
    1326         Type of the node:  :obj:`C45Node.Leaf` (0),  
    1327         :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2), 
    1328         :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches" 
    1329         split examples based on values of a discrete attribute, 
    1330         "cuts" cut them according to a threshold value of a continuous 
    1331         attributes and "subsets" use discrete attributes but with subsetting 
    1332         so that several values can go into the same branch. 
    1333  
    1334     .. attribute:: leaf 
    1335  
    1336         Value returned by that leaf. The field is defined for internal  
    1337         nodes as well. 
    1338  
    1339     .. attribute:: items 
    1340  
    1341         Number of (learning) examples in the node. 
    1342  
    1343     .. attribute:: classDist 
    1344  
    1345         Class distribution for the node (of type  
    1346         :obj:`orange.DiscDistribution`). 
    1347  
    1348     .. attribute:: tested 
    1349          
    1350         The attribute used in the node's test. If node is a leaf, 
    1351         obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut` 
    1352         :obj:`tested` is a discrete attribute, and if node is of type 
    1353         :obj:`Cut` then :obj:`tested` is a continuous attribute. 
    1354  
    1355     .. attribute:: cut 
    1356  
    1357         A threshold for continuous attributes, if node is of type :obj:`Cut`. 
    1358         Undefined otherwise. 
    1359  
    1360     .. attribute:: mapping 
    1361  
    1362         Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]` 
    1363         gives the index for an example whose value of :obj:`tested` is *i*.  
    1364         Here, *i* denotes an index of value, not a :class:`Orange.data.Value`. 
    1365  
    1366     .. attribute:: branch 
    1367          
    1368         A list of branches stemming from this node. 
    1369  
    1370 Examples 
    1371 ======== 
    1372  
    1373 .. _tree_c45.py: code/tree_c45.py 
    1374 .. _iris.tab: code/iris.tab 
    1375  
    1376 The simplest way to use :class:`C45Learner` is to call it. This 
    1377 script constructs the same learner as you would get by calling 
    1378 the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_): 
    1379  
    1380 .. literalinclude:: code/tree_c45.py 
    1381    :lines: 7-14 
    1382  
    1383 Arguments can be set by the usual mechanism (the below to lines do the 
    1384 same, except that one uses command-line symbols and the other internal 
    1385 variable names) 
    1386  
    1387 :: 
    1388  
    1389     tree = Orange.classification.tree.C45Learner(data, m=100) 
    1390     tree = Orange.classification.tree.C45Learner(data, minObjs=100) 
    1391  
    1392 The way that could be prefered by veteran C4.5 user might be through 
    1393 method `:obj:C45Learner.commandline`. 
    1394  
    1395 :: 
    1396  
    1397     lrn = Orange.classification.tree..C45Learner() 
    1398     lrn.commandline("-m 1 -s") 
    1399     tree = lrn(data) 
    1400  
    1401 There's nothing special about using :obj:`C45Classifier` - it's  
    1402 just like any other. To demonstrate what the structure of  
    1403 :class:`C45Node`'s looks like, will show a script that prints  
    1404 it out in the same format as C4.5 does. 
    1405  
    1406 .. literalinclude:: code/tree_c45_printtree.py 
    1407  
    1408 Leaves are the simplest. We just print out the value contained 
    1409 in :samp:`node.leaf`. Since this is not a qualified value (ie.,  
    1410 :obj:`C45Node` does not know to which attribute it belongs), we need to 
    1411 convert it to a string through :obj:`classVar`, which is passed as an 
    1412 extra argument to the recursive part of printTree. 
    1413  
    1414 For discrete splits without subsetting, we print out all attribute values 
    1415 and recursively call the function for all branches. Continuous splits are 
    1416 equally easy to handle. 
    1417  
    1418 For discrete splits with subsetting, we iterate through branches, retrieve 
    1419 the corresponding values that go into each branch to inset, turn 
    1420 the values into strings and print them out, separately treating the 
    1421 case when only a single value goes into the branch. 
    1422  
    1423 Printing out C45 Tree 
    1424 ===================== 
    1425  
    1426 .. autofunction:: printTreeC45 
    1427  
    1428 ======================= 
    1429 orngTree module XXXXXXX 
    1430 ======================= 
     12To build a small tree (:obj:`TreeClassifier`) from the iris data set  
     13(with the depth limited to three levels), use  
     14(part of `orngTree1.py`_, uses `iris.tab`_): 
     15 
     16.. literalinclude:: code/orngTree1.py 
     17   :lines: 1-4 
     18 
     19.. _orngTree1.py: code/orngTree1.py 
     20 
     21 
     22This page first describes the learner and the classifier, and then 
     23defines the base classes (individual components) of the trees 
     24and the tree-building process. 
    143125 
    143226.. autoclass:: TreeLearner 
     
    143529.. autoclass:: TreeClassifier 
    143630    :members: 
     31 
    143732 
    143833For a bit more complex example, here's how to write your own stop 
     
    144439considers only the random function as the stopping criteria. Note that in 
    144540the second case lambda function still has three parameters, since this is 
    1446 a necessary number of parameters for the stop function XXXXX link ( 
    1447 part of `tree3.py`_ (uses  `iris.tab`_): 
     41a necessary number of parameters for the stop function (:obj:`StopCriteria`). 
     42Part of `tree3.py`_ (uses  `iris.tab`_): 
    144843 
    144944.. _tree3.py: code/tree3.py 
     
    145550big. 
    145651 
     52================= 
    145753Printing the Tree 
    145854================= 
     
    1568164 
    1569165.. literalinclude:: code/orngTree1.py 
    1570    :lines: 0-4 
     166   :lines: 1-4 
    1571167 
    1572168.. _orngTree1.py: code/orngTree1.py 
     
    1575171argument:: 
    1576172 
    1577     >>> Orange.classification.tree.printTree(tree) 
     173    >>> print tree.dump() 
    1578174    petal width<0.800: Iris-setosa (100.00%) 
    1579175    petal width>=0.800 
     
    1589185in the node:: 
    1590186 
    1591     >>> Orange.classification.tree.printTree(tree, leafStr="%V (%M out of %N)") 
     187    >>> print tree.dump(leafStr="%V (%M out of %N)") 
    1592188    petal width<0.800: Iris-setosa (50.000 out of 50.000) 
    1593189    petal width>=0.800 
     
    1603199it with this:: 
    1604200 
    1605     >>> orng.printTree("%V (%^MbA%, %^MbP%)") 
     201    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)") 
    1606202    petal width<0.800: Iris-setosa (100%, 100%) 
    1607203    petal width>=0.800 
     
    1624220And now for the output: all examples of setosa for into the first node. 
    1625221For versicolor, we have 98% in one node; the rest is certainly 
    1626 not in the neighbouring node (petal length&gt;=5.350) since all 
     222not in the neighbouring node (petal length>=5.350) since all 
    1627223versicolors from the node petal width<1.750 went to petal length<5.350 
    1628224(we know this from the 100% in that line). Virginica is the  
     
    1684280:: 
    1685281 
    1686     Orange.classification.tree.printTree(tree, leafStr="%V", nodeStr=".") 
     282    tree.dump(leafStr="%V", nodeStr=".") 
    1687283     
    1688284says that the nodeStr should be the same as leafStr (not very useful  
     
    1708304of virginicas decreases down the tree:: 
    1709305 
    1710     Orange.classification.tree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.') 
     306    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.') 
    1711307 
    1712308Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is  
     
    1730326    |    |    |    petal length>=4.850: 86.0% (95.6%) 
    1731327 
    1732 See what's in the parentheses in the root node? If :func:`printTree` 
     328See what's in the parentheses in the root node? If :meth:`TreeClassifier.dump` 
    1733329cannot compute something (in this case it's because the root has no parent), 
    1734330it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it  
     
    1743339:: 
    1744340 
    1745     >>>Orange.classification.tree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"') 
     341    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"') 
    1746342    root: [50.000, 50.000, 50.000] . . 
    1747343    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]: 
     
    1761357To explore the possibilities when printing regression trees, we are going  
    1762358to induce a tree from the housing data set. Called with the tree as the 
    1763 only argument, :func:`printTree` prints the tree like this:: 
     359only argument, :meth:`TreeClassifier.dump` prints the tree like this:: 
    1764360 
    1765361    RM<6.941 
     
    178137790% confidence intervals in the leaves:: 
    1782378 
    1783     >>> Orange.classification.tree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]") 
     379    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]") 
    1784380    root: [SE: 0.409] 
    1785381    |    RM<6.941: [SE: 0.306] 
     
    1821417this number with values in the parent nodes:: 
    1822418 
    1823     >>> Orange.classification.tree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".") 
     419    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".") 
    1824420    root: 277.000 (.) 
    1825421    |    RM<6.941: 273.000 (1.160) 
     
    1848444:: 
    1849445 
    1850     >>> Orange.classification.tree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".") 
     446    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".") 
    1851447 
    1852448OK, let's observe the format string for one last time. :samp:`%c![20, 22]` 
     
    1877473==================================== 
    1878474 
    1879 :func:`dumpTree`'s argument :obj:`userFormats` can be used to print out 
     475:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out 
    1880476some other information in the leaves or nodes. If provided, 
    1881477:obj:`userFormats` should contain a list of tuples with a regular expression 
     
    1975571=========================== 
    1976572 
    1977 .. autofunction:: printDot 
    1978  
    1979 .. autofunction:: dotTree 
    1980  
    1981573Suppose you saved the tree in a file "tree5.dot". You can then 
    1982574print it out as a gif if you execute the following command line 
     
    1988580GraphViz's dot has quite a few other output formats, check  
    1989581its documentation to learn which. 
     582 
     583 
     584 
     585Classification trees are represented as a tree-like hierarchy of 
     586:obj:`Node` classes. 
     587 
     588.. class:: Node 
     589 
     590    Node stores information about the learning examples belonging  
     591    to the node, a branch selector, a list of branches (if the node is  
     592    not a leaf) with their descriptions and strengths, and a classifier. 
     593 
     594    .. attribute:: distribution 
     595     
     596        Stores a distribution for learning examples belonging to the node. 
     597        Storing distributions can be disabled by setting the  
     598        :obj:`TreeLearnerBase`'s storeDistributions flag to false. 
     599 
     600    .. attribute:: contingency 
     601 
     602        Stores complete contingency matrices for the learning examples  
     603        belonging to the node. Storing contingencies can be enabled by  
     604        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`  
     605        flag to true. Note that even when the flag is not  
     606        set, the contingencies get computed and stored to  
     607        :obj:`Node`, but are removed shortly afterwards.  
     608        The details are given in the  
     609        description of the :obj:`TreeLearnerBase` object. 
     610 
     611    .. attribute:: examples, weightID 
     612 
     613        Store a set of learning examples for the node and the 
     614        corresponding ID of /weight meta attribute. The root of the 
     615        tree stores a "master table" of examples, while other nodes' 
     616        :obj:`Orange.data.Table` contain reference to examples in 
     617        the root's :obj:`Orange.data.Table`. Examples are only stored 
     618        if a corresponding flag (:obj:`storeExamples`) has been 
     619        set while building the tree; to conserve the space, storing 
     620        is disabled by default. 
     621 
     622    .. attribute:: nodeClassifier 
     623 
     624        A classifier (usually, but not necessarily, a 
     625        :obj:`DefaultClassifier`) that can be used to classify 
     626        examples coming to the node. If the node is a leaf, this is 
     627        used to decide the final class (or class distribution) of an 
     628        example. If it's an internal node, it is stored if 
     629        :obj:`Node`'s flag :obj:`storeNodeClassifier` 
     630        is set. Since the :obj:`nodeClassifier` is needed by 
     631        :obj:`Descender` and for pruning (see far below), 
     632        this is the default behaviour; space consumption of the default 
     633        :obj:`DefaultClassifier` is rather small. You should 
     634        never disable this if you intend to prune the tree later. 
     635 
     636    If the node is a leaf, the remaining fields are None.  
     637    If it's an internal node, there are several additional fields. 
     638 
     639    .. attribute:: branches 
     640 
     641        Stores a list of subtrees, given as :obj:`Node`. 
     642        An element can be None; in this case the node is empty. 
     643 
     644    .. attribute:: branchDescriptions 
     645 
     646        A list with string descriptions for branches, constructed by 
     647        :obj:`SplitConstructor`. It can contain different kinds 
     648        of descriptions, but basically, expect things like 'red' or '>12.3'. 
     649 
     650    .. attribute:: branchSizes 
     651 
     652        Gives a (weighted) number of training examples that went into 
     653        each branch. This can be used later, for instance, for 
     654        modeling probabilities when classifying examples with 
     655        unknown values. 
     656 
     657    .. attribute:: branchSelector 
     658 
     659        Gives a branch for each example. The same object is used during 
     660        learning and classifying. The :obj:`branchSelector` is of 
     661        type :obj:`orange.Classifier`, since its job is similar to that 
     662        of a classifier: it gets an example and returns discrete 
     663        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`. 
     664        When an example cannot be classified to any branch, the selector 
     665        can return a :obj:`Orange.data.Value` containing a special value 
     666        (sVal) which should be a discrete distribution 
     667        (DiscDistribution). This should represent a 
     668        :obj:`branchSelector`'s opinion of how to divide the 
     669        example between the branches. Whether the proposition will be 
     670        used or not depends upon the chosen :obj:`ExampleSplitter` 
     671        (when learning) or :obj:`Descender` (when classifying). 
     672 
     673    The lists :obj:`branches`, :obj:`branchDescriptions` and 
     674    :obj:`branchSizes` are of the same length; all of them are 
     675    defined if the node is internal and none if it is a leaf. 
     676 
     677    .. method:: treeSize() 
     678         
     679        Return the number of nodes in the subtrees (including the 
     680        node, excluding null-nodes). 
     681 
     682============== 
     683Classification 
     684============== 
     685 
     686.. class:: _TreeClassifier 
     687 
     688    Classifies examples according to a tree stored in :obj:`tree`. 
     689 
     690    .. attribute:: tree 
     691 
     692        The root of the tree, represented as a :class:`Node`. 
     693     
     694    Classification would be straightforward if there were no unknown  
     695    values or, in general, examples that cannot be placed into a  
     696    single branch. The response in such cases is determined by a 
     697    component :obj:`descender`. 
     698 
     699    :obj:`Descender` is an abstract object which is given an example 
     700    and whose basic job is to descend as far down the tree as possible, 
     701    according to the values of example's attributes. The 
     702    :obj:`Descender`: calls the node's :obj:`branchSelector` to get  
     703    the branch index. If it's a simple index, the corresponding branch  
     704    is followed. If not, it's up to descender to decide what to do, and 
     705    that's where descenders differ. A :obj:`descender` can choose  
     706    a single branch (for instance, the one that is the most recommended  
     707    by the :obj:`branchSelector`) or it can let the branches vote. 
     708 
     709    In general there are three possible outcomes of a descent. 
     710 
     711    #. Descender reaches a leaf. This happens when nothing went wrong  
     712       (there are no unknown or out-of-range values in the example) or  
     713       when things went wrong, but the descender smoothed them by  
     714       selecting a single branch and continued the descend. In this 
     715       case, the descender returns the reached :obj:`Node`. 
     716    #. :obj:`branchSelector` returned a distribution and the  
     717       :obj:`Descender` decided to stop the descend at this  
     718       (internal) node.  Again, descender returns the current  
     719       :obj:`Node` and nothing else. 
     720    #. :obj:`branchSelector` returned a distribution and the  
     721       :obj:`Node` wants to split the example (i.e., to decide the  
     722       class by voting).  
     723 
     724    It returns a :obj:`Node` and the vote-weights for the branches.  
     725    The weights can correspond to the distribution returned by 
     726    :obj:`branchSelector`, to the number of learning examples that 
     727    were assigned to each branch, or to something else. 
     728 
     729    :obj:`TreeClassifier` uses the descender to descend from the root.  
     730    If it returns only a :obj:`Node` and no distribution, the  
     731    descend should stop; it does not matter whether it's a leaf (the 
     732    first case above) or an internal node (the second case). The node's 
     733    :obj:`nodeClassifier` is used to decide the class. If the descender 
     734    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier` 
     735    recursively calls itself for each of the subtrees and the  
     736    predictions are weighted as requested by the descender. 
     737 
     738    When voting, subtrees do not predict the class but probabilities  
     739    of classes. The predictions are multiplied by weights, summed and  
     740    the most probable class is returned. 
     741 
     742    .. method:: vote() 
     743 
     744    .. method:: classDistribution() 
     745 
     746 
     747The rest of this section is only for those interested in the C++ code. 
     748====================================================================== 
     749 
     750If you'd like to understand how the classification works in C++,  
     751start reading at :obj:`TTreeClassifier::vote`. It gets a  
     752:obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of  
     753vote weights. For each node, it calls the  
     754:obj:`TTreeClassifier::classDistribution` and then multiplies  
     755and sums the distribution. :obj:`vote` returns a normalized  
     756distribution of predictions. 
     757 
     758A new overload of :obj:`TTreeClassifier::classDistribution` gets 
     759an additional parameter, a :obj:`Node`. This is done  
     760for the sake of recursion. The normal version of  
     761:obj:`classDistribution` simply calls the overloaded with a  
     762tree root as an additional parameter. :obj:`classDistribution`  
     763uses :obj:`descender`. If descender reaches a leaf, it calls  
     764:obj:`nodeClassifier`, otherwise it calls :obj:`vote`. 
     765 
     766Thus, the :obj:`TreeClassifier`'s :obj:`vote` and  
     767:obj:`classDistribution` are written in a form of double  
     768recursion. The recursive calls do not happen at each node of the  
     769tree but only at nodes where a vote is needed (that is, at nodes  
     770where the descender halts). 
     771 
     772For predicting a class, :obj:`operator()`, calls the 
     773descender. If it reaches a leaf, the class is predicted by the  
     774leaf's :obj:`nodeClassifier`. Otherwise, it calls  
     775:obj:`vote`>. From now on, :obj:`vote` and  
     776:obj:`classDistribution` interweave down the tree and return  
     777a distribution of predictions. :obj:`operator()` then simply  
     778chooses the most probable class. 
     779 
     780======== 
     781Learning 
     782======== 
     783 
     784The main learning object is :obj:`TreeLearnerBase`. It is basically  
     785a skeleton into which the user must plug the components for particular  
     786functions. For easier use, defaults are provided. 
     787 
     788Components that govern the structure of the tree are :obj:`split` 
     789(of type :obj:`SplitConstructor`), :obj:`stop` (of  
     790type :obj:`StopCriteria` and :obj:`exampleSplitter` 
     791(of type :obj:`ExampleSplitter`). 
     792 
     793 
     794.. class:: TreeLearnerBase 
     795 
     796    TreeLearnerBase has a number of components. 
     797 
     798    .. attribute:: split 
     799 
     800        Object of type :obj:`SplitConstructor`. Default value,  
     801        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined` 
     802        with separate constructors for discrete and continuous attributes.  
     803        Discrete attributes are used as are, while continuous attributes 
     804        are binarized. Gain ratio is used to select attributes. 
     805        A minimum of two examples in a leaf is required for discreter 
     806        and five examples in a leaf for continuous attributes.</DD> 
     807     
     808    .. attribute:: stop 
     809 
     810        Object of type :obj:`StopCriteria`. The default stopping 
     811        criterion stops induction when all examples in a node belong  
     812        to the same class. 
     813 
     814    .. attribute:: splitter 
     815 
     816        Object of type :obj:`ExampleSplitter`. The default splitter 
     817        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits 
     818        the learning examples according to distributions given by the 
     819        selector. 
     820 
     821    .. attribute:: contingencyComputer 
     822     
     823        By default, this slot is left empty and ordinary contingency 
     824        matrices are computed for examples at each node. If need arises, 
     825        one can change the way the matrices are computed. This can be 
     826        used to change the way that unknown values are treated when 
     827        assessing qualities of attributes. As mentioned earlier, 
     828        the computed matrices can be used by split constructor and by 
     829        stopping criteria. On the other hand, they can be (and are) 
     830        ignored by some splitting constructors. 
     831 
     832    .. attribute:: nodeLearner 
     833 
     834        Induces a classifier from examples belonging to a node. The 
     835        same learner is used for internal nodes and for leaves. The 
     836        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`. 
     837 
     838    .. attribute:: descender 
     839 
     840        Descending component that the induces :obj:`TreeClassifier` 
     841        will use. Default descender is  
     842        :obj:`Descender_UnknownMergeAsSelector` which votes using  
     843        the :obj:`branchSelector`'s distribution for vote weights. 
     844 
     845    .. attribute:: maxDepth 
     846 
     847        Gives maximal tree depth; 0 means that only root is generated.  
     848        The default is 100 to prevent any infinite tree induction due 
     849        to missettings in stop criteria. If you are sure you need 
     850        larger trees, increase it. If you, on the other hand, want 
     851        to lower this hard limit, you can do so as well. 
     852 
     853    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier 
     854 
     855        Decides whether to store class distributions, contingencies  
     856        and examples in :obj:`Node`, and whether the  
     857        :obj:`nodeClassifier` should be build for internal nodes.  
     858        By default, distributions and node classifiers are stored,  
     859        while contingencies and examples are not. You won't save any  
     860        memory by not storing distributions but storing contingencies, 
     861        since distributions actually points to the same distribution 
     862        that is stored in :obj:`contingency.classes`. 
     863 
     864    The :obj:`TreeLearnerBase` first sets the defaults for missing 
     865    components. Although stored in the actual :obj:`TreeLearnerBase`'s 
     866    fields, they are removed when the induction is finished. 
     867 
     868    Then it ensures that examples are stored in a table. This is needed 
     869    because the algorithm juggles with pointers to examples. If 
     870    examples are in a file or are fed through a filter, they are copied 
     871    to a table. Even if they are already in a table, they are copied 
     872    if :obj:`storeExamples` is set. This is to assure that pointers 
     873    remain pointing to examples even if the user later changes the 
     874    example table. If they are in the table and the :obj:`storeExamples` 
     875    flag is clear, we just use them as they are. This will obviously 
     876    crash in a multi-threaded system if one changes the table during 
     877    the tree induction. Well... don't do it. 
     878 
     879    Apriori class probabilities are computed. At this point we check 
     880    the sum of example weights; if it's zero, there are no examples and  
     881    we cannot proceed. A list of candidate attributes is set; in the 
     882    beginning, all attributes are candidates for the split criterion. 
     883 
     884    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments  
     885    are a set of examples, a weight meta-attribute ID (a tricky thing, 
     886    it can be always the same as the original or can change to  
     887    accomodate splitting of examples among branches), apriori class 
     888    distribution and a list of candidates (represented as a vector 
     889    of Boolean values). 
     890 
     891    The contingency matrix is computed next. This happens 
     892    even if the flag :obj:`storeContingencies` is false. 
     893    If the :obj:`contingencyComputer` is given we use it, 
     894    otherwise we construct just an ordinary contingency matrix. 
     895 
     896    A :obj:`stop` is called to see whether it's worth to continue. If  
     897    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is  
     898    returned. Otherwise, a :obj:`nodeClassifier` is only built if  
     899    :obj:`forceNodeClassifier` flag is set. 
     900 
     901    To get a :obj:`Node`'s :obj:`nodeClassifier`, the  
     902    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with  
     903    the given examples, weight ID and the just computed matrix. If  
     904    the learner can use the matrix (and the default,  
     905    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus, 
     906    a choice of :obj:`contingencyComputer` will, in many cases,  
     907    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can 
     908    return no classifier; if so and if the classifier would be  
     909    needed for classification, the :obj:`TreeClassifier`'s function 
     910    returns DK or an empty distribution. If you're writing your own 
     911    tree classifier - pay attention. 
     912 
     913    If the induction is to continue, a :obj:`split` component is called.  
     914    If it fails to return a branch selector, induction stops and the  
     915    :obj:`Node` is returned. 
     916 
     917    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide  
     918    the examples as described above. 
     919 
     920    The contingency gets removed at this point if it is not to be  
     921    stored. Thus, the :obj:`split`, :obj:`stop` and  
     922    :obj:`exampleSplitter` can use the contingency matrices if they will. 
     923 
     924    The :obj:`TreeLearnerBase` then recursively calls itself for each of  
     925    the non-empty subsets. If the splitter returnes a list of weights,  
     926    a corresponding weight is used for each branch. Besides, the  
     927    attribute spent by the splitter (if any) is removed from the  
     928    list of candidates for the subtree. 
     929 
     930    A subset of examples is stored in its corresponding tree node,  
     931    if so requested. If not, the new weight attributes are removed (if  
     932    any were created). 
     933 
     934Pruning 
     935======= 
     936 
     937Tree pruners derived from :obj:`Pruner` can be given either a 
     938:obj:`Node` (presumably, but not necessarily a root) or a 
     939:obj:`TreeClassifier`. The result is a new, pruned :obj:`Node` 
     940or a new :obj:`TreeClassifier` with a pruned tree. The original 
     941tree remains intact. 
     942 
     943Note however that pruners construct only a shallow copy of a tree. 
     944The pruned tree's :obj:`Node` contain references to the same 
     945contingency matrices, node classifiers, branch selectors, ... 
     946as the original tree. Thus, you may modify a pruned tree structure 
     947(manually cut it, add new nodes, replace components) but modifying, 
     948for instance, some node's :obj:`nodeClassifier` (a 
     949:obj:`nodeClassifier` itself, not a reference to it!) would modify 
     950the node's :obj:`nodeClassifier` in the corresponding node of 
     951the original tree. 
     952 
     953Talking about node classifiers - pruners cannot construct a 
     954:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned 
     955subtrees into classifiers for new leaves. Thus, if you want to build 
     956a prunable tree, internal nodes must have their :obj:`nodeClassifier` 
     957defined. Fortunately, all you need to do is nothing; if you leave 
     958the :obj:`TreeLearnerBase`'s flags as they are by default, the 
     959:obj:`nodeClassifier` are created. 
     960 
     961======= 
     962Classes 
     963======= 
     964 
     965Several classes described above are already functional and can 
     966(and mostly will) be used as they are. Those classes are :obj:`Node`, 
     967:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe  
     968the other classes. 
     969 
     970Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,  
     971:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner` 
     972and :obj:`Classifier` are among the Orange classes that can be subtyped  
     973in Python and have the call operator overloadedd in such a way that it 
     974is callbacked from C++ code. You can thus program your own components 
     975for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed  
     976information on how this is done and what can go wrong, is given in a  
     977separate page, dedicated to callbacks to Python XXXXXXXXXX. 
     978 
     979SplitConstructors 
     980===================== 
     981 
     982Split construction is almost as exciting as waiting for a delayed flight. 
     983Boring, that is. Split constructors are programmed as spaghetti code 
     984that juggles with contingency matrices, with separate cases for discrete 
     985and continuous classes... Most split constructors work either for 
     986discrete or for continuous attributes. The suggested practice is 
     987to use a :obj:`SplitConstructor_Combined` that can handle 
     988both by simply delegating attributes to specialized split constructors. 
     989 
     990Note: split constructors that cannot handle attributes of particular 
     991type (discrete, continuous) do not report an error or a warning but 
     992simply skip the attribute. It is your responsibility to use a correct 
     993split constructor for your dataset. (May we again suggest 
     994using :obj:`SplitConstructor_Combined`?) 
     995 
     996The same components can be used either for inducing classification and 
     997regression trees. The only component that needs to be chosen accordingly 
     998is the 'measure' attribute for the :obj:`SplitConstructor_Measure` 
     999class (and derived classes). 
     1000 
     1001.. class:: SplitConstructor 
     1002 
     1003    Finds a suitable criteria for dividing the learning (and later testing) 
     1004    examples coming to the node. The data it gets is a set of examples 
     1005    (and, optionally, an ID of weight meta-attribute), a domain 
     1006    contingency computed from examples, apriori class probabilities, 
     1007    a list of candidate attributes it should consider and a node 
     1008    classifier (if it was constructed, that is, if  
     1009    :obj:`storeNodeClassifier` is left true). 
     1010 
     1011    The :obj:`SplitConstructor` should use the domain contingency 
     1012    when possible. The reasons are two-fold; one is that it's faster 
     1013    and the other is that the contingency matrices are not necessarily 
     1014    constructed by simply counting the examples. Why and how is 
     1015    explained later. There are, however, cases, when domain contingency 
     1016    does not suffice, for examples, when ReliefF is used as a measure 
     1017    of quality of attributes. In this case, there's no other way but 
     1018    to use the examples and ignore the precomputed contingencies. 
     1019 
     1020    The split constructor should consider only the attributes in the 
     1021    candidate list (the list is a vector of booleans, one for each 
     1022    attribute). 
     1023 
     1024    :obj:`SplitConstructor` returns most of the data we talked 
     1025    about when describing the :obj:`Node`. It returns a classifier 
     1026    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of 
     1027    branch descriptions and a list with the number of examples that 
     1028    go into each branch. Just what we need for the :obj:`Node`. 
     1029    It can return an empty list for the number of examples in branches; 
     1030    in this case, the :obj:`TreeLearnerBase` will find the number itself 
     1031    after splitting the example set into subsets. However, if a split 
     1032    constructors can provide the numbers at no extra computational 
     1033    cost, it should do so. 
     1034 
     1035    In addition, it returns a quality of the split; a number without 
     1036    any fixed meaning except that higher numbers mean better splits. 
     1037 
     1038    If the constructed splitting criterion uses an attribute in such 
     1039    a way that the attribute is 'completely spent' and should not be 
     1040    considered as a split criterion in any of the subtrees (the 
     1041    typical case of this are discrete attributes that are used 
     1042    as-they-are, that is, without any binarization or subsetting), 
     1043    then it should report the index of this attribute. Some splits 
     1044    do not spend any attribute; this is indicated by returning a 
     1045    negative index. 
     1046 
     1047    A :obj:`SplitConstructor` can veto the further tree induction 
     1048    by returning no classifier. This can happen for many reasons. 
     1049    A general one is related to number of examples in the branches. 
     1050    :obj:`SplitConstructor` has a field :obj:`minSubset`, 
     1051    which sets the minimal number of examples in a branch; null nodes, 
     1052    however, are allowed. If there is no split where this condition 
     1053    is met, :obj:`SplitConstructor` stops the induction. 
     1054 
     1055    .. attribute:: minSubset 
     1056 
     1057        Sets the minimal number of examples in non-null leaves. As 
     1058        always in Orange (where not specified otherwise), "number of  
     1059        examples" refers to the weighted number of examples. 
     1060     
     1061    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])  
     1062 
     1063        Construct a split. Returns a tuple (:obj:`branchSelector`, 
     1064        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`, 
     1065        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no 
     1066        attribute is completely spent by the split criterion. If no 
     1067        split is constructed, the :obj:`selector`, :obj:`branchDescriptions` 
     1068        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and 
     1069        :obj:`spentAttribute` is -1. 
     1070 
     1071        :param examples:  Examples can be given in any acceptable form 
     1072            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a 
     1073            list of examples). 
     1074        :param weightID: Optional; the default of 0 means that all 
     1075            examples have a weight of 1.0.  
     1076        :param apriori-distribution: Should be of type  
     1077            :obj:`orange.Distribution` and candidates should be a Python  
     1078            list of objects which are interpreted as booleans. 
     1079 
     1080 
     1081 
     1082.. class:: SplitConstructor_Measure 
     1083 
     1084    Bases: :class:`SplitConstructor` 
     1085 
     1086    An abstract base class for split constructors that employ  
     1087    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.  
     1088    At present, 
     1089    all split constructors except for :obj:`SplitConstructor_Combined` 
     1090    are derived from this class. 
     1091 
     1092    .. attribute:: measure 
     1093 
     1094        A component of type :obj:`Orange.feature.scoring.Measure` used for 
     1095        evaluation of a split. Note that you must select the subclass  
     1096        :obj:`Orange.feature.scoring.Measure` capable of handling your  
     1097        class type  
     1098        - you cannot use :obj:`Orange.feature.scoring.GainRatio` 
     1099        for building regression trees or :obj:`Orange.feature.scoring.MSE` 
     1100        for classification trees. 
     1101 
     1102    .. attribute:: worstAcceptable 
     1103 
     1104        The lowest required split quality for a split to be acceptable. 
     1105        Note that this value make sense only in connection with a 
     1106        :obj:`measure` component. Default is 0.0. 
     1107 
     1108.. class:: SplitConstructor_Attribute 
     1109 
     1110    Bases: :class:`SplitConstructor_Measure` 
     1111 
     1112    Attempts to use a discrete attribute as a split; each value of the  
     1113    attribute corresponds to a branch in the tree. Attributes are 
     1114    evaluated with the :obj:`measure` and the one with the 
     1115    highest score is used for a split. If there is more than one 
     1116    attribute with the highest score, one of them is selected by random. 
     1117 
     1118    The constructed :obj:`branchSelector` is an instance of  
     1119    :obj:`orange.ClassifierFromVarFD` that returns a value of the  
     1120    selected attribute. If the attribute is  
     1121    :obj:`Orange.data.variable.Discrete`, 
     1122    :obj:`branchDescription`'s are the attribute's values. The  
     1123    attribute is marked as spent, so that it cannot reappear in the  
     1124    node's subtrees. 
     1125 
     1126.. class:: SplitConstructor_ExhaustiveBinary 
     1127 
     1128    Bases: :class:`SplitConstructor_Measure` 
     1129 
     1130    Works on discrete attributes. For each attribute, it determines 
     1131    which binarization of the attribute gives the split with the 
     1132    highest score. If more than one split has the highest score, 
     1133    one of them is selected by random. After trying all the attributes, 
     1134    it returns one of those with the highest score. 
     1135 
     1136    The constructed :obj:`branchSelector` is again an instance 
     1137    :obj:`orange.ClassifierFromVarFD` that returns a value of the 
     1138    selected attribute. This time, however, its :obj:`transformer` 
     1139    contains an instance of :obj:`MapIntValue` that maps the values 
     1140    of the attribute into a binary attribute. Branch descriptions are 
     1141    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to 
     1142    more than one value of the attribute. Branches that correspond to 
     1143    a single value of the attribute are described with this value. If  
     1144    the attribute was originally binary, it is spent and cannot be  
     1145    used in the node's subtrees. Otherwise, it can reappear in the  
     1146    subtrees. 
     1147 
     1148 
     1149.. class:: SplitConstructor_Threshold 
     1150 
     1151    Bases: :class:`SplitConstructor_Measure` 
     1152 
     1153    This is currently the only constructor for splits with continuous  
     1154    attributes. It divides the range of attributes values with a threshold  
     1155    that maximizes the split's quality. As always, if there is more than  
     1156    one split with the highest score, a random threshold is selected.  
     1157    The attribute that yields the highest binary split is returned. 
     1158 
     1159    The constructed :obj:`branchSelector` is again an instance of  
     1160    :obj:`orange.ClassifierFromVarFD` with an attached  
     1161    :obj:`transformer`. This time, :obj:`transformer` is of type  
     1162    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are  
     1163    "<threshold" and ">=threshold". The attribute is not spent. 
     1164 
     1165.. class:: SplitConstructor_OneAgainstOthers 
     1166     
     1167    Bases: :class:`SplitConstructor_Measure` 
     1168 
     1169    Undocumented. 
     1170 
     1171.. class:: SplitConstructor_Combined 
     1172 
     1173    Bases: :class:`SplitConstructor` 
     1174 
     1175    This constructor delegates the task of finding the optimal split  
     1176    to separate split constructors for discrete and for continuous 
     1177    attributes. Each split constructor is called, given only attributes 
     1178    of appropriate types as candidates. Both construct a candidate for 
     1179    a split; the better of them is selected. 
     1180 
     1181    (Note that there is a problem when more candidates have the same 
     1182    score. Let there be are nine discrete attributes with the highest 
     1183    score; the split constructor for discrete attributes will select 
     1184    one of them. Now, let us suppose that there is a single continuous 
     1185    attribute with the same score. :obj:`SplitConstructor_Combined` 
     1186    would randomly select between the proposed discrete attribute and  
     1187    the continuous attribute, not aware of the fact that the discrete 
     1188    has already competed with eight other discrete attributes. So,  
     1189    he probability for selecting (each) discrete attribute would be 1/18 
     1190    instead of 1/10. Although not really correct, we doubt that this 
     1191    would affect the tree's performance; many other machine learning 
     1192    systems simply choose the first attribute with the highest score  
     1193    anyway.) 
     1194 
     1195    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether  
     1196    the attribute is spent is decided by the winning split constructor. 
     1197 
     1198    .. attribute: discreteSplitConstructor 
     1199 
     1200        Split constructor for discrete attributes; can be, for instance, 
     1201        :obj:`SplitConstructor_Attribute` or  
     1202        :obj:`SplitConstructor_ExhaustiveBinary`. 
     1203 
     1204    .. attribute: continuousSplitConstructor 
     1205 
     1206        Split constructor for continuous attributes; at the moment, it  
     1207        can be either :obj:`SplitConstructor_Threshold` or a  
     1208        split constructor you programmed in Python. 
     1209 
     1210    .. attribute: continuousSplitConstructor 
     1211     
     1212        Split constructor for continuous attributes; at the moment, it  
     1213        can be either :obj:`SplitConstructor_Threshold` or a split 
     1214        constructor you programmed in Python. 
     1215 
     1216 
     1217StopCriteria and StopCriteria_common 
     1218============================================ 
     1219 
     1220obj:`StopCriteria` determines when to stop the induction of subtrees.  
     1221 
     1222.. class:: StopCriteria 
     1223 
     1224    Given a set of examples, weight ID and contingency matrices, decide 
     1225    whether to continue the induction or not. The basic criterion checks 
     1226    whether there are any examples and whether they belong to at least 
     1227    two different classes (if the class is discrete). Derived components 
     1228    check things like the number of examples and the proportion of 
     1229    majority classes. 
     1230 
     1231    As opposed to :obj:`SplitConstructor` and similar basic classes, 
     1232    :obj:`StopCriteria` is not an abstract but a fully functional 
     1233    class that provides the basic stopping criteria. That is, the tree 
     1234    induction stops when there is at most one example left; in this case, 
     1235    it is not the weighted but the actual number of examples that counts. 
     1236    Besides that, the induction stops when all examples are in the same 
     1237    class (for discrete problems) or have the same value of the outcome 
     1238    (for regression problems). 
     1239 
     1240    .. method:: __call__(examples[, weightID, domain contingencies]) 
     1241 
     1242        Decides whether to stop (true) or continue (false) the induction. 
     1243        If contingencies are given, they are used for checking whether 
     1244        the examples are in the same class (but not for counting the 
     1245        examples). Derived classes should use the contingencies whenever 
     1246        possible. If contingencies are not given, :obj:`StopCriteria` 
     1247        will work without them. Derived classes should also use them if 
     1248        they are available, but otherwise compute them only when they 
     1249        really need them. 
     1250 
     1251 
     1252 
     1253.. class:: StopCriteria_common 
     1254 
     1255    :obj:`StopCriteria` contains additional criteria for pre-pruning: 
     1256    it checks the proportion of majority class and the number of weighted 
     1257    examples. 
     1258 
     1259    .. attribute:: maxMajor 
     1260 
     1261        Maximal proportion of majority class. When this is exceeded,  
     1262        induction stops. 
     1263 
     1264    .. attribute:: minExamples 
     1265 
     1266        Minimal number of examples in internal leaves. Subsets with less 
     1267        than :obj:`minExamples` examples are not split any further. 
     1268        Example count is weighed. 
     1269 
     1270.. class:: StopCriteria_Python 
     1271 
     1272    Undocumented. 
     1273 
     1274Classes derived from ExampleSplitters 
     1275======================================== 
     1276 
     1277The :obj:`ExampleSplitter` sorts the learning examples into branches. 
     1278 
     1279.. class:: ExampleSplitter 
     1280 
     1281    Just like the :obj:`Descender` decides the branch for an 
     1282    example during classification, the :obj:`ExampleSplitter` 
     1283    sorts the learning examples into branches. 
     1284 
     1285    :obj:`ExampleSplitter` is given a :obj:`Node` (from which  
     1286    it can use different stuff, but most of splitters only use the  
     1287    :obj:`branchSelector`), a set of examples to be divided, and  
     1288    the weight ID. The result is a list of subsets of examples 
     1289    and, optionally, a list of new weight ID's. 
     1290 
     1291    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most  
     1292    of :obj:`ExampleSplitters` simply call the node's  
     1293    :obj:`branchSelector` and assign examples to corresponding  
     1294    branches. When the value is unknown they choose a particular  
     1295    branch or simply skip the example. 
     1296 
     1297    Some enhanced splitters can split examples. An example (actually,  
     1298    a pointer to it) is copied to more than one subset. To facilitate 
     1299    real splitting, weights are needed. Each branch is assigned a 
     1300    weight ID (each would usually have its own ID) and all examples 
     1301    that are in that branch (either completely or partially) should 
     1302    have this meta attribute. If an example hasn't been split, it 
     1303    has only one additional attribute - with weight ID corresponding 
     1304    to the subset to which it went. Example that is split between, 
     1305    say, three subsets, has three new meta attributes, one for each 
     1306    subset. ID's of weight meta attributes are returned by the 
     1307    :obj:`ExampleSplitter` to be used at induction of the 
     1308    corresponding subtrees. 
     1309 
     1310    Note that weights are used only when needed. When no splitting 
     1311    occured - because the splitter is not able to do it or becauser 
     1312    there was no need for splitting - no weight ID's are returned. 
     1313 
     1314    An abstract base class for objects that split sets of examples into  
     1315    subsets. The derived classes differ in treatment of examples which 
     1316    cannot be unambiguously placed into a single branch (usually due 
     1317    to unknown value of the crucial attribute). 
     1318 
     1319    .. method:: __call__(node, examples[, weightID]) 
     1320         
     1321        Use the information in :obj:`node` (particularly the  
     1322        :obj:`branchSelector`) to split the given set of examples into subsets.  
     1323        Return a tuple with a list of example generators and a list of weights.  
     1324        The list of weights is either an ordinary python list of integers or  
     1325        a None when no splitting of examples occurs and thus no weights are  
     1326        needed. 
     1327 
     1328 
     1329.. class:: ExampleSplitter_IgnoreUnknowns 
     1330 
     1331    Bases: :class:`ExampleSplitter` 
     1332 
     1333    Simply ignores the examples for which no single branch can be determined. 
     1334 
     1335.. class:: ExampleSplitter_UnknownsToCommon 
     1336 
     1337    Bases: :class:`ExampleSplitter` 
     1338 
     1339    Places all such examples to a branch with the highest number of 
     1340    examples. If there is more than one such branch, one is selected at 
     1341    random and then used for all examples. 
     1342 
     1343.. class:: ExampleSplitter_UnknownsToAll 
     1344 
     1345    Bases: :class:`ExampleSplitter` 
     1346 
     1347    Places examples with unknown value of the attribute into all branches. 
     1348 
     1349.. class:: ExampleSplitter_UnknownsToRandom 
     1350 
     1351    Bases: :class:`ExampleSplitter` 
     1352 
     1353    Selects a random branch for such examples. 
     1354 
     1355.. class:: ExampleSplitter_UnknownsToBranch 
     1356 
     1357    Bases: :class:`ExampleSplitter` 
     1358 
     1359    Constructs an additional branch to contain all such examples.  
     1360    The branch's description is "unknown". 
     1361 
     1362.. class:: ExampleSplitter_UnknownsAsBranchSizes 
     1363 
     1364    Bases: :class:`ExampleSplitter` 
     1365 
     1366    Splits examples with unknown value of the attribute according to  
     1367    proportions of examples in each branch. 
     1368 
     1369.. class:: ExampleSplitter_UnknownsAsSelector 
     1370 
     1371    Bases: :class:`ExampleSplitter` 
     1372 
     1373    Splits examples with unknown value of the attribute according to  
     1374    distribution proposed by selector (which is in most cases the same  
     1375    as proportions of examples in branches). 
     1376 
     1377Descender and derived classes 
     1378================================= 
     1379 
     1380This is a classifier's counterpart for :obj:`ExampleSplitter`. It  
     1381decides the destiny of examples that need to be classified and cannot 
     1382be unambiguously put in a branch. The detailed function of this class 
     1383is given in description of classification with trees. XXXXXX 
     1384 
     1385.. class:: Descender 
     1386 
     1387    An abstract base object for tree descenders. 
     1388 
     1389    .. method:: __call__(node, example) 
     1390 
     1391        Descends down the tree until it reaches a leaf or a node in  
     1392        which a vote of subtrees is required. In both cases, a tuple  
     1393        of two elements is returned; in the former, the tuple contains  
     1394        the reached node and None, in the latter in  
     1395        contains a node and weights of votes for subtrees (a list of floats). 
     1396 
     1397        :obj:`Descender`'s that never split examples always descend 
     1398        to a leaf, but they differ in the treatment of examples with 
     1399        unknown values (or, in general, examples for which a branch 
     1400        cannot be determined at some node(s) the tree). 
     1401        :obj:`Descender`'s that do split examples differ in returned 
     1402        vote weights. 
     1403 
     1404.. class:: Descender_UnknownsToNode 
     1405 
     1406    Bases: :obj:`Descender` 
     1407 
     1408    When example cannot be classified into a single branch, the 
     1409    current node is returned. Thus, the node's :obj:`NodeClassifier` 
     1410    will be used to make a decision. It is your responsibility to see 
     1411    that even the internal nodes have their :obj:`NodeClassifier` 
     1412    (i.e., don't disable creating node classifier or manually remove 
     1413    them after the induction, that's all) 
     1414 
     1415.. class:: Descender_UnknownsToBranch 
     1416 
     1417    Bases: :obj:`Descender` 
     1418 
     1419    Classifies examples with unknown value to a special branch. This 
     1420    makes sense only if the tree itself was constructed with 
     1421    :obj:`ExampleSplitter_UnknownsToBranch`. 
     1422 
     1423.. class:: Descender_UnknownsToCommonBranch 
     1424 
     1425    Bases: :obj:`Descender` 
     1426 
     1427    Classifies examples with unknown values to the branch with the 
     1428    highest number of examples. If there is more than one such branch, 
     1429    random branch is chosen for each example that is to be classified. 
     1430 
     1431.. class:: Descender_UnknownsToCommonSelector 
     1432 
     1433    Bases: :obj:`Descender` 
     1434 
     1435    Classifies examples with unknown values to the branch which received  
     1436    the highest recommendation by the selector. 
     1437 
     1438.. class:: Descender_MergeAsBranchSizes 
     1439 
     1440    Bases: :obj:`Descender` 
     1441 
     1442    Makes the subtrees vote for the example's class; the vote is 
     1443    weighted according to the sizes of the branches. 
     1444 
     1445.. class:: Descender_MergeAsSelector 
     1446 
     1447    Bases: :obj:`Descender` 
     1448 
     1449    Makes the subtrees vote for the example's class; the vote is  
     1450    weighted according to the selectors proposal. 
     1451 
     1452Pruner and derived classes 
     1453============================== 
     1454 
     1455.. index:: 
     1456    pair: classification trees; pruning 
     1457 
     1458Classes derived from :obj:`Pruner` prune the trees as a 
     1459described in the section pruning XXXXXXXX - make sure you read it  
     1460to understand what the pruners will do to your trees. 
     1461 
     1462.. class:: Pruner 
     1463 
     1464    This is an abstract base class which defines nothing useful, only  
     1465    a pure virtual call operator. 
     1466 
     1467    .. method:: __call__(tree) 
     1468 
     1469        Prunes a tree. The argument can be either a tree classifier or  
     1470        a tree node; the result is of the same type as the argument. 
     1471 
     1472.. class:: Pruner_SameMajority 
     1473 
     1474    Bases: :class:`Pruner` 
     1475 
     1476    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees  
     1477    with more than one leaf) in which all the leaves have the same majority  
     1478    class. (This is allowed because those leaves can still have different 
     1479    distributions of classes and thus predict different probabilities.)  
     1480    However, this can be undesired when we're only interested in the  
     1481    class prediction or a simple tree interpretation. The  
     1482    :obj:`Pruner_SameMajority` prunes the tree so that there is no 
     1483    subtree in which all the nodes would have the same majority class. 
     1484 
     1485    This pruner will only prune the nodes in which the node classifier  
     1486    is of class :obj:`orange.DefaultClassifier` (or from a derived class). 
     1487 
     1488    Note that the leaves with more than one majority class require some  
     1489    special handling. The pruning goes backwards, from leaves to the root.  
     1490    When siblings are compared, the algorithm checks whether they  
     1491    have (at least one) common majority class. If so, they can be pruned. 
     1492 
     1493.. class:: Pruner_m 
     1494 
     1495    Bases: :class:`Pruner` 
     1496 
     1497    Prunes a tree by comparing m-estimates of static and dynamic  
     1498    error as defined in (Bratko, 2002). 
     1499 
     1500    .. attribute:: m 
     1501 
     1502        Parameter m for m-estimation. 
     1503 
     1504======== 
     1505Examples 
     1506======== 
     1507 
     1508This page does not provide examples for programming your own components,  
     1509such as, for instance, a :obj:`SplitConstructor`. Those examples 
     1510can be found on a page dedicated to callbacks to Python XXXXXXXX. 
     1511 
     1512Tree Structure 
     1513============== 
     1514 
     1515To have something to work on, we'll take the data from lenses dataset  
     1516and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_): 
     1517 
     1518.. literalinclude:: code/treestructure.py 
     1519   :lines: 7-10 
     1520 
     1521How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)? 
     1522 
     1523.. _lenses.tab: code/lenses.tab 
     1524.. _treestructure.py: code/treestructure.py 
     1525 
     1526.. literalinclude:: code/treestructure.py 
     1527   :lines: 12-21 
     1528 
     1529If node is None, we have a null-node; null nodes don't count,  
     1530so we return 0. Otherwise, the size is 1 (this node) plus the 
     1531sizes of all subtrees. The node is an internal node if it has a  
     1532:obj:`branchSelector`; it there's no selector, it's a leaf. Don't 
     1533attempt to skip the if statement: leaves don't have an empty list  
     1534of branches, they don't have a list of branches at all. 
     1535 
     1536    >>> treeSize(treeClassifier.tree) 
     1537    10 
     1538 
     1539Don't forget that this was only an excercise - :obj:`Node` has a  
     1540built-in method :obj:`Node.treeSize` that does exactly the same. 
     1541 
     1542Let us now write a simple script that prints out a tree. The recursive 
     1543part of the function will get a node and its level (part of `treestructure.py`_, uses `lenses.tab`_). 
     1544 
     1545.. literalinclude:: code/treestructure.py 
     1546   :lines: 26-41 
     1547 
     1548Don't waste time on studying formatting tricks (\n's etc.), this is just 
     1549for nicer output. What matters is everything but the print statements. 
     1550As first, we check whether the node is a null-node (a node to which no 
     1551learning examples were classified). If this is so, we just print out 
     1552"<null node>" and return. 
     1553 
     1554After handling null nodes, remaining nodes are internal nodes and leaves. 
     1555For internal nodes, we print a node description consisting of the 
     1556attribute's name and distribution of classes. :obj:`Node`'s branch 
     1557description is, for all currently defined splits, an instance of a 
     1558class derived from :obj:`orange.Classifier` (in fact, it is a 
     1559:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would  
     1560suffice), and its :obj:`classVar` XXXXX points to the attribute we seek.  
     1561So we print its name. We will also assume that storing class distributions  
     1562has not been disabled and print them as well. A more able function for  
     1563printing trees (:meth:`TreeClassifier.dump`) has an alternative  
     1564means to get the distribution, when this fails. Then we iterate  
     1565through branches; for each we print a branch description and iteratively  
     1566call the :obj:`printTree0` with a level increased by 1 (to increase  
     1567the indent). 
     1568 
     1569Finally, if the node is a leaf, we print out the distribution of  
     1570learning examples in the node and the class to which the examples in  
     1571the node would be classified. We again assume that the :obj:`nodeClassifier`  
     1572is the default one - a :obj:`DefaultClassifier`. A better print  
     1573function should be aware of possible alternatives. 
     1574 
     1575Now, we just need to write a simple function to call our printTree0.  
     1576We could write something like... 
     1577 
     1578:: 
     1579 
     1580    def printTree(x): 
     1581        printTree0(x.tree, 0) 
     1582 
     1583... but we won't. Let us learn how to handle arguments of different 
     1584types. Let's write a function that will accept either a :obj:`TreeClassifier` 
     1585or a :obj:`Node`; just like :obj:`Pruner`, remember? Part of `treestructure.py`_, uses `lenses.tab`_. 
     1586 
     1587.. literalinclude:: code/treestructure.py 
     1588   :lines: 43-49 
     1589 
     1590It's fairly straightforward: if :obj:`x` is of type derived from  
     1591:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's  
     1592:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's  
     1593of some other type, we don't know how to handle it and thus raise  
     1594an exception. (Note that we could also use  
     1595 
     1596:: 
     1597 
     1598    if isinstance(x, Orange.classification.tree.TreeClassifier) 
     1599 
     1600but this would only work if :obj:`x` would be of type  
     1601:obj:`TreeClassifier` and not of any derived types. The latter,  
     1602however, do not exist yet...) 
     1603 
     1604    >>> printTree(treeClassifier) 
     1605    tear_rate (<15.000, 5.000, 4.000>) 
     1606    : reduced --> none (<12.000, 0.000, 0.000>) 
     1607    : normal 
     1608       astigmatic (<3.000, 5.000, 4.000>) 
     1609       : no 
     1610          age (<1.000, 5.000, 0.000>) 
     1611          : young --> soft (<0.000, 2.000, 0.000>) 
     1612          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>) 
     1613          : presbyopic --> none (<1.000, 1.000, 0.000>) 
     1614       : yes 
     1615          prescription (<2.000, 0.000, 4.000>) 
     1616          : myope --> hard (<0.000, 0.000, 3.000>) 
     1617          : hypermetrope --> none (<2.000, 0.000, 1.000>) 
     1618 
     1619For a final exercise, let us write a simple pruning procedure. It will  
     1620be written entirely in Python, unrelated to any :obj:`Pruner`. Our 
     1621procedure will limit the tree depth - the maximal depth (here defined 
     1622as the number of internal nodes on any path down the tree) shall be 
     1623given as an argument. For example, to get a two-level tree, we would 
     1624call cutTree(root, 2). The function will be recursive, with the second  
     1625argument (level) decreasing at each call; when zero, the current node  
     1626will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_): 
     1627 
     1628.. literalinclude:: code/treestructure.py 
     1629   :lines: 54-62 
     1630 
     1631There's nothing to prune at null-nodes or leaves, so we act only when  
     1632:obj:`node` and :obj:`node.branchSelector` are defined. If level is  
     1633not zero, we call the function for each branch. Otherwise, we clear  
     1634the selector, branches and branch descriptions. 
     1635 
     1636    >>> cutTree(tree.tree, 2) 
     1637    >>> printTree(tree) 
     1638    tear_rate (<15.000, 5.000, 4.000>) 
     1639    : reduced --> none (<12.000, 0.000, 0.000>) 
     1640    : normal 
     1641       astigmatic (<3.000, 5.000, 4.000>) 
     1642       : no --> soft (<1.000, 5.000, 0.000>) 
     1643       : yes --> hard (<2.000, 0.000, 4.000>) 
     1644 
     1645Learning 
     1646======== 
     1647 
     1648You've already seen a simple example of using a :obj:`TreeLearnerBase`. 
     1649You can just call it and let it fill the empty slots with the default 
     1650components. This section will teach you three things: what are the 
     1651missing components (and how to set the same components yourself), 
     1652how to use alternative components to get a different tree and, 
     1653finally, how to write a skeleton for tree induction in Python. 
     1654 
     1655Default components for TreeLearnerBase 
     1656====================================== 
     1657 
     1658Let us construct a :obj:`TreeLearnerBase` to play with. 
     1659 
     1660.. _treelearner.py: code/treelearner.py 
     1661 
     1662`treelearner.py`_, uses `lenses.tab`_: 
     1663 
     1664.. literalinclude:: code/treelearner.py 
     1665   :lines: 7-10 
     1666 
     1667There are three crucial components in learning: the split and stop 
     1668criteria, and the :obj:`exampleSplitter` (there are some others, 
     1669which become important during classification; we'll talk about them 
     1670later). They are not defined; if you use the learner, the slots are 
     1671filled temporarily but later cleared again. 
     1672 
     1673:: 
     1674 
     1675    >>> print learner.split 
     1676    None 
     1677    >>> learner(data) 
     1678    <TreeClassifier instance at 0x01F08760> 
     1679    >>> print learner.split 
     1680    None 
     1681 
     1682Stopping criteria 
     1683================= 
     1684 
     1685The stop is trivial. The default is set by 
     1686 
     1687:: 
     1688    >>> learner.stop = Orange.classification.tree.StopCriteria_common() 
     1689 
     1690Well, this is actually done in C++ and it uses a global component 
     1691that is constructed once for all, but apart from that we did 
     1692effectively the same thing. 
     1693 
     1694We can now examine the default stopping parameters. 
     1695 
     1696    >>> print learner.stop.maxMajority, learner.stop.minExamples 
     1697    1.0 0.0 
     1698 
     1699Not very restrictive. This keeps splitting the examples until 
     1700there's nothing left to split or all the examples are in the same 
     1701class. Let us set the minimal subset that we allow to be split to 
     1702five examples and see what comes out. 
     1703 
     1704    >>> learner.stop.minExamples = 5.0 
     1705    >>> tree = learner(data) 
     1706    >>> print tree.dump() 
     1707    tear_rate=reduced: none (100.00%) 
     1708    tear_rate=normal 
     1709    |    astigmatic=no 
     1710    |    |    age=pre-presbyopic: soft (100.00%) 
     1711    |    |    age=presbyopic: none (50.00%) 
     1712    |    |    age=young: soft (100.00%) 
     1713    |    astigmatic=yes 
     1714    |    |    prescription=hypermetrope: none (66.67%) 
     1715    |    |    prescription=myope: hard (100.00%) 
     1716 
     1717OK, that's better. If we want an even smaller tree, we can also limit 
     1718the maximal proportion of majority class. 
     1719 
     1720    >>> learner.stop.maxMajority = 0.5 
     1721    >>> tree = learner(data) 
     1722    >>> print tree.dump() 
     1723    none (62.50%) 
     1724 
     1725Undocumented 
     1726============ 
     1727 
     1728.. class:: NodeList 
     1729 
     1730    Undocumented. 
     1731 
     1732.. class:: C45NodeList 
     1733 
     1734    Undocumented. 
     1735 
     1736=========================== 
     1737C4.5 Classifier and Learner 
     1738=========================== 
     1739 
     1740As C4.5 is a standard benchmark in machine learning,  
     1741it is incorporated in Orange, although Orange has its own 
     1742implementation of decision trees. 
     1743 
     1744The implementation uses the original Quinlan's code for learning so the 
     1745tree you get is exactly like the one that would be build by standalone 
     1746C4.5. Upon return, however, the original tree is copied to Orange 
     1747components that contain exactly the same information plus what is needed 
     1748to make them visible from Python. To be sure that the algorithm behaves 
     1749just as the original, we use a dedicated class :class:`C45Node` 
     1750instead of reusing the components used by Orange's tree inducer 
     1751(ie, :class:`Node`). This, however, could be done and probably 
     1752will be done in the future; we shall still retain :class:`C45Node`  
     1753but offer transformation to :class:`Node` so that routines 
     1754that work on Orange trees will also be usable for C45 trees. 
     1755 
     1756:class:`C45Learner` and :class:`C45Classifier` behave 
     1757like any other Orange learner and classifier. Unlike most of Orange  
     1758learning algorithms, C4.5 does not accepts weighted examples. 
     1759 
     1760Building the C4.5 plug-in 
     1761========================= 
     1762 
     1763We haven't been able to obtain the legal rights to distribute 
     1764C4.5 and therefore couldn't statically link it into Orange. Instead, 
     1765it's incorporated as a plug-in which you'll need to build yourself. 
     1766The procedure is trivial, but you'll need a C compiler. On Windows, 
     1767the scripts we provide work with MS Visual C and the files CL.EXE 
     1768and LINK.EXE must be on the PATH. On Linux you're equipped with 
     1769gcc. Mac OS X comes without gcc, but you can download it for 
     1770free from Apple. 
     1771 
     1772Orange must be installed prior to building C4.5. (This is because 
     1773the build script will copy the created file next to Orange, 
     1774which it obviously can't if Orange isn't there yet.) 
     1775 
     1776#. Download the  
     1777   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_ 
     1778   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract 
     1779   them into some temporary directory. The files will be modified in the 
     1780   further process, so don't use your copy of Quinlan's sources that you 
     1781   need for another purpose. 
     1782#. Download  
     1783   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_  
     1784   and unzip its contents into the directory R8/Src of the Quinlan's  
     1785   stuff (it's the directory that contains, for instance, the file 
     1786   average.c). 
     1787#. Run buildC45.py, which will build the plug-in and put it next to  
     1788   orange.pyd (or orange.so on Linux/Mac). 
     1789#. Run python, type :samp:`import Orange` and  
     1790   create create :samp:`Orange.classification.tree.C45Learner()`. 
     1791   If this fails, something went wrong; see the diagnostic messages from 
     1792   buildC45.py and read the below paragraph. 
     1793#. Finally, you can remove the Quinlan's stuff, along with everything 
     1794   created by buildC45.py. 
     1795 
     1796If the process fails, here's what buildC45.py really does: it creates 
     1797.h files that wrap Quinlan's .i files and ensure that they are not 
     1798included twice. It modifies C4.5 sources to include .h's instead of 
     1799.i's. This step can hardly fail. Then follows the platform dependent 
     1800step which compiles ensemble.c (which includes all the Quinlan's .c 
     1801files it needs) into c45.dll or c45.so and puts it next to Orange. 
     1802If this fails, but you do have a C compiler and linker, and you know 
     1803how to use them, you can compile the ensemble.c into a dynamic 
     1804library yourself. See the compile and link steps in buildC45.py, 
     1805if it helps. Anyway, after doing this check that the built C4.5 
     1806gives the same results as the original. 
     1807 
     1808.. class:: C45Learner 
     1809 
     1810    :class:`C45Learner`'s attributes have double names - those that 
     1811    you know from C4.5 command lines and the corresponding names of C4.5's 
     1812    internal variables. All defaults are set as in C4.5; if you change 
     1813    nothing, you are running C4.5. 
     1814 
     1815    .. attribute:: gainRatio (g) 
     1816         
     1817        Determines whether to use information gain (false>, default) 
     1818        or gain ratio for selection of attributes (true). 
     1819 
     1820    .. attribute:: batch (b) 
     1821 
     1822        Turn on batch mode (no windows, no iterations); this option is 
     1823        not documented in C4.5 manuals. It conflicts with "window", 
     1824        "increment" and "trials". 
     1825 
     1826    .. attribute:: subset (s) 
     1827         
     1828        Enables subsetting (default: false, no subsetting), 
     1829  
     1830    .. attribute:: probThresh (p) 
     1831 
     1832        Probabilistic threshold for continuous attributes (default: false). 
     1833 
     1834    .. attribute:: minObjs (m) 
     1835         
     1836        Minimal number of objects (examples) in leaves (default: 2). 
     1837 
     1838    .. attribute:: window (w) 
     1839 
     1840        Initial windows size (default: maximum of 20% and twice the 
     1841        square root of the number of data objects). 
     1842 
     1843    .. attribute:: increment (i) 
     1844 
     1845        The maximum number of objects that can be added to the window 
     1846        at each iteration (default: 20% of the initial window size). 
     1847 
     1848    .. attribute:: cf (c) 
     1849 
     1850        Prunning confidence level (default: 25%). 
     1851 
     1852    .. attribute:: trials (t) 
     1853 
     1854        Set the number of trials in iterative (i.e. non-batch) mode (default: 10). 
     1855 
     1856    .. attribute:: prune 
     1857         
     1858        Return pruned tree (not an original C4.5 option) (default: true) 
     1859 
     1860 
     1861:class:`C45Learner` also offers another way for setting 
     1862the arguments: it provides a function :obj:`C45Learner.commandLine` 
     1863which is given a string and parses it the same way as C4.5 would 
     1864parse its command line. XXXXXXXXXXX 
     1865 
     1866.. class:: C45Classifier 
     1867 
     1868    A faithful reimplementation of Quinlan's function from C4.5. The only 
     1869    difference (and the only reason it's been rewritten) is that it uses 
     1870    a tree composed of :class:`C45Node` instead of C4.5's 
     1871    original tree structure. 
     1872 
     1873    .. attribute:: tree 
     1874 
     1875        C4.5 tree stored as a tree of :obj:`C45Node`. 
     1876 
     1877 
     1878.. class:: C45Node 
     1879 
     1880    This class is a reimplementation of the corresponding *struct* from 
     1881    Quinlan's C4.5 code. 
     1882 
     1883    .. attribute:: nodeType 
     1884 
     1885        Type of the node:  :obj:`C45Node.Leaf` (0),  
     1886        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2), 
     1887        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches" 
     1888        split examples based on values of a discrete attribute, 
     1889        "cuts" cut them according to a threshold value of a continuous 
     1890        attributes and "subsets" use discrete attributes but with subsetting 
     1891        so that several values can go into the same branch. 
     1892 
     1893    .. attribute:: leaf 
     1894 
     1895        Value returned by that leaf. The field is defined for internal  
     1896        nodes as well. 
     1897 
     1898    .. attribute:: items 
     1899 
     1900        Number of (learning) examples in the node. 
     1901 
     1902    .. attribute:: classDist 
     1903 
     1904        Class distribution for the node (of type  
     1905        :obj:`orange.DiscDistribution`). 
     1906 
     1907    .. attribute:: tested 
     1908         
     1909        The attribute used in the node's test. If node is a leaf, 
     1910        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut` 
     1911        :obj:`tested` is a discrete attribute, and if node is of type 
     1912        :obj:`Cut` then :obj:`tested` is a continuous attribute. 
     1913 
     1914    .. attribute:: cut 
     1915 
     1916        A threshold for continuous attributes, if node is of type :obj:`Cut`. 
     1917        Undefined otherwise. 
     1918 
     1919    .. attribute:: mapping 
     1920 
     1921        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]` 
     1922        gives the index for an example whose value of :obj:`tested` is *i*.  
     1923        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`. 
     1924 
     1925    .. attribute:: branch 
     1926         
     1927        A list of branches stemming from this node. 
     1928 
     1929Examples 
     1930======== 
     1931 
     1932.. _tree_c45.py: code/tree_c45.py 
     1933.. _iris.tab: code/iris.tab 
     1934 
     1935The simplest way to use :class:`C45Learner` is to call it. This 
     1936script constructs the same learner as you would get by calling 
     1937the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_): 
     1938 
     1939.. literalinclude:: code/tree_c45.py 
     1940   :lines: 7-14 
     1941 
     1942Arguments can be set by the usual mechanism (the below to lines do the 
     1943same, except that one uses command-line symbols and the other internal 
     1944variable names) 
     1945 
     1946:: 
     1947 
     1948    tree = Orange.classification.tree.C45Learner(data, m=100) 
     1949    tree = Orange.classification.tree.C45Learner(data, minObjs=100) 
     1950 
     1951The way that could be prefered by veteran C4.5 user might be through 
     1952method `:obj:C45Learner.commandline`. 
     1953 
     1954:: 
     1955 
     1956    lrn = Orange.classification.tree..C45Learner() 
     1957    lrn.commandline("-m 1 -s") 
     1958    tree = lrn(data) 
     1959 
     1960There's nothing special about using :obj:`C45Classifier` - it's  
     1961just like any other. To demonstrate what the structure of  
     1962:class:`C45Node`'s looks like, will show a script that prints  
     1963it out in the same format as C4.5 does. 
     1964 
     1965.. literalinclude:: code/tree_c45_printtree.py 
     1966 
     1967Leaves are the simplest. We just print out the value contained 
     1968in :samp:`node.leaf`. Since this is not a qualified value (ie.,  
     1969:obj:`C45Node` does not know to which attribute it belongs), we need to 
     1970convert it to a string through :obj:`classVar`, which is passed as an 
     1971extra argument to the recursive part of printTree. 
     1972 
     1973For discrete splits without subsetting, we print out all attribute values 
     1974and recursively call the function for all branches. Continuous splits are 
     1975equally easy to handle. 
     1976 
     1977For discrete splits with subsetting, we iterate through branches, retrieve 
     1978the corresponding values that go into each branch to inset, turn 
     1979the values into strings and print them out, separately treating the 
     1980case when only a single value goes into the branch. 
     1981 
     1982Printing out C45 Tree 
     1983===================== 
     1984 
     1985.. autofunction:: printTreeC45 
    19901986 
    19911987References 
     
    21732169 
    21742170        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used. 
    2175         If 2, use class:`SplitConstructor_OneAgainstOthers`. If 
    2176         0, do not use binarization (use class:`SplitConstructor_Attribute`). 
     2171        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If 
     2172        0, do not use binarization (use :class:`SplitConstructor_Attribute`). 
    21772173        Default: 0. 
    21782174 
     
    21812177        Measure for scoring of the attributes when deciding which of the 
    21822178        attributes will be used for splitting of the example set in the node. 
    2183         Can be either a measure XXXXX or one of 
     2179        Can be either a :class:`Orange.feature.scoring.Measure` or one of 
    21842180        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),  
    21852181        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),  
    21862182        "gini" (:class:`Orange.feature.scoring.Gini`), 
    21872183        "relief" (:class:`Orange.feature.scoring.Relief`), 
    2188         "retis" (:class: `Orange.feature.scoring.MSE`). Default: "gainRatio". 
     2184        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio". 
    21892185 
    21902186    .. attribute:: reliefM, reliefK 
     
    22222218 
    22232219        Induction stops when the proportion of majority class in the 
    2224         node exceeds the value set by this parameter(default: 1.0). E.g. 
    2225         to stop the induction as soon as the majority class reaches 70%, 
    2226         you should say  
    2227         :samp:`tree2 = Orange.classification.tree.TreeLearner(data, maxMajority=0.7)` 
    2228  
    2229         This is an example of the tree on iris data set, built with 
    2230         XXXXXXXXX what above arguments XXXXXXXXX 
    2231         the above arguments - the numbers show the majority class  
    2232         proportion at each node. You can find more details in the  
    2233         script tree2.py, which induces and prints this tree. 
     2220        node exceeds the value set by this parameter(default: 1.0).  
     2221        To stop the induction as soon as the majority class reaches 70%, 
     2222        you should user :samp:`maxMajority = 0.7`. 
     2223 
     2224        This is an example of the tree on iris data set, with  
     2225        :samp:`maxMajority = 0.7`. The numbers show the majority class  
     2226        proportion at each node. The script `tree2.py`_ induces and  
     2227        prints this tree. 
     2228 
     2229        .. _tree2.py: code/tree2.py 
    22342230 
    22352231        :: 
    22362232 
    22372233            root: 0.333 
    2238             |    petal width<0.800: 1.000 
    2239             |    petal width>=0.800: 0.500 
    2240             |    |    petal width<1.750: 0.907 
    2241             |    |    petal width>=1.750: 0.978 
    2242      
     2234            |    petal width<=0.800: 1.000 
     2235            |    petal width>0.800: 0.500 
     2236            |    |    petal width<=1.750: 0.907 
     2237            |    |    petal width>1.750: 0.978 
     2238    
    22432239    .. attribute:: stop 
    22442240 
     
    22482244        :class:`StopCriteria` for more info on this function.  
    22492245        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`  
    2250         will not be  considered (default: None). XXXXX To je pisalo spodaj. 
    2251         The default stopping criterion stops induction when all examples in a node belong to the same class. 
     2246        will not be  considered (default: None).  
     2247        The default stopping criterion stops induction when all examples  
     2248        in a node belong to the same class. 
    22522249 
    22532250    .. attribute:: mForPruning 
     
    29852982    def count_nodes(self): 
    29862983        """ 
    2987         Return the number of nodes of tree. 
     2984        Return the number of nodes. 
    29882985        """ 
    29892986        return _countNodes(self.tree) 
     
    29912988    def count_leaves(self): 
    29922989        """ 
    2993         Return the number of leaves in the tree. 
     2990        Return the number of leaves. 
    29942991        """ 
    29952992        return _countLeaves(self.tree) 
  • orange/doc/Orange/rst/code/orngTree1.py

    r7522 r7713  
    1010for format in formats: 
    1111    print '\n\n*** FORMAT: "%s"\n' % format 
    12     Orange.classification.tree.printTree(tree, leafStr=format) 
     12    print tree.dump(leafStr=format) 
    1313 
    1414formats2 = [("%V", "."), ('%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', '.'), ("%V   %D %.2DbP %.2dbP", "%D %.2DbP %.2dbP")] 
    1515for fl, fn in formats2: 
    16     Orange.classification.tree.printTree(tree, leafStr=fl, nodeStr=fn) 
     16    print tree.dump(leafStr=fl, nodeStr=fn) 
    1717 
    1818 
     
    2222for format in formats: 
    2323    print '\n\n*** FORMAT: "%s"\n' % format 
    24     Orange.classification.tree.printTree(tree, leafStr=format) 
     24    print tree.dump(leafStr=format) 
    2525 
    2626formats2 = [("[SE: %E]\t %V %I(90)", "[SE: %E]"), ("%C<22 (%cbP<22)", "."), ("%C![20,22] (%^cbP![20,22]%)", ".")] 
    2727for fl, fn in formats2: 
    28     Orange.classification.tree.printTree(tree, leafStr=fl, nodeStr=fn) 
     28    print tree.dump(leafStr=fl, nodeStr=fn) 
  • orange/doc/Orange/rst/code/orngTree2.py

    r7474 r7713  
    3131    +"B"+Orange.classification.tree.by), replaceB)] 
    3232             
    33 Orange.classification.tree.printTree(tree, leafStr="%V %^B% (%^3.2BbP%)", userFormats=myFormat) 
     33print tree.dump(leafStr="%V %^B% (%^3.2BbP%)", userFormats=myFormat) 
  • orange/doc/Orange/rst/code/tree3.py

    r7474 r7713  
    1616f = lambda examples, weightID, contingency: defStop(examples, weightID, contingency) or randint(1, 5) == 1 
    1717l = Orange.classification.tree.TreeLearner(data, stop=f) 
    18 Orange.classification.tree.printTxt(l) 
     18print l.dump() 
    1919 
    2020print "\nRANDOM STOPING:" 
    2121f = lambda x,y,z: randint(1, 5)==1 
    2222l = Orange.classification.tree.TreeLearner(data, stop=f) 
    23 Orange.classification.tree.printTxt(l) 
     23print l.dump() 
  • orange/doc/Orange/rst/code/treelearner.py

    r7474 r7713  
    88 
    99data = Orange.data.Table("lenses") 
    10 learner = Orange.classification.tree.TreeLearnerBase() 
     10learner = Orange.classification.tree.TreeLearner() 
    1111 
    1212def printTree0(node, level): 
     
    3535        raise TypeError, "invalid parameter" 
    3636 
    37 print learner.split 
    38 learner(data) 
    39 print learner.split 
    40  
    4137learner.stop = Orange.classification.tree.StopCriteria_common() 
    4238print learner.stop.maxMajority, learner.stop.minExamples 
     
    4541learner.stop.minExamples = 5.0 
    4642tree = learner(data) 
    47 Orange.classification.tree.printTree(tree) 
     43print tree.dump() 
    4844 
    4945print "\n\nTree with maxMajority = 0.5" 
    5046learner.stop.maxMajority = 0.5 
    5147tree = learner(data) 
    52 Orange.classification.tree.printTree(tree) 
     48print tree.dump() 
  • orange/doc/Orange/rst/code/treestructure.py

    r7474 r7713  
    88 
    99data = Orange.data.Table("lenses") 
    10 treeClassifier = Orange.classification.tree.TreeLearnerBase(data) 
     10treeClassifier = Orange.classification.tree.TreeLearner(data) 
    1111 
    1212def treeSize(node): 
     
    5050 
    5151print "\n\nUnpruned tree" 
    52 printTree(treeClassifier) 
     52print treeClassifier.dump() 
    5353 
    5454def cutTree(node, level): 
     
    6464print "\n\nPruned tree" 
    6565cutTree(treeClassifier.tree, 2) 
    66 printTree(treeClassifier) 
     66print treeClassifier.dump() 
    6767 
Note: See TracChangeset for help on using the changeset viewer.