Ignore:
Timestamp:
02/25/12 22:41:40 (2 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Message:

Changed the hierarchy of sections in documentation about trees.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.classification.tree.rst

    r9372 r10371  
    1 .. automodule:: Orange.classification.tree 
     1.. py:currentmodule:: Orange.classification.tree 
     2 
     3.. index:: classification tree 
     4 
     5.. index:: 
     6   single: classification; tree 
     7 
     8******************************* 
     9Classification trees (``tree``) 
     10******************************* 
     11 
     12Orange includes multiple implementations of classification tree learners: 
     13a very flexible :class:`TreeLearner`, a fast :class:`SimpleTreeLearner`, 
     14and a :class:`C45Learner`, which uses the C4.5 tree induction 
     15algorithm. 
     16 
     17The following code builds a :obj:`TreeClassifier` on the Iris data set 
     18(with the depth limited to three levels): 
     19 
     20.. literalinclude:: code/orngTree1.py 
     21   :lines: 1-4 
     22 
     23See `Decision tree learning 
     24<http://en.wikipedia.org/wiki/Decision_tree_learning>`_ on Wikipedia 
     25for introduction to classification trees. 
     26 
     27============================== 
     28Component-based Tree Inducer 
     29============================== 
     30 
     31.. autoclass:: TreeLearner 
     32    :members: 
     33 
     34.. autoclass:: TreeClassifier 
     35    :members: 
     36 
     37.. class:: Node 
     38 
     39    Classification trees are a hierarchy of :obj:`Node` classes. 
     40 
     41    Node stores the instances belonging to the node, a branch selector, 
     42    a list of branches (if the node is not a leaf) with their descriptions 
     43    and strengths, and a classifier. 
     44 
     45    .. attribute:: distribution 
     46     
     47        A distribution of learning instances. 
     48 
     49    .. attribute:: contingency 
     50 
     51        Complete contingency matrices for the learning instances. 
     52 
     53    .. attribute:: instances, weightID 
     54 
     55        Learning instances and the ID of weight meta attribute. The root 
     56        of the tree actually stores all instances, while other nodes 
     57        store only reference to instances in the root node. 
     58 
     59    .. attribute:: node_classifier 
     60 
     61        A classifier for instances coming to the node. If the node is a 
     62        leaf, it chooses the class (or class distribution) of an instance. 
     63 
     64    Internal nodes have additional attributes. The lists :obj:`branches`, 
     65    :obj:`branch_descriptions` and :obj:`branch_sizes` are of the 
     66    same length. 
     67 
     68    .. attribute:: branches 
     69 
     70        A list of subtrees. Each element is a :obj:`Node` or None. 
     71        If None, the node is empty. 
     72 
     73    .. attribute:: branch_descriptions 
     74 
     75        A list with strings describing branches. They are constructed 
     76        by :obj:`SplitConstructor`. A string can contain anything, 
     77        for example 'red' or '>12.3'. 
     78 
     79    .. attribute:: branch_sizes 
     80 
     81        A (weighted) number of training instances for  
     82        each branch. It can be used, for instance, for modeling 
     83        probabilities when classifying instances with unknown values. 
     84 
     85    .. attribute:: branch_selector 
     86 
     87        A :obj:`~Orange.classification.Classifier` that returns a branch 
     88        for each instance (as 
     89        :obj:`Orange.data.Value` in ``[0, len(branches)-1]``).  When an 
     90        instance cannot be classified unambiguously, the selector can 
     91        return a discrete distribution, which proposes how to divide 
     92        the instance between the branches. Whether the proposition will 
     93        be used depends upon the :obj:`Splitter` (for learning) 
     94        or :obj:`Descender` (for classification). 
     95 
     96    .. method:: tree_size() 
     97         
     98        Return the number of nodes in the subtrees (including the node, 
     99        excluding null-nodes). 
     100 
     101-------- 
     102Examples 
     103-------- 
     104 
     105Tree Structure 
     106============== 
     107 
     108This example works with the lenses data set: 
     109 
     110    >>> import Orange 
     111    >>> lenses = Orange.data.Table("lenses") 
     112    >>> tree_classifier = Orange.classification.tree.TreeLearner(lenses) 
     113 
     114The following function counts the number of nodes in a tree: 
     115 
     116    >>> def tree_size(node): 
     117    ...    if not node: 
     118    ...        return 0 
     119    ...    size = 1 
     120    ...    if node.branch_selector: 
     121    ...        for branch in node.branches: 
     122    ...            size += tree_size(branch) 
     123    ...    return size 
     124 
     125If node is None, the function above return 0. Otherwise, the size is 1 
     126(this node) plus the sizes of all subtrees. The algorithm need to check 
     127if a node is internal (it has a :obj:`~Node.branch_selector`), as leaves 
     128don't have the :obj:`~Node.branches` attribute. 
     129 
     130    >>> tree_size(tree_classifier.tree) 
     131    15 
     132 
     133Note that a :obj:`Node` already has a built-in method 
     134:func:`~Node.tree_size`. 
     135 
     136Trees can be printed with a simple recursive function: 
     137 
     138    >>> def print_tree0(node, level): 
     139    ...     if not node: 
     140    ...         print " "*level + "<null node>" 
     141    ...         return 
     142    ...     if node.branch_selector: 
     143    ...         node_desc = node.branch_selector.class_var.name 
     144    ...         node_cont = node.distribution 
     145    ...         print "\\n" + "   "*level + "%s (%s)" % (node_desc, node_cont), 
     146    ...         for i in range(len(node.branches)): 
     147    ...             print "\\n" + "   "*level + ": %s" % node.branch_descriptions[i], 
     148    ...             print_tree0(node.branches[i], level+1) 
     149    ...     else: 
     150    ...         node_cont = node.distribution 
     151    ...         major_class = node.node_classifier.default_value 
     152    ...         print "--> %s (%s) " % (major_class, node_cont), 
     153 
     154The crux of the example is not in the formatting (\\n's etc.); 
     155what matters is everything but the print statements. The code 
     156separately handles three node types: 
     157 
     158* For null nodes (a node to which no learning instances were classified), 
     159  it just prints "<null node>". 
     160* For internal nodes, it prints a node description: 
     161  the feature's name and distribution of classes. :obj:`Node`'s 
     162  branch description is an :obj:`~Orange.classification.Classifier`, 
     163  and its ``class_var`` is the feature whose name is printed.  Class 
     164  distributions are printed as well (they are assumed to be stored). 
     165  The :obj:`print_tree0` with a level increased by 1 to increase the 
     166  indent is recursively called for each branch. 
     167* If the node is a leaf, it prints the distribution of learning instances 
     168  in the node and the class to which the instances in the node would 
     169  be classified. We assume that the :obj:`~Node.node_classifier` is a 
     170  :obj:`DefaultClassifier`. A better print function should be aware of 
     171  possible alternatives. 
     172 
     173The wrapper function that accepts either a 
     174:obj:`TreeClassifier` or a :obj:`Node` can be written as follows: 
     175 
     176    >>> def print_tree(x): 
     177    ...     if isinstance(x, Orange.classification.tree.TreeClassifier): 
     178    ...         print_tree0(x.tree, 0) 
     179    ...     elif isinstance(x, Orange.classification.tree.Node): 
     180    ...         print_tree0(x, 0) 
     181    ...     else: 
     182    ...         raise TypeError, "invalid parameter" 
     183 
     184It's straightforward: if ``x`` is a 
     185:obj:`TreeClassifier`, it prints ``x.tree``; if it's :obj:`Node` it 
     186print ``x``. If it's of some other type, 
     187an exception is raised. The output: 
     188 
     189    >>> print_tree(tree_classifier) 
     190    <BLANKLINE> 
     191    tear_rate (<15.000, 4.000, 5.000>)  
     192    : normal  
     193       astigmatic (<3.000, 4.000, 5.000>)  
     194       : no  
     195          age (<1.000, 0.000, 5.000>)  
     196          : pre-presbyopic --> soft (<0.000, 0.000, 2.000>)   
     197          : presbyopic  
     198             prescription (<1.000, 0.000, 1.000>)  
     199             : hypermetrope --> soft (<0.000, 0.000, 1.000>)   
     200             : myope --> none (<1.000, 0.000, 0.000>)   
     201          : young --> soft (<0.000, 0.000, 2.000>)   
     202       : yes  
     203          prescription (<2.000, 4.000, 0.000>)  
     204          : hypermetrope  
     205             age (<2.000, 1.000, 0.000>)  
     206             : pre-presbyopic --> none (<1.000, 0.000, 0.000>)   
     207             : presbyopic --> none (<1.000, 0.000, 0.000>)   
     208             : young --> hard (<0.000, 1.000, 0.000>)   
     209          : myope --> hard (<0.000, 3.000, 0.000>)   
     210    : reduced --> none (<12.000, 0.000, 0.000>)  
     211 
     212The tree structure examples conclude with a simple pruning function, 
     213written entirely in Python and unrelated to any :class:`Pruner`. It limits 
     214the tree depth (the number of internal nodes on any path down the tree). 
     215For example, to get a two-level tree, call cut_tree(root, 2). The function 
     216is recursive, with the second argument (level) decreasing at each call; 
     217when zero, the current node will be made a leaf: 
     218 
     219    >>> def cut_tree(node, level): 
     220    ...     if node and node.branch_selector: 
     221    ...         if level: 
     222    ...             for branch in node.branches: 
     223    ...                 cut_tree(branch, level-1) 
     224    ...         else: 
     225    ...             node.branch_selector = None 
     226    ...             node.branches = None 
     227    ...             node.branch_descriptions = None 
     228 
     229The function acts only when :obj:`node` and :obj:`node.branch_selector` 
     230are defined. If the level is not zero, is recursively calls  the function 
     231for each branch. Otherwise, it clears the selector, branches and branch 
     232descriptions. 
     233 
     234    >>> cut_tree(tree_classifier.tree, 2) 
     235    >>> print_tree(tree_classifier) 
     236    <BLANKLINE> 
     237    tear_rate (<15.000, 4.000, 5.000>)  
     238    : normal  
     239       astigmatic (<3.000, 4.000, 5.000>)  
     240       : no --> soft (<1.000, 0.000, 5.000>)   
     241       : yes --> hard (<2.000, 4.000, 0.000>)   
     242    : reduced --> none (<12.000, 0.000, 0.000>)  
     243 
     244Setting learning parameters 
     245=========================== 
     246 
     247Let us construct a :obj:`TreeLearner` to play with: 
     248 
     249    >>> import Orange 
     250    >>> lenses = Orange.data.Table("lenses") 
     251    >>> learner = Orange.classification.tree.TreeLearner() 
     252 
     253There are three crucial components in learning: the 
     254:obj:`~TreeLearner.split` and :obj:`~TreeLearner.stop` criteria, and the 
     255example :obj:`~TreeLearner.splitter`. The default ``stop`` is set with: 
     256 
     257    >>> learner.stop = Orange.classification.tree.StopCriteria_common() 
     258 
     259The default stopping parameters are: 
     260 
     261    >>> print learner.stop.max_majority, learner.stop.min_examples 
     262    1.0 0.0 
     263 
     264The defaults only stop splitting when no instances are left or all of 
     265them are in the same class. 
     266 
     267If the minimal subset that is allowed to be split further is set to five 
     268instances, the resulting tree is smaller. 
     269 
     270    >>> learner.stop.min_examples = 5.0 
     271    >>> tree = learner(lenses) 
     272    >>> print tree 
     273    tear_rate=reduced: none (100.00%) 
     274    tear_rate=normal 
     275    |    astigmatic=no 
     276    |    |    age=pre-presbyopic: soft (100.00%) 
     277    |    |    age=presbyopic: none (50.00%) 
     278    |    |    age=young: soft (100.00%) 
     279    |    astigmatic=yes 
     280    |    |    prescription=hypermetrope: none (66.67%) 
     281    |    |    prescription=myope: hard (100.00%) 
     282    <BLANKLINE> 
     283 
     284We can also limit the maximal proportion of majority class. 
     285 
     286    >>> learner.stop.max_majority = 0.5 
     287    >>> tree = learner(lenses) 
     288    >>> print tree 
     289    none (62.50%) 
     290 
     291Redefining tree induction components 
     292==================================== 
     293 
     294This example shows how to use a custom stop function.  First, the 
     295``def_stop`` function defines the default stop function. The first tree 
     296has some added randomness; the induction also stops in 20% of the 
     297cases when ``def_stop`` returns False. The stopping criteria for the 
     298second tree is completely random: it stops induction in 20% of cases. 
     299Note that in the second case lambda function still has three parameters, 
     300even though in does not need any, since so many are necessary 
     301for :obj:`~TreeLearner.stop`. 
     302 
     303.. literalinclude:: code/tree3.py 
     304   :lines: 8-23 
     305 
     306--------------------------------- 
     307Learner and Classifier Components 
     308--------------------------------- 
     309 
     310Split constructors 
     311===================== 
     312 
     313.. class:: SplitConstructor 
     314 
     315    Decide how to divide learning instances, ie. define branching criteria. 
     316     
     317    The :obj:`SplitConstructor` should use the domain 
     318    contingency when possible, both for speed and adaptability.  
     319    Sometimes domain contingency does 
     320    not suffice, for example if ReliefF score is used. 
     321 
     322    A :obj:`SplitConstructor` can veto further tree induction by returning 
     323    no classifier. This is generally related to the number of learning 
     324    instances that would go in each branch. If there are no splits with 
     325    more than :obj:`SplitConstructor.min_subset` instances in the branches 
     326    (null nodes are allowed), the induction is stopped. 
     327 
     328    Split constructors that cannot handle a particular feature 
     329    type (discrete, continuous) quietly skip them. When in doubt, use 
     330    :obj:`SplitConstructor_Combined`, which delegates features to 
     331    specialized split constructors. 
     332 
     333    The same split constructors can be used both for classification and 
     334    regression, if the chosen score (for :obj:`SplitConstructor_Score` 
     335    and derived classes) supports both. 
     336 
     337    .. attribute:: min_subset 
     338 
     339        The minimal (weighted) number in non-null leaves. 
     340 
     341    .. method:: __call__(data, [ weightID, contingency, apriori_distribution, candidates, clsfr])  
     342 
     343        :param data: in any acceptable form. 
     344        :param weightID: Optional; the default of 0 means that all 
     345            instances have a weight of 1.0.  
     346        :param contingency: a domain contingency 
     347        :param apriori_distribution: apriori class probabilities. 
     348        :type apriori_distribution: :obj:`Orange.statistics.distribution.Distribution` 
     349        :param candidates: only consider these  
     350            features (one boolean for each feature). 
     351        :param clsfr: a node classifier (if it was constructed, that is,  
     352            if :obj:`store_node_classifier` is True)  
     353 
     354        Construct a split. Return a tuple (:obj:`branch_selector`, 
     355        :obj:`branch_descriptions` (a list), :obj:`subset_sizes` 
     356        (the number of instances for each branch, may also be 
     357        empty), :obj:`quality` (higher numbers mean better splits), 
     358        :obj:`spent_feature`). If no split is constructed, 
     359        the :obj:`selector`, :obj:`branch_descriptions` and 
     360        :obj:`subset_sizes` are None, while :obj:`quality` is 0.0 and 
     361        :obj:`spent_feature` is -1. 
     362 
     363        If the chosen feature will be useless in the future and 
     364        should not be considered for splitting in any of the subtrees 
     365        (typically, when discrete features are used as-they-are, without 
     366        any binarization or subsetting), then it should return the index 
     367        of this feature as :obj:`spent_feature`. If no features are spent, 
     368        :obj:`spent_feature` is -1. 
     369 
     370.. class:: SplitConstructor_Score 
     371 
     372    Bases: :class:`SplitConstructor` 
     373 
     374    An abstract base class that compare splits 
     375    with a :class:`Orange.feature.scoring.Score`.  All split 
     376    constructors except for :obj:`SplitConstructor_Combined` are derived 
     377    from this class. 
     378 
     379    .. attribute:: measure 
     380 
     381        A :class:`Orange.feature.scoring.Score` for split evaluation. It 
     382        has to handle the class type - for example, you cannot use 
     383        :class:`~Orange.feature.scoring.GainRatio` for regression or 
     384        :class:`~Orange.feature.scoring.MSE` for classification. 
     385 
     386    .. attribute:: worst_acceptable 
     387 
     388        The lowest allowed split quality.  The value strongly depends 
     389        on chosen :obj:`measure` component. Default is 0.0. 
     390 
     391.. class:: SplitConstructor_Feature 
     392 
     393    Bases: :class:`SplitConstructor_Score` 
     394 
     395    Each value of a discrete feature corresponds to a branch.  The feature 
     396    with the highest score (:obj:`~Measure.measure`) is selected. When 
     397    tied, a random feature is selected. 
     398 
     399    The constructed :obj:`branch_selector` is an instance of 
     400    :obj:`orange.ClassifierFromVarFD`, that returns a value of the selected 
     401    feature. :obj:`branch_description` contains the feature's 
     402    values. The feature is marked as spent (it cannot reappear in the 
     403    node's subtrees). 
     404 
     405.. class:: SplitConstructor_ExhaustiveBinary 
     406 
     407    Bases: :class:`SplitConstructor_Score` 
     408 
     409    Finds the binarization with the highest score among all features. In 
     410    case of ties, a random feature is selected. 
     411 
     412    The constructed :obj:`branch_selector` is an instance 
     413    :obj:`orange.ClassifierFromVarFD`, that returns a value of the 
     414    selected feature. Its :obj:`transformer` contains a ``MapIntValue`` 
     415    that maps values of the feature into a binary feature. Branches 
     416    with a single feature value are described with that value and 
     417    branches with more than one are described with ``[<val1>, <val2>, 
     418    ..., <valn>]``. Only binary features are marked as spent. 
     419 
     420.. class:: SplitConstructor_Threshold 
     421 
     422    Bases: :class:`SplitConstructor_Score` 
     423 
     424    The only split constructor for continuous features. It divides the 
     425    range of feature values with a threshold that maximizes the split's 
     426    quality. In case of ties, a random feature is selected.  The feature 
     427    that yields the best binary split is returned. 
     428 
     429    The constructed :obj:`branch_selector` is an instance of 
     430    :obj:`orange.ClassifierFromVarFD` with an attached :obj:`transformer`, 
     431    of type :obj:`Orange.feature.discretization.ThresholdDiscretizer`. The 
     432    branch descriptions are "<threshold" and ">=threshold". The feature 
     433    is not spent. 
     434 
     435.. class:: SplitConstructor_OneAgainstOthers 
     436     
     437    Bases: :class:`SplitConstructor_Score` 
     438 
     439    Undocumented. 
     440 
     441.. class:: SplitConstructor_Combined 
     442 
     443    Bases: :class:`SplitConstructor` 
     444 
     445    Uses different split constructors for discrete and continuous 
     446    features. Each split constructor is called with appropriate 
     447    features. Both construct a candidate for a split; the better of them 
     448    is used. 
     449 
     450    The choice of the split is not probabilistically fair, when 
     451    multiple candidates have the same score. For example, if there 
     452    are nine discrete features with the highest score the split 
     453    constructor for discrete features will select one of them. Now, 
     454    if there is also a single continuous feature with the same score, 
     455    :obj:`SplitConstructor_Combined` would randomly select between the 
     456    proposed discrete feature and continuous feature, unaware that the 
     457    discrete feature  has already competed with eight others.  So, 
     458    the probability for selecting (each) discrete feature would be 
     459    1/18 instead of 1/10. Although incorrect, this should not affect 
     460    the performance. 
     461 
     462    .. attribute: discrete_split_constructor 
     463 
     464        Split constructor for discrete features;  
     465        for instance, :obj:`SplitConstructor_Feature` or 
     466        :obj:`SplitConstructor_ExhaustiveBinary`. 
     467 
     468    .. attribute: continuous_split_constructor 
     469 
     470        Split constructor for continuous features; it  
     471        can be either :obj:`SplitConstructor_Threshold` or a  
     472        a custom split constructor. 
     473 
     474 
     475StopCriteria and StopCriteria_common 
     476============================================ 
     477 
     478:obj:`StopCriteria` determines when to stop the induction of subtrees.  
     479 
     480.. class:: StopCriteria 
     481 
     482    Provides the basic stopping criteria: the tree induction stops 
     483    when there is at most one instance left (the actual, not weighted, 
     484    number). The induction also stops when all instances are in the 
     485    same class (for discrete problems) or have the same outcome value 
     486    (for regression problems). 
     487 
     488    .. method:: __call__(instances[, weightID, domain contingencies]) 
     489 
     490        Return True (stop) of False (continue the induction). 
     491        Contingencies are not used for counting. Derived classes should 
     492        use the contingencies whenever possible. 
     493 
     494.. class:: StopCriteria_common 
     495 
     496    Pre-pruning with additional criteria. 
     497 
     498    .. attribute:: max_majority 
     499 
     500        Maximum proportion of majority class. When exceeded, 
     501        induction stops. 
     502 
     503    .. attribute:: min_instances 
     504 
     505        Minimum number of instances for splitting. Subsets with less 
     506        than :obj:`min_instances` instances are not split further. 
     507        The sample count is weighed. 
     508 
     509 
     510Splitters 
     511================= 
     512 
     513Splitters sort learning instances into branches (the branches are selected 
     514with a :obj:`SplitConstructor`, while a :obj:`Descender` decides the 
     515branch for an instance during classification). 
     516 
     517Most splitters call :obj:`Node.branch_selector` and assign 
     518instances correspondingly. When the value is unknown they choose a 
     519particular branch or skip the instance. 
     520 
     521Some splitters can also split instances: a weighed instance is  
     522used in more than than one subset. Each branch has a weight ID (usually, 
     523each its own ID) and all instances in that branch should have this meta attribute.  
     524 
     525An instance that  
     526hasn't been split has only one additional attribute (weight 
     527ID corresponding to the subset to which it went). Instance that is split 
     528between, say, three subsets, has three new meta attributes, one for each 
     529subset. The weights are used only when needed; when there is no 
     530splitting - no weight IDs are returned. 
     531 
     532.. class:: Splitter 
     533 
     534    An abstract base class that splits instances 
     535    into subsets. 
     536 
     537    .. method:: __call__(node, instances[, weightID]) 
     538 
     539        :param node: a node. 
     540        :type node: :obj:`Node` 
     541        :param instances: a set of instances 
     542        :param weightID: weight ID.  
     543         
     544        Use the information in :obj:`Node` (particularly the 
     545        :obj:`~Node.branch_selector`) to split the given set of instances into 
     546        subsets.  Return a tuple with a list of instance subsets and 
     547        a list of weights.  The list of weights is either a 
     548        list of integers or None when no weights are added. 
     549 
     550.. class:: Splitter_IgnoreUnknowns 
     551 
     552    Bases: :class:`Splitter` 
     553 
     554    Ignores the instances for which no single branch can be determined. 
     555 
     556.. class:: Splitter_UnknownsToCommon 
     557 
     558    Bases: :class:`Splitter` 
     559 
     560    Places all ambiguous instances to a branch with the highest number of 
     561    instances. If there is more than one such branch, one is selected at 
     562    random and then used for all instances. 
     563 
     564.. class:: Splitter_UnknownsToAll 
     565 
     566    Bases: :class:`Splitter` 
     567 
     568    Splits instances with an unknown value of the feature into all branches. 
     569 
     570.. class:: Splitter_UnknownsToRandom 
     571 
     572    Bases: :class:`Splitter` 
     573 
     574    Selects a random branch for ambiguous instances. 
     575 
     576.. class:: Splitter_UnknownsToBranch 
     577 
     578    Bases: :class:`Splitter` 
     579 
     580    Constructs an additional branch for ambiguous instances.  
     581    The branch's description is "unknown". 
     582 
     583.. class:: Splitter_UnknownsAsBranchSizes 
     584 
     585    Bases: :class:`Splitter` 
     586 
     587    Splits instances with unknown value of the feature according to 
     588    proportions of instances in each branch. 
     589 
     590.. class:: Splitter_UnknownsAsSelector 
     591 
     592    Bases: :class:`Splitter` 
     593 
     594    Splits instances with unknown value of the feature according to 
     595    distribution proposed by selector (usually the same as proportions 
     596    of instances in branches). 
     597 
     598Descenders 
     599============================= 
     600 
     601Descenders decide where should the instances that cannot be unambiguously 
     602put in a single branch go during classification (the branches are selected 
     603with a :obj:`SplitConstructor`, while a :obj:`Splitter` sorts instances 
     604during learning). 
     605 
     606.. class:: Descender 
     607 
     608    An abstract base tree descender. It descends a 
     609    an instance as deep as possible, according to the values 
     610    of instance's features. The :obj:`Descender`: calls the node's 
     611    :obj:`~Node.branch_selector` to get the branch index. If it's a 
     612    simple index, the corresponding branch is followed. If not, the 
     613    descender decides what to do. A descender can choose a single 
     614    branch (for instance, the one that is the most recommended by the 
     615    :obj:`~Node.branch_selector`) or it can let the branches vote. 
     616 
     617    Three are possible outcomes of a descent: 
     618 
     619    #. The descender reaches a leaf. This happens when 
     620       there were no unknown or out-of-range values, or when the 
     621       descender selected a single branch and continued the descend 
     622       despite them. The descender returns the :obj:`Node` it has reached. 
     623    #. Node's :obj:`~Node.branch_selector` returned a distribution and 
     624       :obj:`Descender` decided to stop the descend at this (internal) 
     625       node. It returns the current :obj:`Node`. 
     626    #. Node's :obj:`~Node.branch_selector` returned a distribution and the 
     627       :obj:`Node` wants to split the instance (i.e., to decide the class 
     628       by voting). It returns a :obj:`Node` and the vote-weights for 
     629       the branches.  The weights can correspond, for example,  to the 
     630       distribution returned by node's :obj:`~Node.branch_selector`, or to 
     631       the number of learning instances that were assigned to each branch. 
     632 
     633    .. method:: __call__(node, instance) 
     634 
     635        Descends until it reaches a leaf or a node in 
     636        which a vote of subtrees is required. A tuple 
     637        of two elements is returned. If it reached a leaf, the tuple contains 
     638        the leaf node and None. If not, it contains a node and 
     639        a list of floats (weights of votes). 
     640 
     641.. class:: Descender_UnknownToNode 
     642 
     643    Bases: :obj:`Descender` 
     644 
     645    When instance cannot be classified into a single branch, the current 
     646    node is returned. Thus, the node's :obj:`~Node.node_classifier` 
     647    will be used to make a decision. Therefore, internal nodes 
     648    need to have :obj:`Node.node_classifier` defined. 
     649 
     650.. class:: Descender_UnknownToBranch 
     651 
     652    Bases: :obj:`Descender` 
     653 
     654    Classifies instances with unknown value to a special branch. This 
     655    makes sense only if the tree itself was constructed with 
     656    :obj:`Splitter_UnknownsToBranch`. 
     657 
     658.. class:: Descender_UnknownToCommonBranch 
     659 
     660    Bases: :obj:`Descender` 
     661 
     662    Classifies instances with unknown values to the branch with the 
     663    highest number of instances. If there is more than one such branch, 
     664    random branch is chosen for each instance. 
     665 
     666.. class:: Descender_UnknownToCommonSelector 
     667 
     668    Bases: :obj:`Descender` 
     669 
     670    Classifies instances with unknown values to the branch which received 
     671    the highest recommendation by the selector. 
     672 
     673.. class:: Descender_UnknownMergeAsBranchSizes 
     674 
     675    Bases: :obj:`Descender` 
     676 
     677    The subtrees vote for the instance's class; the vote is weighted 
     678    according to the sizes of the branches. 
     679 
     680.. class:: Descender_UnknownMergeAsSelector 
     681 
     682    Bases: :obj:`Descender` 
     683 
     684    The subtrees vote for the instance's class; the vote is weighted 
     685    according to the selectors proposal. 
     686 
     687Pruning 
     688======= 
     689 
     690.. index:: 
     691    pair: classification trees; pruning 
     692 
     693The pruners construct a shallow copy of a tree. The pruned tree's 
     694:obj:`Node` contain references to the same contingency matrices, 
     695node classifiers, branch selectors, ...  as the original tree. 
     696 
     697Pruners cannot construct a new :obj:`~Node.node_classifier`.  Thus, for 
     698pruning, internal nodes must have :obj:`~Node.node_classifier` defined 
     699(the default). 
     700 
     701.. class:: Pruner 
     702 
     703    An abstract base tree pruner. 
     704 
     705    .. method:: __call__(tree) 
     706 
     707        :param tree: either 
     708            a :obj:`Node` or (the C++ version of the classifier, 
     709            saved in :obj:`TreeClassfier.base_classifier`). 
     710 
     711        The resulting pruned tree is of the same type as the argument. 
     712        The original tree remains intact. 
     713 
     714.. class:: Pruner_SameMajority 
     715 
     716    Bases: :class:`Pruner` 
     717 
     718    A tree can have a subtrees where all the leaves have 
     719    the same majority class. This is allowed because leaves can still 
     720    have different class distributions and thus predict different 
     721    probabilities.  The :obj:`Pruner_SameMajority` prunes the tree so 
     722    that there is no subtree in which all the nodes would have the same 
     723    majority class. 
     724 
     725    This pruner will only prune the nodes in which the node classifier 
     726    is a :obj:`~Orange.classification.ConstantClassifier` 
     727    (or a derived class). 
     728 
     729    The pruning works from leaves to the root. 
     730    It siblings have (at least one) common majority class, they can be pruned. 
     731 
     732.. class:: Pruner_m 
     733 
     734    Bases: :class:`Pruner` 
     735 
     736    Prunes a tree by comparing m-estimates of static and dynamic  
     737    error as defined in (Bratko, 2002). 
     738 
     739    .. attribute:: m 
     740 
     741        Parameter m for m-estimation. 
     742 
     743Printing the tree 
     744================= 
     745 
     746The tree printing functions are very flexible. They can print, for 
     747example, numbers of instances, proportions of majority class in nodes 
     748and similar, or more complex statistics like the proportion of instances 
     749in a particular class divided by the proportion of instances of this 
     750class in a parent node. Users may also pass their own functions to print 
     751certain elements. 
     752 
     753The easiest way to print the tree is to print :func:`TreeClassifier`:: 
     754 
     755    >>> print tree 
     756    petal width<0.800: Iris-setosa (100.00%) 
     757    petal width>=0.800 
     758    |    petal width<1.750 
     759    |    |    petal length<5.350: Iris-versicolor (94.23%) 
     760    |    |    petal length>=5.350: Iris-virginica (100.00%) 
     761    |    petal width>=1.750 
     762    |    |    petal length<4.850: Iris-virginica (66.67%) 
     763    |    |    petal length>=4.850: Iris-virginica (100.00%) 
     764 
     765 
     766Format string 
     767------------- 
     768 
     769Format strings are printed at every leaf or internal node with the certain 
     770format specifiers replaced by data from the tree node. Specifiers are 
     771generally of form **%[^]<precision><quantity><divisor>**. 
     772 
     773**^** at the start tells that the number should be multiplied by 100, 
     774which is useful for proportions like percentages. 
     775 
     776**<precision>** is in the same format as in Python (or C) string 
     777formatting. For instance, ``%N`` denotes the number of instances in 
     778the node, hence ``%6.2N`` would mean output to two decimal digits 
     779and six places altogether. If left out, a default format ``5.3`` is 
     780used, unless the numbers are multiplied by 100, in which case the default 
     781is ``.0`` (no decimals, the number is rounded to the nearest integer). 
     782 
     783**<divisor>** tells what to divide the quantity in that node with. 
     784``bP`` means division by the same quantity in the parent node; for instance, 
     785``%NbP`` gives the number of instances in the node divided by the 
     786number of instances in parent node. Precision formatting can be added, 
     787e.g. ``%6.2NbP``. ``bA`` denotes division by the same quantity over the entire 
     788data set, so ``%NbA`` will tell you the proportion of instances (out 
     789of the entire training data set) that fell into that node. If division is 
     790impossible since the parent node does not exist or some data is missing, 
     791a dot is printed out instead. 
     792 
     793**<quantity>** defines what to print and is the only required element.  
     794It can be: 
     795 
     796``V`` 
     797    The predicted value at that node. Precision  
     798    or divisor can not be defined here. 
     799 
     800``N`` 
     801    The number of instances in the node. 
     802 
     803``M`` 
     804    The number of instances in the majority class (that is, the class  
     805    predicted by the node). 
     806 
     807``m`` 
     808    The proportion of instances in the majority class. 
     809 
     810``A`` 
     811    The average class for instances the node; this is available only for  
     812    regression trees. 
     813 
     814``E`` 
     815    Standard error for class of instances in the node; available only for 
     816    regression trees. 
     817 
     818``I`` 
     819    Print out the confidence interval. The modifier is used as  
     820    ``%I(95)`` of (more complicated) ``%5.3I(95)bP``. 
     821 
     822``C`` 
     823    The number of instances in the given class.  For a classification 
     824    example, ``%5.3C="Iris-virginica"bP`` denotes the number of instances 
     825    of Iris-virginica by the number of instances this class in the parent 
     826    node ( instances that are *not* Iris-virginica could be described with 
     827    ``%5.3CbP!="Iris-virginica"``). 
     828 
     829    For regression trees, use operators =, !=, <, <=, >, and >=, as in 
     830    ``%C<22``, with optional precision and divisor. Intervals are also 
     831    possible: ``%C[20, 22]`` gives the number of instances between 
     832    20 and 22 (inclusive) and ``%C(20, 22)`` gives the number of such 
     833    instances excluding the boundaries. Mixing of parentheses is allowed, 
     834    e.g. ``%C(20, 22]``.  Add ``!`` for instances outside the interval, 
     835    like ``%C!(20, 22]``. 
     836 
     837``c`` 
     838    Same as ``C``, except that it computes the proportion of the class 
     839    instead of the number of instances. 
     840 
     841``D`` 
     842    The number of instances in each class. Precision and the divisor 
     843    are applied to each number in the distribution.  This quantity can 
     844    not be computed for regression trees. 
     845 
     846``d`` 
     847    Same as ``D``, except that it shows proportions of instances. 
     848 
     849<user defined formats> 
     850    Instructions and examples of added formats are at the end of this 
     851    section. 
     852 
     853.. rubric:: Examples on classification trees 
     854 
     855A tree on the iris data set with the depth limited to three 
     856levels is built as follows: 
     857     
     858.. literalinclude:: code/orngTree1.py 
     859   :lines: 1-4 
     860 
     861Printing the predicted class at each node, the number 
     862of instances in the majority class with the total number of instances in 
     863the node requires a custom format string:: 
     864 
     865    >>> print tree.to_string(leaf_str="%V (%M out of %N)") 
     866    petal width<0.800: Iris-setosa (50.000 out of 50.000) 
     867    petal width>=0.800 
     868    |    petal width<1.750 
     869    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000) 
     870    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000) 
     871    |    petal width>=1.750 
     872    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000) 
     873    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000) 
     874 
     875The number of instances as 
     876compared to the entire data set and to the parent node:: 
     877 
     878    >>> print tree.to_string(leaf_str="%V (%^MbA%, %^MbP%)") 
     879    petal width<0.800: Iris-setosa (100%, 100%) 
     880    petal width>=0.800 
     881    |    petal width<1.750 
     882    |    |    petal length<5.350: Iris-versicolor (98%, 100%) 
     883    |    |    petal length>=5.350: Iris-virginica (4%, 40%) 
     884    |    petal width>=1.750 
     885    |    |    petal length<4.850: Iris-virginica (4%, 4%) 
     886    |    |    petal length>=4.850: Iris-virginica (86%, 96%) 
     887 
     888``%M`` is the number of instances in the majority class. Dividing by 
     889the number of all instances from this class on the entire data set 
     890is described with ``%MbA``. Add ``^`` in front for mutiplication with 
     891100. The percent sign *after* that is printed out literally, just as the 
     892comma and parentheses. For the proportion of this class in the parent the 
     893``bA`` is replaced with ``bA``. 
     894 
     895To print the number of versicolors in each node, together with the 
     896proportion of versicolors among the instances in this particular node 
     897and among all versicolors, use the following:: 
     898 
     899    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors) 
     900 
     901It gives:: 
     902 
     903    petal width<0.800: 0.000 (0% of node, 0% of versicolors) 
     904    petal width>=0.800 
     905    |    petal width<1.750 
     906    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors) 
     907    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors) 
     908    |    petal width>=1.750 
     909    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors) 
     910    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors) 
     911 
     912Finally, to print the distributions using a format string ``%D``:: 
     913 
     914    petal width<0.800: [50.000, 0.000, 0.000] 
     915    petal width>=0.800 
     916    |    petal width<1.750 
     917    |    |    petal length<5.350: [0.000, 49.000, 3.000] 
     918    |    |    petal length>=5.350: [0.000, 0.000, 2.000] 
     919    |    petal width>=1.750 
     920    |    |    petal length<4.850: [0.000, 1.000, 2.000] 
     921    |    |    petal length>=4.850: [0.000, 0.000, 43.000] 
     922 
     923As the order of classes is the same as in ``data.domain.class_var.values`` 
     924(setosa, versicolor, virginica), there are 49 versicolors and 3 virginicae 
     925in the node at ``petal length<5.350``. To print the proportions within 
     926nodes rounded to two decimals use ``%.2d``:: 
     927 
     928    petal width<0.800: [1.00, 0.00, 0.00] 
     929    petal width>=0.800 
     930    |    petal width<1.750 
     931    |    |    petal length<5.350: [0.00, 0.94, 0.06] 
     932    |    |    petal length>=5.350: [0.00, 0.00, 1.00] 
     933    |    petal width>=1.750 
     934    |    |    petal length<4.850: [0.00, 0.33, 0.67] 
     935    |    |    petal length>=4.850: [0.00, 0.00, 1.00] 
     936 
     937The most trivial format string for internal nodes is for printing 
     938node predictions. ``.`` in the following example specifies 
     939that the node_str should be the same as leaf_str. 
     940 
     941:: 
     942 
     943    tree.to_string(leaf_str="%V", node_str=".") 
     944  
     945The output:: 
     946 
     947    root: Iris-setosa 
     948    |    petal width<0.800: Iris-setosa 
     949    |    petal width>=0.800: Iris-versicolor 
     950    |    |    petal width<1.750: Iris-versicolor 
     951    |    |    |    petal length<5.350: Iris-versicolor 
     952    |    |    |    petal length>=5.350: Iris-virginica 
     953    |    |    petal width>=1.750: Iris-virginica 
     954    |    |    |    petal length<4.850: Iris-virginica 
     955    |    |    |    petal length>=4.850: Iris-virginica 
     956 
     957A node *root* has appeared and the tree looks one level 
     958deeper. This is needed to also print the data for tree root. 
     959 
     960To observe how the number 
     961of virginicas decreases down the tree try:: 
     962 
     963    print tree.to_string(leaf_str='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', node_str='.') 
     964 
     965Interpretation: ``CbA="Iris-virginica"`` is  
     966the number of instances from virginica, divided by the total number 
     967of instances in this class. Add ``^.1`` and the result will be 
     968multiplied and printed with one decimal. The trailing ``%`` is printed 
     969out. In parentheses the same thing was divided by 
     970the instances in the parent node. The single quotes were used for strings, so 
     971that double quotes inside the string can specify the class. 
     972 
     973:: 
     974 
     975    root: 100.0% (.%) 
     976    |    petal width<0.800: 0.0% (0.0%) 
     977    |    petal width>=0.800: 100.0% (100.0%) 
     978    |    |    petal width<1.750: 10.0% (10.0%) 
     979    |    |    |    petal length<5.350: 6.0% (60.0%) 
     980    |    |    |    petal length>=5.350: 4.0% (40.0%) 
     981    |    |    petal width>=1.750: 90.0% (90.0%) 
     982    |    |    |    petal length<4.850: 4.0% (4.4%) 
     983    |    |    |    petal length>=4.850: 86.0% (95.6%) 
     984 
     985If :meth:`~TreeClassifier.to_string` cannot compute something, in this case 
     986because the root has no parent, it prints out a dot. 
     987 
     988The final example with classification trees prints the distributions in 
     989nodes, the distribution compared to the parent, the proportions compared 
     990to the parent and the predicted class in the leaves:: 
     991 
     992    >>> print tree.to_string(leaf_str='"%V   %D %.2DbP %.2dbP"', node_str='"%D %.2DbP %.2dbP"') 
     993    root: [50.000, 50.000, 50.000] . . 
     994    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]: 
     995    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00] 
     996    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50] 
     997    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19] 
     998    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]: 
     999    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62] 
     1000    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]: 
     1001    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80] 
     1002    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96] 
     1003    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]: 
     1004    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68] 
     1005    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]: 
     1006    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02] 
     1007 
     1008 
     1009.. rubric:: Examples on regression trees 
     1010 
     1011The regression trees examples use a tree induced from the housing data 
     1012set. Without other argumets, :meth:`TreeClassifier.to_string` prints the 
     1013following:: 
     1014 
     1015    RM<6.941 
     1016    |    LSTAT<14.400 
     1017    |    |    DIS<1.385: 45.6 
     1018    |    |    DIS>=1.385: 22.9 
     1019    |    LSTAT>=14.400 
     1020    |    |    CRIM<6.992: 17.1 
     1021    |    |    CRIM>=6.992: 12.0 
     1022    RM>=6.941 
     1023    |    RM<7.437 
     1024    |    |    CRIM<7.393: 33.3 
     1025    |    |    CRIM>=7.393: 14.4 
     1026    |    RM>=7.437 
     1027    |    |    TAX<534.500: 45.9 
     1028    |    |    TAX>=534.500: 21.9 
     1029 
     1030To add the standard error in both internal nodes and leaves, and 
     1031the 90% confidence intervals in the leaves, use:: 
     1032 
     1033    >>> print tree.to_string(leaf_str="[SE: %E]\t %V %I(90)", node_str="[SE: %E]") 
     1034    root: [SE: 0.409] 
     1035    |    RM<6.941: [SE: 0.306] 
     1036    |    |    LSTAT<14.400: [SE: 0.320] 
     1037    |    |    |    DIS<1.385: [SE: 4.420]: 
     1038    |    |    |        [SE: 4.420]   45.6 [38.331-52.829] 
     1039    |    |    |    DIS>=1.385: [SE: 0.244]: 
     1040    |    |    |        [SE: 0.244]   22.9 [22.504-23.306] 
     1041    |    |    LSTAT>=14.400: [SE: 0.333] 
     1042    |    |    |    CRIM<6.992: [SE: 0.338]: 
     1043    |    |    |        [SE: 0.338]   17.1 [16.584-17.691] 
     1044    |    |    |    CRIM>=6.992: [SE: 0.448]: 
     1045    |    |    |        [SE: 0.448]   12.0 [11.243-12.714] 
     1046    |    RM>=6.941: [SE: 1.031] 
     1047    |    |    RM<7.437: [SE: 0.958] 
     1048    |    |    |    CRIM<7.393: [SE: 0.692]: 
     1049    |    |    |        [SE: 0.692]   33.3 [32.214-34.484] 
     1050    |    |    |    CRIM>=7.393: [SE: 2.157]: 
     1051    |    |    |        [SE: 2.157]   14.4 [10.862-17.938] 
     1052    |    |    RM>=7.437: [SE: 1.124] 
     1053    |    |    |    TAX<534.500: [SE: 0.817]: 
     1054    |    |    |        [SE: 0.817]   45.9 [44.556-47.237] 
     1055    |    |    |    TAX>=534.500: [SE: 0.000]: 
     1056    |    |    |        [SE: 0.000]   21.9 [21.900-21.900] 
     1057 
     1058The predicted value (``%V``) and the average (``%A``) may differ because 
     1059a regression tree does not always predict the leaf average, but whatever 
     1060the :obj:`~Node.node_classifier` in a leaf returns.  As ``%V`` uses the 
     1061:obj:`Orange.feature.Continuous`' function for printing the 
     1062value, the number has the same number of decimals as in the data file. 
     1063 
     1064Regression trees cannot print the distributions in the same way 
     1065as classification trees. They instead offer a set of operators for 
     1066observing the number of instances within a certain range. For instance, 
     1067to print the number of instances with values below 22 and compare 
     1068it with values in the parent nodes use:: 
     1069 
     1070    >>> print tree.to_string(leaf_str="%C<22 (%cbP<22)", node_str=".") 
     1071    root: 277.000 (.) 
     1072    |    RM<6.941: 273.000 (1.160) 
     1073    |    |    LSTAT<14.400: 107.000 (0.661) 
     1074    |    |    |    DIS<1.385: 0.000 (0.000) 
     1075    |    |    |    DIS>=1.385: 107.000 (1.020) 
     1076    |    |    LSTAT>=14.400: 166.000 (1.494) 
     1077    |    |    |    CRIM<6.992: 93.000 (0.971) 
     1078    |    |    |    CRIM>=6.992: 73.000 (1.040) 
     1079    |    RM>=6.941: 4.000 (0.096) 
     1080    |    |    RM<7.437: 3.000 (1.239) 
     1081    |    |    |    CRIM<7.393: 0.000 (0.000) 
     1082    |    |    |    CRIM>=7.393: 3.000 (15.333) 
     1083    |    |    RM>=7.437: 1.000 (0.633) 
     1084    |    |    |    TAX<534.500: 0.000 (0.000) 
     1085    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp> 
     1086 
     1087The last line, for instance, says the the number of instances with the 
     1088class below 22 is among those with tax above 534 is 30 times higher than 
     1089the number of such instances in its parent node. 
     1090 
     1091To count the same for all instances *outside* 
     1092interval [20, 22] and print out the proportions as percents use:: 
     1093 
     1094    >>> print tree.to_string(leaf_str="%C![20,22] (%^cbP![20,22]%)", node_str=".") 
     1095 
     1096The format string  ``%c![20, 22]`` denotes the proportion of instances 
     1097(within the node) whose values are below 20 or above 22. ``%cbP![20, 
     109822]`` derives same statistics computed on the parent. A ``^`` is added 
     1099for percentages. 
     1100 
     1101:: 
     1102 
     1103    root: 439.000 (.%) 
     1104    |    RM<6.941: 364.000 (98%) 
     1105    |    |    LSTAT<14.400: 200.000 (93%) 
     1106    |    |    |    DIS<1.385: 5.000 (127%) 
     1107    |    |    |    DIS>=1.385: 195.000 (99%) 
     1108    |    |    LSTAT>=14.400: 164.000 (111%) 
     1109    |    |    |    CRIM<6.992: 91.000 (96%) 
     1110    |    |    |    CRIM>=6.992: 73.000 (105%) 
     1111    |    RM>=6.941: 75.000 (114%) 
     1112    |    |    RM<7.437: 46.000 (101%) 
     1113    |    |    |    CRIM<7.393: 43.000 (100%) 
     1114    |    |    |    CRIM>=7.393: 3.000 (100%) 
     1115    |    |    RM>=7.437: 29.000 (98%) 
     1116    |    |    |    TAX<534.500: 29.000 (103%) 
     1117    |    |    |    TAX>=534.500: 0.000 (0%) 
     1118 
     1119 
     1120Defining custom printouts 
     1121------------------------- 
     1122 
     1123:meth:`TreeClassifier.to_string`'s argument :obj:`user_formats` can be used to 
     1124print other information.  :obj:`~TreeClassifier.format.user_formats` should 
     1125contain a list of tuples with a regular expression and a function to be 
     1126called when that expression is found in the format string. Expressions 
     1127from :obj:`user_formats` are checked before the built-in expressions 
     1128discussed above. 
     1129 
     1130The regular expression should describe a string like used above, 
     1131for instance ``%.2DbP``. When a leaf or internal node 
     1132is printed, the format string (:obj:`leaf_str` or :obj:`node_str`) 
     1133is checked for these regular expressions and when the match is found, 
     1134the corresponding callback function is called. 
     1135 
     1136The passed function will get five arguments: the format string  
     1137(:obj:`leaf_str` or :obj:`node_str`), the match object, the node which is 
     1138being printed, its parent (can be None) and the tree classifier. 
     1139The function should return the format string in which the part described 
     1140by the match object (that is, the part that is matched by the regular 
     1141expression) is replaced by whatever information your callback function 
     1142is supposed to give. 
     1143 
     1144The function can use several utility function provided in the module. 
     1145 
     1146.. autofunction:: insert_str 
     1147 
     1148.. autofunction:: insert_dot 
     1149 
     1150.. autofunction:: insert_num 
     1151 
     1152.. autofunction:: by_whom 
     1153 
     1154The module also includes reusable regular expressions:  
     1155 
     1156.. autodata:: fs 
     1157 
     1158.. autodata:: by 
     1159 
     1160For a trivial example, ``%V`` is implemented with the 
     1161following tuple:: 
     1162 
     1163    (re.compile("%V"), replaceV) 
     1164 
     1165And ``replaceV`` is defined by:: 
     1166 
     1167    def replaceV(strg, mo, node, parent, tree): 
     1168        return insert_str(strg, mo, str(node.node_classifier.default_value)) 
     1169 
     1170``replaceV`` takes the value predicted at the node 
     1171(``node.node_classifier.default_value`` ), converts it to a string 
     1172and passes it to :func:`insert_str`. 
     1173 
     1174A more complex regular expression is the one for the proportion of 
     1175majority class, defined as ``"%"+fs+"M"+by``. It uses the two partial 
     1176expressions defined above (:obj:`fs` and :obj:`by`). 
     1177 
     1178The following code prints the classification margin for each node, 
     1179that is, the difference between the proportion of the largest and the 
     1180second largest class in the node: 
     1181 
     1182.. literalinclude:: code/orngTree2.py 
     1183   :lines: 7-31 
     1184 
     1185``get_margin`` computes the margin from the distribution. The replacing 
     1186function, ``replaceB``, computes the margin for the node.  If :data:`by` 
     1187group is present, we call :func:`by_whom` to get the node with whose 
     1188margin this node's margin is to be divided. If this node (usually the 
     1189parent) does not exist of if its margin is zero, :func:`insert_dot` 
     1190inserts dot, otherwise :func:`insert_num` is called which inserts the 
     1191number in the user-specified format.  ``my_format`` contains the regular 
     1192expression and the callback function. 
     1193 
     1194Printing the tree with 
     1195 
     1196.. literalinclude:: code/orngTree2.py 
     1197    :lines: 33 
     1198 
     1199yields:: 
     1200 
     1201    petal width<0.800: Iris-setosa 100% (100.00%) 
     1202    petal width>=0.800 
     1203    |    petal width<1.750 
     1204    |    |    petal length<5.350: Iris-versicolor 88% (108.57%) 
     1205    |    |    petal length>=5.350: Iris-virginica 100% (122.73%) 
     1206    |    petal width>=1.750 
     1207    |    |    petal length<4.850: Iris-virginica 33% (34.85%) 
     1208    |    |    petal length>=4.850: Iris-virginica 100% (104.55%) 
     1209 
     1210Plotting with Dot 
     1211--------------------------- 
     1212 
     1213To produce images of trees, first create a .dot file with 
     1214:meth:`TreeClassifier.dot`. If it was saved to "tree5.dot", plot a gif 
     1215with the following command:: 
     1216 
     1217    dot -Tgif tree5.dot -otree5.gif 
     1218 
     1219Check GraphViz's dot documentation for more options and 
     1220output formats. 
     1221 
     1222 
     1223=========================== 
     1224C4.5 Tree Inducer 
     1225=========================== 
     1226 
     1227C4.5 is, as  a standard benchmark in machine learning, incorporated in 
     1228Orange. The implementation uses the original C4.5 code, so the resulting 
     1229tree is exactly like the one that would be build by standalone C4.5. The 
     1230tree build is only made accessible in Python. 
     1231 
     1232:class:`C45Learner` and :class:`C45Classifier` behave 
     1233like any other Orange learner and classifier. Unlike most of Orange  
     1234learning algorithms, C4.5 does not accepts weighted instances. 
     1235 
     1236------------------------- 
     1237Building the C4.5 plug-in 
     1238------------------------- 
     1239 
     1240Due to copyright restrictions, C4.5 is not distributed with Orange, 
     1241but it can be added as a plug-in. A C compiler is needed for the 
     1242procedure: on Windows MS Visual C (CL.EXE and LINK.EXE must be on the 
     1243PATH), on Linux and OS X gcc (OS X users can download it from Apple). 
     1244 
     1245Orange must be installed prior to building C4.5. 
     1246 
     1247#. Download  
     1248   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_ 
     1249   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract 
     1250   them. The files will be modified in the 
     1251   further process. 
     1252#. Download 
     1253   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_ 
     1254   and unzip its contents into the directory R8/Src of the C4.5 sources 
     1255   (this directory contains, for instance, the file average.c). 
     1256#. Run buildC45.py, which will build the plug-in and put it next to  
     1257   orange.pyd (or orange.so on Linux/Mac). 
     1258#. Run python, type ``import Orange`` and 
     1259   create ``Orange.classification.tree.C45Learner()``. This should 
     1260   succeed. 
     1261#. Finally, you can remove C4.5 sources. 
     1262 
     1263The buildC45.py creates .h files that wrap Quinlan's .i files and 
     1264ensure that they are not included twice. It modifies C4.5 sources to 
     1265include .h's instead of .i's (this step can hardly fail). Then it compiles 
     1266ensemble.c into c45.dll or c45.so and puts it next to Orange. In the end 
     1267it checks if the built C4.5 gives the same results as the original. 
     1268 
     1269.. autoclass:: C45Learner 
     1270    :members: 
     1271 
     1272.. autoclass:: C45Classifier 
     1273    :members: 
     1274 
     1275.. class:: C45Node 
     1276 
     1277    This class is a reimplementation of the corresponding *struct* from 
     1278    Quinlan's C4.5 code. 
     1279 
     1280    .. attribute:: node_type 
     1281 
     1282        Type of the node:  :obj:`C45Node.Leaf` (0),  
     1283        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2), 
     1284        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches" 
     1285        split instances based on values of a discrete attribute, 
     1286        "cuts" cut them according to a threshold value of a continuous 
     1287        attributes and "subsets" use discrete attributes but with subsetting 
     1288        so that several values can go into the same branch. 
     1289 
     1290    .. attribute:: leaf 
     1291 
     1292        Value returned by that leaf. The field is defined for internal  
     1293        nodes as well. 
     1294 
     1295    .. attribute:: items 
     1296 
     1297        Number of (learning) instances in the node. 
     1298 
     1299    .. attribute:: class_dist 
     1300 
     1301        Class distribution for the node (of type  
     1302        :obj:`Orange.statistics.distribution.Discrete`). 
     1303 
     1304    .. attribute:: tested 
     1305         
     1306        The attribute used in the node's test. If node is a leaf, 
     1307        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut` 
     1308        :obj:`tested` is a discrete attribute, and if node is of type 
     1309        :obj:`Cut` then :obj:`tested` is a continuous attribute. 
     1310 
     1311    .. attribute:: cut 
     1312 
     1313        A threshold for continuous attributes, if node is of type :obj:`Cut`. 
     1314        Undefined otherwise. 
     1315 
     1316    .. attribute:: mapping 
     1317 
     1318        Mapping for nodes of type :obj:`Subset`. Element ``mapping[i]`` 
     1319        gives the index for an instance whose value of :obj:`tested` is *i*.  
     1320        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`. 
     1321 
     1322    .. attribute:: branch 
     1323         
     1324        A list of branches stemming from this node. 
     1325 
     1326-------- 
     1327Examples 
     1328-------- 
     1329 
     1330This 
     1331script constructs the same learner as you would get by calling 
     1332the usual C4.5: 
     1333 
     1334.. literalinclude:: code/tree_c45.py 
     1335   :lines: 7-14 
     1336 
     1337Both C4.5 command-line symbols and variable names can be used. The  
     1338following lines produce the same result:: 
     1339 
     1340    tree = Orange.classification.tree.C45Learner(data, m=100) 
     1341    tree = Orange.classification.tree.C45Learner(data, min_objs=100) 
     1342 
     1343A veteran C4.5 might prefer :func:`C45Learner.commandline`:: 
     1344 
     1345    lrn = Orange.classification.tree.C45Learner() 
     1346    lrn.commandline("-m 1 -s") 
     1347    tree = lrn(data) 
     1348 
     1349The following script prints out the tree same format as C4.5 does. 
     1350 
     1351.. literalinclude:: code/tree_c45_printtree.py 
     1352 
     1353For the leaves just the value in ``node.leaf`` in printed. Since 
     1354:obj:`C45Node` does not know to which attribute it belongs, we need to 
     1355convert it to a string through ``classvar``, which is passed as an extra 
     1356argument to the recursive part of print_tree. 
     1357 
     1358For discrete splits without subsetting, we print out all attribute values 
     1359and recursively call the function for all branches. Continuous splits 
     1360are equally easy to handle. 
     1361 
     1362For discrete splits with subsetting, we iterate through branches, 
     1363retrieve the corresponding values that go into each branch to inset, 
     1364turn the values into strings and print them out, separately treating 
     1365the case when only a single value goes into the branch. 
     1366 
     1367=================== 
     1368Simple Tree Inducer 
     1369=================== 
     1370 
     1371.. include:: /SimpleTreeLearner.txt 
     1372 
     1373--------         
     1374Examples 
     1375-------- 
     1376 
     1377:obj:`SimpleTreeLearner` is used in much the same way as :obj:`TreeLearner`. 
     1378A typical example of using :obj:`SimpleTreeLearner` would be to build a random 
     1379forest: 
     1380 
     1381.. literalinclude:: code/simple_tree_random_forest.py 
     1382 
     1383========== 
     1384References 
     1385========== 
     1386 
     1387Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison  
     1388Wesley, 2002. 
     1389 
     1390E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories, 
     1391Murray Hill NJ, U.S.A., October 1993. 
     1392 
     1393`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_ 
     1394A home page of AT&T's dot and similar software packages. 
     1395 
     1396""" 
     1397 
     1398""" 
     1399TODO C++ aliases 
     1400 
     1401SplitConstructor.discrete/continuous_split_constructor -> SplitConstructor.discrete  
     1402Node.examples -> Node.instances 
Note: See TracChangeset for help on using the changeset viewer.