source: orange/docs/reference/rst/Orange.classification.tree.rst @ 11028:009ba5a75e30

Revision 11028:009ba5a75e30, 51.4 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 17 months ago (diff)

Added a common documentation index.

Line 
1.. py:currentmodule:: Orange.classification.tree
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8*******************************
9Classification trees (``tree``)
10*******************************
11
12Orange includes multiple implementations of classification tree learners:
13a very flexible :class:`TreeLearner`, a fast :class:`SimpleTreeLearner`,
14and a :class:`C45Learner`, which uses the C4.5 tree induction
15algorithm.
16
17The following code builds a :obj:`TreeClassifier` on the Iris data set
18(with the depth limited to three levels):
19
20.. literalinclude:: code/orngTree1.py
21   :lines: 1-4
22
23See `Decision tree learning
24<http://en.wikipedia.org/wiki/Decision_tree_learning>`_ on Wikipedia
25for introduction to classification trees.
26
27==============================
28Component-based Tree Inducer
29==============================
30
31.. autoclass:: TreeLearner
32    :members:
33
34.. autoclass:: TreeClassifier
35    :members:
36
37.. class:: Node
38
39    Classification trees are a hierarchy of :obj:`Node` classes.
40
41    Node stores the instances belonging to the node, a branch selector,
42    a list of branches (if the node is not a leaf) with their descriptions
43    and strengths, and a classifier.
44
45    .. attribute:: distribution
46   
47        A distribution of learning instances.
48
49    .. attribute:: contingency
50
51        Complete contingency matrices for the learning instances.
52
53    .. attribute:: instances, weightID
54
55        Learning instances and the ID of weight meta attribute. The root
56        of the tree actually stores all instances, while other nodes
57        store only reference to instances in the root node.
58
59    .. attribute:: node_classifier
60
61        A classifier for instances coming to the node. If the node is a
62        leaf, it chooses the class (or class distribution) of an instance.
63
64    Internal nodes have additional attributes. The lists :obj:`branches`,
65    :obj:`branch_descriptions` and :obj:`branch_sizes` are of the
66    same length.
67
68    .. attribute:: branches
69
70        A list of subtrees. Each element is a :obj:`Node` or None.
71        If None, the node is empty.
72
73    .. attribute:: branch_descriptions
74
75        A list with strings describing branches. They are constructed
76        by :obj:`SplitConstructor`. A string can contain anything,
77        for example 'red' or '>12.3'.
78
79    .. attribute:: branch_sizes
80
81        A (weighted) number of training instances for
82        each branch. It can be used, for instance, for modeling
83        probabilities when classifying instances with unknown values.
84
85    .. attribute:: branch_selector
86
87        A :obj:`~Orange.classification.Classifier` that returns a branch
88        for each instance (as
89        :obj:`Orange.data.Value` in ``[0, len(branches)-1]``).  When an
90        instance cannot be classified unambiguously, the selector can
91        return a discrete distribution, which proposes how to divide
92        the instance between the branches. Whether the proposition will
93        be used depends upon the :obj:`Splitter` (for learning)
94        or :obj:`Descender` (for classification).
95
96    .. method:: tree_size()
97       
98        Return the number of nodes in the subtrees (including the node,
99        excluding null-nodes).
100
101--------
102Examples
103--------
104
105Tree Structure
106==============
107
108This example works with the lenses data set:
109
110    >>> import Orange
111    >>> lenses = Orange.data.Table("lenses")
112    >>> tree_classifier = Orange.classification.tree.TreeLearner(lenses)
113
114The following function counts the number of nodes in a tree:
115
116    >>> def tree_size(node):
117    ...    if not node:
118    ...        return 0
119    ...    size = 1
120    ...    if node.branch_selector:
121    ...        for branch in node.branches:
122    ...            size += tree_size(branch)
123    ...    return size
124
125If node is None, the function above return 0. Otherwise, the size is 1
126(this node) plus the sizes of all subtrees. The algorithm need to check
127if a node is internal (it has a :obj:`~Node.branch_selector`), as leaves
128don't have the :obj:`~Node.branches` attribute.
129
130    >>> tree_size(tree_classifier.tree)
131    15
132
133Note that a :obj:`Node` already has a built-in method
134:func:`~Node.tree_size`.
135
136Trees can be printed with a simple recursive function:
137
138    >>> def print_tree0(node, level):
139    ...     if not node:
140    ...         print " "*level + "<null node>"
141    ...         return
142    ...     if node.branch_selector:
143    ...         node_desc = node.branch_selector.class_var.name
144    ...         node_cont = node.distribution
145    ...         print "\\n" + "   "*level + "%s (%s)" % (node_desc, node_cont),
146    ...         for i in range(len(node.branches)):
147    ...             print "\\n" + "   "*level + ": %s" % node.branch_descriptions[i],
148    ...             print_tree0(node.branches[i], level+1)
149    ...     else:
150    ...         node_cont = node.distribution
151    ...         major_class = node.node_classifier.default_value
152    ...         print "--> %s (%s) " % (major_class, node_cont),
153
154The crux of the example is not in the formatting (\\n's etc.);
155what matters is everything but the print statements. The code
156separately handles three node types:
157
158* For null nodes (a node to which no learning instances were classified),
159  it just prints "<null node>".
160* For internal nodes, it prints a node description:
161  the feature's name and distribution of classes. :obj:`Node`'s
162  branch description is an :obj:`~Orange.classification.Classifier`,
163  and its ``class_var`` is the feature whose name is printed.  Class
164  distributions are printed as well (they are assumed to be stored).
165  The :obj:`print_tree0` with a level increased by 1 to increase the
166  indent is recursively called for each branch.
167* If the node is a leaf, it prints the distribution of learning instances
168  in the node and the class to which the instances in the node would
169  be classified. We assume that the :obj:`~Node.node_classifier` is a
170  :obj:`DefaultClassifier`. A better print function should be aware of
171  possible alternatives.
172
173The wrapper function that accepts either a
174:obj:`TreeClassifier` or a :obj:`Node` can be written as follows:
175
176    >>> def print_tree(x):
177    ...     if isinstance(x, Orange.classification.tree.TreeClassifier):
178    ...         print_tree0(x.tree, 0)
179    ...     elif isinstance(x, Orange.classification.tree.Node):
180    ...         print_tree0(x, 0)
181    ...     else:
182    ...         raise TypeError, "invalid parameter"
183
184It's straightforward: if ``x`` is a
185:obj:`TreeClassifier`, it prints ``x.tree``; if it's :obj:`Node` it
186print ``x``. If it's of some other type,
187an exception is raised. The output:
188
189    >>> print_tree(tree_classifier)
190    <BLANKLINE>
191    tear_rate (<15.000, 4.000, 5.000>)
192    : normal
193       astigmatic (<3.000, 4.000, 5.000>)
194       : no
195          age (<1.000, 0.000, 5.000>)
196          : pre-presbyopic --> soft (<0.000, 0.000, 2.000>) 
197          : presbyopic
198             prescription (<1.000, 0.000, 1.000>)
199             : hypermetrope --> soft (<0.000, 0.000, 1.000>) 
200             : myope --> none (<1.000, 0.000, 0.000>) 
201          : young --> soft (<0.000, 0.000, 2.000>) 
202       : yes
203          prescription (<2.000, 4.000, 0.000>)
204          : hypermetrope
205             age (<2.000, 1.000, 0.000>)
206             : pre-presbyopic --> none (<1.000, 0.000, 0.000>) 
207             : presbyopic --> none (<1.000, 0.000, 0.000>) 
208             : young --> hard (<0.000, 1.000, 0.000>) 
209          : myope --> hard (<0.000, 3.000, 0.000>) 
210    : reduced --> none (<12.000, 0.000, 0.000>)
211
212The tree structure examples conclude with a simple pruning function,
213written entirely in Python and unrelated to any :class:`Pruner`. It limits
214the tree depth (the number of internal nodes on any path down the tree).
215For example, to get a two-level tree, call cut_tree(root, 2). The function
216is recursive, with the second argument (level) decreasing at each call;
217when zero, the current node will be made a leaf:
218
219    >>> def cut_tree(node, level):
220    ...     if node and node.branch_selector:
221    ...         if level:
222    ...             for branch in node.branches:
223    ...                 cut_tree(branch, level-1)
224    ...         else:
225    ...             node.branch_selector = None
226    ...             node.branches = None
227    ...             node.branch_descriptions = None
228
229The function acts only when :obj:`node` and :obj:`node.branch_selector`
230are defined. If the level is not zero, is recursively calls  the function
231for each branch. Otherwise, it clears the selector, branches and branch
232descriptions.
233
234    >>> cut_tree(tree_classifier.tree, 2)
235    >>> print_tree(tree_classifier)
236    <BLANKLINE>
237    tear_rate (<15.000, 4.000, 5.000>)
238    : normal
239       astigmatic (<3.000, 4.000, 5.000>)
240       : no --> soft (<1.000, 0.000, 5.000>) 
241       : yes --> hard (<2.000, 4.000, 0.000>) 
242    : reduced --> none (<12.000, 0.000, 0.000>)
243
244Setting learning parameters
245===========================
246
247Let us construct a :obj:`TreeLearner` to play with:
248
249    >>> import Orange
250    >>> lenses = Orange.data.Table("lenses")
251    >>> learner = Orange.classification.tree.TreeLearner()
252
253There are three crucial components in learning: the
254:obj:`~TreeLearner.split` and :obj:`~TreeLearner.stop` criteria, and the
255example :obj:`~TreeLearner.splitter`. The default ``stop`` is set with:
256
257    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
258
259The default stopping parameters are:
260
261    >>> print learner.stop.max_majority, learner.stop.min_examples
262    1.0 0.0
263
264The defaults only stop splitting when no instances are left or all of
265them are in the same class.
266
267If the minimal subset that is allowed to be split further is set to five
268instances, the resulting tree is smaller.
269
270    >>> learner.stop.min_examples = 5.0
271    >>> tree = learner(lenses)
272    >>> print tree
273    tear_rate=reduced: none (100.00%)
274    tear_rate=normal
275    |    astigmatic=no
276    |    |    age=pre-presbyopic: soft (100.00%)
277    |    |    age=presbyopic: none (50.00%)
278    |    |    age=young: soft (100.00%)
279    |    astigmatic=yes
280    |    |    prescription=hypermetrope: none (66.67%)
281    |    |    prescription=myope: hard (100.00%)
282    <BLANKLINE>
283
284We can also limit the maximal proportion of majority class.
285
286    >>> learner.stop.max_majority = 0.5
287    >>> tree = learner(lenses)
288    >>> print tree
289    none (62.50%)
290
291Redefining tree induction components
292====================================
293
294This example shows how to use a custom stop function.  First, the
295``def_stop`` function defines the default stop function. The first tree
296has some added randomness; the induction also stops in 20% of the
297cases when ``def_stop`` returns False. The stopping criteria for the
298second tree is completely random: it stops induction in 20% of cases.
299Note that in the second case lambda function still has three parameters,
300even though in does not need any, since so many are necessary
301for :obj:`~TreeLearner.stop`.
302
303.. literalinclude:: code/tree3.py
304   :lines: 8-23
305
306---------------------------------
307Learner and Classifier Components
308---------------------------------
309
310Split constructors
311=====================
312
313.. class:: SplitConstructor
314
315    Decide how to divide learning instances, ie. define branching criteria.
316   
317    The :obj:`SplitConstructor` should use the domain
318    contingency when possible, both for speed and adaptability.
319    Sometimes domain contingency does
320    not suffice, for example if ReliefF score is used.
321
322    A :obj:`SplitConstructor` can veto further tree induction by returning
323    no classifier. This is generally related to the number of learning
324    instances that would go in each branch. If there are no splits with
325    more than :obj:`SplitConstructor.min_subset` instances in the branches
326    (null nodes are allowed), the induction is stopped.
327
328    Split constructors that cannot handle a particular feature
329    type (discrete, continuous) quietly skip them. When in doubt, use
330    :obj:`SplitConstructor_Combined`, which delegates features to
331    specialized split constructors.
332
333    The same split constructors can be used both for classification and
334    regression, if the chosen score (for :obj:`SplitConstructor_Score`
335    and derived classes) supports both.
336
337    .. attribute:: min_subset
338
339        The minimal (weighted) number in non-null leaves.
340
341    .. method:: __call__(data, [ weightID, contingency, apriori_distribution, candidates, clsfr])
342
343        :param data: in any acceptable form.
344        :param weightID: Optional; the default of 0 means that all
345            instances have a weight of 1.0.
346        :param contingency: a domain contingency
347        :param apriori_distribution: apriori class probabilities.
348        :type apriori_distribution: :obj:`Orange.statistics.distribution.Distribution`
349        :param candidates: only consider these
350            features (one boolean for each feature).
351        :param clsfr: a node classifier (if it was constructed, that is,
352            if :obj:`store_node_classifier` is True)
353
354        Construct a split. Return a tuple (:obj:`branch_selector`,
355        :obj:`branch_descriptions` (a list), :obj:`subset_sizes`
356        (the number of instances for each branch, may also be
357        empty), :obj:`quality` (higher numbers mean better splits),
358        :obj:`spent_feature`). If no split is constructed,
359        the :obj:`selector`, :obj:`branch_descriptions` and
360        :obj:`subset_sizes` are None, while :obj:`quality` is 0.0 and
361        :obj:`spent_feature` is -1.
362
363        If the chosen feature will be useless in the future and
364        should not be considered for splitting in any of the subtrees
365        (typically, when discrete features are used as-they-are, without
366        any binarization or subsetting), then it should return the index
367        of this feature as :obj:`spent_feature`. If no features are spent,
368        :obj:`spent_feature` is -1.
369
370.. class:: SplitConstructor_Score
371
372    Bases: :class:`SplitConstructor`
373
374    An abstract base class that compare splits
375    with a :class:`Orange.feature.scoring.Score`.  All split
376    constructors except for :obj:`SplitConstructor_Combined` are derived
377    from this class.
378
379    .. attribute:: measure
380
381        A :class:`Orange.feature.scoring.Score` for split evaluation. It
382        has to handle the class type - for example, you cannot use
383        :class:`~Orange.feature.scoring.GainRatio` for regression or
384        :class:`~Orange.feature.scoring.MSE` for classification.
385
386    .. attribute:: worst_acceptable
387
388        The lowest allowed split quality.  The value strongly depends
389        on chosen :obj:`measure` component. Default is 0.0.
390
391.. class:: SplitConstructor_Feature
392
393    Bases: :class:`SplitConstructor_Score`
394
395    Each value of a discrete feature corresponds to a branch.  The feature
396    with the highest score (:obj:`~Measure.measure`) is selected. When
397    tied, a random feature is selected.
398
399    The constructed :obj:`branch_selector` is an instance of
400    :obj:`orange.ClassifierFromVarFD`, that returns a value of the selected
401    feature. :obj:`branch_description` contains the feature's
402    values. The feature is marked as spent (it cannot reappear in the
403    node's subtrees).
404
405.. class:: SplitConstructor_ExhaustiveBinary
406
407    Bases: :class:`SplitConstructor_Score`
408
409    Finds the binarization with the highest score among all features. In
410    case of ties, a random feature is selected.
411
412    The constructed :obj:`branch_selector` is an instance
413    :obj:`orange.ClassifierFromVarFD`, that returns a value of the
414    selected feature. Its :obj:`transformer` contains a ``MapIntValue``
415    that maps values of the feature into a binary feature. Branches
416    with a single feature value are described with that value and
417    branches with more than one are described with ``[<val1>, <val2>,
418    ..., <valn>]``. Only binary features are marked as spent.
419
420.. class:: SplitConstructor_Threshold
421
422    Bases: :class:`SplitConstructor_Score`
423
424    The only split constructor for continuous features. It divides the
425    range of feature values with a threshold that maximizes the split's
426    quality. In case of ties, a random feature is selected.  The feature
427    that yields the best binary split is returned.
428
429    The constructed :obj:`branch_selector` is an instance of
430    :obj:`orange.ClassifierFromVarFD` with an attached :obj:`transformer`,
431    of type :obj:`Orange.feature.discretization.ThresholdDiscretizer`. The
432    branch descriptions are "<threshold" and ">=threshold". The feature
433    is not spent.
434
435.. class:: SplitConstructor_OneAgainstOthers
436   
437    Bases: :class:`SplitConstructor_Score`
438
439    Undocumented.
440
441.. class:: SplitConstructor_Combined
442
443    Bases: :class:`SplitConstructor`
444
445    Uses different split constructors for discrete and continuous
446    features. Each split constructor is called with appropriate
447    features. Both construct a candidate for a split; the better of them
448    is used.
449
450    The choice of the split is not probabilistically fair, when
451    multiple candidates have the same score. For example, if there
452    are nine discrete features with the highest score the split
453    constructor for discrete features will select one of them. Now,
454    if there is also a single continuous feature with the same score,
455    :obj:`SplitConstructor_Combined` would randomly select between the
456    proposed discrete feature and continuous feature, unaware that the
457    discrete feature  has already competed with eight others.  So,
458    the probability for selecting (each) discrete feature would be
459    1/18 instead of 1/10. Although incorrect, this should not affect
460    the performance.
461
462    .. attribute: discrete_split_constructor
463
464        Split constructor for discrete features;
465        for instance, :obj:`SplitConstructor_Feature` or
466        :obj:`SplitConstructor_ExhaustiveBinary`.
467
468    .. attribute: continuous_split_constructor
469
470        Split constructor for continuous features; it
471        can be either :obj:`SplitConstructor_Threshold` or a
472        a custom split constructor.
473
474
475StopCriteria and StopCriteria_common
476============================================
477
478:obj:`StopCriteria` determines when to stop the induction of subtrees.
479
480.. class:: StopCriteria
481
482    Provides the basic stopping criteria: the tree induction stops
483    when there is at most one instance left (the actual, not weighted,
484    number). The induction also stops when all instances are in the
485    same class (for discrete problems) or have the same outcome value
486    (for regression problems).
487
488    .. method:: __call__(instances[, weightID, domain contingencies])
489
490        Return True (stop) of False (continue the induction).
491        Contingencies are not used for counting. Derived classes should
492        use the contingencies whenever possible.
493
494.. class:: StopCriteria_common
495
496    Pre-pruning with additional criteria.
497
498    .. attribute:: max_majority
499
500        Maximum proportion of majority class. When exceeded,
501        induction stops.
502
503    .. attribute:: min_instances
504
505        Minimum number of instances for splitting. Subsets with less
506        than :obj:`min_instances` instances are not split further.
507        The sample count is weighed.
508
509
510Splitters
511=================
512
513Splitters sort learning instances into branches (the branches are selected
514with a :obj:`SplitConstructor`, while a :obj:`Descender` decides the
515branch for an instance during classification).
516
517Most splitters call :obj:`Node.branch_selector` and assign
518instances correspondingly. When the value is unknown they choose a
519particular branch or skip the instance.
520
521Some splitters can also split instances: a weighed instance is
522used in more than than one subset. Each branch has a weight ID (usually,
523each its own ID) and all instances in that branch should have this meta attribute.
524
525An instance that
526hasn't been split has only one additional attribute (weight
527ID corresponding to the subset to which it went). Instance that is split
528between, say, three subsets, has three new meta attributes, one for each
529subset. The weights are used only when needed; when there is no
530splitting - no weight IDs are returned.
531
532.. class:: Splitter
533
534    An abstract base class that splits instances
535    into subsets.
536
537    .. method:: __call__(node, instances[, weightID])
538
539        :param node: a node.
540        :type node: :obj:`Node`
541        :param instances: a set of instances
542        :param weightID: weight ID.
543       
544        Use the information in :obj:`Node` (particularly the
545        :obj:`~Node.branch_selector`) to split the given set of instances into
546        subsets.  Return a tuple with a list of instance subsets and
547        a list of weights.  The list of weights is either a
548        list of integers or None when no weights are added.
549
550.. class:: Splitter_IgnoreUnknowns
551
552    Bases: :class:`Splitter`
553
554    Ignores the instances for which no single branch can be determined.
555
556.. class:: Splitter_UnknownsToCommon
557
558    Bases: :class:`Splitter`
559
560    Places all ambiguous instances to a branch with the highest number of
561    instances. If there is more than one such branch, one is selected at
562    random and then used for all instances.
563
564.. class:: Splitter_UnknownsToAll
565
566    Bases: :class:`Splitter`
567
568    Splits instances with an unknown value of the feature into all branches.
569
570.. class:: Splitter_UnknownsToRandom
571
572    Bases: :class:`Splitter`
573
574    Selects a random branch for ambiguous instances.
575
576.. class:: Splitter_UnknownsToBranch
577
578    Bases: :class:`Splitter`
579
580    Constructs an additional branch for ambiguous instances.
581    The branch's description is "unknown".
582
583.. class:: Splitter_UnknownsAsBranchSizes
584
585    Bases: :class:`Splitter`
586
587    Splits instances with unknown value of the feature according to
588    proportions of instances in each branch.
589
590.. class:: Splitter_UnknownsAsSelector
591
592    Bases: :class:`Splitter`
593
594    Splits instances with unknown value of the feature according to
595    distribution proposed by selector (usually the same as proportions
596    of instances in branches).
597
598Descenders
599=============================
600
601Descenders decide where should the instances that cannot be unambiguously
602put in a single branch go during classification (the branches are selected
603with a :obj:`SplitConstructor`, while a :obj:`Splitter` sorts instances
604during learning).
605
606.. class:: Descender
607
608    An abstract base tree descender. It descends a
609    an instance as deep as possible, according to the values
610    of instance's features. The :obj:`Descender`: calls the node's
611    :obj:`~Node.branch_selector` to get the branch index. If it's a
612    simple index, the corresponding branch is followed. If not, the
613    descender decides what to do. A descender can choose a single
614    branch (for instance, the one that is the most recommended by the
615    :obj:`~Node.branch_selector`) or it can let the branches vote.
616
617    Three are possible outcomes of a descent:
618
619    #. The descender reaches a leaf. This happens when
620       there were no unknown or out-of-range values, or when the
621       descender selected a single branch and continued the descend
622       despite them. The descender returns the :obj:`Node` it has reached.
623    #. Node's :obj:`~Node.branch_selector` returned a distribution and
624       :obj:`Descender` decided to stop the descend at this (internal)
625       node. It returns the current :obj:`Node`.
626    #. Node's :obj:`~Node.branch_selector` returned a distribution and the
627       :obj:`Node` wants to split the instance (i.e., to decide the class
628       by voting). It returns a :obj:`Node` and the vote-weights for
629       the branches.  The weights can correspond, for example,  to the
630       distribution returned by node's :obj:`~Node.branch_selector`, or to
631       the number of learning instances that were assigned to each branch.
632
633    .. method:: __call__(node, instance)
634
635        Descends until it reaches a leaf or a node in
636        which a vote of subtrees is required. A tuple
637        of two elements is returned. If it reached a leaf, the tuple contains
638        the leaf node and None. If not, it contains a node and
639        a list of floats (weights of votes).
640
641.. class:: Descender_UnknownToNode
642
643    Bases: :obj:`Descender`
644
645    When instance cannot be classified into a single branch, the current
646    node is returned. Thus, the node's :obj:`~Node.node_classifier`
647    will be used to make a decision. Therefore, internal nodes
648    need to have :obj:`Node.node_classifier` defined.
649
650.. class:: Descender_UnknownToBranch
651
652    Bases: :obj:`Descender`
653
654    Classifies instances with unknown value to a special branch. This
655    makes sense only if the tree itself was constructed with
656    :obj:`Splitter_UnknownsToBranch`.
657
658.. class:: Descender_UnknownToCommonBranch
659
660    Bases: :obj:`Descender`
661
662    Classifies instances with unknown values to the branch with the
663    highest number of instances. If there is more than one such branch,
664    random branch is chosen for each instance.
665
666.. class:: Descender_UnknownToCommonSelector
667
668    Bases: :obj:`Descender`
669
670    Classifies instances with unknown values to the branch which received
671    the highest recommendation by the selector.
672
673.. class:: Descender_UnknownMergeAsBranchSizes
674
675    Bases: :obj:`Descender`
676
677    The subtrees vote for the instance's class; the vote is weighted
678    according to the sizes of the branches.
679
680.. class:: Descender_UnknownMergeAsSelector
681
682    Bases: :obj:`Descender`
683
684    The subtrees vote for the instance's class; the vote is weighted
685    according to the selectors proposal.
686
687Pruning
688=======
689
690.. index::
691    pair: classification trees; pruning
692
693The pruners construct a shallow copy of a tree. The pruned tree's
694:obj:`Node` contain references to the same contingency matrices,
695node classifiers, branch selectors, ...  as the original tree.
696
697Pruners cannot construct a new :obj:`~Node.node_classifier`.  Thus, for
698pruning, internal nodes must have :obj:`~Node.node_classifier` defined
699(the default).
700
701.. class:: Pruner
702
703    An abstract base tree pruner.
704
705    .. method:: __call__(tree)
706
707        :param tree: either
708            a :obj:`Node` or (the C++ version of the classifier,
709            saved in :obj:`TreeClassfier.base_classifier`).
710
711        The resulting pruned tree is of the same type as the argument.
712        The original tree remains intact.
713
714.. class:: Pruner_SameMajority
715
716    Bases: :class:`Pruner`
717
718    A tree can have a subtrees where all the leaves have
719    the same majority class. This is allowed because leaves can still
720    have different class distributions and thus predict different
721    probabilities.  The :obj:`Pruner_SameMajority` prunes the tree so
722    that there is no subtree in which all the nodes would have the same
723    majority class.
724
725    This pruner will only prune the nodes in which the node classifier
726    is a :obj:`~Orange.classification.ConstantClassifier`
727    (or a derived class).
728
729    The pruning works from leaves to the root.
730    It siblings have (at least one) common majority class, they can be pruned.
731
732.. class:: Pruner_m
733
734    Bases: :class:`Pruner`
735
736    Prunes a tree by comparing m-estimates of static and dynamic
737    error as defined in (Bratko, 2002).
738
739    .. attribute:: m
740
741        Parameter m for m-estimation.
742
743Printing the tree
744=================
745
746The tree printing functions are very flexible. They can print, for
747example, numbers of instances, proportions of majority class in nodes
748and similar, or more complex statistics like the proportion of instances
749in a particular class divided by the proportion of instances of this
750class in a parent node. Users may also pass their own functions to print
751certain elements.
752
753The easiest way to print the tree is to print :func:`TreeClassifier`::
754
755    >>> print tree
756    petal width<0.800: Iris-setosa (100.00%)
757    petal width>=0.800
758    |    petal width<1.750
759    |    |    petal length<5.350: Iris-versicolor (94.23%)
760    |    |    petal length>=5.350: Iris-virginica (100.00%)
761    |    petal width>=1.750
762    |    |    petal length<4.850: Iris-virginica (66.67%)
763    |    |    petal length>=4.850: Iris-virginica (100.00%)
764
765
766Format string
767-------------
768
769Format strings are printed at every leaf or internal node with the certain
770format specifiers replaced by data from the tree node. Specifiers are
771generally of form **%[^]<precision><quantity><divisor>**.
772
773**^** at the start tells that the number should be multiplied by 100,
774which is useful for proportions like percentages.
775
776**<precision>** is in the same format as in Python (or C) string
777formatting. For instance, ``%N`` denotes the number of instances in
778the node, hence ``%6.2N`` would mean output to two decimal digits
779and six places altogether. If left out, a default format ``5.3`` is
780used, unless the numbers are multiplied by 100, in which case the default
781is ``.0`` (no decimals, the number is rounded to the nearest integer).
782
783**<divisor>** tells what to divide the quantity in that node with.
784``bP`` means division by the same quantity in the parent node; for instance,
785``%NbP`` gives the number of instances in the node divided by the
786number of instances in parent node. Precision formatting can be added,
787e.g. ``%6.2NbP``. ``bA`` denotes division by the same quantity over the entire
788data set, so ``%NbA`` will tell you the proportion of instances (out
789of the entire training data set) that fell into that node. If division is
790impossible since the parent node does not exist or some data is missing,
791a dot is printed out instead.
792
793**<quantity>** defines what to print and is the only required element.
794It can be:
795
796``V``
797    The predicted value at that node. Precision
798    or divisor can not be defined here.
799
800``N``
801    The number of instances in the node.
802
803``M``
804    The number of instances in the majority class (that is, the class
805    predicted by the node).
806
807``m``
808    The proportion of instances in the majority class.
809
810``A``
811    The average class for instances the node; this is available only for
812    regression trees.
813
814``E``
815    Standard error for class of instances in the node; available only for
816    regression trees.
817
818``I``
819    Print out the confidence interval. The modifier is used as
820    ``%I(95)`` of (more complicated) ``%5.3I(95)bP``.
821
822``C``
823    The number of instances in the given class.  For a classification
824    example, ``%5.3C="Iris-virginica"bP`` denotes the number of instances
825    of Iris-virginica by the number of instances this class in the parent
826    node ( instances that are *not* Iris-virginica could be described with
827    ``%5.3CbP!="Iris-virginica"``).
828
829    For regression trees, use operators =, !=, <, <=, >, and >=, as in
830    ``%C<22``, with optional precision and divisor. Intervals are also
831    possible: ``%C[20, 22]`` gives the number of instances between
832    20 and 22 (inclusive) and ``%C(20, 22)`` gives the number of such
833    instances excluding the boundaries. Mixing of parentheses is allowed,
834    e.g. ``%C(20, 22]``.  Add ``!`` for instances outside the interval,
835    like ``%C!(20, 22]``.
836
837``c``
838    Same as ``C``, except that it computes the proportion of the class
839    instead of the number of instances.
840
841``D``
842    The number of instances in each class. Precision and the divisor
843    are applied to each number in the distribution.  This quantity can
844    not be computed for regression trees.
845
846``d``
847    Same as ``D``, except that it shows proportions of instances.
848
849<user defined formats>
850    Instructions and examples of added formats are at the end of this
851    section.
852
853.. rubric:: Examples on classification trees
854
855A tree on the iris data set with the depth limited to three
856levels is built as follows:
857   
858.. literalinclude:: code/orngTree1.py
859   :lines: 1-4
860
861Printing the predicted class at each node, the number
862of instances in the majority class with the total number of instances in
863the node requires a custom format string::
864
865    >>> print tree.to_string(leaf_str="%V (%M out of %N)")
866    petal width<0.800: Iris-setosa (50.000 out of 50.000)
867    petal width>=0.800
868    |    petal width<1.750
869    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
870    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
871    |    petal width>=1.750
872    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
873    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
874
875The number of instances as
876compared to the entire data set and to the parent node::
877
878    >>> print tree.to_string(leaf_str="%V (%^MbA%, %^MbP%)")
879    petal width<0.800: Iris-setosa (100%, 100%)
880    petal width>=0.800
881    |    petal width<1.750
882    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
883    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
884    |    petal width>=1.750
885    |    |    petal length<4.850: Iris-virginica (4%, 4%)
886    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
887
888``%M`` is the number of instances in the majority class. Dividing by
889the number of all instances from this class on the entire data set
890is described with ``%MbA``. Add ``^`` in front for mutiplication with
891100. The percent sign *after* that is printed out literally, just as the
892comma and parentheses. For the proportion of this class in the parent the
893``bA`` is replaced with ``bA``.
894
895To print the number of versicolors in each node, together with the
896proportion of versicolors among the instances in this particular node
897and among all versicolors, use the following::
898
899    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
900
901It gives::
902
903    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
904    petal width>=0.800
905    |    petal width<1.750
906    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
907    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
908    |    petal width>=1.750
909    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
910    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
911
912Finally, to print the distributions using a format string ``%D``::
913
914    petal width<0.800: [50.000, 0.000, 0.000]
915    petal width>=0.800
916    |    petal width<1.750
917    |    |    petal length<5.350: [0.000, 49.000, 3.000]
918    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
919    |    petal width>=1.750
920    |    |    petal length<4.850: [0.000, 1.000, 2.000]
921    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
922
923As the order of classes is the same as in ``data.domain.class_var.values``
924(setosa, versicolor, virginica), there are 49 versicolors and 3 virginicae
925in the node at ``petal length<5.350``. To print the proportions within
926nodes rounded to two decimals use ``%.2d``::
927
928    petal width<0.800: [1.00, 0.00, 0.00]
929    petal width>=0.800
930    |    petal width<1.750
931    |    |    petal length<5.350: [0.00, 0.94, 0.06]
932    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
933    |    petal width>=1.750
934    |    |    petal length<4.850: [0.00, 0.33, 0.67]
935    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
936
937The most trivial format string for internal nodes is for printing
938node predictions. ``.`` in the following example specifies
939that the node_str should be the same as leaf_str.
940
941::
942
943    tree.to_string(leaf_str="%V", node_str=".")
944 
945The output::
946
947    root: Iris-setosa
948    |    petal width<0.800: Iris-setosa
949    |    petal width>=0.800: Iris-versicolor
950    |    |    petal width<1.750: Iris-versicolor
951    |    |    |    petal length<5.350: Iris-versicolor
952    |    |    |    petal length>=5.350: Iris-virginica
953    |    |    petal width>=1.750: Iris-virginica
954    |    |    |    petal length<4.850: Iris-virginica
955    |    |    |    petal length>=4.850: Iris-virginica
956
957A node *root* has appeared and the tree looks one level
958deeper. This is needed to also print the data for tree root.
959
960To observe how the number
961of virginicas decreases down the tree try::
962
963    print tree.to_string(leaf_str='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', node_str='.')
964
965Interpretation: ``CbA="Iris-virginica"`` is
966the number of instances from virginica, divided by the total number
967of instances in this class. Add ``^.1`` and the result will be
968multiplied and printed with one decimal. The trailing ``%`` is printed
969out. In parentheses the same thing was divided by
970the instances in the parent node. The single quotes were used for strings, so
971that double quotes inside the string can specify the class.
972
973::
974
975    root: 100.0% (.%)
976    |    petal width<0.800: 0.0% (0.0%)
977    |    petal width>=0.800: 100.0% (100.0%)
978    |    |    petal width<1.750: 10.0% (10.0%)
979    |    |    |    petal length<5.350: 6.0% (60.0%)
980    |    |    |    petal length>=5.350: 4.0% (40.0%)
981    |    |    petal width>=1.750: 90.0% (90.0%)
982    |    |    |    petal length<4.850: 4.0% (4.4%)
983    |    |    |    petal length>=4.850: 86.0% (95.6%)
984
985If :meth:`~TreeClassifier.to_string` cannot compute something, in this case
986because the root has no parent, it prints out a dot.
987
988The final example with classification trees prints the distributions in
989nodes, the distribution compared to the parent, the proportions compared
990to the parent and the predicted class in the leaves::
991
992    >>> print tree.to_string(leaf_str='"%V   %D %.2DbP %.2dbP"', node_str='"%D %.2DbP %.2dbP"')
993    root: [50.000, 50.000, 50.000] . .
994    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
995    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
996    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
997    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
998    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
999    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
1000    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
1001    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
1002    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
1003    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
1004    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
1005    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
1006    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
1007
1008
1009.. rubric:: Examples on regression trees
1010
1011The regression trees examples use a tree induced from the housing data
1012set. Without other argumets, :meth:`TreeClassifier.to_string` prints the
1013following::
1014
1015    RM<6.941
1016    |    LSTAT<14.400
1017    |    |    DIS<1.385: 45.6
1018    |    |    DIS>=1.385: 22.9
1019    |    LSTAT>=14.400
1020    |    |    CRIM<6.992: 17.1
1021    |    |    CRIM>=6.992: 12.0
1022    RM>=6.941
1023    |    RM<7.437
1024    |    |    CRIM<7.393: 33.3
1025    |    |    CRIM>=7.393: 14.4
1026    |    RM>=7.437
1027    |    |    TAX<534.500: 45.9
1028    |    |    TAX>=534.500: 21.9
1029
1030To add the standard error in both internal nodes and leaves, and
1031the 90% confidence intervals in the leaves, use::
1032
1033    >>> print tree.to_string(leaf_str="[SE: %E]\t %V %I(90)", node_str="[SE: %E]")
1034    root: [SE: 0.409]
1035    |    RM<6.941: [SE: 0.306]
1036    |    |    LSTAT<14.400: [SE: 0.320]
1037    |    |    |    DIS<1.385: [SE: 4.420]:
1038    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
1039    |    |    |    DIS>=1.385: [SE: 0.244]:
1040    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
1041    |    |    LSTAT>=14.400: [SE: 0.333]
1042    |    |    |    CRIM<6.992: [SE: 0.338]:
1043    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1044    |    |    |    CRIM>=6.992: [SE: 0.448]:
1045    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1046    |    RM>=6.941: [SE: 1.031]
1047    |    |    RM<7.437: [SE: 0.958]
1048    |    |    |    CRIM<7.393: [SE: 0.692]:
1049    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1050    |    |    |    CRIM>=7.393: [SE: 2.157]:
1051    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1052    |    |    RM>=7.437: [SE: 1.124]
1053    |    |    |    TAX<534.500: [SE: 0.817]:
1054    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1055    |    |    |    TAX>=534.500: [SE: 0.000]:
1056    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1057
1058The predicted value (``%V``) and the average (``%A``) may differ because
1059a regression tree does not always predict the leaf average, but whatever
1060the :obj:`~Node.node_classifier` in a leaf returns.  As ``%V`` uses the
1061:obj:`Orange.feature.Continuous`' function for printing the
1062value, the number has the same number of decimals as in the data file.
1063
1064Regression trees cannot print the distributions in the same way
1065as classification trees. They instead offer a set of operators for
1066observing the number of instances within a certain range. For instance,
1067to print the number of instances with values below 22 and compare
1068it with values in the parent nodes use::
1069
1070    >>> print tree.to_string(leaf_str="%C<22 (%cbP<22)", node_str=".")
1071    root: 277.000 (.)
1072    |    RM<6.941: 273.000 (1.160)
1073    |    |    LSTAT<14.400: 107.000 (0.661)
1074    |    |    |    DIS<1.385: 0.000 (0.000)
1075    |    |    |    DIS>=1.385: 107.000 (1.020)
1076    |    |    LSTAT>=14.400: 166.000 (1.494)
1077    |    |    |    CRIM<6.992: 93.000 (0.971)
1078    |    |    |    CRIM>=6.992: 73.000 (1.040)
1079    |    RM>=6.941: 4.000 (0.096)
1080    |    |    RM<7.437: 3.000 (1.239)
1081    |    |    |    CRIM<7.393: 0.000 (0.000)
1082    |    |    |    CRIM>=7.393: 3.000 (15.333)
1083    |    |    RM>=7.437: 1.000 (0.633)
1084    |    |    |    TAX<534.500: 0.000 (0.000)
1085    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1086
1087The last line, for instance, says the the number of instances with the
1088class below 22 is among those with tax above 534 is 30 times higher than
1089the number of such instances in its parent node.
1090
1091To count the same for all instances *outside*
1092interval [20, 22] and print out the proportions as percents use::
1093
1094    >>> print tree.to_string(leaf_str="%C![20,22] (%^cbP![20,22]%)", node_str=".")
1095
1096The format string  ``%c![20, 22]`` denotes the proportion of instances
1097(within the node) whose values are below 20 or above 22. ``%cbP![20,
109822]`` derives same statistics computed on the parent. A ``^`` is added
1099for percentages.
1100
1101::
1102
1103    root: 439.000 (.%)
1104    |    RM<6.941: 364.000 (98%)
1105    |    |    LSTAT<14.400: 200.000 (93%)
1106    |    |    |    DIS<1.385: 5.000 (127%)
1107    |    |    |    DIS>=1.385: 195.000 (99%)
1108    |    |    LSTAT>=14.400: 164.000 (111%)
1109    |    |    |    CRIM<6.992: 91.000 (96%)
1110    |    |    |    CRIM>=6.992: 73.000 (105%)
1111    |    RM>=6.941: 75.000 (114%)
1112    |    |    RM<7.437: 46.000 (101%)
1113    |    |    |    CRIM<7.393: 43.000 (100%)
1114    |    |    |    CRIM>=7.393: 3.000 (100%)
1115    |    |    RM>=7.437: 29.000 (98%)
1116    |    |    |    TAX<534.500: 29.000 (103%)
1117    |    |    |    TAX>=534.500: 0.000 (0%)
1118
1119
1120Defining custom printouts
1121-------------------------
1122
1123:meth:`TreeClassifier.to_string`'s argument :obj:`user_formats` can be used to
1124print other information.  :obj:`~TreeClassifier.format.user_formats` should
1125contain a list of tuples with a regular expression and a function to be
1126called when that expression is found in the format string. Expressions
1127from :obj:`user_formats` are checked before the built-in expressions
1128discussed above.
1129
1130The regular expression should describe a string like used above,
1131for instance ``%.2DbP``. When a leaf or internal node
1132is printed, the format string (:obj:`leaf_str` or :obj:`node_str`)
1133is checked for these regular expressions and when the match is found,
1134the corresponding callback function is called.
1135
1136The passed function will get five arguments: the format string
1137(:obj:`leaf_str` or :obj:`node_str`), the match object, the node which is
1138being printed, its parent (can be None) and the tree classifier.
1139The function should return the format string in which the part described
1140by the match object (that is, the part that is matched by the regular
1141expression) is replaced by whatever information your callback function
1142is supposed to give.
1143
1144The function can use several utility function provided in the module.
1145
1146.. autofunction:: insert_str
1147
1148.. autofunction:: insert_dot
1149
1150.. autofunction:: insert_num
1151
1152.. autofunction:: by_whom
1153
1154The module also includes reusable regular expressions:
1155
1156.. autodata:: fs
1157
1158.. autodata:: by
1159
1160For a trivial example, ``%V`` is implemented with the
1161following tuple::
1162
1163    (re.compile("%V"), replaceV)
1164
1165And ``replaceV`` is defined by::
1166
1167    def replaceV(strg, mo, node, parent, tree):
1168        return insert_str(strg, mo, str(node.node_classifier.default_value))
1169
1170``replaceV`` takes the value predicted at the node
1171(``node.node_classifier.default_value`` ), converts it to a string
1172and passes it to :func:`insert_str`.
1173
1174A more complex regular expression is the one for the proportion of
1175majority class, defined as ``"%"+fs+"M"+by``. It uses the two partial
1176expressions defined above (:obj:`fs` and :obj:`by`).
1177
1178The following code prints the classification margin for each node,
1179that is, the difference between the proportion of the largest and the
1180second largest class in the node:
1181
1182.. literalinclude:: code/orngTree2.py
1183   :lines: 7-31
1184
1185``get_margin`` computes the margin from the distribution. The replacing
1186function, ``replaceB``, computes the margin for the node.  If :data:`by`
1187group is present, we call :func:`by_whom` to get the node with whose
1188margin this node's margin is to be divided. If this node (usually the
1189parent) does not exist of if its margin is zero, :func:`insert_dot`
1190inserts dot, otherwise :func:`insert_num` is called which inserts the
1191number in the user-specified format.  ``my_format`` contains the regular
1192expression and the callback function.
1193
1194Printing the tree with
1195
1196.. literalinclude:: code/orngTree2.py
1197    :lines: 33
1198
1199yields::
1200
1201    petal width<0.800: Iris-setosa 100% (100.00%)
1202    petal width>=0.800
1203    |    petal width<1.750
1204    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1205    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1206    |    petal width>=1.750
1207    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1208    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1209
1210Plotting with Dot
1211---------------------------
1212
1213To produce images of trees, first create a .dot file with
1214:meth:`TreeClassifier.dot`. If it was saved to "tree5.dot", plot a gif
1215with the following command::
1216
1217    dot -Tgif tree5.dot -otree5.gif
1218
1219Check GraphViz's dot documentation for more options and
1220output formats.
1221
1222
1223===========================
1224C4.5 Tree Inducer
1225===========================
1226
1227C4.5 is, as  a standard benchmark in machine learning, incorporated in
1228Orange. The implementation uses the original C4.5 code, so the resulting
1229tree is exactly like the one that would be build by standalone C4.5. The
1230tree build is only made accessible in Python.
1231
1232:class:`C45Learner` and :class:`C45Classifier` behave
1233like any other Orange learner and classifier. Unlike most of Orange
1234learning algorithms, C4.5 does not accepts weighted instances.
1235
1236-------------------------
1237Building the C4.5 plug-in
1238-------------------------
1239
1240Due to copyright restrictions, C4.5 is not distributed with Orange,
1241but it can be added as a plug-in. A C compiler is needed for the
1242procedure: on Windows MS Visual C (CL.EXE and LINK.EXE must be on the
1243PATH), on Linux and OS X gcc (OS X users can download it from Apple).
1244
1245Orange must be installed prior to building C4.5.
1246
1247#. Download
1248   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1249   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1250   them. The files will be modified in the
1251   further process.
1252#. Download
1253   `ensemble.c <http://orange.biolab.si/trac/browser/orange/Orange/orng/ensemble.c?format=txt>`_
1254   and `buildC45.py <http://orange.biolab.si/trac/browser/orange/Orange/orng/buildC45.py?format=txt>`_
1255   into the directory R8/Src of the C4.5 sources
1256   (this directory contains, for instance, the file average.c).
1257#. Run buildC45.py, which will build the plug-in and put it next to
1258   orange.pyd (or orange.so on Linux/Mac).
1259#. Run python, type ``import Orange`` and
1260   create ``Orange.classification.tree.C45Learner()``. This should
1261   succeed.
1262#. Finally, you can remove C4.5 sources.
1263
1264The buildC45.py creates .h files that wrap Quinlan's .i files and
1265ensure that they are not included twice. It modifies C4.5 sources to
1266include .h's instead of .i's (this step can hardly fail). Then it compiles
1267ensemble.c into c45.dll or c45.so and puts it next to Orange.
1268
1269.. autoclass:: C45Learner
1270    :members:
1271
1272.. autoclass:: C45Classifier
1273    :members:
1274
1275.. class:: C45Node
1276
1277    This class is a reimplementation of the corresponding *struct* from
1278    Quinlan's C4.5 code.
1279
1280    .. attribute:: node_type
1281
1282        Type of the node:  :obj:`C45Node.Leaf` (0),
1283        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1284        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1285        split instances based on values of a discrete attribute,
1286        "cuts" cut them according to a threshold value of a continuous
1287        attributes and "subsets" use discrete attributes but with subsetting
1288        so that several values can go into the same branch.
1289
1290    .. attribute:: leaf
1291
1292        Value returned by that leaf. The field is defined for internal
1293        nodes as well.
1294
1295    .. attribute:: items
1296
1297        Number of (learning) instances in the node.
1298
1299    .. attribute:: class_dist
1300
1301        Class distribution for the node (of type
1302        :obj:`Orange.statistics.distribution.Discrete`).
1303
1304    .. attribute:: tested
1305       
1306        The attribute used in the node's test. If node is a leaf,
1307        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1308        :obj:`tested` is a discrete attribute, and if node is of type
1309        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1310
1311    .. attribute:: cut
1312
1313        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1314        Undefined otherwise.
1315
1316    .. attribute:: mapping
1317
1318        Mapping for nodes of type :obj:`Subset`. Element ``mapping[i]``
1319        gives the index for an instance whose value of :obj:`tested` is *i*.
1320        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1321
1322    .. attribute:: branch
1323       
1324        A list of branches stemming from this node.
1325
1326--------
1327Examples
1328--------
1329
1330This
1331script constructs the same learner as you would get by calling
1332the usual C4.5:
1333
1334.. literalinclude:: code/tree_c45.py
1335   :lines: 7-14
1336
1337Both C4.5 command-line symbols and variable names can be used. The
1338following lines produce the same result::
1339
1340    tree = Orange.classification.tree.C45Learner(data, m=100)
1341    tree = Orange.classification.tree.C45Learner(data, min_objs=100)
1342
1343A veteran C4.5 might prefer :func:`C45Learner.commandline`::
1344
1345    lrn = Orange.classification.tree.C45Learner()
1346    lrn.commandline("-m 1 -s")
1347    tree = lrn(data)
1348
1349The following script prints out the tree same format as C4.5 does.
1350
1351.. literalinclude:: code/tree_c45_printtree.py
1352
1353For the leaves just the value in ``node.leaf`` in printed. Since
1354:obj:`C45Node` does not know to which attribute it belongs, we need to
1355convert it to a string through ``classvar``, which is passed as an extra
1356argument to the recursive part of print_tree.
1357
1358For discrete splits without subsetting, we print out all attribute values
1359and recursively call the function for all branches. Continuous splits
1360are equally easy to handle.
1361
1362For discrete splits with subsetting, we iterate through branches,
1363retrieve the corresponding values that go into each branch to inset,
1364turn the values into strings and print them out, separately treating
1365the case when only a single value goes into the branch.
1366
1367===================
1368Simple Tree Inducer
1369===================
1370
1371.. include:: SimpleTreeLearner.txt
1372
1373--------       
1374Examples
1375--------
1376
1377:obj:`SimpleTreeLearner` is used in much the same way as :obj:`TreeLearner`.
1378A typical example of using :obj:`SimpleTreeLearner` would be to build a random
1379forest:
1380
1381.. literalinclude:: code/simple_tree_random_forest.py
1382
1383==========
1384References
1385==========
1386
1387Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1388Wesley, 2002.
1389
1390E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1391Murray Hill NJ, U.S.A., October 1993.
1392
1393`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1394A home page of AT&T's dot and similar software packages.
1395
1396"""
1397
1398"""
1399TODO C++ aliases
1400
1401SplitConstructor.discrete/continuous_split_constructor -> SplitConstructor.discrete
1402Node.examples -> Node.instances
Note: See TracBrowser for help on using the repository browser.