source: orange/Orange/classification/tree.py @ 10015:80f04d2d6ce7

Revision 10015:80f04d2d6ce7, 98.6 KB checked in by Jure Zbontar <jure.zbontar@…>, 2 years ago (diff)

Fixed doctests in tree.py

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8*******************************
9Classification trees (``tree``)
10*******************************
11
12Orange includes multiple implementations of classification tree learners:
13a very flexible :class:`TreeLearner`, a fast :class:`SimpleTreeLearner`,
14and a :class:`C45Learner`, which uses the C4.5 tree induction
15algorithm.
16
17The following code builds a :obj:`TreeClassifier` on the Iris data set
18(with the depth limited to three levels):
19
20>>> 1 + 1
212
22
23.. literalinclude:: code/orngTree1.py
24   :lines: 1-4
25
26See `Decision tree learning
27<http://en.wikipedia.org/wiki/Decision_tree_learning>`_ on Wikipedia
28for introduction to classification trees.
29
30======================
31Learner and Classifier
32======================
33
34.. autoclass:: TreeLearner
35    :members:
36
37.. autoclass:: TreeClassifier
38    :members:
39
40.. class:: Node
41
42    Classification trees are a hierarchy of :obj:`Node` classes.
43
44    Node stores the instances belonging to the node, a branch selector,
45    a list of branches (if the node is not a leaf) with their descriptions
46    and strengths, and a classifier.
47
48    .. attribute:: distribution
49   
50        A distribution of learning instances.
51
52    .. attribute:: contingency
53
54        Complete contingency matrices for the learning instances.
55
56    .. attribute:: instances, weightID
57
58        Learning instances and the ID of weight meta attribute. The root
59        of the tree actually stores all instances, while other nodes
60        store only reference to instances in the root node.
61
62    .. attribute:: node_classifier
63
64        A classifier for instances coming to the node. If the node is a
65        leaf, it chooses the class (or class distribution) of an instance.
66
67    Internal nodes have additional attributes. The lists :obj:`branches`,
68    :obj:`branch_descriptions` and :obj:`branch_sizes` are of the
69    same length.
70
71    .. attribute:: branches
72
73        A list of subtrees. Each element is a :obj:`Node` or None.
74        If None, the node is empty.
75
76    .. attribute:: branch_descriptions
77
78        A list with strings describing branches. They are constructed
79        by :obj:`SplitConstructor`. A string can contain anything,
80        for example 'red' or '>12.3'.
81
82    .. attribute:: branch_sizes
83
84        A (weighted) number of training instances for
85        each branch. It can be used, for instance, for modeling
86        probabilities when classifying instances with unknown values.
87
88    .. attribute:: branch_selector
89
90        A :obj:`~Orange.classification.Classifier` that returns a branch
91        for each instance (as
92        :obj:`Orange.data.Value` in ``[0, len(branches)-1]``).  When an
93        instance cannot be classified unambiguously, the selector can
94        return a discrete distribution, which proposes how to divide
95        the instance between the branches. Whether the proposition will
96        be used depends upon the :obj:`Splitter` (for learning)
97        or :obj:`Descender` (for classification).
98
99    .. method:: tree_size()
100       
101        Return the number of nodes in the subtrees (including the node,
102        excluding null-nodes).
103
104========
105Examples
106========
107
108Tree Structure
109==============
110
111This example works with the lenses data set:
112
113..
114    .. literalinclude:: code/treestructure.py
115       :lines: 7-10
116
117>>> import Orange
118>>> lenses = Orange.data.Table("lenses")
119>>> tree_classifier = Orange.classification.tree.TreeLearner(lenses)
120
121The following function counts the number of nodes in a tree:
122
123..
124    .. literalinclude:: code/treestructure.py
125       :lines: 12-21
126
127>>> def tree_size(node):
128...    if not node:
129...        return 0
130...
131...    size = 1
132...    if node.branch_selector:
133...        for branch in node.branches:
134...            size += tree_size(branch)
135...
136...    return size
137
138If node is None, the function above return 0. Otherwise, the size is 1
139(this node) plus the sizes of all subtrees. The algorithm need to check
140if a node is internal (it has a :obj:`~Node.branch_selector`), as leaves
141don't have the :obj:`~Node.branches` attribute.
142
143    >>> tree_size(tree_classifier.tree)
144    15
145
146Note that a :obj:`Node` already has a built-in method
147:func:`~Node.tree_size`.
148
149Trees can be printed with a simple recursive function:
150
151.. literalinclude:: code/treestructure.py
152   :lines: 26-41
153
154The crux of the example is not in the formatting (\\n's etc.);
155what matters is everything but the print statements. The code
156separately handles three node types:
157
158* For null nodes (a node to which no learning instances were classified),
159  it just prints "<null node>".
160* For internal nodes, it prints a node description:
161  the feature's name and distribution of classes. :obj:`Node`'s
162  branch description is an :obj:`~Orange.classification.Classifier`,
163  and its ``class_var`` is the feature whose name is printed.  Class
164  distributions are printed as well (they are assumed to be stored).
165  The :obj:`printTree0` with a level increased by 1 to increase the
166  indent is recursively called for each branch.
167* If the node is a leaf, it prints the distribution of learning instances
168  in the node and the class to which the instances in the node would
169  be classified. We assume that the :obj:`~Node.node_classifier` is a
170  :obj:`DefaultClassifier`. A better print function should be aware of
171  possible alternatives.
172
173The wrapper function that accepts either a
174:obj:`TreeClassifier` or a :obj:`Node` can be written as follows:
175
176    .. literalinclude:: code/treestructure.py
177       :lines: 43-49
178
179It's straightforward: if ``x`` is a
180:obj:`TreeClassifier`, it prints ``x.tree``; if it's :obj:`Node` it
181print ``x``. If it's of some other type,
182an exception is raised. The output::
183
184    >>> print_tree(tree_classifier)
185    tear_rate (<15.000, 5.000, 4.000>)
186    : reduced --> none (<12.000, 0.000, 0.000>)
187    : normal
188       astigmatic (<3.000, 5.000, 4.000>)
189       : no
190          age (<1.000, 5.000, 0.000>)
191          : young --> soft (<0.000, 2.000, 0.000>)
192          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
193          : presbyopic --> none (<1.000, 1.000, 0.000>)
194       : yes
195          prescription (<2.000, 0.000, 4.000>)
196          : myope --> hard (<0.000, 0.000, 3.000>)
197          : hypermetrope --> none (<2.000, 0.000, 1.000>)
198
199The tree structure examples conclude with a simple pruning function,
200written entirely in Python and unrelated to any :class:`Pruner`. It limits
201the tree depth (the number of internal nodes on any path down the tree).
202For example, to get a two-level tree, call cut_tree(root, 2). The function
203is recursive, with the second argument (level) decreasing at each call;
204when zero, the current node will be made a leaf:
205
206.. literalinclude:: code/treestructure.py
207   :lines: 54-62
208
209The function acts only when :obj:`node` and :obj:`node.branch_selector`
210are defined. If the level is not zero, is recursively calls  the function
211for each branch. Otherwise, it clears the selector, branches and branch
212descriptions.
213
214    >>> cutTree(tree.tree, 2)
215    >>> printTree(tree)
216    tear_rate (<15.000, 5.000, 4.000>)
217    : reduced --> none (<12.000, 0.000, 0.000>)
218    : normal
219       astigmatic (<3.000, 5.000, 4.000>)
220       : no --> soft (<1.000, 5.000, 0.000>)
221       : yes --> hard (<2.000, 0.000, 4.000>)
222
223Setting learning parameters
224===========================
225
226Let us construct a :obj:`TreeLearner` to play with:
227
228.. literalinclude:: code/treelearner.py
229   :lines: 7-10
230
231There are three crucial components in learning: the
232:obj:`~TreeLearner.split` and :obj:`~TreeLearner.stop` criteria, and the
233example :obj:`~TreeLearner.splitter`. The default ``stop`` is set with:
234
235    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
236
237The default stopping parameters are:
238
239    >>> print learner.stop.max_majority, learner.stop.min_instances
240    1.0 0.0
241
242The defaults only stop splitting when no instances are left or all of
243them are in the same class.
244
245If the minimal subset that is allowed to be split further is set to five
246instances, the resulting tree is smaller.
247
248    >>> learner.stop.min_instances = 5.0
249    >>> tree = learner(data)
250    >>> print tree
251    tear_rate=reduced: none (100.00%)
252    tear_rate=normal
253    |    astigmatic=no
254    |    |    age=pre-presbyopic: soft (100.00%)
255    |    |    age=presbyopic: none (50.00%)
256    |    |    age=young: soft (100.00%)
257    |    astigmatic=yes
258    |    |    prescription=hypermetrope: none (66.67%)
259    |    |    prescription=myope: hard (100.00%)
260
261We can also limit the maximal proportion of majority class.
262
263    >>> learner.stop.max_majority = 0.5
264    >>> tree = learner(data)
265    >>> print tree
266    none (62.50%)
267
268Redefining tree induction components
269====================================
270
271This example shows how to use a custom stop function.  First, the
272``def_stop`` function defines the default stop function. The first tree
273has some added randomness; the induction also stops in 20% of the
274cases when ``def_stop`` returns False. The stopping criteria for the
275second tree is completely random: it stops induction in 20% of cases.
276Note that in the second case lambda function still has three parameters,
277even though in does not need any, since so many are necessary
278for :obj:`~TreeLearner.stop`.
279
280.. literalinclude:: code/tree3.py
281   :lines: 8-23
282
283=================================
284Learner and Classifier Components
285=================================
286
287Split constructors
288=====================
289
290.. class:: SplitConstructor
291
292    Decide how to divide learning instances, ie. define branching criteria.
293   
294    The :obj:`SplitConstructor` should use the domain
295    contingency when possible, both for speed and adaptability.
296    Sometimes domain contingency does
297    not suffice, for example if ReliefF score is used.
298
299    A :obj:`SplitConstructor` can veto further tree induction by returning
300    no classifier. This is generally related to the number of learning
301    instances that would go in each branch. If there are no splits with
302    more than :obj:`SplitConstructor.min_subset` instances in the branches
303    (null nodes are allowed), the induction is stopped.
304
305    Split constructors that cannot handle a particular feature
306    type (discrete, continuous) quietly skip them. When in doubt, use
307    :obj:`SplitConstructor_Combined`, which delegates features to
308    specialized split constructors.
309
310    The same split constructors can be used both for classification and
311    regression, if the chosen score (for :obj:`SplitConstructor_Score`
312    and derived classes) supports both.
313
314    .. attribute:: min_subset
315
316        The minimal (weighted) number in non-null leaves.
317
318    .. method:: __call__(data, [ weightID, contingency, apriori_distribution, candidates, clsfr])
319
320        :param data: in any acceptable form.
321        :param weightID: Optional; the default of 0 means that all
322            instances have a weight of 1.0.
323        :param contingency: a domain contingency
324        :param apriori_distribution: apriori class probabilities.
325        :type apriori_distribution: :obj:`Orange.statistics.distribution.Distribution`
326        :param candidates: only consider these
327            features (one boolean for each feature).
328        :param clsfr: a node classifier (if it was constructed, that is,
329            if :obj:`store_node_classifier` is True)
330
331        Construct a split. Return a tuple (:obj:`branch_selector`,
332        :obj:`branch_descriptions` (a list), :obj:`subset_sizes`
333        (the number of instances for each branch, may also be
334        empty), :obj:`quality` (higher numbers mean better splits),
335        :obj:`spent_feature`). If no split is constructed,
336        the :obj:`selector`, :obj:`branch_descriptions` and
337        :obj:`subset_sizes` are None, while :obj:`quality` is 0.0 and
338        :obj:`spent_feature` is -1.
339
340        If the chosen feature will be useless in the future and
341        should not be considered for splitting in any of the subtrees
342        (typically, when discrete features are used as-they-are, without
343        any binarization or subsetting), then it should return the index
344        of this feature as :obj:`spent_feature`. If no features are spent,
345        :obj:`spent_feature` is -1.
346
347.. class:: SplitConstructor_Score
348
349    Bases: :class:`SplitConstructor`
350
351    An abstract base class that compare splits
352    with a :class:`Orange.feature.scoring.Score`.  All split
353    constructors except for :obj:`SplitConstructor_Combined` are derived
354    from this class.
355
356    .. attribute:: measure
357
358        A :class:`Orange.feature.scoring.Score` for split evaluation. It
359        has to handle the class type - for example, you cannot use
360        :class:`~Orange.feature.scoring.GainRatio` for regression or
361        :class:`~Orange.feature.scoring.MSE` for classification.
362
363    .. attribute:: worst_acceptable
364
365        The lowest allowed split quality.  The value strongly depends
366        on chosen :obj:`measure` component. Default is 0.0.
367
368.. class:: SplitConstructor_Feature
369
370    Bases: :class:`SplitConstructor_Score`
371
372    Each value of a discrete feature corresponds to a branch.  The feature
373    with the highest score (:obj:`~Measure.measure`) is selected. When
374    tied, a random feature is selected.
375
376    The constructed :obj:`branch_selector` is an instance of
377    :obj:`orange.ClassifierFromVarFD`, that returns a value of the selected
378    feature. :obj:`branch_description` contains the feature's
379    values. The feature is marked as spent (it cannot reappear in the
380    node's subtrees).
381
382.. class:: SplitConstructor_ExhaustiveBinary
383
384    Bases: :class:`SplitConstructor_Score`
385
386    Finds the binarization with the highest score among all features. In
387    case of ties, a random feature is selected.
388
389    The constructed :obj:`branch_selector` is an instance
390    :obj:`orange.ClassifierFromVarFD`, that returns a value of the
391    selected feature. Its :obj:`transformer` contains a ``MapIntValue``
392    that maps values of the feature into a binary feature. Branches
393    with a single feature value are described with that value and
394    branches with more than one are described with ``[<val1>, <val2>,
395    ..., <valn>]``. Only binary features are marked as spent.
396
397.. class:: SplitConstructor_Threshold
398
399    Bases: :class:`SplitConstructor_Score`
400
401    The only split constructor for continuous features. It divides the
402    range of feature values with a threshold that maximizes the split's
403    quality. In case of ties, a random feature is selected.  The feature
404    that yields the best binary split is returned.
405
406    The constructed :obj:`branch_selector` is an instance of
407    :obj:`orange.ClassifierFromVarFD` with an attached :obj:`transformer`,
408    of type :obj:`Orange.feature.discretization.ThresholdDiscretizer`. The
409    branch descriptions are "<threshold" and ">=threshold". The feature
410    is not spent.
411
412.. class:: SplitConstructor_OneAgainstOthers
413   
414    Bases: :class:`SplitConstructor_Score`
415
416    Undocumented.
417
418.. class:: SplitConstructor_Combined
419
420    Bases: :class:`SplitConstructor`
421
422    Uses different split constructors for discrete and continuous
423    features. Each split constructor is called with appropriate
424    features. Both construct a candidate for a split; the better of them
425    is used.
426
427    The choice of the split is not probabilistically fair, when
428    multiple candidates have the same score. For example, if there
429    are nine discrete features with the highest score the split
430    constructor for discrete features will select one of them. Now,
431    if there is also a single continuous feature with the same score,
432    :obj:`SplitConstructor_Combined` would randomly select between the
433    proposed discrete feature and continuous feature, unaware that the
434    discrete feature  has already competed with eight others.  So,
435    the probability for selecting (each) discrete feature would be
436    1/18 instead of 1/10. Although incorrect, this should not affect
437    the performance.
438
439    .. attribute: discrete_split_constructor
440
441        Split constructor for discrete features;
442        for instance, :obj:`SplitConstructor_Feature` or
443        :obj:`SplitConstructor_ExhaustiveBinary`.
444
445    .. attribute: continuous_split_constructor
446
447        Split constructor for continuous features; it
448        can be either :obj:`SplitConstructor_Threshold` or a
449        a custom split constructor.
450
451
452StopCriteria and StopCriteria_common
453============================================
454
455:obj:`StopCriteria` determines when to stop the induction of subtrees.
456
457.. class:: StopCriteria
458
459    Provides the basic stopping criteria: the tree induction stops
460    when there is at most one instance left (the actual, not weighted,
461    number). The induction also stops when all instances are in the
462    same class (for discrete problems) or have the same outcome value
463    (for regression problems).
464
465    .. method:: __call__(instances[, weightID, domain contingencies])
466
467        Return True (stop) of False (continue the induction).
468        Contingencies are not used for counting. Derived classes should
469        use the contingencies whenever possible.
470
471.. class:: StopCriteria_common
472
473    Pre-pruning with additional criteria.
474
475    .. attribute:: max_majority
476
477        Maximum proportion of majority class. When exceeded,
478        induction stops.
479
480    .. attribute:: min_instances
481
482        Minimum number of instances for splitting. Subsets with less
483        than :obj:`min_instances` instances are not split further.
484        The sample count is weighed.
485
486
487Splitters
488=================
489
490Splitters sort learning instances into branches (the branches are selected
491with a :obj:`SplitConstructor`, while a :obj:`Descender` decides the
492branch for an instance during classification).
493
494Most splitters call :obj:`Node.branch_selector` and assign
495instances correspondingly. When the value is unknown they choose a
496particular branch or skip the instance.
497
498Some splitters can also split instances: a weighed instance is
499used in more than than one subset. Each branch has a weight ID (usually,
500each its own ID) and all instances in that branch should have this meta attribute.
501
502An instance that
503hasn't been split has only one additional attribute (weight
504ID corresponding to the subset to which it went). Instance that is split
505between, say, three subsets, has three new meta attributes, one for each
506subset. The weights are used only when needed; when there is no
507splitting - no weight IDs are returned.
508
509.. class:: Splitter
510
511    An abstract base class that splits instances
512    into subsets.
513
514    .. method:: __call__(node, instances[, weightID])
515
516        :param node: a node.
517        :type node: :obj:`Node`
518        :param instances: a set of instances
519        :param weightID: weight ID.
520       
521        Use the information in :obj:`Node` (particularly the
522        :obj:`~Node.branch_selector`) to split the given set of instances into
523        subsets.  Return a tuple with a list of instance subsets and
524        a list of weights.  The list of weights is either a
525        list of integers or None when no weights are added.
526
527.. class:: Splitter_IgnoreUnknowns
528
529    Bases: :class:`Splitter`
530
531    Ignores the instances for which no single branch can be determined.
532
533.. class:: Splitter_UnknownsToCommon
534
535    Bases: :class:`Splitter`
536
537    Places all ambiguous instances to a branch with the highest number of
538    instances. If there is more than one such branch, one is selected at
539    random and then used for all instances.
540
541.. class:: Splitter_UnknownsToAll
542
543    Bases: :class:`Splitter`
544
545    Splits instances with an unknown value of the feature into all branches.
546
547.. class:: Splitter_UnknownsToRandom
548
549    Bases: :class:`Splitter`
550
551    Selects a random branch for ambiguous instances.
552
553.. class:: Splitter_UnknownsToBranch
554
555    Bases: :class:`Splitter`
556
557    Constructs an additional branch for ambiguous instances.
558    The branch's description is "unknown".
559
560.. class:: Splitter_UnknownsAsBranchSizes
561
562    Bases: :class:`Splitter`
563
564    Splits instances with unknown value of the feature according to
565    proportions of instances in each branch.
566
567.. class:: Splitter_UnknownsAsSelector
568
569    Bases: :class:`Splitter`
570
571    Splits instances with unknown value of the feature according to
572    distribution proposed by selector (usually the same as proportions
573    of instances in branches).
574
575Descenders
576=============================
577
578Descenders decide where should the instances that cannot be unambiguously
579put in a single branch go during classification (the branches are selected
580with a :obj:`SplitConstructor`, while a :obj:`Splitter` sorts instances
581during learning).
582
583.. class:: Descender
584
585    An abstract base tree descender. It descends a
586    an instance as deep as possible, according to the values
587    of instance's features. The :obj:`Descender`: calls the node's
588    :obj:`~Node.branch_selector` to get the branch index. If it's a
589    simple index, the corresponding branch is followed. If not, the
590    descender decides what to do. A descender can choose a single
591    branch (for instance, the one that is the most recommended by the
592    :obj:`~Node.branch_selector`) or it can let the branches vote.
593
594    Three are possible outcomes of a descent:
595
596    #. The descender reaches a leaf. This happens when
597       there were no unknown or out-of-range values, or when the
598       descender selected a single branch and continued the descend
599       despite them. The descender returns the :obj:`Node` it has reached.
600    #. Node's :obj:`~Node.branch_selector` returned a distribution and
601       :obj:`Descender` decided to stop the descend at this (internal)
602       node. It returns the current :obj:`Node`.
603    #. Node's :obj:`~Node.branch_selector` returned a distribution and the
604       :obj:`Node` wants to split the instance (i.e., to decide the class
605       by voting). It returns a :obj:`Node` and the vote-weights for
606       the branches.  The weights can correspond, for example,  to the
607       distribution returned by node's :obj:`~Node.branch_selector`, or to
608       the number of learning instances that were assigned to each branch.
609
610    .. method:: __call__(node, instance)
611
612        Descends until it reaches a leaf or a node in
613        which a vote of subtrees is required. A tuple
614        of two elements is returned. If it reached a leaf, the tuple contains
615        the leaf node and None. If not, it contains a node and
616        a list of floats (weights of votes).
617
618.. class:: Descender_UnknownToNode
619
620    Bases: :obj:`Descender`
621
622    When instance cannot be classified into a single branch, the current
623    node is returned. Thus, the node's :obj:`~Node.node_classifier`
624    will be used to make a decision. Therefore, internal nodes
625    need to have :obj:`Node.node_classifier` defined.
626
627.. class:: Descender_UnknownToBranch
628
629    Bases: :obj:`Descender`
630
631    Classifies instances with unknown value to a special branch. This
632    makes sense only if the tree itself was constructed with
633    :obj:`Splitter_UnknownsToBranch`.
634
635.. class:: Descender_UnknownToCommonBranch
636
637    Bases: :obj:`Descender`
638
639    Classifies instances with unknown values to the branch with the
640    highest number of instances. If there is more than one such branch,
641    random branch is chosen for each instance.
642
643.. class:: Descender_UnknownToCommonSelector
644
645    Bases: :obj:`Descender`
646
647    Classifies instances with unknown values to the branch which received
648    the highest recommendation by the selector.
649
650.. class:: Descender_UnknownMergeAsBranchSizes
651
652    Bases: :obj:`Descender`
653
654    The subtrees vote for the instance's class; the vote is weighted
655    according to the sizes of the branches.
656
657.. class:: Descender_UnknownMergeAsSelector
658
659    Bases: :obj:`Descender`
660
661    The subtrees vote for the instance's class; the vote is weighted
662    according to the selectors proposal.
663
664Pruning
665=======
666
667.. index::
668    pair: classification trees; pruning
669
670The pruners construct a shallow copy of a tree. The pruned tree's
671:obj:`Node` contain references to the same contingency matrices,
672node classifiers, branch selectors, ...  as the original tree.
673
674Pruners cannot construct a new :obj:`~Node.node_classifier`.  Thus, for
675pruning, internal nodes must have :obj:`~Node.node_classifier` defined
676(the default).
677
678.. class:: Pruner
679
680    An abstract base tree pruner.
681
682    .. method:: __call__(tree)
683
684        :param tree: either
685            a :obj:`Node` or (the C++ version of the classifier,
686            saved in :obj:`TreeClassfier.base_classifier`).
687
688        The resulting pruned tree is of the same type as the argument.
689        The original tree remains intact.
690
691.. class:: Pruner_SameMajority
692
693    Bases: :class:`Pruner`
694
695    A tree can have a subtrees where all the leaves have
696    the same majority class. This is allowed because leaves can still
697    have different class distributions and thus predict different
698    probabilities.  The :obj:`Pruner_SameMajority` prunes the tree so
699    that there is no subtree in which all the nodes would have the same
700    majority class.
701
702    This pruner will only prune the nodes in which the node classifier
703    is a :obj:`~Orange.classification.majority.ConstantClassifier`
704    (or a derived class).
705
706    The pruning works from leaves to the root.
707    It siblings have (at least one) common majority class, they can be pruned.
708
709.. class:: Pruner_m
710
711    Bases: :class:`Pruner`
712
713    Prunes a tree by comparing m-estimates of static and dynamic
714    error as defined in (Bratko, 2002).
715
716    .. attribute:: m
717
718        Parameter m for m-estimation.
719
720Printing the tree
721=================
722
723The tree printing functions are very flexible. They can print, for
724example, numbers of instances, proportions of majority class in nodes
725and similar, or more complex statistics like the proportion of instances
726in a particular class divided by the proportion of instances of this
727class in a parent node. Users may also pass their own functions to print
728certain elements.
729
730The easiest way to print the tree is to print :func:`TreeClassifier`::
731
732    >>> print tree
733    petal width<0.800: Iris-setosa (100.00%)
734    petal width>=0.800
735    |    petal width<1.750
736    |    |    petal length<5.350: Iris-versicolor (94.23%)
737    |    |    petal length>=5.350: Iris-virginica (100.00%)
738    |    petal width>=1.750
739    |    |    petal length<4.850: Iris-virginica (66.67%)
740    |    |    petal length>=4.850: Iris-virginica (100.00%)
741
742
743Format string
744-------------
745
746Format strings are printed at every leaf or internal node with the certain
747format specifiers replaced by data from the tree node. Specifiers are
748generally of form **%[^]<precision><quantity><divisor>**.
749
750**^** at the start tells that the number should be multiplied by 100,
751which is useful for proportions like percentages.
752
753**<precision>** is in the same format as in Python (or C) string
754formatting. For instance, ``%N`` denotes the number of instances in
755the node, hence ``%6.2N`` would mean output to two decimal digits
756and six places altogether. If left out, a default format ``5.3`` is
757used, unless the numbers are multiplied by 100, in which case the default
758is ``.0`` (no decimals, the number is rounded to the nearest integer).
759
760**<divisor>** tells what to divide the quantity in that node with.
761``bP`` means division by the same quantity in the parent node; for instance,
762``%NbP`` gives the number of instances in the node divided by the
763number of instances in parent node. Precision formatting can be added,
764e.g. ``%6.2NbP``. ``bA`` denotes division by the same quantity over the entire
765data set, so ``%NbA`` will tell you the proportion of instances (out
766of the entire training data set) that fell into that node. If division is
767impossible since the parent node does not exist or some data is missing,
768a dot is printed out instead.
769
770**<quantity>** defines what to print and is the only required element.
771It can be:
772
773``V``
774    The predicted value at that node. Precision
775    or divisor can not be defined here.
776
777``N``
778    The number of instances in the node.
779
780``M``
781    The number of instances in the majority class (that is, the class
782    predicted by the node).
783
784``m``
785    The proportion of instances in the majority class.
786
787``A``
788    The average class for instances the node; this is available only for
789    regression trees.
790
791``E``
792    Standard error for class of instances in the node; available only for
793    regression trees.
794
795``I``
796    Print out the confidence interval. The modifier is used as
797    ``%I(95)`` of (more complicated) ``%5.3I(95)bP``.
798
799``C``
800    The number of instances in the given class.  For a classification
801    example, ``%5.3C="Iris-virginica"bP`` denotes the number of instances
802    of Iris-virginica by the number of instances this class in the parent
803    node ( instances that are *not* Iris-virginica could be described with
804    ``%5.3CbP!="Iris-virginica"``).
805
806    For regression trees, use operators =, !=, <, <=, >, and >=, as in
807    ``%C<22``, with optional precision and divisor. Intervals are also
808    possible: ``%C[20, 22]`` gives the number of instances between
809    20 and 22 (inclusive) and ``%C(20, 22)`` gives the number of such
810    instances excluding the boundaries. Mixing of parentheses is allowed,
811    e.g. ``%C(20, 22]``.  Add ``!`` for instances outside the interval,
812    like ``%C!(20, 22]``.
813
814``c``
815    Same as ``C``, except that it computes the proportion of the class
816    instead of the number of instances.
817
818``D``
819    The number of instances in each class. Precision and the divisor
820    are applied to each number in the distribution.  This quantity can
821    not be computed for regression trees.
822
823``d``
824    Same as ``D``, except that it shows proportions of instances.
825
826<user defined formats>
827    Instructions and examples of added formats are at the end of this
828    section.
829
830.. rubric:: Examples on classification trees
831
832A tree on the iris data set with the depth limited to three
833levels is built as follows:
834   
835.. literalinclude:: code/orngTree1.py
836   :lines: 1-4
837
838Printing the predicted class at each node, the number
839of instances in the majority class with the total number of instances in
840the node requires a custom format string::
841
842    >>> print tree.to_string(leaf_str="%V (%M out of %N)")
843    petal width<0.800: Iris-setosa (50.000 out of 50.000)
844    petal width>=0.800
845    |    petal width<1.750
846    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
847    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
848    |    petal width>=1.750
849    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
850    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
851
852The number of instances as
853compared to the entire data set and to the parent node::
854
855    >>> print tree.to_string(leaf_str="%V (%^MbA%, %^MbP%)")
856    petal width<0.800: Iris-setosa (100%, 100%)
857    petal width>=0.800
858    |    petal width<1.750
859    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
860    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
861    |    petal width>=1.750
862    |    |    petal length<4.850: Iris-virginica (4%, 4%)
863    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
864
865``%M`` is the number of instances in the majority class. Dividing by
866the number of all instances from this class on the entire data set
867is described with ``%MbA``. Add ``^`` in front for mutiplication with
868100. The percent sign *after* that is printed out literally, just as the
869comma and parentheses. For the proportion of this class in the parent the
870``bA`` is replaced with ``bA``.
871
872To print the number of versicolors in each node, together with the
873proportion of versicolors among the instances in this particular node
874and among all versicolors, use the following::
875
876    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
877
878It gives::
879
880    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
881    petal width>=0.800
882    |    petal width<1.750
883    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
884    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
885    |    petal width>=1.750
886    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
887    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
888
889Finally, to print the distributions using a format string ``%D``::
890
891    petal width<0.800: [50.000, 0.000, 0.000]
892    petal width>=0.800
893    |    petal width<1.750
894    |    |    petal length<5.350: [0.000, 49.000, 3.000]
895    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
896    |    petal width>=1.750
897    |    |    petal length<4.850: [0.000, 1.000, 2.000]
898    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
899
900As the order of classes is the same as in ``data.domain.class_var.values``
901(setosa, versicolor, virginica), there are 49 versicolors and 3 virginicae
902in the node at ``petal length<5.350``. To print the proportions within
903nodes rounded to two decimals use ``%.2d``::
904
905    petal width<0.800: [1.00, 0.00, 0.00]
906    petal width>=0.800
907    |    petal width<1.750
908    |    |    petal length<5.350: [0.00, 0.94, 0.06]
909    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
910    |    petal width>=1.750
911    |    |    petal length<4.850: [0.00, 0.33, 0.67]
912    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
913
914The most trivial format string for internal nodes is for printing
915node predictions. ``.`` in the following example specifies
916that the node_str should be the same as leaf_str.
917
918::
919
920    tree.to_string(leaf_str="%V", node_str=".")
921 
922The output::
923
924    root: Iris-setosa
925    |    petal width<0.800: Iris-setosa
926    |    petal width>=0.800: Iris-versicolor
927    |    |    petal width<1.750: Iris-versicolor
928    |    |    |    petal length<5.350: Iris-versicolor
929    |    |    |    petal length>=5.350: Iris-virginica
930    |    |    petal width>=1.750: Iris-virginica
931    |    |    |    petal length<4.850: Iris-virginica
932    |    |    |    petal length>=4.850: Iris-virginica
933
934A node *root* has appeared and the tree looks one level
935deeper. This is needed to also print the data for tree root.
936
937To observe how the number
938of virginicas decreases down the tree try::
939
940    print tree.to_string(leaf_str='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', node_str='.')
941
942Interpretation: ``CbA="Iris-virginica"`` is
943the number of instances from virginica, divided by the total number
944of instances in this class. Add ``^.1`` and the result will be
945multiplied and printed with one decimal. The trailing ``%`` is printed
946out. In parentheses the same thing was divided by
947the instances in the parent node. The single quotes were used for strings, so
948that double quotes inside the string can specify the class.
949
950::
951
952    root: 100.0% (.%)
953    |    petal width<0.800: 0.0% (0.0%)
954    |    petal width>=0.800: 100.0% (100.0%)
955    |    |    petal width<1.750: 10.0% (10.0%)
956    |    |    |    petal length<5.350: 6.0% (60.0%)
957    |    |    |    petal length>=5.350: 4.0% (40.0%)
958    |    |    petal width>=1.750: 90.0% (90.0%)
959    |    |    |    petal length<4.850: 4.0% (4.4%)
960    |    |    |    petal length>=4.850: 86.0% (95.6%)
961
962If :meth:`~TreeClassifier.to_string` cannot compute something, in this case
963because the root has no parent, it prints out a dot.
964
965The final example with classification trees prints the distributions in
966nodes, the distribution compared to the parent, the proportions compared
967to the parent and the predicted class in the leaves::
968
969    >>> print tree.to_string(leaf_str='"%V   %D %.2DbP %.2dbP"', node_str='"%D %.2DbP %.2dbP"')
970    root: [50.000, 50.000, 50.000] . .
971    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
972    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
973    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
974    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
975    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
976    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
977    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
978    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
979    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
980    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
981    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
982    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
983    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
984
985
986.. rubric:: Examples on regression trees
987
988The regression trees examples use a tree induced from the housing data
989set. Without other argumets, :meth:`TreeClassifier.to_string` prints the
990following::
991
992    RM<6.941
993    |    LSTAT<14.400
994    |    |    DIS<1.385: 45.6
995    |    |    DIS>=1.385: 22.9
996    |    LSTAT>=14.400
997    |    |    CRIM<6.992: 17.1
998    |    |    CRIM>=6.992: 12.0
999    RM>=6.941
1000    |    RM<7.437
1001    |    |    CRIM<7.393: 33.3
1002    |    |    CRIM>=7.393: 14.4
1003    |    RM>=7.437
1004    |    |    TAX<534.500: 45.9
1005    |    |    TAX>=534.500: 21.9
1006
1007To add the standard error in both internal nodes and leaves, and
1008the 90% confidence intervals in the leaves, use::
1009
1010    >>> print tree.to_string(leaf_str="[SE: %E]\t %V %I(90)", node_str="[SE: %E]")
1011    root: [SE: 0.409]
1012    |    RM<6.941: [SE: 0.306]
1013    |    |    LSTAT<14.400: [SE: 0.320]
1014    |    |    |    DIS<1.385: [SE: 4.420]:
1015    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
1016    |    |    |    DIS>=1.385: [SE: 0.244]:
1017    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
1018    |    |    LSTAT>=14.400: [SE: 0.333]
1019    |    |    |    CRIM<6.992: [SE: 0.338]:
1020    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1021    |    |    |    CRIM>=6.992: [SE: 0.448]:
1022    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1023    |    RM>=6.941: [SE: 1.031]
1024    |    |    RM<7.437: [SE: 0.958]
1025    |    |    |    CRIM<7.393: [SE: 0.692]:
1026    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1027    |    |    |    CRIM>=7.393: [SE: 2.157]:
1028    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1029    |    |    RM>=7.437: [SE: 1.124]
1030    |    |    |    TAX<534.500: [SE: 0.817]:
1031    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1032    |    |    |    TAX>=534.500: [SE: 0.000]:
1033    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1034
1035The predicted value (``%V``) and the average (``%A``) may differ because
1036a regression tree does not always predict the leaf average, but whatever
1037the :obj:`~Node.node_classifier` in a leaf returns.  As ``%V`` uses the
1038:obj:`Orange.feature.Continuous`' function for printing the
1039value, the number has the same number of decimals as in the data file.
1040
1041Regression trees cannot print the distributions in the same way
1042as classification trees. They instead offer a set of operators for
1043observing the number of instances within a certain range. For instance,
1044to print the number of instances with values below 22 and compare
1045it with values in the parent nodes use::
1046
1047    >>> print tree.to_string(leaf_str="%C<22 (%cbP<22)", node_str=".")
1048    root: 277.000 (.)
1049    |    RM<6.941: 273.000 (1.160)
1050    |    |    LSTAT<14.400: 107.000 (0.661)
1051    |    |    |    DIS<1.385: 0.000 (0.000)
1052    |    |    |    DIS>=1.385: 107.000 (1.020)
1053    |    |    LSTAT>=14.400: 166.000 (1.494)
1054    |    |    |    CRIM<6.992: 93.000 (0.971)
1055    |    |    |    CRIM>=6.992: 73.000 (1.040)
1056    |    RM>=6.941: 4.000 (0.096)
1057    |    |    RM<7.437: 3.000 (1.239)
1058    |    |    |    CRIM<7.393: 0.000 (0.000)
1059    |    |    |    CRIM>=7.393: 3.000 (15.333)
1060    |    |    RM>=7.437: 1.000 (0.633)
1061    |    |    |    TAX<534.500: 0.000 (0.000)
1062    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1063
1064The last line, for instance, says the the number of instances with the
1065class below 22 is among those with tax above 534 is 30 times higher than
1066the number of such instances in its parent node.
1067
1068To count the same for all instances *outside*
1069interval [20, 22] and print out the proportions as percents use::
1070
1071    >>> print tree.to_string(leaf_str="%C![20,22] (%^cbP![20,22]%)", node_str=".")
1072
1073The format string  ``%c![20, 22]`` denotes the proportion of instances
1074(within the node) whose values are below 20 or above 22. ``%cbP![20,
107522]`` derives same statistics computed on the parent. A ``^`` is added
1076for percentages.
1077
1078::
1079
1080    root: 439.000 (.%)
1081    |    RM<6.941: 364.000 (98%)
1082    |    |    LSTAT<14.400: 200.000 (93%)
1083    |    |    |    DIS<1.385: 5.000 (127%)
1084    |    |    |    DIS>=1.385: 195.000 (99%)
1085    |    |    LSTAT>=14.400: 164.000 (111%)
1086    |    |    |    CRIM<6.992: 91.000 (96%)
1087    |    |    |    CRIM>=6.992: 73.000 (105%)
1088    |    RM>=6.941: 75.000 (114%)
1089    |    |    RM<7.437: 46.000 (101%)
1090    |    |    |    CRIM<7.393: 43.000 (100%)
1091    |    |    |    CRIM>=7.393: 3.000 (100%)
1092    |    |    RM>=7.437: 29.000 (98%)
1093    |    |    |    TAX<534.500: 29.000 (103%)
1094    |    |    |    TAX>=534.500: 0.000 (0%)
1095
1096
1097Defining custom printouts
1098-------------------------
1099
1100:meth:`TreeClassifier.to_string`'s argument :obj:`user_formats` can be used to
1101print other information.  :obj:`~TreeClassifier.format.user_formats` should
1102contain a list of tuples with a regular expression and a function to be
1103called when that expression is found in the format string. Expressions
1104from :obj:`user_formats` are checked before the built-in expressions
1105discussed above.
1106
1107The regular expression should describe a string like used above,
1108for instance ``%.2DbP``. When a leaf or internal node
1109is printed, the format string (:obj:`leaf_str` or :obj:`node_str`)
1110is checked for these regular expressions and when the match is found,
1111the corresponding callback function is called.
1112
1113The passed function will get five arguments: the format string
1114(:obj:`leaf_str` or :obj:`node_str`), the match object, the node which is
1115being printed, its parent (can be None) and the tree classifier.
1116The function should return the format string in which the part described
1117by the match object (that is, the part that is matched by the regular
1118expression) is replaced by whatever information your callback function
1119is supposed to give.
1120
1121The function can use several utility function provided in the module.
1122
1123.. autofunction:: insert_str
1124
1125.. autofunction:: insert_dot
1126
1127.. autofunction:: insert_num
1128
1129.. autofunction:: by_whom
1130
1131The module also includes reusable regular expressions:
1132
1133.. autodata:: fs
1134
1135.. autodata:: by
1136
1137For a trivial example, ``%V`` is implemented with the
1138following tuple::
1139
1140    (re.compile("%V"), replaceV)
1141
1142And ``replaceV`` is defined by::
1143
1144    def replaceV(strg, mo, node, parent, tree):
1145        return insert_str(strg, mo, str(node.node_classifier.default_value))
1146
1147``replaceV`` takes the value predicted at the node
1148(``node.node_classifier.default_value`` ), converts it to a string
1149and passes it to :func:`insert_str`.
1150
1151A more complex regular expression is the one for the proportion of
1152majority class, defined as ``"%"+fs+"M"+by``. It uses the two partial
1153expressions defined above (:obj:`fs` and :obj:`by`).
1154
1155The following code prints the classification margin for each node,
1156that is, the difference between the proportion of the largest and the
1157second largest class in the node:
1158
1159.. literalinclude:: code/orngTree2.py
1160   :lines: 7-31
1161
1162``get_margin`` computes the margin from the distribution. The replacing
1163function, ``replaceB``, computes the margin for the node.  If :data:`by`
1164group is present, we call :func:`by_whom` to get the node with whose
1165margin this node's margin is to be divided. If this node (usually the
1166parent) does not exist of if its margin is zero, :func:`insert_dot`
1167inserts dot, otherwise :func:`insert_num` is called which inserts the
1168number in the user-specified format.  ``my_format`` contains the regular
1169expression and the callback function.
1170
1171Printing the tree with
1172
1173.. literalinclude:: code/orngTree2.py
1174    :lines: 33
1175
1176yields::
1177
1178    petal width<0.800: Iris-setosa 100% (100.00%)
1179    petal width>=0.800
1180    |    petal width<1.750
1181    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1182    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1183    |    petal width>=1.750
1184    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1185    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1186
1187Plotting with Dot
1188---------------------------
1189
1190To produce images of trees, first create a .dot file with
1191:meth:`TreeClassifier.dot`. If it was saved to "tree5.dot", plot a gif
1192with the following command::
1193
1194    dot -Tgif tree5.dot -otree5.gif
1195
1196Check GraphViz's dot documentation for more options and
1197output formats.
1198
1199
1200===========================
1201C4.5 Classifier and Learner
1202===========================
1203
1204C4.5 is, as  a standard benchmark in machine learning, incorporated in
1205Orange. The implementation uses the original C4.5 code, so the resulting
1206tree is exactly like the one that would be build by standalone C4.5. The
1207tree build is only made accessible in Python.
1208
1209:class:`C45Learner` and :class:`C45Classifier` behave
1210like any other Orange learner and classifier. Unlike most of Orange
1211learning algorithms, C4.5 does not accepts weighted instances.
1212
1213Building the C4.5 plug-in
1214=========================
1215
1216C4.5 is not distributed with Orange, but it can be added as a
1217plug-in. A C compiler is needed for the procedure: on Windows MS Visual C
1218(CL.EXE and LINK.EXE must be on the PATH), on Linux and OS X gcc (OS X
1219users can download it from Apple).
1220
1221Orange must be installed prior to building C4.5.
1222
1223#. Download
1224   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1225   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1226   them. The files will be modified in the
1227   further process.
1228#. Download
1229   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1230   and unzip its contents into the directory R8/Src of the C4.5 sources
1231   (this directory contains, for instance, the file average.c).
1232#. Run buildC45.py, which will build the plug-in and put it next to
1233   orange.pyd (or orange.so on Linux/Mac).
1234#. Run python, type ``import Orange`` and
1235   create ``Orange.classification.tree.C45Learner()``. This should
1236   succeed.
1237#. Finally, you can remove C4.5 sources.
1238
1239The buildC45.py creates .h files that wrap Quinlan's .i files and
1240ensure that they are not included twice. It modifies C4.5 sources to
1241include .h's instead of .i's (this step can hardly fail). Then it compiles
1242ensemble.c into c45.dll or c45.so and puts it next to Orange. In the end
1243it checks if the built C4.5 gives the same results as the original.
1244
1245.. autoclass:: C45Learner
1246    :members:
1247
1248.. autoclass:: C45Classifier
1249    :members:
1250
1251.. class:: C45Node
1252
1253    This class is a reimplementation of the corresponding *struct* from
1254    Quinlan's C4.5 code.
1255
1256    .. attribute:: node_type
1257
1258        Type of the node:  :obj:`C45Node.Leaf` (0),
1259        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1260        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1261        split instances based on values of a discrete attribute,
1262        "cuts" cut them according to a threshold value of a continuous
1263        attributes and "subsets" use discrete attributes but with subsetting
1264        so that several values can go into the same branch.
1265
1266    .. attribute:: leaf
1267
1268        Value returned by that leaf. The field is defined for internal
1269        nodes as well.
1270
1271    .. attribute:: items
1272
1273        Number of (learning) instances in the node.
1274
1275    .. attribute:: class_dist
1276
1277        Class distribution for the node (of type
1278        :obj:`Orange.statistics.distribution.Discrete`).
1279
1280    .. attribute:: tested
1281       
1282        The attribute used in the node's test. If node is a leaf,
1283        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1284        :obj:`tested` is a discrete attribute, and if node is of type
1285        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1286
1287    .. attribute:: cut
1288
1289        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1290        Undefined otherwise.
1291
1292    .. attribute:: mapping
1293
1294        Mapping for nodes of type :obj:`Subset`. Element ``mapping[i]``
1295        gives the index for an instance whose value of :obj:`tested` is *i*.
1296        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1297
1298    .. attribute:: branch
1299       
1300        A list of branches stemming from this node.
1301
1302Examples
1303========
1304
1305This
1306script constructs the same learner as you would get by calling
1307the usual C4.5:
1308
1309.. literalinclude:: code/tree_c45.py
1310   :lines: 7-14
1311
1312Both C4.5 command-line symbols and variable names can be used. The
1313following lines produce the same result::
1314
1315    tree = Orange.classification.tree.C45Learner(data, m=100)
1316    tree = Orange.classification.tree.C45Learner(data, min_objs=100)
1317
1318A veteran C4.5 might prefer :func:`C45Learner.commandline`::
1319
1320    lrn = Orange.classification.tree.C45Learner()
1321    lrn.commandline("-m 1 -s")
1322    tree = lrn(data)
1323
1324The following script prints out the tree same format as C4.5 does.
1325
1326.. literalinclude:: code/tree_c45_printtree.py
1327
1328For the leaves just the value in ``node.leaf`` in printed. Since
1329:obj:`C45Node` does not know to which attribute it belongs, we need to
1330convert it to a string through ``classvar``, which is passed as an extra
1331argument to the recursive part of printTree.
1332
1333For discrete splits without subsetting, we print out all attribute values
1334and recursively call the function for all branches. Continuous splits
1335are equally easy to handle.
1336
1337For discrete splits with subsetting, we iterate through branches,
1338retrieve the corresponding values that go into each branch to inset,
1339turn the values into strings and print them out, separately treating
1340the case when only a single value goes into the branch.
1341
1342=================
1343SimpleTreeLearner
1344=================
1345
1346.. include:: /SimpleTreeLearner.txt
1347       
1348Examples
1349========
1350
1351:obj:`SimpleTreeLearner` is used in much the same way as :obj:`TreeLearner`.
1352A typical example of using :obj:`SimpleTreeLearner` would be to build a random
1353forest:
1354
1355.. literalinclude:: code/simple_tree_random_forest.py
1356
1357
1358References
1359==========
1360
1361Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1362Wesley, 2002.
1363
1364E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1365Murray Hill NJ, U.S.A., October 1993.
1366
1367`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1368A home page of AT&T's dot and similar software packages.
1369
1370"""
1371
1372"""
1373TODO C++ aliases
1374
1375SplitConstructor.discrete/continuous_split_constructor -> SplitConstructor.discrete
1376Node.examples -> Node.instances
1377"""
1378
1379from Orange.core import \
1380     TreeLearner as _TreeLearner, \
1381         TreeClassifier as _TreeClassifier, \
1382         SimpleTreeLearner, \
1383         SimpleTreeClassifier, \
1384         C45Learner as _C45Learner, \
1385         C45Classifier as _C45Classifier, \
1386         C45TreeNode as C45Node, \
1387         C45TreeNodeList as C45NodeList, \
1388         TreeDescender as Descender, \
1389              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
1390              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
1391              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
1392              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
1393              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
1394         TreeExampleSplitter as Splitter, \
1395              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
1396              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
1397              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
1398              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
1399              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
1400              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
1401              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
1402         TreeNode as Node, \
1403         TreeNodeList as NodeList, \
1404         TreePruner as Pruner, \
1405              TreePruner_SameMajority as Pruner_SameMajority, \
1406              TreePruner_m as Pruner_m, \
1407         TreeSplitConstructor as SplitConstructor, \
1408              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
1409              TreeSplitConstructor_Measure as SplitConstructor_Score, \
1410                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
1411                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
1412                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
1413                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
1414         TreeStopCriteria as StopCriteria, \
1415              TreeStopCriteria_Python as StopCriteria_Python, \
1416              TreeStopCriteria_common as StopCriteria_common
1417
1418import Orange.core
1419import operator
1420import base64
1421import re
1422import Orange.data
1423import Orange.feature.scoring
1424import warnings
1425
1426class C45Learner(Orange.classification.Learner):
1427    """
1428    :class:`C45Learner`'s attributes have double names - those that
1429    you know from C4.5 command lines and the corresponding names of C4.5's
1430    internal variables. All defaults are set as in C4.5; if you change
1431    nothing, you are running C4.5.
1432
1433    Constructs a :obj:`C45Classifier` when given data.
1434
1435    .. attribute:: gain_ratio (g)
1436       
1437        Determines whether to use information gain (false, default)
1438        or gain ratio for selection of attributes (true).
1439
1440    .. attribute:: batch (b)
1441
1442        Turn on batch mode (no windows, no iterations); this option is
1443        not documented in C4.5 manuals. It conflicts with "window",
1444        "increment" and "trials".
1445
1446    .. attribute:: subset (s)
1447       
1448        Enables subsetting (default: false, no subsetting),
1449 
1450    .. attribute:: prob_thresh (p)
1451
1452        Probabilistic threshold for continuous attributes (default: false).
1453
1454    .. attribute:: min_objs (m)
1455       
1456        Minimal number of objects (instances) in leaves (default: 2).
1457
1458    .. attribute:: window (w)
1459
1460        Initial windows size (default: maximum of 20% and twice the
1461        square root of the number of data objects).
1462
1463    .. attribute:: increment (i)
1464
1465        The maximum number of objects that can be added to the window
1466        at each iteration (default: 20% of the initial window size).
1467
1468    .. attribute:: cf (c)
1469
1470        Prunning confidence level (default: 25%).
1471
1472    .. attribute:: trials (t)
1473
1474        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1475
1476    .. attribute:: prune
1477       
1478        Return pruned tree (not an original C4.5 option) (default: true)
1479    """
1480
1481    _rename_new_old = { "min_objs": "minObjs", "probTresh": "prob_tresh",
1482            "gain_ratio": "gainRatio" }
1483    #_rename_new_old = {}
1484    _rename_old_new = dict((a, b) for b, a in _rename_new_old.items())
1485
1486    @classmethod
1487    def _rename_dict(cls, dic):
1488        return dict((cls._rename_arg(a), b) for a, b in dic.items())
1489
1490    @classmethod
1491    def _rename_arg(cls, a):
1492        if a in cls._rename_old_new:
1493            Orange.misc.deprecation_warning(a, cls._rename_old_new[a], stacklevel=4)
1494        return cls._rename_new_old.get(a, a)
1495
1496    def __new__(cls, instances=None, weightID=0, **argkw):
1497        self = Orange.classification.Learner.__new__(cls, **cls._rename_dict(argkw))
1498        if instances:
1499            self.__init__(**cls._rename_dict(argkw))
1500            return self.__call__(instances, weightID)
1501        else:
1502            return self
1503
1504    def __init__(self, **kwargs):
1505        self.base = _C45Learner(**self._rename_dict(kwargs))
1506
1507    def __setattr__(self, name, value):
1508        nn = self._rename_arg(name)
1509        if name != "base" and nn in self.base.__dict__:
1510            self.base.__dict__[nn] = value
1511        else:
1512            self.__dict__[nn] = value
1513
1514    def __getattr__(self, name):
1515        nn = self._rename_arg(name)
1516        if name != " base" and nn in self.base.__dict__:
1517            return self.base.__dict__[nn]
1518        else:
1519            return self.__dict__[nn]
1520
1521    def __call__(self, *args, **kwargs):
1522        return C45Classifier(self.base(*args, **self._rename_dict(kwargs)))
1523
1524    def commandline(self, ln):
1525        """
1526        Set the arguments with a C4.5 command line.
1527        """
1528        self.base.commandline(ln)
1529
1530
1531class C45Classifier(Orange.classification.Classifier):
1532    """
1533    A faithful reimplementation of Quinlan's C4.5, but
1534    uses a tree composed of :class:`C45Node` instead of C4.5's original
1535    tree structure.
1536
1537    .. attribute:: tree
1538
1539        C4.5 tree stored as :obj:`C45Node`.
1540    """
1541
1542    def __init__(self, base_classifier):
1543        self.nativeClassifier = base_classifier
1544        for k, v in self.nativeClassifier.__dict__.items():
1545            self.__dict__[k] = v
1546
1547    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
1548                 *args, **kwdargs):
1549        """Classify a new instance.
1550       
1551        :param instance: instance to be classified.
1552        :type instance: :class:`Orange.data.Instance`
1553        :param result_type:
1554              :class:`Orange.classification.Classifier.GetValue` or \
1555              :class:`Orange.classification.Classifier.GetProbabilities` or
1556              :class:`Orange.classification.Classifier.GetBoth`
1557       
1558        :rtype: :class:`Orange.data.Value`,
1559              :class:`Orange.statistics.Distribution` or a tuple with both
1560        """
1561        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
1562
1563    def __setattr__(self, name, value):
1564        if name == "nativeClassifier":
1565            self.__dict__[name] = value
1566            return
1567        if name in self.nativeClassifier.__dict__:
1568            self.nativeClassifier.__dict__[name] = value
1569        self.__dict__[name] = value
1570
1571    def __str__(self):
1572        return self.dump()
1573
1574
1575    def dump(self):
1576        """
1577        Print the tree in the same form as Ross Quinlan's
1578        C4.5 program.
1579
1580        ::
1581
1582            import Orange
1583
1584            data = Orange.data.Table("voting")
1585            c45 = Orange.classification.tree.C45Learner(data)
1586            print c45
1587
1588        prints
1589
1590        ::
1591
1592            physician-fee-freeze = n: democrat (253.4)
1593            physician-fee-freeze = y:
1594            |   synfuels-corporation-cutback = n: republican (145.7)
1595            |   synfuels-corporation-cutback = y:
1596            |   |   mx-missile = y: democrat (6.0)
1597            |   |   mx-missile = n:
1598            |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
1599            |   |   |   adoption-of-the-budget-resolution = y:
1600            |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
1601            |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
1602
1603
1604        The standalone C4.5 would print::
1605
1606            physician-fee-freeze = n: democrat (253.4/5.9)
1607            physician-fee-freeze = y:
1608            |   synfuels-corporation-cutback = n: republican (145.7/6.2)
1609            |   synfuels-corporation-cutback = y:
1610            |   |   mx-missile = y: democrat (6.0/2.4)
1611            |   |   mx-missile = n:
1612            |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
1613            |   |   |   adoption-of-the-budget-resolution = y:
1614            |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
1615            |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
1616
1617        C4.5 also prints out the number of errors on learning data in
1618        each node.
1619        """
1620        return  _c45_printTree0(self.tree, self.class_var, 0)
1621
1622    to_string = dump
1623
1624def _c45_showBranch(node, classvar, lev, i):
1625    var = node.tested
1626    str_ = ""
1627    if node.node_type == 1:
1628        str_ += "\n" + "|   "*lev + "%s = %s:" % (var.name, var.values[i])
1629        str_ += _c45_printTree0(node.branch[i], classvar, lev + 1)
1630    elif node.node_type == 2:
1631        str_ += "\n" + "|   "*lev + "%s %s %.1f:" % (var.name, ["<=", ">"][i], node.cut)
1632        str_ += _c45_printTree0(node.branch[i], classvar, lev + 1)
1633    else:
1634        inset = filter(lambda a:a[1] == i, enumerate(node.mapping))
1635        inset = [var.values[j[0]] for j in inset]
1636        if len(inset) == 1:
1637            str_ += "\n" + "|   "*lev + "%s = %s:" % (var.name, inset[0])
1638        else:
1639            str_ += "\n" + "|   "*lev + "%s in {%s}:" % (var.name, ", ".join(inset))
1640        str_ += _c45_printTree0(node.branch[i], classvar, lev + 1)
1641    return str_
1642
1643
1644def _c45_printTree0(node, classvar, lev):
1645    var = node.tested
1646    str_ = ""
1647    if node.node_type == 0:
1648        str_ += "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items)
1649    else:
1650        for i, branch in enumerate(node.branch):
1651            if not branch.node_type:
1652                str_ += _c45_showBranch(node, classvar, lev, i)
1653        for i, branch in enumerate(node.branch):
1654            if branch.node_type:
1655                str_ += _c45_showBranch(node, classvar, lev, i)
1656    return str_
1657
1658def _printTreeC45(tree):
1659    print _c45_printTree0(tree.tree, tree.class_var, 0)
1660
1661
1662import Orange.feature.scoring as fscoring
1663
1664class TreeLearner(Orange.core.Learner):
1665    """
1666    A classification or regression tree learner. If a set of instances
1667    is given on initialization, a :class:`TreeClassifier` is built and
1668    returned instead. All attributes can also be set on initialization.
1669
1670    **The tree induction process**
1671
1672    #. The learning instances are copied, unless
1673       :obj:`store_instances` is `False` and the instance
1674       already are stored in a :obj:`~Orange.data.Table`.
1675    #. Apriori class probabilities are computed. A list of
1676       candidate features for the split is compiled; in the beginning,
1677       all features are candidates.
1678    #. The recursive part. The contingency matrix is computed by
1679       :obj:`contingency_computer`. Contingencies are used by :obj:`split`,
1680       :obj:`stop` and :obj:`splitter`.
1681    #. If the induction should :obj:`stop`, a :obj:`~Node.node_classifier`
1682       is built by calling :obj:`node_learner` with the given instances,
1683       weight ID and the contingency matrix. As the learner uses
1684       contingencies whenever possible, the :obj:`contingency_computer`
1685       will affect the :obj:`~Node.node_classifier`. The node is returned.
1686    #. If the induction continues, a :obj:`split` is called.
1687       If :obj:`split` fails to return a branch selector, induction stops
1688       and the :obj:`Node` is returned.
1689    #. The feature spent (if any) is removed from the candidate list.
1690    #. Instances are divided into child nodes with :obj:`splitter`.
1691       The process recursively continues with step 3 for
1692       each of the non-empty subsets. If the splitter returned weights,
1693       they are used for each branch.
1694
1695    **Attributes**
1696
1697    .. attribute:: node_learner
1698
1699        Induces a classifier from instances in a node. It is used both
1700        for internal nodes and leaves. The default is
1701        :obj:`Orange.classification.majority.MajorityLearner`.
1702
1703    .. attribute:: descender
1704
1705        The descender that the induced :obj:`TreeClassifier` will
1706        use. The default is :obj:`Descender_UnknownMergeAsSelector`.
1707        It votes with the :obj:`branch_selector`'s distribution for
1708        vote weights.
1709
1710    .. attribute:: contingency_computer
1711
1712        Defines the computation of contingency matrices (used by
1713        :obj:`split`, :obj:`stop`, :obj:`splitter`). It can be used,
1714        for example, to change the treatment of unknown values. By
1715        default ordinary contingency matrices are computed.
1716
1717    **Split construction**
1718
1719    .. attribute:: split
1720       
1721        A :obj:`SplitConstructor` or a function with the same signature as
1722        :obj:`SplitConstructor.__call__`. It is useful for prototyping
1723        new tree induction algorithms. If :obj:`split` is defined, other
1724        arguments that affect split construction are ignored. These include
1725        :obj:`binarization`, :obj:`measure`, :obj:`worst_acceptable` and
1726        :obj:`min_subset`. Default: :class:`SplitConstructor_Combined`
1727        with separate constructors for discrete and continuous
1728        features. Discrete features are used as they are, while
1729        continuous are binarized. Features are scored with gain ratio.
1730        At least two instances in a leaf are required for
1731        discrete and five for continuous features.
1732
1733    .. attribute:: binarization
1734
1735        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
1736        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
1737        0, do not use binarization (use :class:`SplitConstructor_Feature`).
1738        Default: 0.
1739
1740    .. attribute:: measure
1741   
1742        A score to evaluate features for splitting instances in a
1743        node.  A subclass of :class:`Orange.feature.scoring.Score`
1744        (perhaps :class:`~Orange.feature.scoring.InfoGain`,
1745        :class:`~Orange.feature.scoring.GainRatio`,
1746        :class:`~Orange.feature.scoring.Gini`,
1747        :class:`~Orange.feature.scoring.Relief`, or
1748        :class:`~Orange.feature.scoring.MSE`). Default:
1749        :class:`Orange.feature.scoring.GainRatio`.
1750
1751    .. attribute:: relief_m, relief_k
1752
1753        Set `m` and `k` for :class:`~Orange.feature.scoring.Relief`,
1754        if chosen.
1755
1756    .. attribute:: splitter
1757
1758        :class:`Splitter` or a function with the same
1759        signature as :obj:`Splitter.__call__`. The default is
1760        :class:`Splitter_UnknownsAsSelector` that splits the
1761        learning instances according to distributions given by the
1762        selector.
1763
1764    **Pruning**
1765
1766    .. attribute:: worst_acceptable
1767
1768        The lowest required feature score. If the score of the best
1769        feature is below this margin, the tree is not grown further
1770        (default: 0).
1771
1772    .. attribute:: min_subset
1773
1774        The lowest required number of instances in non-null leaves (default: 0).
1775
1776    .. attribute:: min_instances
1777
1778        Data subsets with less than :obj:`min_instances`
1779        instances are not split any further. Therefore, all leaves in the tree
1780        will contain at least that many instances (default: 0).
1781
1782    .. attribute:: max_depth
1783
1784        Maximal tree depth. If 0, only root is generated.
1785        The default is 100.
1786
1787    .. attribute:: max_majority
1788
1789        Induction stops when the proportion of majority class in the
1790        node exceeds the value set by this parameter (default: 1.0).
1791 
1792    .. attribute:: stop
1793
1794        :class:`StopCriteria` or a function with the same signature as
1795        :obj:`StopCriteria.__call__`. Useful for prototyping new tree
1796        induction algorithms.  When used, parameters  :obj:`max_majority`
1797        and :obj:`min_instances` will not be  considered.  The default
1798        stopping criterion stops induction when all instances in a node
1799        belong to the same class.
1800
1801    .. attribute:: m_pruning
1802
1803        If non-zero, invokes an error-based bottom-up post-pruning,
1804        where m-estimate is used to estimate class probabilities
1805        (default: 0).
1806
1807    .. attribute:: same_majority_pruning
1808
1809        If true, invokes a bottom-up post-pruning by removing the
1810        subtrees of which all leaves classify to the same class
1811        (default: False).
1812
1813    **Record keeping**
1814
1815    .. attribute:: store_distributions
1816   
1817    .. attribute:: store_contingencies
1818   
1819    .. attribute:: store_instances
1820   
1821    .. attribute:: store_node_classifier
1822
1823        Determines whether to store class distributions,
1824        contingencies and instances in :class:`Node`, and whether the
1825        :obj:`Node.node_classifier` should be build for internal nodes
1826        also (it is needed by the :obj:`Descender` or for post-pruning).
1827        Not storing distributions but storing contingencies does not
1828        save any memory, since distributions actually points to the
1829        same distribution that is stored in :obj:`contingency.classes`.
1830        By default everything except :obj:`store_instances` is enabled.
1831
1832    """
1833    def __new__(cls, data=None, weightID=0, **argkw):
1834        self = Orange.core.Learner.__new__(cls, **argkw)
1835        if data:
1836            self.__init__(**argkw)
1837            return self.__call__(data, weightID)
1838        else:
1839            return self
1840
1841    def __init__(self, **kw):
1842
1843        #name, buildfunction, parameters
1844        #buildfunctions are not saved as function references
1845        #because that would make problems with object copies
1846        for n, (fn, _) in self._built_fn.items():
1847            self.__dict__["_handset_" + n] = False
1848
1849        #measure has to be before split
1850        self.measure = None
1851        self.split = None
1852        self.stop = None
1853        self.splitter = None
1854
1855        for n, (fn, _) in self._built_fn.items():
1856            self.__dict__[n] = fn(self)
1857
1858        for k, v in kw.items():
1859            self.__setattr__(k, v)
1860
1861    def __call__(self, instances, weight=0):
1862        """
1863        Return a classifier from the given instances.
1864        """
1865        bl = self._base_learner()
1866
1867        #set the scoring criteria for regression if it was not
1868        #set by the user
1869        if not self._handset_split and not self.measure:
1870            measure = fscoring.GainRatio() \
1871                if instances.domain.class_var.var_type == Orange.feature.Type.Discrete \
1872                else fscoring.MSE()
1873            bl.split.continuous_split_constructor.measure = measure
1874            bl.split.discrete_split_constructor.measure = measure
1875
1876        if self.splitter != None:
1877            bl.example_splitter = self.splitter
1878
1879        #post pruning
1880        tree = bl(instances, weight)
1881        if getattr(self, "same_majority_pruning", 0):
1882            tree = Pruner_SameMajority(tree)
1883        if getattr(self, "m_pruning", 0):
1884            tree = Pruner_m(tree, m=self.m_pruning)
1885
1886        return TreeClassifier(base_classifier=tree)
1887
1888    def __setattr__(self, name, value):
1889        self.__dict__[name] = value
1890        for n, (fn, v) in self._built_fn.items():
1891            if name in v:
1892                if not self.__dict__["_handset_" + n]:
1893                    self.__dict__[n] = fn(self)
1894                else:
1895                    warnings.warn("Changing \"" + name + "\" does not have any effect as \"" + n + "\" was already set", UserWarning, 2)
1896            elif n == name:
1897                if value == None:
1898                    self.__dict__[n] = fn(self)
1899                    self.__dict__["_handset_" + n] = False
1900                    #print n, "was now disabled by hand"
1901                else:
1902                    self.__dict__["_handset_" + n] = True
1903                    #print n, "is now handset"
1904        #print self.__dict__
1905
1906    def __delattr__(self, name):
1907        self.__setattr__(name, None) #use the __setattr__
1908        del self.__dict__[name]
1909
1910    def instance(self):
1911        """
1912        DEPRECATED. Return a base learner - an object
1913        of :class:`_TreeLearner`.
1914        This method is left for backwards compatibility.
1915        """
1916        return self._base_learner()
1917
1918    def _build_split(self):
1919        """
1920        Return the split constructor built according to object attributes.
1921        """
1922        split = SplitConstructor_Combined()
1923        split.continuous_split_constructor = \
1924            SplitConstructor_Threshold()
1925        binarization = getattr(self, "binarization", 0)
1926        if binarization == 1:
1927            split.discrete_split_constructor = \
1928                SplitConstructor_ExhaustiveBinary()
1929        elif binarization == 2:
1930            split.discrete_split_constructor = \
1931                SplitConstructor_OneAgainstOthers()
1932        else:
1933            split.discrete_split_constructor = \
1934                SplitConstructor_Feature()
1935
1936        measures = {"infoGain": fscoring.InfoGain,
1937            "gainRatio": fscoring.GainRatio,
1938            "gini": fscoring.Gini,
1939            "relief": fscoring.Relief,
1940            "retis": fscoring.MSE
1941            }
1942
1943        measure = self.measure
1944        if isinstance(measure, str):
1945            measure = measures[measure]()
1946        if not measure:
1947            measure = fscoring.GainRatio()
1948
1949        measureIsRelief = isinstance(measure, fscoring.Relief)
1950        relM = getattr(self, "relief_m", None)
1951        if relM and measureIsRelief:
1952            measure.m = relM
1953
1954        relK = getattr(self, "relief_k", None)
1955        if relK and measureIsRelief:
1956            measure.k = relK
1957
1958        split.continuous_split_constructor.measure = measure
1959        split.discrete_split_constructor.measure = measure
1960
1961        wa = getattr(self, "worst_acceptable", 0)
1962        if wa:
1963            split.continuous_split_constructor.worst_acceptable = wa
1964            split.discrete_split_constructor.worst_acceptable = wa
1965
1966        ms = getattr(self, "min_subset", 0)
1967        if ms:
1968            split.continuous_split_constructor.min_subset = ms
1969            split.discrete_split_constructor.min_subset = ms
1970
1971        return split
1972
1973    def _build_stop(self):
1974        """
1975        Return the stop criteria built according to object's attributes.
1976        """
1977        stop = Orange.classification.tree.StopCriteria_common()
1978        mm = getattr(self, "max_majority", 1.0)
1979        if mm < 1.0:
1980            stop.max_majority = self.max_majority
1981        me = getattr(self, "min_instances", 0)
1982        if me:
1983            stop.min_instances = self.min_instances
1984        return stop
1985
1986    def _base_learner(self):
1987        learner = _TreeLearner()
1988
1989        learner.split = self.split
1990        learner.stop = self.stop
1991
1992        for a in ["store_distributions", "store_contingencies",
1993            "store_node_classifier", "node_learner", "max_depth", "contingency_computer", "descender" ]:
1994            if hasattr(self, a):
1995                setattr(learner, a, getattr(self, a))
1996
1997        if hasattr(self, "store_instances"):
1998            learner.store_examples = self.store_instances
1999
2000        return learner
2001
2002    _built_fn = {
2003            "split": [ _build_split, [ "binarization", "measure", "relief_m", "relief_k", "worst_acceptable", "min_subset" ] ], \
2004            "stop": [ _build_stop, ["max_majority", "min_instances" ] ]
2005        }
2006
2007
2008
2009TreeLearner = Orange.misc.deprecated_members({
2010          "mForPruning": "m_pruning",
2011          "sameMajorityPruning": "same_majority_pruning",
2012          "reliefM": "relief_m",
2013          "reliefK": "relief_k",
2014          "storeDistributions": "store_distributions",
2015          "storeContingencies": "store_contingencies",
2016          "storeExamples": "store_instances",
2017          "store_examples": "store_instances",
2018          "storeNodeClassifier": "store_node_classifier",
2019          "worstAcceptable": "worst_acceptable",
2020          "minSubset": "min_subset",
2021          "maxMajority": "max_majority",
2022          "minExamples": "min_instances",
2023          "maxDepth": "max_depth",
2024          "nodeLearner": "node_learner",
2025          "min_examples": "min_instances"
2026}, wrap_methods=[])(TreeLearner)
2027
2028#
2029# the following is for the output
2030#
2031
2032fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2033""" Defines the multiplier by 100 (``^``) and the format
2034for the number of decimals (e.g. ``5.3``). The corresponding
2035groups are named ``m100`` and ``fs``. """
2036
2037by = r"(?P<by>(b(P|A)))?"
2038""" Defines bP or bA or nothing; the result is in groups by. """
2039
2040bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2041opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2042opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2043intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2044fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2045re_V = re.compile("%V")
2046re_N = re.compile("%" + fs + "N" + by)
2047re_M = re.compile("%" + fs + "M" + by)
2048re_m = re.compile("%" + fs + "m" + by)
2049re_Ccont = re.compile("%" + fs + "C" + by + opc)
2050re_Cdisc = re.compile("%" + fs + "C" + by + opd)
2051re_ccont = re.compile("%" + fs + "c" + by + opc)
2052re_cdisc = re.compile("%" + fs + "c" + by + opd)
2053re_Cconti = re.compile("%" + fs + "C" + by + fromto)
2054re_cconti = re.compile("%" + fs + "c" + by + fromto)
2055re_D = re.compile("%" + fs + "D" + by)
2056re_d = re.compile("%" + fs + "d" + by)
2057re_AE = re.compile("%" + fs + "(?P<AorE>A|E)" + bysub)
2058re_I = re.compile("%" + fs + "I" + intrvl)
2059
2060def insert_str(s, mo, sub):
2061    """ Replace the part of s which is covered by mo
2062    with the string sub. """
2063    return s[:mo.start()] + sub + s[mo.end():]
2064
2065def insert_dot(s, mo):
2066    """ Replace the part of s which is covered by mo
2067    with a dot.  You should use this when the
2068    function cannot compute the desired quantity; it is called, for instance,
2069    when it needs to divide by something in the parent, but the parent
2070    doesn't exist.
2071    """
2072    return s[:mo.start()] + "." + s[mo.end():]
2073
2074def insert_num(s, mo, n):
2075    """ Replace the part of s matched by mo with the number n,
2076    formatted as specified by the user, that is, it multiplies
2077    it by 100, if needed, and prints with the right number of
2078    places and decimals. It does so by checking the mo
2079    for a group named m100 (representing the ``^`` in the format string)
2080    and a group named fs representing the part giving the number o
2081    f decimals (e.g. ``5.3``).
2082    """
2083    grps = mo.groupdict()
2084    m100 = grps.get("m100", None)
2085    if m100:
2086        n *= 100
2087    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2088    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2089
2090def by_whom(by, parent, tree):
2091    """ If by equals bp, return parent, else return
2092    ``tree.tree``. This is used to find what to divide the quantity
2093    with, when division is required.
2094    """
2095    if by == "bP":
2096        return parent
2097    else:
2098        return tree.tree
2099
2100def replaceV(strg, mo, node, parent, tree):
2101    return insert_str(strg, mo, str(node.node_classifier.default_value))
2102
2103def replaceN(strg, mo, node, parent, tree):
2104    by = mo.group("by")
2105    N = node.distribution.abs
2106    if by:
2107        whom = by_whom(by, parent, tree)
2108        if whom and whom.distribution:
2109            if whom.distribution.abs > 1e-30:
2110                N /= whom.distribution.abs
2111        else:
2112            return insert_dot(strg, mo)
2113    return insert_num(strg, mo, N)
2114
2115
2116def replaceM(strg, mo, node, parent, tree):
2117    by = mo.group("by")
2118    maj = int(node.node_classifier.default_value)
2119    N = node.distribution[maj]
2120    if by:
2121        whom = by_whom(by, parent, tree)
2122        if whom and whom.distribution:
2123            if whom.distribution[maj] > 1e-30:
2124                N /= whom.distribution[maj]
2125        else:
2126            return insert_dot(strg, mo)
2127    return insert_num(strg, mo, N)
2128
2129
2130def replacem(strg, mo, node, parent, tree):
2131    by = mo.group("by")
2132    maj = int(node.node_classifier.default_value)
2133    if node.distribution.abs > 1e-30:
2134        N = node.distribution[maj] / node.distribution.abs
2135        if by:
2136            if whom and whom.distribution:
2137                byN = whom.distribution[maj] / whom.distribution.abs
2138                if byN > 1e-30:
2139                    N /= byN
2140            else:
2141                return insert_dot(strg, mo)
2142    else:
2143        N = 0.
2144    return insert_num(strg, mo, N)
2145
2146
2147def replaceCdisc(strg, mo, node, parent, tree):
2148    if tree.class_var.var_type != Orange.feature.Type.Discrete:
2149        return insert_dot(strg, mo)
2150
2151    by, op, cls = mo.group("by", "op", "cls")
2152    N = node.distribution[cls]
2153    if op == "!=":
2154        N = node.distribution.abs - N
2155    if by:
2156        whom = by_whom(by, parent, tree)
2157        if whom and whom.distribution:
2158            if whom.distribution[cls] > 1e-30:
2159                N /= whom.distribution[cls]
2160        else:
2161            return insert_dot(strg, mo)
2162    return insert_num(strg, mo, N)
2163
2164
2165def replacecdisc(strg, mo, node, parent, tree):
2166    if tree.class_var.var_type != Orange.feature.Type.Discrete:
2167        return insert_dot(strg, mo)
2168
2169    op, by, cls = mo.group("op", "by", "cls")
2170    N = node.distribution[cls]
2171    if node.distribution.abs > 1e-30:
2172        N /= node.distribution.abs
2173        if op == "!=":
2174            N = 1 - N
2175    if by:
2176        whom = by_whom(by, parent, tree)
2177        if whom and whom.distribution:
2178            if whom.distribution[cls] > 1e-30:
2179                N /= whom.distribution[cls] / whom.distribution.abs
2180        else:
2181            return insert_dot(strg, mo)
2182    return insert_num(strg, mo, N)
2183
2184
2185__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2186
2187def replaceCcont(strg, mo, node, parent, tree):
2188    if tree.class_var.var_type != Orange.feature.Type.Continuous:
2189        return insert_dot(strg, mo)
2190
2191    by, op, num = mo.group("by", "op", "num")
2192    op = __opdict[op]
2193    num = float(num)
2194    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2195    if by:
2196        whom = by_whom(by, parent, tree)
2197        if whom and whom.distribution:
2198            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2199            if byN > 1e-30:
2200                N /= byN
2201        else:
2202            return insert_dot(strg, mo)
2203
2204    return insert_num(strg, mo, N)
2205
2206
2207def replaceccont(strg, mo, node, parent, tree):
2208    if tree.class_var.var_type != Orange.feature.Type.Continuous:
2209        return insert_dot(strg, mo)
2210
2211    by, op, num = mo.group("by", "op", "num")
2212    op = __opdict[op]
2213    num = float(num)
2214    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2215    if node.distribution.abs > 1e-30:
2216        N /= node.distribution.abs
2217    if by:
2218        whom = by_whom(by, parent, tree)
2219        if whom and whom.distribution:
2220            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2221            if byN > 1e-30:
2222                N /= byN / whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2223        else:
2224            return insert_dot(strg, mo)
2225    return insert_num(strg, mo, N)
2226
2227
2228def extractInterval(mo, dist):
2229    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2230    lower, upper = float(lower), float(upper)
2231    if out:
2232        lop = lowin == "(" and operator.le or operator.lt
2233        hop = upin == ")" and operator.ge or operator.ge
2234        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2235    else:
2236        lop = lowin == "(" and operator.gt or operator.ge
2237        hop = upin == ")" and operator.lt or operator.le
2238        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2239
2240
2241def replaceCconti(strg, mo, node, parent, tree):
2242    if tree.class_var.var_type != Orange.feature.Type.Continuous:
2243        return insert_dot(strg, mo)
2244
2245    by = mo.group("by")
2246    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2247    if by:
2248        whom = by_whom(by, parent, tree)
2249        if whom and whom.distribution:
2250            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2251            if byN > 1e-30:
2252                N /= byN
2253        else:
2254            return insert_dot(strg, mo)
2255
2256    return insert_num(strg, mo, N)
2257
2258
2259def replacecconti(strg, mo, node, parent, tree):
2260    if tree.class_var.var_type != Orange.feature.Type.Continuous:
2261        return insert_dot(strg, mo)
2262
2263    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2264    ab = node.distribution.abs
2265    if ab > 1e-30:
2266        N /= ab
2267
2268    by = mo.group("by")
2269    if by:
2270        whom = by_whom(by, parent, tree)
2271        if whom and whom.distribution:
2272            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2273            if byN > 1e-30:
2274                N /= byN / whom.distribution.abs
2275        else:
2276            return insert_dot(strg, mo)
2277
2278    return insert_num(strg, mo, N)
2279
2280
2281def replaceD(strg, mo, node, parent, tree):
2282    if tree.class_var.var_type != Orange.feature.Type.Discrete:
2283        return insert_dot(strg, mo)
2284
2285    fs, by, m100 = mo.group("fs", "by", "m100")
2286    dist = list(node.distribution)
2287    if by:
2288        whom = by_whom(by, parent, tree)
2289        if whom:
2290            for i, d in enumerate(whom.distribution):
2291                if d > 1e-30:
2292                    dist[i] /= d
2293        else:
2294            return insert_dot(strg, mo)
2295    mul = m100 and 100 or 1
2296    fs = fs or (m100 and ".0" or "5.3")
2297    return insert_str(strg, mo, "[" + ", ".join(["%%%sf" % fs % (N * mul) for N in dist]) + "]")
2298
2299
2300def replaced(strg, mo, node, parent, tree):
2301    if tree.class_var.var_type != Orange.feature.Type.Discrete:
2302        return insert_dot(strg, mo)
2303
2304    fs, by, m100 = mo.group("fs", "by", "m100")
2305    dist = list(node.distribution)
2306    ab = node.distribution.abs
2307    if ab > 1e-30:
2308        dist = [d / ab for d in dist]
2309    if by:
2310        whom = by_whom(by, parent, tree)
2311        if whom:
2312            for i, d in enumerate(whom.distribution):
2313                if d > 1e-30:
2314                    dist[i] /= d / whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2315        else:
2316            return insert_dot(strg, mo)
2317    mul = m100 and 100 or 1
2318    fs = fs or (m100 and ".0" or "5.3")
2319    return insert_str(strg, mo, "[" + ", ".join(["%%%sf" % fs % (N * mul) for N in dist]) + "]")
2320
2321
2322def replaceAE(strg, mo, node, parent, tree):
2323    if tree.class_var.var_type != Orange.feature.Type.Continuous:
2324        return insert_dot(strg, mo)
2325
2326    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2327
2328    if AorE == "A":
2329        A = node.distribution.average()
2330    else:
2331        A = node.distribution.error()
2332    if by:
2333        whom = by_whom("b" + by, parent, tree)
2334        if whom:
2335            if AorE == "A":
2336                avg = whom.distribution.average()
2337            else:
2338                avg = whom.distribution.error()
2339            if bysub == "b":
2340                if avg > 1e-30:
2341                    A /= avg
2342            else:
2343                A -= avg
2344        else:
2345            return insert_dot(strg, mo)
2346    return insert_num(strg, mo, A)
2347
2348
2349Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2350
2351def replaceI(strg, mo, node, parent, tree):
2352    if tree.class_var.var_type != Orange.feature.Type.Continuous:
2353        return insert_dot(strg, mo)
2354
2355    fs = mo.group("fs") or "5.3"
2356    intrvl = float(mo.group("intp") or mo.group("intv") or "95") / 100.
2357    mul = mo.group("m100") and 100 or 1
2358
2359    if not Z.has_key(intrvl):
2360        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2361
2362    av = node.distribution.average()
2363    il = node.distribution.error() * Z[intrvl]
2364    return insert_str(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av - il) * mul, (av + il) * mul))
2365
2366
2367# This class is more a collection of function, merged into a class so
2368# that they don't need to transfer too many arguments. It will be
2369# constructed, used and discarded, it is not meant to store any information.
2370class _TreeDumper:
2371    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2372         (re_M, replaceM), (re_m, replacem),
2373         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2374         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2375         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2376         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE),
2377         (re_I, replaceI) ]
2378
2379    def node(self):
2380        return self.tree.tree if "tree" in self.tree.__dict__ else self.tree
2381
2382    def __init__(self, leafStr, nodeStr, stringFormats, minExamples,
2383        maxDepth, simpleFirst, tree, **kw):
2384        self.stringFormats = stringFormats
2385        self.minExamples = minExamples
2386        self.maxDepth = maxDepth
2387        self.simpleFirst = simpleFirst
2388        self.tree = tree
2389        self.__dict__.update(kw)
2390
2391        if leafStr:
2392            self.leafStr = leafStr
2393        else:
2394            if self.node().node_classifier.class_var.var_type == \
2395                    Orange.feature.Type.Discrete:
2396                self.leafStr = "%V (%^.2m%)"
2397            else:
2398                self.leafStr = "%V"
2399
2400        if nodeStr == ".":
2401            self.nodeStr = self.leafStr
2402        else:
2403            self.nodeStr = nodeStr
2404
2405
2406    def formatString(self, strg, node, parent):
2407        if hasattr(strg, "__call__"):
2408            return strg(node, parent, self.tree)
2409
2410        if not node:
2411            return "<null node>"
2412
2413        for rgx, replacer in self.stringFormats:
2414            if not node.distribution:
2415                strg = rgx.sub(".", strg)
2416            else:
2417                strt = 0
2418                while True:
2419                    mo = rgx.search(strg, strt)
2420                    if not mo:
2421                        break
2422                    strg = replacer(strg, mo, node, parent, self.tree)
2423                    strt = mo.start() + 1
2424
2425        return strg
2426
2427
2428    def showBranch(self, node, parent, lev, i):
2429        bdes = node.branch_descriptions[i]
2430        bdes = node.branch_selector.class_var.name + \
2431            (bdes[0] not in "<=>" and "=" or "") + bdes
2432        if node.branches[i]:
2433            nodedes = self.nodeStr and ": " + \
2434                self.formatString(self.nodeStr, node.branches[i], node) or ""
2435        else:
2436            nodedes = "<null node>"
2437        return "|    "*lev + bdes + nodedes
2438
2439
2440    def dumpTree0(self, node, parent, lev):
2441        if node.branches:
2442            if node.distribution.abs < self.minExamples or \
2443                lev > self.maxDepth:
2444                return "|    "*lev + ". . .\n"
2445
2446            res = ""
2447            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2448                leafsep = "\n" + ("|    "*lev) + "    "
2449            else:
2450                leafsep = ""
2451            if self.simpleFirst:
2452                for i, branch in enumerate(node.branches):
2453                    if not branch or not branch.branches:
2454                        if self.leafStr == self.nodeStr:
2455                            res += "%s\n" % \
2456                                self.showBranch(node, parent, lev, i)
2457                        else:
2458                            res += "%s: %s\n" % \
2459                                (self.showBranch(node, parent, lev, i),
2460                                 leafsep +
2461                                 self.formatString(self.leafStr, branch, node))
2462            for i, branch in enumerate(node.branches):
2463                if branch and branch.branches:
2464                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2465                                       self.dumpTree0(branch, node, lev + 1))
2466                elif not self.simpleFirst:
2467                    if self.leafStr == self.nodeStr:
2468                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2469                    else:
2470                        res += "%s: %s\n" % \
2471                            (self.showBranch(node, parent, lev, i),
2472                             leafsep +
2473                             self.formatString(self.leafStr, branch, node))
2474            return res
2475        else:
2476            return self.formatString(self.leafStr, node, parent)
2477
2478
2479    def dumpTree(self):
2480        node = self.node()
2481        if self.nodeStr:
2482            lev, res = 1, "root: %s\n" % \
2483                self.formatString(self.nodeStr, node, None)
2484            self.maxDepth += 1
2485        else:
2486            lev, res = 0, ""
2487        return res + self.dumpTree0(node, None, lev)
2488
2489
2490    def dotTree0(self, node, parent, internalName):
2491        if node.branches:
2492            if node.distribution.abs < self.minExamples or \
2493                len(internalName) - 1 > self.maxDepth:
2494                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2495                    _quoteName(internalName))
2496                return
2497
2498            label = node.branch_selector.class_var.name
2499            if self.nodeStr:
2500                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2501            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2502                (_quoteName(internalName), self.nodeShape, label))
2503
2504            for i, branch in enumerate(node.branches):
2505                if branch:
2506                    internalBranchName = "%s-%d" % (internalName, i)
2507                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2508                        (_quoteName(internalName),
2509                         _quoteName(internalBranchName),
2510                         node.branch_descriptions[i]))
2511                    self.dotTree0(branch, node, internalBranchName)
2512
2513        else:
2514            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2515                (_quoteName(internalName), self.leafShape,
2516                self.formatString(self.leafStr, node, parent)))
2517
2518
2519    def dotTree(self, internalName="n"):
2520        self.fle.write("digraph G {\n")
2521        self.dotTree0(self.node(), None, internalName)
2522        self.fle.write("}\n")
2523
2524def _quoteName(x):
2525    return '"%s"' % (base64.b64encode(x))
2526
2527class TreeClassifier(Orange.classification.Classifier):
2528    """
2529
2530    Classifies instances according to the tree stored in :obj:`tree`.
2531
2532    **The classification process**
2533
2534    :obj:`TreeClassifier` uses the :obj:`descender` to descend the
2535    instance from the root. If the :obj:`descender` returns only a
2536    :obj:`Node` and no distribution, the descend should stop as the node
2537    was unambiguously selected. The node's :obj:`~Node.node_classifier`
2538    decides the class.
2539
2540    If the descender returns a :obj:`Node` and a distribution, as it
2541    happens, for example, if the instance's value for the :obj:`Node`'s
2542    feature is unknown, the same process repeats for all subtrees and
2543    their predictions are combined.
2544
2545    **Attributes**
2546
2547    .. attribute:: tree
2548
2549        The root of the tree, as a :class:`Node`.
2550
2551    .. attribute:: descender
2552
2553        A :obj:`Descender` used to descend an instance from the root as
2554        deeply as possible according to the instance's feature values.
2555    """
2556
2557    def __init__(self, base_classifier=None):
2558        if not base_classifier: base_classifier = _TreeClassifier()
2559        self.nativeClassifier = base_classifier
2560        for k, v in self.nativeClassifier.__dict__.items():
2561            self.__dict__[k] = v
2562
2563    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2564                 *args, **kwdargs):
2565        """Classify a new instance.
2566
2567     
2568        :param instance: instance to be classified.
2569        :type instance: :class:`Orange.data.Instance`
2570        :param result_type:
2571              :class:`Orange.classification.Classifier.GetValue` or \
2572              :class:`Orange.classification.Classifier.GetProbabilities` or
2573              :class:`Orange.classification.Classifier.GetBoth`
2574       
2575        :rtype: :class:`Orange.data.Value`,
2576              :class:`Orange.statistics.Distribution` or a tuple with both
2577        """
2578        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2579
2580    def __setattr__(self, name, value):
2581        if name == "nativeClassifier":
2582            self.__dict__[name] = value
2583            return
2584        if name in self.nativeClassifier.__dict__:
2585            self.nativeClassifier.__dict__[name] = value
2586        self.__dict__[name] = value
2587
2588    def __str__(self):
2589        return self.to_string()
2590
2591    @Orange.misc.deprecated_keywords({"fileName": "file_name", \
2592        "leafStr": "leaf_str", "nodeStr": "node_str", \
2593        "userFormats": "user_formats", "minExamples": "min_instances", \
2594        "min_examples": "min_instances", \
2595        "maxDepth": "max_depth", "simpleFirst": "simple_first"})
2596    def to_string(self, leaf_str="", node_str="", \
2597            user_formats=[], min_instances=0, max_depth=1e10, \
2598            simple_first=True):
2599        """
2600        Return a string representation of a tree.
2601
2602        :arg leaf_str: The format string for the tree leaves. If
2603          left empty, ``"%V (%^.2m%)"`` will be used for classification trees
2604          and ``"%V"`` for regression trees.
2605        :type leaf_str: string
2606        :arg node_str: The format string for the internal nodes.
2607          If left empty (as it is by default), no data is printed out for
2608          internal nodes. If set to ``"."``, the same string is
2609          used as for leaves.
2610        :type node_str: string
2611        :arg max_depth: If set, it limits the depth to which the tree is
2612          printed out.
2613        :type max_depth: integer
2614        :arg min_instances: If set, the subtrees with less than the given
2615          number of examples are not printed.
2616        :type min_instances: integer
2617        :arg simple_first: If True (default), the branches with a single
2618          node are printed before the branches with larger subtrees.
2619          If False, the branches are printed in order of
2620          appearance.
2621        :type simple_first: boolean
2622        :arg user_formats: A list of regular expressions and callback
2623          function through which the user can print out other specific
2624          information in the nodes.
2625        """
2626        return _TreeDumper(leaf_str, node_str, user_formats +
2627            _TreeDumper.defaultStringFormats, min_instances,
2628            max_depth, simple_first, self).dumpTree()
2629
2630    dump = to_string
2631
2632    @Orange.misc.deprecated_keywords({"fileName": "file_name", \
2633        "leafStr": "leaf_str", "nodeStr": "node_str", \
2634        "leafShape": "leaf_shape", "nodeShape": "node_shape", \
2635        "userFormats": "user_formats", \
2636        "minExamples": "min_instances", \
2637        "min_examples": "min_instances", \
2638        "maxDepth": "max_depth", "simpleFirst": "simple_first"})
2639    def dot(self, file_name, leaf_str="", node_str="", \
2640            leaf_shape="plaintext", node_shape="plaintext", \
2641            user_formats=[], min_instances=0, max_depth=1e10, \
2642            simple_first=True):
2643        """ Print the tree to a file in a format used by `GraphViz
2644        <http://www.research.att.com/sw/tools/graphviz>`_.  Uses the
2645        same parameters as :meth:`to_string` plus two which define the shape
2646        of internal nodes and leaves of the tree:
2647
2648        :param leaf_shape: Shape of the outline around leaves of the tree.
2649            If "plaintext", no outline is used (default: "plaintext").
2650        :type leaf_shape: string
2651        :param node_shape: Shape of the outline around internal nodes
2652            of the tree. If "plaintext", no outline is used (default: "plaintext")
2653        :type node_shape: string
2654
2655        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2656        for various outlines supported by GraphViz.
2657        """
2658        fle = type(file_name) == str and open(file_name, "wt") or file_name
2659
2660        _TreeDumper(leaf_str, node_str, user_formats +
2661            _TreeDumper.defaultStringFormats, min_instances,
2662            max_depth, simple_first, self,
2663            leafShape=leaf_shape, nodeShape=node_shape, fle=fle).dotTree()
2664
2665    def count_nodes(self):
2666        """
2667        Return the number of nodes.
2668        """
2669        return _countNodes(self.tree)
2670
2671    def count_leaves(self):
2672        """
2673        Return the number of leaves.
2674        """
2675        return _countLeaves(self.tree)
2676
2677    def to_network(self):
2678        net = Orange.network.DiGraph()
2679        if self.class_var.var_type == Orange.feature.Type.Discrete:
2680            domain = Orange.data.Domain([self.class_var] +
2681                [Orange.feature.Continuous(name) for name in
2682                 ["instances", "majority proportion"] + list(self.class_var.values)], None)
2683        else:
2684            domain = Orange.data.Domain([self.class_var] +
2685                [Orange.feature.Continuous(name) for name in
2686                 ["error", "instances"]], None)
2687        domain = Orange.data.Domain(domain)
2688        data = Orange.data.Table(domain)
2689        self.to_network0(self.tree, net, data)
2690        net.set_items(data)
2691        return net
2692
2693    def to_network0(self, node, net, table):
2694        node_id = len(table)
2695        net.add_node(node_id)
2696        d = node.distribution
2697        maj = node.node_classifier.default_value
2698        if self.class_var.var_type == Orange.feature.Type.Discrete:
2699            if d.abs > 1e-6:
2700                table.append([maj, d.abs, d[maj]] + [x / d.abs for x in d])
2701            else:
2702                table.append([maj] + [0] * (2 + len(d)))
2703        else:
2704            table.append([maj, d.error(), d.abs])
2705        if node.branches:
2706            for branch in node.branches:
2707                if branch:
2708                    child_id = self.to_network0(branch, net, table)
2709                    net.add_edge(node_id, child_id)
2710        return node_id
2711
2712def _countNodes(node):
2713    count = 0
2714    if node:
2715        count += 1
2716        if node.branches:
2717            for node in node.branches:
2718                count += _countNodes(node)
2719    return count
2720
2721def _countLeaves(node):
2722    count = 0
2723    if node:
2724        if node.branches: # internal node
2725            for node in node.branches:
2726                count += _countLeaves(node)
2727        else:
2728            count += 1
2729    return count
2730
Note: See TracBrowser for help on using the repository browser.