source: orange/orange/Orange/classification/tree.py @ 9349:fa13a2c52fcd

Revision 9349:fa13a2c52fcd, 96.8 KB checked in by mitar, 2 years ago (diff)

Changed way of linking to code in documentation.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8*******************************
9Classification trees (``tree``)
10*******************************
11
12Orange includes multiple implementations of classification tree learners:
13a very flexible :class:`TreeLearner`, a fast :class:`SimpleTreeLearner`,
14and a :class:`C45Learner`, which uses the C4.5 tree induction
15algorithm.
16
17The following code builds a :obj:`TreeClassifier` on the Iris data set
18(with the depth limited to three levels):
19
20.. literalinclude:: code/orngTree1.py
21   :lines: 1-4
22
23See `Decision tree learning
24<http://en.wikipedia.org/wiki/Decision_tree_learning>`_ on Wikipedia
25for introduction to classification trees.
26
27======================
28Learner and Classifier
29======================
30
31.. autoclass:: TreeLearner
32    :members:
33
34.. autoclass:: TreeClassifier
35    :members:
36
37.. class:: Node
38
39    Classification trees are a hierarchy of :obj:`Node` classes.
40
41    Node stores the instances belonging to the node, a branch selector,
42    a list of branches (if the node is not a leaf) with their descriptions
43    and strengths, and a classifier.
44
45    .. attribute:: distribution
46   
47        A distribution of learning instances.
48
49    .. attribute:: contingency
50
51        Complete contingency matrices for the learning instances.
52
53    .. attribute:: examples, weightID
54
55        Learning instances and the ID of weight meta attribute. The root
56        of the tree actually stores all instances, while other nodes
57        store only reference to instances in the root node.
58
59    .. attribute:: node_classifier
60
61        A classifier for instances coming to the node. If the node is a
62        leaf, it chooses the class (or class distribution) of an instance.
63
64    Internal nodes have additional attributes. The lists :obj:`branches`,
65    :obj:`branch_descriptions` and :obj:`branch_sizes` are of the
66    same length.
67
68    .. attribute:: branches
69
70        A list of subtrees. Each element is a :obj:`Node` or None.
71        If None, the node is empty.
72
73    .. attribute:: branch_descriptions
74
75        A list with strings describing branches. They are constructed
76        by :obj:`SplitConstructor`. A string can contain anything,
77        for example 'red' or '>12.3'.
78
79    .. attribute:: branch_sizes
80
81        A (weighted) number of training instances for
82        each branch. It can be used, for instance, for modeling
83        probabilities when classifying instances with unknown values.
84
85    .. attribute:: branch_selector
86
87        A :obj:`~Orange.classification.Classifier` that returns a branch
88        for each instance (as
89        :obj:`Orange.data.Value` in ``[0, len(branches)-1]``).  When an
90        instance cannot be classified unambiguously, the selector can
91        return a discrete distribution, which proposes how to divide
92        the instance between the branches. Whether the proposition will
93        be used depends upon the :obj:`Splitter` (for learning)
94        or :obj:`Descender` (for classification).
95
96    .. method:: tree_size()
97       
98        Return the number of nodes in the subtrees (including the node,
99        excluding null-nodes).
100
101========
102Examples
103========
104
105Tree Structure
106==============
107
108This example works with the lenses data set:
109
110.. literalinclude:: code/treestructure.py
111   :lines: 7-10
112
113The following function counts the number of nodes in a tree:
114
115.. literalinclude:: code/treestructure.py
116   :lines: 12-21
117
118If node is None, the function above return 0. Otherwise, the size is 1
119(this node) plus the sizes of all subtrees. The algorithm need to check
120if a node is internal (it has a :obj:`~Node.branch_selector`), as leaves
121don't have the :obj:`~Node.branches` attribute.
122
123    >>> tree_size(tree_classifier.tree)
124    10
125
126Note that a :obj:`Node` already has a built-in method
127:func:`~Node.tree_size`.
128
129Trees can be printed with a simple recursive function:
130
131.. literalinclude:: code/treestructure.py
132   :lines: 26-41
133
134The crux of the example is not in the formatting (\\n's etc.);
135what matters is everything but the print statements. The code
136separately handles three node types:
137
138* For null nodes (a node to which no learning instances were classified),
139  it just prints "<null node>".
140* For internal nodes, it prints a node description:
141  the feature's name and distribution of classes. :obj:`Node`'s
142  branch description is an :obj:`~Orange.classification.Classifier`,
143  and its ``class_var`` is the feature whose name is printed.  Class
144  distributions are printed as well (they are assumed to be stored).
145  The :obj:`printTree0` with a level increased by 1 to increase the
146  indent is recursively called for each branch.
147* If the node is a leaf, it prints the distribution of learning instances
148  in the node and the class to which the instances in the node would
149  be classified. We assume that the :obj:`~Node.node_classifier` is a
150  :obj:`DefaultClassifier`. A better print function should be aware of
151  possible alternatives.
152
153The wrapper function that accepts either a
154:obj:`TreeClassifier` or a :obj:`Node` can be written as follows:
155
156.. literalinclude:: code/treestructure.py
157   :lines: 43-49
158
159It's straightforward: if ``x`` is a
160:obj:`TreeClassifier`, it prints ``x.tree``; if it's :obj:`Node` it
161print ``x``. If it's of some other type,
162an exception is raised. The output::
163
164    >>> print_tree(tree_classifier)
165    tear_rate (<15.000, 5.000, 4.000>)
166    : reduced --> none (<12.000, 0.000, 0.000>)
167    : normal
168       astigmatic (<3.000, 5.000, 4.000>)
169       : no
170          age (<1.000, 5.000, 0.000>)
171          : young --> soft (<0.000, 2.000, 0.000>)
172          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
173          : presbyopic --> none (<1.000, 1.000, 0.000>)
174       : yes
175          prescription (<2.000, 0.000, 4.000>)
176          : myope --> hard (<0.000, 0.000, 3.000>)
177          : hypermetrope --> none (<2.000, 0.000, 1.000>)
178
179The tree structure examples conclude with a simple pruning function,
180written entirely in Python and unrelated to any :class:`Pruner`. It limits
181the tree depth (the number of internal nodes on any path down the tree).
182For example, to get a two-level tree, call cut_tree(root, 2). The function
183is recursive, with the second argument (level) decreasing at each call;
184when zero, the current node will be made a leaf:
185
186.. literalinclude:: code/treestructure.py
187   :lines: 54-62
188
189The function acts only when :obj:`node` and :obj:`node.branch_selector`
190are defined. If the level is not zero, is recursively calls  the function
191for each branch. Otherwise, it clears the selector, branches and branch
192descriptions.
193
194    >>> cutTree(tree.tree, 2)
195    >>> printTree(tree)
196    tear_rate (<15.000, 5.000, 4.000>)
197    : reduced --> none (<12.000, 0.000, 0.000>)
198    : normal
199       astigmatic (<3.000, 5.000, 4.000>)
200       : no --> soft (<1.000, 5.000, 0.000>)
201       : yes --> hard (<2.000, 0.000, 4.000>)
202
203Setting learning parameters
204===========================
205
206Let us construct a :obj:`TreeLearner` to play with:
207
208.. literalinclude:: code/treelearner.py
209   :lines: 7-10
210
211There are three crucial components in learning: the
212:obj:`~TreeLearner.split` and :obj:`~TreeLearner.stop` criteria, and the
213example :obj:`~TreeLearner.splitter`. The default ``stop`` is set with:
214
215    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
216
217The default stopping parameters are:
218
219    >>> print learner.stop.max_majority, learner.stop.min_examples
220    1.0 0.0
221
222The defaults only stop splitting when no instances are left or all of
223them are in the same class.
224
225If the minimal subset that is allowed to be split further is set to five
226instances, the resulting tree is smaller.
227
228    >>> learner.stop.min_instances = 5.0
229    >>> tree = learner(data)
230    >>> print tree
231    tear_rate=reduced: none (100.00%)
232    tear_rate=normal
233    |    astigmatic=no
234    |    |    age=pre-presbyopic: soft (100.00%)
235    |    |    age=presbyopic: none (50.00%)
236    |    |    age=young: soft (100.00%)
237    |    astigmatic=yes
238    |    |    prescription=hypermetrope: none (66.67%)
239    |    |    prescription=myope: hard (100.00%)
240
241We can also limit the maximal proportion of majority class.
242
243    >>> learner.stop.max_majority = 0.5
244    >>> tree = learner(data)
245    >>> print tree
246    none (62.50%)
247
248Redefining tree induction components
249====================================
250
251This example shows how to use a custom stop function.  First, the
252``def_stop`` function defines the default stop function. The first tree
253has some added randomness; the induction also stops in 20% of the
254cases when ``def_stop`` returns False. The stopping criteria for the
255second tree is completely random: it stops induction in 20% of cases.
256Note that in the second case lambda function still has three parameters,
257even though in does not need any, since so many are necessary
258for :obj:`~TreeLearner.stop`.
259
260.. literalinclude:: code/tree3.py
261   :lines: 8-23
262
263=================================
264Learner and Classifier Components
265=================================
266
267Split constructors
268=====================
269
270.. class:: SplitConstructor
271
272    Decide how to divide learning instances, ie. define branching criteria.
273   
274    The :obj:`SplitConstructor` should use the domain
275    contingency when possible, both for speed and adaptability.
276    Sometimes domain contingency does
277    not suffice, for example if ReliefF score is used.
278
279    A :obj:`SplitConstructor` can veto further tree induction by returning
280    no classifier. This is generally related to the number of learning
281    instances that would go in each branch. If there are no splits with
282    more than :obj:`SplitConstructor.min_subset` instances in the branches
283    (null nodes are allowed), the induction is stopped.
284
285    Split constructors that cannot handle a particular feature
286    type (discrete, continuous) quietly skip them. When in doubt, use
287    :obj:`SplitConstructor_Combined`, which delegates features to
288    specialized split constructors.
289
290    The same split constructors can be used both for classification and
291    regression, if the chosen score (for :obj:`SplitConstructor_Score`
292    and derived classes) supports both.
293
294    .. attribute:: min_subset
295
296        The minimal (weighted) number in non-null leaves.
297
298    .. method:: __call__(data, [ weightID, contingency, apriori_distribution, candidates, clsfr])
299
300        :param data: in any acceptable form.
301        :param weightID: Optional; the default of 0 means that all
302            instances have a weight of 1.0.
303        :param contingency: a domain contingency
304        :param apriori_distribution: apriori class probabilities.
305        :type apriori_distribution: :obj:`Orange.statistics.distribution.Distribution`
306        :param candidates: only consider these
307            features (one boolean for each feature).
308        :param clsfr: a node classifier (if it was constructed, that is,
309            if :obj:`store_node_classifier` is True)
310
311        Construct a split. Return a tuple (:obj:`branch_selector`,
312        :obj:`branch_descriptions` (a list), :obj:`subset_sizes`
313        (the number of instances for each branch, may also be
314        empty), :obj:`quality` (higher numbers mean better splits),
315        :obj:`spent_feature`). If no split is constructed,
316        the :obj:`selector`, :obj:`branch_descriptions` and
317        :obj:`subset_sizes` are None, while :obj:`quality` is 0.0 and
318        :obj:`spent_feature` is -1.
319
320        If the chosen feature will be useless in the future and
321        should not be considered for splitting in any of the subtrees
322        (typically, when discrete features are used as-they-are, without
323        any binarization or subsetting), then it should return the index
324        of this feature as :obj:`spent_feature`. If no features are spent,
325        :obj:`spent_feature` is -1.
326
327.. class:: SplitConstructor_Score
328
329    Bases: :class:`SplitConstructor`
330
331    An abstract base class that compare splits
332    with a :class:`Orange.feature.scoring.Score`.  All split
333    constructors except for :obj:`SplitConstructor_Combined` are derived
334    from this class.
335
336    .. attribute:: measure
337
338        A :class:`Orange.feature.scoring.Score` for split evaluation. It
339        has to handle the class type - for example, you cannot use
340        :class:`~Orange.feature.scoring.GainRatio` for regression or
341        :class:`~Orange.feature.scoring.MSE` for classification.
342
343    .. attribute:: worst_acceptable
344
345        The lowest allowed split quality.  The value strongly depends
346        on chosen :obj:`measure` component. Default is 0.0.
347
348.. class:: SplitConstructor_Feature
349
350    Bases: :class:`SplitConstructor_Score`
351
352    Each value of a discrete feature corresponds to a branch.  The feature
353    with the highest score (:obj:`~Measure.measure`) is selected. When
354    tied, a random feature is selected.
355
356    The constructed :obj:`branch_selector` is an instance of
357    :obj:`orange.ClassifierFromVarFD`, that returns a value of the selected
358    feature. :obj:`branch_description` contains the feature's
359    values. The feature is marked as spent (it cannot reappear in the
360    node's subtrees).
361
362.. class:: SplitConstructor_ExhaustiveBinary
363
364    Bases: :class:`SplitConstructor_Score`
365
366    Finds the binarization with the highest score among all features. In
367    case of ties, a random feature is selected.
368
369    The constructed :obj:`branch_selector` is an instance
370    :obj:`orange.ClassifierFromVarFD`, that returns a value of the
371    selected feature. Its :obj:`transformer` contains a ``MapIntValue``
372    that maps values of the feature into a binary feature. Branches
373    with a single feature value are described with that value and
374    branches with more than one are described with ``[<val1>, <val2>,
375    ..., <valn>]``. Only binary features are marked as spent.
376
377.. class:: SplitConstructor_Threshold
378
379    Bases: :class:`SplitConstructor_Score`
380
381    The only split constructor for continuous features. It divides the
382    range of feature values with a threshold that maximizes the split's
383    quality. In case of ties, a random feature is selected.  The feature
384    that yields the best binary split is returned.
385
386    The constructed :obj:`branch_selector` is an instance of
387    :obj:`orange.ClassifierFromVarFD` with an attached :obj:`transformer`,
388    of type :obj:`Orange.feature.discretization.ThresholdDiscretizer`. The
389    branch descriptions are "<threshold" and ">=threshold". The feature
390    is not spent.
391
392.. class:: SplitConstructor_OneAgainstOthers
393   
394    Bases: :class:`SplitConstructor_Score`
395
396    Undocumented.
397
398.. class:: SplitConstructor_Combined
399
400    Bases: :class:`SplitConstructor`
401
402    Uses different split constructors for discrete and continuous
403    features. Each split constructor is called with appropriate
404    features. Both construct a candidate for a split; the better of them
405    is used.
406
407    The choice of the split is not probabilistically fair, when
408    multiple candidates have the same score. For example, if there
409    are nine discrete features with the highest score the split
410    constructor for discrete features will select one of them. Now,
411    if there is also a single continuous feature with the same score,
412    :obj:`SplitConstructor_Combined` would randomly select between the
413    proposed discrete feature and continuous feature, unaware that the
414    discrete feature  has already competed with eight others.  So,
415    the probability for selecting (each) discrete feature would be
416    1/18 instead of 1/10. Although incorrect, this should not affect
417    the performance.
418
419    .. attribute: discrete_split_constructor
420
421        Split constructor for discrete features;
422        for instance, :obj:`SplitConstructor_Feature` or
423        :obj:`SplitConstructor_ExhaustiveBinary`.
424
425    .. attribute: continuous_split_constructor
426
427        Split constructor for continuous features; it
428        can be either :obj:`SplitConstructor_Threshold` or a
429        a custom split constructor.
430
431
432StopCriteria and StopCriteria_common
433============================================
434
435:obj:`StopCriteria` determines when to stop the induction of subtrees.
436
437.. class:: StopCriteria
438
439    Provides the basic stopping criteria: the tree induction stops
440    when there is at most one instance left (the actual, not weighted,
441    number). The induction also stops when all instances are in the
442    same class (for discrete problems) or have the same outcome value
443    (for regression problems).
444
445    .. method:: __call__(instances[, weightID, domain contingencies])
446
447        Return True (stop) of False (continue the induction).
448        Contingencies are not used for counting. Derived classes should
449        use the contingencies whenever possible.
450
451.. class:: StopCriteria_common
452
453    Pre-pruning with additional criteria.
454
455    .. attribute:: max_majority
456
457        Maximum proportion of majority class. When exceeded,
458        induction stops.
459
460    .. attribute:: min_instances
461
462        Minimum number of instances for splitting. Subsets with less
463        than :obj:`min_instances` instances are not split further.
464        The sample count is weighed.
465
466
467Splitters
468=================
469
470Splitters sort learning instances into branches (the branches are selected
471with a :obj:`SplitConstructor`, while a :obj:`Descender` decides the
472branch for an instance during classification).
473
474Most splitters call :obj:`Node.branch_selector` and assign
475instances correspondingly. When the value is unknown they choose a
476particular branch or skip the instance.
477
478Some splitters can also split instances: a weighed instance is
479used in more than than one subset. Each branch has a weight ID (usually,
480each its own ID) and all instances in that branch should have this meta attribute.
481
482An instance that
483hasn't been split has only one additional attribute (weight
484ID corresponding to the subset to which it went). Instance that is split
485between, say, three subsets, has three new meta attributes, one for each
486subset. The weights are used only when needed; when there is no
487splitting - no weight IDs are returned.
488
489.. class:: Splitter
490
491    An abstract base class that splits instances
492    into subsets.
493
494    .. method:: __call__(node, instances[, weightID])
495
496        :param node: a node.
497        :type node: :obj:`Node`
498        :param instances: a set of instances
499        :param weightID: weight ID.
500       
501        Use the information in :obj:`Node` (particularly the
502        :obj:`~Node.branch_selector`) to split the given set of instances into
503        subsets.  Return a tuple with a list of instance subsets and
504        a list of weights.  The list of weights is either a
505        list of integers or None when no weights are added.
506
507.. class:: Splitter_IgnoreUnknowns
508
509    Bases: :class:`Splitter`
510
511    Ignores the instances for which no single branch can be determined.
512
513.. class:: Splitter_UnknownsToCommon
514
515    Bases: :class:`Splitter`
516
517    Places all ambiguous instances to a branch with the highest number of
518    instances. If there is more than one such branch, one is selected at
519    random and then used for all instances.
520
521.. class:: Splitter_UnknownsToAll
522
523    Bases: :class:`Splitter`
524
525    Splits instances with an unknown value of the feature into all branches.
526
527.. class:: Splitter_UnknownsToRandom
528
529    Bases: :class:`Splitter`
530
531    Selects a random branch for ambiguous instances.
532
533.. class:: Splitter_UnknownsToBranch
534
535    Bases: :class:`Splitter`
536
537    Constructs an additional branch for ambiguous instances.
538    The branch's description is "unknown".
539
540.. class:: Splitter_UnknownsAsBranchSizes
541
542    Bases: :class:`Splitter`
543
544    Splits instances with unknown value of the feature according to
545    proportions of instances in each branch.
546
547.. class:: Splitter_UnknownsAsSelector
548
549    Bases: :class:`Splitter`
550
551    Splits instances with unknown value of the feature according to
552    distribution proposed by selector (usually the same as proportions
553    of instances in branches).
554
555Descenders
556=============================
557
558Descenders decide where should the instances that cannot be unambiguously
559put in a single branch go during classification (the branches are selected
560with a :obj:`SplitConstructor`, while a :obj:`Splitter` sorts instances
561during learning).
562
563.. class:: Descender
564
565    An abstract base tree descender. It descends a
566    an instance as deep as possible, according to the values
567    of instance's features. The :obj:`Descender`: calls the node's
568    :obj:`~Node.branch_selector` to get the branch index. If it's a
569    simple index, the corresponding branch is followed. If not, the
570    descender decides what to do. A descender can choose a single
571    branch (for instance, the one that is the most recommended by the
572    :obj:`~Node.branch_selector`) or it can let the branches vote.
573
574    Three are possible outcomes of a descent:
575
576    #. The descender reaches a leaf. This happens when
577       there were no unknown or out-of-range values, or when the
578       descender selected a single branch and continued the descend
579       despite them. The descender returns the :obj:`Node` it has reached.
580    #. Node's :obj:`~Node.branch_selector` returned a distribution and
581       :obj:`Descender` decided to stop the descend at this (internal)
582       node. It returns the current :obj:`Node`.
583    #. Node's :obj:`~Node.branch_selector` returned a distribution and the
584       :obj:`Node` wants to split the instance (i.e., to decide the class
585       by voting). It returns a :obj:`Node` and the vote-weights for
586       the branches.  The weights can correspond, for example,  to the
587       distribution returned by node's :obj:`~Node.branch_selector`, or to
588       the number of learning instances that were assigned to each branch.
589
590    .. method:: __call__(node, instance)
591
592        Descends until it reaches a leaf or a node in
593        which a vote of subtrees is required. A tuple
594        of two elements is returned. If it reached a leaf, the tuple contains
595        the leaf node and None. If not, it contains a node and
596        a list of floats (weights of votes).
597
598.. class:: Descender_UnknownToNode
599
600    Bases: :obj:`Descender`
601
602    When instance cannot be classified into a single branch, the current
603    node is returned. Thus, the node's :obj:`~Node.node_classifier`
604    will be used to make a decision. Therefore, internal nodes
605    need to have :obj:`Node.node_classifier` defined.
606
607.. class:: Descender_UnknownToBranch
608
609    Bases: :obj:`Descender`
610
611    Classifies instances with unknown value to a special branch. This
612    makes sense only if the tree itself was constructed with
613    :obj:`Splitter_UnknownsToBranch`.
614
615.. class:: Descender_UnknownToCommonBranch
616
617    Bases: :obj:`Descender`
618
619    Classifies instances with unknown values to the branch with the
620    highest number of instances. If there is more than one such branch,
621    random branch is chosen for each instance.
622
623.. class:: Descender_UnknownToCommonSelector
624
625    Bases: :obj:`Descender`
626
627    Classifies instances with unknown values to the branch which received
628    the highest recommendation by the selector.
629
630.. class:: Descender_UnknownMergeAsBranchSizes
631
632    Bases: :obj:`Descender`
633
634    The subtrees vote for the instance's class; the vote is weighted
635    according to the sizes of the branches.
636
637.. class:: Descender_UnknownMergeAsSelector
638
639    Bases: :obj:`Descender`
640
641    The subtrees vote for the instance's class; the vote is weighted
642    according to the selectors proposal.
643
644Pruning
645=======
646
647.. index::
648    pair: classification trees; pruning
649
650The pruners construct a shallow copy of a tree. The pruned tree's
651:obj:`Node` contain references to the same contingency matrices,
652node classifiers, branch selectors, ...  as the original tree.
653
654Pruners cannot construct a new :obj:`~Node.node_classifier`.  Thus, for
655pruning, internal nodes must have :obj:`~Node.node_classifier` defined
656(the default).
657
658.. class:: Pruner
659
660    An abstract base tree pruner.
661
662    .. method:: __call__(tree)
663
664        :param tree: either
665            a :obj:`Node` or (the C++ version of the classifier,
666            saved in :obj:`TreeClassfier.base_classifier`).
667
668        The resulting pruned tree is of the same type as the argument.
669        The original tree remains intact.
670
671.. class:: Pruner_SameMajority
672
673    Bases: :class:`Pruner`
674
675    A tree can have a subtrees where all the leaves have
676    the same majority class. This is allowed because leaves can still
677    have different class distributions and thus predict different
678    probabilities.  The :obj:`Pruner_SameMajority` prunes the tree so
679    that there is no subtree in which all the nodes would have the same
680    majority class.
681
682    This pruner will only prune the nodes in which the node classifier
683    is a :obj:`~Orange.classification.majority.ConstantClassifier`
684    (or a derived class).
685
686    The pruning works from leaves to the root.
687    It siblings have (at least one) common majority class, they can be pruned.
688
689.. class:: Pruner_m
690
691    Bases: :class:`Pruner`
692
693    Prunes a tree by comparing m-estimates of static and dynamic
694    error as defined in (Bratko, 2002).
695
696    .. attribute:: m
697
698        Parameter m for m-estimation.
699
700Printing the tree
701=================
702
703The tree printing functions are very flexible. They can print, for
704example, numbers of instances, proportions of majority class in nodes
705and similar, or more complex statistics like the proportion of instances
706in a particular class divided by the proportion of instances of this
707class in a parent node. Users may also pass their own functions to print
708certain elements.
709
710The easiest way to print the tree is to print :func:`TreeClassifier`::
711
712    >>> print tree
713    petal width<0.800: Iris-setosa (100.00%)
714    petal width>=0.800
715    |    petal width<1.750
716    |    |    petal length<5.350: Iris-versicolor (94.23%)
717    |    |    petal length>=5.350: Iris-virginica (100.00%)
718    |    petal width>=1.750
719    |    |    petal length<4.850: Iris-virginica (66.67%)
720    |    |    petal length>=4.850: Iris-virginica (100.00%)
721
722
723Format string
724-------------
725
726Format strings are printed at every leaf or internal node with the certain
727format specifiers replaced by data from the tree node. Specifiers are
728generally of form **%[^]<precision><quantity><divisor>**.
729
730**^** at the start tells that the number should be multiplied by 100,
731which is useful for proportions like percentages.
732
733**<precision>** is in the same format as in Python (or C) string
734formatting. For instance, ``%N`` denotes the number of instances in
735the node, hence ``%6.2N`` would mean output to two decimal digits
736and six places altogether. If left out, a default format ``5.3`` is
737used, unless the numbers are multiplied by 100, in which case the default
738is ``.0`` (no decimals, the number is rounded to the nearest integer).
739
740**<divisor>** tells what to divide the quantity in that node with.
741``bP`` means division by the same quantity in the parent node; for instance,
742``%NbP`` gives the number of instances in the node divided by the
743number of instances in parent node. Precision formatting can be added,
744e.g. ``%6.2NbP``. ``bA`` denotes division by the same quantity over the entire
745data set, so ``%NbA`` will tell you the proportion of instances (out
746of the entire training data set) that fell into that node. If division is
747impossible since the parent node does not exist or some data is missing,
748a dot is printed out instead.
749
750**<quantity>** defines what to print and is the only required element.
751It can be:
752
753``V``
754    The predicted value at that node. Precision
755    or divisor can not be defined here.
756
757``N``
758    The number of instances in the node.
759
760``M``
761    The number of instances in the majority class (that is, the class
762    predicted by the node).
763
764``m``
765    The proportion of instances in the majority class.
766
767``A``
768    The average class for instances the node; this is available only for
769    regression trees.
770
771``E``
772    Standard error for class of instances in the node; available only for
773    regression trees.
774
775``I``
776    Print out the confidence interval. The modifier is used as
777    ``%I(95)`` of (more complicated) ``%5.3I(95)bP``.
778
779``C``
780    The number of instances in the given class.  For a classification
781    example, ``%5.3C="Iris-virginica"bP`` denotes the number of instances
782    of Iris-virginica by the number of instances this class in the parent
783    node ( instances that are *not* Iris-virginica could be described with
784    ``%5.3CbP!="Iris-virginica"``).
785
786    For regression trees, use operators =, !=, <, <=, >, and >=, as in
787    ``%C<22``, with optional precision and divisor. Intervals are also
788    possible: ``%C[20, 22]`` gives the number of instances between
789    20 and 22 (inclusive) and ``%C(20, 22)`` gives the number of such
790    instances excluding the boundaries. Mixing of parentheses is allowed,
791    e.g. ``%C(20, 22]``.  Add ``!`` for instances outside the interval,
792    like ``%C!(20, 22]``.
793
794``c``
795    Same as ``C``, except that it computes the proportion of the class
796    instead of the number of instances.
797
798``D``
799    The number of instances in each class. Precision and the divisor
800    are applied to each number in the distribution.  This quantity can
801    not be computed for regression trees.
802
803``d``
804    Same as ``D``, except that it shows proportions of instances.
805
806<user defined formats>
807    Instructions and examples of added formats are at the end of this
808    section.
809
810.. rubric:: Examples on classification trees
811
812A tree on the iris data set with the depth limited to three
813levels is built as follows:
814   
815.. literalinclude:: code/orngTree1.py
816   :lines: 1-4
817
818Printing the predicted class at each node, the number
819of instances in the majority class with the total number of instances in
820the node requires a custom format string::
821
822    >>> print tree.format(leaf_str="%V (%M out of %N)")
823    petal width<0.800: Iris-setosa (50.000 out of 50.000)
824    petal width>=0.800
825    |    petal width<1.750
826    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
827    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
828    |    petal width>=1.750
829    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
830    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
831
832The number of instances as
833compared to the entire data set and to the parent node::
834
835    >>> print tree.format(leaf_str="%V (%^MbA%, %^MbP%)")
836    petal width<0.800: Iris-setosa (100%, 100%)
837    petal width>=0.800
838    |    petal width<1.750
839    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
840    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
841    |    petal width>=1.750
842    |    |    petal length<4.850: Iris-virginica (4%, 4%)
843    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
844
845``%M`` is the number of instances in the majority class. Dividing by
846the number of all instances from this class on the entire data set
847is described with ``%MbA``. Add ``^`` in front for mutiplication with
848100. The percent sign *after* that is printed out literally, just as the
849comma and parentheses. For the proportion of this class in the parent the
850``bA`` is replaced with ``bA``.
851
852To print the number of versicolors in each node, together with the
853proportion of versicolors among the instances in this particular node
854and among all versicolors, use the following::
855
856    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
857
858It gives::
859
860    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
861    petal width>=0.800
862    |    petal width<1.750
863    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
864    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
865    |    petal width>=1.750
866    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
867    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
868
869Finally, to print the distributions using a format string ``%D``::
870
871    petal width<0.800: [50.000, 0.000, 0.000]
872    petal width>=0.800
873    |    petal width<1.750
874    |    |    petal length<5.350: [0.000, 49.000, 3.000]
875    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
876    |    petal width>=1.750
877    |    |    petal length<4.850: [0.000, 1.000, 2.000]
878    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
879
880As the order of classes is the same as in ``data.domain.class_var.values``
881(setosa, versicolor, virginica), there are 49 versicolors and 3 virginicae
882in the node at ``petal length<5.350``. To print the proportions within
883nodes rounded to two decimals use ``%.2d``::
884
885    petal width<0.800: [1.00, 0.00, 0.00]
886    petal width>=0.800
887    |    petal width<1.750
888    |    |    petal length<5.350: [0.00, 0.94, 0.06]
889    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
890    |    petal width>=1.750
891    |    |    petal length<4.850: [0.00, 0.33, 0.67]
892    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
893
894The most trivial format string for internal nodes is for printing
895node predictions. ``.`` in the following example specifies
896that the node_str should be the same as leaf_str.
897
898::
899
900    tree.format(leaf_str="%V", node_str=".")
901 
902The output::
903
904    root: Iris-setosa
905    |    petal width<0.800: Iris-setosa
906    |    petal width>=0.800: Iris-versicolor
907    |    |    petal width<1.750: Iris-versicolor
908    |    |    |    petal length<5.350: Iris-versicolor
909    |    |    |    petal length>=5.350: Iris-virginica
910    |    |    petal width>=1.750: Iris-virginica
911    |    |    |    petal length<4.850: Iris-virginica
912    |    |    |    petal length>=4.850: Iris-virginica
913
914A node *root* has appeared and the tree looks one level
915deeper. This is needed to also print the data for tree root.
916
917To observe how the number
918of virginicas decreases down the tree try::
919
920    print tree.format(leaf_str='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', node_str='.')
921
922Interpretation: ``CbA="Iris-virginica"`` is
923the number of instances from virginica, divided by the total number
924of instances in this class. Add ``^.1`` and the result will be
925multiplied and printed with one decimal. The trailing ``%`` is printed
926out. In parentheses the same thing was divided by
927the instances in the parent node. The single quotes were used for strings, so
928that double quotes inside the string can specify the class.
929
930::
931
932    root: 100.0% (.%)
933    |    petal width<0.800: 0.0% (0.0%)
934    |    petal width>=0.800: 100.0% (100.0%)
935    |    |    petal width<1.750: 10.0% (10.0%)
936    |    |    |    petal length<5.350: 6.0% (60.0%)
937    |    |    |    petal length>=5.350: 4.0% (40.0%)
938    |    |    petal width>=1.750: 90.0% (90.0%)
939    |    |    |    petal length<4.850: 4.0% (4.4%)
940    |    |    |    petal length>=4.850: 86.0% (95.6%)
941
942If :meth:`~TreeClassifier.format` cannot compute something, in this case
943because the root has no parent, it prints out a dot.
944
945The final example with classification trees prints the distributions in
946nodes, the distribution compared to the parent, the proportions compared
947to the parent and the predicted class in the leaves::
948
949    >>> print tree.format(leaf_str='"%V   %D %.2DbP %.2dbP"', node_str='"%D %.2DbP %.2dbP"')
950    root: [50.000, 50.000, 50.000] . .
951    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
952    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
953    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
954    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
955    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
956    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
957    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
958    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
959    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
960    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
961    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
962    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
963    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
964
965
966.. rubric:: Examples on regression trees
967
968The regression trees examples use a tree induced from the housing data
969set. Without other argumets, :meth:`TreeClassifier.format` prints the
970following::
971
972    RM<6.941
973    |    LSTAT<14.400
974    |    |    DIS<1.385: 45.6
975    |    |    DIS>=1.385: 22.9
976    |    LSTAT>=14.400
977    |    |    CRIM<6.992: 17.1
978    |    |    CRIM>=6.992: 12.0
979    RM>=6.941
980    |    RM<7.437
981    |    |    CRIM<7.393: 33.3
982    |    |    CRIM>=7.393: 14.4
983    |    RM>=7.437
984    |    |    TAX<534.500: 45.9
985    |    |    TAX>=534.500: 21.9
986
987To add the standard error in both internal nodes and leaves, and
988the 90% confidence intervals in the leaves, use::
989
990    >>> print tree.format(leaf_str="[SE: %E]\t %V %I(90)", node_str="[SE: %E]")
991    root: [SE: 0.409]
992    |    RM<6.941: [SE: 0.306]
993    |    |    LSTAT<14.400: [SE: 0.320]
994    |    |    |    DIS<1.385: [SE: 4.420]:
995    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
996    |    |    |    DIS>=1.385: [SE: 0.244]:
997    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
998    |    |    LSTAT>=14.400: [SE: 0.333]
999    |    |    |    CRIM<6.992: [SE: 0.338]:
1000    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1001    |    |    |    CRIM>=6.992: [SE: 0.448]:
1002    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1003    |    RM>=6.941: [SE: 1.031]
1004    |    |    RM<7.437: [SE: 0.958]
1005    |    |    |    CRIM<7.393: [SE: 0.692]:
1006    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1007    |    |    |    CRIM>=7.393: [SE: 2.157]:
1008    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1009    |    |    RM>=7.437: [SE: 1.124]
1010    |    |    |    TAX<534.500: [SE: 0.817]:
1011    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1012    |    |    |    TAX>=534.500: [SE: 0.000]:
1013    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1014
1015The predicted value (``%V``) and the average (``%A``) may differ because
1016a regression tree does not always predict the leaf average, but whatever
1017the :obj:`~Node.node_classifier` in a leaf returns.  As ``%V`` uses the
1018:obj:`Orange.data.variable.Continuous`' function for printing the
1019value, the number has the same number of decimals as in the data file.
1020
1021Regression trees cannot print the distributions in the same way
1022as classification trees. They instead offer a set of operators for
1023observing the number of instances within a certain range. For instance,
1024to print the number of instances with values below 22 and compare
1025it with values in the parent nodes use::
1026
1027    >>> print tree.format(leaf_str="%C<22 (%cbP<22)", node_str=".")
1028    root: 277.000 (.)
1029    |    RM<6.941: 273.000 (1.160)
1030    |    |    LSTAT<14.400: 107.000 (0.661)
1031    |    |    |    DIS<1.385: 0.000 (0.000)
1032    |    |    |    DIS>=1.385: 107.000 (1.020)
1033    |    |    LSTAT>=14.400: 166.000 (1.494)
1034    |    |    |    CRIM<6.992: 93.000 (0.971)
1035    |    |    |    CRIM>=6.992: 73.000 (1.040)
1036    |    RM>=6.941: 4.000 (0.096)
1037    |    |    RM<7.437: 3.000 (1.239)
1038    |    |    |    CRIM<7.393: 0.000 (0.000)
1039    |    |    |    CRIM>=7.393: 3.000 (15.333)
1040    |    |    RM>=7.437: 1.000 (0.633)
1041    |    |    |    TAX<534.500: 0.000 (0.000)
1042    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1043
1044The last line, for instance, says the the number of instances with the
1045class below 22 is among those with tax above 534 is 30 times higher than
1046the number of such instances in its parent node.
1047
1048To count the same for all instances *outside*
1049interval [20, 22] and print out the proportions as percents use::
1050
1051    >>> print tree.format(leaf_str="%C![20,22] (%^cbP![20,22]%)", node_str=".")
1052
1053The format string  ``%c![20, 22]`` denotes the proportion of instances
1054(within the node) whose values are below 20 or above 22. ``%cbP![20,
105522]`` derives same statistics computed on the parent. A ``^`` is added
1056for percentages.
1057
1058::
1059
1060    root: 439.000 (.%)
1061    |    RM<6.941: 364.000 (98%)
1062    |    |    LSTAT<14.400: 200.000 (93%)
1063    |    |    |    DIS<1.385: 5.000 (127%)
1064    |    |    |    DIS>=1.385: 195.000 (99%)
1065    |    |    LSTAT>=14.400: 164.000 (111%)
1066    |    |    |    CRIM<6.992: 91.000 (96%)
1067    |    |    |    CRIM>=6.992: 73.000 (105%)
1068    |    RM>=6.941: 75.000 (114%)
1069    |    |    RM<7.437: 46.000 (101%)
1070    |    |    |    CRIM<7.393: 43.000 (100%)
1071    |    |    |    CRIM>=7.393: 3.000 (100%)
1072    |    |    RM>=7.437: 29.000 (98%)
1073    |    |    |    TAX<534.500: 29.000 (103%)
1074    |    |    |    TAX>=534.500: 0.000 (0%)
1075
1076
1077Defining custom printouts
1078-------------------------
1079
1080:meth:`TreeClassifier.format`'s argument :obj:`user_formats` can be used to
1081print other information.  :obj:`~TreeClassifier.format.user_formats` should
1082contain a list of tuples with a regular expression and a function to be
1083called when that expression is found in the format string. Expressions
1084from :obj:`user_formats` are checked before the built-in expressions
1085discussed above.
1086
1087The regular expression should describe a string like used above,
1088for instance ``%.2DbP``. When a leaf or internal node
1089is printed, the format string (:obj:`leaf_str` or :obj:`node_str`)
1090is checked for these regular expressions and when the match is found,
1091the corresponding callback function is called.
1092
1093The passed function will get five arguments: the format string
1094(:obj:`leaf_str` or :obj:`node_str`), the match object, the node which is
1095being printed, its parent (can be None) and the tree classifier.
1096The function should return the format string in which the part described
1097by the match object (that is, the part that is matched by the regular
1098expression) is replaced by whatever information your callback function
1099is supposed to give.
1100
1101The function can use several utility function provided in the module.
1102
1103.. autofunction:: insert_str
1104
1105.. autofunction:: insert_dot
1106
1107.. autofunction:: insert_num
1108
1109.. autofunction:: by_whom
1110
1111The module also includes reusable regular expressions:
1112
1113.. autodata:: fs
1114
1115.. autodata:: by
1116
1117For a trivial example, ``%V`` is implemented with the
1118following tuple::
1119
1120    (re.compile("%V"), replaceV)
1121
1122And ``replaceV`` is defined by::
1123
1124    def replaceV(strg, mo, node, parent, tree):
1125        return insert_str(strg, mo, str(node.node_classifier.default_value))
1126
1127``replaceV`` takes the value predicted at the node
1128(``node.node_classifier.default_value`` ), converts it to a string
1129and passes it to :func:`insert_str`.
1130
1131A more complex regular expression is the one for the proportion of
1132majority class, defined as ``"%"+fs+"M"+by``. It uses the two partial
1133expressions defined above (:obj:`fs` and :obj:`by`).
1134
1135The following code prints the classification margin for each node,
1136that is, the difference between the proportion of the largest and the
1137second largest class in the node:
1138
1139.. literalinclude:: code/orngTree2.py
1140   :lines: 7-31
1141
1142``get_margin`` computes the margin from the distribution. The replacing
1143function, ``replaceB``, computes the margin for the node.  If :data:`by`
1144group is present, we call :func:`by_whom` to get the node with whose
1145margin this node's margin is to be divided. If this node (usually the
1146parent) does not exist of if its margin is zero, :func:`insert_dot`
1147inserts dot, otherwise :func:`insert_num` is called which inserts the
1148number in the user-specified format.  ``my_format`` contains the regular
1149expression and the callback function.
1150
1151Printing the tree with
1152
1153.. literalinclude:: code/orngTree2.py
1154    :lines: 33
1155
1156yields::
1157
1158    petal width<0.800: Iris-setosa 100% (100.00%)
1159    petal width>=0.800
1160    |    petal width<1.750
1161    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1162    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1163    |    petal width>=1.750
1164    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1165    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1166
1167Plotting with Dot
1168---------------------------
1169
1170To produce images of trees, first create a .dot file with
1171:meth:`TreeClassifier.dot`. If it was saved to "tree5.dot", plot a gif
1172with the following command::
1173
1174    dot -Tgif tree5.dot -otree5.gif
1175
1176Check GraphViz's dot documentation for more options and
1177output formats.
1178
1179
1180===========================
1181C4.5 Classifier and Learner
1182===========================
1183
1184C4.5 is, as  a standard benchmark in machine learning, incorporated in
1185Orange. The implementation uses the original C4.5 code, so the resulting
1186tree is exactly like the one that would be build by standalone C4.5. The
1187tree build is only made accessible in Python.
1188
1189:class:`C45Learner` and :class:`C45Classifier` behave
1190like any other Orange learner and classifier. Unlike most of Orange
1191learning algorithms, C4.5 does not accepts weighted instances.
1192
1193Building the C4.5 plug-in
1194=========================
1195
1196C4.5 is not distributed with Orange, but it can be added as a
1197plug-in. A C compiler is needed for the procedure: on Windows MS Visual C
1198(CL.EXE and LINK.EXE must be on the PATH), on Linux and OS X gcc (OS X
1199users can download it from Apple).
1200
1201Orange must be installed prior to building C4.5.
1202
1203#. Download
1204   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1205   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1206   them. The files will be modified in the
1207   further process.
1208#. Download
1209   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1210   and unzip its contents into the directory R8/Src of the C4.5 sources
1211   (this directory contains, for instance, the file average.c).
1212#. Run buildC45.py, which will build the plug-in and put it next to
1213   orange.pyd (or orange.so on Linux/Mac).
1214#. Run python, type ``import Orange`` and
1215   create ``Orange.classification.tree.C45Learner()``. This should
1216   succeed.
1217#. Finally, you can remove C4.5 sources.
1218
1219The buildC45.py creates .h files that wrap Quinlan's .i files and
1220ensure that they are not included twice. It modifies C4.5 sources to
1221include .h's instead of .i's (this step can hardly fail). Then it compiles
1222ensemble.c into c45.dll or c45.so and puts it next to Orange. In the end
1223it checks if the built C4.5 gives the same results as the original.
1224
1225.. autoclass:: C45Learner
1226    :members:
1227
1228.. autoclass:: C45Classifier
1229    :members:
1230
1231.. class:: C45Node
1232
1233    This class is a reimplementation of the corresponding *struct* from
1234    Quinlan's C4.5 code.
1235
1236    .. attribute:: node_type
1237
1238        Type of the node:  :obj:`C45Node.Leaf` (0),
1239        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1240        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1241        split instances based on values of a discrete attribute,
1242        "cuts" cut them according to a threshold value of a continuous
1243        attributes and "subsets" use discrete attributes but with subsetting
1244        so that several values can go into the same branch.
1245
1246    .. attribute:: leaf
1247
1248        Value returned by that leaf. The field is defined for internal
1249        nodes as well.
1250
1251    .. attribute:: items
1252
1253        Number of (learning) instances in the node.
1254
1255    .. attribute:: class_dist
1256
1257        Class distribution for the node (of type
1258        :obj:`Orange.statistics.distribution.Discrete`).
1259
1260    .. attribute:: tested
1261       
1262        The attribute used in the node's test. If node is a leaf,
1263        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1264        :obj:`tested` is a discrete attribute, and if node is of type
1265        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1266
1267    .. attribute:: cut
1268
1269        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1270        Undefined otherwise.
1271
1272    .. attribute:: mapping
1273
1274        Mapping for nodes of type :obj:`Subset`. Element ``mapping[i]``
1275        gives the index for an instance whose value of :obj:`tested` is *i*.
1276        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1277
1278    .. attribute:: branch
1279       
1280        A list of branches stemming from this node.
1281
1282Examples
1283========
1284
1285This
1286script constructs the same learner as you would get by calling
1287the usual C4.5:
1288
1289.. literalinclude:: code/tree_c45.py
1290   :lines: 7-14
1291
1292Both C4.5 command-line symbols and variable names can be used. The
1293following lines produce the same result::
1294
1295    tree = Orange.classification.tree.C45Learner(data, m=100)
1296    tree = Orange.classification.tree.C45Learner(data, min_objs=100)
1297
1298A veteran C4.5 might prefer :func:`C45Learner.commandline`::
1299
1300    lrn = Orange.classification.tree.C45Learner()
1301    lrn.commandline("-m 1 -s")
1302    tree = lrn(data)
1303
1304The following script prints out the tree same format as C4.5 does.
1305
1306.. literalinclude:: code/tree_c45_printtree.py
1307
1308For the leaves just the value in ``node.leaf`` in printed. Since
1309:obj:`C45Node` does not know to which attribute it belongs, we need to
1310convert it to a string through ``classvar``, which is passed as an extra
1311argument to the recursive part of printTree.
1312
1313For discrete splits without subsetting, we print out all attribute values
1314and recursively call the function for all branches. Continuous splits
1315are equally easy to handle.
1316
1317For discrete splits with subsetting, we iterate through branches,
1318retrieve the corresponding values that go into each branch to inset,
1319turn the values into strings and print them out, separately treating
1320the case when only a single value goes into the branch.
1321
1322=================
1323SimpleTreeLearner
1324=================
1325
1326.. include:: /SimpleTreeLearner.txt
1327       
1328Examples
1329========
1330
1331:obj:`SimpleTreeLearner` is used in much the same way as :obj:`TreeLearner`.
1332A typical example of using :obj:`SimpleTreeLearner` would be to build a random
1333forest:
1334
1335.. literalinclude:: code/simple_tree_random_forest.py
1336
1337
1338References
1339==========
1340
1341Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1342Wesley, 2002.
1343
1344E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1345Murray Hill NJ, U.S.A., October 1993.
1346
1347`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1348A home page of AT&T's dot and similar software packages.
1349
1350"""
1351
1352"""
1353TODO C++ aliases
1354
1355SplitConstructor.discrete/continuous_split_constructor -> SplitConstructor.discrete
1356Node.examples -> Node.instances
1357"""
1358
1359from Orange.core import \
1360     TreeLearner as _TreeLearner, \
1361         TreeClassifier as _TreeClassifier, \
1362         SimpleTreeLearner, \
1363         SimpleTreeClassifier, \
1364         C45Learner as _C45Learner, \
1365         C45Classifier as _C45Classifier, \
1366         C45TreeNode as C45Node, \
1367         C45TreeNodeList as C45NodeList, \
1368         TreeDescender as Descender, \
1369              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
1370              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
1371              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
1372              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
1373              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
1374         TreeExampleSplitter as Splitter, \
1375              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
1376              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
1377              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
1378              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
1379              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
1380              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
1381              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
1382         TreeNode as Node, \
1383         TreeNodeList as NodeList, \
1384         TreePruner as Pruner, \
1385              TreePruner_SameMajority as Pruner_SameMajority, \
1386              TreePruner_m as Pruner_m, \
1387         TreeSplitConstructor as SplitConstructor, \
1388              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
1389              TreeSplitConstructor_Measure as SplitConstructor_Score, \
1390                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
1391                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
1392                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
1393                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
1394         TreeStopCriteria as StopCriteria, \
1395              TreeStopCriteria_Python as StopCriteria_Python, \
1396              TreeStopCriteria_common as StopCriteria_common
1397
1398import Orange.core
1399import operator
1400import base64
1401import re
1402import Orange.data
1403import Orange.feature.scoring
1404import warnings
1405
1406class C45Learner(Orange.classification.Learner):
1407    """
1408    :class:`C45Learner`'s attributes have double names - those that
1409    you know from C4.5 command lines and the corresponding names of C4.5's
1410    internal variables. All defaults are set as in C4.5; if you change
1411    nothing, you are running C4.5.
1412
1413    Constructs a :obj:`C45Classifier` when given data.
1414
1415    .. attribute:: gain_ratio (g)
1416       
1417        Determines whether to use information gain (false, default)
1418        or gain ratio for selection of attributes (true).
1419
1420    .. attribute:: batch (b)
1421
1422        Turn on batch mode (no windows, no iterations); this option is
1423        not documented in C4.5 manuals. It conflicts with "window",
1424        "increment" and "trials".
1425
1426    .. attribute:: subset (s)
1427       
1428        Enables subsetting (default: false, no subsetting),
1429 
1430    .. attribute:: prob_thresh (p)
1431
1432        Probabilistic threshold for continuous attributes (default: false).
1433
1434    .. attribute:: min_objs (m)
1435       
1436        Minimal number of objects (instances) in leaves (default: 2).
1437
1438    .. attribute:: window (w)
1439
1440        Initial windows size (default: maximum of 20% and twice the
1441        square root of the number of data objects).
1442
1443    .. attribute:: increment (i)
1444
1445        The maximum number of objects that can be added to the window
1446        at each iteration (default: 20% of the initial window size).
1447
1448    .. attribute:: cf (c)
1449
1450        Prunning confidence level (default: 25%).
1451
1452    .. attribute:: trials (t)
1453
1454        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1455
1456    .. attribute:: prune
1457       
1458        Return pruned tree (not an original C4.5 option) (default: true)
1459    """
1460
1461    _rename_new_old = { "min_objs": "minObjs", "probTresh": "prob_tresh",
1462            "gain_ratio": "gainRatio" }
1463    #_rename_new_old = {}
1464    _rename_old_new = dict((a,b) for b,a in _rename_new_old.items())
1465
1466    @classmethod
1467    def _rename_dict(cls, dic):
1468        return dict((cls._rename_arg(a),b) for a,b in dic.items())
1469
1470    @classmethod
1471    def _rename_arg(cls, a):
1472        if a in cls._rename_old_new:
1473            Orange.misc.deprecation_warning(a, cls._rename_old_new[a], stacklevel=4)
1474        return cls._rename_new_old.get(a, a)
1475
1476    def __new__(cls, instances = None, weightID = 0, **argkw):
1477        self = Orange.classification.Learner.__new__(cls, **cls._rename_dict(argkw))
1478        if instances:
1479            self.__init__(**cls._rename_dict(argkw))
1480            return self.__call__(instances, weightID)
1481        else:
1482            return self
1483       
1484    def __init__(self, **kwargs):
1485        self.base = _C45Learner(**self._rename_dict(kwargs))
1486
1487    def __setattr__(self, name, value):
1488        nn = self._rename_arg(name)
1489        if name != "base" and nn in self.base.__dict__:
1490            self.base.__dict__[nn] = value
1491        else:
1492            self.__dict__[nn] = value
1493
1494    def __getattr__(self, name):
1495        nn = self._rename_arg(name)
1496        if name != " base" and nn in self.base.__dict__:
1497            return self.base.__dict__[nn]
1498        else:
1499            return self.__dict__[nn]
1500
1501    def __call__(self, *args, **kwargs):
1502        return C45Classifier(self.base(*args, **self._rename_dict(kwargs)))
1503
1504    def commandline(self, ln):
1505        """
1506        Set the arguments with a C4.5 command line.
1507        """
1508        self.base.commandline(ln)
1509   
1510 
1511class C45Classifier(Orange.classification.Classifier):
1512    """
1513    A faithful reimplementation of Quinlan's C4.5, but
1514    uses a tree composed of :class:`C45Node` instead of C4.5's original
1515    tree structure.
1516
1517    .. attribute:: tree
1518
1519        C4.5 tree stored as :obj:`C45Node`.
1520    """
1521
1522    def __init__(self, base_classifier):
1523        self.nativeClassifier = base_classifier
1524        for k, v in self.nativeClassifier.__dict__.items():
1525            self.__dict__[k] = v
1526 
1527    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
1528                 *args, **kwdargs):
1529        """Classify a new instance.
1530       
1531        :param instance: instance to be classified.
1532        :type instance: :class:`Orange.data.Instance`
1533        :param result_type:
1534              :class:`Orange.classification.Classifier.GetValue` or \
1535              :class:`Orange.classification.Classifier.GetProbabilities` or
1536              :class:`Orange.classification.Classifier.GetBoth`
1537       
1538        :rtype: :class:`Orange.data.Value`,
1539              :class:`Orange.statistics.Distribution` or a tuple with both
1540        """
1541        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
1542
1543    def __setattr__(self, name, value):
1544        if name == "nativeClassifier":
1545            self.__dict__[name] = value
1546            return
1547        if name in self.nativeClassifier.__dict__:
1548            self.nativeClassifier.__dict__[name] = value
1549        self.__dict__[name] = value
1550
1551    def __str__(self):
1552        return self.dump()
1553   
1554
1555    def dump(self): 
1556        """
1557        Print the tree in the same form as Ross Quinlan's
1558        C4.5 program.
1559
1560        ::
1561
1562            import Orange
1563
1564            data = Orange.data.Table("voting")
1565            c45 = Orange.classification.tree.C45Learner(data)
1566            print c45
1567
1568        prints
1569
1570        ::
1571
1572            physician-fee-freeze = n: democrat (253.4)
1573            physician-fee-freeze = y:
1574            |   synfuels-corporation-cutback = n: republican (145.7)
1575            |   synfuels-corporation-cutback = y:
1576            |   |   mx-missile = y: democrat (6.0)
1577            |   |   mx-missile = n:
1578            |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
1579            |   |   |   adoption-of-the-budget-resolution = y:
1580            |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
1581            |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
1582
1583
1584        The standalone C4.5 would print::
1585
1586            physician-fee-freeze = n: democrat (253.4/5.9)
1587            physician-fee-freeze = y:
1588            |   synfuels-corporation-cutback = n: republican (145.7/6.2)
1589            |   synfuels-corporation-cutback = y:
1590            |   |   mx-missile = y: democrat (6.0/2.4)
1591            |   |   mx-missile = n:
1592            |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
1593            |   |   |   adoption-of-the-budget-resolution = y:
1594            |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
1595            |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
1596
1597        C4.5 also prints out the number of errors on learning data in
1598        each node.
1599        """
1600        return  _c45_printTree0(self.tree, self.class_var, 0)
1601
1602    format = dump
1603
1604def _c45_showBranch(node, classvar, lev, i):
1605    var = node.tested
1606    str_ = ""
1607    if node.node_type == 1:
1608        str_ += "\n"+"|   "*lev + "%s = %s:" % (var.name, var.values[i])
1609        str_ += _c45_printTree0(node.branch[i], classvar, lev+1)
1610    elif node.node_type == 2:
1611        str_ += "\n"+"|   "*lev + "%s %s %.1f:" % (var.name, ["<=", ">"][i], node.cut)
1612        str_ += _c45_printTree0(node.branch[i], classvar, lev+1)
1613    else:
1614        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
1615        inset = [var.values[j[0]] for j in inset]
1616        if len(inset)==1:
1617            str_ += "\n"+"|   "*lev + "%s = %s:" % (var.name, inset[0])
1618        else:
1619            str_ +=  "\n"+"|   "*lev + "%s in {%s}:" % (var.name, ", ".join(inset))
1620        str_ += _c45_printTree0(node.branch[i], classvar, lev+1)
1621    return str_
1622       
1623       
1624def _c45_printTree0(node, classvar, lev):
1625    var = node.tested
1626    str_ = ""
1627    if node.node_type == 0:
1628        str_ += "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items) 
1629    else:
1630        for i, branch in enumerate(node.branch):
1631            if not branch.node_type:
1632                str_ += _c45_showBranch(node, classvar, lev, i)
1633        for i, branch in enumerate(node.branch):
1634            if branch.node_type:
1635                str_ += _c45_showBranch(node, classvar, lev, i)
1636    return str_
1637
1638def _printTreeC45(tree):
1639    print _c45_printTree0(tree.tree, tree.class_var, 0)
1640
1641
1642import Orange.feature.scoring as fscoring
1643
1644class TreeLearner(Orange.core.Learner):
1645    """
1646    A classification or regression tree learner. If a set of instances
1647    is given on initialization, a :class:`TreeClassifier` is built and
1648    returned instead. All attributes can also be set on initialization.
1649
1650    **The tree induction process**
1651
1652    #. The learning instances are copied, unless
1653       :obj:`store_instances` is `False` and the instance
1654       already are stored in a :obj:`~Orange.data.Table`.
1655    #. Apriori class probabilities are computed. A list of
1656       candidate features for the split is compiled; in the beginning,
1657       all features are candidates.
1658    #. The recursive part. The contingency matrix is computed by
1659       :obj:`contingency_computer`. Contingencies are used by :obj:`split`,
1660       :obj:`stop` and :obj:`splitter`.
1661    #. If the induction should :obj:`stop`, a :obj:`~Node.node_classifier`
1662       is built by calling :obj:`node_learner` with the given instances,
1663       weight ID and the contingency matrix. As the learner uses
1664       contingencies whenever possible, the :obj:`contingency_computer`
1665       will affect the :obj:`~Node.node_classifier`. The node is returned.
1666    #. If the induction continues, a :obj:`split` is called.
1667       If :obj:`split` fails to return a branch selector, induction stops
1668       and the :obj:`Node` is returned.
1669    #. The feature spent (if any) is removed from the candidate list.
1670    #. Instances are divided into child nodes with :obj:`splitter`.
1671       The process recursively continues with step 3 for
1672       each of the non-empty subsets. If the splitter returned weights,
1673       they are used for each branch.
1674
1675    **Attributes**
1676
1677    .. attribute:: node_learner
1678
1679        Induces a classifier from instances in a node. It is used both
1680        for internal nodes and leaves. The default is
1681        :obj:`Orange.classification.majority.MajorityLearner`.
1682
1683    .. attribute:: descender
1684
1685        The descender that the induced :obj:`TreeClassifier` will
1686        use. The default is :obj:`Descender_UnknownMergeAsSelector`.
1687        It votes with the :obj:`branch_selector`'s distribution for
1688        vote weights.
1689
1690    .. attribute:: contingency_computer
1691
1692        Defines the computation of contingency matrices (used by
1693        :obj:`split`, :obj:`stop`, :obj:`splitter`). It can be used,
1694        for example, to change the treatment of unknown values. By
1695        default ordinary contingency matrices are computed.
1696
1697    **Split construction**
1698
1699    .. attribute:: split
1700       
1701        A :obj:`SplitConstructor` or a function with the same signature as
1702        :obj:`SplitConstructor.__call__`. It is useful for prototyping
1703        new tree induction algorithms. If :obj:`split` is defined, other
1704        arguments that affect split construction are ignored. These include
1705        :obj:`binarization`, :obj:`measure`, :obj:`worst_acceptable` and
1706        :obj:`min_subset`. Default: :class:`SplitConstructor_Combined`
1707        with separate constructors for discrete and continuous
1708        features. Discrete features are used as they are, while
1709        continuous are binarized. Features are scored with gain ratio.
1710        At least two instances in a leaf are required for
1711        discrete and five for continuous features.
1712
1713    .. attribute:: binarization
1714
1715        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
1716        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
1717        0, do not use binarization (use :class:`SplitConstructor_Feature`).
1718        Default: 0.
1719
1720    .. attribute:: measure
1721   
1722        A score to evaluate features for splitting instances in a
1723        node.  A subclass of :class:`Orange.feature.scoring.Score`
1724        (perhaps :class:`~Orange.feature.scoring.InfoGain`,
1725        :class:`~Orange.feature.scoring.GainRatio`,
1726        :class:`~Orange.feature.scoring.Gini`,
1727        :class:`~Orange.feature.scoring.Relief`, or
1728        :class:`~Orange.feature.scoring.MSE`). Default:
1729        :class:`Orange.feature.scoring.GainRatio`.
1730
1731    .. attribute:: relief_m, relief_k
1732
1733        Set `m` and `k` for :class:`~Orange.feature.scoring.Relief`,
1734        if chosen.
1735
1736    .. attribute:: splitter
1737
1738        :class:`Splitter` or a function with the same
1739        signature as :obj:`Splitter.__call__`. The default is
1740        :class:`Splitter_UnknownsAsSelector` that splits the
1741        learning instances according to distributions given by the
1742        selector.
1743
1744    **Pruning**
1745
1746    .. attribute:: worst_acceptable
1747
1748        The lowest required feature score. If the score of the best
1749        feature is below this margin, the tree is not grown further
1750        (default: 0).
1751
1752    .. attribute:: min_subset
1753
1754        The lowest required number of instances in non-null leaves (default: 0).
1755
1756    .. attribute:: min_instances
1757
1758        Data subsets with less than :obj:`min_instances`
1759        instances are not split any further. Therefore, all leaves in the tree
1760        will contain at least that many instances (default: 0).
1761
1762    .. attribute:: max_depth
1763
1764        Maximal tree depth. If 0, only root is generated.
1765        The default is 100.
1766
1767    .. attribute:: max_majority
1768
1769        Induction stops when the proportion of majority class in the
1770        node exceeds the value set by this parameter (default: 1.0).
1771 
1772    .. attribute:: stop
1773
1774        :class:`StopCriteria` or a function with the same signature as
1775        :obj:`StopCriteria.__call__`. Useful for prototyping new tree
1776        induction algorithms.  When used, parameters  :obj:`max_majority`
1777        and :obj:`min_instances` will not be  considered.  The default
1778        stopping criterion stops induction when all instances in a node
1779        belong to the same class.
1780
1781    .. attribute:: m_pruning
1782
1783        If non-zero, invokes an error-based bottom-up post-pruning,
1784        where m-estimate is used to estimate class probabilities
1785        (default: 0).
1786
1787    .. attribute:: same_majority_pruning
1788
1789        If true, invokes a bottom-up post-pruning by removing the
1790        subtrees of which all leaves classify to the same class
1791        (default: False).
1792
1793    **Record keeping**
1794
1795    .. attribute:: store_distributions
1796   
1797    .. attribute:: store_contingencies
1798   
1799    .. attribute:: store_instances
1800   
1801    .. attribute:: store_node_classifier
1802
1803        Determines whether to store class distributions,
1804        contingencies and instances in :class:`Node`, and whether the
1805        :obj:`Node.node_classifier` should be build for internal nodes
1806        also (it is needed by the :obj:`Descender` or for post-pruning).
1807        Not storing distributions but storing contingencies does not
1808        save any memory, since distributions actually points to the
1809        same distribution that is stored in :obj:`contingency.classes`.
1810        By default everything except :obj:`store_instances` is enabled.
1811
1812    """
1813    def __new__(cls, data=None, weightID=0, **argkw):
1814        self = Orange.core.Learner.__new__(cls, **argkw)
1815        if data:
1816            self.__init__(**argkw)
1817            return self.__call__(data, weightID)
1818        else:
1819            return self
1820   
1821    def __init__(self, **kw):
1822
1823        #name, buildfunction, parameters
1824        #buildfunctions are not saved as function references
1825        #because that would make problems with object copies
1826        for n,(fn,_) in self._built_fn.items():
1827            self.__dict__["_handset_" + n] = False
1828
1829        #measure has to be before split
1830        self.measure = None
1831        self.split = None
1832        self.stop = None
1833        self.splitter = None
1834       
1835        for n,(fn,_) in self._built_fn.items():
1836            self.__dict__[n] = fn(self)
1837
1838        for k,v in kw.items():
1839            self.__setattr__(k,v)
1840     
1841    def __call__(self, instances, weight=0):
1842        """
1843        Return a classifier from the given instances.
1844        """
1845        bl = self._base_learner()
1846
1847        #set the scoring criteria for regression if it was not
1848        #set by the user
1849        if not self._handset_split and not self.measure:
1850            measure = fscoring.GainRatio() \
1851                if instances.domain.class_var.var_type == Orange.data.Type.Discrete \
1852                else fscoring.MSE()
1853            bl.split.continuous_split_constructor.measure = measure
1854            bl.split.discrete_split_constructor.measure = measure
1855         
1856        if self.splitter != None:
1857            bl.example_splitter = self.splitter
1858
1859        #post pruning
1860        tree = bl(instances, weight)
1861        if getattr(self, "same_majority_pruning", 0):
1862            tree = Pruner_SameMajority(tree)
1863        if getattr(self, "m_pruning", 0):
1864            tree = Pruner_m(tree, m=self.m_pruning)
1865
1866        return TreeClassifier(base_classifier=tree) 
1867
1868    def __setattr__(self, name, value):
1869        self.__dict__[name] = value
1870        for n,(fn,v) in self._built_fn.items():
1871            if name in v:
1872                if not self.__dict__["_handset_" + n]:
1873                    self.__dict__[n] = fn(self)
1874                else:
1875                    warnings.warn("Changing \"" + name + "\" does not have any effect as \"" + n + "\" was already set", UserWarning, 2)
1876            elif n == name:
1877                if value == None:
1878                    self.__dict__[n] = fn(self)
1879                    self.__dict__["_handset_" + n] = False
1880                    #print n, "was now disabled by hand"
1881                else:
1882                    self.__dict__["_handset_" + n] = True
1883                    #print n, "is now handset"
1884        #print self.__dict__
1885
1886    def __delattr__(self, name):
1887        self.__setattr__(name, None) #use the __setattr__
1888        del self.__dict__[name]
1889
1890    def instance(self):
1891        """
1892        DEPRECATED. Return a base learner - an object
1893        of :class:`_TreeLearner`.
1894        This method is left for backwards compatibility.
1895        """
1896        return self._base_learner()
1897
1898    def _build_split(self):
1899        """
1900        Return the split constructor built according to object attributes.
1901        """
1902        split = SplitConstructor_Combined()
1903        split.continuous_split_constructor = \
1904            SplitConstructor_Threshold()
1905        binarization = getattr(self, "binarization", 0)
1906        if binarization == 1:
1907            split.discrete_split_constructor = \
1908                SplitConstructor_ExhaustiveBinary()
1909        elif binarization == 2:
1910            split.discrete_split_constructor = \
1911                SplitConstructor_OneAgainstOthers()
1912        else:
1913            split.discrete_split_constructor = \
1914                SplitConstructor_Feature()
1915
1916        measures = {"infoGain": fscoring.InfoGain,
1917            "gainRatio": fscoring.GainRatio,
1918            "gini": fscoring.Gini,
1919            "relief": fscoring.Relief,
1920            "retis": fscoring.MSE
1921            }
1922
1923        measure = self.measure
1924        if isinstance(measure, str):
1925            measure = measures[measure]()
1926        if not measure:
1927            measure = fscoring.GainRatio()
1928
1929        measureIsRelief = isinstance(measure, fscoring.Relief)
1930        relM = getattr(self, "relief_m", None)
1931        if relM and measureIsRelief:
1932            measure.m = relM
1933       
1934        relK = getattr(self, "relief_k", None)
1935        if relK and measureIsRelief:
1936            measure.k = relK
1937
1938        split.continuous_split_constructor.measure = measure
1939        split.discrete_split_constructor.measure = measure
1940
1941        wa = getattr(self, "worst_acceptable", 0)
1942        if wa:
1943            split.continuous_split_constructor.worst_acceptable = wa
1944            split.discrete_split_constructor.worst_acceptable = wa
1945
1946        ms = getattr(self, "min_subset", 0)
1947        if ms:
1948            split.continuous_split_constructor.min_subset = ms
1949            split.discrete_split_constructor.min_subset = ms
1950
1951        return split
1952
1953    def _build_stop(self):
1954        """
1955        Return the stop criteria built according to object's attributes.
1956        """
1957        stop = Orange.classification.tree.StopCriteria_common()
1958        mm = getattr(self, "max_majority", 1.0)
1959        if mm < 1.0:
1960            stop.max_majority = self.max_majority
1961        me = getattr(self, "min_instances", 0)
1962        if me:
1963            stop.min_examples = self.min_instances
1964        return stop
1965
1966    def _base_learner(self):
1967        learner = _TreeLearner()
1968
1969        learner.split = self.split
1970        learner.stop = self.stop
1971
1972        for a in ["store_distributions", "store_contingencies",
1973            "store_node_classifier", "node_learner", "max_depth", "contingency_computer", "descender" ]:
1974            if hasattr(self, a):
1975                setattr(learner, a, getattr(self, a))
1976
1977        if hasattr(self, "store_instances"):
1978            learner.store_examples = self.store_instances
1979
1980        return learner
1981
1982    _built_fn = { 
1983            "split": [ _build_split, [ "binarization", "measure", "relief_m", "relief_k", "worst_acceptable", "min_subset" ] ], \
1984            "stop": [ _build_stop, ["max_majority", "min_instances" ] ] 
1985        }
1986
1987
1988
1989TreeLearner = Orange.misc.deprecated_members({
1990          "mForPruning": "m_pruning",
1991          "sameMajorityPruning": "same_majority_pruning",
1992          "reliefM": "relief_m",
1993          "reliefK": "relief_k",
1994          "storeDistributions": "store_distributions",
1995          "storeContingencies": "store_contingencies",
1996          "storeExamples": "store_instances",
1997          "store_examples": "store_instances",
1998          "storeNodeClassifier": "store_node_classifier",
1999          "worstAcceptable": "worst_acceptable",
2000          "minSubset": "min_subset",
2001          "maxMajority": "max_majority",
2002          "minExamples": "min_instances",
2003          "maxDepth": "max_depth",
2004          "nodeLearner": "node_learner",
2005          "min_examples": "min_instances"
2006}, wrap_methods=[])(TreeLearner)
2007
2008#
2009# the following is for the output
2010#
2011
2012fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2013""" Defines the multiplier by 100 (``^``) and the format
2014for the number of decimals (e.g. ``5.3``). The corresponding
2015groups are named ``m100`` and ``fs``. """
2016
2017by = r"(?P<by>(b(P|A)))?"
2018""" Defines bP or bA or nothing; the result is in groups by. """
2019
2020bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2021opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2022opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2023intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2024fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2025re_V = re.compile("%V")
2026re_N = re.compile("%"+fs+"N"+by)
2027re_M = re.compile("%"+fs+"M"+by)
2028re_m = re.compile("%"+fs+"m"+by)
2029re_Ccont = re.compile("%"+fs+"C"+by+opc)
2030re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2031re_ccont = re.compile("%"+fs+"c"+by+opc)
2032re_cdisc = re.compile("%"+fs+"c"+by+opd)
2033re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2034re_cconti = re.compile("%"+fs+"c"+by+fromto)
2035re_D = re.compile("%"+fs+"D"+by)
2036re_d = re.compile("%"+fs+"d"+by)
2037re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2038re_I = re.compile("%"+fs+"I"+intrvl)
2039
2040def insert_str(s, mo, sub):
2041    """ Replace the part of s which is covered by mo
2042    with the string sub. """
2043    return s[:mo.start()] + sub + s[mo.end():]
2044
2045def insert_dot(s, mo):
2046    """ Replace the part of s which is covered by mo
2047    with a dot.  You should use this when the
2048    function cannot compute the desired quantity; it is called, for instance,
2049    when it needs to divide by something in the parent, but the parent
2050    doesn't exist.
2051    """
2052    return s[:mo.start()] + "." + s[mo.end():]
2053
2054def insert_num(s, mo, n):
2055    """ Replace the part of s matched by mo with the number n,
2056    formatted as specified by the user, that is, it multiplies
2057    it by 100, if needed, and prints with the right number of
2058    places and decimals. It does so by checking the mo
2059    for a group named m100 (representing the ``^`` in the format string)
2060    and a group named fs representing the part giving the number o
2061    f decimals (e.g. ``5.3``).
2062    """
2063    grps = mo.groupdict()
2064    m100 = grps.get("m100", None)
2065    if m100:
2066        n *= 100
2067    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2068    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2069
2070def by_whom(by, parent, tree):
2071    """ If by equals bp, return parent, else return
2072    ``tree.tree``. This is used to find what to divide the quantity
2073    with, when division is required.
2074    """
2075    if by=="bP":
2076        return parent
2077    else:
2078        return tree.tree
2079
2080def replaceV(strg, mo, node, parent, tree):
2081    return insert_str(strg, mo, str(node.node_classifier.default_value))
2082
2083def replaceN(strg, mo, node, parent, tree):
2084    by = mo.group("by")
2085    N = node.distribution.abs
2086    if by:
2087        whom = by_whom(by, parent, tree)
2088        if whom and whom.distribution:
2089            if whom.distribution.abs > 1e-30:
2090                N /= whom.distribution.abs
2091        else:
2092            return insert_dot(strg, mo)
2093    return insert_num(strg, mo, N)
2094       
2095
2096def replaceM(strg, mo, node, parent, tree):
2097    by = mo.group("by")
2098    maj = int(node.node_classifier.default_value)
2099    N = node.distribution[maj]
2100    if by:
2101        whom = by_whom(by, parent, tree)
2102        if whom and whom.distribution:
2103            if whom.distribution[maj] > 1e-30:
2104                N /= whom.distribution[maj]
2105        else:
2106            return insert_dot(strg, mo)
2107    return insert_num(strg, mo, N)
2108       
2109
2110def replacem(strg, mo, node, parent, tree):
2111    by = mo.group("by")
2112    maj = int(node.node_classifier.default_value)
2113    if node.distribution.abs > 1e-30:
2114        N = node.distribution[maj] / node.distribution.abs
2115        if by:
2116            if whom and whom.distribution:
2117                byN = whom.distribution[maj] / whom.distribution.abs
2118                if byN > 1e-30:
2119                    N /= byN
2120            else:
2121                return insert_dot(strg, mo)
2122    else:
2123        N = 0.
2124    return insert_num(strg, mo, N)
2125
2126
2127def replaceCdisc(strg, mo, node, parent, tree):
2128    if tree.class_var.var_type != Orange.data.Type.Discrete:
2129        return insert_dot(strg, mo)
2130   
2131    by, op, cls = mo.group("by", "op", "cls")
2132    N = node.distribution[cls]
2133    if op == "!=":
2134        N = node.distribution.abs - N
2135    if by:
2136        whom = by_whom(by, parent, tree)
2137        if whom and whom.distribution:
2138            if whom.distribution[cls] > 1e-30:
2139                N /= whom.distribution[cls]
2140        else:
2141            return insert_dot(strg, mo)
2142    return insert_num(strg, mo, N)
2143
2144   
2145def replacecdisc(strg, mo, node, parent, tree):
2146    if tree.class_var.var_type != Orange.data.Type.Discrete:
2147        return insert_dot(strg, mo)
2148   
2149    op, by, cls = mo.group("op", "by", "cls")
2150    N = node.distribution[cls]
2151    if node.distribution.abs > 1e-30:
2152        N /= node.distribution.abs
2153        if op == "!=":
2154            N = 1 - N
2155    if by:
2156        whom = by_whom(by, parent, tree)
2157        if whom and whom.distribution:
2158            if whom.distribution[cls] > 1e-30:
2159                N /= whom.distribution[cls] / whom.distribution.abs
2160        else:
2161            return insert_dot(strg, mo)
2162    return insert_num(strg, mo, N)
2163
2164
2165__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2166
2167def replaceCcont(strg, mo, node, parent, tree):
2168    if tree.class_var.var_type != Orange.data.Type.Continuous:
2169        return insert_dot(strg, mo)
2170   
2171    by, op, num = mo.group("by", "op", "num")
2172    op = __opdict[op]
2173    num = float(num)
2174    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2175    if by:
2176        whom = by_whom(by, parent, tree)
2177        if whom and whom.distribution:
2178            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2179            if byN > 1e-30:
2180                N /= byN
2181        else:
2182            return insert_dot(strg, mo)
2183
2184    return insert_num(strg, mo, N)
2185   
2186   
2187def replaceccont(strg, mo, node, parent, tree):
2188    if tree.class_var.var_type != Orange.data.Type.Continuous:
2189        return insert_dot(strg, mo)
2190   
2191    by, op, num = mo.group("by", "op", "num")
2192    op = __opdict[op]
2193    num = float(num)
2194    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2195    if node.distribution.abs > 1e-30:
2196        N /= node.distribution.abs
2197    if by:
2198        whom = by_whom(by, parent, tree)
2199        if whom and whom.distribution:
2200            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2201            if byN > 1e-30:
2202                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2203        else:
2204            return insert_dot(strg, mo)
2205    return insert_num(strg, mo, N)
2206
2207
2208def extractInterval(mo, dist):
2209    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2210    lower, upper = float(lower), float(upper)
2211    if out:
2212        lop = lowin == "(" and operator.le or operator.lt
2213        hop = upin == ")" and operator.ge or operator.ge
2214        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2215    else:
2216        lop = lowin == "(" and operator.gt or operator.ge
2217        hop = upin == ")" and operator.lt or operator.le
2218        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2219
2220   
2221def replaceCconti(strg, mo, node, parent, tree):
2222    if tree.class_var.var_type != Orange.data.Type.Continuous:
2223        return insert_dot(strg, mo)
2224
2225    by = mo.group("by")
2226    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2227    if by:
2228        whom = by_whom(by, parent, tree)
2229        if whom and whom.distribution:
2230            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2231            if byN > 1e-30:
2232                N /= byN
2233        else:
2234            return insert_dot(strg, mo)
2235       
2236    return insert_num(strg, mo, N)
2237
2238           
2239def replacecconti(strg, mo, node, parent, tree):
2240    if tree.class_var.var_type != Orange.data.Type.Continuous:
2241        return insert_dot(strg, mo)
2242
2243    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2244    ab = node.distribution.abs
2245    if ab > 1e-30:
2246        N /= ab
2247
2248    by = mo.group("by")
2249    if by:
2250        whom = by_whom(by, parent, tree)
2251        if whom and whom.distribution:
2252            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2253            if byN > 1e-30:
2254                N /= byN/whom.distribution.abs
2255        else:
2256            return insert_dot(strg, mo)
2257       
2258    return insert_num(strg, mo, N)
2259
2260   
2261def replaceD(strg, mo, node, parent, tree):
2262    if tree.class_var.var_type != Orange.data.Type.Discrete:
2263        return insert_dot(strg, mo)
2264
2265    fs, by, m100 = mo.group("fs", "by", "m100")
2266    dist = list(node.distribution)
2267    if by:
2268        whom = by_whom(by, parent, tree)
2269        if whom:
2270            for i, d in enumerate(whom.distribution):
2271                if d > 1e-30:
2272                    dist[i] /= d
2273        else:
2274            return insert_dot(strg, mo)
2275    mul = m100 and 100 or 1
2276    fs = fs or (m100 and ".0" or "5.3")
2277    return insert_str(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2278
2279
2280def replaced(strg, mo, node, parent, tree):
2281    if tree.class_var.var_type != Orange.data.Type.Discrete:
2282        return insert_dot(strg, mo)
2283
2284    fs, by, m100 = mo.group("fs", "by", "m100")
2285    dist = list(node.distribution)
2286    ab = node.distribution.abs
2287    if ab > 1e-30:
2288        dist = [d/ab for d in dist]
2289    if by:
2290        whom = by_whom(by, parent, tree)
2291        if whom:
2292            for i, d in enumerate(whom.distribution):
2293                if d > 1e-30:
2294                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2295        else:
2296            return insert_dot(strg, mo)
2297    mul = m100 and 100 or 1
2298    fs = fs or (m100 and ".0" or "5.3")
2299    return insert_str(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2300
2301
2302def replaceAE(strg, mo, node, parent, tree):
2303    if tree.class_var.var_type != Orange.data.Type.Continuous:
2304        return insert_dot(strg, mo)
2305
2306    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2307   
2308    if AorE == "A":
2309        A = node.distribution.average()
2310    else:
2311        A = node.distribution.error()
2312    if by:
2313        whom = by_whom("b"+by, parent, tree)
2314        if whom:
2315            if AorE == "A":
2316                avg = whom.distribution.average()
2317            else:
2318                avg = whom.distribution.error()
2319            if bysub == "b":
2320                if avg > 1e-30:
2321                    A /= avg
2322            else:
2323                A -= avg
2324        else:
2325            return insert_dot(strg, mo)
2326    return insert_num(strg, mo, A)
2327
2328
2329Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2330
2331def replaceI(strg, mo, node, parent, tree):
2332    if tree.class_var.var_type != Orange.data.Type.Continuous:
2333        return insert_dot(strg, mo)
2334
2335    fs = mo.group("fs") or "5.3"
2336    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2337    mul = mo.group("m100") and 100 or 1
2338
2339    if not Z.has_key(intrvl):
2340        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2341
2342    av = node.distribution.average()   
2343    il = node.distribution.error() * Z[intrvl]
2344    return insert_str(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2345
2346
2347# This class is more a collection of function, merged into a class so
2348# that they don't need to transfer too many arguments. It will be
2349# constructed, used and discarded, it is not meant to store any information.
2350class _TreeDumper:
2351    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2352         (re_M, replaceM), (re_m, replacem), 
2353         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2354         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2355         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2356         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2357         (re_I, replaceI) ]
2358
2359    def node(self):
2360        return self.tree.tree if "tree" in self.tree.__dict__ else self.tree
2361
2362    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2363        maxDepth, simpleFirst, tree, **kw):
2364        self.stringFormats = stringFormats
2365        self.minExamples = minExamples
2366        self.maxDepth = maxDepth
2367        self.simpleFirst = simpleFirst
2368        self.tree = tree
2369        self.__dict__.update(kw)
2370
2371        if leafStr:
2372            self.leafStr = leafStr
2373        else:
2374            if self.node().node_classifier.class_var.var_type == \
2375                    Orange.data.Type.Discrete:
2376                self.leafStr = "%V (%^.2m%)"
2377            else:
2378                self.leafStr = "%V"
2379
2380        if nodeStr == ".":
2381            self.nodeStr = self.leafStr
2382        else:
2383            self.nodeStr = nodeStr
2384       
2385
2386    def formatString(self, strg, node, parent):
2387        if hasattr(strg, "__call__"):
2388            return strg(node, parent, self.tree)
2389       
2390        if not node:
2391            return "<null node>"
2392       
2393        for rgx, replacer in self.stringFormats:
2394            if not node.distribution:
2395                strg = rgx.sub(".", strg)
2396            else:
2397                strt = 0
2398                while True:
2399                    mo = rgx.search(strg, strt)
2400                    if not mo:
2401                        break
2402                    strg = replacer(strg, mo, node, parent, self.tree)
2403                    strt = mo.start()+1
2404                       
2405        return strg
2406       
2407
2408    def showBranch(self, node, parent, lev, i):
2409        bdes = node.branch_descriptions[i]
2410        bdes = node.branch_selector.class_var.name + \
2411            (bdes[0] not in "<=>" and "=" or "") + bdes
2412        if node.branches[i]:
2413            nodedes = self.nodeStr and ": " + \
2414                self.formatString(self.nodeStr, node.branches[i], node) or ""
2415        else:
2416            nodedes = "<null node>"
2417        return "|    "*lev + bdes + nodedes
2418       
2419       
2420    def dumpTree0(self, node, parent, lev):
2421        if node.branches:
2422            if node.distribution.abs < self.minExamples or \
2423                lev > self.maxDepth:
2424                return "|    "*lev + ". . .\n"
2425           
2426            res = ""
2427            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2428                leafsep = "\n"+("|    "*lev)+"    "
2429            else:
2430                leafsep = ""
2431            if self.simpleFirst:
2432                for i, branch in enumerate(node.branches):
2433                    if not branch or not branch.branches:
2434                        if self.leafStr == self.nodeStr:
2435                            res += "%s\n" % \
2436                                self.showBranch(node, parent, lev, i)
2437                        else:
2438                            res += "%s: %s\n" % \
2439                                (self.showBranch(node, parent, lev, i),
2440                                 leafsep + 
2441                                 self.formatString(self.leafStr, branch, node))
2442            for i, branch in enumerate(node.branches):
2443                if branch and branch.branches:
2444                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2445                                       self.dumpTree0(branch, node, lev+1))
2446                elif not self.simpleFirst:
2447                    if self.leafStr == self.nodeStr:
2448                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2449                    else:
2450                        res += "%s: %s\n" % \
2451                            (self.showBranch(node, parent, lev, i),
2452                             leafsep + 
2453                             self.formatString(self.leafStr, branch, node))
2454            return res
2455        else:
2456            return self.formatString(self.leafStr, node, parent)
2457
2458
2459    def dumpTree(self):
2460        node = self.node()
2461        if self.nodeStr:
2462            lev, res = 1, "root: %s\n" % \
2463                self.formatString(self.nodeStr, node, None)
2464            self.maxDepth += 1
2465        else:
2466            lev, res = 0, ""
2467        return res + self.dumpTree0(node, None, lev)
2468       
2469
2470    def dotTree0(self, node, parent, internalName):
2471        if node.branches:
2472            if node.distribution.abs < self.minExamples or \
2473                len(internalName)-1 > self.maxDepth:
2474                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2475                    _quoteName(internalName))
2476                return
2477               
2478            label = node.branch_selector.class_var.name
2479            if self.nodeStr:
2480                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2481            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2482                (_quoteName(internalName), self.nodeShape, label))
2483           
2484            for i, branch in enumerate(node.branches):
2485                if branch:
2486                    internalBranchName = "%s-%d" % (internalName,i)
2487                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2488                        (_quoteName(internalName), 
2489                         _quoteName(internalBranchName), 
2490                         node.branch_descriptions[i]))
2491                    self.dotTree0(branch, node, internalBranchName)
2492                   
2493        else:
2494            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2495                (_quoteName(internalName), self.leafShape, 
2496                self.formatString(self.leafStr, node, parent)))
2497
2498
2499    def dotTree(self, internalName="n"):
2500        self.fle.write("digraph G {\n")
2501        self.dotTree0(self.node(), None, internalName)
2502        self.fle.write("}\n")
2503
2504def _quoteName(x):
2505    return '"%s"' % (base64.b64encode(x))
2506
2507class TreeClassifier(Orange.classification.Classifier):
2508    """
2509
2510    Classifies instances according to the tree stored in :obj:`tree`.
2511
2512    **The classification process**
2513
2514    :obj:`TreeClassifier` uses the :obj:`descender` to descend the
2515    instance from the root. If the :obj:`descender` returns only a
2516    :obj:`Node` and no distribution, the descend should stop as the node
2517    was unambiguously selected. The node's :obj:`~Node.node_classifier`
2518    decides the class.
2519
2520    If the descender returns a :obj:`Node` and a distribution, as it
2521    happens, for example, if the instance's value for the :obj:`Node`'s
2522    feature is unknown, the same process repeats for all subtrees and
2523    their predictions are combined.
2524
2525    **Attributes**
2526
2527    .. attribute:: tree
2528
2529        The root of the tree, as a :class:`Node`.
2530
2531    .. attribute:: descender
2532
2533        A :obj:`Descender` used to descend an instance from the root as
2534        deeply as possible according to the instance's feature values.
2535    """
2536   
2537    def __init__(self, base_classifier=None):
2538        if not base_classifier: base_classifier = _TreeClassifier()
2539        self.nativeClassifier = base_classifier
2540        for k, v in self.nativeClassifier.__dict__.items():
2541            self.__dict__[k] = v
2542 
2543    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2544                 *args, **kwdargs):
2545        """Classify a new instance.
2546
2547     
2548        :param instance: instance to be classified.
2549        :type instance: :class:`Orange.data.Instance`
2550        :param result_type:
2551              :class:`Orange.classification.Classifier.GetValue` or \
2552              :class:`Orange.classification.Classifier.GetProbabilities` or
2553              :class:`Orange.classification.Classifier.GetBoth`
2554       
2555        :rtype: :class:`Orange.data.Value`,
2556              :class:`Orange.statistics.Distribution` or a tuple with both
2557        """
2558        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2559
2560    def __setattr__(self, name, value):
2561        if name == "nativeClassifier":
2562            self.__dict__[name] = value
2563            return
2564        if name in self.nativeClassifier.__dict__:
2565            self.nativeClassifier.__dict__[name] = value
2566        self.__dict__[name] = value
2567   
2568    def __str__(self):
2569        return self.format()
2570
2571    @Orange.misc.deprecated_keywords({"fileName": "file_name", \
2572        "leafStr": "leaf_str", "nodeStr": "node_str", \
2573        "userFormats": "user_formats", "minExamples": "min_examples", \
2574        "maxDepth": "max_depth", "simpleFirst": "simple_first"})
2575    def format(self, leaf_str = "", node_str = "", \
2576            user_formats=[], min_examples=0, max_depth=1e10, \
2577            simple_first=True):
2578        """
2579        Return a string representation of a tree.
2580
2581        :arg leaf_str: The format string for the tree leaves. If
2582          left empty, ``"%V (%^.2m%)"`` will be used for classification trees
2583          and ``"%V"`` for regression trees.
2584        :type leaf_str: string
2585        :arg node_str: The format string for the internal nodes.
2586          If left empty (as it is by default), no data is printed out for
2587          internal nodes. If set to ``"."``, the same string is
2588          used as for leaves.
2589        :type node_str: string
2590        :arg max_depth: If set, it limits the depth to which the tree is
2591          printed out.
2592        :type max_depth: integer
2593        :arg min_examples: If set, the subtrees with less than the given
2594          number of examples are not printed.
2595        :type min_examples: integer
2596        :arg simple_first: If True (default), the branches with a single
2597          node are printed before the branches with larger subtrees.
2598          If False, the branches are printed in order of
2599          appearance.
2600        :type simple_first: boolean
2601        :arg user_formats: A list of regular expressions and callback
2602          function through which the user can print out other specific
2603          information in the nodes.
2604        """
2605        return _TreeDumper(leaf_str, node_str, user_formats + 
2606            _TreeDumper.defaultStringFormats, min_examples, 
2607            max_depth, simple_first, self).dumpTree()
2608
2609    dump = format
2610
2611    @Orange.misc.deprecated_keywords({"fileName": "file_name", \
2612        "leafStr": "leaf_str", "nodeStr": "node_str", \
2613        "leafShape": "leaf_shape", "nodeShape": "node_shape", \
2614        "userFormats": "user_formats", "minExamples": "min_examples", \
2615        "maxDepth": "max_depth", "simpleFirst": "simple_first"})
2616    def dot(self, file_name, leaf_str = "", node_str = "", \
2617            leaf_shape="plaintext", node_shape="plaintext", \
2618            user_formats=[], min_examples=0, max_depth=1e10, \
2619            simple_first=True):
2620        """ Print the tree to a file in a format used by `GraphViz
2621        <http://www.research.att.com/sw/tools/graphviz>`_.  Uses the
2622        same parameters as :meth:`format` plus two which define the shape
2623        of internal nodes and leaves of the tree:
2624
2625        :param leaf_shape: Shape of the outline around leaves of the tree.
2626            If "plaintext", no outline is used (default: "plaintext").
2627        :type leaf_shape: string
2628        :param node_shape: Shape of the outline around internal nodes
2629            of the tree. If "plaintext", no outline is used (default: "plaintext")
2630        :type node_shape: string
2631
2632        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2633        for various outlines supported by GraphViz.
2634        """
2635        fle = type(file_name) == str and open(file_name, "wt") or file_name
2636
2637        _TreeDumper(leaf_str, node_str, user_formats + 
2638            _TreeDumper.defaultStringFormats, min_examples, 
2639            max_depth, simple_first, self,
2640            leafShape=leaf_shape, nodeShape=node_shape, fle=fle).dotTree()
2641
2642    def count_nodes(self):
2643        """
2644        Return the number of nodes.
2645        """
2646        return _countNodes(self.tree)
2647
2648    def count_leaves(self):
2649        """
2650        Return the number of leaves.
2651        """
2652        return _countLeaves(self.tree)
2653
2654def _countNodes(node):
2655    count = 0
2656    if node:
2657        count += 1
2658        if node.branches:
2659            for node in node.branches:
2660                count += _countNodes(node)
2661    return count
2662
2663def _countLeaves(node):
2664    count = 0
2665    if node:
2666        if node.branches: # internal node
2667            for node in node.branches:
2668                count += _countLeaves(node)
2669        else:
2670            count += 1
2671    return count
2672
Note: See TracBrowser for help on using the repository browser.