source: orange/orange/Orange/classification/tree.py @ 7710:363cb7f443ec

Revision 7710:363cb7f443ec, 119.4 KB checked in by markotoplak, 3 years ago (diff)

Moved print, dump, dot, and count_* functions to the TreeClassifier class.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12This page describes the Orange trees. It first describes the basic
13components and procedures: it starts with the structure that represents
14the tree, then it defines how the tree is used for classification,
15how it is built and pruned. The order might seem strange,
16but the things are rather complex and this order is perhaps a
17bit easier to follow. After you have some idea about what the
18principal components do, we described the
19concrete classes that you can use as
20components for a tree learner.
21
22Classification trees are represented as a tree-like hierarchy of
23:obj:`Node` classes.
24
25.. class:: Node
26
27    Node stores information about the learning examples belonging
28    to the node, a branch selector, a list of branches (if the node is
29    not a leaf) with their descriptions and strengths, and a classifier.
30
31    .. attribute:: distribution
32   
33        Stores a distribution for learning examples belonging to the node.
34        Storing distributions can be disabled by setting the
35        :obj:`TreeLearnerBase`'s storeDistributions flag to false.
36
37    .. attribute:: contingency
38
39        Stores complete contingency matrices for the learning examples
40        belonging to the node. Storing contingencies can be enabled by
41        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`
42        flag to true. Note that even when the flag is not
43        set, the contingencies get computed and stored to
44        :obj:`Node`, but are removed shortly afterwards.
45        The details are given in the
46        description of the :obj:`TreeLearnerBase` object.
47
48    .. attribute:: examples, weightID
49
50        Store a set of learning examples for the node and the
51        corresponding ID of /weight meta attribute. The root of the
52        tree stores a "master table" of examples, while other nodes'
53        :obj:`Orange.data.Table` contain reference to examples in
54        the root's :obj:`Orange.data.Table`. Examples are only stored
55        if a corresponding flag (:obj:`storeExamples`) has been
56        set while building the tree; to conserve the space, storing
57        is disabled by default.
58
59    .. attribute:: nodeClassifier
60
61        A classifier (usually, but not necessarily, a
62        :obj:`DefaultClassifier`) that can be used to classify
63        examples coming to the node. If the node is a leaf, this is
64        used to decide the final class (or class distribution) of an
65        example. If it's an internal node, it is stored if
66        :obj:`Node`'s flag :obj:`storeNodeClassifier`
67        is set. Since the :obj:`nodeClassifier` is needed by
68        :obj:`Descender` and for pruning (see far below),
69        this is the default behaviour; space consumption of the default
70        :obj:`DefaultClassifier` is rather small. You should
71        never disable this if you intend to prune the tree later.
72
73    If the node is a leaf, the remaining fields are None.
74    If it's an internal node, there are several additional fields.
75
76    .. attribute:: branches
77
78        Stores a list of subtrees, given as :obj:`Node`.
79        An element can be None; in this case the node is empty.
80
81    .. attribute:: branchDescriptions
82
83        A list with string descriptions for branches, constructed by
84        :obj:`SplitConstructor`. It can contain different kinds
85        of descriptions, but basically, expect things like 'red' or '>12.3'.
86
87    .. attribute:: branchSizes
88
89        Gives a (weighted) number of training examples that went into
90        each branch. This can be used later, for instance, for
91        modeling probabilities when classifying examples with
92        unknown values.
93
94    .. attribute:: branchSelector
95
96        Gives a branch for each example. The same object is used during
97        learning and classifying. The :obj:`branchSelector` is of
98        type :obj:`orange.Classifier`, since its job is similar to that
99        of a classifier: it gets an example and returns discrete
100        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
101        When an example cannot be classified to any branch, the selector
102        can return a :obj:`Orange.data.Value` containing a special value
103        (sVal) which should be a discrete distribution
104        (DiscDistribution). This should represent a
105        :obj:`branchSelector`'s opinion of how to divide the
106        example between the branches. Whether the proposition will be
107        used or not depends upon the chosen :obj:`ExampleSplitter`
108        (when learning) or :obj:`Descender` (when classifying).
109
110    The lists :obj:`branches`, :obj:`branchDescriptions` and
111    :obj:`branchSizes` are of the same length; all of them are
112    defined if the node is internal and none if it is a leaf.
113
114    .. method:: treeSize()
115       
116        Return the number of nodes in the subtrees (including the
117        node, excluding null-nodes).
118
119==============
120Classification
121==============
122
123.. class:: _TreeClassifier
124
125    Classifies examples according to a tree stored in :obj:`tree`.
126
127    .. attribute:: tree
128
129        The root of the tree, represented as a :class:`Node`.
130   
131    Classification would be straightforward if there were no unknown
132    values or, in general, examples that cannot be placed into a
133    single branch. The response in such cases is determined by a
134    component :obj:`descender`.
135
136    :obj:`Descender` is an abstract object which is given an example
137    and whose basic job is to descend as far down the tree as possible,
138    according to the values of example's attributes. The
139    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
140    the branch index. If it's a simple index, the corresponding branch
141    is followed. If not, it's up to descender to decide what to do, and
142    that's where descenders differ. A :obj:`descender` can choose
143    a single branch (for instance, the one that is the most recommended
144    by the :obj:`branchSelector`) or it can let the branches vote.
145
146    In general there are three possible outcomes of a descent.
147
148    #. Descender reaches a leaf. This happens when nothing went wrong
149       (there are no unknown or out-of-range values in the example) or
150       when things went wrong, but the descender smoothed them by
151       selecting a single branch and continued the descend. In this
152       case, the descender returns the reached :obj:`Node`.
153    #. :obj:`branchSelector` returned a distribution and the
154       :obj:`Descender` decided to stop the descend at this
155       (internal) node.  Again, descender returns the current
156       :obj:`Node` and nothing else.
157    #. :obj:`branchSelector` returned a distribution and the
158       :obj:`Node` wants to split the example (i.e., to decide the
159       class by voting).
160
161    It returns a :obj:`Node` and the vote-weights for the branches.
162    The weights can correspond to the distribution returned by
163    :obj:`branchSelector`, to the number of learning examples that
164    were assigned to each branch, or to something else.
165
166    :obj:`TreeClassifier` uses the descender to descend from the root.
167    If it returns only a :obj:`Node` and no distribution, the
168    descend should stop; it does not matter whether it's a leaf (the
169    first case above) or an internal node (the second case). The node's
170    :obj:`nodeClassifier` is used to decide the class. If the descender
171    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
172    recursively calls itself for each of the subtrees and the
173    predictions are weighted as requested by the descender.
174
175    When voting, subtrees do not predict the class but probabilities
176    of classes. The predictions are multiplied by weights, summed and
177    the most probable class is returned.
178
179    .. method:: vote()
180
181    .. method:: classDistribution()
182
183
184The rest of this section is only for those interested in the C++ code.
185======================================================================
186
187If you'd like to understand how the classification works in C++,
188start reading at :obj:`TTreeClassifier::vote`. It gets a
189:obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of
190vote weights. For each node, it calls the
191:obj:`TTreeClassifier::classDistribution` and then multiplies
192and sums the distribution. :obj:`vote` returns a normalized
193distribution of predictions.
194
195A new overload of :obj:`TTreeClassifier::classDistribution` gets
196an additional parameter, a :obj:`Node`. This is done
197for the sake of recursion. The normal version of
198:obj:`classDistribution` simply calls the overloaded with a
199tree root as an additional parameter. :obj:`classDistribution`
200uses :obj:`descender`. If descender reaches a leaf, it calls
201:obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
202
203Thus, the :obj:`TreeClassifier`'s :obj:`vote` and
204:obj:`classDistribution` are written in a form of double
205recursion. The recursive calls do not happen at each node of the
206tree but only at nodes where a vote is needed (that is, at nodes
207where the descender halts).
208
209For predicting a class, :obj:`operator()`, calls the
210descender. If it reaches a leaf, the class is predicted by the
211leaf's :obj:`nodeClassifier`. Otherwise, it calls
212:obj:`vote`>. From now on, :obj:`vote` and
213:obj:`classDistribution` interweave down the tree and return
214a distribution of predictions. :obj:`operator()` then simply
215chooses the most probable class.
216
217========
218Learning
219========
220
221The main learning object is :obj:`TreeLearnerBase`. It is basically
222a skeleton into which the user must plug the components for particular
223functions. For easier use, defaults are provided.
224
225Components that govern the structure of the tree are :obj:`split`
226(of type :obj:`SplitConstructor`), :obj:`stop` (of
227type :obj:`StopCriteria` and :obj:`exampleSplitter`
228(of type :obj:`ExampleSplitter`).
229
230.. class:: SplitConstructor
231
232    Finds a suitable criteria for dividing the learning (and later testing)
233    examples coming to the node. The data it gets is a set of examples
234    (and, optionally, an ID of weight meta-attribute), a domain
235    contingency computed from examples, apriori class probabilities,
236    a list of candidate attributes it should consider and a node
237    classifier (if it was constructed, that is, if
238    :obj:`storeNodeClassifier` is left true).
239
240    The :obj:`SplitConstructor` should use the domain contingency
241    when possible. The reasons are two-fold; one is that it's faster
242    and the other is that the contingency matrices are not necessarily
243    constructed by simply counting the examples. Why and how is
244    explained later. There are, however, cases, when domain contingency
245    does not suffice, for examples, when ReliefF is used as a measure
246    of quality of attributes. In this case, there's no other way but
247    to use the examples and ignore the precomputed contingencies.
248
249    The split constructor should consider only the attributes in the
250    candidate list (the list is a vector of booleans, one for each
251    attribute).
252
253    :obj:`SplitConstructor` returns most of the data we talked
254    about when describing the :obj:`Node`. It returns a classifier
255    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
256    branch descriptions and a list with the number of examples that
257    go into each branch. Just what we need for the :obj:`Node`.
258    It can return an empty list for the number of examples in branches;
259    in this case, the :obj:`TreeLearnerBase` will find the number itself
260    after splitting the example set into subsets. However, if a split
261    constructors can provide the numbers at no extra computational
262    cost, it should do so.
263
264    In addition, it returns a quality of the split; a number without
265    any fixed meaning except that higher numbers mean better splits.
266
267    If the constructed splitting criterion uses an attribute in such
268    a way that the attribute is 'completely spent' and should not be
269    considered as a split criterion in any of the subtrees (the
270    typical case of this are discrete attributes that are used
271    as-they-are, that is, without any binarization or subsetting),
272    then it should report the index of this attribute. Some splits
273    do not spend any attribute; this is indicated by returning a
274    negative index.
275
276    A :obj:`SplitConstructor` can veto the further tree induction
277    by returning no classifier. This can happen for many reasons.
278    A general one is related to number of examples in the branches.
279    :obj:`SplitConstructor` has a field :obj:`minSubset`,
280    which sets the minimal number of examples in a branch; null nodes,
281    however, are allowed. If there is no split where this condition
282    is met, :obj:`SplitConstructor` stops the induction.
283
284    .. attribute:: minSubset
285
286        Sets the minimal number of examples in non-null leaves. As
287        always in Orange (where not specified otherwise), "number of
288        examples" refers to the weighted number of examples.
289   
290    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
291
292        Construct a split. Returns a tuple (:obj:`branchSelector`,
293        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
294        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no
295        attribute is completely spent by the split criterion. If no
296        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
297        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
298        :obj:`spentAttribute` is -1.
299
300        :param examples:  Examples can be given in any acceptable form
301            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
302            list of examples).
303        :param weightID: Optional; the default of 0 means that all
304            examples have a weight of 1.0.
305        :param apriori-distribution: Should be of type
306            :obj:`orange.Distribution` and candidates should be a Python
307            list of objects which are interpreted as booleans.
308
309
310.. class:: StopCriteria
311
312    Given a set of examples, weight ID and contingency matrices, decide
313    whether to continue the induction or not. The basic criterion checks
314    whether there are any examples and whether they belong to at least
315    two different classes (if the class is discrete). Derived components
316    check things like the number of examples and the proportion of
317    majority classes.
318
319    As opposed to :obj:`SplitConstructor` and similar basic classes,
320    :obj:`StopCriteria` is not an abstract but a fully functional
321    class that provides the basic stopping criteria. That is, the tree
322    induction stops when there is at most one example left; in this case,
323    it is not the weighted but the actual number of examples that counts.
324    Besides that, the induction stops when all examples are in the same
325    class (for discrete problems) or have the same value of the outcome
326    (for regression problems).
327
328    .. method:: __call__(examples[, weightID, domain contingencies])
329
330        Decides whether to stop (true) or continue (false) the induction.
331        If contingencies are given, they are used for checking whether
332        the examples are in the same class (but not for counting the
333        examples). Derived classes should use the contingencies whenever
334        possible. If contingencies are not given, :obj:`StopCriteria`
335        will work without them. Derived classes should also use them if
336        they are available, but otherwise compute them only when they
337        really need them.
338
339
340.. class:: ExampleSplitter
341
342    Just like the :obj:`Descender` decides the branch for an
343    example during classification, the :obj:`ExampleSplitter`
344    sorts the learning examples into branches.
345
346    :obj:`ExampleSplitter` is given a :obj:`Node` (from which
347    it can use different stuff, but most of splitters only use the
348    :obj:`branchSelector`), a set of examples to be divided, and
349    the weight ID. The result is a list of subsets of examples
350    and, optionally, a list of new weight ID's.
351
352    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most
353    of :obj:`ExampleSplitters` simply call the node's
354    :obj:`branchSelector` and assign examples to corresponding
355    branches. When the value is unknown they choose a particular
356    branch or simply skip the example.
357
358    Some enhanced splitters can split examples. An example (actually,
359    a pointer to it) is copied to more than one subset. To facilitate
360    real splitting, weights are needed. Each branch is assigned a
361    weight ID (each would usually have its own ID) and all examples
362    that are in that branch (either completely or partially) should
363    have this meta attribute. If an example hasn't been split, it
364    has only one additional attribute - with weight ID corresponding
365    to the subset to which it went. Example that is split between,
366    say, three subsets, has three new meta attributes, one for each
367    subset. ID's of weight meta attributes are returned by the
368    :obj:`ExampleSplitter` to be used at induction of the
369    corresponding subtrees.
370
371    Note that weights are used only when needed. When no splitting
372    occured - because the splitter is not able to do it or becauser
373    there was no need for splitting - no weight ID's are returned.
374
375    An abstract base class for objects that split sets of examples into
376    subsets. The derived classes differ in treatment of examples which
377    cannot be unambiguously placed into a single branch (usually due
378    to unknown value of the crucial attribute).
379
380    .. method:: __call__(node, examples[, weightID])
381       
382        Use the information in :obj:`node` (particularly the
383        :obj:`branchSelector`) to split the given set of examples into subsets.
384        Return a tuple with a list of example generators and a list of weights.
385        The list of weights is either an ordinary python list of integers or
386        a None when no splitting of examples occurs and thus no weights are
387        needed.
388
389.. class:: TreeLearnerBase
390
391    TreeLearnerBase has a number of components.
392
393    .. attribute:: split
394
395        Object of type :obj:`SplitConstructor`. Default value,
396        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined`
397        with separate constructors for discrete and continuous attributes.
398        Discrete attributes are used as are, while continuous attributes
399        are binarized. Gain ratio is used to select attributes.
400        A minimum of two examples in a leaf is required for discreter
401        and five examples in a leaf for continuous attributes.</DD>
402   
403    .. attribute:: stop
404
405        Object of type :obj:`StopCriteria`. The default stopping
406        criterion stops induction when all examples in a node belong
407        to the same class.
408
409    .. attribute:: splitter
410
411        Object of type :obj:`ExampleSplitter`. The default splitter
412        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
413        the learning examples according to distributions given by the
414        selector.
415
416    .. attribute:: contingencyComputer
417   
418        By default, this slot is left empty and ordinary contingency
419        matrices are computed for examples at each node. If need arises,
420        one can change the way the matrices are computed. This can be
421        used to change the way that unknown values are treated when
422        assessing qualities of attributes. As mentioned earlier,
423        the computed matrices can be used by split constructor and by
424        stopping criteria. On the other hand, they can be (and are)
425        ignored by some splitting constructors.
426
427    .. attribute:: nodeLearner
428
429        Induces a classifier from examples belonging to a node. The
430        same learner is used for internal nodes and for leaves. The
431        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
432
433    .. attribute:: descender
434
435        Descending component that the induces :obj:`TreeClassifier`
436        will use. Default descender is
437        :obj:`Descender_UnknownMergeAsSelector` which votes using
438        the :obj:`branchSelector`'s distribution for vote weights.
439
440    .. attribute:: maxDepth
441
442        Gives maximal tree depth; 0 means that only root is generated.
443        The default is 100 to prevent any infinite tree induction due
444        to missettings in stop criteria. If you are sure you need
445        larger trees, increase it. If you, on the other hand, want
446        to lower this hard limit, you can do so as well.
447
448    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
449
450        Decides whether to store class distributions, contingencies
451        and examples in :obj:`Node`, and whether the
452        :obj:`nodeClassifier` should be build for internal nodes.
453        By default, distributions and node classifiers are stored,
454        while contingencies and examples are not. You won't save any
455        memory by not storing distributions but storing contingencies,
456        since distributions actually points to the same distribution
457        that is stored in :obj:`contingency.classes`.
458
459    The :obj:`TreeLearnerBase` first sets the defaults for missing
460    components. Although stored in the actual :obj:`TreeLearnerBase`'s
461    fields, they are removed when the induction is finished.
462
463    Then it ensures that examples are stored in a table. This is needed
464    because the algorithm juggles with pointers to examples. If
465    examples are in a file or are fed through a filter, they are copied
466    to a table. Even if they are already in a table, they are copied
467    if :obj:`storeExamples` is set. This is to assure that pointers
468    remain pointing to examples even if the user later changes the
469    example table. If they are in the table and the :obj:`storeExamples`
470    flag is clear, we just use them as they are. This will obviously
471    crash in a multi-threaded system if one changes the table during
472    the tree induction. Well... don't do it.
473
474    Apriori class probabilities are computed. At this point we check
475    the sum of example weights; if it's zero, there are no examples and
476    we cannot proceed. A list of candidate attributes is set; in the
477    beginning, all attributes are candidates for the split criterion.
478
479    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments
480    are a set of examples, a weight meta-attribute ID (a tricky thing,
481    it can be always the same as the original or can change to
482    accomodate splitting of examples among branches), apriori class
483    distribution and a list of candidates (represented as a vector
484    of Boolean values).
485
486    The contingency matrix is computed next. This happens
487    even if the flag :obj:`storeContingencies` is false.
488    If the :obj:`contingencyComputer` is given we use it,
489    otherwise we construct just an ordinary contingency matrix.
490
491    A :obj:`stop` is called to see whether it's worth to continue. If
492    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
493    returned. Otherwise, a :obj:`nodeClassifier` is only built if
494    :obj:`forceNodeClassifier` flag is set.
495
496    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
497    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
498    the given examples, weight ID and the just computed matrix. If
499    the learner can use the matrix (and the default,
500    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
501    a choice of :obj:`contingencyComputer` will, in many cases,
502    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
503    return no classifier; if so and if the classifier would be
504    needed for classification, the :obj:`TreeClassifier`'s function
505    returns DK or an empty distribution. If you're writing your own
506    tree classifier - pay attention.
507
508    If the induction is to continue, a :obj:`split` component is called.
509    If it fails to return a branch selector, induction stops and the
510    :obj:`Node` is returned.
511
512    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide
513    the examples as described above.
514
515    The contingency gets removed at this point if it is not to be
516    stored. Thus, the :obj:`split`, :obj:`stop` and
517    :obj:`exampleSplitter` can use the contingency matrices if they will.
518
519    The :obj:`TreeLearnerBase` then recursively calls itself for each of
520    the non-empty subsets. If the splitter returnes a list of weights,
521    a corresponding weight is used for each branch. Besides, the
522    attribute spent by the splitter (if any) is removed from the
523    list of candidates for the subtree.
524
525    A subset of examples is stored in its corresponding tree node,
526    if so requested. If not, the new weight attributes are removed (if
527    any were created).
528
529Pruning
530=======
531
532Tree pruners derived from :obj:`Pruner` can be given either a
533:obj:`Node` (presumably, but not necessarily a root) or a
534:obj:`TreeClassifier`. The result is a new, pruned :obj:`Node`
535or a new :obj:`TreeClassifier` with a pruned tree. The original
536tree remains intact.
537
538Note however that pruners construct only a shallow copy of a tree.
539The pruned tree's :obj:`Node` contain references to the same
540contingency matrices, node classifiers, branch selectors, ...
541as the original tree. Thus, you may modify a pruned tree structure
542(manually cut it, add new nodes, replace components) but modifying,
543for instance, some node's :obj:`nodeClassifier` (a
544:obj:`nodeClassifier` itself, not a reference to it!) would modify
545the node's :obj:`nodeClassifier` in the corresponding node of
546the original tree.
547
548Talking about node classifiers - pruners cannot construct a
549:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
550subtrees into classifiers for new leaves. Thus, if you want to build
551a prunable tree, internal nodes must have their :obj:`nodeClassifier`
552defined. Fortunately, all you need to do is nothing; if you leave
553the :obj:`TreeLearnerBase`'s flags as they are by default, the
554:obj:`nodeClassifier` are created.
555
556=======
557Classes
558=======
559
560Several classes described above are already functional and can
561(and mostly will) be used as they are. Those classes are :obj:`Node`,
562:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe
563the other classes.
564
565Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
566:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner`
567and :obj:`Classifier` are among the Orange classes that can be subtyped
568in Python and have the call operator overloadedd in such a way that it
569is callbacked from C++ code. You can thus program your own components
570for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed
571information on how this is done and what can go wrong, is given in a
572separate page, dedicated to callbacks to Python XXXXXXXXXX.
573
574SplitConstructors
575=====================
576
577Split construction is almost as exciting as waiting for a delayed flight.
578Boring, that is. Split constructors are programmed as spaghetti code
579that juggles with contingency matrices, with separate cases for discrete
580and continuous classes... Most split constructors work either for
581discrete or for continuous attributes. The suggested practice is
582to use a :obj:`SplitConstructor_Combined` that can handle
583both by simply delegating attributes to specialized split constructors.
584
585Note: split constructors that cannot handle attributes of particular
586type (discrete, continuous) do not report an error or a warning but
587simply skip the attribute. It is your responsibility to use a correct
588split constructor for your dataset. (May we again suggest
589using :obj:`SplitConstructor_Combined`?)
590
591The same components can be used either for inducing classification and
592regression trees. The only component that needs to be chosen accordingly
593is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
594class (and derived classes).
595
596.. class:: SplitConstructor_Measure
597
598    Bases: :class:`SplitConstructor`
599
600    An abstract base class for split constructors that employ
601    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.
602    At present,
603    all split constructors except for :obj:`SplitConstructor_Combined`
604    are derived from this class.
605
606    .. attribute:: measure
607
608        A component of type :obj:`Orange.feature.scoring.Measure` used for
609        evaluation of a split. Note that you must select the subclass
610        :obj:`Orange.feature.scoring.Measure` capable of handling your
611        class type
612        - you cannot use :obj:`Orange.feature.scoring.GainRatio`
613        for building regression trees or :obj:`Orange.feature.scoring.MSE`
614        for classification trees.
615
616    .. attribute:: worstAcceptable
617
618        The lowest required split quality for a split to be acceptable.
619        Note that this value make sense only in connection with a
620        :obj:`measure` component. Default is 0.0.
621
622.. class:: SplitConstructor_Attribute
623
624    Bases: :class:`SplitConstructor_Measure`
625
626    Attempts to use a discrete attribute as a split; each value of the
627    attribute corresponds to a branch in the tree. Attributes are
628    evaluated with the :obj:`measure` and the one with the
629    highest score is used for a split. If there is more than one
630    attribute with the highest score, one of them is selected by random.
631
632    The constructed :obj:`branchSelector` is an instance of
633    :obj:`orange.ClassifierFromVarFD` that returns a value of the
634    selected attribute. If the attribute is
635    :obj:`Orange.data.variable.Discrete`,
636    :obj:`branchDescription`'s are the attribute's values. The
637    attribute is marked as spent, so that it cannot reappear in the
638    node's subtrees.
639
640.. class:: SplitConstructor_ExhaustiveBinary
641
642    Bases: :class:`SplitConstructor_Measure`
643
644    Works on discrete attributes. For each attribute, it determines
645    which binarization of the attribute gives the split with the
646    highest score. If more than one split has the highest score,
647    one of them is selected by random. After trying all the attributes,
648    it returns one of those with the highest score.
649
650    The constructed :obj:`branchSelector` is again an instance
651    :obj:`orange.ClassifierFromVarFD` that returns a value of the
652    selected attribute. This time, however, its :obj:`transformer`
653    contains an instance of :obj:`MapIntValue` that maps the values
654    of the attribute into a binary attribute. Branch descriptions are
655    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
656    more than one value of the attribute. Branches that correspond to
657    a single value of the attribute are described with this value. If
658    the attribute was originally binary, it is spent and cannot be
659    used in the node's subtrees. Otherwise, it can reappear in the
660    subtrees.
661
662
663.. class:: SplitConstructor_Threshold
664
665    Bases: :class:`SplitConstructor_Measure`
666
667    This is currently the only constructor for splits with continuous
668    attributes. It divides the range of attributes values with a threshold
669    that maximizes the split's quality. As always, if there is more than
670    one split with the highest score, a random threshold is selected.
671    The attribute that yields the highest binary split is returned.
672
673    The constructed :obj:`branchSelector` is again an instance of
674    :obj:`orange.ClassifierFromVarFD` with an attached
675    :obj:`transformer`. This time, :obj:`transformer` is of type
676    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
677    "<threshold" and ">=threshold". The attribute is not spent.
678
679.. class:: SplitConstructor_OneAgainstOthers
680   
681    Bases: :class:`SplitConstructor_Measure`
682
683    Undocumented.
684
685.. class:: SplitConstructor_Combined
686
687    Bases: :class:`SplitConstructor`
688
689    This constructor delegates the task of finding the optimal split
690    to separate split constructors for discrete and for continuous
691    attributes. Each split constructor is called, given only attributes
692    of appropriate types as candidates. Both construct a candidate for
693    a split; the better of them is selected.
694
695    (Note that there is a problem when more candidates have the same
696    score. Let there be are nine discrete attributes with the highest
697    score; the split constructor for discrete attributes will select
698    one of them. Now, let us suppose that there is a single continuous
699    attribute with the same score. :obj:`SplitConstructor_Combined`
700    would randomly select between the proposed discrete attribute and
701    the continuous attribute, not aware of the fact that the discrete
702    has already competed with eight other discrete attributes. So,
703    he probability for selecting (each) discrete attribute would be 1/18
704    instead of 1/10. Although not really correct, we doubt that this
705    would affect the tree's performance; many other machine learning
706    systems simply choose the first attribute with the highest score
707    anyway.)
708
709    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
710    the attribute is spent is decided by the winning split constructor.
711
712    .. attribute: discreteSplitConstructor
713
714        Split constructor for discrete attributes; can be, for instance,
715        :obj:`SplitConstructor_Attribute` or
716        :obj:`SplitConstructor_ExhaustiveBinary`.
717
718    .. attribute: continuousSplitConstructor
719
720        Split constructor for continuous attributes; at the moment, it
721        can be either :obj:`SplitConstructor_Threshold` or a
722        split constructor you programmed in Python.
723
724    .. attribute: continuousSplitConstructor
725   
726        Split constructor for continuous attributes; at the moment, it
727        can be either :obj:`SplitConstructor_Threshold` or a split
728        constructor you programmed in Python.
729
730
731StopCriteria and StopCriteria_common
732============================================
733
734obj:`StopCriteria` determines when to stop the induction of subtrees, as
735described in detail in description of the learning process. XXXXXXXXXX
736
737.. class:: StopCriteria_common
738
739    :obj:`StopCriteria` contains additional criteria for pre-pruning:
740    it checks the proportion of majority class and the number of weighted
741    examples.
742
743    .. attribute:: maxMajor
744
745        Maximal proportion of majority class. When this is exceeded,
746        induction stops.
747
748    .. attribute:: minExamples
749
750        Minimal number of examples in internal leaves. Subsets with less
751        than :obj:`minExamples` examples are not split any further.
752        Example count is weighed.
753
754.. class:: StopCriteria_Python
755
756    Undocumented.
757
758Classes derived from ExampleSplitter
759========================================
760
761:obj:`ExampleSplitter` is the third crucial component of
762:obj:`TreeLearnerBase`. Its function is described in
763description of the learning process. XXXXXXXXXX
764
765.. class:: ExampleSplitter_IgnoreUnknowns
766
767    Bases: :class:`ExampleSplitter`
768
769    Simply ignores the examples for which no single branch can be determined.
770
771.. class:: ExampleSplitter_UnknownsToCommon
772
773    Bases: :class:`ExampleSplitter`
774
775    Places all such examples to a branch with the highest number of
776    examples. If there is more than one such branch, one is selected at
777    random and then used for all examples.
778
779.. class:: ExampleSplitter_UnknownsToAll
780
781    Bases: :class:`ExampleSplitter`
782
783    Places examples with unknown value of the attribute into all branches.
784
785.. class:: ExampleSplitter_UnknownsToRandom
786
787    Bases: :class:`ExampleSplitter`
788
789    Selects a random branch for such examples.
790
791.. class:: ExampleSplitter_UnknownsToBranch
792
793    Bases: :class:`ExampleSplitter`
794
795    Constructs an additional branch to contain all such examples.
796    The branch's description is "unknown".
797
798.. class:: ExampleSplitter_UnknownsAsBranchSizes
799
800    Bases: :class:`ExampleSplitter`
801
802    Splits examples with unknown value of the attribute according to
803    proportions of examples in each branch.
804
805.. class:: ExampleSplitter_UnknownsAsSelector
806
807    Bases: :class:`ExampleSplitter`
808
809    Splits examples with unknown value of the attribute according to
810    distribution proposed by selector (which is in most cases the same
811    as proportions of examples in branches).
812
813Descender and derived classes
814=================================
815
816This is a classifier's counterpart for :obj:`ExampleSplitter`. It
817decides the destiny of examples that need to be classified and cannot
818be unambiguously put in a branch. The detailed function of this class
819is given in description of classification with trees. XXXXXX
820
821.. class:: Descender
822
823    An abstract base object for tree descenders.
824
825    .. method:: __call__(node, example)
826
827        Descends down the tree until it reaches a leaf or a node in
828        which a vote of subtrees is required. In both cases, a tuple
829        of two elements is returned; in the former, the tuple contains
830        the reached node and None, in the latter in
831        contains a node and weights of votes for subtrees (a list of floats).
832
833        :obj:`Descender`'s that never split examples always descend
834        to a leaf, but they differ in the treatment of examples with
835        unknown values (or, in general, examples for which a branch
836        cannot be determined at some node(s) the tree).
837        :obj:`Descender`'s that do split examples differ in returned
838        vote weights.
839
840.. class:: Descender_UnknownsToNode
841
842    Bases: :obj:`Descender`
843
844    When example cannot be classified into a single branch, the
845    current node is returned. Thus, the node's :obj:`NodeClassifier`
846    will be used to make a decision. It is your responsibility to see
847    that even the internal nodes have their :obj:`NodeClassifier`
848    (i.e., don't disable creating node classifier or manually remove
849    them after the induction, that's all)
850
851.. class:: Descender_UnknownsToBranch
852
853    Bases: :obj:`Descender`
854
855    Classifies examples with unknown value to a special branch. This
856    makes sense only if the tree itself was constructed with
857    :obj:`ExampleSplitter_UnknownsToBranch`.
858
859.. class:: Descender_UnknownsToCommonBranch
860
861    Bases: :obj:`Descender`
862
863    Classifies examples with unknown values to the branch with the
864    highest number of examples. If there is more than one such branch,
865    random branch is chosen for each example that is to be classified.
866
867.. class:: Descender_UnknownsToCommonSelector
868
869    Bases: :obj:`Descender`
870
871    Classifies examples with unknown values to the branch which received
872    the highest recommendation by the selector.
873
874.. class:: Descender_MergeAsBranchSizes
875
876    Bases: :obj:`Descender`
877
878    Makes the subtrees vote for the example's class; the vote is
879    weighted according to the sizes of the branches.
880
881.. class:: Descender_MergeAsSelector
882
883    Bases: :obj:`Descender`
884
885    Makes the subtrees vote for the example's class; the vote is
886    weighted according to the selectors proposal.
887
888Pruner and derived classes
889==============================
890
891.. index::
892    pair: classification trees; pruning
893
894Classes derived from :obj:`Pruner` prune the trees as a
895described in the section pruning XXXXXXXX - make sure you read it
896to understand what the pruners will do to your trees.
897
898.. class:: Pruner
899
900    This is an abstract base class which defines nothing useful, only
901    a pure virtual call operator.
902
903    .. method:: __call__(tree)
904
905        Prunes a tree. The argument can be either a tree classifier or
906        a tree node; the result is of the same type as the argument.
907
908.. class:: Pruner_SameMajority
909
910    Bases: :class:`Pruner`
911
912    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
913    with more than one leaf) in which all the leaves have the same majority
914    class. (This is allowed because those leaves can still have different
915    distributions of classes and thus predict different probabilities.)
916    However, this can be undesired when we're only interested in the
917    class prediction or a simple tree interpretation. The
918    :obj:`Pruner_SameMajority` prunes the tree so that there is no
919    subtree in which all the nodes would have the same majority class.
920
921    This pruner will only prune the nodes in which the node classifier
922    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
923
924    Note that the leaves with more than one majority class require some
925    special handling. The pruning goes backwards, from leaves to the root.
926    When siblings are compared, the algorithm checks whether they
927    have (at least one) common majority class. If so, they can be pruned.
928
929.. class:: Pruner_m
930
931    Bases: :class:`Pruner`
932
933    Prunes a tree by comparing m-estimates of static and dynamic
934    error as defined in (Bratko, 2002).
935
936    .. attribute:: m
937
938        Parameter m for m-estimation.
939
940========
941Examples
942========
943
944This page does not provide examples for programming your own components,
945such as, for instance, a :obj:`SplitConstructor`. Those examples
946can be found on a page dedicated to callbacks to Python XXXXXXXX.
947
948Tree Structure
949==============
950
951To have something to work on, we'll take the data from lenses dataset
952and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
953
954.. literalinclude:: code/treestructure.py
955   :lines: 7-10
956
957How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
958
959.. _lenses.tab: code/lenses.tab
960.. _treestructure.py: code/treestructure.py
961
962.. literalinclude:: code/treestructure.py
963   :lines: 12-21
964
965If node is None, we have a null-node; null nodes don't count,
966so we return 0. Otherwise, the size is 1 (this node) plus the
967sizes of all subtrees. The node is an internal node if it has a
968:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
969attempt to skip the if statement: leaves don't have an empty list
970of branches, they don't have a list of branches at all.
971
972    >>> treeSize(treeClassifier.tree)
973    10
974
975Don't forget that this was only an excercise - :obj:`Node` has a
976built-in method :obj:`Node.treeSize` that does exactly the same.
977
978Let us now write a simple script that prints out a tree. The recursive
979part of the function will get a node and its level (part of `treestructure.py`_, uses `lenses.tab`_).
980
981.. literalinclude:: code/treestructure.py
982   :lines: 26-41
983
984Don't waste time on studying formatting tricks (\n's etc.), this is just
985for nicer output. What matters is everything but the print statements.
986As first, we check whether the node is a null-node (a node to which no
987learning examples were classified). If this is so, we just print out
988"<null node>" and return.
989
990After handling null nodes, remaining nodes are internal nodes and leaves.
991For internal nodes, we print a node description consisting of the
992attribute's name and distribution of classes. :obj:`Node`'s branch
993description is, for all currently defined splits, an instance of a
994class derived from :obj:`orange.Classifier` (in fact, it is a
995:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
996suffice), and its :obj:`classVar` XXXXX points to the attribute we seek.
997So we print its name. We will also assume that storing class distributions
998has not been disabled and print them as well. A more able function for
999printing trees (as one defined in XXXXXXXXXX) has an alternative
1000means to get the distribution, when this fails. Then we iterate
1001through branches; for each we print a branch description and iteratively
1002call the :obj:`printTree0` with a level increased by 1 (to increase
1003the indent).
1004
1005Finally, if the node is a leaf, we print out the distribution of
1006learning examples in the node and the class to which the examples in
1007the node would be classified. We again assume that the :obj:`nodeClassifier`
1008is the default one - a :obj:`DefaultClassifier`. A better print
1009function should be aware of possible alternatives.
1010
1011Now, we just need to write a simple function to call our printTree0.
1012We could write something like...
1013
1014::
1015
1016    def printTree(x):
1017        printTree0(x.tree, 0)
1018
1019... but we won't. Let us learn how to handle arguments of different
1020types. Let's write a function that will accept either a :obj:`TreeClassifier`
1021or a :obj:`Node`; just like Pruners XXXXXX, remember? Part of `treestructure.py`_, uses `lenses.tab`_.
1022
1023.. literalinclude:: code/treestructure.py
1024   :lines: 43-49
1025
1026It's fairly straightforward: if :obj:`x` is of type derived from
1027:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
1028:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
1029of some other type, we don't know how to handle it and thus raise
1030an exception. (Note that we could also use
1031
1032::
1033
1034    if isinstance(x, Orange.classification.tree.TreeClassifier)
1035
1036but this would only work if :obj:`x` would be of type
1037:obj:`TreeClassifier` and not of any derived types. The latter,
1038however, do not exist yet...)
1039
1040    >>> printTree(treeClassifier)
1041    tear_rate (<15.000, 5.000, 4.000>)
1042    : reduced --> none (<12.000, 0.000, 0.000>)
1043    : normal
1044       astigmatic (<3.000, 5.000, 4.000>)
1045       : no
1046          age (<1.000, 5.000, 0.000>)
1047          : young --> soft (<0.000, 2.000, 0.000>)
1048          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
1049          : presbyopic --> none (<1.000, 1.000, 0.000>)
1050       : yes
1051          prescription (<2.000, 0.000, 4.000>)
1052          : myope --> hard (<0.000, 0.000, 3.000>)
1053          : hypermetrope --> none (<2.000, 0.000, 1.000>)
1054
1055For a final exercise, let us write a simple pruning procedure. It will
1056be written entirely in Python, unrelated to any :obj:`Pruner`. Our
1057procedure will limit the tree depth - the maximal depth (here defined
1058as the number of internal nodes on any path down the tree) shall be
1059given as an argument. For example, to get a two-level tree, we would
1060call cutTree(root, 2). The function will be recursive, with the second
1061argument (level) decreasing at each call; when zero, the current node
1062will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
1063
1064.. literalinclude:: code/treestructure.py
1065   :lines: 54-62
1066
1067There's nothing to prune at null-nodes or leaves, so we act only when
1068:obj:`node` and :obj:`node.branchSelector` are defined. If level is
1069not zero, we call the function for each branch. Otherwise, we clear
1070the selector, branches and branch descriptions.
1071
1072    >>> cutTree(tree.tree, 2)
1073    >>> printTree(tree)
1074    tear_rate (<15.000, 5.000, 4.000>)
1075    : reduced --> none (<12.000, 0.000, 0.000>)
1076    : normal
1077       astigmatic (<3.000, 5.000, 4.000>)
1078       : no --> soft (<1.000, 5.000, 0.000>)
1079       : yes --> hard (<2.000, 0.000, 4.000>)
1080
1081Learning
1082========
1083
1084You've already seen a simple example of using a :obj:`TreeLearnerBase`.
1085You can just call it and let it fill the empty slots with the default
1086components. This section will teach you three things: what are the
1087missing components (and how to set the same components yourself),
1088how to use alternative components to get a different tree and,
1089finally, how to write a skeleton for tree induction in Python.
1090
1091Default components for TreeLearnerBase
1092======================================
1093
1094Let us construct a :obj:`TreeLearnerBase` to play with.
1095
1096.. _treelearner.py: code/treelearner.py
1097
1098`treelearner.py`_, uses `lenses.tab`_:
1099
1100.. literalinclude:: code/treelearner.py
1101   :lines: 7-10
1102
1103There are three crucial components in learning: the split and stop
1104criteria, and the :obj:`exampleSplitter` (there are some others,
1105which become important during classification; we'll talk about them
1106later). They are not defined; if you use the learner, the slots are
1107filled temporarily but later cleared again.
1108
1109::
1110
1111    >>> print learner.split
1112    None
1113    >>> learner(data)
1114    <TreeClassifier instance at 0x01F08760>
1115    >>> print learner.split
1116    None
1117
1118Stopping criteria
1119=================
1120
1121The stop is trivial. The default is set by
1122
1123::
1124    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
1125
1126Well, this is actually done in C++ and it uses a global component
1127that is constructed once for all, but apart from that we did
1128effectively the same thing.
1129
1130We can now examine the default stopping parameters.
1131
1132    >>> print learner.stop.maxMajority, learner.stop.minExamples
1133    1.0 0.0
1134
1135Not very restrictive. This keeps splitting the examples until
1136there's nothing left to split or all the examples are in the same
1137class. Let us set the minimal subset that we allow to be split to
1138five examples and see what comes out.
1139
1140    >>> learner.stop.minExamples = 5.0
1141    >>> tree = learner(data)
1142    >>> printTree(tree)
1143    tear_rate (<15.000, 5.000, 4.000>)
1144    : reduced --> none (<12.000, 0.000, 0.000>)
1145    : normal
1146       astigmatic (<3.000, 5.000, 4.000>)
1147       : no
1148          age (<1.000, 5.000, 0.000>)
1149          : young --> soft (<0.000, 2.000, 0.000>)
1150          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
1151          : presbyopic --> soft (<1.000, 1.000, 0.000>)
1152       : yes
1153          prescription (<2.000, 0.000, 4.000>)
1154          : myope --> hard (<0.000, 0.000, 3.000>)
1155          : hypermetrope --> none (<2.000, 0.000, 1.000>)
1156
1157OK, that's better. If we want an even smaller tree, we can also limit
1158the maximal proportion of majority class.
1159
1160    >>> learner.stop.maxMajority = 0.5
1161    >>> tree = learner(tree)
1162    >>> printTree(tree)
1163    --> none (<15.000, 5.000, 4.000>)
1164
1165
1166Undocumented
1167============
1168
1169.. class:: NodeList
1170
1171    Undocumented.
1172
1173.. class:: C45NodeList
1174
1175    Undocumented.
1176
1177===========================
1178C4.5 Classifier and Learner
1179===========================
1180
1181As C4.5 is a standard benchmark in machine learning,
1182it is incorporated in Orange, although Orange has its own
1183implementation of decision trees.
1184
1185The implementation uses the original Quinlan's code for learning so the
1186tree you get is exactly like the one that would be build by standalone
1187C4.5. Upon return, however, the original tree is copied to Orange
1188components that contain exactly the same information plus what is needed
1189to make them visible from Python. To be sure that the algorithm behaves
1190just as the original, we use a dedicated class :class:`C45Node`
1191instead of reusing the components used by Orange's tree inducer
1192(ie, :class:`Node`). This, however, could be done and probably
1193will be done in the future; we shall still retain :class:`C45Node`
1194but offer transformation to :class:`Node` so that routines
1195that work on Orange trees will also be usable for C45 trees.
1196
1197:class:`C45Learner` and :class:`C45Classifier` behave
1198like any other Orange learner and classifier. Unlike most of Orange
1199learning algorithms, C4.5 does not accepts weighted examples.
1200
1201Building the C4.5 plug-in
1202=========================
1203
1204We haven't been able to obtain the legal rights to distribute
1205C4.5 and therefore couldn't statically link it into Orange. Instead,
1206it's incorporated as a plug-in which you'll need to build yourself.
1207The procedure is trivial, but you'll need a C compiler. On Windows,
1208the scripts we provide work with MS Visual C and the files CL.EXE
1209and LINK.EXE must be on the PATH. On Linux you're equipped with
1210gcc. Mac OS X comes without gcc, but you can download it for
1211free from Apple.
1212
1213Orange must be installed prior to building C4.5. (This is because
1214the build script will copy the created file next to Orange,
1215which it obviously can't if Orange isn't there yet.)
1216
1217#. Download the
1218   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1219   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1220   them into some temporary directory. The files will be modified in the
1221   further process, so don't use your copy of Quinlan's sources that you
1222   need for another purpose.
1223#. Download
1224   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1225   and unzip its contents into the directory R8/Src of the Quinlan's
1226   stuff (it's the directory that contains, for instance, the file
1227   average.c).
1228#. Run buildC45.py, which will build the plug-in and put it next to
1229   orange.pyd (or orange.so on Linux/Mac).
1230#. Run python, type :samp:`import Orange` and
1231   create create :samp:`Orange.classification.tree.C45Learner()`.
1232   If this fails, something went wrong; see the diagnostic messages from
1233   buildC45.py and read the below paragraph.
1234#. Finally, you can remove the Quinlan's stuff, along with everything
1235   created by buildC45.py.
1236
1237If the process fails, here's what buildC45.py really does: it creates
1238.h files that wrap Quinlan's .i files and ensure that they are not
1239included twice. It modifies C4.5 sources to include .h's instead of
1240.i's. This step can hardly fail. Then follows the platform dependent
1241step which compiles ensemble.c (which includes all the Quinlan's .c
1242files it needs) into c45.dll or c45.so and puts it next to Orange.
1243If this fails, but you do have a C compiler and linker, and you know
1244how to use them, you can compile the ensemble.c into a dynamic
1245library yourself. See the compile and link steps in buildC45.py,
1246if it helps. Anyway, after doing this check that the built C4.5
1247gives the same results as the original.
1248
1249.. class:: C45Learner
1250
1251    :class:`C45Learner`'s attributes have double names - those that
1252    you know from C4.5 command lines and the corresponding names of C4.5's
1253    internal variables. All defaults are set as in C4.5; if you change
1254    nothing, you are running C4.5.
1255
1256    .. attribute:: gainRatio (g)
1257       
1258        Determines whether to use information gain (false>, default)
1259        or gain ratio for selection of attributes (true).
1260
1261    .. attribute:: batch (b)
1262
1263        Turn on batch mode (no windows, no iterations); this option is
1264        not documented in C4.5 manuals. It conflicts with "window",
1265        "increment" and "trials".
1266
1267    .. attribute:: subset (s)
1268       
1269        Enables subsetting (default: false, no subsetting),
1270 
1271    .. attribute:: probThresh (p)
1272
1273        Probabilistic threshold for continuous attributes (default: false).
1274
1275    .. attribute:: minObjs (m)
1276       
1277        Minimal number of objects (examples) in leaves (default: 2).
1278
1279    .. attribute:: window (w)
1280
1281        Initial windows size (default: maximum of 20% and twice the
1282        square root of the number of data objects).
1283
1284    .. attribute:: increment (i)
1285
1286        The maximum number of objects that can be added to the window
1287        at each iteration (default: 20% of the initial window size).
1288
1289    .. attribute:: cf (c)
1290
1291        Prunning confidence level (default: 25%).
1292
1293    .. attribute:: trials (t)
1294
1295        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1296
1297    .. attribute:: prune
1298       
1299        Return pruned tree (not an original C4.5 option) (default: true)
1300
1301
1302:class:`C45Learner` also offers another way for setting
1303the arguments: it provides a function :obj:`C45Learner.commandLine`
1304which is given a string and parses it the same way as C4.5 would
1305parse its command line. XXXXXXXXXXX
1306
1307.. class:: C45Classifier
1308
1309    A faithful reimplementation of Quinlan's function from C4.5. The only
1310    difference (and the only reason it's been rewritten) is that it uses
1311    a tree composed of :class:`C45Node` instead of C4.5's
1312    original tree structure.
1313
1314    .. attribute:: tree
1315
1316        C4.5 tree stored as a tree of :obj:`C45Node`.
1317
1318
1319.. class:: C45Node
1320
1321    This class is a reimplementation of the corresponding *struct* from
1322    Quinlan's C4.5 code.
1323
1324    .. attribute:: nodeType
1325
1326        Type of the node:  :obj:`C45Node.Leaf` (0),
1327        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1328        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1329        split examples based on values of a discrete attribute,
1330        "cuts" cut them according to a threshold value of a continuous
1331        attributes and "subsets" use discrete attributes but with subsetting
1332        so that several values can go into the same branch.
1333
1334    .. attribute:: leaf
1335
1336        Value returned by that leaf. The field is defined for internal
1337        nodes as well.
1338
1339    .. attribute:: items
1340
1341        Number of (learning) examples in the node.
1342
1343    .. attribute:: classDist
1344
1345        Class distribution for the node (of type
1346        :obj:`orange.DiscDistribution`).
1347
1348    .. attribute:: tested
1349       
1350        The attribute used in the node's test. If node is a leaf,
1351        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1352        :obj:`tested` is a discrete attribute, and if node is of type
1353        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1354
1355    .. attribute:: cut
1356
1357        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1358        Undefined otherwise.
1359
1360    .. attribute:: mapping
1361
1362        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1363        gives the index for an example whose value of :obj:`tested` is *i*.
1364        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1365
1366    .. attribute:: branch
1367       
1368        A list of branches stemming from this node.
1369
1370Examples
1371========
1372
1373.. _tree_c45.py: code/tree_c45.py
1374.. _iris.tab: code/iris.tab
1375
1376The simplest way to use :class:`C45Learner` is to call it. This
1377script constructs the same learner as you would get by calling
1378the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1379
1380.. literalinclude:: code/tree_c45.py
1381   :lines: 7-14
1382
1383Arguments can be set by the usual mechanism (the below to lines do the
1384same, except that one uses command-line symbols and the other internal
1385variable names)
1386
1387::
1388
1389    tree = Orange.classification.tree.C45Learner(data, m=100)
1390    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1391
1392The way that could be prefered by veteran C4.5 user might be through
1393method `:obj:C45Learner.commandline`.
1394
1395::
1396
1397    lrn = Orange.classification.tree..C45Learner()
1398    lrn.commandline("-m 1 -s")
1399    tree = lrn(data)
1400
1401There's nothing special about using :obj:`C45Classifier` - it's
1402just like any other. To demonstrate what the structure of
1403:class:`C45Node`'s looks like, will show a script that prints
1404it out in the same format as C4.5 does.
1405
1406.. literalinclude:: code/tree_c45_printtree.py
1407
1408Leaves are the simplest. We just print out the value contained
1409in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1410:obj:`C45Node` does not know to which attribute it belongs), we need to
1411convert it to a string through :obj:`classVar`, which is passed as an
1412extra argument to the recursive part of printTree.
1413
1414For discrete splits without subsetting, we print out all attribute values
1415and recursively call the function for all branches. Continuous splits are
1416equally easy to handle.
1417
1418For discrete splits with subsetting, we iterate through branches, retrieve
1419the corresponding values that go into each branch to inset, turn
1420the values into strings and print them out, separately treating the
1421case when only a single value goes into the branch.
1422
1423Printing out C45 Tree
1424=====================
1425
1426.. autofunction:: printTreeC45
1427
1428=======================
1429orngTree module XXXXXXX
1430=======================
1431
1432.. autoclass:: TreeLearner
1433    :members:
1434
1435.. autoclass:: TreeClassifier
1436    :members:
1437
1438For a bit more complex example, here's how to write your own stop
1439function. The example itself is more funny than useful. It constructs
1440and prints two trees. For the first one we define the *defStop*
1441function, which is used by default, and combine it with a random function
1442so that the stop criteria will also be met in additional 20% of the cases
1443when *defStop* is false. The second tree is build such that it
1444considers only the random function as the stopping criteria. Note that in
1445the second case lambda function still has three parameters, since this is
1446a necessary number of parameters for the stop function XXXXX link (
1447part of `tree3.py`_ (uses  `iris.tab`_):
1448
1449.. _tree3.py: code/tree3.py
1450
1451.. literalinclude:: code/tree3.py
1452   :lines: 8-23
1453
1454The output is not shown here since the resulting trees are rather
1455big.
1456
1457Printing the Tree
1458=================
1459
1460The included printing functions can
1461print out practically anything you'd like to
1462know, from the number of examples, proportion of examples of majority
1463class in nodes and similar, to more complex statistics like the
1464proportion of examples in a particular class divided by the proportion
1465of examples of this class in a parent node. And even more, you can
1466define your own callback functions to be used for printing.
1467
1468Before we go on: you can read all about the function and use it to its
1469full extent, or you can just call it, giving it the tree as the sole
1470argument and it will print out the usual textual representation of the
1471tree. If you're satisfied with that, you can stop here.
1472
1473The magic is in the format string. It is a string which is printed
1474out at every leaf or internal node with the certain format specifiers
1475replaced by data from the tree node. Specifiers are generally of form
1476**%[^]<precision><quantity><divisor>**.
1477
1478**^** at the start tells that the number should be multiplied by 100.
1479It's useful for printing proportions like percentages.
1480
1481**<precision>** is in the same format as in Python (or C) string
1482formatting. For instance, :samp:`%N` denotes the number of examples in the node,
1483hence :samp:`%6.2N` would mean output to two decimal digits and six places
1484altogether. If left out, a default format :samp:`5.3` is used, unless you
1485multiply the numbers by 100, in which case the default is :samp:`.0`
1486(no decimals, the number is rounded to the nearest integer).
1487
1488**<divisor>** tells what to divide the quantity in that node with.
1489:samp:`bP` means division by the same quantity in the parent node; for instance,
1490:samp:`%NbP` will tell give the number of examples in the node divided by the
1491number of examples in parent node. You can add use precision formatting,
1492e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
1493set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
1494training data set) that fell into that node. If division is impossible
1495since the parent node does not exist or some data is missing, a dot is
1496printed out instead of the quantity.
1497
1498**<quantity>** is the only required element. It defines what to print.
1499For instance, :samp:`%N` would print out the number of examples in the node.
1500Possible quantities are
1501
1502:samp:`V`
1503    The value predicted at that node. You cannot define the precision
1504    or divisor here.
1505
1506:samp:`N`
1507    The number of examples in the node.
1508
1509:samp:`M`
1510    The number of examples in the majority class (that is, the class
1511    predicted by the node).
1512
1513:samp:`m`
1514    The proportion of examples in the majority class.
1515
1516:samp:`A`
1517    The average class for examples the node; this is available only for
1518    regression trees.
1519
1520:samp:`E`
1521    Standard error for class of examples in the node; available for
1522    regression trees.
1523
1524:samp:`I`
1525    Print out the confidence interval. The modifier is used as
1526    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
1527
1528:samp:`C`
1529    The number of examples in the given class. For classification trees,
1530    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
1531    - this will tell the number of examples of Iris-virginica by the
1532    number of examples this class in the parent node. If you are
1533    interested in examples that are *not* Iris-virginica, say
1534    :samp:`%5.3CbP!="Iris-virginica"`
1535
1536    For regression trees, you can use operators =, !=, <, <=, >, and >=,
1537    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
1538    check the number of examples in a certain interval: :samp:`%C[20, 22]`
1539    will give you the number of examples between 20 and 22 (inclusive)
1540    and :samp:`%C(20, 22)` will give the number of such examples excluding the
1541    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
1542    If you would like the examples outside the interval, add a :samp:`!`,
1543    like :samp:`%C!(20, 22]`.
1544 
1545:samp:`c`
1546    Same as above, except that it computes the proportion of the class
1547    instead of the number of examples.
1548
1549:samp:`D`
1550    Prints out the number of examples in each class. You can use both,
1551    precision (it is applied to each number in the distribution) or the
1552    divisor. This quantity cannot be computed for regression trees.
1553
1554:samp:`d`
1555    Same as above, except that it shows proportions of examples. This
1556    again doesn't work with regression trees.
1557
1558<user defined formats>
1559    You can add more, if you will. Instructions and examples are given at
1560    the end of this section.
1561
1562
1563Examples
1564========
1565
1566We shall build a small tree from the iris data set - we shall limit the
1567depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
1568
1569.. literalinclude:: code/orngTree1.py
1570   :lines: 0-4
1571
1572.. _orngTree1.py: code/orngTree1.py
1573
1574The easiest way to call the function is to pass the tree as the only
1575argument::
1576
1577    >>> Orange.classification.tree.printTree(tree)
1578    petal width<0.800: Iris-setosa (100.00%)
1579    petal width>=0.800
1580    |    petal width<1.750
1581    |    |    petal length<5.350: Iris-versicolor (94.23%)
1582    |    |    petal length>=5.350: Iris-virginica (100.00%)
1583    |    petal width>=1.750
1584    |    |    petal length<4.850: Iris-virginica (66.67%)
1585    |    |    petal length>=4.850: Iris-virginica (100.00%)
1586
1587Let's now print out the predicted class at each node, the number
1588of examples in the majority class with the total number of examples
1589in the node::
1590
1591    >>> Orange.classification.tree.printTree(tree, leafStr="%V (%M out of %N)")
1592    petal width<0.800: Iris-setosa (50.000 out of 50.000)
1593    petal width>=0.800
1594    |    petal width<1.750
1595    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
1596    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
1597    |    petal width>=1.750
1598    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
1599    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
1600
1601Would you like to know how the number of examples declines as
1602compared to the entire data set and to the parent node? We find
1603it with this::
1604
1605    >>> orng.printTree("%V (%^MbA%, %^MbP%)")
1606    petal width<0.800: Iris-setosa (100%, 100%)
1607    petal width>=0.800
1608    |    petal width<1.750
1609    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
1610    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
1611    |    petal width>=1.750
1612    |    |    petal length<4.850: Iris-virginica (4%, 4%)
1613    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
1614
1615Let us first read the format string. :samp:`%M` is the number of
1616examples in the majority class. We want it divided by the number of
1617all examples from this class on the entire data set, hence :samp:`%MbA`.
1618To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
1619that is just printed out literally, just as the comma and parentheses
1620(see the output). The string for showing the proportion of this class
1621in the parent is the same except that we have :samp:`bP` instead
1622of :samp:`bA`.
1623
1624And now for the output: all examples of setosa for into the first node.
1625For versicolor, we have 98% in one node; the rest is certainly
1626not in the neighbouring node (petal length&gt;=5.350) since all
1627versicolors from the node petal width<1.750 went to petal length<5.350
1628(we know this from the 100% in that line). Virginica is the
1629majority class in the three nodes that together contain 94% of this
1630class (4+4+86). The rest must had gone to the same node as versicolor.
1631
1632If you find this guesswork annoying - so do I. Let us print out the
1633number of versicolors in each node, together with the proportion of
1634versicolors among the examples in this particular node and among all
1635versicolors. So,
1636
1637::
1638
1639    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
1640
1641gives the following output::
1642
1643    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
1644    petal width>=0.800
1645    |    petal width<1.750
1646    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
1647    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
1648    |    petal width>=1.750
1649    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
1650    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
1651
1652Finally, we may want to print out the distributions, using a simple
1653string :samp:`%D`::
1654
1655    petal width<0.800: [50.000, 0.000, 0.000]
1656    petal width>=0.800
1657    |    petal width<1.750
1658    |    |    petal length<5.350: [0.000, 49.000, 3.000]
1659    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
1660    |    petal width>=1.750
1661    |    |    petal length<4.850: [0.000, 1.000, 2.000]
1662    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
1663
1664What is the order of numbers here? If you check
1665:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
1666versicolor, virginica; so in the node at peta length<5.350 we have 49
1667versicolors and 3 virginicae. To print out the proportions, we can
1668:samp:`%.2d` - this gives us proportions within node, rounded on
1669two decimals::
1670
1671    petal width<0.800: [1.00, 0.00, 0.00]
1672    petal width>=0.800
1673    |    petal width<1.750
1674    |    |    petal length<5.350: [0.00, 0.94, 0.06]
1675    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
1676    |    petal width>=1.750
1677    |    |    petal length<4.850: [0.00, 0.33, 0.67]
1678    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
1679
1680We haven't tried printing out any information for internal nodes.
1681To start with the most trivial case, we shall print the prediction
1682at each node.
1683
1684::
1685
1686    Orange.classification.tree.printTree(tree, leafStr="%V", nodeStr=".")
1687   
1688says that the nodeStr should be the same as leafStr (not very useful
1689here, since leafStr is trivial anyway).
1690
1691::
1692
1693    root: Iris-setosa
1694    |    petal width<0.800: Iris-setosa
1695    |    petal width>=0.800: Iris-versicolor
1696    |    |    petal width<1.750: Iris-versicolor
1697    |    |    |    petal length<5.350: Iris-versicolor
1698    |    |    |    petal length>=5.350: Iris-virginica
1699    |    |    petal width>=1.750: Iris-virginica
1700    |    |    |    petal length<4.850: Iris-virginica
1701    |    |    |    petal length>=4.850: Iris-virginica
1702
1703Note that the output is somewhat different now: there appeared another
1704node called *root* and the tree looks one level deeper. This is
1705needed to print out the data for that node to.
1706
1707Now for something more complicated: let us observe how the number
1708of virginicas decreases down the tree::
1709
1710    Orange.classification.tree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
1711
1712Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
1713the number of examples from class virginica, divided by the total number
1714of examples in this class. Add :samp:`^.1` and the result will be
1715multiplied and printed with one decimal. The trailing :samp:`%` is printed
1716out. In parentheses we print the same thing except that we divide by the
1717examples in the parent node. Note the use of single quotes, so we can
1718use the double quotes inside the string, when we specify the class.
1719
1720::
1721
1722    root: 100.0% (.%)
1723    |    petal width<0.800: 0.0% (0.0%)
1724    |    petal width>=0.800: 100.0% (100.0%)
1725    |    |    petal width<1.750: 10.0% (10.0%)
1726    |    |    |    petal length<5.350: 6.0% (60.0%)
1727    |    |    |    petal length>=5.350: 4.0% (40.0%)
1728    |    |    petal width>=1.750: 90.0% (90.0%)
1729    |    |    |    petal length<4.850: 4.0% (4.4%)
1730    |    |    |    petal length>=4.850: 86.0% (95.6%)
1731
1732See what's in the parentheses in the root node? If :func:`printTree`
1733cannot compute something (in this case it's because the root has no parent),
1734it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
1735will count all classes *except* virginica.
1736
1737For one final example with classification trees, we shall print the
1738distributions in that nodes, the distribution compared to the parent
1739and the proportions compared to the parent (the latter things are not
1740the same - think why). In the leaves we shall also add the predicted
1741class. So now we'll have to call the function like this.
1742
1743::
1744
1745    >>>Orange.classification.tree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
1746    root: [50.000, 50.000, 50.000] . .
1747    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
1748    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
1749    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
1750    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
1751    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
1752    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
1753    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
1754    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
1755    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
1756    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
1757    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
1758    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
1759    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
1760
1761To explore the possibilities when printing regression trees, we are going
1762to induce a tree from the housing data set. Called with the tree as the
1763only argument, :func:`printTree` prints the tree like this::
1764
1765    RM<6.941
1766    |    LSTAT<14.400
1767    |    |    DIS<1.385: 45.6
1768    |    |    DIS>=1.385: 22.9
1769    |    LSTAT>=14.400
1770    |    |    CRIM<6.992: 17.1
1771    |    |    CRIM>=6.992: 12.0
1772    RM>=6.941
1773    |    RM<7.437
1774    |    |    CRIM<7.393: 33.3
1775    |    |    CRIM>=7.393: 14.4
1776    |    RM>=7.437
1777    |    |    TAX<534.500: 45.9
1778    |    |    TAX>=534.500: 21.9
1779
1780Let us add the standard error in both internal nodes and leaves, and the
178190% confidence intervals in the leaves::
1782
1783    >>> Orange.classification.tree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
1784    root: [SE: 0.409]
1785    |    RM<6.941: [SE: 0.306]
1786    |    |    LSTAT<14.400: [SE: 0.320]
1787    |    |    |    DIS<1.385: [SE: 4.420]:
1788    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
1789    |    |    |    DIS>=1.385: [SE: 0.244]:
1790    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
1791    |    |    LSTAT>=14.400: [SE: 0.333]
1792    |    |    |    CRIM<6.992: [SE: 0.338]:
1793    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1794    |    |    |    CRIM>=6.992: [SE: 0.448]:
1795    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1796    |    RM>=6.941: [SE: 1.031]
1797    |    |    RM<7.437: [SE: 0.958]
1798    |    |    |    CRIM<7.393: [SE: 0.692]:
1799    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1800    |    |    |    CRIM>=7.393: [SE: 2.157]:
1801    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1802    |    |    RM>=7.437: [SE: 1.124]
1803    |    |    |    TAX<534.500: [SE: 0.817]:
1804    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1805    |    |    |    TAX>=534.500: [SE: 0.000]:
1806    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1807
1808What's the difference between :samp:`%V`, the predicted value and
1809:samp:`%A` the average? Doesn't a regression tree always predict the
1810leaf average anyway? Not necessarily, the tree predict whatever the
1811:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
1812As :samp:`%V` uses the
1813:obj:`Orange.data.variable.Continuous`' function for printing out the value,
1814therefore the printed number has the same number of decimals
1815as in the data file.
1816
1817Regression trees cannot print the distributions in the same way
1818as classification trees. They instead offer a set of operators for
1819observing the number of examples within a certain range. For instance,
1820let us check the number of examples with values below 22, and compare
1821this number with values in the parent nodes::
1822
1823    >>> Orange.classification.tree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".")
1824    root: 277.000 (.)
1825    |    RM<6.941: 273.000 (1.160)
1826    |    |    LSTAT<14.400: 107.000 (0.661)
1827    |    |    |    DIS<1.385: 0.000 (0.000)
1828    |    |    |    DIS>=1.385: 107.000 (1.020)
1829    |    |    LSTAT>=14.400: 166.000 (1.494)
1830    |    |    |    CRIM<6.992: 93.000 (0.971)
1831    |    |    |    CRIM>=6.992: 73.000 (1.040)
1832    |    RM>=6.941: 4.000 (0.096)
1833    |    |    RM<7.437: 3.000 (1.239)
1834    |    |    |    CRIM<7.393: 0.000 (0.000)
1835    |    |    |    CRIM>=7.393: 3.000 (15.333)
1836    |    |    RM>=7.437: 1.000 (0.633)
1837    |    |    |    TAX<534.500: 0.000 (0.000)
1838    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1839
1840The last line, for instance, says the the number of examples with the
1841class below 22 is among those with tax above 534 is 30 times higher
1842than the number of such examples in its parent node.
1843
1844For another exercise, let's count the same for all examples *outside*
1845interval [20, 22] (given like this, the interval includes the bounds).
1846And let us print out the proportions as percents.
1847
1848::
1849
1850    >>> Orange.classification.tree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
1851
1852OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
1853would be the proportion of examples (within the node) whose values are
1854below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
1855statistics computed on the parent. Add a :samp:`^` and you have the percentages.
1856
1857::
1858
1859    root: 439.000 (.%)
1860    |    RM<6.941: 364.000 (98%)
1861    |    |    LSTAT<14.400: 200.000 (93%)
1862    |    |    |    DIS<1.385: 5.000 (127%)
1863    |    |    |    DIS>=1.385: 195.000 (99%)
1864    |    |    LSTAT>=14.400: 164.000 (111%)
1865    |    |    |    CRIM<6.992: 91.000 (96%)
1866    |    |    |    CRIM>=6.992: 73.000 (105%)
1867    |    RM>=6.941: 75.000 (114%)
1868    |    |    RM<7.437: 46.000 (101%)
1869    |    |    |    CRIM<7.393: 43.000 (100%)
1870    |    |    |    CRIM>=7.393: 3.000 (100%)
1871    |    |    RM>=7.437: 29.000 (98%)
1872    |    |    |    TAX<534.500: 29.000 (103%)
1873    |    |    |    TAX>=534.500: 0.000 (0%)
1874
1875
1876Defining Your Own Printout functions
1877====================================
1878
1879:func:`dumpTree`'s argument :obj:`userFormats` can be used to print out
1880some other information in the leaves or nodes. If provided,
1881:obj:`userFormats` should contain a list of tuples with a regular expression
1882and a callback function to be called when that expression is found in the
1883format string. Expressions from :obj:`userFormats` are checked before
1884the built-in expressions discussed above, so you can override the built-ins
1885if you want to.
1886
1887The regular expression should describe a string like those we used above,
1888for instance the string :samp:`%.2DbP`. When a leaf or internal node
1889is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
1890is checked for these regular expressions and when the match is found,
1891the corresponding callback function is called.
1892
1893The callback function will get five arguments: the format string
1894(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
1895being printed, its parent (can be None) and the tree classifier.
1896The function should return the format string in which the part described
1897by the match object (that is, the part that is matched by the regular
1898expression) is replaced by whatever information your callback function
1899is supposed to give.
1900
1901The function can use several utility function provided in the module.
1902
1903.. autofunction:: insertStr
1904
1905.. autofunction:: insertDot
1906
1907.. autofunction:: insertNum
1908
1909.. autofunction:: byWhom
1910
1911
1912There are also a few pieces of regular expression that you may want to reuse.
1913The two you are likely to use are:
1914
1915.. autodata:: fs
1916
1917.. autodata:: by
1918
1919For a trivial example, :samp:`%V` is implemented like this. There is the
1920following tuple in the list of built-in formats::
1921
1922    (re.compile("%V"), replaceV)
1923
1924:obj:`replaceV` is a function defined by::
1925
1926    def replaceV(strg, mo, node, parent, tree):
1927        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
1928
1929It therefore takes the value predicted at the node
1930(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
1931and passes it to *insertStr* to do the replacement.
1932
1933A more complex regular expression is the one for the proportion of
1934majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
1935two partial expressions defined above.
1936
1937Let's say with like to print the classification margin for each node,
1938that is, the difference between the proportion of the largest and the
1939second largest class in the node (part of `orngTree2.py`_):
1940
1941.. _orngTree2.py: code/orngTree2.py
1942
1943.. literalinclude:: code/orngTree2.py
1944   :lines: 7-31
1945
1946We first defined getMargin which gets the distribution and computes the
1947margin. The callback replaces, replaceB, computes the margin for the node.
1948If we need to divided the quantity by something (that is, if the :data:`by`
1949group is present), we call :func:`byWhom` to get the node with whose margin
1950this node's margin is to be divided. If this node (usually the parent)
1951does not exist of if its margin is zero, we call :func:`insertDot`
1952to insert a dot, otherwise we call :func:`insertNum` which will insert
1953the number, obeying the format specified by the user. myFormat is a list
1954containing the regular expression and the callback function.
1955
1956
1957We can now print out the iris tree:
1958
1959.. literalinclude:: code/orngTree2.py
1960    :lines: 33
1961
1962And we get::
1963
1964    petal width<0.800: Iris-setosa 100% (100.00%)
1965    petal width>=0.800
1966    |    petal width<1.750
1967    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1968    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1969    |    petal width>=1.750
1970    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1971    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1972
1973
1974Plotting the Tree using Dot
1975===========================
1976
1977.. autofunction:: printDot
1978
1979.. autofunction:: dotTree
1980
1981Suppose you saved the tree in a file "tree5.dot". You can then
1982print it out as a gif if you execute the following command line
1983
1984::
1985   
1986    dot -Tgif tree5.dot -otree5.gif
1987
1988GraphViz's dot has quite a few other output formats, check
1989its documentation to learn which.
1990
1991References
1992==========
1993
1994Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1995Wesley, 2002.
1996
1997E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1998Murray Hill NJ, U.S.A., October 1993.
1999
2000`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
2001A home page of AT&T's dot and similar software packages.
2002
2003"""
2004
2005from Orange.core import \
2006     TreeLearner as TreeLearnerBase, \
2007         TreeClassifier as _TreeClassifier, \
2008         C45Learner, \
2009         C45Classifier, \
2010         C45TreeNode as C45Node, \
2011         C45TreeNodeList as C45NodeList, \
2012         TreeDescender as Descender, \
2013              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
2014              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
2015              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
2016              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
2017              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
2018         TreeExampleSplitter as Splitter, \
2019              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
2020              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
2021              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
2022              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
2023              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
2024              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
2025              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
2026         TreeNode as Node, \
2027         TreeNodeList as NodeList, \
2028         TreePruner as Pruner, \
2029              TreePruner_SameMajority as Pruner_SameMajority, \
2030              TreePruner_m as Pruner_m, \
2031         TreeSplitConstructor as SplitConstructor, \
2032              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
2033              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
2034                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
2035                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
2036                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
2037                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
2038         TreeStopCriteria as StopCriteria, \
2039              TreeStopCriteria_Python as StopCriteria_Python, \
2040              TreeStopCriteria_common as StopCriteria_common
2041
2042import Orange.core
2043import operator
2044import base64
2045import re
2046import Orange.data
2047import Orange.feature.scoring
2048import Orange.classification.tree
2049
2050def _c45_showBranch(node, classvar, lev, i):
2051    var = node.tested
2052    if node.nodeType == 1:
2053        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
2054        _c45_printTree0(node.branch[i], classvar, lev+1)
2055    elif node.nodeType == 2:
2056        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
2057        _c45_printTree0(node.branch[i], classvar, lev+1)
2058    else:
2059        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2060        inset = [var.values[j[0]] for j in inset]
2061        if len(inset)==1:
2062            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
2063        else:
2064            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
2065        _c45_printTree0(node.branch[i], classvar, lev+1)
2066       
2067       
2068def _c45_printTree0(node, classvar, lev):
2069    var = node.tested
2070    if node.nodeType == 0:
2071        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2072    else:
2073        for i, branch in enumerate(node.branch):
2074            if not branch.nodeType:
2075                _c45_showBranch(node, classvar, lev, i)
2076        for i, branch in enumerate(node.branch):
2077            if branch.nodeType:
2078                _c45_showBranch(node, classvar, lev, i)
2079
2080def printTreeC45(tree):
2081    """
2082    Prints the tree given as an argument in the same form as Ross Quinlan's
2083    C4.5 program.
2084
2085    ::
2086
2087        import Orange
2088
2089        data = Orange.data.Table("voting")
2090        c45 = Orange.classification.tree.C45Learner(data)
2091        Orange.classification.tree.printTreeC45(c45)
2092
2093    will print out
2094
2095    ::
2096
2097        physician-fee-freeze = n: democrat (253.4)
2098        physician-fee-freeze = y:
2099        |   synfuels-corporation-cutback = n: republican (145.7)
2100        |   synfuels-corporation-cutback = y:
2101        |   |   mx-missile = y: democrat (6.0)
2102        |   |   mx-missile = n:
2103        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2104        |   |   |   adoption-of-the-budget-resolution = y:
2105        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2106        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2107
2108
2109    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2110
2111    ::
2112
2113        physician-fee-freeze = n: democrat (253.4/5.9)
2114        physician-fee-freeze = y:
2115        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2116        |   synfuels-corporation-cutback = y:
2117        |   |   mx-missile = y: democrat (6.0/2.4)
2118        |   |   mx-missile = n:
2119        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2120        |   |   |   adoption-of-the-budget-resolution = y:
2121        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2122        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2123
2124    which is adoringly similar, except that C4.5 tested the tree on
2125    the learning data and has also printed out the number of errors
2126    in each node - something which :obj:`c45_printTree` obviously can't do
2127    (nor is there any need it should).
2128
2129    """
2130    _c45_printTree0(tree.tree, tree.classVar, 0)
2131
2132
2133class TreeLearner(Orange.core.Learner):
2134    """
2135    Assembles the generic classification or regression tree learner
2136    (from Orange's objects for induction of decision trees).
2137    :class:`TreeLearner` is essentially a wrapper
2138    around :class:`TreeLearnerBase`, provided for easier use of the latter.
2139    It sets a number of parameters used in induction that
2140    can also be set after the creation of the object, most often through
2141    the object's attributes. If upon initialization
2142    :class:`TreeLearner` is given a set of examples, then an instance
2143    of :class:`TreeClassifier` object is returned instead.
2144
2145    The values of attributes can be also be set in the constructor.
2146
2147    .. attribute:: nodeLearner
2148
2149        Induces a classifier from examples belonging to a node. The
2150        same learner is used for internal nodes and for leaves. The
2151        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2152
2153    **Split construction**
2154
2155    .. attribute:: split
2156       
2157        Defines a function that will be used in place of
2158        :obj:`SplitConstructor`.
2159        Useful when prototyping new tree induction
2160        algorithms. When this parameter is defined, other parameters that
2161        affect the procedures for growing of the tree are ignored. These
2162        include :obj:`binarization`, :obj:`measure`,
2163        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2164        :class:SplitConstructor_Combined
2165        with separate constructors for discrete and continuous attributes.
2166        Discrete attributes are used as they are, while
2167        continuous attributes are binarized.
2168        Gain ratio is used to select attributes.
2169        A minimum of two examples in a leaf is required for
2170        discrete and five examples in a leaf for continuous attributes.)
2171
2172    .. attribute:: binarization
2173
2174        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2175        If 2, use class:`SplitConstructor_OneAgainstOthers`. If
2176        0, do not use binarization (use class:`SplitConstructor_Attribute`).
2177        Default: 0.
2178
2179    .. attribute:: measure
2180   
2181        Measure for scoring of the attributes when deciding which of the
2182        attributes will be used for splitting of the example set in the node.
2183        Can be either a measure XXXXX or one of
2184        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2185        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2186        "gini" (:class:`Orange.feature.scoring.Gini`),
2187        "relief" (:class:`Orange.feature.scoring.Relief`),
2188        "retis" (:class: `Orange.feature.scoring.MSE`). Default: "gainRatio".
2189
2190    .. attribute:: reliefM, reliefK
2191
2192        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2193
2194    **Pruning**
2195
2196    .. attribute:: worstAcceptable
2197
2198        Used in pre-pruning, sets the lowest required attribute
2199        score. If the score of the best attribute is below this margin, the
2200        tree at that node is not grown further (default: 0).
2201
2202        So, to allow splitting only when gainRatio (the default measure)
2203        is greater than 0.6, one should run the learner like this:
2204        :samp:`l = Orange.classification.tree.TreeLearner(data, worstAcceptable=0.6)`
2205
2206    .. attribute:: minSubset
2207
2208        Minimal number of examples in non-null leaves (default: 0).
2209
2210    .. attribute:: minExamples
2211
2212        Data subsets with less than :obj:`minExamples`
2213        examples are not split any further, that is, all leaves in the tree
2214        will contain at least that many of examples (default: 0).
2215
2216    .. attribute:: maxDepth
2217
2218        Gives maximal tree depth;  0 means that only root is generated.
2219        The default is 100.
2220
2221    .. attribute:: maxMajority
2222
2223        Induction stops when the proportion of majority class in the
2224        node exceeds the value set by this parameter(default: 1.0). E.g.
2225        to stop the induction as soon as the majority class reaches 70%,
2226        you should say
2227        :samp:`tree2 = Orange.classification.tree.TreeLearner(data, maxMajority=0.7)`
2228
2229        This is an example of the tree on iris data set, built with
2230        XXXXXXXXX what above arguments XXXXXXXXX
2231        the above arguments - the numbers show the majority class
2232        proportion at each node. You can find more details in the
2233        script tree2.py, which induces and prints this tree.
2234
2235        ::
2236
2237            root: 0.333
2238            |    petal width<0.800: 1.000
2239            |    petal width>=0.800: 0.500
2240            |    |    petal width<1.750: 0.907
2241            |    |    petal width>=1.750: 0.978
2242   
2243    .. attribute:: stop
2244
2245        Used for passing a function which is used in place of
2246        :class:`StopCriteria`. Useful when prototyping new
2247        tree induction algorithms. See a documentation on
2248        :class:`StopCriteria` for more info on this function.
2249        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2250        will not be  considered (default: None). XXXXX To je pisalo spodaj.
2251        The default stopping criterion stops induction when all examples in a node belong to the same class.
2252
2253    .. attribute:: mForPruning
2254
2255        If non-zero, invokes an error-based bottom-up post-pruning,
2256        where m-estimate is used to estimate class probabilities
2257        (default: 0).
2258
2259    .. attribute:: sameMajorityPruning
2260
2261        If true, invokes a bottom-up post-pruning by removing the
2262        subtrees of which all leaves classify to the same class
2263        (default: False).
2264
2265    **Record keeping**
2266
2267    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2268
2269        Determines whether to store class distributions, contingencies and
2270        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2271        should be build for internal nodes. By default everything except
2272        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2273        distributions but storing contingencies, since distributions actually points to
2274        the same distribution that is stored in
2275        :obj:`contingency.classes`. (default: True except for
2276        storeExamples, which defaults to False).
2277   
2278    """
2279    def __new__(cls, examples = None, weightID = 0, **argkw):
2280        self = Orange.core.Learner.__new__(cls, **argkw)
2281        if examples:
2282            self.__init__(**argkw)
2283            return self.__call__(examples, weightID)
2284        else:
2285            return self
2286     
2287    def __init__(self, **kw):
2288        self.learner = None
2289        self.__dict__.update(kw)
2290     
2291    def __call__(self, examples, weight=0):
2292        """
2293        Return a classifier from the given examples.
2294        """
2295        bl = self._base_learner()
2296
2297        #build second part of the split
2298        if not hasattr(self, "split") and not hasattr(self, "measure"):
2299            if examples.domain.classVar.varType == Orange.data.Type.Discrete:
2300                measure = Orange.feature.scoring.GainRatio()
2301            else:
2302                measure = Orange.feature.scoring.MSE()
2303            bl.split.continuousSplitConstructor.measure = measure
2304            bl.split.discreteSplitConstructor.measure = measure
2305           
2306        tree = bl(examples, weight)
2307        if getattr(self, "sameMajorityPruning", 0):
2308            tree = Pruner_SameMajority(tree)
2309        if getattr(self, "mForPruning", 0):
2310            tree = Pruner_m(tree, m = self.mForPruning)
2311
2312        return TreeClassifier(baseClassifier=tree) 
2313
2314    def instance(self):
2315        """
2316        DEPRECATED. Return a base learner - an object
2317        of :class:`TreeLearnerBase`.
2318        This method is left for backwards compatibility.
2319        """
2320        return self._base_learner()
2321
2322    def _build_split(self, learner):
2323
2324        if hasattr(self, "split"):
2325            learner.split = self.split
2326        else:
2327            learner.split = SplitConstructor_Combined()
2328            learner.split.continuousSplitConstructor = \
2329                SplitConstructor_Threshold()
2330            binarization = getattr(self, "binarization", 0)
2331            if binarization == 1:
2332                learner.split.discreteSplitConstructor = \
2333                    SplitConstructor_ExhaustiveBinary()
2334            elif binarization == 2:
2335                learner.split.discreteSplitConstructor = \
2336                    SplitConstructor_OneAgainstOthers()
2337            else:
2338                learner.split.discreteSplitConstructor = \
2339                    SplitConstructor_Feature()
2340
2341            measures = {"infoGain": Orange.feature.scoring.InfoGain,
2342                "gainRatio": Orange.feature.scoring.GainRatio,
2343                "gini": Orange.feature.scoring.Gini,
2344                "relief": Orange.feature.scoring.Relief,
2345                "retis": Orange.feature.scoring.MSE
2346                }
2347
2348            measure = getattr(self, "measure", None)
2349            if isinstance(measure, str):
2350                measure = measures[measure]()
2351            if not measure:
2352                measure = Orange.feature.scoring.GainRatio()
2353
2354            measureIsRelief = isinstance(measure, Orange.feature.scoring.Relief)
2355            relM = getattr(self, "reliefM", None)
2356            if relM and measureIsRelief:
2357                measure.m = relM
2358           
2359            relK = getattr(self, "reliefK", None)
2360            if relK and measureIsRelief:
2361                measure.k = relK
2362
2363            learner.split.continuousSplitConstructor.measure = measure
2364            learner.split.discreteSplitConstructor.measure = measure
2365
2366            wa = getattr(self, "worstAcceptable", 0)
2367            if wa:
2368                learner.split.continuousSplitConstructor.worstAcceptable = wa
2369                learner.split.discreteSplitConstructor.worstAcceptable = wa
2370
2371            ms = getattr(self, "minSubset", 0)
2372            if ms:
2373                learner.split.continuousSplitConstructor.minSubset = ms
2374                learner.split.discreteSplitConstructor.minSubset = ms
2375
2376    def _base_learner(self):
2377        learner = TreeLearnerBase()
2378
2379        self._build_split(learner)
2380
2381        if hasattr(self, "stop"):
2382            learner.stop = self.stop
2383        else:
2384            learner.stop = Orange.classification.tree.StopCriteria_common()
2385            mm = getattr(self, "maxMajority", 1.0)
2386            if mm < 1.0:
2387                learner.stop.maxMajority = self.maxMajority
2388            me = getattr(self, "minExamples", 0)
2389            if me:
2390                learner.stop.minExamples = self.minExamples
2391
2392        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2393            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2394            if hasattr(self, a):
2395                setattr(learner, a, getattr(self, a))
2396
2397        return learner
2398
2399# the following is for the output
2400
2401
2402fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2403""" Defines the multiplier by 100 (:samp:`^`) and the format
2404for the number of decimals (e.g. :samp:`5.3`). The corresponding
2405groups are named :samp:`m100` and :samp:`fs`. """
2406
2407by = r"(?P<by>(b(P|A)))?"
2408""" Defines bP or bA or nothing; the result is in groups by. """
2409
2410bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2411opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2412opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2413intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2414fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2415re_V = re.compile("%V")
2416re_N = re.compile("%"+fs+"N"+by)
2417re_M = re.compile("%"+fs+"M"+by)
2418re_m = re.compile("%"+fs+"m"+by)
2419re_Ccont = re.compile("%"+fs+"C"+by+opc)
2420re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2421re_ccont = re.compile("%"+fs+"c"+by+opc)
2422re_cdisc = re.compile("%"+fs+"c"+by+opd)
2423re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2424re_cconti = re.compile("%"+fs+"c"+by+fromto)
2425re_D = re.compile("%"+fs+"D"+by)
2426re_d = re.compile("%"+fs+"d"+by)
2427re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2428re_I = re.compile("%"+fs+"I"+intrvl)
2429
2430def insertStr(s, mo, sub):
2431    """ Replace the part of s which is covered by mo
2432    with the string sub. """
2433    return s[:mo.start()] + sub + s[mo.end():]
2434
2435
2436def insertDot(s, mo):
2437    """ Replace the part of s which is covered by mo
2438    with a dot.  You should use this when the
2439    function cannot compute the desired quantity; it is called, for instance,
2440    when it needs to divide by something in the parent, but the parent
2441    doesn't exist.
2442    """
2443    return s[:mo.start()] + "." + s[mo.end():]
2444
2445def insertNum(s, mo, n):
2446    """ Replace the part of s matched by mo with the number n,
2447    formatted as specified by the user, that is, it multiplies
2448    it by 100, if needed, and prints with the right number of
2449    places and decimals. It does so by checking the mo
2450    for a group named m100 (representing the :samp:`^` in the format string)
2451    and a group named fs representing the part giving the number o
2452    f decimals (e.g. :samp:`5.3`).
2453    """
2454    grps = mo.groupdict()
2455    m100 = grps.get("m100", None)
2456    if m100:
2457        n *= 100
2458    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2459    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2460
2461def byWhom(by, parent, tree):
2462    """ If by equals bp, it returns parent, else it returns
2463    :samp:`tree.tree`. This is used to find what to divide the quantity
2464    with, when division is required.
2465    """
2466    if by=="bP":
2467        return parent
2468    else:
2469        return tree.tree
2470
2471def replaceV(strg, mo, node, parent, tree):
2472    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2473
2474def replaceN(strg, mo, node, parent, tree):
2475    by = mo.group("by")
2476    N = node.distribution.abs
2477    if by:
2478        whom = byWhom(by, parent, tree)
2479        if whom and whom.distribution:
2480            if whom.distribution.abs > 1e-30:
2481                N /= whom.distribution.abs
2482        else:
2483            return insertDot(strg, mo)
2484    return insertNum(strg, mo, N)
2485       
2486
2487def replaceM(strg, mo, node, parent, tree):
2488    by = mo.group("by")
2489    maj = int(node.nodeClassifier.defaultValue)
2490    N = node.distribution[maj]
2491    if by:
2492        whom = byWhom(by, parent, tree)
2493        if whom and whom.distribution:
2494            if whom.distribution[maj] > 1e-30:
2495                N /= whom.distribution[maj]
2496        else:
2497            return insertDot(strg, mo)
2498    return insertNum(strg, mo, N)
2499       
2500
2501def replacem(strg, mo, node, parent, tree):
2502    by = mo.group("by")
2503    maj = int(node.nodeClassifier.defaultValue)
2504    if node.distribution.abs > 1e-30:
2505        N = node.distribution[maj] / node.distribution.abs
2506        if by:
2507            if whom and whom.distribution:
2508                byN = whom.distribution[maj] / whom.distribution.abs
2509                if byN > 1e-30:
2510                    N /= byN
2511            else:
2512                return insertDot(strg, mo)
2513    else:
2514        N = 0.
2515    return insertNum(strg, mo, N)
2516
2517
2518def replaceCdisc(strg, mo, node, parent, tree):
2519    if tree.classVar.varType != Orange.data.Type.Discrete:
2520        return insertDot(strg, mo)
2521   
2522    by, op, cls = mo.group("by", "op", "cls")
2523    N = node.distribution[cls]
2524    if op == "!=":
2525        N = node.distribution.abs - N
2526    if by:
2527        whom = byWhom(by, parent, tree)
2528        if whom and whom.distribution:
2529            if whom.distribution[cls] > 1e-30:
2530                N /= whom.distribution[cls]
2531        else:
2532            return insertDot(strg, mo)
2533    return insertNum(strg, mo, N)
2534
2535   
2536def replacecdisc(strg, mo, node, parent, tree):
2537    if tree.classVar.varType != Orange.data.Type.Discrete:
2538        return insertDot(strg, mo)
2539   
2540    op, by, cls = mo.group("op", "by", "cls")
2541    N = node.distribution[cls]
2542    if node.distribution.abs > 1e-30:
2543        N /= node.distribution.abs
2544        if op == "!=":
2545            N = 1 - N
2546    if by:
2547        whom = byWhom(by, parent, tree)
2548        if whom and whom.distribution:
2549            if whom.distribution[cls] > 1e-30:
2550                N /= whom.distribution[cls] / whom.distribution.abs
2551        else:
2552            return insertDot(strg, mo)
2553    return insertNum(strg, mo, N)
2554
2555
2556__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2557
2558def replaceCcont(strg, mo, node, parent, tree):
2559    if tree.classVar.varType != Orange.data.Type.Continuous:
2560        return insertDot(strg, mo)
2561   
2562    by, op, num = mo.group("by", "op", "num")
2563    op = __opdict[op]
2564    num = float(num)
2565    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2566    if by:
2567        whom = byWhom(by, parent, tree)
2568        if whom and whom.distribution:
2569            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2570            if byN > 1e-30:
2571                N /= byN
2572        else:
2573            return insertDot(strg, mo)
2574
2575    return insertNum(strg, mo, N)
2576   
2577   
2578def replaceccont(strg, mo, node, parent, tree):
2579    if tree.classVar.varType != Orange.data.Type.Continuous:
2580        return insertDot(strg, mo)
2581   
2582    by, op, num = mo.group("by", "op", "num")
2583    op = __opdict[op]
2584    num = float(num)
2585    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2586    if node.distribution.abs > 1e-30:
2587        N /= node.distribution.abs
2588    if by:
2589        whom = byWhom(by, parent, tree)
2590        if whom and whom.distribution:
2591            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2592            if byN > 1e-30:
2593                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2594        else:
2595            return insertDot(strg, mo)
2596    return insertNum(strg, mo, N)
2597
2598
2599def extractInterval(mo, dist):
2600    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2601    lower, upper = float(lower), float(upper)
2602    if out:
2603        lop = lowin == "(" and operator.le or operator.lt
2604        hop = upin == ")" and operator.ge or operator.ge
2605        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2606    else:
2607        lop = lowin == "(" and operator.gt or operator.ge
2608        hop = upin == ")" and operator.lt or operator.le
2609        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2610
2611   
2612def replaceCconti(strg, mo, node, parent, tree):
2613    if tree.classVar.varType != Orange.data.Type.Continuous:
2614        return insertDot(strg, mo)
2615
2616    by = mo.group("by")
2617    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2618    if by:
2619        whom = byWhom(by, parent, tree)
2620        if whom and whom.distribution:
2621            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2622            if byN > 1e-30:
2623                N /= byN
2624        else:
2625            return insertDot(strg, mo)
2626       
2627    return insertNum(strg, mo, N)
2628
2629           
2630def replacecconti(strg, mo, node, parent, tree):
2631    if tree.classVar.varType != Orange.data.Type.Continuous:
2632        return insertDot(strg, mo)
2633
2634    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2635    ab = node.distribution.abs
2636    if ab > 1e-30:
2637        N /= ab
2638
2639    by = mo.group("by")
2640    if by:
2641        whom = byWhom(by, parent, tree)
2642        if whom and whom.distribution:
2643            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2644            if byN > 1e-30:
2645                N /= byN/whom.distribution.abs
2646        else:
2647            return insertDot(strg, mo)
2648       
2649    return insertNum(strg, mo, N)
2650
2651   
2652def replaceD(strg, mo, node, parent, tree):
2653    if tree.classVar.varType != Orange.data.Type.Discrete:
2654        return insertDot(strg, mo)
2655
2656    fs, by, m100 = mo.group("fs", "by", "m100")
2657    dist = list(node.distribution)
2658    if by:
2659        whom = byWhom(by, parent, tree)
2660        if whom:
2661            for i, d in enumerate(whom.distribution):
2662                if d > 1e-30:
2663                    dist[i] /= d
2664        else:
2665            return insertDot(strg, mo)
2666    mul = m100 and 100 or 1
2667    fs = fs or (m100 and ".0" or "5.3")
2668    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2669
2670
2671def replaced(strg, mo, node, parent, tree):
2672    if tree.classVar.varType != Orange.data.Type.Discrete:
2673        return insertDot(strg, mo)
2674
2675    fs, by, m100 = mo.group("fs", "by", "m100")
2676    dist = list(node.distribution)
2677    ab = node.distribution.abs
2678    if ab > 1e-30:
2679        dist = [d/ab for d in dist]
2680    if by:
2681        whom = byWhom(by, parent, tree)
2682        if whom:
2683            for i, d in enumerate(whom.distribution):
2684                if d > 1e-30:
2685                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2686        else:
2687            return insertDot(strg, mo)
2688    mul = m100 and 100 or 1
2689    fs = fs or (m100 and ".0" or "5.3")
2690    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2691
2692
2693def replaceAE(strg, mo, node, parent, tree):
2694    if tree.classVar.varType != Orange.data.Type.Continuous:
2695        return insertDot(strg, mo)
2696
2697    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2698   
2699    if AorE == "A":
2700        A = node.distribution.average()
2701    else:
2702        A = node.distribution.error()
2703    if by:
2704        whom = byWhom("b"+by, parent, tree)
2705        if whom:
2706            if AorE == "A":
2707                avg = whom.distribution.average()
2708            else:
2709                avg = whom.distribution.error()
2710            if bysub == "b":
2711                if avg > 1e-30:
2712                    A /= avg
2713            else:
2714                A -= avg
2715        else:
2716            return insertDot(strg, mo)
2717    return insertNum(strg, mo, A)
2718
2719
2720Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2721
2722def replaceI(strg, mo, node, parent, tree):
2723    if tree.classVar.varType != Orange.data.Type.Continuous:
2724        return insertDot(strg, mo)
2725
2726    fs = mo.group("fs") or "5.3"
2727    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2728    mul = mo.group("m100") and 100 or 1
2729
2730    if not Z.has_key(intrvl):
2731        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2732
2733    av = node.distribution.average()   
2734    il = node.distribution.error() * Z[intrvl]
2735    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2736
2737
2738# This class is more a collection of function, merged into a class so
2739# that they don't need to transfer too many arguments. It will be
2740# constructed, used and discarded, it is not meant to store any information.
2741class _TreeDumper:
2742    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2743         (re_M, replaceM), (re_m, replacem), 
2744         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2745         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2746         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2747         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2748         (re_I, replaceI) ]
2749
2750    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2751        maxDepth, simpleFirst, tree, **kw):
2752        self.stringFormats = stringFormats
2753        self.minExamples = minExamples
2754        self.maxDepth = maxDepth
2755        self.simpleFirst = simpleFirst
2756        self.tree = tree
2757        self.__dict__.update(kw)
2758
2759        if leafStr:
2760            self.leafStr = leafStr
2761        else:
2762            if tree.classVar.varType == Orange.data.Type.Discrete:
2763                self.leafStr = "%V (%^.2m%)"
2764            else:
2765                self.leafStr = "%V"
2766
2767        if nodeStr == ".":
2768            self.nodeStr = self.leafStr
2769        else:
2770            self.nodeStr = nodeStr
2771       
2772
2773    def formatString(self, strg, node, parent):
2774        if hasattr(strg, "__call__"):
2775            return strg(node, parent, self.tree)
2776       
2777        if not node:
2778            return "<null node>"
2779       
2780        for rgx, replacer in self.stringFormats:
2781            if not node.distribution:
2782                strg = rgx.sub(".", strg)
2783            else:
2784                strt = 0
2785                while True:
2786                    mo = rgx.search(strg, strt)
2787                    if not mo:
2788                        break
2789                    strg = replacer(strg, mo, node, parent, self.tree)
2790                    strt = mo.start()+1
2791                       
2792        return strg
2793       
2794
2795    def showBranch(self, node, parent, lev, i):
2796        bdes = node.branchDescriptions[i]
2797        bdes = node.branchSelector.classVar.name + \
2798            (bdes[0] not in "<=>" and "=" or "") + bdes
2799        if node.branches[i]:
2800            nodedes = self.nodeStr and ": " + \
2801                self.formatString(self.nodeStr, node.branches[i], node) or ""
2802        else:
2803            nodedes = "<null node>"
2804        return "|    "*lev + bdes + nodedes
2805       
2806       
2807    def dumpTree0(self, node, parent, lev):
2808        if node.branches:
2809            if node.distribution.abs < self.minExamples or \
2810                lev > self.maxDepth:
2811                return "|    "*lev + ". . .\n"
2812           
2813            res = ""
2814            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2815                leafsep = "\n"+("|    "*lev)+"    "
2816            else:
2817                leafsep = ""
2818            if self.simpleFirst:
2819                for i, branch in enumerate(node.branches):
2820                    if not branch or not branch.branches:
2821                        if self.leafStr == self.nodeStr:
2822                            res += "%s\n" % \
2823                                self.showBranch(node, parent, lev, i)
2824                        else:
2825                            res += "%s: %s\n" % \
2826                                (self.showBranch(node, parent, lev, i),
2827                                 leafsep + 
2828                                 self.formatString(self.leafStr, branch, node))
2829            for i, branch in enumerate(node.branches):
2830                if branch and branch.branches:
2831                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2832                                       self.dumpTree0(branch, node, lev+1))
2833                elif not self.simpleFirst:
2834                    if self.leafStr == self.nodeStr:
2835                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2836                    else:
2837                        res += "%s: %s\n" % \
2838                            (self.showBranch(node, parent, lev, i),
2839                             leafsep + 
2840                             self.formatString(self.leafStr, branch, node))
2841            return res
2842        else:
2843            return self.formatString(self.leafStr, node, parent)
2844
2845
2846    def dumpTree(self):
2847        if self.nodeStr:
2848            lev, res = 1, "root: %s\n" % \
2849                self.formatString(self.nodeStr, self.tree.tree, None)
2850            self.maxDepth += 1
2851        else:
2852            lev, res = 0, ""
2853        return res + self.dumpTree0(self.tree.tree, None, lev)
2854       
2855
2856    def dotTree0(self, node, parent, internalName):
2857        if node.branches:
2858            if node.distribution.abs < self.minExamples or \
2859                len(internalName)-1 > self.maxDepth:
2860                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2861                    _quoteName(internalName))
2862                return
2863               
2864            label = node.branchSelector.classVar.name
2865            if self.nodeStr:
2866                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2867            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2868                (_quoteName(internalName), self.nodeShape, label))
2869           
2870            for i, branch in enumerate(node.branches):
2871                if branch:
2872                    internalBranchName = internalName+chr(i+65)
2873                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2874                        (_quoteName(internalName), 
2875                         _quoteName(internalBranchName), 
2876                         node.branchDescriptions[i]))
2877                    self.dotTree0(branch, node, internalBranchName)
2878                   
2879        else:
2880            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2881                (internalName, self.leafShape, 
2882                self.formatString(self.leafStr, node, parent)))
2883
2884
2885    def dotTree(self, internalName="n"):
2886        self.fle.write("digraph G {\n")
2887        self.dotTree0(self.tree.tree, None, internalName)
2888        self.fle.write("}\n")
2889
2890def _quoteName(x):
2891    return '"%s"' % (base64.b64encode(x))
2892
2893class TreeClassifier(Orange.classification.Classifier):
2894    """
2895    Wraps :class:`_TreeClassifier`.
2896    """
2897   
2898    def __init__(self, baseClassifier=None):
2899        if not baseClassifier: baseClassifier = _TreeClassifier()
2900        self.nativeClassifier = baseClassifier
2901        for k, v in self.nativeClassifier.__dict__.items():
2902            self.__dict__[k] = v
2903 
2904    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2905                 *args, **kwdargs):
2906        """Classify a new instance.
2907       
2908        :param instance: instance to be classified.
2909        :type instance: :class:`Orange.data.Instance`
2910        :param result_type:
2911              :class:`Orange.classification.Classifier.GetValue` or \
2912              :class:`Orange.classification.Classifier.GetProbabilities` or
2913              :class:`Orange.classification.Classifier.GetBoth`
2914       
2915        :rtype: :class:`Orange.data.Value`,
2916              :class:`Orange.statistics.Distribution` or a tuple with both
2917        """
2918        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2919
2920    def __setattr__(self, name, value):
2921        if name == "nativeClassifier":
2922            self.__dict__[name] = value
2923            return
2924        if name in self.nativeClassifier.__dict__:
2925            self.nativeClassifier.__dict__[name] = value
2926        self.__dict__[name] = value
2927   
2928    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2929        """
2930        Return a string representation of a tree.
2931
2932        :arg tree: The tree to dump to string.
2933        :type tree: class:`TreeClassifier`
2934        :arg leafStr: The format string for printing the tree leaves. If
2935          left empty, "%V (%^.2m%)" will be used for classification trees
2936          and "%V" for regression trees.
2937        :type leafStr: string
2938        :arg nodeStr: The format string for printing out the internal nodes.
2939          If left empty (as it is by default), no data is printed out for
2940          internal nodes. If set to :samp:`"."`, the same string is
2941          used as for leaves.
2942        :type nodeStr: string
2943        :arg maxDepth: If set, it limits the depth to which the tree is
2944          printed out.
2945        :type maxDepth: integer
2946        :arg minExamples: If set, the subtrees with less than the given
2947          number of examples are not printed.
2948        :type minExamples: integer
2949        :arg simpleFirst: If True (default), the branches with a single
2950          node are printed before the branches with larger subtrees.
2951          If False, the branches are printed in order of
2952          appearance.
2953        :type simpleFirst: boolean
2954        :arg userFormats: A list of regular expressions and callback
2955          function through which the user can print out other specific
2956          information in the nodes.
2957        """
2958        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2959            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2960            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2961            self).dumpTree()
2962
2963    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2964        """ Prints the tree to a file in a format used by
2965        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2966        Uses the same parameters as :func:`printTxt` defined above
2967        plus two parameters which define the shape used for internal
2968        nodes and laves of the tree:
2969
2970        :param leafShape: Shape of the outline around leves of the tree.
2971            If "plaintext", no outline is used (default: "plaintext").
2972        :type leafShape: string
2973        :param internalNodeShape: Shape of the outline around internal nodes
2974            of the tree. If "plaintext", no outline is used (default: "box")
2975        :type leafShape: string
2976
2977        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2978        for various outlines supported by GraphViz.
2979        """
2980        fle = type(fileName) == str and file(fileName, "wt") or fileName
2981
2982        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2983            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2984            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
2985            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2986
2987    def count_nodes(self):
2988        """
2989        Return the number of nodes of tree.
2990        """
2991        return _countNodes(self.tree if isinstance(self, _TreeClassifier) or \
2992            isinstance(self, TreeClassifier) else self)
2993
2994    def count_leaves(self):
2995        """
2996        Return the number of leaves in the tree.
2997        """
2998        return _countLeaves(self.tree if isinstance(self, _TreeClassifier) or \
2999            isinstance(self, TreeClassifier) else self)
3000
3001dumpTree = TreeClassifier.dump
3002""" An alias for :obj:`TreeClassifier.dump`. """
3003
3004def printTree(*a, **aa):
3005    """
3006    Print out the tree (call :func:`dumpTree` with the same
3007    arguments and print out the result).
3008    """
3009    print dumpTree(*a, **aa)
3010
3011printTxt = printTree
3012""" An alias for :func:`printTree`. Left for compatibility. """
3013
3014printDot = TreeClassifier.dot
3015""" An alias for :obj:`TreeClassifier.dot` """
3016
3017dotTree = printDot
3018""" An alias for :func:`printDot`. Left for compatibility. """
3019
3020countNodes = TreeClassifier.count_nodes
3021countLeaves = TreeClassifier.count_leaves
3022
3023def _countNodes(node):
3024    count = 0
3025    if node:
3026        count += 1
3027        if node.branches:
3028            for node in node.branches:
3029                count += _countNodes(node)
3030    return count
3031
3032def _countLeaves(node):
3033    count = 0
3034    if node:
3035        if node.branches: # internal node
3036            for node in node.branches:
3037                count += _countLeaves(node)
3038        else:
3039            count += 1
3040    return count
3041
Note: See TracBrowser for help on using the repository browser.