source: orange/orange/Orange/classification/tree.py @ 7738:669f73e64a0c

Revision 7738:669f73e64a0c, 115.4 KB checked in by markotoplak, 3 years ago (diff)

Tree documentation. Renamed TreeLearnerBase to _TreeLearner.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12To build a small tree (:obj:`TreeClassifier`) from the iris data set
13(with the depth limited to three levels), use
14(part of `orngTree1.py`_, uses `iris.tab`_):
15
16.. literalinclude:: code/orngTree1.py
17   :lines: 1-4
18
19.. _orngTree1.py: code/orngTree1.py
20
21
22This page first describes the learner and the classifier, and then
23defines the base classes (individual components) of the trees
24and the tree-building process.
25
26.. autoclass:: TreeLearner
27    :members:
28
29.. autoclass:: TreeClassifier
30    :members:
31
32========
33Examples
34========
35
36For example, here's how to write your own stop
37function. The example constructs
38and prints two trees. For the first one we define the *defStop*
39function, which is used by default, and combine it with a random function
40so that the stop criteria will also be met in 20% of the cases
41when *defStop* is false. For the second tree the stopping criteria
42is random. Note that in
43the second case lambda function still has three parameters, since this is
44a necessary number of parameters for the stop function (:obj:`StopCriteria`).
45Part of `tree3.py`_ (uses  `iris.tab`_):
46
47.. _tree3.py: code/tree3.py
48
49.. literalinclude:: code/tree3.py
50   :lines: 8-23
51
52The output is not shown here since the resulting trees are rather
53big.
54
55Tree Structure
56==============
57
58To have something to work on, we'll take the data from lenses dataset
59and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
60
61.. literalinclude:: code/treestructure.py
62   :lines: 7-10
63
64How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
65
66.. _lenses.tab: code/lenses.tab
67.. _treestructure.py: code/treestructure.py
68
69.. literalinclude:: code/treestructure.py
70   :lines: 12-21
71
72If node is None, we have a null-node; null nodes don't count,
73so we return 0. Otherwise, the size is 1 (this node) plus the
74sizes of all subtrees. The node is an internal node if it has a
75:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
76attempt to skip the if statement: leaves don't have an empty list
77of branches, they don't have a list of branches at all.
78
79    >>> treeSize(treeClassifier.tree)
80    10
81
82Don't forget that this was only an excercise - :obj:`Node` has a
83built-in method :obj:`Node.treeSize` that does exactly the same.
84
85Let us now write a script that prints out a tree. The recursive
86part of the function will get a node and its level
87(part of `treestructure.py`_, uses `lenses.tab`_).
88
89.. literalinclude:: code/treestructure.py
90   :lines: 26-41
91
92Don't waste time on studying formatting tricks (\n's etc.), this is just
93for nicer output. What matters is everything but the print statements.
94As first, we check whether the node is a null-node (a node to which no
95learning examples were classified). If this is so, we just print out
96"<null node>" and return.
97
98After handling null nodes, remaining nodes are internal nodes and leaves.
99For internal nodes, we print a node description consisting of the
100attribute's name and distribution of classes. :obj:`Node`'s branch
101description is, for all currently defined splits, an instance of a
102class derived from :obj:`orange.Classifier` (in fact, it is a
103:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
104suffice), and its :obj:`classVar` points to the attribute we seek.
105So we print its name. We will also assume that storing class distributions
106has not been disabled and print them as well. 
107Then we iterate
108through branches; for each we print a branch description and iteratively
109call the :obj:`printTree0` with a level increased by 1 (to increase
110the indent).
111
112Finally, if the node is a leaf, we print out the distribution of
113learning examples in the node and the class to which the examples in
114the node would be classified. We again assume that the :obj:`nodeClassifier`
115is the default one - a :obj:`DefaultClassifier`. A better print
116function should be aware of possible alternatives.
117
118Now, we just need to write a simple function to call our printTree0.
119We could write something like...
120
121::
122
123    def printTree(x):
124        printTree0(x.tree, 0)
125
126... but we won't. Let us learn how to handle arguments of different
127types. Let's write a function that will accept either a
128:obj:`TreeClassifier` or a :obj:`Node`.
129Part of `treestructure.py`_, uses `lenses.tab`_.
130
131.. literalinclude:: code/treestructure.py
132   :lines: 43-49
133
134It's fairly straightforward: if :obj:`x` is of type derived from
135:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
136:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
137of some other type, we don't know how to handle it and thus raise
138an exception. The output::
139
140    >>> printTree(treeClassifier)
141    tear_rate (<15.000, 5.000, 4.000>)
142    : reduced --> none (<12.000, 0.000, 0.000>)
143    : normal
144       astigmatic (<3.000, 5.000, 4.000>)
145       : no
146          age (<1.000, 5.000, 0.000>)
147          : young --> soft (<0.000, 2.000, 0.000>)
148          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
149          : presbyopic --> none (<1.000, 1.000, 0.000>)
150       : yes
151          prescription (<2.000, 0.000, 4.000>)
152          : myope --> hard (<0.000, 0.000, 3.000>)
153          : hypermetrope --> none (<2.000, 0.000, 1.000>)
154
155For a final exercise, let us write a simple pruning function. It will
156be written entirely in Python, unrelated to any :obj:`Pruner`. It
157will limit the maximal tree depth (the number of internal nodes on any
158path down the tree) given as an argument.
159For example, to get a two-level tree, we would
160call cutTree(root, 2). The function will be recursive, with the second
161argument (level) decreasing at each call; when zero, the current node
162will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
163
164.. literalinclude:: code/treestructure.py
165   :lines: 54-62
166
167There's nothing to prune at null-nodes or leaves, so we act only when
168:obj:`node` and :obj:`node.branchSelector` are defined. If level is
169not zero, we call the function for each branch. Otherwise, we clear
170the selector, branches and branch descriptions.
171
172    >>> cutTree(tree.tree, 2)
173    >>> printTree(tree)
174    tear_rate (<15.000, 5.000, 4.000>)
175    : reduced --> none (<12.000, 0.000, 0.000>)
176    : normal
177       astigmatic (<3.000, 5.000, 4.000>)
178       : no --> soft (<1.000, 5.000, 0.000>)
179       : yes --> hard (<2.000, 0.000, 4.000>)
180
181Learning
182========
183
184You could just call :class:`TreeLearner` and let it fill the empty
185slots with the default
186components. This section will teach you three things: what are the
187missing components (and how to set the same components yourself),
188how to use alternative components to get a different tree and,
189finally, how to write a skeleton for tree induction in Python.
190
191.. _treelearner.py: code/treelearner.py
192
193Let us construct a :obj:`TreeLearner` to play with (`treelearner.py`_, uses `lenses.tab`_):
194
195.. literalinclude:: code/treelearner.py
196   :lines: 7-10
197
198There are three crucial components in learning: the :obj:`~TreeLearner.split` and
199:obj:`~TreeLearner.stop`
200criteria, and the example :obj:`~TreeLearner.splitter` (there are some others,
201which become important during classification; we'll talk about them
202later). They are not defined; if you use the learner, the slots are
203filled temporarily but later cleared again.
204
205::
206
207    >>> print learner.split
208    None
209    >>> learner(data)
210    <TreeClassifier instance at 0x01F08760>
211    >>> print learner.split
212    None
213
214Stopping criteria
215=================
216
217The stop is trivial. The default is set by
218
219    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
220
221We can now examine the default stopping parameters.
222
223    >>> print learner.stop.maxMajority, learner.stop.minExamples
224    1.0 0.0
225
226Not very restrictive. This keeps splitting the examples until
227there's nothing left to split or all the examples are in the same
228class. Let us set the minimal subset that we allow to be split to
229five examples and see what comes out.
230
231    >>> learner.stop.minExamples = 5.0
232    >>> tree = learner(data)
233    >>> print tree.dump()
234    tear_rate=reduced: none (100.00%)
235    tear_rate=normal
236    |    astigmatic=no
237    |    |    age=pre-presbyopic: soft (100.00%)
238    |    |    age=presbyopic: none (50.00%)
239    |    |    age=young: soft (100.00%)
240    |    astigmatic=yes
241    |    |    prescription=hypermetrope: none (66.67%)
242    |    |    prescription=myope: hard (100.00%)
243
244OK, that's better. If we want an even smaller tree, we can also limit
245the maximal proportion of majority class.
246
247    >>> learner.stop.maxMajority = 0.5
248    >>> tree = learner(data)
249    >>> print tree.dump()
250    none (62.50%)
251
252
253=================================
254Learner and Classifier Components
255=================================
256
257Classification trees are represented as a tree-like hierarchy of
258:obj:`Node` classes.
259
260Several classes described above are already functional and can
261(and mostly will) be used as they are: :obj:`Node`,
262:obj:`_TreeLearner` and :obj:`TreeClassifier`.
263Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
264:obj:`ExampleSplitter`, :obj:`Descender`
265are among the Orange (C++ implemented)
266classes that can be subtyped
267in Python. You can thus program your own components based on
268these classes.
269
270.. class:: Node
271
272    Node stores information about the learning examples belonging
273    to the node, a branch selector, a list of branches (if the node is
274    not a leaf) with their descriptions and strengths, and a classifier.
275
276    .. attribute:: distribution
277   
278        Stores a distribution for learning examples belonging to the node.
279        Storing distributions can be disabled by setting the
280        :obj:`_TreeLearner`'s storeDistributions flag to false.
281
282    .. attribute:: contingency
283
284        Stores complete contingency matrices for the learning examples
285        belonging to the node. Storing contingencies can be enabled by
286        setting :obj:`_TreeLearner`'s :obj:`storeContingencies`
287        flag to true. Note that even when the flag is not
288        set, the contingencies get computed and stored to
289        :obj:`Node`, but are removed shortly afterwards.
290        The details are given in the
291        description of the :obj:`_TreeLearner` object.
292
293    .. attribute:: examples, weightID
294
295        Store a set of learning examples for the node and the
296        corresponding ID of /weight meta attribute. The root of the
297        tree stores a "master table" of examples, while other nodes'
298        :obj:`Orange.data.Table` contain reference to examples in
299        the root's :obj:`Orange.data.Table`. Examples are only stored
300        if a corresponding flag (:obj:`storeExamples`) has been
301        set while building the tree; to conserve the space, storing
302        is disabled by default.
303
304    .. attribute:: nodeClassifier
305
306        A classifier (usually, but not necessarily, a
307        :obj:`DefaultClassifier`) that can be used to classify
308        examples coming to the node. If the node is a leaf, this is
309        used to decide the final class (or class distribution) of an
310        example. If it's an internal node, it is stored if
311        :obj:`Node`'s flag :obj:`storeNodeClassifier`
312        is set. Since the :obj:`nodeClassifier` is needed by
313        :obj:`Descender` and for pruning (see far below),
314        this is the default behaviour; space consumption of the default
315        :obj:`DefaultClassifier` is rather small. You should
316        never disable this if you intend to prune the tree later.
317
318    If the node is a leaf, the remaining fields are None.
319    If it's an internal node, there are several additional fields.
320
321    .. attribute:: branches
322
323        Stores a list of subtrees, given as :obj:`Node`.
324        An element can be None; in this case the node is empty.
325
326    .. attribute:: branchDescriptions
327
328        A list with string descriptions for branches, constructed by
329        :obj:`SplitConstructor`. It can contain different kinds
330        of descriptions, but basically, expect things like 'red' or '>12.3'.
331
332    .. attribute:: branchSizes
333
334        Gives a (weighted) number of training examples that went into
335        each branch. This can be used later, for instance, for
336        modeling probabilities when classifying examples with
337        unknown values.
338
339    .. attribute:: branchSelector
340
341        Gives a branch for each example. The same object is used during
342        learning and classifying. The :obj:`branchSelector` is of
343        type :obj:`orange.Classifier`, since its job is similar to that
344        of a classifier: it gets an example and returns discrete
345        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
346        When an example cannot be classified to any branch, the selector
347        can return a :obj:`Orange.data.Value` containing a special value
348        (sVal) which should be a discrete distribution
349        (DiscDistribution). This should represent a
350        :obj:`branchSelector`'s opinion of how to divide the
351        example between the branches. Whether the proposition will be
352        used or not depends upon the chosen :obj:`ExampleSplitter`
353        (when learning) or :obj:`Descender` (when classifying).
354
355    The lists :obj:`branches`, :obj:`branchDescriptions` and
356    :obj:`branchSizes` are of the same length; all of them are
357    defined if the node is internal and none if it is a leaf.
358
359    .. method:: treeSize()
360       
361        Return the number of nodes in the subtrees (including the
362        node, excluding null-nodes).
363
364
365.. class:: _TreeClassifier
366
367    Classifies examples according to a tree stored in :obj:`tree`.
368    Not meant to be used directly. The :class:`TreeLearner` class
369    constructs :class:`TreeClassifier`.
370
371    .. attribute:: tree
372
373        The root of the tree, represented as a :class:`Node`.
374   
375    Classification would be straightforward if there were no unknown
376    values or, in general, examples that cannot be placed into a
377    single branch. The response in such cases is determined by a
378    component :obj:`descender`.
379
380    :obj:`Descender` is an abstract object which is given an example
381    and whose basic job is to descend as far down the tree as possible,
382    according to the values of example's attributes. The
383    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
384    the branch index. If it's a simple index, the corresponding branch
385    is followed. If not, it's up to descender to decide what to do, and
386    that's where descenders differ. A :obj:`descender` can choose
387    a single branch (for instance, the one that is the most recommended
388    by the :obj:`branchSelector`) or it can let the branches vote.
389
390    In general there are three possible outcomes of a descent.
391
392    #. Descender reaches a leaf. This happens when nothing went wrong
393       (there are no unknown or out-of-range values in the example) or
394       when things went wrong, but the descender smoothed them by
395       selecting a single branch and continued the descend. In this
396       case, the descender returns the reached :obj:`Node`.
397    #. :obj:`branchSelector` returned a distribution and the
398       :obj:`Descender` decided to stop the descend at this
399       (internal) node.  Again, descender returns the current
400       :obj:`Node` and nothing else.
401    #. :obj:`branchSelector` returned a distribution and the
402       :obj:`Node` wants to split the example (i.e., to decide the
403       class by voting).
404
405    It returns a :obj:`Node` and the vote-weights for the branches.
406    The weights can correspond to the distribution returned by
407    :obj:`branchSelector`, to the number of learning examples that
408    were assigned to each branch, or to something else.
409
410    :obj:`TreeClassifier` uses the descender to descend from the root.
411    If it returns only a :obj:`Node` and no distribution, the
412    descend should stop; it does not matter whether it's a leaf (the
413    first case above) or an internal node (the second case). The node's
414    :obj:`nodeClassifier` is used to decide the class. If the descender
415    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
416    recursively calls itself for each of the subtrees and the
417    predictions are weighted as requested by the descender.
418
419    When voting, subtrees do not predict the class but probabilities
420    of classes. The predictions are multiplied by weights, summed and
421    the most probable class is returned.
422
423    .. method:: vote()
424   
425        It gets a
426        :obj:`Node`, an :obj:`Orange.data.Instance` and a distribution of
427        vote weights. For each node, it calls the
428        :obj:`classDistribution` and then multiplies
429        and sums the distribution. :obj:`vote` returns a normalized
430        distribution of predictions.
431
432    .. method:: classDistribution()
433
434        Gets
435        an additional parameter, a :obj:`Node` (default tree root).
436        :obj:`classDistribution` uses :obj:`descender`. If descender
437        reaches a leaf, it calls
438        :obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
439
440        Thus, the :obj:`vote` and
441        :obj:`classDistribution` are written in a form of double
442        recursion. The recursive calls do not happen at each node of the
443        tree but only at nodes where a vote is needed (that is, at nodes
444        where the descender halts).
445
446    .. method:: __call__
447
448        Calls the
449        descender. If it reaches a leaf, the class is predicted by the
450        leaf's :obj:`nodeClassifier`. Otherwise, it calls
451        :obj:`vote`. From now on, :obj:`vote` and
452        :obj:`classDistribution` interweave down the tree and return
453        a distribution of predictions. This method simply
454        chooses the most probable class.
455
456.. class:: _TreeLearner
457
458    The main learning object is :obj:`_TreeLearner`. It is basically
459    a skeleton into which the user must plug the components for particular
460    functions. This class is not meant to be used directly. You should
461    rather use :class:`TreeLearner`.
462
463    Components that govern the structure of the tree are :obj:`split`
464    (of type :obj:`SplitConstructor`), :obj:`stop` (of
465    type :obj:`StopCriteria` and :obj:`exampleSplitter`
466    (of type :obj:`ExampleSplitter`).
467
468    .. attribute:: split
469
470        Object of type :obj:`SplitConstructor`. Default value,
471        provided by :obj:`_TreeLearner`, is :obj:`SplitConstructor_Combined`
472        with separate constructors for discrete and continuous attributes.
473        Discrete attributes are used as are, while continuous attributes
474        are binarized. Gain ratio is used to select attributes.
475        A minimum of two examples in a leaf is required for discreter
476        and five examples in a leaf for continuous attributes.</DD>
477   
478    .. attribute:: stop
479
480        Object of type :obj:`StopCriteria`. The default stopping
481        criterion stops induction when all examples in a node belong
482        to the same class.
483
484    .. attribute:: splitter
485
486        Object of type :obj:`ExampleSplitter`. The default splitter
487        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
488        the learning examples according to distributions given by the
489        selector.
490
491    .. attribute:: contingencyComputer
492   
493        By default, this slot is left empty and ordinary contingency
494        matrices are computed for examples at each node. If need arises,
495        one can change the way the matrices are computed. This can be
496        used to change the way that unknown values are treated when
497        assessing qualities of attributes. As mentioned earlier,
498        the computed matrices can be used by split constructor and by
499        stopping criteria. On the other hand, they can be (and are)
500        ignored by some splitting constructors.
501
502    .. attribute:: nodeLearner
503
504        Induces a classifier from examples belonging to a node. The
505        same learner is used for internal nodes and for leaves. The
506        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
507
508    .. attribute:: descender
509
510        Descending component that the induces :obj:`TreeClassifier`
511        will use. Default descender is
512        :obj:`Descender_UnknownMergeAsSelector` which votes using
513        the :obj:`branchSelector`'s distribution for vote weights.
514
515    .. attribute:: maxDepth
516
517        Gives maximal tree depth; 0 means that only root is generated.
518        The default is 100 to prevent any infinite tree induction due
519        to missettings in stop criteria. If you are sure you need
520        larger trees, increase it. If you, on the other hand, want
521        to lower this hard limit, you can do so as well.
522
523    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
524
525        Decides whether to store class distributions, contingencies
526        and examples in :obj:`Node`, and whether the
527        :obj:`nodeClassifier` should be build for internal nodes.
528        By default, distributions and node classifiers are stored,
529        while contingencies and examples are not. You won't save any
530        memory by not storing distributions but storing contingencies,
531        since distributions actually points to the same distribution
532        that is stored in :obj:`contingency.classes`.
533
534    The :obj:`_TreeLearner` first sets the defaults for missing
535    components. Although stored in the actual :obj:`_TreeLearner`'s
536    fields, they are removed when the induction is finished.
537
538    Then it ensures that examples are stored in a table. This is needed
539    because the algorithm juggles with pointers to examples. If
540    examples are in a file or are fed through a filter, they are copied
541    to a table. Even if they are already in a table, they are copied
542    if :obj:`storeExamples` is set. This is to assure that pointers
543    remain pointing to examples even if the user later changes the
544    example table. If they are in the table and the :obj:`storeExamples`
545    flag is clear, we just use them as they are. This will obviously
546    crash in a multi-threaded system if one changes the table during
547    the tree induction. Well... don't do it.
548
549    Apriori class probabilities are computed. At this point we check
550    the sum of example weights; if it's zero, there are no examples and
551    we cannot proceed. A list of candidate attributes is set; in the
552    beginning, all attributes are candidates for the split criterion.
553
554    Now comes the recursive part of the :obj:`_TreeLearner`. Its arguments
555    are a set of examples, a weight meta-attribute ID (a tricky thing,
556    it can be always the same as the original or can change to
557    accomodate splitting of examples among branches), apriori class
558    distribution and a list of candidates (represented as a vector
559    of Boolean values).
560
561    The contingency matrix is computed next. This happens
562    even if the flag :obj:`storeContingencies` is false.
563    If the :obj:`contingencyComputer` is given we use it,
564    otherwise we construct just an ordinary contingency matrix.
565
566    A :obj:`stop` is called to see whether it's worth to continue. If
567    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
568    returned. Otherwise, a :obj:`nodeClassifier` is only built if
569    :obj:`forceNodeClassifier` flag is set.
570
571    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
572    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
573    the given examples, weight ID and the just computed matrix. If
574    the learner can use the matrix (and the default,
575    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
576    a choice of :obj:`contingencyComputer` will, in many cases,
577    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
578    return no classifier; if so and if the classifier would be
579    needed for classification, the :obj:`TreeClassifier`'s function
580    returns DK or an empty distribution. If you're writing your own
581    tree classifier - pay attention.
582
583    If the induction is to continue, a :obj:`split` component is called.
584    If it fails to return a branch selector, induction stops and the
585    :obj:`Node` is returned.
586
587    :obj:`_TreeLearner` than uses :obj:`ExampleSplitter` to divide
588    the examples as described above.
589
590    The contingency gets removed at this point if it is not to be
591    stored. Thus, the :obj:`split`, :obj:`stop` and
592    :obj:`exampleSplitter` can use the contingency matrices if they will.
593
594    The :obj:`_TreeLearner` then recursively calls itself for each of
595    the non-empty subsets. If the splitter returnes a list of weights,
596    a corresponding weight is used for each branch. Besides, the
597    attribute spent by the splitter (if any) is removed from the
598    list of candidates for the subtree.
599
600    A subset of examples is stored in its corresponding tree node,
601    if so requested. If not, the new weight attributes are removed (if
602    any were created).
603
604
605Split constructors
606=====================
607
608Split construction is almost as exciting as waiting for a delayed flight.
609Boring, that is. Split constructors juggle
610with contingency matrices, with separate cases for discrete
611and continuous classes... Most split constructors work either for
612discrete or for continuous attributes. We suggest to use
613a :obj:`SplitConstructor_Combined` that delegates
614attributes to specialized split constructors.
615
616Split constructors that cannot handle attributes of particular
617type (discrete, continuous) do not report an error or a warning but
618simply skip the attribute. It is your responsibility to use a correct
619split constructor for your dataset. (May we again suggest
620using :obj:`SplitConstructor_Combined`?)
621
622The same components can be used either for inducing classification and
623regression trees. The only component that needs to be chosen accordingly
624is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
625class (and derived classes).
626
627.. class:: SplitConstructor
628
629    Finds a suitable criteria for dividing the learning (and later testing)
630    examples coming to the node. The data it gets is a set of examples
631    (and, optionally, an ID of weight meta-attribute), a domain
632    contingency computed from examples, apriori class probabilities,
633    a list of candidate attributes it should consider and a node
634    classifier (if it was constructed, that is, if
635    :obj:`storeNodeClassifier` is left true).
636
637    The :obj:`SplitConstructor` should use the domain contingency
638    when possible. The reasons are two-fold; one is that it's faster
639    and the other is that the contingency matrices are not necessarily
640    constructed by simply counting the examples. Why and how is
641    explained later. There are, however, cases, when domain contingency
642    does not suffice, for examples, when ReliefF is used as a measure
643    of quality of attributes. In this case, there's no other way but
644    to use the examples and ignore the precomputed contingencies.
645
646    :obj:`SplitConstructor` returns most of the data we talked
647    about when describing the :obj:`Node`. It returns a classifier
648    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
649    branch descriptions and a list with the number of examples that
650    go into each branch. Just what we need for the :obj:`Node`.
651    It can return an empty list for the number of examples in branches;
652    in this case, the :obj:`_TreeLearner` will find the number itself
653    after splitting the example set into subsets. However, if a split
654    constructors can provide the numbers at no extra computational
655    cost, it should do so.
656
657    In addition, it returns a quality of the split; a number without
658    any fixed meaning except that higher numbers mean better splits.
659
660    If the constructed splitting criterion uses an attribute in such
661    a way that the attribute is 'completely spent' and should not be
662    considered as a split criterion in any of the subtrees (the
663    typical case of this are discrete attributes that are used
664    as-they-are, that is, without any binarization or subsetting),
665    then it should report the index of this attribute. Some splits
666    do not spend any attribute; this is indicated by returning a
667    negative index.
668
669    A :obj:`SplitConstructor` can veto the further tree induction
670    by returning no classifier. This can happen for many reasons.
671    A general one is related to number of examples in the branches.
672    :obj:`SplitConstructor` has a field :obj:`minSubset`,
673    which sets the minimal number of examples in a branch; null nodes,
674    however, are allowed. If there is no split where this condition
675    is met, :obj:`SplitConstructor` stops the induction.
676
677    .. attribute:: minSubset
678
679        Sets the minimal number of examples in non-null leaves. As
680        always in Orange (where not specified otherwise), "number of
681        examples" refers to the weighted number of examples.
682   
683    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
684
685        Construct a split. Returns a tuple (:obj:`branchSelector`,
686        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
687        :obj:`spentAttribute`). :obj:`spentAttribute` is -1 if no
688        attribute is completely spent by the split criterion. If no
689        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
690        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
691        :obj:`spentAttribute` is -1.
692
693        :param examples:  Examples can be given in any acceptable form
694            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
695            list of examples).
696        :param weightID: Optional; the default of 0 means that all
697            examples have a weight of 1.0.
698        :param apriori-distribution: Should be of type
699            :obj:`orange.Distribution` and candidates should be a Python
700            list of objects which are interpreted as booleans.
701        :param candidates: The split constructor should consider only
702            the attributes in the candidate list (one boolean for each
703            attribute).
704
705
706.. class:: SplitConstructor_Measure
707
708    Bases: :class:`SplitConstructor`
709
710    An abstract base class for split constructors that employ
711    a :class:`Orange.feature.scoring.Measure` to assess a quality of a split.
712    All split constructors except for :obj:`SplitConstructor_Combined`
713    are derived from this class.
714
715    .. attribute:: measure
716
717        A component of type :class:`Orange.feature.scoring.Measure` used for
718        split evaluation. You must select a
719        :class:`Orange.feature.scoring.Measure` capable of handling your
720        class type
721        - for example, you cannot use :class:`Orange.feature.scoring.GainRatio`
722        for building regression trees or :class:`Orange.feature.scoring.MSE`
723        for classification trees.
724
725    .. attribute:: worstAcceptable
726
727        The lowest required split quality for a split to be acceptable.
728        Note that this value make sense only in connection with a
729        :obj:`measure` component. Default is 0.0.
730
731.. class:: SplitConstructor_Attribute
732
733    Bases: :class:`SplitConstructor_Measure`
734
735    Attempts to use a discrete attribute as a split; each value of the
736    attribute corresponds to a branch in the tree. Attributes are
737    evaluated with the :obj:`measure` and the one with the
738    highest score is used for a split. If there is more than one
739    attribute with the highest score, one of them is selected by random.
740
741    The constructed :obj:`branchSelector` is an instance of
742    :obj:`orange.ClassifierFromVarFD` that returns a value of the
743    selected attribute. If the attribute is
744    :obj:`Orange.data.variable.Discrete`,
745    :obj:`branchDescription`'s are the attribute's values. The
746    attribute is marked as spent, so that it cannot reappear in the
747    node's subtrees.
748
749.. class:: SplitConstructor_ExhaustiveBinary
750
751    Bases: :class:`SplitConstructor_Measure`
752
753    Works on discrete attributes. For each attribute, it determines
754    which binarization of the attribute gives the split with the
755    highest score. If more than one split has the highest score,
756    one of them is selected by random. After trying all the attributes,
757    it returns one of those with the highest score.
758
759    The constructed :obj:`branchSelector` is again an instance
760    :obj:`orange.ClassifierFromVarFD` that returns a value of the
761    selected attribute. This time, however, its :obj:`transformer`
762    contains an instance of :obj:`MapIntValue` that maps the values
763    of the attribute into a binary attribute. Branch descriptions are
764    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
765    more than one value of the attribute. Branches that correspond to
766    a single value of the attribute are described with this value. If
767    the attribute was originally binary, it is spent and cannot be
768    used in the node's subtrees. Otherwise, it can reappear in the
769    subtrees.
770
771
772.. class:: SplitConstructor_Threshold
773
774    Bases: :class:`SplitConstructor_Measure`
775
776    This is currently the only constructor for splits with continuous
777    attributes. It divides the range of attributes values with a threshold
778    that maximizes the split's quality. As always, if there is more than
779    one split with the highest score, a random threshold is selected.
780    The attribute that yields the highest binary split is returned.
781
782    The constructed :obj:`branchSelector` is again an instance of
783    :obj:`orange.ClassifierFromVarFD` with an attached
784    :obj:`transformer`. This time, :obj:`transformer` is of type
785    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
786    "<threshold" and ">=threshold". The attribute is not spent.
787
788.. class:: SplitConstructor_OneAgainstOthers
789   
790    Bases: :class:`SplitConstructor_Measure`
791
792    Undocumented.
793
794.. class:: SplitConstructor_Combined
795
796    Bases: :class:`SplitConstructor`
797
798    This constructor delegates the task of finding the optimal split
799    to separate split constructors for discrete and for continuous
800    attributes. Each split constructor is called, given only attributes
801    of appropriate types as candidates. Both construct a candidate for
802    a split; the better of them is selected.
803
804    (Note that there is a problem when more candidates have the same
805    score. Let there be are nine discrete attributes with the highest
806    score; the split constructor for discrete attributes will select
807    one of them. Now, let us suppose that there is a single continuous
808    attribute with the same score. :obj:`SplitConstructor_Combined`
809    would randomly select between the proposed discrete attribute and
810    the continuous attribute, not aware of the fact that the discrete
811    has already competed with eight other discrete attributes. So,
812    he probability for selecting (each) discrete attribute would be 1/18
813    instead of 1/10. Although not really correct, we doubt that this
814    would affect the tree's performance; many other machine learning
815    systems simply choose the first attribute with the highest score
816    anyway.)
817
818    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
819    the attribute is spent is decided by the winning split constructor.
820
821    .. attribute: discreteSplitConstructor
822
823        Split constructor for discrete attributes; can be, for instance,
824        :obj:`SplitConstructor_Attribute` or
825        :obj:`SplitConstructor_ExhaustiveBinary`.
826
827    .. attribute: continuousSplitConstructor
828
829        Split constructor for continuous attributes; at the moment, it
830        can be either :obj:`SplitConstructor_Threshold` or a
831        split constructor you programmed in Python.
832
833    .. attribute: continuousSplitConstructor
834   
835        Split constructor for continuous attributes; at the moment, it
836        can be either :obj:`SplitConstructor_Threshold` or a split
837        constructor you programmed in Python.
838
839
840StopCriteria and StopCriteria_common
841============================================
842
843obj:`StopCriteria` determines when to stop the induction of subtrees.
844
845.. class:: StopCriteria
846
847    Given a set of examples, weight ID and contingency matrices, decide
848    whether to continue the induction or not. The basic criterion checks
849    whether there are any examples and whether they belong to at least
850    two different classes (if the class is discrete). Derived components
851    check things like the number of examples and the proportion of
852    majority classes.
853
854    As opposed to :obj:`SplitConstructor` and similar basic classes,
855    :obj:`StopCriteria` is not an abstract but a fully functional
856    class that provides the basic stopping criteria. That is, the tree
857    induction stops when there is at most one example left; in this case,
858    it is not the weighted but the actual number of examples that counts.
859    Besides that, the induction stops when all examples are in the same
860    class (for discrete problems) or have the same value of the outcome
861    (for regression problems).
862
863    .. method:: __call__(examples[, weightID, domain contingencies])
864
865        Decides whether to stop (true) or continue (false) the induction.
866        If contingencies are given, they are used for checking whether
867        the examples are in the same class (but not for counting the
868        examples). Derived classes should use the contingencies whenever
869        possible. If contingencies are not given, :obj:`StopCriteria`
870        will work without them. Derived classes should also use them if
871        they are available, but otherwise compute them only when they
872        really need them.
873
874
875
876.. class:: StopCriteria_common
877
878    :obj:`StopCriteria` contains additional criteria for pre-pruning:
879    it checks the proportion of majority class and the number of weighted
880    examples.
881
882    .. attribute:: maxMajor
883
884        Maximal proportion of majority class. When this is exceeded,
885        induction stops.
886
887    .. attribute:: minExamples
888
889        Minimal number of examples in internal leaves. Subsets with less
890        than :obj:`minExamples` examples are not split any further.
891        Example count is weighed.
892
893.. class:: StopCriteria_Python
894
895    Undocumented.
896
897Example Splitters
898=================
899
900Just like the :obj:`Descender` decides the branch for an
901example during classification, the :obj:`ExampleSplitter`
902sorts the learning examples into branches.
903
904:obj:`ExampleSplitter` is given a :obj:`Node` (from which
905it can use different stuff, but most of splitters only use the
906:obj:`branchSelector`), a set of examples to be divided, and
907the weight ID. The result is a list of subsets of examples
908and, optionally, a list of new weight ID's.
909
910Most
911:obj:`ExampleSplitter` classes simply call the node's
912:obj:`branchSelector` and assign examples to corresponding
913branches. When the value is unknown they choose a particular
914branch or simply skip the example.
915
916Some enhanced splitters can split examples. An example (actually,
917a pointer to it) is copied to more than one subset. To facilitate
918real splitting, weights are needed. Each branch is assigned a
919weight ID (each would usually have its own ID) and all examples
920that are in that branch (either completely or partially) should
921have this meta attribute. If an example hasn't been split, it
922has only one additional attribute - with weight ID corresponding
923to the subset to which it went. Example that is split between,
924say, three subsets, has three new meta attributes, one for each
925subset. ID's of weight meta attributes are returned by the
926:obj:`ExampleSplitter` to be used at induction of the
927corresponding subtrees.
928
929Note that weights are used only when needed. When no splitting
930occured - because the splitter is not able to do it or becauser
931there was no need for splitting - no weight ID's are returned.
932
933
934.. class:: ExampleSplitter
935
936    An abstract base class for objects that split sets of examples into
937    subsets. The derived classes treat examples which
938    cannot be unambiguously placed into a single branch (usually due
939    to unknown value of the crucial attribute) differently.
940
941    .. method:: __call__(node, examples[, weightID])
942       
943        Use the information in :obj:`node` (particularly the
944        :obj:`branchSelector`) to split the given set of examples into subsets.
945        Return a tuple with a list of example generators and a list of weights.
946        The list of weights is either an ordinary python list of integers or
947        a None when no splitting of examples occurs and thus no weights are
948        needed.
949
950
951.. class:: ExampleSplitter_IgnoreUnknowns
952
953    Bases: :class:`ExampleSplitter`
954
955    Simply ignores the examples for which no single branch can be determined.
956
957.. class:: ExampleSplitter_UnknownsToCommon
958
959    Bases: :class:`ExampleSplitter`
960
961    Places all such examples to a branch with the highest number of
962    examples. If there is more than one such branch, one is selected at
963    random and then used for all examples.
964
965.. class:: ExampleSplitter_UnknownsToAll
966
967    Bases: :class:`ExampleSplitter`
968
969    Places examples with unknown value of the attribute into all branches.
970
971.. class:: ExampleSplitter_UnknownsToRandom
972
973    Bases: :class:`ExampleSplitter`
974
975    Selects a random branch for such examples.
976
977.. class:: ExampleSplitter_UnknownsToBranch
978
979    Bases: :class:`ExampleSplitter`
980
981    Constructs an additional branch to contain all such examples.
982    The branch's description is "unknown".
983
984.. class:: ExampleSplitter_UnknownsAsBranchSizes
985
986    Bases: :class:`ExampleSplitter`
987
988    Splits examples with unknown value of the attribute according to
989    proportions of examples in each branch.
990
991.. class:: ExampleSplitter_UnknownsAsSelector
992
993    Bases: :class:`ExampleSplitter`
994
995    Splits examples with unknown value of the attribute according to
996    distribution proposed by selector (which is in most cases the same
997    as proportions of examples in branches).
998
999Descenders
1000=============================
1001
1002This is a classifier's counterpart for :class:`ExampleSplitter`. It
1003decides the destiny of examples that need to be classified and cannot
1004be unambiguously put in a branch.
1005
1006.. class:: Descender
1007
1008    An abstract base object for tree descenders.
1009
1010    .. method:: __call__(node, example)
1011
1012        Descends down the tree until it reaches a leaf or a node in
1013        which a vote of subtrees is required. In both cases, a tuple
1014        of two elements is returned; in the former, the tuple contains
1015        the reached node and None, in the latter in
1016        contains a node and weights of votes for subtrees (a list of floats).
1017
1018        :obj:`Descender`'s that never split examples always descend
1019        to a leaf, but they differ in the treatment of examples with
1020        unknown values (or, in general, examples for which a branch
1021        cannot be determined at some node(s) the tree).
1022        :obj:`Descender`'s that do split examples differ in returned
1023        vote weights.
1024
1025.. class:: Descender_UnknownsToNode
1026
1027    Bases: :obj:`Descender`
1028
1029    When example cannot be classified into a single branch, the
1030    current node is returned. Thus, the node's :obj:`NodeClassifier`
1031    will be used to make a decision. It is your responsibility to see
1032    that even the internal nodes have their :obj:`NodeClassifier`
1033    (i.e., don't disable creating node classifier or manually remove
1034    them after the induction, that's all)
1035
1036.. class:: Descender_UnknownsToBranch
1037
1038    Bases: :obj:`Descender`
1039
1040    Classifies examples with unknown value to a special branch. This
1041    makes sense only if the tree itself was constructed with
1042    :obj:`ExampleSplitter_UnknownsToBranch`.
1043
1044.. class:: Descender_UnknownsToCommonBranch
1045
1046    Bases: :obj:`Descender`
1047
1048    Classifies examples with unknown values to the branch with the
1049    highest number of examples. If there is more than one such branch,
1050    random branch is chosen for each example that is to be classified.
1051
1052.. class:: Descender_UnknownsToCommonSelector
1053
1054    Bases: :obj:`Descender`
1055
1056    Classifies examples with unknown values to the branch which received
1057    the highest recommendation by the selector.
1058
1059.. class:: Descender_MergeAsBranchSizes
1060
1061    Bases: :obj:`Descender`
1062
1063    Makes the subtrees vote for the example's class; the vote is
1064    weighted according to the sizes of the branches.
1065
1066.. class:: Descender_MergeAsSelector
1067
1068    Bases: :obj:`Descender`
1069
1070    Makes the subtrees vote for the example's class; the vote is
1071    weighted according to the selectors proposal.
1072
1073Pruning
1074=======
1075
1076.. index::
1077    pair: classification trees; pruning
1078
1079Tree pruners derived from :obj:`Pruner` can be given either a
1080:obj:`Node` (presumably, but not necessarily a root) or a
1081:obj:`_TreeClassifier`. The result is a new :obj:`Node`
1082or a :obj:`_TreeClassifier` with a pruned tree. The original
1083tree remains intact.
1084
1085The pruners construct only a shallow copy of a tree.
1086The pruned tree's :obj:`Node` contain references to the same
1087contingency matrices, node classifiers, branch selectors, ...
1088as the original tree. Thus, you may modify a pruned tree structure
1089(manually cut it, add new nodes, replace components) but modifying,
1090for instance, some node's :obj:`nodeClassifier` (a
1091:obj:`nodeClassifier` itself, not a reference to it!) would modify
1092the node's :obj:`nodeClassifier` in the corresponding node of
1093the original tree.
1094
1095Pruners cannot construct a
1096:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
1097subtrees into classifiers for new leaves. Thus, if you want to build
1098a prunable tree, internal nodes must have their :obj:`nodeClassifier`
1099defined. Fortunately, this is the default.
1100
1101.. class:: Pruner
1102
1103    This is an abstract base class which defines nothing useful, only
1104    a pure virtual call operator.
1105
1106    .. method:: __call__(tree)
1107
1108        Prunes a tree. The argument can be either a tree classifier or
1109        a tree node; the result is of the same type as the argument.
1110
1111.. class:: Pruner_SameMajority
1112
1113    Bases: :class:`Pruner`
1114
1115    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
1116    with more than one leaf) in which all the leaves have the same majority
1117    class. (This is allowed because those leaves can still have different
1118    distributions of classes and thus predict different probabilities.)
1119    However, this can be undesired when we're only interested in the
1120    class prediction or a simple tree interpretation. The
1121    :obj:`Pruner_SameMajority` prunes the tree so that there is no
1122    subtree in which all the nodes would have the same majority class.
1123
1124    This pruner will only prune the nodes in which the node classifier
1125    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
1126
1127    Note that the leaves with more than one majority class require some
1128    special handling. The pruning goes backwards, from leaves to the root.
1129    When siblings are compared, the algorithm checks whether they
1130    have (at least one) common majority class. If so, they can be pruned.
1131
1132.. class:: Pruner_m
1133
1134    Bases: :class:`Pruner`
1135
1136    Prunes a tree by comparing m-estimates of static and dynamic
1137    error as defined in (Bratko, 2002).
1138
1139    .. attribute:: m
1140
1141        Parameter m for m-estimation.
1142
1143Printing the tree
1144=================
1145
1146The included printing functions can
1147print out practically anything you'd like to
1148know, from the number of examples, proportion of examples of majority
1149class in nodes and similar, to more complex statistics like the
1150proportion of examples in a particular class divided by the proportion
1151of examples of this class in a parent node. And even more, you can
1152define your own callback functions to be used for printing.
1153
1154Before we go on: you can read all about the function and use it to its
1155full extent, or you can just call it, giving it the tree as the sole
1156argument and it will print out the usual textual representation of the
1157tree. If you're satisfied with that, you can stop here.
1158
1159The magic is in the format string. It is a string which is printed
1160out at every leaf or internal node with the certain format specifiers
1161replaced by data from the tree node. Specifiers are generally of form
1162**%[^]<precision><quantity><divisor>**.
1163
1164**^** at the start tells that the number should be multiplied by 100.
1165It's useful for printing proportions like percentages.
1166
1167**<precision>** is in the same format as in Python (or C) string
1168formatting. For instance, :samp:`%N` denotes the number of examples in the node,
1169hence :samp:`%6.2N` would mean output to two decimal digits and six places
1170altogether. If left out, a default format :samp:`5.3` is used, unless you
1171multiply the numbers by 100, in which case the default is :samp:`.0`
1172(no decimals, the number is rounded to the nearest integer).
1173
1174**<divisor>** tells what to divide the quantity in that node with.
1175:samp:`bP` means division by the same quantity in the parent node; for instance,
1176:samp:`%NbP` will tell give the number of examples in the node divided by the
1177number of examples in parent node. You can add use precision formatting,
1178e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
1179set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
1180training data set) that fell into that node. If division is impossible
1181since the parent node does not exist or some data is missing, a dot is
1182printed out instead of the quantity.
1183
1184**<quantity>** is the only required element. It defines what to print.
1185For instance, :samp:`%N` would print out the number of examples in the node.
1186Possible quantities are
1187
1188:samp:`V`
1189    The value predicted at that node. You cannot define the precision
1190    or divisor here.
1191
1192:samp:`N`
1193    The number of examples in the node.
1194
1195:samp:`M`
1196    The number of examples in the majority class (that is, the class
1197    predicted by the node).
1198
1199:samp:`m`
1200    The proportion of examples in the majority class.
1201
1202:samp:`A`
1203    The average class for examples the node; this is available only for
1204    regression trees.
1205
1206:samp:`E`
1207    Standard error for class of examples in the node; available for
1208    regression trees.
1209
1210:samp:`I`
1211    Print out the confidence interval. The modifier is used as
1212    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
1213
1214:samp:`C`
1215    The number of examples in the given class. For classification trees,
1216    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
1217    - this will tell the number of examples of Iris-virginica by the
1218    number of examples this class in the parent node. If you are
1219    interested in examples that are *not* Iris-virginica, say
1220    :samp:`%5.3CbP!="Iris-virginica"`
1221
1222    For regression trees, you can use operators =, !=, <, <=, >, and >=,
1223    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
1224    check the number of examples in a certain interval: :samp:`%C[20, 22]`
1225    will give you the number of examples between 20 and 22 (inclusive)
1226    and :samp:`%C(20, 22)` will give the number of such examples excluding the
1227    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
1228    If you would like the examples outside the interval, add a :samp:`!`,
1229    like :samp:`%C!(20, 22]`.
1230 
1231:samp:`c`
1232    Same as above, except that it computes the proportion of the class
1233    instead of the number of examples.
1234
1235:samp:`D`
1236    Prints out the number of examples in each class. You can use both,
1237    precision (it is applied to each number in the distribution) or the
1238    divisor. This quantity cannot be computed for regression trees.
1239
1240:samp:`d`
1241    Same as above, except that it shows proportions of examples. This
1242    again doesn't work with regression trees.
1243
1244<user defined formats>
1245    You can add more, if you will. Instructions and examples are given at
1246    the end of this section.
1247
1248
1249Examples
1250--------
1251
1252We shall build a small tree from the iris data set - we shall limit the
1253depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
1254
1255.. literalinclude:: code/orngTree1.py
1256   :lines: 1-4
1257
1258.. _orngTree1.py: code/orngTree1.py
1259
1260The easiest way to call the function is to pass the tree as the only
1261argument::
1262
1263    >>> print tree.dump()
1264    petal width<0.800: Iris-setosa (100.00%)
1265    petal width>=0.800
1266    |    petal width<1.750
1267    |    |    petal length<5.350: Iris-versicolor (94.23%)
1268    |    |    petal length>=5.350: Iris-virginica (100.00%)
1269    |    petal width>=1.750
1270    |    |    petal length<4.850: Iris-virginica (66.67%)
1271    |    |    petal length>=4.850: Iris-virginica (100.00%)
1272
1273Let's now print out the predicted class at each node, the number
1274of examples in the majority class with the total number of examples
1275in the node::
1276
1277    >>> print tree.dump(leafStr="%V (%M out of %N)")
1278    petal width<0.800: Iris-setosa (50.000 out of 50.000)
1279    petal width>=0.800
1280    |    petal width<1.750
1281    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
1282    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
1283    |    petal width>=1.750
1284    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
1285    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
1286
1287Would you like to know how the number of examples declines as
1288compared to the entire data set and to the parent node? We find
1289it with this::
1290
1291    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)")
1292    petal width<0.800: Iris-setosa (100%, 100%)
1293    petal width>=0.800
1294    |    petal width<1.750
1295    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
1296    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
1297    |    petal width>=1.750
1298    |    |    petal length<4.850: Iris-virginica (4%, 4%)
1299    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
1300
1301Let us first read the format string. :samp:`%M` is the number of
1302examples in the majority class. We want it divided by the number of
1303all examples from this class on the entire data set, hence :samp:`%MbA`.
1304To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
1305that is just printed out literally, just as the comma and parentheses
1306(see the output). The string for showing the proportion of this class
1307in the parent is the same except that we have :samp:`bP` instead
1308of :samp:`bA`.
1309
1310And now for the output: all examples of setosa for into the first node.
1311For versicolor, we have 98% in one node; the rest is certainly
1312not in the neighbouring node (petal length>=5.350) since all
1313versicolors from the node petal width<1.750 went to petal length<5.350
1314(we know this from the 100% in that line). Virginica is the
1315majority class in the three nodes that together contain 94% of this
1316class (4+4+86). The rest must had gone to the same node as versicolor.
1317
1318If you find this guesswork annoying - so do I. Let us print out the
1319number of versicolors in each node, together with the proportion of
1320versicolors among the examples in this particular node and among all
1321versicolors. So,
1322
1323::
1324
1325    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
1326
1327gives the following output::
1328
1329    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
1330    petal width>=0.800
1331    |    petal width<1.750
1332    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
1333    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
1334    |    petal width>=1.750
1335    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
1336    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
1337
1338Finally, we may want to print out the distributions, using a simple
1339string :samp:`%D`::
1340
1341    petal width<0.800: [50.000, 0.000, 0.000]
1342    petal width>=0.800
1343    |    petal width<1.750
1344    |    |    petal length<5.350: [0.000, 49.000, 3.000]
1345    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
1346    |    petal width>=1.750
1347    |    |    petal length<4.850: [0.000, 1.000, 2.000]
1348    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
1349
1350What is the order of numbers here? If you check
1351:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
1352versicolor, virginica; so in the node at peta length<5.350 we have 49
1353versicolors and 3 virginicae. To print out the proportions, we can
1354:samp:`%.2d` - this gives us proportions within node, rounded on
1355two decimals::
1356
1357    petal width<0.800: [1.00, 0.00, 0.00]
1358    petal width>=0.800
1359    |    petal width<1.750
1360    |    |    petal length<5.350: [0.00, 0.94, 0.06]
1361    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
1362    |    petal width>=1.750
1363    |    |    petal length<4.850: [0.00, 0.33, 0.67]
1364    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
1365
1366We haven't tried printing out any information for internal nodes.
1367To start with the most trivial case, we shall print the prediction
1368at each node.
1369
1370::
1371
1372    tree.dump(leafStr="%V", nodeStr=".")
1373   
1374says that the nodeStr should be the same as leafStr (not very useful
1375here, since leafStr is trivial anyway).
1376
1377::
1378
1379    root: Iris-setosa
1380    |    petal width<0.800: Iris-setosa
1381    |    petal width>=0.800: Iris-versicolor
1382    |    |    petal width<1.750: Iris-versicolor
1383    |    |    |    petal length<5.350: Iris-versicolor
1384    |    |    |    petal length>=5.350: Iris-virginica
1385    |    |    petal width>=1.750: Iris-virginica
1386    |    |    |    petal length<4.850: Iris-virginica
1387    |    |    |    petal length>=4.850: Iris-virginica
1388
1389Note that the output is somewhat different now: there appeared another
1390node called *root* and the tree looks one level deeper. This is
1391needed to print out the data for that node to.
1392
1393Now for something more complicated: let us observe how the number
1394of virginicas decreases down the tree::
1395
1396    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
1397
1398Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
1399the number of examples from class virginica, divided by the total number
1400of examples in this class. Add :samp:`^.1` and the result will be
1401multiplied and printed with one decimal. The trailing :samp:`%` is printed
1402out. In parentheses we print the same thing except that we divide by the
1403examples in the parent node. Note the use of single quotes, so we can
1404use the double quotes inside the string, when we specify the class.
1405
1406::
1407
1408    root: 100.0% (.%)
1409    |    petal width<0.800: 0.0% (0.0%)
1410    |    petal width>=0.800: 100.0% (100.0%)
1411    |    |    petal width<1.750: 10.0% (10.0%)
1412    |    |    |    petal length<5.350: 6.0% (60.0%)
1413    |    |    |    petal length>=5.350: 4.0% (40.0%)
1414    |    |    petal width>=1.750: 90.0% (90.0%)
1415    |    |    |    petal length<4.850: 4.0% (4.4%)
1416    |    |    |    petal length>=4.850: 86.0% (95.6%)
1417
1418See what's in the parentheses in the root node? If :meth:`~TreeClassifier.dump`
1419cannot compute something (in this case it's because the root has no parent),
1420it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
1421will count all classes *except* virginica.
1422
1423For one final example with classification trees, we shall print the
1424distributions in that nodes, the distribution compared to the parent
1425and the proportions compared to the parent (the latter things are not
1426the same - think why). In the leaves we shall also add the predicted
1427class. So now we'll have to call the function like this.
1428
1429::
1430
1431    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
1432    root: [50.000, 50.000, 50.000] . .
1433    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
1434    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
1435    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
1436    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
1437    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
1438    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
1439    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
1440    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
1441    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
1442    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
1443    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
1444    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
1445    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
1446
1447To explore the possibilities when printing regression trees, we are going
1448to induce a tree from the housing data set. Called with the tree as the
1449only argument, :meth:`TreeClassifier.dump` prints the tree like this::
1450
1451    RM<6.941
1452    |    LSTAT<14.400
1453    |    |    DIS<1.385: 45.6
1454    |    |    DIS>=1.385: 22.9
1455    |    LSTAT>=14.400
1456    |    |    CRIM<6.992: 17.1
1457    |    |    CRIM>=6.992: 12.0
1458    RM>=6.941
1459    |    RM<7.437
1460    |    |    CRIM<7.393: 33.3
1461    |    |    CRIM>=7.393: 14.4
1462    |    RM>=7.437
1463    |    |    TAX<534.500: 45.9
1464    |    |    TAX>=534.500: 21.9
1465
1466Let us add the standard error in both internal nodes and leaves, and the
146790% confidence intervals in the leaves::
1468
1469    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
1470    root: [SE: 0.409]
1471    |    RM<6.941: [SE: 0.306]
1472    |    |    LSTAT<14.400: [SE: 0.320]
1473    |    |    |    DIS<1.385: [SE: 4.420]:
1474    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
1475    |    |    |    DIS>=1.385: [SE: 0.244]:
1476    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
1477    |    |    LSTAT>=14.400: [SE: 0.333]
1478    |    |    |    CRIM<6.992: [SE: 0.338]:
1479    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1480    |    |    |    CRIM>=6.992: [SE: 0.448]:
1481    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1482    |    RM>=6.941: [SE: 1.031]
1483    |    |    RM<7.437: [SE: 0.958]
1484    |    |    |    CRIM<7.393: [SE: 0.692]:
1485    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1486    |    |    |    CRIM>=7.393: [SE: 2.157]:
1487    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1488    |    |    RM>=7.437: [SE: 1.124]
1489    |    |    |    TAX<534.500: [SE: 0.817]:
1490    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1491    |    |    |    TAX>=534.500: [SE: 0.000]:
1492    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1493
1494What's the difference between :samp:`%V`, the predicted value and
1495:samp:`%A` the average? Doesn't a regression tree always predict the
1496leaf average anyway? Not necessarily, the tree predict whatever the
1497:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
1498As :samp:`%V` uses the
1499:obj:`Orange.data.variable.Continuous`' function for printing out the value,
1500therefore the printed number has the same number of decimals
1501as in the data file.
1502
1503Regression trees cannot print the distributions in the same way
1504as classification trees. They instead offer a set of operators for
1505observing the number of examples within a certain range. For instance,
1506let us check the number of examples with values below 22, and compare
1507this number with values in the parent nodes::
1508
1509    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".")
1510    root: 277.000 (.)
1511    |    RM<6.941: 273.000 (1.160)
1512    |    |    LSTAT<14.400: 107.000 (0.661)
1513    |    |    |    DIS<1.385: 0.000 (0.000)
1514    |    |    |    DIS>=1.385: 107.000 (1.020)
1515    |    |    LSTAT>=14.400: 166.000 (1.494)
1516    |    |    |    CRIM<6.992: 93.000 (0.971)
1517    |    |    |    CRIM>=6.992: 73.000 (1.040)
1518    |    RM>=6.941: 4.000 (0.096)
1519    |    |    RM<7.437: 3.000 (1.239)
1520    |    |    |    CRIM<7.393: 0.000 (0.000)
1521    |    |    |    CRIM>=7.393: 3.000 (15.333)
1522    |    |    RM>=7.437: 1.000 (0.633)
1523    |    |    |    TAX<534.500: 0.000 (0.000)
1524    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1525
1526The last line, for instance, says the the number of examples with the
1527class below 22 is among those with tax above 534 is 30 times higher
1528than the number of such examples in its parent node.
1529
1530For another exercise, let's count the same for all examples *outside*
1531interval [20, 22] (given like this, the interval includes the bounds).
1532And let us print out the proportions as percents.
1533
1534::
1535
1536    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
1537
1538OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
1539would be the proportion of examples (within the node) whose values are
1540below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
1541statistics computed on the parent. Add a :samp:`^` and you have the percentages.
1542
1543::
1544
1545    root: 439.000 (.%)
1546    |    RM<6.941: 364.000 (98%)
1547    |    |    LSTAT<14.400: 200.000 (93%)
1548    |    |    |    DIS<1.385: 5.000 (127%)
1549    |    |    |    DIS>=1.385: 195.000 (99%)
1550    |    |    LSTAT>=14.400: 164.000 (111%)
1551    |    |    |    CRIM<6.992: 91.000 (96%)
1552    |    |    |    CRIM>=6.992: 73.000 (105%)
1553    |    RM>=6.941: 75.000 (114%)
1554    |    |    RM<7.437: 46.000 (101%)
1555    |    |    |    CRIM<7.393: 43.000 (100%)
1556    |    |    |    CRIM>=7.393: 3.000 (100%)
1557    |    |    RM>=7.437: 29.000 (98%)
1558    |    |    |    TAX<534.500: 29.000 (103%)
1559    |    |    |    TAX>=534.500: 0.000 (0%)
1560
1561
1562Defining Your Own Printout functions
1563------------------------------------
1564
1565:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out
1566some other information in the leaves or nodes. If provided,
1567:obj:`userFormats` should contain a list of tuples with a regular expression
1568and a callback function to be called when that expression is found in the
1569format string. Expressions from :obj:`userFormats` are checked before
1570the built-in expressions discussed above, so you can override the built-ins
1571if you want to.
1572
1573The regular expression should describe a string like those we used above,
1574for instance the string :samp:`%.2DbP`. When a leaf or internal node
1575is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
1576is checked for these regular expressions and when the match is found,
1577the corresponding callback function is called.
1578
1579The callback function will get five arguments: the format string
1580(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
1581being printed, its parent (can be None) and the tree classifier.
1582The function should return the format string in which the part described
1583by the match object (that is, the part that is matched by the regular
1584expression) is replaced by whatever information your callback function
1585is supposed to give.
1586
1587The function can use several utility function provided in the module.
1588
1589.. autofunction:: insertStr
1590
1591.. autofunction:: insertDot
1592
1593.. autofunction:: insertNum
1594
1595.. autofunction:: byWhom
1596
1597
1598There are also a few pieces of regular expression that you may want to reuse.
1599The two you are likely to use are:
1600
1601.. autodata:: fs
1602
1603.. autodata:: by
1604
1605For a trivial example, :samp:`%V` is implemented like this. There is the
1606following tuple in the list of built-in formats::
1607
1608    (re.compile("%V"), replaceV)
1609
1610:obj:`replaceV` is a function defined by::
1611
1612    def replaceV(strg, mo, node, parent, tree):
1613        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
1614
1615It therefore takes the value predicted at the node
1616(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
1617and passes it to *insertStr* to do the replacement.
1618
1619A more complex regular expression is the one for the proportion of
1620majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
1621two partial expressions defined above.
1622
1623Let's say with like to print the classification margin for each node,
1624that is, the difference between the proportion of the largest and the
1625second largest class in the node (part of `orngTree2.py`_):
1626
1627.. _orngTree2.py: code/orngTree2.py
1628
1629.. literalinclude:: code/orngTree2.py
1630   :lines: 7-31
1631
1632We first defined getMargin which gets the distribution and computes the
1633margin. The callback replaces, replaceB, computes the margin for the node.
1634If we need to divided the quantity by something (that is, if the :data:`by`
1635group is present), we call :func:`byWhom` to get the node with whose margin
1636this node's margin is to be divided. If this node (usually the parent)
1637does not exist of if its margin is zero, we call :func:`insertDot`
1638to insert a dot, otherwise we call :func:`insertNum` which will insert
1639the number, obeying the format specified by the user. myFormat is a list
1640containing the regular expression and the callback function.
1641
1642We can now print out the iris tree:
1643
1644.. literalinclude:: code/orngTree2.py
1645    :lines: 33
1646
1647And we get::
1648
1649    petal width<0.800: Iris-setosa 100% (100.00%)
1650    petal width>=0.800
1651    |    petal width<1.750
1652    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1653    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1654    |    petal width>=1.750
1655    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1656    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1657
1658Plotting the Tree using Dot
1659---------------------------
1660
1661Suppose you saved the tree in a file "tree5.dot". You can then
1662print it out as a gif if you execute the following command line
1663
1664::
1665   
1666    dot -Tgif tree5.dot -otree5.gif
1667
1668GraphViz's dot has quite a few other output formats, check
1669its documentation to learn which.
1670
1671
1672
1673===========================
1674C4.5 Classifier and Learner
1675===========================
1676
1677As C4.5 is a standard benchmark in machine learning,
1678it is incorporated in Orange, although Orange has its own
1679implementation of decision trees.
1680
1681The implementation uses the original Quinlan's code for learning so the
1682tree you get is exactly like the one that would be build by standalone
1683C4.5. Upon return, however, the original tree is copied to Orange
1684components that contain exactly the same information plus what is needed
1685to make them visible from Python. To be sure that the algorithm behaves
1686just as the original, we use a dedicated class :class:`C45Node`
1687instead of reusing the components used by Orange's tree inducer
1688(ie, :class:`Node`). This, however, could be done and probably
1689will be done in the future; we shall still retain :class:`C45Node`
1690but offer transformation to :class:`Node` so that routines
1691that work on Orange trees will also be usable for C45 trees.
1692
1693:class:`C45Learner` and :class:`C45Classifier` behave
1694like any other Orange learner and classifier. Unlike most of Orange
1695learning algorithms, C4.5 does not accepts weighted examples.
1696
1697Building the C4.5 plug-in
1698=========================
1699
1700We haven't been able to obtain the legal rights to distribute
1701C4.5 and therefore couldn't statically link it into Orange. Instead,
1702it's incorporated as a plug-in which you'll need to build yourself.
1703The procedure is trivial, but you'll need a C compiler. On Windows,
1704the scripts we provide work with MS Visual C and the files CL.EXE
1705and LINK.EXE must be on the PATH. On Linux you're equipped with
1706gcc. Mac OS X comes without gcc, but you can download it for
1707free from Apple.
1708
1709Orange must be installed prior to building C4.5. (This is because
1710the build script will copy the created file next to Orange,
1711which it obviously can't if Orange isn't there yet.)
1712
1713#. Download the
1714   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1715   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1716   them into some temporary directory. The files will be modified in the
1717   further process, so don't use your copy of Quinlan's sources that you
1718   need for another purpose.
1719#. Download
1720   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1721   and unzip its contents into the directory R8/Src of the Quinlan's
1722   stuff (it's the directory that contains, for instance, the file
1723   average.c).
1724#. Run buildC45.py, which will build the plug-in and put it next to
1725   orange.pyd (or orange.so on Linux/Mac).
1726#. Run python, type :samp:`import Orange` and
1727   create create :samp:`Orange.classification.tree.C45Learner()`.
1728   If this fails, something went wrong; see the diagnostic messages from
1729   buildC45.py and read the below paragraph.
1730#. Finally, you can remove the Quinlan's stuff, along with everything
1731   created by buildC45.py.
1732
1733If the process fails, here's what buildC45.py really does: it creates
1734.h files that wrap Quinlan's .i files and ensure that they are not
1735included twice. It modifies C4.5 sources to include .h's instead of
1736.i's. This step can hardly fail. Then follows the platform dependent
1737step which compiles ensemble.c (which includes all the Quinlan's .c
1738files it needs) into c45.dll or c45.so and puts it next to Orange.
1739If this fails, but you do have a C compiler and linker, and you know
1740how to use them, you can compile the ensemble.c into a dynamic
1741library yourself. See the compile and link steps in buildC45.py,
1742if it helps. Anyway, after doing this check that the built C4.5
1743gives the same results as the original.
1744
1745.. class:: C45Learner
1746
1747    :class:`C45Learner`'s attributes have double names - those that
1748    you know from C4.5 command lines and the corresponding names of C4.5's
1749    internal variables. All defaults are set as in C4.5; if you change
1750    nothing, you are running C4.5.
1751
1752    .. attribute:: gainRatio (g)
1753       
1754        Determines whether to use information gain (false>, default)
1755        or gain ratio for selection of attributes (true).
1756
1757    .. attribute:: batch (b)
1758
1759        Turn on batch mode (no windows, no iterations); this option is
1760        not documented in C4.5 manuals. It conflicts with "window",
1761        "increment" and "trials".
1762
1763    .. attribute:: subset (s)
1764       
1765        Enables subsetting (default: false, no subsetting),
1766 
1767    .. attribute:: probThresh (p)
1768
1769        Probabilistic threshold for continuous attributes (default: false).
1770
1771    .. attribute:: minObjs (m)
1772       
1773        Minimal number of objects (examples) in leaves (default: 2).
1774
1775    .. attribute:: window (w)
1776
1777        Initial windows size (default: maximum of 20% and twice the
1778        square root of the number of data objects).
1779
1780    .. attribute:: increment (i)
1781
1782        The maximum number of objects that can be added to the window
1783        at each iteration (default: 20% of the initial window size).
1784
1785    .. attribute:: cf (c)
1786
1787        Prunning confidence level (default: 25%).
1788
1789    .. attribute:: trials (t)
1790
1791        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1792
1793    .. attribute:: prune
1794       
1795        Return pruned tree (not an original C4.5 option) (default: true)
1796
1797
1798:class:`C45Learner` also offers another way for setting
1799the arguments: it provides a function :obj:`C45Learner.commandLine`
1800which is given a string and parses it the same way as C4.5 would
1801parse its command line.
1802
1803.. class:: C45Classifier
1804
1805    A faithful reimplementation of Quinlan's function from C4.5. The only
1806    difference (and the only reason it's been rewritten) is that it uses
1807    a tree composed of :class:`C45Node` instead of C4.5's
1808    original tree structure.
1809
1810    .. attribute:: tree
1811
1812        C4.5 tree stored as a tree of :obj:`C45Node`.
1813
1814
1815.. class:: C45Node
1816
1817    This class is a reimplementation of the corresponding *struct* from
1818    Quinlan's C4.5 code.
1819
1820    .. attribute:: nodeType
1821
1822        Type of the node:  :obj:`C45Node.Leaf` (0),
1823        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1824        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1825        split examples based on values of a discrete attribute,
1826        "cuts" cut them according to a threshold value of a continuous
1827        attributes and "subsets" use discrete attributes but with subsetting
1828        so that several values can go into the same branch.
1829
1830    .. attribute:: leaf
1831
1832        Value returned by that leaf. The field is defined for internal
1833        nodes as well.
1834
1835    .. attribute:: items
1836
1837        Number of (learning) examples in the node.
1838
1839    .. attribute:: classDist
1840
1841        Class distribution for the node (of type
1842        :obj:`orange.DiscDistribution`).
1843
1844    .. attribute:: tested
1845       
1846        The attribute used in the node's test. If node is a leaf,
1847        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1848        :obj:`tested` is a discrete attribute, and if node is of type
1849        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1850
1851    .. attribute:: cut
1852
1853        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1854        Undefined otherwise.
1855
1856    .. attribute:: mapping
1857
1858        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1859        gives the index for an example whose value of :obj:`tested` is *i*.
1860        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1861
1862    .. attribute:: branch
1863       
1864        A list of branches stemming from this node.
1865
1866Examples
1867========
1868
1869.. _tree_c45.py: code/tree_c45.py
1870.. _iris.tab: code/iris.tab
1871
1872The simplest way to use :class:`C45Learner` is to call it. This
1873script constructs the same learner as you would get by calling
1874the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1875
1876.. literalinclude:: code/tree_c45.py
1877   :lines: 7-14
1878
1879Arguments can be set by the usual mechanism (the below to lines do the
1880same, except that one uses command-line symbols and the other internal
1881variable names)
1882
1883::
1884
1885    tree = Orange.classification.tree.C45Learner(data, m=100)
1886    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1887
1888The way that could be prefered by veteran C4.5 user might be through
1889method `:obj:C45Learner.commandline`.
1890
1891::
1892
1893    lrn = Orange.classification.tree..C45Learner()
1894    lrn.commandline("-m 1 -s")
1895    tree = lrn(data)
1896
1897There's nothing special about using :obj:`C45Classifier` - it's
1898just like any other. To demonstrate what the structure of
1899:class:`C45Node`'s looks like, will show a script that prints
1900it out in the same format as C4.5 does.
1901
1902.. literalinclude:: code/tree_c45_printtree.py
1903
1904Leaves are the simplest. We just print out the value contained
1905in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1906:obj:`C45Node` does not know to which attribute it belongs), we need to
1907convert it to a string through :obj:`classVar`, which is passed as an
1908extra argument to the recursive part of printTree.
1909
1910For discrete splits without subsetting, we print out all attribute values
1911and recursively call the function for all branches. Continuous splits are
1912equally easy to handle.
1913
1914For discrete splits with subsetting, we iterate through branches, retrieve
1915the corresponding values that go into each branch to inset, turn
1916the values into strings and print them out, separately treating the
1917case when only a single value goes into the branch.
1918
1919Printing out C45 Tree
1920=====================
1921
1922.. autofunction:: printTreeC45
1923
1924References
1925==========
1926
1927Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1928Wesley, 2002.
1929
1930E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1931Murray Hill NJ, U.S.A., October 1993.
1932
1933`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1934A home page of AT&T's dot and similar software packages.
1935
1936"""
1937
1938from Orange.core import \
1939     TreeLearner as _TreeLearner, \
1940         TreeClassifier as _TreeClassifier, \
1941         C45Learner, \
1942         C45Classifier, \
1943         C45TreeNode as C45Node, \
1944         C45TreeNodeList as C45NodeList, \
1945         TreeDescender as Descender, \
1946              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
1947              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
1948              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
1949              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
1950              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
1951         TreeExampleSplitter as Splitter, \
1952              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
1953              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
1954              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
1955              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
1956              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
1957              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
1958              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
1959         TreeNode as Node, \
1960         TreeNodeList as NodeList, \
1961         TreePruner as Pruner, \
1962              TreePruner_SameMajority as Pruner_SameMajority, \
1963              TreePruner_m as Pruner_m, \
1964         TreeSplitConstructor as SplitConstructor, \
1965              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
1966              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
1967                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
1968                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
1969                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
1970                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
1971         TreeStopCriteria as StopCriteria, \
1972              TreeStopCriteria_Python as StopCriteria_Python, \
1973              TreeStopCriteria_common as StopCriteria_common
1974
1975import Orange.core
1976import operator
1977import base64
1978import re
1979import Orange.data
1980import Orange.feature.scoring
1981import Orange.classification.tree
1982
1983def _c45_showBranch(node, classvar, lev, i):
1984    var = node.tested
1985    if node.nodeType == 1:
1986        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
1987        _c45_printTree0(node.branch[i], classvar, lev+1)
1988    elif node.nodeType == 2:
1989        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
1990        _c45_printTree0(node.branch[i], classvar, lev+1)
1991    else:
1992        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
1993        inset = [var.values[j[0]] for j in inset]
1994        if len(inset)==1:
1995            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
1996        else:
1997            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
1998        _c45_printTree0(node.branch[i], classvar, lev+1)
1999       
2000       
2001def _c45_printTree0(node, classvar, lev):
2002    var = node.tested
2003    if node.nodeType == 0:
2004        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2005    else:
2006        for i, branch in enumerate(node.branch):
2007            if not branch.nodeType:
2008                _c45_showBranch(node, classvar, lev, i)
2009        for i, branch in enumerate(node.branch):
2010            if branch.nodeType:
2011                _c45_showBranch(node, classvar, lev, i)
2012
2013def printTreeC45(tree):
2014    """
2015    Prints the tree given as an argument in the same form as Ross Quinlan's
2016    C4.5 program.
2017
2018    ::
2019
2020        import Orange
2021
2022        data = Orange.data.Table("voting")
2023        c45 = Orange.classification.tree.C45Learner(data)
2024        Orange.classification.tree.printTreeC45(c45)
2025
2026    will print out
2027
2028    ::
2029
2030        physician-fee-freeze = n: democrat (253.4)
2031        physician-fee-freeze = y:
2032        |   synfuels-corporation-cutback = n: republican (145.7)
2033        |   synfuels-corporation-cutback = y:
2034        |   |   mx-missile = y: democrat (6.0)
2035        |   |   mx-missile = n:
2036        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2037        |   |   |   adoption-of-the-budget-resolution = y:
2038        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2039        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2040
2041
2042    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2043
2044    ::
2045
2046        physician-fee-freeze = n: democrat (253.4/5.9)
2047        physician-fee-freeze = y:
2048        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2049        |   synfuels-corporation-cutback = y:
2050        |   |   mx-missile = y: democrat (6.0/2.4)
2051        |   |   mx-missile = n:
2052        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2053        |   |   |   adoption-of-the-budget-resolution = y:
2054        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2055        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2056
2057    which is adoringly similar, except that C4.5 tested the tree on
2058    the learning data and has also printed out the number of errors
2059    in each node - something which :obj:`c45_printTree` obviously can't do
2060    (nor is there any need it should).
2061
2062    """
2063    _c45_printTree0(tree.tree, tree.classVar, 0)
2064
2065
2066import Orange.feature.scoring as fscoring
2067
2068class TreeLearner(Orange.core.Learner):
2069    """
2070    Assembles the generic classification or regression tree learner
2071    (from Orange's objects for induction of decision trees).
2072    :class:`TreeLearner` is essentially a wrapper
2073    around :class:`_TreeLearner`, provided for easier use of the latter.
2074    It sets a number of parameters used in induction that
2075    can also be set after the creation of the object, most often through
2076    the object's attributes. If upon initialization
2077    :class:`TreeLearner` is given a set of examples, then an instance
2078    of :class:`TreeClassifier` object is returned instead.
2079
2080    The values of attributes can be also be set in the constructor.
2081
2082    .. attribute:: nodeLearner
2083
2084        Induces a classifier from examples belonging to a node. The
2085        same learner is used for internal nodes and for leaves. The
2086        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2087
2088    **Split construction**
2089
2090    .. attribute:: split
2091       
2092        Defines a function that will be used in place of
2093        :obj:`SplitConstructor`.
2094        Useful when prototyping new tree induction
2095        algorithms. When this parameter is defined, other parameters that
2096        affect the procedures for growing of the tree are ignored. These
2097        include :obj:`binarization`, :obj:`measure`,
2098        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2099        :class:SplitConstructor_Combined
2100        with separate constructors for discrete and continuous attributes.
2101        Discrete attributes are used as they are, while
2102        continuous attributes are binarized.
2103        Gain ratio is used to select attributes.
2104        A minimum of two examples in a leaf is required for
2105        discrete and five examples in a leaf for continuous attributes.)
2106
2107    .. attribute:: binarization
2108
2109        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2110        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
2111        0, do not use binarization (use :class:`SplitConstructor_Attribute`).
2112        Default: 0.
2113
2114    .. attribute:: measure
2115   
2116        Measure for scoring of the attributes when deciding which of the
2117        attributes will be used for splitting of the example set in the node.
2118        Can be either a :class:`Orange.feature.scoring.Measure` or one of
2119        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2120        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2121        "gini" (:class:`Orange.feature.scoring.Gini`),
2122        "relief" (:class:`Orange.feature.scoring.Relief`),
2123        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio".
2124
2125    .. attribute:: reliefM, reliefK
2126
2127        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2128
2129    .. attribute:: splitter
2130
2131        Object of type :class:`ExampleSplitter`. The default splitter
2132        is :class:`ExampleSplitter_UnknownsAsSelector` that splits
2133        the learning examples according to distributions given by the
2134        selector.
2135
2136    **Pruning**
2137
2138    .. attribute:: worstAcceptable
2139
2140        Used in pre-pruning, sets the lowest required attribute
2141        score. If the score of the best attribute is below this margin, the
2142        tree at that node is not grown further (default: 0).
2143
2144        So, to allow splitting only when gainRatio (the default measure)
2145        is greater than 0.6, set :samp:`worstAcceptable=0.6`.
2146
2147    .. attribute:: minSubset
2148
2149        Minimal number of examples in non-null leaves (default: 0).
2150
2151    .. attribute:: minExamples
2152
2153        Data subsets with less than :obj:`minExamples`
2154        examples are not split any further, that is, all leaves in the tree
2155        will contain at least that many of examples (default: 0).
2156
2157    .. attribute:: maxDepth
2158
2159        Gives maximal tree depth;  0 means that only root is generated.
2160        The default is 100.
2161
2162    .. attribute:: maxMajority
2163
2164        Induction stops when the proportion of majority class in the
2165        node exceeds the value set by this parameter(default: 1.0).
2166        To stop the induction as soon as the majority class reaches 70%,
2167        you should use :samp:`maxMajority=0.7`, as in the following
2168        example. The numbers show the majority class
2169        proportion at each node. The script `tree2.py`_ induces and
2170        prints this tree.
2171
2172        .. _tree2.py: code/tree2.py
2173
2174        ::
2175
2176            root: 0.333
2177            |    petal width<=0.800: 1.000
2178            |    petal width>0.800: 0.500
2179            |    |    petal width<=1.750: 0.907
2180            |    |    petal width>1.750: 0.978
2181   
2182    .. attribute:: stop
2183
2184        Used for passing a function which is used in place of
2185        :class:`StopCriteria`. Useful when prototyping new
2186        tree induction algorithms. See a documentation on
2187        :class:`StopCriteria` for more info on this function.
2188        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2189        will not be  considered (default: None).
2190        The default stopping criterion stops induction when all examples
2191        in a node belong to the same class.
2192
2193    .. attribute:: mForPruning
2194
2195        If non-zero, invokes an error-based bottom-up post-pruning,
2196        where m-estimate is used to estimate class probabilities
2197        (default: 0).
2198
2199    .. attribute:: sameMajorityPruning
2200
2201        If true, invokes a bottom-up post-pruning by removing the
2202        subtrees of which all leaves classify to the same class
2203        (default: False).
2204
2205    **Record keeping**
2206
2207    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2208
2209        Determines whether to store class distributions, contingencies and
2210        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2211        should be build for internal nodes. By default everything except
2212        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2213        distributions but storing contingencies, since distributions actually points to
2214        the same distribution that is stored in
2215        :obj:`contingency.classes`. (default: True except for
2216        storeExamples, which defaults to False).
2217   
2218    """
2219    def __new__(cls, examples = None, weightID = 0, **argkw):
2220        self = Orange.core.Learner.__new__(cls, **argkw)
2221        if examples:
2222            self.__init__(**argkw)
2223            return self.__call__(examples, weightID)
2224        else:
2225            return self
2226     
2227    def __init__(self, **kw):
2228        self.split = None
2229        self.stop = None
2230        self.measure = None
2231        self.splitter = None
2232        self.__dict__.update(kw)
2233     
2234    def __call__(self, examples, weight=0):
2235        """
2236        Return a classifier from the given examples.
2237        """
2238        bl = self._base_learner()
2239
2240        #set the scoring criteria for regression if it was not
2241        #set by the user
2242        if not self.split and not self.measure:
2243            measure = fscoring.GainRatio() \
2244                if examples.domain.classVar.varType == Orange.data.Type.Discrete \
2245                else fscoring.MSE()
2246            bl.split.continuousSplitConstructor.measure = measure
2247            bl.split.discreteSplitConstructor.measure = measure
2248         
2249        if self.splitter != None:
2250            bl.splitter = self.splitter
2251
2252
2253        #post pruning
2254        tree = bl(examples, weight)
2255        if getattr(self, "sameMajorityPruning", 0):
2256            tree = Pruner_SameMajority(tree)
2257        if getattr(self, "mForPruning", 0):
2258            tree = Pruner_m(tree, m=self.mForPruning)
2259
2260        return TreeClassifier(baseClassifier=tree) 
2261
2262    def instance(self):
2263        """
2264        DEPRECATED. Return a base learner - an object
2265        of :class:`_TreeLearner`.
2266        This method is left for backwards compatibility.
2267        """
2268        return self._base_learner()
2269
2270    def build_split(self):
2271        """
2272        Return the split constructor built according to object attributes.
2273        """
2274       
2275        split = self.split
2276
2277        if not split:
2278            split = SplitConstructor_Combined()
2279            split.continuousSplitConstructor = \
2280                SplitConstructor_Threshold()
2281            binarization = getattr(self, "binarization", 0)
2282            if binarization == 1:
2283                split.discreteSplitConstructor = \
2284                    SplitConstructor_ExhaustiveBinary()
2285            elif binarization == 2:
2286                split.discreteSplitConstructor = \
2287                    SplitConstructor_OneAgainstOthers()
2288            else:
2289                split.discreteSplitConstructor = \
2290                    SplitConstructor_Feature()
2291
2292            measures = {"infoGain": fscoring.InfoGain,
2293                "gainRatio": fscoring.GainRatio,
2294                "gini": fscoring.Gini,
2295                "relief": fscoring.Relief,
2296                "retis": fscoring.MSE
2297                }
2298
2299            measure = self.measure
2300            if isinstance(measure, str):
2301                measure = measures[measure]()
2302            if not measure:
2303                measure = fscoring.GainRatio()
2304
2305            measureIsRelief = isinstance(measure, fscoring.Relief)
2306            relM = getattr(self, "reliefM", None)
2307            if relM and measureIsRelief:
2308                measure.m = relM
2309           
2310            relK = getattr(self, "reliefK", None)
2311            if relK and measureIsRelief:
2312                measure.k = relK
2313
2314            split.continuousSplitConstructor.measure = measure
2315            split.discreteSplitConstructor.measure = measure
2316
2317            wa = getattr(self, "worstAcceptable", 0)
2318            if wa:
2319                split.continuousSplitConstructor.worstAcceptable = wa
2320                split.discreteSplitConstructor.worstAcceptable = wa
2321
2322            ms = getattr(self, "minSubset", 0)
2323            if ms:
2324                split.continuousSplitConstructor.minSubset = ms
2325                split.discreteSplitConstructor.minSubset = ms
2326
2327        return split
2328
2329    def build_stop(self):
2330        """
2331        Return the stop criteria built according to object's attributes.
2332        """
2333        stop = self.stop
2334        if not stop:
2335            stop = Orange.classification.tree.StopCriteria_common()
2336            mm = getattr(self, "maxMajority", 1.0)
2337            if mm < 1.0:
2338                stop.maxMajority = self.maxMajority
2339            me = getattr(self, "minExamples", 0)
2340            if me:
2341                stop.minExamples = self.minExamples
2342        return stop
2343
2344    def _base_learner(self):
2345        learner = _TreeLearner()
2346
2347        learner.split = self.build_split()
2348        learner.stop = self.build_stop()
2349
2350        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2351            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2352            if hasattr(self, a):
2353                setattr(learner, a, getattr(self, a))
2354
2355        return learner
2356
2357# the following is for the output
2358
2359
2360fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2361""" Defines the multiplier by 100 (:samp:`^`) and the format
2362for the number of decimals (e.g. :samp:`5.3`). The corresponding
2363groups are named :samp:`m100` and :samp:`fs`. """
2364
2365by = r"(?P<by>(b(P|A)))?"
2366""" Defines bP or bA or nothing; the result is in groups by. """
2367
2368bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2369opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2370opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2371intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2372fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2373re_V = re.compile("%V")
2374re_N = re.compile("%"+fs+"N"+by)
2375re_M = re.compile("%"+fs+"M"+by)
2376re_m = re.compile("%"+fs+"m"+by)
2377re_Ccont = re.compile("%"+fs+"C"+by+opc)
2378re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2379re_ccont = re.compile("%"+fs+"c"+by+opc)
2380re_cdisc = re.compile("%"+fs+"c"+by+opd)
2381re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2382re_cconti = re.compile("%"+fs+"c"+by+fromto)
2383re_D = re.compile("%"+fs+"D"+by)
2384re_d = re.compile("%"+fs+"d"+by)
2385re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2386re_I = re.compile("%"+fs+"I"+intrvl)
2387
2388def insertStr(s, mo, sub):
2389    """ Replace the part of s which is covered by mo
2390    with the string sub. """
2391    return s[:mo.start()] + sub + s[mo.end():]
2392
2393
2394def insertDot(s, mo):
2395    """ Replace the part of s which is covered by mo
2396    with a dot.  You should use this when the
2397    function cannot compute the desired quantity; it is called, for instance,
2398    when it needs to divide by something in the parent, but the parent
2399    doesn't exist.
2400    """
2401    return s[:mo.start()] + "." + s[mo.end():]
2402
2403def insertNum(s, mo, n):
2404    """ Replace the part of s matched by mo with the number n,
2405    formatted as specified by the user, that is, it multiplies
2406    it by 100, if needed, and prints with the right number of
2407    places and decimals. It does so by checking the mo
2408    for a group named m100 (representing the :samp:`^` in the format string)
2409    and a group named fs representing the part giving the number o
2410    f decimals (e.g. :samp:`5.3`).
2411    """
2412    grps = mo.groupdict()
2413    m100 = grps.get("m100", None)
2414    if m100:
2415        n *= 100
2416    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2417    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2418
2419def byWhom(by, parent, tree):
2420    """ If by equals bp, it returns parent, else it returns
2421    :samp:`tree.tree`. This is used to find what to divide the quantity
2422    with, when division is required.
2423    """
2424    if by=="bP":
2425        return parent
2426    else:
2427        return tree.tree
2428
2429def replaceV(strg, mo, node, parent, tree):
2430    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2431
2432def replaceN(strg, mo, node, parent, tree):
2433    by = mo.group("by")
2434    N = node.distribution.abs
2435    if by:
2436        whom = byWhom(by, parent, tree)
2437        if whom and whom.distribution:
2438            if whom.distribution.abs > 1e-30:
2439                N /= whom.distribution.abs
2440        else:
2441            return insertDot(strg, mo)
2442    return insertNum(strg, mo, N)
2443       
2444
2445def replaceM(strg, mo, node, parent, tree):
2446    by = mo.group("by")
2447    maj = int(node.nodeClassifier.defaultValue)
2448    N = node.distribution[maj]
2449    if by:
2450        whom = byWhom(by, parent, tree)
2451        if whom and whom.distribution:
2452            if whom.distribution[maj] > 1e-30:
2453                N /= whom.distribution[maj]
2454        else:
2455            return insertDot(strg, mo)
2456    return insertNum(strg, mo, N)
2457       
2458
2459def replacem(strg, mo, node, parent, tree):
2460    by = mo.group("by")
2461    maj = int(node.nodeClassifier.defaultValue)
2462    if node.distribution.abs > 1e-30:
2463        N = node.distribution[maj] / node.distribution.abs
2464        if by:
2465            if whom and whom.distribution:
2466                byN = whom.distribution[maj] / whom.distribution.abs
2467                if byN > 1e-30:
2468                    N /= byN
2469            else:
2470                return insertDot(strg, mo)
2471    else:
2472        N = 0.
2473    return insertNum(strg, mo, N)
2474
2475
2476def replaceCdisc(strg, mo, node, parent, tree):
2477    if tree.classVar.varType != Orange.data.Type.Discrete:
2478        return insertDot(strg, mo)
2479   
2480    by, op, cls = mo.group("by", "op", "cls")
2481    N = node.distribution[cls]
2482    if op == "!=":
2483        N = node.distribution.abs - N
2484    if by:
2485        whom = byWhom(by, parent, tree)
2486        if whom and whom.distribution:
2487            if whom.distribution[cls] > 1e-30:
2488                N /= whom.distribution[cls]
2489        else:
2490            return insertDot(strg, mo)
2491    return insertNum(strg, mo, N)
2492
2493   
2494def replacecdisc(strg, mo, node, parent, tree):
2495    if tree.classVar.varType != Orange.data.Type.Discrete:
2496        return insertDot(strg, mo)
2497   
2498    op, by, cls = mo.group("op", "by", "cls")
2499    N = node.distribution[cls]
2500    if node.distribution.abs > 1e-30:
2501        N /= node.distribution.abs
2502        if op == "!=":
2503            N = 1 - N
2504    if by:
2505        whom = byWhom(by, parent, tree)
2506        if whom and whom.distribution:
2507            if whom.distribution[cls] > 1e-30:
2508                N /= whom.distribution[cls] / whom.distribution.abs
2509        else:
2510            return insertDot(strg, mo)
2511    return insertNum(strg, mo, N)
2512
2513
2514__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2515
2516def replaceCcont(strg, mo, node, parent, tree):
2517    if tree.classVar.varType != Orange.data.Type.Continuous:
2518        return insertDot(strg, mo)
2519   
2520    by, op, num = mo.group("by", "op", "num")
2521    op = __opdict[op]
2522    num = float(num)
2523    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2524    if by:
2525        whom = byWhom(by, parent, tree)
2526        if whom and whom.distribution:
2527            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2528            if byN > 1e-30:
2529                N /= byN
2530        else:
2531            return insertDot(strg, mo)
2532
2533    return insertNum(strg, mo, N)
2534   
2535   
2536def replaceccont(strg, mo, node, parent, tree):
2537    if tree.classVar.varType != Orange.data.Type.Continuous:
2538        return insertDot(strg, mo)
2539   
2540    by, op, num = mo.group("by", "op", "num")
2541    op = __opdict[op]
2542    num = float(num)
2543    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2544    if node.distribution.abs > 1e-30:
2545        N /= node.distribution.abs
2546    if by:
2547        whom = byWhom(by, parent, tree)
2548        if whom and whom.distribution:
2549            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2550            if byN > 1e-30:
2551                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2552        else:
2553            return insertDot(strg, mo)
2554    return insertNum(strg, mo, N)
2555
2556
2557def extractInterval(mo, dist):
2558    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2559    lower, upper = float(lower), float(upper)
2560    if out:
2561        lop = lowin == "(" and operator.le or operator.lt
2562        hop = upin == ")" and operator.ge or operator.ge
2563        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2564    else:
2565        lop = lowin == "(" and operator.gt or operator.ge
2566        hop = upin == ")" and operator.lt or operator.le
2567        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2568
2569   
2570def replaceCconti(strg, mo, node, parent, tree):
2571    if tree.classVar.varType != Orange.data.Type.Continuous:
2572        return insertDot(strg, mo)
2573
2574    by = mo.group("by")
2575    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2576    if by:
2577        whom = byWhom(by, parent, tree)
2578        if whom and whom.distribution:
2579            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2580            if byN > 1e-30:
2581                N /= byN
2582        else:
2583            return insertDot(strg, mo)
2584       
2585    return insertNum(strg, mo, N)
2586
2587           
2588def replacecconti(strg, mo, node, parent, tree):
2589    if tree.classVar.varType != Orange.data.Type.Continuous:
2590        return insertDot(strg, mo)
2591
2592    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2593    ab = node.distribution.abs
2594    if ab > 1e-30:
2595        N /= ab
2596
2597    by = mo.group("by")
2598    if by:
2599        whom = byWhom(by, parent, tree)
2600        if whom and whom.distribution:
2601            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2602            if byN > 1e-30:
2603                N /= byN/whom.distribution.abs
2604        else:
2605            return insertDot(strg, mo)
2606       
2607    return insertNum(strg, mo, N)
2608
2609   
2610def replaceD(strg, mo, node, parent, tree):
2611    if tree.classVar.varType != Orange.data.Type.Discrete:
2612        return insertDot(strg, mo)
2613
2614    fs, by, m100 = mo.group("fs", "by", "m100")
2615    dist = list(node.distribution)
2616    if by:
2617        whom = byWhom(by, parent, tree)
2618        if whom:
2619            for i, d in enumerate(whom.distribution):
2620                if d > 1e-30:
2621                    dist[i] /= d
2622        else:
2623            return insertDot(strg, mo)
2624    mul = m100 and 100 or 1
2625    fs = fs or (m100 and ".0" or "5.3")
2626    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2627
2628
2629def replaced(strg, mo, node, parent, tree):
2630    if tree.classVar.varType != Orange.data.Type.Discrete:
2631        return insertDot(strg, mo)
2632
2633    fs, by, m100 = mo.group("fs", "by", "m100")
2634    dist = list(node.distribution)
2635    ab = node.distribution.abs
2636    if ab > 1e-30:
2637        dist = [d/ab for d in dist]
2638    if by:
2639        whom = byWhom(by, parent, tree)
2640        if whom:
2641            for i, d in enumerate(whom.distribution):
2642                if d > 1e-30:
2643                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2644        else:
2645            return insertDot(strg, mo)
2646    mul = m100 and 100 or 1
2647    fs = fs or (m100 and ".0" or "5.3")
2648    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2649
2650
2651def replaceAE(strg, mo, node, parent, tree):
2652    if tree.classVar.varType != Orange.data.Type.Continuous:
2653        return insertDot(strg, mo)
2654
2655    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2656   
2657    if AorE == "A":
2658        A = node.distribution.average()
2659    else:
2660        A = node.distribution.error()
2661    if by:
2662        whom = byWhom("b"+by, parent, tree)
2663        if whom:
2664            if AorE == "A":
2665                avg = whom.distribution.average()
2666            else:
2667                avg = whom.distribution.error()
2668            if bysub == "b":
2669                if avg > 1e-30:
2670                    A /= avg
2671            else:
2672                A -= avg
2673        else:
2674            return insertDot(strg, mo)
2675    return insertNum(strg, mo, A)
2676
2677
2678Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2679
2680def replaceI(strg, mo, node, parent, tree):
2681    if tree.classVar.varType != Orange.data.Type.Continuous:
2682        return insertDot(strg, mo)
2683
2684    fs = mo.group("fs") or "5.3"
2685    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2686    mul = mo.group("m100") and 100 or 1
2687
2688    if not Z.has_key(intrvl):
2689        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2690
2691    av = node.distribution.average()   
2692    il = node.distribution.error() * Z[intrvl]
2693    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2694
2695
2696# This class is more a collection of function, merged into a class so
2697# that they don't need to transfer too many arguments. It will be
2698# constructed, used and discarded, it is not meant to store any information.
2699class _TreeDumper:
2700    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2701         (re_M, replaceM), (re_m, replacem), 
2702         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2703         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2704         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2705         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2706         (re_I, replaceI) ]
2707
2708    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2709        maxDepth, simpleFirst, tree, **kw):
2710        self.stringFormats = stringFormats
2711        self.minExamples = minExamples
2712        self.maxDepth = maxDepth
2713        self.simpleFirst = simpleFirst
2714        self.tree = tree
2715        self.__dict__.update(kw)
2716
2717        if leafStr:
2718            self.leafStr = leafStr
2719        else:
2720            if tree.classVar.varType == Orange.data.Type.Discrete:
2721                self.leafStr = "%V (%^.2m%)"
2722            else:
2723                self.leafStr = "%V"
2724
2725        if nodeStr == ".":
2726            self.nodeStr = self.leafStr
2727        else:
2728            self.nodeStr = nodeStr
2729       
2730
2731    def formatString(self, strg, node, parent):
2732        if hasattr(strg, "__call__"):
2733            return strg(node, parent, self.tree)
2734       
2735        if not node:
2736            return "<null node>"
2737       
2738        for rgx, replacer in self.stringFormats:
2739            if not node.distribution:
2740                strg = rgx.sub(".", strg)
2741            else:
2742                strt = 0
2743                while True:
2744                    mo = rgx.search(strg, strt)
2745                    if not mo:
2746                        break
2747                    strg = replacer(strg, mo, node, parent, self.tree)
2748                    strt = mo.start()+1
2749                       
2750        return strg
2751       
2752
2753    def showBranch(self, node, parent, lev, i):
2754        bdes = node.branchDescriptions[i]
2755        bdes = node.branchSelector.classVar.name + \
2756            (bdes[0] not in "<=>" and "=" or "") + bdes
2757        if node.branches[i]:
2758            nodedes = self.nodeStr and ": " + \
2759                self.formatString(self.nodeStr, node.branches[i], node) or ""
2760        else:
2761            nodedes = "<null node>"
2762        return "|    "*lev + bdes + nodedes
2763       
2764       
2765    def dumpTree0(self, node, parent, lev):
2766        if node.branches:
2767            if node.distribution.abs < self.minExamples or \
2768                lev > self.maxDepth:
2769                return "|    "*lev + ". . .\n"
2770           
2771            res = ""
2772            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2773                leafsep = "\n"+("|    "*lev)+"    "
2774            else:
2775                leafsep = ""
2776            if self.simpleFirst:
2777                for i, branch in enumerate(node.branches):
2778                    if not branch or not branch.branches:
2779                        if self.leafStr == self.nodeStr:
2780                            res += "%s\n" % \
2781                                self.showBranch(node, parent, lev, i)
2782                        else:
2783                            res += "%s: %s\n" % \
2784                                (self.showBranch(node, parent, lev, i),
2785                                 leafsep + 
2786                                 self.formatString(self.leafStr, branch, node))
2787            for i, branch in enumerate(node.branches):
2788                if branch and branch.branches:
2789                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2790                                       self.dumpTree0(branch, node, lev+1))
2791                elif not self.simpleFirst:
2792                    if self.leafStr == self.nodeStr:
2793                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2794                    else:
2795                        res += "%s: %s\n" % \
2796                            (self.showBranch(node, parent, lev, i),
2797                             leafsep + 
2798                             self.formatString(self.leafStr, branch, node))
2799            return res
2800        else:
2801            return self.formatString(self.leafStr, node, parent)
2802
2803
2804    def dumpTree(self):
2805        if self.nodeStr:
2806            lev, res = 1, "root: %s\n" % \
2807                self.formatString(self.nodeStr, self.tree.tree, None)
2808            self.maxDepth += 1
2809        else:
2810            lev, res = 0, ""
2811        return res + self.dumpTree0(self.tree.tree, None, lev)
2812       
2813
2814    def dotTree0(self, node, parent, internalName):
2815        if node.branches:
2816            if node.distribution.abs < self.minExamples or \
2817                len(internalName)-1 > self.maxDepth:
2818                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2819                    _quoteName(internalName))
2820                return
2821               
2822            label = node.branchSelector.classVar.name
2823            if self.nodeStr:
2824                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2825            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2826                (_quoteName(internalName), self.nodeShape, label))
2827           
2828            for i, branch in enumerate(node.branches):
2829                if branch:
2830                    internalBranchName = internalName+chr(i+65)
2831                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2832                        (_quoteName(internalName), 
2833                         _quoteName(internalBranchName), 
2834                         node.branchDescriptions[i]))
2835                    self.dotTree0(branch, node, internalBranchName)
2836                   
2837        else:
2838            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2839                (internalName, self.leafShape, 
2840                self.formatString(self.leafStr, node, parent)))
2841
2842
2843    def dotTree(self, internalName="n"):
2844        self.fle.write("digraph G {\n")
2845        self.dotTree0(self.tree.tree, None, internalName)
2846        self.fle.write("}\n")
2847
2848def _quoteName(x):
2849    return '"%s"' % (base64.b64encode(x))
2850
2851class TreeClassifier(Orange.classification.Classifier):
2852    """
2853    Wraps :class:`_TreeClassifier`.
2854    """
2855   
2856    def __init__(self, baseClassifier=None):
2857        if not baseClassifier: baseClassifier = _TreeClassifier()
2858        self.nativeClassifier = baseClassifier
2859        for k, v in self.nativeClassifier.__dict__.items():
2860            self.__dict__[k] = v
2861 
2862    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2863                 *args, **kwdargs):
2864        """Classify a new instance.
2865       
2866        :param instance: instance to be classified.
2867        :type instance: :class:`Orange.data.Instance`
2868        :param result_type:
2869              :class:`Orange.classification.Classifier.GetValue` or \
2870              :class:`Orange.classification.Classifier.GetProbabilities` or
2871              :class:`Orange.classification.Classifier.GetBoth`
2872       
2873        :rtype: :class:`Orange.data.Value`,
2874              :class:`Orange.statistics.Distribution` or a tuple with both
2875        """
2876        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2877
2878    def __setattr__(self, name, value):
2879        if name == "nativeClassifier":
2880            self.__dict__[name] = value
2881            return
2882        if name in self.nativeClassifier.__dict__:
2883            self.nativeClassifier.__dict__[name] = value
2884        self.__dict__[name] = value
2885   
2886    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2887        """
2888        Return a string representation of a tree.
2889
2890        :arg leafStr: The format string for printing the tree leaves. If
2891          left empty, "%V (%^.2m%)" will be used for classification trees
2892          and "%V" for regression trees.
2893        :type leafStr: string
2894        :arg nodeStr: The format string for printing out the internal nodes.
2895          If left empty (as it is by default), no data is printed out for
2896          internal nodes. If set to :samp:`"."`, the same string is
2897          used as for leaves.
2898        :type nodeStr: string
2899        :arg maxDepth: If set, it limits the depth to which the tree is
2900          printed out.
2901        :type maxDepth: integer
2902        :arg minExamples: If set, the subtrees with less than the given
2903          number of examples are not printed.
2904        :type minExamples: integer
2905        :arg simpleFirst: If True (default), the branches with a single
2906          node are printed before the branches with larger subtrees.
2907          If False, the branches are printed in order of
2908          appearance.
2909        :type simpleFirst: boolean
2910        :arg userFormats: A list of regular expressions and callback
2911          function through which the user can print out other specific
2912          information in the nodes.
2913        """
2914        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2915            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2916            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2917            self).dumpTree()
2918
2919    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2920        """ Prints the tree to a file in a format used by
2921        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2922        Uses the same parameters as :meth:`dump` defined above
2923        plus two parameters which define the shape used for internal
2924        nodes and laves of the tree:
2925
2926        :param leafShape: Shape of the outline around leves of the tree.
2927            If "plaintext", no outline is used (default: "plaintext").
2928        :type leafShape: string
2929        :param internalNodeShape: Shape of the outline around internal nodes
2930            of the tree. If "plaintext", no outline is used (default: "box")
2931        :type leafShape: string
2932
2933        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2934        for various outlines supported by GraphViz.
2935        """
2936        fle = type(fileName) == str and file(fileName, "wt") or fileName
2937
2938        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2939            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2940            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
2941            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2942
2943    def count_nodes(self):
2944        """
2945        Return the number of nodes.
2946        """
2947        return _countNodes(self.tree)
2948
2949    def count_leaves(self):
2950        """
2951        Return the number of leaves.
2952        """
2953        return _countLeaves(self.tree)
2954
2955def _countNodes(node):
2956    count = 0
2957    if node:
2958        count += 1
2959        if node.branches:
2960            for node in node.branches:
2961                count += _countNodes(node)
2962    return count
2963
2964def _countLeaves(node):
2965    count = 0
2966    if node:
2967        if node.branches: # internal node
2968            for node in node.branches:
2969                count += _countLeaves(node)
2970        else:
2971            count += 1
2972    return count
2973
Note: See TracBrowser for help on using the repository browser.