source: orange/orange/Orange/classification/tree.py @ 7755:6e519f6df8fe

Revision 7755:6e519f6df8fe, 117.7 KB checked in by markotoplak, 3 years ago (diff)

Removed references to printTreeC45.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12To build a small tree (:obj:`TreeClassifier`) from the iris data set
13(with the depth limited to three levels), use
14(part of `orngTree1.py`_, uses `iris.tab`_):
15
16.. literalinclude:: code/orngTree1.py
17   :lines: 1-4
18
19.. _orngTree1.py: code/orngTree1.py
20
21
22This page first describes the learner and the classifier, and then
23defines the base classes (individual components) of the trees
24and the tree-building process.
25
26.. autoclass:: TreeLearner
27    :members:
28
29.. autoclass:: TreeClassifier
30    :members:
31
32========
33Examples
34========
35
36For example, here's how to write your own stop
37function. The example constructs
38and prints two trees. For the first one we define the *defStop*
39function, which is used by default, and combine it with a random function
40so that the stop criteria will also be met in 20% of the cases
41when *defStop* is false. For the second tree the stopping criteria
42is random. Note that in
43the second case lambda function still has three parameters, since this is
44a necessary number of parameters for the stop function (:obj:`StopCriteria`).
45Part of `tree3.py`_ (uses  `iris.tab`_):
46
47.. _tree3.py: code/tree3.py
48
49.. literalinclude:: code/tree3.py
50   :lines: 8-23
51
52The output is not shown here since the resulting trees are rather
53big.
54
55Tree Structure
56==============
57
58To have something to work on, we'll take the data from lenses dataset
59and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
60
61.. literalinclude:: code/treestructure.py
62   :lines: 7-10
63
64How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
65
66.. _lenses.tab: code/lenses.tab
67.. _treestructure.py: code/treestructure.py
68
69.. literalinclude:: code/treestructure.py
70   :lines: 12-21
71
72If node is None, we have a null-node; null nodes don't count,
73so we return 0. Otherwise, the size is 1 (this node) plus the
74sizes of all subtrees. The node is an internal node if it has a
75:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
76attempt to skip the if statement: leaves don't have an empty list
77of branches, they don't have a list of branches at all.
78
79    >>> treeSize(treeClassifier.tree)
80    10
81
82Don't forget that this was only an excercise - :obj:`Node` has a
83built-in method :obj:`Node.treeSize` that does exactly the same.
84
85Let us now write a script that prints out a tree. The recursive
86part of the function will get a node and its level
87(part of `treestructure.py`_, uses `lenses.tab`_).
88
89.. literalinclude:: code/treestructure.py
90   :lines: 26-41
91
92Don't waste time on studying formatting tricks (\n's etc.), this is just
93for nicer output. What matters is everything but the print statements.
94As first, we check whether the node is a null-node (a node to which no
95learning examples were classified). If this is so, we just print out
96"<null node>" and return.
97
98After handling null nodes, remaining nodes are internal nodes and leaves.
99For internal nodes, we print a node description consisting of the
100attribute's name and distribution of classes. :obj:`Node`'s branch
101description is, for all currently defined splits, an instance of a
102class derived from :obj:`orange.Classifier` (in fact, it is a
103:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
104suffice), and its :obj:`classVar` points to the attribute we seek.
105So we print its name. We will also assume that storing class distributions
106has not been disabled and print them as well. 
107Then we iterate
108through branches; for each we print a branch description and iteratively
109call the :obj:`printTree0` with a level increased by 1 (to increase
110the indent).
111
112Finally, if the node is a leaf, we print out the distribution of
113learning examples in the node and the class to which the examples in
114the node would be classified. We again assume that the :obj:`nodeClassifier`
115is the default one - a :obj:`DefaultClassifier`. A better print
116function should be aware of possible alternatives.
117
118Now, we just need to write a simple function to call our printTree0.
119We could write something like...
120
121::
122
123    def printTree(x):
124        printTree0(x.tree, 0)
125
126... but we won't. Let us learn how to handle arguments of different
127types. Let's write a function that will accept either a
128:obj:`TreeClassifier` or a :obj:`Node`.
129Part of `treestructure.py`_, uses `lenses.tab`_.
130
131.. literalinclude:: code/treestructure.py
132   :lines: 43-49
133
134It's fairly straightforward: if :obj:`x` is of type derived from
135:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
136:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
137of some other type, we don't know how to handle it and thus raise
138an exception. The output::
139
140    >>> printTree(treeClassifier)
141    tear_rate (<15.000, 5.000, 4.000>)
142    : reduced --> none (<12.000, 0.000, 0.000>)
143    : normal
144       astigmatic (<3.000, 5.000, 4.000>)
145       : no
146          age (<1.000, 5.000, 0.000>)
147          : young --> soft (<0.000, 2.000, 0.000>)
148          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
149          : presbyopic --> none (<1.000, 1.000, 0.000>)
150       : yes
151          prescription (<2.000, 0.000, 4.000>)
152          : myope --> hard (<0.000, 0.000, 3.000>)
153          : hypermetrope --> none (<2.000, 0.000, 1.000>)
154
155For a final exercise, let us write a simple pruning function. It will
156be written entirely in Python, unrelated to any :obj:`Pruner`. It
157will limit the maximal tree depth (the number of internal nodes on any
158path down the tree) given as an argument.
159For example, to get a two-level tree, we would
160call cutTree(root, 2). The function will be recursive, with the second
161argument (level) decreasing at each call; when zero, the current node
162will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
163
164.. literalinclude:: code/treestructure.py
165   :lines: 54-62
166
167There's nothing to prune at null-nodes or leaves, so we act only when
168:obj:`node` and :obj:`node.branchSelector` are defined. If level is
169not zero, we call the function for each branch. Otherwise, we clear
170the selector, branches and branch descriptions.
171
172    >>> cutTree(tree.tree, 2)
173    >>> printTree(tree)
174    tear_rate (<15.000, 5.000, 4.000>)
175    : reduced --> none (<12.000, 0.000, 0.000>)
176    : normal
177       astigmatic (<3.000, 5.000, 4.000>)
178       : no --> soft (<1.000, 5.000, 0.000>)
179       : yes --> hard (<2.000, 0.000, 4.000>)
180
181Learning
182========
183
184You could just call :class:`TreeLearner` and let it fill the empty
185slots with the default
186components. This section will teach you three things: what are the
187missing components (and how to set the same components yourself),
188how to use alternative components to get a different tree and,
189finally, how to write a skeleton for tree induction in Python.
190
191.. _treelearner.py: code/treelearner.py
192
193Let us construct a :obj:`TreeLearner` to play with (`treelearner.py`_, uses `lenses.tab`_):
194
195.. literalinclude:: code/treelearner.py
196   :lines: 7-10
197
198There are three crucial components in learning: the :obj:`~TreeLearner.split` and
199:obj:`~TreeLearner.stop`
200criteria, and the example :obj:`~TreeLearner.splitter` (there are some others,
201which become important during classification; we'll talk about them
202later). They are not defined; if you use the learner, the slots are
203filled temporarily but later cleared again.
204
205::
206
207    >>> print learner.split
208    None
209    >>> learner(data)
210    <TreeClassifier instance at 0x01F08760>
211    >>> print learner.split
212    None
213
214Stopping criteria
215=================
216
217The stop is trivial. The default is set by
218
219    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
220
221We can now examine the default stopping parameters.
222
223    >>> print learner.stop.maxMajority, learner.stop.minExamples
224    1.0 0.0
225
226Not very restrictive. This keeps splitting the examples until
227there's nothing left to split or all the examples are in the same
228class. Let us set the minimal subset that we allow to be split to
229five examples and see what comes out.
230
231    >>> learner.stop.minExamples = 5.0
232    >>> tree = learner(data)
233    >>> print tree.dump()
234    tear_rate=reduced: none (100.00%)
235    tear_rate=normal
236    |    astigmatic=no
237    |    |    age=pre-presbyopic: soft (100.00%)
238    |    |    age=presbyopic: none (50.00%)
239    |    |    age=young: soft (100.00%)
240    |    astigmatic=yes
241    |    |    prescription=hypermetrope: none (66.67%)
242    |    |    prescription=myope: hard (100.00%)
243
244OK, that's better. If we want an even smaller tree, we can also limit
245the maximal proportion of majority class.
246
247    >>> learner.stop.maxMajority = 0.5
248    >>> tree = learner(data)
249    >>> print tree.dump()
250    none (62.50%)
251
252
253=================================
254Learner and Classifier Components
255=================================
256
257Classification trees are represented as a tree-like hierarchy of
258:obj:`Node` classes.
259
260Several classes described above are already functional and can
261(and mostly will) be used as they are: :obj:`Node`,
262:obj:`_TreeLearner` and :obj:`TreeClassifier`.
263Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
264:obj:`ExampleSplitter`, :obj:`Descender`
265are among the Orange (C++ implemented)
266classes that can be subtyped
267in Python. You can thus program your own components based on
268these classes.
269
270.. class:: Node
271
272    Node stores information about the learning examples belonging
273    to the node, a branch selector, a list of branches (if the node is
274    not a leaf) with their descriptions and strengths, and a classifier.
275
276    .. attribute:: distribution
277   
278        Stores a distribution for learning examples belonging to the node.
279        Storing distributions can be disabled by setting the
280        :obj:`_TreeLearner`'s storeDistributions flag to false.
281
282    .. attribute:: contingency
283
284        Stores complete contingency matrices for the learning examples
285        belonging to the node. Storing contingencies can be enabled by
286        setting :obj:`_TreeLearner`'s :obj:`storeContingencies`
287        flag to true. Note that even when the flag is not
288        set, the contingencies get computed and stored to
289        :obj:`Node`, but are removed shortly afterwards.
290        The details are given in the
291        description of the :obj:`_TreeLearner` object.
292
293    .. attribute:: examples, weightID
294
295        Store a set of learning examples for the node and the
296        corresponding ID of /weight meta attribute. The root of the
297        tree stores a "master table" of examples, while other nodes'
298        :obj:`Orange.data.Table` contain reference to examples in
299        the root's :obj:`Orange.data.Table`. Examples are only stored
300        if a corresponding flag (:obj:`storeExamples`) has been
301        set while building the tree; to conserve the space, storing
302        is disabled by default.
303
304    .. attribute:: nodeClassifier
305
306        A classifier (usually, but not necessarily, a
307        :obj:`DefaultClassifier`) that can be used to classify
308        examples coming to the node. If the node is a leaf, this is
309        used to decide the final class (or class distribution) of an
310        example. If it's an internal node, it is stored if
311        :obj:`Node`'s flag :obj:`storeNodeClassifier`
312        is set. Since the :obj:`nodeClassifier` is needed by
313        :obj:`Descender` and for pruning (see far below),
314        this is the default behaviour; space consumption of the default
315        :obj:`DefaultClassifier` is rather small. You should
316        never disable this if you intend to prune the tree later.
317
318    If the node is a leaf, the remaining fields are None.
319    If it's an internal node, there are several additional fields.
320
321    .. attribute:: branches
322
323        Stores a list of subtrees, given as :obj:`Node`.
324        An element can be None; in this case the node is empty.
325
326    .. attribute:: branchDescriptions
327
328        A list with string descriptions for branches, constructed by
329        :obj:`SplitConstructor`. It can contain different kinds
330        of descriptions, but basically, expect things like 'red' or '>12.3'.
331
332    .. attribute:: branchSizes
333
334        Gives a (weighted) number of training examples that went into
335        each branch. This can be used later, for instance, for
336        modeling probabilities when classifying examples with
337        unknown values.
338
339    .. attribute:: branchSelector
340
341        Gives a branch for each example. The same object is used during
342        learning and classifying. The :obj:`branchSelector` is of
343        type :obj:`orange.Classifier`, since its job is similar to that
344        of a classifier: it gets an example and returns discrete
345        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
346        When an example cannot be classified to any branch, the selector
347        can return a :obj:`Orange.data.Value` containing a special value
348        (sVal) which should be a discrete distribution
349        (DiscDistribution). This should represent a
350        :obj:`branchSelector`'s opinion of how to divide the
351        example between the branches. Whether the proposition will be
352        used or not depends upon the chosen :obj:`ExampleSplitter`
353        (when learning) or :obj:`Descender` (when classifying).
354
355    The lists :obj:`branches`, :obj:`branchDescriptions` and
356    :obj:`branchSizes` are of the same length; all of them are
357    defined if the node is internal and none if it is a leaf.
358
359    .. method:: treeSize()
360       
361        Return the number of nodes in the subtrees (including the
362        node, excluding null-nodes).
363
364
365.. class:: _TreeClassifier
366
367    Classifies examples according to a tree stored in :obj:`tree`.
368    Not meant to be used directly. The :class:`TreeLearner` class
369    constructs :class:`TreeClassifier`.
370
371    .. attribute:: tree
372
373        The root of the tree, represented as a :class:`Node`.
374   
375    Classification would be straightforward if there were no unknown
376    values or, in general, examples that cannot be placed into a
377    single branch. The response in such cases is determined by a
378    component :obj:`descender`.
379
380    :obj:`Descender` is an abstract object which is given an example
381    and whose basic job is to descend as far down the tree as possible,
382    according to the values of example's attributes. The
383    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
384    the branch index. If it's a simple index, the corresponding branch
385    is followed. If not, it's up to descender to decide what to do, and
386    that's where descenders differ. A :obj:`descender` can choose
387    a single branch (for instance, the one that is the most recommended
388    by the :obj:`branchSelector`) or it can let the branches vote.
389
390    In general there are three possible outcomes of a descent.
391
392    #. Descender reaches a leaf. This happens when nothing went wrong
393       (there are no unknown or out-of-range values in the example) or
394       when things went wrong, but the descender smoothed them by
395       selecting a single branch and continued the descend. In this
396       case, the descender returns the reached :obj:`Node`.
397    #. :obj:`branchSelector` returned a distribution and the
398       :obj:`Descender` decided to stop the descend at this
399       (internal) node.  Again, descender returns the current
400       :obj:`Node` and nothing else.
401    #. :obj:`branchSelector` returned a distribution and the
402       :obj:`Node` wants to split the example (i.e., to decide the
403       class by voting).
404
405    It returns a :obj:`Node` and the vote-weights for the branches.
406    The weights can correspond to the distribution returned by
407    :obj:`branchSelector`, to the number of learning examples that
408    were assigned to each branch, or to something else.
409
410    :obj:`TreeClassifier` uses the descender to descend from the root.
411    If it returns only a :obj:`Node` and no distribution, the
412    descend should stop; it does not matter whether it's a leaf (the
413    first case above) or an internal node (the second case). The node's
414    :obj:`nodeClassifier` is used to decide the class. If the descender
415    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
416    recursively calls itself for each of the subtrees and the
417    predictions are weighted as requested by the descender.
418
419    When voting, subtrees do not predict the class but probabilities
420    of classes. The predictions are multiplied by weights, summed and
421    the most probable class is returned.
422
423    .. method:: vote()
424   
425        It gets a
426        :obj:`Node`, an :obj:`Orange.data.Instance` and a distribution of
427        vote weights. For each node, it calls the
428        :obj:`classDistribution` and then multiplies
429        and sums the distribution. :obj:`vote` returns a normalized
430        distribution of predictions.
431
432    .. method:: classDistribution()
433
434        Gets
435        an additional parameter, a :obj:`Node` (default tree root).
436        :obj:`classDistribution` uses :obj:`descender`. If descender
437        reaches a leaf, it calls
438        :obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
439
440        Thus, the :obj:`vote` and
441        :obj:`classDistribution` are written in a form of double
442        recursion. The recursive calls do not happen at each node of the
443        tree but only at nodes where a vote is needed (that is, at nodes
444        where the descender halts).
445
446    .. method:: __call__
447
448        Calls the
449        descender. If it reaches a leaf, the class is predicted by the
450        leaf's :obj:`nodeClassifier`. Otherwise, it calls
451        :obj:`vote`. From now on, :obj:`vote` and
452        :obj:`classDistribution` interweave down the tree and return
453        a distribution of predictions. This method simply
454        chooses the most probable class.
455
456.. class:: _TreeLearner
457
458    The main learning object is :obj:`_TreeLearner`. It is basically
459    a skeleton into which the user must plug the components for particular
460    functions. This class is not meant to be used directly. You should
461    rather use :class:`TreeLearner`.
462
463    Components that govern the structure of the tree are :obj:`split`
464    (of type :obj:`SplitConstructor`), :obj:`stop` (of
465    type :obj:`StopCriteria` and :obj:`exampleSplitter`
466    (of type :obj:`ExampleSplitter`).
467
468    .. attribute:: split
469
470        Object of type :obj:`SplitConstructor`. Default value,
471        provided by :obj:`_TreeLearner`, is :obj:`SplitConstructor_Combined`
472        with separate constructors for discrete and continuous attributes.
473        Discrete attributes are used as are, while continuous attributes
474        are binarized. Gain ratio is used to select attributes.
475        A minimum of two examples in a leaf is required for discreter
476        and five examples in a leaf for continuous attributes.</DD>
477   
478    .. attribute:: stop
479
480        Object of type :obj:`StopCriteria`. The default stopping
481        criterion stops induction when all examples in a node belong
482        to the same class.
483
484    .. attribute:: splitter
485
486        Object of type :obj:`ExampleSplitter`. The default splitter
487        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
488        the learning examples according to distributions given by the
489        selector.
490
491    .. attribute:: contingencyComputer
492   
493        By default, this slot is left empty and ordinary contingency
494        matrices are computed for examples at each node. If need arises,
495        one can change the way the matrices are computed. This can be
496        used to change the way that unknown values are treated when
497        assessing qualities of attributes. As mentioned earlier,
498        the computed matrices can be used by split constructor and by
499        stopping criteria. On the other hand, they can be (and are)
500        ignored by some splitting constructors.
501
502    .. attribute:: nodeLearner
503
504        Induces a classifier from examples belonging to a node. The
505        same learner is used for internal nodes and for leaves. The
506        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
507
508    .. attribute:: descender
509
510        Descending component that the induces :obj:`TreeClassifier`
511        will use. Default descender is
512        :obj:`Descender_UnknownMergeAsSelector` which votes using
513        the :obj:`branchSelector`'s distribution for vote weights.
514
515    .. attribute:: maxDepth
516
517        Gives maximal tree depth; 0 means that only root is generated.
518        The default is 100 to prevent any infinite tree induction due
519        to missettings in stop criteria. If you are sure you need
520        larger trees, increase it. If you, on the other hand, want
521        to lower this hard limit, you can do so as well.
522
523    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
524
525        Decides whether to store class distributions, contingencies
526        and examples in :obj:`Node`, and whether the
527        :obj:`nodeClassifier` should be build for internal nodes.
528        By default, distributions and node classifiers are stored,
529        while contingencies and examples are not. You won't save any
530        memory by not storing distributions but storing contingencies,
531        since distributions actually points to the same distribution
532        that is stored in :obj:`contingency.classes`.
533
534    The :obj:`_TreeLearner` first sets the defaults for missing
535    components. Although stored in the actual :obj:`_TreeLearner`'s
536    fields, they are removed when the induction is finished.
537
538    Then it ensures that examples are stored in a table. This is needed
539    because the algorithm juggles with pointers to examples. If
540    examples are in a file or are fed through a filter, they are copied
541    to a table. Even if they are already in a table, they are copied
542    if :obj:`storeExamples` is set. This is to assure that pointers
543    remain pointing to examples even if the user later changes the
544    example table. If they are in the table and the :obj:`storeExamples`
545    flag is clear, we just use them as they are. This will obviously
546    crash in a multi-threaded system if one changes the table during
547    the tree induction. Well... don't do it.
548
549    Apriori class probabilities are computed. At this point we check
550    the sum of example weights; if it's zero, there are no examples and
551    we cannot proceed. A list of candidate attributes is set; in the
552    beginning, all attributes are candidates for the split criterion.
553
554    Now comes the recursive part of the :obj:`_TreeLearner`. Its arguments
555    are a set of examples, a weight meta-attribute ID (a tricky thing,
556    it can be always the same as the original or can change to
557    accomodate splitting of examples among branches), apriori class
558    distribution and a list of candidates (represented as a vector
559    of Boolean values).
560
561    The contingency matrix is computed next. This happens
562    even if the flag :obj:`storeContingencies` is false.
563    If the :obj:`contingencyComputer` is given we use it,
564    otherwise we construct just an ordinary contingency matrix.
565
566    A :obj:`stop` is called to see whether it's worth to continue. If
567    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
568    returned. Otherwise, a :obj:`nodeClassifier` is only built if
569    :obj:`forceNodeClassifier` flag is set.
570
571    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
572    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
573    the given examples, weight ID and the just computed matrix. If
574    the learner can use the matrix (and the default,
575    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
576    a choice of :obj:`contingencyComputer` will, in many cases,
577    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
578    return no classifier; if so and if the classifier would be
579    needed for classification, the :obj:`TreeClassifier`'s function
580    returns DK or an empty distribution. If you're writing your own
581    tree classifier - pay attention.
582
583    If the induction is to continue, a :obj:`split` component is called.
584    If it fails to return a branch selector, induction stops and the
585    :obj:`Node` is returned.
586
587    :obj:`_TreeLearner` than uses :obj:`ExampleSplitter` to divide
588    the examples as described above.
589
590    The contingency gets removed at this point if it is not to be
591    stored. Thus, the :obj:`split`, :obj:`stop` and
592    :obj:`exampleSplitter` can use the contingency matrices if they will.
593
594    The :obj:`_TreeLearner` then recursively calls itself for each of
595    the non-empty subsets. If the splitter returnes a list of weights,
596    a corresponding weight is used for each branch. Besides, the
597    attribute spent by the splitter (if any) is removed from the
598    list of candidates for the subtree.
599
600    A subset of examples is stored in its corresponding tree node,
601    if so requested. If not, the new weight attributes are removed (if
602    any were created).
603
604
605Split constructors
606=====================
607
608Split construction is almost as exciting as waiting for a delayed flight.
609Boring, that is. Split constructors juggle
610with contingency matrices, with separate cases for discrete
611and continuous classes... Most split constructors work either for
612discrete or for continuous attributes. We suggest to use
613a :obj:`SplitConstructor_Combined` that delegates
614attributes to specialized split constructors.
615
616Split constructors that cannot handle attributes of particular
617type (discrete, continuous) do not report an error or a warning but
618simply skip the attribute. It is your responsibility to use a correct
619split constructor for your dataset. (May we again suggest
620using :obj:`SplitConstructor_Combined`?)
621
622The same components can be used either for inducing classification and
623regression trees. The only component that needs to be chosen accordingly
624is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
625class (and derived classes).
626
627.. class:: SplitConstructor
628
629    Finds a suitable criteria for dividing the learning (and later testing)
630    examples coming to the node. The data it gets is a set of examples
631    (and, optionally, an ID of weight meta-attribute), a domain
632    contingency computed from examples, apriori class probabilities,
633    a list of candidate attributes it should consider and a node
634    classifier (if it was constructed, that is, if
635    :obj:`storeNodeClassifier` is left true).
636
637    The :obj:`SplitConstructor` should use the domain contingency
638    when possible. The reasons are two-fold; one is that it's faster
639    and the other is that the contingency matrices are not necessarily
640    constructed by simply counting the examples. Why and how is
641    explained later. There are, however, cases, when domain contingency
642    does not suffice, for examples, when ReliefF is used as a measure
643    of quality of attributes. In this case, there's no other way but
644    to use the examples and ignore the precomputed contingencies.
645
646    :obj:`SplitConstructor` returns most of the data we talked
647    about when describing the :obj:`Node`. It returns a classifier
648    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
649    branch descriptions and a list with the number of examples that
650    go into each branch. Just what we need for the :obj:`Node`.
651    It can return an empty list for the number of examples in branches;
652    in this case, the :obj:`_TreeLearner` will find the number itself
653    after splitting the example set into subsets. However, if a split
654    constructors can provide the numbers at no extra computational
655    cost, it should do so.
656
657    In addition, it returns a quality of the split; a number without
658    any fixed meaning except that higher numbers mean better splits.
659
660    If the constructed splitting criterion uses an attribute in such
661    a way that the attribute is 'completely spent' and should not be
662    considered as a split criterion in any of the subtrees (the
663    typical case of this are discrete attributes that are used
664    as-they-are, that is, without any binarization or subsetting),
665    then it should report the index of this attribute. Some splits
666    do not spend any attribute; this is indicated by returning a
667    negative index.
668
669    A :obj:`SplitConstructor` can veto the further tree induction
670    by returning no classifier. This can happen for many reasons.
671    A general one is related to number of examples in the branches.
672    :obj:`SplitConstructor` has a field :obj:`minSubset`,
673    which sets the minimal number of examples in a branch; null nodes,
674    however, are allowed. If there is no split where this condition
675    is met, :obj:`SplitConstructor` stops the induction.
676
677    .. attribute:: minSubset
678
679        Sets the minimal number of examples in non-null leaves. As
680        always in Orange (where not specified otherwise), "number of
681        examples" refers to the weighted number of examples.
682   
683    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
684
685        Construct a split. Returns a tuple (:obj:`branchSelector`,
686        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
687        :obj:`spentAttribute`). :obj:`spentAttribute` is -1 if no
688        attribute is completely spent by the split criterion. If no
689        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
690        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
691        :obj:`spentAttribute` is -1.
692
693        :param examples:  Examples can be given in any acceptable form
694            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
695            list of examples).
696        :param weightID: Optional; the default of 0 means that all
697            examples have a weight of 1.0.
698        :param apriori-distribution: Should be of type
699            :obj:`orange.Distribution` and candidates should be a Python
700            list of objects which are interpreted as booleans.
701        :param candidates: The split constructor should consider only
702            the attributes in the candidate list (one boolean for each
703            attribute).
704
705
706.. class:: SplitConstructor_Measure
707
708    Bases: :class:`SplitConstructor`
709
710    An abstract base class for split constructors that employ
711    a :class:`Orange.feature.scoring.Measure` to assess a quality of a split.
712    All split constructors except for :obj:`SplitConstructor_Combined`
713    are derived from this class.
714
715    .. attribute:: measure
716
717        A component of type :class:`Orange.feature.scoring.Measure` used for
718        split evaluation. You must select a
719        :class:`Orange.feature.scoring.Measure` capable of handling your
720        class type
721        - for example, you cannot use :class:`Orange.feature.scoring.GainRatio`
722        for building regression trees or :class:`Orange.feature.scoring.MSE`
723        for classification trees.
724
725    .. attribute:: worstAcceptable
726
727        The lowest required split quality for a split to be acceptable.
728        Note that this value make sense only in connection with a
729        :obj:`measure` component. Default is 0.0.
730
731.. class:: SplitConstructor_Attribute
732
733    Bases: :class:`SplitConstructor_Measure`
734
735    Attempts to use a discrete attribute as a split; each value of the
736    attribute corresponds to a branch in the tree. Attributes are
737    evaluated with the :obj:`measure` and the one with the
738    highest score is used for a split. If there is more than one
739    attribute with the highest score, one of them is selected by random.
740
741    The constructed :obj:`branchSelector` is an instance of
742    :obj:`orange.ClassifierFromVarFD` that returns a value of the
743    selected attribute. If the attribute is
744    :obj:`Orange.data.variable.Discrete`,
745    :obj:`branchDescription`'s are the attribute's values. The
746    attribute is marked as spent, so that it cannot reappear in the
747    node's subtrees.
748
749.. class:: SplitConstructor_ExhaustiveBinary
750
751    Bases: :class:`SplitConstructor_Measure`
752
753    Works on discrete attributes. For each attribute, it determines
754    which binarization of the attribute gives the split with the
755    highest score. If more than one split has the highest score,
756    one of them is selected by random. After trying all the attributes,
757    it returns one of those with the highest score.
758
759    The constructed :obj:`branchSelector` is again an instance
760    :obj:`orange.ClassifierFromVarFD` that returns a value of the
761    selected attribute. This time, however, its :obj:`transformer`
762    contains an instance of :obj:`MapIntValue` that maps the values
763    of the attribute into a binary attribute. Branch descriptions are
764    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
765    more than one value of the attribute. Branches that correspond to
766    a single value of the attribute are described with this value. If
767    the attribute was originally binary, it is spent and cannot be
768    used in the node's subtrees. Otherwise, it can reappear in the
769    subtrees.
770
771
772.. class:: SplitConstructor_Threshold
773
774    Bases: :class:`SplitConstructor_Measure`
775
776    This is currently the only constructor for splits with continuous
777    attributes. It divides the range of attributes values with a threshold
778    that maximizes the split's quality. As always, if there is more than
779    one split with the highest score, a random threshold is selected.
780    The attribute that yields the highest binary split is returned.
781
782    The constructed :obj:`branchSelector` is again an instance of
783    :obj:`orange.ClassifierFromVarFD` with an attached
784    :obj:`transformer`. This time, :obj:`transformer` is of type
785    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
786    "<threshold" and ">=threshold". The attribute is not spent.
787
788.. class:: SplitConstructor_OneAgainstOthers
789   
790    Bases: :class:`SplitConstructor_Measure`
791
792    Undocumented.
793
794.. class:: SplitConstructor_Combined
795
796    Bases: :class:`SplitConstructor`
797
798    This constructor delegates the task of finding the optimal split
799    to separate split constructors for discrete and for continuous
800    attributes. Each split constructor is called, given only attributes
801    of appropriate types as candidates. Both construct a candidate for
802    a split; the better of them is selected.
803
804    (Note that there is a problem when more candidates have the same
805    score. Let there be are nine discrete attributes with the highest
806    score; the split constructor for discrete attributes will select
807    one of them. Now, let us suppose that there is a single continuous
808    attribute with the same score. :obj:`SplitConstructor_Combined`
809    would randomly select between the proposed discrete attribute and
810    the continuous attribute, not aware of the fact that the discrete
811    has already competed with eight other discrete attributes. So,
812    he probability for selecting (each) discrete attribute would be 1/18
813    instead of 1/10. Although not really correct, we doubt that this
814    would affect the tree's performance; many other machine learning
815    systems simply choose the first attribute with the highest score
816    anyway.)
817
818    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
819    the attribute is spent is decided by the winning split constructor.
820
821    .. attribute: discreteSplitConstructor
822
823        Split constructor for discrete attributes; can be, for instance,
824        :obj:`SplitConstructor_Attribute` or
825        :obj:`SplitConstructor_ExhaustiveBinary`.
826
827    .. attribute: continuousSplitConstructor
828
829        Split constructor for continuous attributes; at the moment, it
830        can be either :obj:`SplitConstructor_Threshold` or a
831        split constructor you programmed in Python.
832
833    .. attribute: continuousSplitConstructor
834   
835        Split constructor for continuous attributes; at the moment, it
836        can be either :obj:`SplitConstructor_Threshold` or a split
837        constructor you programmed in Python.
838
839
840StopCriteria and StopCriteria_common
841============================================
842
843obj:`StopCriteria` determines when to stop the induction of subtrees.
844
845.. class:: StopCriteria
846
847    Given a set of examples, weight ID and contingency matrices, decide
848    whether to continue the induction or not. The basic criterion checks
849    whether there are any examples and whether they belong to at least
850    two different classes (if the class is discrete). Derived components
851    check things like the number of examples and the proportion of
852    majority classes.
853
854    As opposed to :obj:`SplitConstructor` and similar basic classes,
855    :obj:`StopCriteria` is not an abstract but a fully functional
856    class that provides the basic stopping criteria. That is, the tree
857    induction stops when there is at most one example left; in this case,
858    it is not the weighted but the actual number of examples that counts.
859    Besides that, the induction stops when all examples are in the same
860    class (for discrete problems) or have the same value of the outcome
861    (for regression problems).
862
863    .. method:: __call__(examples[, weightID, domain contingencies])
864
865        Decides whether to stop (true) or continue (false) the induction.
866        If contingencies are given, they are used for checking whether
867        the examples are in the same class (but not for counting the
868        examples). Derived classes should use the contingencies whenever
869        possible. If contingencies are not given, :obj:`StopCriteria`
870        will work without them. Derived classes should also use them if
871        they are available, but otherwise compute them only when they
872        really need them.
873
874
875
876.. class:: StopCriteria_common
877
878    :obj:`StopCriteria` contains additional criteria for pre-pruning:
879    it checks the proportion of majority class and the number of weighted
880    examples.
881
882    .. attribute:: maxMajor
883
884        Maximal proportion of majority class. When this is exceeded,
885        induction stops.
886
887    .. attribute:: minExamples
888
889        Minimal number of examples in internal leaves. Subsets with less
890        than :obj:`minExamples` examples are not split any further.
891        Example count is weighed.
892
893.. class:: StopCriteria_Python
894
895    Undocumented.
896
897Example Splitters
898=================
899
900Just like the :obj:`Descender` decides the branch for an
901example during classification, the :obj:`ExampleSplitter`
902sorts the learning examples into branches.
903
904:obj:`ExampleSplitter` is given a :obj:`Node` (from which
905it can use different stuff, but most of splitters only use the
906:obj:`branchSelector`), a set of examples to be divided, and
907the weight ID. The result is a list of subsets of examples
908and, optionally, a list of new weight ID's.
909
910Most
911:obj:`ExampleSplitter` classes simply call the node's
912:obj:`branchSelector` and assign examples to corresponding
913branches. When the value is unknown they choose a particular
914branch or simply skip the example.
915
916Some enhanced splitters can split examples. An example (actually,
917a pointer to it) is copied to more than one subset. To facilitate
918real splitting, weights are needed. Each branch is assigned a
919weight ID (each would usually have its own ID) and all examples
920that are in that branch (either completely or partially) should
921have this meta attribute. If an example hasn't been split, it
922has only one additional attribute - with weight ID corresponding
923to the subset to which it went. Example that is split between,
924say, three subsets, has three new meta attributes, one for each
925subset. ID's of weight meta attributes are returned by the
926:obj:`ExampleSplitter` to be used at induction of the
927corresponding subtrees.
928
929Note that weights are used only when needed. When no splitting
930occured - because the splitter is not able to do it or becauser
931there was no need for splitting - no weight ID's are returned.
932
933
934.. class:: ExampleSplitter
935
936    An abstract base class for objects that split sets of examples into
937    subsets. The derived classes treat examples which
938    cannot be unambiguously placed into a single branch (usually due
939    to unknown value of the crucial attribute) differently.
940
941    .. method:: __call__(node, examples[, weightID])
942       
943        Use the information in :obj:`node` (particularly the
944        :obj:`branchSelector`) to split the given set of examples into subsets.
945        Return a tuple with a list of example generators and a list of weights.
946        The list of weights is either an ordinary python list of integers or
947        a None when no splitting of examples occurs and thus no weights are
948        needed.
949
950
951.. class:: ExampleSplitter_IgnoreUnknowns
952
953    Bases: :class:`ExampleSplitter`
954
955    Simply ignores the examples for which no single branch can be determined.
956
957.. class:: ExampleSplitter_UnknownsToCommon
958
959    Bases: :class:`ExampleSplitter`
960
961    Places all such examples to a branch with the highest number of
962    examples. If there is more than one such branch, one is selected at
963    random and then used for all examples.
964
965.. class:: ExampleSplitter_UnknownsToAll
966
967    Bases: :class:`ExampleSplitter`
968
969    Places examples with unknown value of the attribute into all branches.
970
971.. class:: ExampleSplitter_UnknownsToRandom
972
973    Bases: :class:`ExampleSplitter`
974
975    Selects a random branch for such examples.
976
977.. class:: ExampleSplitter_UnknownsToBranch
978
979    Bases: :class:`ExampleSplitter`
980
981    Constructs an additional branch to contain all such examples.
982    The branch's description is "unknown".
983
984.. class:: ExampleSplitter_UnknownsAsBranchSizes
985
986    Bases: :class:`ExampleSplitter`
987
988    Splits examples with unknown value of the attribute according to
989    proportions of examples in each branch.
990
991.. class:: ExampleSplitter_UnknownsAsSelector
992
993    Bases: :class:`ExampleSplitter`
994
995    Splits examples with unknown value of the attribute according to
996    distribution proposed by selector (which is in most cases the same
997    as proportions of examples in branches).
998
999Descenders
1000=============================
1001
1002This is a classifier's counterpart for :class:`ExampleSplitter`. It
1003decides the destiny of examples that need to be classified and cannot
1004be unambiguously put in a branch.
1005
1006.. class:: Descender
1007
1008    An abstract base object for tree descenders.
1009
1010    .. method:: __call__(node, example)
1011
1012        Descends down the tree until it reaches a leaf or a node in
1013        which a vote of subtrees is required. In both cases, a tuple
1014        of two elements is returned; in the former, the tuple contains
1015        the reached node and None, in the latter in
1016        contains a node and weights of votes for subtrees (a list of floats).
1017
1018        :obj:`Descender`'s that never split examples always descend
1019        to a leaf, but they differ in the treatment of examples with
1020        unknown values (or, in general, examples for which a branch
1021        cannot be determined at some node(s) the tree).
1022        :obj:`Descender`'s that do split examples differ in returned
1023        vote weights.
1024
1025.. class:: Descender_UnknownsToNode
1026
1027    Bases: :obj:`Descender`
1028
1029    When example cannot be classified into a single branch, the
1030    current node is returned. Thus, the node's :obj:`NodeClassifier`
1031    will be used to make a decision. It is your responsibility to see
1032    that even the internal nodes have their :obj:`NodeClassifier`
1033    (i.e., don't disable creating node classifier or manually remove
1034    them after the induction, that's all)
1035
1036.. class:: Descender_UnknownsToBranch
1037
1038    Bases: :obj:`Descender`
1039
1040    Classifies examples with unknown value to a special branch. This
1041    makes sense only if the tree itself was constructed with
1042    :obj:`ExampleSplitter_UnknownsToBranch`.
1043
1044.. class:: Descender_UnknownsToCommonBranch
1045
1046    Bases: :obj:`Descender`
1047
1048    Classifies examples with unknown values to the branch with the
1049    highest number of examples. If there is more than one such branch,
1050    random branch is chosen for each example that is to be classified.
1051
1052.. class:: Descender_UnknownsToCommonSelector
1053
1054    Bases: :obj:`Descender`
1055
1056    Classifies examples with unknown values to the branch which received
1057    the highest recommendation by the selector.
1058
1059.. class:: Descender_MergeAsBranchSizes
1060
1061    Bases: :obj:`Descender`
1062
1063    Makes the subtrees vote for the example's class; the vote is
1064    weighted according to the sizes of the branches.
1065
1066.. class:: Descender_MergeAsSelector
1067
1068    Bases: :obj:`Descender`
1069
1070    Makes the subtrees vote for the example's class; the vote is
1071    weighted according to the selectors proposal.
1072
1073Pruning
1074=======
1075
1076.. index::
1077    pair: classification trees; pruning
1078
1079Tree pruners derived from :obj:`Pruner` can be given either a
1080:obj:`Node` (presumably, but not necessarily a root) or a
1081:obj:`_TreeClassifier`. The result is a new :obj:`Node`
1082or a :obj:`_TreeClassifier` with a pruned tree. The original
1083tree remains intact.
1084
1085The pruners construct only a shallow copy of a tree.
1086The pruned tree's :obj:`Node` contain references to the same
1087contingency matrices, node classifiers, branch selectors, ...
1088as the original tree. Thus, you may modify a pruned tree structure
1089(manually cut it, add new nodes, replace components) but modifying,
1090for instance, some node's :obj:`nodeClassifier` (a
1091:obj:`nodeClassifier` itself, not a reference to it!) would modify
1092the node's :obj:`nodeClassifier` in the corresponding node of
1093the original tree.
1094
1095Pruners cannot construct a
1096:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
1097subtrees into classifiers for new leaves. Thus, if you want to build
1098a prunable tree, internal nodes must have their :obj:`nodeClassifier`
1099defined. Fortunately, this is the default.
1100
1101.. class:: Pruner
1102
1103    This is an abstract base class which defines nothing useful, only
1104    a pure virtual call operator.
1105
1106    .. method:: __call__(tree)
1107
1108        Prunes a tree. The argument can be either a tree classifier or
1109        a tree node; the result is of the same type as the argument.
1110
1111.. class:: Pruner_SameMajority
1112
1113    Bases: :class:`Pruner`
1114
1115    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
1116    with more than one leaf) in which all the leaves have the same majority
1117    class. (This is allowed because those leaves can still have different
1118    distributions of classes and thus predict different probabilities.)
1119    However, this can be undesired when we're only interested in the
1120    class prediction or a simple tree interpretation. The
1121    :obj:`Pruner_SameMajority` prunes the tree so that there is no
1122    subtree in which all the nodes would have the same majority class.
1123
1124    This pruner will only prune the nodes in which the node classifier
1125    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
1126
1127    Note that the leaves with more than one majority class require some
1128    special handling. The pruning goes backwards, from leaves to the root.
1129    When siblings are compared, the algorithm checks whether they
1130    have (at least one) common majority class. If so, they can be pruned.
1131
1132.. class:: Pruner_m
1133
1134    Bases: :class:`Pruner`
1135
1136    Prunes a tree by comparing m-estimates of static and dynamic
1137    error as defined in (Bratko, 2002).
1138
1139    .. attribute:: m
1140
1141        Parameter m for m-estimation.
1142
1143Printing the tree
1144=================
1145
1146The included printing functions can
1147print out practically anything you'd like to
1148know, from the number of examples, proportion of examples of majority
1149class in nodes and similar, to more complex statistics like the
1150proportion of examples in a particular class divided by the proportion
1151of examples of this class in a parent node. And even more, you can
1152define your own callback functions to be used for printing.
1153
1154Before we go on: you can read all about the function and use it to its
1155full extent, or you can just call it, giving it the tree as the sole
1156argument and it will print out the usual textual representation of the
1157tree. If you're satisfied with that, you can stop here.
1158
1159The magic is in the format string. It is a string which is printed
1160out at every leaf or internal node with the certain format specifiers
1161replaced by data from the tree node. Specifiers are generally of form
1162**%[^]<precision><quantity><divisor>**.
1163
1164**^** at the start tells that the number should be multiplied by 100.
1165It's useful for printing proportions like percentages.
1166
1167**<precision>** is in the same format as in Python (or C) string
1168formatting. For instance, :samp:`%N` denotes the number of examples in the node,
1169hence :samp:`%6.2N` would mean output to two decimal digits and six places
1170altogether. If left out, a default format :samp:`5.3` is used, unless you
1171multiply the numbers by 100, in which case the default is :samp:`.0`
1172(no decimals, the number is rounded to the nearest integer).
1173
1174**<divisor>** tells what to divide the quantity in that node with.
1175:samp:`bP` means division by the same quantity in the parent node; for instance,
1176:samp:`%NbP` will tell give the number of examples in the node divided by the
1177number of examples in parent node. You can add use precision formatting,
1178e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
1179set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
1180training data set) that fell into that node. If division is impossible
1181since the parent node does not exist or some data is missing, a dot is
1182printed out instead of the quantity.
1183
1184**<quantity>** is the only required element. It defines what to print.
1185For instance, :samp:`%N` would print out the number of examples in the node.
1186Possible quantities are
1187
1188:samp:`V`
1189    The value predicted at that node. You cannot define the precision
1190    or divisor here.
1191
1192:samp:`N`
1193    The number of examples in the node.
1194
1195:samp:`M`
1196    The number of examples in the majority class (that is, the class
1197    predicted by the node).
1198
1199:samp:`m`
1200    The proportion of examples in the majority class.
1201
1202:samp:`A`
1203    The average class for examples the node; this is available only for
1204    regression trees.
1205
1206:samp:`E`
1207    Standard error for class of examples in the node; available for
1208    regression trees.
1209
1210:samp:`I`
1211    Print out the confidence interval. The modifier is used as
1212    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
1213
1214:samp:`C`
1215    The number of examples in the given class. For classification trees,
1216    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
1217    - this will tell the number of examples of Iris-virginica by the
1218    number of examples this class in the parent node. If you are
1219    interested in examples that are *not* Iris-virginica, say
1220    :samp:`%5.3CbP!="Iris-virginica"`
1221
1222    For regression trees, you can use operators =, !=, <, <=, >, and >=,
1223    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
1224    check the number of examples in a certain interval: :samp:`%C[20, 22]`
1225    will give you the number of examples between 20 and 22 (inclusive)
1226    and :samp:`%C(20, 22)` will give the number of such examples excluding the
1227    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
1228    If you would like the examples outside the interval, add a :samp:`!`,
1229    like :samp:`%C!(20, 22]`.
1230 
1231:samp:`c`
1232    Same as above, except that it computes the proportion of the class
1233    instead of the number of examples.
1234
1235:samp:`D`
1236    Prints out the number of examples in each class. You can use both,
1237    precision (it is applied to each number in the distribution) or the
1238    divisor. This quantity cannot be computed for regression trees.
1239
1240:samp:`d`
1241    Same as above, except that it shows proportions of examples. This
1242    again doesn't work with regression trees.
1243
1244<user defined formats>
1245    You can add more, if you will. Instructions and examples are given at
1246    the end of this section.
1247
1248
1249Examples
1250--------
1251
1252We shall build a small tree from the iris data set - we shall limit the
1253depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
1254
1255.. literalinclude:: code/orngTree1.py
1256   :lines: 1-4
1257
1258.. _orngTree1.py: code/orngTree1.py
1259
1260The easiest way to call the function is to pass the tree as the only
1261argument::
1262
1263    >>> print tree.dump()
1264    petal width<0.800: Iris-setosa (100.00%)
1265    petal width>=0.800
1266    |    petal width<1.750
1267    |    |    petal length<5.350: Iris-versicolor (94.23%)
1268    |    |    petal length>=5.350: Iris-virginica (100.00%)
1269    |    petal width>=1.750
1270    |    |    petal length<4.850: Iris-virginica (66.67%)
1271    |    |    petal length>=4.850: Iris-virginica (100.00%)
1272
1273Let's now print out the predicted class at each node, the number
1274of examples in the majority class with the total number of examples
1275in the node::
1276
1277    >>> print tree.dump(leafStr="%V (%M out of %N)")
1278    petal width<0.800: Iris-setosa (50.000 out of 50.000)
1279    petal width>=0.800
1280    |    petal width<1.750
1281    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
1282    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
1283    |    petal width>=1.750
1284    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
1285    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
1286
1287Would you like to know how the number of examples declines as
1288compared to the entire data set and to the parent node? We find
1289it with this::
1290
1291    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)")
1292    petal width<0.800: Iris-setosa (100%, 100%)
1293    petal width>=0.800
1294    |    petal width<1.750
1295    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
1296    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
1297    |    petal width>=1.750
1298    |    |    petal length<4.850: Iris-virginica (4%, 4%)
1299    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
1300
1301Let us first read the format string. :samp:`%M` is the number of
1302examples in the majority class. We want it divided by the number of
1303all examples from this class on the entire data set, hence :samp:`%MbA`.
1304To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
1305that is just printed out literally, just as the comma and parentheses
1306(see the output). The string for showing the proportion of this class
1307in the parent is the same except that we have :samp:`bP` instead
1308of :samp:`bA`.
1309
1310And now for the output: all examples of setosa for into the first node.
1311For versicolor, we have 98% in one node; the rest is certainly
1312not in the neighbouring node (petal length>=5.350) since all
1313versicolors from the node petal width<1.750 went to petal length<5.350
1314(we know this from the 100% in that line). Virginica is the
1315majority class in the three nodes that together contain 94% of this
1316class (4+4+86). The rest must had gone to the same node as versicolor.
1317
1318If you find this guesswork annoying - so do I. Let us print out the
1319number of versicolors in each node, together with the proportion of
1320versicolors among the examples in this particular node and among all
1321versicolors. So,
1322
1323::
1324
1325    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
1326
1327gives the following output::
1328
1329    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
1330    petal width>=0.800
1331    |    petal width<1.750
1332    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
1333    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
1334    |    petal width>=1.750
1335    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
1336    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
1337
1338Finally, we may want to print out the distributions, using a simple
1339string :samp:`%D`::
1340
1341    petal width<0.800: [50.000, 0.000, 0.000]
1342    petal width>=0.800
1343    |    petal width<1.750
1344    |    |    petal length<5.350: [0.000, 49.000, 3.000]
1345    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
1346    |    petal width>=1.750
1347    |    |    petal length<4.850: [0.000, 1.000, 2.000]
1348    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
1349
1350What is the order of numbers here? If you check
1351:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
1352versicolor, virginica; so in the node at peta length<5.350 we have 49
1353versicolors and 3 virginicae. To print out the proportions, we can
1354:samp:`%.2d` - this gives us proportions within node, rounded on
1355two decimals::
1356
1357    petal width<0.800: [1.00, 0.00, 0.00]
1358    petal width>=0.800
1359    |    petal width<1.750
1360    |    |    petal length<5.350: [0.00, 0.94, 0.06]
1361    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
1362    |    petal width>=1.750
1363    |    |    petal length<4.850: [0.00, 0.33, 0.67]
1364    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
1365
1366We haven't tried printing out any information for internal nodes.
1367To start with the most trivial case, we shall print the prediction
1368at each node.
1369
1370::
1371
1372    tree.dump(leafStr="%V", nodeStr=".")
1373   
1374says that the nodeStr should be the same as leafStr (not very useful
1375here, since leafStr is trivial anyway).
1376
1377::
1378
1379    root: Iris-setosa
1380    |    petal width<0.800: Iris-setosa
1381    |    petal width>=0.800: Iris-versicolor
1382    |    |    petal width<1.750: Iris-versicolor
1383    |    |    |    petal length<5.350: Iris-versicolor
1384    |    |    |    petal length>=5.350: Iris-virginica
1385    |    |    petal width>=1.750: Iris-virginica
1386    |    |    |    petal length<4.850: Iris-virginica
1387    |    |    |    petal length>=4.850: Iris-virginica
1388
1389Note that the output is somewhat different now: there appeared another
1390node called *root* and the tree looks one level deeper. This is
1391needed to print out the data for that node to.
1392
1393Now for something more complicated: let us observe how the number
1394of virginicas decreases down the tree::
1395
1396    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
1397
1398Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
1399the number of examples from class virginica, divided by the total number
1400of examples in this class. Add :samp:`^.1` and the result will be
1401multiplied and printed with one decimal. The trailing :samp:`%` is printed
1402out. In parentheses we print the same thing except that we divide by the
1403examples in the parent node. Note the use of single quotes, so we can
1404use the double quotes inside the string, when we specify the class.
1405
1406::
1407
1408    root: 100.0% (.%)
1409    |    petal width<0.800: 0.0% (0.0%)
1410    |    petal width>=0.800: 100.0% (100.0%)
1411    |    |    petal width<1.750: 10.0% (10.0%)
1412    |    |    |    petal length<5.350: 6.0% (60.0%)
1413    |    |    |    petal length>=5.350: 4.0% (40.0%)
1414    |    |    petal width>=1.750: 90.0% (90.0%)
1415    |    |    |    petal length<4.850: 4.0% (4.4%)
1416    |    |    |    petal length>=4.850: 86.0% (95.6%)
1417
1418See what's in the parentheses in the root node? If :meth:`~TreeClassifier.dump`
1419cannot compute something (in this case it's because the root has no parent),
1420it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
1421will count all classes *except* virginica.
1422
1423For one final example with classification trees, we shall print the
1424distributions in that nodes, the distribution compared to the parent
1425and the proportions compared to the parent (the latter things are not
1426the same - think why). In the leaves we shall also add the predicted
1427class. So now we'll have to call the function like this.
1428
1429::
1430
1431    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
1432    root: [50.000, 50.000, 50.000] . .
1433    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
1434    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
1435    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
1436    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
1437    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
1438    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
1439    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
1440    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
1441    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
1442    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
1443    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
1444    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
1445    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
1446
1447To explore the possibilities when printing regression trees, we are going
1448to induce a tree from the housing data set. Called with the tree as the
1449only argument, :meth:`TreeClassifier.dump` prints the tree like this::
1450
1451    RM<6.941
1452    |    LSTAT<14.400
1453    |    |    DIS<1.385: 45.6
1454    |    |    DIS>=1.385: 22.9
1455    |    LSTAT>=14.400
1456    |    |    CRIM<6.992: 17.1
1457    |    |    CRIM>=6.992: 12.0
1458    RM>=6.941
1459    |    RM<7.437
1460    |    |    CRIM<7.393: 33.3
1461    |    |    CRIM>=7.393: 14.4
1462    |    RM>=7.437
1463    |    |    TAX<534.500: 45.9
1464    |    |    TAX>=534.500: 21.9
1465
1466Let us add the standard error in both internal nodes and leaves, and the
146790% confidence intervals in the leaves::
1468
1469    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
1470    root: [SE: 0.409]
1471    |    RM<6.941: [SE: 0.306]
1472    |    |    LSTAT<14.400: [SE: 0.320]
1473    |    |    |    DIS<1.385: [SE: 4.420]:
1474    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
1475    |    |    |    DIS>=1.385: [SE: 0.244]:
1476    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
1477    |    |    LSTAT>=14.400: [SE: 0.333]
1478    |    |    |    CRIM<6.992: [SE: 0.338]:
1479    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1480    |    |    |    CRIM>=6.992: [SE: 0.448]:
1481    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1482    |    RM>=6.941: [SE: 1.031]
1483    |    |    RM<7.437: [SE: 0.958]
1484    |    |    |    CRIM<7.393: [SE: 0.692]:
1485    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1486    |    |    |    CRIM>=7.393: [SE: 2.157]:
1487    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1488    |    |    RM>=7.437: [SE: 1.124]
1489    |    |    |    TAX<534.500: [SE: 0.817]:
1490    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1491    |    |    |    TAX>=534.500: [SE: 0.000]:
1492    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1493
1494What's the difference between :samp:`%V`, the predicted value and
1495:samp:`%A` the average? Doesn't a regression tree always predict the
1496leaf average anyway? Not necessarily, the tree predict whatever the
1497:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
1498As :samp:`%V` uses the
1499:obj:`Orange.data.variable.Continuous`' function for printing out the value,
1500therefore the printed number has the same number of decimals
1501as in the data file.
1502
1503Regression trees cannot print the distributions in the same way
1504as classification trees. They instead offer a set of operators for
1505observing the number of examples within a certain range. For instance,
1506let us check the number of examples with values below 22, and compare
1507this number with values in the parent nodes::
1508
1509    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".")
1510    root: 277.000 (.)
1511    |    RM<6.941: 273.000 (1.160)
1512    |    |    LSTAT<14.400: 107.000 (0.661)
1513    |    |    |    DIS<1.385: 0.000 (0.000)
1514    |    |    |    DIS>=1.385: 107.000 (1.020)
1515    |    |    LSTAT>=14.400: 166.000 (1.494)
1516    |    |    |    CRIM<6.992: 93.000 (0.971)
1517    |    |    |    CRIM>=6.992: 73.000 (1.040)
1518    |    RM>=6.941: 4.000 (0.096)
1519    |    |    RM<7.437: 3.000 (1.239)
1520    |    |    |    CRIM<7.393: 0.000 (0.000)
1521    |    |    |    CRIM>=7.393: 3.000 (15.333)
1522    |    |    RM>=7.437: 1.000 (0.633)
1523    |    |    |    TAX<534.500: 0.000 (0.000)
1524    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1525
1526The last line, for instance, says the the number of examples with the
1527class below 22 is among those with tax above 534 is 30 times higher
1528than the number of such examples in its parent node.
1529
1530For another exercise, let's count the same for all examples *outside*
1531interval [20, 22] (given like this, the interval includes the bounds).
1532And let us print out the proportions as percents.
1533
1534::
1535
1536    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
1537
1538OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
1539would be the proportion of examples (within the node) whose values are
1540below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
1541statistics computed on the parent. Add a :samp:`^` and you have the percentages.
1542
1543::
1544
1545    root: 439.000 (.%)
1546    |    RM<6.941: 364.000 (98%)
1547    |    |    LSTAT<14.400: 200.000 (93%)
1548    |    |    |    DIS<1.385: 5.000 (127%)
1549    |    |    |    DIS>=1.385: 195.000 (99%)
1550    |    |    LSTAT>=14.400: 164.000 (111%)
1551    |    |    |    CRIM<6.992: 91.000 (96%)
1552    |    |    |    CRIM>=6.992: 73.000 (105%)
1553    |    RM>=6.941: 75.000 (114%)
1554    |    |    RM<7.437: 46.000 (101%)
1555    |    |    |    CRIM<7.393: 43.000 (100%)
1556    |    |    |    CRIM>=7.393: 3.000 (100%)
1557    |    |    RM>=7.437: 29.000 (98%)
1558    |    |    |    TAX<534.500: 29.000 (103%)
1559    |    |    |    TAX>=534.500: 0.000 (0%)
1560
1561
1562Defining Your Own Printout functions
1563------------------------------------
1564
1565:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out
1566some other information in the leaves or nodes. If provided,
1567:obj:`userFormats` should contain a list of tuples with a regular expression
1568and a callback function to be called when that expression is found in the
1569format string. Expressions from :obj:`userFormats` are checked before
1570the built-in expressions discussed above, so you can override the built-ins
1571if you want to.
1572
1573The regular expression should describe a string like those we used above,
1574for instance the string :samp:`%.2DbP`. When a leaf or internal node
1575is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
1576is checked for these regular expressions and when the match is found,
1577the corresponding callback function is called.
1578
1579The callback function will get five arguments: the format string
1580(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
1581being printed, its parent (can be None) and the tree classifier.
1582The function should return the format string in which the part described
1583by the match object (that is, the part that is matched by the regular
1584expression) is replaced by whatever information your callback function
1585is supposed to give.
1586
1587The function can use several utility function provided in the module.
1588
1589.. autofunction:: insertStr
1590
1591.. autofunction:: insertDot
1592
1593.. autofunction:: insertNum
1594
1595.. autofunction:: byWhom
1596
1597
1598There are also a few pieces of regular expression that you may want to reuse.
1599The two you are likely to use are:
1600
1601.. autodata:: fs
1602
1603.. autodata:: by
1604
1605For a trivial example, :samp:`%V` is implemented like this. There is the
1606following tuple in the list of built-in formats::
1607
1608    (re.compile("%V"), replaceV)
1609
1610:obj:`replaceV` is a function defined by::
1611
1612    def replaceV(strg, mo, node, parent, tree):
1613        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
1614
1615It therefore takes the value predicted at the node
1616(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
1617and passes it to *insertStr* to do the replacement.
1618
1619A more complex regular expression is the one for the proportion of
1620majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
1621two partial expressions defined above.
1622
1623Let's say with like to print the classification margin for each node,
1624that is, the difference between the proportion of the largest and the
1625second largest class in the node (part of `orngTree2.py`_):
1626
1627.. _orngTree2.py: code/orngTree2.py
1628
1629.. literalinclude:: code/orngTree2.py
1630   :lines: 7-31
1631
1632We first defined getMargin which gets the distribution and computes the
1633margin. The callback replaces, replaceB, computes the margin for the node.
1634If we need to divided the quantity by something (that is, if the :data:`by`
1635group is present), we call :func:`byWhom` to get the node with whose margin
1636this node's margin is to be divided. If this node (usually the parent)
1637does not exist of if its margin is zero, we call :func:`insertDot`
1638to insert a dot, otherwise we call :func:`insertNum` which will insert
1639the number, obeying the format specified by the user. myFormat is a list
1640containing the regular expression and the callback function.
1641
1642We can now print out the iris tree:
1643
1644.. literalinclude:: code/orngTree2.py
1645    :lines: 33
1646
1647And we get::
1648
1649    petal width<0.800: Iris-setosa 100% (100.00%)
1650    petal width>=0.800
1651    |    petal width<1.750
1652    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1653    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1654    |    petal width>=1.750
1655    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1656    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1657
1658Plotting the Tree using Dot
1659---------------------------
1660
1661Suppose you saved the tree in a file "tree5.dot". You can then
1662print it out as a gif if you execute the following command line
1663
1664::
1665   
1666    dot -Tgif tree5.dot -otree5.gif
1667
1668GraphViz's dot has quite a few other output formats, check
1669its documentation to learn which.
1670
1671
1672
1673===========================
1674C4.5 Classifier and Learner
1675===========================
1676
1677As C4.5 is a standard benchmark in machine learning,
1678it is incorporated in Orange, although Orange has its own
1679implementation of decision trees.
1680
1681The implementation uses the original Quinlan's code for learning so the
1682tree you get is exactly like the one that would be build by standalone
1683C4.5. Upon return, however, the original tree is copied to Orange
1684components that contain exactly the same information plus what is needed
1685to make them visible from Python. To be sure that the algorithm behaves
1686just as the original, we use a dedicated class :class:`C45Node`
1687instead of reusing the components used by Orange's tree inducer
1688(ie, :class:`Node`). This, however, could be done and probably
1689will be done in the future; we shall still retain :class:`C45Node`
1690but offer transformation to :class:`Node` so that routines
1691that work on Orange trees will also be usable for C45 trees.
1692
1693:class:`C45Learner` and :class:`C45Classifier` behave
1694like any other Orange learner and classifier. Unlike most of Orange
1695learning algorithms, C4.5 does not accepts weighted examples.
1696
1697Building the C4.5 plug-in
1698=========================
1699
1700We haven't been able to obtain the legal rights to distribute
1701C4.5 and therefore couldn't statically link it into Orange. Instead,
1702it's incorporated as a plug-in which you'll need to build yourself.
1703The procedure is trivial, but you'll need a C compiler. On Windows,
1704the scripts we provide work with MS Visual C and the files CL.EXE
1705and LINK.EXE must be on the PATH. On Linux you're equipped with
1706gcc. Mac OS X comes without gcc, but you can download it for
1707free from Apple.
1708
1709Orange must be installed prior to building C4.5. (This is because
1710the build script will copy the created file next to Orange,
1711which it obviously can't if Orange isn't there yet.)
1712
1713#. Download the
1714   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1715   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1716   them into some temporary directory. The files will be modified in the
1717   further process, so don't use your copy of Quinlan's sources that you
1718   need for another purpose.
1719#. Download
1720   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1721   and unzip its contents into the directory R8/Src of the Quinlan's
1722   stuff (it's the directory that contains, for instance, the file
1723   average.c).
1724#. Run buildC45.py, which will build the plug-in and put it next to
1725   orange.pyd (or orange.so on Linux/Mac).
1726#. Run python, type :samp:`import Orange` and
1727   create create :samp:`Orange.classification.tree.C45Learner()`.
1728   If this fails, something went wrong; see the diagnostic messages from
1729   buildC45.py and read the below paragraph.
1730#. Finally, you can remove the Quinlan's stuff, along with everything
1731   created by buildC45.py.
1732
1733If the process fails, here's what buildC45.py really does: it creates
1734.h files that wrap Quinlan's .i files and ensure that they are not
1735included twice. It modifies C4.5 sources to include .h's instead of
1736.i's. This step can hardly fail. Then follows the platform dependent
1737step which compiles ensemble.c (which includes all the Quinlan's .c
1738files it needs) into c45.dll or c45.so and puts it next to Orange.
1739If this fails, but you do have a C compiler and linker, and you know
1740how to use them, you can compile the ensemble.c into a dynamic
1741library yourself. See the compile and link steps in buildC45.py,
1742if it helps. Anyway, after doing this check that the built C4.5
1743gives the same results as the original.
1744
1745.. autoclass:: C45Learner
1746    :members:
1747
1748.. autoclass:: C45Classifier
1749    :members:
1750
1751.. class:: C45Node
1752
1753    This class is a reimplementation of the corresponding *struct* from
1754    Quinlan's C4.5 code.
1755
1756    .. attribute:: nodeType
1757
1758        Type of the node:  :obj:`C45Node.Leaf` (0),
1759        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1760        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1761        split examples based on values of a discrete attribute,
1762        "cuts" cut them according to a threshold value of a continuous
1763        attributes and "subsets" use discrete attributes but with subsetting
1764        so that several values can go into the same branch.
1765
1766    .. attribute:: leaf
1767
1768        Value returned by that leaf. The field is defined for internal
1769        nodes as well.
1770
1771    .. attribute:: items
1772
1773        Number of (learning) examples in the node.
1774
1775    .. attribute:: classDist
1776
1777        Class distribution for the node (of type
1778        :obj:`orange.DiscDistribution`).
1779
1780    .. attribute:: tested
1781       
1782        The attribute used in the node's test. If node is a leaf,
1783        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1784        :obj:`tested` is a discrete attribute, and if node is of type
1785        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1786
1787    .. attribute:: cut
1788
1789        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1790        Undefined otherwise.
1791
1792    .. attribute:: mapping
1793
1794        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1795        gives the index for an example whose value of :obj:`tested` is *i*.
1796        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1797
1798    .. attribute:: branch
1799       
1800        A list of branches stemming from this node.
1801
1802Examples
1803========
1804
1805.. _tree_c45.py: code/tree_c45.py
1806.. _iris.tab: code/iris.tab
1807
1808The simplest way to use :class:`C45Learner` is to call it. This
1809script constructs the same learner as you would get by calling
1810the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1811
1812.. literalinclude:: code/tree_c45.py
1813   :lines: 7-14
1814
1815Arguments can be set by the usual mechanism (the below to lines do the
1816same, except that one uses command-line symbols and the other internal
1817variable names)
1818
1819::
1820
1821    tree = Orange.classification.tree.C45Learner(data, m=100)
1822    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1823
1824The way that could be prefered by veteran C4.5 user might be through
1825method `:obj:C45Learner.commandline`.
1826
1827::
1828
1829    lrn = Orange.classification.tree.C45Learner()
1830    lrn.commandline("-m 1 -s")
1831    tree = lrn(data)
1832
1833There's nothing special about using :obj:`C45Classifier` - it's
1834just like any other. To demonstrate what the structure of
1835:class:`C45Node`'s looks like, will show a script that prints
1836it out in the same format as C4.5 does.
1837
1838.. literalinclude:: code/tree_c45_printtree.py
1839
1840Leaves are the simplest. We just print out the value contained
1841in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1842:obj:`C45Node` does not know to which attribute it belongs), we need to
1843convert it to a string through :obj:`classVar`, which is passed as an
1844extra argument to the recursive part of printTree.
1845
1846For discrete splits without subsetting, we print out all attribute values
1847and recursively call the function for all branches. Continuous splits are
1848equally easy to handle.
1849
1850For discrete splits with subsetting, we iterate through branches, retrieve
1851the corresponding values that go into each branch to inset, turn
1852the values into strings and print them out, separately treating the
1853case when only a single value goes into the branch.
1854
1855References
1856==========
1857
1858Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1859Wesley, 2002.
1860
1861E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1862Murray Hill NJ, U.S.A., October 1993.
1863
1864`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1865A home page of AT&T's dot and similar software packages.
1866
1867"""
1868
1869from Orange.core import \
1870     TreeLearner as _TreeLearner, \
1871         TreeClassifier as _TreeClassifier, \
1872         C45Learner as _C45Learner, \
1873         C45Classifier as _C45Classifier, \
1874         C45TreeNode as C45Node, \
1875         C45TreeNodeList as C45NodeList, \
1876         TreeDescender as Descender, \
1877              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
1878              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
1879              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
1880              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
1881              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
1882         TreeExampleSplitter as Splitter, \
1883              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
1884              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
1885              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
1886              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
1887              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
1888              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
1889              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
1890         TreeNode as Node, \
1891         TreeNodeList as NodeList, \
1892         TreePruner as Pruner, \
1893              TreePruner_SameMajority as Pruner_SameMajority, \
1894              TreePruner_m as Pruner_m, \
1895         TreeSplitConstructor as SplitConstructor, \
1896              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
1897              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
1898                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
1899                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
1900                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
1901                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
1902         TreeStopCriteria as StopCriteria, \
1903              TreeStopCriteria_Python as StopCriteria_Python, \
1904              TreeStopCriteria_common as StopCriteria_common
1905
1906import Orange.core
1907import operator
1908import base64
1909import re
1910import Orange.data
1911import Orange.feature.scoring
1912
1913class C45Learner(Orange.classification.Learner):
1914    """
1915    :class:`C45Learner`'s attributes have double names - those that
1916    you know from C4.5 command lines and the corresponding names of C4.5's
1917    internal variables. All defaults are set as in C4.5; if you change
1918    nothing, you are running C4.5.
1919
1920    .. attribute:: gainRatio (g)
1921       
1922        Determines whether to use information gain (false, default)
1923        or gain ratio for selection of attributes (true).
1924
1925    .. attribute:: batch (b)
1926
1927        Turn on batch mode (no windows, no iterations); this option is
1928        not documented in C4.5 manuals. It conflicts with "window",
1929        "increment" and "trials".
1930
1931    .. attribute:: subset (s)
1932       
1933        Enables subsetting (default: false, no subsetting),
1934 
1935    .. attribute:: probThresh (p)
1936
1937        Probabilistic threshold for continuous attributes (default: false).
1938
1939    .. attribute:: minObjs (m)
1940       
1941        Minimal number of objects (examples) in leaves (default: 2).
1942
1943    .. attribute:: window (w)
1944
1945        Initial windows size (default: maximum of 20% and twice the
1946        square root of the number of data objects).
1947
1948    .. attribute:: increment (i)
1949
1950        The maximum number of objects that can be added to the window
1951        at each iteration (default: 20% of the initial window size).
1952
1953    .. attribute:: cf (c)
1954
1955        Prunning confidence level (default: 25%).
1956
1957    .. attribute:: trials (t)
1958
1959        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1960
1961    .. attribute:: prune
1962       
1963        Return pruned tree (not an original C4.5 option) (default: true)
1964    """
1965
1966    def __new__(cls, instances = None, weightID = 0, **argkw):
1967        self = Orange.classification.Learner.__new__(cls, **argkw)
1968        if instances:
1969            self.__init__(**argkw)
1970            return self.__call__(instances, weightID)
1971        else:
1972            return self
1973       
1974    def __init__(self, **kwargs):
1975        self.base = _C45Learner(**kwargs)
1976
1977    def __setattr__(self, name, value):
1978        if name != "base" and name in self.base.__dict__:
1979            self.base.__dict__[name] = value
1980        else:
1981            self.__dict__["base"] = value
1982
1983    def __getattr(self, name):
1984        if name != "base":
1985            return self.base.__dict__[name]
1986        else:
1987            return self.base
1988
1989    def __call__(self, *args, **kwargs):
1990        return C45Classifier(self.base(*args, **kwargs))
1991
1992    def commandline(self, ln):
1993        """
1994        Set the arguments with a C4.5 command line.
1995        """
1996        self.base.commandline(ln)
1997   
1998 
1999class C45Classifier(Orange.classification.Classifier):
2000    """
2001    A faithful reimplementation of Quinlan's function from C4.5. The only
2002    difference (and the only reason it's been rewritten) is that it uses
2003    a tree composed of :class:`C45Node` instead of C4.5's
2004    original tree structure.
2005
2006    .. attribute:: tree
2007
2008        C4.5 tree stored as a tree of :obj:`C45Node`.
2009    """
2010
2011    def __init__(self, baseClassifier):
2012        self.nativeClassifier = baseClassifier
2013        for k, v in self.nativeClassifier.__dict__.items():
2014            self.__dict__[k] = v
2015 
2016    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2017                 *args, **kwdargs):
2018        """Classify a new instance.
2019       
2020        :param instance: instance to be classified.
2021        :type instance: :class:`Orange.data.Instance`
2022        :param result_type:
2023              :class:`Orange.classification.Classifier.GetValue` or \
2024              :class:`Orange.classification.Classifier.GetProbabilities` or
2025              :class:`Orange.classification.Classifier.GetBoth`
2026       
2027        :rtype: :class:`Orange.data.Value`,
2028              :class:`Orange.statistics.Distribution` or a tuple with both
2029        """
2030        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2031
2032    def __setattr__(self, name, value):
2033        if name == "nativeClassifier":
2034            self.__dict__[name] = value
2035            return
2036        if name in self.nativeClassifier.__dict__:
2037            self.nativeClassifier.__dict__[name] = value
2038        self.__dict__[name] = value
2039   
2040    def dump(self): 
2041        """
2042        Prints the tree given as an argument in the same form as Ross Quinlan's
2043        C4.5 program.
2044
2045        ::
2046
2047            import Orange
2048
2049            data = Orange.data.Table("voting")
2050            c45 = Orange.classification.tree.C45Learner(data)
2051            print c45.dump()
2052
2053        will print out
2054
2055        ::
2056
2057            physician-fee-freeze = n: democrat (253.4)
2058            physician-fee-freeze = y:
2059            |   synfuels-corporation-cutback = n: republican (145.7)
2060            |   synfuels-corporation-cutback = y:
2061            |   |   mx-missile = y: democrat (6.0)
2062            |   |   mx-missile = n:
2063            |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2064            |   |   |   adoption-of-the-budget-resolution = y:
2065            |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2066            |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2067
2068
2069        If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2070
2071        ::
2072
2073            physician-fee-freeze = n: democrat (253.4/5.9)
2074            physician-fee-freeze = y:
2075            |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2076            |   synfuels-corporation-cutback = y:
2077            |   |   mx-missile = y: democrat (6.0/2.4)
2078            |   |   mx-missile = n:
2079            |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2080            |   |   |   adoption-of-the-budget-resolution = y:
2081            |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2082            |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2083
2084        which is adoringly similar, except that C4.5 tested the tree on
2085        the learning data and has also printed out the number of errors
2086        in each node - something which :obj:`c45_printTree` obviously can't do
2087        (nor is there any need it should).
2088
2089        """
2090        return  _c45_printTree0(self.tree, self.classVar, 0)
2091
2092def _c45_showBranch(node, classvar, lev, i):
2093    var = node.tested
2094    str_ = ""
2095    if node.nodeType == 1:
2096        str_ += "\n"+"|   "*lev + "%s = %s:" % (var.name, var.values[i])
2097        str_ += _c45_printTree0(node.branch[i], classvar, lev+1)
2098    elif node.nodeType == 2:
2099        str_ += "\n"+"|   "*lev + "%s %s %.1f:" % (var.name, ["<=", ">"][i], node.cut)
2100        str_ += _c45_printTree0(node.branch[i], classvar, lev+1)
2101    else:
2102        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2103        inset = [var.values[j[0]] for j in inset]
2104        if len(inset)==1:
2105            str_ += "\n"+"|   "*lev + "%s = %s:" % (var.name, inset[0])
2106        else:
2107            str_ +=  "\n"+"|   "*lev + "%s in {%s}:" % (var.name, ", ".join(inset))
2108        str_ += _c45_printTree0(node.branch[i], classvar, lev+1)
2109    return str_
2110       
2111       
2112def _c45_printTree0(node, classvar, lev):
2113    var = node.tested
2114    str_ = ""
2115    if node.nodeType == 0:
2116        str_ += "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items) 
2117    else:
2118        for i, branch in enumerate(node.branch):
2119            if not branch.nodeType:
2120                str_ += _c45_showBranch(node, classvar, lev, i)
2121        for i, branch in enumerate(node.branch):
2122            if branch.nodeType:
2123                str_ += _c45_showBranch(node, classvar, lev, i)
2124    return str_
2125
2126def _printTreeC45(tree):
2127    print _c45_printTree0(tree.tree, tree.classVar, 0)
2128
2129
2130import Orange.feature.scoring as fscoring
2131
2132class TreeLearner(Orange.core.Learner):
2133    """
2134    Assembles the generic classification or regression tree learner
2135    (from Orange's objects for induction of decision trees).
2136    :class:`TreeLearner` is essentially a wrapper
2137    around :class:`_TreeLearner`, provided for easier use of the latter.
2138    It sets a number of parameters used in induction that
2139    can also be set after the creation of the object, most often through
2140    the object's attributes. If upon initialization
2141    :class:`TreeLearner` is given a set of examples, then an instance
2142    of :class:`TreeClassifier` object is returned instead.
2143
2144    The values of attributes can be also be set in the constructor.
2145
2146    .. attribute:: nodeLearner
2147
2148        Induces a classifier from examples belonging to a node. The
2149        same learner is used for internal nodes and for leaves. The
2150        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2151
2152    **Split construction**
2153
2154    .. attribute:: split
2155       
2156        Defines a function that will be used in place of
2157        :obj:`SplitConstructor`.
2158        Useful when prototyping new tree induction
2159        algorithms. When this parameter is defined, other parameters that
2160        affect the procedures for growing of the tree are ignored. These
2161        include :obj:`binarization`, :obj:`measure`,
2162        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2163        :class:SplitConstructor_Combined
2164        with separate constructors for discrete and continuous attributes.
2165        Discrete attributes are used as they are, while
2166        continuous attributes are binarized.
2167        Gain ratio is used to select attributes.
2168        A minimum of two examples in a leaf is required for
2169        discrete and five examples in a leaf for continuous attributes.)
2170
2171    .. attribute:: binarization
2172
2173        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2174        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
2175        0, do not use binarization (use :class:`SplitConstructor_Attribute`).
2176        Default: 0.
2177
2178    .. attribute:: measure
2179   
2180        Measure for scoring of the attributes when deciding which of the
2181        attributes will be used for splitting of the example set in the node.
2182        Can be either a :class:`Orange.feature.scoring.Measure` or one of
2183        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2184        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2185        "gini" (:class:`Orange.feature.scoring.Gini`),
2186        "relief" (:class:`Orange.feature.scoring.Relief`),
2187        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio".
2188
2189    .. attribute:: reliefM, reliefK
2190
2191        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2192
2193    .. attribute:: splitter
2194
2195        Object of type :class:`ExampleSplitter`. The default splitter
2196        is :class:`ExampleSplitter_UnknownsAsSelector` that splits
2197        the learning examples according to distributions given by the
2198        selector.
2199
2200    **Pruning**
2201
2202    .. attribute:: worstAcceptable
2203
2204        Used in pre-pruning, sets the lowest required attribute
2205        score. If the score of the best attribute is below this margin, the
2206        tree at that node is not grown further (default: 0).
2207
2208        So, to allow splitting only when gainRatio (the default measure)
2209        is greater than 0.6, set :samp:`worstAcceptable=0.6`.
2210
2211    .. attribute:: minSubset
2212
2213        Minimal number of examples in non-null leaves (default: 0).
2214
2215    .. attribute:: minExamples
2216
2217        Data subsets with less than :obj:`minExamples`
2218        examples are not split any further, that is, all leaves in the tree
2219        will contain at least that many of examples (default: 0).
2220
2221    .. attribute:: maxDepth
2222
2223        Gives maximal tree depth;  0 means that only root is generated.
2224        The default is 100.
2225
2226    .. attribute:: maxMajority
2227
2228        Induction stops when the proportion of majority class in the
2229        node exceeds the value set by this parameter(default: 1.0).
2230        To stop the induction as soon as the majority class reaches 70%,
2231        you should use :samp:`maxMajority=0.7`, as in the following
2232        example. The numbers show the majority class
2233        proportion at each node. The script `tree2.py`_ induces and
2234        prints this tree.
2235
2236        .. _tree2.py: code/tree2.py
2237
2238        ::
2239
2240            root: 0.333
2241            |    petal width<=0.800: 1.000
2242            |    petal width>0.800: 0.500
2243            |    |    petal width<=1.750: 0.907
2244            |    |    petal width>1.750: 0.978
2245   
2246    .. attribute:: stop
2247
2248        Used for passing a function which is used in place of
2249        :class:`StopCriteria`. Useful when prototyping new
2250        tree induction algorithms. See a documentation on
2251        :class:`StopCriteria` for more info on this function.
2252        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2253        will not be  considered (default: None).
2254        The default stopping criterion stops induction when all examples
2255        in a node belong to the same class.
2256
2257    .. attribute:: mForPruning
2258
2259        If non-zero, invokes an error-based bottom-up post-pruning,
2260        where m-estimate is used to estimate class probabilities
2261        (default: 0).
2262
2263    .. attribute:: sameMajorityPruning
2264
2265        If true, invokes a bottom-up post-pruning by removing the
2266        subtrees of which all leaves classify to the same class
2267        (default: False).
2268
2269    **Record keeping**
2270
2271    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2272
2273        Determines whether to store class distributions, contingencies and
2274        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2275        should be build for internal nodes. By default everything except
2276        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2277        distributions but storing contingencies, since distributions actually points to
2278        the same distribution that is stored in
2279        :obj:`contingency.classes`. (default: True except for
2280        storeExamples, which defaults to False).
2281   
2282    """
2283    def __new__(cls, examples = None, weightID = 0, **argkw):
2284        self = Orange.core.Learner.__new__(cls, **argkw)
2285        if examples:
2286            self.__init__(**argkw)
2287            return self.__call__(examples, weightID)
2288        else:
2289            return self
2290     
2291    def __init__(self, **kw):
2292        self.split = None
2293        self.stop = None
2294        self.measure = None
2295        self.splitter = None
2296        self.__dict__.update(kw)
2297     
2298    def __call__(self, examples, weight=0):
2299        """
2300        Return a classifier from the given examples.
2301        """
2302        bl = self._base_learner()
2303
2304        #set the scoring criteria for regression if it was not
2305        #set by the user
2306        if not self.split and not self.measure:
2307            measure = fscoring.GainRatio() \
2308                if examples.domain.classVar.varType == Orange.data.Type.Discrete \
2309                else fscoring.MSE()
2310            bl.split.continuousSplitConstructor.measure = measure
2311            bl.split.discreteSplitConstructor.measure = measure
2312         
2313        if self.splitter != None:
2314            bl.splitter = self.splitter
2315
2316
2317        #post pruning
2318        tree = bl(examples, weight)
2319        if getattr(self, "sameMajorityPruning", 0):
2320            tree = Pruner_SameMajority(tree)
2321        if getattr(self, "mForPruning", 0):
2322            tree = Pruner_m(tree, m=self.mForPruning)
2323
2324        return TreeClassifier(baseClassifier=tree) 
2325
2326    def instance(self):
2327        """
2328        DEPRECATED. Return a base learner - an object
2329        of :class:`_TreeLearner`.
2330        This method is left for backwards compatibility.
2331        """
2332        return self._base_learner()
2333
2334    def build_split(self):
2335        """
2336        Return the split constructor built according to object attributes.
2337        """
2338       
2339        split = self.split
2340
2341        if not split:
2342            split = SplitConstructor_Combined()
2343            split.continuousSplitConstructor = \
2344                SplitConstructor_Threshold()
2345            binarization = getattr(self, "binarization", 0)
2346            if binarization == 1:
2347                split.discreteSplitConstructor = \
2348                    SplitConstructor_ExhaustiveBinary()
2349            elif binarization == 2:
2350                split.discreteSplitConstructor = \
2351                    SplitConstructor_OneAgainstOthers()
2352            else:
2353                split.discreteSplitConstructor = \
2354                    SplitConstructor_Feature()
2355
2356            measures = {"infoGain": fscoring.InfoGain,
2357                "gainRatio": fscoring.GainRatio,
2358                "gini": fscoring.Gini,
2359                "relief": fscoring.Relief,
2360                "retis": fscoring.MSE
2361                }
2362
2363            measure = self.measure
2364            if isinstance(measure, str):
2365                measure = measures[measure]()
2366            if not measure:
2367                measure = fscoring.GainRatio()
2368
2369            measureIsRelief = isinstance(measure, fscoring.Relief)
2370            relM = getattr(self, "reliefM", None)
2371            if relM and measureIsRelief:
2372                measure.m = relM
2373           
2374            relK = getattr(self, "reliefK", None)
2375            if relK and measureIsRelief:
2376                measure.k = relK
2377
2378            split.continuousSplitConstructor.measure = measure
2379            split.discreteSplitConstructor.measure = measure
2380
2381            wa = getattr(self, "worstAcceptable", 0)
2382            if wa:
2383                split.continuousSplitConstructor.worstAcceptable = wa
2384                split.discreteSplitConstructor.worstAcceptable = wa
2385
2386            ms = getattr(self, "minSubset", 0)
2387            if ms:
2388                split.continuousSplitConstructor.minSubset = ms
2389                split.discreteSplitConstructor.minSubset = ms
2390
2391        return split
2392
2393    def build_stop(self):
2394        """
2395        Return the stop criteria built according to object's attributes.
2396        """
2397        stop = self.stop
2398        if not stop:
2399            stop = Orange.classification.tree.StopCriteria_common()
2400            mm = getattr(self, "maxMajority", 1.0)
2401            if mm < 1.0:
2402                stop.maxMajority = self.maxMajority
2403            me = getattr(self, "minExamples", 0)
2404            if me:
2405                stop.minExamples = self.minExamples
2406        return stop
2407
2408    def _base_learner(self):
2409        learner = _TreeLearner()
2410
2411        learner.split = self.build_split()
2412        learner.stop = self.build_stop()
2413
2414        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2415            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2416            if hasattr(self, a):
2417                setattr(learner, a, getattr(self, a))
2418
2419        return learner
2420
2421#
2422# the following is for the output
2423#
2424
2425fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2426""" Defines the multiplier by 100 (:samp:`^`) and the format
2427for the number of decimals (e.g. :samp:`5.3`). The corresponding
2428groups are named :samp:`m100` and :samp:`fs`. """
2429
2430by = r"(?P<by>(b(P|A)))?"
2431""" Defines bP or bA or nothing; the result is in groups by. """
2432
2433bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2434opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2435opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2436intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2437fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2438re_V = re.compile("%V")
2439re_N = re.compile("%"+fs+"N"+by)
2440re_M = re.compile("%"+fs+"M"+by)
2441re_m = re.compile("%"+fs+"m"+by)
2442re_Ccont = re.compile("%"+fs+"C"+by+opc)
2443re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2444re_ccont = re.compile("%"+fs+"c"+by+opc)
2445re_cdisc = re.compile("%"+fs+"c"+by+opd)
2446re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2447re_cconti = re.compile("%"+fs+"c"+by+fromto)
2448re_D = re.compile("%"+fs+"D"+by)
2449re_d = re.compile("%"+fs+"d"+by)
2450re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2451re_I = re.compile("%"+fs+"I"+intrvl)
2452
2453def insertStr(s, mo, sub):
2454    """ Replace the part of s which is covered by mo
2455    with the string sub. """
2456    return s[:mo.start()] + sub + s[mo.end():]
2457
2458
2459def insertDot(s, mo):
2460    """ Replace the part of s which is covered by mo
2461    with a dot.  You should use this when the
2462    function cannot compute the desired quantity; it is called, for instance,
2463    when it needs to divide by something in the parent, but the parent
2464    doesn't exist.
2465    """
2466    return s[:mo.start()] + "." + s[mo.end():]
2467
2468def insertNum(s, mo, n):
2469    """ Replace the part of s matched by mo with the number n,
2470    formatted as specified by the user, that is, it multiplies
2471    it by 100, if needed, and prints with the right number of
2472    places and decimals. It does so by checking the mo
2473    for a group named m100 (representing the :samp:`^` in the format string)
2474    and a group named fs representing the part giving the number o
2475    f decimals (e.g. :samp:`5.3`).
2476    """
2477    grps = mo.groupdict()
2478    m100 = grps.get("m100", None)
2479    if m100:
2480        n *= 100
2481    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2482    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2483
2484def byWhom(by, parent, tree):
2485    """ If by equals bp, it returns parent, else it returns
2486    :samp:`tree.tree`. This is used to find what to divide the quantity
2487    with, when division is required.
2488    """
2489    if by=="bP":
2490        return parent
2491    else:
2492        return tree.tree
2493
2494def replaceV(strg, mo, node, parent, tree):
2495    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2496
2497def replaceN(strg, mo, node, parent, tree):
2498    by = mo.group("by")
2499    N = node.distribution.abs
2500    if by:
2501        whom = byWhom(by, parent, tree)
2502        if whom and whom.distribution:
2503            if whom.distribution.abs > 1e-30:
2504                N /= whom.distribution.abs
2505        else:
2506            return insertDot(strg, mo)
2507    return insertNum(strg, mo, N)
2508       
2509
2510def replaceM(strg, mo, node, parent, tree):
2511    by = mo.group("by")
2512    maj = int(node.nodeClassifier.defaultValue)
2513    N = node.distribution[maj]
2514    if by:
2515        whom = byWhom(by, parent, tree)
2516        if whom and whom.distribution:
2517            if whom.distribution[maj] > 1e-30:
2518                N /= whom.distribution[maj]
2519        else:
2520            return insertDot(strg, mo)
2521    return insertNum(strg, mo, N)
2522       
2523
2524def replacem(strg, mo, node, parent, tree):
2525    by = mo.group("by")
2526    maj = int(node.nodeClassifier.defaultValue)
2527    if node.distribution.abs > 1e-30:
2528        N = node.distribution[maj] / node.distribution.abs
2529        if by:
2530            if whom and whom.distribution:
2531                byN = whom.distribution[maj] / whom.distribution.abs
2532                if byN > 1e-30:
2533                    N /= byN
2534            else:
2535                return insertDot(strg, mo)
2536    else:
2537        N = 0.
2538    return insertNum(strg, mo, N)
2539
2540
2541def replaceCdisc(strg, mo, node, parent, tree):
2542    if tree.classVar.varType != Orange.data.Type.Discrete:
2543        return insertDot(strg, mo)
2544   
2545    by, op, cls = mo.group("by", "op", "cls")
2546    N = node.distribution[cls]
2547    if op == "!=":
2548        N = node.distribution.abs - N
2549    if by:
2550        whom = byWhom(by, parent, tree)
2551        if whom and whom.distribution:
2552            if whom.distribution[cls] > 1e-30:
2553                N /= whom.distribution[cls]
2554        else:
2555            return insertDot(strg, mo)
2556    return insertNum(strg, mo, N)
2557
2558   
2559def replacecdisc(strg, mo, node, parent, tree):
2560    if tree.classVar.varType != Orange.data.Type.Discrete:
2561        return insertDot(strg, mo)
2562   
2563    op, by, cls = mo.group("op", "by", "cls")
2564    N = node.distribution[cls]
2565    if node.distribution.abs > 1e-30:
2566        N /= node.distribution.abs
2567        if op == "!=":
2568            N = 1 - N
2569    if by:
2570        whom = byWhom(by, parent, tree)
2571        if whom and whom.distribution:
2572            if whom.distribution[cls] > 1e-30:
2573                N /= whom.distribution[cls] / whom.distribution.abs
2574        else:
2575            return insertDot(strg, mo)
2576    return insertNum(strg, mo, N)
2577
2578
2579__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2580
2581def replaceCcont(strg, mo, node, parent, tree):
2582    if tree.classVar.varType != Orange.data.Type.Continuous:
2583        return insertDot(strg, mo)
2584   
2585    by, op, num = mo.group("by", "op", "num")
2586    op = __opdict[op]
2587    num = float(num)
2588    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2589    if by:
2590        whom = byWhom(by, parent, tree)
2591        if whom and whom.distribution:
2592            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2593            if byN > 1e-30:
2594                N /= byN
2595        else:
2596            return insertDot(strg, mo)
2597
2598    return insertNum(strg, mo, N)
2599   
2600   
2601def replaceccont(strg, mo, node, parent, tree):
2602    if tree.classVar.varType != Orange.data.Type.Continuous:
2603        return insertDot(strg, mo)
2604   
2605    by, op, num = mo.group("by", "op", "num")
2606    op = __opdict[op]
2607    num = float(num)
2608    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2609    if node.distribution.abs > 1e-30:
2610        N /= node.distribution.abs
2611    if by:
2612        whom = byWhom(by, parent, tree)
2613        if whom and whom.distribution:
2614            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2615            if byN > 1e-30:
2616                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2617        else:
2618            return insertDot(strg, mo)
2619    return insertNum(strg, mo, N)
2620
2621
2622def extractInterval(mo, dist):
2623    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2624    lower, upper = float(lower), float(upper)
2625    if out:
2626        lop = lowin == "(" and operator.le or operator.lt
2627        hop = upin == ")" and operator.ge or operator.ge
2628        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2629    else:
2630        lop = lowin == "(" and operator.gt or operator.ge
2631        hop = upin == ")" and operator.lt or operator.le
2632        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2633
2634   
2635def replaceCconti(strg, mo, node, parent, tree):
2636    if tree.classVar.varType != Orange.data.Type.Continuous:
2637        return insertDot(strg, mo)
2638
2639    by = mo.group("by")
2640    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2641    if by:
2642        whom = byWhom(by, parent, tree)
2643        if whom and whom.distribution:
2644            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2645            if byN > 1e-30:
2646                N /= byN
2647        else:
2648            return insertDot(strg, mo)
2649       
2650    return insertNum(strg, mo, N)
2651
2652           
2653def replacecconti(strg, mo, node, parent, tree):
2654    if tree.classVar.varType != Orange.data.Type.Continuous:
2655        return insertDot(strg, mo)
2656
2657    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2658    ab = node.distribution.abs
2659    if ab > 1e-30:
2660        N /= ab
2661
2662    by = mo.group("by")
2663    if by:
2664        whom = byWhom(by, parent, tree)
2665        if whom and whom.distribution:
2666            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2667            if byN > 1e-30:
2668                N /= byN/whom.distribution.abs
2669        else:
2670            return insertDot(strg, mo)
2671       
2672    return insertNum(strg, mo, N)
2673
2674   
2675def replaceD(strg, mo, node, parent, tree):
2676    if tree.classVar.varType != Orange.data.Type.Discrete:
2677        return insertDot(strg, mo)
2678
2679    fs, by, m100 = mo.group("fs", "by", "m100")
2680    dist = list(node.distribution)
2681    if by:
2682        whom = byWhom(by, parent, tree)
2683        if whom:
2684            for i, d in enumerate(whom.distribution):
2685                if d > 1e-30:
2686                    dist[i] /= d
2687        else:
2688            return insertDot(strg, mo)
2689    mul = m100 and 100 or 1
2690    fs = fs or (m100 and ".0" or "5.3")
2691    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2692
2693
2694def replaced(strg, mo, node, parent, tree):
2695    if tree.classVar.varType != Orange.data.Type.Discrete:
2696        return insertDot(strg, mo)
2697
2698    fs, by, m100 = mo.group("fs", "by", "m100")
2699    dist = list(node.distribution)
2700    ab = node.distribution.abs
2701    if ab > 1e-30:
2702        dist = [d/ab for d in dist]
2703    if by:
2704        whom = byWhom(by, parent, tree)
2705        if whom:
2706            for i, d in enumerate(whom.distribution):
2707                if d > 1e-30:
2708                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2709        else:
2710            return insertDot(strg, mo)
2711    mul = m100 and 100 or 1
2712    fs = fs or (m100 and ".0" or "5.3")
2713    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2714
2715
2716def replaceAE(strg, mo, node, parent, tree):
2717    if tree.classVar.varType != Orange.data.Type.Continuous:
2718        return insertDot(strg, mo)
2719
2720    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2721   
2722    if AorE == "A":
2723        A = node.distribution.average()
2724    else:
2725        A = node.distribution.error()
2726    if by:
2727        whom = byWhom("b"+by, parent, tree)
2728        if whom:
2729            if AorE == "A":
2730                avg = whom.distribution.average()
2731            else:
2732                avg = whom.distribution.error()
2733            if bysub == "b":
2734                if avg > 1e-30:
2735                    A /= avg
2736            else:
2737                A -= avg
2738        else:
2739            return insertDot(strg, mo)
2740    return insertNum(strg, mo, A)
2741
2742
2743Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2744
2745def replaceI(strg, mo, node, parent, tree):
2746    if tree.classVar.varType != Orange.data.Type.Continuous:
2747        return insertDot(strg, mo)
2748
2749    fs = mo.group("fs") or "5.3"
2750    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2751    mul = mo.group("m100") and 100 or 1
2752
2753    if not Z.has_key(intrvl):
2754        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2755
2756    av = node.distribution.average()   
2757    il = node.distribution.error() * Z[intrvl]
2758    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2759
2760
2761# This class is more a collection of function, merged into a class so
2762# that they don't need to transfer too many arguments. It will be
2763# constructed, used and discarded, it is not meant to store any information.
2764class _TreeDumper:
2765    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2766         (re_M, replaceM), (re_m, replacem), 
2767         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2768         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2769         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2770         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2771         (re_I, replaceI) ]
2772
2773    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2774        maxDepth, simpleFirst, tree, **kw):
2775        self.stringFormats = stringFormats
2776        self.minExamples = minExamples
2777        self.maxDepth = maxDepth
2778        self.simpleFirst = simpleFirst
2779        self.tree = tree
2780        self.__dict__.update(kw)
2781
2782        if leafStr:
2783            self.leafStr = leafStr
2784        else:
2785            if tree.classVar.varType == Orange.data.Type.Discrete:
2786                self.leafStr = "%V (%^.2m%)"
2787            else:
2788                self.leafStr = "%V"
2789
2790        if nodeStr == ".":
2791            self.nodeStr = self.leafStr
2792        else:
2793            self.nodeStr = nodeStr
2794       
2795
2796    def formatString(self, strg, node, parent):
2797        if hasattr(strg, "__call__"):
2798            return strg(node, parent, self.tree)
2799       
2800        if not node:
2801            return "<null node>"
2802       
2803        for rgx, replacer in self.stringFormats:
2804            if not node.distribution:
2805                strg = rgx.sub(".", strg)
2806            else:
2807                strt = 0
2808                while True:
2809                    mo = rgx.search(strg, strt)
2810                    if not mo:
2811                        break
2812                    strg = replacer(strg, mo, node, parent, self.tree)
2813                    strt = mo.start()+1
2814                       
2815        return strg
2816       
2817
2818    def showBranch(self, node, parent, lev, i):
2819        bdes = node.branchDescriptions[i]
2820        bdes = node.branchSelector.classVar.name + \
2821            (bdes[0] not in "<=>" and "=" or "") + bdes
2822        if node.branches[i]:
2823            nodedes = self.nodeStr and ": " + \
2824                self.formatString(self.nodeStr, node.branches[i], node) or ""
2825        else:
2826            nodedes = "<null node>"
2827        return "|    "*lev + bdes + nodedes
2828       
2829       
2830    def dumpTree0(self, node, parent, lev):
2831        if node.branches:
2832            if node.distribution.abs < self.minExamples or \
2833                lev > self.maxDepth:
2834                return "|    "*lev + ". . .\n"
2835           
2836            res = ""
2837            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2838                leafsep = "\n"+("|    "*lev)+"    "
2839            else:
2840                leafsep = ""
2841            if self.simpleFirst:
2842                for i, branch in enumerate(node.branches):
2843                    if not branch or not branch.branches:
2844                        if self.leafStr == self.nodeStr:
2845                            res += "%s\n" % \
2846                                self.showBranch(node, parent, lev, i)
2847                        else:
2848                            res += "%s: %s\n" % \
2849                                (self.showBranch(node, parent, lev, i),
2850                                 leafsep + 
2851                                 self.formatString(self.leafStr, branch, node))
2852            for i, branch in enumerate(node.branches):
2853                if branch and branch.branches:
2854                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2855                                       self.dumpTree0(branch, node, lev+1))
2856                elif not self.simpleFirst:
2857                    if self.leafStr == self.nodeStr:
2858                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2859                    else:
2860                        res += "%s: %s\n" % \
2861                            (self.showBranch(node, parent, lev, i),
2862                             leafsep + 
2863                             self.formatString(self.leafStr, branch, node))
2864            return res
2865        else:
2866            return self.formatString(self.leafStr, node, parent)
2867
2868
2869    def dumpTree(self):
2870        if self.nodeStr:
2871            lev, res = 1, "root: %s\n" % \
2872                self.formatString(self.nodeStr, self.tree.tree, None)
2873            self.maxDepth += 1
2874        else:
2875            lev, res = 0, ""
2876        return res + self.dumpTree0(self.tree.tree, None, lev)
2877       
2878
2879    def dotTree0(self, node, parent, internalName):
2880        if node.branches:
2881            if node.distribution.abs < self.minExamples or \
2882                len(internalName)-1 > self.maxDepth:
2883                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2884                    _quoteName(internalName))
2885                return
2886               
2887            label = node.branchSelector.classVar.name
2888            if self.nodeStr:
2889                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2890            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2891                (_quoteName(internalName), self.nodeShape, label))
2892           
2893            for i, branch in enumerate(node.branches):
2894                if branch:
2895                    internalBranchName = internalName+chr(i+65)
2896                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2897                        (_quoteName(internalName), 
2898                         _quoteName(internalBranchName), 
2899                         node.branchDescriptions[i]))
2900                    self.dotTree0(branch, node, internalBranchName)
2901                   
2902        else:
2903            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2904                (internalName, self.leafShape, 
2905                self.formatString(self.leafStr, node, parent)))
2906
2907
2908    def dotTree(self, internalName="n"):
2909        self.fle.write("digraph G {\n")
2910        self.dotTree0(self.tree.tree, None, internalName)
2911        self.fle.write("}\n")
2912
2913def _quoteName(x):
2914    return '"%s"' % (base64.b64encode(x))
2915
2916class TreeClassifier(Orange.classification.Classifier):
2917    """
2918    Wraps :class:`_TreeClassifier`.
2919    """
2920   
2921    def __init__(self, baseClassifier=None):
2922        if not baseClassifier: baseClassifier = _TreeClassifier()
2923        self.nativeClassifier = baseClassifier
2924        for k, v in self.nativeClassifier.__dict__.items():
2925            self.__dict__[k] = v
2926 
2927    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2928                 *args, **kwdargs):
2929        """Classify a new instance.
2930       
2931        :param instance: instance to be classified.
2932        :type instance: :class:`Orange.data.Instance`
2933        :param result_type:
2934              :class:`Orange.classification.Classifier.GetValue` or \
2935              :class:`Orange.classification.Classifier.GetProbabilities` or
2936              :class:`Orange.classification.Classifier.GetBoth`
2937       
2938        :rtype: :class:`Orange.data.Value`,
2939              :class:`Orange.statistics.Distribution` or a tuple with both
2940        """
2941        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2942
2943    def __setattr__(self, name, value):
2944        if name == "nativeClassifier":
2945            self.__dict__[name] = value
2946            return
2947        if name in self.nativeClassifier.__dict__:
2948            self.nativeClassifier.__dict__[name] = value
2949        self.__dict__[name] = value
2950   
2951    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2952        """
2953        Return a string representation of a tree.
2954
2955        :arg leafStr: The format string for printing the tree leaves. If
2956          left empty, "%V (%^.2m%)" will be used for classification trees
2957          and "%V" for regression trees.
2958        :type leafStr: string
2959        :arg nodeStr: The format string for printing out the internal nodes.
2960          If left empty (as it is by default), no data is printed out for
2961          internal nodes. If set to :samp:`"."`, the same string is
2962          used as for leaves.
2963        :type nodeStr: string
2964        :arg maxDepth: If set, it limits the depth to which the tree is
2965          printed out.
2966        :type maxDepth: integer
2967        :arg minExamples: If set, the subtrees with less than the given
2968          number of examples are not printed.
2969        :type minExamples: integer
2970        :arg simpleFirst: If True (default), the branches with a single
2971          node are printed before the branches with larger subtrees.
2972          If False, the branches are printed in order of
2973          appearance.
2974        :type simpleFirst: boolean
2975        :arg userFormats: A list of regular expressions and callback
2976          function through which the user can print out other specific
2977          information in the nodes.
2978        """
2979        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2980            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2981            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2982            self).dumpTree()
2983
2984    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2985        """ Prints the tree to a file in a format used by
2986        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2987        Uses the same parameters as :meth:`dump` defined above
2988        plus two parameters which define the shape used for internal
2989        nodes and laves of the tree:
2990
2991        :param leafShape: Shape of the outline around leves of the tree.
2992            If "plaintext", no outline is used (default: "plaintext").
2993        :type leafShape: string
2994        :param internalNodeShape: Shape of the outline around internal nodes
2995            of the tree. If "plaintext", no outline is used (default: "box")
2996        :type leafShape: string
2997
2998        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2999        for various outlines supported by GraphViz.
3000        """
3001        fle = type(fileName) == str and file(fileName, "wt") or fileName
3002
3003        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
3004            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
3005            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
3006            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
3007
3008    def count_nodes(self):
3009        """
3010        Return the number of nodes.
3011        """
3012        return _countNodes(self.tree)
3013
3014    def count_leaves(self):
3015        """
3016        Return the number of leaves.
3017        """
3018        return _countLeaves(self.tree)
3019
3020def _countNodes(node):
3021    count = 0
3022    if node:
3023        count += 1
3024        if node.branches:
3025            for node in node.branches:
3026                count += _countNodes(node)
3027    return count
3028
3029def _countLeaves(node):
3030    count = 0
3031    if node:
3032        if node.branches: # internal node
3033            for node in node.branches:
3034                count += _countLeaves(node)
3035        else:
3036            count += 1
3037    return count
3038
Note: See TracBrowser for help on using the repository browser.