source: orange/orange/Orange/classification/tree.py @ 7737:03f50b3da484

Revision 7737:03f50b3da484, 116.5 KB checked in by markotoplak, 3 years ago (diff)

tree: documentation updates.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12To build a small tree (:obj:`TreeClassifier`) from the iris data set
13(with the depth limited to three levels), use
14(part of `orngTree1.py`_, uses `iris.tab`_):
15
16.. literalinclude:: code/orngTree1.py
17   :lines: 1-4
18
19.. _orngTree1.py: code/orngTree1.py
20
21
22This page first describes the learner and the classifier, and then
23defines the base classes (individual components) of the trees
24and the tree-building process.
25
26.. autoclass:: TreeLearner
27    :members:
28
29.. autoclass:: TreeClassifier
30    :members:
31
32========
33Examples
34========
35
36For example, here's how to write your own stop
37function. The example constructs
38and prints two trees. For the first one we define the *defStop*
39function, which is used by default, and combine it with a random function
40so that the stop criteria will also be met in 20% of the cases
41when *defStop* is false. For the second tree the stopping criteria
42is random. Note that in
43the second case lambda function still has three parameters, since this is
44a necessary number of parameters for the stop function (:obj:`StopCriteria`).
45Part of `tree3.py`_ (uses  `iris.tab`_):
46
47.. _tree3.py: code/tree3.py
48
49.. literalinclude:: code/tree3.py
50   :lines: 8-23
51
52The output is not shown here since the resulting trees are rather
53big.
54
55Tree Structure
56==============
57
58To have something to work on, we'll take the data from lenses dataset
59and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
60
61.. literalinclude:: code/treestructure.py
62   :lines: 7-10
63
64How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
65
66.. _lenses.tab: code/lenses.tab
67.. _treestructure.py: code/treestructure.py
68
69.. literalinclude:: code/treestructure.py
70   :lines: 12-21
71
72If node is None, we have a null-node; null nodes don't count,
73so we return 0. Otherwise, the size is 1 (this node) plus the
74sizes of all subtrees. The node is an internal node if it has a
75:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
76attempt to skip the if statement: leaves don't have an empty list
77of branches, they don't have a list of branches at all.
78
79    >>> treeSize(treeClassifier.tree)
80    10
81
82Don't forget that this was only an excercise - :obj:`Node` has a
83built-in method :obj:`Node.treeSize` that does exactly the same.
84
85Let us now write a script that prints out a tree. The recursive
86part of the function will get a node and its level
87(part of `treestructure.py`_, uses `lenses.tab`_).
88
89.. literalinclude:: code/treestructure.py
90   :lines: 26-41
91
92Don't waste time on studying formatting tricks (\n's etc.), this is just
93for nicer output. What matters is everything but the print statements.
94As first, we check whether the node is a null-node (a node to which no
95learning examples were classified). If this is so, we just print out
96"<null node>" and return.
97
98After handling null nodes, remaining nodes are internal nodes and leaves.
99For internal nodes, we print a node description consisting of the
100attribute's name and distribution of classes. :obj:`Node`'s branch
101description is, for all currently defined splits, an instance of a
102class derived from :obj:`orange.Classifier` (in fact, it is a
103:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
104suffice), and its :obj:`classVar` points to the attribute we seek.
105So we print its name. We will also assume that storing class distributions
106has not been disabled and print them as well. 
107Then we iterate
108through branches; for each we print a branch description and iteratively
109call the :obj:`printTree0` with a level increased by 1 (to increase
110the indent).
111
112Finally, if the node is a leaf, we print out the distribution of
113learning examples in the node and the class to which the examples in
114the node would be classified. We again assume that the :obj:`nodeClassifier`
115is the default one - a :obj:`DefaultClassifier`. A better print
116function should be aware of possible alternatives.
117
118Now, we just need to write a simple function to call our printTree0.
119We could write something like...
120
121::
122
123    def printTree(x):
124        printTree0(x.tree, 0)
125
126... but we won't. Let us learn how to handle arguments of different
127types. Let's write a function that will accept either a
128:obj:`TreeClassifier` or a :obj:`Node`.
129Part of `treestructure.py`_, uses `lenses.tab`_.
130
131.. literalinclude:: code/treestructure.py
132   :lines: 43-49
133
134It's fairly straightforward: if :obj:`x` is of type derived from
135:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
136:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
137of some other type, we don't know how to handle it and thus raise
138an exception. The output::
139
140    >>> printTree(treeClassifier)
141    tear_rate (<15.000, 5.000, 4.000>)
142    : reduced --> none (<12.000, 0.000, 0.000>)
143    : normal
144       astigmatic (<3.000, 5.000, 4.000>)
145       : no
146          age (<1.000, 5.000, 0.000>)
147          : young --> soft (<0.000, 2.000, 0.000>)
148          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
149          : presbyopic --> none (<1.000, 1.000, 0.000>)
150       : yes
151          prescription (<2.000, 0.000, 4.000>)
152          : myope --> hard (<0.000, 0.000, 3.000>)
153          : hypermetrope --> none (<2.000, 0.000, 1.000>)
154
155For a final exercise, let us write a simple pruning function. It will
156be written entirely in Python, unrelated to any :obj:`Pruner`. It
157will limit the maximal tree depth (the number of internal nodes on any
158path down the tree) given as an argument.
159For example, to get a two-level tree, we would
160call cutTree(root, 2). The function will be recursive, with the second
161argument (level) decreasing at each call; when zero, the current node
162will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
163
164.. literalinclude:: code/treestructure.py
165   :lines: 54-62
166
167There's nothing to prune at null-nodes or leaves, so we act only when
168:obj:`node` and :obj:`node.branchSelector` are defined. If level is
169not zero, we call the function for each branch. Otherwise, we clear
170the selector, branches and branch descriptions.
171
172    >>> cutTree(tree.tree, 2)
173    >>> printTree(tree)
174    tear_rate (<15.000, 5.000, 4.000>)
175    : reduced --> none (<12.000, 0.000, 0.000>)
176    : normal
177       astigmatic (<3.000, 5.000, 4.000>)
178       : no --> soft (<1.000, 5.000, 0.000>)
179       : yes --> hard (<2.000, 0.000, 4.000>)
180
181Learning
182========
183
184You could just call :class:`TreeLearner` and let it fill the empty
185slots with the default
186components. This section will teach you three things: what are the
187missing components (and how to set the same components yourself),
188how to use alternative components to get a different tree and,
189finally, how to write a skeleton for tree induction in Python.
190
191.. _treelearner.py: code/treelearner.py
192
193Let us construct a :obj:`TreeLearner` to play with (treelearner.py`_, uses `lenses.tab`_):
194
195.. literalinclude:: code/treelearner.py
196   :lines: 7-10
197
198There are three crucial components in learning: the split and stop
199criteria, and the :obj:`exampleSplitter` YY (there are some others,
200which become important during classification; we'll talk about them
201later). They are not defined; if you use the learner, the slots are
202filled temporarily but later cleared again.
203
204::
205
206    >>> print learner.split
207    None
208    >>> learner(data)
209    <TreeClassifier instance at 0x01F08760>
210    >>> print learner.split
211    None
212
213Stopping criteria
214-----------------
215
216The stop is trivial. The default is set by
217
218    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
219
220We can now examine the default stopping parameters.
221
222    >>> print learner.stop.maxMajority, learner.stop.minExamples
223    1.0 0.0
224
225Not very restrictive. This keeps splitting the examples until
226there's nothing left to split or all the examples are in the same
227class. Let us set the minimal subset that we allow to be split to
228five examples and see what comes out.
229
230    >>> learner.stop.minExamples = 5.0
231    >>> tree = learner(data)
232    >>> print tree.dump()
233    tear_rate=reduced: none (100.00%)
234    tear_rate=normal
235    |    astigmatic=no
236    |    |    age=pre-presbyopic: soft (100.00%)
237    |    |    age=presbyopic: none (50.00%)
238    |    |    age=young: soft (100.00%)
239    |    astigmatic=yes
240    |    |    prescription=hypermetrope: none (66.67%)
241    |    |    prescription=myope: hard (100.00%)
242
243OK, that's better. If we want an even smaller tree, we can also limit
244the maximal proportion of majority class.
245
246    >>> learner.stop.maxMajority = 0.5
247    >>> tree = learner(data)
248    >>> print tree.dump()
249    none (62.50%)
250
251
252Printing the Tree
253=================
254
255The included printing functions can
256print out practically anything you'd like to
257know, from the number of examples, proportion of examples of majority
258class in nodes and similar, to more complex statistics like the
259proportion of examples in a particular class divided by the proportion
260of examples of this class in a parent node. And even more, you can
261define your own callback functions to be used for printing.
262
263Before we go on: you can read all about the function and use it to its
264full extent, or you can just call it, giving it the tree as the sole
265argument and it will print out the usual textual representation of the
266tree. If you're satisfied with that, you can stop here.
267
268The magic is in the format string. It is a string which is printed
269out at every leaf or internal node with the certain format specifiers
270replaced by data from the tree node. Specifiers are generally of form
271**%[^]<precision><quantity><divisor>**.
272
273**^** at the start tells that the number should be multiplied by 100.
274It's useful for printing proportions like percentages.
275
276**<precision>** is in the same format as in Python (or C) string
277formatting. For instance, :samp:`%N` denotes the number of examples in the node,
278hence :samp:`%6.2N` would mean output to two decimal digits and six places
279altogether. If left out, a default format :samp:`5.3` is used, unless you
280multiply the numbers by 100, in which case the default is :samp:`.0`
281(no decimals, the number is rounded to the nearest integer).
282
283**<divisor>** tells what to divide the quantity in that node with.
284:samp:`bP` means division by the same quantity in the parent node; for instance,
285:samp:`%NbP` will tell give the number of examples in the node divided by the
286number of examples in parent node. You can add use precision formatting,
287e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
288set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
289training data set) that fell into that node. If division is impossible
290since the parent node does not exist or some data is missing, a dot is
291printed out instead of the quantity.
292
293**<quantity>** is the only required element. It defines what to print.
294For instance, :samp:`%N` would print out the number of examples in the node.
295Possible quantities are
296
297:samp:`V`
298    The value predicted at that node. You cannot define the precision
299    or divisor here.
300
301:samp:`N`
302    The number of examples in the node.
303
304:samp:`M`
305    The number of examples in the majority class (that is, the class
306    predicted by the node).
307
308:samp:`m`
309    The proportion of examples in the majority class.
310
311:samp:`A`
312    The average class for examples the node; this is available only for
313    regression trees.
314
315:samp:`E`
316    Standard error for class of examples in the node; available for
317    regression trees.
318
319:samp:`I`
320    Print out the confidence interval. The modifier is used as
321    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
322
323:samp:`C`
324    The number of examples in the given class. For classification trees,
325    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
326    - this will tell the number of examples of Iris-virginica by the
327    number of examples this class in the parent node. If you are
328    interested in examples that are *not* Iris-virginica, say
329    :samp:`%5.3CbP!="Iris-virginica"`
330
331    For regression trees, you can use operators =, !=, <, <=, >, and >=,
332    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
333    check the number of examples in a certain interval: :samp:`%C[20, 22]`
334    will give you the number of examples between 20 and 22 (inclusive)
335    and :samp:`%C(20, 22)` will give the number of such examples excluding the
336    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
337    If you would like the examples outside the interval, add a :samp:`!`,
338    like :samp:`%C!(20, 22]`.
339 
340:samp:`c`
341    Same as above, except that it computes the proportion of the class
342    instead of the number of examples.
343
344:samp:`D`
345    Prints out the number of examples in each class. You can use both,
346    precision (it is applied to each number in the distribution) or the
347    divisor. This quantity cannot be computed for regression trees.
348
349:samp:`d`
350    Same as above, except that it shows proportions of examples. This
351    again doesn't work with regression trees.
352
353<user defined formats>
354    You can add more, if you will. Instructions and examples are given at
355    the end of this section.
356
357
358Examples
359========
360
361We shall build a small tree from the iris data set - we shall limit the
362depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
363
364.. literalinclude:: code/orngTree1.py
365   :lines: 1-4
366
367.. _orngTree1.py: code/orngTree1.py
368
369The easiest way to call the function is to pass the tree as the only
370argument::
371
372    >>> print tree.dump()
373    petal width<0.800: Iris-setosa (100.00%)
374    petal width>=0.800
375    |    petal width<1.750
376    |    |    petal length<5.350: Iris-versicolor (94.23%)
377    |    |    petal length>=5.350: Iris-virginica (100.00%)
378    |    petal width>=1.750
379    |    |    petal length<4.850: Iris-virginica (66.67%)
380    |    |    petal length>=4.850: Iris-virginica (100.00%)
381
382Let's now print out the predicted class at each node, the number
383of examples in the majority class with the total number of examples
384in the node::
385
386    >>> print tree.dump(leafStr="%V (%M out of %N)")
387    petal width<0.800: Iris-setosa (50.000 out of 50.000)
388    petal width>=0.800
389    |    petal width<1.750
390    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
391    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
392    |    petal width>=1.750
393    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
394    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
395
396Would you like to know how the number of examples declines as
397compared to the entire data set and to the parent node? We find
398it with this::
399
400    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)")
401    petal width<0.800: Iris-setosa (100%, 100%)
402    petal width>=0.800
403    |    petal width<1.750
404    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
405    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
406    |    petal width>=1.750
407    |    |    petal length<4.850: Iris-virginica (4%, 4%)
408    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
409
410Let us first read the format string. :samp:`%M` is the number of
411examples in the majority class. We want it divided by the number of
412all examples from this class on the entire data set, hence :samp:`%MbA`.
413To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
414that is just printed out literally, just as the comma and parentheses
415(see the output). The string for showing the proportion of this class
416in the parent is the same except that we have :samp:`bP` instead
417of :samp:`bA`.
418
419And now for the output: all examples of setosa for into the first node.
420For versicolor, we have 98% in one node; the rest is certainly
421not in the neighbouring node (petal length>=5.350) since all
422versicolors from the node petal width<1.750 went to petal length<5.350
423(we know this from the 100% in that line). Virginica is the
424majority class in the three nodes that together contain 94% of this
425class (4+4+86). The rest must had gone to the same node as versicolor.
426
427If you find this guesswork annoying - so do I. Let us print out the
428number of versicolors in each node, together with the proportion of
429versicolors among the examples in this particular node and among all
430versicolors. So,
431
432::
433
434    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
435
436gives the following output::
437
438    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
439    petal width>=0.800
440    |    petal width<1.750
441    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
442    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
443    |    petal width>=1.750
444    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
445    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
446
447Finally, we may want to print out the distributions, using a simple
448string :samp:`%D`::
449
450    petal width<0.800: [50.000, 0.000, 0.000]
451    petal width>=0.800
452    |    petal width<1.750
453    |    |    petal length<5.350: [0.000, 49.000, 3.000]
454    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
455    |    petal width>=1.750
456    |    |    petal length<4.850: [0.000, 1.000, 2.000]
457    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
458
459What is the order of numbers here? If you check
460:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
461versicolor, virginica; so in the node at peta length<5.350 we have 49
462versicolors and 3 virginicae. To print out the proportions, we can
463:samp:`%.2d` - this gives us proportions within node, rounded on
464two decimals::
465
466    petal width<0.800: [1.00, 0.00, 0.00]
467    petal width>=0.800
468    |    petal width<1.750
469    |    |    petal length<5.350: [0.00, 0.94, 0.06]
470    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
471    |    petal width>=1.750
472    |    |    petal length<4.850: [0.00, 0.33, 0.67]
473    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
474
475We haven't tried printing out any information for internal nodes.
476To start with the most trivial case, we shall print the prediction
477at each node.
478
479::
480
481    tree.dump(leafStr="%V", nodeStr=".")
482   
483says that the nodeStr should be the same as leafStr (not very useful
484here, since leafStr is trivial anyway).
485
486::
487
488    root: Iris-setosa
489    |    petal width<0.800: Iris-setosa
490    |    petal width>=0.800: Iris-versicolor
491    |    |    petal width<1.750: Iris-versicolor
492    |    |    |    petal length<5.350: Iris-versicolor
493    |    |    |    petal length>=5.350: Iris-virginica
494    |    |    petal width>=1.750: Iris-virginica
495    |    |    |    petal length<4.850: Iris-virginica
496    |    |    |    petal length>=4.850: Iris-virginica
497
498Note that the output is somewhat different now: there appeared another
499node called *root* and the tree looks one level deeper. This is
500needed to print out the data for that node to.
501
502Now for something more complicated: let us observe how the number
503of virginicas decreases down the tree::
504
505    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
506
507Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
508the number of examples from class virginica, divided by the total number
509of examples in this class. Add :samp:`^.1` and the result will be
510multiplied and printed with one decimal. The trailing :samp:`%` is printed
511out. In parentheses we print the same thing except that we divide by the
512examples in the parent node. Note the use of single quotes, so we can
513use the double quotes inside the string, when we specify the class.
514
515::
516
517    root: 100.0% (.%)
518    |    petal width<0.800: 0.0% (0.0%)
519    |    petal width>=0.800: 100.0% (100.0%)
520    |    |    petal width<1.750: 10.0% (10.0%)
521    |    |    |    petal length<5.350: 6.0% (60.0%)
522    |    |    |    petal length>=5.350: 4.0% (40.0%)
523    |    |    petal width>=1.750: 90.0% (90.0%)
524    |    |    |    petal length<4.850: 4.0% (4.4%)
525    |    |    |    petal length>=4.850: 86.0% (95.6%)
526
527See what's in the parentheses in the root node? If :meth:`TreeClassifier.dump`
528cannot compute something (in this case it's because the root has no parent),
529it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
530will count all classes *except* virginica.
531
532For one final example with classification trees, we shall print the
533distributions in that nodes, the distribution compared to the parent
534and the proportions compared to the parent (the latter things are not
535the same - think why). In the leaves we shall also add the predicted
536class. So now we'll have to call the function like this.
537
538::
539
540    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
541    root: [50.000, 50.000, 50.000] . .
542    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
543    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
544    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
545    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
546    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
547    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
548    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
549    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
550    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
551    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
552    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
553    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
554    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
555
556To explore the possibilities when printing regression trees, we are going
557to induce a tree from the housing data set. Called with the tree as the
558only argument, :meth:`TreeClassifier.dump` prints the tree like this::
559
560    RM<6.941
561    |    LSTAT<14.400
562    |    |    DIS<1.385: 45.6
563    |    |    DIS>=1.385: 22.9
564    |    LSTAT>=14.400
565    |    |    CRIM<6.992: 17.1
566    |    |    CRIM>=6.992: 12.0
567    RM>=6.941
568    |    RM<7.437
569    |    |    CRIM<7.393: 33.3
570    |    |    CRIM>=7.393: 14.4
571    |    RM>=7.437
572    |    |    TAX<534.500: 45.9
573    |    |    TAX>=534.500: 21.9
574
575Let us add the standard error in both internal nodes and leaves, and the
57690% confidence intervals in the leaves::
577
578    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
579    root: [SE: 0.409]
580    |    RM<6.941: [SE: 0.306]
581    |    |    LSTAT<14.400: [SE: 0.320]
582    |    |    |    DIS<1.385: [SE: 4.420]:
583    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
584    |    |    |    DIS>=1.385: [SE: 0.244]:
585    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
586    |    |    LSTAT>=14.400: [SE: 0.333]
587    |    |    |    CRIM<6.992: [SE: 0.338]:
588    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
589    |    |    |    CRIM>=6.992: [SE: 0.448]:
590    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
591    |    RM>=6.941: [SE: 1.031]
592    |    |    RM<7.437: [SE: 0.958]
593    |    |    |    CRIM<7.393: [SE: 0.692]:
594    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
595    |    |    |    CRIM>=7.393: [SE: 2.157]:
596    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
597    |    |    RM>=7.437: [SE: 1.124]
598    |    |    |    TAX<534.500: [SE: 0.817]:
599    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
600    |    |    |    TAX>=534.500: [SE: 0.000]:
601    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
602
603What's the difference between :samp:`%V`, the predicted value and
604:samp:`%A` the average? Doesn't a regression tree always predict the
605leaf average anyway? Not necessarily, the tree predict whatever the
606:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
607As :samp:`%V` uses the
608:obj:`Orange.data.variable.Continuous`' function for printing out the value,
609therefore the printed number has the same number of decimals
610as in the data file.
611
612Regression trees cannot print the distributions in the same way
613as classification trees. They instead offer a set of operators for
614observing the number of examples within a certain range. For instance,
615let us check the number of examples with values below 22, and compare
616this number with values in the parent nodes::
617
618    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".")
619    root: 277.000 (.)
620    |    RM<6.941: 273.000 (1.160)
621    |    |    LSTAT<14.400: 107.000 (0.661)
622    |    |    |    DIS<1.385: 0.000 (0.000)
623    |    |    |    DIS>=1.385: 107.000 (1.020)
624    |    |    LSTAT>=14.400: 166.000 (1.494)
625    |    |    |    CRIM<6.992: 93.000 (0.971)
626    |    |    |    CRIM>=6.992: 73.000 (1.040)
627    |    RM>=6.941: 4.000 (0.096)
628    |    |    RM<7.437: 3.000 (1.239)
629    |    |    |    CRIM<7.393: 0.000 (0.000)
630    |    |    |    CRIM>=7.393: 3.000 (15.333)
631    |    |    RM>=7.437: 1.000 (0.633)
632    |    |    |    TAX<534.500: 0.000 (0.000)
633    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
634
635The last line, for instance, says the the number of examples with the
636class below 22 is among those with tax above 534 is 30 times higher
637than the number of such examples in its parent node.
638
639For another exercise, let's count the same for all examples *outside*
640interval [20, 22] (given like this, the interval includes the bounds).
641And let us print out the proportions as percents.
642
643::
644
645    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
646
647OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
648would be the proportion of examples (within the node) whose values are
649below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
650statistics computed on the parent. Add a :samp:`^` and you have the percentages.
651
652::
653
654    root: 439.000 (.%)
655    |    RM<6.941: 364.000 (98%)
656    |    |    LSTAT<14.400: 200.000 (93%)
657    |    |    |    DIS<1.385: 5.000 (127%)
658    |    |    |    DIS>=1.385: 195.000 (99%)
659    |    |    LSTAT>=14.400: 164.000 (111%)
660    |    |    |    CRIM<6.992: 91.000 (96%)
661    |    |    |    CRIM>=6.992: 73.000 (105%)
662    |    RM>=6.941: 75.000 (114%)
663    |    |    RM<7.437: 46.000 (101%)
664    |    |    |    CRIM<7.393: 43.000 (100%)
665    |    |    |    CRIM>=7.393: 3.000 (100%)
666    |    |    RM>=7.437: 29.000 (98%)
667    |    |    |    TAX<534.500: 29.000 (103%)
668    |    |    |    TAX>=534.500: 0.000 (0%)
669
670
671Defining Your Own Printout functions
672====================================
673
674:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out
675some other information in the leaves or nodes. If provided,
676:obj:`userFormats` should contain a list of tuples with a regular expression
677and a callback function to be called when that expression is found in the
678format string. Expressions from :obj:`userFormats` are checked before
679the built-in expressions discussed above, so you can override the built-ins
680if you want to.
681
682The regular expression should describe a string like those we used above,
683for instance the string :samp:`%.2DbP`. When a leaf or internal node
684is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
685is checked for these regular expressions and when the match is found,
686the corresponding callback function is called.
687
688The callback function will get five arguments: the format string
689(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
690being printed, its parent (can be None) and the tree classifier.
691The function should return the format string in which the part described
692by the match object (that is, the part that is matched by the regular
693expression) is replaced by whatever information your callback function
694is supposed to give.
695
696The function can use several utility function provided in the module.
697
698.. autofunction:: insertStr
699
700.. autofunction:: insertDot
701
702.. autofunction:: insertNum
703
704.. autofunction:: byWhom
705
706
707There are also a few pieces of regular expression that you may want to reuse.
708The two you are likely to use are:
709
710.. autodata:: fs
711
712.. autodata:: by
713
714For a trivial example, :samp:`%V` is implemented like this. There is the
715following tuple in the list of built-in formats::
716
717    (re.compile("%V"), replaceV)
718
719:obj:`replaceV` is a function defined by::
720
721    def replaceV(strg, mo, node, parent, tree):
722        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
723
724It therefore takes the value predicted at the node
725(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
726and passes it to *insertStr* to do the replacement.
727
728A more complex regular expression is the one for the proportion of
729majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
730two partial expressions defined above.
731
732Let's say with like to print the classification margin for each node,
733that is, the difference between the proportion of the largest and the
734second largest class in the node (part of `orngTree2.py`_):
735
736.. _orngTree2.py: code/orngTree2.py
737
738.. literalinclude:: code/orngTree2.py
739   :lines: 7-31
740
741We first defined getMargin which gets the distribution and computes the
742margin. The callback replaces, replaceB, computes the margin for the node.
743If we need to divided the quantity by something (that is, if the :data:`by`
744group is present), we call :func:`byWhom` to get the node with whose margin
745this node's margin is to be divided. If this node (usually the parent)
746does not exist of if its margin is zero, we call :func:`insertDot`
747to insert a dot, otherwise we call :func:`insertNum` which will insert
748the number, obeying the format specified by the user. myFormat is a list
749containing the regular expression and the callback function.
750
751
752We can now print out the iris tree:
753
754.. literalinclude:: code/orngTree2.py
755    :lines: 33
756
757And we get::
758
759    petal width<0.800: Iris-setosa 100% (100.00%)
760    petal width>=0.800
761    |    petal width<1.750
762    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
763    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
764    |    petal width>=1.750
765    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
766    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
767
768
769Plotting the Tree using Dot
770===========================
771
772Suppose you saved the tree in a file "tree5.dot". You can then
773print it out as a gif if you execute the following command line
774
775::
776   
777    dot -Tgif tree5.dot -otree5.gif
778
779GraphViz's dot has quite a few other output formats, check
780its documentation to learn which.
781
782
783
784Classification trees are represented as a tree-like hierarchy of
785:obj:`Node` classes.
786
787.. class:: Node
788
789    Node stores information about the learning examples belonging
790    to the node, a branch selector, a list of branches (if the node is
791    not a leaf) with their descriptions and strengths, and a classifier.
792
793    .. attribute:: distribution
794   
795        Stores a distribution for learning examples belonging to the node.
796        Storing distributions can be disabled by setting the
797        :obj:`TreeLearnerBase`'s storeDistributions flag to false.
798
799    .. attribute:: contingency
800
801        Stores complete contingency matrices for the learning examples
802        belonging to the node. Storing contingencies can be enabled by
803        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`
804        flag to true. Note that even when the flag is not
805        set, the contingencies get computed and stored to
806        :obj:`Node`, but are removed shortly afterwards.
807        The details are given in the
808        description of the :obj:`TreeLearnerBase` object.
809
810    .. attribute:: examples, weightID
811
812        Store a set of learning examples for the node and the
813        corresponding ID of /weight meta attribute. The root of the
814        tree stores a "master table" of examples, while other nodes'
815        :obj:`Orange.data.Table` contain reference to examples in
816        the root's :obj:`Orange.data.Table`. Examples are only stored
817        if a corresponding flag (:obj:`storeExamples`) has been
818        set while building the tree; to conserve the space, storing
819        is disabled by default.
820
821    .. attribute:: nodeClassifier
822
823        A classifier (usually, but not necessarily, a
824        :obj:`DefaultClassifier`) that can be used to classify
825        examples coming to the node. If the node is a leaf, this is
826        used to decide the final class (or class distribution) of an
827        example. If it's an internal node, it is stored if
828        :obj:`Node`'s flag :obj:`storeNodeClassifier`
829        is set. Since the :obj:`nodeClassifier` is needed by
830        :obj:`Descender` and for pruning (see far below),
831        this is the default behaviour; space consumption of the default
832        :obj:`DefaultClassifier` is rather small. You should
833        never disable this if you intend to prune the tree later.
834
835    If the node is a leaf, the remaining fields are None.
836    If it's an internal node, there are several additional fields.
837
838    .. attribute:: branches
839
840        Stores a list of subtrees, given as :obj:`Node`.
841        An element can be None; in this case the node is empty.
842
843    .. attribute:: branchDescriptions
844
845        A list with string descriptions for branches, constructed by
846        :obj:`SplitConstructor`. It can contain different kinds
847        of descriptions, but basically, expect things like 'red' or '>12.3'.
848
849    .. attribute:: branchSizes
850
851        Gives a (weighted) number of training examples that went into
852        each branch. This can be used later, for instance, for
853        modeling probabilities when classifying examples with
854        unknown values.
855
856    .. attribute:: branchSelector
857
858        Gives a branch for each example. The same object is used during
859        learning and classifying. The :obj:`branchSelector` is of
860        type :obj:`orange.Classifier`, since its job is similar to that
861        of a classifier: it gets an example and returns discrete
862        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
863        When an example cannot be classified to any branch, the selector
864        can return a :obj:`Orange.data.Value` containing a special value
865        (sVal) which should be a discrete distribution
866        (DiscDistribution). This should represent a
867        :obj:`branchSelector`'s opinion of how to divide the
868        example between the branches. Whether the proposition will be
869        used or not depends upon the chosen :obj:`ExampleSplitter`
870        (when learning) or :obj:`Descender` (when classifying).
871
872    The lists :obj:`branches`, :obj:`branchDescriptions` and
873    :obj:`branchSizes` are of the same length; all of them are
874    defined if the node is internal and none if it is a leaf.
875
876    .. method:: treeSize()
877       
878        Return the number of nodes in the subtrees (including the
879        node, excluding null-nodes).
880
881============
882Base classes
883============
884
885.. class:: _TreeClassifier
886
887    Classifies examples according to a tree stored in :obj:`tree`.
888    Not meant to be used directly. The :class:`TreeLearner` class
889    constructs :class:`TreeClassifier`.
890
891    .. attribute:: tree
892
893        The root of the tree, represented as a :class:`Node`.
894   
895    Classification would be straightforward if there were no unknown
896    values or, in general, examples that cannot be placed into a
897    single branch. The response in such cases is determined by a
898    component :obj:`descender`.
899
900    :obj:`Descender` is an abstract object which is given an example
901    and whose basic job is to descend as far down the tree as possible,
902    according to the values of example's attributes. The
903    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
904    the branch index. If it's a simple index, the corresponding branch
905    is followed. If not, it's up to descender to decide what to do, and
906    that's where descenders differ. A :obj:`descender` can choose
907    a single branch (for instance, the one that is the most recommended
908    by the :obj:`branchSelector`) or it can let the branches vote.
909
910    In general there are three possible outcomes of a descent.
911
912    #. Descender reaches a leaf. This happens when nothing went wrong
913       (there are no unknown or out-of-range values in the example) or
914       when things went wrong, but the descender smoothed them by
915       selecting a single branch and continued the descend. In this
916       case, the descender returns the reached :obj:`Node`.
917    #. :obj:`branchSelector` returned a distribution and the
918       :obj:`Descender` decided to stop the descend at this
919       (internal) node.  Again, descender returns the current
920       :obj:`Node` and nothing else.
921    #. :obj:`branchSelector` returned a distribution and the
922       :obj:`Node` wants to split the example (i.e., to decide the
923       class by voting).
924
925    It returns a :obj:`Node` and the vote-weights for the branches.
926    The weights can correspond to the distribution returned by
927    :obj:`branchSelector`, to the number of learning examples that
928    were assigned to each branch, or to something else.
929
930    :obj:`TreeClassifier` uses the descender to descend from the root.
931    If it returns only a :obj:`Node` and no distribution, the
932    descend should stop; it does not matter whether it's a leaf (the
933    first case above) or an internal node (the second case). The node's
934    :obj:`nodeClassifier` is used to decide the class. If the descender
935    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
936    recursively calls itself for each of the subtrees and the
937    predictions are weighted as requested by the descender.
938
939    When voting, subtrees do not predict the class but probabilities
940    of classes. The predictions are multiplied by weights, summed and
941    the most probable class is returned.
942
943    .. method:: vote()
944
945    .. method:: classDistribution()
946
947    If you'd like to understand how the classification works in C++,
948    start reading at :obj:`TTreeClassifier::vote`. It gets a
949    :obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of
950    vote weights. For each node, it calls the
951    :obj:`TTreeClassifier::classDistribution` and then multiplies
952    and sums the distribution. :obj:`vote` returns a normalized
953    distribution of predictions.
954
955    A new overload of :obj:`TTreeClassifier::classDistribution` gets
956    an additional parameter, a :obj:`Node`. This is done
957    for the sake of recursion. The normal version of
958    :obj:`classDistribution` simply calls the overloaded with a
959    tree root as an additional parameter. :obj:`classDistribution`
960    uses :obj:`descender`. If descender reaches a leaf, it calls
961    :obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
962
963    Thus, the :obj:`TreeClassifier`'s :obj:`vote` and
964    :obj:`classDistribution` are written in a form of double
965    recursion. The recursive calls do not happen at each node of the
966    tree but only at nodes where a vote is needed (that is, at nodes
967    where the descender halts).
968
969    For predicting a class, :obj:`operator()`, calls the
970    descender. If it reaches a leaf, the class is predicted by the
971    leaf's :obj:`nodeClassifier`. Otherwise, it calls
972    :obj:`vote`>. From now on, :obj:`vote` and
973    :obj:`classDistribution` interweave down the tree and return
974    a distribution of predictions. :obj:`operator()` then simply
975    chooses the most probable class.
976
977.. class:: TreeLearnerBase
978
979    The main learning object is :obj:`TreeLearnerBase`. It is basically
980    a skeleton into which the user must plug the components for particular
981    functions. This class is not meant to be used directly. You should
982    rather use :class:`TreeLearner`.
983
984    Components that govern the structure of the tree are :obj:`split`
985    (of type :obj:`SplitConstructor`), :obj:`stop` (of
986    type :obj:`StopCriteria` and :obj:`exampleSplitter`
987    (of type :obj:`ExampleSplitter`).
988
989    .. attribute:: split
990
991        Object of type :obj:`SplitConstructor`. Default value,
992        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined`
993        with separate constructors for discrete and continuous attributes.
994        Discrete attributes are used as are, while continuous attributes
995        are binarized. Gain ratio is used to select attributes.
996        A minimum of two examples in a leaf is required for discreter
997        and five examples in a leaf for continuous attributes.</DD>
998   
999    .. attribute:: stop
1000
1001        Object of type :obj:`StopCriteria`. The default stopping
1002        criterion stops induction when all examples in a node belong
1003        to the same class.
1004
1005    .. attribute:: splitter
1006
1007        Object of type :obj:`ExampleSplitter`. The default splitter
1008        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
1009        the learning examples according to distributions given by the
1010        selector.
1011
1012    .. attribute:: contingencyComputer
1013   
1014        By default, this slot is left empty and ordinary contingency
1015        matrices are computed for examples at each node. If need arises,
1016        one can change the way the matrices are computed. This can be
1017        used to change the way that unknown values are treated when
1018        assessing qualities of attributes. As mentioned earlier,
1019        the computed matrices can be used by split constructor and by
1020        stopping criteria. On the other hand, they can be (and are)
1021        ignored by some splitting constructors.
1022
1023    .. attribute:: nodeLearner
1024
1025        Induces a classifier from examples belonging to a node. The
1026        same learner is used for internal nodes and for leaves. The
1027        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
1028
1029    .. attribute:: descender
1030
1031        Descending component that the induces :obj:`TreeClassifier`
1032        will use. Default descender is
1033        :obj:`Descender_UnknownMergeAsSelector` which votes using
1034        the :obj:`branchSelector`'s distribution for vote weights.
1035
1036    .. attribute:: maxDepth
1037
1038        Gives maximal tree depth; 0 means that only root is generated.
1039        The default is 100 to prevent any infinite tree induction due
1040        to missettings in stop criteria. If you are sure you need
1041        larger trees, increase it. If you, on the other hand, want
1042        to lower this hard limit, you can do so as well.
1043
1044    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
1045
1046        Decides whether to store class distributions, contingencies
1047        and examples in :obj:`Node`, and whether the
1048        :obj:`nodeClassifier` should be build for internal nodes.
1049        By default, distributions and node classifiers are stored,
1050        while contingencies and examples are not. You won't save any
1051        memory by not storing distributions but storing contingencies,
1052        since distributions actually points to the same distribution
1053        that is stored in :obj:`contingency.classes`.
1054
1055    The :obj:`TreeLearnerBase` first sets the defaults for missing
1056    components. Although stored in the actual :obj:`TreeLearnerBase`'s
1057    fields, they are removed when the induction is finished.
1058
1059    Then it ensures that examples are stored in a table. This is needed
1060    because the algorithm juggles with pointers to examples. If
1061    examples are in a file or are fed through a filter, they are copied
1062    to a table. Even if they are already in a table, they are copied
1063    if :obj:`storeExamples` is set. This is to assure that pointers
1064    remain pointing to examples even if the user later changes the
1065    example table. If they are in the table and the :obj:`storeExamples`
1066    flag is clear, we just use them as they are. This will obviously
1067    crash in a multi-threaded system if one changes the table during
1068    the tree induction. Well... don't do it.
1069
1070    Apriori class probabilities are computed. At this point we check
1071    the sum of example weights; if it's zero, there are no examples and
1072    we cannot proceed. A list of candidate attributes is set; in the
1073    beginning, all attributes are candidates for the split criterion.
1074
1075    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments
1076    are a set of examples, a weight meta-attribute ID (a tricky thing,
1077    it can be always the same as the original or can change to
1078    accomodate splitting of examples among branches), apriori class
1079    distribution and a list of candidates (represented as a vector
1080    of Boolean values).
1081
1082    The contingency matrix is computed next. This happens
1083    even if the flag :obj:`storeContingencies` is false.
1084    If the :obj:`contingencyComputer` is given we use it,
1085    otherwise we construct just an ordinary contingency matrix.
1086
1087    A :obj:`stop` is called to see whether it's worth to continue. If
1088    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
1089    returned. Otherwise, a :obj:`nodeClassifier` is only built if
1090    :obj:`forceNodeClassifier` flag is set.
1091
1092    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
1093    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
1094    the given examples, weight ID and the just computed matrix. If
1095    the learner can use the matrix (and the default,
1096    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
1097    a choice of :obj:`contingencyComputer` will, in many cases,
1098    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
1099    return no classifier; if so and if the classifier would be
1100    needed for classification, the :obj:`TreeClassifier`'s function
1101    returns DK or an empty distribution. If you're writing your own
1102    tree classifier - pay attention.
1103
1104    If the induction is to continue, a :obj:`split` component is called.
1105    If it fails to return a branch selector, induction stops and the
1106    :obj:`Node` is returned.
1107
1108    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide
1109    the examples as described above.
1110
1111    The contingency gets removed at this point if it is not to be
1112    stored. Thus, the :obj:`split`, :obj:`stop` and
1113    :obj:`exampleSplitter` can use the contingency matrices if they will.
1114
1115    The :obj:`TreeLearnerBase` then recursively calls itself for each of
1116    the non-empty subsets. If the splitter returnes a list of weights,
1117    a corresponding weight is used for each branch. Besides, the
1118    attribute spent by the splitter (if any) is removed from the
1119    list of candidates for the subtree.
1120
1121    A subset of examples is stored in its corresponding tree node,
1122    if so requested. If not, the new weight attributes are removed (if
1123    any were created).
1124
1125=======
1126Classes
1127=======
1128
1129Several classes described above are already functional and can
1130(and mostly will) be used as they are. Those classes are :obj:`Node`,
1131:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe
1132the other classes.
1133
1134Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
1135:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner`
1136and :obj:`Classifier` are among the Orange classes that can be subtyped
1137in Python and have the call operator overloadedd in such a way that it
1138is callbacked from C++ code. You can thus program your own components
1139for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed
1140information on how this is done and what can go wrong, is given in a
1141separate page, dedicated to callbacks to Python XXXXXXXXXX.
1142
1143SplitConstructors
1144=====================
1145
1146Split construction is almost as exciting as waiting for a delayed flight.
1147Boring, that is. Split constructors are programmed as spaghetti code
1148that juggles with contingency matrices, with separate cases for discrete
1149and continuous classes... Most split constructors work either for
1150discrete or for continuous attributes. The suggested practice is
1151to use a :obj:`SplitConstructor_Combined` that can handle
1152both by simply delegating attributes to specialized split constructors.
1153
1154Note: split constructors that cannot handle attributes of particular
1155type (discrete, continuous) do not report an error or a warning but
1156simply skip the attribute. It is your responsibility to use a correct
1157split constructor for your dataset. (May we again suggest
1158using :obj:`SplitConstructor_Combined`?)
1159
1160The same components can be used either for inducing classification and
1161regression trees. The only component that needs to be chosen accordingly
1162is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
1163class (and derived classes).
1164
1165.. class:: SplitConstructor
1166
1167    Finds a suitable criteria for dividing the learning (and later testing)
1168    examples coming to the node. The data it gets is a set of examples
1169    (and, optionally, an ID of weight meta-attribute), a domain
1170    contingency computed from examples, apriori class probabilities,
1171    a list of candidate attributes it should consider and a node
1172    classifier (if it was constructed, that is, if
1173    :obj:`storeNodeClassifier` is left true).
1174
1175    The :obj:`SplitConstructor` should use the domain contingency
1176    when possible. The reasons are two-fold; one is that it's faster
1177    and the other is that the contingency matrices are not necessarily
1178    constructed by simply counting the examples. Why and how is
1179    explained later. There are, however, cases, when domain contingency
1180    does not suffice, for examples, when ReliefF is used as a measure
1181    of quality of attributes. In this case, there's no other way but
1182    to use the examples and ignore the precomputed contingencies.
1183
1184    The split constructor should consider only the attributes in the
1185    candidate list (the list is a vector of booleans, one for each
1186    attribute).
1187
1188    :obj:`SplitConstructor` returns most of the data we talked
1189    about when describing the :obj:`Node`. It returns a classifier
1190    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
1191    branch descriptions and a list with the number of examples that
1192    go into each branch. Just what we need for the :obj:`Node`.
1193    It can return an empty list for the number of examples in branches;
1194    in this case, the :obj:`TreeLearnerBase` will find the number itself
1195    after splitting the example set into subsets. However, if a split
1196    constructors can provide the numbers at no extra computational
1197    cost, it should do so.
1198
1199    In addition, it returns a quality of the split; a number without
1200    any fixed meaning except that higher numbers mean better splits.
1201
1202    If the constructed splitting criterion uses an attribute in such
1203    a way that the attribute is 'completely spent' and should not be
1204    considered as a split criterion in any of the subtrees (the
1205    typical case of this are discrete attributes that are used
1206    as-they-are, that is, without any binarization or subsetting),
1207    then it should report the index of this attribute. Some splits
1208    do not spend any attribute; this is indicated by returning a
1209    negative index.
1210
1211    A :obj:`SplitConstructor` can veto the further tree induction
1212    by returning no classifier. This can happen for many reasons.
1213    A general one is related to number of examples in the branches.
1214    :obj:`SplitConstructor` has a field :obj:`minSubset`,
1215    which sets the minimal number of examples in a branch; null nodes,
1216    however, are allowed. If there is no split where this condition
1217    is met, :obj:`SplitConstructor` stops the induction.
1218
1219    .. attribute:: minSubset
1220
1221        Sets the minimal number of examples in non-null leaves. As
1222        always in Orange (where not specified otherwise), "number of
1223        examples" refers to the weighted number of examples.
1224   
1225    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
1226
1227        Construct a split. Returns a tuple (:obj:`branchSelector`,
1228        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
1229        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no
1230        attribute is completely spent by the split criterion. If no
1231        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
1232        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
1233        :obj:`spentAttribute` is -1.
1234
1235        :param examples:  Examples can be given in any acceptable form
1236            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
1237            list of examples).
1238        :param weightID: Optional; the default of 0 means that all
1239            examples have a weight of 1.0.
1240        :param apriori-distribution: Should be of type
1241            :obj:`orange.Distribution` and candidates should be a Python
1242            list of objects which are interpreted as booleans.
1243
1244
1245
1246.. class:: SplitConstructor_Measure
1247
1248    Bases: :class:`SplitConstructor`
1249
1250    An abstract base class for split constructors that employ
1251    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.
1252    At present,
1253    all split constructors except for :obj:`SplitConstructor_Combined`
1254    are derived from this class.
1255
1256    .. attribute:: measure
1257
1258        A component of type :obj:`Orange.feature.scoring.Measure` used for
1259        evaluation of a split. Note that you must select the subclass
1260        :obj:`Orange.feature.scoring.Measure` capable of handling your
1261        class type
1262        - you cannot use :obj:`Orange.feature.scoring.GainRatio`
1263        for building regression trees or :obj:`Orange.feature.scoring.MSE`
1264        for classification trees.
1265
1266    .. attribute:: worstAcceptable
1267
1268        The lowest required split quality for a split to be acceptable.
1269        Note that this value make sense only in connection with a
1270        :obj:`measure` component. Default is 0.0.
1271
1272.. class:: SplitConstructor_Attribute
1273
1274    Bases: :class:`SplitConstructor_Measure`
1275
1276    Attempts to use a discrete attribute as a split; each value of the
1277    attribute corresponds to a branch in the tree. Attributes are
1278    evaluated with the :obj:`measure` and the one with the
1279    highest score is used for a split. If there is more than one
1280    attribute with the highest score, one of them is selected by random.
1281
1282    The constructed :obj:`branchSelector` is an instance of
1283    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1284    selected attribute. If the attribute is
1285    :obj:`Orange.data.variable.Discrete`,
1286    :obj:`branchDescription`'s are the attribute's values. The
1287    attribute is marked as spent, so that it cannot reappear in the
1288    node's subtrees.
1289
1290.. class:: SplitConstructor_ExhaustiveBinary
1291
1292    Bases: :class:`SplitConstructor_Measure`
1293
1294    Works on discrete attributes. For each attribute, it determines
1295    which binarization of the attribute gives the split with the
1296    highest score. If more than one split has the highest score,
1297    one of them is selected by random. After trying all the attributes,
1298    it returns one of those with the highest score.
1299
1300    The constructed :obj:`branchSelector` is again an instance
1301    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1302    selected attribute. This time, however, its :obj:`transformer`
1303    contains an instance of :obj:`MapIntValue` that maps the values
1304    of the attribute into a binary attribute. Branch descriptions are
1305    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
1306    more than one value of the attribute. Branches that correspond to
1307    a single value of the attribute are described with this value. If
1308    the attribute was originally binary, it is spent and cannot be
1309    used in the node's subtrees. Otherwise, it can reappear in the
1310    subtrees.
1311
1312
1313.. class:: SplitConstructor_Threshold
1314
1315    Bases: :class:`SplitConstructor_Measure`
1316
1317    This is currently the only constructor for splits with continuous
1318    attributes. It divides the range of attributes values with a threshold
1319    that maximizes the split's quality. As always, if there is more than
1320    one split with the highest score, a random threshold is selected.
1321    The attribute that yields the highest binary split is returned.
1322
1323    The constructed :obj:`branchSelector` is again an instance of
1324    :obj:`orange.ClassifierFromVarFD` with an attached
1325    :obj:`transformer`. This time, :obj:`transformer` is of type
1326    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
1327    "<threshold" and ">=threshold". The attribute is not spent.
1328
1329.. class:: SplitConstructor_OneAgainstOthers
1330   
1331    Bases: :class:`SplitConstructor_Measure`
1332
1333    Undocumented.
1334
1335.. class:: SplitConstructor_Combined
1336
1337    Bases: :class:`SplitConstructor`
1338
1339    This constructor delegates the task of finding the optimal split
1340    to separate split constructors for discrete and for continuous
1341    attributes. Each split constructor is called, given only attributes
1342    of appropriate types as candidates. Both construct a candidate for
1343    a split; the better of them is selected.
1344
1345    (Note that there is a problem when more candidates have the same
1346    score. Let there be are nine discrete attributes with the highest
1347    score; the split constructor for discrete attributes will select
1348    one of them. Now, let us suppose that there is a single continuous
1349    attribute with the same score. :obj:`SplitConstructor_Combined`
1350    would randomly select between the proposed discrete attribute and
1351    the continuous attribute, not aware of the fact that the discrete
1352    has already competed with eight other discrete attributes. So,
1353    he probability for selecting (each) discrete attribute would be 1/18
1354    instead of 1/10. Although not really correct, we doubt that this
1355    would affect the tree's performance; many other machine learning
1356    systems simply choose the first attribute with the highest score
1357    anyway.)
1358
1359    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
1360    the attribute is spent is decided by the winning split constructor.
1361
1362    .. attribute: discreteSplitConstructor
1363
1364        Split constructor for discrete attributes; can be, for instance,
1365        :obj:`SplitConstructor_Attribute` or
1366        :obj:`SplitConstructor_ExhaustiveBinary`.
1367
1368    .. attribute: continuousSplitConstructor
1369
1370        Split constructor for continuous attributes; at the moment, it
1371        can be either :obj:`SplitConstructor_Threshold` or a
1372        split constructor you programmed in Python.
1373
1374    .. attribute: continuousSplitConstructor
1375   
1376        Split constructor for continuous attributes; at the moment, it
1377        can be either :obj:`SplitConstructor_Threshold` or a split
1378        constructor you programmed in Python.
1379
1380
1381StopCriteria and StopCriteria_common
1382============================================
1383
1384obj:`StopCriteria` determines when to stop the induction of subtrees.
1385
1386.. class:: StopCriteria
1387
1388    Given a set of examples, weight ID and contingency matrices, decide
1389    whether to continue the induction or not. The basic criterion checks
1390    whether there are any examples and whether they belong to at least
1391    two different classes (if the class is discrete). Derived components
1392    check things like the number of examples and the proportion of
1393    majority classes.
1394
1395    As opposed to :obj:`SplitConstructor` and similar basic classes,
1396    :obj:`StopCriteria` is not an abstract but a fully functional
1397    class that provides the basic stopping criteria. That is, the tree
1398    induction stops when there is at most one example left; in this case,
1399    it is not the weighted but the actual number of examples that counts.
1400    Besides that, the induction stops when all examples are in the same
1401    class (for discrete problems) or have the same value of the outcome
1402    (for regression problems).
1403
1404    .. method:: __call__(examples[, weightID, domain contingencies])
1405
1406        Decides whether to stop (true) or continue (false) the induction.
1407        If contingencies are given, they are used for checking whether
1408        the examples are in the same class (but not for counting the
1409        examples). Derived classes should use the contingencies whenever
1410        possible. If contingencies are not given, :obj:`StopCriteria`
1411        will work without them. Derived classes should also use them if
1412        they are available, but otherwise compute them only when they
1413        really need them.
1414
1415
1416
1417.. class:: StopCriteria_common
1418
1419    :obj:`StopCriteria` contains additional criteria for pre-pruning:
1420    it checks the proportion of majority class and the number of weighted
1421    examples.
1422
1423    .. attribute:: maxMajor
1424
1425        Maximal proportion of majority class. When this is exceeded,
1426        induction stops.
1427
1428    .. attribute:: minExamples
1429
1430        Minimal number of examples in internal leaves. Subsets with less
1431        than :obj:`minExamples` examples are not split any further.
1432        Example count is weighed.
1433
1434.. class:: StopCriteria_Python
1435
1436    Undocumented.
1437
1438Classes derived from ExampleSplitters
1439========================================
1440
1441The :obj:`ExampleSplitter` sorts the learning examples into branches.
1442
1443.. class:: ExampleSplitter
1444
1445    Just like the :obj:`Descender` decides the branch for an
1446    example during classification, the :obj:`ExampleSplitter`
1447    sorts the learning examples into branches.
1448
1449    :obj:`ExampleSplitter` is given a :obj:`Node` (from which
1450    it can use different stuff, but most of splitters only use the
1451    :obj:`branchSelector`), a set of examples to be divided, and
1452    the weight ID. The result is a list of subsets of examples
1453    and, optionally, a list of new weight ID's.
1454
1455    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most
1456    of :obj:`ExampleSplitters` simply call the node's
1457    :obj:`branchSelector` and assign examples to corresponding
1458    branches. When the value is unknown they choose a particular
1459    branch or simply skip the example.
1460
1461    Some enhanced splitters can split examples. An example (actually,
1462    a pointer to it) is copied to more than one subset. To facilitate
1463    real splitting, weights are needed. Each branch is assigned a
1464    weight ID (each would usually have its own ID) and all examples
1465    that are in that branch (either completely or partially) should
1466    have this meta attribute. If an example hasn't been split, it
1467    has only one additional attribute - with weight ID corresponding
1468    to the subset to which it went. Example that is split between,
1469    say, three subsets, has three new meta attributes, one for each
1470    subset. ID's of weight meta attributes are returned by the
1471    :obj:`ExampleSplitter` to be used at induction of the
1472    corresponding subtrees.
1473
1474    Note that weights are used only when needed. When no splitting
1475    occured - because the splitter is not able to do it or becauser
1476    there was no need for splitting - no weight ID's are returned.
1477
1478    An abstract base class for objects that split sets of examples into
1479    subsets. The derived classes differ in treatment of examples which
1480    cannot be unambiguously placed into a single branch (usually due
1481    to unknown value of the crucial attribute).
1482
1483    .. method:: __call__(node, examples[, weightID])
1484       
1485        Use the information in :obj:`node` (particularly the
1486        :obj:`branchSelector`) to split the given set of examples into subsets.
1487        Return a tuple with a list of example generators and a list of weights.
1488        The list of weights is either an ordinary python list of integers or
1489        a None when no splitting of examples occurs and thus no weights are
1490        needed.
1491
1492
1493.. class:: ExampleSplitter_IgnoreUnknowns
1494
1495    Bases: :class:`ExampleSplitter`
1496
1497    Simply ignores the examples for which no single branch can be determined.
1498
1499.. class:: ExampleSplitter_UnknownsToCommon
1500
1501    Bases: :class:`ExampleSplitter`
1502
1503    Places all such examples to a branch with the highest number of
1504    examples. If there is more than one such branch, one is selected at
1505    random and then used for all examples.
1506
1507.. class:: ExampleSplitter_UnknownsToAll
1508
1509    Bases: :class:`ExampleSplitter`
1510
1511    Places examples with unknown value of the attribute into all branches.
1512
1513.. class:: ExampleSplitter_UnknownsToRandom
1514
1515    Bases: :class:`ExampleSplitter`
1516
1517    Selects a random branch for such examples.
1518
1519.. class:: ExampleSplitter_UnknownsToBranch
1520
1521    Bases: :class:`ExampleSplitter`
1522
1523    Constructs an additional branch to contain all such examples.
1524    The branch's description is "unknown".
1525
1526.. class:: ExampleSplitter_UnknownsAsBranchSizes
1527
1528    Bases: :class:`ExampleSplitter`
1529
1530    Splits examples with unknown value of the attribute according to
1531    proportions of examples in each branch.
1532
1533.. class:: ExampleSplitter_UnknownsAsSelector
1534
1535    Bases: :class:`ExampleSplitter`
1536
1537    Splits examples with unknown value of the attribute according to
1538    distribution proposed by selector (which is in most cases the same
1539    as proportions of examples in branches).
1540
1541Descender and derived classes
1542=================================
1543
1544This is a classifier's counterpart for :obj:`ExampleSplitter`. It
1545decides the destiny of examples that need to be classified and cannot
1546be unambiguously put in a branch. The detailed function of this class
1547is given in description of classification with trees. XXXXXX
1548
1549.. class:: Descender
1550
1551    An abstract base object for tree descenders.
1552
1553    .. method:: __call__(node, example)
1554
1555        Descends down the tree until it reaches a leaf or a node in
1556        which a vote of subtrees is required. In both cases, a tuple
1557        of two elements is returned; in the former, the tuple contains
1558        the reached node and None, in the latter in
1559        contains a node and weights of votes for subtrees (a list of floats).
1560
1561        :obj:`Descender`'s that never split examples always descend
1562        to a leaf, but they differ in the treatment of examples with
1563        unknown values (or, in general, examples for which a branch
1564        cannot be determined at some node(s) the tree).
1565        :obj:`Descender`'s that do split examples differ in returned
1566        vote weights.
1567
1568.. class:: Descender_UnknownsToNode
1569
1570    Bases: :obj:`Descender`
1571
1572    When example cannot be classified into a single branch, the
1573    current node is returned. Thus, the node's :obj:`NodeClassifier`
1574    will be used to make a decision. It is your responsibility to see
1575    that even the internal nodes have their :obj:`NodeClassifier`
1576    (i.e., don't disable creating node classifier or manually remove
1577    them after the induction, that's all)
1578
1579.. class:: Descender_UnknownsToBranch
1580
1581    Bases: :obj:`Descender`
1582
1583    Classifies examples with unknown value to a special branch. This
1584    makes sense only if the tree itself was constructed with
1585    :obj:`ExampleSplitter_UnknownsToBranch`.
1586
1587.. class:: Descender_UnknownsToCommonBranch
1588
1589    Bases: :obj:`Descender`
1590
1591    Classifies examples with unknown values to the branch with the
1592    highest number of examples. If there is more than one such branch,
1593    random branch is chosen for each example that is to be classified.
1594
1595.. class:: Descender_UnknownsToCommonSelector
1596
1597    Bases: :obj:`Descender`
1598
1599    Classifies examples with unknown values to the branch which received
1600    the highest recommendation by the selector.
1601
1602.. class:: Descender_MergeAsBranchSizes
1603
1604    Bases: :obj:`Descender`
1605
1606    Makes the subtrees vote for the example's class; the vote is
1607    weighted according to the sizes of the branches.
1608
1609.. class:: Descender_MergeAsSelector
1610
1611    Bases: :obj:`Descender`
1612
1613    Makes the subtrees vote for the example's class; the vote is
1614    weighted according to the selectors proposal.
1615
1616Pruning
1617=======
1618
1619.. index::
1620    pair: classification trees; pruning
1621
1622Tree pruners derived from :obj:`Pruner` can be given either a
1623:obj:`Node` (presumably, but not necessarily a root) or a
1624:obj:`_TreeClassifier`. The result is a new :obj:`Node`
1625or a :obj:`_TreeClassifier` with a pruned tree. The original
1626tree remains intact.
1627
1628The pruners construct only a shallow copy of a tree.
1629The pruned tree's :obj:`Node` contain references to the same
1630contingency matrices, node classifiers, branch selectors, ...
1631as the original tree. Thus, you may modify a pruned tree structure
1632(manually cut it, add new nodes, replace components) but modifying,
1633for instance, some node's :obj:`nodeClassifier` (a
1634:obj:`nodeClassifier` itself, not a reference to it!) would modify
1635the node's :obj:`nodeClassifier` in the corresponding node of
1636the original tree.
1637
1638Pruners cannot construct a
1639:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
1640subtrees into classifiers for new leaves. Thus, if you want to build
1641a prunable tree, internal nodes must have their :obj:`nodeClassifier`
1642defined. Fortunately, this is the default.
1643
1644.. class:: Pruner
1645
1646    This is an abstract base class which defines nothing useful, only
1647    a pure virtual call operator.
1648
1649    .. method:: __call__(tree)
1650
1651        Prunes a tree. The argument can be either a tree classifier or
1652        a tree node; the result is of the same type as the argument.
1653
1654.. class:: Pruner_SameMajority
1655
1656    Bases: :class:`Pruner`
1657
1658    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
1659    with more than one leaf) in which all the leaves have the same majority
1660    class. (This is allowed because those leaves can still have different
1661    distributions of classes and thus predict different probabilities.)
1662    However, this can be undesired when we're only interested in the
1663    class prediction or a simple tree interpretation. The
1664    :obj:`Pruner_SameMajority` prunes the tree so that there is no
1665    subtree in which all the nodes would have the same majority class.
1666
1667    This pruner will only prune the nodes in which the node classifier
1668    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
1669
1670    Note that the leaves with more than one majority class require some
1671    special handling. The pruning goes backwards, from leaves to the root.
1672    When siblings are compared, the algorithm checks whether they
1673    have (at least one) common majority class. If so, they can be pruned.
1674
1675.. class:: Pruner_m
1676
1677    Bases: :class:`Pruner`
1678
1679    Prunes a tree by comparing m-estimates of static and dynamic
1680    error as defined in (Bratko, 2002).
1681
1682    .. attribute:: m
1683
1684        Parameter m for m-estimation.
1685
1686Undocumented
1687============
1688
1689.. class:: NodeList
1690
1691    Undocumented.
1692
1693.. class:: C45NodeList
1694
1695    Undocumented.
1696
1697===========================
1698C4.5 Classifier and Learner
1699===========================
1700
1701As C4.5 is a standard benchmark in machine learning,
1702it is incorporated in Orange, although Orange has its own
1703implementation of decision trees.
1704
1705The implementation uses the original Quinlan's code for learning so the
1706tree you get is exactly like the one that would be build by standalone
1707C4.5. Upon return, however, the original tree is copied to Orange
1708components that contain exactly the same information plus what is needed
1709to make them visible from Python. To be sure that the algorithm behaves
1710just as the original, we use a dedicated class :class:`C45Node`
1711instead of reusing the components used by Orange's tree inducer
1712(ie, :class:`Node`). This, however, could be done and probably
1713will be done in the future; we shall still retain :class:`C45Node`
1714but offer transformation to :class:`Node` so that routines
1715that work on Orange trees will also be usable for C45 trees.
1716
1717:class:`C45Learner` and :class:`C45Classifier` behave
1718like any other Orange learner and classifier. Unlike most of Orange
1719learning algorithms, C4.5 does not accepts weighted examples.
1720
1721Building the C4.5 plug-in
1722=========================
1723
1724We haven't been able to obtain the legal rights to distribute
1725C4.5 and therefore couldn't statically link it into Orange. Instead,
1726it's incorporated as a plug-in which you'll need to build yourself.
1727The procedure is trivial, but you'll need a C compiler. On Windows,
1728the scripts we provide work with MS Visual C and the files CL.EXE
1729and LINK.EXE must be on the PATH. On Linux you're equipped with
1730gcc. Mac OS X comes without gcc, but you can download it for
1731free from Apple.
1732
1733Orange must be installed prior to building C4.5. (This is because
1734the build script will copy the created file next to Orange,
1735which it obviously can't if Orange isn't there yet.)
1736
1737#. Download the
1738   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1739   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1740   them into some temporary directory. The files will be modified in the
1741   further process, so don't use your copy of Quinlan's sources that you
1742   need for another purpose.
1743#. Download
1744   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1745   and unzip its contents into the directory R8/Src of the Quinlan's
1746   stuff (it's the directory that contains, for instance, the file
1747   average.c).
1748#. Run buildC45.py, which will build the plug-in and put it next to
1749   orange.pyd (or orange.so on Linux/Mac).
1750#. Run python, type :samp:`import Orange` and
1751   create create :samp:`Orange.classification.tree.C45Learner()`.
1752   If this fails, something went wrong; see the diagnostic messages from
1753   buildC45.py and read the below paragraph.
1754#. Finally, you can remove the Quinlan's stuff, along with everything
1755   created by buildC45.py.
1756
1757If the process fails, here's what buildC45.py really does: it creates
1758.h files that wrap Quinlan's .i files and ensure that they are not
1759included twice. It modifies C4.5 sources to include .h's instead of
1760.i's. This step can hardly fail. Then follows the platform dependent
1761step which compiles ensemble.c (which includes all the Quinlan's .c
1762files it needs) into c45.dll or c45.so and puts it next to Orange.
1763If this fails, but you do have a C compiler and linker, and you know
1764how to use them, you can compile the ensemble.c into a dynamic
1765library yourself. See the compile and link steps in buildC45.py,
1766if it helps. Anyway, after doing this check that the built C4.5
1767gives the same results as the original.
1768
1769.. class:: C45Learner
1770
1771    :class:`C45Learner`'s attributes have double names - those that
1772    you know from C4.5 command lines and the corresponding names of C4.5's
1773    internal variables. All defaults are set as in C4.5; if you change
1774    nothing, you are running C4.5.
1775
1776    .. attribute:: gainRatio (g)
1777       
1778        Determines whether to use information gain (false>, default)
1779        or gain ratio for selection of attributes (true).
1780
1781    .. attribute:: batch (b)
1782
1783        Turn on batch mode (no windows, no iterations); this option is
1784        not documented in C4.5 manuals. It conflicts with "window",
1785        "increment" and "trials".
1786
1787    .. attribute:: subset (s)
1788       
1789        Enables subsetting (default: false, no subsetting),
1790 
1791    .. attribute:: probThresh (p)
1792
1793        Probabilistic threshold for continuous attributes (default: false).
1794
1795    .. attribute:: minObjs (m)
1796       
1797        Minimal number of objects (examples) in leaves (default: 2).
1798
1799    .. attribute:: window (w)
1800
1801        Initial windows size (default: maximum of 20% and twice the
1802        square root of the number of data objects).
1803
1804    .. attribute:: increment (i)
1805
1806        The maximum number of objects that can be added to the window
1807        at each iteration (default: 20% of the initial window size).
1808
1809    .. attribute:: cf (c)
1810
1811        Prunning confidence level (default: 25%).
1812
1813    .. attribute:: trials (t)
1814
1815        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1816
1817    .. attribute:: prune
1818       
1819        Return pruned tree (not an original C4.5 option) (default: true)
1820
1821
1822:class:`C45Learner` also offers another way for setting
1823the arguments: it provides a function :obj:`C45Learner.commandLine`
1824which is given a string and parses it the same way as C4.5 would
1825parse its command line. XXXXXXXXXXX
1826
1827.. class:: C45Classifier
1828
1829    A faithful reimplementation of Quinlan's function from C4.5. The only
1830    difference (and the only reason it's been rewritten) is that it uses
1831    a tree composed of :class:`C45Node` instead of C4.5's
1832    original tree structure.
1833
1834    .. attribute:: tree
1835
1836        C4.5 tree stored as a tree of :obj:`C45Node`.
1837
1838
1839.. class:: C45Node
1840
1841    This class is a reimplementation of the corresponding *struct* from
1842    Quinlan's C4.5 code.
1843
1844    .. attribute:: nodeType
1845
1846        Type of the node:  :obj:`C45Node.Leaf` (0),
1847        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1848        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1849        split examples based on values of a discrete attribute,
1850        "cuts" cut them according to a threshold value of a continuous
1851        attributes and "subsets" use discrete attributes but with subsetting
1852        so that several values can go into the same branch.
1853
1854    .. attribute:: leaf
1855
1856        Value returned by that leaf. The field is defined for internal
1857        nodes as well.
1858
1859    .. attribute:: items
1860
1861        Number of (learning) examples in the node.
1862
1863    .. attribute:: classDist
1864
1865        Class distribution for the node (of type
1866        :obj:`orange.DiscDistribution`).
1867
1868    .. attribute:: tested
1869       
1870        The attribute used in the node's test. If node is a leaf,
1871        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1872        :obj:`tested` is a discrete attribute, and if node is of type
1873        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1874
1875    .. attribute:: cut
1876
1877        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1878        Undefined otherwise.
1879
1880    .. attribute:: mapping
1881
1882        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1883        gives the index for an example whose value of :obj:`tested` is *i*.
1884        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1885
1886    .. attribute:: branch
1887       
1888        A list of branches stemming from this node.
1889
1890Examples
1891========
1892
1893.. _tree_c45.py: code/tree_c45.py
1894.. _iris.tab: code/iris.tab
1895
1896The simplest way to use :class:`C45Learner` is to call it. This
1897script constructs the same learner as you would get by calling
1898the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1899
1900.. literalinclude:: code/tree_c45.py
1901   :lines: 7-14
1902
1903Arguments can be set by the usual mechanism (the below to lines do the
1904same, except that one uses command-line symbols and the other internal
1905variable names)
1906
1907::
1908
1909    tree = Orange.classification.tree.C45Learner(data, m=100)
1910    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1911
1912The way that could be prefered by veteran C4.5 user might be through
1913method `:obj:C45Learner.commandline`.
1914
1915::
1916
1917    lrn = Orange.classification.tree..C45Learner()
1918    lrn.commandline("-m 1 -s")
1919    tree = lrn(data)
1920
1921There's nothing special about using :obj:`C45Classifier` - it's
1922just like any other. To demonstrate what the structure of
1923:class:`C45Node`'s looks like, will show a script that prints
1924it out in the same format as C4.5 does.
1925
1926.. literalinclude:: code/tree_c45_printtree.py
1927
1928Leaves are the simplest. We just print out the value contained
1929in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1930:obj:`C45Node` does not know to which attribute it belongs), we need to
1931convert it to a string through :obj:`classVar`, which is passed as an
1932extra argument to the recursive part of printTree.
1933
1934For discrete splits without subsetting, we print out all attribute values
1935and recursively call the function for all branches. Continuous splits are
1936equally easy to handle.
1937
1938For discrete splits with subsetting, we iterate through branches, retrieve
1939the corresponding values that go into each branch to inset, turn
1940the values into strings and print them out, separately treating the
1941case when only a single value goes into the branch.
1942
1943Printing out C45 Tree
1944=====================
1945
1946.. autofunction:: printTreeC45
1947
1948TODO
1949====
1950
1951This page does not provide examples for programming your own components,
1952such as, for instance, a :obj:`SplitConstructor`. Those examples
1953can be found on a page dedicated to callbacks to Python.
1954
1955References
1956==========
1957
1958Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1959Wesley, 2002.
1960
1961E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1962Murray Hill NJ, U.S.A., October 1993.
1963
1964`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1965A home page of AT&T's dot and similar software packages.
1966
1967"""
1968
1969from Orange.core import \
1970     TreeLearner as TreeLearnerBase, \
1971         TreeClassifier as _TreeClassifier, \
1972         C45Learner, \
1973         C45Classifier, \
1974         C45TreeNode as C45Node, \
1975         C45TreeNodeList as C45NodeList, \
1976         TreeDescender as Descender, \
1977              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
1978              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
1979              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
1980              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
1981              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
1982         TreeExampleSplitter as Splitter, \
1983              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
1984              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
1985              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
1986              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
1987              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
1988              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
1989              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
1990         TreeNode as Node, \
1991         TreeNodeList as NodeList, \
1992         TreePruner as Pruner, \
1993              TreePruner_SameMajority as Pruner_SameMajority, \
1994              TreePruner_m as Pruner_m, \
1995         TreeSplitConstructor as SplitConstructor, \
1996              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
1997              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
1998                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
1999                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
2000                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
2001                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
2002         TreeStopCriteria as StopCriteria, \
2003              TreeStopCriteria_Python as StopCriteria_Python, \
2004              TreeStopCriteria_common as StopCriteria_common
2005
2006import Orange.core
2007import operator
2008import base64
2009import re
2010import Orange.data
2011import Orange.feature.scoring
2012import Orange.classification.tree
2013
2014def _c45_showBranch(node, classvar, lev, i):
2015    var = node.tested
2016    if node.nodeType == 1:
2017        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
2018        _c45_printTree0(node.branch[i], classvar, lev+1)
2019    elif node.nodeType == 2:
2020        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
2021        _c45_printTree0(node.branch[i], classvar, lev+1)
2022    else:
2023        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2024        inset = [var.values[j[0]] for j in inset]
2025        if len(inset)==1:
2026            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
2027        else:
2028            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
2029        _c45_printTree0(node.branch[i], classvar, lev+1)
2030       
2031       
2032def _c45_printTree0(node, classvar, lev):
2033    var = node.tested
2034    if node.nodeType == 0:
2035        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2036    else:
2037        for i, branch in enumerate(node.branch):
2038            if not branch.nodeType:
2039                _c45_showBranch(node, classvar, lev, i)
2040        for i, branch in enumerate(node.branch):
2041            if branch.nodeType:
2042                _c45_showBranch(node, classvar, lev, i)
2043
2044def printTreeC45(tree):
2045    """
2046    Prints the tree given as an argument in the same form as Ross Quinlan's
2047    C4.5 program.
2048
2049    ::
2050
2051        import Orange
2052
2053        data = Orange.data.Table("voting")
2054        c45 = Orange.classification.tree.C45Learner(data)
2055        Orange.classification.tree.printTreeC45(c45)
2056
2057    will print out
2058
2059    ::
2060
2061        physician-fee-freeze = n: democrat (253.4)
2062        physician-fee-freeze = y:
2063        |   synfuels-corporation-cutback = n: republican (145.7)
2064        |   synfuels-corporation-cutback = y:
2065        |   |   mx-missile = y: democrat (6.0)
2066        |   |   mx-missile = n:
2067        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2068        |   |   |   adoption-of-the-budget-resolution = y:
2069        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2070        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2071
2072
2073    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2074
2075    ::
2076
2077        physician-fee-freeze = n: democrat (253.4/5.9)
2078        physician-fee-freeze = y:
2079        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2080        |   synfuels-corporation-cutback = y:
2081        |   |   mx-missile = y: democrat (6.0/2.4)
2082        |   |   mx-missile = n:
2083        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2084        |   |   |   adoption-of-the-budget-resolution = y:
2085        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2086        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2087
2088    which is adoringly similar, except that C4.5 tested the tree on
2089    the learning data and has also printed out the number of errors
2090    in each node - something which :obj:`c45_printTree` obviously can't do
2091    (nor is there any need it should).
2092
2093    """
2094    _c45_printTree0(tree.tree, tree.classVar, 0)
2095
2096
2097import Orange.feature.scoring as fscoring
2098
2099class TreeLearner(Orange.core.Learner):
2100    """
2101    Assembles the generic classification or regression tree learner
2102    (from Orange's objects for induction of decision trees).
2103    :class:`TreeLearner` is essentially a wrapper
2104    around :class:`TreeLearnerBase`, provided for easier use of the latter.
2105    It sets a number of parameters used in induction that
2106    can also be set after the creation of the object, most often through
2107    the object's attributes. If upon initialization
2108    :class:`TreeLearner` is given a set of examples, then an instance
2109    of :class:`TreeClassifier` object is returned instead.
2110
2111    The values of attributes can be also be set in the constructor.
2112
2113    .. attribute:: nodeLearner
2114
2115        Induces a classifier from examples belonging to a node. The
2116        same learner is used for internal nodes and for leaves. The
2117        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2118
2119    **Split construction**
2120
2121    .. attribute:: split
2122       
2123        Defines a function that will be used in place of
2124        :obj:`SplitConstructor`.
2125        Useful when prototyping new tree induction
2126        algorithms. When this parameter is defined, other parameters that
2127        affect the procedures for growing of the tree are ignored. These
2128        include :obj:`binarization`, :obj:`measure`,
2129        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2130        :class:SplitConstructor_Combined
2131        with separate constructors for discrete and continuous attributes.
2132        Discrete attributes are used as they are, while
2133        continuous attributes are binarized.
2134        Gain ratio is used to select attributes.
2135        A minimum of two examples in a leaf is required for
2136        discrete and five examples in a leaf for continuous attributes.)
2137
2138    .. attribute:: binarization
2139
2140        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2141        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
2142        0, do not use binarization (use :class:`SplitConstructor_Attribute`).
2143        Default: 0.
2144
2145    .. attribute:: measure
2146   
2147        Measure for scoring of the attributes when deciding which of the
2148        attributes will be used for splitting of the example set in the node.
2149        Can be either a :class:`Orange.feature.scoring.Measure` or one of
2150        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2151        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2152        "gini" (:class:`Orange.feature.scoring.Gini`),
2153        "relief" (:class:`Orange.feature.scoring.Relief`),
2154        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio".
2155
2156    .. attribute:: reliefM, reliefK
2157
2158        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2159
2160    **Pruning**
2161
2162    .. attribute:: worstAcceptable
2163
2164        Used in pre-pruning, sets the lowest required attribute
2165        score. If the score of the best attribute is below this margin, the
2166        tree at that node is not grown further (default: 0).
2167
2168        So, to allow splitting only when gainRatio (the default measure)
2169        is greater than 0.6, set :samp:`worstAcceptable=0.6`.
2170
2171    .. attribute:: minSubset
2172
2173        Minimal number of examples in non-null leaves (default: 0).
2174
2175    .. attribute:: minExamples
2176
2177        Data subsets with less than :obj:`minExamples`
2178        examples are not split any further, that is, all leaves in the tree
2179        will contain at least that many of examples (default: 0).
2180
2181    .. attribute:: maxDepth
2182
2183        Gives maximal tree depth;  0 means that only root is generated.
2184        The default is 100.
2185
2186    .. attribute:: maxMajority
2187
2188        Induction stops when the proportion of majority class in the
2189        node exceeds the value set by this parameter(default: 1.0).
2190        To stop the induction as soon as the majority class reaches 70%,
2191        you should use :samp:`maxMajority=0.7`, as in the following
2192        example. The numbers show the majority class
2193        proportion at each node. The script `tree2.py`_ induces and
2194        prints this tree.
2195
2196        .. _tree2.py: code/tree2.py
2197
2198        ::
2199
2200            root: 0.333
2201            |    petal width<=0.800: 1.000
2202            |    petal width>0.800: 0.500
2203            |    |    petal width<=1.750: 0.907
2204            |    |    petal width>1.750: 0.978
2205   
2206    .. attribute:: stop
2207
2208        Used for passing a function which is used in place of
2209        :class:`StopCriteria`. Useful when prototyping new
2210        tree induction algorithms. See a documentation on
2211        :class:`StopCriteria` for more info on this function.
2212        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2213        will not be  considered (default: None).
2214        The default stopping criterion stops induction when all examples
2215        in a node belong to the same class.
2216
2217    .. attribute:: mForPruning
2218
2219        If non-zero, invokes an error-based bottom-up post-pruning,
2220        where m-estimate is used to estimate class probabilities
2221        (default: 0).
2222
2223    .. attribute:: sameMajorityPruning
2224
2225        If true, invokes a bottom-up post-pruning by removing the
2226        subtrees of which all leaves classify to the same class
2227        (default: False).
2228
2229    **Record keeping**
2230
2231    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2232
2233        Determines whether to store class distributions, contingencies and
2234        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2235        should be build for internal nodes. By default everything except
2236        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2237        distributions but storing contingencies, since distributions actually points to
2238        the same distribution that is stored in
2239        :obj:`contingency.classes`. (default: True except for
2240        storeExamples, which defaults to False).
2241   
2242    """
2243    def __new__(cls, examples = None, weightID = 0, **argkw):
2244        self = Orange.core.Learner.__new__(cls, **argkw)
2245        if examples:
2246            self.__init__(**argkw)
2247            return self.__call__(examples, weightID)
2248        else:
2249            return self
2250     
2251    def __init__(self, **kw):
2252        self.split = None
2253        self.stop = None
2254        self.measure = None
2255        self.__dict__.update(kw)
2256     
2257    def __call__(self, examples, weight=0):
2258        """
2259        Return a classifier from the given examples.
2260        """
2261        bl = self._base_learner()
2262
2263        #set the scoring criteria for regression if it was not
2264        #set by the user
2265        if not self.split and not self.measure:
2266            measure = fscoring.GainRatio() \
2267                if examples.domain.classVar.varType == Orange.data.Type.Discrete \
2268                else fscoring.MSE()
2269            bl.split.continuousSplitConstructor.measure = measure
2270            bl.split.discreteSplitConstructor.measure = measure
2271         
2272        #post pruning
2273        tree = bl(examples, weight)
2274        if getattr(self, "sameMajorityPruning", 0):
2275            tree = Pruner_SameMajority(tree)
2276        if getattr(self, "mForPruning", 0):
2277            tree = Pruner_m(tree, m=self.mForPruning)
2278
2279        return TreeClassifier(baseClassifier=tree) 
2280
2281    def instance(self):
2282        """
2283        DEPRECATED. Return a base learner - an object
2284        of :class:`TreeLearnerBase`.
2285        This method is left for backwards compatibility.
2286        """
2287        return self._base_learner()
2288
2289    def build_split(self):
2290        """
2291        Return the split constructor built according to object attributes.
2292        """
2293       
2294        split = self.split
2295
2296        if not split:
2297            split = SplitConstructor_Combined()
2298            split.continuousSplitConstructor = \
2299                SplitConstructor_Threshold()
2300            binarization = getattr(self, "binarization", 0)
2301            if binarization == 1:
2302                split.discreteSplitConstructor = \
2303                    SplitConstructor_ExhaustiveBinary()
2304            elif binarization == 2:
2305                split.discreteSplitConstructor = \
2306                    SplitConstructor_OneAgainstOthers()
2307            else:
2308                split.discreteSplitConstructor = \
2309                    SplitConstructor_Feature()
2310
2311            measures = {"infoGain": fscoring.InfoGain,
2312                "gainRatio": fscoring.GainRatio,
2313                "gini": fscoring.Gini,
2314                "relief": fscoring.Relief,
2315                "retis": fscoring.MSE
2316                }
2317
2318            measure = self.measure
2319            if isinstance(measure, str):
2320                measure = measures[measure]()
2321            if not measure:
2322                measure = fscoring.GainRatio()
2323
2324            measureIsRelief = isinstance(measure, fscoring.Relief)
2325            relM = getattr(self, "reliefM", None)
2326            if relM and measureIsRelief:
2327                measure.m = relM
2328           
2329            relK = getattr(self, "reliefK", None)
2330            if relK and measureIsRelief:
2331                measure.k = relK
2332
2333            split.continuousSplitConstructor.measure = measure
2334            split.discreteSplitConstructor.measure = measure
2335
2336            wa = getattr(self, "worstAcceptable", 0)
2337            if wa:
2338                split.continuousSplitConstructor.worstAcceptable = wa
2339                split.discreteSplitConstructor.worstAcceptable = wa
2340
2341            ms = getattr(self, "minSubset", 0)
2342            if ms:
2343                split.continuousSplitConstructor.minSubset = ms
2344                split.discreteSplitConstructor.minSubset = ms
2345
2346        return split
2347
2348    def build_stop(self):
2349        """
2350        Return the stop criteria built according to object's attributes.
2351        """
2352        stop = self.stop
2353        if not stop:
2354            stop = Orange.classification.tree.StopCriteria_common()
2355            mm = getattr(self, "maxMajority", 1.0)
2356            if mm < 1.0:
2357                stop.maxMajority = self.maxMajority
2358            me = getattr(self, "minExamples", 0)
2359            if me:
2360                stop.minExamples = self.minExamples
2361        return stop
2362
2363    def _base_learner(self):
2364        learner = TreeLearnerBase()
2365
2366        learner.split = self.build_split()
2367        learner.stop = self.build_stop()
2368
2369        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2370            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2371            if hasattr(self, a):
2372                setattr(learner, a, getattr(self, a))
2373
2374        return learner
2375
2376# the following is for the output
2377
2378
2379fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2380""" Defines the multiplier by 100 (:samp:`^`) and the format
2381for the number of decimals (e.g. :samp:`5.3`). The corresponding
2382groups are named :samp:`m100` and :samp:`fs`. """
2383
2384by = r"(?P<by>(b(P|A)))?"
2385""" Defines bP or bA or nothing; the result is in groups by. """
2386
2387bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2388opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2389opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2390intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2391fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2392re_V = re.compile("%V")
2393re_N = re.compile("%"+fs+"N"+by)
2394re_M = re.compile("%"+fs+"M"+by)
2395re_m = re.compile("%"+fs+"m"+by)
2396re_Ccont = re.compile("%"+fs+"C"+by+opc)
2397re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2398re_ccont = re.compile("%"+fs+"c"+by+opc)
2399re_cdisc = re.compile("%"+fs+"c"+by+opd)
2400re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2401re_cconti = re.compile("%"+fs+"c"+by+fromto)
2402re_D = re.compile("%"+fs+"D"+by)
2403re_d = re.compile("%"+fs+"d"+by)
2404re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2405re_I = re.compile("%"+fs+"I"+intrvl)
2406
2407def insertStr(s, mo, sub):
2408    """ Replace the part of s which is covered by mo
2409    with the string sub. """
2410    return s[:mo.start()] + sub + s[mo.end():]
2411
2412
2413def insertDot(s, mo):
2414    """ Replace the part of s which is covered by mo
2415    with a dot.  You should use this when the
2416    function cannot compute the desired quantity; it is called, for instance,
2417    when it needs to divide by something in the parent, but the parent
2418    doesn't exist.
2419    """
2420    return s[:mo.start()] + "." + s[mo.end():]
2421
2422def insertNum(s, mo, n):
2423    """ Replace the part of s matched by mo with the number n,
2424    formatted as specified by the user, that is, it multiplies
2425    it by 100, if needed, and prints with the right number of
2426    places and decimals. It does so by checking the mo
2427    for a group named m100 (representing the :samp:`^` in the format string)
2428    and a group named fs representing the part giving the number o
2429    f decimals (e.g. :samp:`5.3`).
2430    """
2431    grps = mo.groupdict()
2432    m100 = grps.get("m100", None)
2433    if m100:
2434        n *= 100
2435    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2436    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2437
2438def byWhom(by, parent, tree):
2439    """ If by equals bp, it returns parent, else it returns
2440    :samp:`tree.tree`. This is used to find what to divide the quantity
2441    with, when division is required.
2442    """
2443    if by=="bP":
2444        return parent
2445    else:
2446        return tree.tree
2447
2448def replaceV(strg, mo, node, parent, tree):
2449    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2450
2451def replaceN(strg, mo, node, parent, tree):
2452    by = mo.group("by")
2453    N = node.distribution.abs
2454    if by:
2455        whom = byWhom(by, parent, tree)
2456        if whom and whom.distribution:
2457            if whom.distribution.abs > 1e-30:
2458                N /= whom.distribution.abs
2459        else:
2460            return insertDot(strg, mo)
2461    return insertNum(strg, mo, N)
2462       
2463
2464def replaceM(strg, mo, node, parent, tree):
2465    by = mo.group("by")
2466    maj = int(node.nodeClassifier.defaultValue)
2467    N = node.distribution[maj]
2468    if by:
2469        whom = byWhom(by, parent, tree)
2470        if whom and whom.distribution:
2471            if whom.distribution[maj] > 1e-30:
2472                N /= whom.distribution[maj]
2473        else:
2474            return insertDot(strg, mo)
2475    return insertNum(strg, mo, N)
2476       
2477
2478def replacem(strg, mo, node, parent, tree):
2479    by = mo.group("by")
2480    maj = int(node.nodeClassifier.defaultValue)
2481    if node.distribution.abs > 1e-30:
2482        N = node.distribution[maj] / node.distribution.abs
2483        if by:
2484            if whom and whom.distribution:
2485                byN = whom.distribution[maj] / whom.distribution.abs
2486                if byN > 1e-30:
2487                    N /= byN
2488            else:
2489                return insertDot(strg, mo)
2490    else:
2491        N = 0.
2492    return insertNum(strg, mo, N)
2493
2494
2495def replaceCdisc(strg, mo, node, parent, tree):
2496    if tree.classVar.varType != Orange.data.Type.Discrete:
2497        return insertDot(strg, mo)
2498   
2499    by, op, cls = mo.group("by", "op", "cls")
2500    N = node.distribution[cls]
2501    if op == "!=":
2502        N = node.distribution.abs - N
2503    if by:
2504        whom = byWhom(by, parent, tree)
2505        if whom and whom.distribution:
2506            if whom.distribution[cls] > 1e-30:
2507                N /= whom.distribution[cls]
2508        else:
2509            return insertDot(strg, mo)
2510    return insertNum(strg, mo, N)
2511
2512   
2513def replacecdisc(strg, mo, node, parent, tree):
2514    if tree.classVar.varType != Orange.data.Type.Discrete:
2515        return insertDot(strg, mo)
2516   
2517    op, by, cls = mo.group("op", "by", "cls")
2518    N = node.distribution[cls]
2519    if node.distribution.abs > 1e-30:
2520        N /= node.distribution.abs
2521        if op == "!=":
2522            N = 1 - N
2523    if by:
2524        whom = byWhom(by, parent, tree)
2525        if whom and whom.distribution:
2526            if whom.distribution[cls] > 1e-30:
2527                N /= whom.distribution[cls] / whom.distribution.abs
2528        else:
2529            return insertDot(strg, mo)
2530    return insertNum(strg, mo, N)
2531
2532
2533__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2534
2535def replaceCcont(strg, mo, node, parent, tree):
2536    if tree.classVar.varType != Orange.data.Type.Continuous:
2537        return insertDot(strg, mo)
2538   
2539    by, op, num = mo.group("by", "op", "num")
2540    op = __opdict[op]
2541    num = float(num)
2542    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2543    if by:
2544        whom = byWhom(by, parent, tree)
2545        if whom and whom.distribution:
2546            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2547            if byN > 1e-30:
2548                N /= byN
2549        else:
2550            return insertDot(strg, mo)
2551
2552    return insertNum(strg, mo, N)
2553   
2554   
2555def replaceccont(strg, mo, node, parent, tree):
2556    if tree.classVar.varType != Orange.data.Type.Continuous:
2557        return insertDot(strg, mo)
2558   
2559    by, op, num = mo.group("by", "op", "num")
2560    op = __opdict[op]
2561    num = float(num)
2562    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2563    if node.distribution.abs > 1e-30:
2564        N /= node.distribution.abs
2565    if by:
2566        whom = byWhom(by, parent, tree)
2567        if whom and whom.distribution:
2568            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2569            if byN > 1e-30:
2570                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2571        else:
2572            return insertDot(strg, mo)
2573    return insertNum(strg, mo, N)
2574
2575
2576def extractInterval(mo, dist):
2577    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2578    lower, upper = float(lower), float(upper)
2579    if out:
2580        lop = lowin == "(" and operator.le or operator.lt
2581        hop = upin == ")" and operator.ge or operator.ge
2582        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2583    else:
2584        lop = lowin == "(" and operator.gt or operator.ge
2585        hop = upin == ")" and operator.lt or operator.le
2586        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2587
2588   
2589def replaceCconti(strg, mo, node, parent, tree):
2590    if tree.classVar.varType != Orange.data.Type.Continuous:
2591        return insertDot(strg, mo)
2592
2593    by = mo.group("by")
2594    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2595    if by:
2596        whom = byWhom(by, parent, tree)
2597        if whom and whom.distribution:
2598            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2599            if byN > 1e-30:
2600                N /= byN
2601        else:
2602            return insertDot(strg, mo)
2603       
2604    return insertNum(strg, mo, N)
2605
2606           
2607def replacecconti(strg, mo, node, parent, tree):
2608    if tree.classVar.varType != Orange.data.Type.Continuous:
2609        return insertDot(strg, mo)
2610
2611    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2612    ab = node.distribution.abs
2613    if ab > 1e-30:
2614        N /= ab
2615
2616    by = mo.group("by")
2617    if by:
2618        whom = byWhom(by, parent, tree)
2619        if whom and whom.distribution:
2620            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2621            if byN > 1e-30:
2622                N /= byN/whom.distribution.abs
2623        else:
2624            return insertDot(strg, mo)
2625       
2626    return insertNum(strg, mo, N)
2627
2628   
2629def replaceD(strg, mo, node, parent, tree):
2630    if tree.classVar.varType != Orange.data.Type.Discrete:
2631        return insertDot(strg, mo)
2632
2633    fs, by, m100 = mo.group("fs", "by", "m100")
2634    dist = list(node.distribution)
2635    if by:
2636        whom = byWhom(by, parent, tree)
2637        if whom:
2638            for i, d in enumerate(whom.distribution):
2639                if d > 1e-30:
2640                    dist[i] /= d
2641        else:
2642            return insertDot(strg, mo)
2643    mul = m100 and 100 or 1
2644    fs = fs or (m100 and ".0" or "5.3")
2645    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2646
2647
2648def replaced(strg, mo, node, parent, tree):
2649    if tree.classVar.varType != Orange.data.Type.Discrete:
2650        return insertDot(strg, mo)
2651
2652    fs, by, m100 = mo.group("fs", "by", "m100")
2653    dist = list(node.distribution)
2654    ab = node.distribution.abs
2655    if ab > 1e-30:
2656        dist = [d/ab for d in dist]
2657    if by:
2658        whom = byWhom(by, parent, tree)
2659        if whom:
2660            for i, d in enumerate(whom.distribution):
2661                if d > 1e-30:
2662                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2663        else:
2664            return insertDot(strg, mo)
2665    mul = m100 and 100 or 1
2666    fs = fs or (m100 and ".0" or "5.3")
2667    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2668
2669
2670def replaceAE(strg, mo, node, parent, tree):
2671    if tree.classVar.varType != Orange.data.Type.Continuous:
2672        return insertDot(strg, mo)
2673
2674    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2675   
2676    if AorE == "A":
2677        A = node.distribution.average()
2678    else:
2679        A = node.distribution.error()
2680    if by:
2681        whom = byWhom("b"+by, parent, tree)
2682        if whom:
2683            if AorE == "A":
2684                avg = whom.distribution.average()
2685            else:
2686                avg = whom.distribution.error()
2687            if bysub == "b":
2688                if avg > 1e-30:
2689                    A /= avg
2690            else:
2691                A -= avg
2692        else:
2693            return insertDot(strg, mo)
2694    return insertNum(strg, mo, A)
2695
2696
2697Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2698
2699def replaceI(strg, mo, node, parent, tree):
2700    if tree.classVar.varType != Orange.data.Type.Continuous:
2701        return insertDot(strg, mo)
2702
2703    fs = mo.group("fs") or "5.3"
2704    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2705    mul = mo.group("m100") and 100 or 1
2706
2707    if not Z.has_key(intrvl):
2708        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2709
2710    av = node.distribution.average()   
2711    il = node.distribution.error() * Z[intrvl]
2712    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2713
2714
2715# This class is more a collection of function, merged into a class so
2716# that they don't need to transfer too many arguments. It will be
2717# constructed, used and discarded, it is not meant to store any information.
2718class _TreeDumper:
2719    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2720         (re_M, replaceM), (re_m, replacem), 
2721         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2722         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2723         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2724         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2725         (re_I, replaceI) ]
2726
2727    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2728        maxDepth, simpleFirst, tree, **kw):
2729        self.stringFormats = stringFormats
2730        self.minExamples = minExamples
2731        self.maxDepth = maxDepth
2732        self.simpleFirst = simpleFirst
2733        self.tree = tree
2734        self.__dict__.update(kw)
2735
2736        if leafStr:
2737            self.leafStr = leafStr
2738        else:
2739            if tree.classVar.varType == Orange.data.Type.Discrete:
2740                self.leafStr = "%V (%^.2m%)"
2741            else:
2742                self.leafStr = "%V"
2743
2744        if nodeStr == ".":
2745            self.nodeStr = self.leafStr
2746        else:
2747            self.nodeStr = nodeStr
2748       
2749
2750    def formatString(self, strg, node, parent):
2751        if hasattr(strg, "__call__"):
2752            return strg(node, parent, self.tree)
2753       
2754        if not node:
2755            return "<null node>"
2756       
2757        for rgx, replacer in self.stringFormats:
2758            if not node.distribution:
2759                strg = rgx.sub(".", strg)
2760            else:
2761                strt = 0
2762                while True:
2763                    mo = rgx.search(strg, strt)
2764                    if not mo:
2765                        break
2766                    strg = replacer(strg, mo, node, parent, self.tree)
2767                    strt = mo.start()+1
2768                       
2769        return strg
2770       
2771
2772    def showBranch(self, node, parent, lev, i):
2773        bdes = node.branchDescriptions[i]
2774        bdes = node.branchSelector.classVar.name + \
2775            (bdes[0] not in "<=>" and "=" or "") + bdes
2776        if node.branches[i]:
2777            nodedes = self.nodeStr and ": " + \
2778                self.formatString(self.nodeStr, node.branches[i], node) or ""
2779        else:
2780            nodedes = "<null node>"
2781        return "|    "*lev + bdes + nodedes
2782       
2783       
2784    def dumpTree0(self, node, parent, lev):
2785        if node.branches:
2786            if node.distribution.abs < self.minExamples or \
2787                lev > self.maxDepth:
2788                return "|    "*lev + ". . .\n"
2789           
2790            res = ""
2791            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2792                leafsep = "\n"+("|    "*lev)+"    "
2793            else:
2794                leafsep = ""
2795            if self.simpleFirst:
2796                for i, branch in enumerate(node.branches):
2797                    if not branch or not branch.branches:
2798                        if self.leafStr == self.nodeStr:
2799                            res += "%s\n" % \
2800                                self.showBranch(node, parent, lev, i)
2801                        else:
2802                            res += "%s: %s\n" % \
2803                                (self.showBranch(node, parent, lev, i),
2804                                 leafsep + 
2805                                 self.formatString(self.leafStr, branch, node))
2806            for i, branch in enumerate(node.branches):
2807                if branch and branch.branches:
2808                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2809                                       self.dumpTree0(branch, node, lev+1))
2810                elif not self.simpleFirst:
2811                    if self.leafStr == self.nodeStr:
2812                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2813                    else:
2814                        res += "%s: %s\n" % \
2815                            (self.showBranch(node, parent, lev, i),
2816                             leafsep + 
2817                             self.formatString(self.leafStr, branch, node))
2818            return res
2819        else:
2820            return self.formatString(self.leafStr, node, parent)
2821
2822
2823    def dumpTree(self):
2824        if self.nodeStr:
2825            lev, res = 1, "root: %s\n" % \
2826                self.formatString(self.nodeStr, self.tree.tree, None)
2827            self.maxDepth += 1
2828        else:
2829            lev, res = 0, ""
2830        return res + self.dumpTree0(self.tree.tree, None, lev)
2831       
2832
2833    def dotTree0(self, node, parent, internalName):
2834        if node.branches:
2835            if node.distribution.abs < self.minExamples or \
2836                len(internalName)-1 > self.maxDepth:
2837                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2838                    _quoteName(internalName))
2839                return
2840               
2841            label = node.branchSelector.classVar.name
2842            if self.nodeStr:
2843                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2844            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2845                (_quoteName(internalName), self.nodeShape, label))
2846           
2847            for i, branch in enumerate(node.branches):
2848                if branch:
2849                    internalBranchName = internalName+chr(i+65)
2850                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2851                        (_quoteName(internalName), 
2852                         _quoteName(internalBranchName), 
2853                         node.branchDescriptions[i]))
2854                    self.dotTree0(branch, node, internalBranchName)
2855                   
2856        else:
2857            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2858                (internalName, self.leafShape, 
2859                self.formatString(self.leafStr, node, parent)))
2860
2861
2862    def dotTree(self, internalName="n"):
2863        self.fle.write("digraph G {\n")
2864        self.dotTree0(self.tree.tree, None, internalName)
2865        self.fle.write("}\n")
2866
2867def _quoteName(x):
2868    return '"%s"' % (base64.b64encode(x))
2869
2870class TreeClassifier(Orange.classification.Classifier):
2871    """
2872    Wraps :class:`_TreeClassifier`.
2873    """
2874   
2875    def __init__(self, baseClassifier=None):
2876        if not baseClassifier: baseClassifier = _TreeClassifier()
2877        self.nativeClassifier = baseClassifier
2878        for k, v in self.nativeClassifier.__dict__.items():
2879            self.__dict__[k] = v
2880 
2881    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2882                 *args, **kwdargs):
2883        """Classify a new instance.
2884       
2885        :param instance: instance to be classified.
2886        :type instance: :class:`Orange.data.Instance`
2887        :param result_type:
2888              :class:`Orange.classification.Classifier.GetValue` or \
2889              :class:`Orange.classification.Classifier.GetProbabilities` or
2890              :class:`Orange.classification.Classifier.GetBoth`
2891       
2892        :rtype: :class:`Orange.data.Value`,
2893              :class:`Orange.statistics.Distribution` or a tuple with both
2894        """
2895        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2896
2897    def __setattr__(self, name, value):
2898        if name == "nativeClassifier":
2899            self.__dict__[name] = value
2900            return
2901        if name in self.nativeClassifier.__dict__:
2902            self.nativeClassifier.__dict__[name] = value
2903        self.__dict__[name] = value
2904   
2905    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2906        """
2907        Return a string representation of a tree.
2908
2909        :arg leafStr: The format string for printing the tree leaves. If
2910          left empty, "%V (%^.2m%)" will be used for classification trees
2911          and "%V" for regression trees.
2912        :type leafStr: string
2913        :arg nodeStr: The format string for printing out the internal nodes.
2914          If left empty (as it is by default), no data is printed out for
2915          internal nodes. If set to :samp:`"."`, the same string is
2916          used as for leaves.
2917        :type nodeStr: string
2918        :arg maxDepth: If set, it limits the depth to which the tree is
2919          printed out.
2920        :type maxDepth: integer
2921        :arg minExamples: If set, the subtrees with less than the given
2922          number of examples are not printed.
2923        :type minExamples: integer
2924        :arg simpleFirst: If True (default), the branches with a single
2925          node are printed before the branches with larger subtrees.
2926          If False, the branches are printed in order of
2927          appearance.
2928        :type simpleFirst: boolean
2929        :arg userFormats: A list of regular expressions and callback
2930          function through which the user can print out other specific
2931          information in the nodes.
2932        """
2933        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2934            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2935            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2936            self).dumpTree()
2937
2938    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2939        """ Prints the tree to a file in a format used by
2940        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2941        Uses the same parameters as :meth:`dump` defined above
2942        plus two parameters which define the shape used for internal
2943        nodes and laves of the tree:
2944
2945        :param leafShape: Shape of the outline around leves of the tree.
2946            If "plaintext", no outline is used (default: "plaintext").
2947        :type leafShape: string
2948        :param internalNodeShape: Shape of the outline around internal nodes
2949            of the tree. If "plaintext", no outline is used (default: "box")
2950        :type leafShape: string
2951
2952        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2953        for various outlines supported by GraphViz.
2954        """
2955        fle = type(fileName) == str and file(fileName, "wt") or fileName
2956
2957        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2958            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2959            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
2960            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2961
2962    def count_nodes(self):
2963        """
2964        Return the number of nodes.
2965        """
2966        return _countNodes(self.tree)
2967
2968    def count_leaves(self):
2969        """
2970        Return the number of leaves.
2971        """
2972        return _countLeaves(self.tree)
2973
2974def _countNodes(node):
2975    count = 0
2976    if node:
2977        count += 1
2978        if node.branches:
2979            for node in node.branches:
2980                count += _countNodes(node)
2981    return count
2982
2983def _countLeaves(node):
2984    count = 0
2985    if node:
2986        if node.branches: # internal node
2987            for node in node.branches:
2988                count += _countLeaves(node)
2989        else:
2990            count += 1
2991    return count
2992
Note: See TracBrowser for help on using the repository browser.