source: orange/orange/Orange/classification/tree.py @ 7719:32f708ef6a26

Revision 7719:32f708ef6a26, 117.0 KB checked in by markotoplak, 3 years ago (diff)

Added build_stop and build_split to the TreeLearner.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12To build a small tree (:obj:`TreeClassifier`) from the iris data set
13(with the depth limited to three levels), use
14(part of `orngTree1.py`_, uses `iris.tab`_):
15
16.. literalinclude:: code/orngTree1.py
17   :lines: 1-4
18
19.. _orngTree1.py: code/orngTree1.py
20
21
22This page first describes the learner and the classifier, and then
23defines the base classes (individual components) of the trees
24and the tree-building process.
25
26.. autoclass:: TreeLearner
27    :members:
28
29.. autoclass:: TreeClassifier
30    :members:
31
32========
33Examples
34========
35
36For example, here's how to write your own stop
37function. The example constructs
38and prints two trees. For the first one we define the *defStop*
39function, which is used by default, and combine it with a random function
40so that the stop criteria will also be met in 20% of the cases
41when *defStop* is false. For the second tree the stopping criteria
42is random. Note that in
43the second case lambda function still has three parameters, since this is
44a necessary number of parameters for the stop function (:obj:`StopCriteria`).
45Part of `tree3.py`_ (uses  `iris.tab`_):
46
47.. _tree3.py: code/tree3.py
48
49.. literalinclude:: code/tree3.py
50   :lines: 8-23
51
52The output is not shown here since the resulting trees are rather
53big.
54
55Tree Structure
56==============
57
58To have something to work on, we'll take the data from lenses dataset
59and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
60
61.. literalinclude:: code/treestructure.py
62   :lines: 7-10
63
64How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
65
66.. _lenses.tab: code/lenses.tab
67.. _treestructure.py: code/treestructure.py
68
69.. literalinclude:: code/treestructure.py
70   :lines: 12-21
71
72If node is None, we have a null-node; null nodes don't count,
73so we return 0. Otherwise, the size is 1 (this node) plus the
74sizes of all subtrees. The node is an internal node if it has a
75:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
76attempt to skip the if statement: leaves don't have an empty list
77of branches, they don't have a list of branches at all.
78
79    >>> treeSize(treeClassifier.tree)
80    10
81
82Don't forget that this was only an excercise - :obj:`Node` has a
83built-in method :obj:`Node.treeSize` that does exactly the same.
84
85Let us now write a script that prints out a tree. The recursive
86part of the function will get a node and its level
87(part of `treestructure.py`_, uses `lenses.tab`_).
88
89.. literalinclude:: code/treestructure.py
90   :lines: 26-41
91
92Don't waste time on studying formatting tricks (\n's etc.), this is just
93for nicer output. What matters is everything but the print statements.
94As first, we check whether the node is a null-node (a node to which no
95learning examples were classified). If this is so, we just print out
96"<null node>" and return.
97
98After handling null nodes, remaining nodes are internal nodes and leaves.
99For internal nodes, we print a node description consisting of the
100attribute's name and distribution of classes. :obj:`Node`'s branch
101description is, for all currently defined splits, an instance of a
102class derived from :obj:`orange.Classifier` (in fact, it is a
103:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
104suffice), and its :obj:`classVar` points to the attribute we seek.
105So we print its name. We will also assume that storing class distributions
106has not been disabled and print them as well. 
107Then we iterate
108through branches; for each we print a branch description and iteratively
109call the :obj:`printTree0` with a level increased by 1 (to increase
110the indent).
111
112Finally, if the node is a leaf, we print out the distribution of
113learning examples in the node and the class to which the examples in
114the node would be classified. We again assume that the :obj:`nodeClassifier`
115is the default one - a :obj:`DefaultClassifier`. A better print
116function should be aware of possible alternatives.
117
118Now, we just need to write a simple function to call our printTree0.
119We could write something like...
120
121::
122
123    def printTree(x):
124        printTree0(x.tree, 0)
125
126... but we won't. Let us learn how to handle arguments of different
127types. Let's write a function that will accept either a
128:obj:`TreeClassifier` or a :obj:`Node`.
129Part of `treestructure.py`_, uses `lenses.tab`_.
130
131.. literalinclude:: code/treestructure.py
132   :lines: 43-49
133
134It's fairly straightforward: if :obj:`x` is of type derived from
135:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
136:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
137of some other type, we don't know how to handle it and thus raise
138an exception. The output::
139
140    >>> printTree(treeClassifier)
141    tear_rate (<15.000, 5.000, 4.000>)
142    : reduced --> none (<12.000, 0.000, 0.000>)
143    : normal
144       astigmatic (<3.000, 5.000, 4.000>)
145       : no
146          age (<1.000, 5.000, 0.000>)
147          : young --> soft (<0.000, 2.000, 0.000>)
148          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
149          : presbyopic --> none (<1.000, 1.000, 0.000>)
150       : yes
151          prescription (<2.000, 0.000, 4.000>)
152          : myope --> hard (<0.000, 0.000, 3.000>)
153          : hypermetrope --> none (<2.000, 0.000, 1.000>)
154
155For a final exercise, let us write a simple pruning function. It will
156be written entirely in Python, unrelated to any :obj:`Pruner`. It
157will limit the maximal tree depth (the number of internal nodes on any
158path down the tree) given as an argument.
159For example, to get a two-level tree, we would
160call cutTree(root, 2). The function will be recursive, with the second
161argument (level) decreasing at each call; when zero, the current node
162will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
163
164.. literalinclude:: code/treestructure.py
165   :lines: 54-62
166
167There's nothing to prune at null-nodes or leaves, so we act only when
168:obj:`node` and :obj:`node.branchSelector` are defined. If level is
169not zero, we call the function for each branch. Otherwise, we clear
170the selector, branches and branch descriptions.
171
172    >>> cutTree(tree.tree, 2)
173    >>> printTree(tree)
174    tear_rate (<15.000, 5.000, 4.000>)
175    : reduced --> none (<12.000, 0.000, 0.000>)
176    : normal
177       astigmatic (<3.000, 5.000, 4.000>)
178       : no --> soft (<1.000, 5.000, 0.000>)
179       : yes --> hard (<2.000, 0.000, 4.000>)
180
181Learning
182========
183
184You could just call :class:`TreeLearner` and let it fill the empty
185slots with the default
186components. This section will teach you three things: what are the
187missing components (and how to set the same components yourself),
188how to use alternative components to get a different tree and,
189finally, how to write a skeleton for tree induction in Python.
190
191.. _treelearner.py: code/treelearner.py
192
193Let us construct a :obj:`TreeLearner` to play with (treelearner.py`_, uses `lenses.tab`_):
194
195.. literalinclude:: code/treelearner.py
196   :lines: 7-10
197
198There are three crucial components in learning: the split and stop
199criteria, and the :obj:`exampleSplitter` YY (there are some others,
200which become important during classification; we'll talk about them
201later). They are not defined; if you use the learner, the slots are
202filled temporarily but later cleared again.
203
204FIXME: the following example is not true anymore.
205
206::
207
208    >>> print learner.split
209    None
210    >>> learner(data)
211    <TreeClassifier instance at 0x01F08760>
212    >>> print learner.split
213    None
214
215Stopping criteria
216-----------------
217
218The stop is trivial. The default is set by
219
220::
221
222    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
223
224Well, this is actually done in C++ and it uses a global component
225that is constructed once for all, but apart from that we did
226effectively the same thing.
227
228We can now examine the default stopping parameters.
229
230    >>> print learner.stop.maxMajority, learner.stop.minExamples
231    1.0 0.0
232
233Not very restrictive. This keeps splitting the examples until
234there's nothing left to split or all the examples are in the same
235class. Let us set the minimal subset that we allow to be split to
236five examples and see what comes out.
237
238    >>> learner.stop.minExamples = 5.0
239    >>> tree = learner(data)
240    >>> print tree.dump()
241    tear_rate=reduced: none (100.00%)
242    tear_rate=normal
243    |    astigmatic=no
244    |    |    age=pre-presbyopic: soft (100.00%)
245    |    |    age=presbyopic: none (50.00%)
246    |    |    age=young: soft (100.00%)
247    |    astigmatic=yes
248    |    |    prescription=hypermetrope: none (66.67%)
249    |    |    prescription=myope: hard (100.00%)
250
251OK, that's better. If we want an even smaller tree, we can also limit
252the maximal proportion of majority class.
253
254    >>> learner.stop.maxMajority = 0.5
255    >>> tree = learner(data)
256    >>> print tree.dump()
257    none (62.50%)
258
259
260=================
261Printing the Tree
262=================
263
264The included printing functions can
265print out practically anything you'd like to
266know, from the number of examples, proportion of examples of majority
267class in nodes and similar, to more complex statistics like the
268proportion of examples in a particular class divided by the proportion
269of examples of this class in a parent node. And even more, you can
270define your own callback functions to be used for printing.
271
272Before we go on: you can read all about the function and use it to its
273full extent, or you can just call it, giving it the tree as the sole
274argument and it will print out the usual textual representation of the
275tree. If you're satisfied with that, you can stop here.
276
277The magic is in the format string. It is a string which is printed
278out at every leaf or internal node with the certain format specifiers
279replaced by data from the tree node. Specifiers are generally of form
280**%[^]<precision><quantity><divisor>**.
281
282**^** at the start tells that the number should be multiplied by 100.
283It's useful for printing proportions like percentages.
284
285**<precision>** is in the same format as in Python (or C) string
286formatting. For instance, :samp:`%N` denotes the number of examples in the node,
287hence :samp:`%6.2N` would mean output to two decimal digits and six places
288altogether. If left out, a default format :samp:`5.3` is used, unless you
289multiply the numbers by 100, in which case the default is :samp:`.0`
290(no decimals, the number is rounded to the nearest integer).
291
292**<divisor>** tells what to divide the quantity in that node with.
293:samp:`bP` means division by the same quantity in the parent node; for instance,
294:samp:`%NbP` will tell give the number of examples in the node divided by the
295number of examples in parent node. You can add use precision formatting,
296e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
297set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
298training data set) that fell into that node. If division is impossible
299since the parent node does not exist or some data is missing, a dot is
300printed out instead of the quantity.
301
302**<quantity>** is the only required element. It defines what to print.
303For instance, :samp:`%N` would print out the number of examples in the node.
304Possible quantities are
305
306:samp:`V`
307    The value predicted at that node. You cannot define the precision
308    or divisor here.
309
310:samp:`N`
311    The number of examples in the node.
312
313:samp:`M`
314    The number of examples in the majority class (that is, the class
315    predicted by the node).
316
317:samp:`m`
318    The proportion of examples in the majority class.
319
320:samp:`A`
321    The average class for examples the node; this is available only for
322    regression trees.
323
324:samp:`E`
325    Standard error for class of examples in the node; available for
326    regression trees.
327
328:samp:`I`
329    Print out the confidence interval. The modifier is used as
330    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
331
332:samp:`C`
333    The number of examples in the given class. For classification trees,
334    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
335    - this will tell the number of examples of Iris-virginica by the
336    number of examples this class in the parent node. If you are
337    interested in examples that are *not* Iris-virginica, say
338    :samp:`%5.3CbP!="Iris-virginica"`
339
340    For regression trees, you can use operators =, !=, <, <=, >, and >=,
341    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
342    check the number of examples in a certain interval: :samp:`%C[20, 22]`
343    will give you the number of examples between 20 and 22 (inclusive)
344    and :samp:`%C(20, 22)` will give the number of such examples excluding the
345    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
346    If you would like the examples outside the interval, add a :samp:`!`,
347    like :samp:`%C!(20, 22]`.
348 
349:samp:`c`
350    Same as above, except that it computes the proportion of the class
351    instead of the number of examples.
352
353:samp:`D`
354    Prints out the number of examples in each class. You can use both,
355    precision (it is applied to each number in the distribution) or the
356    divisor. This quantity cannot be computed for regression trees.
357
358:samp:`d`
359    Same as above, except that it shows proportions of examples. This
360    again doesn't work with regression trees.
361
362<user defined formats>
363    You can add more, if you will. Instructions and examples are given at
364    the end of this section.
365
366
367Examples
368========
369
370We shall build a small tree from the iris data set - we shall limit the
371depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
372
373.. literalinclude:: code/orngTree1.py
374   :lines: 1-4
375
376.. _orngTree1.py: code/orngTree1.py
377
378The easiest way to call the function is to pass the tree as the only
379argument::
380
381    >>> print tree.dump()
382    petal width<0.800: Iris-setosa (100.00%)
383    petal width>=0.800
384    |    petal width<1.750
385    |    |    petal length<5.350: Iris-versicolor (94.23%)
386    |    |    petal length>=5.350: Iris-virginica (100.00%)
387    |    petal width>=1.750
388    |    |    petal length<4.850: Iris-virginica (66.67%)
389    |    |    petal length>=4.850: Iris-virginica (100.00%)
390
391Let's now print out the predicted class at each node, the number
392of examples in the majority class with the total number of examples
393in the node::
394
395    >>> print tree.dump(leafStr="%V (%M out of %N)")
396    petal width<0.800: Iris-setosa (50.000 out of 50.000)
397    petal width>=0.800
398    |    petal width<1.750
399    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
400    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
401    |    petal width>=1.750
402    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
403    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
404
405Would you like to know how the number of examples declines as
406compared to the entire data set and to the parent node? We find
407it with this::
408
409    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)")
410    petal width<0.800: Iris-setosa (100%, 100%)
411    petal width>=0.800
412    |    petal width<1.750
413    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
414    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
415    |    petal width>=1.750
416    |    |    petal length<4.850: Iris-virginica (4%, 4%)
417    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
418
419Let us first read the format string. :samp:`%M` is the number of
420examples in the majority class. We want it divided by the number of
421all examples from this class on the entire data set, hence :samp:`%MbA`.
422To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
423that is just printed out literally, just as the comma and parentheses
424(see the output). The string for showing the proportion of this class
425in the parent is the same except that we have :samp:`bP` instead
426of :samp:`bA`.
427
428And now for the output: all examples of setosa for into the first node.
429For versicolor, we have 98% in one node; the rest is certainly
430not in the neighbouring node (petal length>=5.350) since all
431versicolors from the node petal width<1.750 went to petal length<5.350
432(we know this from the 100% in that line). Virginica is the
433majority class in the three nodes that together contain 94% of this
434class (4+4+86). The rest must had gone to the same node as versicolor.
435
436If you find this guesswork annoying - so do I. Let us print out the
437number of versicolors in each node, together with the proportion of
438versicolors among the examples in this particular node and among all
439versicolors. So,
440
441::
442
443    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
444
445gives the following output::
446
447    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
448    petal width>=0.800
449    |    petal width<1.750
450    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
451    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
452    |    petal width>=1.750
453    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
454    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
455
456Finally, we may want to print out the distributions, using a simple
457string :samp:`%D`::
458
459    petal width<0.800: [50.000, 0.000, 0.000]
460    petal width>=0.800
461    |    petal width<1.750
462    |    |    petal length<5.350: [0.000, 49.000, 3.000]
463    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
464    |    petal width>=1.750
465    |    |    petal length<4.850: [0.000, 1.000, 2.000]
466    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
467
468What is the order of numbers here? If you check
469:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
470versicolor, virginica; so in the node at peta length<5.350 we have 49
471versicolors and 3 virginicae. To print out the proportions, we can
472:samp:`%.2d` - this gives us proportions within node, rounded on
473two decimals::
474
475    petal width<0.800: [1.00, 0.00, 0.00]
476    petal width>=0.800
477    |    petal width<1.750
478    |    |    petal length<5.350: [0.00, 0.94, 0.06]
479    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
480    |    petal width>=1.750
481    |    |    petal length<4.850: [0.00, 0.33, 0.67]
482    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
483
484We haven't tried printing out any information for internal nodes.
485To start with the most trivial case, we shall print the prediction
486at each node.
487
488::
489
490    tree.dump(leafStr="%V", nodeStr=".")
491   
492says that the nodeStr should be the same as leafStr (not very useful
493here, since leafStr is trivial anyway).
494
495::
496
497    root: Iris-setosa
498    |    petal width<0.800: Iris-setosa
499    |    petal width>=0.800: Iris-versicolor
500    |    |    petal width<1.750: Iris-versicolor
501    |    |    |    petal length<5.350: Iris-versicolor
502    |    |    |    petal length>=5.350: Iris-virginica
503    |    |    petal width>=1.750: Iris-virginica
504    |    |    |    petal length<4.850: Iris-virginica
505    |    |    |    petal length>=4.850: Iris-virginica
506
507Note that the output is somewhat different now: there appeared another
508node called *root* and the tree looks one level deeper. This is
509needed to print out the data for that node to.
510
511Now for something more complicated: let us observe how the number
512of virginicas decreases down the tree::
513
514    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
515
516Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
517the number of examples from class virginica, divided by the total number
518of examples in this class. Add :samp:`^.1` and the result will be
519multiplied and printed with one decimal. The trailing :samp:`%` is printed
520out. In parentheses we print the same thing except that we divide by the
521examples in the parent node. Note the use of single quotes, so we can
522use the double quotes inside the string, when we specify the class.
523
524::
525
526    root: 100.0% (.%)
527    |    petal width<0.800: 0.0% (0.0%)
528    |    petal width>=0.800: 100.0% (100.0%)
529    |    |    petal width<1.750: 10.0% (10.0%)
530    |    |    |    petal length<5.350: 6.0% (60.0%)
531    |    |    |    petal length>=5.350: 4.0% (40.0%)
532    |    |    petal width>=1.750: 90.0% (90.0%)
533    |    |    |    petal length<4.850: 4.0% (4.4%)
534    |    |    |    petal length>=4.850: 86.0% (95.6%)
535
536See what's in the parentheses in the root node? If :meth:`TreeClassifier.dump`
537cannot compute something (in this case it's because the root has no parent),
538it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
539will count all classes *except* virginica.
540
541For one final example with classification trees, we shall print the
542distributions in that nodes, the distribution compared to the parent
543and the proportions compared to the parent (the latter things are not
544the same - think why). In the leaves we shall also add the predicted
545class. So now we'll have to call the function like this.
546
547::
548
549    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
550    root: [50.000, 50.000, 50.000] . .
551    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
552    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
553    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
554    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
555    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
556    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
557    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
558    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
559    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
560    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
561    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
562    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
563    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
564
565To explore the possibilities when printing regression trees, we are going
566to induce a tree from the housing data set. Called with the tree as the
567only argument, :meth:`TreeClassifier.dump` prints the tree like this::
568
569    RM<6.941
570    |    LSTAT<14.400
571    |    |    DIS<1.385: 45.6
572    |    |    DIS>=1.385: 22.9
573    |    LSTAT>=14.400
574    |    |    CRIM<6.992: 17.1
575    |    |    CRIM>=6.992: 12.0
576    RM>=6.941
577    |    RM<7.437
578    |    |    CRIM<7.393: 33.3
579    |    |    CRIM>=7.393: 14.4
580    |    RM>=7.437
581    |    |    TAX<534.500: 45.9
582    |    |    TAX>=534.500: 21.9
583
584Let us add the standard error in both internal nodes and leaves, and the
58590% confidence intervals in the leaves::
586
587    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
588    root: [SE: 0.409]
589    |    RM<6.941: [SE: 0.306]
590    |    |    LSTAT<14.400: [SE: 0.320]
591    |    |    |    DIS<1.385: [SE: 4.420]:
592    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
593    |    |    |    DIS>=1.385: [SE: 0.244]:
594    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
595    |    |    LSTAT>=14.400: [SE: 0.333]
596    |    |    |    CRIM<6.992: [SE: 0.338]:
597    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
598    |    |    |    CRIM>=6.992: [SE: 0.448]:
599    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
600    |    RM>=6.941: [SE: 1.031]
601    |    |    RM<7.437: [SE: 0.958]
602    |    |    |    CRIM<7.393: [SE: 0.692]:
603    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
604    |    |    |    CRIM>=7.393: [SE: 2.157]:
605    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
606    |    |    RM>=7.437: [SE: 1.124]
607    |    |    |    TAX<534.500: [SE: 0.817]:
608    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
609    |    |    |    TAX>=534.500: [SE: 0.000]:
610    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
611
612What's the difference between :samp:`%V`, the predicted value and
613:samp:`%A` the average? Doesn't a regression tree always predict the
614leaf average anyway? Not necessarily, the tree predict whatever the
615:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
616As :samp:`%V` uses the
617:obj:`Orange.data.variable.Continuous`' function for printing out the value,
618therefore the printed number has the same number of decimals
619as in the data file.
620
621Regression trees cannot print the distributions in the same way
622as classification trees. They instead offer a set of operators for
623observing the number of examples within a certain range. For instance,
624let us check the number of examples with values below 22, and compare
625this number with values in the parent nodes::
626
627    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".")
628    root: 277.000 (.)
629    |    RM<6.941: 273.000 (1.160)
630    |    |    LSTAT<14.400: 107.000 (0.661)
631    |    |    |    DIS<1.385: 0.000 (0.000)
632    |    |    |    DIS>=1.385: 107.000 (1.020)
633    |    |    LSTAT>=14.400: 166.000 (1.494)
634    |    |    |    CRIM<6.992: 93.000 (0.971)
635    |    |    |    CRIM>=6.992: 73.000 (1.040)
636    |    RM>=6.941: 4.000 (0.096)
637    |    |    RM<7.437: 3.000 (1.239)
638    |    |    |    CRIM<7.393: 0.000 (0.000)
639    |    |    |    CRIM>=7.393: 3.000 (15.333)
640    |    |    RM>=7.437: 1.000 (0.633)
641    |    |    |    TAX<534.500: 0.000 (0.000)
642    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
643
644The last line, for instance, says the the number of examples with the
645class below 22 is among those with tax above 534 is 30 times higher
646than the number of such examples in its parent node.
647
648For another exercise, let's count the same for all examples *outside*
649interval [20, 22] (given like this, the interval includes the bounds).
650And let us print out the proportions as percents.
651
652::
653
654    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
655
656OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
657would be the proportion of examples (within the node) whose values are
658below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
659statistics computed on the parent. Add a :samp:`^` and you have the percentages.
660
661::
662
663    root: 439.000 (.%)
664    |    RM<6.941: 364.000 (98%)
665    |    |    LSTAT<14.400: 200.000 (93%)
666    |    |    |    DIS<1.385: 5.000 (127%)
667    |    |    |    DIS>=1.385: 195.000 (99%)
668    |    |    LSTAT>=14.400: 164.000 (111%)
669    |    |    |    CRIM<6.992: 91.000 (96%)
670    |    |    |    CRIM>=6.992: 73.000 (105%)
671    |    RM>=6.941: 75.000 (114%)
672    |    |    RM<7.437: 46.000 (101%)
673    |    |    |    CRIM<7.393: 43.000 (100%)
674    |    |    |    CRIM>=7.393: 3.000 (100%)
675    |    |    RM>=7.437: 29.000 (98%)
676    |    |    |    TAX<534.500: 29.000 (103%)
677    |    |    |    TAX>=534.500: 0.000 (0%)
678
679
680Defining Your Own Printout functions
681====================================
682
683:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out
684some other information in the leaves or nodes. If provided,
685:obj:`userFormats` should contain a list of tuples with a regular expression
686and a callback function to be called when that expression is found in the
687format string. Expressions from :obj:`userFormats` are checked before
688the built-in expressions discussed above, so you can override the built-ins
689if you want to.
690
691The regular expression should describe a string like those we used above,
692for instance the string :samp:`%.2DbP`. When a leaf or internal node
693is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
694is checked for these regular expressions and when the match is found,
695the corresponding callback function is called.
696
697The callback function will get five arguments: the format string
698(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
699being printed, its parent (can be None) and the tree classifier.
700The function should return the format string in which the part described
701by the match object (that is, the part that is matched by the regular
702expression) is replaced by whatever information your callback function
703is supposed to give.
704
705The function can use several utility function provided in the module.
706
707.. autofunction:: insertStr
708
709.. autofunction:: insertDot
710
711.. autofunction:: insertNum
712
713.. autofunction:: byWhom
714
715
716There are also a few pieces of regular expression that you may want to reuse.
717The two you are likely to use are:
718
719.. autodata:: fs
720
721.. autodata:: by
722
723For a trivial example, :samp:`%V` is implemented like this. There is the
724following tuple in the list of built-in formats::
725
726    (re.compile("%V"), replaceV)
727
728:obj:`replaceV` is a function defined by::
729
730    def replaceV(strg, mo, node, parent, tree):
731        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
732
733It therefore takes the value predicted at the node
734(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
735and passes it to *insertStr* to do the replacement.
736
737A more complex regular expression is the one for the proportion of
738majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
739two partial expressions defined above.
740
741Let's say with like to print the classification margin for each node,
742that is, the difference between the proportion of the largest and the
743second largest class in the node (part of `orngTree2.py`_):
744
745.. _orngTree2.py: code/orngTree2.py
746
747.. literalinclude:: code/orngTree2.py
748   :lines: 7-31
749
750We first defined getMargin which gets the distribution and computes the
751margin. The callback replaces, replaceB, computes the margin for the node.
752If we need to divided the quantity by something (that is, if the :data:`by`
753group is present), we call :func:`byWhom` to get the node with whose margin
754this node's margin is to be divided. If this node (usually the parent)
755does not exist of if its margin is zero, we call :func:`insertDot`
756to insert a dot, otherwise we call :func:`insertNum` which will insert
757the number, obeying the format specified by the user. myFormat is a list
758containing the regular expression and the callback function.
759
760
761We can now print out the iris tree:
762
763.. literalinclude:: code/orngTree2.py
764    :lines: 33
765
766And we get::
767
768    petal width<0.800: Iris-setosa 100% (100.00%)
769    petal width>=0.800
770    |    petal width<1.750
771    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
772    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
773    |    petal width>=1.750
774    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
775    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
776
777
778Plotting the Tree using Dot
779===========================
780
781Suppose you saved the tree in a file "tree5.dot". You can then
782print it out as a gif if you execute the following command line
783
784::
785   
786    dot -Tgif tree5.dot -otree5.gif
787
788GraphViz's dot has quite a few other output formats, check
789its documentation to learn which.
790
791
792
793Classification trees are represented as a tree-like hierarchy of
794:obj:`Node` classes.
795
796.. class:: Node
797
798    Node stores information about the learning examples belonging
799    to the node, a branch selector, a list of branches (if the node is
800    not a leaf) with their descriptions and strengths, and a classifier.
801
802    .. attribute:: distribution
803   
804        Stores a distribution for learning examples belonging to the node.
805        Storing distributions can be disabled by setting the
806        :obj:`TreeLearnerBase`'s storeDistributions flag to false.
807
808    .. attribute:: contingency
809
810        Stores complete contingency matrices for the learning examples
811        belonging to the node. Storing contingencies can be enabled by
812        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`
813        flag to true. Note that even when the flag is not
814        set, the contingencies get computed and stored to
815        :obj:`Node`, but are removed shortly afterwards.
816        The details are given in the
817        description of the :obj:`TreeLearnerBase` object.
818
819    .. attribute:: examples, weightID
820
821        Store a set of learning examples for the node and the
822        corresponding ID of /weight meta attribute. The root of the
823        tree stores a "master table" of examples, while other nodes'
824        :obj:`Orange.data.Table` contain reference to examples in
825        the root's :obj:`Orange.data.Table`. Examples are only stored
826        if a corresponding flag (:obj:`storeExamples`) has been
827        set while building the tree; to conserve the space, storing
828        is disabled by default.
829
830    .. attribute:: nodeClassifier
831
832        A classifier (usually, but not necessarily, a
833        :obj:`DefaultClassifier`) that can be used to classify
834        examples coming to the node. If the node is a leaf, this is
835        used to decide the final class (or class distribution) of an
836        example. If it's an internal node, it is stored if
837        :obj:`Node`'s flag :obj:`storeNodeClassifier`
838        is set. Since the :obj:`nodeClassifier` is needed by
839        :obj:`Descender` and for pruning (see far below),
840        this is the default behaviour; space consumption of the default
841        :obj:`DefaultClassifier` is rather small. You should
842        never disable this if you intend to prune the tree later.
843
844    If the node is a leaf, the remaining fields are None.
845    If it's an internal node, there are several additional fields.
846
847    .. attribute:: branches
848
849        Stores a list of subtrees, given as :obj:`Node`.
850        An element can be None; in this case the node is empty.
851
852    .. attribute:: branchDescriptions
853
854        A list with string descriptions for branches, constructed by
855        :obj:`SplitConstructor`. It can contain different kinds
856        of descriptions, but basically, expect things like 'red' or '>12.3'.
857
858    .. attribute:: branchSizes
859
860        Gives a (weighted) number of training examples that went into
861        each branch. This can be used later, for instance, for
862        modeling probabilities when classifying examples with
863        unknown values.
864
865    .. attribute:: branchSelector
866
867        Gives a branch for each example. The same object is used during
868        learning and classifying. The :obj:`branchSelector` is of
869        type :obj:`orange.Classifier`, since its job is similar to that
870        of a classifier: it gets an example and returns discrete
871        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
872        When an example cannot be classified to any branch, the selector
873        can return a :obj:`Orange.data.Value` containing a special value
874        (sVal) which should be a discrete distribution
875        (DiscDistribution). This should represent a
876        :obj:`branchSelector`'s opinion of how to divide the
877        example between the branches. Whether the proposition will be
878        used or not depends upon the chosen :obj:`ExampleSplitter`
879        (when learning) or :obj:`Descender` (when classifying).
880
881    The lists :obj:`branches`, :obj:`branchDescriptions` and
882    :obj:`branchSizes` are of the same length; all of them are
883    defined if the node is internal and none if it is a leaf.
884
885    .. method:: treeSize()
886       
887        Return the number of nodes in the subtrees (including the
888        node, excluding null-nodes).
889
890==============
891Classification
892==============
893
894.. class:: _TreeClassifier
895
896    Classifies examples according to a tree stored in :obj:`tree`.
897
898    .. attribute:: tree
899
900        The root of the tree, represented as a :class:`Node`.
901   
902    Classification would be straightforward if there were no unknown
903    values or, in general, examples that cannot be placed into a
904    single branch. The response in such cases is determined by a
905    component :obj:`descender`.
906
907    :obj:`Descender` is an abstract object which is given an example
908    and whose basic job is to descend as far down the tree as possible,
909    according to the values of example's attributes. The
910    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
911    the branch index. If it's a simple index, the corresponding branch
912    is followed. If not, it's up to descender to decide what to do, and
913    that's where descenders differ. A :obj:`descender` can choose
914    a single branch (for instance, the one that is the most recommended
915    by the :obj:`branchSelector`) or it can let the branches vote.
916
917    In general there are three possible outcomes of a descent.
918
919    #. Descender reaches a leaf. This happens when nothing went wrong
920       (there are no unknown or out-of-range values in the example) or
921       when things went wrong, but the descender smoothed them by
922       selecting a single branch and continued the descend. In this
923       case, the descender returns the reached :obj:`Node`.
924    #. :obj:`branchSelector` returned a distribution and the
925       :obj:`Descender` decided to stop the descend at this
926       (internal) node.  Again, descender returns the current
927       :obj:`Node` and nothing else.
928    #. :obj:`branchSelector` returned a distribution and the
929       :obj:`Node` wants to split the example (i.e., to decide the
930       class by voting).
931
932    It returns a :obj:`Node` and the vote-weights for the branches.
933    The weights can correspond to the distribution returned by
934    :obj:`branchSelector`, to the number of learning examples that
935    were assigned to each branch, or to something else.
936
937    :obj:`TreeClassifier` uses the descender to descend from the root.
938    If it returns only a :obj:`Node` and no distribution, the
939    descend should stop; it does not matter whether it's a leaf (the
940    first case above) or an internal node (the second case). The node's
941    :obj:`nodeClassifier` is used to decide the class. If the descender
942    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
943    recursively calls itself for each of the subtrees and the
944    predictions are weighted as requested by the descender.
945
946    When voting, subtrees do not predict the class but probabilities
947    of classes. The predictions are multiplied by weights, summed and
948    the most probable class is returned.
949
950    .. method:: vote()
951
952    .. method:: classDistribution()
953
954
955The rest of this section is only for those interested in the C++ code.
956======================================================================
957
958If you'd like to understand how the classification works in C++,
959start reading at :obj:`TTreeClassifier::vote`. It gets a
960:obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of
961vote weights. For each node, it calls the
962:obj:`TTreeClassifier::classDistribution` and then multiplies
963and sums the distribution. :obj:`vote` returns a normalized
964distribution of predictions.
965
966A new overload of :obj:`TTreeClassifier::classDistribution` gets
967an additional parameter, a :obj:`Node`. This is done
968for the sake of recursion. The normal version of
969:obj:`classDistribution` simply calls the overloaded with a
970tree root as an additional parameter. :obj:`classDistribution`
971uses :obj:`descender`. If descender reaches a leaf, it calls
972:obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
973
974Thus, the :obj:`TreeClassifier`'s :obj:`vote` and
975:obj:`classDistribution` are written in a form of double
976recursion. The recursive calls do not happen at each node of the
977tree but only at nodes where a vote is needed (that is, at nodes
978where the descender halts).
979
980For predicting a class, :obj:`operator()`, calls the
981descender. If it reaches a leaf, the class is predicted by the
982leaf's :obj:`nodeClassifier`. Otherwise, it calls
983:obj:`vote`>. From now on, :obj:`vote` and
984:obj:`classDistribution` interweave down the tree and return
985a distribution of predictions. :obj:`operator()` then simply
986chooses the most probable class.
987
988========
989Learning
990========
991
992The main learning object is :obj:`TreeLearnerBase`. It is basically
993a skeleton into which the user must plug the components for particular
994functions. For easier use, defaults are provided.
995
996Components that govern the structure of the tree are :obj:`split`
997(of type :obj:`SplitConstructor`), :obj:`stop` (of
998type :obj:`StopCriteria` and :obj:`exampleSplitter`
999(of type :obj:`ExampleSplitter`).
1000
1001
1002.. class:: TreeLearnerBase
1003
1004    TreeLearnerBase has a number of components.
1005
1006    .. attribute:: split
1007
1008        Object of type :obj:`SplitConstructor`. Default value,
1009        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined`
1010        with separate constructors for discrete and continuous attributes.
1011        Discrete attributes are used as are, while continuous attributes
1012        are binarized. Gain ratio is used to select attributes.
1013        A minimum of two examples in a leaf is required for discreter
1014        and five examples in a leaf for continuous attributes.</DD>
1015   
1016    .. attribute:: stop
1017
1018        Object of type :obj:`StopCriteria`. The default stopping
1019        criterion stops induction when all examples in a node belong
1020        to the same class.
1021
1022    .. attribute:: splitter
1023
1024        Object of type :obj:`ExampleSplitter`. The default splitter
1025        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
1026        the learning examples according to distributions given by the
1027        selector.
1028
1029    .. attribute:: contingencyComputer
1030   
1031        By default, this slot is left empty and ordinary contingency
1032        matrices are computed for examples at each node. If need arises,
1033        one can change the way the matrices are computed. This can be
1034        used to change the way that unknown values are treated when
1035        assessing qualities of attributes. As mentioned earlier,
1036        the computed matrices can be used by split constructor and by
1037        stopping criteria. On the other hand, they can be (and are)
1038        ignored by some splitting constructors.
1039
1040    .. attribute:: nodeLearner
1041
1042        Induces a classifier from examples belonging to a node. The
1043        same learner is used for internal nodes and for leaves. The
1044        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
1045
1046    .. attribute:: descender
1047
1048        Descending component that the induces :obj:`TreeClassifier`
1049        will use. Default descender is
1050        :obj:`Descender_UnknownMergeAsSelector` which votes using
1051        the :obj:`branchSelector`'s distribution for vote weights.
1052
1053    .. attribute:: maxDepth
1054
1055        Gives maximal tree depth; 0 means that only root is generated.
1056        The default is 100 to prevent any infinite tree induction due
1057        to missettings in stop criteria. If you are sure you need
1058        larger trees, increase it. If you, on the other hand, want
1059        to lower this hard limit, you can do so as well.
1060
1061    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
1062
1063        Decides whether to store class distributions, contingencies
1064        and examples in :obj:`Node`, and whether the
1065        :obj:`nodeClassifier` should be build for internal nodes.
1066        By default, distributions and node classifiers are stored,
1067        while contingencies and examples are not. You won't save any
1068        memory by not storing distributions but storing contingencies,
1069        since distributions actually points to the same distribution
1070        that is stored in :obj:`contingency.classes`.
1071
1072    The :obj:`TreeLearnerBase` first sets the defaults for missing
1073    components. Although stored in the actual :obj:`TreeLearnerBase`'s
1074    fields, they are removed when the induction is finished.
1075
1076    Then it ensures that examples are stored in a table. This is needed
1077    because the algorithm juggles with pointers to examples. If
1078    examples are in a file or are fed through a filter, they are copied
1079    to a table. Even if they are already in a table, they are copied
1080    if :obj:`storeExamples` is set. This is to assure that pointers
1081    remain pointing to examples even if the user later changes the
1082    example table. If they are in the table and the :obj:`storeExamples`
1083    flag is clear, we just use them as they are. This will obviously
1084    crash in a multi-threaded system if one changes the table during
1085    the tree induction. Well... don't do it.
1086
1087    Apriori class probabilities are computed. At this point we check
1088    the sum of example weights; if it's zero, there are no examples and
1089    we cannot proceed. A list of candidate attributes is set; in the
1090    beginning, all attributes are candidates for the split criterion.
1091
1092    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments
1093    are a set of examples, a weight meta-attribute ID (a tricky thing,
1094    it can be always the same as the original or can change to
1095    accomodate splitting of examples among branches), apriori class
1096    distribution and a list of candidates (represented as a vector
1097    of Boolean values).
1098
1099    The contingency matrix is computed next. This happens
1100    even if the flag :obj:`storeContingencies` is false.
1101    If the :obj:`contingencyComputer` is given we use it,
1102    otherwise we construct just an ordinary contingency matrix.
1103
1104    A :obj:`stop` is called to see whether it's worth to continue. If
1105    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
1106    returned. Otherwise, a :obj:`nodeClassifier` is only built if
1107    :obj:`forceNodeClassifier` flag is set.
1108
1109    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
1110    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
1111    the given examples, weight ID and the just computed matrix. If
1112    the learner can use the matrix (and the default,
1113    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
1114    a choice of :obj:`contingencyComputer` will, in many cases,
1115    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
1116    return no classifier; if so and if the classifier would be
1117    needed for classification, the :obj:`TreeClassifier`'s function
1118    returns DK or an empty distribution. If you're writing your own
1119    tree classifier - pay attention.
1120
1121    If the induction is to continue, a :obj:`split` component is called.
1122    If it fails to return a branch selector, induction stops and the
1123    :obj:`Node` is returned.
1124
1125    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide
1126    the examples as described above.
1127
1128    The contingency gets removed at this point if it is not to be
1129    stored. Thus, the :obj:`split`, :obj:`stop` and
1130    :obj:`exampleSplitter` can use the contingency matrices if they will.
1131
1132    The :obj:`TreeLearnerBase` then recursively calls itself for each of
1133    the non-empty subsets. If the splitter returnes a list of weights,
1134    a corresponding weight is used for each branch. Besides, the
1135    attribute spent by the splitter (if any) is removed from the
1136    list of candidates for the subtree.
1137
1138    A subset of examples is stored in its corresponding tree node,
1139    if so requested. If not, the new weight attributes are removed (if
1140    any were created).
1141
1142Pruning
1143=======
1144
1145Tree pruners derived from :obj:`Pruner` can be given either a
1146:obj:`Node` (presumably, but not necessarily a root) or a
1147:obj:`TreeClassifier`. The result is a new, pruned :obj:`Node`
1148or a new :obj:`TreeClassifier` with a pruned tree. The original
1149tree remains intact.
1150
1151Note however that pruners construct only a shallow copy of a tree.
1152The pruned tree's :obj:`Node` contain references to the same
1153contingency matrices, node classifiers, branch selectors, ...
1154as the original tree. Thus, you may modify a pruned tree structure
1155(manually cut it, add new nodes, replace components) but modifying,
1156for instance, some node's :obj:`nodeClassifier` (a
1157:obj:`nodeClassifier` itself, not a reference to it!) would modify
1158the node's :obj:`nodeClassifier` in the corresponding node of
1159the original tree.
1160
1161Talking about node classifiers - pruners cannot construct a
1162:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
1163subtrees into classifiers for new leaves. Thus, if you want to build
1164a prunable tree, internal nodes must have their :obj:`nodeClassifier`
1165defined. Fortunately, all you need to do is nothing; if you leave
1166the :obj:`TreeLearnerBase`'s flags as they are by default, the
1167:obj:`nodeClassifier` are created.
1168
1169=======
1170Classes
1171=======
1172
1173Several classes described above are already functional and can
1174(and mostly will) be used as they are. Those classes are :obj:`Node`,
1175:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe
1176the other classes.
1177
1178Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
1179:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner`
1180and :obj:`Classifier` are among the Orange classes that can be subtyped
1181in Python and have the call operator overloadedd in such a way that it
1182is callbacked from C++ code. You can thus program your own components
1183for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed
1184information on how this is done and what can go wrong, is given in a
1185separate page, dedicated to callbacks to Python XXXXXXXXXX.
1186
1187SplitConstructors
1188=====================
1189
1190Split construction is almost as exciting as waiting for a delayed flight.
1191Boring, that is. Split constructors are programmed as spaghetti code
1192that juggles with contingency matrices, with separate cases for discrete
1193and continuous classes... Most split constructors work either for
1194discrete or for continuous attributes. The suggested practice is
1195to use a :obj:`SplitConstructor_Combined` that can handle
1196both by simply delegating attributes to specialized split constructors.
1197
1198Note: split constructors that cannot handle attributes of particular
1199type (discrete, continuous) do not report an error or a warning but
1200simply skip the attribute. It is your responsibility to use a correct
1201split constructor for your dataset. (May we again suggest
1202using :obj:`SplitConstructor_Combined`?)
1203
1204The same components can be used either for inducing classification and
1205regression trees. The only component that needs to be chosen accordingly
1206is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
1207class (and derived classes).
1208
1209.. class:: SplitConstructor
1210
1211    Finds a suitable criteria for dividing the learning (and later testing)
1212    examples coming to the node. The data it gets is a set of examples
1213    (and, optionally, an ID of weight meta-attribute), a domain
1214    contingency computed from examples, apriori class probabilities,
1215    a list of candidate attributes it should consider and a node
1216    classifier (if it was constructed, that is, if
1217    :obj:`storeNodeClassifier` is left true).
1218
1219    The :obj:`SplitConstructor` should use the domain contingency
1220    when possible. The reasons are two-fold; one is that it's faster
1221    and the other is that the contingency matrices are not necessarily
1222    constructed by simply counting the examples. Why and how is
1223    explained later. There are, however, cases, when domain contingency
1224    does not suffice, for examples, when ReliefF is used as a measure
1225    of quality of attributes. In this case, there's no other way but
1226    to use the examples and ignore the precomputed contingencies.
1227
1228    The split constructor should consider only the attributes in the
1229    candidate list (the list is a vector of booleans, one for each
1230    attribute).
1231
1232    :obj:`SplitConstructor` returns most of the data we talked
1233    about when describing the :obj:`Node`. It returns a classifier
1234    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
1235    branch descriptions and a list with the number of examples that
1236    go into each branch. Just what we need for the :obj:`Node`.
1237    It can return an empty list for the number of examples in branches;
1238    in this case, the :obj:`TreeLearnerBase` will find the number itself
1239    after splitting the example set into subsets. However, if a split
1240    constructors can provide the numbers at no extra computational
1241    cost, it should do so.
1242
1243    In addition, it returns a quality of the split; a number without
1244    any fixed meaning except that higher numbers mean better splits.
1245
1246    If the constructed splitting criterion uses an attribute in such
1247    a way that the attribute is 'completely spent' and should not be
1248    considered as a split criterion in any of the subtrees (the
1249    typical case of this are discrete attributes that are used
1250    as-they-are, that is, without any binarization or subsetting),
1251    then it should report the index of this attribute. Some splits
1252    do not spend any attribute; this is indicated by returning a
1253    negative index.
1254
1255    A :obj:`SplitConstructor` can veto the further tree induction
1256    by returning no classifier. This can happen for many reasons.
1257    A general one is related to number of examples in the branches.
1258    :obj:`SplitConstructor` has a field :obj:`minSubset`,
1259    which sets the minimal number of examples in a branch; null nodes,
1260    however, are allowed. If there is no split where this condition
1261    is met, :obj:`SplitConstructor` stops the induction.
1262
1263    .. attribute:: minSubset
1264
1265        Sets the minimal number of examples in non-null leaves. As
1266        always in Orange (where not specified otherwise), "number of
1267        examples" refers to the weighted number of examples.
1268   
1269    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
1270
1271        Construct a split. Returns a tuple (:obj:`branchSelector`,
1272        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
1273        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no
1274        attribute is completely spent by the split criterion. If no
1275        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
1276        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
1277        :obj:`spentAttribute` is -1.
1278
1279        :param examples:  Examples can be given in any acceptable form
1280            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
1281            list of examples).
1282        :param weightID: Optional; the default of 0 means that all
1283            examples have a weight of 1.0.
1284        :param apriori-distribution: Should be of type
1285            :obj:`orange.Distribution` and candidates should be a Python
1286            list of objects which are interpreted as booleans.
1287
1288
1289
1290.. class:: SplitConstructor_Measure
1291
1292    Bases: :class:`SplitConstructor`
1293
1294    An abstract base class for split constructors that employ
1295    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.
1296    At present,
1297    all split constructors except for :obj:`SplitConstructor_Combined`
1298    are derived from this class.
1299
1300    .. attribute:: measure
1301
1302        A component of type :obj:`Orange.feature.scoring.Measure` used for
1303        evaluation of a split. Note that you must select the subclass
1304        :obj:`Orange.feature.scoring.Measure` capable of handling your
1305        class type
1306        - you cannot use :obj:`Orange.feature.scoring.GainRatio`
1307        for building regression trees or :obj:`Orange.feature.scoring.MSE`
1308        for classification trees.
1309
1310    .. attribute:: worstAcceptable
1311
1312        The lowest required split quality for a split to be acceptable.
1313        Note that this value make sense only in connection with a
1314        :obj:`measure` component. Default is 0.0.
1315
1316.. class:: SplitConstructor_Attribute
1317
1318    Bases: :class:`SplitConstructor_Measure`
1319
1320    Attempts to use a discrete attribute as a split; each value of the
1321    attribute corresponds to a branch in the tree. Attributes are
1322    evaluated with the :obj:`measure` and the one with the
1323    highest score is used for a split. If there is more than one
1324    attribute with the highest score, one of them is selected by random.
1325
1326    The constructed :obj:`branchSelector` is an instance of
1327    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1328    selected attribute. If the attribute is
1329    :obj:`Orange.data.variable.Discrete`,
1330    :obj:`branchDescription`'s are the attribute's values. The
1331    attribute is marked as spent, so that it cannot reappear in the
1332    node's subtrees.
1333
1334.. class:: SplitConstructor_ExhaustiveBinary
1335
1336    Bases: :class:`SplitConstructor_Measure`
1337
1338    Works on discrete attributes. For each attribute, it determines
1339    which binarization of the attribute gives the split with the
1340    highest score. If more than one split has the highest score,
1341    one of them is selected by random. After trying all the attributes,
1342    it returns one of those with the highest score.
1343
1344    The constructed :obj:`branchSelector` is again an instance
1345    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1346    selected attribute. This time, however, its :obj:`transformer`
1347    contains an instance of :obj:`MapIntValue` that maps the values
1348    of the attribute into a binary attribute. Branch descriptions are
1349    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
1350    more than one value of the attribute. Branches that correspond to
1351    a single value of the attribute are described with this value. If
1352    the attribute was originally binary, it is spent and cannot be
1353    used in the node's subtrees. Otherwise, it can reappear in the
1354    subtrees.
1355
1356
1357.. class:: SplitConstructor_Threshold
1358
1359    Bases: :class:`SplitConstructor_Measure`
1360
1361    This is currently the only constructor for splits with continuous
1362    attributes. It divides the range of attributes values with a threshold
1363    that maximizes the split's quality. As always, if there is more than
1364    one split with the highest score, a random threshold is selected.
1365    The attribute that yields the highest binary split is returned.
1366
1367    The constructed :obj:`branchSelector` is again an instance of
1368    :obj:`orange.ClassifierFromVarFD` with an attached
1369    :obj:`transformer`. This time, :obj:`transformer` is of type
1370    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
1371    "<threshold" and ">=threshold". The attribute is not spent.
1372
1373.. class:: SplitConstructor_OneAgainstOthers
1374   
1375    Bases: :class:`SplitConstructor_Measure`
1376
1377    Undocumented.
1378
1379.. class:: SplitConstructor_Combined
1380
1381    Bases: :class:`SplitConstructor`
1382
1383    This constructor delegates the task of finding the optimal split
1384    to separate split constructors for discrete and for continuous
1385    attributes. Each split constructor is called, given only attributes
1386    of appropriate types as candidates. Both construct a candidate for
1387    a split; the better of them is selected.
1388
1389    (Note that there is a problem when more candidates have the same
1390    score. Let there be are nine discrete attributes with the highest
1391    score; the split constructor for discrete attributes will select
1392    one of them. Now, let us suppose that there is a single continuous
1393    attribute with the same score. :obj:`SplitConstructor_Combined`
1394    would randomly select between the proposed discrete attribute and
1395    the continuous attribute, not aware of the fact that the discrete
1396    has already competed with eight other discrete attributes. So,
1397    he probability for selecting (each) discrete attribute would be 1/18
1398    instead of 1/10. Although not really correct, we doubt that this
1399    would affect the tree's performance; many other machine learning
1400    systems simply choose the first attribute with the highest score
1401    anyway.)
1402
1403    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
1404    the attribute is spent is decided by the winning split constructor.
1405
1406    .. attribute: discreteSplitConstructor
1407
1408        Split constructor for discrete attributes; can be, for instance,
1409        :obj:`SplitConstructor_Attribute` or
1410        :obj:`SplitConstructor_ExhaustiveBinary`.
1411
1412    .. attribute: continuousSplitConstructor
1413
1414        Split constructor for continuous attributes; at the moment, it
1415        can be either :obj:`SplitConstructor_Threshold` or a
1416        split constructor you programmed in Python.
1417
1418    .. attribute: continuousSplitConstructor
1419   
1420        Split constructor for continuous attributes; at the moment, it
1421        can be either :obj:`SplitConstructor_Threshold` or a split
1422        constructor you programmed in Python.
1423
1424
1425StopCriteria and StopCriteria_common
1426============================================
1427
1428obj:`StopCriteria` determines when to stop the induction of subtrees.
1429
1430.. class:: StopCriteria
1431
1432    Given a set of examples, weight ID and contingency matrices, decide
1433    whether to continue the induction or not. The basic criterion checks
1434    whether there are any examples and whether they belong to at least
1435    two different classes (if the class is discrete). Derived components
1436    check things like the number of examples and the proportion of
1437    majority classes.
1438
1439    As opposed to :obj:`SplitConstructor` and similar basic classes,
1440    :obj:`StopCriteria` is not an abstract but a fully functional
1441    class that provides the basic stopping criteria. That is, the tree
1442    induction stops when there is at most one example left; in this case,
1443    it is not the weighted but the actual number of examples that counts.
1444    Besides that, the induction stops when all examples are in the same
1445    class (for discrete problems) or have the same value of the outcome
1446    (for regression problems).
1447
1448    .. method:: __call__(examples[, weightID, domain contingencies])
1449
1450        Decides whether to stop (true) or continue (false) the induction.
1451        If contingencies are given, they are used for checking whether
1452        the examples are in the same class (but not for counting the
1453        examples). Derived classes should use the contingencies whenever
1454        possible. If contingencies are not given, :obj:`StopCriteria`
1455        will work without them. Derived classes should also use them if
1456        they are available, but otherwise compute them only when they
1457        really need them.
1458
1459
1460
1461.. class:: StopCriteria_common
1462
1463    :obj:`StopCriteria` contains additional criteria for pre-pruning:
1464    it checks the proportion of majority class and the number of weighted
1465    examples.
1466
1467    .. attribute:: maxMajor
1468
1469        Maximal proportion of majority class. When this is exceeded,
1470        induction stops.
1471
1472    .. attribute:: minExamples
1473
1474        Minimal number of examples in internal leaves. Subsets with less
1475        than :obj:`minExamples` examples are not split any further.
1476        Example count is weighed.
1477
1478.. class:: StopCriteria_Python
1479
1480    Undocumented.
1481
1482Classes derived from ExampleSplitters
1483========================================
1484
1485The :obj:`ExampleSplitter` sorts the learning examples into branches.
1486
1487.. class:: ExampleSplitter
1488
1489    Just like the :obj:`Descender` decides the branch for an
1490    example during classification, the :obj:`ExampleSplitter`
1491    sorts the learning examples into branches.
1492
1493    :obj:`ExampleSplitter` is given a :obj:`Node` (from which
1494    it can use different stuff, but most of splitters only use the
1495    :obj:`branchSelector`), a set of examples to be divided, and
1496    the weight ID. The result is a list of subsets of examples
1497    and, optionally, a list of new weight ID's.
1498
1499    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most
1500    of :obj:`ExampleSplitters` simply call the node's
1501    :obj:`branchSelector` and assign examples to corresponding
1502    branches. When the value is unknown they choose a particular
1503    branch or simply skip the example.
1504
1505    Some enhanced splitters can split examples. An example (actually,
1506    a pointer to it) is copied to more than one subset. To facilitate
1507    real splitting, weights are needed. Each branch is assigned a
1508    weight ID (each would usually have its own ID) and all examples
1509    that are in that branch (either completely or partially) should
1510    have this meta attribute. If an example hasn't been split, it
1511    has only one additional attribute - with weight ID corresponding
1512    to the subset to which it went. Example that is split between,
1513    say, three subsets, has three new meta attributes, one for each
1514    subset. ID's of weight meta attributes are returned by the
1515    :obj:`ExampleSplitter` to be used at induction of the
1516    corresponding subtrees.
1517
1518    Note that weights are used only when needed. When no splitting
1519    occured - because the splitter is not able to do it or becauser
1520    there was no need for splitting - no weight ID's are returned.
1521
1522    An abstract base class for objects that split sets of examples into
1523    subsets. The derived classes differ in treatment of examples which
1524    cannot be unambiguously placed into a single branch (usually due
1525    to unknown value of the crucial attribute).
1526
1527    .. method:: __call__(node, examples[, weightID])
1528       
1529        Use the information in :obj:`node` (particularly the
1530        :obj:`branchSelector`) to split the given set of examples into subsets.
1531        Return a tuple with a list of example generators and a list of weights.
1532        The list of weights is either an ordinary python list of integers or
1533        a None when no splitting of examples occurs and thus no weights are
1534        needed.
1535
1536
1537.. class:: ExampleSplitter_IgnoreUnknowns
1538
1539    Bases: :class:`ExampleSplitter`
1540
1541    Simply ignores the examples for which no single branch can be determined.
1542
1543.. class:: ExampleSplitter_UnknownsToCommon
1544
1545    Bases: :class:`ExampleSplitter`
1546
1547    Places all such examples to a branch with the highest number of
1548    examples. If there is more than one such branch, one is selected at
1549    random and then used for all examples.
1550
1551.. class:: ExampleSplitter_UnknownsToAll
1552
1553    Bases: :class:`ExampleSplitter`
1554
1555    Places examples with unknown value of the attribute into all branches.
1556
1557.. class:: ExampleSplitter_UnknownsToRandom
1558
1559    Bases: :class:`ExampleSplitter`
1560
1561    Selects a random branch for such examples.
1562
1563.. class:: ExampleSplitter_UnknownsToBranch
1564
1565    Bases: :class:`ExampleSplitter`
1566
1567    Constructs an additional branch to contain all such examples.
1568    The branch's description is "unknown".
1569
1570.. class:: ExampleSplitter_UnknownsAsBranchSizes
1571
1572    Bases: :class:`ExampleSplitter`
1573
1574    Splits examples with unknown value of the attribute according to
1575    proportions of examples in each branch.
1576
1577.. class:: ExampleSplitter_UnknownsAsSelector
1578
1579    Bases: :class:`ExampleSplitter`
1580
1581    Splits examples with unknown value of the attribute according to
1582    distribution proposed by selector (which is in most cases the same
1583    as proportions of examples in branches).
1584
1585Descender and derived classes
1586=================================
1587
1588This is a classifier's counterpart for :obj:`ExampleSplitter`. It
1589decides the destiny of examples that need to be classified and cannot
1590be unambiguously put in a branch. The detailed function of this class
1591is given in description of classification with trees. XXXXXX
1592
1593.. class:: Descender
1594
1595    An abstract base object for tree descenders.
1596
1597    .. method:: __call__(node, example)
1598
1599        Descends down the tree until it reaches a leaf or a node in
1600        which a vote of subtrees is required. In both cases, a tuple
1601        of two elements is returned; in the former, the tuple contains
1602        the reached node and None, in the latter in
1603        contains a node and weights of votes for subtrees (a list of floats).
1604
1605        :obj:`Descender`'s that never split examples always descend
1606        to a leaf, but they differ in the treatment of examples with
1607        unknown values (or, in general, examples for which a branch
1608        cannot be determined at some node(s) the tree).
1609        :obj:`Descender`'s that do split examples differ in returned
1610        vote weights.
1611
1612.. class:: Descender_UnknownsToNode
1613
1614    Bases: :obj:`Descender`
1615
1616    When example cannot be classified into a single branch, the
1617    current node is returned. Thus, the node's :obj:`NodeClassifier`
1618    will be used to make a decision. It is your responsibility to see
1619    that even the internal nodes have their :obj:`NodeClassifier`
1620    (i.e., don't disable creating node classifier or manually remove
1621    them after the induction, that's all)
1622
1623.. class:: Descender_UnknownsToBranch
1624
1625    Bases: :obj:`Descender`
1626
1627    Classifies examples with unknown value to a special branch. This
1628    makes sense only if the tree itself was constructed with
1629    :obj:`ExampleSplitter_UnknownsToBranch`.
1630
1631.. class:: Descender_UnknownsToCommonBranch
1632
1633    Bases: :obj:`Descender`
1634
1635    Classifies examples with unknown values to the branch with the
1636    highest number of examples. If there is more than one such branch,
1637    random branch is chosen for each example that is to be classified.
1638
1639.. class:: Descender_UnknownsToCommonSelector
1640
1641    Bases: :obj:`Descender`
1642
1643    Classifies examples with unknown values to the branch which received
1644    the highest recommendation by the selector.
1645
1646.. class:: Descender_MergeAsBranchSizes
1647
1648    Bases: :obj:`Descender`
1649
1650    Makes the subtrees vote for the example's class; the vote is
1651    weighted according to the sizes of the branches.
1652
1653.. class:: Descender_MergeAsSelector
1654
1655    Bases: :obj:`Descender`
1656
1657    Makes the subtrees vote for the example's class; the vote is
1658    weighted according to the selectors proposal.
1659
1660Pruner and derived classes
1661==============================
1662
1663.. index::
1664    pair: classification trees; pruning
1665
1666Classes derived from :obj:`Pruner` prune the trees as a
1667described in the section pruning XXXXXXXX - make sure you read it
1668to understand what the pruners will do to your trees.
1669
1670.. class:: Pruner
1671
1672    This is an abstract base class which defines nothing useful, only
1673    a pure virtual call operator.
1674
1675    .. method:: __call__(tree)
1676
1677        Prunes a tree. The argument can be either a tree classifier or
1678        a tree node; the result is of the same type as the argument.
1679
1680.. class:: Pruner_SameMajority
1681
1682    Bases: :class:`Pruner`
1683
1684    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
1685    with more than one leaf) in which all the leaves have the same majority
1686    class. (This is allowed because those leaves can still have different
1687    distributions of classes and thus predict different probabilities.)
1688    However, this can be undesired when we're only interested in the
1689    class prediction or a simple tree interpretation. The
1690    :obj:`Pruner_SameMajority` prunes the tree so that there is no
1691    subtree in which all the nodes would have the same majority class.
1692
1693    This pruner will only prune the nodes in which the node classifier
1694    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
1695
1696    Note that the leaves with more than one majority class require some
1697    special handling. The pruning goes backwards, from leaves to the root.
1698    When siblings are compared, the algorithm checks whether they
1699    have (at least one) common majority class. If so, they can be pruned.
1700
1701.. class:: Pruner_m
1702
1703    Bases: :class:`Pruner`
1704
1705    Prunes a tree by comparing m-estimates of static and dynamic
1706    error as defined in (Bratko, 2002).
1707
1708    .. attribute:: m
1709
1710        Parameter m for m-estimation.
1711
1712Undocumented
1713============
1714
1715.. class:: NodeList
1716
1717    Undocumented.
1718
1719.. class:: C45NodeList
1720
1721    Undocumented.
1722
1723===========================
1724C4.5 Classifier and Learner
1725===========================
1726
1727As C4.5 is a standard benchmark in machine learning,
1728it is incorporated in Orange, although Orange has its own
1729implementation of decision trees.
1730
1731The implementation uses the original Quinlan's code for learning so the
1732tree you get is exactly like the one that would be build by standalone
1733C4.5. Upon return, however, the original tree is copied to Orange
1734components that contain exactly the same information plus what is needed
1735to make them visible from Python. To be sure that the algorithm behaves
1736just as the original, we use a dedicated class :class:`C45Node`
1737instead of reusing the components used by Orange's tree inducer
1738(ie, :class:`Node`). This, however, could be done and probably
1739will be done in the future; we shall still retain :class:`C45Node`
1740but offer transformation to :class:`Node` so that routines
1741that work on Orange trees will also be usable for C45 trees.
1742
1743:class:`C45Learner` and :class:`C45Classifier` behave
1744like any other Orange learner and classifier. Unlike most of Orange
1745learning algorithms, C4.5 does not accepts weighted examples.
1746
1747Building the C4.5 plug-in
1748=========================
1749
1750We haven't been able to obtain the legal rights to distribute
1751C4.5 and therefore couldn't statically link it into Orange. Instead,
1752it's incorporated as a plug-in which you'll need to build yourself.
1753The procedure is trivial, but you'll need a C compiler. On Windows,
1754the scripts we provide work with MS Visual C and the files CL.EXE
1755and LINK.EXE must be on the PATH. On Linux you're equipped with
1756gcc. Mac OS X comes without gcc, but you can download it for
1757free from Apple.
1758
1759Orange must be installed prior to building C4.5. (This is because
1760the build script will copy the created file next to Orange,
1761which it obviously can't if Orange isn't there yet.)
1762
1763#. Download the
1764   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1765   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1766   them into some temporary directory. The files will be modified in the
1767   further process, so don't use your copy of Quinlan's sources that you
1768   need for another purpose.
1769#. Download
1770   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1771   and unzip its contents into the directory R8/Src of the Quinlan's
1772   stuff (it's the directory that contains, for instance, the file
1773   average.c).
1774#. Run buildC45.py, which will build the plug-in and put it next to
1775   orange.pyd (or orange.so on Linux/Mac).
1776#. Run python, type :samp:`import Orange` and
1777   create create :samp:`Orange.classification.tree.C45Learner()`.
1778   If this fails, something went wrong; see the diagnostic messages from
1779   buildC45.py and read the below paragraph.
1780#. Finally, you can remove the Quinlan's stuff, along with everything
1781   created by buildC45.py.
1782
1783If the process fails, here's what buildC45.py really does: it creates
1784.h files that wrap Quinlan's .i files and ensure that they are not
1785included twice. It modifies C4.5 sources to include .h's instead of
1786.i's. This step can hardly fail. Then follows the platform dependent
1787step which compiles ensemble.c (which includes all the Quinlan's .c
1788files it needs) into c45.dll or c45.so and puts it next to Orange.
1789If this fails, but you do have a C compiler and linker, and you know
1790how to use them, you can compile the ensemble.c into a dynamic
1791library yourself. See the compile and link steps in buildC45.py,
1792if it helps. Anyway, after doing this check that the built C4.5
1793gives the same results as the original.
1794
1795.. class:: C45Learner
1796
1797    :class:`C45Learner`'s attributes have double names - those that
1798    you know from C4.5 command lines and the corresponding names of C4.5's
1799    internal variables. All defaults are set as in C4.5; if you change
1800    nothing, you are running C4.5.
1801
1802    .. attribute:: gainRatio (g)
1803       
1804        Determines whether to use information gain (false>, default)
1805        or gain ratio for selection of attributes (true).
1806
1807    .. attribute:: batch (b)
1808
1809        Turn on batch mode (no windows, no iterations); this option is
1810        not documented in C4.5 manuals. It conflicts with "window",
1811        "increment" and "trials".
1812
1813    .. attribute:: subset (s)
1814       
1815        Enables subsetting (default: false, no subsetting),
1816 
1817    .. attribute:: probThresh (p)
1818
1819        Probabilistic threshold for continuous attributes (default: false).
1820
1821    .. attribute:: minObjs (m)
1822       
1823        Minimal number of objects (examples) in leaves (default: 2).
1824
1825    .. attribute:: window (w)
1826
1827        Initial windows size (default: maximum of 20% and twice the
1828        square root of the number of data objects).
1829
1830    .. attribute:: increment (i)
1831
1832        The maximum number of objects that can be added to the window
1833        at each iteration (default: 20% of the initial window size).
1834
1835    .. attribute:: cf (c)
1836
1837        Prunning confidence level (default: 25%).
1838
1839    .. attribute:: trials (t)
1840
1841        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1842
1843    .. attribute:: prune
1844       
1845        Return pruned tree (not an original C4.5 option) (default: true)
1846
1847
1848:class:`C45Learner` also offers another way for setting
1849the arguments: it provides a function :obj:`C45Learner.commandLine`
1850which is given a string and parses it the same way as C4.5 would
1851parse its command line. XXXXXXXXXXX
1852
1853.. class:: C45Classifier
1854
1855    A faithful reimplementation of Quinlan's function from C4.5. The only
1856    difference (and the only reason it's been rewritten) is that it uses
1857    a tree composed of :class:`C45Node` instead of C4.5's
1858    original tree structure.
1859
1860    .. attribute:: tree
1861
1862        C4.5 tree stored as a tree of :obj:`C45Node`.
1863
1864
1865.. class:: C45Node
1866
1867    This class is a reimplementation of the corresponding *struct* from
1868    Quinlan's C4.5 code.
1869
1870    .. attribute:: nodeType
1871
1872        Type of the node:  :obj:`C45Node.Leaf` (0),
1873        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1874        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1875        split examples based on values of a discrete attribute,
1876        "cuts" cut them according to a threshold value of a continuous
1877        attributes and "subsets" use discrete attributes but with subsetting
1878        so that several values can go into the same branch.
1879
1880    .. attribute:: leaf
1881
1882        Value returned by that leaf. The field is defined for internal
1883        nodes as well.
1884
1885    .. attribute:: items
1886
1887        Number of (learning) examples in the node.
1888
1889    .. attribute:: classDist
1890
1891        Class distribution for the node (of type
1892        :obj:`orange.DiscDistribution`).
1893
1894    .. attribute:: tested
1895       
1896        The attribute used in the node's test. If node is a leaf,
1897        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1898        :obj:`tested` is a discrete attribute, and if node is of type
1899        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1900
1901    .. attribute:: cut
1902
1903        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1904        Undefined otherwise.
1905
1906    .. attribute:: mapping
1907
1908        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1909        gives the index for an example whose value of :obj:`tested` is *i*.
1910        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1911
1912    .. attribute:: branch
1913       
1914        A list of branches stemming from this node.
1915
1916Examples
1917========
1918
1919.. _tree_c45.py: code/tree_c45.py
1920.. _iris.tab: code/iris.tab
1921
1922The simplest way to use :class:`C45Learner` is to call it. This
1923script constructs the same learner as you would get by calling
1924the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1925
1926.. literalinclude:: code/tree_c45.py
1927   :lines: 7-14
1928
1929Arguments can be set by the usual mechanism (the below to lines do the
1930same, except that one uses command-line symbols and the other internal
1931variable names)
1932
1933::
1934
1935    tree = Orange.classification.tree.C45Learner(data, m=100)
1936    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1937
1938The way that could be prefered by veteran C4.5 user might be through
1939method `:obj:C45Learner.commandline`.
1940
1941::
1942
1943    lrn = Orange.classification.tree..C45Learner()
1944    lrn.commandline("-m 1 -s")
1945    tree = lrn(data)
1946
1947There's nothing special about using :obj:`C45Classifier` - it's
1948just like any other. To demonstrate what the structure of
1949:class:`C45Node`'s looks like, will show a script that prints
1950it out in the same format as C4.5 does.
1951
1952.. literalinclude:: code/tree_c45_printtree.py
1953
1954Leaves are the simplest. We just print out the value contained
1955in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1956:obj:`C45Node` does not know to which attribute it belongs), we need to
1957convert it to a string through :obj:`classVar`, which is passed as an
1958extra argument to the recursive part of printTree.
1959
1960For discrete splits without subsetting, we print out all attribute values
1961and recursively call the function for all branches. Continuous splits are
1962equally easy to handle.
1963
1964For discrete splits with subsetting, we iterate through branches, retrieve
1965the corresponding values that go into each branch to inset, turn
1966the values into strings and print them out, separately treating the
1967case when only a single value goes into the branch.
1968
1969Printing out C45 Tree
1970=====================
1971
1972.. autofunction:: printTreeC45
1973
1974TODO
1975====
1976
1977This page does not provide examples for programming your own components,
1978such as, for instance, a :obj:`SplitConstructor`. Those examples
1979can be found on a page dedicated to callbacks to Python.
1980
1981References
1982==========
1983
1984Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1985Wesley, 2002.
1986
1987E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1988Murray Hill NJ, U.S.A., October 1993.
1989
1990`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1991A home page of AT&T's dot and similar software packages.
1992
1993"""
1994
1995from Orange.core import \
1996     TreeLearner as TreeLearnerBase, \
1997         TreeClassifier as _TreeClassifier, \
1998         C45Learner, \
1999         C45Classifier, \
2000         C45TreeNode as C45Node, \
2001         C45TreeNodeList as C45NodeList, \
2002         TreeDescender as Descender, \
2003              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
2004              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
2005              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
2006              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
2007              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
2008         TreeExampleSplitter as Splitter, \
2009              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
2010              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
2011              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
2012              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
2013              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
2014              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
2015              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
2016         TreeNode as Node, \
2017         TreeNodeList as NodeList, \
2018         TreePruner as Pruner, \
2019              TreePruner_SameMajority as Pruner_SameMajority, \
2020              TreePruner_m as Pruner_m, \
2021         TreeSplitConstructor as SplitConstructor, \
2022              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
2023              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
2024                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
2025                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
2026                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
2027                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
2028         TreeStopCriteria as StopCriteria, \
2029              TreeStopCriteria_Python as StopCriteria_Python, \
2030              TreeStopCriteria_common as StopCriteria_common
2031
2032import Orange.core
2033import operator
2034import base64
2035import re
2036import Orange.data
2037import Orange.feature.scoring
2038import Orange.classification.tree
2039
2040def _c45_showBranch(node, classvar, lev, i):
2041    var = node.tested
2042    if node.nodeType == 1:
2043        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
2044        _c45_printTree0(node.branch[i], classvar, lev+1)
2045    elif node.nodeType == 2:
2046        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
2047        _c45_printTree0(node.branch[i], classvar, lev+1)
2048    else:
2049        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2050        inset = [var.values[j[0]] for j in inset]
2051        if len(inset)==1:
2052            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
2053        else:
2054            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
2055        _c45_printTree0(node.branch[i], classvar, lev+1)
2056       
2057       
2058def _c45_printTree0(node, classvar, lev):
2059    var = node.tested
2060    if node.nodeType == 0:
2061        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2062    else:
2063        for i, branch in enumerate(node.branch):
2064            if not branch.nodeType:
2065                _c45_showBranch(node, classvar, lev, i)
2066        for i, branch in enumerate(node.branch):
2067            if branch.nodeType:
2068                _c45_showBranch(node, classvar, lev, i)
2069
2070def printTreeC45(tree):
2071    """
2072    Prints the tree given as an argument in the same form as Ross Quinlan's
2073    C4.5 program.
2074
2075    ::
2076
2077        import Orange
2078
2079        data = Orange.data.Table("voting")
2080        c45 = Orange.classification.tree.C45Learner(data)
2081        Orange.classification.tree.printTreeC45(c45)
2082
2083    will print out
2084
2085    ::
2086
2087        physician-fee-freeze = n: democrat (253.4)
2088        physician-fee-freeze = y:
2089        |   synfuels-corporation-cutback = n: republican (145.7)
2090        |   synfuels-corporation-cutback = y:
2091        |   |   mx-missile = y: democrat (6.0)
2092        |   |   mx-missile = n:
2093        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2094        |   |   |   adoption-of-the-budget-resolution = y:
2095        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2096        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2097
2098
2099    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2100
2101    ::
2102
2103        physician-fee-freeze = n: democrat (253.4/5.9)
2104        physician-fee-freeze = y:
2105        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2106        |   synfuels-corporation-cutback = y:
2107        |   |   mx-missile = y: democrat (6.0/2.4)
2108        |   |   mx-missile = n:
2109        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2110        |   |   |   adoption-of-the-budget-resolution = y:
2111        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2112        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2113
2114    which is adoringly similar, except that C4.5 tested the tree on
2115    the learning data and has also printed out the number of errors
2116    in each node - something which :obj:`c45_printTree` obviously can't do
2117    (nor is there any need it should).
2118
2119    """
2120    _c45_printTree0(tree.tree, tree.classVar, 0)
2121
2122
2123import Orange.feature.scoring as fscoring
2124
2125class TreeLearner(Orange.core.Learner):
2126    """
2127    Assembles the generic classification or regression tree learner
2128    (from Orange's objects for induction of decision trees).
2129    :class:`TreeLearner` is essentially a wrapper
2130    around :class:`TreeLearnerBase`, provided for easier use of the latter.
2131    It sets a number of parameters used in induction that
2132    can also be set after the creation of the object, most often through
2133    the object's attributes. If upon initialization
2134    :class:`TreeLearner` is given a set of examples, then an instance
2135    of :class:`TreeClassifier` object is returned instead.
2136
2137    The values of attributes can be also be set in the constructor.
2138
2139    .. attribute:: nodeLearner
2140
2141        Induces a classifier from examples belonging to a node. The
2142        same learner is used for internal nodes and for leaves. The
2143        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2144
2145    **Split construction**
2146
2147    .. attribute:: split
2148       
2149        Defines a function that will be used in place of
2150        :obj:`SplitConstructor`.
2151        Useful when prototyping new tree induction
2152        algorithms. When this parameter is defined, other parameters that
2153        affect the procedures for growing of the tree are ignored. These
2154        include :obj:`binarization`, :obj:`measure`,
2155        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2156        :class:SplitConstructor_Combined
2157        with separate constructors for discrete and continuous attributes.
2158        Discrete attributes are used as they are, while
2159        continuous attributes are binarized.
2160        Gain ratio is used to select attributes.
2161        A minimum of two examples in a leaf is required for
2162        discrete and five examples in a leaf for continuous attributes.)
2163
2164    .. attribute:: binarization
2165
2166        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2167        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
2168        0, do not use binarization (use :class:`SplitConstructor_Attribute`).
2169        Default: 0.
2170
2171    .. attribute:: measure
2172   
2173        Measure for scoring of the attributes when deciding which of the
2174        attributes will be used for splitting of the example set in the node.
2175        Can be either a :class:`Orange.feature.scoring.Measure` or one of
2176        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2177        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2178        "gini" (:class:`Orange.feature.scoring.Gini`),
2179        "relief" (:class:`Orange.feature.scoring.Relief`),
2180        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio".
2181
2182    .. attribute:: reliefM, reliefK
2183
2184        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2185
2186    **Pruning**
2187
2188    .. attribute:: worstAcceptable
2189
2190        Used in pre-pruning, sets the lowest required attribute
2191        score. If the score of the best attribute is below this margin, the
2192        tree at that node is not grown further (default: 0).
2193
2194        So, to allow splitting only when gainRatio (the default measure)
2195        is greater than 0.6, set :samp:`worstAcceptable=0.6`.
2196
2197    .. attribute:: minSubset
2198
2199        Minimal number of examples in non-null leaves (default: 0).
2200
2201    .. attribute:: minExamples
2202
2203        Data subsets with less than :obj:`minExamples`
2204        examples are not split any further, that is, all leaves in the tree
2205        will contain at least that many of examples (default: 0).
2206
2207    .. attribute:: maxDepth
2208
2209        Gives maximal tree depth;  0 means that only root is generated.
2210        The default is 100.
2211
2212    .. attribute:: maxMajority
2213
2214        Induction stops when the proportion of majority class in the
2215        node exceeds the value set by this parameter(default: 1.0).
2216        To stop the induction as soon as the majority class reaches 70%,
2217        you should use :samp:`maxMajority=0.7`, as in the following
2218        example. The numbers show the majority class
2219        proportion at each node. The script `tree2.py`_ induces and
2220        prints this tree.
2221
2222        .. _tree2.py: code/tree2.py
2223
2224        ::
2225
2226            root: 0.333
2227            |    petal width<=0.800: 1.000
2228            |    petal width>0.800: 0.500
2229            |    |    petal width<=1.750: 0.907
2230            |    |    petal width>1.750: 0.978
2231   
2232    .. attribute:: stop
2233
2234        Used for passing a function which is used in place of
2235        :class:`StopCriteria`. Useful when prototyping new
2236        tree induction algorithms. See a documentation on
2237        :class:`StopCriteria` for more info on this function.
2238        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2239        will not be  considered (default: None).
2240        The default stopping criterion stops induction when all examples
2241        in a node belong to the same class.
2242
2243    .. attribute:: mForPruning
2244
2245        If non-zero, invokes an error-based bottom-up post-pruning,
2246        where m-estimate is used to estimate class probabilities
2247        (default: 0).
2248
2249    .. attribute:: sameMajorityPruning
2250
2251        If true, invokes a bottom-up post-pruning by removing the
2252        subtrees of which all leaves classify to the same class
2253        (default: False).
2254
2255    **Record keeping**
2256
2257    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2258
2259        Determines whether to store class distributions, contingencies and
2260        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2261        should be build for internal nodes. By default everything except
2262        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2263        distributions but storing contingencies, since distributions actually points to
2264        the same distribution that is stored in
2265        :obj:`contingency.classes`. (default: True except for
2266        storeExamples, which defaults to False).
2267   
2268    """
2269    def __new__(cls, examples = None, weightID = 0, **argkw):
2270        self = Orange.core.Learner.__new__(cls, **argkw)
2271        if examples:
2272            self.__init__(**argkw)
2273            return self.__call__(examples, weightID)
2274        else:
2275            return self
2276     
2277    def __init__(self, **kw):
2278        self.split = None
2279        self.stop = None
2280        self.measure = None
2281        self.__dict__.update(kw)
2282     
2283    def __call__(self, examples, weight=0):
2284        """
2285        Return a classifier from the given examples.
2286        """
2287        bl = self._base_learner()
2288
2289        #set the scoring criteria for regression if it was not
2290        #set by the user
2291        if not self.split and not self.measure:
2292            measure = fscoring.GainRatio() \
2293                if examples.domain.classVar.varType == Orange.data.Type.Discrete \
2294                else fscoring.MSE()
2295            bl.split.continuousSplitConstructor.measure = measure
2296            bl.split.discreteSplitConstructor.measure = measure
2297         
2298        #post pruning
2299        tree = bl(examples, weight)
2300        if getattr(self, "sameMajorityPruning", 0):
2301            tree = Pruner_SameMajority(tree)
2302        if getattr(self, "mForPruning", 0):
2303            tree = Pruner_m(tree, m=self.mForPruning)
2304
2305        return TreeClassifier(baseClassifier=tree) 
2306
2307    def instance(self):
2308        """
2309        DEPRECATED. Return a base learner - an object
2310        of :class:`TreeLearnerBase`.
2311        This method is left for backwards compatibility.
2312        """
2313        return self._base_learner()
2314
2315    def build_split(self):
2316        """
2317        Return the split constructor built according to object attributes.
2318        """
2319       
2320        split = self.split
2321
2322        if not split:
2323            split = SplitConstructor_Combined()
2324            split.continuousSplitConstructor = \
2325                SplitConstructor_Threshold()
2326            binarization = getattr(self, "binarization", 0)
2327            if binarization == 1:
2328                split.discreteSplitConstructor = \
2329                    SplitConstructor_ExhaustiveBinary()
2330            elif binarization == 2:
2331                split.discreteSplitConstructor = \
2332                    SplitConstructor_OneAgainstOthers()
2333            else:
2334                split.discreteSplitConstructor = \
2335                    SplitConstructor_Feature()
2336
2337            measures = {"infoGain": fscoring.InfoGain,
2338                "gainRatio": fscoring.GainRatio,
2339                "gini": fscoring.Gini,
2340                "relief": fscoring.Relief,
2341                "retis": fscoring.MSE
2342                }
2343
2344            measure = self.measure
2345            if isinstance(measure, str):
2346                measure = measures[measure]()
2347            if not measure:
2348                measure = fscoring.GainRatio()
2349
2350            measureIsRelief = isinstance(measure, fscoring.Relief)
2351            relM = getattr(self, "reliefM", None)
2352            if relM and measureIsRelief:
2353                measure.m = relM
2354           
2355            relK = getattr(self, "reliefK", None)
2356            if relK and measureIsRelief:
2357                measure.k = relK
2358
2359            split.continuousSplitConstructor.measure = measure
2360            split.discreteSplitConstructor.measure = measure
2361
2362            wa = getattr(self, "worstAcceptable", 0)
2363            if wa:
2364                split.continuousSplitConstructor.worstAcceptable = wa
2365                split.discreteSplitConstructor.worstAcceptable = wa
2366
2367            ms = getattr(self, "minSubset", 0)
2368            if ms:
2369                split.continuousSplitConstructor.minSubset = ms
2370                split.discreteSplitConstructor.minSubset = ms
2371
2372        return split
2373
2374    def build_stop(self):
2375        """
2376        Return the stop criteria built according to object's attributes.
2377        """
2378        stop = self.stop
2379        if not stop:
2380            stop = Orange.classification.tree.StopCriteria_common()
2381            mm = getattr(self, "maxMajority", 1.0)
2382            if mm < 1.0:
2383                stop.maxMajority = self.maxMajority
2384            me = getattr(self, "minExamples", 0)
2385            if me:
2386                stop.minExamples = self.minExamples
2387        return stop
2388
2389    def _base_learner(self):
2390        learner = TreeLearnerBase()
2391
2392        learner.split = self.build_split()
2393        learner.stop = self.build_stop()
2394
2395        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2396            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2397            if hasattr(self, a):
2398                setattr(learner, a, getattr(self, a))
2399
2400        return learner
2401
2402# the following is for the output
2403
2404
2405fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2406""" Defines the multiplier by 100 (:samp:`^`) and the format
2407for the number of decimals (e.g. :samp:`5.3`). The corresponding
2408groups are named :samp:`m100` and :samp:`fs`. """
2409
2410by = r"(?P<by>(b(P|A)))?"
2411""" Defines bP or bA or nothing; the result is in groups by. """
2412
2413bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2414opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2415opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2416intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2417fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2418re_V = re.compile("%V")
2419re_N = re.compile("%"+fs+"N"+by)
2420re_M = re.compile("%"+fs+"M"+by)
2421re_m = re.compile("%"+fs+"m"+by)
2422re_Ccont = re.compile("%"+fs+"C"+by+opc)
2423re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2424re_ccont = re.compile("%"+fs+"c"+by+opc)
2425re_cdisc = re.compile("%"+fs+"c"+by+opd)
2426re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2427re_cconti = re.compile("%"+fs+"c"+by+fromto)
2428re_D = re.compile("%"+fs+"D"+by)
2429re_d = re.compile("%"+fs+"d"+by)
2430re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2431re_I = re.compile("%"+fs+"I"+intrvl)
2432
2433def insertStr(s, mo, sub):
2434    """ Replace the part of s which is covered by mo
2435    with the string sub. """
2436    return s[:mo.start()] + sub + s[mo.end():]
2437
2438
2439def insertDot(s, mo):
2440    """ Replace the part of s which is covered by mo
2441    with a dot.  You should use this when the
2442    function cannot compute the desired quantity; it is called, for instance,
2443    when it needs to divide by something in the parent, but the parent
2444    doesn't exist.
2445    """
2446    return s[:mo.start()] + "." + s[mo.end():]
2447
2448def insertNum(s, mo, n):
2449    """ Replace the part of s matched by mo with the number n,
2450    formatted as specified by the user, that is, it multiplies
2451    it by 100, if needed, and prints with the right number of
2452    places and decimals. It does so by checking the mo
2453    for a group named m100 (representing the :samp:`^` in the format string)
2454    and a group named fs representing the part giving the number o
2455    f decimals (e.g. :samp:`5.3`).
2456    """
2457    grps = mo.groupdict()
2458    m100 = grps.get("m100", None)
2459    if m100:
2460        n *= 100
2461    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2462    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2463
2464def byWhom(by, parent, tree):
2465    """ If by equals bp, it returns parent, else it returns
2466    :samp:`tree.tree`. This is used to find what to divide the quantity
2467    with, when division is required.
2468    """
2469    if by=="bP":
2470        return parent
2471    else:
2472        return tree.tree
2473
2474def replaceV(strg, mo, node, parent, tree):
2475    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2476
2477def replaceN(strg, mo, node, parent, tree):
2478    by = mo.group("by")
2479    N = node.distribution.abs
2480    if by:
2481        whom = byWhom(by, parent, tree)
2482        if whom and whom.distribution:
2483            if whom.distribution.abs > 1e-30:
2484                N /= whom.distribution.abs
2485        else:
2486            return insertDot(strg, mo)
2487    return insertNum(strg, mo, N)
2488       
2489
2490def replaceM(strg, mo, node, parent, tree):
2491    by = mo.group("by")
2492    maj = int(node.nodeClassifier.defaultValue)
2493    N = node.distribution[maj]
2494    if by:
2495        whom = byWhom(by, parent, tree)
2496        if whom and whom.distribution:
2497            if whom.distribution[maj] > 1e-30:
2498                N /= whom.distribution[maj]
2499        else:
2500            return insertDot(strg, mo)
2501    return insertNum(strg, mo, N)
2502       
2503
2504def replacem(strg, mo, node, parent, tree):
2505    by = mo.group("by")
2506    maj = int(node.nodeClassifier.defaultValue)
2507    if node.distribution.abs > 1e-30:
2508        N = node.distribution[maj] / node.distribution.abs
2509        if by:
2510            if whom and whom.distribution:
2511                byN = whom.distribution[maj] / whom.distribution.abs
2512                if byN > 1e-30:
2513                    N /= byN
2514            else:
2515                return insertDot(strg, mo)
2516    else:
2517        N = 0.
2518    return insertNum(strg, mo, N)
2519
2520
2521def replaceCdisc(strg, mo, node, parent, tree):
2522    if tree.classVar.varType != Orange.data.Type.Discrete:
2523        return insertDot(strg, mo)
2524   
2525    by, op, cls = mo.group("by", "op", "cls")
2526    N = node.distribution[cls]
2527    if op == "!=":
2528        N = node.distribution.abs - N
2529    if by:
2530        whom = byWhom(by, parent, tree)
2531        if whom and whom.distribution:
2532            if whom.distribution[cls] > 1e-30:
2533                N /= whom.distribution[cls]
2534        else:
2535            return insertDot(strg, mo)
2536    return insertNum(strg, mo, N)
2537
2538   
2539def replacecdisc(strg, mo, node, parent, tree):
2540    if tree.classVar.varType != Orange.data.Type.Discrete:
2541        return insertDot(strg, mo)
2542   
2543    op, by, cls = mo.group("op", "by", "cls")
2544    N = node.distribution[cls]
2545    if node.distribution.abs > 1e-30:
2546        N /= node.distribution.abs
2547        if op == "!=":
2548            N = 1 - N
2549    if by:
2550        whom = byWhom(by, parent, tree)
2551        if whom and whom.distribution:
2552            if whom.distribution[cls] > 1e-30:
2553                N /= whom.distribution[cls] / whom.distribution.abs
2554        else:
2555            return insertDot(strg, mo)
2556    return insertNum(strg, mo, N)
2557
2558
2559__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2560
2561def replaceCcont(strg, mo, node, parent, tree):
2562    if tree.classVar.varType != Orange.data.Type.Continuous:
2563        return insertDot(strg, mo)
2564   
2565    by, op, num = mo.group("by", "op", "num")
2566    op = __opdict[op]
2567    num = float(num)
2568    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2569    if by:
2570        whom = byWhom(by, parent, tree)
2571        if whom and whom.distribution:
2572            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2573            if byN > 1e-30:
2574                N /= byN
2575        else:
2576            return insertDot(strg, mo)
2577
2578    return insertNum(strg, mo, N)
2579   
2580   
2581def replaceccont(strg, mo, node, parent, tree):
2582    if tree.classVar.varType != Orange.data.Type.Continuous:
2583        return insertDot(strg, mo)
2584   
2585    by, op, num = mo.group("by", "op", "num")
2586    op = __opdict[op]
2587    num = float(num)
2588    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2589    if node.distribution.abs > 1e-30:
2590        N /= node.distribution.abs
2591    if by:
2592        whom = byWhom(by, parent, tree)
2593        if whom and whom.distribution:
2594            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2595            if byN > 1e-30:
2596                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2597        else:
2598            return insertDot(strg, mo)
2599    return insertNum(strg, mo, N)
2600
2601
2602def extractInterval(mo, dist):
2603    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2604    lower, upper = float(lower), float(upper)
2605    if out:
2606        lop = lowin == "(" and operator.le or operator.lt
2607        hop = upin == ")" and operator.ge or operator.ge
2608        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2609    else:
2610        lop = lowin == "(" and operator.gt or operator.ge
2611        hop = upin == ")" and operator.lt or operator.le
2612        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2613
2614   
2615def replaceCconti(strg, mo, node, parent, tree):
2616    if tree.classVar.varType != Orange.data.Type.Continuous:
2617        return insertDot(strg, mo)
2618
2619    by = mo.group("by")
2620    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2621    if by:
2622        whom = byWhom(by, parent, tree)
2623        if whom and whom.distribution:
2624            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2625            if byN > 1e-30:
2626                N /= byN
2627        else:
2628            return insertDot(strg, mo)
2629       
2630    return insertNum(strg, mo, N)
2631
2632           
2633def replacecconti(strg, mo, node, parent, tree):
2634    if tree.classVar.varType != Orange.data.Type.Continuous:
2635        return insertDot(strg, mo)
2636
2637    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2638    ab = node.distribution.abs
2639    if ab > 1e-30:
2640        N /= ab
2641
2642    by = mo.group("by")
2643    if by:
2644        whom = byWhom(by, parent, tree)
2645        if whom and whom.distribution:
2646            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2647            if byN > 1e-30:
2648                N /= byN/whom.distribution.abs
2649        else:
2650            return insertDot(strg, mo)
2651       
2652    return insertNum(strg, mo, N)
2653
2654   
2655def replaceD(strg, mo, node, parent, tree):
2656    if tree.classVar.varType != Orange.data.Type.Discrete:
2657        return insertDot(strg, mo)
2658
2659    fs, by, m100 = mo.group("fs", "by", "m100")
2660    dist = list(node.distribution)
2661    if by:
2662        whom = byWhom(by, parent, tree)
2663        if whom:
2664            for i, d in enumerate(whom.distribution):
2665                if d > 1e-30:
2666                    dist[i] /= d
2667        else:
2668            return insertDot(strg, mo)
2669    mul = m100 and 100 or 1
2670    fs = fs or (m100 and ".0" or "5.3")
2671    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2672
2673
2674def replaced(strg, mo, node, parent, tree):
2675    if tree.classVar.varType != Orange.data.Type.Discrete:
2676        return insertDot(strg, mo)
2677
2678    fs, by, m100 = mo.group("fs", "by", "m100")
2679    dist = list(node.distribution)
2680    ab = node.distribution.abs
2681    if ab > 1e-30:
2682        dist = [d/ab for d in dist]
2683    if by:
2684        whom = byWhom(by, parent, tree)
2685        if whom:
2686            for i, d in enumerate(whom.distribution):
2687                if d > 1e-30:
2688                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2689        else:
2690            return insertDot(strg, mo)
2691    mul = m100 and 100 or 1
2692    fs = fs or (m100 and ".0" or "5.3")
2693    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2694
2695
2696def replaceAE(strg, mo, node, parent, tree):
2697    if tree.classVar.varType != Orange.data.Type.Continuous:
2698        return insertDot(strg, mo)
2699
2700    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2701   
2702    if AorE == "A":
2703        A = node.distribution.average()
2704    else:
2705        A = node.distribution.error()
2706    if by:
2707        whom = byWhom("b"+by, parent, tree)
2708        if whom:
2709            if AorE == "A":
2710                avg = whom.distribution.average()
2711            else:
2712                avg = whom.distribution.error()
2713            if bysub == "b":
2714                if avg > 1e-30:
2715                    A /= avg
2716            else:
2717                A -= avg
2718        else:
2719            return insertDot(strg, mo)
2720    return insertNum(strg, mo, A)
2721
2722
2723Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2724
2725def replaceI(strg, mo, node, parent, tree):
2726    if tree.classVar.varType != Orange.data.Type.Continuous:
2727        return insertDot(strg, mo)
2728
2729    fs = mo.group("fs") or "5.3"
2730    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2731    mul = mo.group("m100") and 100 or 1
2732
2733    if not Z.has_key(intrvl):
2734        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2735
2736    av = node.distribution.average()   
2737    il = node.distribution.error() * Z[intrvl]
2738    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2739
2740
2741# This class is more a collection of function, merged into a class so
2742# that they don't need to transfer too many arguments. It will be
2743# constructed, used and discarded, it is not meant to store any information.
2744class _TreeDumper:
2745    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2746         (re_M, replaceM), (re_m, replacem), 
2747         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2748         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2749         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2750         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2751         (re_I, replaceI) ]
2752
2753    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2754        maxDepth, simpleFirst, tree, **kw):
2755        self.stringFormats = stringFormats
2756        self.minExamples = minExamples
2757        self.maxDepth = maxDepth
2758        self.simpleFirst = simpleFirst
2759        self.tree = tree
2760        self.__dict__.update(kw)
2761
2762        if leafStr:
2763            self.leafStr = leafStr
2764        else:
2765            if tree.classVar.varType == Orange.data.Type.Discrete:
2766                self.leafStr = "%V (%^.2m%)"
2767            else:
2768                self.leafStr = "%V"
2769
2770        if nodeStr == ".":
2771            self.nodeStr = self.leafStr
2772        else:
2773            self.nodeStr = nodeStr
2774       
2775
2776    def formatString(self, strg, node, parent):
2777        if hasattr(strg, "__call__"):
2778            return strg(node, parent, self.tree)
2779       
2780        if not node:
2781            return "<null node>"
2782       
2783        for rgx, replacer in self.stringFormats:
2784            if not node.distribution:
2785                strg = rgx.sub(".", strg)
2786            else:
2787                strt = 0
2788                while True:
2789                    mo = rgx.search(strg, strt)
2790                    if not mo:
2791                        break
2792                    strg = replacer(strg, mo, node, parent, self.tree)
2793                    strt = mo.start()+1
2794                       
2795        return strg
2796       
2797
2798    def showBranch(self, node, parent, lev, i):
2799        bdes = node.branchDescriptions[i]
2800        bdes = node.branchSelector.classVar.name + \
2801            (bdes[0] not in "<=>" and "=" or "") + bdes
2802        if node.branches[i]:
2803            nodedes = self.nodeStr and ": " + \
2804                self.formatString(self.nodeStr, node.branches[i], node) or ""
2805        else:
2806            nodedes = "<null node>"
2807        return "|    "*lev + bdes + nodedes
2808       
2809       
2810    def dumpTree0(self, node, parent, lev):
2811        if node.branches:
2812            if node.distribution.abs < self.minExamples or \
2813                lev > self.maxDepth:
2814                return "|    "*lev + ". . .\n"
2815           
2816            res = ""
2817            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2818                leafsep = "\n"+("|    "*lev)+"    "
2819            else:
2820                leafsep = ""
2821            if self.simpleFirst:
2822                for i, branch in enumerate(node.branches):
2823                    if not branch or not branch.branches:
2824                        if self.leafStr == self.nodeStr:
2825                            res += "%s\n" % \
2826                                self.showBranch(node, parent, lev, i)
2827                        else:
2828                            res += "%s: %s\n" % \
2829                                (self.showBranch(node, parent, lev, i),
2830                                 leafsep + 
2831                                 self.formatString(self.leafStr, branch, node))
2832            for i, branch in enumerate(node.branches):
2833                if branch and branch.branches:
2834                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2835                                       self.dumpTree0(branch, node, lev+1))
2836                elif not self.simpleFirst:
2837                    if self.leafStr == self.nodeStr:
2838                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2839                    else:
2840                        res += "%s: %s\n" % \
2841                            (self.showBranch(node, parent, lev, i),
2842                             leafsep + 
2843                             self.formatString(self.leafStr, branch, node))
2844            return res
2845        else:
2846            return self.formatString(self.leafStr, node, parent)
2847
2848
2849    def dumpTree(self):
2850        if self.nodeStr:
2851            lev, res = 1, "root: %s\n" % \
2852                self.formatString(self.nodeStr, self.tree.tree, None)
2853            self.maxDepth += 1
2854        else:
2855            lev, res = 0, ""
2856        return res + self.dumpTree0(self.tree.tree, None, lev)
2857       
2858
2859    def dotTree0(self, node, parent, internalName):
2860        if node.branches:
2861            if node.distribution.abs < self.minExamples or \
2862                len(internalName)-1 > self.maxDepth:
2863                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2864                    _quoteName(internalName))
2865                return
2866               
2867            label = node.branchSelector.classVar.name
2868            if self.nodeStr:
2869                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2870            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2871                (_quoteName(internalName), self.nodeShape, label))
2872           
2873            for i, branch in enumerate(node.branches):
2874                if branch:
2875                    internalBranchName = internalName+chr(i+65)
2876                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2877                        (_quoteName(internalName), 
2878                         _quoteName(internalBranchName), 
2879                         node.branchDescriptions[i]))
2880                    self.dotTree0(branch, node, internalBranchName)
2881                   
2882        else:
2883            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2884                (internalName, self.leafShape, 
2885                self.formatString(self.leafStr, node, parent)))
2886
2887
2888    def dotTree(self, internalName="n"):
2889        self.fle.write("digraph G {\n")
2890        self.dotTree0(self.tree.tree, None, internalName)
2891        self.fle.write("}\n")
2892
2893def _quoteName(x):
2894    return '"%s"' % (base64.b64encode(x))
2895
2896class TreeClassifier(Orange.classification.Classifier):
2897    """
2898    Wraps :class:`_TreeClassifier`.
2899    """
2900   
2901    def __init__(self, baseClassifier=None):
2902        if not baseClassifier: baseClassifier = _TreeClassifier()
2903        self.nativeClassifier = baseClassifier
2904        for k, v in self.nativeClassifier.__dict__.items():
2905            self.__dict__[k] = v
2906 
2907    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2908                 *args, **kwdargs):
2909        """Classify a new instance.
2910       
2911        :param instance: instance to be classified.
2912        :type instance: :class:`Orange.data.Instance`
2913        :param result_type:
2914              :class:`Orange.classification.Classifier.GetValue` or \
2915              :class:`Orange.classification.Classifier.GetProbabilities` or
2916              :class:`Orange.classification.Classifier.GetBoth`
2917       
2918        :rtype: :class:`Orange.data.Value`,
2919              :class:`Orange.statistics.Distribution` or a tuple with both
2920        """
2921        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2922
2923    def __setattr__(self, name, value):
2924        if name == "nativeClassifier":
2925            self.__dict__[name] = value
2926            return
2927        if name in self.nativeClassifier.__dict__:
2928            self.nativeClassifier.__dict__[name] = value
2929        self.__dict__[name] = value
2930   
2931    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2932        """
2933        Return a string representation of a tree.
2934
2935        :arg leafStr: The format string for printing the tree leaves. If
2936          left empty, "%V (%^.2m%)" will be used for classification trees
2937          and "%V" for regression trees.
2938        :type leafStr: string
2939        :arg nodeStr: The format string for printing out the internal nodes.
2940          If left empty (as it is by default), no data is printed out for
2941          internal nodes. If set to :samp:`"."`, the same string is
2942          used as for leaves.
2943        :type nodeStr: string
2944        :arg maxDepth: If set, it limits the depth to which the tree is
2945          printed out.
2946        :type maxDepth: integer
2947        :arg minExamples: If set, the subtrees with less than the given
2948          number of examples are not printed.
2949        :type minExamples: integer
2950        :arg simpleFirst: If True (default), the branches with a single
2951          node are printed before the branches with larger subtrees.
2952          If False, the branches are printed in order of
2953          appearance.
2954        :type simpleFirst: boolean
2955        :arg userFormats: A list of regular expressions and callback
2956          function through which the user can print out other specific
2957          information in the nodes.
2958        """
2959        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2960            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2961            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2962            self).dumpTree()
2963
2964    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2965        """ Prints the tree to a file in a format used by
2966        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2967        Uses the same parameters as :meth:`dump` defined above
2968        plus two parameters which define the shape used for internal
2969        nodes and laves of the tree:
2970
2971        :param leafShape: Shape of the outline around leves of the tree.
2972            If "plaintext", no outline is used (default: "plaintext").
2973        :type leafShape: string
2974        :param internalNodeShape: Shape of the outline around internal nodes
2975            of the tree. If "plaintext", no outline is used (default: "box")
2976        :type leafShape: string
2977
2978        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2979        for various outlines supported by GraphViz.
2980        """
2981        fle = type(fileName) == str and file(fileName, "wt") or fileName
2982
2983        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2984            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2985            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
2986            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2987
2988    def count_nodes(self):
2989        """
2990        Return the number of nodes.
2991        """
2992        return _countNodes(self.tree)
2993
2994    def count_leaves(self):
2995        """
2996        Return the number of leaves.
2997        """
2998        return _countLeaves(self.tree)
2999
3000def _countNodes(node):
3001    count = 0
3002    if node:
3003        count += 1
3004        if node.branches:
3005            for node in node.branches:
3006                count += _countNodes(node)
3007    return count
3008
3009def _countLeaves(node):
3010    count = 0
3011    if node:
3012        if node.branches: # internal node
3013            for node in node.branches:
3014                count += _countLeaves(node)
3015        else:
3016            count += 1
3017    return count
3018
Note: See TracBrowser for help on using the repository browser.