source: orange/orange/Orange/classification/tree.py @ 7718:c0ba2cbb5af1

Revision 7718:c0ba2cbb5af1, 116.9 KB checked in by markotoplak, 3 years ago (diff)

Some documentation update.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12To build a small tree (:obj:`TreeClassifier`) from the iris data set
13(with the depth limited to three levels), use
14(part of `orngTree1.py`_, uses `iris.tab`_):
15
16.. literalinclude:: code/orngTree1.py
17   :lines: 1-4
18
19.. _orngTree1.py: code/orngTree1.py
20
21
22This page first describes the learner and the classifier, and then
23defines the base classes (individual components) of the trees
24and the tree-building process.
25
26.. autoclass:: TreeLearner
27    :members:
28
29.. autoclass:: TreeClassifier
30    :members:
31
32========
33Examples
34========
35
36For example, here's how to write your own stop
37function. The example constructs
38and prints two trees. For the first one we define the *defStop*
39function, which is used by default, and combine it with a random function
40so that the stop criteria will also be met in 20% of the cases
41when *defStop* is false. For the second tree the stopping criteria
42is random. Note that in
43the second case lambda function still has three parameters, since this is
44a necessary number of parameters for the stop function (:obj:`StopCriteria`).
45Part of `tree3.py`_ (uses  `iris.tab`_):
46
47.. _tree3.py: code/tree3.py
48
49.. literalinclude:: code/tree3.py
50   :lines: 8-23
51
52The output is not shown here since the resulting trees are rather
53big.
54
55Tree Structure
56==============
57
58To have something to work on, we'll take the data from lenses dataset
59and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
60
61.. literalinclude:: code/treestructure.py
62   :lines: 7-10
63
64How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
65
66.. _lenses.tab: code/lenses.tab
67.. _treestructure.py: code/treestructure.py
68
69.. literalinclude:: code/treestructure.py
70   :lines: 12-21
71
72If node is None, we have a null-node; null nodes don't count,
73so we return 0. Otherwise, the size is 1 (this node) plus the
74sizes of all subtrees. The node is an internal node if it has a
75:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
76attempt to skip the if statement: leaves don't have an empty list
77of branches, they don't have a list of branches at all.
78
79    >>> treeSize(treeClassifier.tree)
80    10
81
82Don't forget that this was only an excercise - :obj:`Node` has a
83built-in method :obj:`Node.treeSize` that does exactly the same.
84
85Let us now write a script that prints out a tree. The recursive
86part of the function will get a node and its level
87(part of `treestructure.py`_, uses `lenses.tab`_).
88
89.. literalinclude:: code/treestructure.py
90   :lines: 26-41
91
92Don't waste time on studying formatting tricks (\n's etc.), this is just
93for nicer output. What matters is everything but the print statements.
94As first, we check whether the node is a null-node (a node to which no
95learning examples were classified). If this is so, we just print out
96"<null node>" and return.
97
98After handling null nodes, remaining nodes are internal nodes and leaves.
99For internal nodes, we print a node description consisting of the
100attribute's name and distribution of classes. :obj:`Node`'s branch
101description is, for all currently defined splits, an instance of a
102class derived from :obj:`orange.Classifier` (in fact, it is a
103:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
104suffice), and its :obj:`classVar` points to the attribute we seek.
105So we print its name. We will also assume that storing class distributions
106has not been disabled and print them as well. 
107Then we iterate
108through branches; for each we print a branch description and iteratively
109call the :obj:`printTree0` with a level increased by 1 (to increase
110the indent).
111
112Finally, if the node is a leaf, we print out the distribution of
113learning examples in the node and the class to which the examples in
114the node would be classified. We again assume that the :obj:`nodeClassifier`
115is the default one - a :obj:`DefaultClassifier`. A better print
116function should be aware of possible alternatives.
117
118Now, we just need to write a simple function to call our printTree0.
119We could write something like...
120
121::
122
123    def printTree(x):
124        printTree0(x.tree, 0)
125
126... but we won't. Let us learn how to handle arguments of different
127types. Let's write a function that will accept either a
128:obj:`TreeClassifier` or a :obj:`Node`.
129Part of `treestructure.py`_, uses `lenses.tab`_.
130
131.. literalinclude:: code/treestructure.py
132   :lines: 43-49
133
134It's fairly straightforward: if :obj:`x` is of type derived from
135:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
136:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
137of some other type, we don't know how to handle it and thus raise
138an exception. The output::
139
140    >>> printTree(treeClassifier)
141    tear_rate (<15.000, 5.000, 4.000>)
142    : reduced --> none (<12.000, 0.000, 0.000>)
143    : normal
144       astigmatic (<3.000, 5.000, 4.000>)
145       : no
146          age (<1.000, 5.000, 0.000>)
147          : young --> soft (<0.000, 2.000, 0.000>)
148          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
149          : presbyopic --> none (<1.000, 1.000, 0.000>)
150       : yes
151          prescription (<2.000, 0.000, 4.000>)
152          : myope --> hard (<0.000, 0.000, 3.000>)
153          : hypermetrope --> none (<2.000, 0.000, 1.000>)
154
155For a final exercise, let us write a simple pruning function. It will
156be written entirely in Python, unrelated to any :obj:`Pruner`. It
157will limit the maximal tree depth (the number of internal nodes on any
158path down the tree) given as an argument.
159For example, to get a two-level tree, we would
160call cutTree(root, 2). The function will be recursive, with the second
161argument (level) decreasing at each call; when zero, the current node
162will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
163
164.. literalinclude:: code/treestructure.py
165   :lines: 54-62
166
167There's nothing to prune at null-nodes or leaves, so we act only when
168:obj:`node` and :obj:`node.branchSelector` are defined. If level is
169not zero, we call the function for each branch. Otherwise, we clear
170the selector, branches and branch descriptions.
171
172    >>> cutTree(tree.tree, 2)
173    >>> printTree(tree)
174    tear_rate (<15.000, 5.000, 4.000>)
175    : reduced --> none (<12.000, 0.000, 0.000>)
176    : normal
177       astigmatic (<3.000, 5.000, 4.000>)
178       : no --> soft (<1.000, 5.000, 0.000>)
179       : yes --> hard (<2.000, 0.000, 4.000>)
180
181Learning
182========
183
184You could just call :class:`TreeLearner` and let it fill the empty
185slots with the default
186components. This section will teach you three things: what are the
187missing components (and how to set the same components yourself),
188how to use alternative components to get a different tree and,
189finally, how to write a skeleton for tree induction in Python.
190
191.. _treelearner.py: code/treelearner.py
192
193Let us construct a :obj:`TreeLearner` to play with (treelearner.py`_, uses `lenses.tab`_):
194
195.. literalinclude:: code/treelearner.py
196   :lines: 7-10
197
198There are three crucial components in learning: the split and stop
199criteria, and the :obj:`exampleSplitter` YY (there are some others,
200which become important during classification; we'll talk about them
201later). They are not defined; if you use the learner, the slots are
202filled temporarily but later cleared again.
203
204FIXME: the following example is not true anymore.
205
206::
207
208    >>> print learner.split
209    None
210    >>> learner(data)
211    <TreeClassifier instance at 0x01F08760>
212    >>> print learner.split
213    None
214
215Stopping criteria
216-----------------
217
218The stop is trivial. The default is set by
219
220::
221
222    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
223
224Well, this is actually done in C++ and it uses a global component
225that is constructed once for all, but apart from that we did
226effectively the same thing.
227
228We can now examine the default stopping parameters.
229
230    >>> print learner.stop.maxMajority, learner.stop.minExamples
231    1.0 0.0
232
233Not very restrictive. This keeps splitting the examples until
234there's nothing left to split or all the examples are in the same
235class. Let us set the minimal subset that we allow to be split to
236five examples and see what comes out.
237
238    >>> learner.stop.minExamples = 5.0
239    >>> tree = learner(data)
240    >>> print tree.dump()
241    tear_rate=reduced: none (100.00%)
242    tear_rate=normal
243    |    astigmatic=no
244    |    |    age=pre-presbyopic: soft (100.00%)
245    |    |    age=presbyopic: none (50.00%)
246    |    |    age=young: soft (100.00%)
247    |    astigmatic=yes
248    |    |    prescription=hypermetrope: none (66.67%)
249    |    |    prescription=myope: hard (100.00%)
250
251OK, that's better. If we want an even smaller tree, we can also limit
252the maximal proportion of majority class.
253
254    >>> learner.stop.maxMajority = 0.5
255    >>> tree = learner(data)
256    >>> print tree.dump()
257    none (62.50%)
258
259
260=================
261Printing the Tree
262=================
263
264The included printing functions can
265print out practically anything you'd like to
266know, from the number of examples, proportion of examples of majority
267class in nodes and similar, to more complex statistics like the
268proportion of examples in a particular class divided by the proportion
269of examples of this class in a parent node. And even more, you can
270define your own callback functions to be used for printing.
271
272Before we go on: you can read all about the function and use it to its
273full extent, or you can just call it, giving it the tree as the sole
274argument and it will print out the usual textual representation of the
275tree. If you're satisfied with that, you can stop here.
276
277The magic is in the format string. It is a string which is printed
278out at every leaf or internal node with the certain format specifiers
279replaced by data from the tree node. Specifiers are generally of form
280**%[^]<precision><quantity><divisor>**.
281
282**^** at the start tells that the number should be multiplied by 100.
283It's useful for printing proportions like percentages.
284
285**<precision>** is in the same format as in Python (or C) string
286formatting. For instance, :samp:`%N` denotes the number of examples in the node,
287hence :samp:`%6.2N` would mean output to two decimal digits and six places
288altogether. If left out, a default format :samp:`5.3` is used, unless you
289multiply the numbers by 100, in which case the default is :samp:`.0`
290(no decimals, the number is rounded to the nearest integer).
291
292**<divisor>** tells what to divide the quantity in that node with.
293:samp:`bP` means division by the same quantity in the parent node; for instance,
294:samp:`%NbP` will tell give the number of examples in the node divided by the
295number of examples in parent node. You can add use precision formatting,
296e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
297set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
298training data set) that fell into that node. If division is impossible
299since the parent node does not exist or some data is missing, a dot is
300printed out instead of the quantity.
301
302**<quantity>** is the only required element. It defines what to print.
303For instance, :samp:`%N` would print out the number of examples in the node.
304Possible quantities are
305
306:samp:`V`
307    The value predicted at that node. You cannot define the precision
308    or divisor here.
309
310:samp:`N`
311    The number of examples in the node.
312
313:samp:`M`
314    The number of examples in the majority class (that is, the class
315    predicted by the node).
316
317:samp:`m`
318    The proportion of examples in the majority class.
319
320:samp:`A`
321    The average class for examples the node; this is available only for
322    regression trees.
323
324:samp:`E`
325    Standard error for class of examples in the node; available for
326    regression trees.
327
328:samp:`I`
329    Print out the confidence interval. The modifier is used as
330    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
331
332:samp:`C`
333    The number of examples in the given class. For classification trees,
334    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
335    - this will tell the number of examples of Iris-virginica by the
336    number of examples this class in the parent node. If you are
337    interested in examples that are *not* Iris-virginica, say
338    :samp:`%5.3CbP!="Iris-virginica"`
339
340    For regression trees, you can use operators =, !=, <, <=, >, and >=,
341    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
342    check the number of examples in a certain interval: :samp:`%C[20, 22]`
343    will give you the number of examples between 20 and 22 (inclusive)
344    and :samp:`%C(20, 22)` will give the number of such examples excluding the
345    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
346    If you would like the examples outside the interval, add a :samp:`!`,
347    like :samp:`%C!(20, 22]`.
348 
349:samp:`c`
350    Same as above, except that it computes the proportion of the class
351    instead of the number of examples.
352
353:samp:`D`
354    Prints out the number of examples in each class. You can use both,
355    precision (it is applied to each number in the distribution) or the
356    divisor. This quantity cannot be computed for regression trees.
357
358:samp:`d`
359    Same as above, except that it shows proportions of examples. This
360    again doesn't work with regression trees.
361
362<user defined formats>
363    You can add more, if you will. Instructions and examples are given at
364    the end of this section.
365
366
367Examples
368========
369
370We shall build a small tree from the iris data set - we shall limit the
371depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
372
373.. literalinclude:: code/orngTree1.py
374   :lines: 1-4
375
376.. _orngTree1.py: code/orngTree1.py
377
378The easiest way to call the function is to pass the tree as the only
379argument::
380
381    >>> print tree.dump()
382    petal width<0.800: Iris-setosa (100.00%)
383    petal width>=0.800
384    |    petal width<1.750
385    |    |    petal length<5.350: Iris-versicolor (94.23%)
386    |    |    petal length>=5.350: Iris-virginica (100.00%)
387    |    petal width>=1.750
388    |    |    petal length<4.850: Iris-virginica (66.67%)
389    |    |    petal length>=4.850: Iris-virginica (100.00%)
390
391Let's now print out the predicted class at each node, the number
392of examples in the majority class with the total number of examples
393in the node::
394
395    >>> print tree.dump(leafStr="%V (%M out of %N)")
396    petal width<0.800: Iris-setosa (50.000 out of 50.000)
397    petal width>=0.800
398    |    petal width<1.750
399    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
400    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
401    |    petal width>=1.750
402    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
403    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
404
405Would you like to know how the number of examples declines as
406compared to the entire data set and to the parent node? We find
407it with this::
408
409    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)")
410    petal width<0.800: Iris-setosa (100%, 100%)
411    petal width>=0.800
412    |    petal width<1.750
413    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
414    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
415    |    petal width>=1.750
416    |    |    petal length<4.850: Iris-virginica (4%, 4%)
417    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
418
419Let us first read the format string. :samp:`%M` is the number of
420examples in the majority class. We want it divided by the number of
421all examples from this class on the entire data set, hence :samp:`%MbA`.
422To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
423that is just printed out literally, just as the comma and parentheses
424(see the output). The string for showing the proportion of this class
425in the parent is the same except that we have :samp:`bP` instead
426of :samp:`bA`.
427
428And now for the output: all examples of setosa for into the first node.
429For versicolor, we have 98% in one node; the rest is certainly
430not in the neighbouring node (petal length>=5.350) since all
431versicolors from the node petal width<1.750 went to petal length<5.350
432(we know this from the 100% in that line). Virginica is the
433majority class in the three nodes that together contain 94% of this
434class (4+4+86). The rest must had gone to the same node as versicolor.
435
436If you find this guesswork annoying - so do I. Let us print out the
437number of versicolors in each node, together with the proportion of
438versicolors among the examples in this particular node and among all
439versicolors. So,
440
441::
442
443    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
444
445gives the following output::
446
447    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
448    petal width>=0.800
449    |    petal width<1.750
450    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
451    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
452    |    petal width>=1.750
453    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
454    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
455
456Finally, we may want to print out the distributions, using a simple
457string :samp:`%D`::
458
459    petal width<0.800: [50.000, 0.000, 0.000]
460    petal width>=0.800
461    |    petal width<1.750
462    |    |    petal length<5.350: [0.000, 49.000, 3.000]
463    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
464    |    petal width>=1.750
465    |    |    petal length<4.850: [0.000, 1.000, 2.000]
466    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
467
468What is the order of numbers here? If you check
469:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
470versicolor, virginica; so in the node at peta length<5.350 we have 49
471versicolors and 3 virginicae. To print out the proportions, we can
472:samp:`%.2d` - this gives us proportions within node, rounded on
473two decimals::
474
475    petal width<0.800: [1.00, 0.00, 0.00]
476    petal width>=0.800
477    |    petal width<1.750
478    |    |    petal length<5.350: [0.00, 0.94, 0.06]
479    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
480    |    petal width>=1.750
481    |    |    petal length<4.850: [0.00, 0.33, 0.67]
482    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
483
484We haven't tried printing out any information for internal nodes.
485To start with the most trivial case, we shall print the prediction
486at each node.
487
488::
489
490    tree.dump(leafStr="%V", nodeStr=".")
491   
492says that the nodeStr should be the same as leafStr (not very useful
493here, since leafStr is trivial anyway).
494
495::
496
497    root: Iris-setosa
498    |    petal width<0.800: Iris-setosa
499    |    petal width>=0.800: Iris-versicolor
500    |    |    petal width<1.750: Iris-versicolor
501    |    |    |    petal length<5.350: Iris-versicolor
502    |    |    |    petal length>=5.350: Iris-virginica
503    |    |    petal width>=1.750: Iris-virginica
504    |    |    |    petal length<4.850: Iris-virginica
505    |    |    |    petal length>=4.850: Iris-virginica
506
507Note that the output is somewhat different now: there appeared another
508node called *root* and the tree looks one level deeper. This is
509needed to print out the data for that node to.
510
511Now for something more complicated: let us observe how the number
512of virginicas decreases down the tree::
513
514    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
515
516Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
517the number of examples from class virginica, divided by the total number
518of examples in this class. Add :samp:`^.1` and the result will be
519multiplied and printed with one decimal. The trailing :samp:`%` is printed
520out. In parentheses we print the same thing except that we divide by the
521examples in the parent node. Note the use of single quotes, so we can
522use the double quotes inside the string, when we specify the class.
523
524::
525
526    root: 100.0% (.%)
527    |    petal width<0.800: 0.0% (0.0%)
528    |    petal width>=0.800: 100.0% (100.0%)
529    |    |    petal width<1.750: 10.0% (10.0%)
530    |    |    |    petal length<5.350: 6.0% (60.0%)
531    |    |    |    petal length>=5.350: 4.0% (40.0%)
532    |    |    petal width>=1.750: 90.0% (90.0%)
533    |    |    |    petal length<4.850: 4.0% (4.4%)
534    |    |    |    petal length>=4.850: 86.0% (95.6%)
535
536See what's in the parentheses in the root node? If :meth:`TreeClassifier.dump`
537cannot compute something (in this case it's because the root has no parent),
538it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
539will count all classes *except* virginica.
540
541For one final example with classification trees, we shall print the
542distributions in that nodes, the distribution compared to the parent
543and the proportions compared to the parent (the latter things are not
544the same - think why). In the leaves we shall also add the predicted
545class. So now we'll have to call the function like this.
546
547::
548
549    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
550    root: [50.000, 50.000, 50.000] . .
551    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
552    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
553    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
554    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
555    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
556    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
557    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
558    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
559    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
560    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
561    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
562    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
563    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
564
565To explore the possibilities when printing regression trees, we are going
566to induce a tree from the housing data set. Called with the tree as the
567only argument, :meth:`TreeClassifier.dump` prints the tree like this::
568
569    RM<6.941
570    |    LSTAT<14.400
571    |    |    DIS<1.385: 45.6
572    |    |    DIS>=1.385: 22.9
573    |    LSTAT>=14.400
574    |    |    CRIM<6.992: 17.1
575    |    |    CRIM>=6.992: 12.0
576    RM>=6.941
577    |    RM<7.437
578    |    |    CRIM<7.393: 33.3
579    |    |    CRIM>=7.393: 14.4
580    |    RM>=7.437
581    |    |    TAX<534.500: 45.9
582    |    |    TAX>=534.500: 21.9
583
584Let us add the standard error in both internal nodes and leaves, and the
58590% confidence intervals in the leaves::
586
587    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
588    root: [SE: 0.409]
589    |    RM<6.941: [SE: 0.306]
590    |    |    LSTAT<14.400: [SE: 0.320]
591    |    |    |    DIS<1.385: [SE: 4.420]:
592    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
593    |    |    |    DIS>=1.385: [SE: 0.244]:
594    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
595    |    |    LSTAT>=14.400: [SE: 0.333]
596    |    |    |    CRIM<6.992: [SE: 0.338]:
597    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
598    |    |    |    CRIM>=6.992: [SE: 0.448]:
599    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
600    |    RM>=6.941: [SE: 1.031]
601    |    |    RM<7.437: [SE: 0.958]
602    |    |    |    CRIM<7.393: [SE: 0.692]:
603    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
604    |    |    |    CRIM>=7.393: [SE: 2.157]:
605    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
606    |    |    RM>=7.437: [SE: 1.124]
607    |    |    |    TAX<534.500: [SE: 0.817]:
608    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
609    |    |    |    TAX>=534.500: [SE: 0.000]:
610    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
611
612What's the difference between :samp:`%V`, the predicted value and
613:samp:`%A` the average? Doesn't a regression tree always predict the
614leaf average anyway? Not necessarily, the tree predict whatever the
615:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
616As :samp:`%V` uses the
617:obj:`Orange.data.variable.Continuous`' function for printing out the value,
618therefore the printed number has the same number of decimals
619as in the data file.
620
621Regression trees cannot print the distributions in the same way
622as classification trees. They instead offer a set of operators for
623observing the number of examples within a certain range. For instance,
624let us check the number of examples with values below 22, and compare
625this number with values in the parent nodes::
626
627    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".")
628    root: 277.000 (.)
629    |    RM<6.941: 273.000 (1.160)
630    |    |    LSTAT<14.400: 107.000 (0.661)
631    |    |    |    DIS<1.385: 0.000 (0.000)
632    |    |    |    DIS>=1.385: 107.000 (1.020)
633    |    |    LSTAT>=14.400: 166.000 (1.494)
634    |    |    |    CRIM<6.992: 93.000 (0.971)
635    |    |    |    CRIM>=6.992: 73.000 (1.040)
636    |    RM>=6.941: 4.000 (0.096)
637    |    |    RM<7.437: 3.000 (1.239)
638    |    |    |    CRIM<7.393: 0.000 (0.000)
639    |    |    |    CRIM>=7.393: 3.000 (15.333)
640    |    |    RM>=7.437: 1.000 (0.633)
641    |    |    |    TAX<534.500: 0.000 (0.000)
642    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
643
644The last line, for instance, says the the number of examples with the
645class below 22 is among those with tax above 534 is 30 times higher
646than the number of such examples in its parent node.
647
648For another exercise, let's count the same for all examples *outside*
649interval [20, 22] (given like this, the interval includes the bounds).
650And let us print out the proportions as percents.
651
652::
653
654    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
655
656OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
657would be the proportion of examples (within the node) whose values are
658below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
659statistics computed on the parent. Add a :samp:`^` and you have the percentages.
660
661::
662
663    root: 439.000 (.%)
664    |    RM<6.941: 364.000 (98%)
665    |    |    LSTAT<14.400: 200.000 (93%)
666    |    |    |    DIS<1.385: 5.000 (127%)
667    |    |    |    DIS>=1.385: 195.000 (99%)
668    |    |    LSTAT>=14.400: 164.000 (111%)
669    |    |    |    CRIM<6.992: 91.000 (96%)
670    |    |    |    CRIM>=6.992: 73.000 (105%)
671    |    RM>=6.941: 75.000 (114%)
672    |    |    RM<7.437: 46.000 (101%)
673    |    |    |    CRIM<7.393: 43.000 (100%)
674    |    |    |    CRIM>=7.393: 3.000 (100%)
675    |    |    RM>=7.437: 29.000 (98%)
676    |    |    |    TAX<534.500: 29.000 (103%)
677    |    |    |    TAX>=534.500: 0.000 (0%)
678
679
680Defining Your Own Printout functions
681====================================
682
683:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out
684some other information in the leaves or nodes. If provided,
685:obj:`userFormats` should contain a list of tuples with a regular expression
686and a callback function to be called when that expression is found in the
687format string. Expressions from :obj:`userFormats` are checked before
688the built-in expressions discussed above, so you can override the built-ins
689if you want to.
690
691The regular expression should describe a string like those we used above,
692for instance the string :samp:`%.2DbP`. When a leaf or internal node
693is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
694is checked for these regular expressions and when the match is found,
695the corresponding callback function is called.
696
697The callback function will get five arguments: the format string
698(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
699being printed, its parent (can be None) and the tree classifier.
700The function should return the format string in which the part described
701by the match object (that is, the part that is matched by the regular
702expression) is replaced by whatever information your callback function
703is supposed to give.
704
705The function can use several utility function provided in the module.
706
707.. autofunction:: insertStr
708
709.. autofunction:: insertDot
710
711.. autofunction:: insertNum
712
713.. autofunction:: byWhom
714
715
716There are also a few pieces of regular expression that you may want to reuse.
717The two you are likely to use are:
718
719.. autodata:: fs
720
721.. autodata:: by
722
723For a trivial example, :samp:`%V` is implemented like this. There is the
724following tuple in the list of built-in formats::
725
726    (re.compile("%V"), replaceV)
727
728:obj:`replaceV` is a function defined by::
729
730    def replaceV(strg, mo, node, parent, tree):
731        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
732
733It therefore takes the value predicted at the node
734(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
735and passes it to *insertStr* to do the replacement.
736
737A more complex regular expression is the one for the proportion of
738majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
739two partial expressions defined above.
740
741Let's say with like to print the classification margin for each node,
742that is, the difference between the proportion of the largest and the
743second largest class in the node (part of `orngTree2.py`_):
744
745.. _orngTree2.py: code/orngTree2.py
746
747.. literalinclude:: code/orngTree2.py
748   :lines: 7-31
749
750We first defined getMargin which gets the distribution and computes the
751margin. The callback replaces, replaceB, computes the margin for the node.
752If we need to divided the quantity by something (that is, if the :data:`by`
753group is present), we call :func:`byWhom` to get the node with whose margin
754this node's margin is to be divided. If this node (usually the parent)
755does not exist of if its margin is zero, we call :func:`insertDot`
756to insert a dot, otherwise we call :func:`insertNum` which will insert
757the number, obeying the format specified by the user. myFormat is a list
758containing the regular expression and the callback function.
759
760
761We can now print out the iris tree:
762
763.. literalinclude:: code/orngTree2.py
764    :lines: 33
765
766And we get::
767
768    petal width<0.800: Iris-setosa 100% (100.00%)
769    petal width>=0.800
770    |    petal width<1.750
771    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
772    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
773    |    petal width>=1.750
774    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
775    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
776
777
778Plotting the Tree using Dot
779===========================
780
781Suppose you saved the tree in a file "tree5.dot". You can then
782print it out as a gif if you execute the following command line
783
784::
785   
786    dot -Tgif tree5.dot -otree5.gif
787
788GraphViz's dot has quite a few other output formats, check
789its documentation to learn which.
790
791
792
793Classification trees are represented as a tree-like hierarchy of
794:obj:`Node` classes.
795
796.. class:: Node
797
798    Node stores information about the learning examples belonging
799    to the node, a branch selector, a list of branches (if the node is
800    not a leaf) with their descriptions and strengths, and a classifier.
801
802    .. attribute:: distribution
803   
804        Stores a distribution for learning examples belonging to the node.
805        Storing distributions can be disabled by setting the
806        :obj:`TreeLearnerBase`'s storeDistributions flag to false.
807
808    .. attribute:: contingency
809
810        Stores complete contingency matrices for the learning examples
811        belonging to the node. Storing contingencies can be enabled by
812        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`
813        flag to true. Note that even when the flag is not
814        set, the contingencies get computed and stored to
815        :obj:`Node`, but are removed shortly afterwards.
816        The details are given in the
817        description of the :obj:`TreeLearnerBase` object.
818
819    .. attribute:: examples, weightID
820
821        Store a set of learning examples for the node and the
822        corresponding ID of /weight meta attribute. The root of the
823        tree stores a "master table" of examples, while other nodes'
824        :obj:`Orange.data.Table` contain reference to examples in
825        the root's :obj:`Orange.data.Table`. Examples are only stored
826        if a corresponding flag (:obj:`storeExamples`) has been
827        set while building the tree; to conserve the space, storing
828        is disabled by default.
829
830    .. attribute:: nodeClassifier
831
832        A classifier (usually, but not necessarily, a
833        :obj:`DefaultClassifier`) that can be used to classify
834        examples coming to the node. If the node is a leaf, this is
835        used to decide the final class (or class distribution) of an
836        example. If it's an internal node, it is stored if
837        :obj:`Node`'s flag :obj:`storeNodeClassifier`
838        is set. Since the :obj:`nodeClassifier` is needed by
839        :obj:`Descender` and for pruning (see far below),
840        this is the default behaviour; space consumption of the default
841        :obj:`DefaultClassifier` is rather small. You should
842        never disable this if you intend to prune the tree later.
843
844    If the node is a leaf, the remaining fields are None.
845    If it's an internal node, there are several additional fields.
846
847    .. attribute:: branches
848
849        Stores a list of subtrees, given as :obj:`Node`.
850        An element can be None; in this case the node is empty.
851
852    .. attribute:: branchDescriptions
853
854        A list with string descriptions for branches, constructed by
855        :obj:`SplitConstructor`. It can contain different kinds
856        of descriptions, but basically, expect things like 'red' or '>12.3'.
857
858    .. attribute:: branchSizes
859
860        Gives a (weighted) number of training examples that went into
861        each branch. This can be used later, for instance, for
862        modeling probabilities when classifying examples with
863        unknown values.
864
865    .. attribute:: branchSelector
866
867        Gives a branch for each example. The same object is used during
868        learning and classifying. The :obj:`branchSelector` is of
869        type :obj:`orange.Classifier`, since its job is similar to that
870        of a classifier: it gets an example and returns discrete
871        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
872        When an example cannot be classified to any branch, the selector
873        can return a :obj:`Orange.data.Value` containing a special value
874        (sVal) which should be a discrete distribution
875        (DiscDistribution). This should represent a
876        :obj:`branchSelector`'s opinion of how to divide the
877        example between the branches. Whether the proposition will be
878        used or not depends upon the chosen :obj:`ExampleSplitter`
879        (when learning) or :obj:`Descender` (when classifying).
880
881    The lists :obj:`branches`, :obj:`branchDescriptions` and
882    :obj:`branchSizes` are of the same length; all of them are
883    defined if the node is internal and none if it is a leaf.
884
885    .. method:: treeSize()
886       
887        Return the number of nodes in the subtrees (including the
888        node, excluding null-nodes).
889
890==============
891Classification
892==============
893
894.. class:: _TreeClassifier
895
896    Classifies examples according to a tree stored in :obj:`tree`.
897
898    .. attribute:: tree
899
900        The root of the tree, represented as a :class:`Node`.
901   
902    Classification would be straightforward if there were no unknown
903    values or, in general, examples that cannot be placed into a
904    single branch. The response in such cases is determined by a
905    component :obj:`descender`.
906
907    :obj:`Descender` is an abstract object which is given an example
908    and whose basic job is to descend as far down the tree as possible,
909    according to the values of example's attributes. The
910    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
911    the branch index. If it's a simple index, the corresponding branch
912    is followed. If not, it's up to descender to decide what to do, and
913    that's where descenders differ. A :obj:`descender` can choose
914    a single branch (for instance, the one that is the most recommended
915    by the :obj:`branchSelector`) or it can let the branches vote.
916
917    In general there are three possible outcomes of a descent.
918
919    #. Descender reaches a leaf. This happens when nothing went wrong
920       (there are no unknown or out-of-range values in the example) or
921       when things went wrong, but the descender smoothed them by
922       selecting a single branch and continued the descend. In this
923       case, the descender returns the reached :obj:`Node`.
924    #. :obj:`branchSelector` returned a distribution and the
925       :obj:`Descender` decided to stop the descend at this
926       (internal) node.  Again, descender returns the current
927       :obj:`Node` and nothing else.
928    #. :obj:`branchSelector` returned a distribution and the
929       :obj:`Node` wants to split the example (i.e., to decide the
930       class by voting).
931
932    It returns a :obj:`Node` and the vote-weights for the branches.
933    The weights can correspond to the distribution returned by
934    :obj:`branchSelector`, to the number of learning examples that
935    were assigned to each branch, or to something else.
936
937    :obj:`TreeClassifier` uses the descender to descend from the root.
938    If it returns only a :obj:`Node` and no distribution, the
939    descend should stop; it does not matter whether it's a leaf (the
940    first case above) or an internal node (the second case). The node's
941    :obj:`nodeClassifier` is used to decide the class. If the descender
942    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
943    recursively calls itself for each of the subtrees and the
944    predictions are weighted as requested by the descender.
945
946    When voting, subtrees do not predict the class but probabilities
947    of classes. The predictions are multiplied by weights, summed and
948    the most probable class is returned.
949
950    .. method:: vote()
951
952    .. method:: classDistribution()
953
954
955The rest of this section is only for those interested in the C++ code.
956======================================================================
957
958If you'd like to understand how the classification works in C++,
959start reading at :obj:`TTreeClassifier::vote`. It gets a
960:obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of
961vote weights. For each node, it calls the
962:obj:`TTreeClassifier::classDistribution` and then multiplies
963and sums the distribution. :obj:`vote` returns a normalized
964distribution of predictions.
965
966A new overload of :obj:`TTreeClassifier::classDistribution` gets
967an additional parameter, a :obj:`Node`. This is done
968for the sake of recursion. The normal version of
969:obj:`classDistribution` simply calls the overloaded with a
970tree root as an additional parameter. :obj:`classDistribution`
971uses :obj:`descender`. If descender reaches a leaf, it calls
972:obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
973
974Thus, the :obj:`TreeClassifier`'s :obj:`vote` and
975:obj:`classDistribution` are written in a form of double
976recursion. The recursive calls do not happen at each node of the
977tree but only at nodes where a vote is needed (that is, at nodes
978where the descender halts).
979
980For predicting a class, :obj:`operator()`, calls the
981descender. If it reaches a leaf, the class is predicted by the
982leaf's :obj:`nodeClassifier`. Otherwise, it calls
983:obj:`vote`>. From now on, :obj:`vote` and
984:obj:`classDistribution` interweave down the tree and return
985a distribution of predictions. :obj:`operator()` then simply
986chooses the most probable class.
987
988========
989Learning
990========
991
992The main learning object is :obj:`TreeLearnerBase`. It is basically
993a skeleton into which the user must plug the components for particular
994functions. For easier use, defaults are provided.
995
996Components that govern the structure of the tree are :obj:`split`
997(of type :obj:`SplitConstructor`), :obj:`stop` (of
998type :obj:`StopCriteria` and :obj:`exampleSplitter`
999(of type :obj:`ExampleSplitter`).
1000
1001
1002.. class:: TreeLearnerBase
1003
1004    TreeLearnerBase has a number of components.
1005
1006    .. attribute:: split
1007
1008        Object of type :obj:`SplitConstructor`. Default value,
1009        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined`
1010        with separate constructors for discrete and continuous attributes.
1011        Discrete attributes are used as are, while continuous attributes
1012        are binarized. Gain ratio is used to select attributes.
1013        A minimum of two examples in a leaf is required for discreter
1014        and five examples in a leaf for continuous attributes.</DD>
1015   
1016    .. attribute:: stop
1017
1018        Object of type :obj:`StopCriteria`. The default stopping
1019        criterion stops induction when all examples in a node belong
1020        to the same class.
1021
1022    .. attribute:: splitter
1023
1024        Object of type :obj:`ExampleSplitter`. The default splitter
1025        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
1026        the learning examples according to distributions given by the
1027        selector.
1028
1029    .. attribute:: contingencyComputer
1030   
1031        By default, this slot is left empty and ordinary contingency
1032        matrices are computed for examples at each node. If need arises,
1033        one can change the way the matrices are computed. This can be
1034        used to change the way that unknown values are treated when
1035        assessing qualities of attributes. As mentioned earlier,
1036        the computed matrices can be used by split constructor and by
1037        stopping criteria. On the other hand, they can be (and are)
1038        ignored by some splitting constructors.
1039
1040    .. attribute:: nodeLearner
1041
1042        Induces a classifier from examples belonging to a node. The
1043        same learner is used for internal nodes and for leaves. The
1044        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
1045
1046    .. attribute:: descender
1047
1048        Descending component that the induces :obj:`TreeClassifier`
1049        will use. Default descender is
1050        :obj:`Descender_UnknownMergeAsSelector` which votes using
1051        the :obj:`branchSelector`'s distribution for vote weights.
1052
1053    .. attribute:: maxDepth
1054
1055        Gives maximal tree depth; 0 means that only root is generated.
1056        The default is 100 to prevent any infinite tree induction due
1057        to missettings in stop criteria. If you are sure you need
1058        larger trees, increase it. If you, on the other hand, want
1059        to lower this hard limit, you can do so as well.
1060
1061    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
1062
1063        Decides whether to store class distributions, contingencies
1064        and examples in :obj:`Node`, and whether the
1065        :obj:`nodeClassifier` should be build for internal nodes.
1066        By default, distributions and node classifiers are stored,
1067        while contingencies and examples are not. You won't save any
1068        memory by not storing distributions but storing contingencies,
1069        since distributions actually points to the same distribution
1070        that is stored in :obj:`contingency.classes`.
1071
1072    The :obj:`TreeLearnerBase` first sets the defaults for missing
1073    components. Although stored in the actual :obj:`TreeLearnerBase`'s
1074    fields, they are removed when the induction is finished.
1075
1076    Then it ensures that examples are stored in a table. This is needed
1077    because the algorithm juggles with pointers to examples. If
1078    examples are in a file or are fed through a filter, they are copied
1079    to a table. Even if they are already in a table, they are copied
1080    if :obj:`storeExamples` is set. This is to assure that pointers
1081    remain pointing to examples even if the user later changes the
1082    example table. If they are in the table and the :obj:`storeExamples`
1083    flag is clear, we just use them as they are. This will obviously
1084    crash in a multi-threaded system if one changes the table during
1085    the tree induction. Well... don't do it.
1086
1087    Apriori class probabilities are computed. At this point we check
1088    the sum of example weights; if it's zero, there are no examples and
1089    we cannot proceed. A list of candidate attributes is set; in the
1090    beginning, all attributes are candidates for the split criterion.
1091
1092    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments
1093    are a set of examples, a weight meta-attribute ID (a tricky thing,
1094    it can be always the same as the original or can change to
1095    accomodate splitting of examples among branches), apriori class
1096    distribution and a list of candidates (represented as a vector
1097    of Boolean values).
1098
1099    The contingency matrix is computed next. This happens
1100    even if the flag :obj:`storeContingencies` is false.
1101    If the :obj:`contingencyComputer` is given we use it,
1102    otherwise we construct just an ordinary contingency matrix.
1103
1104    A :obj:`stop` is called to see whether it's worth to continue. If
1105    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
1106    returned. Otherwise, a :obj:`nodeClassifier` is only built if
1107    :obj:`forceNodeClassifier` flag is set.
1108
1109    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
1110    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
1111    the given examples, weight ID and the just computed matrix. If
1112    the learner can use the matrix (and the default,
1113    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
1114    a choice of :obj:`contingencyComputer` will, in many cases,
1115    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
1116    return no classifier; if so and if the classifier would be
1117    needed for classification, the :obj:`TreeClassifier`'s function
1118    returns DK or an empty distribution. If you're writing your own
1119    tree classifier - pay attention.
1120
1121    If the induction is to continue, a :obj:`split` component is called.
1122    If it fails to return a branch selector, induction stops and the
1123    :obj:`Node` is returned.
1124
1125    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide
1126    the examples as described above.
1127
1128    The contingency gets removed at this point if it is not to be
1129    stored. Thus, the :obj:`split`, :obj:`stop` and
1130    :obj:`exampleSplitter` can use the contingency matrices if they will.
1131
1132    The :obj:`TreeLearnerBase` then recursively calls itself for each of
1133    the non-empty subsets. If the splitter returnes a list of weights,
1134    a corresponding weight is used for each branch. Besides, the
1135    attribute spent by the splitter (if any) is removed from the
1136    list of candidates for the subtree.
1137
1138    A subset of examples is stored in its corresponding tree node,
1139    if so requested. If not, the new weight attributes are removed (if
1140    any were created).
1141
1142Pruning
1143=======
1144
1145Tree pruners derived from :obj:`Pruner` can be given either a
1146:obj:`Node` (presumably, but not necessarily a root) or a
1147:obj:`TreeClassifier`. The result is a new, pruned :obj:`Node`
1148or a new :obj:`TreeClassifier` with a pruned tree. The original
1149tree remains intact.
1150
1151Note however that pruners construct only a shallow copy of a tree.
1152The pruned tree's :obj:`Node` contain references to the same
1153contingency matrices, node classifiers, branch selectors, ...
1154as the original tree. Thus, you may modify a pruned tree structure
1155(manually cut it, add new nodes, replace components) but modifying,
1156for instance, some node's :obj:`nodeClassifier` (a
1157:obj:`nodeClassifier` itself, not a reference to it!) would modify
1158the node's :obj:`nodeClassifier` in the corresponding node of
1159the original tree.
1160
1161Talking about node classifiers - pruners cannot construct a
1162:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
1163subtrees into classifiers for new leaves. Thus, if you want to build
1164a prunable tree, internal nodes must have their :obj:`nodeClassifier`
1165defined. Fortunately, all you need to do is nothing; if you leave
1166the :obj:`TreeLearnerBase`'s flags as they are by default, the
1167:obj:`nodeClassifier` are created.
1168
1169=======
1170Classes
1171=======
1172
1173Several classes described above are already functional and can
1174(and mostly will) be used as they are. Those classes are :obj:`Node`,
1175:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe
1176the other classes.
1177
1178Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
1179:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner`
1180and :obj:`Classifier` are among the Orange classes that can be subtyped
1181in Python and have the call operator overloadedd in such a way that it
1182is callbacked from C++ code. You can thus program your own components
1183for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed
1184information on how this is done and what can go wrong, is given in a
1185separate page, dedicated to callbacks to Python XXXXXXXXXX.
1186
1187SplitConstructors
1188=====================
1189
1190Split construction is almost as exciting as waiting for a delayed flight.
1191Boring, that is. Split constructors are programmed as spaghetti code
1192that juggles with contingency matrices, with separate cases for discrete
1193and continuous classes... Most split constructors work either for
1194discrete or for continuous attributes. The suggested practice is
1195to use a :obj:`SplitConstructor_Combined` that can handle
1196both by simply delegating attributes to specialized split constructors.
1197
1198Note: split constructors that cannot handle attributes of particular
1199type (discrete, continuous) do not report an error or a warning but
1200simply skip the attribute. It is your responsibility to use a correct
1201split constructor for your dataset. (May we again suggest
1202using :obj:`SplitConstructor_Combined`?)
1203
1204The same components can be used either for inducing classification and
1205regression trees. The only component that needs to be chosen accordingly
1206is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
1207class (and derived classes).
1208
1209.. class:: SplitConstructor
1210
1211    Finds a suitable criteria for dividing the learning (and later testing)
1212    examples coming to the node. The data it gets is a set of examples
1213    (and, optionally, an ID of weight meta-attribute), a domain
1214    contingency computed from examples, apriori class probabilities,
1215    a list of candidate attributes it should consider and a node
1216    classifier (if it was constructed, that is, if
1217    :obj:`storeNodeClassifier` is left true).
1218
1219    The :obj:`SplitConstructor` should use the domain contingency
1220    when possible. The reasons are two-fold; one is that it's faster
1221    and the other is that the contingency matrices are not necessarily
1222    constructed by simply counting the examples. Why and how is
1223    explained later. There are, however, cases, when domain contingency
1224    does not suffice, for examples, when ReliefF is used as a measure
1225    of quality of attributes. In this case, there's no other way but
1226    to use the examples and ignore the precomputed contingencies.
1227
1228    The split constructor should consider only the attributes in the
1229    candidate list (the list is a vector of booleans, one for each
1230    attribute).
1231
1232    :obj:`SplitConstructor` returns most of the data we talked
1233    about when describing the :obj:`Node`. It returns a classifier
1234    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
1235    branch descriptions and a list with the number of examples that
1236    go into each branch. Just what we need for the :obj:`Node`.
1237    It can return an empty list for the number of examples in branches;
1238    in this case, the :obj:`TreeLearnerBase` will find the number itself
1239    after splitting the example set into subsets. However, if a split
1240    constructors can provide the numbers at no extra computational
1241    cost, it should do so.
1242
1243    In addition, it returns a quality of the split; a number without
1244    any fixed meaning except that higher numbers mean better splits.
1245
1246    If the constructed splitting criterion uses an attribute in such
1247    a way that the attribute is 'completely spent' and should not be
1248    considered as a split criterion in any of the subtrees (the
1249    typical case of this are discrete attributes that are used
1250    as-they-are, that is, without any binarization or subsetting),
1251    then it should report the index of this attribute. Some splits
1252    do not spend any attribute; this is indicated by returning a
1253    negative index.
1254
1255    A :obj:`SplitConstructor` can veto the further tree induction
1256    by returning no classifier. This can happen for many reasons.
1257    A general one is related to number of examples in the branches.
1258    :obj:`SplitConstructor` has a field :obj:`minSubset`,
1259    which sets the minimal number of examples in a branch; null nodes,
1260    however, are allowed. If there is no split where this condition
1261    is met, :obj:`SplitConstructor` stops the induction.
1262
1263    .. attribute:: minSubset
1264
1265        Sets the minimal number of examples in non-null leaves. As
1266        always in Orange (where not specified otherwise), "number of
1267        examples" refers to the weighted number of examples.
1268   
1269    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
1270
1271        Construct a split. Returns a tuple (:obj:`branchSelector`,
1272        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
1273        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no
1274        attribute is completely spent by the split criterion. If no
1275        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
1276        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
1277        :obj:`spentAttribute` is -1.
1278
1279        :param examples:  Examples can be given in any acceptable form
1280            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
1281            list of examples).
1282        :param weightID: Optional; the default of 0 means that all
1283            examples have a weight of 1.0.
1284        :param apriori-distribution: Should be of type
1285            :obj:`orange.Distribution` and candidates should be a Python
1286            list of objects which are interpreted as booleans.
1287
1288
1289
1290.. class:: SplitConstructor_Measure
1291
1292    Bases: :class:`SplitConstructor`
1293
1294    An abstract base class for split constructors that employ
1295    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.
1296    At present,
1297    all split constructors except for :obj:`SplitConstructor_Combined`
1298    are derived from this class.
1299
1300    .. attribute:: measure
1301
1302        A component of type :obj:`Orange.feature.scoring.Measure` used for
1303        evaluation of a split. Note that you must select the subclass
1304        :obj:`Orange.feature.scoring.Measure` capable of handling your
1305        class type
1306        - you cannot use :obj:`Orange.feature.scoring.GainRatio`
1307        for building regression trees or :obj:`Orange.feature.scoring.MSE`
1308        for classification trees.
1309
1310    .. attribute:: worstAcceptable
1311
1312        The lowest required split quality for a split to be acceptable.
1313        Note that this value make sense only in connection with a
1314        :obj:`measure` component. Default is 0.0.
1315
1316.. class:: SplitConstructor_Attribute
1317
1318    Bases: :class:`SplitConstructor_Measure`
1319
1320    Attempts to use a discrete attribute as a split; each value of the
1321    attribute corresponds to a branch in the tree. Attributes are
1322    evaluated with the :obj:`measure` and the one with the
1323    highest score is used for a split. If there is more than one
1324    attribute with the highest score, one of them is selected by random.
1325
1326    The constructed :obj:`branchSelector` is an instance of
1327    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1328    selected attribute. If the attribute is
1329    :obj:`Orange.data.variable.Discrete`,
1330    :obj:`branchDescription`'s are the attribute's values. The
1331    attribute is marked as spent, so that it cannot reappear in the
1332    node's subtrees.
1333
1334.. class:: SplitConstructor_ExhaustiveBinary
1335
1336    Bases: :class:`SplitConstructor_Measure`
1337
1338    Works on discrete attributes. For each attribute, it determines
1339    which binarization of the attribute gives the split with the
1340    highest score. If more than one split has the highest score,
1341    one of them is selected by random. After trying all the attributes,
1342    it returns one of those with the highest score.
1343
1344    The constructed :obj:`branchSelector` is again an instance
1345    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1346    selected attribute. This time, however, its :obj:`transformer`
1347    contains an instance of :obj:`MapIntValue` that maps the values
1348    of the attribute into a binary attribute. Branch descriptions are
1349    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
1350    more than one value of the attribute. Branches that correspond to
1351    a single value of the attribute are described with this value. If
1352    the attribute was originally binary, it is spent and cannot be
1353    used in the node's subtrees. Otherwise, it can reappear in the
1354    subtrees.
1355
1356
1357.. class:: SplitConstructor_Threshold
1358
1359    Bases: :class:`SplitConstructor_Measure`
1360
1361    This is currently the only constructor for splits with continuous
1362    attributes. It divides the range of attributes values with a threshold
1363    that maximizes the split's quality. As always, if there is more than
1364    one split with the highest score, a random threshold is selected.
1365    The attribute that yields the highest binary split is returned.
1366
1367    The constructed :obj:`branchSelector` is again an instance of
1368    :obj:`orange.ClassifierFromVarFD` with an attached
1369    :obj:`transformer`. This time, :obj:`transformer` is of type
1370    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
1371    "<threshold" and ">=threshold". The attribute is not spent.
1372
1373.. class:: SplitConstructor_OneAgainstOthers
1374   
1375    Bases: :class:`SplitConstructor_Measure`
1376
1377    Undocumented.
1378
1379.. class:: SplitConstructor_Combined
1380
1381    Bases: :class:`SplitConstructor`
1382
1383    This constructor delegates the task of finding the optimal split
1384    to separate split constructors for discrete and for continuous
1385    attributes. Each split constructor is called, given only attributes
1386    of appropriate types as candidates. Both construct a candidate for
1387    a split; the better of them is selected.
1388
1389    (Note that there is a problem when more candidates have the same
1390    score. Let there be are nine discrete attributes with the highest
1391    score; the split constructor for discrete attributes will select
1392    one of them. Now, let us suppose that there is a single continuous
1393    attribute with the same score. :obj:`SplitConstructor_Combined`
1394    would randomly select between the proposed discrete attribute and
1395    the continuous attribute, not aware of the fact that the discrete
1396    has already competed with eight other discrete attributes. So,
1397    he probability for selecting (each) discrete attribute would be 1/18
1398    instead of 1/10. Although not really correct, we doubt that this
1399    would affect the tree's performance; many other machine learning
1400    systems simply choose the first attribute with the highest score
1401    anyway.)
1402
1403    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
1404    the attribute is spent is decided by the winning split constructor.
1405
1406    .. attribute: discreteSplitConstructor
1407
1408        Split constructor for discrete attributes; can be, for instance,
1409        :obj:`SplitConstructor_Attribute` or
1410        :obj:`SplitConstructor_ExhaustiveBinary`.
1411
1412    .. attribute: continuousSplitConstructor
1413
1414        Split constructor for continuous attributes; at the moment, it
1415        can be either :obj:`SplitConstructor_Threshold` or a
1416        split constructor you programmed in Python.
1417
1418    .. attribute: continuousSplitConstructor
1419   
1420        Split constructor for continuous attributes; at the moment, it
1421        can be either :obj:`SplitConstructor_Threshold` or a split
1422        constructor you programmed in Python.
1423
1424
1425StopCriteria and StopCriteria_common
1426============================================
1427
1428obj:`StopCriteria` determines when to stop the induction of subtrees.
1429
1430.. class:: StopCriteria
1431
1432    Given a set of examples, weight ID and contingency matrices, decide
1433    whether to continue the induction or not. The basic criterion checks
1434    whether there are any examples and whether they belong to at least
1435    two different classes (if the class is discrete). Derived components
1436    check things like the number of examples and the proportion of
1437    majority classes.
1438
1439    As opposed to :obj:`SplitConstructor` and similar basic classes,
1440    :obj:`StopCriteria` is not an abstract but a fully functional
1441    class that provides the basic stopping criteria. That is, the tree
1442    induction stops when there is at most one example left; in this case,
1443    it is not the weighted but the actual number of examples that counts.
1444    Besides that, the induction stops when all examples are in the same
1445    class (for discrete problems) or have the same value of the outcome
1446    (for regression problems).
1447
1448    .. method:: __call__(examples[, weightID, domain contingencies])
1449
1450        Decides whether to stop (true) or continue (false) the induction.
1451        If contingencies are given, they are used for checking whether
1452        the examples are in the same class (but not for counting the
1453        examples). Derived classes should use the contingencies whenever
1454        possible. If contingencies are not given, :obj:`StopCriteria`
1455        will work without them. Derived classes should also use them if
1456        they are available, but otherwise compute them only when they
1457        really need them.
1458
1459
1460
1461.. class:: StopCriteria_common
1462
1463    :obj:`StopCriteria` contains additional criteria for pre-pruning:
1464    it checks the proportion of majority class and the number of weighted
1465    examples.
1466
1467    .. attribute:: maxMajor
1468
1469        Maximal proportion of majority class. When this is exceeded,
1470        induction stops.
1471
1472    .. attribute:: minExamples
1473
1474        Minimal number of examples in internal leaves. Subsets with less
1475        than :obj:`minExamples` examples are not split any further.
1476        Example count is weighed.
1477
1478.. class:: StopCriteria_Python
1479
1480    Undocumented.
1481
1482Classes derived from ExampleSplitters
1483========================================
1484
1485The :obj:`ExampleSplitter` sorts the learning examples into branches.
1486
1487.. class:: ExampleSplitter
1488
1489    Just like the :obj:`Descender` decides the branch for an
1490    example during classification, the :obj:`ExampleSplitter`
1491    sorts the learning examples into branches.
1492
1493    :obj:`ExampleSplitter` is given a :obj:`Node` (from which
1494    it can use different stuff, but most of splitters only use the
1495    :obj:`branchSelector`), a set of examples to be divided, and
1496    the weight ID. The result is a list of subsets of examples
1497    and, optionally, a list of new weight ID's.
1498
1499    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most
1500    of :obj:`ExampleSplitters` simply call the node's
1501    :obj:`branchSelector` and assign examples to corresponding
1502    branches. When the value is unknown they choose a particular
1503    branch or simply skip the example.
1504
1505    Some enhanced splitters can split examples. An example (actually,
1506    a pointer to it) is copied to more than one subset. To facilitate
1507    real splitting, weights are needed. Each branch is assigned a
1508    weight ID (each would usually have its own ID) and all examples
1509    that are in that branch (either completely or partially) should
1510    have this meta attribute. If an example hasn't been split, it
1511    has only one additional attribute - with weight ID corresponding
1512    to the subset to which it went. Example that is split between,
1513    say, three subsets, has three new meta attributes, one for each
1514    subset. ID's of weight meta attributes are returned by the
1515    :obj:`ExampleSplitter` to be used at induction of the
1516    corresponding subtrees.
1517
1518    Note that weights are used only when needed. When no splitting
1519    occured - because the splitter is not able to do it or becauser
1520    there was no need for splitting - no weight ID's are returned.
1521
1522    An abstract base class for objects that split sets of examples into
1523    subsets. The derived classes differ in treatment of examples which
1524    cannot be unambiguously placed into a single branch (usually due
1525    to unknown value of the crucial attribute).
1526
1527    .. method:: __call__(node, examples[, weightID])
1528       
1529        Use the information in :obj:`node` (particularly the
1530        :obj:`branchSelector`) to split the given set of examples into subsets.
1531        Return a tuple with a list of example generators and a list of weights.
1532        The list of weights is either an ordinary python list of integers or
1533        a None when no splitting of examples occurs and thus no weights are
1534        needed.
1535
1536
1537.. class:: ExampleSplitter_IgnoreUnknowns
1538
1539    Bases: :class:`ExampleSplitter`
1540
1541    Simply ignores the examples for which no single branch can be determined.
1542
1543.. class:: ExampleSplitter_UnknownsToCommon
1544
1545    Bases: :class:`ExampleSplitter`
1546
1547    Places all such examples to a branch with the highest number of
1548    examples. If there is more than one such branch, one is selected at
1549    random and then used for all examples.
1550
1551.. class:: ExampleSplitter_UnknownsToAll
1552
1553    Bases: :class:`ExampleSplitter`
1554
1555    Places examples with unknown value of the attribute into all branches.
1556
1557.. class:: ExampleSplitter_UnknownsToRandom
1558
1559    Bases: :class:`ExampleSplitter`
1560
1561    Selects a random branch for such examples.
1562
1563.. class:: ExampleSplitter_UnknownsToBranch
1564
1565    Bases: :class:`ExampleSplitter`
1566
1567    Constructs an additional branch to contain all such examples.
1568    The branch's description is "unknown".
1569
1570.. class:: ExampleSplitter_UnknownsAsBranchSizes
1571
1572    Bases: :class:`ExampleSplitter`
1573
1574    Splits examples with unknown value of the attribute according to
1575    proportions of examples in each branch.
1576
1577.. class:: ExampleSplitter_UnknownsAsSelector
1578
1579    Bases: :class:`ExampleSplitter`
1580
1581    Splits examples with unknown value of the attribute according to
1582    distribution proposed by selector (which is in most cases the same
1583    as proportions of examples in branches).
1584
1585Descender and derived classes
1586=================================
1587
1588This is a classifier's counterpart for :obj:`ExampleSplitter`. It
1589decides the destiny of examples that need to be classified and cannot
1590be unambiguously put in a branch. The detailed function of this class
1591is given in description of classification with trees. XXXXXX
1592
1593.. class:: Descender
1594
1595    An abstract base object for tree descenders.
1596
1597    .. method:: __call__(node, example)
1598
1599        Descends down the tree until it reaches a leaf or a node in
1600        which a vote of subtrees is required. In both cases, a tuple
1601        of two elements is returned; in the former, the tuple contains
1602        the reached node and None, in the latter in
1603        contains a node and weights of votes for subtrees (a list of floats).
1604
1605        :obj:`Descender`'s that never split examples always descend
1606        to a leaf, but they differ in the treatment of examples with
1607        unknown values (or, in general, examples for which a branch
1608        cannot be determined at some node(s) the tree).
1609        :obj:`Descender`'s that do split examples differ in returned
1610        vote weights.
1611
1612.. class:: Descender_UnknownsToNode
1613
1614    Bases: :obj:`Descender`
1615
1616    When example cannot be classified into a single branch, the
1617    current node is returned. Thus, the node's :obj:`NodeClassifier`
1618    will be used to make a decision. It is your responsibility to see
1619    that even the internal nodes have their :obj:`NodeClassifier`
1620    (i.e., don't disable creating node classifier or manually remove
1621    them after the induction, that's all)
1622
1623.. class:: Descender_UnknownsToBranch
1624
1625    Bases: :obj:`Descender`
1626
1627    Classifies examples with unknown value to a special branch. This
1628    makes sense only if the tree itself was constructed with
1629    :obj:`ExampleSplitter_UnknownsToBranch`.
1630
1631.. class:: Descender_UnknownsToCommonBranch
1632
1633    Bases: :obj:`Descender`
1634
1635    Classifies examples with unknown values to the branch with the
1636    highest number of examples. If there is more than one such branch,
1637    random branch is chosen for each example that is to be classified.
1638
1639.. class:: Descender_UnknownsToCommonSelector
1640
1641    Bases: :obj:`Descender`
1642
1643    Classifies examples with unknown values to the branch which received
1644    the highest recommendation by the selector.
1645
1646.. class:: Descender_MergeAsBranchSizes
1647
1648    Bases: :obj:`Descender`
1649
1650    Makes the subtrees vote for the example's class; the vote is
1651    weighted according to the sizes of the branches.
1652
1653.. class:: Descender_MergeAsSelector
1654
1655    Bases: :obj:`Descender`
1656
1657    Makes the subtrees vote for the example's class; the vote is
1658    weighted according to the selectors proposal.
1659
1660Pruner and derived classes
1661==============================
1662
1663.. index::
1664    pair: classification trees; pruning
1665
1666Classes derived from :obj:`Pruner` prune the trees as a
1667described in the section pruning XXXXXXXX - make sure you read it
1668to understand what the pruners will do to your trees.
1669
1670.. class:: Pruner
1671
1672    This is an abstract base class which defines nothing useful, only
1673    a pure virtual call operator.
1674
1675    .. method:: __call__(tree)
1676
1677        Prunes a tree. The argument can be either a tree classifier or
1678        a tree node; the result is of the same type as the argument.
1679
1680.. class:: Pruner_SameMajority
1681
1682    Bases: :class:`Pruner`
1683
1684    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
1685    with more than one leaf) in which all the leaves have the same majority
1686    class. (This is allowed because those leaves can still have different
1687    distributions of classes and thus predict different probabilities.)
1688    However, this can be undesired when we're only interested in the
1689    class prediction or a simple tree interpretation. The
1690    :obj:`Pruner_SameMajority` prunes the tree so that there is no
1691    subtree in which all the nodes would have the same majority class.
1692
1693    This pruner will only prune the nodes in which the node classifier
1694    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
1695
1696    Note that the leaves with more than one majority class require some
1697    special handling. The pruning goes backwards, from leaves to the root.
1698    When siblings are compared, the algorithm checks whether they
1699    have (at least one) common majority class. If so, they can be pruned.
1700
1701.. class:: Pruner_m
1702
1703    Bases: :class:`Pruner`
1704
1705    Prunes a tree by comparing m-estimates of static and dynamic
1706    error as defined in (Bratko, 2002).
1707
1708    .. attribute:: m
1709
1710        Parameter m for m-estimation.
1711
1712Undocumented
1713============
1714
1715.. class:: NodeList
1716
1717    Undocumented.
1718
1719.. class:: C45NodeList
1720
1721    Undocumented.
1722
1723===========================
1724C4.5 Classifier and Learner
1725===========================
1726
1727As C4.5 is a standard benchmark in machine learning,
1728it is incorporated in Orange, although Orange has its own
1729implementation of decision trees.
1730
1731The implementation uses the original Quinlan's code for learning so the
1732tree you get is exactly like the one that would be build by standalone
1733C4.5. Upon return, however, the original tree is copied to Orange
1734components that contain exactly the same information plus what is needed
1735to make them visible from Python. To be sure that the algorithm behaves
1736just as the original, we use a dedicated class :class:`C45Node`
1737instead of reusing the components used by Orange's tree inducer
1738(ie, :class:`Node`). This, however, could be done and probably
1739will be done in the future; we shall still retain :class:`C45Node`
1740but offer transformation to :class:`Node` so that routines
1741that work on Orange trees will also be usable for C45 trees.
1742
1743:class:`C45Learner` and :class:`C45Classifier` behave
1744like any other Orange learner and classifier. Unlike most of Orange
1745learning algorithms, C4.5 does not accepts weighted examples.
1746
1747Building the C4.5 plug-in
1748=========================
1749
1750We haven't been able to obtain the legal rights to distribute
1751C4.5 and therefore couldn't statically link it into Orange. Instead,
1752it's incorporated as a plug-in which you'll need to build yourself.
1753The procedure is trivial, but you'll need a C compiler. On Windows,
1754the scripts we provide work with MS Visual C and the files CL.EXE
1755and LINK.EXE must be on the PATH. On Linux you're equipped with
1756gcc. Mac OS X comes without gcc, but you can download it for
1757free from Apple.
1758
1759Orange must be installed prior to building C4.5. (This is because
1760the build script will copy the created file next to Orange,
1761which it obviously can't if Orange isn't there yet.)
1762
1763#. Download the
1764   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1765   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1766   them into some temporary directory. The files will be modified in the
1767   further process, so don't use your copy of Quinlan's sources that you
1768   need for another purpose.
1769#. Download
1770   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1771   and unzip its contents into the directory R8/Src of the Quinlan's
1772   stuff (it's the directory that contains, for instance, the file
1773   average.c).
1774#. Run buildC45.py, which will build the plug-in and put it next to
1775   orange.pyd (or orange.so on Linux/Mac).
1776#. Run python, type :samp:`import Orange` and
1777   create create :samp:`Orange.classification.tree.C45Learner()`.
1778   If this fails, something went wrong; see the diagnostic messages from
1779   buildC45.py and read the below paragraph.
1780#. Finally, you can remove the Quinlan's stuff, along with everything
1781   created by buildC45.py.
1782
1783If the process fails, here's what buildC45.py really does: it creates
1784.h files that wrap Quinlan's .i files and ensure that they are not
1785included twice. It modifies C4.5 sources to include .h's instead of
1786.i's. This step can hardly fail. Then follows the platform dependent
1787step which compiles ensemble.c (which includes all the Quinlan's .c
1788files it needs) into c45.dll or c45.so and puts it next to Orange.
1789If this fails, but you do have a C compiler and linker, and you know
1790how to use them, you can compile the ensemble.c into a dynamic
1791library yourself. See the compile and link steps in buildC45.py,
1792if it helps. Anyway, after doing this check that the built C4.5
1793gives the same results as the original.
1794
1795.. class:: C45Learner
1796
1797    :class:`C45Learner`'s attributes have double names - those that
1798    you know from C4.5 command lines and the corresponding names of C4.5's
1799    internal variables. All defaults are set as in C4.5; if you change
1800    nothing, you are running C4.5.
1801
1802    .. attribute:: gainRatio (g)
1803       
1804        Determines whether to use information gain (false>, default)
1805        or gain ratio for selection of attributes (true).
1806
1807    .. attribute:: batch (b)
1808
1809        Turn on batch mode (no windows, no iterations); this option is
1810        not documented in C4.5 manuals. It conflicts with "window",
1811        "increment" and "trials".
1812
1813    .. attribute:: subset (s)
1814       
1815        Enables subsetting (default: false, no subsetting),
1816 
1817    .. attribute:: probThresh (p)
1818
1819        Probabilistic threshold for continuous attributes (default: false).
1820
1821    .. attribute:: minObjs (m)
1822       
1823        Minimal number of objects (examples) in leaves (default: 2).
1824
1825    .. attribute:: window (w)
1826
1827        Initial windows size (default: maximum of 20% and twice the
1828        square root of the number of data objects).
1829
1830    .. attribute:: increment (i)
1831
1832        The maximum number of objects that can be added to the window
1833        at each iteration (default: 20% of the initial window size).
1834
1835    .. attribute:: cf (c)
1836
1837        Prunning confidence level (default: 25%).
1838
1839    .. attribute:: trials (t)
1840
1841        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1842
1843    .. attribute:: prune
1844       
1845        Return pruned tree (not an original C4.5 option) (default: true)
1846
1847
1848:class:`C45Learner` also offers another way for setting
1849the arguments: it provides a function :obj:`C45Learner.commandLine`
1850which is given a string and parses it the same way as C4.5 would
1851parse its command line. XXXXXXXXXXX
1852
1853.. class:: C45Classifier
1854
1855    A faithful reimplementation of Quinlan's function from C4.5. The only
1856    difference (and the only reason it's been rewritten) is that it uses
1857    a tree composed of :class:`C45Node` instead of C4.5's
1858    original tree structure.
1859
1860    .. attribute:: tree
1861
1862        C4.5 tree stored as a tree of :obj:`C45Node`.
1863
1864
1865.. class:: C45Node
1866
1867    This class is a reimplementation of the corresponding *struct* from
1868    Quinlan's C4.5 code.
1869
1870    .. attribute:: nodeType
1871
1872        Type of the node:  :obj:`C45Node.Leaf` (0),
1873        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1874        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1875        split examples based on values of a discrete attribute,
1876        "cuts" cut them according to a threshold value of a continuous
1877        attributes and "subsets" use discrete attributes but with subsetting
1878        so that several values can go into the same branch.
1879
1880    .. attribute:: leaf
1881
1882        Value returned by that leaf. The field is defined for internal
1883        nodes as well.
1884
1885    .. attribute:: items
1886
1887        Number of (learning) examples in the node.
1888
1889    .. attribute:: classDist
1890
1891        Class distribution for the node (of type
1892        :obj:`orange.DiscDistribution`).
1893
1894    .. attribute:: tested
1895       
1896        The attribute used in the node's test. If node is a leaf,
1897        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1898        :obj:`tested` is a discrete attribute, and if node is of type
1899        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1900
1901    .. attribute:: cut
1902
1903        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1904        Undefined otherwise.
1905
1906    .. attribute:: mapping
1907
1908        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1909        gives the index for an example whose value of :obj:`tested` is *i*.
1910        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1911
1912    .. attribute:: branch
1913       
1914        A list of branches stemming from this node.
1915
1916Examples
1917========
1918
1919.. _tree_c45.py: code/tree_c45.py
1920.. _iris.tab: code/iris.tab
1921
1922The simplest way to use :class:`C45Learner` is to call it. This
1923script constructs the same learner as you would get by calling
1924the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1925
1926.. literalinclude:: code/tree_c45.py
1927   :lines: 7-14
1928
1929Arguments can be set by the usual mechanism (the below to lines do the
1930same, except that one uses command-line symbols and the other internal
1931variable names)
1932
1933::
1934
1935    tree = Orange.classification.tree.C45Learner(data, m=100)
1936    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1937
1938The way that could be prefered by veteran C4.5 user might be through
1939method `:obj:C45Learner.commandline`.
1940
1941::
1942
1943    lrn = Orange.classification.tree..C45Learner()
1944    lrn.commandline("-m 1 -s")
1945    tree = lrn(data)
1946
1947There's nothing special about using :obj:`C45Classifier` - it's
1948just like any other. To demonstrate what the structure of
1949:class:`C45Node`'s looks like, will show a script that prints
1950it out in the same format as C4.5 does.
1951
1952.. literalinclude:: code/tree_c45_printtree.py
1953
1954Leaves are the simplest. We just print out the value contained
1955in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1956:obj:`C45Node` does not know to which attribute it belongs), we need to
1957convert it to a string through :obj:`classVar`, which is passed as an
1958extra argument to the recursive part of printTree.
1959
1960For discrete splits without subsetting, we print out all attribute values
1961and recursively call the function for all branches. Continuous splits are
1962equally easy to handle.
1963
1964For discrete splits with subsetting, we iterate through branches, retrieve
1965the corresponding values that go into each branch to inset, turn
1966the values into strings and print them out, separately treating the
1967case when only a single value goes into the branch.
1968
1969Printing out C45 Tree
1970=====================
1971
1972.. autofunction:: printTreeC45
1973
1974TODO
1975====
1976
1977This page does not provide examples for programming your own components,
1978such as, for instance, a :obj:`SplitConstructor`. Those examples
1979can be found on a page dedicated to callbacks to Python.
1980
1981References
1982==========
1983
1984Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1985Wesley, 2002.
1986
1987E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1988Murray Hill NJ, U.S.A., October 1993.
1989
1990`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1991A home page of AT&T's dot and similar software packages.
1992
1993"""
1994
1995from Orange.core import \
1996     TreeLearner as TreeLearnerBase, \
1997         TreeClassifier as _TreeClassifier, \
1998         C45Learner, \
1999         C45Classifier, \
2000         C45TreeNode as C45Node, \
2001         C45TreeNodeList as C45NodeList, \
2002         TreeDescender as Descender, \
2003              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
2004              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
2005              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
2006              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
2007              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
2008         TreeExampleSplitter as Splitter, \
2009              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
2010              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
2011              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
2012              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
2013              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
2014              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
2015              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
2016         TreeNode as Node, \
2017         TreeNodeList as NodeList, \
2018         TreePruner as Pruner, \
2019              TreePruner_SameMajority as Pruner_SameMajority, \
2020              TreePruner_m as Pruner_m, \
2021         TreeSplitConstructor as SplitConstructor, \
2022              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
2023              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
2024                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
2025                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
2026                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
2027                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
2028         TreeStopCriteria as StopCriteria, \
2029              TreeStopCriteria_Python as StopCriteria_Python, \
2030              TreeStopCriteria_common as StopCriteria_common
2031
2032import Orange.core
2033import operator
2034import base64
2035import re
2036import Orange.data
2037import Orange.feature.scoring
2038import Orange.classification.tree
2039
2040def _c45_showBranch(node, classvar, lev, i):
2041    var = node.tested
2042    if node.nodeType == 1:
2043        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
2044        _c45_printTree0(node.branch[i], classvar, lev+1)
2045    elif node.nodeType == 2:
2046        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
2047        _c45_printTree0(node.branch[i], classvar, lev+1)
2048    else:
2049        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2050        inset = [var.values[j[0]] for j in inset]
2051        if len(inset)==1:
2052            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
2053        else:
2054            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
2055        _c45_printTree0(node.branch[i], classvar, lev+1)
2056       
2057       
2058def _c45_printTree0(node, classvar, lev):
2059    var = node.tested
2060    if node.nodeType == 0:
2061        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2062    else:
2063        for i, branch in enumerate(node.branch):
2064            if not branch.nodeType:
2065                _c45_showBranch(node, classvar, lev, i)
2066        for i, branch in enumerate(node.branch):
2067            if branch.nodeType:
2068                _c45_showBranch(node, classvar, lev, i)
2069
2070def printTreeC45(tree):
2071    """
2072    Prints the tree given as an argument in the same form as Ross Quinlan's
2073    C4.5 program.
2074
2075    ::
2076
2077        import Orange
2078
2079        data = Orange.data.Table("voting")
2080        c45 = Orange.classification.tree.C45Learner(data)
2081        Orange.classification.tree.printTreeC45(c45)
2082
2083    will print out
2084
2085    ::
2086
2087        physician-fee-freeze = n: democrat (253.4)
2088        physician-fee-freeze = y:
2089        |   synfuels-corporation-cutback = n: republican (145.7)
2090        |   synfuels-corporation-cutback = y:
2091        |   |   mx-missile = y: democrat (6.0)
2092        |   |   mx-missile = n:
2093        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2094        |   |   |   adoption-of-the-budget-resolution = y:
2095        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2096        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2097
2098
2099    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2100
2101    ::
2102
2103        physician-fee-freeze = n: democrat (253.4/5.9)
2104        physician-fee-freeze = y:
2105        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2106        |   synfuels-corporation-cutback = y:
2107        |   |   mx-missile = y: democrat (6.0/2.4)
2108        |   |   mx-missile = n:
2109        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2110        |   |   |   adoption-of-the-budget-resolution = y:
2111        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2112        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2113
2114    which is adoringly similar, except that C4.5 tested the tree on
2115    the learning data and has also printed out the number of errors
2116    in each node - something which :obj:`c45_printTree` obviously can't do
2117    (nor is there any need it should).
2118
2119    """
2120    _c45_printTree0(tree.tree, tree.classVar, 0)
2121
2122
2123class TreeLearner(Orange.core.Learner):
2124    """
2125    Assembles the generic classification or regression tree learner
2126    (from Orange's objects for induction of decision trees).
2127    :class:`TreeLearner` is essentially a wrapper
2128    around :class:`TreeLearnerBase`, provided for easier use of the latter.
2129    It sets a number of parameters used in induction that
2130    can also be set after the creation of the object, most often through
2131    the object's attributes. If upon initialization
2132    :class:`TreeLearner` is given a set of examples, then an instance
2133    of :class:`TreeClassifier` object is returned instead.
2134
2135    The values of attributes can be also be set in the constructor.
2136
2137    .. attribute:: nodeLearner
2138
2139        Induces a classifier from examples belonging to a node. The
2140        same learner is used for internal nodes and for leaves. The
2141        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2142
2143    **Split construction**
2144
2145    .. attribute:: split
2146       
2147        Defines a function that will be used in place of
2148        :obj:`SplitConstructor`.
2149        Useful when prototyping new tree induction
2150        algorithms. When this parameter is defined, other parameters that
2151        affect the procedures for growing of the tree are ignored. These
2152        include :obj:`binarization`, :obj:`measure`,
2153        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2154        :class:SplitConstructor_Combined
2155        with separate constructors for discrete and continuous attributes.
2156        Discrete attributes are used as they are, while
2157        continuous attributes are binarized.
2158        Gain ratio is used to select attributes.
2159        A minimum of two examples in a leaf is required for
2160        discrete and five examples in a leaf for continuous attributes.)
2161
2162    .. attribute:: binarization
2163
2164        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2165        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
2166        0, do not use binarization (use :class:`SplitConstructor_Attribute`).
2167        Default: 0.
2168
2169    .. attribute:: measure
2170   
2171        Measure for scoring of the attributes when deciding which of the
2172        attributes will be used for splitting of the example set in the node.
2173        Can be either a :class:`Orange.feature.scoring.Measure` or one of
2174        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2175        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2176        "gini" (:class:`Orange.feature.scoring.Gini`),
2177        "relief" (:class:`Orange.feature.scoring.Relief`),
2178        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio".
2179
2180    .. attribute:: reliefM, reliefK
2181
2182        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2183
2184    **Pruning**
2185
2186    .. attribute:: worstAcceptable
2187
2188        Used in pre-pruning, sets the lowest required attribute
2189        score. If the score of the best attribute is below this margin, the
2190        tree at that node is not grown further (default: 0).
2191
2192        So, to allow splitting only when gainRatio (the default measure)
2193        is greater than 0.6, set :samp:`worstAcceptable=0.6`.
2194
2195    .. attribute:: minSubset
2196
2197        Minimal number of examples in non-null leaves (default: 0).
2198
2199    .. attribute:: minExamples
2200
2201        Data subsets with less than :obj:`minExamples`
2202        examples are not split any further, that is, all leaves in the tree
2203        will contain at least that many of examples (default: 0).
2204
2205    .. attribute:: maxDepth
2206
2207        Gives maximal tree depth;  0 means that only root is generated.
2208        The default is 100.
2209
2210    .. attribute:: maxMajority
2211
2212        Induction stops when the proportion of majority class in the
2213        node exceeds the value set by this parameter(default: 1.0).
2214        To stop the induction as soon as the majority class reaches 70%,
2215        you should use :samp:`maxMajority=0.7`, as in the following
2216        example. The numbers show the majority class
2217        proportion at each node. The script `tree2.py`_ induces and
2218        prints this tree.
2219
2220        .. _tree2.py: code/tree2.py
2221
2222        ::
2223
2224            root: 0.333
2225            |    petal width<=0.800: 1.000
2226            |    petal width>0.800: 0.500
2227            |    |    petal width<=1.750: 0.907
2228            |    |    petal width>1.750: 0.978
2229   
2230    .. attribute:: stop
2231
2232        Used for passing a function which is used in place of
2233        :class:`StopCriteria`. Useful when prototyping new
2234        tree induction algorithms. See a documentation on
2235        :class:`StopCriteria` for more info on this function.
2236        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2237        will not be  considered (default: None).
2238        The default stopping criterion stops induction when all examples
2239        in a node belong to the same class.
2240
2241    .. attribute:: mForPruning
2242
2243        If non-zero, invokes an error-based bottom-up post-pruning,
2244        where m-estimate is used to estimate class probabilities
2245        (default: 0).
2246
2247    .. attribute:: sameMajorityPruning
2248
2249        If true, invokes a bottom-up post-pruning by removing the
2250        subtrees of which all leaves classify to the same class
2251        (default: False).
2252
2253    **Record keeping**
2254
2255    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2256
2257        Determines whether to store class distributions, contingencies and
2258        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2259        should be build for internal nodes. By default everything except
2260        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2261        distributions but storing contingencies, since distributions actually points to
2262        the same distribution that is stored in
2263        :obj:`contingency.classes`. (default: True except for
2264        storeExamples, which defaults to False).
2265   
2266    """
2267    def __new__(cls, examples = None, weightID = 0, **argkw):
2268        self = Orange.core.Learner.__new__(cls, **argkw)
2269        if examples:
2270            self.__init__(**argkw)
2271            return self.__call__(examples, weightID)
2272        else:
2273            return self
2274     
2275    def __init__(self, **kw):
2276        self.__dict__.update(kw)
2277     
2278    def __call__(self, examples, weight=0):
2279        """
2280        Return a classifier from the given examples.
2281        """
2282        bl = self._base_learner()
2283
2284        #build second part of the split
2285        if not hasattr(self, "split") and not hasattr(self, "measure"):
2286            if examples.domain.classVar.varType == Orange.data.Type.Discrete:
2287                measure = Orange.feature.scoring.GainRatio()
2288            else:
2289                measure = Orange.feature.scoring.MSE()
2290            bl.split.continuousSplitConstructor.measure = measure
2291            bl.split.discreteSplitConstructor.measure = measure
2292           
2293        tree = bl(examples, weight)
2294        if getattr(self, "sameMajorityPruning", 0):
2295            tree = Pruner_SameMajority(tree)
2296        if getattr(self, "mForPruning", 0):
2297            tree = Pruner_m(tree, m = self.mForPruning)
2298
2299        return TreeClassifier(baseClassifier=tree) 
2300
2301    def instance(self):
2302        """
2303        DEPRECATED. Return a base learner - an object
2304        of :class:`TreeLearnerBase`.
2305        This method is left for backwards compatibility.
2306        """
2307        return self._base_learner()
2308
2309    def _build_split(self, learner):
2310
2311        if hasattr(self, "split"):
2312            learner.split = self.split
2313        else:
2314            learner.split = SplitConstructor_Combined()
2315            learner.split.continuousSplitConstructor = \
2316                SplitConstructor_Threshold()
2317            binarization = getattr(self, "binarization", 0)
2318            if binarization == 1:
2319                learner.split.discreteSplitConstructor = \
2320                    SplitConstructor_ExhaustiveBinary()
2321            elif binarization == 2:
2322                learner.split.discreteSplitConstructor = \
2323                    SplitConstructor_OneAgainstOthers()
2324            else:
2325                learner.split.discreteSplitConstructor = \
2326                    SplitConstructor_Feature()
2327
2328            measures = {"infoGain": Orange.feature.scoring.InfoGain,
2329                "gainRatio": Orange.feature.scoring.GainRatio,
2330                "gini": Orange.feature.scoring.Gini,
2331                "relief": Orange.feature.scoring.Relief,
2332                "retis": Orange.feature.scoring.MSE
2333                }
2334
2335            measure = getattr(self, "measure", None)
2336            if isinstance(measure, str):
2337                measure = measures[measure]()
2338            if not measure:
2339                measure = Orange.feature.scoring.GainRatio()
2340
2341            measureIsRelief = isinstance(measure, Orange.feature.scoring.Relief)
2342            relM = getattr(self, "reliefM", None)
2343            if relM and measureIsRelief:
2344                measure.m = relM
2345           
2346            relK = getattr(self, "reliefK", None)
2347            if relK and measureIsRelief:
2348                measure.k = relK
2349
2350            learner.split.continuousSplitConstructor.measure = measure
2351            learner.split.discreteSplitConstructor.measure = measure
2352
2353            wa = getattr(self, "worstAcceptable", 0)
2354            if wa:
2355                learner.split.continuousSplitConstructor.worstAcceptable = wa
2356                learner.split.discreteSplitConstructor.worstAcceptable = wa
2357
2358            ms = getattr(self, "minSubset", 0)
2359            if ms:
2360                learner.split.continuousSplitConstructor.minSubset = ms
2361                learner.split.discreteSplitConstructor.minSubset = ms
2362
2363    def _base_learner(self):
2364        learner = TreeLearnerBase()
2365
2366        self._build_split(learner)
2367
2368        if hasattr(self, "stop"):
2369            learner.stop = self.stop
2370        else:
2371            learner.stop = Orange.classification.tree.StopCriteria_common()
2372            mm = getattr(self, "maxMajority", 1.0)
2373            if mm < 1.0:
2374                learner.stop.maxMajority = self.maxMajority
2375            me = getattr(self, "minExamples", 0)
2376            if me:
2377                learner.stop.minExamples = self.minExamples
2378
2379        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2380            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2381            if hasattr(self, a):
2382                setattr(learner, a, getattr(self, a))
2383
2384        return learner
2385
2386# the following is for the output
2387
2388
2389fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2390""" Defines the multiplier by 100 (:samp:`^`) and the format
2391for the number of decimals (e.g. :samp:`5.3`). The corresponding
2392groups are named :samp:`m100` and :samp:`fs`. """
2393
2394by = r"(?P<by>(b(P|A)))?"
2395""" Defines bP or bA or nothing; the result is in groups by. """
2396
2397bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2398opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2399opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2400intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2401fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2402re_V = re.compile("%V")
2403re_N = re.compile("%"+fs+"N"+by)
2404re_M = re.compile("%"+fs+"M"+by)
2405re_m = re.compile("%"+fs+"m"+by)
2406re_Ccont = re.compile("%"+fs+"C"+by+opc)
2407re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2408re_ccont = re.compile("%"+fs+"c"+by+opc)
2409re_cdisc = re.compile("%"+fs+"c"+by+opd)
2410re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2411re_cconti = re.compile("%"+fs+"c"+by+fromto)
2412re_D = re.compile("%"+fs+"D"+by)
2413re_d = re.compile("%"+fs+"d"+by)
2414re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2415re_I = re.compile("%"+fs+"I"+intrvl)
2416
2417def insertStr(s, mo, sub):
2418    """ Replace the part of s which is covered by mo
2419    with the string sub. """
2420    return s[:mo.start()] + sub + s[mo.end():]
2421
2422
2423def insertDot(s, mo):
2424    """ Replace the part of s which is covered by mo
2425    with a dot.  You should use this when the
2426    function cannot compute the desired quantity; it is called, for instance,
2427    when it needs to divide by something in the parent, but the parent
2428    doesn't exist.
2429    """
2430    return s[:mo.start()] + "." + s[mo.end():]
2431
2432def insertNum(s, mo, n):
2433    """ Replace the part of s matched by mo with the number n,
2434    formatted as specified by the user, that is, it multiplies
2435    it by 100, if needed, and prints with the right number of
2436    places and decimals. It does so by checking the mo
2437    for a group named m100 (representing the :samp:`^` in the format string)
2438    and a group named fs representing the part giving the number o
2439    f decimals (e.g. :samp:`5.3`).
2440    """
2441    grps = mo.groupdict()
2442    m100 = grps.get("m100", None)
2443    if m100:
2444        n *= 100
2445    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2446    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2447
2448def byWhom(by, parent, tree):
2449    """ If by equals bp, it returns parent, else it returns
2450    :samp:`tree.tree`. This is used to find what to divide the quantity
2451    with, when division is required.
2452    """
2453    if by=="bP":
2454        return parent
2455    else:
2456        return tree.tree
2457
2458def replaceV(strg, mo, node, parent, tree):
2459    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2460
2461def replaceN(strg, mo, node, parent, tree):
2462    by = mo.group("by")
2463    N = node.distribution.abs
2464    if by:
2465        whom = byWhom(by, parent, tree)
2466        if whom and whom.distribution:
2467            if whom.distribution.abs > 1e-30:
2468                N /= whom.distribution.abs
2469        else:
2470            return insertDot(strg, mo)
2471    return insertNum(strg, mo, N)
2472       
2473
2474def replaceM(strg, mo, node, parent, tree):
2475    by = mo.group("by")
2476    maj = int(node.nodeClassifier.defaultValue)
2477    N = node.distribution[maj]
2478    if by:
2479        whom = byWhom(by, parent, tree)
2480        if whom and whom.distribution:
2481            if whom.distribution[maj] > 1e-30:
2482                N /= whom.distribution[maj]
2483        else:
2484            return insertDot(strg, mo)
2485    return insertNum(strg, mo, N)
2486       
2487
2488def replacem(strg, mo, node, parent, tree):
2489    by = mo.group("by")
2490    maj = int(node.nodeClassifier.defaultValue)
2491    if node.distribution.abs > 1e-30:
2492        N = node.distribution[maj] / node.distribution.abs
2493        if by:
2494            if whom and whom.distribution:
2495                byN = whom.distribution[maj] / whom.distribution.abs
2496                if byN > 1e-30:
2497                    N /= byN
2498            else:
2499                return insertDot(strg, mo)
2500    else:
2501        N = 0.
2502    return insertNum(strg, mo, N)
2503
2504
2505def replaceCdisc(strg, mo, node, parent, tree):
2506    if tree.classVar.varType != Orange.data.Type.Discrete:
2507        return insertDot(strg, mo)
2508   
2509    by, op, cls = mo.group("by", "op", "cls")
2510    N = node.distribution[cls]
2511    if op == "!=":
2512        N = node.distribution.abs - N
2513    if by:
2514        whom = byWhom(by, parent, tree)
2515        if whom and whom.distribution:
2516            if whom.distribution[cls] > 1e-30:
2517                N /= whom.distribution[cls]
2518        else:
2519            return insertDot(strg, mo)
2520    return insertNum(strg, mo, N)
2521
2522   
2523def replacecdisc(strg, mo, node, parent, tree):
2524    if tree.classVar.varType != Orange.data.Type.Discrete:
2525        return insertDot(strg, mo)
2526   
2527    op, by, cls = mo.group("op", "by", "cls")
2528    N = node.distribution[cls]
2529    if node.distribution.abs > 1e-30:
2530        N /= node.distribution.abs
2531        if op == "!=":
2532            N = 1 - N
2533    if by:
2534        whom = byWhom(by, parent, tree)
2535        if whom and whom.distribution:
2536            if whom.distribution[cls] > 1e-30:
2537                N /= whom.distribution[cls] / whom.distribution.abs
2538        else:
2539            return insertDot(strg, mo)
2540    return insertNum(strg, mo, N)
2541
2542
2543__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2544
2545def replaceCcont(strg, mo, node, parent, tree):
2546    if tree.classVar.varType != Orange.data.Type.Continuous:
2547        return insertDot(strg, mo)
2548   
2549    by, op, num = mo.group("by", "op", "num")
2550    op = __opdict[op]
2551    num = float(num)
2552    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2553    if by:
2554        whom = byWhom(by, parent, tree)
2555        if whom and whom.distribution:
2556            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2557            if byN > 1e-30:
2558                N /= byN
2559        else:
2560            return insertDot(strg, mo)
2561
2562    return insertNum(strg, mo, N)
2563   
2564   
2565def replaceccont(strg, mo, node, parent, tree):
2566    if tree.classVar.varType != Orange.data.Type.Continuous:
2567        return insertDot(strg, mo)
2568   
2569    by, op, num = mo.group("by", "op", "num")
2570    op = __opdict[op]
2571    num = float(num)
2572    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2573    if node.distribution.abs > 1e-30:
2574        N /= node.distribution.abs
2575    if by:
2576        whom = byWhom(by, parent, tree)
2577        if whom and whom.distribution:
2578            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2579            if byN > 1e-30:
2580                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2581        else:
2582            return insertDot(strg, mo)
2583    return insertNum(strg, mo, N)
2584
2585
2586def extractInterval(mo, dist):
2587    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2588    lower, upper = float(lower), float(upper)
2589    if out:
2590        lop = lowin == "(" and operator.le or operator.lt
2591        hop = upin == ")" and operator.ge or operator.ge
2592        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2593    else:
2594        lop = lowin == "(" and operator.gt or operator.ge
2595        hop = upin == ")" and operator.lt or operator.le
2596        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2597
2598   
2599def replaceCconti(strg, mo, node, parent, tree):
2600    if tree.classVar.varType != Orange.data.Type.Continuous:
2601        return insertDot(strg, mo)
2602
2603    by = mo.group("by")
2604    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2605    if by:
2606        whom = byWhom(by, parent, tree)
2607        if whom and whom.distribution:
2608            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2609            if byN > 1e-30:
2610                N /= byN
2611        else:
2612            return insertDot(strg, mo)
2613       
2614    return insertNum(strg, mo, N)
2615
2616           
2617def replacecconti(strg, mo, node, parent, tree):
2618    if tree.classVar.varType != Orange.data.Type.Continuous:
2619        return insertDot(strg, mo)
2620
2621    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2622    ab = node.distribution.abs
2623    if ab > 1e-30:
2624        N /= ab
2625
2626    by = mo.group("by")
2627    if by:
2628        whom = byWhom(by, parent, tree)
2629        if whom and whom.distribution:
2630            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2631            if byN > 1e-30:
2632                N /= byN/whom.distribution.abs
2633        else:
2634            return insertDot(strg, mo)
2635       
2636    return insertNum(strg, mo, N)
2637
2638   
2639def replaceD(strg, mo, node, parent, tree):
2640    if tree.classVar.varType != Orange.data.Type.Discrete:
2641        return insertDot(strg, mo)
2642
2643    fs, by, m100 = mo.group("fs", "by", "m100")
2644    dist = list(node.distribution)
2645    if by:
2646        whom = byWhom(by, parent, tree)
2647        if whom:
2648            for i, d in enumerate(whom.distribution):
2649                if d > 1e-30:
2650                    dist[i] /= d
2651        else:
2652            return insertDot(strg, mo)
2653    mul = m100 and 100 or 1
2654    fs = fs or (m100 and ".0" or "5.3")
2655    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2656
2657
2658def replaced(strg, mo, node, parent, tree):
2659    if tree.classVar.varType != Orange.data.Type.Discrete:
2660        return insertDot(strg, mo)
2661
2662    fs, by, m100 = mo.group("fs", "by", "m100")
2663    dist = list(node.distribution)
2664    ab = node.distribution.abs
2665    if ab > 1e-30:
2666        dist = [d/ab for d in dist]
2667    if by:
2668        whom = byWhom(by, parent, tree)
2669        if whom:
2670            for i, d in enumerate(whom.distribution):
2671                if d > 1e-30:
2672                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2673        else:
2674            return insertDot(strg, mo)
2675    mul = m100 and 100 or 1
2676    fs = fs or (m100 and ".0" or "5.3")
2677    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2678
2679
2680def replaceAE(strg, mo, node, parent, tree):
2681    if tree.classVar.varType != Orange.data.Type.Continuous:
2682        return insertDot(strg, mo)
2683
2684    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2685   
2686    if AorE == "A":
2687        A = node.distribution.average()
2688    else:
2689        A = node.distribution.error()
2690    if by:
2691        whom = byWhom("b"+by, parent, tree)
2692        if whom:
2693            if AorE == "A":
2694                avg = whom.distribution.average()
2695            else:
2696                avg = whom.distribution.error()
2697            if bysub == "b":
2698                if avg > 1e-30:
2699                    A /= avg
2700            else:
2701                A -= avg
2702        else:
2703            return insertDot(strg, mo)
2704    return insertNum(strg, mo, A)
2705
2706
2707Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2708
2709def replaceI(strg, mo, node, parent, tree):
2710    if tree.classVar.varType != Orange.data.Type.Continuous:
2711        return insertDot(strg, mo)
2712
2713    fs = mo.group("fs") or "5.3"
2714    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2715    mul = mo.group("m100") and 100 or 1
2716
2717    if not Z.has_key(intrvl):
2718        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2719
2720    av = node.distribution.average()   
2721    il = node.distribution.error() * Z[intrvl]
2722    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2723
2724
2725# This class is more a collection of function, merged into a class so
2726# that they don't need to transfer too many arguments. It will be
2727# constructed, used and discarded, it is not meant to store any information.
2728class _TreeDumper:
2729    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2730         (re_M, replaceM), (re_m, replacem), 
2731         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2732         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2733         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2734         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2735         (re_I, replaceI) ]
2736
2737    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2738        maxDepth, simpleFirst, tree, **kw):
2739        self.stringFormats = stringFormats
2740        self.minExamples = minExamples
2741        self.maxDepth = maxDepth
2742        self.simpleFirst = simpleFirst
2743        self.tree = tree
2744        self.__dict__.update(kw)
2745
2746        if leafStr:
2747            self.leafStr = leafStr
2748        else:
2749            if tree.classVar.varType == Orange.data.Type.Discrete:
2750                self.leafStr = "%V (%^.2m%)"
2751            else:
2752                self.leafStr = "%V"
2753
2754        if nodeStr == ".":
2755            self.nodeStr = self.leafStr
2756        else:
2757            self.nodeStr = nodeStr
2758       
2759
2760    def formatString(self, strg, node, parent):
2761        if hasattr(strg, "__call__"):
2762            return strg(node, parent, self.tree)
2763       
2764        if not node:
2765            return "<null node>"
2766       
2767        for rgx, replacer in self.stringFormats:
2768            if not node.distribution:
2769                strg = rgx.sub(".", strg)
2770            else:
2771                strt = 0
2772                while True:
2773                    mo = rgx.search(strg, strt)
2774                    if not mo:
2775                        break
2776                    strg = replacer(strg, mo, node, parent, self.tree)
2777                    strt = mo.start()+1
2778                       
2779        return strg
2780       
2781
2782    def showBranch(self, node, parent, lev, i):
2783        bdes = node.branchDescriptions[i]
2784        bdes = node.branchSelector.classVar.name + \
2785            (bdes[0] not in "<=>" and "=" or "") + bdes
2786        if node.branches[i]:
2787            nodedes = self.nodeStr and ": " + \
2788                self.formatString(self.nodeStr, node.branches[i], node) or ""
2789        else:
2790            nodedes = "<null node>"
2791        return "|    "*lev + bdes + nodedes
2792       
2793       
2794    def dumpTree0(self, node, parent, lev):
2795        if node.branches:
2796            if node.distribution.abs < self.minExamples or \
2797                lev > self.maxDepth:
2798                return "|    "*lev + ". . .\n"
2799           
2800            res = ""
2801            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2802                leafsep = "\n"+("|    "*lev)+"    "
2803            else:
2804                leafsep = ""
2805            if self.simpleFirst:
2806                for i, branch in enumerate(node.branches):
2807                    if not branch or not branch.branches:
2808                        if self.leafStr == self.nodeStr:
2809                            res += "%s\n" % \
2810                                self.showBranch(node, parent, lev, i)
2811                        else:
2812                            res += "%s: %s\n" % \
2813                                (self.showBranch(node, parent, lev, i),
2814                                 leafsep + 
2815                                 self.formatString(self.leafStr, branch, node))
2816            for i, branch in enumerate(node.branches):
2817                if branch and branch.branches:
2818                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2819                                       self.dumpTree0(branch, node, lev+1))
2820                elif not self.simpleFirst:
2821                    if self.leafStr == self.nodeStr:
2822                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2823                    else:
2824                        res += "%s: %s\n" % \
2825                            (self.showBranch(node, parent, lev, i),
2826                             leafsep + 
2827                             self.formatString(self.leafStr, branch, node))
2828            return res
2829        else:
2830            return self.formatString(self.leafStr, node, parent)
2831
2832
2833    def dumpTree(self):
2834        if self.nodeStr:
2835            lev, res = 1, "root: %s\n" % \
2836                self.formatString(self.nodeStr, self.tree.tree, None)
2837            self.maxDepth += 1
2838        else:
2839            lev, res = 0, ""
2840        return res + self.dumpTree0(self.tree.tree, None, lev)
2841       
2842
2843    def dotTree0(self, node, parent, internalName):
2844        if node.branches:
2845            if node.distribution.abs < self.minExamples or \
2846                len(internalName)-1 > self.maxDepth:
2847                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2848                    _quoteName(internalName))
2849                return
2850               
2851            label = node.branchSelector.classVar.name
2852            if self.nodeStr:
2853                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2854            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2855                (_quoteName(internalName), self.nodeShape, label))
2856           
2857            for i, branch in enumerate(node.branches):
2858                if branch:
2859                    internalBranchName = internalName+chr(i+65)
2860                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2861                        (_quoteName(internalName), 
2862                         _quoteName(internalBranchName), 
2863                         node.branchDescriptions[i]))
2864                    self.dotTree0(branch, node, internalBranchName)
2865                   
2866        else:
2867            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2868                (internalName, self.leafShape, 
2869                self.formatString(self.leafStr, node, parent)))
2870
2871
2872    def dotTree(self, internalName="n"):
2873        self.fle.write("digraph G {\n")
2874        self.dotTree0(self.tree.tree, None, internalName)
2875        self.fle.write("}\n")
2876
2877def _quoteName(x):
2878    return '"%s"' % (base64.b64encode(x))
2879
2880class TreeClassifier(Orange.classification.Classifier):
2881    """
2882    Wraps :class:`_TreeClassifier`.
2883    """
2884   
2885    def __init__(self, baseClassifier=None):
2886        if not baseClassifier: baseClassifier = _TreeClassifier()
2887        self.nativeClassifier = baseClassifier
2888        for k, v in self.nativeClassifier.__dict__.items():
2889            self.__dict__[k] = v
2890 
2891    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2892                 *args, **kwdargs):
2893        """Classify a new instance.
2894       
2895        :param instance: instance to be classified.
2896        :type instance: :class:`Orange.data.Instance`
2897        :param result_type:
2898              :class:`Orange.classification.Classifier.GetValue` or \
2899              :class:`Orange.classification.Classifier.GetProbabilities` or
2900              :class:`Orange.classification.Classifier.GetBoth`
2901       
2902        :rtype: :class:`Orange.data.Value`,
2903              :class:`Orange.statistics.Distribution` or a tuple with both
2904        """
2905        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2906
2907    def __setattr__(self, name, value):
2908        if name == "nativeClassifier":
2909            self.__dict__[name] = value
2910            return
2911        if name in self.nativeClassifier.__dict__:
2912            self.nativeClassifier.__dict__[name] = value
2913        self.__dict__[name] = value
2914   
2915    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2916        """
2917        Return a string representation of a tree.
2918
2919        :arg leafStr: The format string for printing the tree leaves. If
2920          left empty, "%V (%^.2m%)" will be used for classification trees
2921          and "%V" for regression trees.
2922        :type leafStr: string
2923        :arg nodeStr: The format string for printing out the internal nodes.
2924          If left empty (as it is by default), no data is printed out for
2925          internal nodes. If set to :samp:`"."`, the same string is
2926          used as for leaves.
2927        :type nodeStr: string
2928        :arg maxDepth: If set, it limits the depth to which the tree is
2929          printed out.
2930        :type maxDepth: integer
2931        :arg minExamples: If set, the subtrees with less than the given
2932          number of examples are not printed.
2933        :type minExamples: integer
2934        :arg simpleFirst: If True (default), the branches with a single
2935          node are printed before the branches with larger subtrees.
2936          If False, the branches are printed in order of
2937          appearance.
2938        :type simpleFirst: boolean
2939        :arg userFormats: A list of regular expressions and callback
2940          function through which the user can print out other specific
2941          information in the nodes.
2942        """
2943        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2944            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2945            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2946            self).dumpTree()
2947
2948    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2949        """ Prints the tree to a file in a format used by
2950        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2951        Uses the same parameters as :meth:`dump` defined above
2952        plus two parameters which define the shape used for internal
2953        nodes and laves of the tree:
2954
2955        :param leafShape: Shape of the outline around leves of the tree.
2956            If "plaintext", no outline is used (default: "plaintext").
2957        :type leafShape: string
2958        :param internalNodeShape: Shape of the outline around internal nodes
2959            of the tree. If "plaintext", no outline is used (default: "box")
2960        :type leafShape: string
2961
2962        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2963        for various outlines supported by GraphViz.
2964        """
2965        fle = type(fileName) == str and file(fileName, "wt") or fileName
2966
2967        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2968            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2969            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
2970            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2971
2972    def count_nodes(self):
2973        """
2974        Return the number of nodes.
2975        """
2976        return _countNodes(self.tree)
2977
2978    def count_leaves(self):
2979        """
2980        Return the number of leaves.
2981        """
2982        return _countLeaves(self.tree)
2983
2984def _countNodes(node):
2985    count = 0
2986    if node:
2987        count += 1
2988        if node.branches:
2989            for node in node.branches:
2990                count += _countNodes(node)
2991    return count
2992
2993def _countLeaves(node):
2994    count = 0
2995    if node:
2996        if node.branches: # internal node
2997            for node in node.branches:
2998                count += _countLeaves(node)
2999        else:
3000            count += 1
3001    return count
3002
Note: See TracBrowser for help on using the repository browser.