source: orange/orange/Orange/classification/tree.py @ 7713:d27afc892c9c

Revision 7713:d27afc892c9c, 117.8 KB checked in by markotoplak, 3 years ago (diff)

Tree documentation update.

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12To build a small tree (:obj:`TreeClassifier`) from the iris data set
13(with the depth limited to three levels), use
14(part of `orngTree1.py`_, uses `iris.tab`_):
15
16.. literalinclude:: code/orngTree1.py
17   :lines: 1-4
18
19.. _orngTree1.py: code/orngTree1.py
20
21
22This page first describes the learner and the classifier, and then
23defines the base classes (individual components) of the trees
24and the tree-building process.
25
26.. autoclass:: TreeLearner
27    :members:
28
29.. autoclass:: TreeClassifier
30    :members:
31
32
33For a bit more complex example, here's how to write your own stop
34function. The example itself is more funny than useful. It constructs
35and prints two trees. For the first one we define the *defStop*
36function, which is used by default, and combine it with a random function
37so that the stop criteria will also be met in additional 20% of the cases
38when *defStop* is false. The second tree is build such that it
39considers only the random function as the stopping criteria. Note that in
40the second case lambda function still has three parameters, since this is
41a necessary number of parameters for the stop function (:obj:`StopCriteria`).
42Part of `tree3.py`_ (uses  `iris.tab`_):
43
44.. _tree3.py: code/tree3.py
45
46.. literalinclude:: code/tree3.py
47   :lines: 8-23
48
49The output is not shown here since the resulting trees are rather
50big.
51
52=================
53Printing the Tree
54=================
55
56The included printing functions can
57print out practically anything you'd like to
58know, from the number of examples, proportion of examples of majority
59class in nodes and similar, to more complex statistics like the
60proportion of examples in a particular class divided by the proportion
61of examples of this class in a parent node. And even more, you can
62define your own callback functions to be used for printing.
63
64Before we go on: you can read all about the function and use it to its
65full extent, or you can just call it, giving it the tree as the sole
66argument and it will print out the usual textual representation of the
67tree. If you're satisfied with that, you can stop here.
68
69The magic is in the format string. It is a string which is printed
70out at every leaf or internal node with the certain format specifiers
71replaced by data from the tree node. Specifiers are generally of form
72**%[^]<precision><quantity><divisor>**.
73
74**^** at the start tells that the number should be multiplied by 100.
75It's useful for printing proportions like percentages.
76
77**<precision>** is in the same format as in Python (or C) string
78formatting. For instance, :samp:`%N` denotes the number of examples in the node,
79hence :samp:`%6.2N` would mean output to two decimal digits and six places
80altogether. If left out, a default format :samp:`5.3` is used, unless you
81multiply the numbers by 100, in which case the default is :samp:`.0`
82(no decimals, the number is rounded to the nearest integer).
83
84**<divisor>** tells what to divide the quantity in that node with.
85:samp:`bP` means division by the same quantity in the parent node; for instance,
86:samp:`%NbP` will tell give the number of examples in the node divided by the
87number of examples in parent node. You can add use precision formatting,
88e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
89set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
90training data set) that fell into that node. If division is impossible
91since the parent node does not exist or some data is missing, a dot is
92printed out instead of the quantity.
93
94**<quantity>** is the only required element. It defines what to print.
95For instance, :samp:`%N` would print out the number of examples in the node.
96Possible quantities are
97
98:samp:`V`
99    The value predicted at that node. You cannot define the precision
100    or divisor here.
101
102:samp:`N`
103    The number of examples in the node.
104
105:samp:`M`
106    The number of examples in the majority class (that is, the class
107    predicted by the node).
108
109:samp:`m`
110    The proportion of examples in the majority class.
111
112:samp:`A`
113    The average class for examples the node; this is available only for
114    regression trees.
115
116:samp:`E`
117    Standard error for class of examples in the node; available for
118    regression trees.
119
120:samp:`I`
121    Print out the confidence interval. The modifier is used as
122    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
123
124:samp:`C`
125    The number of examples in the given class. For classification trees,
126    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
127    - this will tell the number of examples of Iris-virginica by the
128    number of examples this class in the parent node. If you are
129    interested in examples that are *not* Iris-virginica, say
130    :samp:`%5.3CbP!="Iris-virginica"`
131
132    For regression trees, you can use operators =, !=, <, <=, >, and >=,
133    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
134    check the number of examples in a certain interval: :samp:`%C[20, 22]`
135    will give you the number of examples between 20 and 22 (inclusive)
136    and :samp:`%C(20, 22)` will give the number of such examples excluding the
137    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
138    If you would like the examples outside the interval, add a :samp:`!`,
139    like :samp:`%C!(20, 22]`.
140 
141:samp:`c`
142    Same as above, except that it computes the proportion of the class
143    instead of the number of examples.
144
145:samp:`D`
146    Prints out the number of examples in each class. You can use both,
147    precision (it is applied to each number in the distribution) or the
148    divisor. This quantity cannot be computed for regression trees.
149
150:samp:`d`
151    Same as above, except that it shows proportions of examples. This
152    again doesn't work with regression trees.
153
154<user defined formats>
155    You can add more, if you will. Instructions and examples are given at
156    the end of this section.
157
158
159Examples
160========
161
162We shall build a small tree from the iris data set - we shall limit the
163depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
164
165.. literalinclude:: code/orngTree1.py
166   :lines: 1-4
167
168.. _orngTree1.py: code/orngTree1.py
169
170The easiest way to call the function is to pass the tree as the only
171argument::
172
173    >>> print tree.dump()
174    petal width<0.800: Iris-setosa (100.00%)
175    petal width>=0.800
176    |    petal width<1.750
177    |    |    petal length<5.350: Iris-versicolor (94.23%)
178    |    |    petal length>=5.350: Iris-virginica (100.00%)
179    |    petal width>=1.750
180    |    |    petal length<4.850: Iris-virginica (66.67%)
181    |    |    petal length>=4.850: Iris-virginica (100.00%)
182
183Let's now print out the predicted class at each node, the number
184of examples in the majority class with the total number of examples
185in the node::
186
187    >>> print tree.dump(leafStr="%V (%M out of %N)")
188    petal width<0.800: Iris-setosa (50.000 out of 50.000)
189    petal width>=0.800
190    |    petal width<1.750
191    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
192    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
193    |    petal width>=1.750
194    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
195    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
196
197Would you like to know how the number of examples declines as
198compared to the entire data set and to the parent node? We find
199it with this::
200
201    >>> print tree.dump(leafStr="%V (%^MbA%, %^MbP%)")
202    petal width<0.800: Iris-setosa (100%, 100%)
203    petal width>=0.800
204    |    petal width<1.750
205    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
206    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
207    |    petal width>=1.750
208    |    |    petal length<4.850: Iris-virginica (4%, 4%)
209    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
210
211Let us first read the format string. :samp:`%M` is the number of
212examples in the majority class. We want it divided by the number of
213all examples from this class on the entire data set, hence :samp:`%MbA`.
214To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
215that is just printed out literally, just as the comma and parentheses
216(see the output). The string for showing the proportion of this class
217in the parent is the same except that we have :samp:`bP` instead
218of :samp:`bA`.
219
220And now for the output: all examples of setosa for into the first node.
221For versicolor, we have 98% in one node; the rest is certainly
222not in the neighbouring node (petal length>=5.350) since all
223versicolors from the node petal width<1.750 went to petal length<5.350
224(we know this from the 100% in that line). Virginica is the
225majority class in the three nodes that together contain 94% of this
226class (4+4+86). The rest must had gone to the same node as versicolor.
227
228If you find this guesswork annoying - so do I. Let us print out the
229number of versicolors in each node, together with the proportion of
230versicolors among the examples in this particular node and among all
231versicolors. So,
232
233::
234
235    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
236
237gives the following output::
238
239    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
240    petal width>=0.800
241    |    petal width<1.750
242    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
243    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
244    |    petal width>=1.750
245    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
246    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
247
248Finally, we may want to print out the distributions, using a simple
249string :samp:`%D`::
250
251    petal width<0.800: [50.000, 0.000, 0.000]
252    petal width>=0.800
253    |    petal width<1.750
254    |    |    petal length<5.350: [0.000, 49.000, 3.000]
255    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
256    |    petal width>=1.750
257    |    |    petal length<4.850: [0.000, 1.000, 2.000]
258    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
259
260What is the order of numbers here? If you check
261:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
262versicolor, virginica; so in the node at peta length<5.350 we have 49
263versicolors and 3 virginicae. To print out the proportions, we can
264:samp:`%.2d` - this gives us proportions within node, rounded on
265two decimals::
266
267    petal width<0.800: [1.00, 0.00, 0.00]
268    petal width>=0.800
269    |    petal width<1.750
270    |    |    petal length<5.350: [0.00, 0.94, 0.06]
271    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
272    |    petal width>=1.750
273    |    |    petal length<4.850: [0.00, 0.33, 0.67]
274    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
275
276We haven't tried printing out any information for internal nodes.
277To start with the most trivial case, we shall print the prediction
278at each node.
279
280::
281
282    tree.dump(leafStr="%V", nodeStr=".")
283   
284says that the nodeStr should be the same as leafStr (not very useful
285here, since leafStr is trivial anyway).
286
287::
288
289    root: Iris-setosa
290    |    petal width<0.800: Iris-setosa
291    |    petal width>=0.800: Iris-versicolor
292    |    |    petal width<1.750: Iris-versicolor
293    |    |    |    petal length<5.350: Iris-versicolor
294    |    |    |    petal length>=5.350: Iris-virginica
295    |    |    petal width>=1.750: Iris-virginica
296    |    |    |    petal length<4.850: Iris-virginica
297    |    |    |    petal length>=4.850: Iris-virginica
298
299Note that the output is somewhat different now: there appeared another
300node called *root* and the tree looks one level deeper. This is
301needed to print out the data for that node to.
302
303Now for something more complicated: let us observe how the number
304of virginicas decreases down the tree::
305
306    print tree.dump(leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
307
308Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
309the number of examples from class virginica, divided by the total number
310of examples in this class. Add :samp:`^.1` and the result will be
311multiplied and printed with one decimal. The trailing :samp:`%` is printed
312out. In parentheses we print the same thing except that we divide by the
313examples in the parent node. Note the use of single quotes, so we can
314use the double quotes inside the string, when we specify the class.
315
316::
317
318    root: 100.0% (.%)
319    |    petal width<0.800: 0.0% (0.0%)
320    |    petal width>=0.800: 100.0% (100.0%)
321    |    |    petal width<1.750: 10.0% (10.0%)
322    |    |    |    petal length<5.350: 6.0% (60.0%)
323    |    |    |    petal length>=5.350: 4.0% (40.0%)
324    |    |    petal width>=1.750: 90.0% (90.0%)
325    |    |    |    petal length<4.850: 4.0% (4.4%)
326    |    |    |    petal length>=4.850: 86.0% (95.6%)
327
328See what's in the parentheses in the root node? If :meth:`TreeClassifier.dump`
329cannot compute something (in this case it's because the root has no parent),
330it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
331will count all classes *except* virginica.
332
333For one final example with classification trees, we shall print the
334distributions in that nodes, the distribution compared to the parent
335and the proportions compared to the parent (the latter things are not
336the same - think why). In the leaves we shall also add the predicted
337class. So now we'll have to call the function like this.
338
339::
340
341    >>> print tree.dump(leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
342    root: [50.000, 50.000, 50.000] . .
343    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
344    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
345    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
346    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
347    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
348    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
349    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
350    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
351    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
352    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
353    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
354    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
355    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
356
357To explore the possibilities when printing regression trees, we are going
358to induce a tree from the housing data set. Called with the tree as the
359only argument, :meth:`TreeClassifier.dump` prints the tree like this::
360
361    RM<6.941
362    |    LSTAT<14.400
363    |    |    DIS<1.385: 45.6
364    |    |    DIS>=1.385: 22.9
365    |    LSTAT>=14.400
366    |    |    CRIM<6.992: 17.1
367    |    |    CRIM>=6.992: 12.0
368    RM>=6.941
369    |    RM<7.437
370    |    |    CRIM<7.393: 33.3
371    |    |    CRIM>=7.393: 14.4
372    |    RM>=7.437
373    |    |    TAX<534.500: 45.9
374    |    |    TAX>=534.500: 21.9
375
376Let us add the standard error in both internal nodes and leaves, and the
37790% confidence intervals in the leaves::
378
379    >>> print tree.dump(leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
380    root: [SE: 0.409]
381    |    RM<6.941: [SE: 0.306]
382    |    |    LSTAT<14.400: [SE: 0.320]
383    |    |    |    DIS<1.385: [SE: 4.420]:
384    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
385    |    |    |    DIS>=1.385: [SE: 0.244]:
386    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
387    |    |    LSTAT>=14.400: [SE: 0.333]
388    |    |    |    CRIM<6.992: [SE: 0.338]:
389    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
390    |    |    |    CRIM>=6.992: [SE: 0.448]:
391    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
392    |    RM>=6.941: [SE: 1.031]
393    |    |    RM<7.437: [SE: 0.958]
394    |    |    |    CRIM<7.393: [SE: 0.692]:
395    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
396    |    |    |    CRIM>=7.393: [SE: 2.157]:
397    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
398    |    |    RM>=7.437: [SE: 1.124]
399    |    |    |    TAX<534.500: [SE: 0.817]:
400    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
401    |    |    |    TAX>=534.500: [SE: 0.000]:
402    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
403
404What's the difference between :samp:`%V`, the predicted value and
405:samp:`%A` the average? Doesn't a regression tree always predict the
406leaf average anyway? Not necessarily, the tree predict whatever the
407:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
408As :samp:`%V` uses the
409:obj:`Orange.data.variable.Continuous`' function for printing out the value,
410therefore the printed number has the same number of decimals
411as in the data file.
412
413Regression trees cannot print the distributions in the same way
414as classification trees. They instead offer a set of operators for
415observing the number of examples within a certain range. For instance,
416let us check the number of examples with values below 22, and compare
417this number with values in the parent nodes::
418
419    >>> print tree.dump(leafStr="%C<22 (%cbP<22)", nodeStr=".")
420    root: 277.000 (.)
421    |    RM<6.941: 273.000 (1.160)
422    |    |    LSTAT<14.400: 107.000 (0.661)
423    |    |    |    DIS<1.385: 0.000 (0.000)
424    |    |    |    DIS>=1.385: 107.000 (1.020)
425    |    |    LSTAT>=14.400: 166.000 (1.494)
426    |    |    |    CRIM<6.992: 93.000 (0.971)
427    |    |    |    CRIM>=6.992: 73.000 (1.040)
428    |    RM>=6.941: 4.000 (0.096)
429    |    |    RM<7.437: 3.000 (1.239)
430    |    |    |    CRIM<7.393: 0.000 (0.000)
431    |    |    |    CRIM>=7.393: 3.000 (15.333)
432    |    |    RM>=7.437: 1.000 (0.633)
433    |    |    |    TAX<534.500: 0.000 (0.000)
434    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
435
436The last line, for instance, says the the number of examples with the
437class below 22 is among those with tax above 534 is 30 times higher
438than the number of such examples in its parent node.
439
440For another exercise, let's count the same for all examples *outside*
441interval [20, 22] (given like this, the interval includes the bounds).
442And let us print out the proportions as percents.
443
444::
445
446    >>> print tree.dump(leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
447
448OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
449would be the proportion of examples (within the node) whose values are
450below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
451statistics computed on the parent. Add a :samp:`^` and you have the percentages.
452
453::
454
455    root: 439.000 (.%)
456    |    RM<6.941: 364.000 (98%)
457    |    |    LSTAT<14.400: 200.000 (93%)
458    |    |    |    DIS<1.385: 5.000 (127%)
459    |    |    |    DIS>=1.385: 195.000 (99%)
460    |    |    LSTAT>=14.400: 164.000 (111%)
461    |    |    |    CRIM<6.992: 91.000 (96%)
462    |    |    |    CRIM>=6.992: 73.000 (105%)
463    |    RM>=6.941: 75.000 (114%)
464    |    |    RM<7.437: 46.000 (101%)
465    |    |    |    CRIM<7.393: 43.000 (100%)
466    |    |    |    CRIM>=7.393: 3.000 (100%)
467    |    |    RM>=7.437: 29.000 (98%)
468    |    |    |    TAX<534.500: 29.000 (103%)
469    |    |    |    TAX>=534.500: 0.000 (0%)
470
471
472Defining Your Own Printout functions
473====================================
474
475:meth:`TreeClassifier.dump`'s argument :obj:`userFormats` can be used to print out
476some other information in the leaves or nodes. If provided,
477:obj:`userFormats` should contain a list of tuples with a regular expression
478and a callback function to be called when that expression is found in the
479format string. Expressions from :obj:`userFormats` are checked before
480the built-in expressions discussed above, so you can override the built-ins
481if you want to.
482
483The regular expression should describe a string like those we used above,
484for instance the string :samp:`%.2DbP`. When a leaf or internal node
485is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
486is checked for these regular expressions and when the match is found,
487the corresponding callback function is called.
488
489The callback function will get five arguments: the format string
490(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
491being printed, its parent (can be None) and the tree classifier.
492The function should return the format string in which the part described
493by the match object (that is, the part that is matched by the regular
494expression) is replaced by whatever information your callback function
495is supposed to give.
496
497The function can use several utility function provided in the module.
498
499.. autofunction:: insertStr
500
501.. autofunction:: insertDot
502
503.. autofunction:: insertNum
504
505.. autofunction:: byWhom
506
507
508There are also a few pieces of regular expression that you may want to reuse.
509The two you are likely to use are:
510
511.. autodata:: fs
512
513.. autodata:: by
514
515For a trivial example, :samp:`%V` is implemented like this. There is the
516following tuple in the list of built-in formats::
517
518    (re.compile("%V"), replaceV)
519
520:obj:`replaceV` is a function defined by::
521
522    def replaceV(strg, mo, node, parent, tree):
523        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
524
525It therefore takes the value predicted at the node
526(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
527and passes it to *insertStr* to do the replacement.
528
529A more complex regular expression is the one for the proportion of
530majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
531two partial expressions defined above.
532
533Let's say with like to print the classification margin for each node,
534that is, the difference between the proportion of the largest and the
535second largest class in the node (part of `orngTree2.py`_):
536
537.. _orngTree2.py: code/orngTree2.py
538
539.. literalinclude:: code/orngTree2.py
540   :lines: 7-31
541
542We first defined getMargin which gets the distribution and computes the
543margin. The callback replaces, replaceB, computes the margin for the node.
544If we need to divided the quantity by something (that is, if the :data:`by`
545group is present), we call :func:`byWhom` to get the node with whose margin
546this node's margin is to be divided. If this node (usually the parent)
547does not exist of if its margin is zero, we call :func:`insertDot`
548to insert a dot, otherwise we call :func:`insertNum` which will insert
549the number, obeying the format specified by the user. myFormat is a list
550containing the regular expression and the callback function.
551
552
553We can now print out the iris tree:
554
555.. literalinclude:: code/orngTree2.py
556    :lines: 33
557
558And we get::
559
560    petal width<0.800: Iris-setosa 100% (100.00%)
561    petal width>=0.800
562    |    petal width<1.750
563    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
564    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
565    |    petal width>=1.750
566    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
567    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
568
569
570Plotting the Tree using Dot
571===========================
572
573Suppose you saved the tree in a file "tree5.dot". You can then
574print it out as a gif if you execute the following command line
575
576::
577   
578    dot -Tgif tree5.dot -otree5.gif
579
580GraphViz's dot has quite a few other output formats, check
581its documentation to learn which.
582
583
584
585Classification trees are represented as a tree-like hierarchy of
586:obj:`Node` classes.
587
588.. class:: Node
589
590    Node stores information about the learning examples belonging
591    to the node, a branch selector, a list of branches (if the node is
592    not a leaf) with their descriptions and strengths, and a classifier.
593
594    .. attribute:: distribution
595   
596        Stores a distribution for learning examples belonging to the node.
597        Storing distributions can be disabled by setting the
598        :obj:`TreeLearnerBase`'s storeDistributions flag to false.
599
600    .. attribute:: contingency
601
602        Stores complete contingency matrices for the learning examples
603        belonging to the node. Storing contingencies can be enabled by
604        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`
605        flag to true. Note that even when the flag is not
606        set, the contingencies get computed and stored to
607        :obj:`Node`, but are removed shortly afterwards.
608        The details are given in the
609        description of the :obj:`TreeLearnerBase` object.
610
611    .. attribute:: examples, weightID
612
613        Store a set of learning examples for the node and the
614        corresponding ID of /weight meta attribute. The root of the
615        tree stores a "master table" of examples, while other nodes'
616        :obj:`Orange.data.Table` contain reference to examples in
617        the root's :obj:`Orange.data.Table`. Examples are only stored
618        if a corresponding flag (:obj:`storeExamples`) has been
619        set while building the tree; to conserve the space, storing
620        is disabled by default.
621
622    .. attribute:: nodeClassifier
623
624        A classifier (usually, but not necessarily, a
625        :obj:`DefaultClassifier`) that can be used to classify
626        examples coming to the node. If the node is a leaf, this is
627        used to decide the final class (or class distribution) of an
628        example. If it's an internal node, it is stored if
629        :obj:`Node`'s flag :obj:`storeNodeClassifier`
630        is set. Since the :obj:`nodeClassifier` is needed by
631        :obj:`Descender` and for pruning (see far below),
632        this is the default behaviour; space consumption of the default
633        :obj:`DefaultClassifier` is rather small. You should
634        never disable this if you intend to prune the tree later.
635
636    If the node is a leaf, the remaining fields are None.
637    If it's an internal node, there are several additional fields.
638
639    .. attribute:: branches
640
641        Stores a list of subtrees, given as :obj:`Node`.
642        An element can be None; in this case the node is empty.
643
644    .. attribute:: branchDescriptions
645
646        A list with string descriptions for branches, constructed by
647        :obj:`SplitConstructor`. It can contain different kinds
648        of descriptions, but basically, expect things like 'red' or '>12.3'.
649
650    .. attribute:: branchSizes
651
652        Gives a (weighted) number of training examples that went into
653        each branch. This can be used later, for instance, for
654        modeling probabilities when classifying examples with
655        unknown values.
656
657    .. attribute:: branchSelector
658
659        Gives a branch for each example. The same object is used during
660        learning and classifying. The :obj:`branchSelector` is of
661        type :obj:`orange.Classifier`, since its job is similar to that
662        of a classifier: it gets an example and returns discrete
663        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
664        When an example cannot be classified to any branch, the selector
665        can return a :obj:`Orange.data.Value` containing a special value
666        (sVal) which should be a discrete distribution
667        (DiscDistribution). This should represent a
668        :obj:`branchSelector`'s opinion of how to divide the
669        example between the branches. Whether the proposition will be
670        used or not depends upon the chosen :obj:`ExampleSplitter`
671        (when learning) or :obj:`Descender` (when classifying).
672
673    The lists :obj:`branches`, :obj:`branchDescriptions` and
674    :obj:`branchSizes` are of the same length; all of them are
675    defined if the node is internal and none if it is a leaf.
676
677    .. method:: treeSize()
678       
679        Return the number of nodes in the subtrees (including the
680        node, excluding null-nodes).
681
682==============
683Classification
684==============
685
686.. class:: _TreeClassifier
687
688    Classifies examples according to a tree stored in :obj:`tree`.
689
690    .. attribute:: tree
691
692        The root of the tree, represented as a :class:`Node`.
693   
694    Classification would be straightforward if there were no unknown
695    values or, in general, examples that cannot be placed into a
696    single branch. The response in such cases is determined by a
697    component :obj:`descender`.
698
699    :obj:`Descender` is an abstract object which is given an example
700    and whose basic job is to descend as far down the tree as possible,
701    according to the values of example's attributes. The
702    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
703    the branch index. If it's a simple index, the corresponding branch
704    is followed. If not, it's up to descender to decide what to do, and
705    that's where descenders differ. A :obj:`descender` can choose
706    a single branch (for instance, the one that is the most recommended
707    by the :obj:`branchSelector`) or it can let the branches vote.
708
709    In general there are three possible outcomes of a descent.
710
711    #. Descender reaches a leaf. This happens when nothing went wrong
712       (there are no unknown or out-of-range values in the example) or
713       when things went wrong, but the descender smoothed them by
714       selecting a single branch and continued the descend. In this
715       case, the descender returns the reached :obj:`Node`.
716    #. :obj:`branchSelector` returned a distribution and the
717       :obj:`Descender` decided to stop the descend at this
718       (internal) node.  Again, descender returns the current
719       :obj:`Node` and nothing else.
720    #. :obj:`branchSelector` returned a distribution and the
721       :obj:`Node` wants to split the example (i.e., to decide the
722       class by voting).
723
724    It returns a :obj:`Node` and the vote-weights for the branches.
725    The weights can correspond to the distribution returned by
726    :obj:`branchSelector`, to the number of learning examples that
727    were assigned to each branch, or to something else.
728
729    :obj:`TreeClassifier` uses the descender to descend from the root.
730    If it returns only a :obj:`Node` and no distribution, the
731    descend should stop; it does not matter whether it's a leaf (the
732    first case above) or an internal node (the second case). The node's
733    :obj:`nodeClassifier` is used to decide the class. If the descender
734    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
735    recursively calls itself for each of the subtrees and the
736    predictions are weighted as requested by the descender.
737
738    When voting, subtrees do not predict the class but probabilities
739    of classes. The predictions are multiplied by weights, summed and
740    the most probable class is returned.
741
742    .. method:: vote()
743
744    .. method:: classDistribution()
745
746
747The rest of this section is only for those interested in the C++ code.
748======================================================================
749
750If you'd like to understand how the classification works in C++,
751start reading at :obj:`TTreeClassifier::vote`. It gets a
752:obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of
753vote weights. For each node, it calls the
754:obj:`TTreeClassifier::classDistribution` and then multiplies
755and sums the distribution. :obj:`vote` returns a normalized
756distribution of predictions.
757
758A new overload of :obj:`TTreeClassifier::classDistribution` gets
759an additional parameter, a :obj:`Node`. This is done
760for the sake of recursion. The normal version of
761:obj:`classDistribution` simply calls the overloaded with a
762tree root as an additional parameter. :obj:`classDistribution`
763uses :obj:`descender`. If descender reaches a leaf, it calls
764:obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
765
766Thus, the :obj:`TreeClassifier`'s :obj:`vote` and
767:obj:`classDistribution` are written in a form of double
768recursion. The recursive calls do not happen at each node of the
769tree but only at nodes where a vote is needed (that is, at nodes
770where the descender halts).
771
772For predicting a class, :obj:`operator()`, calls the
773descender. If it reaches a leaf, the class is predicted by the
774leaf's :obj:`nodeClassifier`. Otherwise, it calls
775:obj:`vote`>. From now on, :obj:`vote` and
776:obj:`classDistribution` interweave down the tree and return
777a distribution of predictions. :obj:`operator()` then simply
778chooses the most probable class.
779
780========
781Learning
782========
783
784The main learning object is :obj:`TreeLearnerBase`. It is basically
785a skeleton into which the user must plug the components for particular
786functions. For easier use, defaults are provided.
787
788Components that govern the structure of the tree are :obj:`split`
789(of type :obj:`SplitConstructor`), :obj:`stop` (of
790type :obj:`StopCriteria` and :obj:`exampleSplitter`
791(of type :obj:`ExampleSplitter`).
792
793
794.. class:: TreeLearnerBase
795
796    TreeLearnerBase has a number of components.
797
798    .. attribute:: split
799
800        Object of type :obj:`SplitConstructor`. Default value,
801        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined`
802        with separate constructors for discrete and continuous attributes.
803        Discrete attributes are used as are, while continuous attributes
804        are binarized. Gain ratio is used to select attributes.
805        A minimum of two examples in a leaf is required for discreter
806        and five examples in a leaf for continuous attributes.</DD>
807   
808    .. attribute:: stop
809
810        Object of type :obj:`StopCriteria`. The default stopping
811        criterion stops induction when all examples in a node belong
812        to the same class.
813
814    .. attribute:: splitter
815
816        Object of type :obj:`ExampleSplitter`. The default splitter
817        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
818        the learning examples according to distributions given by the
819        selector.
820
821    .. attribute:: contingencyComputer
822   
823        By default, this slot is left empty and ordinary contingency
824        matrices are computed for examples at each node. If need arises,
825        one can change the way the matrices are computed. This can be
826        used to change the way that unknown values are treated when
827        assessing qualities of attributes. As mentioned earlier,
828        the computed matrices can be used by split constructor and by
829        stopping criteria. On the other hand, they can be (and are)
830        ignored by some splitting constructors.
831
832    .. attribute:: nodeLearner
833
834        Induces a classifier from examples belonging to a node. The
835        same learner is used for internal nodes and for leaves. The
836        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
837
838    .. attribute:: descender
839
840        Descending component that the induces :obj:`TreeClassifier`
841        will use. Default descender is
842        :obj:`Descender_UnknownMergeAsSelector` which votes using
843        the :obj:`branchSelector`'s distribution for vote weights.
844
845    .. attribute:: maxDepth
846
847        Gives maximal tree depth; 0 means that only root is generated.
848        The default is 100 to prevent any infinite tree induction due
849        to missettings in stop criteria. If you are sure you need
850        larger trees, increase it. If you, on the other hand, want
851        to lower this hard limit, you can do so as well.
852
853    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
854
855        Decides whether to store class distributions, contingencies
856        and examples in :obj:`Node`, and whether the
857        :obj:`nodeClassifier` should be build for internal nodes.
858        By default, distributions and node classifiers are stored,
859        while contingencies and examples are not. You won't save any
860        memory by not storing distributions but storing contingencies,
861        since distributions actually points to the same distribution
862        that is stored in :obj:`contingency.classes`.
863
864    The :obj:`TreeLearnerBase` first sets the defaults for missing
865    components. Although stored in the actual :obj:`TreeLearnerBase`'s
866    fields, they are removed when the induction is finished.
867
868    Then it ensures that examples are stored in a table. This is needed
869    because the algorithm juggles with pointers to examples. If
870    examples are in a file or are fed through a filter, they are copied
871    to a table. Even if they are already in a table, they are copied
872    if :obj:`storeExamples` is set. This is to assure that pointers
873    remain pointing to examples even if the user later changes the
874    example table. If they are in the table and the :obj:`storeExamples`
875    flag is clear, we just use them as they are. This will obviously
876    crash in a multi-threaded system if one changes the table during
877    the tree induction. Well... don't do it.
878
879    Apriori class probabilities are computed. At this point we check
880    the sum of example weights; if it's zero, there are no examples and
881    we cannot proceed. A list of candidate attributes is set; in the
882    beginning, all attributes are candidates for the split criterion.
883
884    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments
885    are a set of examples, a weight meta-attribute ID (a tricky thing,
886    it can be always the same as the original or can change to
887    accomodate splitting of examples among branches), apriori class
888    distribution and a list of candidates (represented as a vector
889    of Boolean values).
890
891    The contingency matrix is computed next. This happens
892    even if the flag :obj:`storeContingencies` is false.
893    If the :obj:`contingencyComputer` is given we use it,
894    otherwise we construct just an ordinary contingency matrix.
895
896    A :obj:`stop` is called to see whether it's worth to continue. If
897    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
898    returned. Otherwise, a :obj:`nodeClassifier` is only built if
899    :obj:`forceNodeClassifier` flag is set.
900
901    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
902    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
903    the given examples, weight ID and the just computed matrix. If
904    the learner can use the matrix (and the default,
905    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
906    a choice of :obj:`contingencyComputer` will, in many cases,
907    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
908    return no classifier; if so and if the classifier would be
909    needed for classification, the :obj:`TreeClassifier`'s function
910    returns DK or an empty distribution. If you're writing your own
911    tree classifier - pay attention.
912
913    If the induction is to continue, a :obj:`split` component is called.
914    If it fails to return a branch selector, induction stops and the
915    :obj:`Node` is returned.
916
917    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide
918    the examples as described above.
919
920    The contingency gets removed at this point if it is not to be
921    stored. Thus, the :obj:`split`, :obj:`stop` and
922    :obj:`exampleSplitter` can use the contingency matrices if they will.
923
924    The :obj:`TreeLearnerBase` then recursively calls itself for each of
925    the non-empty subsets. If the splitter returnes a list of weights,
926    a corresponding weight is used for each branch. Besides, the
927    attribute spent by the splitter (if any) is removed from the
928    list of candidates for the subtree.
929
930    A subset of examples is stored in its corresponding tree node,
931    if so requested. If not, the new weight attributes are removed (if
932    any were created).
933
934Pruning
935=======
936
937Tree pruners derived from :obj:`Pruner` can be given either a
938:obj:`Node` (presumably, but not necessarily a root) or a
939:obj:`TreeClassifier`. The result is a new, pruned :obj:`Node`
940or a new :obj:`TreeClassifier` with a pruned tree. The original
941tree remains intact.
942
943Note however that pruners construct only a shallow copy of a tree.
944The pruned tree's :obj:`Node` contain references to the same
945contingency matrices, node classifiers, branch selectors, ...
946as the original tree. Thus, you may modify a pruned tree structure
947(manually cut it, add new nodes, replace components) but modifying,
948for instance, some node's :obj:`nodeClassifier` (a
949:obj:`nodeClassifier` itself, not a reference to it!) would modify
950the node's :obj:`nodeClassifier` in the corresponding node of
951the original tree.
952
953Talking about node classifiers - pruners cannot construct a
954:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
955subtrees into classifiers for new leaves. Thus, if you want to build
956a prunable tree, internal nodes must have their :obj:`nodeClassifier`
957defined. Fortunately, all you need to do is nothing; if you leave
958the :obj:`TreeLearnerBase`'s flags as they are by default, the
959:obj:`nodeClassifier` are created.
960
961=======
962Classes
963=======
964
965Several classes described above are already functional and can
966(and mostly will) be used as they are. Those classes are :obj:`Node`,
967:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe
968the other classes.
969
970Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
971:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner`
972and :obj:`Classifier` are among the Orange classes that can be subtyped
973in Python and have the call operator overloadedd in such a way that it
974is callbacked from C++ code. You can thus program your own components
975for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed
976information on how this is done and what can go wrong, is given in a
977separate page, dedicated to callbacks to Python XXXXXXXXXX.
978
979SplitConstructors
980=====================
981
982Split construction is almost as exciting as waiting for a delayed flight.
983Boring, that is. Split constructors are programmed as spaghetti code
984that juggles with contingency matrices, with separate cases for discrete
985and continuous classes... Most split constructors work either for
986discrete or for continuous attributes. The suggested practice is
987to use a :obj:`SplitConstructor_Combined` that can handle
988both by simply delegating attributes to specialized split constructors.
989
990Note: split constructors that cannot handle attributes of particular
991type (discrete, continuous) do not report an error or a warning but
992simply skip the attribute. It is your responsibility to use a correct
993split constructor for your dataset. (May we again suggest
994using :obj:`SplitConstructor_Combined`?)
995
996The same components can be used either for inducing classification and
997regression trees. The only component that needs to be chosen accordingly
998is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
999class (and derived classes).
1000
1001.. class:: SplitConstructor
1002
1003    Finds a suitable criteria for dividing the learning (and later testing)
1004    examples coming to the node. The data it gets is a set of examples
1005    (and, optionally, an ID of weight meta-attribute), a domain
1006    contingency computed from examples, apriori class probabilities,
1007    a list of candidate attributes it should consider and a node
1008    classifier (if it was constructed, that is, if
1009    :obj:`storeNodeClassifier` is left true).
1010
1011    The :obj:`SplitConstructor` should use the domain contingency
1012    when possible. The reasons are two-fold; one is that it's faster
1013    and the other is that the contingency matrices are not necessarily
1014    constructed by simply counting the examples. Why and how is
1015    explained later. There are, however, cases, when domain contingency
1016    does not suffice, for examples, when ReliefF is used as a measure
1017    of quality of attributes. In this case, there's no other way but
1018    to use the examples and ignore the precomputed contingencies.
1019
1020    The split constructor should consider only the attributes in the
1021    candidate list (the list is a vector of booleans, one for each
1022    attribute).
1023
1024    :obj:`SplitConstructor` returns most of the data we talked
1025    about when describing the :obj:`Node`. It returns a classifier
1026    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
1027    branch descriptions and a list with the number of examples that
1028    go into each branch. Just what we need for the :obj:`Node`.
1029    It can return an empty list for the number of examples in branches;
1030    in this case, the :obj:`TreeLearnerBase` will find the number itself
1031    after splitting the example set into subsets. However, if a split
1032    constructors can provide the numbers at no extra computational
1033    cost, it should do so.
1034
1035    In addition, it returns a quality of the split; a number without
1036    any fixed meaning except that higher numbers mean better splits.
1037
1038    If the constructed splitting criterion uses an attribute in such
1039    a way that the attribute is 'completely spent' and should not be
1040    considered as a split criterion in any of the subtrees (the
1041    typical case of this are discrete attributes that are used
1042    as-they-are, that is, without any binarization or subsetting),
1043    then it should report the index of this attribute. Some splits
1044    do not spend any attribute; this is indicated by returning a
1045    negative index.
1046
1047    A :obj:`SplitConstructor` can veto the further tree induction
1048    by returning no classifier. This can happen for many reasons.
1049    A general one is related to number of examples in the branches.
1050    :obj:`SplitConstructor` has a field :obj:`minSubset`,
1051    which sets the minimal number of examples in a branch; null nodes,
1052    however, are allowed. If there is no split where this condition
1053    is met, :obj:`SplitConstructor` stops the induction.
1054
1055    .. attribute:: minSubset
1056
1057        Sets the minimal number of examples in non-null leaves. As
1058        always in Orange (where not specified otherwise), "number of
1059        examples" refers to the weighted number of examples.
1060   
1061    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
1062
1063        Construct a split. Returns a tuple (:obj:`branchSelector`,
1064        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
1065        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no
1066        attribute is completely spent by the split criterion. If no
1067        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
1068        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
1069        :obj:`spentAttribute` is -1.
1070
1071        :param examples:  Examples can be given in any acceptable form
1072            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
1073            list of examples).
1074        :param weightID: Optional; the default of 0 means that all
1075            examples have a weight of 1.0.
1076        :param apriori-distribution: Should be of type
1077            :obj:`orange.Distribution` and candidates should be a Python
1078            list of objects which are interpreted as booleans.
1079
1080
1081
1082.. class:: SplitConstructor_Measure
1083
1084    Bases: :class:`SplitConstructor`
1085
1086    An abstract base class for split constructors that employ
1087    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split.
1088    At present,
1089    all split constructors except for :obj:`SplitConstructor_Combined`
1090    are derived from this class.
1091
1092    .. attribute:: measure
1093
1094        A component of type :obj:`Orange.feature.scoring.Measure` used for
1095        evaluation of a split. Note that you must select the subclass
1096        :obj:`Orange.feature.scoring.Measure` capable of handling your
1097        class type
1098        - you cannot use :obj:`Orange.feature.scoring.GainRatio`
1099        for building regression trees or :obj:`Orange.feature.scoring.MSE`
1100        for classification trees.
1101
1102    .. attribute:: worstAcceptable
1103
1104        The lowest required split quality for a split to be acceptable.
1105        Note that this value make sense only in connection with a
1106        :obj:`measure` component. Default is 0.0.
1107
1108.. class:: SplitConstructor_Attribute
1109
1110    Bases: :class:`SplitConstructor_Measure`
1111
1112    Attempts to use a discrete attribute as a split; each value of the
1113    attribute corresponds to a branch in the tree. Attributes are
1114    evaluated with the :obj:`measure` and the one with the
1115    highest score is used for a split. If there is more than one
1116    attribute with the highest score, one of them is selected by random.
1117
1118    The constructed :obj:`branchSelector` is an instance of
1119    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1120    selected attribute. If the attribute is
1121    :obj:`Orange.data.variable.Discrete`,
1122    :obj:`branchDescription`'s are the attribute's values. The
1123    attribute is marked as spent, so that it cannot reappear in the
1124    node's subtrees.
1125
1126.. class:: SplitConstructor_ExhaustiveBinary
1127
1128    Bases: :class:`SplitConstructor_Measure`
1129
1130    Works on discrete attributes. For each attribute, it determines
1131    which binarization of the attribute gives the split with the
1132    highest score. If more than one split has the highest score,
1133    one of them is selected by random. After trying all the attributes,
1134    it returns one of those with the highest score.
1135
1136    The constructed :obj:`branchSelector` is again an instance
1137    :obj:`orange.ClassifierFromVarFD` that returns a value of the
1138    selected attribute. This time, however, its :obj:`transformer`
1139    contains an instance of :obj:`MapIntValue` that maps the values
1140    of the attribute into a binary attribute. Branch descriptions are
1141    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
1142    more than one value of the attribute. Branches that correspond to
1143    a single value of the attribute are described with this value. If
1144    the attribute was originally binary, it is spent and cannot be
1145    used in the node's subtrees. Otherwise, it can reappear in the
1146    subtrees.
1147
1148
1149.. class:: SplitConstructor_Threshold
1150
1151    Bases: :class:`SplitConstructor_Measure`
1152
1153    This is currently the only constructor for splits with continuous
1154    attributes. It divides the range of attributes values with a threshold
1155    that maximizes the split's quality. As always, if there is more than
1156    one split with the highest score, a random threshold is selected.
1157    The attribute that yields the highest binary split is returned.
1158
1159    The constructed :obj:`branchSelector` is again an instance of
1160    :obj:`orange.ClassifierFromVarFD` with an attached
1161    :obj:`transformer`. This time, :obj:`transformer` is of type
1162    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
1163    "<threshold" and ">=threshold". The attribute is not spent.
1164
1165.. class:: SplitConstructor_OneAgainstOthers
1166   
1167    Bases: :class:`SplitConstructor_Measure`
1168
1169    Undocumented.
1170
1171.. class:: SplitConstructor_Combined
1172
1173    Bases: :class:`SplitConstructor`
1174
1175    This constructor delegates the task of finding the optimal split
1176    to separate split constructors for discrete and for continuous
1177    attributes. Each split constructor is called, given only attributes
1178    of appropriate types as candidates. Both construct a candidate for
1179    a split; the better of them is selected.
1180
1181    (Note that there is a problem when more candidates have the same
1182    score. Let there be are nine discrete attributes with the highest
1183    score; the split constructor for discrete attributes will select
1184    one of them. Now, let us suppose that there is a single continuous
1185    attribute with the same score. :obj:`SplitConstructor_Combined`
1186    would randomly select between the proposed discrete attribute and
1187    the continuous attribute, not aware of the fact that the discrete
1188    has already competed with eight other discrete attributes. So,
1189    he probability for selecting (each) discrete attribute would be 1/18
1190    instead of 1/10. Although not really correct, we doubt that this
1191    would affect the tree's performance; many other machine learning
1192    systems simply choose the first attribute with the highest score
1193    anyway.)
1194
1195    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
1196    the attribute is spent is decided by the winning split constructor.
1197
1198    .. attribute: discreteSplitConstructor
1199
1200        Split constructor for discrete attributes; can be, for instance,
1201        :obj:`SplitConstructor_Attribute` or
1202        :obj:`SplitConstructor_ExhaustiveBinary`.
1203
1204    .. attribute: continuousSplitConstructor
1205
1206        Split constructor for continuous attributes; at the moment, it
1207        can be either :obj:`SplitConstructor_Threshold` or a
1208        split constructor you programmed in Python.
1209
1210    .. attribute: continuousSplitConstructor
1211   
1212        Split constructor for continuous attributes; at the moment, it
1213        can be either :obj:`SplitConstructor_Threshold` or a split
1214        constructor you programmed in Python.
1215
1216
1217StopCriteria and StopCriteria_common
1218============================================
1219
1220obj:`StopCriteria` determines when to stop the induction of subtrees.
1221
1222.. class:: StopCriteria
1223
1224    Given a set of examples, weight ID and contingency matrices, decide
1225    whether to continue the induction or not. The basic criterion checks
1226    whether there are any examples and whether they belong to at least
1227    two different classes (if the class is discrete). Derived components
1228    check things like the number of examples and the proportion of
1229    majority classes.
1230
1231    As opposed to :obj:`SplitConstructor` and similar basic classes,
1232    :obj:`StopCriteria` is not an abstract but a fully functional
1233    class that provides the basic stopping criteria. That is, the tree
1234    induction stops when there is at most one example left; in this case,
1235    it is not the weighted but the actual number of examples that counts.
1236    Besides that, the induction stops when all examples are in the same
1237    class (for discrete problems) or have the same value of the outcome
1238    (for regression problems).
1239
1240    .. method:: __call__(examples[, weightID, domain contingencies])
1241
1242        Decides whether to stop (true) or continue (false) the induction.
1243        If contingencies are given, they are used for checking whether
1244        the examples are in the same class (but not for counting the
1245        examples). Derived classes should use the contingencies whenever
1246        possible. If contingencies are not given, :obj:`StopCriteria`
1247        will work without them. Derived classes should also use them if
1248        they are available, but otherwise compute them only when they
1249        really need them.
1250
1251
1252
1253.. class:: StopCriteria_common
1254
1255    :obj:`StopCriteria` contains additional criteria for pre-pruning:
1256    it checks the proportion of majority class and the number of weighted
1257    examples.
1258
1259    .. attribute:: maxMajor
1260
1261        Maximal proportion of majority class. When this is exceeded,
1262        induction stops.
1263
1264    .. attribute:: minExamples
1265
1266        Minimal number of examples in internal leaves. Subsets with less
1267        than :obj:`minExamples` examples are not split any further.
1268        Example count is weighed.
1269
1270.. class:: StopCriteria_Python
1271
1272    Undocumented.
1273
1274Classes derived from ExampleSplitters
1275========================================
1276
1277The :obj:`ExampleSplitter` sorts the learning examples into branches.
1278
1279.. class:: ExampleSplitter
1280
1281    Just like the :obj:`Descender` decides the branch for an
1282    example during classification, the :obj:`ExampleSplitter`
1283    sorts the learning examples into branches.
1284
1285    :obj:`ExampleSplitter` is given a :obj:`Node` (from which
1286    it can use different stuff, but most of splitters only use the
1287    :obj:`branchSelector`), a set of examples to be divided, and
1288    the weight ID. The result is a list of subsets of examples
1289    and, optionally, a list of new weight ID's.
1290
1291    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most
1292    of :obj:`ExampleSplitters` simply call the node's
1293    :obj:`branchSelector` and assign examples to corresponding
1294    branches. When the value is unknown they choose a particular
1295    branch or simply skip the example.
1296
1297    Some enhanced splitters can split examples. An example (actually,
1298    a pointer to it) is copied to more than one subset. To facilitate
1299    real splitting, weights are needed. Each branch is assigned a
1300    weight ID (each would usually have its own ID) and all examples
1301    that are in that branch (either completely or partially) should
1302    have this meta attribute. If an example hasn't been split, it
1303    has only one additional attribute - with weight ID corresponding
1304    to the subset to which it went. Example that is split between,
1305    say, three subsets, has three new meta attributes, one for each
1306    subset. ID's of weight meta attributes are returned by the
1307    :obj:`ExampleSplitter` to be used at induction of the
1308    corresponding subtrees.
1309
1310    Note that weights are used only when needed. When no splitting
1311    occured - because the splitter is not able to do it or becauser
1312    there was no need for splitting - no weight ID's are returned.
1313
1314    An abstract base class for objects that split sets of examples into
1315    subsets. The derived classes differ in treatment of examples which
1316    cannot be unambiguously placed into a single branch (usually due
1317    to unknown value of the crucial attribute).
1318
1319    .. method:: __call__(node, examples[, weightID])
1320       
1321        Use the information in :obj:`node` (particularly the
1322        :obj:`branchSelector`) to split the given set of examples into subsets.
1323        Return a tuple with a list of example generators and a list of weights.
1324        The list of weights is either an ordinary python list of integers or
1325        a None when no splitting of examples occurs and thus no weights are
1326        needed.
1327
1328
1329.. class:: ExampleSplitter_IgnoreUnknowns
1330
1331    Bases: :class:`ExampleSplitter`
1332
1333    Simply ignores the examples for which no single branch can be determined.
1334
1335.. class:: ExampleSplitter_UnknownsToCommon
1336
1337    Bases: :class:`ExampleSplitter`
1338
1339    Places all such examples to a branch with the highest number of
1340    examples. If there is more than one such branch, one is selected at
1341    random and then used for all examples.
1342
1343.. class:: ExampleSplitter_UnknownsToAll
1344
1345    Bases: :class:`ExampleSplitter`
1346
1347    Places examples with unknown value of the attribute into all branches.
1348
1349.. class:: ExampleSplitter_UnknownsToRandom
1350
1351    Bases: :class:`ExampleSplitter`
1352
1353    Selects a random branch for such examples.
1354
1355.. class:: ExampleSplitter_UnknownsToBranch
1356
1357    Bases: :class:`ExampleSplitter`
1358
1359    Constructs an additional branch to contain all such examples.
1360    The branch's description is "unknown".
1361
1362.. class:: ExampleSplitter_UnknownsAsBranchSizes
1363
1364    Bases: :class:`ExampleSplitter`
1365
1366    Splits examples with unknown value of the attribute according to
1367    proportions of examples in each branch.
1368
1369.. class:: ExampleSplitter_UnknownsAsSelector
1370
1371    Bases: :class:`ExampleSplitter`
1372
1373    Splits examples with unknown value of the attribute according to
1374    distribution proposed by selector (which is in most cases the same
1375    as proportions of examples in branches).
1376
1377Descender and derived classes
1378=================================
1379
1380This is a classifier's counterpart for :obj:`ExampleSplitter`. It
1381decides the destiny of examples that need to be classified and cannot
1382be unambiguously put in a branch. The detailed function of this class
1383is given in description of classification with trees. XXXXXX
1384
1385.. class:: Descender
1386
1387    An abstract base object for tree descenders.
1388
1389    .. method:: __call__(node, example)
1390
1391        Descends down the tree until it reaches a leaf or a node in
1392        which a vote of subtrees is required. In both cases, a tuple
1393        of two elements is returned; in the former, the tuple contains
1394        the reached node and None, in the latter in
1395        contains a node and weights of votes for subtrees (a list of floats).
1396
1397        :obj:`Descender`'s that never split examples always descend
1398        to a leaf, but they differ in the treatment of examples with
1399        unknown values (or, in general, examples for which a branch
1400        cannot be determined at some node(s) the tree).
1401        :obj:`Descender`'s that do split examples differ in returned
1402        vote weights.
1403
1404.. class:: Descender_UnknownsToNode
1405
1406    Bases: :obj:`Descender`
1407
1408    When example cannot be classified into a single branch, the
1409    current node is returned. Thus, the node's :obj:`NodeClassifier`
1410    will be used to make a decision. It is your responsibility to see
1411    that even the internal nodes have their :obj:`NodeClassifier`
1412    (i.e., don't disable creating node classifier or manually remove
1413    them after the induction, that's all)
1414
1415.. class:: Descender_UnknownsToBranch
1416
1417    Bases: :obj:`Descender`
1418
1419    Classifies examples with unknown value to a special branch. This
1420    makes sense only if the tree itself was constructed with
1421    :obj:`ExampleSplitter_UnknownsToBranch`.
1422
1423.. class:: Descender_UnknownsToCommonBranch
1424
1425    Bases: :obj:`Descender`
1426
1427    Classifies examples with unknown values to the branch with the
1428    highest number of examples. If there is more than one such branch,
1429    random branch is chosen for each example that is to be classified.
1430
1431.. class:: Descender_UnknownsToCommonSelector
1432
1433    Bases: :obj:`Descender`
1434
1435    Classifies examples with unknown values to the branch which received
1436    the highest recommendation by the selector.
1437
1438.. class:: Descender_MergeAsBranchSizes
1439
1440    Bases: :obj:`Descender`
1441
1442    Makes the subtrees vote for the example's class; the vote is
1443    weighted according to the sizes of the branches.
1444
1445.. class:: Descender_MergeAsSelector
1446
1447    Bases: :obj:`Descender`
1448
1449    Makes the subtrees vote for the example's class; the vote is
1450    weighted according to the selectors proposal.
1451
1452Pruner and derived classes
1453==============================
1454
1455.. index::
1456    pair: classification trees; pruning
1457
1458Classes derived from :obj:`Pruner` prune the trees as a
1459described in the section pruning XXXXXXXX - make sure you read it
1460to understand what the pruners will do to your trees.
1461
1462.. class:: Pruner
1463
1464    This is an abstract base class which defines nothing useful, only
1465    a pure virtual call operator.
1466
1467    .. method:: __call__(tree)
1468
1469        Prunes a tree. The argument can be either a tree classifier or
1470        a tree node; the result is of the same type as the argument.
1471
1472.. class:: Pruner_SameMajority
1473
1474    Bases: :class:`Pruner`
1475
1476    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
1477    with more than one leaf) in which all the leaves have the same majority
1478    class. (This is allowed because those leaves can still have different
1479    distributions of classes and thus predict different probabilities.)
1480    However, this can be undesired when we're only interested in the
1481    class prediction or a simple tree interpretation. The
1482    :obj:`Pruner_SameMajority` prunes the tree so that there is no
1483    subtree in which all the nodes would have the same majority class.
1484
1485    This pruner will only prune the nodes in which the node classifier
1486    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
1487
1488    Note that the leaves with more than one majority class require some
1489    special handling. The pruning goes backwards, from leaves to the root.
1490    When siblings are compared, the algorithm checks whether they
1491    have (at least one) common majority class. If so, they can be pruned.
1492
1493.. class:: Pruner_m
1494
1495    Bases: :class:`Pruner`
1496
1497    Prunes a tree by comparing m-estimates of static and dynamic
1498    error as defined in (Bratko, 2002).
1499
1500    .. attribute:: m
1501
1502        Parameter m for m-estimation.
1503
1504========
1505Examples
1506========
1507
1508This page does not provide examples for programming your own components,
1509such as, for instance, a :obj:`SplitConstructor`. Those examples
1510can be found on a page dedicated to callbacks to Python XXXXXXXX.
1511
1512Tree Structure
1513==============
1514
1515To have something to work on, we'll take the data from lenses dataset
1516and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
1517
1518.. literalinclude:: code/treestructure.py
1519   :lines: 7-10
1520
1521How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
1522
1523.. _lenses.tab: code/lenses.tab
1524.. _treestructure.py: code/treestructure.py
1525
1526.. literalinclude:: code/treestructure.py
1527   :lines: 12-21
1528
1529If node is None, we have a null-node; null nodes don't count,
1530so we return 0. Otherwise, the size is 1 (this node) plus the
1531sizes of all subtrees. The node is an internal node if it has a
1532:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
1533attempt to skip the if statement: leaves don't have an empty list
1534of branches, they don't have a list of branches at all.
1535
1536    >>> treeSize(treeClassifier.tree)
1537    10
1538
1539Don't forget that this was only an excercise - :obj:`Node` has a
1540built-in method :obj:`Node.treeSize` that does exactly the same.
1541
1542Let us now write a simple script that prints out a tree. The recursive
1543part of the function will get a node and its level (part of `treestructure.py`_, uses `lenses.tab`_).
1544
1545.. literalinclude:: code/treestructure.py
1546   :lines: 26-41
1547
1548Don't waste time on studying formatting tricks (\n's etc.), this is just
1549for nicer output. What matters is everything but the print statements.
1550As first, we check whether the node is a null-node (a node to which no
1551learning examples were classified). If this is so, we just print out
1552"<null node>" and return.
1553
1554After handling null nodes, remaining nodes are internal nodes and leaves.
1555For internal nodes, we print a node description consisting of the
1556attribute's name and distribution of classes. :obj:`Node`'s branch
1557description is, for all currently defined splits, an instance of a
1558class derived from :obj:`orange.Classifier` (in fact, it is a
1559:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
1560suffice), and its :obj:`classVar` XXXXX points to the attribute we seek.
1561So we print its name. We will also assume that storing class distributions
1562has not been disabled and print them as well. A more able function for
1563printing trees (:meth:`TreeClassifier.dump`) has an alternative
1564means to get the distribution, when this fails. Then we iterate
1565through branches; for each we print a branch description and iteratively
1566call the :obj:`printTree0` with a level increased by 1 (to increase
1567the indent).
1568
1569Finally, if the node is a leaf, we print out the distribution of
1570learning examples in the node and the class to which the examples in
1571the node would be classified. We again assume that the :obj:`nodeClassifier`
1572is the default one - a :obj:`DefaultClassifier`. A better print
1573function should be aware of possible alternatives.
1574
1575Now, we just need to write a simple function to call our printTree0.
1576We could write something like...
1577
1578::
1579
1580    def printTree(x):
1581        printTree0(x.tree, 0)
1582
1583... but we won't. Let us learn how to handle arguments of different
1584types. Let's write a function that will accept either a :obj:`TreeClassifier`
1585or a :obj:`Node`; just like :obj:`Pruner`, remember? Part of `treestructure.py`_, uses `lenses.tab`_.
1586
1587.. literalinclude:: code/treestructure.py
1588   :lines: 43-49
1589
1590It's fairly straightforward: if :obj:`x` is of type derived from
1591:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
1592:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
1593of some other type, we don't know how to handle it and thus raise
1594an exception. (Note that we could also use
1595
1596::
1597
1598    if isinstance(x, Orange.classification.tree.TreeClassifier)
1599
1600but this would only work if :obj:`x` would be of type
1601:obj:`TreeClassifier` and not of any derived types. The latter,
1602however, do not exist yet...)
1603
1604    >>> printTree(treeClassifier)
1605    tear_rate (<15.000, 5.000, 4.000>)
1606    : reduced --> none (<12.000, 0.000, 0.000>)
1607    : normal
1608       astigmatic (<3.000, 5.000, 4.000>)
1609       : no
1610          age (<1.000, 5.000, 0.000>)
1611          : young --> soft (<0.000, 2.000, 0.000>)
1612          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
1613          : presbyopic --> none (<1.000, 1.000, 0.000>)
1614       : yes
1615          prescription (<2.000, 0.000, 4.000>)
1616          : myope --> hard (<0.000, 0.000, 3.000>)
1617          : hypermetrope --> none (<2.000, 0.000, 1.000>)
1618
1619For a final exercise, let us write a simple pruning procedure. It will
1620be written entirely in Python, unrelated to any :obj:`Pruner`. Our
1621procedure will limit the tree depth - the maximal depth (here defined
1622as the number of internal nodes on any path down the tree) shall be
1623given as an argument. For example, to get a two-level tree, we would
1624call cutTree(root, 2). The function will be recursive, with the second
1625argument (level) decreasing at each call; when zero, the current node
1626will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
1627
1628.. literalinclude:: code/treestructure.py
1629   :lines: 54-62
1630
1631There's nothing to prune at null-nodes or leaves, so we act only when
1632:obj:`node` and :obj:`node.branchSelector` are defined. If level is
1633not zero, we call the function for each branch. Otherwise, we clear
1634the selector, branches and branch descriptions.
1635
1636    >>> cutTree(tree.tree, 2)
1637    >>> printTree(tree)
1638    tear_rate (<15.000, 5.000, 4.000>)
1639    : reduced --> none (<12.000, 0.000, 0.000>)
1640    : normal
1641       astigmatic (<3.000, 5.000, 4.000>)
1642       : no --> soft (<1.000, 5.000, 0.000>)
1643       : yes --> hard (<2.000, 0.000, 4.000>)
1644
1645Learning
1646========
1647
1648You've already seen a simple example of using a :obj:`TreeLearnerBase`.
1649You can just call it and let it fill the empty slots with the default
1650components. This section will teach you three things: what are the
1651missing components (and how to set the same components yourself),
1652how to use alternative components to get a different tree and,
1653finally, how to write a skeleton for tree induction in Python.
1654
1655Default components for TreeLearnerBase
1656======================================
1657
1658Let us construct a :obj:`TreeLearnerBase` to play with.
1659
1660.. _treelearner.py: code/treelearner.py
1661
1662`treelearner.py`_, uses `lenses.tab`_:
1663
1664.. literalinclude:: code/treelearner.py
1665   :lines: 7-10
1666
1667There are three crucial components in learning: the split and stop
1668criteria, and the :obj:`exampleSplitter` (there are some others,
1669which become important during classification; we'll talk about them
1670later). They are not defined; if you use the learner, the slots are
1671filled temporarily but later cleared again.
1672
1673::
1674
1675    >>> print learner.split
1676    None
1677    >>> learner(data)
1678    <TreeClassifier instance at 0x01F08760>
1679    >>> print learner.split
1680    None
1681
1682Stopping criteria
1683=================
1684
1685The stop is trivial. The default is set by
1686
1687::
1688    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
1689
1690Well, this is actually done in C++ and it uses a global component
1691that is constructed once for all, but apart from that we did
1692effectively the same thing.
1693
1694We can now examine the default stopping parameters.
1695
1696    >>> print learner.stop.maxMajority, learner.stop.minExamples
1697    1.0 0.0
1698
1699Not very restrictive. This keeps splitting the examples until
1700there's nothing left to split or all the examples are in the same
1701class. Let us set the minimal subset that we allow to be split to
1702five examples and see what comes out.
1703
1704    >>> learner.stop.minExamples = 5.0
1705    >>> tree = learner(data)
1706    >>> print tree.dump()
1707    tear_rate=reduced: none (100.00%)
1708    tear_rate=normal
1709    |    astigmatic=no
1710    |    |    age=pre-presbyopic: soft (100.00%)
1711    |    |    age=presbyopic: none (50.00%)
1712    |    |    age=young: soft (100.00%)
1713    |    astigmatic=yes
1714    |    |    prescription=hypermetrope: none (66.67%)
1715    |    |    prescription=myope: hard (100.00%)
1716
1717OK, that's better. If we want an even smaller tree, we can also limit
1718the maximal proportion of majority class.
1719
1720    >>> learner.stop.maxMajority = 0.5
1721    >>> tree = learner(data)
1722    >>> print tree.dump()
1723    none (62.50%)
1724
1725Undocumented
1726============
1727
1728.. class:: NodeList
1729
1730    Undocumented.
1731
1732.. class:: C45NodeList
1733
1734    Undocumented.
1735
1736===========================
1737C4.5 Classifier and Learner
1738===========================
1739
1740As C4.5 is a standard benchmark in machine learning,
1741it is incorporated in Orange, although Orange has its own
1742implementation of decision trees.
1743
1744The implementation uses the original Quinlan's code for learning so the
1745tree you get is exactly like the one that would be build by standalone
1746C4.5. Upon return, however, the original tree is copied to Orange
1747components that contain exactly the same information plus what is needed
1748to make them visible from Python. To be sure that the algorithm behaves
1749just as the original, we use a dedicated class :class:`C45Node`
1750instead of reusing the components used by Orange's tree inducer
1751(ie, :class:`Node`). This, however, could be done and probably
1752will be done in the future; we shall still retain :class:`C45Node`
1753but offer transformation to :class:`Node` so that routines
1754that work on Orange trees will also be usable for C45 trees.
1755
1756:class:`C45Learner` and :class:`C45Classifier` behave
1757like any other Orange learner and classifier. Unlike most of Orange
1758learning algorithms, C4.5 does not accepts weighted examples.
1759
1760Building the C4.5 plug-in
1761=========================
1762
1763We haven't been able to obtain the legal rights to distribute
1764C4.5 and therefore couldn't statically link it into Orange. Instead,
1765it's incorporated as a plug-in which you'll need to build yourself.
1766The procedure is trivial, but you'll need a C compiler. On Windows,
1767the scripts we provide work with MS Visual C and the files CL.EXE
1768and LINK.EXE must be on the PATH. On Linux you're equipped with
1769gcc. Mac OS X comes without gcc, but you can download it for
1770free from Apple.
1771
1772Orange must be installed prior to building C4.5. (This is because
1773the build script will copy the created file next to Orange,
1774which it obviously can't if Orange isn't there yet.)
1775
1776#. Download the
1777   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1778   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1779   them into some temporary directory. The files will be modified in the
1780   further process, so don't use your copy of Quinlan's sources that you
1781   need for another purpose.
1782#. Download
1783   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1784   and unzip its contents into the directory R8/Src of the Quinlan's
1785   stuff (it's the directory that contains, for instance, the file
1786   average.c).
1787#. Run buildC45.py, which will build the plug-in and put it next to
1788   orange.pyd (or orange.so on Linux/Mac).
1789#. Run python, type :samp:`import Orange` and
1790   create create :samp:`Orange.classification.tree.C45Learner()`.
1791   If this fails, something went wrong; see the diagnostic messages from
1792   buildC45.py and read the below paragraph.
1793#. Finally, you can remove the Quinlan's stuff, along with everything
1794   created by buildC45.py.
1795
1796If the process fails, here's what buildC45.py really does: it creates
1797.h files that wrap Quinlan's .i files and ensure that they are not
1798included twice. It modifies C4.5 sources to include .h's instead of
1799.i's. This step can hardly fail. Then follows the platform dependent
1800step which compiles ensemble.c (which includes all the Quinlan's .c
1801files it needs) into c45.dll or c45.so and puts it next to Orange.
1802If this fails, but you do have a C compiler and linker, and you know
1803how to use them, you can compile the ensemble.c into a dynamic
1804library yourself. See the compile and link steps in buildC45.py,
1805if it helps. Anyway, after doing this check that the built C4.5
1806gives the same results as the original.
1807
1808.. class:: C45Learner
1809
1810    :class:`C45Learner`'s attributes have double names - those that
1811    you know from C4.5 command lines and the corresponding names of C4.5's
1812    internal variables. All defaults are set as in C4.5; if you change
1813    nothing, you are running C4.5.
1814
1815    .. attribute:: gainRatio (g)
1816       
1817        Determines whether to use information gain (false>, default)
1818        or gain ratio for selection of attributes (true).
1819
1820    .. attribute:: batch (b)
1821
1822        Turn on batch mode (no windows, no iterations); this option is
1823        not documented in C4.5 manuals. It conflicts with "window",
1824        "increment" and "trials".
1825
1826    .. attribute:: subset (s)
1827       
1828        Enables subsetting (default: false, no subsetting),
1829 
1830    .. attribute:: probThresh (p)
1831
1832        Probabilistic threshold for continuous attributes (default: false).
1833
1834    .. attribute:: minObjs (m)
1835       
1836        Minimal number of objects (examples) in leaves (default: 2).
1837
1838    .. attribute:: window (w)
1839
1840        Initial windows size (default: maximum of 20% and twice the
1841        square root of the number of data objects).
1842
1843    .. attribute:: increment (i)
1844
1845        The maximum number of objects that can be added to the window
1846        at each iteration (default: 20% of the initial window size).
1847
1848    .. attribute:: cf (c)
1849
1850        Prunning confidence level (default: 25%).
1851
1852    .. attribute:: trials (t)
1853
1854        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1855
1856    .. attribute:: prune
1857       
1858        Return pruned tree (not an original C4.5 option) (default: true)
1859
1860
1861:class:`C45Learner` also offers another way for setting
1862the arguments: it provides a function :obj:`C45Learner.commandLine`
1863which is given a string and parses it the same way as C4.5 would
1864parse its command line. XXXXXXXXXXX
1865
1866.. class:: C45Classifier
1867
1868    A faithful reimplementation of Quinlan's function from C4.5. The only
1869    difference (and the only reason it's been rewritten) is that it uses
1870    a tree composed of :class:`C45Node` instead of C4.5's
1871    original tree structure.
1872
1873    .. attribute:: tree
1874
1875        C4.5 tree stored as a tree of :obj:`C45Node`.
1876
1877
1878.. class:: C45Node
1879
1880    This class is a reimplementation of the corresponding *struct* from
1881    Quinlan's C4.5 code.
1882
1883    .. attribute:: nodeType
1884
1885        Type of the node:  :obj:`C45Node.Leaf` (0),
1886        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1887        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1888        split examples based on values of a discrete attribute,
1889        "cuts" cut them according to a threshold value of a continuous
1890        attributes and "subsets" use discrete attributes but with subsetting
1891        so that several values can go into the same branch.
1892
1893    .. attribute:: leaf
1894
1895        Value returned by that leaf. The field is defined for internal
1896        nodes as well.
1897
1898    .. attribute:: items
1899
1900        Number of (learning) examples in the node.
1901
1902    .. attribute:: classDist
1903
1904        Class distribution for the node (of type
1905        :obj:`orange.DiscDistribution`).
1906
1907    .. attribute:: tested
1908       
1909        The attribute used in the node's test. If node is a leaf,
1910        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1911        :obj:`tested` is a discrete attribute, and if node is of type
1912        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1913
1914    .. attribute:: cut
1915
1916        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1917        Undefined otherwise.
1918
1919    .. attribute:: mapping
1920
1921        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1922        gives the index for an example whose value of :obj:`tested` is *i*.
1923        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1924
1925    .. attribute:: branch
1926       
1927        A list of branches stemming from this node.
1928
1929Examples
1930========
1931
1932.. _tree_c45.py: code/tree_c45.py
1933.. _iris.tab: code/iris.tab
1934
1935The simplest way to use :class:`C45Learner` is to call it. This
1936script constructs the same learner as you would get by calling
1937the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1938
1939.. literalinclude:: code/tree_c45.py
1940   :lines: 7-14
1941
1942Arguments can be set by the usual mechanism (the below to lines do the
1943same, except that one uses command-line symbols and the other internal
1944variable names)
1945
1946::
1947
1948    tree = Orange.classification.tree.C45Learner(data, m=100)
1949    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1950
1951The way that could be prefered by veteran C4.5 user might be through
1952method `:obj:C45Learner.commandline`.
1953
1954::
1955
1956    lrn = Orange.classification.tree..C45Learner()
1957    lrn.commandline("-m 1 -s")
1958    tree = lrn(data)
1959
1960There's nothing special about using :obj:`C45Classifier` - it's
1961just like any other. To demonstrate what the structure of
1962:class:`C45Node`'s looks like, will show a script that prints
1963it out in the same format as C4.5 does.
1964
1965.. literalinclude:: code/tree_c45_printtree.py
1966
1967Leaves are the simplest. We just print out the value contained
1968in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1969:obj:`C45Node` does not know to which attribute it belongs), we need to
1970convert it to a string through :obj:`classVar`, which is passed as an
1971extra argument to the recursive part of printTree.
1972
1973For discrete splits without subsetting, we print out all attribute values
1974and recursively call the function for all branches. Continuous splits are
1975equally easy to handle.
1976
1977For discrete splits with subsetting, we iterate through branches, retrieve
1978the corresponding values that go into each branch to inset, turn
1979the values into strings and print them out, separately treating the
1980case when only a single value goes into the branch.
1981
1982Printing out C45 Tree
1983=====================
1984
1985.. autofunction:: printTreeC45
1986
1987References
1988==========
1989
1990Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
1991Wesley, 2002.
1992
1993E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
1994Murray Hill NJ, U.S.A., October 1993.
1995
1996`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
1997A home page of AT&T's dot and similar software packages.
1998
1999"""
2000
2001from Orange.core import \
2002     TreeLearner as TreeLearnerBase, \
2003         TreeClassifier as _TreeClassifier, \
2004         C45Learner, \
2005         C45Classifier, \
2006         C45TreeNode as C45Node, \
2007         C45TreeNodeList as C45NodeList, \
2008         TreeDescender as Descender, \
2009              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
2010              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
2011              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
2012              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
2013              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
2014         TreeExampleSplitter as Splitter, \
2015              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
2016              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
2017              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
2018              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
2019              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
2020              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
2021              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
2022         TreeNode as Node, \
2023         TreeNodeList as NodeList, \
2024         TreePruner as Pruner, \
2025              TreePruner_SameMajority as Pruner_SameMajority, \
2026              TreePruner_m as Pruner_m, \
2027         TreeSplitConstructor as SplitConstructor, \
2028              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
2029              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
2030                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
2031                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
2032                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
2033                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
2034         TreeStopCriteria as StopCriteria, \
2035              TreeStopCriteria_Python as StopCriteria_Python, \
2036              TreeStopCriteria_common as StopCriteria_common
2037
2038import Orange.core
2039import operator
2040import base64
2041import re
2042import Orange.data
2043import Orange.feature.scoring
2044import Orange.classification.tree
2045
2046def _c45_showBranch(node, classvar, lev, i):
2047    var = node.tested
2048    if node.nodeType == 1:
2049        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
2050        _c45_printTree0(node.branch[i], classvar, lev+1)
2051    elif node.nodeType == 2:
2052        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
2053        _c45_printTree0(node.branch[i], classvar, lev+1)
2054    else:
2055        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2056        inset = [var.values[j[0]] for j in inset]
2057        if len(inset)==1:
2058            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
2059        else:
2060            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
2061        _c45_printTree0(node.branch[i], classvar, lev+1)
2062       
2063       
2064def _c45_printTree0(node, classvar, lev):
2065    var = node.tested
2066    if node.nodeType == 0:
2067        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2068    else:
2069        for i, branch in enumerate(node.branch):
2070            if not branch.nodeType:
2071                _c45_showBranch(node, classvar, lev, i)
2072        for i, branch in enumerate(node.branch):
2073            if branch.nodeType:
2074                _c45_showBranch(node, classvar, lev, i)
2075
2076def printTreeC45(tree):
2077    """
2078    Prints the tree given as an argument in the same form as Ross Quinlan's
2079    C4.5 program.
2080
2081    ::
2082
2083        import Orange
2084
2085        data = Orange.data.Table("voting")
2086        c45 = Orange.classification.tree.C45Learner(data)
2087        Orange.classification.tree.printTreeC45(c45)
2088
2089    will print out
2090
2091    ::
2092
2093        physician-fee-freeze = n: democrat (253.4)
2094        physician-fee-freeze = y:
2095        |   synfuels-corporation-cutback = n: republican (145.7)
2096        |   synfuels-corporation-cutback = y:
2097        |   |   mx-missile = y: democrat (6.0)
2098        |   |   mx-missile = n:
2099        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2100        |   |   |   adoption-of-the-budget-resolution = y:
2101        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2102        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2103
2104
2105    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2106
2107    ::
2108
2109        physician-fee-freeze = n: democrat (253.4/5.9)
2110        physician-fee-freeze = y:
2111        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2112        |   synfuels-corporation-cutback = y:
2113        |   |   mx-missile = y: democrat (6.0/2.4)
2114        |   |   mx-missile = n:
2115        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2116        |   |   |   adoption-of-the-budget-resolution = y:
2117        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2118        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2119
2120    which is adoringly similar, except that C4.5 tested the tree on
2121    the learning data and has also printed out the number of errors
2122    in each node - something which :obj:`c45_printTree` obviously can't do
2123    (nor is there any need it should).
2124
2125    """
2126    _c45_printTree0(tree.tree, tree.classVar, 0)
2127
2128
2129class TreeLearner(Orange.core.Learner):
2130    """
2131    Assembles the generic classification or regression tree learner
2132    (from Orange's objects for induction of decision trees).
2133    :class:`TreeLearner` is essentially a wrapper
2134    around :class:`TreeLearnerBase`, provided for easier use of the latter.
2135    It sets a number of parameters used in induction that
2136    can also be set after the creation of the object, most often through
2137    the object's attributes. If upon initialization
2138    :class:`TreeLearner` is given a set of examples, then an instance
2139    of :class:`TreeClassifier` object is returned instead.
2140
2141    The values of attributes can be also be set in the constructor.
2142
2143    .. attribute:: nodeLearner
2144
2145        Induces a classifier from examples belonging to a node. The
2146        same learner is used for internal nodes and for leaves. The
2147        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2148
2149    **Split construction**
2150
2151    .. attribute:: split
2152       
2153        Defines a function that will be used in place of
2154        :obj:`SplitConstructor`.
2155        Useful when prototyping new tree induction
2156        algorithms. When this parameter is defined, other parameters that
2157        affect the procedures for growing of the tree are ignored. These
2158        include :obj:`binarization`, :obj:`measure`,
2159        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2160        :class:SplitConstructor_Combined
2161        with separate constructors for discrete and continuous attributes.
2162        Discrete attributes are used as they are, while
2163        continuous attributes are binarized.
2164        Gain ratio is used to select attributes.
2165        A minimum of two examples in a leaf is required for
2166        discrete and five examples in a leaf for continuous attributes.)
2167
2168    .. attribute:: binarization
2169
2170        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2171        If 2, use :class:`SplitConstructor_OneAgainstOthers`. If
2172        0, do not use binarization (use :class:`SplitConstructor_Attribute`).
2173        Default: 0.
2174
2175    .. attribute:: measure
2176   
2177        Measure for scoring of the attributes when deciding which of the
2178        attributes will be used for splitting of the example set in the node.
2179        Can be either a :class:`Orange.feature.scoring.Measure` or one of
2180        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2181        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2182        "gini" (:class:`Orange.feature.scoring.Gini`),
2183        "relief" (:class:`Orange.feature.scoring.Relief`),
2184        "retis" (:class:`Orange.feature.scoring.MSE`). Default: "gainRatio".
2185
2186    .. attribute:: reliefM, reliefK
2187
2188        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2189
2190    **Pruning**
2191
2192    .. attribute:: worstAcceptable
2193
2194        Used in pre-pruning, sets the lowest required attribute
2195        score. If the score of the best attribute is below this margin, the
2196        tree at that node is not grown further (default: 0).
2197
2198        So, to allow splitting only when gainRatio (the default measure)
2199        is greater than 0.6, one should run the learner like this:
2200        :samp:`l = Orange.classification.tree.TreeLearner(data, worstAcceptable=0.6)`
2201
2202    .. attribute:: minSubset
2203
2204        Minimal number of examples in non-null leaves (default: 0).
2205
2206    .. attribute:: minExamples
2207
2208        Data subsets with less than :obj:`minExamples`
2209        examples are not split any further, that is, all leaves in the tree
2210        will contain at least that many of examples (default: 0).
2211
2212    .. attribute:: maxDepth
2213
2214        Gives maximal tree depth;  0 means that only root is generated.
2215        The default is 100.
2216
2217    .. attribute:: maxMajority
2218
2219        Induction stops when the proportion of majority class in the
2220        node exceeds the value set by this parameter(default: 1.0).
2221        To stop the induction as soon as the majority class reaches 70%,
2222        you should user :samp:`maxMajority = 0.7`.
2223
2224        This is an example of the tree on iris data set, with
2225        :samp:`maxMajority = 0.7`. The numbers show the majority class
2226        proportion at each node. The script `tree2.py`_ induces and
2227        prints this tree.
2228
2229        .. _tree2.py: code/tree2.py
2230
2231        ::
2232
2233            root: 0.333
2234            |    petal width<=0.800: 1.000
2235            |    petal width>0.800: 0.500
2236            |    |    petal width<=1.750: 0.907
2237            |    |    petal width>1.750: 0.978
2238   
2239    .. attribute:: stop
2240
2241        Used for passing a function which is used in place of
2242        :class:`StopCriteria`. Useful when prototyping new
2243        tree induction algorithms. See a documentation on
2244        :class:`StopCriteria` for more info on this function.
2245        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2246        will not be  considered (default: None).
2247        The default stopping criterion stops induction when all examples
2248        in a node belong to the same class.
2249
2250    .. attribute:: mForPruning
2251
2252        If non-zero, invokes an error-based bottom-up post-pruning,
2253        where m-estimate is used to estimate class probabilities
2254        (default: 0).
2255
2256    .. attribute:: sameMajorityPruning
2257
2258        If true, invokes a bottom-up post-pruning by removing the
2259        subtrees of which all leaves classify to the same class
2260        (default: False).
2261
2262    **Record keeping**
2263
2264    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2265
2266        Determines whether to store class distributions, contingencies and
2267        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2268        should be build for internal nodes. By default everything except
2269        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2270        distributions but storing contingencies, since distributions actually points to
2271        the same distribution that is stored in
2272        :obj:`contingency.classes`. (default: True except for
2273        storeExamples, which defaults to False).
2274   
2275    """
2276    def __new__(cls, examples = None, weightID = 0, **argkw):
2277        self = Orange.core.Learner.__new__(cls, **argkw)
2278        if examples:
2279            self.__init__(**argkw)
2280            return self.__call__(examples, weightID)
2281        else:
2282            return self
2283     
2284    def __init__(self, **kw):
2285        self.learner = None
2286        self.__dict__.update(kw)
2287     
2288    def __call__(self, examples, weight=0):
2289        """
2290        Return a classifier from the given examples.
2291        """
2292        bl = self._base_learner()
2293
2294        #build second part of the split
2295        if not hasattr(self, "split") and not hasattr(self, "measure"):
2296            if examples.domain.classVar.varType == Orange.data.Type.Discrete:
2297                measure = Orange.feature.scoring.GainRatio()
2298            else:
2299                measure = Orange.feature.scoring.MSE()
2300            bl.split.continuousSplitConstructor.measure = measure
2301            bl.split.discreteSplitConstructor.measure = measure
2302           
2303        tree = bl(examples, weight)
2304        if getattr(self, "sameMajorityPruning", 0):
2305            tree = Pruner_SameMajority(tree)
2306        if getattr(self, "mForPruning", 0):
2307            tree = Pruner_m(tree, m = self.mForPruning)
2308
2309        return TreeClassifier(baseClassifier=tree) 
2310
2311    def instance(self):
2312        """
2313        DEPRECATED. Return a base learner - an object
2314        of :class:`TreeLearnerBase`.
2315        This method is left for backwards compatibility.
2316        """
2317        return self._base_learner()
2318
2319    def _build_split(self, learner):
2320
2321        if hasattr(self, "split"):
2322            learner.split = self.split
2323        else:
2324            learner.split = SplitConstructor_Combined()
2325            learner.split.continuousSplitConstructor = \
2326                SplitConstructor_Threshold()
2327            binarization = getattr(self, "binarization", 0)
2328            if binarization == 1:
2329                learner.split.discreteSplitConstructor = \
2330                    SplitConstructor_ExhaustiveBinary()
2331            elif binarization == 2:
2332                learner.split.discreteSplitConstructor = \
2333                    SplitConstructor_OneAgainstOthers()
2334            else:
2335                learner.split.discreteSplitConstructor = \
2336                    SplitConstructor_Feature()
2337
2338            measures = {"infoGain": Orange.feature.scoring.InfoGain,
2339                "gainRatio": Orange.feature.scoring.GainRatio,
2340                "gini": Orange.feature.scoring.Gini,
2341                "relief": Orange.feature.scoring.Relief,
2342                "retis": Orange.feature.scoring.MSE
2343                }
2344
2345            measure = getattr(self, "measure", None)
2346            if isinstance(measure, str):
2347                measure = measures[measure]()
2348            if not measure:
2349                measure = Orange.feature.scoring.GainRatio()
2350
2351            measureIsRelief = isinstance(measure, Orange.feature.scoring.Relief)
2352            relM = getattr(self, "reliefM", None)
2353            if relM and measureIsRelief:
2354                measure.m = relM
2355           
2356            relK = getattr(self, "reliefK", None)
2357            if relK and measureIsRelief:
2358                measure.k = relK
2359
2360            learner.split.continuousSplitConstructor.measure = measure
2361            learner.split.discreteSplitConstructor.measure = measure
2362
2363            wa = getattr(self, "worstAcceptable", 0)
2364            if wa:
2365                learner.split.continuousSplitConstructor.worstAcceptable = wa
2366                learner.split.discreteSplitConstructor.worstAcceptable = wa
2367
2368            ms = getattr(self, "minSubset", 0)
2369            if ms:
2370                learner.split.continuousSplitConstructor.minSubset = ms
2371                learner.split.discreteSplitConstructor.minSubset = ms
2372
2373    def _base_learner(self):
2374        learner = TreeLearnerBase()
2375
2376        self._build_split(learner)
2377
2378        if hasattr(self, "stop"):
2379            learner.stop = self.stop
2380        else:
2381            learner.stop = Orange.classification.tree.StopCriteria_common()
2382            mm = getattr(self, "maxMajority", 1.0)
2383            if mm < 1.0:
2384                learner.stop.maxMajority = self.maxMajority
2385            me = getattr(self, "minExamples", 0)
2386            if me:
2387                learner.stop.minExamples = self.minExamples
2388
2389        for a in ["storeDistributions", "storeContingencies", "storeExamples", 
2390            "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2391            if hasattr(self, a):
2392                setattr(learner, a, getattr(self, a))
2393
2394        return learner
2395
2396# the following is for the output
2397
2398
2399fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2400""" Defines the multiplier by 100 (:samp:`^`) and the format
2401for the number of decimals (e.g. :samp:`5.3`). The corresponding
2402groups are named :samp:`m100` and :samp:`fs`. """
2403
2404by = r"(?P<by>(b(P|A)))?"
2405""" Defines bP or bA or nothing; the result is in groups by. """
2406
2407bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2408opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2409opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2410intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2411fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2412re_V = re.compile("%V")
2413re_N = re.compile("%"+fs+"N"+by)
2414re_M = re.compile("%"+fs+"M"+by)
2415re_m = re.compile("%"+fs+"m"+by)
2416re_Ccont = re.compile("%"+fs+"C"+by+opc)
2417re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2418re_ccont = re.compile("%"+fs+"c"+by+opc)
2419re_cdisc = re.compile("%"+fs+"c"+by+opd)
2420re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2421re_cconti = re.compile("%"+fs+"c"+by+fromto)
2422re_D = re.compile("%"+fs+"D"+by)
2423re_d = re.compile("%"+fs+"d"+by)
2424re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2425re_I = re.compile("%"+fs+"I"+intrvl)
2426
2427def insertStr(s, mo, sub):
2428    """ Replace the part of s which is covered by mo
2429    with the string sub. """
2430    return s[:mo.start()] + sub + s[mo.end():]
2431
2432
2433def insertDot(s, mo):
2434    """ Replace the part of s which is covered by mo
2435    with a dot.  You should use this when the
2436    function cannot compute the desired quantity; it is called, for instance,
2437    when it needs to divide by something in the parent, but the parent
2438    doesn't exist.
2439    """
2440    return s[:mo.start()] + "." + s[mo.end():]
2441
2442def insertNum(s, mo, n):
2443    """ Replace the part of s matched by mo with the number n,
2444    formatted as specified by the user, that is, it multiplies
2445    it by 100, if needed, and prints with the right number of
2446    places and decimals. It does so by checking the mo
2447    for a group named m100 (representing the :samp:`^` in the format string)
2448    and a group named fs representing the part giving the number o
2449    f decimals (e.g. :samp:`5.3`).
2450    """
2451    grps = mo.groupdict()
2452    m100 = grps.get("m100", None)
2453    if m100:
2454        n *= 100
2455    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2456    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2457
2458def byWhom(by, parent, tree):
2459    """ If by equals bp, it returns parent, else it returns
2460    :samp:`tree.tree`. This is used to find what to divide the quantity
2461    with, when division is required.
2462    """
2463    if by=="bP":
2464        return parent
2465    else:
2466        return tree.tree
2467
2468def replaceV(strg, mo, node, parent, tree):
2469    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2470
2471def replaceN(strg, mo, node, parent, tree):
2472    by = mo.group("by")
2473    N = node.distribution.abs
2474    if by:
2475        whom = byWhom(by, parent, tree)
2476        if whom and whom.distribution:
2477            if whom.distribution.abs > 1e-30:
2478                N /= whom.distribution.abs
2479        else:
2480            return insertDot(strg, mo)
2481    return insertNum(strg, mo, N)
2482       
2483
2484def replaceM(strg, mo, node, parent, tree):
2485    by = mo.group("by")
2486    maj = int(node.nodeClassifier.defaultValue)
2487    N = node.distribution[maj]
2488    if by:
2489        whom = byWhom(by, parent, tree)
2490        if whom and whom.distribution:
2491            if whom.distribution[maj] > 1e-30:
2492                N /= whom.distribution[maj]
2493        else:
2494            return insertDot(strg, mo)
2495    return insertNum(strg, mo, N)
2496       
2497
2498def replacem(strg, mo, node, parent, tree):
2499    by = mo.group("by")
2500    maj = int(node.nodeClassifier.defaultValue)
2501    if node.distribution.abs > 1e-30:
2502        N = node.distribution[maj] / node.distribution.abs
2503        if by:
2504            if whom and whom.distribution:
2505                byN = whom.distribution[maj] / whom.distribution.abs
2506                if byN > 1e-30:
2507                    N /= byN
2508            else:
2509                return insertDot(strg, mo)
2510    else:
2511        N = 0.
2512    return insertNum(strg, mo, N)
2513
2514
2515def replaceCdisc(strg, mo, node, parent, tree):
2516    if tree.classVar.varType != Orange.data.Type.Discrete:
2517        return insertDot(strg, mo)
2518   
2519    by, op, cls = mo.group("by", "op", "cls")
2520    N = node.distribution[cls]
2521    if op == "!=":
2522        N = node.distribution.abs - N
2523    if by:
2524        whom = byWhom(by, parent, tree)
2525        if whom and whom.distribution:
2526            if whom.distribution[cls] > 1e-30:
2527                N /= whom.distribution[cls]
2528        else:
2529            return insertDot(strg, mo)
2530    return insertNum(strg, mo, N)
2531
2532   
2533def replacecdisc(strg, mo, node, parent, tree):
2534    if tree.classVar.varType != Orange.data.Type.Discrete:
2535        return insertDot(strg, mo)
2536   
2537    op, by, cls = mo.group("op", "by", "cls")
2538    N = node.distribution[cls]
2539    if node.distribution.abs > 1e-30:
2540        N /= node.distribution.abs
2541        if op == "!=":
2542            N = 1 - N
2543    if by:
2544        whom = byWhom(by, parent, tree)
2545        if whom and whom.distribution:
2546            if whom.distribution[cls] > 1e-30:
2547                N /= whom.distribution[cls] / whom.distribution.abs
2548        else:
2549            return insertDot(strg, mo)
2550    return insertNum(strg, mo, N)
2551
2552
2553__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2554
2555def replaceCcont(strg, mo, node, parent, tree):
2556    if tree.classVar.varType != Orange.data.Type.Continuous:
2557        return insertDot(strg, mo)
2558   
2559    by, op, num = mo.group("by", "op", "num")
2560    op = __opdict[op]
2561    num = float(num)
2562    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2563    if by:
2564        whom = byWhom(by, parent, tree)
2565        if whom and whom.distribution:
2566            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2567            if byN > 1e-30:
2568                N /= byN
2569        else:
2570            return insertDot(strg, mo)
2571
2572    return insertNum(strg, mo, N)
2573   
2574   
2575def replaceccont(strg, mo, node, parent, tree):
2576    if tree.classVar.varType != Orange.data.Type.Continuous:
2577        return insertDot(strg, mo)
2578   
2579    by, op, num = mo.group("by", "op", "num")
2580    op = __opdict[op]
2581    num = float(num)
2582    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2583    if node.distribution.abs > 1e-30:
2584        N /= node.distribution.abs
2585    if by:
2586        whom = byWhom(by, parent, tree)
2587        if whom and whom.distribution:
2588            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2589            if byN > 1e-30:
2590                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2591        else:
2592            return insertDot(strg, mo)
2593    return insertNum(strg, mo, N)
2594
2595
2596def extractInterval(mo, dist):
2597    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2598    lower, upper = float(lower), float(upper)
2599    if out:
2600        lop = lowin == "(" and operator.le or operator.lt
2601        hop = upin == ")" and operator.ge or operator.ge
2602        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2603    else:
2604        lop = lowin == "(" and operator.gt or operator.ge
2605        hop = upin == ")" and operator.lt or operator.le
2606        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2607
2608   
2609def replaceCconti(strg, mo, node, parent, tree):
2610    if tree.classVar.varType != Orange.data.Type.Continuous:
2611        return insertDot(strg, mo)
2612
2613    by = mo.group("by")
2614    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2615    if by:
2616        whom = byWhom(by, parent, tree)
2617        if whom and whom.distribution:
2618            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2619            if byN > 1e-30:
2620                N /= byN
2621        else:
2622            return insertDot(strg, mo)
2623       
2624    return insertNum(strg, mo, N)
2625
2626           
2627def replacecconti(strg, mo, node, parent, tree):
2628    if tree.classVar.varType != Orange.data.Type.Continuous:
2629        return insertDot(strg, mo)
2630
2631    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2632    ab = node.distribution.abs
2633    if ab > 1e-30:
2634        N /= ab
2635
2636    by = mo.group("by")
2637    if by:
2638        whom = byWhom(by, parent, tree)
2639        if whom and whom.distribution:
2640            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2641            if byN > 1e-30:
2642                N /= byN/whom.distribution.abs
2643        else:
2644            return insertDot(strg, mo)
2645       
2646    return insertNum(strg, mo, N)
2647
2648   
2649def replaceD(strg, mo, node, parent, tree):
2650    if tree.classVar.varType != Orange.data.Type.Discrete:
2651        return insertDot(strg, mo)
2652
2653    fs, by, m100 = mo.group("fs", "by", "m100")
2654    dist = list(node.distribution)
2655    if by:
2656        whom = byWhom(by, parent, tree)
2657        if whom:
2658            for i, d in enumerate(whom.distribution):
2659                if d > 1e-30:
2660                    dist[i] /= d
2661        else:
2662            return insertDot(strg, mo)
2663    mul = m100 and 100 or 1
2664    fs = fs or (m100 and ".0" or "5.3")
2665    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2666
2667
2668def replaced(strg, mo, node, parent, tree):
2669    if tree.classVar.varType != Orange.data.Type.Discrete:
2670        return insertDot(strg, mo)
2671
2672    fs, by, m100 = mo.group("fs", "by", "m100")
2673    dist = list(node.distribution)
2674    ab = node.distribution.abs
2675    if ab > 1e-30:
2676        dist = [d/ab for d in dist]
2677    if by:
2678        whom = byWhom(by, parent, tree)
2679        if whom:
2680            for i, d in enumerate(whom.distribution):
2681                if d > 1e-30:
2682                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2683        else:
2684            return insertDot(strg, mo)
2685    mul = m100 and 100 or 1
2686    fs = fs or (m100 and ".0" or "5.3")
2687    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2688
2689
2690def replaceAE(strg, mo, node, parent, tree):
2691    if tree.classVar.varType != Orange.data.Type.Continuous:
2692        return insertDot(strg, mo)
2693
2694    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2695   
2696    if AorE == "A":
2697        A = node.distribution.average()
2698    else:
2699        A = node.distribution.error()
2700    if by:
2701        whom = byWhom("b"+by, parent, tree)
2702        if whom:
2703            if AorE == "A":
2704                avg = whom.distribution.average()
2705            else:
2706                avg = whom.distribution.error()
2707            if bysub == "b":
2708                if avg > 1e-30:
2709                    A /= avg
2710            else:
2711                A -= avg
2712        else:
2713            return insertDot(strg, mo)
2714    return insertNum(strg, mo, A)
2715
2716
2717Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2718
2719def replaceI(strg, mo, node, parent, tree):
2720    if tree.classVar.varType != Orange.data.Type.Continuous:
2721        return insertDot(strg, mo)
2722
2723    fs = mo.group("fs") or "5.3"
2724    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2725    mul = mo.group("m100") and 100 or 1
2726
2727    if not Z.has_key(intrvl):
2728        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2729
2730    av = node.distribution.average()   
2731    il = node.distribution.error() * Z[intrvl]
2732    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2733
2734
2735# This class is more a collection of function, merged into a class so
2736# that they don't need to transfer too many arguments. It will be
2737# constructed, used and discarded, it is not meant to store any information.
2738class _TreeDumper:
2739    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN),
2740         (re_M, replaceM), (re_m, replacem), 
2741         (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2742         (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2743         (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2744         (re_D, replaceD), (re_d, replaced), (re_AE, replaceAE), 
2745         (re_I, replaceI) ]
2746
2747    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, 
2748        maxDepth, simpleFirst, tree, **kw):
2749        self.stringFormats = stringFormats
2750        self.minExamples = minExamples
2751        self.maxDepth = maxDepth
2752        self.simpleFirst = simpleFirst
2753        self.tree = tree
2754        self.__dict__.update(kw)
2755
2756        if leafStr:
2757            self.leafStr = leafStr
2758        else:
2759            if tree.classVar.varType == Orange.data.Type.Discrete:
2760                self.leafStr = "%V (%^.2m%)"
2761            else:
2762                self.leafStr = "%V"
2763
2764        if nodeStr == ".":
2765            self.nodeStr = self.leafStr
2766        else:
2767            self.nodeStr = nodeStr
2768       
2769
2770    def formatString(self, strg, node, parent):
2771        if hasattr(strg, "__call__"):
2772            return strg(node, parent, self.tree)
2773       
2774        if not node:
2775            return "<null node>"
2776       
2777        for rgx, replacer in self.stringFormats:
2778            if not node.distribution:
2779                strg = rgx.sub(".", strg)
2780            else:
2781                strt = 0
2782                while True:
2783                    mo = rgx.search(strg, strt)
2784                    if not mo:
2785                        break
2786                    strg = replacer(strg, mo, node, parent, self.tree)
2787                    strt = mo.start()+1
2788                       
2789        return strg
2790       
2791
2792    def showBranch(self, node, parent, lev, i):
2793        bdes = node.branchDescriptions[i]
2794        bdes = node.branchSelector.classVar.name + \
2795            (bdes[0] not in "<=>" and "=" or "") + bdes
2796        if node.branches[i]:
2797            nodedes = self.nodeStr and ": " + \
2798                self.formatString(self.nodeStr, node.branches[i], node) or ""
2799        else:
2800            nodedes = "<null node>"
2801        return "|    "*lev + bdes + nodedes
2802       
2803       
2804    def dumpTree0(self, node, parent, lev):
2805        if node.branches:
2806            if node.distribution.abs < self.minExamples or \
2807                lev > self.maxDepth:
2808                return "|    "*lev + ". . .\n"
2809           
2810            res = ""
2811            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2812                leafsep = "\n"+("|    "*lev)+"    "
2813            else:
2814                leafsep = ""
2815            if self.simpleFirst:
2816                for i, branch in enumerate(node.branches):
2817                    if not branch or not branch.branches:
2818                        if self.leafStr == self.nodeStr:
2819                            res += "%s\n" % \
2820                                self.showBranch(node, parent, lev, i)
2821                        else:
2822                            res += "%s: %s\n" % \
2823                                (self.showBranch(node, parent, lev, i),
2824                                 leafsep + 
2825                                 self.formatString(self.leafStr, branch, node))
2826            for i, branch in enumerate(node.branches):
2827                if branch and branch.branches:
2828                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2829                                       self.dumpTree0(branch, node, lev+1))
2830                elif not self.simpleFirst:
2831                    if self.leafStr == self.nodeStr:
2832                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2833                    else:
2834                        res += "%s: %s\n" % \
2835                            (self.showBranch(node, parent, lev, i),
2836                             leafsep + 
2837                             self.formatString(self.leafStr, branch, node))
2838            return res
2839        else:
2840            return self.formatString(self.leafStr, node, parent)
2841
2842
2843    def dumpTree(self):
2844        if self.nodeStr:
2845            lev, res = 1, "root: %s\n" % \
2846                self.formatString(self.nodeStr, self.tree.tree, None)
2847            self.maxDepth += 1
2848        else:
2849            lev, res = 0, ""
2850        return res + self.dumpTree0(self.tree.tree, None, lev)
2851       
2852
2853    def dotTree0(self, node, parent, internalName):
2854        if node.branches:
2855            if node.distribution.abs < self.minExamples or \
2856                len(internalName)-1 > self.maxDepth:
2857                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % \
2858                    _quoteName(internalName))
2859                return
2860               
2861            label = node.branchSelector.classVar.name
2862            if self.nodeStr:
2863                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2864            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2865                (_quoteName(internalName), self.nodeShape, label))
2866           
2867            for i, branch in enumerate(node.branches):
2868                if branch:
2869                    internalBranchName = internalName+chr(i+65)
2870                    self.fle.write('%s -> %s [ label="%s" ]\n' % \
2871                        (_quoteName(internalName), 
2872                         _quoteName(internalBranchName), 
2873                         node.branchDescriptions[i]))
2874                    self.dotTree0(branch, node, internalBranchName)
2875                   
2876        else:
2877            self.fle.write('%s [ shape=%s label="%s"]\n' % \
2878                (internalName, self.leafShape, 
2879                self.formatString(self.leafStr, node, parent)))
2880
2881
2882    def dotTree(self, internalName="n"):
2883        self.fle.write("digraph G {\n")
2884        self.dotTree0(self.tree.tree, None, internalName)
2885        self.fle.write("}\n")
2886
2887def _quoteName(x):
2888    return '"%s"' % (base64.b64encode(x))
2889
2890class TreeClassifier(Orange.classification.Classifier):
2891    """
2892    Wraps :class:`_TreeClassifier`.
2893    """
2894   
2895    def __init__(self, baseClassifier=None):
2896        if not baseClassifier: baseClassifier = _TreeClassifier()
2897        self.nativeClassifier = baseClassifier
2898        for k, v in self.nativeClassifier.__dict__.items():
2899            self.__dict__[k] = v
2900 
2901    def __call__(self, instance, result_type=Orange.classification.Classifier.GetValue,
2902                 *args, **kwdargs):
2903        """Classify a new instance.
2904       
2905        :param instance: instance to be classified.
2906        :type instance: :class:`Orange.data.Instance`
2907        :param result_type:
2908              :class:`Orange.classification.Classifier.GetValue` or \
2909              :class:`Orange.classification.Classifier.GetProbabilities` or
2910              :class:`Orange.classification.Classifier.GetBoth`
2911       
2912        :rtype: :class:`Orange.data.Value`,
2913              :class:`Orange.statistics.Distribution` or a tuple with both
2914        """
2915        return self.nativeClassifier(instance, result_type, *args, **kwdargs)
2916
2917    def __setattr__(self, name, value):
2918        if name == "nativeClassifier":
2919            self.__dict__[name] = value
2920            return
2921        if name in self.nativeClassifier.__dict__:
2922            self.nativeClassifier.__dict__[name] = value
2923        self.__dict__[name] = value
2924   
2925    def dump(self, leafStr = "", nodeStr = "", **argkw): 
2926        """
2927        Return a string representation of a tree.
2928
2929        :arg leafStr: The format string for printing the tree leaves. If
2930          left empty, "%V (%^.2m%)" will be used for classification trees
2931          and "%V" for regression trees.
2932        :type leafStr: string
2933        :arg nodeStr: The format string for printing out the internal nodes.
2934          If left empty (as it is by default), no data is printed out for
2935          internal nodes. If set to :samp:`"."`, the same string is
2936          used as for leaves.
2937        :type nodeStr: string
2938        :arg maxDepth: If set, it limits the depth to which the tree is
2939          printed out.
2940        :type maxDepth: integer
2941        :arg minExamples: If set, the subtrees with less than the given
2942          number of examples are not printed.
2943        :type minExamples: integer
2944        :arg simpleFirst: If True (default), the branches with a single
2945          node are printed before the branches with larger subtrees.
2946          If False, the branches are printed in order of
2947          appearance.
2948        :type simpleFirst: boolean
2949        :arg userFormats: A list of regular expressions and callback
2950          function through which the user can print out other specific
2951          information in the nodes.
2952        """
2953        return _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2954            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2955            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2956            self).dumpTree()
2957
2958    def dot(self, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2959        """ Prints the tree to a file in a format used by
2960        `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2961        Uses the same parameters as :func:`printTxt` defined above
2962        plus two parameters which define the shape used for internal
2963        nodes and laves of the tree:
2964
2965        :param leafShape: Shape of the outline around leves of the tree.
2966            If "plaintext", no outline is used (default: "plaintext").
2967        :type leafShape: string
2968        :param internalNodeShape: Shape of the outline around internal nodes
2969            of the tree. If "plaintext", no outline is used (default: "box")
2970        :type leafShape: string
2971
2972        Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2973        for various outlines supported by GraphViz.
2974        """
2975        fle = type(fileName) == str and file(fileName, "wt") or fileName
2976
2977        _TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2978            _TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2979            argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), self,
2980            leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2981
2982    def count_nodes(self):
2983        """
2984        Return the number of nodes.
2985        """
2986        return _countNodes(self.tree)
2987
2988    def count_leaves(self):
2989        """
2990        Return the number of leaves.
2991        """
2992        return _countLeaves(self.tree)
2993
2994def _countNodes(node):
2995    count = 0
2996    if node:
2997        count += 1
2998        if node.branches:
2999            for node in node.branches:
3000                count += _countNodes(node)
3001    return count
3002
3003def _countLeaves(node):
3004    count = 0
3005    if node:
3006        if node.branches: # internal node
3007            for node in node.branches:
3008                count += _countLeaves(node)
3009        else:
3010            count += 1
3011    return count
3012
Note: See TracBrowser for help on using the repository browser.