source: orange/orange/doc/modules/orngTree.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 34.3 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line 
1<html>
2<head>
3<link rel=stylesheet href="../style.css" type="text/css">
4</head>
5<body>
6
7<h1>orngTree: Orange Decision Trees Module</h1>
8<index name="classifiers/classification trees">
9<index name="modules/classification trees">
10
11<P>Module orngTree implements a class <code>TreeLearner</code> for
12building both decision and regression
13trees. <code>orngTree.TreeLearner</code> is essentially a wrapper
14around <a
15href="../reference/TreeLearner.htm"><code>orange.TreeLearner</code></a>,
16provided for easier use of the latter.</p>
17
18<P>The module also contains functions for counting the number of nodes
19and leaves in the tree.</P>
20
21<P>The module includes functions for printing out the tree, which are
22rather versatile and can print out practically anything you'd like to
23know, from the number of examples, proportion of examples of majority
24class in nodes and similar, to more complex statistics like the
25proportion of examples in a particular class divided by the proportion
26of examples of this class in a parent node. And even more, you can
27define your own callback functions to be used for printing.</P>
28
29<h2>TreeLearner</h2>
30
31<p><code><INDEX name="classes/TreeLearner (in
32orngTree)">TreeLearner</code> is a class that assembles the generic
33classification tree learner (from Orange's objects for induction of
34decision trees). It sets a number of parameters used in induction that
35can also be set after the creation of the object, most often through
36the object's attributes. If upon initialization
37<code>TreeLearner</code> is given a set of examples, then an instance
38of <code>TreeClassifier</code> object is returned instead.</p>
39
40<h4>Split construction</h4>
41<dl class=attributes>
42  <dt>measure</dt>
43  <dd>Measure for scoring of the attributes when deciding which of the
44  attributes will be used for splitting of the example set in the node.
45  Can take one of the following values: "infoGain", "gainRatio", "gini",
46  "relief" (default: "gainRatio").</dd>
47
48  <dt>split</dt>
49  <dd>Defines a function that will be used in place of Orange's
50  <code>TreeSplitConstructor</code> (see <a
51  href="../reference/TreeLearner.htm">documentation on
52  TreeLearner</a>). Useful when prototyping new tree induction
53  algorithms. When this parameter is defined, other parameters that
54  affect the procedures for growing of the tree are ignored. These
55  include <code>binarization</code>, <code>measure</code>,
56  <code>worstAcceptable</code> and <code>minSubset</code> (default:
57  None).</dd>
58
59  <dt>binarization</dt>
60  <dd>If True the induction constructs binary trees (default: False).</dd>
61</dl>
62
63<h3>Pruning</h3>
64<index name="classification trees/pruning">
65<index name="pruning classification trees">
66<dl class="attributes">
67  <dt>worstAcceptable</dt>
68  <dd><P>Used in pre-pruning, sets the lowest required attribute
69  score. If the score of the best attribute is below this margin, the
70  tree at that node is not grown further (default: 0).</P>
71
72  <P>So, to allow splitting only when gainRatio (the default measure) is greater than 0.6, one should run the learner like this:
73  <xmp class="code">l = orngTree.TreeLearner(data, worstAcceptable=0.6)</xmp></P>
74  </dd>
75
76  <dt>minSubset</dt>
77  <dd>Minimal number of examples in
78  non-null leaves (default: 0).</dd>
79
80  <dt>minExamples</dt>
81  <dd>Data subsets with less than <code>minExamples</code>
82  examples are not split any further, that is, all leaves in the tree
83  will contain at least that many of examples (default: 0).</dd>
84
85  <dt>maxMajority</dt>
86  <dd>Induction stops when the proportion of majority class in the
87  node exceeds the value set by this parameter(default: 1.0). E.g. to stop the induction as soon as the majority class reaches 70%, you should say
88  <xmp class="code">tree2 = orngTree.TreeLearner(data, maxMajority=0.7)</xmp>
89
90  <P>This is an example of the tree on iris data set, built with the above arguments - the numbers show the majority class proportion at each node. You can find more details in the script <a href="tree2.py">tree2.py</a>, which induces and prints this tree.</P>
91  <xmp class="printout">root: 0.333
92|    petal width<0.800: 1.000
93|    petal width>=0.800: 0.500
94|    |    petal width<1.750: 0.907
95|    |    petal width>=1.750: 0.978</xmp>
96  </dd>
97
98  <dt>stop</dt>
99  <dd>Used for passing a function which is used in place of
100  <code>TreeStopCriteria</code>. Useful when prototyping new
101  tree induction algorithms. See a documentation on <a
102  href="../reference/TreeLearner.htm">TreeStopCriteria</a> for more
103  info on this function. When used, parameters
104  <code>maxMajority</code> and <code>minExamples</code> will not be
105  considered (default: None).</dd>
106
107  <dt>mForPruning</dt>
108  <dd>If non-zero, invokes an error-based bottom-up post-pruning,
109  where m-estimate is used to estimate class probabilities (default: 0).</dd>
110
111  <dt>sameMajorityPruning</dt>
112  <dd>If true, invokes a bottom-up post-pruning by removing the
113  subtrees of which all leaves classify to the same class
114  (default: False).</dd>
115</dl>
116
117<h3>Record keeping</h3>
118<dl class="attributes">
119  <dt>storeDistributions, storeContingencies, storeExamples,
120  storeNodeClassifier</dt>
121  <dd>Determines whether to store class distributions, contingencies and
122  examples in TreeNodes, and whether the nodeClassifier should be
123  build for internal nodes. By default everything except storeExamples
124  is enabled. You won't save any memory by not storing distributions
125  but storing contingencies, since distributions actually points to
126  the same distribution that is stored in
127  <code>contingency.classes.</code>(default: True except for
128  storeExamples, which defaults to False).</dd>
129</dl>
130
131<P>For a bit more complex example, here's how to write your own stop function. The example itself is more funny than useful. It constructs and prints two trees. For the first one we define the <code>defStop</code> function, which is used by default, and combine it with a random function so that the stop criteria will also be met in additional 20% of the cases when <code>defStop</code> is false. The second tree is build such that it considers only the random function as the stopping criteria. Note that in the second case lambda function still has three parameters, since this is a necessary number of parameters for the stop function (for more, see section on <a href="../reference/TreeLearner.htm">Orange Trees</a> in Orange Reference).
132</p>
133
134<p class="header"><a href="tree3.py">tree3.py</a> (uses <a href=
135"iris.tab">iris.tab</a>)</p>
136
137<XMP class=code>import orange, orngTree
138from whrandom import randint, random
139
140data = orange.ExampleTable("iris.tab")
141
142defStop = orange.TreeStopCriteria()
143f = lambda examples, weightID, contingency: defStop(examples, weightID, contingency) or randint(1, 5)==1
144l = orngTree.TreeLearner(data, stop=f)
145orngTree.printTxt(l, leafFields=['major', 'contingency'])
146
147f = lambda x,y,z: randint(1, 5)==1
148l = orngTree.TreeLearner(data, stop=f)
149orngTree.printTxt(l, leafFields=['major', 'contingency'])
150</XMP>
151
152<p>The output is not shown here since the resulting trees are rather
153big.</p>
154
155
156
157<h2>Tree Size</h2>
158
159<p><b><code>countNodes(tree)</code></b> returns the number of nodes of tree.</p>
160
161<p class=section>Arguments</p>
162<dl class=arguments>
163  <dt>tree</dt>
164  <dd>The tree for which to count the nodes.</dd>
165</dl>
166
167<p><b><code>countLeaves(tree)</code></b> returns the number of leaves in the tree.</p>
168
169<p class=section>Arguments</p>
170<dl class=arguments>
171  <dt>tree</dt>
172  <dd>The tree for which to count the leaves.</dd>
173</dl>
174
175
176<h2>Printing the Tree</h2>
177<index name="classification trees/printing">
178
179<P>Function <code>dumpTree</code> dumps a tree to a string, and <code>printTree</code> prints out the tree (<code>printTxt</code> is alias for <code>printTree</code>, and it's there for compatibility). Functions have same arguments.</P>
180
181<P>Before we go on: you can read all about the function and use it to its full extent, or you can just call it, giving it the tree as the sole argument and it will print out the usual textual representation of the tree. If you're satisfied with that, you can stop here.</P>
182
183<p class=section>Arguments</p>
184<dl class=arguments>
185  <dt>tree</dt>
186  <dd>The tree to be printed out.</dd>
187
188  <dt>leafStr</dt>
189  <dd>The format string for printing the tree leaves. If left empty, "%V (%^.2m%)" will be used for classification trees and "%V" for regression trees.</dd>
190
191  <dt>nodeStr</dt>
192  <dd>The string for printing out the internal nodes. If left empty (as it is by default), no data is printed out for internal nodes. If set to <code>"."</code>, the same string is used as for leaves.</dd>
193
194  <dt>maxDepth</dt>
195  <dd>If set, it limits the depth to which the tree is printed out.</dd>
196
197  <dt>minExamples</dt>
198  <dd>If set, the subtrees with less than the given number of examples are not printed.</dd>
199
200  <dt>simpleFirst</dt>
201  <dd>If <code>True</code> (default), the branches with a single node are printed before the branches with larger subtrees. If you set it to <code>False</code> (which I don't know why you would), the branches are printed in order of appearance.</dd>
202
203  <dt>userFormats</dt>
204  <dd>A list of regular expressions and callback function through which the user can print out other specific information in the nodes.
205</dl>
206
207<P>The magic is in the format string. It is a string which is printed out at every leaf or internal node with the certain format specifiers replaced by data from the tree node. Specifiers are generally of form
208<B><code>%<em>[^]</em><em>&lt;precision&gt;</em><em>&lt;quantity&gt;</em><em>&lt;divisor&gt;</em>.</code></B>
209</center>
210
211<P><B><EM>^</EM></B> at the start tells that the number should be multiplied by 100. It's useful for printing proportions like percentages, but it makes no sense to multiply, say, the number of examples at the node (although the function will allow it).</P>
212
213<P><B><em>&lt;precision&gt;</em></B> is in the same format as in Python (or C) string formatting. For instance, <code>%N</code> denotes the number of examples in the node, hence <code>%6.2N</code> would mean output to two decimal digits and six places altogether. If left out, a default format <code>5.3</code> is used, unless you multiply the numbers by 100, in which case the default is <code>.0</code> (no decimals, the number is rounded to the nearest integer).</code></P>
214
215<P><B><em>&lt;divisor&gt;</em></B> tells what to divide the quantity in that node with. <code>bP</code> means division by the same quantity in the parent node; for instance, <code>%NbP</code> will tell give the number of examples in the node divided by the number of examples in parent node. You can add use precision formatting, e.g. <code>%6.2NbP</code>. <code>bA</code> is division by the same quantity over the entire data set, so <code>%NbA</code> will tell you the proportion of examples (out of the entire training data set) that fell into that node. If division is impossible since the parent node does not exist or some data is missing, a dot is printed out instead of the quantity.</P>
216
217<P><B><em>&lt;quantity&gt;</em></B> is the only required element. It defines what to print. For instance, <code>%N</code> would print out the number of examples in the node. Possible quantities are
218<dl class=arguments_sm>
219<dt>V</dt>
220<dd>The value predicted at that node. You cannot define the precision or divisor here.</dd>
221
222<dt>N</dt>
223<dd>The number of examples in the node.</dd>
224
225<dt>M</dt>
226<dd>The number of examples in the majority class (that is, the class predicted by the node).</dd>
227
228<dt>m</dt>
229<dd>The proportion of examples in the majority class.</dd>
230
231<dt>A</dt>
232<dd>The average class for examples the node; this is available only for regression trees.</dd>
233
234<dt>E</dt>
235<dd>Standard error for class of examples in the node; available for regression trees.</dd>
236
237<dt>I</dt>
238<dd>Print out the confidence interval. The modifier is used as <code>%I(95)</code> of (more complicated) <code>%5.3I(95)bP</code>.</dd>
239
240<dt>C</dt>
241<dd>The number of examples in the given class. For classification trees, this modifier is used as, for instance in, <code>%5.3C="Iris-virginica"bP</code> - this will tell the number of examples of Iris-virginica by the number of examples this class in the parent node. If you are interested in examples that are <em>not</em> Iris-virginica, say <code>%5.3CbP!="Iris-virginica"</code>
242
243For regression trees, you can use operators =, !=, &lt;, &lt;=, &gt;, and &gt;=, as in <code>%C&lt;22</code> - add the precision and divisor if you will. You can also check the number of examples in a certain interval: <code>%C[20, 22]</code> will give you the number of examples between 20 and 22 (inclusive) and <code>%C(20, 22)</code> will give the number of such examples excluding the boundaries. You can of course mix the parentheses, e.g. <code>%C(20, 22]</code>. If you would like the examples outside the interval, add a <code>!</code>, like <code>%C!(20, 22]</code>.</dd>
244
245<dt>c</dt>
246<dd>Same as above, except that it computes the proportion of the class instead of the number of examples.</dd>
247
248<dt>D</dt>
249<dd>Prints out the number of examples in each class. You can use both, precision (it is applied to each number in the distribution) or the divisor. This quantity cannot be computed for regression trees.</dd>
250
251<dt>d</dt>
252<dd>Same as above, except that it shows proportions of examples. This again doesn't work with regression trees.</dd>
253</dl>
254
255<dt>&lt;user defined formats&gt;</dt>
256<dd>You can add more, if you will. Instructions and examples are given at the end of this section.</dd>
257</P>
258
259<P>Now for some examples. We shall build a small tree from the iris data set - we shall limit the depth to three levels.</P>
260
261<p class="header">part of <a href="orngTree1.py">orngTree1.py</a></p>
262<xmp class="code">import orange, orngTree
263data = orange.ExampleTable("iris")
264tree = orngTree.TreeLearner(data, maxDepth=3)
265</xmp>
266
267<P>The easiest way to call the function is to pass the tree as the only argument. Calling <code>orngTree.printTree(tree)</code> will print
268<xmp class="printout">petal width<0.800: Iris-setosa (100.00%)
269petal width>=0.800
270|    petal width<1.750
271|    |    petal length<5.350: Iris-versicolor (94.23%)
272|    |    petal length>=5.350: Iris-virginica (100.00%)
273|    petal width>=1.750
274|    |    petal length<4.850: Iris-virginica (66.67%)
275|    |    petal length>=4.850: Iris-virginica (100.00%)
276</xmp>
277</P>
278
279<P>Let's now print out the predicted class at each node, the number of examples in the majority class with the total number of examples in the node,
280<code>orngTree.printTree(tree, leafStr="%V (%M out of %N)")</code>.
281<xmp class="printout">petal width<0.800: Iris-setosa (50.000 out of 50.000)
282petal width>=0.800
283|    petal width<1.750
284|    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
285|    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
286|    petal width>=1.750
287|    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
288|    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
289</xmp>
290</P>
291
292<P>Would you like to know how the number of examples declines as compared to the entire data set and to the parent node? We find it with this: <code>orng.printTree("%V (%^MbA%, %^MbP%)")</code>
293<xmp class="printout">petal width<0.800: Iris-setosa (100%, 100%)
294petal width>=0.800
295|    petal width<1.750
296|    |    petal length<5.350: Iris-versicolor (98%, 100%)
297|    |    petal length>=5.350: Iris-virginica (4%, 40%)
298|    petal width>=1.750
299|    |    petal length<4.850: Iris-virginica (4%, 4%)
300|    |    petal length>=4.850: Iris-virginica (86%, 96%)
301</xmp>
302<P>Let us first read the format string. <code>%M</code> is the number of examples in the majority class. We want it divided by the number of all examples from this class on the entire data set, hence <code>%MbA</code>. To have it multipied by 100, we say <code>%^MbA</code>. The percent sign <em>after</em> that is just printed out literally, just as the comma and parentheses (see the output). The string for showing the proportion of this class in the parent is the same except that we have <code>bP</code> instead of <code>bA</code>.</P>
303
304<P>And now for the output: all examples of setosa for into the first node. For versicolor, we have 98% in one node; the rest is certainly not in the neighbouring node (petal length&gt;=5.350) since all versicolors from the node petal width&lt;1.750 went to petal length&lt;5.350 (we know this from the <code>100%</code> in that line). Virginica is the majority class in the three nodes that together contain 94% of this class (4+4+86). The rest must had gone to the same node as versicolor.</P>
305
306<P>If you find this guesswork annoying - so do I. Let us print out the number of versicolors in each node, together with the proportion of versicolors among the examples in this particular node and among all versicolors. So,<br>
307<code>'%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)</code><br>gives the following output:</P>
308
309<xmp class="printout">petal width<0.800: 0.000 (0% of node, 0% of versicolors)
310petal width>=0.800
311|    petal width<1.750
312|    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
313|    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
314|    petal width>=1.750
315|    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
316|    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
317</xmp>
318
319<P>Finally, we may want to print out the distributions, using a simple string <code>%D</code>.</P>
320<xmp class="printout">petal width<0.800: [50.000, 0.000, 0.000]
321petal width>=0.800
322|    petal width<1.750
323|    |    petal length<5.350: [0.000, 49.000, 3.000]
324|    |    petal length>=5.350: [0.000, 0.000, 2.000]
325|    petal width>=1.750
326|    |    petal length<4.850: [0.000, 1.000, 2.000]
327|    |    petal length>=4.850: [0.000, 0.000, 43.000]
328</xmp>
329<P>What is the order of numbers here? If you check <code>data.domain.classVar.values</code>, you'll learn that the order is setosa, versicolor, virginica; so in the node at peta length&lt;5.350 we have 49 versicolors and 3 virginicae. To print out the proportions, we can use, for instance <code>%.2d</code> - this gives us proportions within node, rounded on two decimals.</P>
330<xmp class="printout">petal width<0.800: [1.00, 0.00, 0.00]
331petal width>=0.800
332|    petal width<1.750
333|    |    petal length<5.350: [0.00, 0.94, 0.06]
334|    |    petal length>=5.350: [0.00, 0.00, 1.00]
335|    petal width>=1.750
336|    |    petal length<4.850: [0.00, 0.33, 0.67]
337|    |    petal length>=4.850: [0.00, 0.00, 1.00]
338</xmp>
339
340<P>We haven't tried printing out some information for internal nodes. To start with the most trivial case, we shall print the prediction at each node
341<xmp class="code">orngTree.printTree(tree, leafStr="%V", nodeStr=".")</xmp> says that the <code>nodeStr</code> should be the same as <code>leafStr</code> (not very useful here, since <code>leafStr</code> is trivial anyway).</P>
342<xmp class="printout">root: Iris-setosa
343|    petal width<0.800: Iris-setosa
344|    petal width>=0.800: Iris-versicolor
345|    |    petal width<1.750: Iris-versicolor
346|    |    |    petal length<5.350: Iris-versicolor
347|    |    |    petal length>=5.350: Iris-virginica
348|    |    petal width>=1.750: Iris-virginica
349|    |    |    petal length<4.850: Iris-virginica
350|    |    |    petal length>=4.850: Iris-virginica
351</xmp>
352
353<P>Note that the output is somewhat different now: there appeared another node called <em>root</em> and the tree looks one level deeper. This is needed to print out the data for that node to.</P>
354
355<P>Now for something more complicated: let us observe how the number of virginicas decreases down the tree:</P>
356<xmp class="code>"orngTree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')</xmp>
357<P>Let's first interpret the format string: <code>CbA="Iris-virginica"</code> is the number of examples from class virginica, divided by the total number of examples in this class. Add <code>^.1</code> and the result will be multiplied and printed with one decimal. The trailing <code>%</code> is printed out. In parentheses we print the same thing except that we divide by the examples in the parent node. Note the use of single quotes, so we can use the double quotes inside the string, when we specify the class.</P>
358<xmp class="printout">root: 100.0% (.%)
359|    petal width<0.800: 0.0% (0.0%)
360|    petal width>=0.800: 100.0% (100.0%)
361|    |    petal width<1.750: 10.0% (10.0%)
362|    |    |    petal length<5.350: 6.0% (60.0%)
363|    |    |    petal length>=5.350: 4.0% (40.0%)
364|    |    petal width>=1.750: 90.0% (90.0%)
365|    |    |    petal length<4.850: 4.0% (4.4%)
366|    |    |    petal length>=4.850: 86.0% (95.6%)
367</xmp>
368<P>See what's in the parentheses in the root node? If <code>printTree</code> cannot compute something (in this case it's because the root has no parent), it prints out a dot. You can also replace <code>=</code> by <code>!=</code> and it will count all classes <em>except</em> virginica.</P>
369
370<P>For one final example with classification trees, we shall print the distributions in that nodes, the distribution compared to the parent and the proportions compared to the parent (the latter things are not the same - think why). In the leaves we shall also add the predicted class. So now we'll have to call the function like this.</P>
371<xmp class="code>"orngTree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')</xmp>
372<p>Here's the result:</p>
373<xmp class="printout">root: [50.000, 50.000, 50.000] . .
374|    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
375|        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
376|    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
377|    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
378|    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
379|    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
380|    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
381|    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
382|    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
383|    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
384|    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
385|    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
386|    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
387</xmp>
388
389<P>To explore the possibilities when printing regression trees, we are gonna induce a tree from the housing data set. Called with the tree as the only argument, <code>printTree</code> prints the tree like this:
390
391<xmp class="printout">RM<6.941
392|    LSTAT<14.400
393|    |    DIS<1.385: 45.6
394|    |    DIS>=1.385: 22.9
395|    LSTAT>=14.400
396|    |    CRIM<6.992: 17.1
397|    |    CRIM>=6.992: 12.0
398RM>=6.941
399|    RM<7.437
400|    |    CRIM<7.393: 33.3
401|    |    CRIM>=7.393: 14.4
402|    RM>=7.437
403|    |    TAX<534.500: 45.9
404|    |    TAX>=534.500: 21.9
405</xmp>
406
407<P>Let us add the standard error in both internal nodes and leaves, and the 90% confidence intervals in the leaves. So we want to call it like this:</P>
408<xmp class="code">orngTree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")</xmp>
409
410<xmp class="printout">root: [SE: 0.409]
411|    RM<6.941: [SE: 0.306]
412|    |    LSTAT<14.400: [SE: 0.320]
413|    |    |    DIS<1.385: [SE: 4.420]:
414|    |    |        [SE: 4.420]   45.6 [38.331-52.829]
415|    |    |    DIS>=1.385: [SE: 0.244]:
416|    |    |        [SE: 0.244]   22.9 [22.504-23.306]
417|    |    LSTAT>=14.400: [SE: 0.333]
418|    |    |    CRIM<6.992: [SE: 0.338]:
419|    |    |        [SE: 0.338]   17.1 [16.584-17.691]
420|    |    |    CRIM>=6.992: [SE: 0.448]:
421|    |    |        [SE: 0.448]   12.0 [11.243-12.714]
422|    RM>=6.941: [SE: 1.031]
423|    |    RM<7.437: [SE: 0.958]
424|    |    |    CRIM<7.393: [SE: 0.692]:
425|    |    |        [SE: 0.692]   33.3 [32.214-34.484]
426|    |    |    CRIM>=7.393: [SE: 2.157]:
427|    |    |        [SE: 2.157]   14.4 [10.862-17.938]
428|    |    RM>=7.437: [SE: 1.124]
429|    |    |    TAX<534.500: [SE: 0.817]:
430|    |    |        [SE: 0.817]   45.9 [44.556-47.237]
431|    |    |    TAX>=534.500: [SE: 0.000]:
432|    |    |        [SE: 0.000]   21.9 [21.900-21.900]
433</xmp>
434
435<P>What's the difference between <code>%V</code>, the predicted value and <code>%A</code> the average? Doesn't a regression tree always predict the leaf average anyway? Not necessarily, the tree predict whatever the <code>nodeClassifier</code> in a leaf returns. But you're mostly right. The difference is in the number of decimals: <code>%V</code> uses the <code>FloatVariable</code>'s function for printing out the value, which results the printed number to have the same number of decimals as in the original file from which the data was read.</P>
436
437<P>Regression trees cannot print the distributions in the same way as classification trees. They instead offer a set of operators for observing the number of examples within a certain range. For instance, let us check the number of examples with values below 22, and compare this number with values in the parent nodes.</P>
438<xmp class="code">orngTree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".")</xmp>
439
440<xmp class="printout">root: 277.000 (.)
441|    RM<6.941: 273.000 (1.160)
442|    |    LSTAT<14.400: 107.000 (0.661)
443|    |    |    DIS<1.385: 0.000 (0.000)
444|    |    |    DIS>=1.385: 107.000 (1.020)
445|    |    LSTAT>=14.400: 166.000 (1.494)
446|    |    |    CRIM<6.992: 93.000 (0.971)
447|    |    |    CRIM>=6.992: 73.000 (1.040)
448|    RM>=6.941: 4.000 (0.096)
449|    |    RM<7.437: 3.000 (1.239)
450|    |    |    CRIM<7.393: 0.000 (0.000)
451|    |    |    CRIM>=7.393: 3.000 (15.333)
452|    |    RM>=7.437: 1.000 (0.633)
453|    |    |    TAX<534.500: 0.000 (0.000)
454|    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
455
456<P>The last line, for instance, says the the number of examples with the class below 22 is among those with tax above 534 is 30 times higher than the number of such examples in its parent node.</P>
457
458<P>For another exercise, let's count the same for all examples <em>outside</em> interval [20, 22] (given like this, the interval includes the bounds). And let us print out the proportions as percents.</P>
459
460<xmp class="code">orngTree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")</xmp>
461
462<P>OK, let's observe the format string for one last time. <code>%c![20, 22]</code> would be the proportion of examples (within the node) whose values are below 20 or above 22. By <code>%cbP![20, 22]</code> we derive this by the same statistics computed on the parent. Add a <code>^</code> and you have the percentages.</P>
463
464<xmp class="printout">root: 439.000 (.%)
465|    RM<6.941: 364.000 (98%)
466|    |    LSTAT<14.400: 200.000 (93%)
467|    |    |    DIS<1.385: 5.000 (127%)
468|    |    |    DIS>=1.385: 195.000 (99%)
469|    |    LSTAT>=14.400: 164.000 (111%)
470|    |    |    CRIM<6.992: 91.000 (96%)
471|    |    |    CRIM>=6.992: 73.000 (105%)
472|    RM>=6.941: 75.000 (114%)
473|    |    RM<7.437: 46.000 (101%)
474|    |    |    CRIM<7.393: 43.000 (100%)
475|    |    |    CRIM>=7.393: 3.000 (100%)
476|    |    RM>=7.437: 29.000 (98%)
477|    |    |    TAX<534.500: 29.000 (103%)
478|    |    |    TAX>=534.500: 0.000 (0%)
479</xmp>
480
481
482<h3>Defining Your Own Printout functions</h3>
483
484<P><code>dumpTree</code>'s argument <code>userFormats</code> can be used to print out some other information in the leaves or nodes. If provided, <code>userFormat</code> should contain a list of tuples with a regular expression and a callback function to be called when that expression is found in the format string. Expressions from <code>userFormats</code> are checked before the built-in expressions discussed above, so you can override the built-ins if you want to.</P>
485
486<P>The regular expression should describe a string like those we used above, for instance the string <code>%.2DbP</code>. When a leaf or internal node is printed out, the format string (<code>leafStr</code> or <code>nodeStr</code>) is checked for these regular expressions and when the match is found, the corresponding callback function is called.</P>
487
488<P>The callback function will get five arguments: the format string (<code>leafStr</code> or <code>nodeStr</code>), the match object, the node which is being printed, its parent (can be <code>None</code>) and the tree classifier. The function should return the format string in which the part described by the match object (that is, the part that is matched by the regular expression) is replaced by whatever information your callback function is supposed to give.</P>
489
490<P>The function can use several utility function provided in the module.</P>
491<dl class="attributes">
492<dt>insertStr(s, mo, sub)</dt>
493<dd>Replaces the part of <code>s</code> which is covered by <code>mo</code> by the string <code>sub</code>.</dd>
494
495<dt>insertDot(s, mo)</dt>
496<dd>Calls <code>insertStr(s, mo, "."). You should use this when the function cannot compute the desired quantity; it is called, for instance, when it needs to divide by something in the parent, but the parent doesn't exist.</dd>
497
498<dt>insertNum(s, mo, n)</dt>
499<dd>Replaces the part of <code>s</code> matched by <code>mo</code> by the number <code>n</code>, formatted as specified by the user, that is, it multiplies it by 100, if needed, and prints with the right number of places and decimals. It does so by checking the <code>mo</code> for a group named <code>m100</code> (representing the <code>^</code> in the format string) and a group named <code>fs</code> represented the part giving the number of decimals (e.g. <code>5.3</code>).</dd>
500
501<dt>byWhom(by, parent, tree)</dt>
502<dd>If <code>by</code> equals <code>bp</code>, it returns <code>parent</code>, else it returns <code>tree.tree</code>. This is used to find what to divide the quantity with, when division is required.</dd>
503</dl>
504
505<P>There are also a few pieces of regular expression that you may want to reuse. The two you are likely to use are</P>
506<dl class="attributes-sm">
507<dt>fs</dt>
508<dd>Defines the multiplier by 100 (<code>^</code>) and the format for the number of decimals (e.g. <code>5.3</code>). The corresponding groups are named <code>m100</code> and <code>fs</code>.</dd>
509
510<dt>by</dt>
511<dd>Defines <code>bP</code> or <code>bA</code> or nothing; the result is in groups <code>by</code>.</dd>
512</dl>
513
514<P>For a trivial example, "%V" is implemented like this. There is the following tuple in the list of built-in formats: <code>(re.compile("%V"), replaceV)</code>. <code>replaceV</code> is a function defined by:</P>
515<xmp class="code">def replaceV(strg, mo, node, parent, tree):
516    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))</xmp>
517<P>It therefore takes the value predicted at the node (<code>node.nodeClassifier.defaultValue</code>), converts it to a string and passes it to <code>insertStr</code> to do the replacement.</P>
518
519<P>A more complex regular expression is the one for the proportion of majority class, defined as <code>"%"+fs+"M"+by</code>. It uses the two partial expressions defined above.</P>
520
521<P>Let's say with like to print the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node.</P>
522
523<p class="header">part of <a href="orngTree2.py">orngTree2.py</a></p>
524<xmp class="code">def getMargin(dist):
525    if dist.abs < 1e-30:
526        return 0
527    l = list(dist)
528    l.sort()
529    return (l[-1] - l[-2]) / dist.abs
530
531def replaceB(strg, mo, node, parent, tree):
532    margin = getMargin(node.distribution)
533
534    by = mo.group("by")
535    if margin and by:
536        whom = orngTree.byWhom(by, parent, tree)
537        if whom and whom.distribution:
538            divMargin = getMargin(whom.distribution)
539            if divMargin > 1e-30:
540                margin /= divMargin
541            else:
542                orngTree.insertDot(strg, mo)
543        else:
544            return orngTree.insertDot(strg, mo)
545    return orngTree.insertNum(strg, mo, margin)
546
547
548myFormat = [(re.compile("%"+orngTree.fs+"B"+orngTree.by), replaceB)]</xmp>
549
550<P>We first defined <code>getMargin</code> which gets the distribution and computes the margin. The callback replaces, <code>replaceB</code>, computes the margin for the node. If we need to divided the quantity by something (that is, if the <code>by</code> group is present), we call <code>orngTree.byWhom</code> to get the node with whose margin this node's margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, we call <code>insertDot</code> to insert a dot, otherwise we call <code>insertNum</code> which will insert the number, obeying the format specified by the user.</P>
551
552<P><code>myFormat</code> is a list containing the regular expression and the callback function.</P>
553
554<P>We can now print out the iris tree, for instance using the following call.</P>
555<xmp class="code">orngTree.printTree(tree, leafStr="%V %^B% (%^3.2BbP%)", userFormats = myFormat)</xmp>
556
557<P>And this is what we get.</P>
558<xmp class="printout">petal width<0.800: Iris-setosa 100% (100.00%)
559petal width>=0.800
560|    petal width<1.750
561|    |    petal length<5.350: Iris-versicolor 88% (108.57%)
562|    |    petal length>=5.350: Iris-virginica 100% (122.73%)
563|    petal width>=1.750
564|    |    petal length<4.850: Iris-virginica 33% (34.85%)
565|    |    petal length>=4.850: Iris-virginica 100% (104.55%)
566</xmp>
567
568
569<h2>Plotting the Tree using Dot</h2>
570
571<p>Function <code>printDot</code> prints the tree to a file in a format used by <a
572href="http://www.research.att.com/sw/tools/graphviz">GraphViz</a>.
573Uses the same parameters as <code>printTxt</code> defined above, and
574in addition two parameters which define the shape used for internal
575nodes and laves of the tree:
576
577<p class=section>Arguments</p>
578<dl class=arguments>
579  <dt>leafShape</dt>
580  <dd>Shape of the outline around leves of the tree. If "plaintext",
581  no outline is used (default: "plaintext")</dd>
582
583  <dt>internalNodeShape</dt>
584  <dd>Shape of the outline around internal nodes of the tree. If "plaintext",
585  no outline is used (default: "box")</dd>
586</dl>
587
588<p>Check <a
589href="http://www.graphviz.org/doc/info/shapes.html">Polygon-based
590Nodes</a> for various outlines supported by GraphViz.</p>
591
592<P>Suppose you saved the tree in a file <code>tree5.dot</code>. You can then print it out as a gif if you execute the following command line
593<XMP class=code>dot -Tgif tree5.dot -otree5.gif
594</XMP>
595</P>
596GraphViz's dot has quite a few other output formats, check its documentation to learn which.</P>
597
598<H2>References</H2>
599
600<P>E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
601Murray Hill NJ, U.S.A., October 1993.</P>
602
603<p><a href="http://www.research.att.com/sw/tools/graphviz/">Graphviz -
604open source graph drawing software</a>. A home page of AT&T's dot and
605similar software packages.</p>
606
607</body>
608</html> 
Note: See TracBrowser for help on using the repository browser.