orngTree: Orange Decision Trees Module
Module orngTree implements a class
building both decision and regression
orngTree.TreeLearner is essentially a wrapper
provided for easier use of the latter.
The module also contains functions for counting the number of nodes and leaves in the tree.
The module includes functions for printing out the tree, which are rather versatile and can print out practically anything you'd like to know, from the number of examples, proportion of examples of majority class in nodes and similar, to more complex statistics like the proportion of examples in a particular class divided by the proportion of examples of this class in a parent node. And even more, you can define your own callback functions to be used for printing.
is a class that assembles the generic
classification tree learner (from Orange's objects for induction of
decision trees). It sets a number of parameters used in induction that
can also be set after the creation of the object, most often through
the object's attributes. If upon initialization
TreeLearner is given a set of examples, then an instance
TreeClassifier object is returned instead.
- Measure for scoring of the attributes when deciding which of the attributes will be used for splitting of the example set in the node. Can take one of the following values: "infoGain", "gainRatio", "gini", "relief" (default: "gainRatio").
- Defines a function that will be used in place of Orange's
TreeSplitConstructor(see documentation on TreeLearner). Useful when prototyping new tree induction algorithms. When this parameter is defined, other parameters that affect the procedures for growing of the tree are ignored. These include
- If True the induction constructs binary trees (default: False).
Used in pre-pruning, sets the lowest required attribute score. If the score of the best attribute is below this margin, the tree at that node is not grown further (default: 0).
So, to allow splitting only when gainRatio (the default measure) is greater than 0.6, one should run the learner like this:
l = orngTree.TreeLearner(data, worstAcceptable=0.6)
- Minimal number of examples in non-null leaves (default: 0).
- Data subsets with less than
minExamplesexamples are not split any further, that is, all leaves in the tree will contain at least that many of examples (default: 0).
- Induction stops when the proportion of majority class in the
node exceeds the value set by this parameter(default: 1.0). E.g. to stop the induction as soon as the majority class reaches 70%, you should say
tree2 = orngTree.TreeLearner(data, maxMajority=0.7)
This is an example of the tree on iris data set, built with the above arguments - the numbers show the majority class proportion at each node. You can find more details in the script tree2.py, which induces and prints this tree.
root: 0.333 | petal width<0.800: 1.000 | petal width>=0.800: 0.500 | | petal width<1.750: 0.907 | | petal width>=1.750: 0.978
- Used for passing a function which is used in place of
TreeStopCriteria. Useful when prototyping new tree induction algorithms. See a documentation on TreeStopCriteria for more info on this function. When used, parameters
minExampleswill not be considered (default: None).
- If non-zero, invokes an error-based bottom-up post-pruning, where m-estimate is used to estimate class probabilities (default: 0).
- If true, invokes a bottom-up post-pruning by removing the subtrees of which all leaves classify to the same class (default: False).
- storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
- Determines whether to store class distributions, contingencies and
examples in TreeNodes, and whether the nodeClassifier should be
build for internal nodes. By default everything except storeExamples
is enabled. You won't save any memory by not storing distributions
but storing contingencies, since distributions actually points to
the same distribution that is stored in
contingency.classes.(default: True except for storeExamples, which defaults to False).
For a bit more complex example, here's how to write your own stop function. The example itself is more funny than useful. It constructs and prints two trees. For the first one we define the
defStop function, which is used by default, and combine it with a random function so that the stop criteria will also be met in additional 20% of the cases when
defStop is false. The second tree is build such that it considers only the random function as the stopping criteria. Note that in the second case lambda function still has three parameters, since this is a necessary number of parameters for the stop function (for more, see section on Orange Trees in Orange Reference).
The output is not shown here since the resulting trees are rather big.
countNodes(tree) returns the number of nodes of tree.
- The tree for which to count the nodes.
countLeaves(tree) returns the number of leaves in the tree.
- The tree for which to count the leaves.
Printing the Tree
dumpTree dumps a tree to a string, and
printTree prints out the tree (
printTxt is alias for
printTree, and it's there for compatibility). Functions have same arguments.
Before we go on: you can read all about the function and use it to its full extent, or you can just call it, giving it the tree as the sole argument and it will print out the usual textual representation of the tree. If you're satisfied with that, you can stop here.
- The tree to be printed out.
- The format string for printing the tree leaves. If left empty, "%V (%^.2m%)" will be used for classification trees and "%V" for regression trees.
- The string for printing out the internal nodes. If left empty (as it is by default), no data is printed out for internal nodes. If set to
".", the same string is used as for leaves.
- If set, it limits the depth to which the tree is printed out.
- If set, the subtrees with less than the given number of examples are not printed.
True(default), the branches with a single node are printed before the branches with larger subtrees. If you set it to
False(which I don't know why you would), the branches are printed in order of appearance.
- A list of regular expressions and callback function through which the user can print out other specific information in the nodes.
The magic is in the format string. It is a string which is printed out at every leaf or internal node with the certain format specifiers replaced by data from the tree node. Specifiers are generally of form
^ at the start tells that the number should be multiplied by 100. It's useful for printing proportions like percentages, but it makes no sense to multiply, say, the number of examples at the node (although the function will allow it).
<precision> is in the same format as in Python (or C) string formatting. For instance,
%N denotes the number of examples in the node, hence
%6.2N would mean output to two decimal digits and six places altogether. If left out, a default format
5.3 is used, unless you multiply the numbers by 100, in which case the default is
.0 (no decimals, the number is rounded to the nearest integer).
<divisor> tells what to divide the quantity in that node with.
bP means division by the same quantity in the parent node; for instance,
%NbP will tell give the number of examples in the node divided by the number of examples in parent node. You can add use precision formatting, e.g.
bA is division by the same quantity over the entire data set, so
%NbA will tell you the proportion of examples (out of the entire training data set) that fell into that node. If division is impossible since the parent node does not exist or some data is missing, a dot is printed out instead of the quantity.
<quantity> is the only required element. It defines what to print. For instance,
%N would print out the number of examples in the node. Possible quantities are
- The value predicted at that node. You cannot define the precision or divisor here.
- The number of examples in the node.
- The number of examples in the majority class (that is, the class predicted by the node).
- The proportion of examples in the majority class.
- The average class for examples the node; this is available only for regression trees.
- Standard error for class of examples in the node; available for regression trees.
- Print out the confidence interval. The modifier is used as
%I(95)of (more complicated)
- The number of examples in the given class. For classification trees, this modifier is used as, for instance in,
%5.3C="Iris-virginica"bP- this will tell the number of examples of Iris-virginica by the number of examples this class in the parent node. If you are interested in examples that are not Iris-virginica, say
%5.3CbP!="Iris-virginica"For regression trees, you can use operators =, !=, <, <=, >, and >=, as in
%C<22- add the precision and divisor if you will. You can also check the number of examples in a certain interval:
%C[20, 22]will give you the number of examples between 20 and 22 (inclusive) and
%C(20, 22)will give the number of such examples excluding the boundaries. You can of course mix the parentheses, e.g.
%C(20, 22]. If you would like the examples outside the interval, add a
- Same as above, except that it computes the proportion of the class instead of the number of examples.
- Prints out the number of examples in each class. You can use both, precision (it is applied to each number in the distribution) or the divisor. This quantity cannot be computed for regression trees.
- Same as above, except that it shows proportions of examples. This again doesn't work with regression trees.
Now for some examples. We shall build a small tree from the iris data set - we shall limit the depth to three levels.
part of orngTree1.py
The easiest way to call the function is to pass the tree as the only argument. Calling
orngTree.printTree(tree) will print
Let's now print out the predicted class at each node, the number of examples in the majority class with the total number of examples in the node,
orngTree.printTree(tree, leafStr="%V (%M out of %N)").
Would you like to know how the number of examples declines as compared to the entire data set and to the parent node? We find it with this:
orng.printTree("%V (%^MbA%, %^MbP%)")
Let us first read the format string.
%M is the number of examples in the majority class. We want it divided by the number of all examples from this class on the entire data set, hence
%MbA. To have it multipied by 100, we say
%^MbA. The percent sign after that is just printed out literally, just as the comma and parentheses (see the output). The string for showing the proportion of this class in the parent is the same except that we have
bP instead of
And now for the output: all examples of setosa for into the first node. For versicolor, we have 98% in one node; the rest is certainly not in the neighbouring node (petal length>=5.350) since all versicolors from the node petal width<1.750 went to petal length<5.350 (we know this from the
100% in that line). Virginica is the majority class in the three nodes that together contain 94% of this class (4+4+86). The rest must had gone to the same node as versicolor.
If you find this guesswork annoying - so do I. Let us print out the number of versicolors in each node, together with the proportion of versicolors among the examples in this particular node and among all versicolors. So,
'%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
gives the following output:
Finally, we may want to print out the distributions, using a simple string
What is the order of numbers here? If you check
data.domain.classVar.values, you'll learn that the order is setosa, versicolor, virginica; so in the node at peta length<5.350 we have 49 versicolors and 3 virginicae. To print out the proportions, we can use, for instance
%.2d - this gives us proportions within node, rounded on two decimals.
We haven't tried printing out some information for internal nodes. To start with the most trivial case, we shall print the prediction at each node
nodeStrshould be the same as
leafStr(not very useful here, since
leafStris trivial anyway).
Note that the output is somewhat different now: there appeared another node called root and the tree looks one level deeper. This is needed to print out the data for that node to.
Now for something more complicated: let us observe how the number of virginicas decreases down the tree:
Let's first interpret the format string:
CbA="Iris-virginica" is the number of examples from class virginica, divided by the total number of examples in this class. Add
^.1 and the result will be multiplied and printed with one decimal. The trailing
% is printed out. In parentheses we print the same thing except that we divide by the examples in the parent node. Note the use of single quotes, so we can use the double quotes inside the string, when we specify the class.
See what's in the parentheses in the root node? If
printTree cannot compute something (in this case it's because the root has no parent), it prints out a dot. You can also replace
!= and it will count all classes except virginica.
For one final example with classification trees, we shall print the distributions in that nodes, the distribution compared to the parent and the proportions compared to the parent (the latter things are not the same - think why). In the leaves we shall also add the predicted class. So now we'll have to call the function like this.
Here's the result:
To explore the possibilities when printing regression trees, we are gonna induce a tree from the housing data set. Called with the tree as the only argument,
printTree prints the tree like this:
Let us add the standard error in both internal nodes and leaves, and the 90% confidence intervals in the leaves. So we want to call it like this:
What's the difference between
%V, the predicted value and
%A the average? Doesn't a regression tree always predict the leaf average anyway? Not necessarily, the tree predict whatever the
nodeClassifier in a leaf returns. But you're mostly right. The difference is in the number of decimals:
%V uses the
FloatVariable's function for printing out the value, which results the printed number to have the same number of decimals as in the original file from which the data was read.
Regression trees cannot print the distributions in the same way as classification trees. They instead offer a set of operators for observing the number of examples within a certain range. For instance, let us check the number of examples with values below 22, and compare this number with values in the parent nodes.
The last line, for instance, says the the number of examples with the class below 22 is among those with tax above 534 is 30 times higher than the number of such examples in its parent node.
For another exercise, let's count the same for all examples outside interval [20, 22] (given like this, the interval includes the bounds). And let us print out the proportions as percents.
OK, let's observe the format string for one last time.
%c![20, 22] would be the proportion of examples (within the node) whose values are below 20 or above 22. By
%cbP![20, 22] we derive this by the same statistics computed on the parent. Add a
^ and you have the percentages.
Defining Your Own Printout functions
userFormats can be used to print out some other information in the leaves or nodes. If provided,
userFormat should contain a list of tuples with a regular expression and a callback function to be called when that expression is found in the format string. Expressions from
userFormats are checked before the built-in expressions discussed above, so you can override the built-ins if you want to.
The regular expression should describe a string like those we used above, for instance the string
%.2DbP. When a leaf or internal node is printed out, the format string (
nodeStr) is checked for these regular expressions and when the match is found, the corresponding callback function is called.
The callback function will get five arguments: the format string (
nodeStr), the match object, the node which is being printed, its parent (can be
None) and the tree classifier. The function should return the format string in which the part described by the match object (that is, the part that is matched by the regular expression) is replaced by whatever information your callback function is supposed to give.
The function can use several utility function provided in the module.
- insertStr(s, mo, sub)
- Replaces the part of
swhich is covered by
moby the string
- insertDot(s, mo)
insertStr(s, mo, "."). You should use this when the function cannot compute the desired quantity; it is called, for instance, when it needs to divide by something in the parent, but the parent doesn't exist.
- insertNum(s, mo, n)
- Replaces the part of
moby the number
n, formatted as specified by the user, that is, it multiplies it by 100, if needed, and prints with the right number of places and decimals. It does so by checking the
mofor a group named
^in the format string) and a group named
fsrepresented the part giving the number of decimals (e.g.
- byWhom(by, parent, tree)
bp, it returns
parent, else it returns
tree.tree. This is used to find what to divide the quantity with, when division is required.
There are also a few pieces of regular expression that you may want to reuse. The two you are likely to use are
- Defines the multiplier by 100 (
^) and the format for the number of decimals (e.g.
5.3). The corresponding groups are named
bAor nothing; the result is in groups
For a trivial example, "%V" is implemented like this. There is the following tuple in the list of built-in formats:
replaceV is a function defined by:
It therefore takes the value predicted at the node (
node.nodeClassifier.defaultValue), converts it to a string and passes it to
insertStr to do the replacement.
A more complex regular expression is the one for the proportion of majority class, defined as
"%"+fs+"M"+by. It uses the two partial expressions defined above.
Let's say with like to print the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node.
part of orngTree2.py
We first defined
getMargin which gets the distribution and computes the margin. The callback replaces,
replaceB, computes the margin for the node. If we need to divided the quantity by something (that is, if the
by group is present), we call
orngTree.byWhom to get the node with whose margin this node's margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, we call
insertDot to insert a dot, otherwise we call
insertNum which will insert the number, obeying the format specified by the user.
myFormat is a list containing the regular expression and the callback function.
We can now print out the iris tree, for instance using the following call.
And this is what we get.
Plotting the Tree using Dot
printDot prints the tree to a file in a format used by GraphViz.
Uses the same parameters as
printTxt defined above, and
in addition two parameters which define the shape used for internal
nodes and laves of the tree:
- Shape of the outline around leves of the tree. If "plaintext", no outline is used (default: "plaintext")
- Shape of the outline around internal nodes of the tree. If "plaintext", no outline is used (default: "box")
Check Polygon-based Nodes for various outlines supported by GraphViz.
Suppose you saved the tree in a file
tree5.dot. You can then print it out as a gif if you execute the following command line
E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories, Murray Hill NJ, U.S.A., October 1993.
Graphviz - open source graph drawing software. A home page of AT&T's dot and similar software packages.