Changeset 7407:a15246b89496 in orange


Ignore:
Timestamp:
02/04/11 11:39:46 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
9cd19a17ae7f421e6e73d664fa9587080db4f50e
Message:

Trees slowly progressing.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classification/tree.py

    r7359 r7407  
    917917    error as defined in (Bratko, 2002). 
    918918 
    919     .. attributes:: m 
     919    .. attribute:: m 
    920920 
    921921        Parameter m for m-estimation. 
     
    14251425 
    14261426.. autofunction:: countLeaves 
     1427 
     1428Printing the Tree 
     1429================= 
     1430 
     1431The included printing functions can 
     1432print out practically anything you'd like to 
     1433know, from the number of examples, proportion of examples of majority 
     1434class in nodes and similar, to more complex statistics like the 
     1435proportion of examples in a particular class divided by the proportion 
     1436of examples of this class in a parent node. And even more, you can 
     1437define your own callback functions to be used for printing. 
     1438 
     1439 
     1440.. autofunction:: dumpTree 
     1441 
     1442.. autofunction:: printTree 
     1443 
     1444.. autofunction:: printTxt 
     1445 
     1446Before we go on: you can read all about the function and use it to its 
     1447full extent, or you can just call it, giving it the tree as the sole 
     1448argument and it will print out the usual textual representation of the 
     1449tree. If you're satisfied with that, you can stop here. 
     1450 
     1451The magic is in the format string. It is a string which is printed 
     1452out at every leaf or internal node with the certain format specifiers 
     1453replaced by data from the tree node. Specifiers are generally of form 
     1454**%[^]<precision><quantity><divisor>**. 
     1455 
     1456**^** at the start tells that the number should be multiplied by 100. 
     1457It's useful for printing proportions like percentages. 
     1458 
     1459**<precision>** is in the same format as in Python (or C) string 
     1460formatting. For instance, :samp:`%N` denotes the number of examples in the node, 
     1461hence :samp:`%6.2N` would mean output to two decimal digits and six places 
     1462altogether. If left out, a default format :samp:`5.3` is used, unless you  
     1463multiply the numbers by 100, in which case the default is :samp:`.0` 
     1464(no decimals, the number is rounded to the nearest integer). 
     1465 
     1466**<divisor>** tells what to divide the quantity in that node with. 
     1467:samp:`bP` means division by the same quantity in the parent node; for instance, 
     1468:samp:`%NbP` will tell give the number of examples in the node divided by the 
     1469number of examples in parent node. You can add use precision formatting, 
     1470e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data  
     1471set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire 
     1472training data set) that fell into that node. If division is impossible 
     1473since the parent node does not exist or some data is missing, a dot is 
     1474printed out instead of the quantity. 
     1475 
     1476**<quantity>** is the only required element. It defines what to print. 
     1477For instance, :samp:`%N` would print out the number of examples in the node. 
     1478Possible quantities are 
     1479 
     1480:samp:`V` 
     1481    The value predicted at that node. You cannot define the precision  
     1482    or divisor here. 
     1483 
     1484:samp:`N` 
     1485    The number of examples in the node. 
     1486 
     1487:samp:`M` 
     1488    The number of examples in the majority class (that is, the class  
     1489    predicted by the node). 
     1490 
     1491:samp:`m` 
     1492    The proportion of examples in the majority class. 
     1493 
     1494:samp:`A` 
     1495    The average class for examples the node; this is available only for  
     1496    regression trees. 
     1497 
     1498:samp:`E` 
     1499    Standard error for class of examples in the node; available for 
     1500    regression trees. 
     1501 
     1502:samp:`I` 
     1503    Print out the confidence interval. The modifier is used as  
     1504    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`. 
     1505 
     1506:samp:`C` 
     1507    The number of examples in the given class. For classification trees,  
     1508    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`  
     1509    - this will tell the number of examples of Iris-virginica by the  
     1510    number of examples this class in the parent node. If you are  
     1511    interested in examples that are *not* Iris-virginica, say  
     1512    :samp:`%5.3CbP!="Iris-virginica"` 
     1513 
     1514    For regression trees, you can use operators =, !=, <, <=, >, and >=,  
     1515    as in :samp:`%C<22` - add the precision and divisor if you will. You can also 
     1516    check the number of examples in a certain interval: :samp:`%C[20, 22]` 
     1517    will give you the number of examples between 20 and 22 (inclusive) 
     1518    and :samp:`%C(20, 22)` will give the number of such examples excluding the 
     1519    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`. 
     1520    If you would like the examples outside the interval, add a :samp:`!`, 
     1521    like :samp:`%C!(20, 22]`. 
     1522  
     1523:samp:`c` 
     1524    Same as above, except that it computes the proportion of the class 
     1525    instead of the number of examples. 
     1526 
     1527:samp:`D` 
     1528    Prints out the number of examples in each class. You can use both, 
     1529    precision (it is applied to each number in the distribution) or the 
     1530    divisor. This quantity cannot be computed for regression trees. 
     1531 
     1532:samp:`d` 
     1533    Same as above, except that it shows proportions of examples. This 
     1534    again doesn't work with regression trees. 
     1535 
     1536<user defined formats> 
     1537    You can add more, if you will. Instructions and examples are given at 
     1538    the end of this section. 
     1539 
     1540 
     1541Examples 
     1542======== 
     1543 
     1544We shall build a small tree from the iris data set - we shall limit the 
     1545depth to three levels. 
     1546 
     1547<p class="header">part of <a href="orngTree1.py">orngTree1.py</a></p> 
     1548<xmp class="code">import orange, orngTree 
     1549data = orange.ExampleTable("iris") 
     1550tree = orngTree.TreeLearner(data, maxDepth=3) 
     1551</xmp> 
     1552 
     1553The easiest way to call the function is to pass the tree as the only  
     1554argument:: 
     1555 
     1556    >>> orngTree.printTree(tree) 
     1557    petal width<0.800: Iris-setosa (100.00%) 
     1558    petal width>=0.800 
     1559    |    petal width<1.750 
     1560    |    |    petal length<5.350: Iris-versicolor (94.23%) 
     1561    |    |    petal length>=5.350: Iris-virginica (100.00%) 
     1562    |    petal width>=1.750 
     1563    |    |    petal length<4.850: Iris-virginica (66.67%) 
     1564    |    |    petal length>=4.850: Iris-virginica (100.00%) 
     1565 
     1566Let's now print out the predicted class at each node, the number 
     1567of examples in the majority class with the total number of examples 
     1568in the node:: 
     1569 
     1570    >>> orngTree.printTree(tree, leafStr="%V (%M out of %N)") 
     1571    petal width<0.800: Iris-setosa (50.000 out of 50.000) 
     1572    petal width>=0.800 
     1573    |    petal width<1.750 
     1574    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000) 
     1575    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000) 
     1576    |    petal width>=1.750 
     1577    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000) 
     1578    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000) 
     1579 
     1580Would you like to know how the number of examples declines as 
     1581compared to the entire data set and to the parent node? We find 
     1582it with this:: 
     1583 
     1584    >>> orng.printTree("%V (%^MbA%, %^MbP%)") 
     1585    petal width<0.800: Iris-setosa (100%, 100%) 
     1586    petal width>=0.800 
     1587    |    petal width<1.750 
     1588    |    |    petal length<5.350: Iris-versicolor (98%, 100%) 
     1589    |    |    petal length>=5.350: Iris-virginica (4%, 40%) 
     1590    |    petal width>=1.750 
     1591    |    |    petal length<4.850: Iris-virginica (4%, 4%) 
     1592    |    |    petal length>=4.850: Iris-virginica (86%, 96%) 
     1593 
     1594Let us first read the format string. :samp:`%M` is the number of  
     1595examples in the majority class. We want it divided by the number of 
     1596all examples from this class on the entire data set, hence :samp:`%MbA`. 
     1597To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after* 
     1598that is just printed out literally, just as the comma and parentheses 
     1599(see the output). The string for showing the proportion of this class 
     1600in the parent is the same except that we have :samp:`bP` instead  
     1601of :samp:`bA`. 
     1602 
     1603And now for the output: all examples of setosa for into the first node. 
     1604For versicolor, we have 98% in one node; the rest is certainly 
     1605not in the neighbouring node (petal length&gt;=5.350) since all 
     1606versicolors from the node petal width<1.750 went to petal length<5.350 
     1607(we know this from the 100% in that line). Virginica is the  
     1608majority class in the three nodes that together contain 94% of this 
     1609class (4+4+86). The rest must had gone to the same node as versicolor. 
     1610 
     1611If you find this guesswork annoying - so do I. Let us print out the 
     1612number of versicolors in each node, together with the proportion of 
     1613versicolors among the examples in this particular node and among all 
     1614versicolors. So, 
     1615 
     1616:: 
     1617 
     1618    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors) 
     1619 
     1620gives the following output:: 
     1621 
     1622    petal width<0.800: 0.000 (0% of node, 0% of versicolors) 
     1623    petal width>=0.800 
     1624    |    petal width<1.750 
     1625    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors) 
     1626    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors) 
     1627    |    petal width>=1.750 
     1628    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors) 
     1629    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors) 
     1630 
     1631Finally, we may want to print out the distributions, using a simple  
     1632string :samp:`%D`:: 
     1633 
     1634    petal width<0.800: [50.000, 0.000, 0.000] 
     1635    petal width>=0.800 
     1636    |    petal width<1.750 
     1637    |    |    petal length<5.350: [0.000, 49.000, 3.000] 
     1638    |    |    petal length>=5.350: [0.000, 0.000, 2.000] 
     1639    |    petal width>=1.750 
     1640    |    |    petal length<4.850: [0.000, 1.000, 2.000] 
     1641    |    |    petal length>=4.850: [0.000, 0.000, 43.000] 
     1642 
     1643What is the order of numbers here? If you check  
     1644:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,  
     1645versicolor, virginica; so in the node at peta length<5.350 we have 49 
     1646versicolors and 3 virginicae. To print out the proportions, we can  
     1647:samp:`%.2d` - this gives us proportions within node, rounded on  
     1648two decimals:: 
     1649 
     1650    petal width<0.800: [1.00, 0.00, 0.00] 
     1651    petal width>=0.800 
     1652    |    petal width<1.750 
     1653    |    |    petal length<5.350: [0.00, 0.94, 0.06] 
     1654    |    |    petal length>=5.350: [0.00, 0.00, 1.00] 
     1655    |    petal width>=1.750 
     1656    |    |    petal length<4.850: [0.00, 0.33, 0.67] 
     1657    |    |    petal length>=4.850: [0.00, 0.00, 1.00] 
     1658 
     1659We haven't tried printing out any information for internal nodes. 
     1660To start with the most trivial case, we shall print the prediction 
     1661at each node. 
     1662 
     1663:: 
     1664 
     1665    orngTree.printTree(tree, leafStr="%V", nodeStr=".") 
     1666     
     1667says that the nodeStr should be the same as leafStr (not very useful  
     1668here, since leafStr is trivial anyway). 
     1669 
     1670::  
     1671 
     1672    root: Iris-setosa 
     1673    |    petal width<0.800: Iris-setosa 
     1674    |    petal width>=0.800: Iris-versicolor 
     1675    |    |    petal width<1.750: Iris-versicolor 
     1676    |    |    |    petal length<5.350: Iris-versicolor 
     1677    |    |    |    petal length>=5.350: Iris-virginica 
     1678    |    |    petal width>=1.750: Iris-virginica 
     1679    |    |    |    petal length<4.850: Iris-virginica 
     1680    |    |    |    petal length>=4.850: Iris-virginica 
     1681 
     1682Note that the output is somewhat different now: there appeared another 
     1683node called *root* and the tree looks one level deeper. This is 
     1684needed to print out the data for that node to. 
     1685 
     1686Now for something more complicated: let us observe how the number 
     1687of virginicas decreases down the tree:: 
     1688 
     1689    orngTree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.') 
     1690 
     1691Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is  
     1692the number of examples from class virginica, divided by the total number 
     1693of examples in this class. Add :samp:`^.1` and the result will be 
     1694multiplied and printed with one decimal. The trailing :samp:`%` is printed 
     1695out. In parentheses we print the same thing except that we divide by the 
     1696examples in the parent node. Note the use of single quotes, so we can 
     1697use the double quotes inside the string, when we specify the class. 
     1698 
     1699:: 
     1700 
     1701    root: 100.0% (.%) 
     1702    |    petal width<0.800: 0.0% (0.0%) 
     1703    |    petal width>=0.800: 100.0% (100.0%) 
     1704    |    |    petal width<1.750: 10.0% (10.0%) 
     1705    |    |    |    petal length<5.350: 6.0% (60.0%) 
     1706    |    |    |    petal length>=5.350: 4.0% (40.0%) 
     1707    |    |    petal width>=1.750: 90.0% (90.0%) 
     1708    |    |    |    petal length<4.850: 4.0% (4.4%) 
     1709    |    |    |    petal length>=4.850: 86.0% (95.6%) 
     1710 
     1711See what's in the parentheses in the root node? If :func:`printTree` 
     1712cannot compute something (in this case it's because the root has no parent), 
     1713it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it  
     1714will count all classes *except* virginica. 
     1715 
     1716For one final example with classification trees, we shall print the 
     1717distributions in that nodes, the distribution compared to the parent 
     1718and the proportions compared to the parent (the latter things are not 
     1719the same - think why). In the leaves we shall also add the predicted 
     1720class. So now we'll have to call the function like this. 
     1721 
     1722:: 
     1723 
     1724    >>>orngTree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"') 
     1725    root: [50.000, 50.000, 50.000] . . 
     1726    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]: 
     1727    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00] 
     1728    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50] 
     1729    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19] 
     1730    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]: 
     1731    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62] 
     1732    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]: 
     1733    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80] 
     1734    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96] 
     1735    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]: 
     1736    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68] 
     1737    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]: 
     1738    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02] 
     1739 
     1740To explore the possibilities when printing regression trees, we are going  
     1741to induce a tree from the housing data set. Called with the tree as the 
     1742only argument, :func:`printTree` prints the tree like this:: 
     1743 
     1744    RM<6.941 
     1745    |    LSTAT<14.400 
     1746    |    |    DIS<1.385: 45.6 
     1747    |    |    DIS>=1.385: 22.9 
     1748    |    LSTAT>=14.400 
     1749    |    |    CRIM<6.992: 17.1 
     1750    |    |    CRIM>=6.992: 12.0 
     1751    RM>=6.941 
     1752    |    RM<7.437 
     1753    |    |    CRIM<7.393: 33.3 
     1754    |    |    CRIM>=7.393: 14.4 
     1755    |    RM>=7.437 
     1756    |    |    TAX<534.500: 45.9 
     1757    |    |    TAX>=534.500: 21.9 
     1758 
     1759Let us add the standard error in both internal nodes and leaves, and the 
     176090% confidence intervals in the leaves:: 
     1761 
     1762    >>> orngTree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]") 
     1763    root: [SE: 0.409] 
     1764    |    RM<6.941: [SE: 0.306] 
     1765    |    |    LSTAT<14.400: [SE: 0.320] 
     1766    |    |    |    DIS<1.385: [SE: 4.420]: 
     1767    |    |    |        [SE: 4.420]   45.6 [38.331-52.829] 
     1768    |    |    |    DIS>=1.385: [SE: 0.244]: 
     1769    |    |    |        [SE: 0.244]   22.9 [22.504-23.306] 
     1770    |    |    LSTAT>=14.400: [SE: 0.333] 
     1771    |    |    |    CRIM<6.992: [SE: 0.338]: 
     1772    |    |    |        [SE: 0.338]   17.1 [16.584-17.691] 
     1773    |    |    |    CRIM>=6.992: [SE: 0.448]: 
     1774    |    |    |        [SE: 0.448]   12.0 [11.243-12.714] 
     1775    |    RM>=6.941: [SE: 1.031] 
     1776    |    |    RM<7.437: [SE: 0.958] 
     1777    |    |    |    CRIM<7.393: [SE: 0.692]: 
     1778    |    |    |        [SE: 0.692]   33.3 [32.214-34.484] 
     1779    |    |    |    CRIM>=7.393: [SE: 2.157]: 
     1780    |    |    |        [SE: 2.157]   14.4 [10.862-17.938] 
     1781    |    |    RM>=7.437: [SE: 1.124] 
     1782    |    |    |    TAX<534.500: [SE: 0.817]: 
     1783    |    |    |        [SE: 0.817]   45.9 [44.556-47.237] 
     1784    |    |    |    TAX>=534.500: [SE: 0.000]: 
     1785    |    |    |        [SE: 0.000]   21.9 [21.900-21.900] 
     1786 
     1787What's the difference between :samp:`%V`, the predicted value and  
     1788:samp:`%A` the average? Doesn't a regression tree always predict the 
     1789leaf average anyway? Not necessarily, the tree predict whatever the 
     1790:attr:`TreeClassifier.nodeClassifier` in a leaf returns.  
     1791As :samp:`%V` uses the  
     1792:obj:`orange.FloatVariable`'s function for printing out the value,  
     1793therefore the printed number has the same number of decimals  
     1794as in the data file. 
     1795 
     1796Regression trees cannot print the distributions in the same way 
     1797as classification trees. They instead offer a set of operators for 
     1798observing the number of examples within a certain range. For instance, 
     1799let us check the number of examples with values below 22, and compare 
     1800this number with values in the parent nodes:: 
     1801 
     1802    >>> orngTree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".") 
     1803    root: 277.000 (.) 
     1804    |    RM<6.941: 273.000 (1.160) 
     1805    |    |    LSTAT<14.400: 107.000 (0.661) 
     1806    |    |    |    DIS<1.385: 0.000 (0.000) 
     1807    |    |    |    DIS>=1.385: 107.000 (1.020) 
     1808    |    |    LSTAT>=14.400: 166.000 (1.494) 
     1809    |    |    |    CRIM<6.992: 93.000 (0.971) 
     1810    |    |    |    CRIM>=6.992: 73.000 (1.040) 
     1811    |    RM>=6.941: 4.000 (0.096) 
     1812    |    |    RM<7.437: 3.000 (1.239) 
     1813    |    |    |    CRIM<7.393: 0.000 (0.000) 
     1814    |    |    |    CRIM>=7.393: 3.000 (15.333) 
     1815    |    |    RM>=7.437: 1.000 (0.633) 
     1816    |    |    |    TAX<534.500: 0.000 (0.000) 
     1817    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp> 
     1818 
     1819The last line, for instance, says the the number of examples with the 
     1820class below 22 is among those with tax above 534 is 30 times higher 
     1821than the number of such examples in its parent node. 
     1822 
     1823For another exercise, let's count the same for all examples *outside* 
     1824interval [20, 22] (given like this, the interval includes the bounds). 
     1825And let us print out the proportions as percents. 
     1826 
     1827:: 
     1828 
     1829    >>> orngTree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".") 
     1830 
     1831OK, let's observe the format string for one last time. :samp:`%c![20, 22]` 
     1832would be the proportion of examples (within the node) whose values are 
     1833below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same 
     1834statistics computed on the parent. Add a :samp:`^` and you have the percentages. 
     1835 
     1836:: 
     1837 
     1838    root: 439.000 (.%) 
     1839    |    RM<6.941: 364.000 (98%) 
     1840    |    |    LSTAT<14.400: 200.000 (93%) 
     1841    |    |    |    DIS<1.385: 5.000 (127%) 
     1842    |    |    |    DIS>=1.385: 195.000 (99%) 
     1843    |    |    LSTAT>=14.400: 164.000 (111%) 
     1844    |    |    |    CRIM<6.992: 91.000 (96%) 
     1845    |    |    |    CRIM>=6.992: 73.000 (105%) 
     1846    |    RM>=6.941: 75.000 (114%) 
     1847    |    |    RM<7.437: 46.000 (101%) 
     1848    |    |    |    CRIM<7.393: 43.000 (100%) 
     1849    |    |    |    CRIM>=7.393: 3.000 (100%) 
     1850    |    |    RM>=7.437: 29.000 (98%) 
     1851    |    |    |    TAX<534.500: 29.000 (103%) 
     1852    |    |    |    TAX>=534.500: 0.000 (0%) 
     1853 
     1854 
     1855Defining Your Own Printout functions 
     1856==================================== 
     1857 
     1858:func:`dumpTree`'s argument :obj:`userFormats` can be used to print out 
     1859some other information in the leaves or nodes. If provided, 
     1860:obj:`userFormats` should contain a list of tuples with a regular expression 
     1861and a callback function to be called when that expression is found in the 
     1862format string. Expressions from :obj:`userFormats` are checked before 
     1863the built-in expressions discussed above, so you can override the built-ins 
     1864if you want to. 
     1865 
     1866The regular expression should describe a string like those we used above, 
     1867for instance the string :samp:`%.2DbP`. When a leaf or internal node 
     1868is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)  
     1869is checked for these regular expressions and when the match is found,  
     1870the corresponding callback function is called. 
     1871 
     1872The callback function will get five arguments: the format string  
     1873(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is 
     1874being printed, its parent (can be None) and the tree classifier. 
     1875The function should return the format string in which the part described 
     1876by the match object (that is, the part that is matched by the regular 
     1877expression) is replaced by whatever information your callback function 
     1878is supposed to give. 
     1879 
     1880The function can use several utility function provided in the module. 
     1881 
     1882.. autofunction:: insertStr 
     1883 
     1884.. autofunction:: insertDot 
     1885 
     1886.. autofunction:: insertNum 
     1887 
     1888.. autofunction:: byWhom 
     1889 
     1890 
     1891There are also a few pieces of regular expression that you may want to reuse.  
     1892The two you are likely to use are: 
     1893 
     1894.. autodata:: fs 
     1895 
     1896<dt>fs</dt> 
     1897 
     1898<dt>by</dt> 
     1899<dd>Defines <code>bP</code> or <code>bA</code> or nothing; the result is in groups <code>by</code>.</dd> 
     1900</dl> 
     1901 
     1902<P>For a trivial example, "%V" is implemented like this. There is the following tuple in the list of built-in formats: <code>(re.compile("%V"), replaceV)</code>. <code>replaceV</code> is a function defined by:</P> 
     1903<xmp class="code">def replaceV(strg, mo, node, parent, tree): 
     1904    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))</xmp> 
     1905<P>It therefore takes the value predicted at the node (<code>node.nodeClassifier.defaultValue</code>), converts it to a string and passes it to <code>insertStr</code> to do the replacement.</P> 
     1906 
     1907<P>A more complex regular expression is the one for the proportion of majority class, defined as <code>"%"+fs+"M"+by</code>. It uses the two partial expressions defined above.</P> 
     1908 
     1909<P>Let's say with like to print the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node.</P> 
     1910 
     1911<p class="header">part of <a href="orngTree2.py">orngTree2.py</a></p> 
     1912<xmp class="code">def getMargin(dist): 
     1913    if dist.abs < 1e-30: 
     1914        return 0 
     1915    l = list(dist) 
     1916    l.sort() 
     1917    return (l[-1] - l[-2]) / dist.abs 
     1918 
     1919def replaceB(strg, mo, node, parent, tree): 
     1920    margin = getMargin(node.distribution) 
     1921 
     1922    by = mo.group("by") 
     1923    if margin and by: 
     1924        whom = orngTree.byWhom(by, parent, tree) 
     1925        if whom and whom.distribution: 
     1926            divMargin = getMargin(whom.distribution) 
     1927            if divMargin > 1e-30: 
     1928                margin /= divMargin 
     1929            else: 
     1930                orngTree.insertDot(strg, mo) 
     1931        else: 
     1932            return orngTree.insertDot(strg, mo) 
     1933    return orngTree.insertNum(strg, mo, margin) 
     1934 
     1935 
     1936myFormat = [(re.compile("%"+orngTree.fs+"B"+orngTree.by), replaceB)]</xmp> 
     1937 
     1938<P>We first defined <code>getMargin</code> which gets the distribution and computes the margin. The callback replaces, <code>replaceB</code>, computes the margin for the node. If we need to divided the quantity by something (that is, if the <code>by</code> group is present), we call <code>orngTree.byWhom</code> to get the node with whose margin this node's margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, we call <code>insertDot</code> to insert a dot, otherwise we call <code>insertNum</code> which will insert the number, obeying the format specified by the user.</P> 
     1939 
     1940<P><code>myFormat</code> is a list containing the regular expression and the callback function.</P> 
     1941 
     1942<P>We can now print out the iris tree, for instance using the following call.</P> 
     1943<xmp class="code">orngTree.printTree(tree, leafStr="%V %^B% (%^3.2BbP%)", userFormats = myFormat)</xmp> 
     1944 
     1945<P>And this is what we get.</P> 
     1946<xmp class="printout">petal width<0.800: Iris-setosa 100% (100.00%) 
     1947petal width>=0.800 
     1948|    petal width<1.750 
     1949|    |    petal length<5.350: Iris-versicolor 88% (108.57%) 
     1950|    |    petal length>=5.350: Iris-virginica 100% (122.73%) 
     1951|    petal width>=1.750 
     1952|    |    petal length<4.850: Iris-virginica 33% (34.85%) 
     1953|    |    petal length>=4.850: Iris-virginica 100% (104.55%) 
     1954</xmp> 
     1955 
     1956 
     1957<h2>Plotting the Tree using Dot</h2> 
     1958 
     1959<p>Function <code>printDot</code> prints the tree to a file in a format used by <a 
     1960href="http://www.research.att.com/sw/tools/graphviz">GraphViz</a>. 
     1961Uses the same parameters as <code>printTxt</code> defined above, and 
     1962in addition two parameters which define the shape used for internal 
     1963nodes and laves of the tree: 
     1964 
     1965<p class=section>Arguments</p> 
     1966<dl class=arguments> 
     1967  <dt>leafShape</dt> 
     1968  <dd>Shape of the outline around leves of the tree. If "plaintext", 
     1969  no outline is used (default: "plaintext")</dd> 
     1970 
     1971  <dt>internalNodeShape</dt> 
     1972  <dd>Shape of the outline around internal nodes of the tree. If "plaintext", 
     1973  no outline is used (default: "box")</dd> 
     1974</dl> 
     1975 
     1976<p>Check <a 
     1977href="http://www.graphviz.org/doc/info/shapes.html">Polygon-based 
     1978Nodes</a> for various outlines supported by GraphViz.</p> 
     1979 
     1980<P>Suppose you saved the tree in a file <code>tree5.dot</code>. You can then print it out as a gif if you execute the following command line 
     1981<XMP class=code>dot -Tgif tree5.dot -otree5.gif 
     1982</XMP> 
     1983</P> 
     1984GraphViz's dot has quite a few other output formats, check its documentation to learn which.</P> 
     1985 
    14271986 
    14281987 
     
    15582117""" 
    15592118 
    1560 The module also contains functions for counting the number of nodes 
    1561 and leaves in the tree. It laso includes functions for  
    1562 printing out the tree, which are 
    1563 rather versatile and can print out practically anything you'd like to 
    1564 know, from the number of examples, proportion of examples of majority 
    1565 class in nodes and similar, to more complex statistics like the 
    1566 proportion of examples in a particular class divided by the proportion 
    1567 of examples of this class in a parent node. And even more, you can 
    1568 define your own callback functions to be used for printing. 
    1569  
    15702119<P>For a bit more complex example, here's how to write your own stop function. The example itself is more funny than useful. It constructs and prints two trees. For the first one we define the <code>defStop</code> function, which is used by default, and combine it with a random function so that the stop criteria will also be met in additional 20% of the cases when <code>defStop</code> is false. The second tree is build such that it considers only the random function as the stopping criteria. Note that in the second case lambda function still has three parameters, since this is a necessary number of parameters for the stop function (for more, see section on <a href="../reference/TreeLearner.htm">Orange Trees</a> in Orange Reference). 
    15712120</p> 
     
    15942143 
    15952144 
    1596 <h2>Tree Size</h2> 
    1597  
    1598  
    1599 <h2>Printing the Tree</h2> 
     2145 
     2146 
     2147 
     2148 
    16002149<index name="classification trees/printing"> 
    1601  
    1602 <P>Function <code>dumpTree</code> dumps a tree to a string, and <code>printTree</code> prints out the tree (<code>printTxt</code> is alias for <code>printTree</code>, and it's there for compatibility). Functions have same arguments.</P> 
    1603  
    1604 <P>Before we go on: you can read all about the function and use it to its full extent, or you can just call it, giving it the tree as the sole argument and it will print out the usual textual representation of the tree. If you're satisfied with that, you can stop here.</P> 
    1605  
    1606 <p class=section>Arguments</p> 
    1607 <dl class=arguments> 
    1608   <dt>tree</dt> 
    1609   <dd>The tree to be printed out.</dd> 
    1610  
    1611   <dt>leafStr</dt> 
    1612   <dd>The format string for printing the tree leaves. If left empty, "%V (%^.2m%)" will be used for classification trees and "%V" for regression trees.</dd> 
    1613  
    1614   <dt>nodeStr</dt> 
    1615   <dd>The string for printing out the internal nodes. If left empty (as it is by default), no data is printed out for internal nodes. If set to <code>"."</code>, the same string is used as for leaves.</dd> 
    1616  
    1617   <dt>maxDepth</dt> 
    1618   <dd>If set, it limits the depth to which the tree is printed out.</dd> 
    1619  
    1620   <dt>minExamples</dt> 
    1621   <dd>If set, the subtrees with less than the given number of examples are not printed.</dd> 
    1622  
    1623   <dt>simpleFirst</dt> 
    1624   <dd>If <code>True</code> (default), the branches with a single node are printed before the branches with larger subtrees. If you set it to <code>False</code> (which I don't know why you would), the branches are printed in order of appearance.</dd> 
    1625  
    1626   <dt>userFormats</dt> 
    1627   <dd>A list of regular expressions and callback function through which the user can print out other specific information in the nodes. 
    1628 </dl> 
    1629  
    1630 <P>The magic is in the format string. It is a string which is printed out at every leaf or internal node with the certain format specifiers replaced by data from the tree node. Specifiers are generally of form 
    1631 <B><code>%<em>[^]</em><em>&lt;precision&gt;</em><em>&lt;quantity&gt;</em><em>&lt;divisor&gt;</em>.</code></B> 
    1632 </center> 
    1633  
    1634 <P><B><EM>^</EM></B> at the start tells that the number should be multiplied by 100. It's useful for printing proportions like percentages, but it makes no sense to multiply, say, the number of examples at the node (although the function will allow it).</P> 
    1635  
    1636 <P><B><em>&lt;precision&gt;</em></B> is in the same format as in Python (or C) string formatting. For instance, <code>%N</code> denotes the number of examples in the node, hence <code>%6.2N</code> would mean output to two decimal digits and six places altogether. If left out, a default format <code>5.3</code> is used, unless you multiply the numbers by 100, in which case the default is <code>.0</code> (no decimals, the number is rounded to the nearest integer).</code></P> 
    1637  
    1638 <P><B><em>&lt;divisor&gt;</em></B> tells what to divide the quantity in that node with. <code>bP</code> means division by the same quantity in the parent node; for instance, <code>%NbP</code> will tell give the number of examples in the node divided by the number of examples in parent node. You can add use precision formatting, e.g. <code>%6.2NbP</code>. <code>bA</code> is division by the same quantity over the entire data set, so <code>%NbA</code> will tell you the proportion of examples (out of the entire training data set) that fell into that node. If division is impossible since the parent node does not exist or some data is missing, a dot is printed out instead of the quantity.</P> 
    1639  
    1640 <P><B><em>&lt;quantity&gt;</em></B> is the only required element. It defines what to print. For instance, <code>%N</code> would print out the number of examples in the node. Possible quantities are 
    1641 <dl class=arguments_sm> 
    1642 <dt>V</dt> 
    1643 <dd>The value predicted at that node. You cannot define the precision or divisor here.</dd> 
    1644  
    1645 <dt>N</dt> 
    1646 <dd>The number of examples in the node.</dd> 
    1647  
    1648 <dt>M</dt> 
    1649 <dd>The number of examples in the majority class (that is, the class predicted by the node).</dd> 
    1650  
    1651 <dt>m</dt> 
    1652 <dd>The proportion of examples in the majority class.</dd> 
    1653  
    1654 <dt>A</dt> 
    1655 <dd>The average class for examples the node; this is available only for regression trees.</dd> 
    1656  
    1657 <dt>E</dt> 
    1658 <dd>Standard error for class of examples in the node; available for regression trees.</dd> 
    1659  
    1660 <dt>I</dt> 
    1661 <dd>Print out the confidence interval. The modifier is used as <code>%I(95)</code> of (more complicated) <code>%5.3I(95)bP</code>.</dd> 
    1662  
    1663 <dt>C</dt> 
    1664 <dd>The number of examples in the given class. For classification trees, this modifier is used as, for instance in, <code>%5.3C="Iris-virginica"bP</code> - this will tell the number of examples of Iris-virginica by the number of examples this class in the parent node. If you are interested in examples that are <em>not</em> Iris-virginica, say <code>%5.3CbP!="Iris-virginica"</code> 
    1665  
    1666 For regression trees, you can use operators =, !=, &lt;, &lt;=, &gt;, and &gt;=, as in <code>%C&lt;22</code> - add the precision and divisor if you will. You can also check the number of examples in a certain interval: <code>%C[20, 22]</code> will give you the number of examples between 20 and 22 (inclusive) and <code>%C(20, 22)</code> will give the number of such examples excluding the boundaries. You can of course mix the parentheses, e.g. <code>%C(20, 22]</code>. If you would like the examples outside the interval, add a <code>!</code>, like <code>%C!(20, 22]</code>.</dd> 
    1667  
    1668 <dt>c</dt> 
    1669 <dd>Same as above, except that it computes the proportion of the class instead of the number of examples.</dd> 
    1670  
    1671 <dt>D</dt> 
    1672 <dd>Prints out the number of examples in each class. You can use both, precision (it is applied to each number in the distribution) or the divisor. This quantity cannot be computed for regression trees.</dd> 
    1673  
    1674 <dt>d</dt> 
    1675 <dd>Same as above, except that it shows proportions of examples. This again doesn't work with regression trees.</dd> 
    1676 </dl> 
    1677  
    1678 <dt>&lt;user defined formats&gt;</dt> 
    1679 <dd>You can add more, if you will. Instructions and examples are given at the end of this section.</dd> 
    1680 </P> 
    1681  
    1682 <P>Now for some examples. We shall build a small tree from the iris data set - we shall limit the depth to three levels.</P> 
    1683  
    1684 <p class="header">part of <a href="orngTree1.py">orngTree1.py</a></p> 
    1685 <xmp class="code">import orange, orngTree 
    1686 data = orange.ExampleTable("iris") 
    1687 tree = orngTree.TreeLearner(data, maxDepth=3) 
    1688 </xmp> 
    1689  
    1690 <P>The easiest way to call the function is to pass the tree as the only argument. Calling <code>orngTree.printTree(tree)</code> will print 
    1691 <xmp class="printout">petal width<0.800: Iris-setosa (100.00%) 
    1692 petal width>=0.800 
    1693 |    petal width<1.750 
    1694 |    |    petal length<5.350: Iris-versicolor (94.23%) 
    1695 |    |    petal length>=5.350: Iris-virginica (100.00%) 
    1696 |    petal width>=1.750 
    1697 |    |    petal length<4.850: Iris-virginica (66.67%) 
    1698 |    |    petal length>=4.850: Iris-virginica (100.00%) 
    1699 </xmp> 
    1700 </P> 
    1701  
    1702 <P>Let's now print out the predicted class at each node, the number of examples in the majority class with the total number of examples in the node, 
    1703 <code>orngTree.printTree(tree, leafStr="%V (%M out of %N)")</code>. 
    1704 <xmp class="printout">petal width<0.800: Iris-setosa (50.000 out of 50.000) 
    1705 petal width>=0.800 
    1706 |    petal width<1.750 
    1707 |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000) 
    1708 |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000) 
    1709 |    petal width>=1.750 
    1710 |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000) 
    1711 |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000) 
    1712 </xmp> 
    1713 </P> 
    1714  
    1715 <P>Would you like to know how the number of examples declines as compared to the entire data set and to the parent node? We find it with this: <code>orng.printTree("%V (%^MbA%, %^MbP%)")</code> 
    1716 <xmp class="printout">petal width<0.800: Iris-setosa (100%, 100%) 
    1717 petal width>=0.800 
    1718 |    petal width<1.750 
    1719 |    |    petal length<5.350: Iris-versicolor (98%, 100%) 
    1720 |    |    petal length>=5.350: Iris-virginica (4%, 40%) 
    1721 |    petal width>=1.750 
    1722 |    |    petal length<4.850: Iris-virginica (4%, 4%) 
    1723 |    |    petal length>=4.850: Iris-virginica (86%, 96%) 
    1724 </xmp> 
    1725 <P>Let us first read the format string. <code>%M</code> is the number of examples in the majority class. We want it divided by the number of all examples from this class on the entire data set, hence <code>%MbA</code>. To have it multipied by 100, we say <code>%^MbA</code>. The percent sign <em>after</em> that is just printed out literally, just as the comma and parentheses (see the output). The string for showing the proportion of this class in the parent is the same except that we have <code>bP</code> instead of <code>bA</code>.</P> 
    1726  
    1727 <P>And now for the output: all examples of setosa for into the first node. For versicolor, we have 98% in one node; the rest is certainly not in the neighbouring node (petal length&gt;=5.350) since all versicolors from the node petal width&lt;1.750 went to petal length&lt;5.350 (we know this from the <code>100%</code> in that line). Virginica is the majority class in the three nodes that together contain 94% of this class (4+4+86). The rest must had gone to the same node as versicolor.</P> 
    1728  
    1729 <P>If you find this guesswork annoying - so do I. Let us print out the number of versicolors in each node, together with the proportion of versicolors among the examples in this particular node and among all versicolors. So,<br> 
    1730 <code>'%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)</code><br>gives the following output:</P> 
    1731  
    1732 <xmp class="printout">petal width<0.800: 0.000 (0% of node, 0% of versicolors) 
    1733 petal width>=0.800 
    1734 |    petal width<1.750 
    1735 |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors) 
    1736 |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors) 
    1737 |    petal width>=1.750 
    1738 |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors) 
    1739 |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors) 
    1740 </xmp> 
    1741  
    1742 <P>Finally, we may want to print out the distributions, using a simple string <code>%D</code>.</P> 
    1743 <xmp class="printout">petal width<0.800: [50.000, 0.000, 0.000] 
    1744 petal width>=0.800 
    1745 |    petal width<1.750 
    1746 |    |    petal length<5.350: [0.000, 49.000, 3.000] 
    1747 |    |    petal length>=5.350: [0.000, 0.000, 2.000] 
    1748 |    petal width>=1.750 
    1749 |    |    petal length<4.850: [0.000, 1.000, 2.000] 
    1750 |    |    petal length>=4.850: [0.000, 0.000, 43.000] 
    1751 </xmp> 
    1752 <P>What is the order of numbers here? If you check <code>data.domain.classVar.values</code>, you'll learn that the order is setosa, versicolor, virginica; so in the node at peta length&lt;5.350 we have 49 versicolors and 3 virginicae. To print out the proportions, we can use, for instance <code>%.2d</code> - this gives us proportions within node, rounded on two decimals.</P> 
    1753 <xmp class="printout">petal width<0.800: [1.00, 0.00, 0.00] 
    1754 petal width>=0.800 
    1755 |    petal width<1.750 
    1756 |    |    petal length<5.350: [0.00, 0.94, 0.06] 
    1757 |    |    petal length>=5.350: [0.00, 0.00, 1.00] 
    1758 |    petal width>=1.750 
    1759 |    |    petal length<4.850: [0.00, 0.33, 0.67] 
    1760 |    |    petal length>=4.850: [0.00, 0.00, 1.00] 
    1761 </xmp> 
    1762  
    1763 <P>We haven't tried printing out some information for internal nodes. To start with the most trivial case, we shall print the prediction at each node 
    1764 <xmp class="code">orngTree.printTree(tree, leafStr="%V", nodeStr=".")</xmp> says that the <code>nodeStr</code> should be the same as <code>leafStr</code> (not very useful here, since <code>leafStr</code> is trivial anyway).</P> 
    1765 <xmp class="printout">root: Iris-setosa 
    1766 |    petal width<0.800: Iris-setosa 
    1767 |    petal width>=0.800: Iris-versicolor 
    1768 |    |    petal width<1.750: Iris-versicolor 
    1769 |    |    |    petal length<5.350: Iris-versicolor 
    1770 |    |    |    petal length>=5.350: Iris-virginica 
    1771 |    |    petal width>=1.750: Iris-virginica 
    1772 |    |    |    petal length<4.850: Iris-virginica 
    1773 |    |    |    petal length>=4.850: Iris-virginica 
    1774 </xmp> 
    1775  
    1776 <P>Note that the output is somewhat different now: there appeared another node called <em>root</em> and the tree looks one level deeper. This is needed to print out the data for that node to.</P> 
    1777  
    1778 <P>Now for something more complicated: let us observe how the number of virginicas decreases down the tree:</P> 
    1779 <xmp class="code>"orngTree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')</xmp> 
    1780 <P>Let's first interpret the format string: <code>CbA="Iris-virginica"</code> is the number of examples from class virginica, divided by the total number of examples in this class. Add <code>^.1</code> and the result will be multiplied and printed with one decimal. The trailing <code>%</code> is printed out. In parentheses we print the same thing except that we divide by the examples in the parent node. Note the use of single quotes, so we can use the double quotes inside the string, when we specify the class.</P> 
    1781 <xmp class="printout">root: 100.0% (.%) 
    1782 |    petal width<0.800: 0.0% (0.0%) 
    1783 |    petal width>=0.800: 100.0% (100.0%) 
    1784 |    |    petal width<1.750: 10.0% (10.0%) 
    1785 |    |    |    petal length<5.350: 6.0% (60.0%) 
    1786 |    |    |    petal length>=5.350: 4.0% (40.0%) 
    1787 |    |    petal width>=1.750: 90.0% (90.0%) 
    1788 |    |    |    petal length<4.850: 4.0% (4.4%) 
    1789 |    |    |    petal length>=4.850: 86.0% (95.6%) 
    1790 </xmp> 
    1791 <P>See what's in the parentheses in the root node? If <code>printTree</code> cannot compute something (in this case it's because the root has no parent), it prints out a dot. You can also replace <code>=</code> by <code>!=</code> and it will count all classes <em>except</em> virginica.</P> 
    1792  
    1793 <P>For one final example with classification trees, we shall print the distributions in that nodes, the distribution compared to the parent and the proportions compared to the parent (the latter things are not the same - think why). In the leaves we shall also add the predicted class. So now we'll have to call the function like this.</P> 
    1794 <xmp class="code>"orngTree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')</xmp> 
    1795 <p>Here's the result:</p> 
    1796 <xmp class="printout">root: [50.000, 50.000, 50.000] . . 
    1797 |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]: 
    1798 |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00] 
    1799 |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50] 
    1800 |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19] 
    1801 |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]: 
    1802 |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62] 
    1803 |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]: 
    1804 |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80] 
    1805 |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96] 
    1806 |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]: 
    1807 |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68] 
    1808 |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]: 
    1809 |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02] 
    1810 </xmp> 
    1811  
    1812 <P>To explore the possibilities when printing regression trees, we are gonna induce a tree from the housing data set. Called with the tree as the only argument, <code>printTree</code> prints the tree like this: 
    1813  
    1814 <xmp class="printout">RM<6.941 
    1815 |    LSTAT<14.400 
    1816 |    |    DIS<1.385: 45.6 
    1817 |    |    DIS>=1.385: 22.9 
    1818 |    LSTAT>=14.400 
    1819 |    |    CRIM<6.992: 17.1 
    1820 |    |    CRIM>=6.992: 12.0 
    1821 RM>=6.941 
    1822 |    RM<7.437 
    1823 |    |    CRIM<7.393: 33.3 
    1824 |    |    CRIM>=7.393: 14.4 
    1825 |    RM>=7.437 
    1826 |    |    TAX<534.500: 45.9 
    1827 |    |    TAX>=534.500: 21.9 
    1828 </xmp> 
    1829  
    1830 <P>Let us add the standard error in both internal nodes and leaves, and the 90% confidence intervals in the leaves. So we want to call it like this:</P> 
    1831 <xmp class="code">orngTree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")</xmp> 
    1832  
    1833 <xmp class="printout">root: [SE: 0.409] 
    1834 |    RM<6.941: [SE: 0.306] 
    1835 |    |    LSTAT<14.400: [SE: 0.320] 
    1836 |    |    |    DIS<1.385: [SE: 4.420]: 
    1837 |    |    |        [SE: 4.420]   45.6 [38.331-52.829] 
    1838 |    |    |    DIS>=1.385: [SE: 0.244]: 
    1839 |    |    |        [SE: 0.244]   22.9 [22.504-23.306] 
    1840 |    |    LSTAT>=14.400: [SE: 0.333] 
    1841 |    |    |    CRIM<6.992: [SE: 0.338]: 
    1842 |    |    |        [SE: 0.338]   17.1 [16.584-17.691] 
    1843 |    |    |    CRIM>=6.992: [SE: 0.448]: 
    1844 |    |    |        [SE: 0.448]   12.0 [11.243-12.714] 
    1845 |    RM>=6.941: [SE: 1.031] 
    1846 |    |    RM<7.437: [SE: 0.958] 
    1847 |    |    |    CRIM<7.393: [SE: 0.692]: 
    1848 |    |    |        [SE: 0.692]   33.3 [32.214-34.484] 
    1849 |    |    |    CRIM>=7.393: [SE: 2.157]: 
    1850 |    |    |        [SE: 2.157]   14.4 [10.862-17.938] 
    1851 |    |    RM>=7.437: [SE: 1.124] 
    1852 |    |    |    TAX<534.500: [SE: 0.817]: 
    1853 |    |    |        [SE: 0.817]   45.9 [44.556-47.237] 
    1854 |    |    |    TAX>=534.500: [SE: 0.000]: 
    1855 |    |    |        [SE: 0.000]   21.9 [21.900-21.900] 
    1856 </xmp> 
    1857  
    1858 <P>What's the difference between <code>%V</code>, the predicted value and <code>%A</code> the average? Doesn't a regression tree always predict the leaf average anyway? Not necessarily, the tree predict whatever the <code>nodeClassifier</code> in a leaf returns. But you're mostly right. The difference is in the number of decimals: <code>%V</code> uses the <code>FloatVariable</code>'s function for printing out the value, which results the printed number to have the same number of decimals as in the original file from which the data was read.</P> 
    1859  
    1860 <P>Regression trees cannot print the distributions in the same way as classification trees. They instead offer a set of operators for observing the number of examples within a certain range. For instance, let us check the number of examples with values below 22, and compare this number with values in the parent nodes.</P> 
    1861 <xmp class="code">orngTree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".")</xmp> 
    1862  
    1863 <xmp class="printout">root: 277.000 (.) 
    1864 |    RM<6.941: 273.000 (1.160) 
    1865 |    |    LSTAT<14.400: 107.000 (0.661) 
    1866 |    |    |    DIS<1.385: 0.000 (0.000) 
    1867 |    |    |    DIS>=1.385: 107.000 (1.020) 
    1868 |    |    LSTAT>=14.400: 166.000 (1.494) 
    1869 |    |    |    CRIM<6.992: 93.000 (0.971) 
    1870 |    |    |    CRIM>=6.992: 73.000 (1.040) 
    1871 |    RM>=6.941: 4.000 (0.096) 
    1872 |    |    RM<7.437: 3.000 (1.239) 
    1873 |    |    |    CRIM<7.393: 0.000 (0.000) 
    1874 |    |    |    CRIM>=7.393: 3.000 (15.333) 
    1875 |    |    RM>=7.437: 1.000 (0.633) 
    1876 |    |    |    TAX<534.500: 0.000 (0.000) 
    1877 |    |    |    TAX>=534.500: 1.000 (30.000)</xmp> 
    1878  
    1879 <P>The last line, for instance, says the the number of examples with the class below 22 is among those with tax above 534 is 30 times higher than the number of such examples in its parent node.</P> 
    1880  
    1881 <P>For another exercise, let's count the same for all examples <em>outside</em> interval [20, 22] (given like this, the interval includes the bounds). And let us print out the proportions as percents.</P> 
    1882  
    1883 <xmp class="code">orngTree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")</xmp> 
    1884  
    1885 <P>OK, let's observe the format string for one last time. <code>%c![20, 22]</code> would be the proportion of examples (within the node) whose values are below 20 or above 22. By <code>%cbP![20, 22]</code> we derive this by the same statistics computed on the parent. Add a <code>^</code> and you have the percentages.</P> 
    1886  
    1887 <xmp class="printout">root: 439.000 (.%) 
    1888 |    RM<6.941: 364.000 (98%) 
    1889 |    |    LSTAT<14.400: 200.000 (93%) 
    1890 |    |    |    DIS<1.385: 5.000 (127%) 
    1891 |    |    |    DIS>=1.385: 195.000 (99%) 
    1892 |    |    LSTAT>=14.400: 164.000 (111%) 
    1893 |    |    |    CRIM<6.992: 91.000 (96%) 
    1894 |    |    |    CRIM>=6.992: 73.000 (105%) 
    1895 |    RM>=6.941: 75.000 (114%) 
    1896 |    |    RM<7.437: 46.000 (101%) 
    1897 |    |    |    CRIM<7.393: 43.000 (100%) 
    1898 |    |    |    CRIM>=7.393: 3.000 (100%) 
    1899 |    |    RM>=7.437: 29.000 (98%) 
    1900 |    |    |    TAX<534.500: 29.000 (103%) 
    1901 |    |    |    TAX>=534.500: 0.000 (0%) 
    1902 </xmp> 
    1903  
    1904  
    1905 <h3>Defining Your Own Printout functions</h3> 
    1906  
    1907 <P><code>dumpTree</code>'s argument <code>userFormats</code> can be used to print out some other information in the leaves or nodes. If provided, <code>userFormat</code> should contain a list of tuples with a regular expression and a callback function to be called when that expression is found in the format string. Expressions from <code>userFormats</code> are checked before the built-in expressions discussed above, so you can override the built-ins if you want to.</P> 
    1908  
    1909 <P>The regular expression should describe a string like those we used above, for instance the string <code>%.2DbP</code>. When a leaf or internal node is printed out, the format string (<code>leafStr</code> or <code>nodeStr</code>) is checked for these regular expressions and when the match is found, the corresponding callback function is called.</P> 
    1910  
    1911 <P>The callback function will get five arguments: the format string (<code>leafStr</code> or <code>nodeStr</code>), the match object, the node which is being printed, its parent (can be <code>None</code>) and the tree classifier. The function should return the format string in which the part described by the match object (that is, the part that is matched by the regular expression) is replaced by whatever information your callback function is supposed to give.</P> 
    1912  
    1913 <P>The function can use several utility function provided in the module.</P> 
    1914 <dl class="attributes"> 
    1915 <dt>insertStr(s, mo, sub)</dt> 
    1916 <dd>Replaces the part of <code>s</code> which is covered by <code>mo</code> by the string <code>sub</code>.</dd> 
    1917  
    1918 <dt>insertDot(s, mo)</dt> 
    1919 <dd>Calls <code>insertStr(s, mo, "."). You should use this when the function cannot compute the desired quantity; it is called, for instance, when it needs to divide by something in the parent, but the parent doesn't exist.</dd> 
    1920  
    1921 <dt>insertNum(s, mo, n)</dt> 
    1922 <dd>Replaces the part of <code>s</code> matched by <code>mo</code> by the number <code>n</code>, formatted as specified by the user, that is, it multiplies it by 100, if needed, and prints with the right number of places and decimals. It does so by checking the <code>mo</code> for a group named <code>m100</code> (representing the <code>^</code> in the format string) and a group named <code>fs</code> represented the part giving the number of decimals (e.g. <code>5.3</code>).</dd> 
    1923  
    1924 <dt>byWhom(by, parent, tree)</dt> 
    1925 <dd>If <code>by</code> equals <code>bp</code>, it returns <code>parent</code>, else it returns <code>tree.tree</code>. This is used to find what to divide the quantity with, when division is required.</dd> 
    1926 </dl> 
    1927  
    1928 <P>There are also a few pieces of regular expression that you may want to reuse. The two you are likely to use are</P> 
    1929 <dl class="attributes-sm"> 
    1930 <dt>fs</dt> 
    1931 <dd>Defines the multiplier by 100 (<code>^</code>) and the format for the number of decimals (e.g. <code>5.3</code>). The corresponding groups are named <code>m100</code> and <code>fs</code>.</dd> 
    1932  
    1933 <dt>by</dt> 
    1934 <dd>Defines <code>bP</code> or <code>bA</code> or nothing; the result is in groups <code>by</code>.</dd> 
    1935 </dl> 
    1936  
    1937 <P>For a trivial example, "%V" is implemented like this. There is the following tuple in the list of built-in formats: <code>(re.compile("%V"), replaceV)</code>. <code>replaceV</code> is a function defined by:</P> 
    1938 <xmp class="code">def replaceV(strg, mo, node, parent, tree): 
    1939     return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))</xmp> 
    1940 <P>It therefore takes the value predicted at the node (<code>node.nodeClassifier.defaultValue</code>), converts it to a string and passes it to <code>insertStr</code> to do the replacement.</P> 
    1941  
    1942 <P>A more complex regular expression is the one for the proportion of majority class, defined as <code>"%"+fs+"M"+by</code>. It uses the two partial expressions defined above.</P> 
    1943  
    1944 <P>Let's say with like to print the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node.</P> 
    1945  
    1946 <p class="header">part of <a href="orngTree2.py">orngTree2.py</a></p> 
    1947 <xmp class="code">def getMargin(dist): 
    1948     if dist.abs < 1e-30: 
    1949         return 0 
    1950     l = list(dist) 
    1951     l.sort() 
    1952     return (l[-1] - l[-2]) / dist.abs 
    1953  
    1954 def replaceB(strg, mo, node, parent, tree): 
    1955     margin = getMargin(node.distribution) 
    1956  
    1957     by = mo.group("by") 
    1958     if margin and by: 
    1959         whom = orngTree.byWhom(by, parent, tree) 
    1960         if whom and whom.distribution: 
    1961             divMargin = getMargin(whom.distribution) 
    1962             if divMargin > 1e-30: 
    1963                 margin /= divMargin 
    1964             else: 
    1965                 orngTree.insertDot(strg, mo) 
    1966         else: 
    1967             return orngTree.insertDot(strg, mo) 
    1968     return orngTree.insertNum(strg, mo, margin) 
    1969  
    1970  
    1971 myFormat = [(re.compile("%"+orngTree.fs+"B"+orngTree.by), replaceB)]</xmp> 
    1972  
    1973 <P>We first defined <code>getMargin</code> which gets the distribution and computes the margin. The callback replaces, <code>replaceB</code>, computes the margin for the node. If we need to divided the quantity by something (that is, if the <code>by</code> group is present), we call <code>orngTree.byWhom</code> to get the node with whose margin this node's margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, we call <code>insertDot</code> to insert a dot, otherwise we call <code>insertNum</code> which will insert the number, obeying the format specified by the user.</P> 
    1974  
    1975 <P><code>myFormat</code> is a list containing the regular expression and the callback function.</P> 
    1976  
    1977 <P>We can now print out the iris tree, for instance using the following call.</P> 
    1978 <xmp class="code">orngTree.printTree(tree, leafStr="%V %^B% (%^3.2BbP%)", userFormats = myFormat)</xmp> 
    1979  
    1980 <P>And this is what we get.</P> 
    1981 <xmp class="printout">petal width<0.800: Iris-setosa 100% (100.00%) 
    1982 petal width>=0.800 
    1983 |    petal width<1.750 
    1984 |    |    petal length<5.350: Iris-versicolor 88% (108.57%) 
    1985 |    |    petal length>=5.350: Iris-virginica 100% (122.73%) 
    1986 |    petal width>=1.750 
    1987 |    |    petal length<4.850: Iris-virginica 33% (34.85%) 
    1988 |    |    petal length>=4.850: Iris-virginica 100% (104.55%) 
    1989 </xmp> 
    1990  
    1991  
    1992 <h2>Plotting the Tree using Dot</h2> 
    1993  
    1994 <p>Function <code>printDot</code> prints the tree to a file in a format used by <a 
    1995 href="http://www.research.att.com/sw/tools/graphviz">GraphViz</a>. 
    1996 Uses the same parameters as <code>printTxt</code> defined above, and 
    1997 in addition two parameters which define the shape used for internal 
    1998 nodes and laves of the tree: 
    1999  
    2000 <p class=section>Arguments</p> 
    2001 <dl class=arguments> 
    2002   <dt>leafShape</dt> 
    2003   <dd>Shape of the outline around leves of the tree. If "plaintext", 
    2004   no outline is used (default: "plaintext")</dd> 
    2005  
    2006   <dt>internalNodeShape</dt> 
    2007   <dd>Shape of the outline around internal nodes of the tree. If "plaintext", 
    2008   no outline is used (default: "box")</dd> 
    2009 </dl> 
    2010  
    2011 <p>Check <a 
    2012 href="http://www.graphviz.org/doc/info/shapes.html">Polygon-based 
    2013 Nodes</a> for various outlines supported by GraphViz.</p> 
    2014  
    2015 <P>Suppose you saved the tree in a file <code>tree5.dot</code>. You can then print it out as a gif if you execute the following command line 
    2016 <XMP class=code>dot -Tgif tree5.dot -otree5.gif 
    2017 </XMP> 
    2018 </P> 
    2019 GraphViz's dot has quite a few other output formats, check its documentation to learn which.</P> 
    20202150 
    20212151References 
     
    22922422        return learner 
    22932423 
     2424#counting 
     2425 
    22942426def __countNodes(node): 
    22952427    count = 0 
     
    23352467import re 
    23362468fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)" 
     2469""" Defines the multiplier by 100 (:samp:`^`) and the format 
     2470for the number of decimals (e.g. :samp:`5.3`). The corresponding  
     2471groups are named :samp:`m100` and :samp:`fs`. """ 
     2472 
    23372473by = r"(?P<by>(b(P|A)))?" 
    23382474bysub = r"((?P<bysub>b|s)(?P<by>P|A))?" 
     
    23562492re_I = re.compile("%"+fs+"I"+intrvl) 
    23572493 
     2494def insertStr(s, mo, sub): 
     2495    """ Replace the part of s which is covered by mo  
     2496    with the string sub. """ 
     2497    return s[:mo.start()] + sub + s[mo.end():] 
     2498 
     2499 
    23582500def insertDot(s, mo): 
     2501    """ Replace the part of s which is covered by mo  
     2502    with a dot.  You should use this when the  
     2503    function cannot compute the desired quantity; it is called, for instance,  
     2504    when it needs to divide by something in the parent, but the parent  
     2505    doesn't exist. 
     2506    """ 
    23592507    return s[:mo.start()] + "." + s[mo.end():] 
    23602508 
    2361 def insertStr(s, mo, sub): 
    2362     return s[:mo.start()] + sub + s[mo.end():] 
    2363  
    23642509def insertNum(s, mo, n): 
     2510    """ Replace the part of s matched by mo with the number n,  
     2511    formatted as specified by the user, that is, it multiplies  
     2512    it by 100, if needed, and prints with the right number of  
     2513    places and decimals. It does so by checking the mo 
     2514    for a group named m100 (representing the :samp:`^` in the format string)  
     2515    and a group named fs representing the part giving the number o 
     2516    f decimals (e.g. :samp:`5.3`). 
     2517    """ 
    23652518    grps = mo.groupdict() 
    23662519    m100 = grps.get("m100", None) 
     
    23712524 
    23722525def byWhom(by, parent, tree): 
    2373         if by=="bP": 
    2374             return parent 
    2375         else: 
    2376             return tree.tree 
     2526    """ If by equals bp, it returns parent, else it returns  
     2527    :samp:`tree.tree`. This is used to find what to divide the quantity  
     2528    with, when division is required. 
     2529    """ 
     2530    if by=="bP": 
     2531        return parent 
     2532    else: 
     2533        return tree.tree 
    23772534 
    23782535def replaceV(strg, mo, node, parent, tree): 
     
    27822939 
    27832940def dumpTree(tree, leafStr = "", nodeStr = "", **argkw): 
    2784     return __TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + __TreeDumper.defaultStringFormats, 
    2785                         argkw.get("minExamples", 0), argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), 
    2786                         tree).dumpTree() 
     2941    """ 
     2942    Return a string representation of a tree. 
     2943 
     2944    :arg tree: The tree to dump to string. 
     2945    :type tree: class:`TreeClassifier` 
     2946    :arg leafStr: The format string for printing the tree leaves. If  
     2947      left empty, "%V (%^.2m%)" will be used for classification trees 
     2948      and "%V" for regression trees. 
     2949    :type leafStr: string 
     2950    :arg nodeStr: The format string for printing out the internal nodes. 
     2951      If left empty (as it is by default), no data is printed out for 
     2952      internal nodes. If set to :samp:`"."`, the same string is 
     2953      used as for leaves. 
     2954    :type nodeStr: string 
     2955    :arg maxDepth: If set, it limits the depth to which the tree is 
     2956      printed out. 
     2957    :type maxDepth: integer 
     2958    :arg minExamples: If set, the subtrees with less than the given  
     2959      number of examples are not printed. 
     2960    :type minExamples: integer 
     2961    :arg simpleFirst: If True (default), the branches with a single  
     2962      node are printed before the branches with larger subtrees.  
     2963      If False, the branches are printed in order of 
     2964      appearance. 
     2965    :type simpleFirst: boolean 
     2966    :arg userFormats: A list of regular expressions and callback  
     2967      function through which the user can print out other specific  
     2968      information in the nodes. 
     2969    """ 
     2970    return __TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) +  
     2971        __TreeDumper.defaultStringFormats, argkw.get("minExamples", 0),  
     2972        argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), 
     2973        tree).dumpTree() 
    27872974 
    27882975 
    27892976def printTree(*a, **aa): 
     2977    """ 
     2978    Print out the tree (call :func:`dumpTree` with the same 
     2979    arguments and print out the result). 
     2980    """ 
    27902981    print dumpTree(*a, **aa) 
    27912982 
    27922983printTxt = printTree 
     2984""" An alias for :func:`printTree`. Left for compatibility. """ 
    27932985 
    27942986 
     
    28032995printDot = dotTree 
    28042996         
    2805 ##import orange, orngTree, os 
    2806 ##os.chdir("c:\\d\\ai\\orange\\doc\\datasets") 
    2807 ##data = orange.ExampleTable("iris") 
    2808 ###data = orange.ExampleTable("housing") 
    2809 ##tree = orngTree.TreeLearner(data) 
    2810 ##printTxt(tree) 
    2811 ###print printTree(tree, '%V %4.2NbP %.3C!="Iris-virginica"') 
    2812 ###print printTree(tree, '%A %I(95) %C![20,22)bP', ".", maxDepth=3) 
    2813 ###dotTree("c:\\d\\ai\\orange\\x.dot", tree, '%A', maxDepth= 3) 
     2997 
Note: See TracChangeset for help on using the changeset viewer.