Changeset 7359:7faa00b955f5 in orange


Ignore:
Timestamp:
02/04/11 00:14:40 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
821b8e412c1144ac3550dfcc1814237d17a5f4b3
Message:

Trees slowly progressing.

Location:
orange
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classification/tree.py

    r7326 r7359  
    2626        Stores a distribution for learning examples belonging to the node. 
    2727        Storing distributions can be disabled by setting the  
    28         :obj:`TreeLearner`'s storeDistributions flag to false. 
     28        :obj:`TreeLearnerBase`'s storeDistributions flag to false. 
    2929 
    3030    .. attribute:: contingency 
     
    3232        Stores complete contingency matrices for the learning examples  
    3333        belonging to the node. Storing contingencies can be enabled by  
    34         setting :obj:`TreeLearner`'s :obj:`storeContingencies`  
    35         flag to <CODE>true</CODE>. Note that even when the flag is not  
     34        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`  
     35        flag to true. Note that even when the flag is not  
    3636        set, the contingencies get computed and stored to  
    3737        :obj:`TreeNone`, but are removed shortly afterwards.  
    3838        The details are given in the  
    39         description of the :obj:`TreeLearner`object. 
     39        description of the :obj:`TreeLearnerBase`object. 
    4040 
    4141    .. attribute:: examples, weightID 
     
    9494        When an example cannot be classified to any branch, the selector 
    9595        can return a :obj:`orange.Value` containing a special value 
    96         (<code>sVal</code>) which should be a discrete distribution 
    97         (<code>DiscDistribution</code>). This should represent a 
     96        (sVal) which should be a discrete distribution 
     97        (DiscDistribution). This should represent a 
    9898        :obj:`branchSelector`'s opinion of how to divide the 
    9999        example between the branches. Whether the proposition will be 
     
    205205======== 
    206206 
    207 The main learning object is :obj:`TreeLearner`. It is basically  
     207The main learning object is :obj:`TreeLearnerBase`. It is basically  
    208208a skeleton into which the user must plug the components for particular  
    209209functions. For easier use, defaults are provided. 
     
    243243    go into each branch. Just what we need for the :obj:`TreeNode`. 
    244244    It can return an empty list for the number of examples in branches; 
    245     in this case, the :obj:`TreeLearner` will find the number itself 
     245    in this case, the :obj:`TreeLearnerBase` will find the number itself 
    246246    after splitting the example set into subsets. However, if a split 
    247247    constructors can provide the numbers at no extra computational 
     
    263263    by returning no classifier. This can happen for many reasons. 
    264264    A general one is related to number of examples in the branches. 
    265     :obj:`TreeSplitConstructor` has a field <code>minSubset</code>, 
     265    :obj:`TreeSplitConstructor` has a field :obj:`minSubset`, 
    266266    which sets the minimal number of examples in a branch; null nodes, 
    267267    however, are allowed. If there is no split where this condition 
     
    274274        examples" refers to the weighted number of examples. 
    275275     
    276     .. method:: test() 
    277  
    278         Bla bla. 
    279  
    280276    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])  
    281277 
     
    377373        needed. 
    378374 
    379 .. class:: TreeLearner 
    380  
    381     TreeLearner has a number of components. 
     375.. class:: TreeLearnerBase 
     376 
     377    TreeLearnerBase has a number of components. 
    382378 
    383379    .. attribute:: split 
    384380 
    385381        Object of type :obj:`TreeSplitConstructor`. Default value,  
    386         provided by :obj:`TreeLearner`, is :obj:`SplitConstructor_Combined` 
     382        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined` 
    387383        with separate constructors for discrete and continuous attributes.  
    388384        Discrete attributes are used as are, while continuous attributes 
     
    419415        Induces a classifier from examples belonging to a node. The 
    420416        same learner is used for internal nodes and for leaves. The 
    421         default :obj:`nodeLearner` is :obj:`MajorityLearner`. 
     417        default :obj:`nodeLearner` is :obj:`orange.MajorityLearner`. 
    422418 
    423419    .. attribute:: descender 
     
    447443        that is stored in <code>contingency.classes</code>. 
    448444 
    449     The :obj:`TreeLearner` first sets the defaults for missing 
    450     components. Although stored in the actual :obj:`TreeLearner`'s 
     445    The :obj:`TreeLearnerBase` first sets the defaults for missing 
     446    components. Although stored in the actual :obj:`TreeLearnerBase`'s 
    451447    fields, they are removed when the induction is finished. 
    452448 
     
    467463    beginning, all attributes are candidates for the split criterion. 
    468464 
    469     Now comes the recursive part of the :obj:`TreeLearner`. Its arguments  
     465    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments  
    470466    are a set of examples, a weight meta-attribute ID (a tricky thing, 
    471467    it can be always the same as the original or can change to  
     
    475471 
    476472    The contingency matrix is computed next. This happens 
    477     even if the flag :obj:`storeContingencies` is <CODE>false</CODE>. 
     473    even if the flag :obj:`storeContingencies` is false. 
    478474    If the <code>contingencyComputer</code> is given we use it, 
    479475    otherwise we construct just an ordinary contingency matrix. 
     
    500496    :obj:`TreeNode` is returned. 
    501497 
    502     :obj:`TreeLearner` than uses :obj:`ExampleSplitter` to divide  
     498    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide  
    503499    the examples as described above. 
    504500 
     
    507503    :obj:`exampleSplitter` can use the contingency matrices if they will. 
    508504 
    509     The :obj:`TreeLearner` then recursively calls itself for each of  
     505    The :obj:`TreeLearnerBase` then recursively calls itself for each of  
    510506    the non-empty subsets. If the splitter returnes a list of weights,  
    511507    a corresponding weight is used for each branch. Besides, the  
     
    541537a prunable tree, internal nodes must have their :obj:`nodeClassifier` 
    542538defined. Fortunately, all you need to do is nothing; if you leave 
    543 the :obj:`TreeLearner`'s flags as they are by default, the 
     539the :obj:`TreeLearnerBase`'s flags as they are by default, the 
    544540:obj:`nodeClassifier` are created. 
    545541 
     
    550546Several classes described above are already functional and can 
    551547(and mostly will) be used as they are. Those classes are :obj:`TreeNode`, 
    552 :obj:`TreeLearner` and :obj:`TreeClassifier`. This section describe  
     548:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe  
    553549the other classes. 
    554550 
     
    558554in Python and have the call operator overloadedd in such a way that it 
    559555is callbacked from C++ code. You can thus program your own components 
    560 for :obj:`orange.TreeLearner` and :obj:`TreeClassifier`. The detailed  
     556for :obj:`orange.TreeLearnerBase` and :obj:`TreeClassifier`. The detailed  
    561557information on how this is done and what can go wrong, is given in a  
    562558separate page, dedicated to callbacks to Python XXXXXXXXXX. 
     
    746742 
    747743:obj:`TreeExampleSplitter` is the third crucial component of 
    748 :obj:`TreeLearner`. Its function is described in  
     744:obj:`TreeLearnerBase`. Its function is described in  
    749745description of the learning process. XXXXXXXXXX 
    750746 
     
    814810        which a vote of subtrees is required. In both cases, a tuple  
    815811        of two elements is returned; in the former, the tuple contains  
    816         the reached node and <CODE>None</CODE>, in the latter in  
     812        the reached node and None, in the latter in  
    817813        contains a node and weights of votes for subtrees (a list of floats). 
    818814 
     
    874870TreePruner and derived classes 
    875871============================== 
    876  
    877872 
    878873.. index:: 
     
    10681063======== 
    10691064 
    1070 You've already seen a simple example of using a :obj:`TreeLearner`. 
     1065You've already seen a simple example of using a :obj:`TreeLearnerBase`. 
    10711066You can just call it and let it fill the empty slots with the default 
    10721067components. This section will teach you three things: what are the 
     
    10751070finally, how to write a skeleton for tree induction in Python. 
    10761071 
    1077 Default components for TreeLearner 
    1078 ================================== 
    1079  
    1080 Let us construct a :obj:`TreeLearner` to play with. 
     1072Default components for TreeLearnerBase 
     1073====================================== 
     1074 
     1075Let us construct a :obj:`TreeLearnerBase` to play with. 
    10811076 
    10821077.. _treelearner.py: code/treelearner.py 
     
    14161411.. autofunction:: c45_printTree 
    14171412 
     1413=============== 
     1414orngTree module 
     1415=============== 
     1416 
     1417.. autoclass:: TreeLearner 
     1418    :members: 
     1419 
     1420 
     1421Tree size 
     1422========= 
     1423 
     1424.. autofunction:: countNodes 
     1425 
     1426.. autofunction:: countLeaves 
     1427 
     1428 
     1429 
    14181430""" 
    14191431 
    14201432from Orange.core import \ 
    1421      TreeLearner, \ 
     1433     TreeLearner as TreeLearnerBase, \ 
    14221434         TreeClassifier, \ 
    14231435         C45Learner, \ 
     
    14561468 
    14571469 
    1458 #from orngC45 
    1459 def c45_showBranch(node, classvar, lev, i): 
     1470def _c45_showBranch(node, classvar, lev, i): 
    14601471    var = node.tested 
    14611472    if node.nodeType == 1: 
    14621473        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]), 
    1463         c45_printTree0(node.branch[i], classvar, lev+1) 
     1474        _c45_printTree0(node.branch[i], classvar, lev+1) 
    14641475    elif node.nodeType == 2: 
    14651476        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut), 
    1466         c45_printTree0(node.branch[i], classvar, lev+1) 
     1477        _c45_printTree0(node.branch[i], classvar, lev+1) 
    14671478    else: 
    14681479        inset = filter(lambda a:a[1]==i, enumerate(node.mapping)) 
     
    14721483        else: 
    14731484            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)), 
    1474         c45_printTree0(node.branch[i], classvar, lev+1) 
     1485        _c45_printTree0(node.branch[i], classvar, lev+1) 
    14751486         
    14761487         
    1477 def c45_printTree0(node, classvar, lev): 
     1488def _c45_printTree0(node, classvar, lev): 
    14781489    var = node.tested 
    14791490    if node.nodeType == 0: 
     
    14821493        for i, branch in enumerate(node.branch): 
    14831494            if not branch.nodeType: 
    1484                 c45_showBranch(node, classvar, lev, i) 
     1495                _c45_showBranch(node, classvar, lev, i) 
    14851496        for i, branch in enumerate(node.branch): 
    14861497            if branch.nodeType: 
    1487                 c45_showBranch(node, classvar, lev, i) 
     1498                _c45_showBranch(node, classvar, lev, i) 
    14881499 
    14891500def c45_printTree(tree): 
     
    15351546    (nor is there any need it should). 
    15361547 
     1548    """ 
     1549    c45_printTree0(tree.tree, tree.classVar, 0) 
     1550 
     1551 
     1552 
     1553 
     1554# 
     1555# From  orngTree 
     1556# 
     1557 
    15371558""" 
    1538     c45_printTree0(tree.tree, tree.classVar, 0) 
    1539     print 
    1540 #end orngC45 
    1541  
    1542  
    1543  
     1559 
     1560The module also contains functions for counting the number of nodes 
     1561and leaves in the tree. It laso includes functions for  
     1562printing out the tree, which are 
     1563rather versatile and can print out practically anything you'd like to 
     1564know, from the number of examples, proportion of examples of majority 
     1565class in nodes and similar, to more complex statistics like the 
     1566proportion of examples in a particular class divided by the proportion 
     1567of examples of this class in a parent node. And even more, you can 
     1568define your own callback functions to be used for printing. 
     1569 
     1570<P>For a bit more complex example, here's how to write your own stop function. The example itself is more funny than useful. It constructs and prints two trees. For the first one we define the <code>defStop</code> function, which is used by default, and combine it with a random function so that the stop criteria will also be met in additional 20% of the cases when <code>defStop</code> is false. The second tree is build such that it considers only the random function as the stopping criteria. Note that in the second case lambda function still has three parameters, since this is a necessary number of parameters for the stop function (for more, see section on <a href="../reference/TreeLearner.htm">Orange Trees</a> in Orange Reference). 
     1571</p> 
     1572 
     1573<p class="header"><a href="tree3.py">tree3.py</a> (uses <a href= 
     1574"iris.tab">iris.tab</a>)</p> 
     1575 
     1576<XMP class=code>import orange, orngTree 
     1577from whrandom import randint, random 
     1578 
     1579data = orange.ExampleTable("iris.tab") 
     1580 
     1581defStop = orange.TreeStopCriteria() 
     1582f = lambda examples, weightID, contingency: defStop(examples, weightID, contingency) or randint(1, 5)==1 
     1583l = orngTree.TreeLearner(data, stop=f) 
     1584orngTree.printTxt(l, leafFields=['major', 'contingency']) 
     1585 
     1586f = lambda x,y,z: randint(1, 5)==1 
     1587l = orngTree.TreeLearner(data, stop=f) 
     1588orngTree.printTxt(l, leafFields=['major', 'contingency']) 
     1589</XMP> 
     1590 
     1591<p>The output is not shown here since the resulting trees are rather 
     1592big.</p> 
     1593 
     1594 
     1595 
     1596<h2>Tree Size</h2> 
     1597 
     1598 
     1599<h2>Printing the Tree</h2> 
     1600<index name="classification trees/printing"> 
     1601 
     1602<P>Function <code>dumpTree</code> dumps a tree to a string, and <code>printTree</code> prints out the tree (<code>printTxt</code> is alias for <code>printTree</code>, and it's there for compatibility). Functions have same arguments.</P> 
     1603 
     1604<P>Before we go on: you can read all about the function and use it to its full extent, or you can just call it, giving it the tree as the sole argument and it will print out the usual textual representation of the tree. If you're satisfied with that, you can stop here.</P> 
     1605 
     1606<p class=section>Arguments</p> 
     1607<dl class=arguments> 
     1608  <dt>tree</dt> 
     1609  <dd>The tree to be printed out.</dd> 
     1610 
     1611  <dt>leafStr</dt> 
     1612  <dd>The format string for printing the tree leaves. If left empty, "%V (%^.2m%)" will be used for classification trees and "%V" for regression trees.</dd> 
     1613 
     1614  <dt>nodeStr</dt> 
     1615  <dd>The string for printing out the internal nodes. If left empty (as it is by default), no data is printed out for internal nodes. If set to <code>"."</code>, the same string is used as for leaves.</dd> 
     1616 
     1617  <dt>maxDepth</dt> 
     1618  <dd>If set, it limits the depth to which the tree is printed out.</dd> 
     1619 
     1620  <dt>minExamples</dt> 
     1621  <dd>If set, the subtrees with less than the given number of examples are not printed.</dd> 
     1622 
     1623  <dt>simpleFirst</dt> 
     1624  <dd>If <code>True</code> (default), the branches with a single node are printed before the branches with larger subtrees. If you set it to <code>False</code> (which I don't know why you would), the branches are printed in order of appearance.</dd> 
     1625 
     1626  <dt>userFormats</dt> 
     1627  <dd>A list of regular expressions and callback function through which the user can print out other specific information in the nodes. 
     1628</dl> 
     1629 
     1630<P>The magic is in the format string. It is a string which is printed out at every leaf or internal node with the certain format specifiers replaced by data from the tree node. Specifiers are generally of form 
     1631<B><code>%<em>[^]</em><em>&lt;precision&gt;</em><em>&lt;quantity&gt;</em><em>&lt;divisor&gt;</em>.</code></B> 
     1632</center> 
     1633 
     1634<P><B><EM>^</EM></B> at the start tells that the number should be multiplied by 100. It's useful for printing proportions like percentages, but it makes no sense to multiply, say, the number of examples at the node (although the function will allow it).</P> 
     1635 
     1636<P><B><em>&lt;precision&gt;</em></B> is in the same format as in Python (or C) string formatting. For instance, <code>%N</code> denotes the number of examples in the node, hence <code>%6.2N</code> would mean output to two decimal digits and six places altogether. If left out, a default format <code>5.3</code> is used, unless you multiply the numbers by 100, in which case the default is <code>.0</code> (no decimals, the number is rounded to the nearest integer).</code></P> 
     1637 
     1638<P><B><em>&lt;divisor&gt;</em></B> tells what to divide the quantity in that node with. <code>bP</code> means division by the same quantity in the parent node; for instance, <code>%NbP</code> will tell give the number of examples in the node divided by the number of examples in parent node. You can add use precision formatting, e.g. <code>%6.2NbP</code>. <code>bA</code> is division by the same quantity over the entire data set, so <code>%NbA</code> will tell you the proportion of examples (out of the entire training data set) that fell into that node. If division is impossible since the parent node does not exist or some data is missing, a dot is printed out instead of the quantity.</P> 
     1639 
     1640<P><B><em>&lt;quantity&gt;</em></B> is the only required element. It defines what to print. For instance, <code>%N</code> would print out the number of examples in the node. Possible quantities are 
     1641<dl class=arguments_sm> 
     1642<dt>V</dt> 
     1643<dd>The value predicted at that node. You cannot define the precision or divisor here.</dd> 
     1644 
     1645<dt>N</dt> 
     1646<dd>The number of examples in the node.</dd> 
     1647 
     1648<dt>M</dt> 
     1649<dd>The number of examples in the majority class (that is, the class predicted by the node).</dd> 
     1650 
     1651<dt>m</dt> 
     1652<dd>The proportion of examples in the majority class.</dd> 
     1653 
     1654<dt>A</dt> 
     1655<dd>The average class for examples the node; this is available only for regression trees.</dd> 
     1656 
     1657<dt>E</dt> 
     1658<dd>Standard error for class of examples in the node; available for regression trees.</dd> 
     1659 
     1660<dt>I</dt> 
     1661<dd>Print out the confidence interval. The modifier is used as <code>%I(95)</code> of (more complicated) <code>%5.3I(95)bP</code>.</dd> 
     1662 
     1663<dt>C</dt> 
     1664<dd>The number of examples in the given class. For classification trees, this modifier is used as, for instance in, <code>%5.3C="Iris-virginica"bP</code> - this will tell the number of examples of Iris-virginica by the number of examples this class in the parent node. If you are interested in examples that are <em>not</em> Iris-virginica, say <code>%5.3CbP!="Iris-virginica"</code> 
     1665 
     1666For regression trees, you can use operators =, !=, &lt;, &lt;=, &gt;, and &gt;=, as in <code>%C&lt;22</code> - add the precision and divisor if you will. You can also check the number of examples in a certain interval: <code>%C[20, 22]</code> will give you the number of examples between 20 and 22 (inclusive) and <code>%C(20, 22)</code> will give the number of such examples excluding the boundaries. You can of course mix the parentheses, e.g. <code>%C(20, 22]</code>. If you would like the examples outside the interval, add a <code>!</code>, like <code>%C!(20, 22]</code>.</dd> 
     1667 
     1668<dt>c</dt> 
     1669<dd>Same as above, except that it computes the proportion of the class instead of the number of examples.</dd> 
     1670 
     1671<dt>D</dt> 
     1672<dd>Prints out the number of examples in each class. You can use both, precision (it is applied to each number in the distribution) or the divisor. This quantity cannot be computed for regression trees.</dd> 
     1673 
     1674<dt>d</dt> 
     1675<dd>Same as above, except that it shows proportions of examples. This again doesn't work with regression trees.</dd> 
     1676</dl> 
     1677 
     1678<dt>&lt;user defined formats&gt;</dt> 
     1679<dd>You can add more, if you will. Instructions and examples are given at the end of this section.</dd> 
     1680</P> 
     1681 
     1682<P>Now for some examples. We shall build a small tree from the iris data set - we shall limit the depth to three levels.</P> 
     1683 
     1684<p class="header">part of <a href="orngTree1.py">orngTree1.py</a></p> 
     1685<xmp class="code">import orange, orngTree 
     1686data = orange.ExampleTable("iris") 
     1687tree = orngTree.TreeLearner(data, maxDepth=3) 
     1688</xmp> 
     1689 
     1690<P>The easiest way to call the function is to pass the tree as the only argument. Calling <code>orngTree.printTree(tree)</code> will print 
     1691<xmp class="printout">petal width<0.800: Iris-setosa (100.00%) 
     1692petal width>=0.800 
     1693|    petal width<1.750 
     1694|    |    petal length<5.350: Iris-versicolor (94.23%) 
     1695|    |    petal length>=5.350: Iris-virginica (100.00%) 
     1696|    petal width>=1.750 
     1697|    |    petal length<4.850: Iris-virginica (66.67%) 
     1698|    |    petal length>=4.850: Iris-virginica (100.00%) 
     1699</xmp> 
     1700</P> 
     1701 
     1702<P>Let's now print out the predicted class at each node, the number of examples in the majority class with the total number of examples in the node, 
     1703<code>orngTree.printTree(tree, leafStr="%V (%M out of %N)")</code>. 
     1704<xmp class="printout">petal width<0.800: Iris-setosa (50.000 out of 50.000) 
     1705petal width>=0.800 
     1706|    petal width<1.750 
     1707|    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000) 
     1708|    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000) 
     1709|    petal width>=1.750 
     1710|    |    petal length<4.850: Iris-virginica (2.000 out of 3.000) 
     1711|    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000) 
     1712</xmp> 
     1713</P> 
     1714 
     1715<P>Would you like to know how the number of examples declines as compared to the entire data set and to the parent node? We find it with this: <code>orng.printTree("%V (%^MbA%, %^MbP%)")</code> 
     1716<xmp class="printout">petal width<0.800: Iris-setosa (100%, 100%) 
     1717petal width>=0.800 
     1718|    petal width<1.750 
     1719|    |    petal length<5.350: Iris-versicolor (98%, 100%) 
     1720|    |    petal length>=5.350: Iris-virginica (4%, 40%) 
     1721|    petal width>=1.750 
     1722|    |    petal length<4.850: Iris-virginica (4%, 4%) 
     1723|    |    petal length>=4.850: Iris-virginica (86%, 96%) 
     1724</xmp> 
     1725<P>Let us first read the format string. <code>%M</code> is the number of examples in the majority class. We want it divided by the number of all examples from this class on the entire data set, hence <code>%MbA</code>. To have it multipied by 100, we say <code>%^MbA</code>. The percent sign <em>after</em> that is just printed out literally, just as the comma and parentheses (see the output). The string for showing the proportion of this class in the parent is the same except that we have <code>bP</code> instead of <code>bA</code>.</P> 
     1726 
     1727<P>And now for the output: all examples of setosa for into the first node. For versicolor, we have 98% in one node; the rest is certainly not in the neighbouring node (petal length&gt;=5.350) since all versicolors from the node petal width&lt;1.750 went to petal length&lt;5.350 (we know this from the <code>100%</code> in that line). Virginica is the majority class in the three nodes that together contain 94% of this class (4+4+86). The rest must had gone to the same node as versicolor.</P> 
     1728 
     1729<P>If you find this guesswork annoying - so do I. Let us print out the number of versicolors in each node, together with the proportion of versicolors among the examples in this particular node and among all versicolors. So,<br> 
     1730<code>'%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)</code><br>gives the following output:</P> 
     1731 
     1732<xmp class="printout">petal width<0.800: 0.000 (0% of node, 0% of versicolors) 
     1733petal width>=0.800 
     1734|    petal width<1.750 
     1735|    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors) 
     1736|    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors) 
     1737|    petal width>=1.750 
     1738|    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors) 
     1739|    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors) 
     1740</xmp> 
     1741 
     1742<P>Finally, we may want to print out the distributions, using a simple string <code>%D</code>.</P> 
     1743<xmp class="printout">petal width<0.800: [50.000, 0.000, 0.000] 
     1744petal width>=0.800 
     1745|    petal width<1.750 
     1746|    |    petal length<5.350: [0.000, 49.000, 3.000] 
     1747|    |    petal length>=5.350: [0.000, 0.000, 2.000] 
     1748|    petal width>=1.750 
     1749|    |    petal length<4.850: [0.000, 1.000, 2.000] 
     1750|    |    petal length>=4.850: [0.000, 0.000, 43.000] 
     1751</xmp> 
     1752<P>What is the order of numbers here? If you check <code>data.domain.classVar.values</code>, you'll learn that the order is setosa, versicolor, virginica; so in the node at peta length&lt;5.350 we have 49 versicolors and 3 virginicae. To print out the proportions, we can use, for instance <code>%.2d</code> - this gives us proportions within node, rounded on two decimals.</P> 
     1753<xmp class="printout">petal width<0.800: [1.00, 0.00, 0.00] 
     1754petal width>=0.800 
     1755|    petal width<1.750 
     1756|    |    petal length<5.350: [0.00, 0.94, 0.06] 
     1757|    |    petal length>=5.350: [0.00, 0.00, 1.00] 
     1758|    petal width>=1.750 
     1759|    |    petal length<4.850: [0.00, 0.33, 0.67] 
     1760|    |    petal length>=4.850: [0.00, 0.00, 1.00] 
     1761</xmp> 
     1762 
     1763<P>We haven't tried printing out some information for internal nodes. To start with the most trivial case, we shall print the prediction at each node 
     1764<xmp class="code">orngTree.printTree(tree, leafStr="%V", nodeStr=".")</xmp> says that the <code>nodeStr</code> should be the same as <code>leafStr</code> (not very useful here, since <code>leafStr</code> is trivial anyway).</P> 
     1765<xmp class="printout">root: Iris-setosa 
     1766|    petal width<0.800: Iris-setosa 
     1767|    petal width>=0.800: Iris-versicolor 
     1768|    |    petal width<1.750: Iris-versicolor 
     1769|    |    |    petal length<5.350: Iris-versicolor 
     1770|    |    |    petal length>=5.350: Iris-virginica 
     1771|    |    petal width>=1.750: Iris-virginica 
     1772|    |    |    petal length<4.850: Iris-virginica 
     1773|    |    |    petal length>=4.850: Iris-virginica 
     1774</xmp> 
     1775 
     1776<P>Note that the output is somewhat different now: there appeared another node called <em>root</em> and the tree looks one level deeper. This is needed to print out the data for that node to.</P> 
     1777 
     1778<P>Now for something more complicated: let us observe how the number of virginicas decreases down the tree:</P> 
     1779<xmp class="code>"orngTree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')</xmp> 
     1780<P>Let's first interpret the format string: <code>CbA="Iris-virginica"</code> is the number of examples from class virginica, divided by the total number of examples in this class. Add <code>^.1</code> and the result will be multiplied and printed with one decimal. The trailing <code>%</code> is printed out. In parentheses we print the same thing except that we divide by the examples in the parent node. Note the use of single quotes, so we can use the double quotes inside the string, when we specify the class.</P> 
     1781<xmp class="printout">root: 100.0% (.%) 
     1782|    petal width<0.800: 0.0% (0.0%) 
     1783|    petal width>=0.800: 100.0% (100.0%) 
     1784|    |    petal width<1.750: 10.0% (10.0%) 
     1785|    |    |    petal length<5.350: 6.0% (60.0%) 
     1786|    |    |    petal length>=5.350: 4.0% (40.0%) 
     1787|    |    petal width>=1.750: 90.0% (90.0%) 
     1788|    |    |    petal length<4.850: 4.0% (4.4%) 
     1789|    |    |    petal length>=4.850: 86.0% (95.6%) 
     1790</xmp> 
     1791<P>See what's in the parentheses in the root node? If <code>printTree</code> cannot compute something (in this case it's because the root has no parent), it prints out a dot. You can also replace <code>=</code> by <code>!=</code> and it will count all classes <em>except</em> virginica.</P> 
     1792 
     1793<P>For one final example with classification trees, we shall print the distributions in that nodes, the distribution compared to the parent and the proportions compared to the parent (the latter things are not the same - think why). In the leaves we shall also add the predicted class. So now we'll have to call the function like this.</P> 
     1794<xmp class="code>"orngTree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')</xmp> 
     1795<p>Here's the result:</p> 
     1796<xmp class="printout">root: [50.000, 50.000, 50.000] . . 
     1797|    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]: 
     1798|        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00] 
     1799|    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50] 
     1800|    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19] 
     1801|    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]: 
     1802|    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62] 
     1803|    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]: 
     1804|    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80] 
     1805|    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96] 
     1806|    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]: 
     1807|    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68] 
     1808|    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]: 
     1809|    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02] 
     1810</xmp> 
     1811 
     1812<P>To explore the possibilities when printing regression trees, we are gonna induce a tree from the housing data set. Called with the tree as the only argument, <code>printTree</code> prints the tree like this: 
     1813 
     1814<xmp class="printout">RM<6.941 
     1815|    LSTAT<14.400 
     1816|    |    DIS<1.385: 45.6 
     1817|    |    DIS>=1.385: 22.9 
     1818|    LSTAT>=14.400 
     1819|    |    CRIM<6.992: 17.1 
     1820|    |    CRIM>=6.992: 12.0 
     1821RM>=6.941 
     1822|    RM<7.437 
     1823|    |    CRIM<7.393: 33.3 
     1824|    |    CRIM>=7.393: 14.4 
     1825|    RM>=7.437 
     1826|    |    TAX<534.500: 45.9 
     1827|    |    TAX>=534.500: 21.9 
     1828</xmp> 
     1829 
     1830<P>Let us add the standard error in both internal nodes and leaves, and the 90% confidence intervals in the leaves. So we want to call it like this:</P> 
     1831<xmp class="code">orngTree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")</xmp> 
     1832 
     1833<xmp class="printout">root: [SE: 0.409] 
     1834|    RM<6.941: [SE: 0.306] 
     1835|    |    LSTAT<14.400: [SE: 0.320] 
     1836|    |    |    DIS<1.385: [SE: 4.420]: 
     1837|    |    |        [SE: 4.420]   45.6 [38.331-52.829] 
     1838|    |    |    DIS>=1.385: [SE: 0.244]: 
     1839|    |    |        [SE: 0.244]   22.9 [22.504-23.306] 
     1840|    |    LSTAT>=14.400: [SE: 0.333] 
     1841|    |    |    CRIM<6.992: [SE: 0.338]: 
     1842|    |    |        [SE: 0.338]   17.1 [16.584-17.691] 
     1843|    |    |    CRIM>=6.992: [SE: 0.448]: 
     1844|    |    |        [SE: 0.448]   12.0 [11.243-12.714] 
     1845|    RM>=6.941: [SE: 1.031] 
     1846|    |    RM<7.437: [SE: 0.958] 
     1847|    |    |    CRIM<7.393: [SE: 0.692]: 
     1848|    |    |        [SE: 0.692]   33.3 [32.214-34.484] 
     1849|    |    |    CRIM>=7.393: [SE: 2.157]: 
     1850|    |    |        [SE: 2.157]   14.4 [10.862-17.938] 
     1851|    |    RM>=7.437: [SE: 1.124] 
     1852|    |    |    TAX<534.500: [SE: 0.817]: 
     1853|    |    |        [SE: 0.817]   45.9 [44.556-47.237] 
     1854|    |    |    TAX>=534.500: [SE: 0.000]: 
     1855|    |    |        [SE: 0.000]   21.9 [21.900-21.900] 
     1856</xmp> 
     1857 
     1858<P>What's the difference between <code>%V</code>, the predicted value and <code>%A</code> the average? Doesn't a regression tree always predict the leaf average anyway? Not necessarily, the tree predict whatever the <code>nodeClassifier</code> in a leaf returns. But you're mostly right. The difference is in the number of decimals: <code>%V</code> uses the <code>FloatVariable</code>'s function for printing out the value, which results the printed number to have the same number of decimals as in the original file from which the data was read.</P> 
     1859 
     1860<P>Regression trees cannot print the distributions in the same way as classification trees. They instead offer a set of operators for observing the number of examples within a certain range. For instance, let us check the number of examples with values below 22, and compare this number with values in the parent nodes.</P> 
     1861<xmp class="code">orngTree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".")</xmp> 
     1862 
     1863<xmp class="printout">root: 277.000 (.) 
     1864|    RM<6.941: 273.000 (1.160) 
     1865|    |    LSTAT<14.400: 107.000 (0.661) 
     1866|    |    |    DIS<1.385: 0.000 (0.000) 
     1867|    |    |    DIS>=1.385: 107.000 (1.020) 
     1868|    |    LSTAT>=14.400: 166.000 (1.494) 
     1869|    |    |    CRIM<6.992: 93.000 (0.971) 
     1870|    |    |    CRIM>=6.992: 73.000 (1.040) 
     1871|    RM>=6.941: 4.000 (0.096) 
     1872|    |    RM<7.437: 3.000 (1.239) 
     1873|    |    |    CRIM<7.393: 0.000 (0.000) 
     1874|    |    |    CRIM>=7.393: 3.000 (15.333) 
     1875|    |    RM>=7.437: 1.000 (0.633) 
     1876|    |    |    TAX<534.500: 0.000 (0.000) 
     1877|    |    |    TAX>=534.500: 1.000 (30.000)</xmp> 
     1878 
     1879<P>The last line, for instance, says the the number of examples with the class below 22 is among those with tax above 534 is 30 times higher than the number of such examples in its parent node.</P> 
     1880 
     1881<P>For another exercise, let's count the same for all examples <em>outside</em> interval [20, 22] (given like this, the interval includes the bounds). And let us print out the proportions as percents.</P> 
     1882 
     1883<xmp class="code">orngTree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")</xmp> 
     1884 
     1885<P>OK, let's observe the format string for one last time. <code>%c![20, 22]</code> would be the proportion of examples (within the node) whose values are below 20 or above 22. By <code>%cbP![20, 22]</code> we derive this by the same statistics computed on the parent. Add a <code>^</code> and you have the percentages.</P> 
     1886 
     1887<xmp class="printout">root: 439.000 (.%) 
     1888|    RM<6.941: 364.000 (98%) 
     1889|    |    LSTAT<14.400: 200.000 (93%) 
     1890|    |    |    DIS<1.385: 5.000 (127%) 
     1891|    |    |    DIS>=1.385: 195.000 (99%) 
     1892|    |    LSTAT>=14.400: 164.000 (111%) 
     1893|    |    |    CRIM<6.992: 91.000 (96%) 
     1894|    |    |    CRIM>=6.992: 73.000 (105%) 
     1895|    RM>=6.941: 75.000 (114%) 
     1896|    |    RM<7.437: 46.000 (101%) 
     1897|    |    |    CRIM<7.393: 43.000 (100%) 
     1898|    |    |    CRIM>=7.393: 3.000 (100%) 
     1899|    |    RM>=7.437: 29.000 (98%) 
     1900|    |    |    TAX<534.500: 29.000 (103%) 
     1901|    |    |    TAX>=534.500: 0.000 (0%) 
     1902</xmp> 
     1903 
     1904 
     1905<h3>Defining Your Own Printout functions</h3> 
     1906 
     1907<P><code>dumpTree</code>'s argument <code>userFormats</code> can be used to print out some other information in the leaves or nodes. If provided, <code>userFormat</code> should contain a list of tuples with a regular expression and a callback function to be called when that expression is found in the format string. Expressions from <code>userFormats</code> are checked before the built-in expressions discussed above, so you can override the built-ins if you want to.</P> 
     1908 
     1909<P>The regular expression should describe a string like those we used above, for instance the string <code>%.2DbP</code>. When a leaf or internal node is printed out, the format string (<code>leafStr</code> or <code>nodeStr</code>) is checked for these regular expressions and when the match is found, the corresponding callback function is called.</P> 
     1910 
     1911<P>The callback function will get five arguments: the format string (<code>leafStr</code> or <code>nodeStr</code>), the match object, the node which is being printed, its parent (can be <code>None</code>) and the tree classifier. The function should return the format string in which the part described by the match object (that is, the part that is matched by the regular expression) is replaced by whatever information your callback function is supposed to give.</P> 
     1912 
     1913<P>The function can use several utility function provided in the module.</P> 
     1914<dl class="attributes"> 
     1915<dt>insertStr(s, mo, sub)</dt> 
     1916<dd>Replaces the part of <code>s</code> which is covered by <code>mo</code> by the string <code>sub</code>.</dd> 
     1917 
     1918<dt>insertDot(s, mo)</dt> 
     1919<dd>Calls <code>insertStr(s, mo, "."). You should use this when the function cannot compute the desired quantity; it is called, for instance, when it needs to divide by something in the parent, but the parent doesn't exist.</dd> 
     1920 
     1921<dt>insertNum(s, mo, n)</dt> 
     1922<dd>Replaces the part of <code>s</code> matched by <code>mo</code> by the number <code>n</code>, formatted as specified by the user, that is, it multiplies it by 100, if needed, and prints with the right number of places and decimals. It does so by checking the <code>mo</code> for a group named <code>m100</code> (representing the <code>^</code> in the format string) and a group named <code>fs</code> represented the part giving the number of decimals (e.g. <code>5.3</code>).</dd> 
     1923 
     1924<dt>byWhom(by, parent, tree)</dt> 
     1925<dd>If <code>by</code> equals <code>bp</code>, it returns <code>parent</code>, else it returns <code>tree.tree</code>. This is used to find what to divide the quantity with, when division is required.</dd> 
     1926</dl> 
     1927 
     1928<P>There are also a few pieces of regular expression that you may want to reuse. The two you are likely to use are</P> 
     1929<dl class="attributes-sm"> 
     1930<dt>fs</dt> 
     1931<dd>Defines the multiplier by 100 (<code>^</code>) and the format for the number of decimals (e.g. <code>5.3</code>). The corresponding groups are named <code>m100</code> and <code>fs</code>.</dd> 
     1932 
     1933<dt>by</dt> 
     1934<dd>Defines <code>bP</code> or <code>bA</code> or nothing; the result is in groups <code>by</code>.</dd> 
     1935</dl> 
     1936 
     1937<P>For a trivial example, "%V" is implemented like this. There is the following tuple in the list of built-in formats: <code>(re.compile("%V"), replaceV)</code>. <code>replaceV</code> is a function defined by:</P> 
     1938<xmp class="code">def replaceV(strg, mo, node, parent, tree): 
     1939    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))</xmp> 
     1940<P>It therefore takes the value predicted at the node (<code>node.nodeClassifier.defaultValue</code>), converts it to a string and passes it to <code>insertStr</code> to do the replacement.</P> 
     1941 
     1942<P>A more complex regular expression is the one for the proportion of majority class, defined as <code>"%"+fs+"M"+by</code>. It uses the two partial expressions defined above.</P> 
     1943 
     1944<P>Let's say with like to print the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node.</P> 
     1945 
     1946<p class="header">part of <a href="orngTree2.py">orngTree2.py</a></p> 
     1947<xmp class="code">def getMargin(dist): 
     1948    if dist.abs < 1e-30: 
     1949        return 0 
     1950    l = list(dist) 
     1951    l.sort() 
     1952    return (l[-1] - l[-2]) / dist.abs 
     1953 
     1954def replaceB(strg, mo, node, parent, tree): 
     1955    margin = getMargin(node.distribution) 
     1956 
     1957    by = mo.group("by") 
     1958    if margin and by: 
     1959        whom = orngTree.byWhom(by, parent, tree) 
     1960        if whom and whom.distribution: 
     1961            divMargin = getMargin(whom.distribution) 
     1962            if divMargin > 1e-30: 
     1963                margin /= divMargin 
     1964            else: 
     1965                orngTree.insertDot(strg, mo) 
     1966        else: 
     1967            return orngTree.insertDot(strg, mo) 
     1968    return orngTree.insertNum(strg, mo, margin) 
     1969 
     1970 
     1971myFormat = [(re.compile("%"+orngTree.fs+"B"+orngTree.by), replaceB)]</xmp> 
     1972 
     1973<P>We first defined <code>getMargin</code> which gets the distribution and computes the margin. The callback replaces, <code>replaceB</code>, computes the margin for the node. If we need to divided the quantity by something (that is, if the <code>by</code> group is present), we call <code>orngTree.byWhom</code> to get the node with whose margin this node's margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, we call <code>insertDot</code> to insert a dot, otherwise we call <code>insertNum</code> which will insert the number, obeying the format specified by the user.</P> 
     1974 
     1975<P><code>myFormat</code> is a list containing the regular expression and the callback function.</P> 
     1976 
     1977<P>We can now print out the iris tree, for instance using the following call.</P> 
     1978<xmp class="code">orngTree.printTree(tree, leafStr="%V %^B% (%^3.2BbP%)", userFormats = myFormat)</xmp> 
     1979 
     1980<P>And this is what we get.</P> 
     1981<xmp class="printout">petal width<0.800: Iris-setosa 100% (100.00%) 
     1982petal width>=0.800 
     1983|    petal width<1.750 
     1984|    |    petal length<5.350: Iris-versicolor 88% (108.57%) 
     1985|    |    petal length>=5.350: Iris-virginica 100% (122.73%) 
     1986|    petal width>=1.750 
     1987|    |    petal length<4.850: Iris-virginica 33% (34.85%) 
     1988|    |    petal length>=4.850: Iris-virginica 100% (104.55%) 
     1989</xmp> 
     1990 
     1991 
     1992<h2>Plotting the Tree using Dot</h2> 
     1993 
     1994<p>Function <code>printDot</code> prints the tree to a file in a format used by <a 
     1995href="http://www.research.att.com/sw/tools/graphviz">GraphViz</a>. 
     1996Uses the same parameters as <code>printTxt</code> defined above, and 
     1997in addition two parameters which define the shape used for internal 
     1998nodes and laves of the tree: 
     1999 
     2000<p class=section>Arguments</p> 
     2001<dl class=arguments> 
     2002  <dt>leafShape</dt> 
     2003  <dd>Shape of the outline around leves of the tree. If "plaintext", 
     2004  no outline is used (default: "plaintext")</dd> 
     2005 
     2006  <dt>internalNodeShape</dt> 
     2007  <dd>Shape of the outline around internal nodes of the tree. If "plaintext", 
     2008  no outline is used (default: "box")</dd> 
     2009</dl> 
     2010 
     2011<p>Check <a 
     2012href="http://www.graphviz.org/doc/info/shapes.html">Polygon-based 
     2013Nodes</a> for various outlines supported by GraphViz.</p> 
     2014 
     2015<P>Suppose you saved the tree in a file <code>tree5.dot</code>. You can then print it out as a gif if you execute the following command line 
     2016<XMP class=code>dot -Tgif tree5.dot -otree5.gif 
     2017</XMP> 
     2018</P> 
     2019GraphViz's dot has quite a few other output formats, check its documentation to learn which.</P> 
     2020 
     2021References 
     2022========== 
     2023 
     2024E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories, 
     2025Murray Hill NJ, U.S.A., October 1993. 
     2026 
     2027<p><a href="http://www.research.att.com/sw/tools/graphviz/">Graphviz - 
     2028open source graph drawing software</a>. A home page of AT&T's dot and 
     2029similar software packages.</p> 
     2030 
     2031""" 
     2032 
     2033import orange 
     2034import base64 
     2035from warnings import warn 
     2036 
     2037class TreeLearner(orange.Learner): 
     2038    """ 
     2039    Assembles the generic classification or regression tree learner  
     2040    (from Orange's objects for induction of decision trees).  
     2041    :class:`TreeLearner` is essentially a wrapper 
     2042    around :class:`TreeLearnerBase`, provided for easier use of the latter. 
     2043    It sets a number of parameters used in induction that 
     2044    can also be set after the creation of the object, most often through 
     2045    the object's attributes. If upon initialization 
     2046    :class:`TreeLearner` is given a set of examples, then an instance 
     2047    of :class:`TreeClassifier` object is returned instead. 
     2048 
     2049    The values of attributes can be also be set in the constructor.  
     2050 
     2051    .. attribute:: nodeLearner 
     2052 
     2053        Induces a classifier from examples belonging to a node. The 
     2054        same learner is used for internal nodes and for leaves. The 
     2055        default :obj:`nodeLearner` is :obj:`MajorityLearner`. 
     2056 
     2057    **Split construction** 
     2058 
     2059    .. attribute:: split 
     2060         
     2061        Defines a function that will be used in place of 
     2062        :obj:`TreeSplitConstructor`.  
     2063        Useful when prototyping new tree induction 
     2064        algorithms. When this parameter is defined, other parameters that 
     2065        affect the procedures for growing of the tree are ignored. These 
     2066        include :obj:`binarization`, :obj:`measure`, 
     2067        :obj:`worstAcceptable` and :obj:`minSubset` (Default: 
     2068        :class:TreeSplitConstructor_Combined  
     2069        with separate constructors for discrete and continuous attributes. 
     2070        Discrete attributes are used as they are, while  
     2071        continuous attributes are binarized. 
     2072        Gain ratio is used to select attributes.  
     2073        A minimum of two examples in a leaf is required for  
     2074        discrete and five examples in a leaf for continuous attributes.) 
     2075 
     2076    .. attribute:: binarization 
     2077 
     2078        If 1, :class:`TreeSplitConstructor_ExhaustiveBinary` is used. 
     2079        If 2, use class:`TreeSplitConstructor_OneAgainstOthers`. If 
     2080        0, do not use binarization (use class:`TreeSplitConstructor_Attribute`). 
     2081        Default: 0. 
     2082 
     2083    .. attribute:: measure 
     2084     
     2085        Measure for scoring of the attributes when deciding which of the 
     2086        attributes will be used for splitting of the example set in the node. 
     2087        Can be either a measure XXXXX or one of 
     2088        "infoGain" (:class:`orange.MeasureAttribute_info`),  
     2089        "gainRatio" (:class:`orange.MeasureAttribute_gainRatio`),  
     2090        "gini" (:class:`orange.MeasureAttribute_gini`), 
     2091        "relief" (:class:`orange.MeasureAttribute_relief`), 
     2092        "retis" (:class: `orange.MeasureAttribute_MSE`). Default: "gainRatio". 
     2093 
     2094    .. attribute:: reliefM, reliefK 
     2095 
     2096        Sem `m` and `k` to given values if the :obj:`measure` is relief. 
     2097 
     2098    **Pruning** 
     2099 
     2100    .. attribute:: worstAcceptable 
     2101 
     2102        Used in pre-pruning, sets the lowest required attribute 
     2103        score. If the score of the best attribute is below this margin, the 
     2104        tree at that node is not grown further (default: 0). 
     2105 
     2106        So, to allow splitting only when gainRatio (the default measure) 
     2107        is greater than 0.6, one should run the learner like this: 
     2108        :samp:`l = orngTree.TreeLearner(data, worstAcceptable=0.6)` 
     2109 
     2110    .. attribute:: minSubset 
     2111 
     2112        Minimal number of examples in non-null leaves (default: 0). 
     2113 
     2114    .. attribute:: minExamples 
     2115 
     2116        Data subsets with less than :obj:`minExamples` 
     2117        examples are not split any further, that is, all leaves in the tree 
     2118        will contain at least that many of examples (default: 0). 
     2119 
     2120    .. attribute:: maxDepth 
     2121 
     2122        Gives maximal tree depth;  0 means that only root is generated.  
     2123        The default is 100.  
     2124 
     2125    .. attribute:: maxMajority 
     2126 
     2127        Induction stops when the proportion of majority class in the 
     2128        node exceeds the value set by this parameter(default: 1.0). E.g. 
     2129        to stop the induction as soon as the majority class reaches 70%, 
     2130        you should say  
     2131        :samp:`tree2 = orngTree.TreeLearner(data, maxMajority=0.7)` 
     2132 
     2133        This is an example of the tree on iris data set, built with 
     2134        XXXXXXXXX what above arguments XXXXXXXXX 
     2135        the above arguments - the numbers show the majority class  
     2136        proportion at each node. You can find more details in the  
     2137        script tree2.py, which induces and prints this tree. 
     2138 
     2139        :: 
     2140 
     2141            root: 0.333 
     2142            |    petal width<0.800: 1.000 
     2143            |    petal width>=0.800: 0.500 
     2144            |    |    petal width<1.750: 0.907 
     2145            |    |    petal width>=1.750: 0.978 
     2146     
     2147    .. attribute:: stop 
     2148 
     2149        Used for passing a function which is used in place of 
     2150        :class:`TreeStopCriteria`. Useful when prototyping new 
     2151        tree induction algorithms. See a documentation on  
     2152        :class:`TreeStopCriteria` for more info on this function.  
     2153        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`  
     2154        will not be  considered (default: None). XXXXX To je pisalo spodaj. 
     2155        The default stopping criterion stops induction when all examples in a node belong to the same class. 
     2156 
     2157    .. attribute:: mForPruning 
     2158 
     2159        If non-zero, invokes an error-based bottom-up post-pruning, 
     2160        where m-estimate is used to estimate class probabilities  
     2161        (default: 0). 
     2162 
     2163    .. attribute:: sameMajorityPruning 
     2164 
     2165        If true, invokes a bottom-up post-pruning by removing the 
     2166        subtrees of which all leaves classify to the same class 
     2167        (default: False). 
     2168 
     2169    **Record keeping** 
     2170 
     2171    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier 
     2172 
     2173        Determines whether to store class distributions, contingencies and 
     2174        examples in :class:`TreeNode`, and whether the :obj:`nodeClassifier` 
     2175        should be build for internal nodes. By default everything except  
     2176        :obj:`storeExamples` is enabled. You won't save any memory by not storing  
     2177        distributions but storing contingencies, since distributions actually points to 
     2178        the same distribution that is stored in 
     2179        :obj:`contingency.classes`. (default: True except for 
     2180        storeExamples, which defaults to False). 
     2181     
     2182    """ 
     2183    def __new__(cls, examples = None, weightID = 0, **argkw): 
     2184        self = orange.Learner.__new__(cls, **argkw) 
     2185        if examples: 
     2186            self.__init__(**argkw) 
     2187            return self.__call__(examples, weightID) 
     2188        else: 
     2189            return self 
     2190       
     2191    def __init__(self, **kw): 
     2192        self.learner = None 
     2193        self.__dict__.update(kw) 
     2194       
     2195    def __setattr__(self, name, value): 
     2196        if name in ["split", "binarization", "measure", "worstAcceptable", "minSubset", 
     2197              "stop", "maxMajority", "minExamples", "nodeLearner", "maxDepth", "reliefM", "reliefK"]: 
     2198            self.learner = None 
     2199        self.__dict__[name] = value 
     2200 
     2201    def __call__(self, examples, weight=0): 
     2202        """ 
     2203        Return a classifier from the given examples. 
     2204        """ 
     2205        if not self.learner: 
     2206            self.learner = self.instance() 
     2207        if not hasattr(self, "split") and not hasattr(self, "measure"): 
     2208            if examples.domain.classVar.varType == orange.VarTypes.Discrete: 
     2209                measure = orange.MeasureAttribute_gainRatio() 
     2210            else: 
     2211                measure = orange.MeasureAttribute_MSE() 
     2212            self.learner.split.continuousSplitConstructor.measure = measure 
     2213            self.learner.split.discreteSplitConstructor.measure = measure 
     2214             
     2215        tree = self.learner(examples, weight) 
     2216        if getattr(self, "sameMajorityPruning", 0): 
     2217            tree = orange.TreePruner_SameMajority(tree) 
     2218        if getattr(self, "mForPruning", 0): 
     2219            tree = orange.TreePruner_m(tree, m = self.mForPruning) 
     2220        return tree 
     2221 
     2222    def instance(self): 
     2223        """ 
     2224        Return the constructed learner - an object of :class:`TreeLearnerBase`. 
     2225        """ 
     2226        learner = orange.TreeLearner() 
     2227 
     2228        hasSplit = hasattr(self, "split") 
     2229        if hasSplit: 
     2230            learner.split = self.split 
     2231        else: 
     2232            learner.split = orange.TreeSplitConstructor_Combined() 
     2233            learner.split.continuousSplitConstructor = orange.TreeSplitConstructor_Threshold() 
     2234            binarization = getattr(self, "binarization", 0) 
     2235            if binarization == 1: 
     2236                learner.split.discreteSplitConstructor = orange.TreeSplitConstructor_ExhaustiveBinary() 
     2237            elif binarization == 2: 
     2238                learner.split.discreteSplitConstructor = orange.TreeSplitConstructor_OneAgainstOthers() 
     2239            else: 
     2240                learner.split.discreteSplitConstructor = orange.TreeSplitConstructor_Attribute() 
     2241 
     2242            measures = {"infoGain": orange.MeasureAttribute_info, 
     2243                "gainRatio": orange.MeasureAttribute_gainRatio, 
     2244                "gini": orange.MeasureAttribute_gini, 
     2245                "relief": orange.MeasureAttribute_relief, 
     2246                "retis": orange.MeasureAttribute_MSE 
     2247                } 
     2248 
     2249            measure = getattr(self, "measure", None) 
     2250            if type(measure) == str: 
     2251                measure = measures[measure]() 
     2252            if not hasSplit and not measure: 
     2253                measure = orange.MeasureAttribute_gainRatio() 
     2254 
     2255            measureIsRelief = type(measure) == orange.MeasureAttribute_relief 
     2256            relM = getattr(self, "reliefM", None) 
     2257            if relM and measureIsRelief: 
     2258                measure.m = relM 
     2259             
     2260            relK = getattr(self, "reliefK", None) 
     2261            if relK and measureIsRelief: 
     2262                measure.k = relK 
     2263 
     2264            learner.split.continuousSplitConstructor.measure = measure 
     2265            learner.split.discreteSplitConstructor.measure = measure 
     2266 
     2267            wa = getattr(self, "worstAcceptable", 0) 
     2268            if wa: 
     2269                learner.split.continuousSplitConstructor.worstAcceptable = wa 
     2270                learner.split.discreteSplitConstructor.worstAcceptable = wa 
     2271 
     2272            ms = getattr(self, "minSubset", 0) 
     2273            if ms: 
     2274                learner.split.continuousSplitConstructor.minSubset = ms 
     2275                learner.split.discreteSplitConstructor.minSubset = ms 
     2276 
     2277        if hasattr(self, "stop"): 
     2278            learner.stop = self.stop 
     2279        else: 
     2280            learner.stop = orange.TreeStopCriteria_common() 
     2281            mm = getattr(self, "maxMajority", 1.0) 
     2282            if mm < 1.0: 
     2283                learner.stop.maxMajority = self.maxMajority 
     2284            me = getattr(self, "minExamples", 0) 
     2285            if me: 
     2286                learner.stop.minExamples = self.minExamples 
     2287 
     2288        for a in ["storeDistributions", "storeContingencies", "storeExamples", "storeNodeClassifier", "nodeLearner", "maxDepth"]: 
     2289            if hasattr(self, a): 
     2290                setattr(learner, a, getattr(self, a)) 
     2291 
     2292        return learner 
     2293 
     2294def __countNodes(node): 
     2295    count = 0 
     2296    if node: 
     2297        count += 1 
     2298        if node.branches: 
     2299            for node in node.branches: 
     2300                count += __countNodes(node) 
     2301    return count 
     2302 
     2303def countNodes(tree): 
     2304    """ 
     2305    Return the number of nodes of tree. 
     2306 
     2307    :param tree: The tree for which to count the nodes. 
     2308    :type tree: :class:`TreeClassifier` 
     2309    """ 
     2310    return __countNodes(type(tree) == orange.TreeClassifier and tree.tree or tree) 
     2311 
     2312 
     2313def __countLeaves(node): 
     2314    count = 0 
     2315    if node: 
     2316        if node.branches: # internal node 
     2317            for node in node.branches: 
     2318                count += __countLeaves(node) 
     2319        else: 
     2320            count += 1 
     2321    return count 
     2322 
     2323def countLeaves(tree): 
     2324    """ 
     2325    Return the number of leaves in the tree. 
     2326 
     2327    :param tree: The tree for which to count the leaves. 
     2328    :type tree: :class:`TreeClassifier` 
     2329    """ 
     2330    return __countLeaves(type(tree) == orange.TreeClassifier and tree.tree or tree) 
     2331 
     2332 
     2333# the following is for the output 
     2334 
     2335import re 
     2336fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)" 
     2337by = r"(?P<by>(b(P|A)))?" 
     2338bysub = r"((?P<bysub>b|s)(?P<by>P|A))?" 
     2339opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)" 
     2340opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"' 
     2341intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)' 
     2342fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))" 
     2343re_V = re.compile("%V") 
     2344re_N = re.compile("%"+fs+"N"+by) 
     2345re_M = re.compile("%"+fs+"M"+by) 
     2346re_m = re.compile("%"+fs+"m"+by) 
     2347re_Ccont = re.compile("%"+fs+"C"+by+opc) 
     2348re_Cdisc = re.compile("%"+fs+"C"+by+opd) 
     2349re_ccont = re.compile("%"+fs+"c"+by+opc) 
     2350re_cdisc = re.compile("%"+fs+"c"+by+opd) 
     2351re_Cconti = re.compile("%"+fs+"C"+by+fromto) 
     2352re_cconti = re.compile("%"+fs+"c"+by+fromto) 
     2353re_D = re.compile("%"+fs+"D"+by) 
     2354re_d = re.compile("%"+fs+"d"+by) 
     2355re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub) 
     2356re_I = re.compile("%"+fs+"I"+intrvl) 
     2357 
     2358def insertDot(s, mo): 
     2359    return s[:mo.start()] + "." + s[mo.end():] 
     2360 
     2361def insertStr(s, mo, sub): 
     2362    return s[:mo.start()] + sub + s[mo.end():] 
     2363 
     2364def insertNum(s, mo, n): 
     2365    grps = mo.groupdict() 
     2366    m100 = grps.get("m100", None) 
     2367    if m100: 
     2368        n *= 100 
     2369    fs = grps.get("fs") or (m100 and ".0" or "5.3") 
     2370    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():] 
     2371 
     2372def byWhom(by, parent, tree): 
     2373        if by=="bP": 
     2374            return parent 
     2375        else: 
     2376            return tree.tree 
     2377 
     2378def replaceV(strg, mo, node, parent, tree): 
     2379    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue)) 
     2380 
     2381def replaceN(strg, mo, node, parent, tree): 
     2382    by = mo.group("by") 
     2383    N = node.distribution.abs 
     2384    if by: 
     2385        whom = byWhom(by, parent, tree) 
     2386        if whom and whom.distribution: 
     2387            if whom.distribution.abs > 1e-30: 
     2388                N /= whom.distribution.abs 
     2389        else: 
     2390            return insertDot(strg, mo) 
     2391    return insertNum(strg, mo, N) 
     2392         
     2393 
     2394def replaceM(strg, mo, node, parent, tree): 
     2395    by = mo.group("by") 
     2396    maj = int(node.nodeClassifier.defaultValue) 
     2397    N = node.distribution[maj] 
     2398    if by: 
     2399        whom = byWhom(by, parent, tree) 
     2400        if whom and whom.distribution: 
     2401            if whom.distribution[maj] > 1e-30: 
     2402                N /= whom.distribution[maj] 
     2403        else: 
     2404            return insertDot(strg, mo) 
     2405    return insertNum(strg, mo, N) 
     2406         
     2407 
     2408def replacem(strg, mo, node, parent, tree): 
     2409    by = mo.group("by") 
     2410    maj = int(node.nodeClassifier.defaultValue) 
     2411    if node.distribution.abs > 1e-30: 
     2412        N = node.distribution[maj] / node.distribution.abs 
     2413        if by: 
     2414            if whom and whom.distribution: 
     2415                byN = whom.distribution[maj] / whom.distribution.abs 
     2416                if byN > 1e-30: 
     2417                    N /= byN 
     2418            else: 
     2419                return insertDot(strg, mo) 
     2420    else: 
     2421        N = 0. 
     2422    return insertNum(strg, mo, N) 
     2423 
     2424 
     2425def replaceCdisc(strg, mo, node, parent, tree): 
     2426    if tree.classVar.varType != orange.VarTypes.Discrete: 
     2427        return insertDot(strg, mo) 
     2428     
     2429    by, op, cls = mo.group("by", "op", "cls") 
     2430    N = node.distribution[cls] 
     2431    if op == "!=": 
     2432        N = node.distribution.abs - N 
     2433    if by: 
     2434        whom = byWhom(by, parent, tree) 
     2435        if whom and whom.distribution: 
     2436            if whom.distribution[cls] > 1e-30: 
     2437                N /= whom.distribution[cls] 
     2438        else: 
     2439            return insertDot(strg, mo) 
     2440    return insertNum(strg, mo, N) 
     2441 
     2442     
     2443def replacecdisc(strg, mo, node, parent, tree): 
     2444    if tree.classVar.varType != orange.VarTypes.Discrete: 
     2445        return insertDot(strg, mo) 
     2446     
     2447    op, by, cls = mo.group("op", "by", "cls") 
     2448    N = node.distribution[cls] 
     2449    if node.distribution.abs > 1e-30: 
     2450        N /= node.distribution.abs 
     2451        if op == "!=": 
     2452            N = 1 - N 
     2453    if by: 
     2454        whom = byWhom(by, parent, tree) 
     2455        if whom and whom.distribution: 
     2456            if whom.distribution[cls] > 1e-30: 
     2457                N /= whom.distribution[cls] / whom.distribution.abs 
     2458        else: 
     2459            return insertDot(strg, mo) 
     2460    return insertNum(strg, mo, N) 
     2461 
     2462 
     2463import operator 
     2464__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne} 
     2465 
     2466def replaceCcont(strg, mo, node, parent, tree): 
     2467    if tree.classVar.varType != orange.VarTypes.Continuous: 
     2468        return insertDot(strg, mo) 
     2469     
     2470    by, op, num = mo.group("by", "op", "num") 
     2471    op = __opdict[op] 
     2472    num = float(num) 
     2473    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.) 
     2474    if by: 
     2475        whom = byWhom(by, parent, tree) 
     2476        if whom and whom.distribution: 
     2477            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.) 
     2478            if byN > 1e-30: 
     2479                N /= byN 
     2480        else: 
     2481            return insertDot(strg, mo) 
     2482 
     2483    return insertNum(strg, mo, N) 
     2484     
     2485     
     2486def replaceccont(strg, mo, node, parent, tree): 
     2487    if tree.classVar.varType != orange.VarTypes.Continuous: 
     2488        return insertDot(strg, mo) 
     2489     
     2490    by, op, num = mo.group("by", "op", "num") 
     2491    op = __opdict[op] 
     2492    num = float(num) 
     2493    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.) 
     2494    if node.distribution.abs > 1e-30: 
     2495        N /= node.distribution.abs 
     2496    if by: 
     2497        whom = byWhom(by, parent, tree) 
     2498        if whom and whom.distribution: 
     2499            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.) 
     2500            if byN > 1e-30: 
     2501                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30 
     2502        else: 
     2503            return insertDot(strg, mo) 
     2504    return insertNum(strg, mo, N) 
     2505 
     2506 
     2507def extractInterval(mo, dist): 
     2508    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin") 
     2509    lower, upper = float(lower), float(upper) 
     2510    if out: 
     2511        lop = lowin == "(" and operator.le or operator.lt 
     2512        hop = upin == ")" and operator.ge or operator.ge 
     2513        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items()) 
     2514    else: 
     2515        lop = lowin == "(" and operator.gt or operator.ge 
     2516        hop = upin == ")" and operator.lt or operator.le 
     2517        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items()) 
     2518 
     2519     
     2520def replaceCconti(strg, mo, node, parent, tree): 
     2521    if tree.classVar.varType != orange.VarTypes.Continuous: 
     2522        return insertDot(strg, mo) 
     2523 
     2524    by = mo.group("by") 
     2525    N = sum([x[1] for x in extractInterval(mo, node.distribution)]) 
     2526    if by: 
     2527        whom = byWhom(by, parent, tree) 
     2528        if whom and whom.distribution: 
     2529            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)]) 
     2530            if byN > 1e-30: 
     2531                N /= byN 
     2532        else: 
     2533            return insertDot(strg, mo) 
     2534         
     2535    return insertNum(strg, mo, N) 
     2536 
     2537             
     2538def replacecconti(strg, mo, node, parent, tree): 
     2539    if tree.classVar.varType != orange.VarTypes.Continuous: 
     2540        return insertDot(strg, mo) 
     2541 
     2542    N = sum([x[1] for x in extractInterval(mo, node.distribution)]) 
     2543    ab = node.distribution.abs 
     2544    if ab > 1e-30: 
     2545        N /= ab 
     2546 
     2547    by = mo.group("by") 
     2548    if by: 
     2549        whom = byWhom(by, parent, tree) 
     2550        if whom and whom.distribution: 
     2551            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)]) 
     2552            if byN > 1e-30: 
     2553                N /= byN/whom.distribution.abs 
     2554        else: 
     2555            return insertDot(strg, mo) 
     2556         
     2557    return insertNum(strg, mo, N) 
     2558 
     2559     
     2560def replaceD(strg, mo, node, parent, tree): 
     2561    if tree.classVar.varType != orange.VarTypes.Discrete: 
     2562        return insertDot(strg, mo) 
     2563 
     2564    fs, by, m100 = mo.group("fs", "by", "m100") 
     2565    dist = list(node.distribution) 
     2566    if by: 
     2567        whom = byWhom(by, parent, tree) 
     2568        if whom: 
     2569            for i, d in enumerate(whom.distribution): 
     2570                if d > 1e-30: 
     2571                    dist[i] /= d 
     2572        else: 
     2573            return insertDot(strg, mo) 
     2574    mul = m100 and 100 or 1 
     2575    fs = fs or (m100 and ".0" or "5.3") 
     2576    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]") 
     2577 
     2578 
     2579def replaced(strg, mo, node, parent, tree): 
     2580    if tree.classVar.varType != orange.VarTypes.Discrete: 
     2581        return insertDot(strg, mo) 
     2582 
     2583    fs, by, m100 = mo.group("fs", "by", "m100") 
     2584    dist = list(node.distribution) 
     2585    ab = node.distribution.abs 
     2586    if ab > 1e-30: 
     2587        dist = [d/ab for d in dist] 
     2588    if by: 
     2589        whom = byWhom(by, parent, tree) 
     2590        if whom: 
     2591            for i, d in enumerate(whom.distribution): 
     2592                if d > 1e-30: 
     2593                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30 
     2594        else: 
     2595            return insertDot(strg, mo) 
     2596    mul = m100 and 100 or 1 
     2597    fs = fs or (m100 and ".0" or "5.3") 
     2598    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]") 
     2599 
     2600 
     2601def replaceAE(strg, mo, node, parent, tree): 
     2602    if tree.classVar.varType != orange.VarTypes.Continuous: 
     2603        return insertDot(strg, mo) 
     2604 
     2605    AorE, bysub, by = mo.group("AorE", "bysub", "by") 
     2606     
     2607    if AorE == "A": 
     2608        A = node.distribution.average() 
     2609    else: 
     2610        A = node.distribution.error() 
     2611    if by: 
     2612        whom = byWhom("b"+by, parent, tree) 
     2613        if whom: 
     2614            if AorE == "A": 
     2615                avg = whom.distribution.average() 
     2616            else: 
     2617                avg = whom.distribution.error() 
     2618            if bysub == "b": 
     2619                if avg > 1e-30: 
     2620                    A /= avg 
     2621            else: 
     2622                A -= avg 
     2623        else: 
     2624            return insertDot(strg, mo) 
     2625    return insertNum(strg, mo, A) 
     2626 
     2627 
     2628Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 } 
     2629 
     2630def replaceI(strg, mo, node, parent, tree): 
     2631    if tree.classVar.varType != orange.VarTypes.Continuous: 
     2632        return insertDot(strg, mo) 
     2633 
     2634    fs = mo.group("fs") or "5.3" 
     2635    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100. 
     2636    mul = mo.group("m100") and 100 or 1 
     2637 
     2638    if not Z.has_key(intrvl): 
     2639        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl 
     2640 
     2641    av = node.distribution.average()     
     2642    il = node.distribution.error() * Z[intrvl] 
     2643    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul)) 
     2644 
     2645 
     2646# This class is more a collection of function, merged into a class so that they don't 
     2647# need to transfer too many arguments. It will be constructed, used and discarded, 
     2648# it is not meant to store any information. 
     2649class __TreeDumper: 
     2650    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN), (re_M, replaceM), (re_m, replacem), 
     2651                              (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc), 
     2652                              (re_Ccont, replaceCcont), (re_ccont, replaceccont), 
     2653                              (re_Cconti, replaceCconti), (re_cconti, replacecconti), 
     2654                              (re_D, replaceD), (re_d, replaced), 
     2655                              (re_AE, replaceAE), (re_I, replaceI) 
     2656                             ] 
     2657 
     2658    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, maxDepth, simpleFirst, tree, **kw): 
     2659        self.stringFormats = stringFormats 
     2660        self.minExamples = minExamples 
     2661        self.maxDepth = maxDepth 
     2662        self.simpleFirst = simpleFirst 
     2663        self.tree = tree 
     2664        self.__dict__.update(kw) 
     2665 
     2666        if leafStr: 
     2667            self.leafStr = leafStr 
     2668        else: 
     2669            if tree.classVar.varType == orange.VarTypes.Discrete: 
     2670                self.leafStr = "%V (%^.2m%)" 
     2671            else: 
     2672                self.leafStr = "%V" 
     2673 
     2674        if nodeStr == ".": 
     2675            self.nodeStr = self.leafStr 
     2676        else: 
     2677            self.nodeStr = nodeStr 
     2678         
     2679 
     2680    def formatString(self, strg, node, parent): 
     2681        if hasattr(strg, "__call__"): 
     2682            return strg(node, parent, self.tree) 
     2683         
     2684        if not node: 
     2685            return "<null node>" 
     2686         
     2687        for rgx, replacer in self.stringFormats: 
     2688            if not node.distribution: 
     2689                strg = rgx.sub(".", strg) 
     2690            else: 
     2691                strt = 0 
     2692                while True: 
     2693                    mo = rgx.search(strg, strt) 
     2694                    if not mo: 
     2695                        break 
     2696                    strg = replacer(strg, mo, node, parent, self.tree) 
     2697                    strt = mo.start()+1 
     2698                         
     2699        return strg 
     2700         
     2701 
     2702    def showBranch(self, node, parent, lev, i): 
     2703        bdes = node.branchDescriptions[i] 
     2704        bdes = node.branchSelector.classVar.name + (bdes[0] not in "<=>" and "=" or "") + bdes 
     2705        if node.branches[i]: 
     2706            nodedes = self.nodeStr and ": "+self.formatString(self.nodeStr, node.branches[i], node) or "" 
     2707        else: 
     2708            nodedes = "<null node>" 
     2709        return "|    "*lev + bdes + nodedes 
     2710         
     2711         
     2712    def dumpTree0(self, node, parent, lev): 
     2713        if node.branches: 
     2714            if node.distribution.abs < self.minExamples or lev > self.maxDepth: 
     2715                return "|    "*lev + ". . .\n" 
     2716             
     2717            res = "" 
     2718            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr: 
     2719                leafsep = "\n"+("|    "*lev)+"    " 
     2720            else: 
     2721                leafsep = "" 
     2722            if self.simpleFirst: 
     2723                for i, branch in enumerate(node.branches): 
     2724                    if not branch or not branch.branches: 
     2725                        if self.leafStr == self.nodeStr: 
     2726                            res += "%s\n" % self.showBranch(node, parent, lev, i) 
     2727                        else: 
     2728                            res += "%s: %s\n" % (self.showBranch(node, parent, lev, i), 
     2729                                                 leafsep + self.formatString(self.leafStr, branch, node)) 
     2730            for i, branch in enumerate(node.branches): 
     2731                if branch and branch.branches: 
     2732                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i), 
     2733                                       self.dumpTree0(branch, node, lev+1)) 
     2734                elif not self.simpleFirst: 
     2735                    if self.leafStr == self.nodeStr: 
     2736                        res += "%s\n" % self.showBranch(node, parent, lev, i) 
     2737                    else: 
     2738                        res += "%s: %s\n" % (self.showBranch(node, parent, lev, i), 
     2739                                             leafsep+self.formatString(self.leafStr, branch, node)) 
     2740            return res 
     2741        else: 
     2742            return self.formatString(self.leafStr, node, parent) 
     2743 
     2744 
     2745    def dumpTree(self): 
     2746        if self.nodeStr: 
     2747            lev, res = 1, "root: %s\n" % self.formatString(self.nodeStr, self.tree.tree, None) 
     2748            self.maxDepth += 1 
     2749        else: 
     2750            lev, res = 0, "" 
     2751        return res + self.dumpTree0(self.tree.tree, None, lev) 
     2752         
     2753 
     2754    def dotTree0(self, node, parent, internalName): 
     2755        if node.branches: 
     2756            if node.distribution.abs < self.minExamples or len(internalName)-1 > self.maxDepth: 
     2757                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % _quoteName(internalName)) 
     2758                return 
     2759                 
     2760            label = node.branchSelector.classVar.name 
     2761            if self.nodeStr: 
     2762                label += "\\n" + self.formatString(self.nodeStr, node, parent) 
     2763            self.fle.write('%s [ shape=%s label="%s"]\n' % (_quoteName(internalName), self.nodeShape, label)) 
     2764             
     2765            for i, branch in enumerate(node.branches): 
     2766                if branch: 
     2767                    internalBranchName = internalName+chr(i+65) 
     2768                    self.fle.write('%s -> %s [ label="%s" ]\n' % (_quoteName(internalName), _quoteName(internalBranchName), node.branchDescriptions[i])) 
     2769                    self.dotTree0(branch, node, internalBranchName) 
     2770                     
     2771        else: 
     2772            self.fle.write('%s [ shape=%s label="%s"]\n' % (internalName, self.leafShape, self.formatString(self.leafStr, node, parent))) 
     2773 
     2774 
     2775    def dotTree(self, internalName="n"): 
     2776        self.fle.write("digraph G {\n") 
     2777        self.dotTree0(self.tree.tree, None, internalName) 
     2778        self.fle.write("}\n") 
     2779 
     2780def _quoteName(x): 
     2781    return '"%s"' % (base64.b64encode(x)) 
     2782 
     2783def dumpTree(tree, leafStr = "", nodeStr = "", **argkw): 
     2784    return __TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + __TreeDumper.defaultStringFormats, 
     2785                        argkw.get("minExamples", 0), argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), 
     2786                        tree).dumpTree() 
     2787 
     2788 
     2789def printTree(*a, **aa): 
     2790    print dumpTree(*a, **aa) 
     2791 
     2792printTxt = printTree 
     2793 
     2794 
     2795def dotTree(tree, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw): 
     2796    fle = type(fileName) == str and file(fileName, "wt") or fileName 
     2797 
     2798    __TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + __TreeDumper.defaultStringFormats, 
     2799                 argkw.get("minExamples", 0), argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), 
     2800                 tree, 
     2801                 leafShape = leafShape, nodeShape = nodeShape, fle = fle).dotTree() 
     2802                         
     2803printDot = dotTree 
     2804         
     2805##import orange, orngTree, os 
     2806##os.chdir("c:\\d\\ai\\orange\\doc\\datasets") 
     2807##data = orange.ExampleTable("iris") 
     2808###data = orange.ExampleTable("housing") 
     2809##tree = orngTree.TreeLearner(data) 
     2810##printTxt(tree) 
     2811###print printTree(tree, '%V %4.2NbP %.3C!="Iris-virginica"') 
     2812###print printTree(tree, '%A %I(95) %C![20,22)bP', ".", maxDepth=3) 
     2813###dotTree("c:\\d\\ai\\orange\\x.dot", tree, '%A', maxDepth= 3) 
  • orange/orngC45.py

    r6538 r7359  
    1 def showBranch(node, classvar, lev, i): 
    2     var = node.tested 
    3     if node.nodeType == 1: 
    4         print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]), 
    5         printTree0(node.branch[i], classvar, lev+1) 
    6     elif node.nodeType == 2: 
    7         print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut), 
    8         printTree0(node.branch[i], classvar, lev+1) 
    9     else: 
    10         inset = filter(lambda a:a[1]==i, enumerate(node.mapping)) 
    11         inset = [var.values[j[0]] for j in inset] 
    12         if len(inset)==1: 
    13             print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]), 
    14         else: 
    15             print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)), 
    16         printTree0(node.branch[i], classvar, lev+1) 
    17          
    18          
    19 def printTree0(node, classvar, lev): 
    20     var = node.tested 
    21     if node.nodeType == 0: 
    22         print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items), 
    23     else: 
    24         for i, branch in enumerate(node.branch): 
    25             if not branch.nodeType: 
    26                 showBranch(node, classvar, lev, i) 
    27         for i, branch in enumerate(node.branch): 
    28             if branch.nodeType: 
    29                 showBranch(node, classvar, lev, i) 
    30  
    31 def printTree(tree): 
    32     printTree0(tree.tree, tree.classVar, 0) 
    33     print 
     1from Orange.classification.tree import c45_printTree as printTree 
Note: See TracChangeset for help on using the changeset viewer.