Changeset 9024:526fbdb71e6b in orange


Ignore:
Timestamp:
09/26/11 23:36:00 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
83a7fd96a7685eafed99a924eae45b9503db7114
Message:

Orange.classification.tree documentation updates.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/classification/tree.py

    r9023 r9024  
    2222for introduction to classification trees. 
    2323 
    24 This page first describes the learner and the classifier, and then 
    25 the individual components of the trees and the 
    26 tree-building process. 
     24====================== 
     25Learner and Classifier 
     26====================== 
    2727 
    2828.. autoclass:: TreeLearner 
     
    187187          : hypermetrope --> none (<2.000, 0.000, 1.000>) 
    188188 
    189 We conclude the tree structure examples with a simple pruning  
     189The tree structure examples conclude with a simple pruning  
    190190function, written entirely in Python and unrelated to any :obj:`Pruner`. It  
    191191limits the maximal tree depth (the number of internal nodes on any path 
     
    510510particular branch or simply skip the instance. 
    511511 
    512  
    513512Some enhanced splitters can split instances. An instance (actually, 
    514513a pointer to it) is copied to more than one subset. To facilitate 
     
    526525for splitting - no weight ID's are returned. 
    527526 
    528  
    529527.. class:: Splitter 
    530528 
     
    601599Descenders 
    602600============================= 
    603  
    604601 
    605602Descenders decide the where should the instances that cannot be 
     
    769766================= 
    770767 
    771 The included printing functions can print out practically anything you'd 
    772 like to know, from the number of instances, proportion of instances of 
    773 majority class in nodes and similar, to more complex statistics like the 
    774 proportion of instances in a particular class divided by the proportion 
    775 of examples of this class in a parent node. And even more, you can define 
    776 your own callback functions to be used for printing. 
    777  
    778 Before we go on: you can read all about the function and use it to its 
    779 full extent, or you can just call it, giving it the tree as the sole 
    780 argument and it will print out the usual textual representation of the 
    781 tree. If you're satisfied with that, you can stop here. 
    782  
    783 The magic is in the format string. It is a string which is printed 
    784 out at every leaf or internal node with the certain format specifiers 
    785 replaced by data from the tree node. Specifiers are generally of form 
    786 **%[^]<precision><quantity><divisor>**. 
    787  
    788 **^** at the start tells that the number should be multiplied by 100. 
    789 It's useful for printing proportions like percentages. 
    790  
    791 **<precision>** is in the same format as in Python (or C) string 
    792 formatting. For instance, :samp:`%N` denotes the number of examples in 
    793 the node, hence :samp:`%6.2N` would mean output to two decimal digits 
    794 and six places altogether. If left out, a default format :samp:`5.3` is 
    795 used, unless you multiply the numbers by 100, in which case the default 
    796 is :samp:`.0` (no decimals, the number is rounded to the nearest integer). 
    797  
    798 **<divisor>** tells what to divide the quantity in that node with. 
    799 :samp:`bP` means division by the same quantity in the parent node; for instance, 
    800 :samp:`%NbP` will tell give the number of examples in the node divided by the 
    801 number of examples in parent node. You can add use precision formatting, 
    802 e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire 
    803 data set, so :samp:`%NbA` will tell you the proportion of examples (out 
    804 of the entire training data set) that fell into that node. If division is 
    805 impossible since the parent node does not exist or some data is missing, 
    806 a dot is printed out instead of the quantity. 
    807  
    808 **<quantity>** is the only required element. It defines what to print. 
    809 For instance, :samp:`%N` would print out the number of examples in the node. 
    810 Possible quantities are 
    811  
    812 :samp:`V` 
    813     The value predicted at that node. You cannot define the precision  
    814     or divisor here. 
    815  
    816 :samp:`N` 
    817     The number of examples in the node. 
    818  
    819 :samp:`M` 
    820     The number of examples in the majority class (that is, the class  
    821     predicted by the node). 
    822  
    823 :samp:`m` 
    824     The proportion of examples in the majority class. 
    825  
    826 :samp:`A` 
    827     The average class for examples the node; this is available only for  
    828     regression trees. 
    829  
    830 :samp:`E` 
    831     Standard error for class of examples in the node; available for 
    832     regression trees. 
    833  
    834 :samp:`I` 
    835     Print out the confidence interval. The modifier is used as  
    836     :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`. 
    837  
    838 :samp:`C` 
    839     The number of examples in the given class. For classification trees,  
    840     this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`  
    841     - this will tell the number of examples of Iris-virginica by the  
    842     number of examples this class in the parent node. If you are  
    843     interested in examples that are *not* Iris-virginica, say 
    844     :samp:`%5.3CbP!="Iris-virginica"` 
    845  
    846     For regression trees, you can use operators =, !=, <, <=, >, and >=,  
    847     as in :samp:`%C<22` - add the precision and divisor if you will. You 
    848     can also check the number of examples in a certain interval: 
    849     :samp:`%C[20, 22]` will give you the number of examples between 20 
    850     and 22 (inclusive) and :samp:`%C(20, 22)` will give the number of 
    851     such examples excluding the boundaries. You can of course mix the 
    852     parentheses, e.g. :samp:`%C(20, 22]`.  If you would like the examples 
    853     outside the interval, add a :samp:`!`, like :samp:`%C!(20, 22]`. 
    854  
    855 :samp:`c` 
    856     Same as above, except that it computes the proportion of the class 
    857     instead of the number of examples. 
    858  
    859 :samp:`D` 
    860     Prints out the number of examples in each class. You can use both, 
    861     precision (it is applied to each number in the distribution) or the 
    862     divisor. This quantity cannot be computed for regression trees. 
    863  
    864 :samp:`d` 
    865     Same as above, except that it shows proportions of examples. This 
    866     again doesn't work with regression trees. 
    867  
    868 <user defined formats> 
    869     You can add more, if you will. Instructions and examples are given at 
    870     the end of this section. 
    871  
    872 .. rubric:: Examples 
    873  
    874 We shall build a small tree from the iris data set - we shall limit the 
    875 depth to three levels: 
    876  
    877 .. literalinclude:: code/orngTree1.py 
    878    :lines: 1-4 
    879  
    880 .. _orngTree1.py: code/orngTree1.py 
    881  
    882 The easiest way to call the function is to pass the tree as the only  
    883 argument:: 
     768The tree printing functions are very flexible. They can print 
     769out practically anything, from the number of instances, proportion 
     770of instances of majority class in nodes and similar, to more complex 
     771statistics like the proportion of instances in a particular class divided 
     772by the proportion of instances of this class in a parent node. Users 
     773may also pass their own functions to print certain elements. 
     774 
     775The easiest way to print the tree is to call :func:`TreeClassifier.dump` 
     776without any arguments:: 
    884777 
    885778    >>> print tree.dump() 
     
    893786    |    |    petal length>=4.850: Iris-virginica (100.00%) 
    894787 
     788 
     789The printout can be custumized with the format string, which is printed 
     790out at every leaf or internal node with the certain format specifiers 
     791replaced by data from the tree node. Specifiers are generally of form 
     792**%[^]<precision><quantity><divisor>**. 
     793 
     794**^** at the start tells that the number should be multiplied by 100, 
     795which is useful for proportions like percentages. 
     796 
     797**<precision>** is in the same format as in Python (or C) string 
     798formatting. For instance, ``%N`` denotes the number of instances in 
     799the node, hence ``%6.2N`` would mean output to two decimal digits 
     800and six places altogether. If left out, a default format ``5.3`` is 
     801used, unless the numbers are multiplied by 100, in which case the default 
     802is ``.0`` (no decimals, the number is rounded to the nearest integer). 
     803 
     804**<divisor>** tells what to divide the quantity in that node with. 
     805``bP`` means division by the same quantity in the parent node; for instance, 
     806``%NbP`` gives the number of instances in the node divided by the 
     807number of instances in parent node. Precision formatting can be added, 
     808e.g. ``%6.2NbP``. ``bA`` denotes division by the same quantity over the entire 
     809data set, so ``%NbA`` will tell you the proportion of instaces (out 
     810of the entire training data set) that fell into that node. If division is 
     811impossible since the parent node does not exist or some data is missing, 
     812a dot is printed out instead. 
     813 
     814**<quantity>** is the only required element. It defines what to print. 
     815For instance, ``%N`` would print out the number of instances in the node. 
     816Possible quantities are 
     817 
     818``V`` 
     819    The value predicted at that node. You cannot define the precision  
     820    or divisor here. 
     821 
     822``N`` 
     823    The number of instances in the node. 
     824 
     825``M`` 
     826    The number of instances in the majority class (that is, the class  
     827    predicted by the node). 
     828 
     829``m`` 
     830    The proportion of instances in the majority class. 
     831 
     832``A`` 
     833    The average class for instances the node; this is available only for  
     834    regression trees. 
     835 
     836``E`` 
     837    Standard error for class of instances in the node; available for 
     838    regression trees. 
     839 
     840``I`` 
     841    Print out the confidence interval. The modifier is used as  
     842    ``%I(95)`` of (more complicated) ``%5.3I(95)bP``. 
     843 
     844``C`` 
     845    The number of instances in the given class. For classification trees,  
     846    this modifier is used as, for instance in, ``%5.3C="Iris-virginica"bP``  
     847    - this will tell the number of instances of Iris-virginica by the  
     848    number of instances this class in the parent node. If you are  
     849    interested in instances that are *not* Iris-virginica, say 
     850    ``%5.3CbP!="Iris-virginica"`` 
     851 
     852    For regression trees, you can use operators =, !=, <, <=, >, and >=,  
     853    as in ``%C<22`` - add the precision and divisor if you will. You 
     854    can also check the number of instances in a certain interval: 
     855    ``%C[20, 22]`` will give you the number of instances between 20 
     856    and 22 (inclusive) and ``%C(20, 22)`` will give the number of 
     857    such instances excluding the boundaries. You can of course mix the 
     858    parentheses, e.g. ``%C(20, 22]``.  If you would like the instances 
     859    outside the interval, add a ``!``, like ``%C!(20, 22]``. 
     860 
     861``c`` 
     862    Same as ``C``, except that it computes the proportion of the class 
     863    instead of the number of instances. 
     864 
     865``D`` 
     866    Prints out the number of instances in each class. You can use both, 
     867    precision (it is applied to each number in the distribution) or the 
     868    divisor. This quantity cannot be computed for regression trees. 
     869 
     870``d`` 
     871    Same as above, except that it shows proportions of instances. This 
     872    again doesn't work with regression trees. 
     873 
     874<user defined formats> 
     875    You can add more, if you will. Instructions and instances are given at 
     876    the end of this section. 
     877 
     878.. rubric:: Examples 
     879 
     880We shall build a small tree from the iris data set - we shall limit the 
     881depth to three levels: 
     882 
     883.. literalinclude:: code/orngTree1.py 
     884   :lines: 1-4 
     885 
     886.. _orngTree1.py: code/orngTree1.py 
     887 
    895888Let's now print out the predicted class at each node, the number 
    896 of examples in the majority class with the total number of examples in 
     889of instances in the majority class with the total number of instances in 
    897890the node:: 
    898891 
     
    907900    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000) 
    908901 
    909 Would you like to know how the number of examples declines as 
     902Would you like to know how the number of instances declines as 
    910903compared to the entire data set and to the parent node? We find it 
    911904with this:: 
     
    921914    |    |    petal length>=4.850: Iris-virginica (86%, 96%) 
    922915 
    923 Let us first read the format string. :samp:`%M` is the number of  
    924 examples in the majority class. We want it divided by the number of 
    925 all examples from this class on the entire data set, hence :samp:`%MbA`. 
    926 To have it multipied by 100, we say :samp:`%^MbA`. The percent sign 
     916Let us first read the format string. ``%M`` is the number of  
     917instances in the majority class. We want it divided by the number of 
     918all instances from this class on the entire data set, hence ``%MbA``. 
     919To have it multipied by 100, we say ``%^MbA``. The percent sign 
    927920*after* that is just printed out literally, just as the comma and 
    928921parentheses (see the output). The string for showing the proportion 
    929 of this class in the parent is the same except that we have :samp:`bP` 
    930 instead of :samp:`bA`. 
    931  
    932 And now for the output: all examples of setosa for into the first node. 
     922of this class in the parent is the same except that we have ``bP`` 
     923instead of ``bA``. 
     924 
     925And now for the output: all instances of setosa for into the first node. 
    933926For versicolor, we have 98% in one node; the rest is certainly 
    934927not in the neighbouring node (petal length>=5.350) since all versicolors 
     
    940933If you find this guesswork annoying - so do I. Let us print out the 
    941934number of versicolors in each node, together with the proportion of 
    942 versicolors among the examples in this particular node and among all 
     935versicolors among the instances in this particular node and among all 
    943936versicolors. So, 
    944937 
     
    959952 
    960953Finally, we may want to print out the distributions, using a simple  
    961 string :samp:`%D`:: 
     954string ``%D``:: 
    962955 
    963956    petal width<0.800: [50.000, 0.000, 0.000] 
     
    971964 
    972965What is the order of numbers here? If you check  
    973 :samp:`data.domain.class_var.values` , you'll learn that the order is setosa,  
     966``data.domain.class_var.values`` , you'll learn that the order is setosa,  
    974967versicolor, virginica; so in the node at peta length<5.350 we have 49 
    975968versicolors and 3 virginicae. To print out the proportions, we can 
    976 :samp:`%.2d` - this gives us proportions within node, rounded on two 
     969``%.2d`` - this gives us proportions within node, rounded on two 
    977970decimals:: 
    978971 
     
    10181011    print tree.dump(leaf_str='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', node_str='.') 
    10191012 
    1020 Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is  
    1021 the number of examples from class virginica, divided by the total number 
    1022 of examples in this class. Add :samp:`^.1` and the result will be 
    1023 multiplied and printed with one decimal. The trailing :samp:`%` is printed 
     1013Let's first interpret the format string: ``CbA="Iris-virginica"`` is  
     1014the number of instances from class virginica, divided by the total number 
     1015of instances in this class. Add ``^.1`` and the result will be 
     1016multiplied and printed with one decimal. The trailing ``%`` is printed 
    10241017out. In parentheses we print the same thing except that we divide by 
    1025 the examples in the parent node. Note the use of single quotes, so we 
     1018the instances in the parent node. Note the use of single quotes, so we 
    10261019can use the double quotes inside the string, when we specify the class. 
    10271020 
     
    10401033See what's in the parentheses in the root node? If :meth:`~TreeClassifier.dump` 
    10411034cannot compute something (in this case it's because the root has no parent), 
    1042 it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it 
     1035it prints out a dot. You can also eplace ``=`` by ``!=`` and it 
    10431036will count all classes *except* virginica. 
    10441037 
    1045 For one final example with classification trees, we shall print the 
     1038For one final instance with classification trees, we shall print the 
    10461039distributions in that nodes, the distribution compared to the parent 
    10471040and the proportions compared to the parent (the latter things are not 
     
    11141107    |    |    |        [SE: 0.000]   21.9 [21.900-21.900] 
    11151108 
    1116 What's the difference between :samp:`%V`, the predicted value and  
    1117 :samp:`%A` the average? Doesn't a regression tree always predict the 
     1109What's the difference between ``%V``, the predicted value and  
     1110``%A`` the average? Doesn't a regression tree always predict the 
    11181111leaf average anyway? Not necessarily, the tree predict whatever the 
    11191112:obj:`~Node.node_classifier` in a leaf returns.  
    1120 As :samp:`%V` uses the :obj:`Orange.data.variable.Continuous`' function 
     1113As ``%V`` uses the :obj:`Orange.data.variable.Continuous`' function 
    11211114for printing out the value, therefore the printed number has the same 
    11221115number of decimals as in the data file. 
     
    11241117Regression trees cannot print the distributions in the same way 
    11251118as classification trees. They instead offer a set of operators for 
    1126 observing the number of examples within a certain range. For instance, 
    1127 let us check the number of examples with values below 22, and compare 
     1119observing the number of instances within a certain range. For instance, 
     1120let us check the number of instances with values below 22, and compare 
    11281121this number with values in the parent nodes:: 
    11291122 
     
    11451138    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp> 
    11461139 
    1147 The last line, for instance, says the the number of examples with the 
     1140The last line, for instance, says the the number of instances with the 
    11481141class below 22 is among those with tax above 534 is 30 times higher than 
    1149 the number of such examples in its parent node. 
    1150  
    1151 For another exercise, let's count the same for all examples *outside* 
     1142the number of such instances in its parent node. 
     1143 
     1144For another exercise, let's count the same for all instances *outside* 
    11521145interval [20, 22] (given like this, the interval includes the bounds). 
    11531146And let us print out the proportions as percents. 
     
    11571150    >>> print tree.dump(leaf_str="%C![20,22] (%^cbP![20,22]%)", node_str=".") 
    11581151 
    1159 OK, let's observe the format string for one last time. :samp:`%c![20, 
    1160 22]` would be the proportion of examples (within the node) whose values 
    1161 are below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by 
    1162 the same statistics computed on the parent. Add a :samp:`^` and you have 
     1152OK, let's observe the format string for one last time. ``%c![20, 22]``  
     1153would be the proportion of instances (within the node) whose values 
     1154are below 20 or above 22. By ``%cbP![20, 22]`` we derive this by 
     1155the same statistics computed on the parent. Add a ``^`` and you have 
    11631156the percentages. 
    11641157 
     
    11941187 
    11951188The regular expression should describe a string like those we used above, 
    1196 for instance the string :samp:`%.2DbP`. When a leaf or internal node 
     1189for instance the string ``%.2DbP``. When a leaf or internal node 
    11971190is printed out, the format string (:obj:`leaf_str` or :obj:`node_str`) 
    11981191is checked for these regular expressions and when the match is found, 
     
    12251218.. autodata:: by 
    12261219 
    1227 For a trivial example, :samp:`%V` is implemented like this. There is the 
     1220For a trivial example, ``%V`` is implemented like this. There is the 
    12281221following tuple in the list of built-in formats:: 
    12291222 
     
    12361229 
    12371230It therefore takes the value predicted at the node 
    1238 (:samp:`node.node_classifier.default_value` ), converts it to a string 
     1231(``node.node_classifier.default_value`` ), converts it to a string 
    12391232and passes it to *insert_str* to do the replacement. 
    12401233 
    12411234A more complex regular expression is the one for the proportion of 
    1242 majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the two partial 
     1235majority class, defined as ``"%"+fs+"M"+by``. It uses the two partial 
    12431236expressions defined above. 
    12441237 
     
    13031296:class:`C45Learner` and :class:`C45Classifier` behave 
    13041297like any other Orange learner and classifier. Unlike most of Orange  
    1305 learning algorithms, C4.5 does not accepts weighted examples. 
     1298learning algorithms, C4.5 does not accepts weighted instances. 
    13061299 
    13071300Building the C4.5 plug-in 
     
    13261319#. Run buildC45.py, which will build the plug-in and put it next to  
    13271320   orange.pyd (or orange.so on Linux/Mac). 
    1328 #. Run python, type :samp:`import Orange` and 
    1329    create :samp:`Orange.classification.tree.C45Learner()`. This should 
     1321#. Run python, type ``import Orange`` and 
     1322   create ``Orange.classification.tree.C45Learner()``. This should 
    13301323   succeed. 
    13311324#. Finally, you can remove C4.5 sources. 
     
    13531346        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2), 
    13541347        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches" 
    1355         split examples based on values of a discrete attribute, 
     1348        split instances based on values of a discrete attribute, 
    13561349        "cuts" cut them according to a threshold value of a continuous 
    13571350        attributes and "subsets" use discrete attributes but with subsetting 
     
    13651358    .. attribute:: items 
    13661359 
    1367         Number of (learning) examples in the node. 
     1360        Number of (learning) instances in the node. 
    13681361 
    13691362    .. attribute:: class_dist 
     
    13861379    .. attribute:: mapping 
    13871380 
    1388         Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]` 
    1389         gives the index for an example whose value of :obj:`tested` is *i*.  
     1381        Mapping for nodes of type :obj:`Subset`. Element ``mapping[i]`` 
     1382        gives the index for an instance whose value of :obj:`tested` is *i*.  
    13901383        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`. 
    13911384 
     
    15741567    .. attribute:: min_objs (m) 
    15751568         
    1576         Minimal number of objects (examples) in leaves (default: 2). 
     1569        Minimal number of objects (instances) in leaves (default: 2). 
    15771570 
    15781571    .. attribute:: window (w) 
     
    19151908 
    19161909        So, to allow splitting only when gain ratio (the default measure) 
    1917         is greater than 0.6, set :samp:`worst_acceptable=0.6`. 
     1910        is greater than 0.6, set ``worst_acceptable=0.6``. 
    19181911 
    19191912    .. attribute:: min_subset 
     
    19691962 
    19701963        Determines whether to store class distributions, 
    1971         contingencies and examples in :class:`Node`, and whether the 
     1964        contingencies and instances in :class:`Node`, and whether the 
    19721965        :obj:`Node.node_classifier` should be build for internal nodes 
    19731966        also (it is needed by the :obj:`Descender` or for pruning). 
     
    19781971 
    19791972    """ 
    1980     def __new__(cls, examples = None, weightID = 0, **argkw): 
     1973    def __new__(cls, data=None, weightID=0, **argkw): 
    19811974        self = Orange.core.Learner.__new__(cls, **argkw) 
    1982         if examples: 
     1975        if data: 
    19831976            self.__init__(**argkw) 
    1984             return self.__call__(examples, weightID) 
     1977            return self.__call__(data, weightID) 
    19851978        else: 
    19861979            return self 
     
    21782171 
    21792172fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)" 
    2180 """ Defines the multiplier by 100 (:samp:`^`) and the format 
    2181 for the number of decimals (e.g. :samp:`5.3`). The corresponding  
    2182 groups are named :samp:`m100` and :samp:`fs`. """ 
     2173""" Defines the multiplier by 100 (``^``) and the format 
     2174for the number of decimals (e.g. ``5.3``). The corresponding  
     2175groups are named ``m100`` and ``fs``. """ 
    21832176 
    21842177by = r"(?P<by>(b(P|A)))?" 
     
    22242217    it by 100, if needed, and prints with the right number of  
    22252218    places and decimals. It does so by checking the mo 
    2226     for a group named m100 (representing the :samp:`^` in the format string)  
     2219    for a group named m100 (representing the ``^`` in the format string)  
    22272220    and a group named fs representing the part giving the number o 
    2228     f decimals (e.g. :samp:`5.3`). 
     2221    f decimals (e.g. ``5.3``). 
    22292222    """ 
    22302223    grps = mo.groupdict() 
     
    22372230def by_whom(by, parent, tree): 
    22382231    """ If by equals bp, return parent, else return 
    2239     :samp:`tree.tree`. This is used to find what to divide the quantity  
     2232    ``tree.tree``. This is used to find what to divide the quantity  
    22402233    with, when division is required. 
    22412234    """ 
     
    27662759        :arg node_str: The format string for the internal nodes. 
    27672760          If left empty (as it is by default), no data is printed out for 
    2768           internal nodes. If set to :samp:`"."`, the same string is 
     2761          internal nodes. If set to ``"."``, the same string is 
    27692762          used as for leaves. 
    27702763        :type node_str: string 
Note: See TracChangeset for help on using the changeset viewer.