source: orange/orange/doc/reference/C45Learner.htm @ 8118:965942355747

Revision 8118:965942355747, 10.9 KB checked in by mitar, 3 years ago (diff)

Commiting some old not commited files I found on a server.

Line 
1<html> <HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
4</HEAD> <body>
5
6<index name="classifiers+c45">
7<h1>C4.5 Classifier and Learner</h1>
8
9<P>C4.5 is a standard benchmark in machine learning. For this
10reason, it is incorporated in Orange, although Orange has its own
11implementation of decision trees.</P>
12
13<P><CODE></CODE> uses the original Quinlan's code for learning so the tree you get is exactly like the one that would be build by standalone C4.5. Upon return, however, the original tree is copied to Orange components that contain exactly the same information plus what is needed to make them visible from Python. To be sure that the algorithm behaves just as the original, we use a dedicated class <CODE>orange.C45TreeNode</CODE> instead of reusing the components used by Orange's tree inducer (ie, <CODE>orange.TreeNode</CODE>). This, however, could be done and probably will be done in the future; we shall still retain <CODE>orange.C45TreeNode</CODE> but offer transformation to <CODE>orange.TreeNode</CODE> so that routines that work on Orange trees will also be usable for C45 trees.</P>
14
15<P><CODE>C45Learner</CODE> and <CODE>C45Classifier</CODE> behave
16like any other Orange learner and classifier. Unlike most of Orange learning algorithms, C4.5 does not accepts weighted examples.</P>
17
18
19<H2>Building the C4.5 plug-in</H2>
20
21<P>We haven't been able to obtain the legal rights to distribute
22C4.5 and therefore couldn't statically link it into Orange. Instead, it's incorporated as a plug-in which you'll need to build yourself. The procedure is trivial, but you'll need a C compiler. On Windows, the scripts we provide work with MS Visual C and the files CL.EXE and LINK.EXE must be on the PATH. On Linux you're equipped with gcc. Mac OS X comes without gcc, but you can download it for free from Apple.</P>
23
24<P>Orange must be installed prior to building C4.5. (This is because the build script will copy the created file next to Orange, which it obviously can't if Orange isn't there yet.)</P>
25
26<OL>
27<LI>Download the <A href="http://www.rulequest.com/Personal/c4.5r8.tar.gz">C4.5 (Release 8) sources</a> from the <a
28href="http://www.rulequest.com/">Rule Quest's site</a> and extract them into some temporary directory. The files will be modified in the further process, so don't use your copy of Quinlan's sources that you need for another purpose.</LI>
29
30<LI>Download <a href="/download/buildC45.zip">buildC45.zip</a> and unzip its contents  into the directory R8/Src of the Quinlan's stuff (it's the directory that contains, for instance, the file average.c).</LI>
31
32<LI>Run buildC45.py, which will build the plug-in and put it next to orange.pyd (or orange.so on Linux/Mac).</LI>
33
34<LI>Run python, import orange and create <CODE>orange.C45Learner()</CODE>. If this fails, something went wrong; see the diagnostic messages from buildC45.py and read the below paragraph.</LI>
35
36<LI>Finally, you can remove the Quinlan's stuff, along with everything created by buildC45.py.</LI>
37</OL>
38
39<P><B>If the process fails</B>, here's what buildC45.py really does: it creates .h files that wrap Quinlan's .i files and ensure that they are not included twice. It modifies C4.5 sources to include .h's instead of .i's. This step can hardly fail. Then follows the platform dependent step which compiles ensemble.c (which includes all the Quinlan's .c files it needs) into c45.dll or c45.so and puts it next to Orange. If this fails, but you do have a C compiler and linker, and you know how to use them, you can compile the ensemble.c into a dynamic library yourself. See the compile and link steps in buildC45.py, if it helps. Anyway, after doing this check that the built C4.5 gives the same results as the original.</P>
40
41<H2>C45Learner</H2>
42<index name="classes/C45Learner">
43
44<P><CODE>C45Learner</CODE>'s attributes have double names - those that you know from C4.5 command lines and the corresponding names of C4.5's internal variables. All defaults are set as in C4.5; if you change nothing, you are running C4.5.</P>
45
46<P class=section>Attributes</P>
47<DL class=attributes>
48 <DT>gainRatio (g)</DT>
49 <DD>Determines whether to use information gain (<CODE>false</CODE>, default) or gain ratio
50 for selection of attributes (<CODE>true</CODE>)</DD>
51
52 <DT>batch (b)</DT>
53 <DD>Turn on batch mode (no windows, no iterations); this option is not documented in C4.5 manuals. It conflicts with "window", "increment" and "trials".</DD>
54
55 <DT>subset (s)</DT>
56 <DD>Enables subsetting (default: <CODE>false</CODE>, no subsetting)</DD>
57
58 <DT>probThresh (p)</DT>
59 <DD>Probabilistic threshold for continuous attributes (default: <CODE>false</CODE>)</DD>
60
61 <DT>minObjs (m)</DT>
62 <DD>Minimal number of objects (examples) in leaves (default: 2)</DD>
63
64 <DT>window (w)</DT>
65 <DD>Initial windows size (default: maximum of 20% and twice the square root of the number of data objects)</DD>
66
67 <DT>increment (i)</DT>
68 <DD>The maximum number of objects that can be added to the window at each iteration (default: 20% of the initial window size)</DD>
69
70 <DT>cf (c)</DT>
71 <DD>Prunning confidence level (default: 25%)</DD>
72
73 <DT>trials (t)</DT>
74 <DD>Set the number of trials in iterative (i.e. non-batch) mode (default: 10)</DD>
75
76 <DT>prune</DT>
77 <DD>Return pruned tree (not an original C4.5 option) (default: <CODE>true</CODE>)</DD>
78</DL>
79
80<P><CODE>C45Learner</CODE> also offers another way for setting
81the arguments: it provides a function <CODE>commandline</CODE>
82which is given a string and parses it the same way as C4.5 would
83parse its command line.</P>
84
85
86<H2>C45Classifier</H2>
87<index name="classes/C45Classifier">
88
89<P><CODE>C45Classifier</CODE> contains a faithful reimplementation of Quinlan's function from C4.5. The only difference (and the only reason it's been rewritten) is that it uses a tree composed of <CODE>orange.C45TreeNode</CODE>s instead of C4.5's original tree structure.</P>
90
91<P class=section>Attributes</P>
92<DL class=attributes>
93<DT>tree</DT>
94<DD>C4.5 tree stored as a tree of <CODE>C45TreeNode</CODE>s.
95</DL>
96
97
98<H2>C45TreeNode</H2>
99<index name="classes/C45TreeNode">
100
101<P>This class is a reimplementation of the corresponding <CODE>struct</CODE> from Quinlan's C4.5 code.
102
103<P class=section>Attributes</P>
104<DL class=attributes>
105<DT>nodeType</DT>
106<DD>Type of the node: <CODE>C45TreeNode.Leaf</CODE> (0), <CODE>C45TreeNode.Branch</CODE> (1), <CODE>C45TreeNode.Cut</CODE> (2), <CODE>C45TreeNode.Subset</CODE> (3). "Leaves" are leaves, "branches" split examples based on values of a discrete attribute, "cuts" cut them according to a threshold value of a continuous attributes and "subsets" use discrete attributes but with subsetting so that several values can go into the same branch.</DD>
107
108
109<DT>leaf</DT>
110<DD><CODE>Value</CODE> returned by that leaf. The field is defined for internal nodes as well.</DD>
111
112<DT>items</DT>
113<DD>Number of (learning) examples in the node.</DD>
114
115<DT>classDist</DT>
116<DD>Class distribution for the node (of type <CODE>DiscDistribution</CODE>).</DD>
117
118<DT>tested</DT>
119<DD>The attribute used in the node's test. If node is a leaf, <CODE>tested</CODE> is <CODE>None</CODE>, if node is of type <CODE>Branch</CODE> or <CODE>Cut</CODE> <CODE>tested</CODE> is a discrete attribute, and if node is of type <CODE>cut</CODE> then <CODE>tested</CODE> is a continuous attribute.</DD>
120
121<DT>cut</DT>
122<DD>A threshold for continuous attributes, if node is of type <CODE>Cut</CODE>. Undefined otherwise.</DD>
123
124<DT>mapping</DT>
125<DD>Mapping for nodes of type <CODE>Subset</CODE>. Element <CODE>mapping[i]</CODE> gives the index for an example whose value of <CODE>tested</CODE> is <CODE>i</CODE>. Here, <CODE>i</CODE> denotes an index of value, not a <CODE>Value</CODE>.</DD>
126
127<DT>branch</DT>
128<DD>A list of branches stemming from this node.</DD>
129</DL>
130</CODE>
131
132<HR>
133
134<H2>Examples</H2>
135
136<P>The simplest way to use <CODE>C45Learner</CODE> is to call it. This
137script constructs the same learner as you would get by calling the usual C4.5.</P>
138
139<p class="header">part of <a href="c45.py">c45.py</a> (uses <a
140href="lenses.tab">lenses.tab</a>)</p>
141<XMP class="code">import orange
142
143data = orange.ExampleTable("lenses")
144tree = orange.C45Learner(data)
145
146for i in data[:5]:
147    print tree(i), i.getclass()
148</XMP>
149
150<P>Arguments can be set by the usual mechanism (the below to lines do the
151same, except that one uses command-line symbols and the other internal
152variable names)</P>
153
154<XMP class="code">tree = orange.C45Learner(data, m=100)
155tree = orange.C45Learner(data, minObjs=100)
156</XMP>
157
158<P>The way that could be prefered by veteran C4.5 user might be through
159method <CODE>commandline</CODE>.</P>
160
161<XMP class="code">lrn = orange.C45Learner()
162lrn.commandline("-m 1 -s")
163tree = lrn(data)
164</XMP>
165
166
167<P>There's nothing special about using <CODE>C45Classifier</CODE> - it's just like any other. To demonstrate what the structure of <CODE>C45TreeNode</CODE>'s looks like, will show a script that prints it out in the same format as C4.5 does. (You can find the script in module orngC45).</P>
168
169<PRE class=code>
170def printTree0(node, classvar, lev):
171    var = node.tested
172
173    if node.nodeType == 0:
174        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
175
176    elif node.nodeType == 1:
177        for i, val in enumerate(var.values):
178            print ("\n"+"|    "*lev + "%s = %s:") % (var.name, val),
179            printTree0(node.branch[i], classvar, lev+1)
180
181    elif node.nodeType == 2:
182        print ("\n"+"|    "*lev + "%s &lt;= %.1f:") % (var.name, node.cut),
183        printTree0(node.branch[0], classvar, lev+1)
184        print ("\n"+"|    "*lev + "%s > %.1f:") % (var.name, node.cut),
185        printTree0(node.branch[1], classvar, lev+1)
186
187    elif node.nodeType == 3:
188        for i, branch in enumerate(node.branch):
189            inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
190            inset = [var.values[j[0]] for j in inset]
191            if len(inset)==1:
192                print ("\n"+"|    "*lev + "%s = %s:") % (var.name, inset[0]),
193            else:
194                print ("\n"+"|    "*lev + "%s in {%s}:") % (var.name, reduce(lambda x,y:x+", "+y, inset)),
195            printTree0(branch, classvar, lev+1)
196
197
198def printTree(tree):
199    printTree0(tree.tree, tree.classVar, 0)
200    print
201</PRE>
202
203<P>Leaves are the simplest. We just print out the value contained in <CODE>node.leaf</CODE>. Since this is not a qualified value (ie., <CODE>C45TreeNode</CODE> does not know to which attribute it belongs), we need to convert it to a string through <CODE>classVar</CODE>, which is passed as an extra argument to the recursive part of <CODE>printTree</CODE>.</P>
204
205<P>For discrete splits without subsetting, we print out all attribute values and recursively call the function for all branches. Continuous splits are equally easy to handle.</P>
206
207<P>For discrete splits with subsetting, we iterate through branches, retrieve the corresponding values that go into each branch to <CODE>inset</CODE>, turn the values into strings and print them out, separately treating the case when only a single value goes into the branch.</p>
208
209</BODY></HTML>
Note: See TracBrowser for help on using the repository browser.