source: orange/docs/tutorial/rst/classification.rst @ 9385:fd37d2ce5541

Revision 9385:fd37d2ce5541, 12.4 KB checked in by mitar, 2 years ago (diff)

Cleaned up tutorial.

Line 
1Classification
2==============
3
4.. index:: classification
5.. index:: supervised data mining
6
7A substantial part of Orange is devoted to machine learning methods
8for classification, or supervised data mining. These methods start
9from the data that incorporates class-labeled instances, like
10:download:`voting.tab <code/voting.tab>`::
11
12   >>> data = orange.ExampleTable("voting.tab")
13   >>> data[0]
14   ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican']
15   >>> data[0].getclass()
16   <orange.Value 'party'='republican'>
17
18Supervised data mining attempts to develop predictive models from such
19data that, given the set of feature values, predict a corresponding
20class.
21
22.. index:: classifiers
23.. index::
24   single: classifiers; naive Bayesian
25
26There are two types of objects important for classification: learners
27and classifiers. Orange has a number of build-in learners. For
28instance, ``orange.BayesLearner`` is a naive Bayesian learner. When
29data is passed to a learner (e.g., ``orange.BayesLearner(data))``, it
30returns a classifier. When data instance is presented to a classifier,
31it returns a class, vector of class probabilities, or both.
32
33A Simple Classifier
34-------------------
35
36Let us see how this works in practice. We will
37construct a naive Bayesian classifier from voting data set, and
38will use it to classify the first five instances from this data set
39(:download:`classifier.py <code/classifier.py>`, uses :download:`voting.tab <code/voting.tab>`)::
40
41   import orange
42   data = orange.ExampleTable("voting")
43   classifier = orange.BayesLearner(data)
44   for i in range(5):
45       c = classifier(data[i])
46       print "original", data[i].getclass(), "classified as", c
47
48The script loads the data, uses it to constructs a classifier using
49naive Bayesian method, and then classifies first five instances of the
50data set. Note that both original class and the class assigned by a
51classifier is printed out.
52
53The data set that we use includes votes for each of the U.S.  House of
54Representatives Congressmen on the 16 key votes; a class is a
55representative's party. There are 435 data instances - 267 democrats
56and 168 republicans - in the data set (see UCI ML Repository and
57voting-records data set for further description).  This is how our
58classifier performs on the first five instances:
59
60   1: republican (originally republican)
61   2: republican (originally republican)
62   3: republican (originally democrat)
63   4: democrat (originally democrat)
64   5: democrat (originally democrat)
65
66Naive Bayes made a mistake at a third instance, but otherwise predicted
67correctly.
68
69Obtaining Class Probabilities
70-----------------------------
71
72To find out what is the probability that the classifier assigns
73to, say, democrat class, we need to call the classifier with
74additional parameter ``orange.GetProbabilities``. Also, note that the
75democrats have a class index 1. We find this out with print
76``data.domain.classVar.values`` (:download:`classifier2.py <code/classifier2.py>`, uses :download:`voting.tab <code/voting.tab>`)::
77
78   import orange
79   data = orange.ExampleTable("voting")
80   classifier = orange.BayesLearner(data)
81   print "Possible classes:", data.domain.classVar.values
82   print "Probabilities for democrats:"
83   for i in range(5):
84       p = classifier(data[i], orange.GetProbabilities)
85       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
86
87The output of this script is::
88
89   Possible classes: <republican, democrat>
90   Probabilities for democrats:
91   1: 0.000 (originally republican)
92   2: 0.000 (originally republican)
93   3: 0.005 (originally democrat)
94   4: 0.998 (originally democrat)
95   5: 0.957 (originally democrat)
96
97The printout, for example, shows that with the third instance
98naive Bayes has not only misclassified, but the classifier missed
99quite substantially; it has assigned only a 0.005 probability to
100the correct class.
101
102.. note::
103   Python list indexes start with 0.
104
105.. note::
106   The ordering of class values depend on occurence of classes in the
107   input data set.
108
109Classification tree
110-------------------
111
112.. index:: classifiers
113.. index::
114   single: classifiers; classification trees
115
116Classification tree learner (yes, this is the same *decision tree*)
117is a native Orange learner, but because it is a rather
118complex object that is for its versatility composed of a number of
119other objects (for attribute estimation, stopping criterion, etc.),
120a wrapper (module) called ``orngTree`` was build around it to simplify
121the use of classification trees and to assemble the learner with
122some usual (default) components. Here is a script with it (:download:`tree.py <code/tree.py>`,
123uses :download:`voting.tab <code/voting.tab>`)::
124
125   import orange, orngTree
126   data = orange.ExampleTable("voting")
127   
128   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
129   print "Possible classes:", data.domain.classVar.values
130   print "Probabilities for democrats:"
131   for i in range(5):
132       p = tree(data[i], orange.GetProbabilities)
133       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
134   
135   orngTree.printTxt(tree)
136
137.. note:: 
138   The script for classification tree is almost the same as the one
139   for naive Bayes (:download:`classifier2.py <code/classifier2.py>`), except that we have imported
140   another module (``orngTree``) and used learner
141   ``orngTree.TreeLearner`` to build a classifier called ``tree``.
142
143.. note::
144   For those of you that are at home with machine learning: the
145   default parameters for tree learner assume that a single example is
146   enough to have a leaf for it, gain ratio is used for measuring the
147   quality of attributes that are considered for internal nodes of the
148   tree, and after the tree is constructed the subtrees no pruning
149   takes place.
150
151The resulting tree with default parameters would be rather big, so we
152have additionally requested that leaves that share common predecessor
153(node) are pruned if they classify to the same class, and requested
154that tree is post-pruned using m-error estimate pruning method with
155parameter m set to 2.0. The output of our script is::
156
157   Possible classes: <republican, democrat>
158   Probabilities for democrats:
159   1: 0.051 (originally republican)
160   2: 0.027 (originally republican)
161   3: 0.989 (originally democrat)
162   4: 0.985 (originally democrat)
163   5: 0.985 (originally democrat)
164
165Notice that all of the instances are classified correctly. The last
166line of the script prints out the tree that was used for
167classification::
168
169   physician-fee-freeze=n: democrat (98.52%)
170   physician-fee-freeze=y
171   |    synfuels-corporation-cutback=n: republican (97.25%)
172   |    synfuels-corporation-cutback=y
173   |    |    mx-missile=n
174   |    |    |    el-salvador-aid=y
175   |    |    |    |    adoption-of-the-budget-resolution=n: republican (85.33%)
176   |    |    |    |    adoption-of-the-budget-resolution=y
177   |    |    |    |    |    anti-satellite-test-ban=n: democrat (99.54%)
178   |    |    |    |    |    anti-satellite-test-ban=y: republican (100.00%)
179   |    |    |    el-salvador-aid=n
180   |    |    |    |    handicapped-infants=n: republican (100.00%)
181   |    |    |    |    handicapped-infants=y: democrat (99.77%)
182   |    |    mx-missile=y
183   |    |    |    religious-groups-in-schools=y: democrat (99.54%)
184   |    |    |    religious-groups-in-schools=n
185   |    |    |    |    immigration=y: republican (98.63%)
186   |    |    |    |    immigration=n
187   |    |    |    |    |    handicapped-infants=n: republican (98.63%)
188   |    |    |    |    |    handicapped-infants=y: democrat (99.77%)
189
190The printout includes the feature on which the tree branches in the
191internal nodes. For leaves, it shows the the class label to which a
192tree would make a classification. The probability of that class, as
193estimated from the training data set, is also displayed.
194
195If you are more of a *visual* type, you may like the graphical
196presentation of the tree better. This was achieved by printing out a
197tree in so-called dot file (the line of the script required for this
198is ``orngTree.printDot(tree, fileName='tree.dot',
199internalNodeShape="ellipse", leafShape="box")``), which was then
200compiled to PNG using program called `dot`_.
201
202.. image:: files/tree.png
203   :alt: A graphical presentation of a classification tree
204
205.. _dot: http://graphviz.org/
206
207Nearest neighbors and majority classifiers
208------------------------------------------
209
210.. index:: classifiers
211.. index:: 
212   single: classifiers; k nearest neighbours
213.. index:: 
214   single: classifiers; majority classifier
215
216Let us here check on two other classifiers. Majority classifier always
217classifies to the majority class of the training set, and predicts
218class probabilities that are equal to class distributions from the training
219set. While being useless as such, it may often be good to compare this
220simplest classifier to any other classifier you test &ndash; if your
221other classifier is not significantly better than majority classifier,
222than this may a reason to sit back and think.
223
224The second classifier we are introducing here is based on k-nearest
225neighbors algorithm, an instance-based method that finds k examples
226from training set that are most similar to the instance that has to be
227classified. From the set it obtains in this way, it estimates class
228probabilities and uses the most frequent class for prediction.
229
230The following script takes naive Bayes, classification tree (what we
231have already learned), majority and k-nearest neighbors classifier
232(new ones) and prints prediction for first 10 instances of voting data
233set (:download:`handful.py <code/handful.py>`, uses :download:`voting.tab <code/voting.tab>`)::
234
235   import orange, orngTree
236   data = orange.ExampleTable("voting")
237   
238   # setting up the classifiers
239   majority = orange.MajorityLearner(data)
240   bayes = orange.BayesLearner(data)
241   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
242   knn = orange.kNNLearner(data, k=21)
243   
244   majority.name="Majority"; bayes.name="Naive Bayes";
245   tree.name="Tree"; knn.name="kNN"
246   
247   classifiers = [majority, bayes, tree, knn]
248   
249   # print the head
250   print "Possible classes:", data.domain.classVar.values
251   print "Probability for republican:"
252   print "Original Class",
253   for l in classifiers:
254       print "%-13s" % (l.name),
255   print
256   
257   # classify first 10 instances and print probabilities
258   for example in data[:10]:
259       print "(%-10s)  " % (example.getclass()),
260       for c in classifiers:
261           p = apply(c, [example, orange.GetProbabilities])
262           print "%5.3f        " % (p[0]),
263       print
264
265The code is somehow long, due to our effort to print the results
266nicely. The first part of the code sets-up our four classifiers, and
267gives them names. Classifiers are then put into the list denoted with
268variable ``classifiers`` (this is nice since, if we would need to add
269another classifier, we would just define it and put it in the list,
270and for the rest of the code we would not worry about it any
271more). The script then prints the header with the names of the
272classifiers, and finally uses the classifiers to compute the
273probabilities of classes. Note for a special function ``apply`` that
274we have not met yet: it simply calls a function that is given as its
275first argument, and passes it the arguments that are given in the
276list. In our case, ``apply`` invokes our classifiers with a data
277instance and request to compute probabilities. The output of our
278script is::
279
280   Possible classes: <republican, democrat>
281   Probability for republican:
282   Original Class Majority      Naive Bayes   Tree          kNN
283   (republican)   0.386         1.000         0.949         1.000
284   (republican)   0.386         1.000         0.973         1.000
285   (democrat  )   0.386         0.995         0.011         0.138
286   (democrat  )   0.386         0.002         0.015         0.468
287   (democrat  )   0.386         0.043         0.015         0.035
288   (democrat  )   0.386         0.228         0.015         0.442
289   (democrat  )   0.386         1.000         0.973         0.977
290   (republican)   0.386         1.000         0.973         1.000
291   (republican)   0.386         1.000         0.973         1.000
292   (democrat  )   0.386         0.000         0.015         0.000
293
294.. note::
295   The prediction of majority class classifier does not depend on the
296   instance it classifies (of course!).
297
298.. note:: 
299   At this stage, it would be inappropriate to say anything conclusive
300   on the predictive quality of the classifiers - for this, we will
301   need to resort to statistical methods on comparison of
302   classification models.
Note: See TracBrowser for help on using the repository browser.