source: orange/docs/tutorial/rst/classification.rst @ 9994:1073e0304a87

Revision 9994:1073e0304a87, 12.2 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Remove links from documentation to datasets. Remove datasets reference directory.

Line 
1Classification
2==============
3
4.. index:: classification
5.. index:: supervised data mining
6
7A substantial part of Orange is devoted to machine learning methods
8for classification, or supervised data mining. These methods start
9from the data that incorporates class-labeled instances, like
10:download:`voting.tab <code/voting.tab>`::
11
12   >>> data = orange.ExampleTable("voting.tab")
13   >>> data[0]
14   ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican']
15   >>> data[0].getclass()
16   <orange.Value 'party'='republican'>
17
18Supervised data mining attempts to develop predictive models from such
19data that, given the set of feature values, predict a corresponding
20class.
21
22.. index:: classifiers
23.. index::
24   single: classifiers; naive Bayesian
25
26There are two types of objects important for classification: learners
27and classifiers. Orange has a number of build-in learners. For
28instance, ``orange.BayesLearner`` is a naive Bayesian learner. When
29data is passed to a learner (e.g., ``orange.BayesLearner(data))``, it
30returns a classifier. When data instance is presented to a classifier,
31it returns a class, vector of class probabilities, or both.
32
33A Simple Classifier
34-------------------
35
36Let us see how this works in practice. We will
37construct a naive Bayesian classifier from voting data set, and
38will use it to classify the first five instances from this data set
39(:download:`classifier.py <code/classifier.py>`)::
40
41   import orange
42   data = orange.ExampleTable("voting")
43   classifier = orange.BayesLearner(data)
44   for i in range(5):
45       c = classifier(data[i])
46       print "original", data[i].getclass(), "classified as", c
47
48The script loads the data, uses it to constructs a classifier using
49naive Bayesian method, and then classifies first five instances of the
50data set. Note that both original class and the class assigned by a
51classifier is printed out.
52
53The data set that we use includes votes for each of the U.S.  House of
54Representatives Congressmen on the 16 key votes; a class is a
55representative's party. There are 435 data instances - 267 democrats
56and 168 republicans - in the data set (see UCI ML Repository and
57voting-records data set for further description).  This is how our
58classifier performs on the first five instances:
59
60   1: republican (originally republican)
61   2: republican (originally republican)
62   3: republican (originally democrat)
63   4: democrat (originally democrat)
64   5: democrat (originally democrat)
65
66Naive Bayes made a mistake at a third instance, but otherwise predicted
67correctly.
68
69Obtaining Class Probabilities
70-----------------------------
71
72To find out what is the probability that the classifier assigns
73to, say, democrat class, we need to call the classifier with
74additional parameter ``orange.GetProbabilities``. Also, note that the
75democrats have a class index 1. We find this out with print
76``data.domain.classVar.values`` (:download:`classifier2.py <code/classifier2.py>`)::
77
78   import orange
79   data = orange.ExampleTable("voting")
80   classifier = orange.BayesLearner(data)
81   print "Possible classes:", data.domain.classVar.values
82   print "Probabilities for democrats:"
83   for i in range(5):
84       p = classifier(data[i], orange.GetProbabilities)
85       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
86
87The output of this script is::
88
89   Possible classes: <republican, democrat>
90   Probabilities for democrats:
91   1: 0.000 (originally republican)
92   2: 0.000 (originally republican)
93   3: 0.005 (originally democrat)
94   4: 0.998 (originally democrat)
95   5: 0.957 (originally democrat)
96
97The printout, for example, shows that with the third instance
98naive Bayes has not only misclassified, but the classifier missed
99quite substantially; it has assigned only a 0.005 probability to
100the correct class.
101
102.. note::
103   Python list indexes start with 0.
104
105.. note::
106   The ordering of class values depend on occurence of classes in the
107   input data set.
108
109Classification tree
110-------------------
111
112.. index:: classifiers
113.. index::
114   single: classifiers; classification trees
115
116Classification tree learner (yes, this is the same *decision tree*)
117is a native Orange learner, but because it is a rather
118complex object that is for its versatility composed of a number of
119other objects (for attribute estimation, stopping criterion, etc.),
120a wrapper (module) called ``orngTree`` was build around it to simplify
121the use of classification trees and to assemble the learner with
122some usual (default) components. Here is a script with it (:download:`tree.py <code/tree.py>`)::
123
124   import orange, orngTree
125   data = orange.ExampleTable("voting")
126   
127   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
128   print "Possible classes:", data.domain.classVar.values
129   print "Probabilities for democrats:"
130   for i in range(5):
131       p = tree(data[i], orange.GetProbabilities)
132       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
133   
134   orngTree.printTxt(tree)
135
136.. note:: 
137   The script for classification tree is almost the same as the one
138   for naive Bayes (:download:`classifier2.py <code/classifier2.py>`), except that we have imported
139   another module (``orngTree``) and used learner
140   ``orngTree.TreeLearner`` to build a classifier called ``tree``.
141
142.. note::
143   For those of you that are at home with machine learning: the
144   default parameters for tree learner assume that a single example is
145   enough to have a leaf for it, gain ratio is used for measuring the
146   quality of attributes that are considered for internal nodes of the
147   tree, and after the tree is constructed the subtrees no pruning
148   takes place.
149
150The resulting tree with default parameters would be rather big, so we
151have additionally requested that leaves that share common predecessor
152(node) are pruned if they classify to the same class, and requested
153that tree is post-pruned using m-error estimate pruning method with
154parameter m set to 2.0. The output of our script is::
155
156   Possible classes: <republican, democrat>
157   Probabilities for democrats:
158   1: 0.051 (originally republican)
159   2: 0.027 (originally republican)
160   3: 0.989 (originally democrat)
161   4: 0.985 (originally democrat)
162   5: 0.985 (originally democrat)
163
164Notice that all of the instances are classified correctly. The last
165line of the script prints out the tree that was used for
166classification::
167
168   physician-fee-freeze=n: democrat (98.52%)
169   physician-fee-freeze=y
170   |    synfuels-corporation-cutback=n: republican (97.25%)
171   |    synfuels-corporation-cutback=y
172   |    |    mx-missile=n
173   |    |    |    el-salvador-aid=y
174   |    |    |    |    adoption-of-the-budget-resolution=n: republican (85.33%)
175   |    |    |    |    adoption-of-the-budget-resolution=y
176   |    |    |    |    |    anti-satellite-test-ban=n: democrat (99.54%)
177   |    |    |    |    |    anti-satellite-test-ban=y: republican (100.00%)
178   |    |    |    el-salvador-aid=n
179   |    |    |    |    handicapped-infants=n: republican (100.00%)
180   |    |    |    |    handicapped-infants=y: democrat (99.77%)
181   |    |    mx-missile=y
182   |    |    |    religious-groups-in-schools=y: democrat (99.54%)
183   |    |    |    religious-groups-in-schools=n
184   |    |    |    |    immigration=y: republican (98.63%)
185   |    |    |    |    immigration=n
186   |    |    |    |    |    handicapped-infants=n: republican (98.63%)
187   |    |    |    |    |    handicapped-infants=y: democrat (99.77%)
188
189The printout includes the feature on which the tree branches in the
190internal nodes. For leaves, it shows the the class label to which a
191tree would make a classification. The probability of that class, as
192estimated from the training data set, is also displayed.
193
194If you are more of a *visual* type, you may like the graphical
195presentation of the tree better. This was achieved by printing out a
196tree in so-called dot file (the line of the script required for this
197is ``orngTree.printDot(tree, fileName='tree.dot',
198internalNodeShape="ellipse", leafShape="box")``), which was then
199compiled to PNG using program called `dot`_.
200
201.. image:: files/tree.png
202   :alt: A graphical presentation of a classification tree
203
204.. _dot: http://graphviz.org/
205
206Nearest neighbors and majority classifiers
207------------------------------------------
208
209.. index:: classifiers
210.. index:: 
211   single: classifiers; k nearest neighbours
212.. index:: 
213   single: classifiers; majority classifier
214
215Let us here check on two other classifiers. Majority classifier always
216classifies to the majority class of the training set, and predicts
217class probabilities that are equal to class distributions from the training
218set. While being useless as such, it may often be good to compare this
219simplest classifier to any other classifier you test &ndash; if your
220other classifier is not significantly better than majority classifier,
221than this may a reason to sit back and think.
222
223The second classifier we are introducing here is based on k-nearest
224neighbors algorithm, an instance-based method that finds k examples
225from training set that are most similar to the instance that has to be
226classified. From the set it obtains in this way, it estimates class
227probabilities and uses the most frequent class for prediction.
228
229The following script takes naive Bayes, classification tree (what we
230have already learned), majority and k-nearest neighbors classifier
231(new ones) and prints prediction for first 10 instances of voting data
232set (:download:`handful.py <code/handful.py>`)::
233
234   import orange, orngTree
235   data = orange.ExampleTable("voting")
236   
237   # setting up the classifiers
238   majority = orange.MajorityLearner(data)
239   bayes = orange.BayesLearner(data)
240   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
241   knn = orange.kNNLearner(data, k=21)
242   
243   majority.name="Majority"; bayes.name="Naive Bayes";
244   tree.name="Tree"; knn.name="kNN"
245   
246   classifiers = [majority, bayes, tree, knn]
247   
248   # print the head
249   print "Possible classes:", data.domain.classVar.values
250   print "Probability for republican:"
251   print "Original Class",
252   for l in classifiers:
253       print "%-13s" % (l.name),
254   print
255   
256   # classify first 10 instances and print probabilities
257   for example in data[:10]:
258       print "(%-10s)  " % (example.getclass()),
259       for c in classifiers:
260           p = apply(c, [example, orange.GetProbabilities])
261           print "%5.3f        " % (p[0]),
262       print
263
264The code is somehow long, due to our effort to print the results
265nicely. The first part of the code sets-up our four classifiers, and
266gives them names. Classifiers are then put into the list denoted with
267variable ``classifiers`` (this is nice since, if we would need to add
268another classifier, we would just define it and put it in the list,
269and for the rest of the code we would not worry about it any
270more). The script then prints the header with the names of the
271classifiers, and finally uses the classifiers to compute the
272probabilities of classes. Note for a special function ``apply`` that
273we have not met yet: it simply calls a function that is given as its
274first argument, and passes it the arguments that are given in the
275list. In our case, ``apply`` invokes our classifiers with a data
276instance and request to compute probabilities. The output of our
277script is::
278
279   Possible classes: <republican, democrat>
280   Probability for republican:
281   Original Class Majority      Naive Bayes   Tree          kNN
282   (republican)   0.386         1.000         0.949         1.000
283   (republican)   0.386         1.000         0.973         1.000
284   (democrat  )   0.386         0.995         0.011         0.138
285   (democrat  )   0.386         0.002         0.015         0.468
286   (democrat  )   0.386         0.043         0.015         0.035
287   (democrat  )   0.386         0.228         0.015         0.442
288   (democrat  )   0.386         1.000         0.973         0.977
289   (republican)   0.386         1.000         0.973         1.000
290   (republican)   0.386         1.000         0.973         1.000
291   (democrat  )   0.386         0.000         0.015         0.000
292
293.. note::
294   The prediction of majority class classifier does not depend on the
295   instance it classifies (of course!).
296
297.. note:: 
298   At this stage, it would be inappropriate to say anything conclusive
299   on the predictive quality of the classifiers - for this, we will
300   need to resort to statistical methods on comparison of
301   classification models.
Note: See TracBrowser for help on using the repository browser.