source: orange/docs/tutorial/rst/classification.rst @ 11361:7727b6fa13dd

Revision 11361:7727b6fa13dd, 6.9 KB checked in by Ales Erjavec <ales.erjavec@…>, 14 months ago (diff)

Fixed an indentation error in the tutorial.

Line 
1Classification
2==============
3
4.. index:: classification
5.. index:: 
6   single: data mining; supervised
7
8Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on
9the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``)::
10
11   >>> data = Orange.data.Table("voting")
12   >>> data[0]
13   ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican']
14   >>> data[0].get_class()
15   <orange.Value 'party'='republican'>
16
17Learners and Classifiers
18------------------------
19
20.. index::
21   single: classification; learner
22.. index::
23   single: classification; classifier
24.. index::
25   single: classification; naive Bayesian classifier
26
27Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class::
28
29    >>> import Orange
30    >>> data = Orange.data.Table("voting")
31    >>> learner = Orange.classification.bayes.NaiveLearner()
32    >>> classifier = learner(data)
33    >>> classifier(data[0])
34    <orange.Value 'party'='republican'>
35
36Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set:
37
38.. literalinclude:: code/classification-classifier1.py
39   :lines: 4-
40
41The script outputs::
42
43    republican; originally republican
44    republican; originally republican
45    republican; originally democrat
46      democrat; originally democrat
47      democrat; originally democrat
48
49Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from.
50
51Probabilistic Classification
52----------------------------
53
54To find out what is the probability that the classifier assigns
55to, say, democrat class, we need to call the classifier with
56additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities:
57
58.. literalinclude:: code/classification-classifier2.py
59   :lines: 4-
60
61The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item::
62
63   Probabilities for democrat:
64   0.000; originally republican
65   0.000; originally republican
66   0.005; originally democrat
67   0.998; originally democrat
68   0.957; originally democrat
69
70Cross-Validation
71----------------
72
73.. index:: cross-validation
74
75Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set:
76
77.. literalinclude:: code/classification-cv.py
78   :lines: 3-
79
80.. index::
81   single: classification; scoring
82.. index::
83   single: classification; area under ROC
84.. index::
85   single: classification; accuracy
86
87Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set::
88
89   Accuracy: 0.90
90   AUC:      0.97
91
92
93Handful of Classifiers
94----------------------
95
96Orange includes wide range of classification algorithms, including:
97
98- logistic regression (``Orange.classification.logreg``)
99- k-nearest neighbors (``Orange.classification.knn``)
100- support vector machines (``Orange.classification.svm``)
101- classification trees (``Orange.classification.tree``)
102- classification rules (``Orange.classification.rules``)
103
104Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint:
105
106.. index::
107   single: classification; logistic regression
108.. index::
109   single: classification; trees
110.. index::
111   single: classification; k-nearest neighbors
112
113.. literalinclude:: code/classification-other.py
114
115For these five data items, there are no major differences between predictions of observed classification algorithms::
116
117   Probabilities for republican:
118   original class  tree      k-NN      lr       
119   republican      0.949     1.000     1.000
120   republican      0.972     1.000     1.000
121   democrat        0.011     0.078     0.000
122   democrat        0.015     0.001     0.000
123   democrat        0.015     0.032     0.000
124
125The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers.
126
127.. literalinclude:: code/classification-cv2.py
128
129Logistic regression wins in area under ROC curve::
130
131            nbc  tree lr 
132   Accuracy 0.90 0.95 0.94
133   AUC      0.97 0.94 0.99
134
135Reporting on Classification Models
136----------------------------------
137
138Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below:
139
140.. literalinclude:: code/classification-models.py
141
142The logistic regression part of the output is::
143   
144   class attribute = survived
145   class values = <no, yes>
146
147         Feature       beta  st. error     wald Z          P OR=exp(beta)
148   
149       Intercept      -1.23       0.08     -15.15      -0.00
150    status=first       0.86       0.16       5.39       0.00       2.36
151   status=second      -0.16       0.18      -0.91       0.36       0.85
152    status=third      -0.92       0.15      -6.12       0.00       0.40
153       age=child       1.06       0.25       4.30       0.00       2.89
154      sex=female       2.42       0.14      17.04       0.00      11.25
155
156Trees can also be rendered in `dot <http://en.wikipedia.org/wiki/DOT_language>`_::
157
158   tree.dot(file_name="0.dot", node_shape="ellipse", leaf_shape="box")
159
160Following figure shows an example of such rendering.
161
162.. image:: files/tree.png
163   :alt: A graphical presentation of a classification tree
Note: See TracBrowser for help on using the repository browser.