Ignore:
Timestamp:
01/06/13 20:18:48 (16 months ago)
Author:
Miha Stajdohar <miha.stajdohar@…>
Branch:
default
Children:
11059:83e86ea77981, 11060:340b8bf1cbb4
Parents:
11057:3da1cf37de17 (diff), 11056:a68fd2fce444 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merged with tutorial updates.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/tutorial/rst/classification.rst

    r9994 r11058  
    33 
    44.. index:: classification 
    5 .. index:: supervised data mining 
     5.. index::  
     6   single: data mining; supervised 
    67 
    7 A substantial part of Orange is devoted to machine learning methods 
    8 for classification, or supervised data mining. These methods start 
    9 from the data that incorporates class-labeled instances, like 
    10 :download:`voting.tab <code/voting.tab>`:: 
     8Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on 
     9the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``):: 
    1110 
    12    >>> data = orange.ExampleTable("voting.tab") 
     11   >>> data = Orange.data.Table("voting") 
    1312   >>> data[0] 
    1413   ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican'] 
    15    >>> data[0].getclass() 
     14   >>> data[0].get_class() 
    1615   <orange.Value 'party'='republican'> 
    1716 
    18 Supervised data mining attempts to develop predictive models from such 
    19 data that, given the set of feature values, predict a corresponding 
    20 class. 
     17Learners and Classifiers 
     18------------------------ 
    2119 
    22 .. index:: classifiers 
    2320.. index:: 
    24    single: classifiers; naive Bayesian 
     21   single: classification; learner 
     22.. index:: 
     23   single: classification; classifier 
     24.. index:: 
     25   single: classification; naive Bayesian classifier 
    2526 
    26 There are two types of objects important for classification: learners 
    27 and classifiers. Orange has a number of build-in learners. For 
    28 instance, ``orange.BayesLearner`` is a naive Bayesian learner. When 
    29 data is passed to a learner (e.g., ``orange.BayesLearner(data))``, it 
    30 returns a classifier. When data instance is presented to a classifier, 
    31 it returns a class, vector of class probabilities, or both. 
     27Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class:: 
    3228 
    33 A Simple Classifier 
    34 ------------------- 
     29    >>> import Orange 
     30    >>> data = Orange.data.Table("voting") 
     31    >>> learner = Orange.classification.bayes.NaiveLearner() 
     32    >>> classifier = learner(data) 
     33    >>> classifier(data[0]) 
     34    <orange.Value 'party'='republican'> 
    3535 
    36 Let us see how this works in practice. We will 
    37 construct a naive Bayesian classifier from voting data set, and 
    38 will use it to classify the first five instances from this data set 
    39 (:download:`classifier.py <code/classifier.py>`):: 
     36Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set: 
    4037 
    41    import orange 
    42    data = orange.ExampleTable("voting") 
    43    classifier = orange.BayesLearner(data) 
    44    for i in range(5): 
    45        c = classifier(data[i]) 
    46        print "original", data[i].getclass(), "classified as", c 
     38.. literalinclude: code/classification-classifier1.py 
     39   :lines: 4- 
    4740 
    48 The script loads the data, uses it to constructs a classifier using 
    49 naive Bayesian method, and then classifies first five instances of the 
    50 data set. Note that both original class and the class assigned by a 
    51 classifier is printed out. 
     41The script outputs:: 
    5242 
    53 The data set that we use includes votes for each of the U.S.  House of 
    54 Representatives Congressmen on the 16 key votes; a class is a 
    55 representative's party. There are 435 data instances - 267 democrats 
    56 and 168 republicans - in the data set (see UCI ML Repository and 
    57 voting-records data set for further description).  This is how our 
    58 classifier performs on the first five instances: 
     43    republican; originally republican 
     44    republican; originally republican 
     45    republican; originally democrat 
     46      democrat; originally democrat 
     47      democrat; originally democrat 
    5948 
    60    1: republican (originally republican) 
    61    2: republican (originally republican) 
    62    3: republican (originally democrat) 
    63    4: democrat (originally democrat) 
    64    5: democrat (originally democrat) 
     49Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from. 
    6550 
    66 Naive Bayes made a mistake at a third instance, but otherwise predicted 
    67 correctly. 
    68  
    69 Obtaining Class Probabilities 
    70 ----------------------------- 
     51Probabilistic Classification 
     52---------------------------- 
    7153 
    7254To find out what is the probability that the classifier assigns 
    7355to, say, democrat class, we need to call the classifier with 
    74 additional parameter ``orange.GetProbabilities``. Also, note that the 
    75 democrats have a class index 1. We find this out with print 
    76 ``data.domain.classVar.values`` (:download:`classifier2.py <code/classifier2.py>`):: 
     56additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities: 
    7757 
    78    import orange 
    79    data = orange.ExampleTable("voting") 
    80    classifier = orange.BayesLearner(data) 
    81    print "Possible classes:", data.domain.classVar.values 
    82    print "Probabilities for democrats:" 
    83    for i in range(5): 
    84        p = classifier(data[i], orange.GetProbabilities) 
    85        print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass()) 
     58.. literalinclude: code/classification-classifier2.py 
     59   :lines: 4- 
    8660 
    87 The output of this script is:: 
     61The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item:: 
    8862 
    89    Possible classes: <republican, democrat> 
    90    Probabilities for democrats: 
    91    1: 0.000 (originally republican) 
    92    2: 0.000 (originally republican) 
    93    3: 0.005 (originally democrat) 
    94    4: 0.998 (originally democrat) 
    95    5: 0.957 (originally democrat) 
     63   Probabilities for democrat: 
     64   0.000; originally republican 
     65   0.000; originally republican 
     66   0.005; originally democrat 
     67   0.998; originally democrat 
     68   0.957; originally democrat 
    9669 
    97 The printout, for example, shows that with the third instance 
    98 naive Bayes has not only misclassified, but the classifier missed 
    99 quite substantially; it has assigned only a 0.005 probability to 
    100 the correct class. 
     70Cross-Validation 
     71---------------- 
    10172 
    102 .. note:: 
    103    Python list indexes start with 0. 
     73.. index:: cross-validation 
    10474 
    105 .. note:: 
    106    The ordering of class values depend on occurence of classes in the 
    107    input data set. 
     75Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set: 
    10876 
    109 Classification tree 
    110 ------------------- 
     77.. literalinclude: code/classification-cv.py 
     78   :lines: 3- 
    11179 
    112 .. index:: classifiers 
    11380.. index:: 
    114    single: classifiers; classification trees 
     81   single: classification; scoring 
     82.. index:: 
     83   single: classification; area under ROC 
     84.. index:: 
     85   single: classification; accuracy 
    11586 
    116 Classification tree learner (yes, this is the same *decision tree*) 
    117 is a native Orange learner, but because it is a rather 
    118 complex object that is for its versatility composed of a number of 
    119 other objects (for attribute estimation, stopping criterion, etc.), 
    120 a wrapper (module) called ``orngTree`` was build around it to simplify 
    121 the use of classification trees and to assemble the learner with 
    122 some usual (default) components. Here is a script with it (:download:`tree.py <code/tree.py>`):: 
     87Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set:: 
    12388 
    124    import orange, orngTree 
    125    data = orange.ExampleTable("voting") 
     89   Accuracy: 0.90 
     90   AUC:      0.97 
     91 
     92 
     93Handful of Classifiers 
     94---------------------- 
     95 
     96Orange includes wide range of classification algorithms, including: 
     97 
     98- logistic regression (``Orange.classification.logreg``) 
     99- k-nearest neighbors (``Orange.classification.knn``) 
     100- support vector machines (``Orange.classification.svm``) 
     101- classification trees (``Orange.classification.tree``) 
     102- classification rules (``Orange.classification.rules``) 
     103 
     104Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint: 
     105 
     106.. index:: 
     107   single: classification; logistic regression 
     108.. index:: 
     109   single: classification; trees 
     110.. index:: 
     111   single: classification; k-nearest neighbors 
     112 
     113.. literalinclude: code/classification-other.py 
     114 
     115For these five data items, there are no major differences between predictions of observed classification algorithms:: 
     116 
     117   Probabilities for republican: 
     118   original class  tree      k-NN      lr        
     119   republican      0.949     1.000     1.000 
     120   republican      0.972     1.000     1.000 
     121   democrat        0.011     0.078     0.000 
     122   democrat        0.015     0.001     0.000 
     123   democrat        0.015     0.032     0.000 
     124 
     125The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers. 
     126 
     127.. literalinclude: code/classification-cv2.py 
     128 
     129Logistic regression wins in area under ROC curve:: 
     130 
     131            nbc  tree lr   
     132   Accuracy 0.90 0.95 0.94 
     133   AUC      0.97 0.94 0.99 
     134 
     135Reporting on Classification Models 
     136---------------------------------- 
     137 
     138Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below: 
     139 
     140.. literalinclude: code/classification-models.py 
     141 
     142   The logistic regression part of the output is: 
    126143    
    127    tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2) 
    128    print "Possible classes:", data.domain.classVar.values 
    129    print "Probabilities for democrats:" 
    130    for i in range(5): 
    131        p = tree(data[i], orange.GetProbabilities) 
    132        print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass()) 
     144   class attribute = survived 
     145   class values = <no, yes> 
     146 
     147         Feature       beta  st. error     wald Z          P OR=exp(beta) 
    133148    
    134    orngTree.printTxt(tree) 
     149       Intercept      -1.23       0.08     -15.15      -0.00 
     150    status=first       0.86       0.16       5.39       0.00       2.36 
     151   status=second      -0.16       0.18      -0.91       0.36       0.85 
     152    status=third      -0.92       0.15      -6.12       0.00       0.40 
     153       age=child       1.06       0.25       4.30       0.00       2.89 
     154      sex=female       2.42       0.14      17.04       0.00      11.25 
    135155 
    136 .. note::  
    137    The script for classification tree is almost the same as the one 
    138    for naive Bayes (:download:`classifier2.py <code/classifier2.py>`), except that we have imported 
    139    another module (``orngTree``) and used learner 
    140    ``orngTree.TreeLearner`` to build a classifier called ``tree``. 
     156Trees can also be rendered in `dot <http://en.wikipedia.org/wiki/DOT_language>`_:: 
    141157 
    142 .. note:: 
    143    For those of you that are at home with machine learning: the 
    144    default parameters for tree learner assume that a single example is 
    145    enough to have a leaf for it, gain ratio is used for measuring the 
    146    quality of attributes that are considered for internal nodes of the 
    147    tree, and after the tree is constructed the subtrees no pruning 
    148    takes place. 
     158   tree.dot(file_name="0.dot", node_shape="ellipse", leaf_shape="box") 
    149159 
    150 The resulting tree with default parameters would be rather big, so we 
    151 have additionally requested that leaves that share common predecessor 
    152 (node) are pruned if they classify to the same class, and requested 
    153 that tree is post-pruned using m-error estimate pruning method with 
    154 parameter m set to 2.0. The output of our script is:: 
    155  
    156    Possible classes: <republican, democrat> 
    157    Probabilities for democrats: 
    158    1: 0.051 (originally republican) 
    159    2: 0.027 (originally republican) 
    160    3: 0.989 (originally democrat) 
    161    4: 0.985 (originally democrat) 
    162    5: 0.985 (originally democrat) 
    163  
    164 Notice that all of the instances are classified correctly. The last 
    165 line of the script prints out the tree that was used for 
    166 classification:: 
    167  
    168    physician-fee-freeze=n: democrat (98.52%) 
    169    physician-fee-freeze=y 
    170    |    synfuels-corporation-cutback=n: republican (97.25%) 
    171    |    synfuels-corporation-cutback=y 
    172    |    |    mx-missile=n 
    173    |    |    |    el-salvador-aid=y 
    174    |    |    |    |    adoption-of-the-budget-resolution=n: republican (85.33%) 
    175    |    |    |    |    adoption-of-the-budget-resolution=y 
    176    |    |    |    |    |    anti-satellite-test-ban=n: democrat (99.54%) 
    177    |    |    |    |    |    anti-satellite-test-ban=y: republican (100.00%) 
    178    |    |    |    el-salvador-aid=n 
    179    |    |    |    |    handicapped-infants=n: republican (100.00%) 
    180    |    |    |    |    handicapped-infants=y: democrat (99.77%) 
    181    |    |    mx-missile=y 
    182    |    |    |    religious-groups-in-schools=y: democrat (99.54%) 
    183    |    |    |    religious-groups-in-schools=n 
    184    |    |    |    |    immigration=y: republican (98.63%) 
    185    |    |    |    |    immigration=n 
    186    |    |    |    |    |    handicapped-infants=n: republican (98.63%) 
    187    |    |    |    |    |    handicapped-infants=y: democrat (99.77%) 
    188  
    189 The printout includes the feature on which the tree branches in the 
    190 internal nodes. For leaves, it shows the the class label to which a 
    191 tree would make a classification. The probability of that class, as 
    192 estimated from the training data set, is also displayed. 
    193  
    194 If you are more of a *visual* type, you may like the graphical  
    195 presentation of the tree better. This was achieved by printing out a 
    196 tree in so-called dot file (the line of the script required for this 
    197 is ``orngTree.printDot(tree, fileName='tree.dot', 
    198 internalNodeShape="ellipse", leafShape="box")``), which was then 
    199 compiled to PNG using program called `dot`_. 
     160Following figure shows an example of such rendering. 
    200161 
    201162.. image:: files/tree.png 
    202163   :alt: A graphical presentation of a classification tree 
    203  
    204 .. _dot: http://graphviz.org/ 
    205  
    206 Nearest neighbors and majority classifiers 
    207 ------------------------------------------ 
    208  
    209 .. index:: classifiers 
    210 .. index::  
    211    single: classifiers; k nearest neighbours 
    212 .. index::  
    213    single: classifiers; majority classifier 
    214  
    215 Let us here check on two other classifiers. Majority classifier always 
    216 classifies to the majority class of the training set, and predicts  
    217 class probabilities that are equal to class distributions from the training 
    218 set. While being useless as such, it may often be good to compare this 
    219 simplest classifier to any other classifier you test &ndash; if your 
    220 other classifier is not significantly better than majority classifier, 
    221 than this may a reason to sit back and think. 
    222  
    223 The second classifier we are introducing here is based on k-nearest 
    224 neighbors algorithm, an instance-based method that finds k examples 
    225 from training set that are most similar to the instance that has to be 
    226 classified. From the set it obtains in this way, it estimates class 
    227 probabilities and uses the most frequent class for prediction. 
    228  
    229 The following script takes naive Bayes, classification tree (what we 
    230 have already learned), majority and k-nearest neighbors classifier 
    231 (new ones) and prints prediction for first 10 instances of voting data 
    232 set (:download:`handful.py <code/handful.py>`):: 
    233  
    234    import orange, orngTree 
    235    data = orange.ExampleTable("voting") 
    236     
    237    # setting up the classifiers 
    238    majority = orange.MajorityLearner(data) 
    239    bayes = orange.BayesLearner(data) 
    240    tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2) 
    241    knn = orange.kNNLearner(data, k=21) 
    242     
    243    majority.name="Majority"; bayes.name="Naive Bayes"; 
    244    tree.name="Tree"; knn.name="kNN" 
    245     
    246    classifiers = [majority, bayes, tree, knn] 
    247     
    248    # print the head 
    249    print "Possible classes:", data.domain.classVar.values 
    250    print "Probability for republican:" 
    251    print "Original Class", 
    252    for l in classifiers: 
    253        print "%-13s" % (l.name), 
    254    print 
    255     
    256    # classify first 10 instances and print probabilities 
    257    for example in data[:10]: 
    258        print "(%-10s)  " % (example.getclass()), 
    259        for c in classifiers: 
    260            p = apply(c, [example, orange.GetProbabilities]) 
    261            print "%5.3f        " % (p[0]), 
    262        print 
    263  
    264 The code is somehow long, due to our effort to print the results 
    265 nicely. The first part of the code sets-up our four classifiers, and 
    266 gives them names. Classifiers are then put into the list denoted with 
    267 variable ``classifiers`` (this is nice since, if we would need to add 
    268 another classifier, we would just define it and put it in the list, 
    269 and for the rest of the code we would not worry about it any 
    270 more). The script then prints the header with the names of the 
    271 classifiers, and finally uses the classifiers to compute the 
    272 probabilities of classes. Note for a special function ``apply`` that 
    273 we have not met yet: it simply calls a function that is given as its 
    274 first argument, and passes it the arguments that are given in the 
    275 list. In our case, ``apply`` invokes our classifiers with a data 
    276 instance and request to compute probabilities. The output of our 
    277 script is:: 
    278  
    279    Possible classes: <republican, democrat> 
    280    Probability for republican: 
    281    Original Class Majority      Naive Bayes   Tree          kNN 
    282    (republican)   0.386         1.000         0.949         1.000 
    283    (republican)   0.386         1.000         0.973         1.000 
    284    (democrat  )   0.386         0.995         0.011         0.138 
    285    (democrat  )   0.386         0.002         0.015         0.468 
    286    (democrat  )   0.386         0.043         0.015         0.035 
    287    (democrat  )   0.386         0.228         0.015         0.442 
    288    (democrat  )   0.386         1.000         0.973         0.977 
    289    (republican)   0.386         1.000         0.973         1.000 
    290    (republican)   0.386         1.000         0.973         1.000 
    291    (democrat  )   0.386         0.000         0.015         0.000 
    292  
    293 .. note:: 
    294    The prediction of majority class classifier does not depend on the 
    295    instance it classifies (of course!). 
    296  
    297 .. note::  
    298    At this stage, it would be inappropriate to say anything conclusive 
    299    on the predictive quality of the classifiers - for this, we will 
    300    need to resort to statistical methods on comparison of 
    301    classification models. 
Note: See TracChangeset for help on using the changeset viewer.