Changeset 11058:f7f0c7b584d7 in orange


Ignore:
Timestamp:
01/06/13 20:18:48 (16 months ago)
Author:
Miha Stajdohar <miha.stajdohar@…>
Branch:
default
Children:
11059:83e86ea77981, 11060:340b8bf1cbb4
Parents:
11057:3da1cf37de17 (diff), 11056:a68fd2fce444 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merged with tutorial updates.

Location:
docs/tutorial/rst
Files:
78 deleted
4 edited

Legend:

Unmodified
Added
Removed
  • docs/tutorial/rst/classification.rst

    r9994 r11058  
    33 
    44.. index:: classification 
    5 .. index:: supervised data mining 
     5.. index::  
     6   single: data mining; supervised 
    67 
    7 A substantial part of Orange is devoted to machine learning methods 
    8 for classification, or supervised data mining. These methods start 
    9 from the data that incorporates class-labeled instances, like 
    10 :download:`voting.tab <code/voting.tab>`:: 
     8Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on 
     9the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``):: 
    1110 
    12    >>> data = orange.ExampleTable("voting.tab") 
     11   >>> data = Orange.data.Table("voting") 
    1312   >>> data[0] 
    1413   ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican'] 
    15    >>> data[0].getclass() 
     14   >>> data[0].get_class() 
    1615   <orange.Value 'party'='republican'> 
    1716 
    18 Supervised data mining attempts to develop predictive models from such 
    19 data that, given the set of feature values, predict a corresponding 
    20 class. 
     17Learners and Classifiers 
     18------------------------ 
    2119 
    22 .. index:: classifiers 
    2320.. index:: 
    24    single: classifiers; naive Bayesian 
     21   single: classification; learner 
     22.. index:: 
     23   single: classification; classifier 
     24.. index:: 
     25   single: classification; naive Bayesian classifier 
    2526 
    26 There are two types of objects important for classification: learners 
    27 and classifiers. Orange has a number of build-in learners. For 
    28 instance, ``orange.BayesLearner`` is a naive Bayesian learner. When 
    29 data is passed to a learner (e.g., ``orange.BayesLearner(data))``, it 
    30 returns a classifier. When data instance is presented to a classifier, 
    31 it returns a class, vector of class probabilities, or both. 
     27Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class:: 
    3228 
    33 A Simple Classifier 
    34 ------------------- 
     29    >>> import Orange 
     30    >>> data = Orange.data.Table("voting") 
     31    >>> learner = Orange.classification.bayes.NaiveLearner() 
     32    >>> classifier = learner(data) 
     33    >>> classifier(data[0]) 
     34    <orange.Value 'party'='republican'> 
    3535 
    36 Let us see how this works in practice. We will 
    37 construct a naive Bayesian classifier from voting data set, and 
    38 will use it to classify the first five instances from this data set 
    39 (:download:`classifier.py <code/classifier.py>`):: 
     36Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set: 
    4037 
    41    import orange 
    42    data = orange.ExampleTable("voting") 
    43    classifier = orange.BayesLearner(data) 
    44    for i in range(5): 
    45        c = classifier(data[i]) 
    46        print "original", data[i].getclass(), "classified as", c 
     38.. literalinclude: code/classification-classifier1.py 
     39   :lines: 4- 
    4740 
    48 The script loads the data, uses it to constructs a classifier using 
    49 naive Bayesian method, and then classifies first five instances of the 
    50 data set. Note that both original class and the class assigned by a 
    51 classifier is printed out. 
     41The script outputs:: 
    5242 
    53 The data set that we use includes votes for each of the U.S.  House of 
    54 Representatives Congressmen on the 16 key votes; a class is a 
    55 representative's party. There are 435 data instances - 267 democrats 
    56 and 168 republicans - in the data set (see UCI ML Repository and 
    57 voting-records data set for further description).  This is how our 
    58 classifier performs on the first five instances: 
     43    republican; originally republican 
     44    republican; originally republican 
     45    republican; originally democrat 
     46      democrat; originally democrat 
     47      democrat; originally democrat 
    5948 
    60    1: republican (originally republican) 
    61    2: republican (originally republican) 
    62    3: republican (originally democrat) 
    63    4: democrat (originally democrat) 
    64    5: democrat (originally democrat) 
     49Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from. 
    6550 
    66 Naive Bayes made a mistake at a third instance, but otherwise predicted 
    67 correctly. 
    68  
    69 Obtaining Class Probabilities 
    70 ----------------------------- 
     51Probabilistic Classification 
     52---------------------------- 
    7153 
    7254To find out what is the probability that the classifier assigns 
    7355to, say, democrat class, we need to call the classifier with 
    74 additional parameter ``orange.GetProbabilities``. Also, note that the 
    75 democrats have a class index 1. We find this out with print 
    76 ``data.domain.classVar.values`` (:download:`classifier2.py <code/classifier2.py>`):: 
     56additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities: 
    7757 
    78    import orange 
    79    data = orange.ExampleTable("voting") 
    80    classifier = orange.BayesLearner(data) 
    81    print "Possible classes:", data.domain.classVar.values 
    82    print "Probabilities for democrats:" 
    83    for i in range(5): 
    84        p = classifier(data[i], orange.GetProbabilities) 
    85        print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass()) 
     58.. literalinclude: code/classification-classifier2.py 
     59   :lines: 4- 
    8660 
    87 The output of this script is:: 
     61The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item:: 
    8862 
    89    Possible classes: <republican, democrat> 
    90    Probabilities for democrats: 
    91    1: 0.000 (originally republican) 
    92    2: 0.000 (originally republican) 
    93    3: 0.005 (originally democrat) 
    94    4: 0.998 (originally democrat) 
    95    5: 0.957 (originally democrat) 
     63   Probabilities for democrat: 
     64   0.000; originally republican 
     65   0.000; originally republican 
     66   0.005; originally democrat 
     67   0.998; originally democrat 
     68   0.957; originally democrat 
    9669 
    97 The printout, for example, shows that with the third instance 
    98 naive Bayes has not only misclassified, but the classifier missed 
    99 quite substantially; it has assigned only a 0.005 probability to 
    100 the correct class. 
     70Cross-Validation 
     71---------------- 
    10172 
    102 .. note:: 
    103    Python list indexes start with 0. 
     73.. index:: cross-validation 
    10474 
    105 .. note:: 
    106    The ordering of class values depend on occurence of classes in the 
    107    input data set. 
     75Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set: 
    10876 
    109 Classification tree 
    110 ------------------- 
     77.. literalinclude: code/classification-cv.py 
     78   :lines: 3- 
    11179 
    112 .. index:: classifiers 
    11380.. index:: 
    114    single: classifiers; classification trees 
     81   single: classification; scoring 
     82.. index:: 
     83   single: classification; area under ROC 
     84.. index:: 
     85   single: classification; accuracy 
    11586 
    116 Classification tree learner (yes, this is the same *decision tree*) 
    117 is a native Orange learner, but because it is a rather 
    118 complex object that is for its versatility composed of a number of 
    119 other objects (for attribute estimation, stopping criterion, etc.), 
    120 a wrapper (module) called ``orngTree`` was build around it to simplify 
    121 the use of classification trees and to assemble the learner with 
    122 some usual (default) components. Here is a script with it (:download:`tree.py <code/tree.py>`):: 
     87Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set:: 
    12388 
    124    import orange, orngTree 
    125    data = orange.ExampleTable("voting") 
     89   Accuracy: 0.90 
     90   AUC:      0.97 
     91 
     92 
     93Handful of Classifiers 
     94---------------------- 
     95 
     96Orange includes wide range of classification algorithms, including: 
     97 
     98- logistic regression (``Orange.classification.logreg``) 
     99- k-nearest neighbors (``Orange.classification.knn``) 
     100- support vector machines (``Orange.classification.svm``) 
     101- classification trees (``Orange.classification.tree``) 
     102- classification rules (``Orange.classification.rules``) 
     103 
     104Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint: 
     105 
     106.. index:: 
     107   single: classification; logistic regression 
     108.. index:: 
     109   single: classification; trees 
     110.. index:: 
     111   single: classification; k-nearest neighbors 
     112 
     113.. literalinclude: code/classification-other.py 
     114 
     115For these five data items, there are no major differences between predictions of observed classification algorithms:: 
     116 
     117   Probabilities for republican: 
     118   original class  tree      k-NN      lr        
     119   republican      0.949     1.000     1.000 
     120   republican      0.972     1.000     1.000 
     121   democrat        0.011     0.078     0.000 
     122   democrat        0.015     0.001     0.000 
     123   democrat        0.015     0.032     0.000 
     124 
     125The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers. 
     126 
     127.. literalinclude: code/classification-cv2.py 
     128 
     129Logistic regression wins in area under ROC curve:: 
     130 
     131            nbc  tree lr   
     132   Accuracy 0.90 0.95 0.94 
     133   AUC      0.97 0.94 0.99 
     134 
     135Reporting on Classification Models 
     136---------------------------------- 
     137 
     138Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below: 
     139 
     140.. literalinclude: code/classification-models.py 
     141 
     142   The logistic regression part of the output is: 
    126143    
    127    tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2) 
    128    print "Possible classes:", data.domain.classVar.values 
    129    print "Probabilities for democrats:" 
    130    for i in range(5): 
    131        p = tree(data[i], orange.GetProbabilities) 
    132        print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass()) 
     144   class attribute = survived 
     145   class values = <no, yes> 
     146 
     147         Feature       beta  st. error     wald Z          P OR=exp(beta) 
    133148    
    134    orngTree.printTxt(tree) 
     149       Intercept      -1.23       0.08     -15.15      -0.00 
     150    status=first       0.86       0.16       5.39       0.00       2.36 
     151   status=second      -0.16       0.18      -0.91       0.36       0.85 
     152    status=third      -0.92       0.15      -6.12       0.00       0.40 
     153       age=child       1.06       0.25       4.30       0.00       2.89 
     154      sex=female       2.42       0.14      17.04       0.00      11.25 
    135155 
    136 .. note::  
    137    The script for classification tree is almost the same as the one 
    138    for naive Bayes (:download:`classifier2.py <code/classifier2.py>`), except that we have imported 
    139    another module (``orngTree``) and used learner 
    140    ``orngTree.TreeLearner`` to build a classifier called ``tree``. 
     156Trees can also be rendered in `dot <http://en.wikipedia.org/wiki/DOT_language>`_:: 
    141157 
    142 .. note:: 
    143    For those of you that are at home with machine learning: the 
    144    default parameters for tree learner assume that a single example is 
    145    enough to have a leaf for it, gain ratio is used for measuring the 
    146    quality of attributes that are considered for internal nodes of the 
    147    tree, and after the tree is constructed the subtrees no pruning 
    148    takes place. 
     158   tree.dot(file_name="0.dot", node_shape="ellipse", leaf_shape="box") 
    149159 
    150 The resulting tree with default parameters would be rather big, so we 
    151 have additionally requested that leaves that share common predecessor 
    152 (node) are pruned if they classify to the same class, and requested 
    153 that tree is post-pruned using m-error estimate pruning method with 
    154 parameter m set to 2.0. The output of our script is:: 
    155  
    156    Possible classes: <republican, democrat> 
    157    Probabilities for democrats: 
    158    1: 0.051 (originally republican) 
    159    2: 0.027 (originally republican) 
    160    3: 0.989 (originally democrat) 
    161    4: 0.985 (originally democrat) 
    162    5: 0.985 (originally democrat) 
    163  
    164 Notice that all of the instances are classified correctly. The last 
    165 line of the script prints out the tree that was used for 
    166 classification:: 
    167  
    168    physician-fee-freeze=n: democrat (98.52%) 
    169    physician-fee-freeze=y 
    170    |    synfuels-corporation-cutback=n: republican (97.25%) 
    171    |    synfuels-corporation-cutback=y 
    172    |    |    mx-missile=n 
    173    |    |    |    el-salvador-aid=y 
    174    |    |    |    |    adoption-of-the-budget-resolution=n: republican (85.33%) 
    175    |    |    |    |    adoption-of-the-budget-resolution=y 
    176    |    |    |    |    |    anti-satellite-test-ban=n: democrat (99.54%) 
    177    |    |    |    |    |    anti-satellite-test-ban=y: republican (100.00%) 
    178    |    |    |    el-salvador-aid=n 
    179    |    |    |    |    handicapped-infants=n: republican (100.00%) 
    180    |    |    |    |    handicapped-infants=y: democrat (99.77%) 
    181    |    |    mx-missile=y 
    182    |    |    |    religious-groups-in-schools=y: democrat (99.54%) 
    183    |    |    |    religious-groups-in-schools=n 
    184    |    |    |    |    immigration=y: republican (98.63%) 
    185    |    |    |    |    immigration=n 
    186    |    |    |    |    |    handicapped-infants=n: republican (98.63%) 
    187    |    |    |    |    |    handicapped-infants=y: democrat (99.77%) 
    188  
    189 The printout includes the feature on which the tree branches in the 
    190 internal nodes. For leaves, it shows the the class label to which a 
    191 tree would make a classification. The probability of that class, as 
    192 estimated from the training data set, is also displayed. 
    193  
    194 If you are more of a *visual* type, you may like the graphical  
    195 presentation of the tree better. This was achieved by printing out a 
    196 tree in so-called dot file (the line of the script required for this 
    197 is ``orngTree.printDot(tree, fileName='tree.dot', 
    198 internalNodeShape="ellipse", leafShape="box")``), which was then 
    199 compiled to PNG using program called `dot`_. 
     160Following figure shows an example of such rendering. 
    200161 
    201162.. image:: files/tree.png 
    202163   :alt: A graphical presentation of a classification tree 
    203  
    204 .. _dot: http://graphviz.org/ 
    205  
    206 Nearest neighbors and majority classifiers 
    207 ------------------------------------------ 
    208  
    209 .. index:: classifiers 
    210 .. index::  
    211    single: classifiers; k nearest neighbours 
    212 .. index::  
    213    single: classifiers; majority classifier 
    214  
    215 Let us here check on two other classifiers. Majority classifier always 
    216 classifies to the majority class of the training set, and predicts  
    217 class probabilities that are equal to class distributions from the training 
    218 set. While being useless as such, it may often be good to compare this 
    219 simplest classifier to any other classifier you test &ndash; if your 
    220 other classifier is not significantly better than majority classifier, 
    221 than this may a reason to sit back and think. 
    222  
    223 The second classifier we are introducing here is based on k-nearest 
    224 neighbors algorithm, an instance-based method that finds k examples 
    225 from training set that are most similar to the instance that has to be 
    226 classified. From the set it obtains in this way, it estimates class 
    227 probabilities and uses the most frequent class for prediction. 
    228  
    229 The following script takes naive Bayes, classification tree (what we 
    230 have already learned), majority and k-nearest neighbors classifier 
    231 (new ones) and prints prediction for first 10 instances of voting data 
    232 set (:download:`handful.py <code/handful.py>`):: 
    233  
    234    import orange, orngTree 
    235    data = orange.ExampleTable("voting") 
    236     
    237    # setting up the classifiers 
    238    majority = orange.MajorityLearner(data) 
    239    bayes = orange.BayesLearner(data) 
    240    tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2) 
    241    knn = orange.kNNLearner(data, k=21) 
    242     
    243    majority.name="Majority"; bayes.name="Naive Bayes"; 
    244    tree.name="Tree"; knn.name="kNN" 
    245     
    246    classifiers = [majority, bayes, tree, knn] 
    247     
    248    # print the head 
    249    print "Possible classes:", data.domain.classVar.values 
    250    print "Probability for republican:" 
    251    print "Original Class", 
    252    for l in classifiers: 
    253        print "%-13s" % (l.name), 
    254    print 
    255     
    256    # classify first 10 instances and print probabilities 
    257    for example in data[:10]: 
    258        print "(%-10s)  " % (example.getclass()), 
    259        for c in classifiers: 
    260            p = apply(c, [example, orange.GetProbabilities]) 
    261            print "%5.3f        " % (p[0]), 
    262        print 
    263  
    264 The code is somehow long, due to our effort to print the results 
    265 nicely. The first part of the code sets-up our four classifiers, and 
    266 gives them names. Classifiers are then put into the list denoted with 
    267 variable ``classifiers`` (this is nice since, if we would need to add 
    268 another classifier, we would just define it and put it in the list, 
    269 and for the rest of the code we would not worry about it any 
    270 more). The script then prints the header with the names of the 
    271 classifiers, and finally uses the classifiers to compute the 
    272 probabilities of classes. Note for a special function ``apply`` that 
    273 we have not met yet: it simply calls a function that is given as its 
    274 first argument, and passes it the arguments that are given in the 
    275 list. In our case, ``apply`` invokes our classifiers with a data 
    276 instance and request to compute probabilities. The output of our 
    277 script is:: 
    278  
    279    Possible classes: <republican, democrat> 
    280    Probability for republican: 
    281    Original Class Majority      Naive Bayes   Tree          kNN 
    282    (republican)   0.386         1.000         0.949         1.000 
    283    (republican)   0.386         1.000         0.973         1.000 
    284    (democrat  )   0.386         0.995         0.011         0.138 
    285    (democrat  )   0.386         0.002         0.015         0.468 
    286    (democrat  )   0.386         0.043         0.015         0.035 
    287    (democrat  )   0.386         0.228         0.015         0.442 
    288    (democrat  )   0.386         1.000         0.973         0.977 
    289    (republican)   0.386         1.000         0.973         1.000 
    290    (republican)   0.386         1.000         0.973         1.000 
    291    (democrat  )   0.386         0.000         0.015         0.000 
    292  
    293 .. note:: 
    294    The prediction of majority class classifier does not depend on the 
    295    instance it classifies (of course!). 
    296  
    297 .. note::  
    298    At this stage, it would be inappropriate to say anything conclusive 
    299    on the predictive quality of the classifiers - for this, we will 
    300    need to resort to statistical methods on comparison of 
    301    classification models. 
  • docs/tutorial/rst/ensembles.rst

    r9994 r11058  
    11.. index:: ensembles 
     2 
     3Ensembles 
     4========= 
     5 
     6`Learning of ensembles <http://en.wikipedia.org/wiki/Ensemble_learning>`_ combines the predictions of separate models to gain in accuracy. The models may come from different training data samples, or may use different learners on the same data sets. Learners may also be diversified by changing their parameter sets. 
     7 
     8In Orange, ensembles are simply wrappers around learners. They behave just like any other learner. Given the data, they return models that can predict the outcome for any data instance:: 
     9 
     10   >>> import Orange 
     11   >>> data = Orange.data.Table("housing") 
     12   >>> tree = Orange.classification.tree.TreeLearner() 
     13   >>> btree = Orange.ensemble.bagging.BaggedLearner(tree) 
     14   >>> btree 
     15   BaggedLearner 'Bagging' 
     16   >>> btree(data) 
     17   BaggedClassifier 'Bagging' 
     18   >>> btree(data)(data[0]) 
     19   <orange.Value 'MEDV'='24.6'> 
     20 
     21The last line builds a predictor (``btree(data)``) and then uses it on a first data instance. 
     22 
     23Most ensemble methods can wrap either classification or regression learners. Exceptions are task-specialized techniques such as boosting. 
     24 
     25Bagging and Boosting 
     26-------------------- 
     27 
    228.. index::  
    329   single: ensembles; bagging 
     30 
     31`Bootstrap aggregating <http://en.wikipedia.org/wiki/Bootstrap_aggregating>`_, or bagging, samples the training data uniformly and with replacement to train different predictors. Majority vote (classification) or mean (regression) across predictions then combines independent predictions into a single prediction.  
     32 
    433.. index::  
    534   single: ensembles; boosting 
    635 
    7 Ensemble learners 
    8 ================= 
     36In general, boosting is a technique that combines weak learners into a single strong learner. Orange implements `AdaBoost <http://en.wikipedia.org/wiki/AdaBoost>`_, which assigns weights to data instances according to performance of the learner. AdaBoost uses these weights to iteratively sample the instances to focus on those that are harder to classify. In the aggregation AdaBoost emphases individual classifiers with better performance on their training sets. 
    937 
    10 Building ensemble classifiers in Orange is simple and easy. Starting 
    11 from learners/classifiers that can predict probabilities and, if 
    12 needed, use example weights, ensembles are actually wrappers that can 
    13 aggregate predictions from a list of constructed classifiers. These 
    14 wrappers behave exactly like other Orange learners/classifiers. We 
    15 will here first show how to use a module for bagging and boosting that 
    16 is included in Orange distribution (:py:mod:`Orange.ensemble` module), and 
    17 then, for a somehow more advanced example build our own ensemble 
    18 learner. Using this module, using it is very easy: you have to define 
    19 a learner, give it to bagger or booster, which in turn returns a new 
    20 (boosted or bagged) learner. Here goes an example (:download:`ensemble3.py <code/ensemble3.py>`):: 
     38The following script wraps a classification tree in boosted and bagged learner, and tests the three learner through cross-validation: 
    2139 
    22    import orange, orngTest, orngStat, orngEnsemble 
    23    data = orange.ExampleTable("promoters") 
    24     
    25    majority = orange.MajorityLearner() 
    26    majority.name = "default" 
    27    knn = orange.kNNLearner(k=11) 
    28    knn.name = "k-NN (k=11)" 
    29     
    30    bagged_knn = orngEnsemble.BaggedLearner(knn, t=10) 
    31    bagged_knn.name = "bagged k-NN" 
    32    boosted_knn = orngEnsemble.BoostedLearner(knn, t=10) 
    33    boosted_knn.name = "boosted k-NN" 
    34     
    35    learners = [majority, knn, bagged_knn, boosted_knn] 
    36    results = orngTest.crossValidation(learners, data, folds=10) 
    37    print "        Learner   CA     Brier Score" 
    38    for i in range(len(learners)): 
    39        print ("%15s:  %5.3f  %5.3f") % (learners[i].name, 
    40            orngStat.CA(results)[i], orngStat.BrierScore(results)[i]) 
     40.. literalinclude:: code/ensemble-bagging.py 
    4141 
    42 Most of the code is used for defining and naming objects that learn, 
    43 and the last piece of code is to report evaluation results. Notice 
    44 that to bag or boost a learner, it takes only a single line of code 
    45 (like, ``bagged_knn = orngEnsemble.BaggedLearner(knn, t=10)``)! 
    46 Parameter ``t`` in bagging and boosting refers to number of 
    47 classifiers that will be used for voting (or, if you like better, 
    48 number of iterations by boosting/bagging). Depending on your random 
    49 generator, you may get something like:: 
     42The benefit of the two ensembling techniques, assessed in terms of area under ROC curve, is obvious:: 
    5043 
    51            Learner   CA     Brier Score 
    52            default:  0.473  0.501 
    53        k-NN (k=11):  0.859  0.240 
    54        bagged k-NN:  0.813  0.257 
    55       boosted k-NN:  0.830  0.244 
     44    tree: 0.83 
     45   boost: 0.90 
     46    bagg: 0.91 
    5647 
     48Stacking 
     49-------- 
    5750 
     51.. index::  
     52   single: ensembles; stacking 
     53 
     54Consider we partition a training set into held-in and held-out set. Assume that our taks is prediction of y, either probability of the target class in classification or a real value in regression. We are given a set of learners. We train them on held-in set, and obtain a vector of prediction on held-out set. Each element of the vector corresponds to prediction of individual predictor. We can now learn how to combine these predictions to form a target prediction, by training a new predictor on a data set of predictions and true value of y in held-out set. The technique is called `stacked generalization <http://en.wikipedia.org/wiki/Ensemble_learning#Stacking>`_, or in short stacking. Instead of a single split to held-in and held-out data set, the vectors of predictions are obtained through cross-validation. 
     55 
     56Orange provides a wrapper for stacking that is given a set of base learners and a meta learner: 
     57 
     58.. literalinclude:: code/ensemble-stacking.py 
     59   :lines: 3- 
     60 
     61By default, the meta classifier is naive Bayesian classifier. Changing this to logistic regression may be a good idea as well:: 
     62 
     63    stack = Orange.ensemble.stacking.StackedClassificationLearner(base_learners, \ 
     64               meta_learner=Orange.classification.logreg.LogRegLearner) 
     65 
     66Stacking is often better than each of the base learners alone, as also demonstrated by running our script:: 
     67 
     68   stacking: 0.967 
     69      bayes: 0.933 
     70       tree: 0.836 
     71        knn: 0.947 
     72 
     73Random Forests 
     74-------------- 
     75 
     76.. index::  
     77   single: ensembles; random forests 
     78 
     79`Random forest <http://en.wikipedia.org/wiki/Random_forest>`_ ensembles tree predictors. The diversity of trees is achieved in randomization of feature selection for node split criteria, where instead of the best feature one is picked arbitrary from a set of best features. Another source of randomization is a bootstrap sample of data from which the threes are developed. Predictions from usually several hundred trees are aggregated by voting. Constructing so many trees may be computationally demanding. Orange uses a special tree inducer (Orange.classification.tree.SimpleTreeLearner, considered by default) optimized for speed in random forest construction:  
     80 
     81.. literalinclude:: code/ensemble-forest.py 
     82   :lines: 3- 
     83 
     84Random forests are often superior when compared to other base classification or regression learners:: 
     85 
     86   forest: 0.976 
     87    bayes: 0.935 
     88      knn: 0.952 
  • docs/tutorial/rst/index.rst

    r11047 r11058  
    33############### 
    44 
    5 If you are new to Orange, then this is probably the best place to start. This 
    6 tutorial was written with a purpose to provide a gentle tutorial over basic 
    7 functionality of Orange. As Orange is integrated within `Python <http://www.python.org/>`_, the tutorial 
    8 is in essence a guide through some basic Orange scripting in this language. 
    9 Although relying on Python, those of you who have some knowledge on programming 
    10 won't need to learn Python first: the tutorial should be simple enough to skip 
    11 learning Python itself. 
     5This is a gentle introduction on scripting in Orange. Orange is a Python `Python <http://www.python.org/>`_ library, and the tutorial is a guide through Orange scripting in this language. 
    126 
    13 Contents: 
     7We here assume you have already `downloaded and installed Orange <http://orange.biolab.si/download/>`_ and have a working version of Python. Python scripts can run in a terminal window, integrated environments like `PyCharm <http://www.jetbrains.com/pycharm/>`_ and `PythonWin <http://wiki.python.org/moin/PythonWin>`_, 
     8or shells like `iPython <http://ipython.scipy.org/moin/>`_. Whichever environment you are using, try now to import Orange. Below, we used a Python shell:: 
     9 
     10   % python 
     11   >>> import Orange 
     12   >>> Orange.version.version 
     13   '2.6a2.dev-a55510d' 
     14   >>> 
     15 
     16If this leaves no error and warning, Orange and Python are properly 
     17installed and you are ready to continue with this Tutorial. 
     18 
     19******** 
     20Contents 
     21******** 
    1422 
    1523.. toctree:: 
    1624   :maxdepth: 1 
    1725 
    18    start.rst 
    19    load-data.rst 
    20    basic-exploration.rst 
     26   data.rst 
    2127   classification.rst 
    22    evaluation.rst 
    23    learners-in-python.rst 
    2428   regression.rst 
    25    association-rules.rst 
    26    feature-subset-selection.rst 
    2729   ensembles.rst 
    28    discretization.rst 
     30   python-learners.rst 
    2931 
    3032**************** 
    31 Index and search 
     33Index and Search 
    3234**************** 
    3335 
  • docs/tutorial/rst/regression.rst

    r9385 r11058  
    1 .. index:: regression 
    2  
    31Regression 
    42========== 
    53 
    6 At the time of writing of this part of tutorial, there were 
    7 essentially two different learning methods for regression modelling: 
    8 regression trees and instance-based learner (k-nearest neighbors). In 
    9 this lesson, we will see that using regression is just like using 
    10 classifiers, and evaluation techniques are not much different either. 
     4.. index:: regression 
     5 
     6From the interface point of view, regression methods in Orange are very similar to classification. Both intended for supervised data mining, they require class-labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class: 
     7 
     8.. literalinclude:: code/regression.py 
     9 
     10 
     11Handful of Regressors 
     12--------------------- 
    1113 
    1214.. index:: 
    13    single: regression; regression trees 
     15   single: regression; tree 
    1416 
    15 Few simple regressors 
    16 --------------------- 
     17Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form: 
    1718 
    18 Let us start with regression trees. Below is an example script that builds 
    19 the tree from :download:`housing.tab <code/housing.tab>` data set and prints 
    20 out the tree in textual form (:download:`regression1.py <code/regression1.py>`):: 
     19.. literalinclude:: code/regression-tree.py 
     20   :lines: 3- 
    2121 
    22    import orange, orngTree 
     22The script outputs the tree:: 
    2323    
    24    data = orange.ExampleTable("housing.tab") 
    25    rt = orngTree.TreeLearner(data, measure="retis", mForPruning=2, minExamples=20) 
    26    orngTree.printTxt(rt, leafStr="%V %I") 
    27     
    28 Notice special setting for attribute evaluation measure! Following is 
    29 the output of this script:: 
    30     
    31    RM<6.941: 19.9 [19.333-20.534] 
    32    RM>=6.941 
    33    |    RM<7.437 
    34    |    |    CRIM>=7.393: 14.4 [10.172-18.628] 
    35    |    |    CRIM<7.393 
    36    |    |    |    DIS<1.886: 45.7 [37.124-54.176] 
    37    |    |    |    DIS>=1.886: 32.7 [31.656-33.841] 
    38    |    RM>=7.437 
    39    |    |    TAX<534.500: 45.9 [44.295-47.498] 
    40    |    |    TAX>=534.500: 21.9 [21.900-21.900] 
     24   RM<=6.941: 19.9 
     25   RM>6.941 
     26   |    RM<=7.437 
     27   |    |    CRIM>7.393: 14.4 
     28   |    |    CRIM<=7.393 
     29   |    |    |    DIS<=1.886: 45.7 
     30   |    |    |    DIS>1.886: 32.7 
     31   |    RM>7.437 
     32   |    |    TAX<=534.500: 45.9 
     33   |    |    TAX>534.500: 21.9 
     34 
     35Following is initialization of few other regressors and their prediction of the first five data instances in housing price data set: 
    4136 
    4237.. index:: 
    43    single: regression; k nearest neighbours 
     38   single: regression; mars 
     39   single: regression; linear 
    4440 
    45 Predicting continues classes is just like predicting crisp ones. In 
    46 this respect, the following script will be nothing new. It uses both 
    47 regression trees and k-nearest neighbors, and also uses a majority 
    48 learner which for regression simply returns an average value from 
    49 learning data set (:download:`regression2.py <code/regression2.py>`):: 
     41.. literalinclude:: code/regression-other.py 
     42   :lines: 3- 
    5043 
    51    import orange, orngTree, orngTest, orngStat 
    52     
    53    data = orange.ExampleTable("housing.tab") 
    54    selection = orange.MakeRandomIndices2(data, 0.5) 
    55    train_data = data.select(selection, 0) 
    56    test_data = data.select(selection, 1) 
    57     
    58    maj = orange.MajorityLearner(train_data) 
    59    maj.name = "default" 
    60     
    61    rt = orngTree.TreeLearner(train_data, measure="retis", mForPruning=2, minExamples=20) 
    62    rt.name = "reg. tree" 
    63     
    64    k = 5 
    65    knn = orange.kNNLearner(train_data, k=k) 
    66    knn.name = "k-NN (k=%i)" % k 
    67     
    68    regressors = [maj, rt, knn] 
    69     
    70    print "\n%10s " % "original", 
    71    for r in regressors: 
    72      print "%10s " % r.name, 
    73    print 
    74     
    75    for i in range(10): 
    76      print "%10.1f " % test_data[i].getclass(), 
    77      for r in regressors: 
    78        print "%10.1f " % r(test_data[i]), 
    79      print 
     44Looks like the housing prices are not that hard to predict:: 
    8045 
    81 The otput of this script is:: 
     46   y    lin  mars tree 
     47   21.4 24.8 23.0 20.1 
     48   15.7 14.4 19.0 17.3 
     49   36.5 35.7 35.6 33.8 
    8250 
    83      original     default   reg. tree  k-NN (k=5) 
    84          24.0        50.0        25.0        24.6 
    85          21.6        50.0        25.0        22.0 
    86          34.7        50.0        35.4        26.6 
    87          28.7        50.0        25.0        36.2 
    88          27.1        50.0        21.7        18.9 
    89          15.0        50.0        21.7        18.9 
    90          18.9        50.0        21.7        18.9 
    91          18.2        50.0        21.7        21.0 
    92          17.5        50.0        21.7        16.6 
    93          20.2        50.0        21.7        23.1 
     51Cross Validation 
     52---------------- 
    9453 
    95 .. index: mean squared error 
     54Just like for classification, the same evaluation module (``Orange.evaluation``) is available for regression. Its testing submodule includes procedures such as cross-validation, leave-one-out testing and similar, and functions in scoring submodule can assess the accuracy from the testing: 
    9655 
    97 Evaluation and scoring 
    98 ---------------------- 
     56.. literalinclude:: code/regression-other.py 
     57   :lines: 3- 
    9958 
    100 For our third and last example for regression, let us see how we can 
    101 use cross-validation testing and for a score function use 
    102 (:download:`regression3.py <code/regression3.py>`, uses `housing.tab <code/housing.tab>`):: 
     59.. index:  
     60   single: regression; root mean squared error 
    10361 
    104    import orange, orngTree, orngTest, orngStat 
    105     
    106    data = orange.ExampleTable("housing.tab") 
    107     
    108    maj = orange.MajorityLearner() 
    109    maj.name = "default" 
    110    rt = orngTree.TreeLearner(measure="retis", mForPruning=2, minExamples=20) 
    111    rt.name = "regression tree" 
    112    k = 5 
    113    knn = orange.kNNLearner(k=k) 
    114    knn.name = "k-NN (k=%i)" % k 
    115    learners = [maj, rt, knn] 
    116     
    117    data = orange.ExampleTable("housing.tab") 
    118    results = orngTest.crossValidation(learners, data, folds=10) 
    119    mse = orngStat.MSE(results) 
    120     
    121    print "Learner        MSE" 
    122    for i in range(len(learners)): 
    123      print "%-15s %5.3f" % (learners[i].name, mse[i]) 
     62`MARS <http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines>`_ has the lowest root mean squared error:: 
    12463 
    125 Again, compared to classification tasks, this is nothing new. The only 
    126 news in the above script is a mean squared error evaluation function 
    127 (``orngStat.MSE``). The scripts prints out the following report:: 
     64   Learner  RMSE 
     65   lin      4.83 
     66   mars     3.84 
     67   tree     5.10 
    12868 
    129    Learner        MSE 
    130    default         84.777 
    131    regression tree 40.096 
    132    k-NN (k=5)      17.532 
    133  
    134 Other scoring techniques are available to evaluate the success of 
    135 regression. Script below uses a range of them, plus features a nice 
    136 implementation where a list of scoring techniques is defined 
    137 independetly from the code that reports on the results (part of 
    138 :download:`regression4.py <code/regression4.py>`):: 
    139  
    140    lr = orngRegression.LinearRegressionLearner(name="lr") 
    141    rt = orngTree.TreeLearner(measure="retis", mForPruning=2, 
    142                              minExamples=20, name="rt") 
    143    maj = orange.MajorityLearner(name="maj") 
    144    knn = orange.kNNLearner(k=10, name="knn") 
    145    learners = [maj, lr, rt, knn] 
    146     
    147    # evaluation and reporting of scores 
    148    results = orngTest.learnAndTestOnTestData(learners, train, test) 
    149    scores = [("MSE", orngStat.MSE), 
    150              ("RMSE", orngStat.RMSE), 
    151              ("MAE", orngStat.MAE), 
    152              ("RSE", orngStat.RSE), 
    153              ("RRSE", orngStat.RRSE), 
    154              ("RAE", orngStat.RAE), 
    155              ("R2", orngStat.R2)] 
    156     
    157    print "Learner  " + "".join(["%-7s" % s[0] for s in scores]) 
    158    for i in range(len(learners)): 
    159        print "%-8s " % learners[i].name + "".join(["%6.3f " % s[1](results)[i] for s in scores]) 
    160  
    161 Here, we used a number of different scores, including: 
    162  
    163 * MSE - mean squared errror, 
    164 * RMSE - root mean squared error, 
    165 * MAE - mean absolute error, 
    166 * RSE - relative squared error, 
    167 * RRSE - root relative squared error, 
    168 * RAE - relative absolute error, and 
    169 * R2 - coefficient of determinatin, also referred to as R-squared. 
    170  
    171 For precise definition of these measures, see :py:mod:`Orange.statistics`. Running 
    172 the script above yields:: 
    173  
    174    Learner  MSE    RMSE   MAE    RSE    RRSE   RAE    R2 
    175    maj      84.777  9.207  6.659  1.004  1.002  1.002 -0.004 
    176    lr       23.729  4.871  3.413  0.281  0.530  0.513  0.719 
    177    rt       40.096  6.332  4.569  0.475  0.689  0.687  0.525 
    178    knn      17.244  4.153  2.670  0.204  0.452  0.402  0.796 
    179  
Note: See TracChangeset for help on using the changeset viewer.