Ignore:
Timestamp:
01/06/13 00:27:59 (16 months ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

new tutorial

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/tutorial/rst/regression.rst

    r9385 r11051  
    1 .. index:: regression 
    2  
    31Regression 
    42========== 
    53 
    6 At the time of writing of this part of tutorial, there were 
    7 essentially two different learning methods for regression modelling: 
    8 regression trees and instance-based learner (k-nearest neighbors). In 
    9 this lesson, we will see that using regression is just like using 
    10 classifiers, and evaluation techniques are not much different either. 
     4.. index:: regression 
     5 
     6From the interface point of view, regression methods in Orange are very similar to classification. Both intended for supervised data mining, they require class-labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class: 
     7 
     8.. literalinclude:: code/regression.py 
     9 
     10 
     11Handful of Regressors 
     12--------------------- 
    1113 
    1214.. index:: 
    13    single: regression; regression trees 
     15   single: regression; tree 
    1416 
    15 Few simple regressors 
    16 --------------------- 
     17Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form: 
    1718 
    18 Let us start with regression trees. Below is an example script that builds 
    19 the tree from :download:`housing.tab <code/housing.tab>` data set and prints 
    20 out the tree in textual form (:download:`regression1.py <code/regression1.py>`):: 
     19.. literalinclude:: code/regression-tree.py 
     20   :lines: 3- 
    2121 
    22    import orange, orngTree 
     22The script outputs the tree:: 
    2323    
    24    data = orange.ExampleTable("housing.tab") 
    25    rt = orngTree.TreeLearner(data, measure="retis", mForPruning=2, minExamples=20) 
    26    orngTree.printTxt(rt, leafStr="%V %I") 
    27     
    28 Notice special setting for attribute evaluation measure! Following is 
    29 the output of this script:: 
    30     
    31    RM<6.941: 19.9 [19.333-20.534] 
    32    RM>=6.941 
    33    |    RM<7.437 
    34    |    |    CRIM>=7.393: 14.4 [10.172-18.628] 
    35    |    |    CRIM<7.393 
    36    |    |    |    DIS<1.886: 45.7 [37.124-54.176] 
    37    |    |    |    DIS>=1.886: 32.7 [31.656-33.841] 
    38    |    RM>=7.437 
    39    |    |    TAX<534.500: 45.9 [44.295-47.498] 
    40    |    |    TAX>=534.500: 21.9 [21.900-21.900] 
     24   RM<=6.941: 19.9 
     25   RM>6.941 
     26   |    RM<=7.437 
     27   |    |    CRIM>7.393: 14.4 
     28   |    |    CRIM<=7.393 
     29   |    |    |    DIS<=1.886: 45.7 
     30   |    |    |    DIS>1.886: 32.7 
     31   |    RM>7.437 
     32   |    |    TAX<=534.500: 45.9 
     33   |    |    TAX>534.500: 21.9 
     34 
     35Following is initialization of few other regressors and their prediction of the first five data instances in housing price data set: 
    4136 
    4237.. index:: 
    43    single: regression; k nearest neighbours 
     38   single: regression; mars 
     39   single: regression; linear 
    4440 
    45 Predicting continues classes is just like predicting crisp ones. In 
    46 this respect, the following script will be nothing new. It uses both 
    47 regression trees and k-nearest neighbors, and also uses a majority 
    48 learner which for regression simply returns an average value from 
    49 learning data set (:download:`regression2.py <code/regression2.py>`):: 
     41.. literalinclude:: code/regression-other.py 
     42   :lines: 3- 
    5043 
    51    import orange, orngTree, orngTest, orngStat 
    52     
    53    data = orange.ExampleTable("housing.tab") 
    54    selection = orange.MakeRandomIndices2(data, 0.5) 
    55    train_data = data.select(selection, 0) 
    56    test_data = data.select(selection, 1) 
    57     
    58    maj = orange.MajorityLearner(train_data) 
    59    maj.name = "default" 
    60     
    61    rt = orngTree.TreeLearner(train_data, measure="retis", mForPruning=2, minExamples=20) 
    62    rt.name = "reg. tree" 
    63     
    64    k = 5 
    65    knn = orange.kNNLearner(train_data, k=k) 
    66    knn.name = "k-NN (k=%i)" % k 
    67     
    68    regressors = [maj, rt, knn] 
    69     
    70    print "\n%10s " % "original", 
    71    for r in regressors: 
    72      print "%10s " % r.name, 
    73    print 
    74     
    75    for i in range(10): 
    76      print "%10.1f " % test_data[i].getclass(), 
    77      for r in regressors: 
    78        print "%10.1f " % r(test_data[i]), 
    79      print 
     44Looks like the housing prices are not that hard to predict:: 
    8045 
    81 The otput of this script is:: 
     46   y    lin  mars tree 
     47   21.4 24.8 23.0 20.1 
     48   15.7 14.4 19.0 17.3 
     49   36.5 35.7 35.6 33.8 
    8250 
    83      original     default   reg. tree  k-NN (k=5) 
    84          24.0        50.0        25.0        24.6 
    85          21.6        50.0        25.0        22.0 
    86          34.7        50.0        35.4        26.6 
    87          28.7        50.0        25.0        36.2 
    88          27.1        50.0        21.7        18.9 
    89          15.0        50.0        21.7        18.9 
    90          18.9        50.0        21.7        18.9 
    91          18.2        50.0        21.7        21.0 
    92          17.5        50.0        21.7        16.6 
    93          20.2        50.0        21.7        23.1 
     51Cross Validation 
     52---------------- 
    9453 
    95 .. index: mean squared error 
     54Just like for classification, the same evaluation module (``Orange.evaluation``) is available for regression. Its testing submodule includes procedures such as cross-validation, leave-one-out testing and similar, and functions in scoring submodule can assess the accuracy from the testing: 
    9655 
    97 Evaluation and scoring 
    98 ---------------------- 
     56.. literalinclude:: code/regression-other.py 
     57   :lines: 3- 
    9958 
    100 For our third and last example for regression, let us see how we can 
    101 use cross-validation testing and for a score function use 
    102 (:download:`regression3.py <code/regression3.py>`, uses `housing.tab <code/housing.tab>`):: 
     59.. index:  
     60   single: regression; root mean squared error 
    10361 
    104    import orange, orngTree, orngTest, orngStat 
    105     
    106    data = orange.ExampleTable("housing.tab") 
    107     
    108    maj = orange.MajorityLearner() 
    109    maj.name = "default" 
    110    rt = orngTree.TreeLearner(measure="retis", mForPruning=2, minExamples=20) 
    111    rt.name = "regression tree" 
    112    k = 5 
    113    knn = orange.kNNLearner(k=k) 
    114    knn.name = "k-NN (k=%i)" % k 
    115    learners = [maj, rt, knn] 
    116     
    117    data = orange.ExampleTable("housing.tab") 
    118    results = orngTest.crossValidation(learners, data, folds=10) 
    119    mse = orngStat.MSE(results) 
    120     
    121    print "Learner        MSE" 
    122    for i in range(len(learners)): 
    123      print "%-15s %5.3f" % (learners[i].name, mse[i]) 
     62`MARS <http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines>`_ has the lowest root mean squared error:: 
    12463 
    125 Again, compared to classification tasks, this is nothing new. The only 
    126 news in the above script is a mean squared error evaluation function 
    127 (``orngStat.MSE``). The scripts prints out the following report:: 
     64   Learner  RMSE 
     65   lin      4.83 
     66   mars     3.84 
     67   tree     5.10 
    12868 
    129    Learner        MSE 
    130    default         84.777 
    131    regression tree 40.096 
    132    k-NN (k=5)      17.532 
    133  
    134 Other scoring techniques are available to evaluate the success of 
    135 regression. Script below uses a range of them, plus features a nice 
    136 implementation where a list of scoring techniques is defined 
    137 independetly from the code that reports on the results (part of 
    138 :download:`regression4.py <code/regression4.py>`):: 
    139  
    140    lr = orngRegression.LinearRegressionLearner(name="lr") 
    141    rt = orngTree.TreeLearner(measure="retis", mForPruning=2, 
    142                              minExamples=20, name="rt") 
    143    maj = orange.MajorityLearner(name="maj") 
    144    knn = orange.kNNLearner(k=10, name="knn") 
    145    learners = [maj, lr, rt, knn] 
    146     
    147    # evaluation and reporting of scores 
    148    results = orngTest.learnAndTestOnTestData(learners, train, test) 
    149    scores = [("MSE", orngStat.MSE), 
    150              ("RMSE", orngStat.RMSE), 
    151              ("MAE", orngStat.MAE), 
    152              ("RSE", orngStat.RSE), 
    153              ("RRSE", orngStat.RRSE), 
    154              ("RAE", orngStat.RAE), 
    155              ("R2", orngStat.R2)] 
    156     
    157    print "Learner  " + "".join(["%-7s" % s[0] for s in scores]) 
    158    for i in range(len(learners)): 
    159        print "%-8s " % learners[i].name + "".join(["%6.3f " % s[1](results)[i] for s in scores]) 
    160  
    161 Here, we used a number of different scores, including: 
    162  
    163 * MSE - mean squared errror, 
    164 * RMSE - root mean squared error, 
    165 * MAE - mean absolute error, 
    166 * RSE - relative squared error, 
    167 * RRSE - root relative squared error, 
    168 * RAE - relative absolute error, and 
    169 * R2 - coefficient of determinatin, also referred to as R-squared. 
    170  
    171 For precise definition of these measures, see :py:mod:`Orange.statistics`. Running 
    172 the script above yields:: 
    173  
    174    Learner  MSE    RMSE   MAE    RSE    RRSE   RAE    R2 
    175    maj      84.777  9.207  6.659  1.004  1.002  1.002 -0.004 
    176    lr       23.729  4.871  3.413  0.281  0.530  0.513  0.719 
    177    rt       40.096  6.332  4.569  0.475  0.689  0.687  0.525 
    178    knn      17.244  4.153  2.670  0.204  0.452  0.402  0.796 
    179  
Note: See TracChangeset for help on using the changeset viewer.