source: orange/docs/tutorial/rst/regression.rst @ 9385:fd37d2ce5541

Revision 9385:fd37d2ce5541, 6.2 KB checked in by mitar, 2 years ago (diff)

Cleaned up tutorial.

Line 
1.. index:: regression
2
3Regression
4==========
5
6At the time of writing of this part of tutorial, there were
7essentially two different learning methods for regression modelling:
8regression trees and instance-based learner (k-nearest neighbors). In
9this lesson, we will see that using regression is just like using
10classifiers, and evaluation techniques are not much different either.
11
12.. index::
13   single: regression; regression trees
14
15Few simple regressors
16---------------------
17
18Let us start with regression trees. Below is an example script that builds
19the tree from :download:`housing.tab <code/housing.tab>` data set and prints
20out the tree in textual form (:download:`regression1.py <code/regression1.py>`)::
21
22   import orange, orngTree
23   
24   data = orange.ExampleTable("housing.tab")
25   rt = orngTree.TreeLearner(data, measure="retis", mForPruning=2, minExamples=20)
26   orngTree.printTxt(rt, leafStr="%V %I")
27   
28Notice special setting for attribute evaluation measure! Following is
29the output of this script::
30   
31   RM<6.941: 19.9 [19.333-20.534]
32   RM>=6.941
33   |    RM<7.437
34   |    |    CRIM>=7.393: 14.4 [10.172-18.628]
35   |    |    CRIM<7.393
36   |    |    |    DIS<1.886: 45.7 [37.124-54.176]
37   |    |    |    DIS>=1.886: 32.7 [31.656-33.841]
38   |    RM>=7.437
39   |    |    TAX<534.500: 45.9 [44.295-47.498]
40   |    |    TAX>=534.500: 21.9 [21.900-21.900]
41
42.. index::
43   single: regression; k nearest neighbours
44
45Predicting continues classes is just like predicting crisp ones. In
46this respect, the following script will be nothing new. It uses both
47regression trees and k-nearest neighbors, and also uses a majority
48learner which for regression simply returns an average value from
49learning data set (:download:`regression2.py <code/regression2.py>`)::
50
51   import orange, orngTree, orngTest, orngStat
52   
53   data = orange.ExampleTable("housing.tab")
54   selection = orange.MakeRandomIndices2(data, 0.5)
55   train_data = data.select(selection, 0)
56   test_data = data.select(selection, 1)
57   
58   maj = orange.MajorityLearner(train_data)
59   maj.name = "default"
60   
61   rt = orngTree.TreeLearner(train_data, measure="retis", mForPruning=2, minExamples=20)
62   rt.name = "reg. tree"
63   
64   k = 5
65   knn = orange.kNNLearner(train_data, k=k)
66   knn.name = "k-NN (k=%i)" % k
67   
68   regressors = [maj, rt, knn]
69   
70   print "\n%10s " % "original",
71   for r in regressors:
72     print "%10s " % r.name,
73   print
74   
75   for i in range(10):
76     print "%10.1f " % test_data[i].getclass(),
77     for r in regressors:
78       print "%10.1f " % r(test_data[i]),
79     print
80
81The otput of this script is::
82
83     original     default   reg. tree  k-NN (k=5)
84         24.0        50.0        25.0        24.6
85         21.6        50.0        25.0        22.0
86         34.7        50.0        35.4        26.6
87         28.7        50.0        25.0        36.2
88         27.1        50.0        21.7        18.9
89         15.0        50.0        21.7        18.9
90         18.9        50.0        21.7        18.9
91         18.2        50.0        21.7        21.0
92         17.5        50.0        21.7        16.6
93         20.2        50.0        21.7        23.1
94
95.. index: mean squared error
96
97Evaluation and scoring
98----------------------
99
100For our third and last example for regression, let us see how we can
101use cross-validation testing and for a score function use
102(:download:`regression3.py <code/regression3.py>`, uses `housing.tab <code/housing.tab>`)::
103
104   import orange, orngTree, orngTest, orngStat
105   
106   data = orange.ExampleTable("housing.tab")
107   
108   maj = orange.MajorityLearner()
109   maj.name = "default"
110   rt = orngTree.TreeLearner(measure="retis", mForPruning=2, minExamples=20)
111   rt.name = "regression tree"
112   k = 5
113   knn = orange.kNNLearner(k=k)
114   knn.name = "k-NN (k=%i)" % k
115   learners = [maj, rt, knn]
116   
117   data = orange.ExampleTable("housing.tab")
118   results = orngTest.crossValidation(learners, data, folds=10)
119   mse = orngStat.MSE(results)
120   
121   print "Learner        MSE"
122   for i in range(len(learners)):
123     print "%-15s %5.3f" % (learners[i].name, mse[i])
124
125Again, compared to classification tasks, this is nothing new. The only
126news in the above script is a mean squared error evaluation function
127(``orngStat.MSE``). The scripts prints out the following report::
128
129   Learner        MSE
130   default         84.777
131   regression tree 40.096
132   k-NN (k=5)      17.532
133
134Other scoring techniques are available to evaluate the success of
135regression. Script below uses a range of them, plus features a nice
136implementation where a list of scoring techniques is defined
137independetly from the code that reports on the results (part of
138:download:`regression4.py <code/regression4.py>`)::
139
140   lr = orngRegression.LinearRegressionLearner(name="lr")
141   rt = orngTree.TreeLearner(measure="retis", mForPruning=2,
142                             minExamples=20, name="rt")
143   maj = orange.MajorityLearner(name="maj")
144   knn = orange.kNNLearner(k=10, name="knn")
145   learners = [maj, lr, rt, knn]
146   
147   # evaluation and reporting of scores
148   results = orngTest.learnAndTestOnTestData(learners, train, test)
149   scores = [("MSE", orngStat.MSE),
150             ("RMSE", orngStat.RMSE),
151             ("MAE", orngStat.MAE),
152             ("RSE", orngStat.RSE),
153             ("RRSE", orngStat.RRSE),
154             ("RAE", orngStat.RAE),
155             ("R2", orngStat.R2)]
156   
157   print "Learner  " + "".join(["%-7s" % s[0] for s in scores])
158   for i in range(len(learners)):
159       print "%-8s " % learners[i].name + "".join(["%6.3f " % s[1](results)[i] for s in scores])
160
161Here, we used a number of different scores, including:
162
163* MSE - mean squared errror,
164* RMSE - root mean squared error,
165* MAE - mean absolute error,
166* RSE - relative squared error,
167* RRSE - root relative squared error,
168* RAE - relative absolute error, and
169* R2 - coefficient of determinatin, also referred to as R-squared.
170
171For precise definition of these measures, see :py:mod:`Orange.statistics`. Running
172the script above yields::
173
174   Learner  MSE    RMSE   MAE    RSE    RRSE   RAE    R2
175   maj      84.777  9.207  6.659  1.004  1.002  1.002 -0.004
176   lr       23.729  4.871  3.413  0.281  0.530  0.513  0.719
177   rt       40.096  6.332  4.569  0.475  0.689  0.687  0.525
178   knn      17.244  4.153  2.670  0.204  0.452  0.402  0.796
179
Note: See TracBrowser for help on using the repository browser.