#
source:
orange/docs/tutorial/rst/learners-in-python.rst
@
9386:b95da3693f19

Revision 9386:b95da3693f19, 18.8 KB checked in by mitar, 2 years ago (diff) |
---|

Line | |
---|---|

1 | Build your own learner |

2 | ====================== |

3 | |

4 | .. index:: |

5 | single: classifiers; in Python |

6 | |

7 | This part of tutorial will show how to build learners and classifiers |

8 | in Python, that is, how to build your own learners and |

9 | classifiers. Especially for those of you that want to test some of |

10 | your methods or want to combine existing techniques in Orange, this is |

11 | a very important topic. Developing your own learners in Python makes |

12 | prototyping of new methods fast and enjoyable. |

13 | |

14 | There are different ways to build learners/classifiers in Python. We |

15 | will take the route that shows how to do this correctly, in a sense |

16 | that you will be able to use your learner as it would be any learner |

17 | that Orange originally provides. Distinct to Orange learners is the |

18 | way how they are invoked and what the return. Let us start with an |

19 | example. Say that we have a Learner(), which is some learner in |

20 | Orange. The learner can be called in two different ways:: |

21 | |

22 | learner = Learner() |

23 | classifier = Learner(data) |

24 | |

25 | In the first line, the learner is invoked without the data set and |

26 | in that case it should return an instance of learner, such that later |

27 | you may say ``classifier = learner(data)`` or you may call |

28 | some validation procedure with a ``learner`` itself (say |

29 | ``orngEval.CrossValidation([learner], data)``). In the second |

30 | line, learner is called with the data and returns a classifier. |

31 | |

32 | Classifiers should be called with a data instance to classify, |

33 | and should return either a class value (by default), probability of |

34 | classes or both:: |

35 | |

36 | value = classifier(instance) |

37 | value = classifier(instance, orange.GetValue) |

38 | probabilities = classifier(instance, orange.GetProbabilities) |

39 | value, probabilities = classifier(instance, orange.GetBoth) |

40 | |

41 | Here is a short example:: |

42 | |

43 | > python |

44 | >>> import orange |

45 | >>> data = orange.ExampleTable("voting") |

46 | >>> learner = orange.BayesLearner() |

47 | >>> classifier = learner(data) |

48 | >>> classifier(data[0]) |

49 | republican |

50 | >>> classifier(data[0], orange.GetBoth) |

51 | (republican, [0.99999994039535522, 7.9730767765795463e-008]) |

52 | >>> classifier(data[0], orange.GetProbabilities) |

53 | [0.99999994039535522, 7.9730767765795463e-008] |

54 | >>> |

55 | >>> c = orange.BayesLearner(data) |

56 | >>> c(data[12]) |

57 | democrat |

58 | >>> |

59 | |

60 | We will here assume that our learner and the corresponding classifier |

61 | will be defined in a single file (module) that will not contain any |

62 | other code. This helps for code reuse, so that if you want to use your |

63 | new method anywhere else, you just import it from that file. Each such |

64 | module will contain a class ``Learner_Class`` and a class |

65 | ``Classifier``. We will use this schema to define a learner that will |

66 | use naive Bayesian classifier with embeded categorization of training |

67 | data. Then we will show how to write naive Bayesian classifier in |

68 | Python (that is, how to do this from scratch). We conclude with Python |

69 | implementation of bagging. |

70 | |

71 | .. _naive bayes with discretization: |

72 | |

73 | Naive Bayes with discretization |

74 | ------------------------------- |

75 | |

76 | Let us build a learner/classifier that is an extension of build-in |

77 | naive Bayes and which before learning categorizes the data. We will |

78 | define a module :download:`nbdisc.py <code/nbdisc.py>` that will implement two classes, Learner |

79 | and Classifier. Following is a Python code for a Learner class (part |

80 | of :download:`nbdisc.py <code/nbdisc.py>`):: |

81 | |

82 | class Learner(object): |

83 | def __new__(cls, examples=None, name='discretized bayes', **kwds): |

84 | learner = object.__new__(cls, **kwds) |

85 | if examples: |

86 | learner.__init__(name) # force init |

87 | return learner(examples) |

88 | else: |

89 | return learner # invokes the __init__ |

90 | |

91 | def __init__(self, name='discretized bayes'): |

92 | self.name = name |

93 | |

94 | def __call__(self, data, weight=None): |

95 | disc = orange.Preprocessor_discretize( \ |

96 | data, method=orange.EntropyDiscretization()) |

97 | model = orange.BayesLearner(disc, weight) |

98 | return Classifier(classifier = model) |

99 | |

100 | ``Learner_Class`` has three methods. Method ``__new__`` creates the |

101 | object and returns a learner or classifier, depending if examples |

102 | where passed to the call. If the examples were passed as an argument |

103 | than the method called the learner (invoking ``__call__`` |

104 | method). Method ``__init__`` is invoked every time the class is called |

105 | for the first time. Notice that all it does is remembers the only |

106 | argument that this class can be called with, i.e. the argument |

107 | ``name`` which defaults to discretized bayes. If you would expect any |

108 | other arguments for your learners, you should handle them here (store |

109 | them as class' attributes using the keyword ``self``). |

110 | |

111 | If we have created an instance of the learner (and did not pass the |

112 | examples as attributes), the next call of this learner will invoke a |

113 | method ``__call__``, where the essence of our learner is |

114 | implemented. Notice also that we have included an attribute for vector |

115 | of instance weights, which is passed to naive Bayesian learner. In our |

116 | learner, we first discretize the data using Fayyad & Irani's |

117 | entropy-based discretization, then build a naive Bayesian model and |

118 | finally pass it to a class ``Classifier``. You may expect that at its |

119 | first invocation the ``Classifier`` will just remember the model we |

120 | have called it with (part of :download:`nbdisc.py <code/nbdisc.py>`):: |

121 | |

122 | class Classifier: |

123 | def __init__(self, **kwds): |

124 | self.__dict__.update(kwds) |

125 | |

126 | def __call__(self, example, resultType = orange.GetValue): |

127 | return self.classifier(example, resultType) |

128 | |

129 | The method ``__init__`` in ``Classifier`` is rather general: it makes |

130 | ``Classifier`` remember all arguments it was called with. They are |

131 | then accessed through ``Classifiers``' arguments |

132 | (``self.argument_name``). When Classifier is called, it expects an |

133 | example and an optional argument that specifies the type of result to |

134 | be returned. |

135 | |

136 | This completes our code for naive Bayesian classifier with |

137 | discretization. You can see that the code is fairly short (fewer than |

138 | 20 lines), and it can be easily extended or changed if we want to do |

139 | something else as well (like feature subset selection, ...). |

140 | |

141 | Here are now a few lines to test our code:: |

142 | |

143 | >>> import orange, nbdisc |

144 | >>> data = orange.ExampleTable("iris") |

145 | >>> classifier = nbdisc.Learner(data) |

146 | >>> print classifier(data[100]) |

147 | Iris-virginica |

148 | >>> classifier(data[100], orange.GetBoth) |

149 | (<orange.Value 'iris'='Iris-virginica'>, <0.000, 0.001, 0.999>) |

150 | >>> |

151 | |

152 | For a more elaborate test that also shows the use of a learner (that |

153 | is not given the data at its initialization), here is a script that |

154 | does 10-fold cross validation (:download:`nbdisc_test.py <code/nbdisc_test.py>`, uses :download:`iris.tab <code/iris.tab>` and |

155 | :download:`nbdisc.py <code/nbdisc.py>`):: |

156 | |

157 | import orange, orngEval, nbdisc |

158 | data = orange.ExampleTable("iris") |

159 | results = orngEval.CrossValidation([nbdisc.Learner()], data) |

160 | print "Accuracy = %5.3f" % orngEval.CA(results)[0] |

161 | |

162 | The accuracy on this data set is about 92%. You may try to obtain a |

163 | better accuracy by using some other type of discretization, or try |

164 | some other learner on this data (hint: k-NN should perform better). |

165 | |

166 | Python implementation of naive Bayesian classifier |

167 | -------------------------------------------------- |

168 | |

169 | .. index:: |

170 | single: naive Bayesian classifier; in Python |

171 | |

172 | The naive Bayesian classifier we will implement in this lesson uses |

173 | standard naive Bayesian algorithm also described in Michell: Machine |

174 | Learning, 1997 (pages 177-180). Essentially, if a data instance is |

175 | described with :math:`n` features :math:`a_i`, then the |

176 | class that instance is classified to a class :math:`c` from set of possible |

177 | classes :math:`V`. According to naive Bayes classifier: |

178 | |

179 | .. math:: |

180 | c=\arg\max_{c_i\in V} P(v_j)\prod_{i=1}^n P(a_i|v_j) |

181 | |

182 | We will also compute a vector of elements: |

183 | |

184 | .. math:: |

185 | p_j = P(v_j)\prod_{i=1}^n P(a_i, v_j) |

186 | |

187 | which, after normalization such that :math:`\sum_j p_j` is |

188 | equal to 1, represent class probabilities. The class probabilities and |

189 | conditional probabilities (priors) in above formulas are estimated |

190 | from training data: class probability is equal to the relative class |

191 | frequency, while the conditional probability of attribute value given |

192 | class is computed by figuring out the proportion of instances with a |

193 | value of :math:`i`-th attribute equal to :math:`a_i` among instances that |

194 | from class :math:`v_j`. |

195 | |

196 | To complicate things just a little bit, :math:`m`-estimate (see |

197 | Mitchell, and Cestnik IJCAI-1990) will be used instead of relative |

198 | frequency when computing prior conditional probabilities. So |

199 | (following the example in Mitchell), when assessing :math:`P=P({\rm |

200 | Wind}={\rm strong}|{\rm PlayTennis}={\rm no})` we find that the total |

201 | number of training examples with PlayTennis=no is :math:`n=5`, and of |

202 | these there are :math:`n_c=3` for which Wind=strong, than using |

203 | relative frequency the corresponding probability would be: |

204 | |

205 | .. math:: |

206 | P={n_c\over n} |

207 | |

208 | Relative frequency has a problem when number of instance is |

209 | small, and to alleviate that m-estimate assumes that there are m |

210 | imaginary cases (m is also referred to as equivalent sample size) |

211 | with equal probability of class values p. Our conditional |

212 | probability using m-estimate is then computed as: |

213 | |

214 | .. math:: |

215 | P={n_c+m p\over n+m} |

216 | |

217 | Often, instead of uniform class probability :math:`p`, a relative class |

218 | frequency as estimated from training data is taken. |

219 | |

220 | We will develop a module called bayes.py that will implement our naive |

221 | Bayes learner and classifier. The structure of the module will be as |

222 | with `naive bayes with discretization`_. Again, we will implement two classes, one for |

223 | learning and the other on for classification. Here is a ``Learner``: |

224 | class (part of :download:`bayes.py <code/bayes.py>`):: |

225 | |

226 | class Learner_Class: |

227 | def __init__(self, m=0.0, name='std naive bayes', **kwds): |

228 | self.__dict__.update(kwds) |

229 | self.m = m |

230 | self.name = name |

231 | |

232 | def __call__(self, examples, weight=None, **kwds): |

233 | for k in kwds.keys(): |

234 | self.__dict__[k] = kwds[k] |

235 | domain = examples.domain |

236 | |

237 | # first, compute class probabilities |

238 | n_class = [0.] * len(domain.classVar.values) |

239 | for e in examples: |

240 | n_class[int(e.getclass())] += 1 |

241 | |

242 | p_class = [0.] * len(domain.classVar.values) |

243 | for i in range(len(domain.classVar.values)): |

244 | p_class[i] = n_class[i] / len(examples) |

245 | |

246 | # count examples with specific attribute and |

247 | # class value, pc[attribute][value][class] |

248 | |

249 | # initialization of pc |

250 | pc = [] |

251 | for i in domain.attributes: |

252 | p = [[0.]*len(domain.classVar.values) for i in range(len(i.values))] |

253 | pc.append(p) |

254 | |

255 | # count instances, store them in pc |

256 | for e in examples: |

257 | c = int(e.getclass()) |

258 | for i in range(len(domain.attributes)): |

259 | if not e[i].isSpecial(): |

260 | pc[i][int(e[i])][c] += 1.0 |

261 | |

262 | # compute conditional probabilities |

263 | for i in range(len(domain.attributes)): |

264 | for j in range(len(domain.attributes[i].values)): |

265 | for k in range(len(domain.classVar.values)): |

266 | pc[i][j][k] = (pc[i][j][k] + self.m * p_class[k])/ \ |

267 | (n_class[k] + self.m) |

268 | |

269 | return Classifier(m = self.m, domain=domain, p_class=p_class, \ |

270 | p_cond=pc, name=self.name) |

271 | |

272 | Initialization of ``Learner_Class`` saves the two attributes, ``m`` |

273 | and ``name`` of the classifier. Notice that both parameters are |

274 | optional, and the default value for ``m`` is 0, making naive Bayes |

275 | m-estimate equal to relative frequency unless the user specifies some |

276 | other value for m. Function ``__call__`` is called with the training |

277 | data set, computes class and conditional probabilities and calls |

278 | classifiers, passing the probabilities along with some other variables |

279 | required for classification (part of :download:`bayes.py <code/bayes.py>`):: |

280 | |

281 | class Classifier: |

282 | def __init__(self, **kwds): |

283 | self.__dict__.update(kwds) |

284 | |

285 | def __call__(self, example, result_type=orange.GetValue): |

286 | # compute the class probabilities |

287 | p = map(None, self.p_class) |

288 | for c in range(len(self.domain.classVar.values)): |

289 | for a in range(len(self.domain.attributes)): |

290 | if not example[a].isSpecial(): |

291 | p[c] *= self.p_cond[a][int(example[a])][c] |

292 | |

293 | # normalize probabilities to sum to 1 |

294 | sum =0. |

295 | for pp in p: sum += pp |

296 | if sum>0: |

297 | for i in range(len(p)): p[i] = p[i]/sum |

298 | |

299 | # find the class with highest probability |

300 | v_index = p.index(max(p)) |

301 | v = orange.Value(self.domain.classVar, v_index) |

302 | |

303 | # return the value based on requested return type |

304 | if result_type == orange.GetValue: |

305 | return v |

306 | if result_type == orange.GetProbabilities: |

307 | return p |

308 | return (v,p) |

309 | |

310 | def show(self): |

311 | print 'm=', self.m |

312 | print 'class prob=', self.p_class |

313 | print 'cond prob=', self.p_cond |

314 | |

315 | Upon first invocation, the classifier will store the values of the |

316 | parameters it was called with (``__init__``). When called with a data |

317 | instance, it will first compute the class probabilities using the |

318 | prior probabilities sent by the learner. The probabilities will be |

319 | normalized to sum to 1. The class will then be found that has the |

320 | highest probability, and the classifier will accordingly predict to |

321 | this class. Notice that we have also added a method called show, which |

322 | reports on m, class probabilities and conditional probabilities:: |

323 | |

324 | >>> import orange, bayes |

325 | >>> data = orange.ExampleTable("voting") |

326 | >>> classifier = bayes.Learner(data) |

327 | >>> classifier.show() |

328 | m= 0.0 |

329 | class prob= [0.38620689655172413, 0.61379310344827587] |

330 | cond prob= [[[0.79761904761904767, 0.38202247191011235], ...]] |

331 | >>> |

332 | |

333 | The following script tests our naive Bayes, and compares it to |

334 | 10-nearest neighbors. Running the script (do you it yourself) reports |

335 | classification accuracies just about 90% (:download:`bayes_test.py <code/bayes_test.py>`, uses |

336 | :download:`bayes.py <code/bayes.py>` and :download:`voting.tab <code/voting.tab>`):: |

337 | |

338 | import orange, orngEval, bayes |

339 | data = orange.ExampleTable("voting") |

340 | |

341 | bayes = bayes.Learner(m=2, name='my bayes') |

342 | knn = orange.kNNLearner(k=10) |

343 | knn.name = "knn" |

344 | |

345 | learners = [knn,bayes] |

346 | results = orngEval.CrossValidation(learners, data) |

347 | for i in range(len(learners)): |

348 | print learners[i].name, orngEval.CA(results)[i] |

349 | |

350 | Bagging |

351 | ------- |

352 | |

353 | Here we show how to use the schema that allows us to build our own |

354 | learners/classifiers for bagging. While you can find bagging, |

355 | boosting, and other ensemble-related stuff in :py:mod:`Orange.ensemble` module, we thought |

356 | explaining how to code bagging in Python may provide for a nice |

357 | example. The following pseudo-code (from |

358 | Whitten & Frank: Data Mining) illustrates the main idea of bagging:: |

359 | |

360 | MODEL GENERATION |

361 | Let n be the number of instances in the training data. |

362 | For each of t iterations: |

363 | Sample n instances with replacement from training data. |

364 | Apply the learning algorithm to the sample. |

365 | Store the resulting model. |

366 | |

367 | CLASSIFICATION |

368 | For each of the t models: |

369 | Predict class of instance using model. |

370 | Return class that has been predicted most often. |

371 | |

372 | Using the above idea, this means that our ``Learner_Class`` will need |

373 | to develop t classifiers and will have to pass them to ``Classifier``, |

374 | which, once seeing a data instance, will use them for |

375 | classification. We will allow parameter t to be specified by the user, |

376 | 10 being the default. |

377 | |

378 | The code for the ``Learner_Class`` is therefore (part of |

379 | :download:`bagging.py <code/bagging.py>`):: |

380 | |

381 | class Learner_Class: |

382 | def __init__(self, learner, t=10, name='bagged classifier'): |

383 | self.t = t |

384 | self.name = name |

385 | self.learner = learner |

386 | |

387 | def __call__(self, examples, weight=None): |

388 | n = len(examples) |

389 | classifiers = [] |

390 | for i in range(self.t): |

391 | selection = [] |

392 | for i in range(n): |

393 | selection.append(random.randrange(n)) |

394 | data = examples.getitems(selection) |

395 | classifiers.append(self.learner(data)) |

396 | |

397 | return Classifier(classifiers = classifiers, \ |

398 | name=self.name, domain=examples.domain) |

399 | |

400 | Upon invocation, ``__init__`` stores the base learning (the one that |

401 | will be bagged), the value of the parameter t, and the name of the |

402 | classifier. Note that while the learner requires the base learner to |

403 | be specified, parameters t and name are optional. |

404 | |

405 | When the learner is called with examples, a list of t classifiers is |

406 | build and stored in variable ``classifier``. Notice that for data |

407 | sampling with replacement, a list of data instance indices is build |

408 | (``selection``) and then used to sample the data from training |

409 | examples (``example.getitems``). Finally, a ``Classifier`` is called |

410 | with a list of classifiers, name and domain information (part of |

411 | :download:`bagging.py <code/bagging.py>`):: |

412 | |

413 | class Classifier: |

414 | def __init__(self, **kwds): |

415 | self.__dict__.update(kwds) |

416 | |

417 | def __call__(self, example, resultType = orange.GetValue): |

418 | freq = [0.] * len(self.domain.classVar.values) |

419 | for c in self.classifiers: |

420 | freq[int(c(example))] += 1 |

421 | index = freq.index(max(freq)) |

422 | value = orange.Value(self.domain.classVar, index) |

423 | for i in range(len(freq)): |

424 | freq[i] = freq[i]/len(self.classifiers) |

425 | if resultType == orange.GetValue: return value |

426 | elif resultType == orange.GetProbabilities: return freq |

427 | else: return (value, freq) |

428 | |

429 | For initialization, ``Classifier`` stores all parameters it was |

430 | invoked with. When called with a data instance, a list freq is |

431 | initialized which is of length equal to the number of classes and |

432 | records the number of models that classify an instance to a specific |

433 | class. The class that majority of models voted for is returned. While |

434 | it may be possible to return classes index, or even a name, by |

435 | convention classifiers in Orange return an object ``Value`` instead. |

436 | |

437 | Notice that while, originally, bagging was not intended to compute |

438 | probabilities of classes, we compute these as the proportion of models |

439 | that voted for a certain class (this is probably incorrect, but |

440 | suffice for our example, and does not hurt if only classes values and |

441 | not probabilities are used). |

442 | |

443 | Here is the code that tests our bagging we have just implemented. It |

444 | compares a decision tree and its bagged variant. Run it yourself to |

445 | see which one is better (:download:`bagging_test.py <code/bagging_test.py>`, uses :download:`bagging.py <code/bagging.py>` and |

446 | :download:`adult_sample.tab <code/adult_sample.tab>`):: |

447 | |

448 | import orange, orngTree, orngEval, bagging |

449 | data = orange.ExampleTable("adult_sample") |

450 | |

451 | tree = orngTree.TreeLearner(mForPrunning=10, minExamples=30) |

452 | tree.name = "tree" |

453 | baggedTree = bagging.Learner(learner=tree, t=5) |

454 | |

455 | learners = [tree, baggedTree] |

456 | |

457 | results = orngEval.crossValidation(learners, data, folds=5) |

458 | for i in range(len(learners)): |

459 | print learners[i].name, orngEval.CA(results)[i] |

460 | |

461 | |

462 |

**Note:**See TracBrowser for help on using the repository browser.