Changeset 9890:90ee12181920 in orange
- 02/07/12 10:29:29 (2 years ago)
- 1 edited
r9853 r9890 25 25 26 26 Imputers 27 ================= 27 ----------------- 28 28 29 29 :obj:`ImputerConstructor` is the abstract root in a hierarchy of classes … … 52 52 .. attribute:: defaults 53 53 54 An instance :obj:` Orange.data.Instance` with the default values to be 54 An instance :obj:`Orange.data.Instance` with the default values to be 55 55 imputed instead of missing value. Examples to be imputed must be from the 56 56 same :obj:`~Orange.data.Domain` as :obj:`defaults`. … … 71 71 pessimistic imputations. 72 72 73 User-define defaults can be given when constructing a :obj:`~Orange.feature 74 .imputation.Imputer_defaults`. Values that are left unspecified do not get 75 imputed. In the following example "LENGTH" is the 73 User-define defaults can be given when constructing a 74 t 75 imputed. In the following example "LENGTH" is the 76 76 only attribute to get imputed with value 1234: 77 77 … … 164 164 :obj:`DefaultClassifier`. A float must be given, because integer values are 165 165 interpreted as indexes of discrete features. Discrete feature "T-OR-D" is 166 imputed using :class:` Orange.classification.ConstantClassifier` which is 166 imputed using :class:`Orange.classification.ConstantClassifier` which is 167 167 given the index of value "THROUGH" as an argument. 168 168 … … 277 277 278 278 Using imputers 279 ============== 280 281 Imputation must run on training data only. Imputing the missing values 282 and subsequently using the data in cross-validation will give overly 283 optimistic results. 279 -------------- 280 281 Imputation is also used by learning algorithms and other methods that are not 282 capable of handling unknown values. 284 283 285 284 Learners with imputer as a component 286 ------------------------------------ 287 288 Learners that cannot handle missing values provide a slot for the imputer 289 component. An example of such a class is 290 :obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called 291 :obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`. 292 293 When given learning instances, 285 ==================================== 286 287 Learners that cannot handle missing values should provide a slot 288 for imputer constructor. An example of such class is 289 :obj:`~Orange.classification.logreg.LogRegLearner` with attribute 290 :obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`, 291 which imputes to average value by default. When given learning instances, 294 292 :obj:`~Orange.classification.logreg.LogRegLearner` will pass them to 295 293 :obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor` to get 296 294 an imputer and used it to impute the missing values in the learning data. 297 295 Imputed data is then used by the actual learning algorithm. Also, when a 298 classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 296 classifier :obj:`~Orange.classification.logreg.LogRegClassifier` is 297 constructed, 299 298 the imputer is stored in its attribute 300 :obj:` Orange.classification.logreg.LogRegClassifier.imputer`. At 299 :obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 301 300 classification, the same imputer is used for imputation of missing values 302 301 in (testing) examples. … … 306 305 it is recommended to use imputation according to the described procedure. 307 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 308 331 Wrapper for learning algorithms 309 332 =============================== 310 333 311 Imputation is also used by learning algorithms and other methods that are not 312 capable of handling unknown values. It imputes missing values, 313 calls the learner and, if imputation is also needed by the classifier, 314 it wraps the classifier that imputes missing values in instances to classify. 315 316 .. literalinclude:: code/imputation-logreg.py 317 :lines: 7- 318 319 The output of this code is:: 320 321 Without imputation: 0.945 322 With imputation: 0.954 323 324 Even so, the module is somewhat redundant, as all learners that cannot handle 325 missing values should, in principle, provide the slots for imputer constructor. 326 For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an 327 attribute 328 :obj:`Orange.classification.logreg.LogRegLearner.imputer_constructor`, 329 and even if you don't set it, it will do some imputation by default. 334 In a learning/classification process, imputation is needed on two occasions. 335 Before learning, the imputer needs to process the training examples. 336 Afterwards, the imputer is called for each instance to be classified. For 337 example, in cross validation, imputation should be done on training folds 338 only. Imputing the missing values on all data and subsequently performing 339 cross-validation will give overly optimistic results. 340 341 Most of Orange's learning algorithms do not use imputers because they can 342 appropriately handle the missing values. Bayesian classifier, for instance, 343 simply skips the corresponding attributes in the formula, while 344 classification/regression trees have components for handling the missing 345 values in various ways. 346 347 If for any reason you want to use these algorithms to run on imputed data, 348 you can use this wrapper. 330 349 331 350 .. class:: ImputeLearner 332 351 333 Wraps a learner and performs data discretization before learning. 334 335 Most of Orange's learning algorithms do not use imputers because they can 336 appropriately handle the missing values. Bayesian classifier, for instance, 337 simply skips the corresponding attributes in the formula, while 338 classification/regression trees have components for handling the missing 339 values in various ways. 340 341 If for any reason you want to use these algorithms to run on imputed data, 342 you can use this wrapper. The class description is a matter of a separate 343 page, but we shall show its code here as another demonstration of how to 344 use the imputers - logistic regression is implemented essentially the same 345 as the below classes. 352 Wraps a learner and performs data imputation before learning. 346 353 347 354 This is basically a learner, so the constructor will return either an 348 355 instance of :obj:`ImputerLearner` or, if called with examples, an instance 349 of some classifier. There are a few attributes that need to be set, though. 356 of some classifier. 350 357 351 358 .. attribute:: base_learner … … 355 362 .. attribute:: imputer_constructor 356 363 357 An instance of a class derived from :obj:`ImputerConstructor` (or a class 358 with the same call operator ). 364 An instance of a class derived from :obj:`ImputerConstructor` or a class 365 with the same call operator. 359 366 360 367 .. attribute:: dont_impute_classifier 361 368 362 If given and set (this attribute is optional), the classifier willnot be 363 wrapped into an imputer. Do this if the classifier doesn't mind if the 364 examples it is given havemissing values. 369 If not be 370 wrapped into an imputer. e 371 missing values. 365 372 366 373 The learner is best illustrated by its code - here's its complete … … 376 383 return ImputeClassifier(base_classifier, trained_imputer) 377 384 378 So "learning" goes like this. :obj:`ImputeLearner` will first construct 379 the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 380 imputer. Than it will use the imputer to impute the data, and call the 385 During learning, :obj:`ImputeLearner` will first construct 386 the imputer. It will then impute the data and call the 381 387 given :obj:`baseLearner` to construct a classifier. For instance, 382 388 :obj:`baseLearner` could be a learner for logistic regression and the 383 389 result would be a logistic regression model. If the classifier can handle 384 unknown values (that is, if :obj:`dont_impute_classifier`, we return it as 385 it is , otherwise we wrap it into :obj:`ImputeClassifier`, which is given 386 the base classifier and the imputer which it can use to impute the missing 387 values in (testing) examples. 390 unknown values (that is, if :obj:`dont_impute_classifier`, 391 it is 392 393 . 388 394 389 395 .. class:: ImputeClassifier … … 401 407 .. method:: __call__ 402 408 403 This class is even more trivial than the learner. Its constructor accepts 404 two arguments, the classifier and the imputer, which are stored into the 405 corresponding attributes. The call operator which does the classification 406 then looks like this:: 409 This class's constructor accepts and stores two arguments, 410 the classifier and the imputer. The call operator for classification 411 looks like this:: 407 412 408 413 def __call__(self, ex, what=orange.GetValue): … … 413 418 414 419 .. note:: 415 In this setup the imputer is trained on the training data - even if you do 420 In this setup the imputer is trained on the training data 416 421 cross validation, the imputer will be trained on the right data. In the 417 classification phase we again use the imputer which was classified on the 418 training data only. 422 classification phase, the imputer will be used to impute testing data. 419 423 420 424 .. rubric:: Code of ImputeLearner and ImputeClassifier 421 425 422 :obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 423 the instance's dictionary. You are expected to call it like 424 :obj:`ImputeLearner(base_learner=<someLearner>, 425 imputer=<someImputerConstructor>)`. When the learner is called with 426 examples, it 427 trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 428 :obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 426 The learner is called with 427 :obj:`Orange.feature.imputation.ImputeLearner(base_learner=<someLearner>, imputer=<someImputerConstructor>)`. 428 When given examples, it trains the imputer, imputes the data, 429 induces a :obj:`base_classifier` by the 430 :obj:`base_learner` and constructs :obj:`ImputeClassifier` that stores the 429 431 :obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 430 432 values are imputed and the classifier's prediction is returned. 431 433 432 Note that this code is slightly simplified, although the omitted detailshandle 434 handle 433 435 non-essential technical issues that are unrelated to imputation:: 434 436 … … 456 458 return self.base_classifier(self.imputer(ex), what) 457 459 458 .. rubric:: Example 459 460 Although most Orange's learning algorithms will take care of imputation 461 internally, if needed, it can sometime happen that an expert will be able to 462 tell you exactly what to put in the data instead of the missing values. In this 463 example we shall suppose that we want to impute the minimal value of each 464 feature. We will try to determine whether the naive Bayesian classifier with 465 its implicit internal imputation works better than one that uses imputation by 466 minimal values. 467 468 :download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 469 470 .. literalinclude:: code/imputation-minimal-imputer.py 471 :lines: 7- 472 473 Should ouput this:: 474 475 Without imputation: 0.903 476 With imputation: 0.899 477 478 .. note:: 479 Note that we constructed just one instance of \ 480 :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 481 used twice in each fold, once it is given the examples as they are (and 482 returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 483 The second time it is called by :obj:`imba` and the \ 484 :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 485 into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 486 learner, but which produces two different classifiers in each round of 487 testing. 488 489 460 Write your own imputer 490 ====================== 491 492 Imputation classes provide the Python-callback functionality (not all Orange 493 classes do so, refer to the documentation on `subtyping the Orange classes 494 in Python <callbacks.htm>`_ for a list). If you want to write your own 495 imputation constructor or an imputer, you need to simply program a Python 496 function that will behave like the built-in Orange classes (and even less, 497 for imputer, you only need to write a function that gets an example as 498 argument, imputation for example tables will then use that function). 499 500 You will most often write the imputation constructor when you have a special 501 imputation procedure or separate procedures for various attributes, as we've 502 demonstrated in the description of 503 :obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only 504 need to pack everything we've written there to an imputer constructor that 505 will accept a data set and the id of the weight meta-attribute (ignore it if 506 you will, but you must accept two arguments), and return the imputer (probably 507 :obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 508 imputer constructor as opposed to what we did above is that you can use such a 509 constructor as a component for Orange learners (like logistic regression) or 510 for wrappers from module orngImpute, and that way properly use the in 511 classifier testing procedures. 461 ---------------------- 462 463 Imputation classes provide the Python-callback functionality. The simples 464 way to write custom imputation constructors or imputers is to write a Python 465 function that behaves like the built-in Orange classes. For imputers it is 466 enough to write a function that gets an instance as argument. Inputation for 467 data tables will then use that function. 468 469 Special imputation procedures or separate procedures for various attributes, 470 as demonstrated in the description of 471 :obj:`~Orange.feature.imputation.ImputerConstructor_model`, 472 are achieved by encoding it in a constructor that accepts a data table and 473 id of the weight meta-attribute, and returns the imputer. The benefit of 474 implementing an imputer constructor is that you can use is as a component 475 for learners (for example, in logistic regression) or wrappers, and that way 476 properly use the classifier in testing procedures. 477 478 479 480 .. 481 This was commented out: 482 Examples 483 -------- 484 485 Missing values sometimes have a special meaning, so they need to be replaced 486 by a designated value. Sometimes we know what to replace the missing value 487 with; for instance, in a medical problem, some laboratory tests might not be 488 done when it is known what their results would be. In that case, we impute 489 certain fixed value instead of the missing. In the most complex case, we assign 490 values that are computed based on some model; we can, for instance, impute the 491 average or majority value or even a value which is computed from values of 492 other, known feature, using a classifier. 493 494 In general, imputer itself needs to be trained. This is, of course, not needed 495 when the imputer imputes certain fixed value. However, when it imputes the 496 average or majority value, it needs to compute the statistics on the training 497 examples, and use it afterwards for imputation of training and testing 498 examples. 499 500 While reading this document, bear in mind that imputation is a part of the 501 learning process. If we fit the imputation model, for instance, by learning 502 how to predict the feature's value from other features, or even if we 503 simply compute the average or the minimal value for the feature and use it 504 in imputation, this should only be done on learning data. Orange 505 provides simple means for doing that. 506 507 This page will first explain how to construct various imputers. Then follow 508 the examples for `proper use of imputers <#using-imputers>`_. Finally, quite 509 often you will want to use imputation with special requests, such as certain 510 features' missing values getting replaced by constants and other by values 511 computed using models induced from specified other features. For instance, 512 in one of the studies we worked on, the patient's pulse rate needed to be 513 estimated using regression trees that included the scope of the patient's 514 injuries, sex and age, some attributes' values were replaced by the most 515 pessimistic ones and others were computed with regression trees based on 516 values of all features. If you are using learners that need the imputer as a 517 component, you will need to `write your own imputer constructor 518 <#write-your-own-imputer-constructor>`_. This is trivial and is explained at 519 the end of this page.
Note: See TracChangeset for help on using the changeset viewer.