- 02/06/12 13:17:48 (2 years ago)
- 1 edited
r9372 r9806 1 .. automodule:: Orange.feature.imputation 1 .. py:currentmodule:: Orange.feature.imputation 2 3 .. index:: imputation 4 5 .. index:: 6 single: feature; value imputation 7 8 *************************** 9 Imputation (``imputation``) 10 *************************** 11 12 Imputation replaces missing feature values with appropriate values, in this 13 case with minimal values: 14 15 .. literalinclude:: code/imputation-values.py 16 :lines: 7- 17 18 The output of this code is:: 19 20 Example with missing values 21 ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD'] 22 Imputed values: 23 ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] 24 ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] 25 26 Imputers 27 ================= 28 29 :obj:`ImputerConstructor` is the abstract root in the hierarchy of classes 30 that get training data and construct an instance of a class derived from 31 :obj:`Imputer`. When an :obj:`Imputer` is called with an 32 :obj:`Orange.data.Instance` it will return a new example with the 33 missing values imputed (leaving the original example intact). If imputer is 34 called with an :obj:`Orange.data.Table` it will return a new example table 35 with imputed instances. 36 37 .. class:: ImputerConstructor 38 39 .. attribute:: imputeClass 40 41 Indicates whether to impute the class value. Default is True. 42 43 .. attribute:: deterministic 44 45 Indicates whether to initialize random by example's CRC. Default is False. 46 47 Simple imputation 48 ================= 49 50 Simple imputers always impute the same value for a particular attribute, 51 disregarding the values of other attributes. They all use the same class 52 :obj:`Imputer_defaults`. 53 54 .. class:: Imputer_defaults 55 56 .. attribute:: defaults 57 58 An instance :obj:`Orange.data.Instance` with the default values to be 59 imputed instead of missing. Examples to be imputed must be from the same 60 domain as :obj:`defaults`. 61 62 Instances of this class can be constructed by 63 :obj:`Orange.feature.imputation.ImputerConstructor_minimal`, 64 :obj:`Orange.feature.imputation.ImputerConstructor_maximal`, 65 :obj:`Orange.feature.imputation.ImputerConstructor_average`. 66 67 For continuous features, they will impute the smallest, 68 largest or the average values encountered in the training examples. 69 70 For discrete, they will impute the lowest (the one with index 0, 71 e. g. attr.values), the highest (attr.values[-1]), 72 and the most common value encountered in the data. 73 74 The first two imputers 75 will mostly be used when the discrete values are ordered according to their 76 impact on the class (for instance, possible values for symptoms of some 77 disease can be ordered according to their seriousness). The minimal and maximal 78 imputers will then represent optimistic and pessimistic imputations. 79 80 The following code will load the bridges data, and first impute the values 81 in a single examples and then in the whole table. 82 83 :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 84 85 .. literalinclude:: code/imputation-complex.py 86 :lines: 9-23 87 88 This is example shows what the imputer does, not how it is to be used. Don't 89 impute all the data and then use it for cross-validation. As warned at the top 90 of this page, see the instructions for actual `use of 91 imputers <#using-imputers>`_. 92 93 .. note:: The :obj:`ImputerConstructor` are another class with schizophrenic 94 constructor: if you give the constructor the data, it will return an \ 95 :obj:`Imputer` - the above call is equivalent to calling \ 96 :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`. 97 98 You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults` 99 yourself and specify your own defaults. Or leave some values unspecified, in 100 which case the imputer won't impute them, as in the following example. Here, 101 the only attribute whose values will get imputed is "LENGTH"; the imputed value 102 will be 1234. 103 104 .. literalinclude:: code/imputation-complex.py 105 :lines: 56-69 106 107 :obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an 108 argument of type :obj:`Orange.data.Domain` (in which case it will construct an 109 empty instance for :obj:`defaults`) or an example. (Be careful with this: 110 :obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the 111 instance and not a copy. But you can make a copy yourself to avoid problems: 112 instead of `Imputer_defaults(data)` you may want to write 113 `Imputer_defaults(Orange.data.Instance(data))`. 114 115 Random imputation 116 ================= 117 118 .. class:: Imputer_Random 119 120 Imputes random values. The corresponding constructor is 121 :obj:`ImputerConstructor_Random`. 122 123 .. attribute:: impute_class 124 125 Tells whether to impute the class values or not. Defaults to True. 126 127 .. attribute:: deterministic 128 129 If true (default is False), random generator is initialized for each 130 example using the example's hash value as a seed. This results in same 131 examples being always imputed the same values. 132 133 Model-based imputation 134 ====================== 135 136 .. class:: ImputerConstructor_model 137 138 Model-based imputers learn to predict the attribute's value from values of 139 other attributes. :obj:`ImputerConstructor_model` are given a learning 140 algorithm (two, actually - one for discrete and one for continuous 141 attributes) and they construct a classifier for each attribute. The 142 constructed imputer :obj:`Imputer_model` stores a list of classifiers which 143 are used when needed. 144 145 .. attribute:: learner_discrete, learner_continuous 146 147 Learner for discrete and for continuous attributes. If any of them is 148 missing, the attributes of the corresponding type won't get imputed. 149 150 .. attribute:: use_class 151 152 Tells whether the imputer is allowed to use the class value. As this is 153 most often undesired, this option is by default set to False. It can 154 however be useful for a more complex design in which we would use one 155 imputer for learning examples (this one would use the class value) and 156 another for testing examples (which would not use the class value as this 157 is unavailable at that moment). 158 159 .. class:: Imputer_model 160 161 .. attribute: models 162 163 A list of classifiers, each corresponding to one attribute of the examples 164 whose values are to be imputed. The :obj:`classVar`'s of the models should 165 equal the examples' attributes. If any of classifier is missing (that is, 166 the corresponding element of the table is :obj:`None`, the corresponding 167 attribute's values will not be imputed. 168 169 .. rubric:: Examples 170 171 The following imputer predicts the missing attribute values using 172 classification and regression trees with the minimum of 20 examples in a leaf. 173 Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 174 175 .. literalinclude:: code/imputation-complex.py 176 :lines: 74-76 177 178 We could even use the same learner for discrete and continuous attributes, 179 as :class:`Orange.classification.tree.TreeLearner` checks the class type 180 and constructs regression or classification trees accordingly. The 181 common parameters, such as the minimal number of 182 examples in leaves, are used in both cases. 183 184 You can also use different learning algorithms for discrete and 185 continuous attributes. Probably a common setup will be to use 186 :class:`Orange.classification.bayes.BayesLearner` for discrete and 187 :class:`Orange.regression.mean.MeanLearner` (which 188 just remembers the average) for continuous attributes. Part of 189 :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 190 191 .. literalinclude:: code/imputation-complex.py 192 :lines: 91-94 193 194 You can also construct an :class:`Imputer_model` yourself. You will do 195 this if different attributes need different treatment. Brace for an 196 example that will be a bit more complex. First we shall construct an 197 :class:`Imputer_model` and initialize an empty list of models. 198 The following code snippets are from 199 :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 200 201 .. literalinclude:: code/imputation-complex.py 202 :lines: 108-109 203 204 Attributes "LANES" and "T-OR-D" will always be imputed values 2 and 205 "THROUGH". Since "LANES" is continuous, it suffices to construct a 206 :obj:`DefaultClassifier` with the default value 2.0 (don't forget the 207 decimal part, or else Orange will think you talk about an index of a discrete 208 value - how could it tell?). For the discrete attribute "T-OR-D", we could 209 construct a :class:`Orange.classification.ConstantClassifier` and give the index of value 210 "THROUGH" as an argument. But we shall do it nicer, by constructing a 211 :class:`Orange.data.Value`. Both classifiers will be stored at the appropriate places 212 in :obj:`imputer.models`. 213 214 .. literalinclude:: code/imputation-complex.py 215 :lines: 110-112 216 217 218 "LENGTH" will be computed with a regression tree induced from "MATERIAL", 219 "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of 220 course). Note that we initialized the domain by simply giving a list with 221 the names of the attributes, with the domain as an additional argument 222 in which Orange will look for the named attributes. 223 224 .. literalinclude:: code/imputation-complex.py 225 :lines: 114-119 226 227 We printed the tree just to see what it looks like. 228 229 :: 230 231 <XMP class=code>SPAN=SHORT: 1158 232 SPAN=LONG: 1907 233 SPAN=MEDIUM 234 | ERECTED<1908.500: 1325 235 | ERECTED>=1908.500: 1528 236 </XMP> 237 238 Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, 239 while the others are mostly medium. This could be done by 240 :class:`Orange.classifier.ClassifierByLookupTable` - this would be faster 241 than what we plan here. See the corresponding documentation on lookup 242 classifier. Here we are going to do it with a Python function. 243 244 .. literalinclude:: code/imputation-complex.py 245 :lines: 121-128 246 247 :obj:`compute_span` could also be written as a class, if you'd prefer 248 it. It's important that it behaves like a classifier, that is, gets an example 249 and returns a value. The second element tells, as usual, what the caller expect 250 the classifier to return - a value, a distribution or both. Since the caller, 251 :obj:`Imputer_model`, always wants values, we shall ignore the argument 252 (at risk of having problems in the future when imputers might handle 253 distribution as well). 254 255 Missing values as special values 256 ================================ 257 258 Missing values sometimes have a special meaning. The fact that something was 259 not measured can sometimes tell a lot. Be, however, cautious when using such 260 values in decision models; it the decision not to measure something (for 261 instance performing a laboratory test on a patient) is based on the expert's 262 knowledge of the class value, such unknown values clearly should not be used 263 in models. 264 265 .. class:: ImputerConstructor_asValue 266 267 Constructs a new domain in which each 268 discrete attribute is replaced with a new attribute that has one value more: 269 "NA". The new attribute will compute its values on the fly from the old one, 270 copying the normal values and replacing the unknowns with "NA". 271 272 For continuous attributes, it will 273 construct a two-valued discrete attribute with values "def" and "undef", 274 telling whether the continuous attribute was defined or not. The attribute's 275 name will equal the original's with "_def" appended. The original continuous 276 attribute will remain in the domain and its unknowns will be replaced by 277 averages. 278 279 :class:`ImputerConstructor_asValue` has no specific attributes. 280 281 It constructs :class:`Imputer_asValue` (I bet you 282 wouldn't guess). It converts the example into the new domain, which imputes 283 the values for discrete attributes. If continuous attributes are present, it 284 will also replace their values by the averages. 285 286 .. class:: Imputer_asValue 287 288 .. attribute:: domain 289 290 The domain with the new attributes constructed by 291 :class:`ImputerConstructor_asValue`. 292 293 .. attribute:: defaults 294 295 Default values for continuous attributes. Present only if there are any. 296 297 The following code shows what this imputer actually does to the domain. 298 Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 299 300 .. literalinclude:: code/imputation-complex.py 301 :lines: 137-151 302 303 The script's output looks like this:: 304 305 [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] 306 307 [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] 308 309 RIVER: M -> M 310 ERECTED: 1874 -> 1874 (def) 311 PURPOSE: RR -> RR 312 LENGTH: ? -> 1567 (undef) 313 LANES: 2 -> 2 (def) 314 CLEAR-G: ? -> NA 315 T-OR-D: THROUGH -> THROUGH 316 MATERIAL: IRON -> IRON 317 SPAN: ? -> NA 318 REL-L: ? -> NA 319 TYPE: SIMPLE-T -> SIMPLE-T 320 321 Seemingly, the two examples have the same attributes (with 322 :samp:`imputed` having a few additional ones). If you check this by 323 :samp:`original.domain == imputed.domain`, you shall see that this 324 first glance is False. The attributes only have the same names, 325 but they are different attributes. If you read this page (which is already a 326 bit advanced), you know that Orange does not really care about the attribute 327 names). 328 329 Therefore, if we wrote :samp:`imputed[i]` the program would fail 330 since :samp:`imputed` has no attribute :samp:`i`. But it has an 331 attribute with the same name (which even usually has the same value). We 332 therefore use :samp:`i.name` to index the attributes of 333 :samp:`imputed`. (Using names for indexing is not fast, though; if you do 334 it a lot, compute the integer index with 335 :samp:`imputed.domain.index(i.name)`.)</P> 336 337 For continuous attributes, there is an additional attribute with "_def" 338 appended; we get it by :samp:`i.name+"_def"`. 339 340 The first continuous attribute, "ERECTED" is defined. Its value remains 1874 341 and the additional attribute "ERECTED_def" has value "def". Not so for 342 "LENGTH". Its undefined value is replaced by the average (1567) and the new 343 attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 344 all other undefined discrete attributes) is assigned the value "NA". 345 346 Using imputers 347 ============== 348 349 To properly use the imputation classes in learning process, they must be 350 trained on training examples only. Imputing the missing values and subsequently 351 using the data set in cross-validation will give overly optimistic results. 352 353 Learners with imputer as a component 354 ------------------------------------ 355 356 Orange learners that cannot handle missing values will generally provide a slot 357 for the imputer component. An example of such a class is 358 :obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 359 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 360 can assign an imputer constructor - one of the above constructors or a specific 361 constructor you wrote yourself. When given learning examples, 362 :obj:`Orange.classification.logreg.LogRegLearner` will pass them to 363 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 364 imputer (again some of the above or a specific imputer you programmed). It will 365 immediately use the imputer to impute the missing values in the learning data 366 set, so it can be used by the actual learning algorithm. Besides, when the 367 classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 368 the imputer will be stored in its attribute 369 :obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 370 classification, the imputer will be used for imputation of missing values in 371 (testing) examples. 372 373 Although details may vary from algorithm to algorithm, this is how the 374 imputation is generally used in Orange's learners. Also, if you write your own 375 learners, it is recommended that you use imputation according to the described 376 procedure. 377 378 Wrapper for learning algorithms 379 =============================== 380 381 Imputation is used by learning algorithms and other methods that are not 382 capable of handling unknown values. It will impute missing values, 383 call the learner and, if imputation is also needed by the classifier, 384 it will wrap the classifier into a wrapper that imputes missing values in 385 examples to classify. 386 387 .. literalinclude:: code/imputation-logreg.py 388 :lines: 7- 389 390 The output of this code is:: 391 392 Without imputation: 0.945 393 With imputation: 0.954 394 395 Even so, the module is somewhat redundant, as all learners that cannot handle 396 missing values should, in principle, provide the slots for imputer constructor. 397 For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute 398 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even 399 if you don't set it, it will do some imputation by default. 400 401 .. class:: ImputeLearner 402 403 Wraps a learner and performs data discretization before learning. 404 405 Most of Orange's learning algorithms do not use imputers because they can 406 appropriately handle the missing values. Bayesian classifier, for instance, 407 simply skips the corresponding attributes in the formula, while 408 classification/regression trees have components for handling the missing 409 values in various ways. 410 411 If for any reason you want to use these algorithms to run on imputed data, 412 you can use this wrapper. The class description is a matter of a separate 413 page, but we shall show its code here as another demonstration of how to 414 use the imputers - logistic regression is implemented essentially the same 415 as the below classes. 416 417 This is basically a learner, so the constructor will return either an 418 instance of :obj:`ImputerLearner` or, if called with examples, an instance 419 of some classifier. There are a few attributes that need to be set, though. 420 421 .. attribute:: base_learner 422 423 A wrapped learner. 424 425 .. attribute:: imputer_constructor 426 427 An instance of a class derived from :obj:`ImputerConstructor` (or a class 428 with the same call operator). 429 430 .. attribute:: dont_impute_classifier 431 432 If given and set (this attribute is optional), the classifier will not be 433 wrapped into an imputer. Do this if the classifier doesn't mind if the 434 examples it is given have missing values. 435 436 The learner is best illustrated by its code - here's its complete 437 :obj:`__call__` method:: 438 439 def __call__(self, data, weight=0): 440 trained_imputer = self.imputer_constructor(data, weight) 441 imputed_data = trained_imputer(data, weight) 442 base_classifier = self.base_learner(imputed_data, weight) 443 if self.dont_impute_classifier: 444 return base_classifier 445 else: 446 return ImputeClassifier(base_classifier, trained_imputer) 447 448 So "learning" goes like this. :obj:`ImputeLearner` will first construct 449 the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 450 imputer. Than it will use the imputer to impute the data, and call the 451 given :obj:`baseLearner` to construct a classifier. For instance, 452 :obj:`baseLearner` could be a learner for logistic regression and the 453 result would be a logistic regression model. If the classifier can handle 454 unknown values (that is, if :obj:`dont_impute_classifier`, we return it as 455 it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 456 the base classifier and the imputer which it can use to impute the missing 457 values in (testing) examples. 458 459 .. class:: ImputeClassifier 460 461 Objects of this class are returned by :obj:`ImputeLearner` when given data. 462 463 .. attribute:: baseClassifier 464 465 A wrapped classifier. 466 467 .. attribute:: imputer 468 469 An imputer for imputation of unknown values. 470 471 .. method:: __call__ 472 473 This class is even more trivial than the learner. Its constructor accepts 474 two arguments, the classifier and the imputer, which are stored into the 475 corresponding attributes. The call operator which does the classification 476 then looks like this:: 477 478 def __call__(self, ex, what=orange.GetValue): 479 return self.base_classifier(self.imputer(ex), what) 480 481 It imputes the missing values by calling the :obj:`imputer` and passes the 482 class to the base classifier. 483 484 .. note:: 485 In this setup the imputer is trained on the training data - even if you do 486 cross validation, the imputer will be trained on the right data. In the 487 classification phase we again use the imputer which was classified on the 488 training data only. 489 490 .. rubric:: Code of ImputeLearner and ImputeClassifier 491 492 :obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 493 the instance's dictionary. You are expected to call it like 494 :obj:`ImputeLearner(base_learner=<someLearner>, 495 imputer=<someImputerConstructor>)`. When the learner is called with examples, it 496 trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 497 :obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 498 :obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 499 values are imputed and the classifier's prediction is returned. 500 501 Note that this code is slightly simplified, although the omitted details handle 502 non-essential technical issues that are unrelated to imputation:: 503 504 class ImputeLearner(orange.Learner): 505 def __new__(cls, examples = None, weightID = 0, **keyw): 506 self = orange.Learner.__new__(cls, **keyw) 507 self.__dict__.update(keyw) 508 if examples: 509 return self.__call__(examples, weightID) 510 else: 511 return self 512 513 def __call__(self, data, weight=0): 514 trained_imputer = self.imputer_constructor(data, weight) 515 imputed_data = trained_imputer(data, weight) 516 base_classifier = self.base_learner(imputed_data, weight) 517 return ImputeClassifier(base_classifier, trained_imputer) 518 519 class ImputeClassifier(orange.Classifier): 520 def __init__(self, base_classifier, imputer): 521 self.base_classifier = base_classifier 522 self.imputer = imputer 523 524 def __call__(self, ex, what=orange.GetValue): 525 return self.base_classifier(self.imputer(ex), what) 526 527 .. rubric:: Example 528 529 Although most Orange's learning algorithms will take care of imputation 530 internally, if needed, it can sometime happen that an expert will be able to 531 tell you exactly what to put in the data instead of the missing values. In this 532 example we shall suppose that we want to impute the minimal value of each 533 feature. We will try to determine whether the naive Bayesian classifier with 534 its implicit internal imputation works better than one that uses imputation by 535 minimal values. 536 537 :download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 538 539 .. literalinclude:: code/imputation-minimal-imputer.py 540 :lines: 7- 541 542 Should ouput this:: 543 544 Without imputation: 0.903 545 With imputation: 0.899 546 547 .. note:: 548 Note that we constructed just one instance of \ 549 :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 550 used twice in each fold, once it is given the examples as they are (and 551 returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 552 The second time it is called by :obj:`imba` and the \ 553 :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 554 into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 555 learner, but which produces two different classifiers in each round of 556 testing. 557 558 Write your own imputer 559 ====================== 560 561 Imputation classes provide the Python-callback functionality (not all Orange 562 classes do so, refer to the documentation on `subtyping the Orange classes 563 in Python <callbacks.htm>`_ for a list). If you want to write your own 564 imputation constructor or an imputer, you need to simply program a Python 565 function that will behave like the built-in Orange classes (and even less, 566 for imputer, you only need to write a function that gets an example as 567 argument, imputation for example tables will then use that function). 568 569 You will most often write the imputation constructor when you have a special 570 imputation procedure or separate procedures for various attributes, as we've 571 demonstrated in the description of 572 :obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only 573 need to pack everything we've written there to an imputer constructor that 574 will accept a data set and the id of the weight meta-attribute (ignore it if 575 you will, but you must accept two arguments), and return the imputer (probably 576 :obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 577 imputer constructor as opposed to what we did above is that you can use such a 578 constructor as a component for Orange learners (like logistic regression) or 579 for wrappers from module orngImpute, and that way properly use the in 580 classifier testing procedures.
Note: See TracChangeset for help on using the changeset viewer.