Changeset 9806:01ddd2a2ff48 in orange


Ignore:
Timestamp:
02/06/12 13:17:48 (2 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
rebase_source:
d3c410b3069550234ec7d5373e7b973a1e0b0012
Message:

Updated documentation on Orange.feature.imputation.

Files:
2 added
2 edited

Legend:

Unmodified
Added
Removed
  • Orange/feature/imputation.py

    r9671 r9806  
    1 """ 
    2 ########################### 
    3 Imputation (``imputation``) 
    4 ########################### 
    5  
    6 .. index:: imputation 
    7  
    8 .. index::  
    9    single: feature; value imputation 
    10  
    11  
    12 Imputation is a procedure of replacing the missing feature values with some  
    13 appropriate values. Imputation is needed because of the methods (learning  
    14 algorithms and others) that are not capable of handling unknown values, for  
    15 instance logistic regression. 
    16  
    17 Missing values sometimes have a special meaning, so they need to be replaced 
    18 by a designated value. Sometimes we know what to replace the missing value 
    19 with; for instance, in a medical problem, some laboratory tests might not be 
    20 done when it is known what their results would be. In that case, we impute  
    21 certain fixed value instead of the missing. In the most complex case, we assign 
    22 values that are computed based on some model; we can, for instance, impute the 
    23 average or majority value or even a value which is computed from values of 
    24 other, known feature, using a classifier. 
    25  
    26 In a learning/classification process, imputation is needed on two occasions. 
    27 Before learning, the imputer needs to process the training examples. 
    28 Afterwards, the imputer is called for each example to be classified. 
    29  
    30 In general, imputer itself needs to be trained. This is, of course, not needed 
    31 when the imputer imputes certain fixed value. However, when it imputes the 
    32 average or majority value, it needs to compute the statistics on the training 
    33 examples, and use it afterwards for imputation of training and testing 
    34 examples. 
    35  
    36 While reading this document, bear in mind that imputation is a part of the 
    37 learning process. If we fit the imputation model, for instance, by learning 
    38 how to predict the feature's value from other features, or even if we  
    39 simply compute the average or the minimal value for the feature and use it 
    40 in imputation, this should only be done on learning data. If cross validation 
    41 is used for sampling, imputation should be done on training folds only. Orange 
    42 provides simple means for doing that. 
    43  
    44 This page will first explain how to construct various imputers. Then follow 
    45 the examples for `proper use of imputers <#using-imputers>`_. Finally, quite 
    46 often you will want to use imputation with special requests, such as certain 
    47 features' missing values getting replaced by constants and other by values 
    48 computed using models induced from specified other features. For instance, 
    49 in one of the studies we worked on, the patient's pulse rate needed to be 
    50 estimated using regression trees that included the scope of the patient's 
    51 injuries, sex and age, some attributes' values were replaced by the most 
    52 pessimistic ones and others were computed with regression trees based on 
    53 values of all features. If you are using learners that need the imputer as a 
    54 component, you will need to `write your own imputer constructor  
    55 <#write-your-own-imputer-constructor>`_. This is trivial and is explained at 
    56 the end of this page. 
    57  
    58 Wrapper for learning algorithms 
    59 =============================== 
    60  
    61 This wrapper can be used with learning algorithms that cannot handle missing 
    62 values: it will impute the missing examples using the imputer, call the  
    63 earning and, if the imputation is also needed by the classifier, wrap the 
    64 resulting classifier into another wrapper that will impute the missing values 
    65 in examples to be classified. 
    66  
    67 Even so, the module is somewhat redundant, as all learners that cannot handle  
    68 missing values should, in principle, provide the slots for imputer constructor. 
    69 For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute  
    70 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even 
    71 if you don't set it, it will do some imputation by default. 
    72  
    73 .. class:: ImputeLearner 
    74  
    75     Wraps a learner and performs data discretization before learning. 
    76  
    77     Most of Orange's learning algorithms do not use imputers because they can 
    78     appropriately handle the missing values. Bayesian classifier, for instance, 
    79     simply skips the corresponding attributes in the formula, while 
    80     classification/regression trees have components for handling the missing 
    81     values in various ways. 
    82  
    83     If for any reason you want to use these algorithms to run on imputed data, 
    84     you can use this wrapper. The class description is a matter of a separate 
    85     page, but we shall show its code here as another demonstration of how to 
    86     use the imputers - logistic regression is implemented essentially the same 
    87     as the below classes. 
    88  
    89     This is basically a learner, so the constructor will return either an 
    90     instance of :obj:`ImputerLearner` or, if called with examples, an instance 
    91     of some classifier. There are a few attributes that need to be set, though. 
    92  
    93     .. attribute:: base_learner  
    94      
    95     A wrapped learner. 
    96  
    97     .. attribute:: imputer_constructor 
    98      
    99     An instance of a class derived from :obj:`ImputerConstructor` (or a class 
    100     with the same call operator). 
    101  
    102     .. attribute:: dont_impute_classifier 
    103  
    104     If given and set (this attribute is optional), the classifier will not be 
    105     wrapped into an imputer. Do this if the classifier doesn't mind if the 
    106     examples it is given have missing values. 
    107  
    108     The learner is best illustrated by its code - here's its complete 
    109     :obj:`__call__` method:: 
    110  
    111         def __call__(self, data, weight=0): 
    112             trained_imputer = self.imputer_constructor(data, weight) 
    113             imputed_data = trained_imputer(data, weight) 
    114             base_classifier = self.base_learner(imputed_data, weight) 
    115             if self.dont_impute_classifier: 
    116                 return base_classifier 
    117             else: 
    118                 return ImputeClassifier(base_classifier, trained_imputer) 
    119  
    120     So "learning" goes like this. :obj:`ImputeLearner` will first construct 
    121     the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 
    122     imputer. Than it will use the imputer to impute the data, and call the 
    123     given :obj:`baseLearner` to construct a classifier. For instance, 
    124     :obj:`baseLearner` could be a learner for logistic regression and the 
    125     result would be a logistic regression model. If the classifier can handle 
    126     unknown values (that is, if :obj:`dont_impute_classifier`, we return it as  
    127     it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 
    128     the base classifier and the imputer which it can use to impute the missing 
    129     values in (testing) examples. 
    130  
    131 .. class:: ImputeClassifier 
    132  
    133     Objects of this class are returned by :obj:`ImputeLearner` when given data. 
    134  
    135     .. attribute:: baseClassifier 
    136      
    137     A wrapped classifier. 
    138  
    139     .. attribute:: imputer 
    140      
    141     An imputer for imputation of unknown values. 
    142  
    143     .. method:: __call__  
    144      
    145     This class is even more trivial than the learner. Its constructor accepts  
    146     two arguments, the classifier and the imputer, which are stored into the 
    147     corresponding attributes. The call operator which does the classification 
    148     then looks like this:: 
    149  
    150         def __call__(self, ex, what=orange.GetValue): 
    151             return self.base_classifier(self.imputer(ex), what) 
    152  
    153     It imputes the missing values by calling the :obj:`imputer` and passes the 
    154     class to the base classifier. 
    155  
    156 .. note::  
    157    In this setup the imputer is trained on the training data - even if you do 
    158    cross validation, the imputer will be trained on the right data. In the 
    159    classification phase we again use the imputer which was classified on the 
    160    training data only. 
    161  
    162 .. rubric:: Code of ImputeLearner and ImputeClassifier  
    163  
    164 :obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 
    165 the instance's  dictionary. You are expected to call it like 
    166 :obj:`ImputeLearner(base_learner=<someLearner>, 
    167 imputer=<someImputerConstructor>)`. When the learner is called with examples, it 
    168 trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 
    169 :obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 
    170 :obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 
    171 values are imputed and the classifier's prediction is returned. 
    172  
    173 Note that this code is slightly simplified, although the omitted details handle 
    174 non-essential technical issues that are unrelated to imputation:: 
    175  
    176     class ImputeLearner(orange.Learner): 
    177         def __new__(cls, examples = None, weightID = 0, **keyw): 
    178             self = orange.Learner.__new__(cls, **keyw) 
    179             self.__dict__.update(keyw) 
    180             if examples: 
    181                 return self.__call__(examples, weightID) 
    182             else: 
    183                 return self 
    184      
    185         def __call__(self, data, weight=0): 
    186             trained_imputer = self.imputer_constructor(data, weight) 
    187             imputed_data = trained_imputer(data, weight) 
    188             base_classifier = self.base_learner(imputed_data, weight) 
    189             return ImputeClassifier(base_classifier, trained_imputer) 
    190      
    191     class ImputeClassifier(orange.Classifier): 
    192         def __init__(self, base_classifier, imputer): 
    193             self.base_classifier = base_classifier 
    194             self.imputer = imputer 
    195      
    196         def __call__(self, ex, what=orange.GetValue): 
    197             return self.base_classifier(self.imputer(ex), what) 
    198  
    199 .. rubric:: Example 
    200  
    201 Although most Orange's learning algorithms will take care of imputation 
    202 internally, if needed, it can sometime happen that an expert will be able to 
    203 tell you exactly what to put in the data instead of the missing values. In this 
    204 example we shall suppose that we want to impute the minimal value of each 
    205 feature. We will try to determine whether the naive Bayesian classifier with 
    206 its  implicit internal imputation works better than one that uses imputation by  
    207 minimal values. 
    208  
    209 :download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 
    210  
    211 .. literalinclude:: code/imputation-minimal-imputer.py 
    212     :lines: 7- 
    213      
    214 Should ouput this:: 
    215  
    216     Without imputation: 0.903 
    217     With imputation: 0.899 
    218  
    219 .. note::  
    220    Note that we constructed just one instance of \ 
    221    :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 
    222    used twice in each fold, once it is given the examples as they are (and  
    223    returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 
    224    The second time it is called by :obj:`imba` and the \ 
    225    :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 
    226    into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 
    227    learner, but which produces two different classifiers in each round of 
    228    testing. 
    229  
    230 Abstract imputers 
    231 ================= 
    232  
    233 As common in Orange, imputation is done by pairs of two classes: one that does 
    234 the work and another that constructs it. :obj:`ImputerConstructor` is an 
    235 abstract root of the hierarchy of classes that get the training data (with an  
    236 optional id for weight) and constructs an instance of a class, derived from 
    237 :obj:`Imputer`. An :obj:`Imputer` can be called with an 
    238 :obj:`Orange.data.Instance` and it will return a new example with the missing 
    239 values imputed (it will leave the original example intact!). If imputer is 
    240 called with an :obj:`Orange.data.Table`, it will return a new example table 
    241 with imputed examples. 
    242  
    243 .. class:: ImputerConstructor 
    244  
    245     .. attribute:: imputeClass 
    246      
    247     Tell whether to impute the class value (default) or not. 
    248  
    249 Simple imputation 
    250 ================= 
    251  
    252 The simplest imputers always impute the same value for a particular attribute, 
    253 disregarding the values of other attributes. They all use the same imputer 
    254 class, :obj:`Imputer_defaults`. 
    255      
    256 .. class:: Imputer_defaults 
    257  
    258     .. attribute::  defaults 
    259      
    260     An example with the default values to be imputed instead of the missing.  
    261     Examples to be imputed must be from the same domain as :obj:`defaults`. 
    262  
    263     Instances of this class can be constructed by  
    264     :obj:`Orange.feature.imputation.ImputerConstructor_minimal`,  
    265     :obj:`Orange.feature.imputation.ImputerConstructor_maximal`, 
    266     :obj:`Orange.feature.imputation.ImputerConstructor_average`.  
    267  
    268     For continuous features, they will impute the smallest, largest or the 
    269     average  values encountered in the training examples. For discrete, they 
    270     will impute the lowest (the one with index 0, e. g. attr.values[0]), the  
    271     highest (attr.values[-1]), and the most common value encountered in the 
    272     data. The first two imputers will mostly be used when the discrete values 
    273     are ordered according to their impact on the class (for instance, possible 
    274     values for symptoms of some disease can be ordered according to their 
    275     seriousness). The minimal and maximal imputers will then represent 
    276     optimistic and pessimistic imputations. 
    277  
    278     The following code will load the bridges data, and first impute the values 
    279     in a single examples and then in the whole table. 
    280  
    281 :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
    282  
    283 .. literalinclude:: code/imputation-complex.py 
    284     :lines: 9-23 
    285  
    286 This is example shows what the imputer does, not how it is to be used. Don't 
    287 impute all the data and then use it for cross-validation. As warned at the top 
    288 of this page, see the instructions for actual `use of 
    289 imputers <#using-imputers>`_. 
    290  
    291 .. note:: The :obj:`ImputerConstructor` are another class with schizophrenic 
    292   constructor: if you give the constructor the data, it will return an \ 
    293   :obj:`Imputer` - the above call is equivalent to calling \ 
    294   :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`. 
    295  
    296 You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults` 
    297 yourself and specify your own defaults. Or leave some values unspecified, in 
    298 which case the imputer won't impute them, as in the following example. Here, 
    299 the only attribute whose values will get imputed is "LENGTH"; the imputed value 
    300 will be 1234. 
    301  
    302 .. literalinclude:: code/imputation-complex.py 
    303     :lines: 56-69 
    304  
    305 :obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an 
    306 argument of type :obj:`Orange.data.Domain` (in which case it will construct an 
    307 empty instance for :obj:`defaults`) or an example. (Be careful with this: 
    308 :obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the 
    309 instance and not a copy. But you can make a copy yourself to avoid problems: 
    310 instead of `Imputer_defaults(data[0])` you may want to write 
    311 `Imputer_defaults(Orange.data.Instance(data[0]))`. 
    312  
    313 Random imputation 
    314 ================= 
    315  
    316 .. class:: Imputer_Random 
    317  
    318     Imputes random values. The corresponding constructor is 
    319     :obj:`ImputerConstructor_Random`. 
    320  
    321     .. attribute:: impute_class 
    322      
    323     Tells whether to impute the class values or not. Defaults to True. 
    324  
    325     .. attribute:: deterministic 
    326  
    327     If true (default is False), random generator is initialized for each 
    328     example using the example's hash value as a seed. This results in same 
    329     examples being always imputed the same values. 
    330      
    331 Model-based imputation 
    332 ====================== 
    333  
    334 .. class:: ImputerConstructor_model 
    335  
    336     Model-based imputers learn to predict the attribute's value from values of 
    337     other attributes. :obj:`ImputerConstructor_model` are given a learning 
    338     algorithm (two, actually - one for discrete and one for continuous 
    339     attributes) and they construct a classifier for each attribute. The 
    340     constructed imputer :obj:`Imputer_model` stores a list of classifiers which 
    341     are used when needed. 
    342  
    343     .. attribute:: learner_discrete, learner_continuous 
    344      
    345     Learner for discrete and for continuous attributes. If any of them is 
    346     missing, the attributes of the corresponding type won't get imputed. 
    347  
    348     .. attribute:: use_class 
    349      
    350     Tells whether the imputer is allowed to use the class value. As this is 
    351     most often undesired, this option is by default set to False. It can 
    352     however be useful for a more complex design in which we would use one 
    353     imputer for learning examples (this one would use the class value) and 
    354     another for testing examples (which would not use the class value as this 
    355     is unavailable at that moment). 
    356  
    357 .. class:: Imputer_model 
    358  
    359     .. attribute: models 
    360  
    361     A list of classifiers, each corresponding to one attribute of the examples 
    362     whose values are to be imputed. The :obj:`classVar`'s of the models should 
    363     equal the examples' attributes. If any of classifier is missing (that is, 
    364     the corresponding element of the table is :obj:`None`, the corresponding 
    365     attribute's values will not be imputed. 
    366  
    367 .. rubric:: Examples 
    368  
    369 The following imputer predicts the missing attribute values using 
    370 classification and regression trees with the minimum of 20 examples in a leaf.  
    371 Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
    372  
    373 .. literalinclude:: code/imputation-complex.py 
    374     :lines: 74-76 
    375  
    376 We could even use the same learner for discrete and continuous attributes, 
    377 as :class:`Orange.classification.tree.TreeLearner` checks the class type 
    378 and constructs regression or classification trees accordingly. The  
    379 common parameters, such as the minimal number of 
    380 examples in leaves, are used in both cases. 
    381  
    382 You can also use different learning algorithms for discrete and 
    383 continuous attributes. Probably a common setup will be to use 
    384 :class:`Orange.classification.bayes.BayesLearner` for discrete and  
    385 :class:`Orange.regression.mean.MeanLearner` (which 
    386 just remembers the average) for continuous attributes. Part of  
    387 :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
    388  
    389 .. literalinclude:: code/imputation-complex.py 
    390     :lines: 91-94 
    391  
    392 You can also construct an :class:`Imputer_model` yourself. You will do  
    393 this if different attributes need different treatment. Brace for an  
    394 example that will be a bit more complex. First we shall construct an  
    395 :class:`Imputer_model` and initialize an empty list of models.  
    396 The following code snippets are from 
    397 :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
    398  
    399 .. literalinclude:: code/imputation-complex.py 
    400     :lines: 108-109 
    401  
    402 Attributes "LANES" and "T-OR-D" will always be imputed values 2 and 
    403 "THROUGH". Since "LANES" is continuous, it suffices to construct a 
    404 :obj:`DefaultClassifier` with the default value 2.0 (don't forget the 
    405 decimal part, or else Orange will think you talk about an index of a discrete 
    406 value - how could it tell?). For the discrete attribute "T-OR-D", we could 
    407 construct a :class:`Orange.classification.ConstantClassifier` and give the index of value 
    408 "THROUGH" as an argument. But we shall do it nicer, by constructing a 
    409 :class:`Orange.data.Value`. Both classifiers will be stored at the appropriate places 
    410 in :obj:`imputer.models`. 
    411  
    412 .. literalinclude:: code/imputation-complex.py 
    413     :lines: 110-112 
    414  
    415  
    416 "LENGTH" will be computed with a regression tree induced from "MATERIAL",  
    417 "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of 
    418 course). Note that we initialized the domain by simply giving a list with 
    419 the names of the attributes, with the domain as an additional argument 
    420 in which Orange will look for the named attributes. 
    421  
    422 .. literalinclude:: code/imputation-complex.py 
    423     :lines: 114-119 
    424  
    425 We printed the tree just to see what it looks like. 
    426  
    427 :: 
    428  
    429     <XMP class=code>SPAN=SHORT: 1158 
    430     SPAN=LONG: 1907 
    431     SPAN=MEDIUM 
    432     |    ERECTED<1908.500: 1325 
    433     |    ERECTED>=1908.500: 1528 
    434     </XMP> 
    435  
    436 Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, 
    437 while the others are mostly medium. This could be done by 
    438 :class:`Orange.classifier.ClassifierByLookupTable` - this would be faster 
    439 than what we plan here. See the corresponding documentation on lookup 
    440 classifier. Here we are going to do it with a Python function. 
    441  
    442 .. literalinclude:: code/imputation-complex.py 
    443     :lines: 121-128 
    444  
    445 :obj:`compute_span` could also be written as a class, if you'd prefer 
    446 it. It's important that it behaves like a classifier, that is, gets an example 
    447 and returns a value. The second element tells, as usual, what the caller expect 
    448 the classifier to return - a value, a distribution or both. Since the caller, 
    449 :obj:`Imputer_model`, always wants values, we shall ignore the argument 
    450 (at risk of having problems in the future when imputers might handle 
    451 distribution as well). 
    452  
    453 Missing values as special values 
    454 ================================ 
    455  
    456 Missing values sometimes have a special meaning. The fact that something was 
    457 not measured can sometimes tell a lot. Be, however, cautious when using such 
    458 values in decision models; it the decision not to measure something (for 
    459 instance performing a laboratory test on a patient) is based on the expert's 
    460 knowledge of the class value, such unknown values clearly should not be used  
    461 in models. 
    462  
    463 .. class:: ImputerConstructor_asValue 
    464  
    465     Constructs a new domain in which each 
    466     discrete attribute is replaced with a new attribute that has one value more: 
    467     "NA". The new attribute will compute its values on the fly from the old one, 
    468     copying the normal values and replacing the unknowns with "NA". 
    469  
    470     For continuous attributes, it will 
    471     construct a two-valued discrete attribute with values "def" and "undef", 
    472     telling whether the continuous attribute was defined or not. The attribute's 
    473     name will equal the original's with "_def" appended. The original continuous 
    474     attribute will remain in the domain and its unknowns will be replaced by 
    475     averages. 
    476  
    477     :class:`ImputerConstructor_asValue` has no specific attributes. 
    478  
    479     It constructs :class:`Imputer_asValue` (I bet you 
    480     wouldn't guess). It converts the example into the new domain, which imputes  
    481     the values for discrete attributes. If continuous attributes are present, it  
    482     will also replace their values by the averages. 
    483  
    484 .. class:: Imputer_asValue 
    485  
    486     .. attribute:: domain 
    487  
    488         The domain with the new attributes constructed by  
    489         :class:`ImputerConstructor_asValue`. 
    490  
    491     .. attribute:: defaults 
    492  
    493         Default values for continuous attributes. Present only if there are any. 
    494  
    495 The following code shows what this imputer actually does to the domain. 
    496 Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
    497  
    498 .. literalinclude:: code/imputation-complex.py 
    499     :lines: 137-151 
    500  
    501 The script's output looks like this:: 
    502  
    503     [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] 
    504  
    505     [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] 
    506  
    507     RIVER: M -> M 
    508     ERECTED: 1874 -> 1874 (def) 
    509     PURPOSE: RR -> RR 
    510     LENGTH: ? -> 1567 (undef) 
    511     LANES: 2 -> 2 (def) 
    512     CLEAR-G: ? -> NA 
    513     T-OR-D: THROUGH -> THROUGH 
    514     MATERIAL: IRON -> IRON 
    515     SPAN: ? -> NA 
    516     REL-L: ? -> NA 
    517     TYPE: SIMPLE-T -> SIMPLE-T 
    518  
    519 Seemingly, the two examples have the same attributes (with 
    520 :samp:`imputed` having a few additional ones). If you check this by 
    521 :samp:`original.domain[0] == imputed.domain[0]`, you shall see that this 
    522 first glance is False. The attributes only have the same names, 
    523 but they are different attributes. If you read this page (which is already a 
    524 bit advanced), you know that Orange does not really care about the attribute 
    525 names). 
    526  
    527 Therefore, if we wrote :samp:`imputed[i]` the program would fail 
    528 since :samp:`imputed` has no attribute :samp:`i`. But it has an 
    529 attribute with the same name (which even usually has the same value). We 
    530 therefore use :samp:`i.name` to index the attributes of 
    531 :samp:`imputed`. (Using names for indexing is not fast, though; if you do 
    532 it a lot, compute the integer index with 
    533 :samp:`imputed.domain.index(i.name)`.)</P> 
    534  
    535 For continuous attributes, there is an additional attribute with "_def" 
    536 appended; we get it by :samp:`i.name+"_def"`. 
    537  
    538 The first continuous attribute, "ERECTED" is defined. Its value remains 1874 
    539 and the additional attribute "ERECTED_def" has value "def". Not so for 
    540 "LENGTH". Its undefined value is replaced by the average (1567) and the new 
    541 attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 
    542 all other undefined discrete attributes) is assigned the value "NA". 
    543  
    544 Using imputers 
    545 ============== 
    546  
    547 To properly use the imputation classes in learning process, they must be 
    548 trained on training examples only. Imputing the missing values and subsequently 
    549 using the data set in cross-validation will give overly optimistic results. 
    550  
    551 Learners with imputer as a component 
    552 ------------------------------------ 
    553  
    554 Orange learners that cannot handle missing values will generally provide a slot 
    555 for the imputer component. An example of such a class is 
    556 :obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 
    557 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 
    558 can assign an imputer constructor - one of the above constructors or a specific 
    559 constructor you wrote yourself. When given learning examples, 
    560 :obj:`Orange.classification.logreg.LogRegLearner` will pass them to 
    561 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 
    562 imputer (again some of the above or a specific imputer you programmed). It will 
    563 immediately use the imputer to impute the missing values in the learning data 
    564 set, so it can be used by the actual learning algorithm. Besides, when the 
    565 classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
    566 the imputer will be stored in its attribute 
    567 :obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
    568 classification, the imputer will be used for imputation of missing values in 
    569 (testing) examples. 
    570  
    571 Although details may vary from algorithm to algorithm, this is how the 
    572 imputation is generally used in Orange's learners. Also, if you write your own 
    573 learners, it is recommended that you use imputation according to the described 
    574 procedure. 
    575  
    576 Write your own imputer 
    577 ====================== 
    578  
    579 Imputation classes provide the Python-callback functionality (not all Orange 
    580 classes do so, refer to the documentation on `subtyping the Orange classes  
    581 in Python <callbacks.htm>`_ for a list). If you want to write your own 
    582 imputation constructor or an imputer, you need to simply program a Python 
    583 function that will behave like the built-in Orange classes (and even less, 
    584 for imputer, you only need to write a function that gets an example as 
    585 argument, imputation for example tables will then use that function). 
    586  
    587 You will most often write the imputation constructor when you have a special 
    588 imputation procedure or separate procedures for various attributes, as we've  
    589 demonstrated in the description of 
    590 :obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only  
    591 need to pack everything we've written there to an imputer constructor that 
    592 will accept a data set and the id of the weight meta-attribute (ignore it if 
    593 you will, but you must accept two arguments), and return the imputer (probably 
    594 :obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 
    595 imputer constructor as opposed to what we did above is that you can use such a 
    596 constructor as a component for Orange learners (like logistic regression) or 
    597 for wrappers from module orngImpute, and that way properly use the in 
    598 classifier testing procedures. 
    599  
    600 """ 
    601  
    6021import Orange.core as orange 
    6032from orange import ImputerConstructor_minimal  
  • docs/reference/rst/Orange.feature.imputation.rst

    r9372 r9806  
    1 .. automodule:: Orange.feature.imputation 
     1.. py:currentmodule:: Orange.feature.imputation 
     2 
     3.. index:: imputation 
     4 
     5.. index:: 
     6   single: feature; value imputation 
     7 
     8*************************** 
     9Imputation (``imputation``) 
     10*************************** 
     11 
     12Imputation replaces missing feature values with appropriate values, in this 
     13case with minimal values: 
     14 
     15.. literalinclude:: code/imputation-values.py 
     16   :lines: 7- 
     17 
     18The output of this code is:: 
     19 
     20    Example with missing values 
     21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD'] 
     22    Imputed values: 
     23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] 
     24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] 
     25 
     26Imputers 
     27================= 
     28 
     29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes 
     30that get training data and construct an instance of a class derived from 
     31:obj:`Imputer`. When an :obj:`Imputer` is called with an 
     32:obj:`Orange.data.Instance` it will return a new example with the 
     33missing values imputed (leaving the original example intact). If imputer is 
     34called with an :obj:`Orange.data.Table` it will return a new example table 
     35with imputed instances. 
     36 
     37.. class:: ImputerConstructor 
     38 
     39    .. attribute:: imputeClass 
     40 
     41    Indicates whether to impute the class value. Default is True. 
     42 
     43    .. attribute:: deterministic 
     44 
     45    Indicates whether to initialize random by example's CRC. Default is False. 
     46 
     47Simple imputation 
     48================= 
     49 
     50Simple imputers always impute the same value for a particular attribute, 
     51disregarding the values of other attributes. They all use the same class 
     52:obj:`Imputer_defaults`. 
     53 
     54.. class:: Imputer_defaults 
     55 
     56    .. attribute::  defaults 
     57 
     58    An instance :obj:`Orange.data.Instance` with the default values to be 
     59    imputed instead of missing. Examples to be imputed must be from the same 
     60    domain as :obj:`defaults`. 
     61 
     62Instances of this class can be constructed by 
     63:obj:`Orange.feature.imputation.ImputerConstructor_minimal`, 
     64:obj:`Orange.feature.imputation.ImputerConstructor_maximal`, 
     65:obj:`Orange.feature.imputation.ImputerConstructor_average`. 
     66 
     67For continuous features, they will impute the smallest, 
     68largest or the average values encountered in the training examples. 
     69 
     70For discrete, they will impute the lowest (the one with index 0, 
     71e. g. attr.values[0]), the highest (attr.values[-1]), 
     72and the most common value encountered in the data. 
     73 
     74The first two imputers 
     75will mostly be used when the discrete values are ordered according to their 
     76impact on the class (for instance, possible values for symptoms of some 
     77disease can be ordered according to their seriousness). The minimal and maximal 
     78imputers will then represent optimistic and pessimistic imputations. 
     79 
     80The following code will load the bridges data, and first impute the values 
     81in a single examples and then in the whole table. 
     82 
     83:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     84 
     85.. literalinclude:: code/imputation-complex.py 
     86    :lines: 9-23 
     87 
     88This is example shows what the imputer does, not how it is to be used. Don't 
     89impute all the data and then use it for cross-validation. As warned at the top 
     90of this page, see the instructions for actual `use of 
     91imputers <#using-imputers>`_. 
     92 
     93.. note:: The :obj:`ImputerConstructor` are another class with schizophrenic 
     94  constructor: if you give the constructor the data, it will return an \ 
     95  :obj:`Imputer` - the above call is equivalent to calling \ 
     96  :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`. 
     97 
     98You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults` 
     99yourself and specify your own defaults. Or leave some values unspecified, in 
     100which case the imputer won't impute them, as in the following example. Here, 
     101the only attribute whose values will get imputed is "LENGTH"; the imputed value 
     102will be 1234. 
     103 
     104.. literalinclude:: code/imputation-complex.py 
     105    :lines: 56-69 
     106 
     107:obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an 
     108argument of type :obj:`Orange.data.Domain` (in which case it will construct an 
     109empty instance for :obj:`defaults`) or an example. (Be careful with this: 
     110:obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the 
     111instance and not a copy. But you can make a copy yourself to avoid problems: 
     112instead of `Imputer_defaults(data[0])` you may want to write 
     113`Imputer_defaults(Orange.data.Instance(data[0]))`. 
     114 
     115Random imputation 
     116================= 
     117 
     118.. class:: Imputer_Random 
     119 
     120    Imputes random values. The corresponding constructor is 
     121    :obj:`ImputerConstructor_Random`. 
     122 
     123    .. attribute:: impute_class 
     124 
     125    Tells whether to impute the class values or not. Defaults to True. 
     126 
     127    .. attribute:: deterministic 
     128 
     129    If true (default is False), random generator is initialized for each 
     130    example using the example's hash value as a seed. This results in same 
     131    examples being always imputed the same values. 
     132 
     133Model-based imputation 
     134====================== 
     135 
     136.. class:: ImputerConstructor_model 
     137 
     138    Model-based imputers learn to predict the attribute's value from values of 
     139    other attributes. :obj:`ImputerConstructor_model` are given a learning 
     140    algorithm (two, actually - one for discrete and one for continuous 
     141    attributes) and they construct a classifier for each attribute. The 
     142    constructed imputer :obj:`Imputer_model` stores a list of classifiers which 
     143    are used when needed. 
     144 
     145    .. attribute:: learner_discrete, learner_continuous 
     146 
     147    Learner for discrete and for continuous attributes. If any of them is 
     148    missing, the attributes of the corresponding type won't get imputed. 
     149 
     150    .. attribute:: use_class 
     151 
     152    Tells whether the imputer is allowed to use the class value. As this is 
     153    most often undesired, this option is by default set to False. It can 
     154    however be useful for a more complex design in which we would use one 
     155    imputer for learning examples (this one would use the class value) and 
     156    another for testing examples (which would not use the class value as this 
     157    is unavailable at that moment). 
     158 
     159.. class:: Imputer_model 
     160 
     161    .. attribute: models 
     162 
     163    A list of classifiers, each corresponding to one attribute of the examples 
     164    whose values are to be imputed. The :obj:`classVar`'s of the models should 
     165    equal the examples' attributes. If any of classifier is missing (that is, 
     166    the corresponding element of the table is :obj:`None`, the corresponding 
     167    attribute's values will not be imputed. 
     168 
     169.. rubric:: Examples 
     170 
     171The following imputer predicts the missing attribute values using 
     172classification and regression trees with the minimum of 20 examples in a leaf. 
     173Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     174 
     175.. literalinclude:: code/imputation-complex.py 
     176    :lines: 74-76 
     177 
     178We could even use the same learner for discrete and continuous attributes, 
     179as :class:`Orange.classification.tree.TreeLearner` checks the class type 
     180and constructs regression or classification trees accordingly. The 
     181common parameters, such as the minimal number of 
     182examples in leaves, are used in both cases. 
     183 
     184You can also use different learning algorithms for discrete and 
     185continuous attributes. Probably a common setup will be to use 
     186:class:`Orange.classification.bayes.BayesLearner` for discrete and 
     187:class:`Orange.regression.mean.MeanLearner` (which 
     188just remembers the average) for continuous attributes. Part of 
     189:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     190 
     191.. literalinclude:: code/imputation-complex.py 
     192    :lines: 91-94 
     193 
     194You can also construct an :class:`Imputer_model` yourself. You will do 
     195this if different attributes need different treatment. Brace for an 
     196example that will be a bit more complex. First we shall construct an 
     197:class:`Imputer_model` and initialize an empty list of models. 
     198The following code snippets are from 
     199:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     200 
     201.. literalinclude:: code/imputation-complex.py 
     202    :lines: 108-109 
     203 
     204Attributes "LANES" and "T-OR-D" will always be imputed values 2 and 
     205"THROUGH". Since "LANES" is continuous, it suffices to construct a 
     206:obj:`DefaultClassifier` with the default value 2.0 (don't forget the 
     207decimal part, or else Orange will think you talk about an index of a discrete 
     208value - how could it tell?). For the discrete attribute "T-OR-D", we could 
     209construct a :class:`Orange.classification.ConstantClassifier` and give the index of value 
     210"THROUGH" as an argument. But we shall do it nicer, by constructing a 
     211:class:`Orange.data.Value`. Both classifiers will be stored at the appropriate places 
     212in :obj:`imputer.models`. 
     213 
     214.. literalinclude:: code/imputation-complex.py 
     215    :lines: 110-112 
     216 
     217 
     218"LENGTH" will be computed with a regression tree induced from "MATERIAL", 
     219"SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of 
     220course). Note that we initialized the domain by simply giving a list with 
     221the names of the attributes, with the domain as an additional argument 
     222in which Orange will look for the named attributes. 
     223 
     224.. literalinclude:: code/imputation-complex.py 
     225    :lines: 114-119 
     226 
     227We printed the tree just to see what it looks like. 
     228 
     229:: 
     230 
     231    <XMP class=code>SPAN=SHORT: 1158 
     232    SPAN=LONG: 1907 
     233    SPAN=MEDIUM 
     234    |    ERECTED<1908.500: 1325 
     235    |    ERECTED>=1908.500: 1528 
     236    </XMP> 
     237 
     238Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, 
     239while the others are mostly medium. This could be done by 
     240:class:`Orange.classifier.ClassifierByLookupTable` - this would be faster 
     241than what we plan here. See the corresponding documentation on lookup 
     242classifier. Here we are going to do it with a Python function. 
     243 
     244.. literalinclude:: code/imputation-complex.py 
     245    :lines: 121-128 
     246 
     247:obj:`compute_span` could also be written as a class, if you'd prefer 
     248it. It's important that it behaves like a classifier, that is, gets an example 
     249and returns a value. The second element tells, as usual, what the caller expect 
     250the classifier to return - a value, a distribution or both. Since the caller, 
     251:obj:`Imputer_model`, always wants values, we shall ignore the argument 
     252(at risk of having problems in the future when imputers might handle 
     253distribution as well). 
     254 
     255Missing values as special values 
     256================================ 
     257 
     258Missing values sometimes have a special meaning. The fact that something was 
     259not measured can sometimes tell a lot. Be, however, cautious when using such 
     260values in decision models; it the decision not to measure something (for 
     261instance performing a laboratory test on a patient) is based on the expert's 
     262knowledge of the class value, such unknown values clearly should not be used 
     263in models. 
     264 
     265.. class:: ImputerConstructor_asValue 
     266 
     267    Constructs a new domain in which each 
     268    discrete attribute is replaced with a new attribute that has one value more: 
     269    "NA". The new attribute will compute its values on the fly from the old one, 
     270    copying the normal values and replacing the unknowns with "NA". 
     271 
     272    For continuous attributes, it will 
     273    construct a two-valued discrete attribute with values "def" and "undef", 
     274    telling whether the continuous attribute was defined or not. The attribute's 
     275    name will equal the original's with "_def" appended. The original continuous 
     276    attribute will remain in the domain and its unknowns will be replaced by 
     277    averages. 
     278 
     279    :class:`ImputerConstructor_asValue` has no specific attributes. 
     280 
     281    It constructs :class:`Imputer_asValue` (I bet you 
     282    wouldn't guess). It converts the example into the new domain, which imputes 
     283    the values for discrete attributes. If continuous attributes are present, it 
     284    will also replace their values by the averages. 
     285 
     286.. class:: Imputer_asValue 
     287 
     288    .. attribute:: domain 
     289 
     290        The domain with the new attributes constructed by 
     291        :class:`ImputerConstructor_asValue`. 
     292 
     293    .. attribute:: defaults 
     294 
     295        Default values for continuous attributes. Present only if there are any. 
     296 
     297The following code shows what this imputer actually does to the domain. 
     298Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     299 
     300.. literalinclude:: code/imputation-complex.py 
     301    :lines: 137-151 
     302 
     303The script's output looks like this:: 
     304 
     305    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] 
     306 
     307    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] 
     308 
     309    RIVER: M -> M 
     310    ERECTED: 1874 -> 1874 (def) 
     311    PURPOSE: RR -> RR 
     312    LENGTH: ? -> 1567 (undef) 
     313    LANES: 2 -> 2 (def) 
     314    CLEAR-G: ? -> NA 
     315    T-OR-D: THROUGH -> THROUGH 
     316    MATERIAL: IRON -> IRON 
     317    SPAN: ? -> NA 
     318    REL-L: ? -> NA 
     319    TYPE: SIMPLE-T -> SIMPLE-T 
     320 
     321Seemingly, the two examples have the same attributes (with 
     322:samp:`imputed` having a few additional ones). If you check this by 
     323:samp:`original.domain[0] == imputed.domain[0]`, you shall see that this 
     324first glance is False. The attributes only have the same names, 
     325but they are different attributes. If you read this page (which is already a 
     326bit advanced), you know that Orange does not really care about the attribute 
     327names). 
     328 
     329Therefore, if we wrote :samp:`imputed[i]` the program would fail 
     330since :samp:`imputed` has no attribute :samp:`i`. But it has an 
     331attribute with the same name (which even usually has the same value). We 
     332therefore use :samp:`i.name` to index the attributes of 
     333:samp:`imputed`. (Using names for indexing is not fast, though; if you do 
     334it a lot, compute the integer index with 
     335:samp:`imputed.domain.index(i.name)`.)</P> 
     336 
     337For continuous attributes, there is an additional attribute with "_def" 
     338appended; we get it by :samp:`i.name+"_def"`. 
     339 
     340The first continuous attribute, "ERECTED" is defined. Its value remains 1874 
     341and the additional attribute "ERECTED_def" has value "def". Not so for 
     342"LENGTH". Its undefined value is replaced by the average (1567) and the new 
     343attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 
     344all other undefined discrete attributes) is assigned the value "NA". 
     345 
     346Using imputers 
     347============== 
     348 
     349To properly use the imputation classes in learning process, they must be 
     350trained on training examples only. Imputing the missing values and subsequently 
     351using the data set in cross-validation will give overly optimistic results. 
     352 
     353Learners with imputer as a component 
     354------------------------------------ 
     355 
     356Orange learners that cannot handle missing values will generally provide a slot 
     357for the imputer component. An example of such a class is 
     358:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 
     359:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 
     360can assign an imputer constructor - one of the above constructors or a specific 
     361constructor you wrote yourself. When given learning examples, 
     362:obj:`Orange.classification.logreg.LogRegLearner` will pass them to 
     363:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 
     364imputer (again some of the above or a specific imputer you programmed). It will 
     365immediately use the imputer to impute the missing values in the learning data 
     366set, so it can be used by the actual learning algorithm. Besides, when the 
     367classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
     368the imputer will be stored in its attribute 
     369:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
     370classification, the imputer will be used for imputation of missing values in 
     371(testing) examples. 
     372 
     373Although details may vary from algorithm to algorithm, this is how the 
     374imputation is generally used in Orange's learners. Also, if you write your own 
     375learners, it is recommended that you use imputation according to the described 
     376procedure. 
     377 
     378Wrapper for learning algorithms 
     379=============================== 
     380 
     381Imputation is used by learning algorithms and other methods that are not 
     382capable of handling unknown values. It will impute missing values, 
     383call the learner and, if imputation is also needed by the classifier, 
     384it will wrap the classifier into a wrapper that imputes missing values in 
     385examples to classify. 
     386 
     387.. literalinclude:: code/imputation-logreg.py 
     388   :lines: 7- 
     389 
     390The output of this code is:: 
     391 
     392    Without imputation: 0.945 
     393    With imputation: 0.954 
     394 
     395Even so, the module is somewhat redundant, as all learners that cannot handle 
     396missing values should, in principle, provide the slots for imputer constructor. 
     397For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute 
     398:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even 
     399if you don't set it, it will do some imputation by default. 
     400 
     401.. class:: ImputeLearner 
     402 
     403    Wraps a learner and performs data discretization before learning. 
     404 
     405    Most of Orange's learning algorithms do not use imputers because they can 
     406    appropriately handle the missing values. Bayesian classifier, for instance, 
     407    simply skips the corresponding attributes in the formula, while 
     408    classification/regression trees have components for handling the missing 
     409    values in various ways. 
     410 
     411    If for any reason you want to use these algorithms to run on imputed data, 
     412    you can use this wrapper. The class description is a matter of a separate 
     413    page, but we shall show its code here as another demonstration of how to 
     414    use the imputers - logistic regression is implemented essentially the same 
     415    as the below classes. 
     416 
     417    This is basically a learner, so the constructor will return either an 
     418    instance of :obj:`ImputerLearner` or, if called with examples, an instance 
     419    of some classifier. There are a few attributes that need to be set, though. 
     420 
     421    .. attribute:: base_learner 
     422 
     423    A wrapped learner. 
     424 
     425    .. attribute:: imputer_constructor 
     426 
     427    An instance of a class derived from :obj:`ImputerConstructor` (or a class 
     428    with the same call operator). 
     429 
     430    .. attribute:: dont_impute_classifier 
     431 
     432    If given and set (this attribute is optional), the classifier will not be 
     433    wrapped into an imputer. Do this if the classifier doesn't mind if the 
     434    examples it is given have missing values. 
     435 
     436    The learner is best illustrated by its code - here's its complete 
     437    :obj:`__call__` method:: 
     438 
     439        def __call__(self, data, weight=0): 
     440            trained_imputer = self.imputer_constructor(data, weight) 
     441            imputed_data = trained_imputer(data, weight) 
     442            base_classifier = self.base_learner(imputed_data, weight) 
     443            if self.dont_impute_classifier: 
     444                return base_classifier 
     445            else: 
     446                return ImputeClassifier(base_classifier, trained_imputer) 
     447 
     448    So "learning" goes like this. :obj:`ImputeLearner` will first construct 
     449    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 
     450    imputer. Than it will use the imputer to impute the data, and call the 
     451    given :obj:`baseLearner` to construct a classifier. For instance, 
     452    :obj:`baseLearner` could be a learner for logistic regression and the 
     453    result would be a logistic regression model. If the classifier can handle 
     454    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as 
     455    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 
     456    the base classifier and the imputer which it can use to impute the missing 
     457    values in (testing) examples. 
     458 
     459.. class:: ImputeClassifier 
     460 
     461    Objects of this class are returned by :obj:`ImputeLearner` when given data. 
     462 
     463    .. attribute:: baseClassifier 
     464 
     465    A wrapped classifier. 
     466 
     467    .. attribute:: imputer 
     468 
     469    An imputer for imputation of unknown values. 
     470 
     471    .. method:: __call__ 
     472 
     473    This class is even more trivial than the learner. Its constructor accepts 
     474    two arguments, the classifier and the imputer, which are stored into the 
     475    corresponding attributes. The call operator which does the classification 
     476    then looks like this:: 
     477 
     478        def __call__(self, ex, what=orange.GetValue): 
     479            return self.base_classifier(self.imputer(ex), what) 
     480 
     481    It imputes the missing values by calling the :obj:`imputer` and passes the 
     482    class to the base classifier. 
     483 
     484.. note:: 
     485   In this setup the imputer is trained on the training data - even if you do 
     486   cross validation, the imputer will be trained on the right data. In the 
     487   classification phase we again use the imputer which was classified on the 
     488   training data only. 
     489 
     490.. rubric:: Code of ImputeLearner and ImputeClassifier 
     491 
     492:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 
     493the instance's  dictionary. You are expected to call it like 
     494:obj:`ImputeLearner(base_learner=<someLearner>, 
     495imputer=<someImputerConstructor>)`. When the learner is called with examples, it 
     496trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 
     497:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 
     498:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 
     499values are imputed and the classifier's prediction is returned. 
     500 
     501Note that this code is slightly simplified, although the omitted details handle 
     502non-essential technical issues that are unrelated to imputation:: 
     503 
     504    class ImputeLearner(orange.Learner): 
     505        def __new__(cls, examples = None, weightID = 0, **keyw): 
     506            self = orange.Learner.__new__(cls, **keyw) 
     507            self.__dict__.update(keyw) 
     508            if examples: 
     509                return self.__call__(examples, weightID) 
     510            else: 
     511                return self 
     512 
     513        def __call__(self, data, weight=0): 
     514            trained_imputer = self.imputer_constructor(data, weight) 
     515            imputed_data = trained_imputer(data, weight) 
     516            base_classifier = self.base_learner(imputed_data, weight) 
     517            return ImputeClassifier(base_classifier, trained_imputer) 
     518 
     519    class ImputeClassifier(orange.Classifier): 
     520        def __init__(self, base_classifier, imputer): 
     521            self.base_classifier = base_classifier 
     522            self.imputer = imputer 
     523 
     524        def __call__(self, ex, what=orange.GetValue): 
     525            return self.base_classifier(self.imputer(ex), what) 
     526 
     527.. rubric:: Example 
     528 
     529Although most Orange's learning algorithms will take care of imputation 
     530internally, if needed, it can sometime happen that an expert will be able to 
     531tell you exactly what to put in the data instead of the missing values. In this 
     532example we shall suppose that we want to impute the minimal value of each 
     533feature. We will try to determine whether the naive Bayesian classifier with 
     534its  implicit internal imputation works better than one that uses imputation by 
     535minimal values. 
     536 
     537:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 
     538 
     539.. literalinclude:: code/imputation-minimal-imputer.py 
     540    :lines: 7- 
     541 
     542Should ouput this:: 
     543 
     544    Without imputation: 0.903 
     545    With imputation: 0.899 
     546 
     547.. note:: 
     548   Note that we constructed just one instance of \ 
     549   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 
     550   used twice in each fold, once it is given the examples as they are (and 
     551   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 
     552   The second time it is called by :obj:`imba` and the \ 
     553   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 
     554   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 
     555   learner, but which produces two different classifiers in each round of 
     556   testing. 
     557 
     558Write your own imputer 
     559====================== 
     560 
     561Imputation classes provide the Python-callback functionality (not all Orange 
     562classes do so, refer to the documentation on `subtyping the Orange classes 
     563in Python <callbacks.htm>`_ for a list). If you want to write your own 
     564imputation constructor or an imputer, you need to simply program a Python 
     565function that will behave like the built-in Orange classes (and even less, 
     566for imputer, you only need to write a function that gets an example as 
     567argument, imputation for example tables will then use that function). 
     568 
     569You will most often write the imputation constructor when you have a special 
     570imputation procedure or separate procedures for various attributes, as we've 
     571demonstrated in the description of 
     572:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only 
     573need to pack everything we've written there to an imputer constructor that 
     574will accept a data set and the id of the weight meta-attribute (ignore it if 
     575you will, but you must accept two arguments), and return the imputer (probably 
     576:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 
     577imputer constructor as opposed to what we did above is that you can use such a 
     578constructor as a component for Orange learners (like logistic regression) or 
     579for wrappers from module orngImpute, and that way properly use the in 
     580classifier testing procedures. 
Note: See TracChangeset for help on using the changeset viewer.