source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9890:90ee12181920

Revision 9890:90ee12181920, 20.7 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Orange.feature.imputation. Closes #1073.

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27-----------------
28
29:obj:`ImputerConstructor` is the abstract root in a hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new instance with the
33missing values imputed (leaving the original instance intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new data table with
35imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: impute_class
40
41    Indicates whether to impute the class value. Defaults to True.
42
43Simple imputation
44=================
45
46Simple imputers always impute the same value for a particular feature,
47disregarding the values of other features. They all use the same class
48:obj:`Imputer_defaults`.
49
50.. class:: Imputer_defaults
51
52    .. attribute::  defaults
53
54    An instance :obj:`~Orange.data.Instance` with the default values to be
55    imputed instead of missing value. Examples to be imputed must be from the
56    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
57
58Instances of this class can be constructed by
59:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
60:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
61:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
62
63For continuous features, they will impute the smallest, largest or the average
64values encountered in the training examples. For discrete,
65they will impute the lowest (the one with index 0, e. g. attr.values[0]),
66the highest (attr.values[-1]), and the most common value encountered in the
67data, respectively. If values of discrete features are ordered according to
68their impact on class (for example, possible values for symptoms of some
69disease can be ordered according to their seriousness),
70the minimal and maximal imputers  will then represent optimistic and
71pessimistic imputations.
72
73User-define defaults can be given when constructing a
74:obj:`~Orange.feature.imputation.Imputer_defaults`. Values that are left
75unspecified do not get imputed. In the following example "LENGTH" is the
76only attribute to get imputed with value 1234:
77
78.. literalinclude:: code/imputation-complex.py
79    :lines: 56-69
80
81If :obj:`~Orange.feature.imputation.Imputer_defaults`'s constructor is given
82an argument of type :obj:`~Orange.data.Domain` it constructs an empty instance
83for :obj:`defaults`. If an instance is given, the reference to the
84instance will be kept. To avoid problems associated with `Imputer_defaults
85(data[0])`, it is better to provide a copy of the instance:
86`Imputer_defaults(Orange.data.Instance(data[0]))`.
87
88Random imputation
89=================
90
91.. class:: Imputer_Random
92
93    Imputes random values. The corresponding constructor is
94    :obj:`ImputerConstructor_Random`.
95
96    .. attribute:: impute_class
97
98    Tells whether to impute the class values or not. Defaults to True.
99
100    .. attribute:: deterministic
101
102    If true (defaults to False), random generator is initialized for each
103    instance using the instance's hash value as a seed. This results in same
104    instances being always imputed with the same (random) values.
105
106Model-based imputation
107======================
108
109.. class:: ImputerConstructor_model
110
111    Model-based imputers learn to predict the features's value from values of
112    other features. :obj:`ImputerConstructor_model` are given two learning
113    algorithms and they construct a classifier for each attribute. The
114    constructed imputer :obj:`Imputer_model` stores a list of classifiers that
115    are used for imputation.
116
117    .. attribute:: learner_discrete, learner_continuous
118
119    Learner for discrete and for continuous attributes. If any of them is
120    missing, the attributes of the corresponding type will not get imputed.
121
122    .. attribute:: use_class
123
124    Tells whether the imputer can use the class attribute. Defaults to
125    False. It is useful in more complex designs in which one imputer is used
126    on learning instances, where it uses the class value,
127    and a second imputer on testing instances, where class is not available.
128
129.. class:: Imputer_model
130
131    .. attribute:: models
132
133    A list of classifiers, each corresponding to one attribute to be imputed.
134    The :obj:`class_var`'s of the models should equal the instances'
135    attributes. If an element is :obj:`None`, the corresponding attribute's
136    values are not imputed.
137
138.. rubric:: Examples
139
140Examples are taken from :download:`imputation-complex.py
141<code/imputation-complex.py>`. The following imputer predicts the missing
142attribute values using classification and regression trees with the minimum
143of 20 examples in a leaf.
144
145.. literalinclude:: code/imputation-complex.py
146    :lines: 74-76
147
148A common setup, where different learning algorithms are used for discrete
149and continuous features, is to use
150:class:`~Orange.classification.bayes.NaiveLearner` for discrete and
151:class:`~Orange.regression.mean.MeanLearner` (which
152just remembers the average) for continuous attributes:
153
154.. literalinclude:: code/imputation-complex.py
155    :lines: 91-94
156
157To construct a user-defined :class:`Imputer_model`:
158
159.. literalinclude:: code/imputation-complex.py
160    :lines: 108-112
161
162A list of empty models is first initialized :obj:`Imputer_model.models`.
163Continuous feature "LANES" is imputed with value 2 using
164:obj:`DefaultClassifier`. A float must be given, because integer values are
165interpreted as indexes of discrete features. Discrete feature "T-OR-D" is
166imputed using :class:`~Orange.classification.ConstantClassifier` which is
167given the index of value "THROUGH" as an argument.
168
169Feature "LENGTH" is computed with a regression tree induced from "MATERIAL",
170"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here).
171Domain is initialized by giving a list of feature names and domain as an
172additional argument where Orange will look for features.
173
174.. literalinclude:: code/imputation-complex.py
175    :lines: 114-119
176
177This is how the inferred tree should look like::
178
179    <XMP class=code>SPAN=SHORT: 1158
180    SPAN=LONG: 1907
181    SPAN=MEDIUM
182    |    ERECTED<1908.500: 1325
183    |    ERECTED>=1908.500: 1528
184    </XMP>
185
186Wooden bridges and walkways are short, while the others are mostly
187medium. This could be encoded in feature "SPAN" using
188:class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the
189Python function used here:
190
191.. literalinclude:: code/imputation-complex.py
192    :lines: 121-128
193
194If :obj:`compute_span` is written as a class it must behave like a
195classifier: it accepts an example and returns a value. The second
196argument tells what the caller expects the classifier to return - a value,
197a distribution or both. Currently, :obj:`Imputer_model`,
198always expects values and the argument can be ignored.
199
200Missing values as special values
201================================
202
203Missing values sometimes have a special meaning. Cautious is needed when
204using such values in decision models. When the decision not to measure
205something (for example, performing a laboratory test on a patient) is based
206on the expert's knowledge of the class value, such missing values clearly
207should not be used in models.
208
209.. class:: ImputerConstructor_asValue
210
211    Constructs a new domain in which each discrete feature is replaced
212    with a new feature that has one more value: "NA". The new feature
213    computes its values on the fly from the old one,
214    copying the normal values and replacing the unknowns with "NA".
215
216    For continuous attributes, it constructs a two-valued discrete attribute
217    with values "def" and "undef", telling whether the value is defined or
218    not.  The features's name will equal the original's with "_def" appended.
219    The original continuous feature will remain in the domain and its
220    unknowns will be replaced by averages.
221
222    :class:`ImputerConstructor_asValue` has no specific attributes.
223
224    It constructs :class:`Imputer_asValue` that converts the example into
225    the new domain.
226
227.. class:: Imputer_asValue
228
229    .. attribute:: domain
230
231        The domain with the new feature constructed by
232        :class:`ImputerConstructor_asValue`.
233
234    .. attribute:: defaults
235
236        Default values for continuous features.
237
238The following code shows what the imputer actually does to the domain:
239
240.. literalinclude:: code/imputation-complex.py
241    :lines: 137-151
242
243The script's output looks like this::
244
245    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
246
247    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
248
249    RIVER: M -> M
250    ERECTED: 1874 -> 1874 (def)
251    PURPOSE: RR -> RR
252    LENGTH: ? -> 1567 (undef)
253    LANES: 2 -> 2 (def)
254    CLEAR-G: ? -> NA
255    T-OR-D: THROUGH -> THROUGH
256    MATERIAL: IRON -> IRON
257    SPAN: ? -> NA
258    REL-L: ? -> NA
259    TYPE: SIMPLE-T -> SIMPLE-T
260
261The two examples have the same attribute, :samp:`imputed` having a few
262additional ones. Comparing :samp:`original.domain[0] == imputed.domain[0]`
263will result in False. While the names are same, they represent different
264features. Writting, :samp:`imputed[i]`  would fail since :samp:`imputed` has
265no attribute :samp:`i`, but it has an attribute with the same name.
266Using :samp:`i.name` to index the attributes of :samp:`imputed` will work,
267yet it is not fast. If a frequently used, it is better to compute the index
268with :samp:`imputed.domain.index(i.name)`.
269
270For continuous features, there is an additional feature with name prefix
271"_def", which is accessible by :samp:`i.name+"_def"`. The value of the first
272continuous feature "ERECTED" remains 1874, and the additional attribute
273"ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced
274by the average (1567) and the new attribute has value "undef". The
275undefined discrete attribute  "CLEAR-G" (and all other undefined discrete
276attributes) is assigned the value "NA".
277
278Using imputers
279--------------
280
281Imputation is also used by learning algorithms and other methods that are not
282capable of handling unknown values.
283
284Learners with imputer as a component
285====================================
286
287Learners that cannot handle missing values should provide a slot
288for imputer constructor. An example of such class is
289:obj:`~Orange.classification.logreg.LogRegLearner` with attribute
290:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`,
291which imputes to average value by default. When given learning instances,
292:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to
293:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor` to get
294an imputer and used it to impute the missing values in the learning data.
295Imputed data is then used by the actual learning algorithm. Also, when a
296classifier :obj:`~Orange.classification.logreg.LogRegClassifier` is
297constructed,
298the imputer is stored in its attribute
299:obj:`~Orange.classification.logreg.LogRegClassifier.imputer`. At
300classification, the same imputer is used for imputation of missing values
301in (testing) examples.
302
303Details may vary from algorithm to algorithm, but this is how the imputation
304is generally used. When writing user-defined learners,
305it is recommended to use imputation according to the described procedure.
306
307The choice of which imputer to use depends on the problem domain. In this
308example we want to impute the minimal value of each feature.
309
310.. literalinclude:: code/imputation-logreg.py
311   :lines: 7-
312
313The output of this code is::
314
315    Without imputation: 0.945
316    With imputation: 0.954
317
318.. note::
319
320   Note that just one instance of
321   :obj:`~Orange.classification.logreg.LogRegLearner` is constructed and then
322   used twice in each fold. Once it is given the original instances as they
323   are. It returns an instance of
324   :obj:`~Orange.classification.logreg.LogRegLearner`. The second time it is
325   called by :obj:`imra` and the
326   :obj:`~Orange.classification.logreg.LogRegLearner` gets wrapped
327   into :obj:`~Orange.feature.imputation.Classifier`. There is only one
328   learner, which produces two different classifiers in each round of
329   testing.
330
331Wrapper for learning algorithms
332===============================
333
334In a learning/classification process, imputation is needed on two occasions.
335Before learning, the imputer needs to process the training examples.
336Afterwards, the imputer is called for each instance to be classified. For
337example, in cross validation, imputation should be done on training folds
338only. Imputing the missing values on all data and subsequently performing
339cross-validation will give overly optimistic results.
340
341Most of Orange's learning algorithms do not use imputers because they can
342appropriately handle the missing values. Bayesian classifier, for instance,
343simply skips the corresponding attributes in the formula, while
344classification/regression trees have components for handling the missing
345values in various ways.
346
347If for any reason you want to use these algorithms to run on imputed data,
348you can use this wrapper.
349
350.. class:: ImputeLearner
351
352    Wraps a learner and performs data imputation before learning.
353
354    This is basically a learner, so the constructor will return either an
355    instance of :obj:`ImputerLearner` or, if called with examples, an instance
356    of some classifier.
357
358    .. attribute:: base_learner
359
360    A wrapped learner.
361
362    .. attribute:: imputer_constructor
363
364    An instance of a class derived from :obj:`ImputerConstructor` or a class
365    with the same call operator.
366
367    .. attribute:: dont_impute_classifier
368
369    If set and a table is given, the classifier is not be
370    wrapped into an imputer. This can be done if classifier can handle
371    missing values.
372
373    The learner is best illustrated by its code - here's its complete
374    :obj:`__call__` method::
375
376        def __call__(self, data, weight=0):
377            trained_imputer = self.imputer_constructor(data, weight)
378            imputed_data = trained_imputer(data, weight)
379            base_classifier = self.base_learner(imputed_data, weight)
380            if self.dont_impute_classifier:
381                return base_classifier
382            else:
383                return ImputeClassifier(base_classifier, trained_imputer)
384
385    During learning, :obj:`ImputeLearner` will first construct
386    the imputer. It will then impute the data and call the
387    given :obj:`baseLearner` to construct a classifier. For instance,
388    :obj:`baseLearner` could be a learner for logistic regression and the
389    result would be a logistic regression model. If the classifier can handle
390    unknown values (that is, if :obj:`dont_impute_classifier`,
391    it is returned as is, otherwise it is wrapped into
392    :obj:`ImputeClassifier`, which holds the base classifier and
393    the imputer used to impute the missing values in (testing) data.
394
395.. class:: ImputeClassifier
396
397    Objects of this class are returned by :obj:`ImputeLearner` when given data.
398
399    .. attribute:: baseClassifier
400
401    A wrapped classifier.
402
403    .. attribute:: imputer
404
405    An imputer for imputation of unknown values.
406
407    .. method:: __call__
408
409    This class's constructor accepts and stores two arguments,
410    the classifier and the imputer. The call operator for classification
411    looks like this::
412
413        def __call__(self, ex, what=orange.GetValue):
414            return self.base_classifier(self.imputer(ex), what)
415
416    It imputes the missing values by calling the :obj:`imputer` and passes the
417    class to the base classifier.
418
419.. note::
420   In this setup the imputer is trained on the training data. Even during
421   cross validation, the imputer will be trained on the right data. In the
422   classification phase, the imputer will be used to impute testing data.
423
424.. rubric:: Code of ImputeLearner and ImputeClassifier
425
426The learner is called with
427:obj:`Orange.feature.imputation.ImputeLearner(base_learner=<someLearner>, imputer=<someImputerConstructor>)`.
428When given examples, it trains the imputer, imputes the data,
429induces a :obj:`base_classifier` by the
430:obj:`base_learner` and constructs :obj:`ImputeClassifier` that stores the
431:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
432values are imputed and the classifier's prediction is returned.
433
434This is a slightly simplified code, where details on how to handle
435non-essential technical issues that are unrelated to imputation::
436
437    class ImputeLearner(orange.Learner):
438        def __new__(cls, examples = None, weightID = 0, **keyw):
439            self = orange.Learner.__new__(cls, **keyw)
440            self.__dict__.update(keyw)
441            if examples:
442                return self.__call__(examples, weightID)
443            else:
444                return self
445
446        def __call__(self, data, weight=0):
447            trained_imputer = self.imputer_constructor(data, weight)
448            imputed_data = trained_imputer(data, weight)
449            base_classifier = self.base_learner(imputed_data, weight)
450            return ImputeClassifier(base_classifier, trained_imputer)
451
452    class ImputeClassifier(orange.Classifier):
453        def __init__(self, base_classifier, imputer):
454            self.base_classifier = base_classifier
455            self.imputer = imputer
456
457        def __call__(self, ex, what=orange.GetValue):
458            return self.base_classifier(self.imputer(ex), what)
459
460Write your own imputer
461----------------------
462
463Imputation classes provide the Python-callback functionality. The simples
464way to write custom imputation constructors or imputers is to write a Python
465function that behaves like the built-in Orange classes. For imputers it is
466enough to write a function that gets an instance as argument. Inputation for
467data tables will then use that function.
468
469Special imputation procedures or separate procedures for various attributes,
470as demonstrated in the description of
471:obj:`~Orange.feature.imputation.ImputerConstructor_model`,
472are achieved by encoding it in a constructor that accepts a data table and
473id of the weight meta-attribute, and returns the imputer. The benefit of
474implementing an imputer constructor is that you can use is as a component
475for learners (for example, in logistic regression) or wrappers, and that way
476properly use the classifier in testing procedures.
477
478
479
480..
481    This was commented out:
482    Examples
483    --------
484
485    Missing values sometimes have a special meaning, so they need to be replaced
486    by a designated value. Sometimes we know what to replace the missing value
487    with; for instance, in a medical problem, some laboratory tests might not be
488    done when it is known what their results would be. In that case, we impute
489    certain fixed value instead of the missing. In the most complex case, we assign
490    values that are computed based on some model; we can, for instance, impute the
491    average or majority value or even a value which is computed from values of
492    other, known feature, using a classifier.
493
494    In general, imputer itself needs to be trained. This is, of course, not needed
495    when the imputer imputes certain fixed value. However, when it imputes the
496    average or majority value, it needs to compute the statistics on the training
497    examples, and use it afterwards for imputation of training and testing
498    examples.
499
500    While reading this document, bear in mind that imputation is a part of the
501    learning process. If we fit the imputation model, for instance, by learning
502    how to predict the feature's value from other features, or even if we
503    simply compute the average or the minimal value for the feature and use it
504    in imputation, this should only be done on learning data. Orange
505    provides simple means for doing that.
506
507    This page will first explain how to construct various imputers. Then follow
508    the examples for `proper use of imputers <#using-imputers>`_. Finally, quite
509    often you will want to use imputation with special requests, such as certain
510    features' missing values getting replaced by constants and other by values
511    computed using models induced from specified other features. For instance,
512    in one of the studies we worked on, the patient's pulse rate needed to be
513    estimated using regression trees that included the scope of the patient's
514    injuries, sex and age, some attributes' values were replaced by the most
515    pessimistic ones and others were computed with regression trees based on
516    values of all features. If you are using learners that need the imputer as a
517    component, you will need to `write your own imputer constructor
518    <#write-your-own-imputer-constructor>`_. This is trivial and is explained at
519    the end of this page.
Note: See TracBrowser for help on using the repository browser.