source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9905:3fd7a62a81ee

Revision 9905:3fd7a62a81ee, 20.6 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Minor changes to Orange.feature.imputation.

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27-----------------
28
29:obj:`ImputerConstructor` is the abstract root in a hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new instance with the
33missing values imputed (leaving the original instance intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new data table with
35imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: impute_class
40
41    Indicates whether to impute the class value. Defaults to True.
42
43Simple imputation
44=================
45
46Simple imputers always impute the same value for a particular feature,
47disregarding the values of other features. They all use the same class
48:obj:`Imputer_defaults`.
49
50.. class:: Imputer_defaults
51
52    .. attribute::  defaults
53
54    An instance :obj:`~Orange.data.Instance` with the default values to be
55    imputed instead of missing value. Examples to be imputed must be from the
56    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
57
58Instances of this class can be constructed by
59:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
60:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
61:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
62
63For continuous features, they will impute the smallest, largest or the average
64values encountered in the training examples. For discrete,
65they will impute the lowest (the one with index 0, e. g. attr.values[0]),
66the highest (attr.values[-1]), and the most common value encountered in the
67data, respectively. If values of discrete features are ordered according to
68their impact on class (for example, possible values for symptoms of some
69disease can be ordered according to their seriousness),
70the minimal and maximal imputers  will then represent optimistic and
71pessimistic imputations.
72
73User-define defaults can be given when constructing a
74:obj:`~Orange.feature.imputation.Imputer_defaults`. Values that are left
75unspecified do not get imputed. In the following example "LENGTH" is the
76only attribute to get imputed with value 1234:
77
78.. literalinclude:: code/imputation-complex.py
79    :lines: 56-69
80
81If :obj:`~Orange.feature.imputation.Imputer_defaults`'s constructor is given
82an argument of type :obj:`~Orange.data.Domain` it constructs an empty instance
83for :obj:`defaults`. If an instance is given, the reference to the
84instance will be kept. To avoid problems associated with `Imputer_defaults
85(data[0])`, it is better to provide a copy of the instance:
86`Imputer_defaults(Orange.data.Instance(data[0]))`.
87
88Random imputation
89=================
90
91.. class:: Imputer_Random
92
93    Imputes random values. The corresponding constructor is
94    :obj:`ImputerConstructor_Random`.
95
96    .. attribute:: impute_class
97
98    Tells whether to impute the class values or not. Defaults to True.
99
100    .. attribute:: deterministic
101
102    If true (defaults to False), random generator is initialized for each
103    instance using the instance's hash value as a seed. This results in same
104    instances being always imputed with the same (random) values.
105
106Model-based imputation
107======================
108
109.. class:: ImputerConstructor_model
110
111    Model-based imputers learn to predict the features's value from values of
112    other features. :obj:`ImputerConstructor_model` are given two learning
113    algorithms and they construct a classifier for each attribute. The
114    constructed imputer :obj:`Imputer_model` stores a list of classifiers that
115    are used for imputation.
116
117    .. attribute:: learner_discrete, learner_continuous
118
119    Learner for discrete and for continuous attributes. If any of them is
120    missing, the attributes of the corresponding type will not get imputed.
121
122    .. attribute:: use_class
123
124    Tells whether the imputer can use the class attribute. Defaults to
125    False. It is useful in more complex designs in which one imputer is used
126    on learning instances, where it uses the class value,
127    and a second imputer on testing instances, where class is not available.
128
129.. class:: Imputer_model
130
131    .. attribute:: models
132
133    A list of classifiers, each corresponding to one attribute to be imputed.
134    The :obj:`class_var`'s of the models should equal the instances'
135    attributes. If an element is :obj:`None`, the corresponding attribute's
136    values are not imputed.
137
138.. rubric:: Examples
139
140Examples are taken from :download:`imputation-complex.py
141<code/imputation-complex.py>`. The following imputer predicts the missing
142attribute values using classification and regression trees with the minimum
143of 20 examples in a leaf.
144
145.. literalinclude:: code/imputation-complex.py
146    :lines: 74-76
147
148A common setup, where different learning algorithms are used for discrete
149and continuous features, is to use
150:class:`~Orange.classification.bayes.NaiveLearner` for discrete and
151:class:`~Orange.regression.mean.MeanLearner` (which
152just remembers the average) for continuous attributes:
153
154.. literalinclude:: code/imputation-complex.py
155    :lines: 91-94
156
157To construct a user-defined :class:`Imputer_model`:
158
159.. literalinclude:: code/imputation-complex.py
160    :lines: 108-112
161
162A list of empty models is first initialized :obj:`Imputer_model.models`.
163Continuous feature "LANES" is imputed with value 2 using
164:obj:`DefaultClassifier`. A float must be given, because integer values are
165interpreted as indexes of discrete features. Discrete feature "T-OR-D" is
166imputed using :class:`~Orange.classification.ConstantClassifier` which is
167given the index of value "THROUGH" as an argument.
168
169Feature "LENGTH" is computed with a regression tree induced from "MATERIAL",
170"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here).
171Domain is initialized by giving a list of feature names and domain as an
172additional argument where Orange will look for features.
173
174.. literalinclude:: code/imputation-complex.py
175    :lines: 114-119
176
177This is how the inferred tree should look like::
178
179    <XMP class=code>SPAN=SHORT: 1158
180    SPAN=LONG: 1907
181    SPAN=MEDIUM
182    |    ERECTED<1908.500: 1325
183    |    ERECTED>=1908.500: 1528
184    </XMP>
185
186Wooden bridges and walkways are short, while the others are mostly
187medium. This could be encoded in feature "SPAN" using
188:class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the
189Python function used here:
190
191.. literalinclude:: code/imputation-complex.py
192    :lines: 121-128
193
194If :obj:`compute_span` is written as a class it must behave like a
195classifier: it accepts an example and returns a value. The second
196argument tells what the caller expects the classifier to return - a value,
197a distribution or both. Currently, :obj:`Imputer_model`,
198always expects values and the argument can be ignored.
199
200Missing values as special values
201================================
202
203Missing values sometimes have a special meaning. Cautious is needed when
204using such values in decision models. When the decision not to measure
205something (for example, performing a laboratory test on a patient) is based
206on the expert's knowledge of the class value, such missing values clearly
207should not be used in models.
208
209.. class:: ImputerConstructor_asValue
210
211    Constructs a new domain in which each discrete feature is replaced
212    with a new feature that has one more value: "NA". The new feature
213    computes its values on the fly from the old one,
214    copying the normal values and replacing the unknowns with "NA".
215
216    For continuous attributes, it constructs a two-valued discrete attribute
217    with values "def" and "undef", telling whether the value is defined or
218    not.  The features's name will equal the original's with "_def" appended.
219    The original continuous feature will remain in the domain and its
220    unknowns will be replaced by averages.
221
222    :class:`ImputerConstructor_asValue` has no specific attributes.
223
224    It constructs :class:`Imputer_asValue` that converts the example into
225    the new domain.
226
227.. class:: Imputer_asValue
228
229    .. attribute:: domain
230
231        The domain with the new feature constructed by
232        :class:`ImputerConstructor_asValue`.
233
234    .. attribute:: defaults
235
236        Default values for continuous features.
237
238The following code shows what the imputer actually does to the domain:
239
240.. literalinclude:: code/imputation-complex.py
241    :lines: 137-151
242
243The script's output looks like this::
244
245    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
246
247    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
248
249    RIVER: M -> M
250    ERECTED: 1874 -> 1874 (def)
251    PURPOSE: RR -> RR
252    LENGTH: ? -> 1567 (undef)
253    LANES: 2 -> 2 (def)
254    CLEAR-G: ? -> NA
255    T-OR-D: THROUGH -> THROUGH
256    MATERIAL: IRON -> IRON
257    SPAN: ? -> NA
258    REL-L: ? -> NA
259    TYPE: SIMPLE-T -> SIMPLE-T
260
261The two examples have the same attribute, :samp:`imputed` having a few
262additional ones. Comparing :samp:`original.domain[0] == imputed.domain[0]`
263will result in False. While the names are same, they represent different
264features. Writting, :samp:`imputed[i]`  would fail since :samp:`imputed` has
265no attribute :samp:`i`, but it has an attribute with the same name.
266Using :samp:`i.name` to index the attributes of :samp:`imputed` will work,
267yet it is not fast. If a frequently used, it is better to compute the index
268with :samp:`imputed.domain.index(i.name)`.
269
270For continuous features, there is an additional feature with name prefix
271"_def", which is accessible by :samp:`i.name+"_def"`. The value of the first
272continuous feature "ERECTED" remains 1874, and the additional attribute
273"ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced
274by the average (1567) and the new attribute has value "undef". The
275undefined discrete attribute  "CLEAR-G" (and all other undefined discrete
276attributes) is assigned the value "NA".
277
278Using imputers
279--------------
280
281Imputation is also used by learning algorithms and other methods that are not
282capable of handling unknown values.
283
284Imputer as a component
285======================
286
287Learners that cannot handle missing values should provide a slot
288for imputer constructor. An example of such class is
289:obj:`~Orange.classification.logreg.LogRegLearner` with attribute
290:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`,
291which imputes to average value by default. When given learning instances,
292:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to
293:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor` to get
294an imputer and use it to impute the missing values in the learning data.
295Imputed data is then used by the actual learning algorithm. When a
296classifier :obj:`~Orange.classification.logreg.LogRegClassifier` is
297constructed, the imputer is stored in its attribute
298:obj:`~Orange.classification.logreg.LogRegClassifier.imputer`. During
299classification the same imputer is used for imputation of missing values
300in (testing) examples.
301
302Details may vary from algorithm to algorithm, but this is how the imputation
303is generally used. When writing user-defined learners,
304it is recommended to use imputation according to the described procedure.
305
306The choice of the imputer depends on the problem domain. In this example the
307minimal value of each feature is imputed:
308
309.. literalinclude:: code/imputation-logreg.py
310   :lines: 7-
311
312The output of this code is::
313
314    Without imputation: 0.945
315    With imputation: 0.954
316
317.. note::
318
319   Just one instance of
320   :obj:`~Orange.classification.logreg.LogRegLearner` is constructed and then
321   used twice in each fold. Once it is given the original instances as they
322   are. It returns an instance of
323   :obj:`~Orange.classification.logreg.LogRegLearner`. The second time it is
324   called by :obj:`imra` and the
325   :obj:`~Orange.classification.logreg.LogRegLearner` gets wrapped
326   into :obj:`~Orange.feature.imputation.Classifier`. There is only one
327   learner, which produces two different classifiers in each round of
328   testing.
329
330Wrappers for learning
331=====================
332
333In a learning/classification process, imputation is needed on two occasions.
334Before learning, the imputer needs to process the training instances.
335Afterwards, the imputer is called for each instance to be classified. For
336example, in cross validation, imputation should be done on training folds
337only. Imputing the missing values on all data and subsequently performing
338cross-validation will give overly optimistic results.
339
340Most of Orange's learning algorithms do not use imputers because they can
341appropriately handle the missing values. Bayesian classifier, for instance,
342simply skips the corresponding attributes in the formula, while
343classification/regression trees have components for handling the missing
344values in various ways. A wrapper is provided for learning algorithms that
345require imputed data.
346
347.. class:: ImputeLearner
348
349    Wraps a learner and performs data imputation before learning.
350
351    This is basically a learner, so the constructor will return either an
352    instance of :obj:`ImputerLearner` or, if called with examples, an instance
353    of some classifier.
354
355    .. attribute:: base_learner
356
357    A wrapped learner.
358
359    .. attribute:: imputer_constructor
360
361    An instance of a class derived from :obj:`ImputerConstructor` or a class
362    with the same call operator.
363
364    .. attribute:: dont_impute_classifier
365
366    If set and a table is given, the classifier is not be
367    wrapped into an imputer. This can be done if classifier can handle
368    missing values.
369
370    The learner is best illustrated by its code - here's its complete
371    :obj:`__call__` method::
372
373        def __call__(self, data, weight=0):
374            trained_imputer = self.imputer_constructor(data, weight)
375            imputed_data = trained_imputer(data, weight)
376            base_classifier = self.base_learner(imputed_data, weight)
377            if self.dont_impute_classifier:
378                return base_classifier
379            else:
380                return ImputeClassifier(base_classifier, trained_imputer)
381
382    During learning, :obj:`ImputeLearner` will first construct
383    the imputer. It will then impute the data and call the
384    given :obj:`baseLearner` to construct a classifier. For instance,
385    :obj:`baseLearner` could be a learner for logistic regression and the
386    result would be a logistic regression model. If the classifier can handle
387    unknown values (that is, if :obj:`dont_impute_classifier`,
388    it is returned as is, otherwise it is wrapped into
389    :obj:`ImputeClassifier`, which holds the base classifier and
390    the imputer used to impute the missing values in (testing) data.
391
392.. class:: ImputeClassifier
393
394    Objects of this class are returned by :obj:`ImputeLearner` when given data.
395
396    .. attribute:: baseClassifier
397
398    A wrapped classifier.
399
400    .. attribute:: imputer
401
402    An imputer for imputation of unknown values.
403
404    .. method:: __call__
405
406    This class's constructor accepts and stores two arguments,
407    the classifier and the imputer. The call operator for classification
408    looks like this::
409
410        def __call__(self, ex, what=orange.GetValue):
411            return self.base_classifier(self.imputer(ex), what)
412
413    It imputes the missing values by calling the :obj:`imputer` and passes the
414    class to the base classifier.
415
416.. note::
417   In this setup the imputer is trained on the training data. Even during
418   cross validation, the imputer will be trained on the right data. In the
419   classification phase, the imputer will be used to impute testing data.
420
421.. rubric:: Code of ImputeLearner and ImputeClassifier
422
423The learner is called with
424:obj:`Orange.feature.imputation.ImputeLearner(base_learner=<someLearner>, imputer=<someImputerConstructor>)`.
425When given examples, it trains the imputer, imputes the data,
426induces a :obj:`base_classifier` by the
427:obj:`base_learner` and constructs :obj:`ImputeClassifier` that stores the
428:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
429values are imputed and the classifier's prediction is returned.
430
431This is a slightly simplified code, where details on how to handle
432non-essential technical issues that are unrelated to imputation::
433
434    class ImputeLearner(orange.Learner):
435        def __new__(cls, examples = None, weightID = 0, **keyw):
436            self = orange.Learner.__new__(cls, **keyw)
437            self.__dict__.update(keyw)
438            if examples:
439                return self.__call__(examples, weightID)
440            else:
441                return self
442
443        def __call__(self, data, weight=0):
444            trained_imputer = self.imputer_constructor(data, weight)
445            imputed_data = trained_imputer(data, weight)
446            base_classifier = self.base_learner(imputed_data, weight)
447            return ImputeClassifier(base_classifier, trained_imputer)
448
449    class ImputeClassifier(orange.Classifier):
450        def __init__(self, base_classifier, imputer):
451            self.base_classifier = base_classifier
452            self.imputer = imputer
453
454        def __call__(self, ex, what=orange.GetValue):
455            return self.base_classifier(self.imputer(ex), what)
456
457Write your own imputer
458----------------------
459
460Imputation classes provide the Python-callback functionality. The simples
461way to write custom imputation constructors or imputers is to write a Python
462function that behaves like the built-in Orange classes. For imputers it is
463enough to write a function that gets an instance as argument. Inputation for
464data tables will then use that function.
465
466Special imputation procedures or separate procedures for various attributes,
467as demonstrated in the description of
468:obj:`~Orange.feature.imputation.ImputerConstructor_model`,
469are achieved by encoding it in a constructor that accepts a data table and
470id of the weight meta-attribute, and returns the imputer. The benefit of
471implementing an imputer constructor is that you can use is as a component
472for learners (for example, in logistic regression) or wrappers, and that way
473properly use the classifier in testing procedures.
474
475
476
477..
478    This was commented out:
479    Examples
480    --------
481
482    Missing values sometimes have a special meaning, so they need to be replaced
483    by a designated value. Sometimes we know what to replace the missing value
484    with; for instance, in a medical problem, some laboratory tests might not be
485    done when it is known what their results would be. In that case, we impute
486    certain fixed value instead of the missing. In the most complex case, we assign
487    values that are computed based on some model; we can, for instance, impute the
488    average or majority value or even a value which is computed from values of
489    other, known feature, using a classifier.
490
491    In general, imputer itself needs to be trained. This is, of course, not needed
492    when the imputer imputes certain fixed value. However, when it imputes the
493    average or majority value, it needs to compute the statistics on the training
494    examples, and use it afterwards for imputation of training and testing
495    examples.
496
497    While reading this document, bear in mind that imputation is a part of the
498    learning process. If we fit the imputation model, for instance, by learning
499    how to predict the feature's value from other features, or even if we
500    simply compute the average or the minimal value for the feature and use it
501    in imputation, this should only be done on learning data. Orange
502    provides simple means for doing that.
503
504    This page will first explain how to construct various imputers. Then follow
505    the examples for `proper use of imputers <#using-imputers>`_. Finally, quite
506    often you will want to use imputation with special requests, such as certain
507    features' missing values getting replaced by constants and other by values
508    computed using models induced from specified other features. For instance,
509    in one of the studies we worked on, the patient's pulse rate needed to be
510    estimated using regression trees that included the scope of the patient's
511    injuries, sex and age, some attributes' values were replaced by the most
512    pessimistic ones and others were computed with regression trees based on
513    values of all features. If you are using learners that need the imputer as a
514    component, you will need to `write your own imputer constructor
515    <#write-your-own-imputer-constructor>`_. This is trivial and is explained at
516    the end of this page.
Note: See TracBrowser for help on using the repository browser.