source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9809:03b6a2f2caa5

Revision 9809:03b6a2f2caa5, 20.6 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Orange.feature.imputation

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27=================
28
29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new instance with the
33missing values imputed (leaving the original instance intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new data table with
35imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: imputeClass
40
41    Indicates whether to impute the class value. Defaults to True.
42
43    .. attribute:: deterministic
44
45    Indicates whether to initialize random by example's CRC. Defaults to False.
46
47Simple imputation
48=================
49
50Simple imputers always impute the same value for a particular feature,
51disregarding the values of other features. They all use the same class
52:obj:`Imputer_defaults`.
53
54.. class:: Imputer_defaults
55
56    .. attribute::  defaults
57
58    An instance :obj:`Orange.data.Instance` with the default values to be
59    imputed instead of missing value. Examples to be imputed must be from the
60    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
61
62Instances of this class can be constructed by
63:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
64:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
65:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
66
67For continuous features, they will impute the smallest, largest or the average
68values encountered in the training examples. For discrete,
69they will impute the lowest (the one with index 0, e. g. attr.values[0]),
70the highest (attr.values[-1]), and the most common value encountered in the
71data, respectively. If values of discrete features are ordered according to
72their impact on class (for example, possible values for symptoms of some
73disease can be ordered according to their seriousness),
74the minimal and maximal imputers  will then represent optimistic and
75pessimistic imputations.
76
77User-define defaults can be given when constructing a :obj:`~Orange.feature
78.imputation.Imputer_defaults`. Values that are left unspecified do not get
79imputed. In the following example "LENGTH" is the
80only attribute to get imputed with value 1234:
81
82.. literalinclude:: code/imputation-complex.py
83    :lines: 56-69
84
85If :obj:`~Orange.feature.imputation.Imputer_defaults`'s constructor is given
86an argument of type :obj:`~Orange.data.Domain` it constructs an empty instance
87for :obj:`defaults`. If an instance is given, the reference to the
88instance will be kept. To avoid problems associated with `Imputer_defaults
89(data[0])`, it is better to provide a copy of the instance:
90`Imputer_defaults(Orange.data.Instance(data[0]))`.
91
92Random imputation
93=================
94
95.. class:: Imputer_Random
96
97    Imputes random values. The corresponding constructor is
98    :obj:`ImputerConstructor_Random`.
99
100    .. attribute:: impute_class
101
102    Tells whether to impute the class values or not. Defaults to True.
103
104    .. attribute:: deterministic
105
106    If true (defaults to False), random generator is initialized for each
107    instance using the instance's hash value as a seed. This results in same
108    instances being always imputed with the same (random) values.
109
110Model-based imputation
111======================
112
113.. class:: ImputerConstructor_model
114
115    Model-based imputers learn to predict the features's value from values of
116    other features. :obj:`ImputerConstructor_model` are given two learning
117    algorithms and they construct a classifier for each attribute. The
118    constructed imputer :obj:`Imputer_model` stores a list of classifiers that
119    are used for imputation.
120
121    .. attribute:: learner_discrete, learner_continuous
122
123    Learner for discrete and for continuous attributes. If any of them is
124    missing, the attributes of the corresponding type will not get imputed.
125
126    .. attribute:: use_class
127
128    Tells whether the imputer can use the class attribute. Defaults to
129    False. It is useful in more complex designs in which one imputer is used
130    on learning instances, where it uses the class value,
131    and a second imputer on testing instances, where class is not available.
132
133.. class:: Imputer_model
134
135    .. attribute:: models
136
137    A list of classifiers, each corresponding to one attribute to be imputed.
138    The :obj:`class_var`'s of the models should equal the instances'
139    attributes. If an element is :obj:`None`, the corresponding attribute's
140    values are not imputed.
141
142.. rubric:: Examples
143
144Examples are taken from :download:`imputation-complex.py
145<code/imputation-complex.py>`. The following imputer predicts the missing
146attribute values using classification and regression trees with the minimum
147of 20 examples in a leaf.
148
149.. literalinclude:: code/imputation-complex.py
150    :lines: 74-76
151
152A common setup, where different learning algorithms are used for discrete
153and continuous features, is to use
154:class:`~Orange.classification.bayes.NaiveLearner` for discrete and
155:class:`~Orange.regression.mean.MeanLearner` (which
156just remembers the average) for continuous attributes:
157
158.. literalinclude:: code/imputation-complex.py
159    :lines: 91-94
160
161To construct a user-defined :class:`Imputer_model`:
162
163.. literalinclude:: code/imputation-complex.py
164    :lines: 108-112
165
166A list of empty models is first initialized :obj:`Imputer_model.models`.
167Continuous feature "LANES" is imputed with value 2 using
168:obj:`DefaultClassifier`. A float must be given, because integer values are
169interpreted as indexes of discrete features. Discrete feature "T-OR-D" is
170imputed using :class:`Orange.classification.ConstantClassifier` which is
171given the index of value "THROUGH" as an argument.
172
173Feature "LENGTH" is computed with a regression tree induced from "MATERIAL",
174"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here).
175Domain is initialized by giving a list of feature names and domain as an
176additional argument where Orange will look for features.
177
178.. literalinclude:: code/imputation-complex.py
179    :lines: 114-119
180
181This is how the inferred tree should look like::
182
183    <XMP class=code>SPAN=SHORT: 1158
184    SPAN=LONG: 1907
185    SPAN=MEDIUM
186    |    ERECTED<1908.500: 1325
187    |    ERECTED>=1908.500: 1528
188    </XMP>
189
190Wooden bridges and walkways are short, while the others are mostly
191medium. This could be encoded in feature "SPAN" using
192:class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the
193Python function used here:
194
195.. literalinclude:: code/imputation-complex.py
196    :lines: 121-128
197
198If :obj:`compute_span` is written as a class it must behave like a
199classifier: it accepts an example and returns a value. The second
200argument tells what the caller expects the classifier to return - a value,
201a distribution or both. Currently, :obj:`Imputer_model`,
202always expects values and the argument can be ignored.
203
204Missing values as special values
205================================
206
207Missing values sometimes have a special meaning. Cautious is needed when
208using such values in decision models. When the decision not to measure
209something (for example, performing a laboratory test on a patient) is based
210on the expert's knowledge of the class value, such missing values clearly
211should not be used in models.
212
213.. class:: ImputerConstructor_asValue
214
215    Constructs a new domain in which each discrete feature is replaced
216    with a new feature that has one more value: "NA". The new feature
217    computes its values on the fly from the old one,
218    copying the normal values and replacing the unknowns with "NA".
219
220    For continuous attributes, it constructs a two-valued discrete attribute
221    with values "def" and "undef", telling whether the value is defined or
222    not.  The features's name will equal the original's with "_def" appended.
223    The original continuous feature will remain in the domain and its
224    unknowns will be replaced by averages.
225
226    :class:`ImputerConstructor_asValue` has no specific attributes.
227
228    It constructs :class:`Imputer_asValue` that converts the example into
229    the new domain.
230
231.. class:: Imputer_asValue
232
233    .. attribute:: domain
234
235        The domain with the new feature constructed by
236        :class:`ImputerConstructor_asValue`.
237
238    .. attribute:: defaults
239
240        Default values for continuous features.
241
242The following code shows what the imputer actually does to the domain:
243
244.. literalinclude:: code/imputation-complex.py
245    :lines: 137-151
246
247The script's output looks like this::
248
249    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
250
251    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
252
253    RIVER: M -> M
254    ERECTED: 1874 -> 1874 (def)
255    PURPOSE: RR -> RR
256    LENGTH: ? -> 1567 (undef)
257    LANES: 2 -> 2 (def)
258    CLEAR-G: ? -> NA
259    T-OR-D: THROUGH -> THROUGH
260    MATERIAL: IRON -> IRON
261    SPAN: ? -> NA
262    REL-L: ? -> NA
263    TYPE: SIMPLE-T -> SIMPLE-T
264
265Seemingly, the two examples have the same attributes (with
266:samp:`imputed` having a few additional ones). Comparing
267:samp:`original.domain[0] == imputed.domain[0]` will result in False. While
268the names are same, they represent different features. Writting,
269:samp:`imputed[i]`  would fail since :samp:`imputed` has no attribute
270:samp:`i`, but it has an attribute with the same name. Using
271:samp:`i.name` to index the attributes of
272:samp:`imputed` will work, yet it is not fast. If a frequently used, it is
273better to compute the index with :samp:`imputed.domain.index(i.name)`.
274
275For continuous features, there is an additional feature with name prefix
276"_def", which is accessible by :samp:`i.name+"_def"`. The value of the first
277continuous feature "ERECTED" remains 1874, and the additional attribute
278"ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced
279by the average (1567) and the new attribute has value "undef". The
280undefined discrete attribute  "CLEAR-G" (and all other undefined discrete
281attributes) is assigned the value "NA".
282
283Using imputers
284==============
285
286Imputation must run on training data only. Imputing the missing values
287and subsequently using the data in cross-validation will give overly
288optimistic results.
289
290Learners with imputer as a component
291------------------------------------
292
293Learners that cannot handle missing values provide a slot for the imputer
294component. An example of such a class is
295:obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called
296:obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor`.
297
298When given learning instances,
299:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to
300:obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor` to get
301an imputer and used it to impute the missing values in the learning data.
302Imputed data is then used by the actual learning algorithm. Also, when a
303classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
304the imputer is stored in its attribute
305:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
306classification, the same imputer is used for imputation of missing values
307in (testing) examples.
308
309Details may vary from algorithm to algorithm, but this is how the imputation
310is generally used. When write user-defined learners,
311it is recommended to use imputation according to the described procedure.
312
313Wrapper for learning algorithms
314===============================
315
316Imputation is used by learning algorithms and other methods that are not
317capable of handling unknown values. It will impute missing values,
318call the learner and, if imputation is also needed by the classifier,
319it will wrap the classifier into a wrapper that imputes missing values in
320examples to classify.
321
322.. literalinclude:: code/imputation-logreg.py
323   :lines: 7-
324
325The output of this code is::
326
327    Without imputation: 0.945
328    With imputation: 0.954
329
330Even so, the module is somewhat redundant, as all learners that cannot handle
331missing values should, in principle, provide the slots for imputer constructor.
332For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute
333:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even
334if you don't set it, it will do some imputation by default.
335
336.. class:: ImputeLearner
337
338    Wraps a learner and performs data discretization before learning.
339
340    Most of Orange's learning algorithms do not use imputers because they can
341    appropriately handle the missing values. Bayesian classifier, for instance,
342    simply skips the corresponding attributes in the formula, while
343    classification/regression trees have components for handling the missing
344    values in various ways.
345
346    If for any reason you want to use these algorithms to run on imputed data,
347    you can use this wrapper. The class description is a matter of a separate
348    page, but we shall show its code here as another demonstration of how to
349    use the imputers - logistic regression is implemented essentially the same
350    as the below classes.
351
352    This is basically a learner, so the constructor will return either an
353    instance of :obj:`ImputerLearner` or, if called with examples, an instance
354    of some classifier. There are a few attributes that need to be set, though.
355
356    .. attribute:: base_learner
357
358    A wrapped learner.
359
360    .. attribute:: imputer_constructor
361
362    An instance of a class derived from :obj:`ImputerConstructor` (or a class
363    with the same call operator).
364
365    .. attribute:: dont_impute_classifier
366
367    If given and set (this attribute is optional), the classifier will not be
368    wrapped into an imputer. Do this if the classifier doesn't mind if the
369    examples it is given have missing values.
370
371    The learner is best illustrated by its code - here's its complete
372    :obj:`__call__` method::
373
374        def __call__(self, data, weight=0):
375            trained_imputer = self.imputer_constructor(data, weight)
376            imputed_data = trained_imputer(data, weight)
377            base_classifier = self.base_learner(imputed_data, weight)
378            if self.dont_impute_classifier:
379                return base_classifier
380            else:
381                return ImputeClassifier(base_classifier, trained_imputer)
382
383    So "learning" goes like this. :obj:`ImputeLearner` will first construct
384    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained)
385    imputer. Than it will use the imputer to impute the data, and call the
386    given :obj:`baseLearner` to construct a classifier. For instance,
387    :obj:`baseLearner` could be a learner for logistic regression and the
388    result would be a logistic regression model. If the classifier can handle
389    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as
390    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
391    the base classifier and the imputer which it can use to impute the missing
392    values in (testing) examples.
393
394.. class:: ImputeClassifier
395
396    Objects of this class are returned by :obj:`ImputeLearner` when given data.
397
398    .. attribute:: baseClassifier
399
400    A wrapped classifier.
401
402    .. attribute:: imputer
403
404    An imputer for imputation of unknown values.
405
406    .. method:: __call__
407
408    This class is even more trivial than the learner. Its constructor accepts
409    two arguments, the classifier and the imputer, which are stored into the
410    corresponding attributes. The call operator which does the classification
411    then looks like this::
412
413        def __call__(self, ex, what=orange.GetValue):
414            return self.base_classifier(self.imputer(ex), what)
415
416    It imputes the missing values by calling the :obj:`imputer` and passes the
417    class to the base classifier.
418
419.. note::
420   In this setup the imputer is trained on the training data - even if you do
421   cross validation, the imputer will be trained on the right data. In the
422   classification phase we again use the imputer which was classified on the
423   training data only.
424
425.. rubric:: Code of ImputeLearner and ImputeClassifier
426
427:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
428the instance's  dictionary. You are expected to call it like
429:obj:`ImputeLearner(base_learner=<someLearner>,
430imputer=<someImputerConstructor>)`. When the learner is called with examples, it
431trains the imputer, imputes the data, induces a :obj:`base_classifier` by the
432:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the
433:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
434values are imputed and the classifier's prediction is returned.
435
436Note that this code is slightly simplified, although the omitted details handle
437non-essential technical issues that are unrelated to imputation::
438
439    class ImputeLearner(orange.Learner):
440        def __new__(cls, examples = None, weightID = 0, **keyw):
441            self = orange.Learner.__new__(cls, **keyw)
442            self.__dict__.update(keyw)
443            if examples:
444                return self.__call__(examples, weightID)
445            else:
446                return self
447
448        def __call__(self, data, weight=0):
449            trained_imputer = self.imputer_constructor(data, weight)
450            imputed_data = trained_imputer(data, weight)
451            base_classifier = self.base_learner(imputed_data, weight)
452            return ImputeClassifier(base_classifier, trained_imputer)
453
454    class ImputeClassifier(orange.Classifier):
455        def __init__(self, base_classifier, imputer):
456            self.base_classifier = base_classifier
457            self.imputer = imputer
458
459        def __call__(self, ex, what=orange.GetValue):
460            return self.base_classifier(self.imputer(ex), what)
461
462.. rubric:: Example
463
464Although most Orange's learning algorithms will take care of imputation
465internally, if needed, it can sometime happen that an expert will be able to
466tell you exactly what to put in the data instead of the missing values. In this
467example we shall suppose that we want to impute the minimal value of each
468feature. We will try to determine whether the naive Bayesian classifier with
469its  implicit internal imputation works better than one that uses imputation by
470minimal values.
471
472:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`):
473
474.. literalinclude:: code/imputation-minimal-imputer.py
475    :lines: 7-
476
477Should ouput this::
478
479    Without imputation: 0.903
480    With imputation: 0.899
481
482.. note::
483   Note that we constructed just one instance of \
484   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
485   used twice in each fold, once it is given the examples as they are (and
486   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
487   The second time it is called by :obj:`imba` and the \
488   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
489   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
490   learner, but which produces two different classifiers in each round of
491   testing.
492
493Write your own imputer
494======================
495
496Imputation classes provide the Python-callback functionality (not all Orange
497classes do so, refer to the documentation on `subtyping the Orange classes
498in Python <callbacks.htm>`_ for a list). If you want to write your own
499imputation constructor or an imputer, you need to simply program a Python
500function that will behave like the built-in Orange classes (and even less,
501for imputer, you only need to write a function that gets an example as
502argument, imputation for example tables will then use that function).
503
504You will most often write the imputation constructor when you have a special
505imputation procedure or separate procedures for various attributes, as we've
506demonstrated in the description of
507:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
508need to pack everything we've written there to an imputer constructor that
509will accept a data set and the id of the weight meta-attribute (ignore it if
510you will, but you must accept two arguments), and return the imputer (probably
511:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
512imputer constructor as opposed to what we did above is that you can use such a
513constructor as a component for Orange learners (like logistic regression) or
514for wrappers from module orngImpute, and that way properly use the in
515classifier testing procedures.
Note: See TracBrowser for help on using the repository browser.