source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9810:a936f359ec27

Revision 9810:a936f359ec27, 20.6 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Orange.feature.imputation

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27=================
28
29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new instance with the
33missing values imputed (leaving the original instance intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new data table with
35imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: imputeClass
40
41    Indicates whether to impute the class value. Defaults to True.
42
43    .. attribute:: deterministic
44
45    Indicates whether to initialize random by example's CRC. Defaults to False.
46
47Simple imputation
48=================
49
50Simple imputers always impute the same value for a particular feature,
51disregarding the values of other features. They all use the same class
52:obj:`Imputer_defaults`.
53
54.. class:: Imputer_defaults
55
56    .. attribute::  defaults
57
58    An instance :obj:`Orange.data.Instance` with the default values to be
59    imputed instead of missing value. Examples to be imputed must be from the
60    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
61
62Instances of this class can be constructed by
63:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
64:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
65:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
66
67For continuous features, they will impute the smallest, largest or the average
68values encountered in the training examples. For discrete,
69they will impute the lowest (the one with index 0, e. g. attr.values[0]),
70the highest (attr.values[-1]), and the most common value encountered in the
71data, respectively. If values of discrete features are ordered according to
72their impact on class (for example, possible values for symptoms of some
73disease can be ordered according to their seriousness),
74the minimal and maximal imputers  will then represent optimistic and
75pessimistic imputations.
76
77User-define defaults can be given when constructing a :obj:`~Orange.feature
78.imputation.Imputer_defaults`. Values that are left unspecified do not get
79imputed. In the following example "LENGTH" is the
80only attribute to get imputed with value 1234:
81
82.. literalinclude:: code/imputation-complex.py
83    :lines: 56-69
84
85If :obj:`~Orange.feature.imputation.Imputer_defaults`'s constructor is given
86an argument of type :obj:`~Orange.data.Domain` it constructs an empty instance
87for :obj:`defaults`. If an instance is given, the reference to the
88instance will be kept. To avoid problems associated with `Imputer_defaults
89(data[0])`, it is better to provide a copy of the instance:
90`Imputer_defaults(Orange.data.Instance(data[0]))`.
91
92Random imputation
93=================
94
95.. class:: Imputer_Random
96
97    Imputes random values. The corresponding constructor is
98    :obj:`ImputerConstructor_Random`.
99
100    .. attribute:: impute_class
101
102    Tells whether to impute the class values or not. Defaults to True.
103
104    .. attribute:: deterministic
105
106    If true (defaults to False), random generator is initialized for each
107    instance using the instance's hash value as a seed. This results in same
108    instances being always imputed with the same (random) values.
109
110Model-based imputation
111======================
112
113.. class:: ImputerConstructor_model
114
115    Model-based imputers learn to predict the features's value from values of
116    other features. :obj:`ImputerConstructor_model` are given two learning
117    algorithms and they construct a classifier for each attribute. The
118    constructed imputer :obj:`Imputer_model` stores a list of classifiers that
119    are used for imputation.
120
121    .. attribute:: learner_discrete, learner_continuous
122
123    Learner for discrete and for continuous attributes. If any of them is
124    missing, the attributes of the corresponding type will not get imputed.
125
126    .. attribute:: use_class
127
128    Tells whether the imputer can use the class attribute. Defaults to
129    False. It is useful in more complex designs in which one imputer is used
130    on learning instances, where it uses the class value,
131    and a second imputer on testing instances, where class is not available.
132
133.. class:: Imputer_model
134
135    .. attribute:: models
136
137    A list of classifiers, each corresponding to one attribute to be imputed.
138    The :obj:`class_var`'s of the models should equal the instances'
139    attributes. If an element is :obj:`None`, the corresponding attribute's
140    values are not imputed.
141
142.. rubric:: Examples
143
144Examples are taken from :download:`imputation-complex.py
145<code/imputation-complex.py>`. The following imputer predicts the missing
146attribute values using classification and regression trees with the minimum
147of 20 examples in a leaf.
148
149.. literalinclude:: code/imputation-complex.py
150    :lines: 74-76
151
152A common setup, where different learning algorithms are used for discrete
153and continuous features, is to use
154:class:`~Orange.classification.bayes.NaiveLearner` for discrete and
155:class:`~Orange.regression.mean.MeanLearner` (which
156just remembers the average) for continuous attributes:
157
158.. literalinclude:: code/imputation-complex.py
159    :lines: 91-94
160
161To construct a user-defined :class:`Imputer_model`:
162
163.. literalinclude:: code/imputation-complex.py
164    :lines: 108-112
165
166A list of empty models is first initialized :obj:`Imputer_model.models`.
167Continuous feature "LANES" is imputed with value 2 using
168:obj:`DefaultClassifier`. A float must be given, because integer values are
169interpreted as indexes of discrete features. Discrete feature "T-OR-D" is
170imputed using :class:`Orange.classification.ConstantClassifier` which is
171given the index of value "THROUGH" as an argument.
172
173Feature "LENGTH" is computed with a regression tree induced from "MATERIAL",
174"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here).
175Domain is initialized by giving a list of feature names and domain as an
176additional argument where Orange will look for features.
177
178.. literalinclude:: code/imputation-complex.py
179    :lines: 114-119
180
181This is how the inferred tree should look like::
182
183    <XMP class=code>SPAN=SHORT: 1158
184    SPAN=LONG: 1907
185    SPAN=MEDIUM
186    |    ERECTED<1908.500: 1325
187    |    ERECTED>=1908.500: 1528
188    </XMP>
189
190Wooden bridges and walkways are short, while the others are mostly
191medium. This could be encoded in feature "SPAN" using
192:class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the
193Python function used here:
194
195.. literalinclude:: code/imputation-complex.py
196    :lines: 121-128
197
198If :obj:`compute_span` is written as a class it must behave like a
199classifier: it accepts an example and returns a value. The second
200argument tells what the caller expects the classifier to return - a value,
201a distribution or both. Currently, :obj:`Imputer_model`,
202always expects values and the argument can be ignored.
203
204Missing values as special values
205================================
206
207Missing values sometimes have a special meaning. Cautious is needed when
208using such values in decision models. When the decision not to measure
209something (for example, performing a laboratory test on a patient) is based
210on the expert's knowledge of the class value, such missing values clearly
211should not be used in models.
212
213.. class:: ImputerConstructor_asValue
214
215    Constructs a new domain in which each discrete feature is replaced
216    with a new feature that has one more value: "NA". The new feature
217    computes its values on the fly from the old one,
218    copying the normal values and replacing the unknowns with "NA".
219
220    For continuous attributes, it constructs a two-valued discrete attribute
221    with values "def" and "undef", telling whether the value is defined or
222    not.  The features's name will equal the original's with "_def" appended.
223    The original continuous feature will remain in the domain and its
224    unknowns will be replaced by averages.
225
226    :class:`ImputerConstructor_asValue` has no specific attributes.
227
228    It constructs :class:`Imputer_asValue` that converts the example into
229    the new domain.
230
231.. class:: Imputer_asValue
232
233    .. attribute:: domain
234
235        The domain with the new feature constructed by
236        :class:`ImputerConstructor_asValue`.
237
238    .. attribute:: defaults
239
240        Default values for continuous features.
241
242The following code shows what the imputer actually does to the domain:
243
244.. literalinclude:: code/imputation-complex.py
245    :lines: 137-151
246
247The script's output looks like this::
248
249    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
250
251    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
252
253    RIVER: M -> M
254    ERECTED: 1874 -> 1874 (def)
255    PURPOSE: RR -> RR
256    LENGTH: ? -> 1567 (undef)
257    LANES: 2 -> 2 (def)
258    CLEAR-G: ? -> NA
259    T-OR-D: THROUGH -> THROUGH
260    MATERIAL: IRON -> IRON
261    SPAN: ? -> NA
262    REL-L: ? -> NA
263    TYPE: SIMPLE-T -> SIMPLE-T
264
265The two examples have the same attribute, :samp:`imputed` having a few
266additional ones. Comparing :samp:`original.domain[0] == imputed.domain[0]`
267will result in False. While the names are same, they represent different
268features. Writting, :samp:`imputed[i]`  would fail since :samp:`imputed` has
269 no attribute :samp:`i`, but it has an attribute with the same name. Using
270:samp:`i.name` to index the attributes of :samp:`imputed` will work,
271yet it is not fast. If a frequently used, it is better to compute the index
272with :samp:`imputed.domain.index(i.name)`.
273
274For continuous features, there is an additional feature with name prefix
275"_def", which is accessible by :samp:`i.name+"_def"`. The value of the first
276continuous feature "ERECTED" remains 1874, and the additional attribute
277"ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced
278by the average (1567) and the new attribute has value "undef". The
279undefined discrete attribute  "CLEAR-G" (and all other undefined discrete
280attributes) is assigned the value "NA".
281
282Using imputers
283==============
284
285Imputation must run on training data only. Imputing the missing values
286and subsequently using the data in cross-validation will give overly
287optimistic results.
288
289Learners with imputer as a component
290------------------------------------
291
292Learners that cannot handle missing values provide a slot for the imputer
293component. An example of such a class is
294:obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called
295:obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor`.
296
297When given learning instances,
298:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to
299:obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor` to get
300an imputer and used it to impute the missing values in the learning data.
301Imputed data is then used by the actual learning algorithm. Also, when a
302classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
303the imputer is stored in its attribute
304:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
305classification, the same imputer is used for imputation of missing values
306in (testing) examples.
307
308Details may vary from algorithm to algorithm, but this is how the imputation
309is generally used. When writing user-defined learners,
310it is recommended to use imputation according to the described procedure.
311
312Wrapper for learning algorithms
313===============================
314
315Imputation is used by learning algorithms and other methods that are not
316capable of handling unknown values. It imputes missing values,
317calls the learner and, if imputation is also needed by the classifier,
318it wraps the classifier that imputes missing values in instances to classify.
319
320.. literalinclude:: code/imputation-logreg.py
321   :lines: 7-
322
323The output of this code is::
324
325    Without imputation: 0.945
326    With imputation: 0.954
327
328Even so, the module is somewhat redundant, as all learners that cannot handle
329missing values should, in principle, provide the slots for imputer constructor.
330For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an
331attribute
332:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even
333if you don't set it, it will do some imputation by default.
334
335.. class:: ImputeLearner
336
337    Wraps a learner and performs data discretization before learning.
338
339    Most of Orange's learning algorithms do not use imputers because they can
340    appropriately handle the missing values. Bayesian classifier, for instance,
341    simply skips the corresponding attributes in the formula, while
342    classification/regression trees have components for handling the missing
343    values in various ways.
344
345    If for any reason you want to use these algorithms to run on imputed data,
346    you can use this wrapper. The class description is a matter of a separate
347    page, but we shall show its code here as another demonstration of how to
348    use the imputers - logistic regression is implemented essentially the same
349    as the below classes.
350
351    This is basically a learner, so the constructor will return either an
352    instance of :obj:`ImputerLearner` or, if called with examples, an instance
353    of some classifier. There are a few attributes that need to be set, though.
354
355    .. attribute:: base_learner
356
357    A wrapped learner.
358
359    .. attribute:: imputer_constructor
360
361    An instance of a class derived from :obj:`ImputerConstructor` (or a class
362    with the same call operator).
363
364    .. attribute:: dont_impute_classifier
365
366    If given and set (this attribute is optional), the classifier will not be
367    wrapped into an imputer. Do this if the classifier doesn't mind if the
368    examples it is given have missing values.
369
370    The learner is best illustrated by its code - here's its complete
371    :obj:`__call__` method::
372
373        def __call__(self, data, weight=0):
374            trained_imputer = self.imputer_constructor(data, weight)
375            imputed_data = trained_imputer(data, weight)
376            base_classifier = self.base_learner(imputed_data, weight)
377            if self.dont_impute_classifier:
378                return base_classifier
379            else:
380                return ImputeClassifier(base_classifier, trained_imputer)
381
382    So "learning" goes like this. :obj:`ImputeLearner` will first construct
383    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained)
384    imputer. Than it will use the imputer to impute the data, and call the
385    given :obj:`baseLearner` to construct a classifier. For instance,
386    :obj:`baseLearner` could be a learner for logistic regression and the
387    result would be a logistic regression model. If the classifier can handle
388    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as
389    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
390    the base classifier and the imputer which it can use to impute the missing
391    values in (testing) examples.
392
393.. class:: ImputeClassifier
394
395    Objects of this class are returned by :obj:`ImputeLearner` when given data.
396
397    .. attribute:: baseClassifier
398
399    A wrapped classifier.
400
401    .. attribute:: imputer
402
403    An imputer for imputation of unknown values.
404
405    .. method:: __call__
406
407    This class is even more trivial than the learner. Its constructor accepts
408    two arguments, the classifier and the imputer, which are stored into the
409    corresponding attributes. The call operator which does the classification
410    then looks like this::
411
412        def __call__(self, ex, what=orange.GetValue):
413            return self.base_classifier(self.imputer(ex), what)
414
415    It imputes the missing values by calling the :obj:`imputer` and passes the
416    class to the base classifier.
417
418.. note::
419   In this setup the imputer is trained on the training data - even if you do
420   cross validation, the imputer will be trained on the right data. In the
421   classification phase we again use the imputer which was classified on the
422   training data only.
423
424.. rubric:: Code of ImputeLearner and ImputeClassifier
425
426:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
427the instance's  dictionary. You are expected to call it like
428:obj:`ImputeLearner(base_learner=<someLearner>,
429imputer=<someImputerConstructor>)`. When the learner is called with examples, it
430trains the imputer, imputes the data, induces a :obj:`base_classifier` by the
431:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the
432:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
433values are imputed and the classifier's prediction is returned.
434
435Note that this code is slightly simplified, although the omitted details handle
436non-essential technical issues that are unrelated to imputation::
437
438    class ImputeLearner(orange.Learner):
439        def __new__(cls, examples = None, weightID = 0, **keyw):
440            self = orange.Learner.__new__(cls, **keyw)
441            self.__dict__.update(keyw)
442            if examples:
443                return self.__call__(examples, weightID)
444            else:
445                return self
446
447        def __call__(self, data, weight=0):
448            trained_imputer = self.imputer_constructor(data, weight)
449            imputed_data = trained_imputer(data, weight)
450            base_classifier = self.base_learner(imputed_data, weight)
451            return ImputeClassifier(base_classifier, trained_imputer)
452
453    class ImputeClassifier(orange.Classifier):
454        def __init__(self, base_classifier, imputer):
455            self.base_classifier = base_classifier
456            self.imputer = imputer
457
458        def __call__(self, ex, what=orange.GetValue):
459            return self.base_classifier(self.imputer(ex), what)
460
461.. rubric:: Example
462
463Although most Orange's learning algorithms will take care of imputation
464internally, if needed, it can sometime happen that an expert will be able to
465tell you exactly what to put in the data instead of the missing values. In this
466example we shall suppose that we want to impute the minimal value of each
467feature. We will try to determine whether the naive Bayesian classifier with
468its  implicit internal imputation works better than one that uses imputation by
469minimal values.
470
471:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`):
472
473.. literalinclude:: code/imputation-minimal-imputer.py
474    :lines: 7-
475
476Should ouput this::
477
478    Without imputation: 0.903
479    With imputation: 0.899
480
481.. note::
482   Note that we constructed just one instance of \
483   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
484   used twice in each fold, once it is given the examples as they are (and
485   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
486   The second time it is called by :obj:`imba` and the \
487   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
488   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
489   learner, but which produces two different classifiers in each round of
490   testing.
491
492Write your own imputer
493======================
494
495Imputation classes provide the Python-callback functionality (not all Orange
496classes do so, refer to the documentation on `subtyping the Orange classes
497in Python <callbacks.htm>`_ for a list). If you want to write your own
498imputation constructor or an imputer, you need to simply program a Python
499function that will behave like the built-in Orange classes (and even less,
500for imputer, you only need to write a function that gets an example as
501argument, imputation for example tables will then use that function).
502
503You will most often write the imputation constructor when you have a special
504imputation procedure or separate procedures for various attributes, as we've
505demonstrated in the description of
506:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
507need to pack everything we've written there to an imputer constructor that
508will accept a data set and the id of the weight meta-attribute (ignore it if
509you will, but you must accept two arguments), and return the imputer (probably
510:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
511imputer constructor as opposed to what we did above is that you can use such a
512constructor as a component for Orange learners (like logistic regression) or
513for wrappers from module orngImpute, and that way properly use the in
514classifier testing procedures.
Note: See TracBrowser for help on using the repository browser.