source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9853:c4448c907bcb

Revision 9853:c4448c907bcb, 20.5 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Fix rst indentation.

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27=================
28
29:obj:`ImputerConstructor` is the abstract root in a hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new instance with the
33missing values imputed (leaving the original instance intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new data table with
35imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: impute_class
40
41    Indicates whether to impute the class value. Defaults to True.
42
43Simple imputation
44=================
45
46Simple imputers always impute the same value for a particular feature,
47disregarding the values of other features. They all use the same class
48:obj:`Imputer_defaults`.
49
50.. class:: Imputer_defaults
51
52    .. attribute::  defaults
53
54    An instance :obj:`Orange.data.Instance` with the default values to be
55    imputed instead of missing value. Examples to be imputed must be from the
56    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
57
58Instances of this class can be constructed by
59:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
60:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
61:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
62
63For continuous features, they will impute the smallest, largest or the average
64values encountered in the training examples. For discrete,
65they will impute the lowest (the one with index 0, e. g. attr.values[0]),
66the highest (attr.values[-1]), and the most common value encountered in the
67data, respectively. If values of discrete features are ordered according to
68their impact on class (for example, possible values for symptoms of some
69disease can be ordered according to their seriousness),
70the minimal and maximal imputers  will then represent optimistic and
71pessimistic imputations.
72
73User-define defaults can be given when constructing a :obj:`~Orange.feature
74.imputation.Imputer_defaults`. Values that are left unspecified do not get
75imputed. In the following example "LENGTH" is the
76only attribute to get imputed with value 1234:
77
78.. literalinclude:: code/imputation-complex.py
79    :lines: 56-69
80
81If :obj:`~Orange.feature.imputation.Imputer_defaults`'s constructor is given
82an argument of type :obj:`~Orange.data.Domain` it constructs an empty instance
83for :obj:`defaults`. If an instance is given, the reference to the
84instance will be kept. To avoid problems associated with `Imputer_defaults
85(data[0])`, it is better to provide a copy of the instance:
86`Imputer_defaults(Orange.data.Instance(data[0]))`.
87
88Random imputation
89=================
90
91.. class:: Imputer_Random
92
93    Imputes random values. The corresponding constructor is
94    :obj:`ImputerConstructor_Random`.
95
96    .. attribute:: impute_class
97
98    Tells whether to impute the class values or not. Defaults to True.
99
100    .. attribute:: deterministic
101
102    If true (defaults to False), random generator is initialized for each
103    instance using the instance's hash value as a seed. This results in same
104    instances being always imputed with the same (random) values.
105
106Model-based imputation
107======================
108
109.. class:: ImputerConstructor_model
110
111    Model-based imputers learn to predict the features's value from values of
112    other features. :obj:`ImputerConstructor_model` are given two learning
113    algorithms and they construct a classifier for each attribute. The
114    constructed imputer :obj:`Imputer_model` stores a list of classifiers that
115    are used for imputation.
116
117    .. attribute:: learner_discrete, learner_continuous
118
119    Learner for discrete and for continuous attributes. If any of them is
120    missing, the attributes of the corresponding type will not get imputed.
121
122    .. attribute:: use_class
123
124    Tells whether the imputer can use the class attribute. Defaults to
125    False. It is useful in more complex designs in which one imputer is used
126    on learning instances, where it uses the class value,
127    and a second imputer on testing instances, where class is not available.
128
129.. class:: Imputer_model
130
131    .. attribute:: models
132
133    A list of classifiers, each corresponding to one attribute to be imputed.
134    The :obj:`class_var`'s of the models should equal the instances'
135    attributes. If an element is :obj:`None`, the corresponding attribute's
136    values are not imputed.
137
138.. rubric:: Examples
139
140Examples are taken from :download:`imputation-complex.py
141<code/imputation-complex.py>`. The following imputer predicts the missing
142attribute values using classification and regression trees with the minimum
143of 20 examples in a leaf.
144
145.. literalinclude:: code/imputation-complex.py
146    :lines: 74-76
147
148A common setup, where different learning algorithms are used for discrete
149and continuous features, is to use
150:class:`~Orange.classification.bayes.NaiveLearner` for discrete and
151:class:`~Orange.regression.mean.MeanLearner` (which
152just remembers the average) for continuous attributes:
153
154.. literalinclude:: code/imputation-complex.py
155    :lines: 91-94
156
157To construct a user-defined :class:`Imputer_model`:
158
159.. literalinclude:: code/imputation-complex.py
160    :lines: 108-112
161
162A list of empty models is first initialized :obj:`Imputer_model.models`.
163Continuous feature "LANES" is imputed with value 2 using
164:obj:`DefaultClassifier`. A float must be given, because integer values are
165interpreted as indexes of discrete features. Discrete feature "T-OR-D" is
166imputed using :class:`Orange.classification.ConstantClassifier` which is
167given the index of value "THROUGH" as an argument.
168
169Feature "LENGTH" is computed with a regression tree induced from "MATERIAL",
170"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here).
171Domain is initialized by giving a list of feature names and domain as an
172additional argument where Orange will look for features.
173
174.. literalinclude:: code/imputation-complex.py
175    :lines: 114-119
176
177This is how the inferred tree should look like::
178
179    <XMP class=code>SPAN=SHORT: 1158
180    SPAN=LONG: 1907
181    SPAN=MEDIUM
182    |    ERECTED<1908.500: 1325
183    |    ERECTED>=1908.500: 1528
184    </XMP>
185
186Wooden bridges and walkways are short, while the others are mostly
187medium. This could be encoded in feature "SPAN" using
188:class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the
189Python function used here:
190
191.. literalinclude:: code/imputation-complex.py
192    :lines: 121-128
193
194If :obj:`compute_span` is written as a class it must behave like a
195classifier: it accepts an example and returns a value. The second
196argument tells what the caller expects the classifier to return - a value,
197a distribution or both. Currently, :obj:`Imputer_model`,
198always expects values and the argument can be ignored.
199
200Missing values as special values
201================================
202
203Missing values sometimes have a special meaning. Cautious is needed when
204using such values in decision models. When the decision not to measure
205something (for example, performing a laboratory test on a patient) is based
206on the expert's knowledge of the class value, such missing values clearly
207should not be used in models.
208
209.. class:: ImputerConstructor_asValue
210
211    Constructs a new domain in which each discrete feature is replaced
212    with a new feature that has one more value: "NA". The new feature
213    computes its values on the fly from the old one,
214    copying the normal values and replacing the unknowns with "NA".
215
216    For continuous attributes, it constructs a two-valued discrete attribute
217    with values "def" and "undef", telling whether the value is defined or
218    not.  The features's name will equal the original's with "_def" appended.
219    The original continuous feature will remain in the domain and its
220    unknowns will be replaced by averages.
221
222    :class:`ImputerConstructor_asValue` has no specific attributes.
223
224    It constructs :class:`Imputer_asValue` that converts the example into
225    the new domain.
226
227.. class:: Imputer_asValue
228
229    .. attribute:: domain
230
231        The domain with the new feature constructed by
232        :class:`ImputerConstructor_asValue`.
233
234    .. attribute:: defaults
235
236        Default values for continuous features.
237
238The following code shows what the imputer actually does to the domain:
239
240.. literalinclude:: code/imputation-complex.py
241    :lines: 137-151
242
243The script's output looks like this::
244
245    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
246
247    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
248
249    RIVER: M -> M
250    ERECTED: 1874 -> 1874 (def)
251    PURPOSE: RR -> RR
252    LENGTH: ? -> 1567 (undef)
253    LANES: 2 -> 2 (def)
254    CLEAR-G: ? -> NA
255    T-OR-D: THROUGH -> THROUGH
256    MATERIAL: IRON -> IRON
257    SPAN: ? -> NA
258    REL-L: ? -> NA
259    TYPE: SIMPLE-T -> SIMPLE-T
260
261The two examples have the same attribute, :samp:`imputed` having a few
262additional ones. Comparing :samp:`original.domain[0] == imputed.domain[0]`
263will result in False. While the names are same, they represent different
264features. Writting, :samp:`imputed[i]`  would fail since :samp:`imputed` has
265no attribute :samp:`i`, but it has an attribute with the same name.
266Using :samp:`i.name` to index the attributes of :samp:`imputed` will work,
267yet it is not fast. If a frequently used, it is better to compute the index
268with :samp:`imputed.domain.index(i.name)`.
269
270For continuous features, there is an additional feature with name prefix
271"_def", which is accessible by :samp:`i.name+"_def"`. The value of the first
272continuous feature "ERECTED" remains 1874, and the additional attribute
273"ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced
274by the average (1567) and the new attribute has value "undef". The
275undefined discrete attribute  "CLEAR-G" (and all other undefined discrete
276attributes) is assigned the value "NA".
277
278Using imputers
279==============
280
281Imputation must run on training data only. Imputing the missing values
282and subsequently using the data in cross-validation will give overly
283optimistic results.
284
285Learners with imputer as a component
286------------------------------------
287
288Learners that cannot handle missing values provide a slot for the imputer
289component. An example of such a class is
290:obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called
291:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`.
292
293When given learning instances,
294:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to
295:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor` to get
296an imputer and used it to impute the missing values in the learning data.
297Imputed data is then used by the actual learning algorithm. Also, when a
298classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
299the imputer is stored in its attribute
300:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
301classification, the same imputer is used for imputation of missing values
302in (testing) examples.
303
304Details may vary from algorithm to algorithm, but this is how the imputation
305is generally used. When writing user-defined learners,
306it is recommended to use imputation according to the described procedure.
307
308Wrapper for learning algorithms
309===============================
310
311Imputation is also used by learning algorithms and other methods that are not
312capable of handling unknown values. It imputes missing values,
313calls the learner and, if imputation is also needed by the classifier,
314it wraps the classifier that imputes missing values in instances to classify.
315
316.. literalinclude:: code/imputation-logreg.py
317   :lines: 7-
318
319The output of this code is::
320
321    Without imputation: 0.945
322    With imputation: 0.954
323
324Even so, the module is somewhat redundant, as all learners that cannot handle
325missing values should, in principle, provide the slots for imputer constructor.
326For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an
327attribute
328:obj:`Orange.classification.logreg.LogRegLearner.imputer_constructor`,
329and even if you don't set it, it will do some imputation by default.
330
331.. class:: ImputeLearner
332
333    Wraps a learner and performs data discretization before learning.
334
335    Most of Orange's learning algorithms do not use imputers because they can
336    appropriately handle the missing values. Bayesian classifier, for instance,
337    simply skips the corresponding attributes in the formula, while
338    classification/regression trees have components for handling the missing
339    values in various ways.
340
341    If for any reason you want to use these algorithms to run on imputed data,
342    you can use this wrapper. The class description is a matter of a separate
343    page, but we shall show its code here as another demonstration of how to
344    use the imputers - logistic regression is implemented essentially the same
345    as the below classes.
346
347    This is basically a learner, so the constructor will return either an
348    instance of :obj:`ImputerLearner` or, if called with examples, an instance
349    of some classifier. There are a few attributes that need to be set, though.
350
351    .. attribute:: base_learner
352
353    A wrapped learner.
354
355    .. attribute:: imputer_constructor
356
357    An instance of a class derived from :obj:`ImputerConstructor` (or a class
358    with the same call operator).
359
360    .. attribute:: dont_impute_classifier
361
362    If given and set (this attribute is optional), the classifier will not be
363    wrapped into an imputer. Do this if the classifier doesn't mind if the
364    examples it is given have missing values.
365
366    The learner is best illustrated by its code - here's its complete
367    :obj:`__call__` method::
368
369        def __call__(self, data, weight=0):
370            trained_imputer = self.imputer_constructor(data, weight)
371            imputed_data = trained_imputer(data, weight)
372            base_classifier = self.base_learner(imputed_data, weight)
373            if self.dont_impute_classifier:
374                return base_classifier
375            else:
376                return ImputeClassifier(base_classifier, trained_imputer)
377
378    So "learning" goes like this. :obj:`ImputeLearner` will first construct
379    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained)
380    imputer. Than it will use the imputer to impute the data, and call the
381    given :obj:`baseLearner` to construct a classifier. For instance,
382    :obj:`baseLearner` could be a learner for logistic regression and the
383    result would be a logistic regression model. If the classifier can handle
384    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as
385    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
386    the base classifier and the imputer which it can use to impute the missing
387    values in (testing) examples.
388
389.. class:: ImputeClassifier
390
391    Objects of this class are returned by :obj:`ImputeLearner` when given data.
392
393    .. attribute:: baseClassifier
394
395    A wrapped classifier.
396
397    .. attribute:: imputer
398
399    An imputer for imputation of unknown values.
400
401    .. method:: __call__
402
403    This class is even more trivial than the learner. Its constructor accepts
404    two arguments, the classifier and the imputer, which are stored into the
405    corresponding attributes. The call operator which does the classification
406    then looks like this::
407
408        def __call__(self, ex, what=orange.GetValue):
409            return self.base_classifier(self.imputer(ex), what)
410
411    It imputes the missing values by calling the :obj:`imputer` and passes the
412    class to the base classifier.
413
414.. note::
415   In this setup the imputer is trained on the training data - even if you do
416   cross validation, the imputer will be trained on the right data. In the
417   classification phase we again use the imputer which was classified on the
418   training data only.
419
420.. rubric:: Code of ImputeLearner and ImputeClassifier
421
422:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
423the instance's  dictionary. You are expected to call it like
424:obj:`ImputeLearner(base_learner=<someLearner>,
425imputer=<someImputerConstructor>)`. When the learner is called with
426examples, it
427trains the imputer, imputes the data, induces a :obj:`base_classifier` by the
428:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the
429:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
430values are imputed and the classifier's prediction is returned.
431
432Note that this code is slightly simplified, although the omitted details handle
433non-essential technical issues that are unrelated to imputation::
434
435    class ImputeLearner(orange.Learner):
436        def __new__(cls, examples = None, weightID = 0, **keyw):
437            self = orange.Learner.__new__(cls, **keyw)
438            self.__dict__.update(keyw)
439            if examples:
440                return self.__call__(examples, weightID)
441            else:
442                return self
443
444        def __call__(self, data, weight=0):
445            trained_imputer = self.imputer_constructor(data, weight)
446            imputed_data = trained_imputer(data, weight)
447            base_classifier = self.base_learner(imputed_data, weight)
448            return ImputeClassifier(base_classifier, trained_imputer)
449
450    class ImputeClassifier(orange.Classifier):
451        def __init__(self, base_classifier, imputer):
452            self.base_classifier = base_classifier
453            self.imputer = imputer
454
455        def __call__(self, ex, what=orange.GetValue):
456            return self.base_classifier(self.imputer(ex), what)
457
458.. rubric:: Example
459
460Although most Orange's learning algorithms will take care of imputation
461internally, if needed, it can sometime happen that an expert will be able to
462tell you exactly what to put in the data instead of the missing values. In this
463example we shall suppose that we want to impute the minimal value of each
464feature. We will try to determine whether the naive Bayesian classifier with
465its  implicit internal imputation works better than one that uses imputation by
466minimal values.
467
468:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`):
469
470.. literalinclude:: code/imputation-minimal-imputer.py
471    :lines: 7-
472
473Should ouput this::
474
475    Without imputation: 0.903
476    With imputation: 0.899
477
478.. note::
479   Note that we constructed just one instance of \
480   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
481   used twice in each fold, once it is given the examples as they are (and
482   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
483   The second time it is called by :obj:`imba` and the \
484   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
485   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
486   learner, but which produces two different classifiers in each round of
487   testing.
488
489Write your own imputer
490======================
491
492Imputation classes provide the Python-callback functionality (not all Orange
493classes do so, refer to the documentation on `subtyping the Orange classes
494in Python <callbacks.htm>`_ for a list). If you want to write your own
495imputation constructor or an imputer, you need to simply program a Python
496function that will behave like the built-in Orange classes (and even less,
497for imputer, you only need to write a function that gets an example as
498argument, imputation for example tables will then use that function).
499
500You will most often write the imputation constructor when you have a special
501imputation procedure or separate procedures for various attributes, as we've
502demonstrated in the description of
503:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
504need to pack everything we've written there to an imputer constructor that
505will accept a data set and the id of the weight meta-attribute (ignore it if
506you will, but you must accept two arguments), and return the imputer (probably
507:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
508imputer constructor as opposed to what we did above is that you can use such a
509constructor as a component for Orange learners (like logistic regression) or
510for wrappers from module orngImpute, and that way properly use the in
511classifier testing procedures.
Note: See TracBrowser for help on using the repository browser.