source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9808:0a0cb189ea89

Revision 9808:0a0cb189ea89, 22.3 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Changes to Orange.feature.imputation.

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27=================
28
29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new instance with the
33missing values imputed (leaving the original instance intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new data table with
35imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: imputeClass
40
41    Indicates whether to impute the class value. Defaults to True.
42
43    .. attribute:: deterministic
44
45    Indicates whether to initialize random by example's CRC. Defaults to False.
46
47Simple imputation
48=================
49
50Simple imputers always impute the same value for a particular feature,
51disregarding the values of other features. They all use the same class
52:obj:`Imputer_defaults`.
53
54.. class:: Imputer_defaults
55
56    .. attribute::  defaults
57
58    An instance :obj:`Orange.data.Instance` with the default values to be
59    imputed instead of missing value. Examples to be imputed must be from the
60    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
61
62Instances of this class can be constructed by
63:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
64:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
65:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
66
67For continuous features, they will impute the smallest, largest or the average
68values encountered in the training examples. For discrete,
69they will impute the lowest (the one with index 0, e. g. attr.values[0]),
70the highest (attr.values[-1]), and the most common value encountered in the
71data, respectively. If values of discrete features are ordered according to
72their impact on class (for example, possible values for symptoms of some
73disease can be ordered according to their seriousness),
74the minimal and maximal imputers  will then represent optimistic and
75pessimistic imputations.
76
77User-define defaults can be given when constructing a :obj:`~Orange.feature
78.imputation.Imputer_defaults`. Values that are left unspecified do not get
79imputed. In the following example "LENGTH" is the
80only attribute to get imputed with value 1234:
81
82.. literalinclude:: code/imputation-complex.py
83    :lines: 56-69
84
85If :obj:`~Orange.feature.imputation.Imputer_defaults`'s constructor is given
86an argument of type :obj:`~Orange.data.Domain` it constructs an empty instance
87for :obj:`defaults`. If an instance is given, the reference to the
88instance will be kept. To avoid problems associated with `Imputer_defaults
89(data[0])`, it is better to provide a copy of the instance:
90`Imputer_defaults(Orange.data.Instance(data[0]))`.
91
92Random imputation
93=================
94
95.. class:: Imputer_Random
96
97    Imputes random values. The corresponding constructor is
98    :obj:`ImputerConstructor_Random`.
99
100    .. attribute:: impute_class
101
102    Tells whether to impute the class values or not. Defaults to True.
103
104    .. attribute:: deterministic
105
106    If true (defaults to False), random generator is initialized for each
107    instance using the instance's hash value as a seed. This results in same
108    instances being always imputed with the same (random) values.
109
110Model-based imputation
111======================
112
113.. class:: ImputerConstructor_model
114
115    Model-based imputers learn to predict the features's value from values of
116    other features. :obj:`ImputerConstructor_model` are given two learning
117    algorithms and they construct a classifier for each attribute. The
118    constructed imputer :obj:`Imputer_model` stores a list of classifiers that
119    are used for imputation.
120
121    .. attribute:: learner_discrete, learner_continuous
122
123    Learner for discrete and for continuous attributes. If any of them is
124    missing, the attributes of the corresponding type will not get imputed.
125
126    .. attribute:: use_class
127
128    Tells whether the imputer can use the class attribute. Defaults to
129    False. It is useful in more complex designs in which one imputer is used
130    on learning instances, where it uses the class value,
131    and a second imputer on testing instances, where class is not available.
132
133.. class:: Imputer_model
134
135    .. attribute:: models
136
137    A list of classifiers, each corresponding to one attribute to be imputed.
138    The :obj:`class_var`'s of the models should equal the instances'
139    attributes. If an element is :obj:`None`, the corresponding attribute's
140    values are not imputed.
141
142.. rubric:: Examples
143
144Examples are taken from :download:`imputation-complex.py
145<code/imputation-complex.py>`. The following imputer predicts the missing
146attribute values using classification and regression trees with the minimum
147of 20 examples in a leaf.
148
149.. literalinclude:: code/imputation-complex.py
150    :lines: 74-76
151
152A common setup, where different learning algorithms are used for discrete
153and continuous features, is to use
154:class:`~Orange.classification.bayes.NaiveLearner` for discrete and
155:class:`~Orange.regression.mean.MeanLearner` (which
156just remembers the average) for continuous attributes:
157
158.. literalinclude:: code/imputation-complex.py
159    :lines: 91-94
160
161To construct a  yourself. You will do
162this if different attributes need different treatment. Brace for an
163example that will be a bit more complex. First we shall construct an
164:class:`Imputer_model` and initialize an empty list of models.
165
166To construct a user-defined :class:`Imputer_model`:
167
168.. literalinclude:: code/imputation-complex.py
169    :lines: 108-112
170
171A list of empty models is first initialized. Continuous feature "LANES" is
172imputed with value 2, using :obj:`DefaultClassifier` with the default value
1732.0. A float must be given, because integer values are interpreted as indexes
174of discrete features. Discrete feature "T-OR-D" is imputed using
175:class:`Orange.classification.ConstantClassifier` which is given the index
176of value "THROUGH" as an argument. Both classifiers are stored at the
177appropriate places in :obj:`Imputer_model.models`.
178
179Feature "LENGTH" is computed with a regression tree induced from "MATERIAL",
180"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here).
181The domain is initialized by simply giving a list of feature names and
182domain as an additional argument where Orange will look for features.
183
184.. literalinclude:: code/imputation-complex.py
185    :lines: 114-119
186
187This is how the inferred tree should look like::
188
189    <XMP class=code>SPAN=SHORT: 1158
190    SPAN=LONG: 1907
191    SPAN=MEDIUM
192    |    ERECTED<1908.500: 1325
193    |    ERECTED>=1908.500: 1528
194    </XMP>
195
196Small and nice. Now for the "SPAN". Wooden bridges and walkways are short,
197while the others are mostly medium. This could be done by
198:class:`Orange.classifier.ClassifierByLookupTable` - this would be faster
199than what we plan here. See the corresponding documentation on lookup
200classifier. Here we are going to do it with a Python function.
201
202.. literalinclude:: code/imputation-complex.py
203    :lines: 121-128
204
205:obj:`compute_span` could also be written as a class, if you'd prefer
206it. It's important that it behaves like a classifier, that is, gets an example
207and returns a value. The second element tells, as usual, what the caller expect
208the classifier to return - a value, a distribution or both. Since the caller,
209:obj:`Imputer_model`, always wants values, we shall ignore the argument
210(at risk of having problems in the future when imputers might handle
211distribution as well).
212
213Missing values as special values
214================================
215
216Missing values sometimes have a special meaning. The fact that something was
217not measured can sometimes tell a lot. Be, however, cautious when using such
218values in decision models; it the decision not to measure something (for
219instance performing a laboratory test on a patient) is based on the expert's
220knowledge of the class value, such unknown values clearly should not be used
221in models.
222
223.. class:: ImputerConstructor_asValue
224
225    Constructs a new domain in which each
226    discrete attribute is replaced with a new attribute that has one value more:
227    "NA". The new attribute will compute its values on the fly from the old one,
228    copying the normal values and replacing the unknowns with "NA".
229
230    For continuous attributes, it will
231    construct a two-valued discrete attribute with values "def" and "undef",
232    telling whether the continuous attribute was defined or not. The attribute's
233    name will equal the original's with "_def" appended. The original continuous
234    attribute will remain in the domain and its unknowns will be replaced by
235    averages.
236
237    :class:`ImputerConstructor_asValue` has no specific attributes.
238
239    It constructs :class:`Imputer_asValue` (I bet you
240    wouldn't guess). It converts the example into the new domain, which imputes
241    the values for discrete attributes. If continuous attributes are present, it
242    will also replace their values by the averages.
243
244.. class:: Imputer_asValue
245
246    .. attribute:: domain
247
248        The domain with the new attributes constructed by
249        :class:`ImputerConstructor_asValue`.
250
251    .. attribute:: defaults
252
253        Default values for continuous attributes. Present only if there are any.
254
255The following code shows what this imputer actually does to the domain.
256Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
257
258.. literalinclude:: code/imputation-complex.py
259    :lines: 137-151
260
261The script's output looks like this::
262
263    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
264
265    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
266
267    RIVER: M -> M
268    ERECTED: 1874 -> 1874 (def)
269    PURPOSE: RR -> RR
270    LENGTH: ? -> 1567 (undef)
271    LANES: 2 -> 2 (def)
272    CLEAR-G: ? -> NA
273    T-OR-D: THROUGH -> THROUGH
274    MATERIAL: IRON -> IRON
275    SPAN: ? -> NA
276    REL-L: ? -> NA
277    TYPE: SIMPLE-T -> SIMPLE-T
278
279Seemingly, the two examples have the same attributes (with
280:samp:`imputed` having a few additional ones). If you check this by
281:samp:`original.domain[0] == imputed.domain[0]`, you shall see that this
282first glance is False. The attributes only have the same names,
283but they are different attributes. If you read this page (which is already a
284bit advanced), you know that Orange does not really care about the attribute
285names).
286
287Therefore, if we wrote :samp:`imputed[i]` the program would fail
288since :samp:`imputed` has no attribute :samp:`i`. But it has an
289attribute with the same name (which even usually has the same value). We
290therefore use :samp:`i.name` to index the attributes of
291:samp:`imputed`. (Using names for indexing is not fast, though; if you do
292it a lot, compute the integer index with
293:samp:`imputed.domain.index(i.name)`.)</P>
294
295For continuous attributes, there is an additional attribute with "_def"
296appended; we get it by :samp:`i.name+"_def"`.
297
298The first continuous attribute, "ERECTED" is defined. Its value remains 1874
299and the additional attribute "ERECTED_def" has value "def". Not so for
300"LENGTH". Its undefined value is replaced by the average (1567) and the new
301attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and
302all other undefined discrete attributes) is assigned the value "NA".
303
304Using imputers
305==============
306
307To properly use the imputation classes in learning process, they must be
308trained on training examples only. Imputing the missing values and subsequently
309using the data set in cross-validation will give overly optimistic results.
310
311Learners with imputer as a component
312------------------------------------
313
314Orange learners that cannot handle missing values will generally provide a slot
315for the imputer component. An example of such a class is
316:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called
317:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you
318can assign an imputer constructor - one of the above constructors or a specific
319constructor you wrote yourself. When given learning examples,
320:obj:`Orange.classification.logreg.LogRegLearner` will pass them to
321:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an
322imputer (again some of the above or a specific imputer you programmed). It will
323immediately use the imputer to impute the missing values in the learning data
324set, so it can be used by the actual learning algorithm. Besides, when the
325classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
326the imputer will be stored in its attribute
327:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
328classification, the imputer will be used for imputation of missing values in
329(testing) examples.
330
331Although details may vary from algorithm to algorithm, this is how the
332imputation is generally used in Orange's learners. Also, if you write your own
333learners, it is recommended that you use imputation according to the described
334procedure.
335
336Wrapper for learning algorithms
337===============================
338
339Imputation is used by learning algorithms and other methods that are not
340capable of handling unknown values. It will impute missing values,
341call the learner and, if imputation is also needed by the classifier,
342it will wrap the classifier into a wrapper that imputes missing values in
343examples to classify.
344
345.. literalinclude:: code/imputation-logreg.py
346   :lines: 7-
347
348The output of this code is::
349
350    Without imputation: 0.945
351    With imputation: 0.954
352
353Even so, the module is somewhat redundant, as all learners that cannot handle
354missing values should, in principle, provide the slots for imputer constructor.
355For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute
356:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even
357if you don't set it, it will do some imputation by default.
358
359.. class:: ImputeLearner
360
361    Wraps a learner and performs data discretization before learning.
362
363    Most of Orange's learning algorithms do not use imputers because they can
364    appropriately handle the missing values. Bayesian classifier, for instance,
365    simply skips the corresponding attributes in the formula, while
366    classification/regression trees have components for handling the missing
367    values in various ways.
368
369    If for any reason you want to use these algorithms to run on imputed data,
370    you can use this wrapper. The class description is a matter of a separate
371    page, but we shall show its code here as another demonstration of how to
372    use the imputers - logistic regression is implemented essentially the same
373    as the below classes.
374
375    This is basically a learner, so the constructor will return either an
376    instance of :obj:`ImputerLearner` or, if called with examples, an instance
377    of some classifier. There are a few attributes that need to be set, though.
378
379    .. attribute:: base_learner
380
381    A wrapped learner.
382
383    .. attribute:: imputer_constructor
384
385    An instance of a class derived from :obj:`ImputerConstructor` (or a class
386    with the same call operator).
387
388    .. attribute:: dont_impute_classifier
389
390    If given and set (this attribute is optional), the classifier will not be
391    wrapped into an imputer. Do this if the classifier doesn't mind if the
392    examples it is given have missing values.
393
394    The learner is best illustrated by its code - here's its complete
395    :obj:`__call__` method::
396
397        def __call__(self, data, weight=0):
398            trained_imputer = self.imputer_constructor(data, weight)
399            imputed_data = trained_imputer(data, weight)
400            base_classifier = self.base_learner(imputed_data, weight)
401            if self.dont_impute_classifier:
402                return base_classifier
403            else:
404                return ImputeClassifier(base_classifier, trained_imputer)
405
406    So "learning" goes like this. :obj:`ImputeLearner` will first construct
407    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained)
408    imputer. Than it will use the imputer to impute the data, and call the
409    given :obj:`baseLearner` to construct a classifier. For instance,
410    :obj:`baseLearner` could be a learner for logistic regression and the
411    result would be a logistic regression model. If the classifier can handle
412    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as
413    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
414    the base classifier and the imputer which it can use to impute the missing
415    values in (testing) examples.
416
417.. class:: ImputeClassifier
418
419    Objects of this class are returned by :obj:`ImputeLearner` when given data.
420
421    .. attribute:: baseClassifier
422
423    A wrapped classifier.
424
425    .. attribute:: imputer
426
427    An imputer for imputation of unknown values.
428
429    .. method:: __call__
430
431    This class is even more trivial than the learner. Its constructor accepts
432    two arguments, the classifier and the imputer, which are stored into the
433    corresponding attributes. The call operator which does the classification
434    then looks like this::
435
436        def __call__(self, ex, what=orange.GetValue):
437            return self.base_classifier(self.imputer(ex), what)
438
439    It imputes the missing values by calling the :obj:`imputer` and passes the
440    class to the base classifier.
441
442.. note::
443   In this setup the imputer is trained on the training data - even if you do
444   cross validation, the imputer will be trained on the right data. In the
445   classification phase we again use the imputer which was classified on the
446   training data only.
447
448.. rubric:: Code of ImputeLearner and ImputeClassifier
449
450:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
451the instance's  dictionary. You are expected to call it like
452:obj:`ImputeLearner(base_learner=<someLearner>,
453imputer=<someImputerConstructor>)`. When the learner is called with examples, it
454trains the imputer, imputes the data, induces a :obj:`base_classifier` by the
455:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the
456:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
457values are imputed and the classifier's prediction is returned.
458
459Note that this code is slightly simplified, although the omitted details handle
460non-essential technical issues that are unrelated to imputation::
461
462    class ImputeLearner(orange.Learner):
463        def __new__(cls, examples = None, weightID = 0, **keyw):
464            self = orange.Learner.__new__(cls, **keyw)
465            self.__dict__.update(keyw)
466            if examples:
467                return self.__call__(examples, weightID)
468            else:
469                return self
470
471        def __call__(self, data, weight=0):
472            trained_imputer = self.imputer_constructor(data, weight)
473            imputed_data = trained_imputer(data, weight)
474            base_classifier = self.base_learner(imputed_data, weight)
475            return ImputeClassifier(base_classifier, trained_imputer)
476
477    class ImputeClassifier(orange.Classifier):
478        def __init__(self, base_classifier, imputer):
479            self.base_classifier = base_classifier
480            self.imputer = imputer
481
482        def __call__(self, ex, what=orange.GetValue):
483            return self.base_classifier(self.imputer(ex), what)
484
485.. rubric:: Example
486
487Although most Orange's learning algorithms will take care of imputation
488internally, if needed, it can sometime happen that an expert will be able to
489tell you exactly what to put in the data instead of the missing values. In this
490example we shall suppose that we want to impute the minimal value of each
491feature. We will try to determine whether the naive Bayesian classifier with
492its  implicit internal imputation works better than one that uses imputation by
493minimal values.
494
495:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`):
496
497.. literalinclude:: code/imputation-minimal-imputer.py
498    :lines: 7-
499
500Should ouput this::
501
502    Without imputation: 0.903
503    With imputation: 0.899
504
505.. note::
506   Note that we constructed just one instance of \
507   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
508   used twice in each fold, once it is given the examples as they are (and
509   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
510   The second time it is called by :obj:`imba` and the \
511   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
512   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
513   learner, but which produces two different classifiers in each round of
514   testing.
515
516Write your own imputer
517======================
518
519Imputation classes provide the Python-callback functionality (not all Orange
520classes do so, refer to the documentation on `subtyping the Orange classes
521in Python <callbacks.htm>`_ for a list). If you want to write your own
522imputation constructor or an imputer, you need to simply program a Python
523function that will behave like the built-in Orange classes (and even less,
524for imputer, you only need to write a function that gets an example as
525argument, imputation for example tables will then use that function).
526
527You will most often write the imputation constructor when you have a special
528imputation procedure or separate procedures for various attributes, as we've
529demonstrated in the description of
530:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
531need to pack everything we've written there to an imputer constructor that
532will accept a data set and the id of the weight meta-attribute (ignore it if
533you will, but you must accept two arguments), and return the imputer (probably
534:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
535imputer constructor as opposed to what we did above is that you can use such a
536constructor as a component for Orange learners (like logistic regression) or
537for wrappers from module orngImpute, and that way properly use the in
538classifier testing procedures.
Note: See TracBrowser for help on using the repository browser.