source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9807:56bf3eae608e

Revision 9807:56bf3eae608e, 23.5 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Orange.feature.imputation

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27=================
28
29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes
30that accept training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it returns a new example with the
33missing values imputed (leaving the original example intact). If imputer is
34called with an :obj:`Orange.data.Table` it returns a new example table
35with imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: imputeClass
40
41    Indicates whether to impute the class value. Default is True.
42
43    .. attribute:: deterministic
44
45    Indicates whether to initialize random by example's CRC. Default is False.
46
47Simple imputation
48=================
49
50Simple imputers always impute the same value for a particular attribute,
51disregarding the values of other attributes. They all use the same class
52:obj:`Imputer_defaults`.
53
54.. class:: Imputer_defaults
55
56    .. attribute::  defaults
57
58    An instance :obj:`Orange.data.Instance` with the default values to be
59    imputed instead of missing value. Examples to be imputed must be from the
60    same :obj:`~Orange.data.Domain` as :obj:`defaults`.
61
62Instances of this class can be constructed by
63:obj:`~Orange.feature.imputation.ImputerConstructor_minimal`,
64:obj:`~Orange.feature.imputation.ImputerConstructor_maximal`,
65:obj:`~Orange.feature.imputation.ImputerConstructor_average`.
66
67For continuous features, they will impute the smallest, largest or the average
68values encountered in the training examples. For discrete,
69they will impute the lowest (the one with index 0, e. g. attr.values[0]),
70the highest (attr.values[-1]), and the most common value encountered in the
71data.
72
73If values of discrete features are be ordered according to their
74impact on class (for example, possible values for symptoms of some
75disease can be ordered according to their seriousness),
76the minimal and maximal imputers will then represent optimistic and
77pessimistic imputations.
78
79To construct the :obj:`~Orange.feature.imputation.Imputer_defaults`
80yourself and specify your own defaults. Or leave some values unspecified, in
81which case the imputer won't impute them, as in the following example. Here,
82the only attribute whose values will get imputed is "LENGTH"; the imputed value
83will be 1234.
84
85.. literalinclude:: code/imputation-complex.py
86    :lines: 56-69
87
88:obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an
89argument of type :obj:`Orange.data.Domain` (in which case it will construct an
90empty instance for :obj:`defaults`) or an example. (Be careful with this:
91:obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the
92instance and not a copy. But you can make a copy yourself to avoid problems:
93instead of `Imputer_defaults(data[0])` you may want to write
94`Imputer_defaults(Orange.data.Instance(data[0]))`.
95
96Random imputation
97=================
98
99.. class:: Imputer_Random
100
101    Imputes random values. The corresponding constructor is
102    :obj:`ImputerConstructor_Random`.
103
104    .. attribute:: impute_class
105
106    Tells whether to impute the class values or not. Defaults to True.
107
108    .. attribute:: deterministic
109
110    If true (default is False), random generator is initialized for each
111    example using the example's hash value as a seed. This results in same
112    examples being always imputed the same values.
113
114Model-based imputation
115======================
116
117.. class:: ImputerConstructor_model
118
119    Model-based imputers learn to predict the attribute's value from values of
120    other attributes. :obj:`ImputerConstructor_model` are given a learning
121    algorithm (two, actually - one for discrete and one for continuous
122    attributes) and they construct a classifier for each attribute. The
123    constructed imputer :obj:`Imputer_model` stores a list of classifiers which
124    are used when needed.
125
126    .. attribute:: learner_discrete, learner_continuous
127
128    Learner for discrete and for continuous attributes. If any of them is
129    missing, the attributes of the corresponding type won't get imputed.
130
131    .. attribute:: use_class
132
133    Tells whether the imputer is allowed to use the class value. As this is
134    most often undesired, this option is by default set to False. It can
135    however be useful for a more complex design in which we would use one
136    imputer for learning examples (this one would use the class value) and
137    another for testing examples (which would not use the class value as this
138    is unavailable at that moment).
139
140.. class:: Imputer_model
141
142    .. attribute: models
143
144    A list of classifiers, each corresponding to one attribute of the examples
145    whose values are to be imputed. The :obj:`classVar`'s of the models should
146    equal the examples' attributes. If any of classifier is missing (that is,
147    the corresponding element of the table is :obj:`None`, the corresponding
148    attribute's values will not be imputed.
149
150.. rubric:: Examples
151
152The following imputer predicts the missing attribute values using
153classification and regression trees with the minimum of 20 examples in a leaf.
154Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
155
156.. literalinclude:: code/imputation-complex.py
157    :lines: 74-76
158
159We could even use the same learner for discrete and continuous attributes,
160as :class:`Orange.classification.tree.TreeLearner` checks the class type
161and constructs regression or classification trees accordingly. The
162common parameters, such as the minimal number of
163examples in leaves, are used in both cases.
164
165You can also use different learning algorithms for discrete and
166continuous attributes. Probably a common setup will be to use
167:class:`Orange.classification.bayes.BayesLearner` for discrete and
168:class:`Orange.regression.mean.MeanLearner` (which
169just remembers the average) for continuous attributes. Part of
170:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
171
172.. literalinclude:: code/imputation-complex.py
173    :lines: 91-94
174
175You can also construct an :class:`Imputer_model` yourself. You will do
176this if different attributes need different treatment. Brace for an
177example that will be a bit more complex. First we shall construct an
178:class:`Imputer_model` and initialize an empty list of models.
179The following code snippets are from
180:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
181
182.. literalinclude:: code/imputation-complex.py
183    :lines: 108-109
184
185Attributes "LANES" and "T-OR-D" will always be imputed values 2 and
186"THROUGH". Since "LANES" is continuous, it suffices to construct a
187:obj:`DefaultClassifier` with the default value 2.0 (don't forget the
188decimal part, or else Orange will think you talk about an index of a discrete
189value - how could it tell?). For the discrete attribute "T-OR-D", we could
190construct a :class:`Orange.classification.ConstantClassifier` and give the index of value
191"THROUGH" as an argument. But we shall do it nicer, by constructing a
192:class:`Orange.data.Value`. Both classifiers will be stored at the appropriate places
193in :obj:`imputer.models`.
194
195.. literalinclude:: code/imputation-complex.py
196    :lines: 110-112
197
198
199"LENGTH" will be computed with a regression tree induced from "MATERIAL",
200"SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of
201course). Note that we initialized the domain by simply giving a list with
202the names of the attributes, with the domain as an additional argument
203in which Orange will look for the named attributes.
204
205.. literalinclude:: code/imputation-complex.py
206    :lines: 114-119
207
208We printed the tree just to see what it looks like.
209
210::
211
212    <XMP class=code>SPAN=SHORT: 1158
213    SPAN=LONG: 1907
214    SPAN=MEDIUM
215    |    ERECTED<1908.500: 1325
216    |    ERECTED>=1908.500: 1528
217    </XMP>
218
219Small and nice. Now for the "SPAN". Wooden bridges and walkways are short,
220while the others are mostly medium. This could be done by
221:class:`Orange.classifier.ClassifierByLookupTable` - this would be faster
222than what we plan here. See the corresponding documentation on lookup
223classifier. Here we are going to do it with a Python function.
224
225.. literalinclude:: code/imputation-complex.py
226    :lines: 121-128
227
228:obj:`compute_span` could also be written as a class, if you'd prefer
229it. It's important that it behaves like a classifier, that is, gets an example
230and returns a value. The second element tells, as usual, what the caller expect
231the classifier to return - a value, a distribution or both. Since the caller,
232:obj:`Imputer_model`, always wants values, we shall ignore the argument
233(at risk of having problems in the future when imputers might handle
234distribution as well).
235
236Missing values as special values
237================================
238
239Missing values sometimes have a special meaning. The fact that something was
240not measured can sometimes tell a lot. Be, however, cautious when using such
241values in decision models; it the decision not to measure something (for
242instance performing a laboratory test on a patient) is based on the expert's
243knowledge of the class value, such unknown values clearly should not be used
244in models.
245
246.. class:: ImputerConstructor_asValue
247
248    Constructs a new domain in which each
249    discrete attribute is replaced with a new attribute that has one value more:
250    "NA". The new attribute will compute its values on the fly from the old one,
251    copying the normal values and replacing the unknowns with "NA".
252
253    For continuous attributes, it will
254    construct a two-valued discrete attribute with values "def" and "undef",
255    telling whether the continuous attribute was defined or not. The attribute's
256    name will equal the original's with "_def" appended. The original continuous
257    attribute will remain in the domain and its unknowns will be replaced by
258    averages.
259
260    :class:`ImputerConstructor_asValue` has no specific attributes.
261
262    It constructs :class:`Imputer_asValue` (I bet you
263    wouldn't guess). It converts the example into the new domain, which imputes
264    the values for discrete attributes. If continuous attributes are present, it
265    will also replace their values by the averages.
266
267.. class:: Imputer_asValue
268
269    .. attribute:: domain
270
271        The domain with the new attributes constructed by
272        :class:`ImputerConstructor_asValue`.
273
274    .. attribute:: defaults
275
276        Default values for continuous attributes. Present only if there are any.
277
278The following code shows what this imputer actually does to the domain.
279Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
280
281.. literalinclude:: code/imputation-complex.py
282    :lines: 137-151
283
284The script's output looks like this::
285
286    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
287
288    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
289
290    RIVER: M -> M
291    ERECTED: 1874 -> 1874 (def)
292    PURPOSE: RR -> RR
293    LENGTH: ? -> 1567 (undef)
294    LANES: 2 -> 2 (def)
295    CLEAR-G: ? -> NA
296    T-OR-D: THROUGH -> THROUGH
297    MATERIAL: IRON -> IRON
298    SPAN: ? -> NA
299    REL-L: ? -> NA
300    TYPE: SIMPLE-T -> SIMPLE-T
301
302Seemingly, the two examples have the same attributes (with
303:samp:`imputed` having a few additional ones). If you check this by
304:samp:`original.domain[0] == imputed.domain[0]`, you shall see that this
305first glance is False. The attributes only have the same names,
306but they are different attributes. If you read this page (which is already a
307bit advanced), you know that Orange does not really care about the attribute
308names).
309
310Therefore, if we wrote :samp:`imputed[i]` the program would fail
311since :samp:`imputed` has no attribute :samp:`i`. But it has an
312attribute with the same name (which even usually has the same value). We
313therefore use :samp:`i.name` to index the attributes of
314:samp:`imputed`. (Using names for indexing is not fast, though; if you do
315it a lot, compute the integer index with
316:samp:`imputed.domain.index(i.name)`.)</P>
317
318For continuous attributes, there is an additional attribute with "_def"
319appended; we get it by :samp:`i.name+"_def"`.
320
321The first continuous attribute, "ERECTED" is defined. Its value remains 1874
322and the additional attribute "ERECTED_def" has value "def". Not so for
323"LENGTH". Its undefined value is replaced by the average (1567) and the new
324attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and
325all other undefined discrete attributes) is assigned the value "NA".
326
327Using imputers
328==============
329
330To properly use the imputation classes in learning process, they must be
331trained on training examples only. Imputing the missing values and subsequently
332using the data set in cross-validation will give overly optimistic results.
333
334Learners with imputer as a component
335------------------------------------
336
337Orange learners that cannot handle missing values will generally provide a slot
338for the imputer component. An example of such a class is
339:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called
340:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you
341can assign an imputer constructor - one of the above constructors or a specific
342constructor you wrote yourself. When given learning examples,
343:obj:`Orange.classification.logreg.LogRegLearner` will pass them to
344:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an
345imputer (again some of the above or a specific imputer you programmed). It will
346immediately use the imputer to impute the missing values in the learning data
347set, so it can be used by the actual learning algorithm. Besides, when the
348classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
349the imputer will be stored in its attribute
350:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
351classification, the imputer will be used for imputation of missing values in
352(testing) examples.
353
354Although details may vary from algorithm to algorithm, this is how the
355imputation is generally used in Orange's learners. Also, if you write your own
356learners, it is recommended that you use imputation according to the described
357procedure.
358
359Wrapper for learning algorithms
360===============================
361
362Imputation is used by learning algorithms and other methods that are not
363capable of handling unknown values. It will impute missing values,
364call the learner and, if imputation is also needed by the classifier,
365it will wrap the classifier into a wrapper that imputes missing values in
366examples to classify.
367
368.. literalinclude:: code/imputation-logreg.py
369   :lines: 7-
370
371The output of this code is::
372
373    Without imputation: 0.945
374    With imputation: 0.954
375
376Even so, the module is somewhat redundant, as all learners that cannot handle
377missing values should, in principle, provide the slots for imputer constructor.
378For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute
379:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even
380if you don't set it, it will do some imputation by default.
381
382.. class:: ImputeLearner
383
384    Wraps a learner and performs data discretization before learning.
385
386    Most of Orange's learning algorithms do not use imputers because they can
387    appropriately handle the missing values. Bayesian classifier, for instance,
388    simply skips the corresponding attributes in the formula, while
389    classification/regression trees have components for handling the missing
390    values in various ways.
391
392    If for any reason you want to use these algorithms to run on imputed data,
393    you can use this wrapper. The class description is a matter of a separate
394    page, but we shall show its code here as another demonstration of how to
395    use the imputers - logistic regression is implemented essentially the same
396    as the below classes.
397
398    This is basically a learner, so the constructor will return either an
399    instance of :obj:`ImputerLearner` or, if called with examples, an instance
400    of some classifier. There are a few attributes that need to be set, though.
401
402    .. attribute:: base_learner
403
404    A wrapped learner.
405
406    .. attribute:: imputer_constructor
407
408    An instance of a class derived from :obj:`ImputerConstructor` (or a class
409    with the same call operator).
410
411    .. attribute:: dont_impute_classifier
412
413    If given and set (this attribute is optional), the classifier will not be
414    wrapped into an imputer. Do this if the classifier doesn't mind if the
415    examples it is given have missing values.
416
417    The learner is best illustrated by its code - here's its complete
418    :obj:`__call__` method::
419
420        def __call__(self, data, weight=0):
421            trained_imputer = self.imputer_constructor(data, weight)
422            imputed_data = trained_imputer(data, weight)
423            base_classifier = self.base_learner(imputed_data, weight)
424            if self.dont_impute_classifier:
425                return base_classifier
426            else:
427                return ImputeClassifier(base_classifier, trained_imputer)
428
429    So "learning" goes like this. :obj:`ImputeLearner` will first construct
430    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained)
431    imputer. Than it will use the imputer to impute the data, and call the
432    given :obj:`baseLearner` to construct a classifier. For instance,
433    :obj:`baseLearner` could be a learner for logistic regression and the
434    result would be a logistic regression model. If the classifier can handle
435    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as
436    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
437    the base classifier and the imputer which it can use to impute the missing
438    values in (testing) examples.
439
440.. class:: ImputeClassifier
441
442    Objects of this class are returned by :obj:`ImputeLearner` when given data.
443
444    .. attribute:: baseClassifier
445
446    A wrapped classifier.
447
448    .. attribute:: imputer
449
450    An imputer for imputation of unknown values.
451
452    .. method:: __call__
453
454    This class is even more trivial than the learner. Its constructor accepts
455    two arguments, the classifier and the imputer, which are stored into the
456    corresponding attributes. The call operator which does the classification
457    then looks like this::
458
459        def __call__(self, ex, what=orange.GetValue):
460            return self.base_classifier(self.imputer(ex), what)
461
462    It imputes the missing values by calling the :obj:`imputer` and passes the
463    class to the base classifier.
464
465.. note::
466   In this setup the imputer is trained on the training data - even if you do
467   cross validation, the imputer will be trained on the right data. In the
468   classification phase we again use the imputer which was classified on the
469   training data only.
470
471.. rubric:: Code of ImputeLearner and ImputeClassifier
472
473:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
474the instance's  dictionary. You are expected to call it like
475:obj:`ImputeLearner(base_learner=<someLearner>,
476imputer=<someImputerConstructor>)`. When the learner is called with examples, it
477trains the imputer, imputes the data, induces a :obj:`base_classifier` by the
478:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the
479:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
480values are imputed and the classifier's prediction is returned.
481
482Note that this code is slightly simplified, although the omitted details handle
483non-essential technical issues that are unrelated to imputation::
484
485    class ImputeLearner(orange.Learner):
486        def __new__(cls, examples = None, weightID = 0, **keyw):
487            self = orange.Learner.__new__(cls, **keyw)
488            self.__dict__.update(keyw)
489            if examples:
490                return self.__call__(examples, weightID)
491            else:
492                return self
493
494        def __call__(self, data, weight=0):
495            trained_imputer = self.imputer_constructor(data, weight)
496            imputed_data = trained_imputer(data, weight)
497            base_classifier = self.base_learner(imputed_data, weight)
498            return ImputeClassifier(base_classifier, trained_imputer)
499
500    class ImputeClassifier(orange.Classifier):
501        def __init__(self, base_classifier, imputer):
502            self.base_classifier = base_classifier
503            self.imputer = imputer
504
505        def __call__(self, ex, what=orange.GetValue):
506            return self.base_classifier(self.imputer(ex), what)
507
508.. rubric:: Example
509
510Although most Orange's learning algorithms will take care of imputation
511internally, if needed, it can sometime happen that an expert will be able to
512tell you exactly what to put in the data instead of the missing values. In this
513example we shall suppose that we want to impute the minimal value of each
514feature. We will try to determine whether the naive Bayesian classifier with
515its  implicit internal imputation works better than one that uses imputation by
516minimal values.
517
518:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`):
519
520.. literalinclude:: code/imputation-minimal-imputer.py
521    :lines: 7-
522
523Should ouput this::
524
525    Without imputation: 0.903
526    With imputation: 0.899
527
528.. note::
529   Note that we constructed just one instance of \
530   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
531   used twice in each fold, once it is given the examples as they are (and
532   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
533   The second time it is called by :obj:`imba` and the \
534   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
535   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
536   learner, but which produces two different classifiers in each round of
537   testing.
538
539Write your own imputer
540======================
541
542Imputation classes provide the Python-callback functionality (not all Orange
543classes do so, refer to the documentation on `subtyping the Orange classes
544in Python <callbacks.htm>`_ for a list). If you want to write your own
545imputation constructor or an imputer, you need to simply program a Python
546function that will behave like the built-in Orange classes (and even less,
547for imputer, you only need to write a function that gets an example as
548argument, imputation for example tables will then use that function).
549
550You will most often write the imputation constructor when you have a special
551imputation procedure or separate procedures for various attributes, as we've
552demonstrated in the description of
553:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
554need to pack everything we've written there to an imputer constructor that
555will accept a data set and the id of the weight meta-attribute (ignore it if
556you will, but you must accept two arguments), and return the imputer (probably
557:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
558imputer constructor as opposed to what we did above is that you can use such a
559constructor as a component for Orange learners (like logistic regression) or
560for wrappers from module orngImpute, and that way properly use the in
561classifier testing procedures.
Note: See TracBrowser for help on using the repository browser.