source: orange/docs/reference/rst/Orange.feature.imputation.rst @ 9806:01ddd2a2ff48

Revision 9806:01ddd2a2ff48, 24.4 KB checked in by tomazc <tomaz.curk@…>, 2 years ago (diff)

Updated documentation on Orange.feature.imputation.

Line 
1.. py:currentmodule:: Orange.feature.imputation
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8***************************
9Imputation (``imputation``)
10***************************
11
12Imputation replaces missing feature values with appropriate values, in this
13case with minimal values:
14
15.. literalinclude:: code/imputation-values.py
16   :lines: 7-
17
18The output of this code is::
19
20    Example with missing values
21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
22    Imputed values:
23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
25
26Imputers
27=================
28
29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes
30that get training data and construct an instance of a class derived from
31:obj:`Imputer`. When an :obj:`Imputer` is called with an
32:obj:`Orange.data.Instance` it will return a new example with the
33missing values imputed (leaving the original example intact). If imputer is
34called with an :obj:`Orange.data.Table` it will return a new example table
35with imputed instances.
36
37.. class:: ImputerConstructor
38
39    .. attribute:: imputeClass
40
41    Indicates whether to impute the class value. Default is True.
42
43    .. attribute:: deterministic
44
45    Indicates whether to initialize random by example's CRC. Default is False.
46
47Simple imputation
48=================
49
50Simple imputers always impute the same value for a particular attribute,
51disregarding the values of other attributes. They all use the same class
52:obj:`Imputer_defaults`.
53
54.. class:: Imputer_defaults
55
56    .. attribute::  defaults
57
58    An instance :obj:`Orange.data.Instance` with the default values to be
59    imputed instead of missing. Examples to be imputed must be from the same
60    domain as :obj:`defaults`.
61
62Instances of this class can be constructed by
63:obj:`Orange.feature.imputation.ImputerConstructor_minimal`,
64:obj:`Orange.feature.imputation.ImputerConstructor_maximal`,
65:obj:`Orange.feature.imputation.ImputerConstructor_average`.
66
67For continuous features, they will impute the smallest,
68largest or the average values encountered in the training examples.
69
70For discrete, they will impute the lowest (the one with index 0,
71e. g. attr.values[0]), the highest (attr.values[-1]),
72and the most common value encountered in the data.
73
74The first two imputers
75will mostly be used when the discrete values are ordered according to their
76impact on the class (for instance, possible values for symptoms of some
77disease can be ordered according to their seriousness). The minimal and maximal
78imputers will then represent optimistic and pessimistic imputations.
79
80The following code will load the bridges data, and first impute the values
81in a single examples and then in the whole table.
82
83:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
84
85.. literalinclude:: code/imputation-complex.py
86    :lines: 9-23
87
88This is example shows what the imputer does, not how it is to be used. Don't
89impute all the data and then use it for cross-validation. As warned at the top
90of this page, see the instructions for actual `use of
91imputers <#using-imputers>`_.
92
93.. note:: The :obj:`ImputerConstructor` are another class with schizophrenic
94  constructor: if you give the constructor the data, it will return an \
95  :obj:`Imputer` - the above call is equivalent to calling \
96  :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`.
97
98You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults`
99yourself and specify your own defaults. Or leave some values unspecified, in
100which case the imputer won't impute them, as in the following example. Here,
101the only attribute whose values will get imputed is "LENGTH"; the imputed value
102will be 1234.
103
104.. literalinclude:: code/imputation-complex.py
105    :lines: 56-69
106
107:obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an
108argument of type :obj:`Orange.data.Domain` (in which case it will construct an
109empty instance for :obj:`defaults`) or an example. (Be careful with this:
110:obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the
111instance and not a copy. But you can make a copy yourself to avoid problems:
112instead of `Imputer_defaults(data[0])` you may want to write
113`Imputer_defaults(Orange.data.Instance(data[0]))`.
114
115Random imputation
116=================
117
118.. class:: Imputer_Random
119
120    Imputes random values. The corresponding constructor is
121    :obj:`ImputerConstructor_Random`.
122
123    .. attribute:: impute_class
124
125    Tells whether to impute the class values or not. Defaults to True.
126
127    .. attribute:: deterministic
128
129    If true (default is False), random generator is initialized for each
130    example using the example's hash value as a seed. This results in same
131    examples being always imputed the same values.
132
133Model-based imputation
134======================
135
136.. class:: ImputerConstructor_model
137
138    Model-based imputers learn to predict the attribute's value from values of
139    other attributes. :obj:`ImputerConstructor_model` are given a learning
140    algorithm (two, actually - one for discrete and one for continuous
141    attributes) and they construct a classifier for each attribute. The
142    constructed imputer :obj:`Imputer_model` stores a list of classifiers which
143    are used when needed.
144
145    .. attribute:: learner_discrete, learner_continuous
146
147    Learner for discrete and for continuous attributes. If any of them is
148    missing, the attributes of the corresponding type won't get imputed.
149
150    .. attribute:: use_class
151
152    Tells whether the imputer is allowed to use the class value. As this is
153    most often undesired, this option is by default set to False. It can
154    however be useful for a more complex design in which we would use one
155    imputer for learning examples (this one would use the class value) and
156    another for testing examples (which would not use the class value as this
157    is unavailable at that moment).
158
159.. class:: Imputer_model
160
161    .. attribute: models
162
163    A list of classifiers, each corresponding to one attribute of the examples
164    whose values are to be imputed. The :obj:`classVar`'s of the models should
165    equal the examples' attributes. If any of classifier is missing (that is,
166    the corresponding element of the table is :obj:`None`, the corresponding
167    attribute's values will not be imputed.
168
169.. rubric:: Examples
170
171The following imputer predicts the missing attribute values using
172classification and regression trees with the minimum of 20 examples in a leaf.
173Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
174
175.. literalinclude:: code/imputation-complex.py
176    :lines: 74-76
177
178We could even use the same learner for discrete and continuous attributes,
179as :class:`Orange.classification.tree.TreeLearner` checks the class type
180and constructs regression or classification trees accordingly. The
181common parameters, such as the minimal number of
182examples in leaves, are used in both cases.
183
184You can also use different learning algorithms for discrete and
185continuous attributes. Probably a common setup will be to use
186:class:`Orange.classification.bayes.BayesLearner` for discrete and
187:class:`Orange.regression.mean.MeanLearner` (which
188just remembers the average) for continuous attributes. Part of
189:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
190
191.. literalinclude:: code/imputation-complex.py
192    :lines: 91-94
193
194You can also construct an :class:`Imputer_model` yourself. You will do
195this if different attributes need different treatment. Brace for an
196example that will be a bit more complex. First we shall construct an
197:class:`Imputer_model` and initialize an empty list of models.
198The following code snippets are from
199:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
200
201.. literalinclude:: code/imputation-complex.py
202    :lines: 108-109
203
204Attributes "LANES" and "T-OR-D" will always be imputed values 2 and
205"THROUGH". Since "LANES" is continuous, it suffices to construct a
206:obj:`DefaultClassifier` with the default value 2.0 (don't forget the
207decimal part, or else Orange will think you talk about an index of a discrete
208value - how could it tell?). For the discrete attribute "T-OR-D", we could
209construct a :class:`Orange.classification.ConstantClassifier` and give the index of value
210"THROUGH" as an argument. But we shall do it nicer, by constructing a
211:class:`Orange.data.Value`. Both classifiers will be stored at the appropriate places
212in :obj:`imputer.models`.
213
214.. literalinclude:: code/imputation-complex.py
215    :lines: 110-112
216
217
218"LENGTH" will be computed with a regression tree induced from "MATERIAL",
219"SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of
220course). Note that we initialized the domain by simply giving a list with
221the names of the attributes, with the domain as an additional argument
222in which Orange will look for the named attributes.
223
224.. literalinclude:: code/imputation-complex.py
225    :lines: 114-119
226
227We printed the tree just to see what it looks like.
228
229::
230
231    <XMP class=code>SPAN=SHORT: 1158
232    SPAN=LONG: 1907
233    SPAN=MEDIUM
234    |    ERECTED<1908.500: 1325
235    |    ERECTED>=1908.500: 1528
236    </XMP>
237
238Small and nice. Now for the "SPAN". Wooden bridges and walkways are short,
239while the others are mostly medium. This could be done by
240:class:`Orange.classifier.ClassifierByLookupTable` - this would be faster
241than what we plan here. See the corresponding documentation on lookup
242classifier. Here we are going to do it with a Python function.
243
244.. literalinclude:: code/imputation-complex.py
245    :lines: 121-128
246
247:obj:`compute_span` could also be written as a class, if you'd prefer
248it. It's important that it behaves like a classifier, that is, gets an example
249and returns a value. The second element tells, as usual, what the caller expect
250the classifier to return - a value, a distribution or both. Since the caller,
251:obj:`Imputer_model`, always wants values, we shall ignore the argument
252(at risk of having problems in the future when imputers might handle
253distribution as well).
254
255Missing values as special values
256================================
257
258Missing values sometimes have a special meaning. The fact that something was
259not measured can sometimes tell a lot. Be, however, cautious when using such
260values in decision models; it the decision not to measure something (for
261instance performing a laboratory test on a patient) is based on the expert's
262knowledge of the class value, such unknown values clearly should not be used
263in models.
264
265.. class:: ImputerConstructor_asValue
266
267    Constructs a new domain in which each
268    discrete attribute is replaced with a new attribute that has one value more:
269    "NA". The new attribute will compute its values on the fly from the old one,
270    copying the normal values and replacing the unknowns with "NA".
271
272    For continuous attributes, it will
273    construct a two-valued discrete attribute with values "def" and "undef",
274    telling whether the continuous attribute was defined or not. The attribute's
275    name will equal the original's with "_def" appended. The original continuous
276    attribute will remain in the domain and its unknowns will be replaced by
277    averages.
278
279    :class:`ImputerConstructor_asValue` has no specific attributes.
280
281    It constructs :class:`Imputer_asValue` (I bet you
282    wouldn't guess). It converts the example into the new domain, which imputes
283    the values for discrete attributes. If continuous attributes are present, it
284    will also replace their values by the averages.
285
286.. class:: Imputer_asValue
287
288    .. attribute:: domain
289
290        The domain with the new attributes constructed by
291        :class:`ImputerConstructor_asValue`.
292
293    .. attribute:: defaults
294
295        Default values for continuous attributes. Present only if there are any.
296
297The following code shows what this imputer actually does to the domain.
298Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`):
299
300.. literalinclude:: code/imputation-complex.py
301    :lines: 137-151
302
303The script's output looks like this::
304
305    [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
306
307    [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]
308
309    RIVER: M -> M
310    ERECTED: 1874 -> 1874 (def)
311    PURPOSE: RR -> RR
312    LENGTH: ? -> 1567 (undef)
313    LANES: 2 -> 2 (def)
314    CLEAR-G: ? -> NA
315    T-OR-D: THROUGH -> THROUGH
316    MATERIAL: IRON -> IRON
317    SPAN: ? -> NA
318    REL-L: ? -> NA
319    TYPE: SIMPLE-T -> SIMPLE-T
320
321Seemingly, the two examples have the same attributes (with
322:samp:`imputed` having a few additional ones). If you check this by
323:samp:`original.domain[0] == imputed.domain[0]`, you shall see that this
324first glance is False. The attributes only have the same names,
325but they are different attributes. If you read this page (which is already a
326bit advanced), you know that Orange does not really care about the attribute
327names).
328
329Therefore, if we wrote :samp:`imputed[i]` the program would fail
330since :samp:`imputed` has no attribute :samp:`i`. But it has an
331attribute with the same name (which even usually has the same value). We
332therefore use :samp:`i.name` to index the attributes of
333:samp:`imputed`. (Using names for indexing is not fast, though; if you do
334it a lot, compute the integer index with
335:samp:`imputed.domain.index(i.name)`.)</P>
336
337For continuous attributes, there is an additional attribute with "_def"
338appended; we get it by :samp:`i.name+"_def"`.
339
340The first continuous attribute, "ERECTED" is defined. Its value remains 1874
341and the additional attribute "ERECTED_def" has value "def". Not so for
342"LENGTH". Its undefined value is replaced by the average (1567) and the new
343attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and
344all other undefined discrete attributes) is assigned the value "NA".
345
346Using imputers
347==============
348
349To properly use the imputation classes in learning process, they must be
350trained on training examples only. Imputing the missing values and subsequently
351using the data set in cross-validation will give overly optimistic results.
352
353Learners with imputer as a component
354------------------------------------
355
356Orange learners that cannot handle missing values will generally provide a slot
357for the imputer component. An example of such a class is
358:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called
359:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you
360can assign an imputer constructor - one of the above constructors or a specific
361constructor you wrote yourself. When given learning examples,
362:obj:`Orange.classification.logreg.LogRegLearner` will pass them to
363:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an
364imputer (again some of the above or a specific imputer you programmed). It will
365immediately use the imputer to impute the missing values in the learning data
366set, so it can be used by the actual learning algorithm. Besides, when the
367classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
368the imputer will be stored in its attribute
369:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
370classification, the imputer will be used for imputation of missing values in
371(testing) examples.
372
373Although details may vary from algorithm to algorithm, this is how the
374imputation is generally used in Orange's learners. Also, if you write your own
375learners, it is recommended that you use imputation according to the described
376procedure.
377
378Wrapper for learning algorithms
379===============================
380
381Imputation is used by learning algorithms and other methods that are not
382capable of handling unknown values. It will impute missing values,
383call the learner and, if imputation is also needed by the classifier,
384it will wrap the classifier into a wrapper that imputes missing values in
385examples to classify.
386
387.. literalinclude:: code/imputation-logreg.py
388   :lines: 7-
389
390The output of this code is::
391
392    Without imputation: 0.945
393    With imputation: 0.954
394
395Even so, the module is somewhat redundant, as all learners that cannot handle
396missing values should, in principle, provide the slots for imputer constructor.
397For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute
398:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even
399if you don't set it, it will do some imputation by default.
400
401.. class:: ImputeLearner
402
403    Wraps a learner and performs data discretization before learning.
404
405    Most of Orange's learning algorithms do not use imputers because they can
406    appropriately handle the missing values. Bayesian classifier, for instance,
407    simply skips the corresponding attributes in the formula, while
408    classification/regression trees have components for handling the missing
409    values in various ways.
410
411    If for any reason you want to use these algorithms to run on imputed data,
412    you can use this wrapper. The class description is a matter of a separate
413    page, but we shall show its code here as another demonstration of how to
414    use the imputers - logistic regression is implemented essentially the same
415    as the below classes.
416
417    This is basically a learner, so the constructor will return either an
418    instance of :obj:`ImputerLearner` or, if called with examples, an instance
419    of some classifier. There are a few attributes that need to be set, though.
420
421    .. attribute:: base_learner
422
423    A wrapped learner.
424
425    .. attribute:: imputer_constructor
426
427    An instance of a class derived from :obj:`ImputerConstructor` (or a class
428    with the same call operator).
429
430    .. attribute:: dont_impute_classifier
431
432    If given and set (this attribute is optional), the classifier will not be
433    wrapped into an imputer. Do this if the classifier doesn't mind if the
434    examples it is given have missing values.
435
436    The learner is best illustrated by its code - here's its complete
437    :obj:`__call__` method::
438
439        def __call__(self, data, weight=0):
440            trained_imputer = self.imputer_constructor(data, weight)
441            imputed_data = trained_imputer(data, weight)
442            base_classifier = self.base_learner(imputed_data, weight)
443            if self.dont_impute_classifier:
444                return base_classifier
445            else:
446                return ImputeClassifier(base_classifier, trained_imputer)
447
448    So "learning" goes like this. :obj:`ImputeLearner` will first construct
449    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained)
450    imputer. Than it will use the imputer to impute the data, and call the
451    given :obj:`baseLearner` to construct a classifier. For instance,
452    :obj:`baseLearner` could be a learner for logistic regression and the
453    result would be a logistic regression model. If the classifier can handle
454    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as
455    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
456    the base classifier and the imputer which it can use to impute the missing
457    values in (testing) examples.
458
459.. class:: ImputeClassifier
460
461    Objects of this class are returned by :obj:`ImputeLearner` when given data.
462
463    .. attribute:: baseClassifier
464
465    A wrapped classifier.
466
467    .. attribute:: imputer
468
469    An imputer for imputation of unknown values.
470
471    .. method:: __call__
472
473    This class is even more trivial than the learner. Its constructor accepts
474    two arguments, the classifier and the imputer, which are stored into the
475    corresponding attributes. The call operator which does the classification
476    then looks like this::
477
478        def __call__(self, ex, what=orange.GetValue):
479            return self.base_classifier(self.imputer(ex), what)
480
481    It imputes the missing values by calling the :obj:`imputer` and passes the
482    class to the base classifier.
483
484.. note::
485   In this setup the imputer is trained on the training data - even if you do
486   cross validation, the imputer will be trained on the right data. In the
487   classification phase we again use the imputer which was classified on the
488   training data only.
489
490.. rubric:: Code of ImputeLearner and ImputeClassifier
491
492:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
493the instance's  dictionary. You are expected to call it like
494:obj:`ImputeLearner(base_learner=<someLearner>,
495imputer=<someImputerConstructor>)`. When the learner is called with examples, it
496trains the imputer, imputes the data, induces a :obj:`base_classifier` by the
497:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the
498:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing
499values are imputed and the classifier's prediction is returned.
500
501Note that this code is slightly simplified, although the omitted details handle
502non-essential technical issues that are unrelated to imputation::
503
504    class ImputeLearner(orange.Learner):
505        def __new__(cls, examples = None, weightID = 0, **keyw):
506            self = orange.Learner.__new__(cls, **keyw)
507            self.__dict__.update(keyw)
508            if examples:
509                return self.__call__(examples, weightID)
510            else:
511                return self
512
513        def __call__(self, data, weight=0):
514            trained_imputer = self.imputer_constructor(data, weight)
515            imputed_data = trained_imputer(data, weight)
516            base_classifier = self.base_learner(imputed_data, weight)
517            return ImputeClassifier(base_classifier, trained_imputer)
518
519    class ImputeClassifier(orange.Classifier):
520        def __init__(self, base_classifier, imputer):
521            self.base_classifier = base_classifier
522            self.imputer = imputer
523
524        def __call__(self, ex, what=orange.GetValue):
525            return self.base_classifier(self.imputer(ex), what)
526
527.. rubric:: Example
528
529Although most Orange's learning algorithms will take care of imputation
530internally, if needed, it can sometime happen that an expert will be able to
531tell you exactly what to put in the data instead of the missing values. In this
532example we shall suppose that we want to impute the minimal value of each
533feature. We will try to determine whether the naive Bayesian classifier with
534its  implicit internal imputation works better than one that uses imputation by
535minimal values.
536
537:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`):
538
539.. literalinclude:: code/imputation-minimal-imputer.py
540    :lines: 7-
541
542Should ouput this::
543
544    Without imputation: 0.903
545    With imputation: 0.899
546
547.. note::
548   Note that we constructed just one instance of \
549   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
550   used twice in each fold, once it is given the examples as they are (and
551   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
552   The second time it is called by :obj:`imba` and the \
553   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
554   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
555   learner, but which produces two different classifiers in each round of
556   testing.
557
558Write your own imputer
559======================
560
561Imputation classes provide the Python-callback functionality (not all Orange
562classes do so, refer to the documentation on `subtyping the Orange classes
563in Python <callbacks.htm>`_ for a list). If you want to write your own
564imputation constructor or an imputer, you need to simply program a Python
565function that will behave like the built-in Orange classes (and even less,
566for imputer, you only need to write a function that gets an example as
567argument, imputation for example tables will then use that function).
568
569You will most often write the imputation constructor when you have a special
570imputation procedure or separate procedures for various attributes, as we've
571demonstrated in the description of
572:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
573need to pack everything we've written there to an imputer constructor that
574will accept a data set and the id of the weight meta-attribute (ignore it if
575you will, but you must accept two arguments), and return the imputer (probably
576:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
577imputer constructor as opposed to what we did above is that you can use such a
578constructor as a component for Orange learners (like logistic regression) or
579for wrappers from module orngImpute, and that way properly use the in
580classifier testing procedures.
Note: See TracBrowser for help on using the repository browser.