source: orange/orange/Orange/feature/imputation.py @ 7611:94c86145af60

Revision 7611:94c86145af60, 29.9 KB checked in by tomazc <tomaz.curk@…>, 3 years ago (diff)

Documentatio and code refactoring at Bohinj retreat.

Line 
1"""
2
3.. index:: imputation
4
5.. index::
6   single: feature; value imputation
7
8
9Imputation is a procedure of replacing the missing feature values with some
10appropriate values. Imputation is needed because of the methods (learning
11algorithms and others) that are not capable of handling unknown values, for
12instance logistic regression.
13
14Missing values sometimes have a special meaning, so they need to be replaced
15by a designated value. Sometimes we know what to replace the missing value
16with; for instance, in a medical problem, some laboratory tests might not be
17done when it is known what their results would be. In that case, we impute
18certain fixed value instead of the missing. In the most complex case, we assign
19values that are computed based on some model; we can, for instance, impute the
20average or majority value or even a value which is computed from values of
21other, known feature, using a classifier.
22
23In a learning/classification process, imputation is needed on two occasions.
24Before learning, the imputer needs to process the training examples.
25Afterwards, the imputer is called for each example to be classified.
26
27In general, imputer itself needs to be trained. This is, of course, not needed
28when the imputer imputes certain fixed value. However, when it imputes the
29average or majority value, it needs to compute the statistics on the training
30examples, and use it afterwards for imputation of training and testing
31examples.
32
33While reading this document, bear in mind that imputation is a part of the
34learning process. If we fit the imputation model, for instance, by learning
35how to predict the feature's value from other features, or even if we
36simply compute the average or the minimal value for the feature and use it
37in imputation, this should only be done on learning data. If cross validation
38is used for sampling, imputation should be done on training folds only. Orange
39provides simple means for doing that.
40
41This page will first explain how to construct various imputers. Then follow
42the examples for `proper use of imputers <#using-imputers>`_. Finally, quite
43often you will want to use imputation with special requests, such as certain
44features' missing values getting replaced by constants and other by values
45computed using models induced from specified other features. For instance,
46in one of the studies we worked on, the patient's pulse rate needed to be
47estimated using regression trees that included the scope of the patient's
48injuries, sex and age, some attributes' values were replaced by the most
49pessimistic ones and others were computed with regression trees based on
50values of all features. If you are using learners that need the imputer as a
51component, you will need to `write your own imputer constructor
52<#write-your-own-imputer-constructor>`_. This is trivial and is explained at
53the end of this page.
54
55Wrapper for learning algorithms
56===============================
57
58This wrapper can be used with learning algorithms that cannot handle missing
59values: it will impute the missing examples using the imputer, call the
60earning and, if the imputation is also needed by the classifier, wrap the
61resulting classifier into another wrapper that will impute the missing values
62in examples to be classified.
63
64Even so, the module is somewhat redundant, as all learners that cannot handle
65missing values should, in principle, provide the slots for imputer constructor.
66For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute
67:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even
68if you don't set it, it will do some imputation by default.
69
70.. class:: ImputeLearner
71
72    Wraps a learner and performs data discretization before learning.
73
74    Most of Orange's learning algorithms do not use imputers because they can
75    appropriately handle the missing values. Bayesian classifier, for instance,
76    simply skips the corresponding attributes in the formula, while
77    classification/regression trees have components for handling the missing
78    values in various ways.
79
80    If for any reason you want to use these algorithms to run on imputed data,
81    you can use this wrapper. The class description is a matter of a separate
82    page, but we shall show its code here as another demonstration of how to
83    use the imputers - logistic regression is implemented essentially the same
84    as the below classes.
85
86    This is basically a learner, so the constructor will return either an
87    instance of :obj:`ImputerLearner` or, if called with examples, an instance
88    of some classifier. There are a few attributes that need to be set, though.
89
90    .. attribute:: baseLearner
91   
92    A wrapped learner.
93
94    .. attribute:: imputerConstructor
95   
96    An instance of a class derived from :obj:`ImputerConstructor` (or a class
97    with the same call operator).
98
99    .. attribute:: dontImputeClassifier
100
101    If given and set (this attribute is optional), the classifier will not be
102    wrapped into an imputer. Do this if the classifier doesn't mind if the
103    examples it is given have missing values.
104
105    The learner is best illustrated by its code - here's its complete
106    :obj:`__call__` method::
107
108        def __call__(self, data, weight=0):
109            trained_imputer = self.imputerConstructor(data, weight)
110            imputed_data = trained_imputer(data, weight)
111            baseClassifier = self.baseLearner(imputed_data, weight)
112            if self.dontImputeClassifier:
113                return baseClassifier
114            else:
115                return ImputeClassifier(baseClassifier, trained_imputer)
116
117    So "learning" goes like this. :obj:`ImputeLearner` will first construct
118    the imputer (that is, call :obj:`self.imputerConstructor` to get a (trained)
119    imputer. Than it will use the imputer to impute the data, and call the
120    given :obj:`baseLearner` to construct a classifier. For instance,
121    :obj:`baseLearner` could be a learner for logistic regression and the
122    result would be a logistic regression model. If the classifier can handle
123    unknown values (that is, if :obj:`dontImputeClassifier`, we return it as
124    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given
125    the base classifier and the imputer which it can use to impute the missing
126    values in (testing) examples.
127
128.. class:: ImputeClassifier
129
130    Objects of this class are returned by :obj:`ImputeLearner` when given data.
131
132    .. attribute:: baseClassifier
133   
134    A wrapped classifier.
135
136    .. attribute:: imputer
137   
138    An imputer for imputation of unknown values.
139
140    .. method:: __call__
141   
142    This class is even more trivial than the learner. Its constructor accepts
143    two arguments, the classifier and the imputer, which are stored into the
144    corresponding attributes. The call operator which does the classification
145    then looks like this::
146
147        def __call__(self, ex, what=orange.GetValue):
148            return self.baseClassifier(self.imputer(ex), what)
149
150    It imputes the missing values by calling the :obj:`imputer` and passes the
151    class to the base classifier.
152
153.. note::
154   In this setup the imputer is trained on the training data - even if you do
155   cross validation, the imputer will be trained on the right data. In the
156   classification phase we again use the imputer which was classified on the
157   training data only.
158
159.. rubric:: Code of ImputeLearner and ImputeClassifier
160
161:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into
162the instance's  dictionary. You are expected to call it like
163:obj:`ImputeLearner(baseLearner=<someLearner>,
164imputer=<someImputerConstructor>)`. When the learner is called with examples, it
165trains the imputer, imputes the data, induces a :obj:`baseClassifier` by the
166:obj:`baseLearner` and constructs :obj:`ImputeClassifier` that stores the
167:obj:`baseClassifier` and the :obj:`imputer`. For classification, the missing
168values are imputed and the classifier's prediction is returned.
169
170Note that this code is slightly simplified, although the omitted details handle
171non-essential technical issues that are unrelated to imputation::
172
173    class ImputeLearner(orange.Learner):
174        def __new__(cls, examples = None, weightID = 0, **keyw):
175            self = orange.Learner.__new__(cls, **keyw)
176            self.__dict__.update(keyw)
177            if examples:
178                return self.__call__(examples, weightID)
179            else:
180                return self
181   
182        def __call__(self, data, weight=0):
183            trained_imputer = self.imputerConstructor(data, weight)
184            imputed_data = trained_imputer(data, weight)
185            baseClassifier = self.baseLearner(imputed_data, weight)
186            return ImputeClassifier(baseClassifier, trained_imputer)
187   
188    class ImputeClassifier(orange.Classifier):
189        def __init__(self, baseClassifier, imputer):
190            self.baseClassifier = baseClassifier
191            self.imputer = imputer
192   
193        def __call__(self, ex, what=orange.GetValue):
194            return self.baseClassifier(self.imputer(ex), what)
195
196.. rubric:: Example
197
198Although most Orange's learning algorithms will take care of imputation
199internally, if needed, it can sometime happen that an expert will be able to
200tell you exactly what to put in the data instead of the missing values. In this
201example we shall suppose that we want to impute the minimal value of each
202feature. We will try to determine whether the naive Bayesian classifier with
203its  implicit internal imputation works better than one that uses imputation by
204minimal values.
205
206`imputation-minimal-imputer.py`_ (uses `voting.tab`_):
207
208.. literalinclude:: code/imputation-minimal-imputer.py
209    :lines: 7-
210   
211Should ouput this::
212
213    Without imputation: 0.903
214    With imputation: 0.899
215
216.. note::
217   Note that we constructed just one instance of
218   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is
219   used twice in each fold, once it is given the examples as they are (and
220   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`.
221   The second time it is called by :obj:`imba` and the
222   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped
223   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one
224   learner, but which produces two different classifiers in each round of
225   testing.
226
227Abstract imputers
228=================
229
230As common in Orange, imputation is done by pairs of two classes: one that does
231the work and another that constructs it. :obj:`ImputerConstructor` is an
232abstract root of the hierarchy of classes that get the training data (with an
233optional id for weight) and constructs an instance of a class, derived from
234:obj:`Imputer`. An :obj:`Imputer` can be called with an
235:obj:`Orange.data.Instance` and it will return a new example with the missing
236values imputed (it will leave the original example intact!). If imputer is
237called with an :obj:`Orange.data.Table`, it will return a new example table
238with imputed examples.
239
240.. class:: ImputerConstructor
241
242    .. attribute:: imputeClass
243   
244    Tell whether to impute the class value (default) or not.
245
246Simple imputation
247=================
248
249The simplest imputers always impute the same value for a particular attribute,
250disregarding the values of other attributes. They all use the same imputer
251class, :obj:`Imputer_defaults`.
252   
253.. class:: Imputer_defaults
254
255    .. attribute::  defaults
256   
257    An example with the default values to be imputed instead of the missing.
258    Examples to be imputed must be from the same domain as :obj:`defaults`.
259
260    Instances of this class can be constructed by
261    :obj:`Orange.feature.imputation.ImputerConstructor_minimal`,
262    :obj:`Orange.feature.imputation.ImputerConstructor_maximal`,
263    :obj:`Orange.feature.imputation.ImputerConstructor_average`.
264
265    For continuous features, they will impute the smallest, largest or the
266    average  values encountered in the training examples. For discrete, they
267    will impute the lowest (the one with index 0, e. g. attr.values[0]), the
268    highest (attr.values[-1]), and the most common value encountered in the
269    data. The first two imputers will mostly be used when the discrete values
270    are ordered according to their impact on the class (for instance, possible
271    values for symptoms of some disease can be ordered according to their
272    seriousness). The minimal and maximal imputers will then represent
273    optimistic and pessimistic imputations.
274
275    The following code will load the bridges data, and first impute the values
276    in a single examples and then in the whole table.
277
278`imputation-complex.py`_ (uses `bridges.tab`_):
279
280.. literalinclude:: code/imputation-complex.py
281    :lines: 9-23
282
283This is example shows what the imputer does, not how it is to be used. Don't
284impute all the data and then use it for cross-validation. As warned at the top
285of this page, see the instructions for actual `use of
286imputers <#using-imputers>`_.
287
288.. note:: :obj:`ImputerConstructor` are another class with schizophrenic
289  constructor: if you give the constructor the data, it will return an
290  :obj:`Imputer` - the above call is equivalent to calling
291  :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`.
292
293You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults`
294yourself and specify your own defaults. Or leave some values unspecified, in
295which case the imputer won't impute them, as in the following example. Here,
296the only attribute whose values will get imputed is "LENGTH"; the imputed value
297will be 1234.
298
299`imputation-complex.py`_ (uses `bridges.tab`_):
300
301.. literalinclude:: code/imputation-complex.py
302    :lines: 56-69
303
304:obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an
305argument of type :obj:`Orange.data.Domain` (in which case it will construct an
306empty instance for :obj:`defaults`) or an example. (Be careful with this:
307:obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the
308instance and not a copy. But you can make a copy yourself to avoid problems:
309instead of `Imputer_defaults(data[0])` you may want to write
310`Imputer_defaults(Orange.data.Instance(data[0]))`.
311
312Random imputation
313=================
314
315.. class:: Imputer_Random
316
317    Imputes random values. The corresponding constructor is
318    :obj:`ImputerConstructor_Random`.
319
320    .. attribute:: imputeClass
321   
322    Tells whether to impute the class values or not. Defaults to :obj:`True`.
323
324    .. attribute:: deterministic
325
326    If true (default is :obj:`False`), random generator is initialized for each
327    example using the example's hash value as a seed. This results in same
328    examples being always imputed the same values.
329   
330Model-based imputation
331======================
332
333.. class:: ImputerConstructor_model
334
335    Model-based imputers learn to predict the attribute's value from values of
336    other attributes. :obj:`ImputerConstructor_model` are given a learning
337    algorithm (two, actually - one for discrete and one for continuous
338    attributes) and they construct a classifier for each attribute. The
339    constructed imputer :obj:`Imputer_model` stores a list of classifiers which
340    are used when needed.
341
342    .. attribute:: learnerDiscrete, learnerContinuous
343   
344    Learner for discrete and for continuous attributes. If any of them is
345    missing, the attributes of the corresponding type won't get imputed.
346
347    .. attribute:: useClass
348   
349    Tells whether the imputer is allowed to use the class value. As this is
350    most often undesired, this option is by default set to :obj:`False`. It can
351    however be useful for a more complex design in which we would use one
352    imputer for learning examples (this one would use the class value) and
353    another for testing examples (which would not use the class value as this
354    is unavailable at that moment).
355
356..class:: Imputer_model
357
358    .. attribute: models
359
360    A list of classifiers, each corresponding to one attribute of the examples
361    whose values are to be imputed. The :obj:`classVar`'s of the models should
362    equal the examples' attributes. If any of classifier is missing (that is,
363    the corresponding element of the table is :obj:`None`, the corresponding
364    attribute's values will not be imputed.
365
366.. rubric:: Examples
367
368The following imputer predicts the missing attribute values using
369classification and regression trees with the minimum of 20 examples in a leaf.
370
371<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a
372href="bridges.tab">bridges.tab</a>)</P> <XMP class=code>import orngTree imputer
373= orange.ImputerConstructor_model() imputer.learnerContinuous =
374imputer.learnerDiscrete = orngTree.TreeLearner(minSubset = 20) imputer =
375imputer(data) </XMP>
376
377<P>We could even use the same learner for discrete and continuous attributes!
378(The way this functions is rather tricky. If you desire to know:
379<CODE>orngTree.TreeLearner</CODE> is a learning algorithm written in Python -
380Orange doesn't mind, it will wrap it into a C++ wrapper for a Python-written
381learners which then call-backs the Python code. When given the examples to
382learn from, <CODE>orngTree.TreeLearner</CODE> checks the class type. If it's
383continuous, it will set the <CODE>orange.TreeLearner</CODE> to construct
384regression trees, and if it's discrete, it will set the components for
385classification trees. The common parameters, such as the minimal number of
386examples in leaves, are used in both cases.)</P>
387
388<P>You can of course use different learning algorithms for discrete and
389continuous attributes. Probably a common setup will be to use
390<CODE>BayesLearner</CODE> for discrete and <CODE>MajorityLearner</CODE> (which
391just remembers the average) for continuous attributes, as follows.</P>
392
393<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a
394href="bridges.tab">bridges.tab</a>)</P> <XMP class=code>imputer =
395orange.ImputerConstructor_model() imputer.learnerContinuous =
396orange.MajorityLearner() imputer.learnerDiscrete = orange.BayesLearner()
397imputer = imputer(data) </XMP>
398
399<P>You can also construct an <CODE>Imputer_model</CODE> yourself. You will do this if different attributes need different treatment. Brace for an example that will be a bit more complex. First we shall construct an <CODE>Imputer_model</CODE> and initialize an empty list of models.</P>
400
401<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
402<XMP class=code>imputer = orange.Imputer_model()
403imputer.models = [None] * len(data.domain)
404</XMP>
405
406<P>Attributes "LANES" and "T-OR-D" will always be imputed values 2 and
407"THROUGH". Since "LANES" is continuous, it suffices to construct a
408<CODE>DefaultClassifier</CODE> with the default value 2.0 (don't forget the
409decimal part, or else Orange will think you talk about an index of a discrete
410value - how could it tell?). For the discrete attribute "T-OR-D", we could
411construct a <CODE>DefaultClassifier</CODE> and give the index of value
412"THROUGH" as an argument. But we shall do it nicer, by constructing a
413<CODE>Value</CODE>. Both classifiers will be stored at the appropriate places
414in <CODE>imputer.models</CODE>.</P>
415
416<XMP class=code>imputer.models[data.domain.index("LANES")] = orange.DefaultClassifier(2.0)
417
418tord = orange.DefaultClassifier(orange.Value(data.domain["T-OR-D"], "THROUGH"))
419imputer.models[data.domain.index("T-OR-D")] = tord
420</XMP>
421
422<P>"LENGTH" will be computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of course). Note that we initialized the domain by simply giving a list with the names of the attributes, with the domain as an additional argument in which Orange will look for the named attributes.</P>
423
424<XMP class=code>import orngTree
425len_domain = orange.Domain(["MATERIAL", "SPAN", "ERECTED", "LENGTH"], data.domain)
426len_data = orange.ExampleTable(len_domain, data)
427len_tree = orngTree.TreeLearner(len_data, minSubset=20)
428imputer.models[data.domain.index("LENGTH")] = len_tree
429orngTree.printTxt(len_tree)
430</XMP>
431
432<P>We printed the tree just to see what it looks like.</P>
433
434<XMP class=code>SPAN=SHORT: 1158
435SPAN=LONG: 1907
436SPAN=MEDIUM
437|    ERECTED<1908.500: 1325
438|    ERECTED>=1908.500: 1528
439</XMP>
440
441<P>Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, while the others are mostly medium. This could be done by <a href="lookup.htm"><CODE>ClassifierByLookupTable</CODE></A> - this would be faster than what we plan here. See the corresponding documentation on lookup classifier. Here we are gonna do it with a Python function.</P>
442
443<XMP class=code>spanVar = data.domain["SPAN"]
444
445def computeSpan(ex, returnWhat):
446    if ex["TYPE"] == "WOOD" or ex["PURPOSE"] == "WALK":
447        span = "SHORT"
448    else:
449        span = "MEDIUM"
450    return orange.Value(spanVar, span)
451
452imputer.models[data.domain.index("SPAN")] = computeSpan
453</XMP>
454
455
456<P><CODE>computeSpan</CODE> could also be written as a class, if you'd prefer
457it. It's important that it behaves like a classifier, that is, gets an example
458and returns a value. The second element tells, as usual, what the caller expect
459the classifier to return - a value, a distribution or both. Since the caller,
460<CODE>Imputer_model</CODE>, always wants values, we shall ignore the argument
461(at risk of having problems in the future when imputers might handle
462distribution as well).</P>
463
464
465Treating the missing values as special values
466=============================================
467
468<P>Missing values sometimes have a special meaning. The fact that something was
469not measured can sometimes tell a lot. Be, however, cautious when using such
470values in decision models; it the decision not to measure something (for
471instance performing a laboratory test on a patient) is based on the expert's
472knowledge of the class value, such unknown values clearly should not be used in
473models.</P>
474
475<P><CODE><INDEX name="classes/ImputerConstructor_asValue">ImputerConstructor_asValue</INDEX></CODE> constructs a new domain in which each discrete attribute is replaced with a new attribute that has one value more: "NA". The new attribute will compute its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA".</P>
476
477<P>For continuous attributes, <CODE>ImputerConstructor_asValue</CODE> will
478construct a two-valued discrete attribute with values "def" and "undef",
479telling whether the continuous attribute was defined or not. The attribute's
480name will equal the original's with "_def" appended. The original continuous
481attribute will remain in the domain and its unknowns will be replaced by
482averages.</P>
483
484<P><CODE>ImputerConstructor_asValue</CODE> has no specific attributes.</P>
485
486<P>The constructed imputer is named <CODE>Imputer_asValue</CODE> (I bet you
487wouldn't guess). It converts the example into the new domain, which imputes the
488values for discrete attributes. If continuous attributes are present, it will
489also replace their values by the averages.</P>
490
491<P class=section>Attributes of <CODE>Imputer_asValue</CODE></P>
492<DL class=attributes>
493<DT>domain</DT>
494<DD>The domain with the new attributes constructed by <CODE>ImputerConstructor_asValue</CODE>.</DD>
495
496<DT>defaults</DT>
497<DD>Default values for continuous attributes. Present only if there are any.</DD>
498</DL>
499
500<P>Here's a script that shows what this imputer actually does to the domain.</P>
501
502<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
503<XMP class=code>imputer = orange.ImputerConstructor_asValue(data)
504
505original = data[19]
506imputed = imputer(data[19])
507
508print original.domain
509print
510print imputed.domain
511print
512
513for i in original.domain:
514    print "%s: %s -> %s" % (original.domain[i].name, original[i], imputed[i.name]),
515    if original.domain[i].varType == orange.VarTypes.Continuous:
516        print "(%s)" % imputed[i.name+"_def"]
517    else:
518        print
519print
520</XMP>
521
522<P>The script's output looks like this.</P>
523
524<XMP class=code>[RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D,
525MATERIAL, SPAN, REL-L, TYPE]
526
527[RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH,
528LANES_def, LANES, CLEAR-G, T-OR-D,
529MATERIAL, SPAN, REL-L, TYPE]
530
531
532RIVER: M -> M
533ERECTED: 1874 -> 1874 (def)
534PURPOSE: RR -> RR
535LENGTH: ? -> 1567 (undef)
536LANES: 2 -> 2 (def)
537CLEAR-G: ? -> NA
538T-OR-D: THROUGH -> THROUGH
539MATERIAL: IRON -> IRON
540SPAN: ? -> NA
541REL-L: ? -> NA
542TYPE: SIMPLE-T -> SIMPLE-T
543</XMP>
544
545<P>Seemingly, the two examples have the same attributes (with
546<CODE>imputed</CODE> having a few additional ones). If you check this by
547<CODE>original.domain[0] == imputed.domain[0]</CODE>, you shall see that this
548first glance is <CODE>False</CODE>. The attributes only have the same names,
549but they are different attributes. If you read this page (which is already a
550bit advanced), you know that Orange does not really care about the attribute
551names).</P>
552
553<P>Therefore, if we wrote "<CODE>imputed[i]</CODE>" the program would fail
554since <CODE>imputed</CODE> has no attribute <CODE>i</CODE>. But it has an
555attribute with the same name (which even usually has the same value). We
556therefore use <CODE>i.name</CODE> to index the attributes of
557<CODE>imputed</CODE>. (Using names for indexing is not fast, though; if you do
558it a lot, compute the integer index with
559<CODE>imputed.domain.index(i.name)</CODE>.)</P>
560
561<P>For continuous attributes, there is an additional attribute with "_def"
562appended; we get it by <CODE>i.name+"_def"</CODE>. Not really nice, but it
563works.</P>
564
565<P>The first continuous attribute, "ERECTED" is defined. Its value remains 1874
566and the additional attribute "ERECTED_def" has value "def". Not so for
567"LENGTH". Its undefined value is replaced by the average (1567) and the new
568attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and
569all other undefined discrete attributes) is assigned the value "NA".</P>
570
571Using imputers
572==============
573
574To properly use the imputation classes in learning process, they must be
575trained on training examples only. Imputing the missing values and subsequently
576using the data set in cross-validation will give overly optimistic results.
577
578Learners with imputer as a component
579------------------------------------
580
581Orange learners that cannot handle missing values will generally provide a slot
582for the imputer component. An example of such a class is
583:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called
584:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you
585can assign an imputer constructor - one of the above constructors or a specific
586constructor you wrote yourself. When given learning examples,
587:obj:`Orange.classification.logreg.LogRegLearner` will pass them to
588:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an
589imputer (again some of the above or a specific imputer you programmed). It will
590immediately use the imputer to impute the missing values in the learning data
591set, so it can be used by the actual learning algorithm. Besides, when the
592classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed,
593the imputer will be stored in its attribute
594:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At
595classification, the imputer will be used for imputation of missing values in
596(testing) examples.
597
598Although details may vary from algorithm to algorithm, this is how the
599imputation is generally used in Orange's learners. Also, if you write your own
600learners, it is recommended that you use imputation according to the described
601procedure.
602
603Write your own imputer
604======================
605
606Imputation classes provide the Python-callback functionality (not all Orange
607classes do so, refer to the documentation on `subtyping the Orange classes
608in Python <callbacks.htm>`_ for a list). If you want to write your own
609imputation constructor or an imputer, you need to simply program a Python
610function that will behave like the built-in Orange classes (and even less,
611for imputer, you only need to write a function that gets an example as
612argument, imputation for example tables will then use that function).
613
614You will most often write the imputation constructor when you have a special
615imputation procedure or separate procedures for various attributes, as we've
616demonstrated in the description of
617:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only
618need to pack everything we've written there to an imputer constructor that
619will accept a data set and the id of the weight meta-attribute (ignore it if
620you will, but you must accept two arguments), and return the imputer (probably
621:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an
622imputer constructor as opposed to what we did above is that you can use such a
623constructor as a component for Orange learners (like logistic regression) or
624for wrappers from module orngImpute, and that way properly use the in
625classifier testing procedures.
626
627.. _imputation-minimal-imputer.py: code/imputation-minimal-imputer.py
628.. _imputation-complex.py: code/imputation-complex.py
629.. _voting.tab: code/voting.tab
630.. _bridges.tab: code/bridges.tab
631
632"""
633
634import Orange.core as orange
635from orange import ImputerConstructor_minimal
636from orange import ImputerConstructor_maximal
637from orange import ImputerConstructor_average
638from orange import Imputer_defaults
639from orange import ImputerConstructor_model
640from orange import Imputer_model
641from orange import ImputerConstructor_asValue
642
643class ImputeLearner(orange.Learner):
644    def __new__(cls, examples = None, weightID = 0, **keyw):
645        self = orange.Learner.__new__(cls, **keyw)
646        self.dontImputeClassifier = False
647        self.__dict__.update(keyw)
648        if examples:
649            return self.__call__(examples, weightID)
650        else:
651            return self
652       
653    def __call__(self, data, weight=0):
654        trained_imputer = self.imputerConstructor(data, weight)
655        imputed_data = trained_imputer(data, weight)
656        baseClassifier = self.baseLearner(imputed_data, weight)
657        if self.dontImputeClassifier:
658            return baseClassifier
659        else:
660            return ImputeClassifier(baseClassifier, trained_imputer)
661
662class ImputeClassifier(orange.Classifier):
663    def __init__(self, baseClassifier, imputer, **argkw):
664        self.baseClassifier = baseClassifier
665        self.imputer = imputer
666        self.__dict__.update(argkw)
667
668    def __call__(self, ex, what=orange.GetValue):
669        return self.baseClassifier(self.imputer(ex), what)
Note: See TracBrowser for help on using the repository browser.