# source:orange/Orange/doc/reference/imputation.htm@9671:a7b056375472

Revision 9671:a7b056375472, 24.4 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line
5
6<index name="imputation">
7<h1>Imputation</h1>
8
9<P>Imputation is a procedure of replacing the missing attribute values with some appropriate values. Imputation is needed because of the methods (learning algorithms and others) that are not capable of handling unknown values, for instance logistic regression.</P>
10
11<P>Missing values sometimes have a special meaning, so they need to be replaced by a designated value. Sometimes we know what to replace the missing value with; for instance, in a medical problem, some laboratory tests might not be done when it is known what their results would be. In that case, we impute certain fixed value instead of the missing. In the most complex case, we assign values that are computed based on some model; we can, for instance, impute the average or majority value or even a value which is computed from values of other, known attribute, using a classifier.</P>
12
13<P>In a learning/classification process, imputation is needed on two occasions. Before learning, the imputer needs to process the training examples. Afterwards, the imputer is called for each example to be classified.</P>
14
15<P>In general, imputer itself needs to be trained. This is, of course, not needed when the imputer imputes certain fixed value. However, when it imputes the average or majority value, it needs to compute the statistics on the training examples, and use it afterwards for imputation of training and testing examples.</P>
16
17<P>While reading this document, bear in mind that imputation is a part of the learning process. If we fit the imputation model, for instance, by learning how to predict the attribute's value from other attributes, or even if we simply compute the average or the minimal value for the attribute and use it in imputation, this should only be done on learning data. If cross validation is used for sampling, imputation should be done on training folds only. Orange provides simple means for doing that.<P>
18
19<P>This page will first explain how to construct various imputers. Then follow the examples for <A href="#use">proper use of imputers</A>. Finally, quite often you will want to use imputation with special requests, such as certain attributes' missing values getting replaced by constants and other by values computed using models induced from specified other attributes. For instance, in one of the studies we worked on, the patient's pulse rate needed to be estimated using regression trees that included the scope of the patient's injuries, sex and age, some attributes' values were replaced by the most pessimistic ones and others were computed with regression trees based on values of all attributes. If you are using learners that need the imputer as a component, you will need to <A href="#callback">write your own imputer constructor</A>. This is trivial and is explained at the end of this page.</P>
20
21
22<H2>Abstract imputers</H2>
23
24<P>As common in Orange, imputation is done by pairs of two classes: one that does the work and another that constructs it. <CODE><INDEX name="classes/ImputerConstructor">ImputerConstructor</CODE> is an abstract root of the hierarchy of classes that get the training data (with an optional id for weight) and constructs an instance of a class, derived from <CODE>Imputer</CODE>. <CODE>Imputer</CODE> can be called either with an <CODE>Example</CODE>; it will return a new example with the missing values imputed (it will leave the original example intact!). If imputer is called with an <CODE>ExampleTable</CODE>, it will return a new example table with imputed examples.</P>
25
26<P class=section>Attributes of <CODE>ImputerConstructor</CODE></P>
27<DL class=attributes>
28<DT>imputeClass</DT>
29<DD>Tell whether to impute the class value (default) or not.</DD>
30</DT>
31</DL>
32
33
34<H2>Simple imputation</H2>
35
36<P>The simplest imputers always impute the same value for a particular attribute, disregarding the values of other attributes. They all use the same imputer class, <CODE><INDEX name="classes/Imputer_defaults">Imputer_defaults</INDEX></CODE>.</P>
37
38<P class=section>Attributes</P>
39<DL class=attributes>
40<DT>defaults</DT>
41<DD>An example with the default values to be imputed instead of the missing. Examples to be imputed must be from the same domain as <CODE>defaults</CODE>.</DD>
42</DT>
43</DL>
44
45<P><CODE>Imputer_defaults</CODE> is constructed by <CODE><index  name="classes/ImputerConstructor_minimal">ImputerConstructor_minimal</CODE>, <CODE><index name="classes/ImputerConstructor_maximal">ImputerConstructor_maximal</CODE> and <CODE><index name="classes/ImputerConstructor_minimal">ImputerConstructor_average</CODE>. For continuous attributes, they will impute the smallest, largest or the average values encountered in the training examples. For discrete, they will impute the lowest (the one with index 0, <I>eg</I> <CODE>attr.values[0]</CODE>), the highest (<CODE>attr.values[-1]</CODE>), and the most common value encountered in the data. The first two imputers will mostly be used when the discrete values are ordered according to their impact on the class (for instance, possible values for symptoms of some disease can be ordered according to their seriousness). The minimal and maximal imputers will then represent optimistic and pessimistic imputations.</P>
46
47<P>The following code will load the bridges data, and first impute the values in a single examples and then in the whole table.</P>
48
49<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
50<XMP class=code>import orange
51
52data = orange.ExampleTable("bridges")
53
54impmin = orange.ImputerConstructor_minimal(data)
55
56print "Example w/ missing values"
57print data[19]
58print "Imputed:"
59print impmin(data[19])
60print
61
62impdata = impmin(data)
63for i in range(20, 25):
64    print data[i]
65    print impdata[i]
66    print
67</XMP>
68
69<P>This is example shows what the imputer does, not how it is to be used. Don't impute all the data and then use it for cross-validation. As warned at the top of this page, see the instructions for actual <A href="#use">use of imputers</A>.</P>
70
71<P>Note that <CODE>ImputerConstructor</CODE>s are another Orange class with schizophrenic constructor: if you give the constructor the data, it will return an <CODE>Imputer</CODE> - the above call is equivalent to calling <CODE>orange.ImputerConstructor_minimal()(data)</CODE></P>
72
73<P>You can also construct the <CODE>Imputer_defaults</CODE> yourself and specify your own defaults. Or leave some values unspecified, in which case the imputer won't impute them, as in the following example.</P>
74
75<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
76<XMP class=code>imputer = orange.Imputer_defaults(data.domain)
77imputer.defaults["LENGTH"] = 1234
78</XMP>
79
80<P>Here, the only attribute whose values will get imputed is "LENGTH"; the imputed value will be 1234.</P>
81
82<P>The <CODE>Imputer_default</CODE>'s constructor will accept an argument of type <CODE>Domain</CODE> (in which case it will construct an empty example for <CODE>defaults</CODE>) or an example. (Be careful with this: <CODE>Imputer_default</CODE> will have a reference to the example and not a copy. But you can make a copy yourself to avoid problems: instead of <CODE>orange.Imputer_default(data[0])</CODE> you may want to write <CODE>orange.Imputer_default(orange.Example(data[0]))</CODE>.</P>
83
84
85<H2>Random imputation</H2>
86
87<P><CODE><INDEX name="classes/Imputer_Random">Imputer_Random</INDEX></CODE> imputes random values. The corresponding constructor is <CODEname="classes/ImputerConstructor_Random">ImputerConstructor_Random</CODE>.</p>
88
89<DL class=attributes>
90<DT>imputeClass</DT>
91<DD>Tells whether to impute the class values or not (default: true).</DD>
92
93<DT>deterministic</DT>
94<DD>If true (default is false), random generator is initialized for each example using the example's hash value as a seed. This results in same examples being always imputed the same values.</DD>
95
96<A name="model"></A>
97<H2>Model-based imputation</H2>
98
99<P>Model-based imputers learn to predict the attribute's value from values of other attributes. <CODE><INDEX name="classes/ImputerConstructor_model">ImputerConstructor_model</CODE> are given a learning algorithm (two, actually - one for discrete and one for continuous attributes) and they construct a classifier for each attribute. The constructed imputer <CODE>Imputer_model</CODE> stores a list of classifiers which are used when needed.</P>
100
101<P class=section>Attributes of <CODE>ImputerConstructor_model</CODE></P>
102<DL class=attributes>
103<DT>learnerDiscrete, learnerContinuous</DT>
104<DD>Learner for discrete and for continuous attributes. If any of them is missing, the attributes of the corresponding type won't get imputed.</DD>
105
106<DT>useClass</DT>
107<DD>Tells whether the imputer is allowed to use the class value. As this is most often undesired, this option is by default set to <CODE>false</CODE>. It can however be useful for a more complex design in which we would use one imputer for learning examples (this one would use the class value) and another for testing examples (which would not use the class value as this is unavailable at that moment).</DD>
108</DL>
109</DD>
110
111<P class=section>Attributes of <CODE>Imputer_model</CODE></P>
112<DL class=attributes>
113<DT>models</DT>
114<DD>A list of classifiers, each corresponding to one attribute of the examples whose values are to be imputed. The <CODE>classVar</CODE>'s of the models should equal the examples' attributes. If any of classifier is missing (that is, the corresponding element of the table is <CODE>None</CODE>, the corresponding attribute's values will not be imputed.</DD>
115</DL>
116
117<P>The following imputer predicts the missing attribute values using classification and regression trees with the minimum of 20 examples in a leaf.</P>
118
119<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
120<XMP class=code>import orngTree
121imputer = orange.ImputerConstructor_model()
122imputer.learnerContinuous = imputer.learnerDiscrete = orngTree.TreeLearner(minSubset = 20)
123imputer = imputer(data)
124</XMP>
125
126<P>We could even use the same learner for discrete and continuous attributes! (The way this functions is rather tricky. If you desire to know: <CODE>orngTree.TreeLearner</CODE> is a learning algorithm written in Python - Orange doesn't mind, it will wrap it into a C++ wrapper for a Python-written learners which then call-backs the Python code. When given the examples to learn from, <CODE>orngTree.TreeLearner</CODE> checks the class type. If it's continuous, it will set the <CODE>orange.TreeLearner</CODE> to construct regression trees, and if it's discrete, it will set the components for classification trees. The common parameters, such as the minimal number of examples in leaves, are used in both cases.)</P>
127
128<P>You can of course use different learning algorithms for discrete and continuous attributes. Probably a common setup will be to use <CODE>BayesLearner</CODE> for discrete and <CODE>MajorityLearner</CODE> (which just remembers the average) for continuous attributes, as follows.</P>
129
130<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
131<XMP class=code>imputer = orange.ImputerConstructor_model()
132imputer.learnerContinuous = orange.MajorityLearner()
133imputer.learnerDiscrete = orange.BayesLearner()
134imputer = imputer(data)
135</XMP>
136
137<P>You can also construct an <CODE>Imputer_model</CODE> yourself. You will do this if different attributes need different treatment. Brace for an example that will be a bit more complex. First we shall construct an <CODE>Imputer_model</CODE> and initialize an empty list of models.</P>
138
139<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
140<XMP class=code>imputer = orange.Imputer_model()
141imputer.models = [None] * len(data.domain)
142</XMP>
143
144<P>Attributes "LANES" and "T-OR-D" will always be imputed values 2 and "THROUGH". Since "LANES" is continuous, it suffices to construct a <CODE>DefaultClassifier</CODE> with the default value 2.0 (don't forget the decimal part, or else Orange will think you talk about an index of a discrete value - how could it tell?). For the discrete attribute "T-OR-D", we could construct a <CODE>DefaultClassifier</CODE> and give the index of value "THROUGH" as an argument. But we shall do it nicer, by constructing a <CODE>Value</CODE>. Both classifiers will be stored at the appropriate places in <CODE>imputer.models</CODE>.</P>
145
146<XMP class=code>imputer.models[data.domain.index("LANES")] = orange.DefaultClassifier(2.0)
147
148tord = orange.DefaultClassifier(orange.Value(data.domain["T-OR-D"], "THROUGH"))
149imputer.models[data.domain.index("T-OR-D")] = tord
150</XMP>
151
152<P>"LENGTH" will be computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of course). Note that we initialized the domain by simply giving a list with the names of the attributes, with the domain as an additional argument in which Orange will look for the named attributes.</P>
153
154<XMP class=code>import orngTree
155len_domain = orange.Domain(["MATERIAL", "SPAN", "ERECTED", "LENGTH"], data.domain)
156len_data = orange.ExampleTable(len_domain, data)
157len_tree = orngTree.TreeLearner(len_data, minSubset=20)
158imputer.models[data.domain.index("LENGTH")] = len_tree
159orngTree.printTxt(len_tree)
160</XMP>
161
162<P>We printed the tree just to see what it looks like.</P>
163
164<XMP class=code>SPAN=SHORT: 1158
165SPAN=LONG: 1907
166SPAN=MEDIUM
167|    ERECTED<1908.500: 1325
168|    ERECTED>=1908.500: 1528
169</XMP>
170
171<P>Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, while the others are mostly medium. This could be done by <a href="lookup.htm"><CODE>ClassifierByLookupTable</CODE></A> - this would be faster than what we plan here. See the corresponding documentation on lookup classifier. Here we are gonna do it with a Python function.</P>
172
173<XMP class=code>spanVar = data.domain["SPAN"]
174
175def computeSpan(ex, returnWhat):
176    if ex["TYPE"] == "WOOD" or ex["PURPOSE"] == "WALK":
177        span = "SHORT"
178    else:
179        span = "MEDIUM"
180    return orange.Value(spanVar, span)
181
182imputer.models[data.domain.index("SPAN")] = computeSpan
183</XMP>
184
185<P><CODE>computeSpan</CODE> could also be written as a class, if you'd prefer it. It's important that it behaves like a classifier, that is, gets an example and returns a value. The second element tells, as usual, what the caller expect the classifier to return - a value, a distribution or both. Since the caller, <CODE>Imputer_model</CODE>, always wants values, we shall ignore the argument (at risk of having problems in the future when imputers might handle distribution as well).</P>
186
187<P>OK, that's enough. Other attributes' values will remain unknown.</P>
188
189
190<H3>Treating the missing values as special values</H3>
191
192<P>Missing values sometimes have a special meaning. The fact that something was not measured can sometimes tell a lot. Be, however, cautious when using such values in decision models; it the decision not to measure something (for instance performing a laboratory test on a patient) is based on the expert's knowledge of the class value, such unknown values clearly should not be used in models.</P>
193
194<P><CODE><INDEX name="classes/ImputerConstructor_asValue">ImputerConstructor_asValue</INDEX></CODE> constructs a new domain in which each discrete attribute is replaced with a new attribute that has one value more: "NA". The new attribute will compute its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA".</P>
195
196<P>For continuous attributes, <CODE>ImputerConstructor_asValue</CODE> will construct a two-valued discrete attribute with values "def" and "undef", telling whether the continuous attribute was defined or not. The attribute's name will equal the original's with "_def" appended. The original continuous attribute will remain in the domain and its unknowns will be replaced by averages.</P>
197
198<P><CODE>ImputerConstructor_asValue</CODE> has no specific attributes.</P>
199
200<P>The constructed imputer is named <CODE>Imputer_asValue</CODE> (I bet you wouldn't guess). It converts the example into the new domain, which imputes the values for discrete attributes. If continuous attributes are present, it will also replace their values by the averages.</P>
201
202<P class=section>Attributes of <CODE>Imputer_asValue</CODE></P>
203<DL class=attributes>
204<DT>domain</DT>
205<DD>The domain with the new attributes constructed by <CODE>ImputerConstructor_asValue</CODE>.</DD>
206
207<DT>defaults</DT>
208<DD>Default values for continuous attributes. Present only if there are any.</DD>
209</DL>
210
211<P>Here's a script that shows what this imputer actually does to the domain.</P>
212
213<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P>
214<XMP class=code>imputer = orange.ImputerConstructor_asValue(data)
215
216original = data[19]
217imputed = imputer(data[19])
218
219print original.domain
220print
221print imputed.domain
222print
223
224for i in original.domain:
225    print "%s: %s -> %s" % (original.domain[i].name, original[i], imputed[i.name]),
226    if original.domain[i].varType == orange.VarTypes.Continuous:
227        print "(%s)" % imputed[i.name+"_def"]
228    else:
229        print
230print
231</XMP>
232
233<P>The script's output looks like this.</P>
234
235<XMP class=code>[RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D,
236MATERIAL, SPAN, REL-L, TYPE]
237
238[RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH,
239LANES_def, LANES, CLEAR-G, T-OR-D,
240MATERIAL, SPAN, REL-L, TYPE]
241
242
243RIVER: M -> M
244ERECTED: 1874 -> 1874 (def)
245PURPOSE: RR -> RR
246LENGTH: ? -> 1567 (undef)
247LANES: 2 -> 2 (def)
248CLEAR-G: ? -> NA
249T-OR-D: THROUGH -> THROUGH
250MATERIAL: IRON -> IRON
251SPAN: ? -> NA
252REL-L: ? -> NA
253TYPE: SIMPLE-T -> SIMPLE-T
254</XMP>
255
256<P>Seemingly, the two examples have the same attributes (with <CODE>imputed</CODE> having a few additional ones). If you check this by <CODE>original.domain[0] == imputed.domain[0]</CODE>, you shall see that this first glance is <CODE>False</CODE>. The attributes only have the same names, but they are different attributes. If you read this page (which is already a bit advanced), you know that Orange does not really care about the attribute names).</P>
257
258<P>Therefore, if we wrote "<CODE>imputed[i]</CODE>" the program would fail since <CODE>imputed</CODE> has no attribute <CODE>i</CODE>. But it has an attribute with the same name (which even usually has the same value). We therefore use <CODE>i.name</CODE> to index the attributes of <CODE>imputed</CODE>. (Using names for indexing is not fast, though; if you do it a lot, compute the integer index with <CODE>imputed.domain.index(i.name)</CODE>.)</P>
259
260<P>For continuous attributes, there is an additional attribute with "_def" appended; we get it by <CODE>i.name+"_def"</CODE>. Not really nice, but it works.</P>
261
262<P>The first continuous attribute, "ERECTED" is defined. Its value remains 1874 and the additional attribute "ERECTED_def" has value "def". Not so for "LENGTH". Its undefined value is replaced by the average (1567) and the new attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and all other undefined discrete attributes) is assigned the value "NA".</P>
263
264<A name="use"></A>
265<H2>Using imputers</H2>
266
267<P>To properly use the imputation classes in learning process, they must be trained on training examples only. Imputing the missing values and subsequently using the data set in cross-validation will give overly optimistic results.</P>
268
269<H3>Learners with Imputer as a Component</H3>
270
271<P>Orange learners that cannot handle missing values will generally provide a slot for the imputer component. An example of such a class is <A href="LogisticLearner.htm">logistic regression learner</A> with an attribute called <CODE>imputerConstructor</CODE>. To it you can assign an imputer constructor - one of the above constructors or a specific constructor you wrote yourself. When given learning examples, <CODE>LogRegLearner</CODE> will pass them to <CODE>imputerConstructor</CODE> to get an imputer (again some of the above or a specific imputer you programmed). It will immediately use the imputer to impute the missing values in the learning data set, so it can be used by the actual learning algorithm. Besides, when the classifier (<CODE>LogRegClassifier</CODE>) is constructed, the imputer will be stored in its attribute <CODE>imputer</CODE>. At classification, the imputer will be used for imputation of missing values in (testing) examples.</P>
272
273<P>Although details may vary from algorithm to algorithm, this is how the imputation is generally used in Orange's learners. Also, if you write your own learners, it is recommended that you use imputation according to the described procedure.</P>
274
275<H3>Module orngImpute</H3>
276
277<P>Most of Orange's learning algorithms do not use imputers because they can appropriately handle the missing values. Bayesian classifier, for instance, simply skips the corresponding attributes in the formula, while classification/regression trees have components for handling the missing values in various ways.</P>
278
279<P>If for any reason you want to use these algorithms to run on imputed data, you can use the wrappers provide in the module orngImpute. The module's description is a matter of a separate page, but we shall show its code here as another demonstration of how to use the imputers - logistic regression is implemented essentially the same as the below classes.</P>
280
281<P class="header">The complete code of module orngImpute.py</A></P>
282<XMP class=code>import orange
283
284class ImputeLearner(orange.Learner):
285    def __new__(cls, examples = None, weightID = 0, **keyw):
286        self = orange.Learner.__new__(cls, **keyw)
287        self.__dict__.update(keyw)
288        if examples:
289            return self.__call__(examples, weightID)
290        else:
291            return self
292
293    def __call__(self, data, weight=0):
294        trained_imputer = self.imputerConstructor(data, weight)
295        imputed_data = trained_imputer(data, weight)
296        baseClassifier = self.baseLearner(imputed_data, weight)
297        return ImputeClassifier(baseClassifier, trained_imputer)
298
299class ImputeClassifier(orange.Classifier):
300    def __init__(self, baseClassifier, imputer):
301        self.baseClassifier = baseClassifier
302        self.imputer = imputer
303
304    def __call__(self, ex, what=orange.GetValue):
305        return self.baseClassifier(self.imputer(ex), what)
306</XMP>
307
308<P><CODE>LearnerWithImputation</CODE> puts the keyword arguments into the instance's  dictionary. You are expected to call it like <CODE>LearnerWithImputation(baseLearner=&lt;someLearner&gt;, imputer=&lt;someImputerConstructor&gt;)</CODE>. When the learner is called with examples, it trains the imputer, imputes the data, induces a <CODE>baseClassifier</CODE> by the <CODE>baseLearner</CODE> and constructs <CODE>ClassifierWithImputation</CODE> that stores the <CODE>baseClassifier</CODE> and the <CODE>imputer</CODE>. For classification, the missing values are imputed and the classifier's prediction is returned.</P>
309
310<P>Note that this code is slightly simplified, although the omitted details handle non-essential technical issues that are unrelated to imputation.</P>
311
312
313<A name="callback"></A>