Changeset 9809:03b6a2f2caa5 in orange
- Timestamp:
- 02/06/12 18:37:52 (16 months ago)
- Branch:
- default
- rebase_source:
- b1c2c3fafdf6f708a990f9372702518ab230282c
- File:
-
- 1 edited
-
docs/reference/rst/Orange.feature.imputation.rst (modified) (5 diffs)
Legend:
- Unmodified
- Added
- Removed
-
docs/reference/rst/Orange.feature.imputation.rst
r9808 r9809 159 159 :lines: 91-94 160 160 161 To construct a yourself. You will do162 this if different attributes need different treatment. Brace for an163 example that will be a bit more complex. First we shall construct an164 :class:`Imputer_model` and initialize an empty list of models.165 166 161 To construct a user-defined :class:`Imputer_model`: 167 162 … … 169 164 :lines: 108-112 170 165 171 A list of empty models is first initialized. Continuous feature "LANES" is 172 imputed with value 2, using :obj:`DefaultClassifier` with the default value 173 2.0. A float must be given, because integer values are interpreted as indexes 174 of discrete features. Discrete feature "T-OR-D" is imputed using 175 :class:`Orange.classification.ConstantClassifier` which is given the index 176 of value "THROUGH" as an argument. Both classifiers are stored at the 177 appropriate places in :obj:`Imputer_model.models`. 166 A list of empty models is first initialized :obj:`Imputer_model.models`. 167 Continuous feature "LANES" is imputed with value 2 using 168 :obj:`DefaultClassifier`. A float must be given, because integer values are 169 interpreted as indexes of discrete features. Discrete feature "T-OR-D" is 170 imputed using :class:`Orange.classification.ConstantClassifier` which is 171 given the index of value "THROUGH" as an argument. 178 172 179 173 Feature "LENGTH" is computed with a regression tree induced from "MATERIAL", 180 174 "SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here). 181 The domain is initialized by simply giving a list of feature names and 182 domain as anadditional argument where Orange will look for features.175 Domain is initialized by giving a list of feature names and domain as an 176 additional argument where Orange will look for features. 183 177 184 178 .. literalinclude:: code/imputation-complex.py … … 194 188 </XMP> 195 189 196 Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, 197 while the others are mostly medium. This could be done by 198 :class:`Orange.classifier.ClassifierByLookupTable` - this would be faster 199 than what we plan here. See the corresponding documentation on lookup 200 classifier. Here we are going to do it with a Python function. 190 Wooden bridges and walkways are short, while the others are mostly 191 medium. This could be encoded in feature "SPAN" using 192 :class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the 193 Python function used here: 201 194 202 195 .. literalinclude:: code/imputation-complex.py 203 196 :lines: 121-128 204 197 205 :obj:`compute_span` could also be written as a class, if you'd prefer 206 it. It's important that it behaves like a classifier, that is, gets an example 207 and returns a value. The second element tells, as usual, what the caller expect 208 the classifier to return - a value, a distribution or both. Since the caller, 209 :obj:`Imputer_model`, always wants values, we shall ignore the argument 210 (at risk of having problems in the future when imputers might handle 211 distribution as well). 198 If :obj:`compute_span` is written as a class it must behave like a 199 classifier: it accepts an example and returns a value. The second 200 argument tells what the caller expects the classifier to return - a value, 201 a distribution or both. Currently, :obj:`Imputer_model`, 202 always expects values and the argument can be ignored. 212 203 213 204 Missing values as special values 214 205 ================================ 215 206 216 Missing values sometimes have a special meaning. The fact that something was 217 not measured can sometimes tell a lot. Be, however, cautious when using such 218 values in decision models; it the decision not to measure something (for 219 instance performing a laboratory test on a patient) is based on the expert's 220 knowledge of the class value, such unknown values clearly should not be used 221 in models. 207 Missing values sometimes have a special meaning. Cautious is needed when 208 using such values in decision models. When the decision not to measure 209 something (for example, performing a laboratory test on a patient) is based 210 on the expert's knowledge of the class value, such missing values clearly 211 should not be used in models. 222 212 223 213 .. class:: ImputerConstructor_asValue 224 214 225 Constructs a new domain in which each 226 discrete attribute is replaced with a new attribute that has one value more:227 "NA". The new attribute will computeits values on the fly from the old one,215 Constructs a new domain in which each discrete feature is replaced 216 with a new feature that has one more value: "NA". The new feature 217 computes its values on the fly from the old one, 228 218 copying the normal values and replacing the unknowns with "NA". 229 219 230 For continuous attributes, it will 231 construct a two-valued discrete attribute with values "def" and "undef", 232 telling whether the continuous attribute was defined or not. The attribute's 233 name will equal the original's with "_def" appended. The original continuous 234 attribute will remain in the domain and its unknowns will be replaced by 235 averages. 220 For continuous attributes, it constructs a two-valued discrete attribute 221 with values "def" and "undef", telling whether the value is defined or 222 not. The features's name will equal the original's with "_def" appended. 223 The original continuous feature will remain in the domain and its 224 unknowns will be replaced by averages. 236 225 237 226 :class:`ImputerConstructor_asValue` has no specific attributes. 238 227 239 It constructs :class:`Imputer_asValue` (I bet you 240 wouldn't guess). It converts the example into the new domain, which imputes 241 the values for discrete attributes. If continuous attributes are present, it 242 will also replace their values by the averages. 228 It constructs :class:`Imputer_asValue` that converts the example into 229 the new domain. 243 230 244 231 .. class:: Imputer_asValue … … 246 233 .. attribute:: domain 247 234 248 The domain with the new attributesconstructed by235 The domain with the new feature constructed by 249 236 :class:`ImputerConstructor_asValue`. 250 237 251 238 .. attribute:: defaults 252 239 253 Default values for continuous attributes. Present only if there are any. 254 255 The following code shows what this imputer actually does to the domain. 256 Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 240 Default values for continuous features. 241 242 The following code shows what the imputer actually does to the domain: 257 243 258 244 .. literalinclude:: code/imputation-complex.py … … 278 264 279 265 Seemingly, the two examples have the same attributes (with 280 :samp:`imputed` having a few additional ones). If you check this by 281 :samp:`original.domain[0] == imputed.domain[0]`, you shall see that this 282 first glance is False. The attributes only have the same names, 283 but they are different attributes. If you read this page (which is already a 284 bit advanced), you know that Orange does not really care about the attribute 285 names). 286 287 Therefore, if we wrote :samp:`imputed[i]` the program would fail 288 since :samp:`imputed` has no attribute :samp:`i`. But it has an 289 attribute with the same name (which even usually has the same value). We 290 therefore use :samp:`i.name` to index the attributes of 291 :samp:`imputed`. (Using names for indexing is not fast, though; if you do 292 it a lot, compute the integer index with 293 :samp:`imputed.domain.index(i.name)`.)</P> 294 295 For continuous attributes, there is an additional attribute with "_def" 296 appended; we get it by :samp:`i.name+"_def"`. 297 298 The first continuous attribute, "ERECTED" is defined. Its value remains 1874 299 and the additional attribute "ERECTED_def" has value "def". Not so for 300 "LENGTH". Its undefined value is replaced by the average (1567) and the new 301 attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 302 all other undefined discrete attributes) is assigned the value "NA". 266 :samp:`imputed` having a few additional ones). Comparing 267 :samp:`original.domain[0] == imputed.domain[0]` will result in False. While 268 the names are same, they represent different features. Writting, 269 :samp:`imputed[i]` would fail since :samp:`imputed` has no attribute 270 :samp:`i`, but it has an attribute with the same name. Using 271 :samp:`i.name` to index the attributes of 272 :samp:`imputed` will work, yet it is not fast. If a frequently used, it is 273 better to compute the index with :samp:`imputed.domain.index(i.name)`. 274 275 For continuous features, there is an additional feature with name prefix 276 "_def", which is accessible by :samp:`i.name+"_def"`. The value of the first 277 continuous feature "ERECTED" remains 1874, and the additional attribute 278 "ERECTED_def" has value "def". The undefined value in "LENGTH" is replaced 279 by the average (1567) and the new attribute has value "undef". The 280 undefined discrete attribute "CLEAR-G" (and all other undefined discrete 281 attributes) is assigned the value "NA". 303 282 304 283 Using imputers 305 284 ============== 306 285 307 To properly use the imputation classes in learning process, they must be 308 trained on training examples only. Imputing the missing values and subsequently309 using the data set in cross-validation will give overlyoptimistic results.286 Imputation must run on training data only. Imputing the missing values 287 and subsequently using the data in cross-validation will give overly 288 optimistic results. 310 289 311 290 Learners with imputer as a component 312 291 ------------------------------------ 313 292 314 Orange learners that cannot handle missing values will generally provide a slot 315 for the imputer component. An example of such a class is 316 :obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 317 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 318 can assign an imputer constructor - one of the above constructors or a specific 319 constructor you wrote yourself. When given learning examples, 320 :obj:`Orange.classification.logreg.LogRegLearner` will pass them to 321 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 322 imputer (again some of the above or a specific imputer you programmed). It will 323 immediately use the imputer to impute the missing values in the learning data 324 set, so it can be used by the actual learning algorithm. Besides, when the 293 Learners that cannot handle missing values provide a slot for the imputer 294 component. An example of such a class is 295 :obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called 296 :obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor`. 297 298 When given learning instances, 299 :obj:`~Orange.classification.logreg.LogRegLearner` will pass them to 300 :obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor` to get 301 an imputer and used it to impute the missing values in the learning data. 302 Imputed data is then used by the actual learning algorithm. Also, when a 325 303 classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 326 the imputer will bestored in its attribute304 the imputer is stored in its attribute 327 305 :obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 328 classification, the imputer will be used for imputation of missing values in 329 (testing) examples. 330 331 Although details may vary from algorithm to algorithm, this is how the 332 imputation is generally used in Orange's learners. Also, if you write your own 333 learners, it is recommended that you use imputation according to the described 334 procedure. 306 classification, the same imputer is used for imputation of missing values 307 in (testing) examples. 308 309 Details may vary from algorithm to algorithm, but this is how the imputation 310 is generally used. When write user-defined learners, 311 it is recommended to use imputation according to the described procedure. 335 312 336 313 Wrapper for learning algorithms
Note: See TracChangeset
for help on using the changeset viewer.
