Ignore:
Timestamp:
02/06/12 18:37:52 (2 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
rebase_source:
b1c2c3fafdf6f708a990f9372702518ab230282c
Message:

Orange.feature.imputation

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.feature.imputation.rst

    r9808 r9809  
    159159    :lines: 91-94 
    160160 
    161 To construct a  yourself. You will do 
    162 this if different attributes need different treatment. Brace for an 
    163 example that will be a bit more complex. First we shall construct an 
    164 :class:`Imputer_model` and initialize an empty list of models. 
    165  
    166161To construct a user-defined :class:`Imputer_model`: 
    167162 
     
    169164    :lines: 108-112 
    170165 
    171 A list of empty models is first initialized. Continuous feature "LANES" is 
    172 imputed with value 2, using :obj:`DefaultClassifier` with the default value 
    173 2.0. A float must be given, because integer values are interpreted as indexes 
    174 of discrete features. Discrete feature "T-OR-D" is imputed using 
    175 :class:`Orange.classification.ConstantClassifier` which is given the index 
    176 of value "THROUGH" as an argument. Both classifiers are stored at the 
    177 appropriate places in :obj:`Imputer_model.models`. 
     166A list of empty models is first initialized :obj:`Imputer_model.models`. 
     167Continuous feature "LANES" is imputed with value 2 using 
     168:obj:`DefaultClassifier`. A float must be given, because integer values are 
     169interpreted as indexes of discrete features. Discrete feature "T-OR-D" is 
     170imputed using :class:`Orange.classification.ConstantClassifier` which is 
     171given the index of value "THROUGH" as an argument. 
    178172 
    179173Feature "LENGTH" is computed with a regression tree induced from "MATERIAL", 
    180174"SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here). 
    181 The domain is initialized by simply giving a list of feature names and 
    182 domain as an additional argument where Orange will look for features. 
     175Domain is initialized by giving a list of feature names and domain as an 
     176additional argument where Orange will look for features. 
    183177 
    184178.. literalinclude:: code/imputation-complex.py 
     
    194188    </XMP> 
    195189 
    196 Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, 
    197 while the others are mostly medium. This could be done by 
    198 :class:`Orange.classifier.ClassifierByLookupTable` - this would be faster 
    199 than what we plan here. See the corresponding documentation on lookup 
    200 classifier. Here we are going to do it with a Python function. 
     190Wooden bridges and walkways are short, while the others are mostly 
     191medium. This could be encoded in feature "SPAN" using 
     192:class:`Orange.classifier.ClassifierByLookupTable`, which is faster than the 
     193Python function used here: 
    201194 
    202195.. literalinclude:: code/imputation-complex.py 
    203196    :lines: 121-128 
    204197 
    205 :obj:`compute_span` could also be written as a class, if you'd prefer 
    206 it. It's important that it behaves like a classifier, that is, gets an example 
    207 and returns a value. The second element tells, as usual, what the caller expect 
    208 the classifier to return - a value, a distribution or both. Since the caller, 
    209 :obj:`Imputer_model`, always wants values, we shall ignore the argument 
    210 (at risk of having problems in the future when imputers might handle 
    211 distribution as well). 
     198If :obj:`compute_span` is written as a class it must behave like a 
     199classifier: it accepts an example and returns a value. The second 
     200argument tells what the caller expects the classifier to return - a value, 
     201a distribution or both. Currently, :obj:`Imputer_model`, 
     202always expects values and the argument can be ignored. 
    212203 
    213204Missing values as special values 
    214205================================ 
    215206 
    216 Missing values sometimes have a special meaning. The fact that something was 
    217 not measured can sometimes tell a lot. Be, however, cautious when using such 
    218 values in decision models; it the decision not to measure something (for 
    219 instance performing a laboratory test on a patient) is based on the expert's 
    220 knowledge of the class value, such unknown values clearly should not be used 
    221 in models. 
     207Missing values sometimes have a special meaning. Cautious is needed when 
     208using such values in decision models. When the decision not to measure 
     209something (for example, performing a laboratory test on a patient) is based 
     210on the expert's knowledge of the class value, such missing values clearly 
     211should not be used in models. 
    222212 
    223213.. class:: ImputerConstructor_asValue 
    224214 
    225     Constructs a new domain in which each 
    226     discrete attribute is replaced with a new attribute that has one value more: 
    227     "NA". The new attribute will compute its values on the fly from the old one, 
     215    Constructs a new domain in which each discrete feature is replaced 
     216    with a new feature that has one more value: "NA". The new feature 
     217    computes its values on the fly from the old one, 
    228218    copying the normal values and replacing the unknowns with "NA". 
    229219 
    230     For continuous attributes, it will 
    231     construct a two-valued discrete attribute with values "def" and "undef", 
    232     telling whether the continuous attribute was defined or not. The attribute's 
    233     name will equal the original's with "_def" appended. The original continuous 
    234     attribute will remain in the domain and its unknowns will be replaced by 
    235     averages. 
     220    For continuous attributes, it constructs a two-valued discrete attribute 
     221    with values "def" and "undef", telling whether the value is defined or 
     222    not.  The features's name will equal the original's with "_def" appended. 
     223    The original continuous feature will remain in the domain and its 
     224    unknowns will be replaced by averages. 
    236225 
    237226    :class:`ImputerConstructor_asValue` has no specific attributes. 
    238227 
    239     It constructs :class:`Imputer_asValue` (I bet you 
    240     wouldn't guess). It converts the example into the new domain, which imputes 
    241     the values for discrete attributes. If continuous attributes are present, it 
    242     will also replace their values by the averages. 
     228    It constructs :class:`Imputer_asValue` that converts the example into 
     229    the new domain. 
    243230 
    244231.. class:: Imputer_asValue 
     
    246233    .. attribute:: domain 
    247234 
    248         The domain with the new attributes constructed by 
     235        The domain with the new feature constructed by 
    249236        :class:`ImputerConstructor_asValue`. 
    250237 
    251238    .. attribute:: defaults 
    252239 
    253         Default values for continuous attributes. Present only if there are any. 
    254  
    255 The following code shows what this imputer actually does to the domain. 
    256 Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     240        Default values for continuous features. 
     241 
     242The following code shows what the imputer actually does to the domain: 
    257243 
    258244.. literalinclude:: code/imputation-complex.py 
     
    278264 
    279265Seemingly, the two examples have the same attributes (with 
    280 :samp:`imputed` having a few additional ones). If you check this by 
    281 :samp:`original.domain[0] == imputed.domain[0]`, you shall see that this 
    282 first glance is False. The attributes only have the same names, 
    283 but they are different attributes. If you read this page (which is already a 
    284 bit advanced), you know that Orange does not really care about the attribute 
    285 names). 
    286  
    287 Therefore, if we wrote :samp:`imputed[i]` the program would fail 
    288 since :samp:`imputed` has no attribute :samp:`i`. But it has an 
    289 attribute with the same name (which even usually has the same value). We 
    290 therefore use :samp:`i.name` to index the attributes of 
    291 :samp:`imputed`. (Using names for indexing is not fast, though; if you do 
    292 it a lot, compute the integer index with 
    293 :samp:`imputed.domain.index(i.name)`.)</P> 
    294  
    295 For continuous attributes, there is an additional attribute with "_def" 
    296 appended; we get it by :samp:`i.name+"_def"`. 
    297  
    298 The first continuous attribute, "ERECTED" is defined. Its value remains 1874 
    299 and the additional attribute "ERECTED_def" has value "def". Not so for 
    300 "LENGTH". Its undefined value is replaced by the average (1567) and the new 
    301 attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 
    302 all other undefined discrete attributes) is assigned the value "NA". 
     266:samp:`imputed` having a few additional ones). Comparing 
     267:samp:`original.domain[0] == imputed.domain[0]` will result in False. While 
     268the names are same, they represent different features. Writting, 
     269:samp:`imputed[i]`  would fail since :samp:`imputed` has no attribute 
     270:samp:`i`, but it has an attribute with the same name. Using 
     271:samp:`i.name` to index the attributes of 
     272:samp:`imputed` will work, yet it is not fast. If a frequently used, it is 
     273better to compute the index with :samp:`imputed.domain.index(i.name)`. 
     274 
     275For continuous features, there is an additional feature with name prefix 
     276"_def", which is accessible by :samp:`i.name+"_def"`. The value of the first 
     277continuous feature "ERECTED" remains 1874, and the additional attribute 
     278"ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced 
     279by the average (1567) and the new attribute has value "undef". The 
     280undefined discrete attribute  "CLEAR-G" (and all other undefined discrete 
     281attributes) is assigned the value "NA". 
    303282 
    304283Using imputers 
    305284============== 
    306285 
    307 To properly use the imputation classes in learning process, they must be 
    308 trained on training examples only. Imputing the missing values and subsequently 
    309 using the data set in cross-validation will give overly optimistic results. 
     286Imputation must run on training data only. Imputing the missing values 
     287and subsequently using the data in cross-validation will give overly 
     288optimistic results. 
    310289 
    311290Learners with imputer as a component 
    312291------------------------------------ 
    313292 
    314 Orange learners that cannot handle missing values will generally provide a slot 
    315 for the imputer component. An example of such a class is 
    316 :obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 
    317 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 
    318 can assign an imputer constructor - one of the above constructors or a specific 
    319 constructor you wrote yourself. When given learning examples, 
    320 :obj:`Orange.classification.logreg.LogRegLearner` will pass them to 
    321 :obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 
    322 imputer (again some of the above or a specific imputer you programmed). It will 
    323 immediately use the imputer to impute the missing values in the learning data 
    324 set, so it can be used by the actual learning algorithm. Besides, when the 
     293Learners that cannot handle missing values provide a slot for the imputer 
     294component. An example of such a class is 
     295:obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called 
     296:obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor`. 
     297 
     298When given learning instances, 
     299:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to 
     300:obj:`~Orange.classification.logreg.LogRegLearner.imputerConstructor` to get 
     301an imputer and used it to impute the missing values in the learning data. 
     302Imputed data is then used by the actual learning algorithm. Also, when a 
    325303classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
    326 the imputer will be stored in its attribute 
     304the imputer is stored in its attribute 
    327305:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
    328 classification, the imputer will be used for imputation of missing values in 
    329 (testing) examples. 
    330  
    331 Although details may vary from algorithm to algorithm, this is how the 
    332 imputation is generally used in Orange's learners. Also, if you write your own 
    333 learners, it is recommended that you use imputation according to the described 
    334 procedure. 
     306classification, the same imputer is used for imputation of missing values 
     307in (testing) examples. 
     308 
     309Details may vary from algorithm to algorithm, but this is how the imputation 
     310is generally used. When write user-defined learners, 
     311it is recommended to use imputation according to the described procedure. 
    335312 
    336313Wrapper for learning algorithms 
Note: See TracChangeset for help on using the changeset viewer.