02/07/12 10:29:29 (2 years ago)
tomazc <tomaz.curk@…>

Orange.feature.imputation. Closes #1073.

1 edited


  • docs/reference/rst/Orange.feature.imputation.rst

    r9853 r9890  
    27 ================= 
    2929:obj:`ImputerConstructor` is the abstract root in a hierarchy of classes 
    5252    .. attribute::  defaults 
    54     An instance :obj:`Orange.data.Instance` with the default values to be 
     54    An instance :obj:`~Orange.data.Instance` with the default values to be 
    5555    imputed instead of missing value. Examples to be imputed must be from the 
    5656    same :obj:`~Orange.data.Domain` as :obj:`defaults`. 
    7171pessimistic imputations. 
    73 User-define defaults can be given when constructing a :obj:`~Orange.feature 
    74 .imputation.Imputer_defaults`. Values that are left unspecified do not get 
    75 imputed. In the following example "LENGTH" is the 
     73User-define defaults can be given when constructing a 
     74:obj:`~Orange.feature.imputation.Imputer_defaults`. Values that are left 
     75unspecified do not get imputed. In the following example "LENGTH" is the 
    7676only attribute to get imputed with value 1234: 
    164164:obj:`DefaultClassifier`. A float must be given, because integer values are 
    165165interpreted as indexes of discrete features. Discrete feature "T-OR-D" is 
    166 imputed using :class:`Orange.classification.ConstantClassifier` which is 
     166imputed using :class:`~Orange.classification.ConstantClassifier` which is 
    167167given the index of value "THROUGH" as an argument. 
    278278Using imputers 
    279 ============== 
    281 Imputation must run on training data only. Imputing the missing values 
    282 and subsequently using the data in cross-validation will give overly 
    283 optimistic results. 
     281Imputation is also used by learning algorithms and other methods that are not 
     282capable of handling unknown values. 
    285284Learners with imputer as a component 
    286 ------------------------------------ 
    288 Learners that cannot handle missing values provide a slot for the imputer 
    289 component. An example of such a class is 
    290 :obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called 
    291 :obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`. 
    293 When given learning instances, 
     287Learners that cannot handle missing values should provide a slot 
     288for imputer constructor. An example of such class is 
     289:obj:`~Orange.classification.logreg.LogRegLearner` with attribute 
     291which imputes to average value by default. When given learning instances, 
    294292:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to 
    295293:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor` to get 
    296294an imputer and used it to impute the missing values in the learning data. 
    297295Imputed data is then used by the actual learning algorithm. Also, when a 
    298 classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
     296classifier :obj:`~Orange.classification.logreg.LogRegClassifier` is 
    299298the imputer is stored in its attribute 
    300 :obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
     299:obj:`~Orange.classification.logreg.LogRegClassifier.imputer`. At 
    301300classification, the same imputer is used for imputation of missing values 
    302301in (testing) examples. 
    306305it is recommended to use imputation according to the described procedure. 
     307The choice of which imputer to use depends on the problem domain. In this 
     308example we want to impute the minimal value of each feature. 
     310.. literalinclude:: code/imputation-logreg.py 
     311   :lines: 7- 
     313The output of this code is:: 
     315    Without imputation: 0.945 
     316    With imputation: 0.954 
     318.. note:: 
     320   Note that just one instance of 
     321   :obj:`~Orange.classification.logreg.LogRegLearner` is constructed and then 
     322   used twice in each fold. Once it is given the original instances as they 
     323   are. It returns an instance of 
     324   :obj:`~Orange.classification.logreg.LogRegLearner`. The second time it is 
     325   called by :obj:`imra` and the 
     326   :obj:`~Orange.classification.logreg.LogRegLearner` gets wrapped 
     327   into :obj:`~Orange.feature.imputation.Classifier`. There is only one 
     328   learner, which produces two different classifiers in each round of 
     329   testing. 
    308331Wrapper for learning algorithms 
    311 Imputation is also used by learning algorithms and other methods that are not 
    312 capable of handling unknown values. It imputes missing values, 
    313 calls the learner and, if imputation is also needed by the classifier, 
    314 it wraps the classifier that imputes missing values in instances to classify. 
    316 .. literalinclude:: code/imputation-logreg.py 
    317    :lines: 7- 
    319 The output of this code is:: 
    321     Without imputation: 0.945 
    322     With imputation: 0.954 
    324 Even so, the module is somewhat redundant, as all learners that cannot handle 
    325 missing values should, in principle, provide the slots for imputer constructor. 
    326 For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an 
    327 attribute 
    328 :obj:`Orange.classification.logreg.LogRegLearner.imputer_constructor`, 
    329 and even if you don't set it, it will do some imputation by default. 
     334In a learning/classification process, imputation is needed on two occasions. 
     335Before learning, the imputer needs to process the training examples. 
     336Afterwards, the imputer is called for each instance to be classified. For 
     337example, in cross validation, imputation should be done on training folds 
     338only. Imputing the missing values on all data and subsequently performing 
     339cross-validation will give overly optimistic results. 
     341Most of Orange's learning algorithms do not use imputers because they can 
     342appropriately handle the missing values. Bayesian classifier, for instance, 
     343simply skips the corresponding attributes in the formula, while 
     344classification/regression trees have components for handling the missing 
     345values in various ways. 
     347If for any reason you want to use these algorithms to run on imputed data, 
     348you can use this wrapper. 
    331350.. class:: ImputeLearner 
    333     Wraps a learner and performs data discretization before learning. 
    335     Most of Orange's learning algorithms do not use imputers because they can 
    336     appropriately handle the missing values. Bayesian classifier, for instance, 
    337     simply skips the corresponding attributes in the formula, while 
    338     classification/regression trees have components for handling the missing 
    339     values in various ways. 
    341     If for any reason you want to use these algorithms to run on imputed data, 
    342     you can use this wrapper. The class description is a matter of a separate 
    343     page, but we shall show its code here as another demonstration of how to 
    344     use the imputers - logistic regression is implemented essentially the same 
    345     as the below classes. 
     352    Wraps a learner and performs data imputation before learning. 
    347354    This is basically a learner, so the constructor will return either an 
    348355    instance of :obj:`ImputerLearner` or, if called with examples, an instance 
    349     of some classifier. There are a few attributes that need to be set, though. 
     356    of some classifier. 
    351358    .. attribute:: base_learner 
    355362    .. attribute:: imputer_constructor 
    357     An instance of a class derived from :obj:`ImputerConstructor` (or a class 
    358     with the same call operator). 
     364    An instance of a class derived from :obj:`ImputerConstructor` or a class 
     365    with the same call operator. 
    360367    .. attribute:: dont_impute_classifier 
    362     If given and set (this attribute is optional), the classifier will not be 
    363     wrapped into an imputer. Do this if the classifier doesn't mind if the 
    364     examples it is given have missing values. 
     369    If set and a table is given, the classifier is not be 
     370    wrapped into an imputer. This can be done if classifier can handle 
     371    missing values. 
    366373    The learner is best illustrated by its code - here's its complete 
    376383                return ImputeClassifier(base_classifier, trained_imputer) 
    378     So "learning" goes like this. :obj:`ImputeLearner` will first construct 
    379     the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 
    380     imputer. Than it will use the imputer to impute the data, and call the 
     385    During learning, :obj:`ImputeLearner` will first construct 
     386    the imputer. It will then impute the data and call the 
    381387    given :obj:`baseLearner` to construct a classifier. For instance, 
    382388    :obj:`baseLearner` could be a learner for logistic regression and the 
    383389    result would be a logistic regression model. If the classifier can handle 
    384     unknown values (that is, if :obj:`dont_impute_classifier`, we return it as 
    385     it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 
    386     the base classifier and the imputer which it can use to impute the missing 
    387     values in (testing) examples. 
     390    unknown values (that is, if :obj:`dont_impute_classifier`, 
     391    it is returned as is, otherwise it is wrapped into 
     392    :obj:`ImputeClassifier`, which holds the base classifier and 
     393    the imputer used to impute the missing values in (testing) data. 
    389395.. class:: ImputeClassifier 
    401407    .. method:: __call__ 
    403     This class is even more trivial than the learner. Its constructor accepts 
    404     two arguments, the classifier and the imputer, which are stored into the 
    405     corresponding attributes. The call operator which does the classification 
    406     then looks like this:: 
     409    This class's constructor accepts and stores two arguments, 
     410    the classifier and the imputer. The call operator for classification 
     411    looks like this:: 
    408413        def __call__(self, ex, what=orange.GetValue): 
    414419.. note:: 
    415    In this setup the imputer is trained on the training data - even if you do 
     420   In this setup the imputer is trained on the training data. Even during 
    416421   cross validation, the imputer will be trained on the right data. In the 
    417    classification phase we again use the imputer which was classified on the 
    418    training data only. 
     422   classification phase, the imputer will be used to impute testing data. 
    420424.. rubric:: Code of ImputeLearner and ImputeClassifier 
    422 :obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 
    423 the instance's  dictionary. You are expected to call it like 
    424 :obj:`ImputeLearner(base_learner=<someLearner>, 
    425 imputer=<someImputerConstructor>)`. When the learner is called with 
    426 examples, it 
    427 trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 
    428 :obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 
     426The learner is called with 
     427:obj:`Orange.feature.imputation.ImputeLearner(base_learner=<someLearner>, imputer=<someImputerConstructor>)`. 
     428When given examples, it trains the imputer, imputes the data, 
     429induces a :obj:`base_classifier` by the 
     430:obj:`base_learner` and constructs :obj:`ImputeClassifier` that stores the 
    429431:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 
    430432values are imputed and the classifier's prediction is returned. 
    432 Note that this code is slightly simplified, although the omitted details handle 
     434This is a slightly simplified code, where details on how to handle 
    433435non-essential technical issues that are unrelated to imputation:: 
    456458            return self.base_classifier(self.imputer(ex), what) 
    458 .. rubric:: Example 
    460 Although most Orange's learning algorithms will take care of imputation 
    461 internally, if needed, it can sometime happen that an expert will be able to 
    462 tell you exactly what to put in the data instead of the missing values. In this 
    463 example we shall suppose that we want to impute the minimal value of each 
    464 feature. We will try to determine whether the naive Bayesian classifier with 
    465 its  implicit internal imputation works better than one that uses imputation by 
    466 minimal values. 
    468 :download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 
    470 .. literalinclude:: code/imputation-minimal-imputer.py 
    471     :lines: 7- 
    473 Should ouput this:: 
    475     Without imputation: 0.903 
    476     With imputation: 0.899 
    478 .. note:: 
    479    Note that we constructed just one instance of \ 
    480    :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 
    481    used twice in each fold, once it is given the examples as they are (and 
    482    returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 
    483    The second time it is called by :obj:`imba` and the \ 
    484    :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 
    485    into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 
    486    learner, but which produces two different classifiers in each round of 
    487    testing. 
    489460Write your own imputer 
    490 ====================== 
    492 Imputation classes provide the Python-callback functionality (not all Orange 
    493 classes do so, refer to the documentation on `subtyping the Orange classes 
    494 in Python <callbacks.htm>`_ for a list). If you want to write your own 
    495 imputation constructor or an imputer, you need to simply program a Python 
    496 function that will behave like the built-in Orange classes (and even less, 
    497 for imputer, you only need to write a function that gets an example as 
    498 argument, imputation for example tables will then use that function). 
    500 You will most often write the imputation constructor when you have a special 
    501 imputation procedure or separate procedures for various attributes, as we've 
    502 demonstrated in the description of 
    503 :obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only 
    504 need to pack everything we've written there to an imputer constructor that 
    505 will accept a data set and the id of the weight meta-attribute (ignore it if 
    506 you will, but you must accept two arguments), and return the imputer (probably 
    507 :obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 
    508 imputer constructor as opposed to what we did above is that you can use such a 
    509 constructor as a component for Orange learners (like logistic regression) or 
    510 for wrappers from module orngImpute, and that way properly use the in 
    511 classifier testing procedures. 
     463Imputation classes provide the Python-callback functionality. The simples 
     464way to write custom imputation constructors or imputers is to write a Python 
     465function that behaves like the built-in Orange classes. For imputers it is 
     466enough to write a function that gets an instance as argument. Inputation for 
     467data tables will then use that function. 
     469Special imputation procedures or separate procedures for various attributes, 
     470as demonstrated in the description of 
     472are achieved by encoding it in a constructor that accepts a data table and 
     473id of the weight meta-attribute, and returns the imputer. The benefit of 
     474implementing an imputer constructor is that you can use is as a component 
     475for learners (for example, in logistic regression) or wrappers, and that way 
     476properly use the classifier in testing procedures. 
     481    This was commented out: 
     482    Examples 
     483    -------- 
     485    Missing values sometimes have a special meaning, so they need to be replaced 
     486    by a designated value. Sometimes we know what to replace the missing value 
     487    with; for instance, in a medical problem, some laboratory tests might not be 
     488    done when it is known what their results would be. In that case, we impute 
     489    certain fixed value instead of the missing. In the most complex case, we assign 
     490    values that are computed based on some model; we can, for instance, impute the 
     491    average or majority value or even a value which is computed from values of 
     492    other, known feature, using a classifier. 
     494    In general, imputer itself needs to be trained. This is, of course, not needed 
     495    when the imputer imputes certain fixed value. However, when it imputes the 
     496    average or majority value, it needs to compute the statistics on the training 
     497    examples, and use it afterwards for imputation of training and testing 
     498    examples. 
     500    While reading this document, bear in mind that imputation is a part of the 
     501    learning process. If we fit the imputation model, for instance, by learning 
     502    how to predict the feature's value from other features, or even if we 
     503    simply compute the average or the minimal value for the feature and use it 
     504    in imputation, this should only be done on learning data. Orange 
     505    provides simple means for doing that. 
     507    This page will first explain how to construct various imputers. Then follow 
     508    the examples for `proper use of imputers <#using-imputers>`_. Finally, quite 
     509    often you will want to use imputation with special requests, such as certain 
     510    features' missing values getting replaced by constants and other by values 
     511    computed using models induced from specified other features. For instance, 
     512    in one of the studies we worked on, the patient's pulse rate needed to be 
     513    estimated using regression trees that included the scope of the patient's 
     514    injuries, sex and age, some attributes' values were replaced by the most 
     515    pessimistic ones and others were computed with regression trees based on 
     516    values of all features. If you are using learners that need the imputer as a 
     517    component, you will need to `write your own imputer constructor 
     518    <#write-your-own-imputer-constructor>`_. This is trivial and is explained at 
     519    the end of this page. 
Note: See TracChangeset for help on using the changeset viewer.