02/06/12 13:17:48 (2 years ago)
tomazc <tomaz.curk@…>

Updated documentation on Orange.feature.imputation.

1 edited


  • docs/reference/rst/Orange.feature.imputation.rst

    r9372 r9806  
    1 .. automodule:: Orange.feature.imputation 
     1.. py:currentmodule:: Orange.feature.imputation 
     3.. index:: imputation 
     5.. index:: 
     6   single: feature; value imputation 
     9Imputation (``imputation``) 
     12Imputation replaces missing feature values with appropriate values, in this 
     13case with minimal values: 
     15.. literalinclude:: code/imputation-values.py 
     16   :lines: 7- 
     18The output of this code is:: 
     20    Example with missing values 
     21    ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD'] 
     22    Imputed values: 
     23    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] 
     24    ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] 
     29:obj:`ImputerConstructor` is the abstract root in the hierarchy of classes 
     30that get training data and construct an instance of a class derived from 
     31:obj:`Imputer`. When an :obj:`Imputer` is called with an 
     32:obj:`Orange.data.Instance` it will return a new example with the 
     33missing values imputed (leaving the original example intact). If imputer is 
     34called with an :obj:`Orange.data.Table` it will return a new example table 
     35with imputed instances. 
     37.. class:: ImputerConstructor 
     39    .. attribute:: imputeClass 
     41    Indicates whether to impute the class value. Default is True. 
     43    .. attribute:: deterministic 
     45    Indicates whether to initialize random by example's CRC. Default is False. 
     47Simple imputation 
     50Simple imputers always impute the same value for a particular attribute, 
     51disregarding the values of other attributes. They all use the same class 
     54.. class:: Imputer_defaults 
     56    .. attribute::  defaults 
     58    An instance :obj:`Orange.data.Instance` with the default values to be 
     59    imputed instead of missing. Examples to be imputed must be from the same 
     60    domain as :obj:`defaults`. 
     62Instances of this class can be constructed by 
     67For continuous features, they will impute the smallest, 
     68largest or the average values encountered in the training examples. 
     70For discrete, they will impute the lowest (the one with index 0, 
     71e. g. attr.values[0]), the highest (attr.values[-1]), 
     72and the most common value encountered in the data. 
     74The first two imputers 
     75will mostly be used when the discrete values are ordered according to their 
     76impact on the class (for instance, possible values for symptoms of some 
     77disease can be ordered according to their seriousness). The minimal and maximal 
     78imputers will then represent optimistic and pessimistic imputations. 
     80The following code will load the bridges data, and first impute the values 
     81in a single examples and then in the whole table. 
     83:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     85.. literalinclude:: code/imputation-complex.py 
     86    :lines: 9-23 
     88This is example shows what the imputer does, not how it is to be used. Don't 
     89impute all the data and then use it for cross-validation. As warned at the top 
     90of this page, see the instructions for actual `use of 
     91imputers <#using-imputers>`_. 
     93.. note:: The :obj:`ImputerConstructor` are another class with schizophrenic 
     94  constructor: if you give the constructor the data, it will return an \ 
     95  :obj:`Imputer` - the above call is equivalent to calling \ 
     96  :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`. 
     98You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults` 
     99yourself and specify your own defaults. Or leave some values unspecified, in 
     100which case the imputer won't impute them, as in the following example. Here, 
     101the only attribute whose values will get imputed is "LENGTH"; the imputed value 
     102will be 1234. 
     104.. literalinclude:: code/imputation-complex.py 
     105    :lines: 56-69 
     107:obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an 
     108argument of type :obj:`Orange.data.Domain` (in which case it will construct an 
     109empty instance for :obj:`defaults`) or an example. (Be careful with this: 
     110:obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the 
     111instance and not a copy. But you can make a copy yourself to avoid problems: 
     112instead of `Imputer_defaults(data[0])` you may want to write 
     115Random imputation 
     118.. class:: Imputer_Random 
     120    Imputes random values. The corresponding constructor is 
     121    :obj:`ImputerConstructor_Random`. 
     123    .. attribute:: impute_class 
     125    Tells whether to impute the class values or not. Defaults to True. 
     127    .. attribute:: deterministic 
     129    If true (default is False), random generator is initialized for each 
     130    example using the example's hash value as a seed. This results in same 
     131    examples being always imputed the same values. 
     133Model-based imputation 
     136.. class:: ImputerConstructor_model 
     138    Model-based imputers learn to predict the attribute's value from values of 
     139    other attributes. :obj:`ImputerConstructor_model` are given a learning 
     140    algorithm (two, actually - one for discrete and one for continuous 
     141    attributes) and they construct a classifier for each attribute. The 
     142    constructed imputer :obj:`Imputer_model` stores a list of classifiers which 
     143    are used when needed. 
     145    .. attribute:: learner_discrete, learner_continuous 
     147    Learner for discrete and for continuous attributes. If any of them is 
     148    missing, the attributes of the corresponding type won't get imputed. 
     150    .. attribute:: use_class 
     152    Tells whether the imputer is allowed to use the class value. As this is 
     153    most often undesired, this option is by default set to False. It can 
     154    however be useful for a more complex design in which we would use one 
     155    imputer for learning examples (this one would use the class value) and 
     156    another for testing examples (which would not use the class value as this 
     157    is unavailable at that moment). 
     159.. class:: Imputer_model 
     161    .. attribute: models 
     163    A list of classifiers, each corresponding to one attribute of the examples 
     164    whose values are to be imputed. The :obj:`classVar`'s of the models should 
     165    equal the examples' attributes. If any of classifier is missing (that is, 
     166    the corresponding element of the table is :obj:`None`, the corresponding 
     167    attribute's values will not be imputed. 
     169.. rubric:: Examples 
     171The following imputer predicts the missing attribute values using 
     172classification and regression trees with the minimum of 20 examples in a leaf. 
     173Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     175.. literalinclude:: code/imputation-complex.py 
     176    :lines: 74-76 
     178We could even use the same learner for discrete and continuous attributes, 
     179as :class:`Orange.classification.tree.TreeLearner` checks the class type 
     180and constructs regression or classification trees accordingly. The 
     181common parameters, such as the minimal number of 
     182examples in leaves, are used in both cases. 
     184You can also use different learning algorithms for discrete and 
     185continuous attributes. Probably a common setup will be to use 
     186:class:`Orange.classification.bayes.BayesLearner` for discrete and 
     187:class:`Orange.regression.mean.MeanLearner` (which 
     188just remembers the average) for continuous attributes. Part of 
     189:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     191.. literalinclude:: code/imputation-complex.py 
     192    :lines: 91-94 
     194You can also construct an :class:`Imputer_model` yourself. You will do 
     195this if different attributes need different treatment. Brace for an 
     196example that will be a bit more complex. First we shall construct an 
     197:class:`Imputer_model` and initialize an empty list of models. 
     198The following code snippets are from 
     199:download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     201.. literalinclude:: code/imputation-complex.py 
     202    :lines: 108-109 
     204Attributes "LANES" and "T-OR-D" will always be imputed values 2 and 
     205"THROUGH". Since "LANES" is continuous, it suffices to construct a 
     206:obj:`DefaultClassifier` with the default value 2.0 (don't forget the 
     207decimal part, or else Orange will think you talk about an index of a discrete 
     208value - how could it tell?). For the discrete attribute "T-OR-D", we could 
     209construct a :class:`Orange.classification.ConstantClassifier` and give the index of value 
     210"THROUGH" as an argument. But we shall do it nicer, by constructing a 
     211:class:`Orange.data.Value`. Both classifiers will be stored at the appropriate places 
     212in :obj:`imputer.models`. 
     214.. literalinclude:: code/imputation-complex.py 
     215    :lines: 110-112 
     218"LENGTH" will be computed with a regression tree induced from "MATERIAL", 
     219"SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of 
     220course). Note that we initialized the domain by simply giving a list with 
     221the names of the attributes, with the domain as an additional argument 
     222in which Orange will look for the named attributes. 
     224.. literalinclude:: code/imputation-complex.py 
     225    :lines: 114-119 
     227We printed the tree just to see what it looks like. 
     231    <XMP class=code>SPAN=SHORT: 1158 
     232    SPAN=LONG: 1907 
     233    SPAN=MEDIUM 
     234    |    ERECTED<1908.500: 1325 
     235    |    ERECTED>=1908.500: 1528 
     236    </XMP> 
     238Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, 
     239while the others are mostly medium. This could be done by 
     240:class:`Orange.classifier.ClassifierByLookupTable` - this would be faster 
     241than what we plan here. See the corresponding documentation on lookup 
     242classifier. Here we are going to do it with a Python function. 
     244.. literalinclude:: code/imputation-complex.py 
     245    :lines: 121-128 
     247:obj:`compute_span` could also be written as a class, if you'd prefer 
     248it. It's important that it behaves like a classifier, that is, gets an example 
     249and returns a value. The second element tells, as usual, what the caller expect 
     250the classifier to return - a value, a distribution or both. Since the caller, 
     251:obj:`Imputer_model`, always wants values, we shall ignore the argument 
     252(at risk of having problems in the future when imputers might handle 
     253distribution as well). 
     255Missing values as special values 
     258Missing values sometimes have a special meaning. The fact that something was 
     259not measured can sometimes tell a lot. Be, however, cautious when using such 
     260values in decision models; it the decision not to measure something (for 
     261instance performing a laboratory test on a patient) is based on the expert's 
     262knowledge of the class value, such unknown values clearly should not be used 
     263in models. 
     265.. class:: ImputerConstructor_asValue 
     267    Constructs a new domain in which each 
     268    discrete attribute is replaced with a new attribute that has one value more: 
     269    "NA". The new attribute will compute its values on the fly from the old one, 
     270    copying the normal values and replacing the unknowns with "NA". 
     272    For continuous attributes, it will 
     273    construct a two-valued discrete attribute with values "def" and "undef", 
     274    telling whether the continuous attribute was defined or not. The attribute's 
     275    name will equal the original's with "_def" appended. The original continuous 
     276    attribute will remain in the domain and its unknowns will be replaced by 
     277    averages. 
     279    :class:`ImputerConstructor_asValue` has no specific attributes. 
     281    It constructs :class:`Imputer_asValue` (I bet you 
     282    wouldn't guess). It converts the example into the new domain, which imputes 
     283    the values for discrete attributes. If continuous attributes are present, it 
     284    will also replace their values by the averages. 
     286.. class:: Imputer_asValue 
     288    .. attribute:: domain 
     290        The domain with the new attributes constructed by 
     291        :class:`ImputerConstructor_asValue`. 
     293    .. attribute:: defaults 
     295        Default values for continuous attributes. Present only if there are any. 
     297The following code shows what this imputer actually does to the domain. 
     298Part of :download:`imputation-complex.py <code/imputation-complex.py>` (uses :download:`bridges.tab <code/bridges.tab>`): 
     300.. literalinclude:: code/imputation-complex.py 
     301    :lines: 137-151 
     303The script's output looks like this:: 
     309    RIVER: M -> M 
     310    ERECTED: 1874 -> 1874 (def) 
     311    PURPOSE: RR -> RR 
     312    LENGTH: ? -> 1567 (undef) 
     313    LANES: 2 -> 2 (def) 
     314    CLEAR-G: ? -> NA 
     315    T-OR-D: THROUGH -> THROUGH 
     316    MATERIAL: IRON -> IRON 
     317    SPAN: ? -> NA 
     318    REL-L: ? -> NA 
     319    TYPE: SIMPLE-T -> SIMPLE-T 
     321Seemingly, the two examples have the same attributes (with 
     322:samp:`imputed` having a few additional ones). If you check this by 
     323:samp:`original.domain[0] == imputed.domain[0]`, you shall see that this 
     324first glance is False. The attributes only have the same names, 
     325but they are different attributes. If you read this page (which is already a 
     326bit advanced), you know that Orange does not really care about the attribute 
     329Therefore, if we wrote :samp:`imputed[i]` the program would fail 
     330since :samp:`imputed` has no attribute :samp:`i`. But it has an 
     331attribute with the same name (which even usually has the same value). We 
     332therefore use :samp:`i.name` to index the attributes of 
     333:samp:`imputed`. (Using names for indexing is not fast, though; if you do 
     334it a lot, compute the integer index with 
     337For continuous attributes, there is an additional attribute with "_def" 
     338appended; we get it by :samp:`i.name+"_def"`. 
     340The first continuous attribute, "ERECTED" is defined. Its value remains 1874 
     341and the additional attribute "ERECTED_def" has value "def". Not so for 
     342"LENGTH". Its undefined value is replaced by the average (1567) and the new 
     343attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 
     344all other undefined discrete attributes) is assigned the value "NA". 
     346Using imputers 
     349To properly use the imputation classes in learning process, they must be 
     350trained on training examples only. Imputing the missing values and subsequently 
     351using the data set in cross-validation will give overly optimistic results. 
     353Learners with imputer as a component 
     356Orange learners that cannot handle missing values will generally provide a slot 
     357for the imputer component. An example of such a class is 
     358:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 
     359:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 
     360can assign an imputer constructor - one of the above constructors or a specific 
     361constructor you wrote yourself. When given learning examples, 
     362:obj:`Orange.classification.logreg.LogRegLearner` will pass them to 
     363:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 
     364imputer (again some of the above or a specific imputer you programmed). It will 
     365immediately use the imputer to impute the missing values in the learning data 
     366set, so it can be used by the actual learning algorithm. Besides, when the 
     367classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
     368the imputer will be stored in its attribute 
     369:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
     370classification, the imputer will be used for imputation of missing values in 
     371(testing) examples. 
     373Although details may vary from algorithm to algorithm, this is how the 
     374imputation is generally used in Orange's learners. Also, if you write your own 
     375learners, it is recommended that you use imputation according to the described 
     378Wrapper for learning algorithms 
     381Imputation is used by learning algorithms and other methods that are not 
     382capable of handling unknown values. It will impute missing values, 
     383call the learner and, if imputation is also needed by the classifier, 
     384it will wrap the classifier into a wrapper that imputes missing values in 
     385examples to classify. 
     387.. literalinclude:: code/imputation-logreg.py 
     388   :lines: 7- 
     390The output of this code is:: 
     392    Without imputation: 0.945 
     393    With imputation: 0.954 
     395Even so, the module is somewhat redundant, as all learners that cannot handle 
     396missing values should, in principle, provide the slots for imputer constructor. 
     397For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute 
     398:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even 
     399if you don't set it, it will do some imputation by default. 
     401.. class:: ImputeLearner 
     403    Wraps a learner and performs data discretization before learning. 
     405    Most of Orange's learning algorithms do not use imputers because they can 
     406    appropriately handle the missing values. Bayesian classifier, for instance, 
     407    simply skips the corresponding attributes in the formula, while 
     408    classification/regression trees have components for handling the missing 
     409    values in various ways. 
     411    If for any reason you want to use these algorithms to run on imputed data, 
     412    you can use this wrapper. The class description is a matter of a separate 
     413    page, but we shall show its code here as another demonstration of how to 
     414    use the imputers - logistic regression is implemented essentially the same 
     415    as the below classes. 
     417    This is basically a learner, so the constructor will return either an 
     418    instance of :obj:`ImputerLearner` or, if called with examples, an instance 
     419    of some classifier. There are a few attributes that need to be set, though. 
     421    .. attribute:: base_learner 
     423    A wrapped learner. 
     425    .. attribute:: imputer_constructor 
     427    An instance of a class derived from :obj:`ImputerConstructor` (or a class 
     428    with the same call operator). 
     430    .. attribute:: dont_impute_classifier 
     432    If given and set (this attribute is optional), the classifier will not be 
     433    wrapped into an imputer. Do this if the classifier doesn't mind if the 
     434    examples it is given have missing values. 
     436    The learner is best illustrated by its code - here's its complete 
     437    :obj:`__call__` method:: 
     439        def __call__(self, data, weight=0): 
     440            trained_imputer = self.imputer_constructor(data, weight) 
     441            imputed_data = trained_imputer(data, weight) 
     442            base_classifier = self.base_learner(imputed_data, weight) 
     443            if self.dont_impute_classifier: 
     444                return base_classifier 
     445            else: 
     446                return ImputeClassifier(base_classifier, trained_imputer) 
     448    So "learning" goes like this. :obj:`ImputeLearner` will first construct 
     449    the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 
     450    imputer. Than it will use the imputer to impute the data, and call the 
     451    given :obj:`baseLearner` to construct a classifier. For instance, 
     452    :obj:`baseLearner` could be a learner for logistic regression and the 
     453    result would be a logistic regression model. If the classifier can handle 
     454    unknown values (that is, if :obj:`dont_impute_classifier`, we return it as 
     455    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 
     456    the base classifier and the imputer which it can use to impute the missing 
     457    values in (testing) examples. 
     459.. class:: ImputeClassifier 
     461    Objects of this class are returned by :obj:`ImputeLearner` when given data. 
     463    .. attribute:: baseClassifier 
     465    A wrapped classifier. 
     467    .. attribute:: imputer 
     469    An imputer for imputation of unknown values. 
     471    .. method:: __call__ 
     473    This class is even more trivial than the learner. Its constructor accepts 
     474    two arguments, the classifier and the imputer, which are stored into the 
     475    corresponding attributes. The call operator which does the classification 
     476    then looks like this:: 
     478        def __call__(self, ex, what=orange.GetValue): 
     479            return self.base_classifier(self.imputer(ex), what) 
     481    It imputes the missing values by calling the :obj:`imputer` and passes the 
     482    class to the base classifier. 
     484.. note:: 
     485   In this setup the imputer is trained on the training data - even if you do 
     486   cross validation, the imputer will be trained on the right data. In the 
     487   classification phase we again use the imputer which was classified on the 
     488   training data only. 
     490.. rubric:: Code of ImputeLearner and ImputeClassifier 
     492:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 
     493the instance's  dictionary. You are expected to call it like 
     495imputer=<someImputerConstructor>)`. When the learner is called with examples, it 
     496trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 
     497:obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 
     498:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 
     499values are imputed and the classifier's prediction is returned. 
     501Note that this code is slightly simplified, although the omitted details handle 
     502non-essential technical issues that are unrelated to imputation:: 
     504    class ImputeLearner(orange.Learner): 
     505        def __new__(cls, examples = None, weightID = 0, **keyw): 
     506            self = orange.Learner.__new__(cls, **keyw) 
     507            self.__dict__.update(keyw) 
     508            if examples: 
     509                return self.__call__(examples, weightID) 
     510            else: 
     511                return self 
     513        def __call__(self, data, weight=0): 
     514            trained_imputer = self.imputer_constructor(data, weight) 
     515            imputed_data = trained_imputer(data, weight) 
     516            base_classifier = self.base_learner(imputed_data, weight) 
     517            return ImputeClassifier(base_classifier, trained_imputer) 
     519    class ImputeClassifier(orange.Classifier): 
     520        def __init__(self, base_classifier, imputer): 
     521            self.base_classifier = base_classifier 
     522            self.imputer = imputer 
     524        def __call__(self, ex, what=orange.GetValue): 
     525            return self.base_classifier(self.imputer(ex), what) 
     527.. rubric:: Example 
     529Although most Orange's learning algorithms will take care of imputation 
     530internally, if needed, it can sometime happen that an expert will be able to 
     531tell you exactly what to put in the data instead of the missing values. In this 
     532example we shall suppose that we want to impute the minimal value of each 
     533feature. We will try to determine whether the naive Bayesian classifier with 
     534its  implicit internal imputation works better than one that uses imputation by 
     535minimal values. 
     537:download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 
     539.. literalinclude:: code/imputation-minimal-imputer.py 
     540    :lines: 7- 
     542Should ouput this:: 
     544    Without imputation: 0.903 
     545    With imputation: 0.899 
     547.. note:: 
     548   Note that we constructed just one instance of \ 
     549   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 
     550   used twice in each fold, once it is given the examples as they are (and 
     551   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 
     552   The second time it is called by :obj:`imba` and the \ 
     553   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 
     554   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 
     555   learner, but which produces two different classifiers in each round of 
     556   testing. 
     558Write your own imputer 
     561Imputation classes provide the Python-callback functionality (not all Orange 
     562classes do so, refer to the documentation on `subtyping the Orange classes 
     563in Python <callbacks.htm>`_ for a list). If you want to write your own 
     564imputation constructor or an imputer, you need to simply program a Python 
     565function that will behave like the built-in Orange classes (and even less, 
     566for imputer, you only need to write a function that gets an example as 
     567argument, imputation for example tables will then use that function). 
     569You will most often write the imputation constructor when you have a special 
     570imputation procedure or separate procedures for various attributes, as we've 
     571demonstrated in the description of 
     572:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only 
     573need to pack everything we've written there to an imputer constructor that 
     574will accept a data set and the id of the weight meta-attribute (ignore it if 
     575you will, but you must accept two arguments), and return the imputer (probably 
     576:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 
     577imputer constructor as opposed to what we did above is that you can use such a 
     578constructor as a component for Orange learners (like logistic regression) or 
     579for wrappers from module orngImpute, and that way properly use the in 
     580classifier testing procedures. 
Note: See TracChangeset for help on using the changeset viewer.