Changeset 7611:94c86145af60 in orange


Ignore:
Timestamp:
02/05/11 01:17:11 (3 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
Convert:
ba5e62ccda1000255927b8518ad151ee46e95c83
Message:

Documentatio and code refactoring at Bohinj retreat.

Location:
orange
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/imputation.py

    r7216 r7611  
     1""" 
     2 
     3.. index:: imputation 
     4 
     5.. index::  
     6   single: feature; value imputation 
     7 
     8 
     9Imputation is a procedure of replacing the missing feature values with some  
     10appropriate values. Imputation is needed because of the methods (learning  
     11algorithms and others) that are not capable of handling unknown values, for  
     12instance logistic regression. 
     13 
     14Missing values sometimes have a special meaning, so they need to be replaced 
     15by a designated value. Sometimes we know what to replace the missing value 
     16with; for instance, in a medical problem, some laboratory tests might not be 
     17done when it is known what their results would be. In that case, we impute  
     18certain fixed value instead of the missing. In the most complex case, we assign 
     19values that are computed based on some model; we can, for instance, impute the 
     20average or majority value or even a value which is computed from values of 
     21other, known feature, using a classifier. 
     22 
     23In a learning/classification process, imputation is needed on two occasions. 
     24Before learning, the imputer needs to process the training examples. 
     25Afterwards, the imputer is called for each example to be classified. 
     26 
     27In general, imputer itself needs to be trained. This is, of course, not needed 
     28when the imputer imputes certain fixed value. However, when it imputes the 
     29average or majority value, it needs to compute the statistics on the training 
     30examples, and use it afterwards for imputation of training and testing 
     31examples. 
     32 
     33While reading this document, bear in mind that imputation is a part of the 
     34learning process. If we fit the imputation model, for instance, by learning 
     35how to predict the feature's value from other features, or even if we  
     36simply compute the average or the minimal value for the feature and use it 
     37in imputation, this should only be done on learning data. If cross validation 
     38is used for sampling, imputation should be done on training folds only. Orange 
     39provides simple means for doing that. 
     40 
     41This page will first explain how to construct various imputers. Then follow 
     42the examples for `proper use of imputers <#using-imputers>`_. Finally, quite 
     43often you will want to use imputation with special requests, such as certain 
     44features' missing values getting replaced by constants and other by values 
     45computed using models induced from specified other features. For instance, 
     46in one of the studies we worked on, the patient's pulse rate needed to be 
     47estimated using regression trees that included the scope of the patient's 
     48injuries, sex and age, some attributes' values were replaced by the most 
     49pessimistic ones and others were computed with regression trees based on 
     50values of all features. If you are using learners that need the imputer as a 
     51component, you will need to `write your own imputer constructor  
     52<#write-your-own-imputer-constructor>`_. This is trivial and is explained at 
     53the end of this page. 
     54 
     55Wrapper for learning algorithms 
     56=============================== 
     57 
     58This wrapper can be used with learning algorithms that cannot handle missing 
     59values: it will impute the missing examples using the imputer, call the  
     60earning and, if the imputation is also needed by the classifier, wrap the 
     61resulting classifier into another wrapper that will impute the missing values 
     62in examples to be classified. 
     63 
     64Even so, the module is somewhat redundant, as all learners that cannot handle  
     65missing values should, in principle, provide the slots for imputer constructor. 
     66For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an attribute  
     67:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`, and even 
     68if you don't set it, it will do some imputation by default. 
     69 
     70.. class:: ImputeLearner 
     71 
     72    Wraps a learner and performs data discretization before learning. 
     73 
     74    Most of Orange's learning algorithms do not use imputers because they can 
     75    appropriately handle the missing values. Bayesian classifier, for instance, 
     76    simply skips the corresponding attributes in the formula, while 
     77    classification/regression trees have components for handling the missing 
     78    values in various ways. 
     79 
     80    If for any reason you want to use these algorithms to run on imputed data, 
     81    you can use this wrapper. The class description is a matter of a separate 
     82    page, but we shall show its code here as another demonstration of how to 
     83    use the imputers - logistic regression is implemented essentially the same 
     84    as the below classes. 
     85 
     86    This is basically a learner, so the constructor will return either an 
     87    instance of :obj:`ImputerLearner` or, if called with examples, an instance 
     88    of some classifier. There are a few attributes that need to be set, though. 
     89 
     90    .. attribute:: baseLearner  
     91     
     92    A wrapped learner. 
     93 
     94    .. attribute:: imputerConstructor 
     95     
     96    An instance of a class derived from :obj:`ImputerConstructor` (or a class 
     97    with the same call operator). 
     98 
     99    .. attribute:: dontImputeClassifier 
     100 
     101    If given and set (this attribute is optional), the classifier will not be 
     102    wrapped into an imputer. Do this if the classifier doesn't mind if the 
     103    examples it is given have missing values. 
     104 
     105    The learner is best illustrated by its code - here's its complete 
     106    :obj:`__call__` method:: 
     107 
     108        def __call__(self, data, weight=0): 
     109            trained_imputer = self.imputerConstructor(data, weight) 
     110            imputed_data = trained_imputer(data, weight) 
     111            baseClassifier = self.baseLearner(imputed_data, weight) 
     112            if self.dontImputeClassifier: 
     113                return baseClassifier 
     114            else: 
     115                return ImputeClassifier(baseClassifier, trained_imputer) 
     116 
     117    So "learning" goes like this. :obj:`ImputeLearner` will first construct 
     118    the imputer (that is, call :obj:`self.imputerConstructor` to get a (trained) 
     119    imputer. Than it will use the imputer to impute the data, and call the 
     120    given :obj:`baseLearner` to construct a classifier. For instance, 
     121    :obj:`baseLearner` could be a learner for logistic regression and the 
     122    result would be a logistic regression model. If the classifier can handle 
     123    unknown values (that is, if :obj:`dontImputeClassifier`, we return it as  
     124    it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 
     125    the base classifier and the imputer which it can use to impute the missing 
     126    values in (testing) examples. 
     127 
     128.. class:: ImputeClassifier 
     129 
     130    Objects of this class are returned by :obj:`ImputeLearner` when given data. 
     131 
     132    .. attribute:: baseClassifier 
     133     
     134    A wrapped classifier. 
     135 
     136    .. attribute:: imputer 
     137     
     138    An imputer for imputation of unknown values. 
     139 
     140    .. method:: __call__  
     141     
     142    This class is even more trivial than the learner. Its constructor accepts  
     143    two arguments, the classifier and the imputer, which are stored into the 
     144    corresponding attributes. The call operator which does the classification 
     145    then looks like this:: 
     146 
     147        def __call__(self, ex, what=orange.GetValue): 
     148            return self.baseClassifier(self.imputer(ex), what) 
     149 
     150    It imputes the missing values by calling the :obj:`imputer` and passes the 
     151    class to the base classifier. 
     152 
     153.. note::  
     154   In this setup the imputer is trained on the training data - even if you do 
     155   cross validation, the imputer will be trained on the right data. In the 
     156   classification phase we again use the imputer which was classified on the 
     157   training data only. 
     158 
     159.. rubric:: Code of ImputeLearner and ImputeClassifier  
     160 
     161:obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 
     162the instance's  dictionary. You are expected to call it like 
     163:obj:`ImputeLearner(baseLearner=<someLearner>, 
     164imputer=<someImputerConstructor>)`. When the learner is called with examples, it 
     165trains the imputer, imputes the data, induces a :obj:`baseClassifier` by the 
     166:obj:`baseLearner` and constructs :obj:`ImputeClassifier` that stores the 
     167:obj:`baseClassifier` and the :obj:`imputer`. For classification, the missing 
     168values are imputed and the classifier's prediction is returned. 
     169 
     170Note that this code is slightly simplified, although the omitted details handle 
     171non-essential technical issues that are unrelated to imputation:: 
     172 
     173    class ImputeLearner(orange.Learner): 
     174        def __new__(cls, examples = None, weightID = 0, **keyw): 
     175            self = orange.Learner.__new__(cls, **keyw) 
     176            self.__dict__.update(keyw) 
     177            if examples: 
     178                return self.__call__(examples, weightID) 
     179            else: 
     180                return self 
     181     
     182        def __call__(self, data, weight=0): 
     183            trained_imputer = self.imputerConstructor(data, weight) 
     184            imputed_data = trained_imputer(data, weight) 
     185            baseClassifier = self.baseLearner(imputed_data, weight) 
     186            return ImputeClassifier(baseClassifier, trained_imputer) 
     187     
     188    class ImputeClassifier(orange.Classifier): 
     189        def __init__(self, baseClassifier, imputer): 
     190            self.baseClassifier = baseClassifier 
     191            self.imputer = imputer 
     192     
     193        def __call__(self, ex, what=orange.GetValue): 
     194            return self.baseClassifier(self.imputer(ex), what) 
     195 
     196.. rubric:: Example 
     197 
     198Although most Orange's learning algorithms will take care of imputation 
     199internally, if needed, it can sometime happen that an expert will be able to 
     200tell you exactly what to put in the data instead of the missing values. In this 
     201example we shall suppose that we want to impute the minimal value of each 
     202feature. We will try to determine whether the naive Bayesian classifier with 
     203its  implicit internal imputation works better than one that uses imputation by  
     204minimal values. 
     205 
     206`imputation-minimal-imputer.py`_ (uses `voting.tab`_): 
     207 
     208.. literalinclude:: code/imputation-minimal-imputer.py 
     209    :lines: 7- 
     210     
     211Should ouput this:: 
     212 
     213    Without imputation: 0.903 
     214    With imputation: 0.899 
     215 
     216.. note::  
     217   Note that we constructed just one instance of 
     218   :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 
     219   used twice in each fold, once it is given the examples as they are (and  
     220   returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 
     221   The second time it is called by :obj:`imba` and the  
     222   :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 
     223   into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 
     224   learner, but which produces two different classifiers in each round of 
     225   testing. 
     226 
     227Abstract imputers 
     228================= 
     229 
     230As common in Orange, imputation is done by pairs of two classes: one that does 
     231the work and another that constructs it. :obj:`ImputerConstructor` is an 
     232abstract root of the hierarchy of classes that get the training data (with an  
     233optional id for weight) and constructs an instance of a class, derived from 
     234:obj:`Imputer`. An :obj:`Imputer` can be called with an 
     235:obj:`Orange.data.Instance` and it will return a new example with the missing 
     236values imputed (it will leave the original example intact!). If imputer is 
     237called with an :obj:`Orange.data.Table`, it will return a new example table 
     238with imputed examples. 
     239 
     240.. class:: ImputerConstructor 
     241 
     242    .. attribute:: imputeClass 
     243     
     244    Tell whether to impute the class value (default) or not. 
     245 
     246Simple imputation 
     247================= 
     248 
     249The simplest imputers always impute the same value for a particular attribute, 
     250disregarding the values of other attributes. They all use the same imputer 
     251class, :obj:`Imputer_defaults`. 
     252     
     253.. class:: Imputer_defaults 
     254 
     255    .. attribute::  defaults 
     256     
     257    An example with the default values to be imputed instead of the missing.  
     258    Examples to be imputed must be from the same domain as :obj:`defaults`. 
     259 
     260    Instances of this class can be constructed by  
     261    :obj:`Orange.feature.imputation.ImputerConstructor_minimal`,  
     262    :obj:`Orange.feature.imputation.ImputerConstructor_maximal`, 
     263    :obj:`Orange.feature.imputation.ImputerConstructor_average`.  
     264 
     265    For continuous features, they will impute the smallest, largest or the 
     266    average  values encountered in the training examples. For discrete, they 
     267    will impute the lowest (the one with index 0, e. g. attr.values[0]), the  
     268    highest (attr.values[-1]), and the most common value encountered in the 
     269    data. The first two imputers will mostly be used when the discrete values 
     270    are ordered according to their impact on the class (for instance, possible 
     271    values for symptoms of some disease can be ordered according to their 
     272    seriousness). The minimal and maximal imputers will then represent 
     273    optimistic and pessimistic imputations. 
     274 
     275    The following code will load the bridges data, and first impute the values 
     276    in a single examples and then in the whole table. 
     277 
     278`imputation-complex.py`_ (uses `bridges.tab`_): 
     279 
     280.. literalinclude:: code/imputation-complex.py 
     281    :lines: 9-23 
     282 
     283This is example shows what the imputer does, not how it is to be used. Don't 
     284impute all the data and then use it for cross-validation. As warned at the top 
     285of this page, see the instructions for actual `use of 
     286imputers <#using-imputers>`_. 
     287 
     288.. note:: :obj:`ImputerConstructor` are another class with schizophrenic 
     289  constructor: if you give the constructor the data, it will return an 
     290  :obj:`Imputer` - the above call is equivalent to calling 
     291  :obj:`Orange.feature.imputation.ImputerConstructor_minimal()(data)`. 
     292 
     293You can also construct the :obj:`Orange.feature.imputation.Imputer_defaults` 
     294yourself and specify your own defaults. Or leave some values unspecified, in 
     295which case the imputer won't impute them, as in the following example. Here, 
     296the only attribute whose values will get imputed is "LENGTH"; the imputed value 
     297will be 1234. 
     298 
     299`imputation-complex.py`_ (uses `bridges.tab`_): 
     300 
     301.. literalinclude:: code/imputation-complex.py 
     302    :lines: 56-69 
     303 
     304:obj:`Orange.feature.imputation.Imputer_defaults`'s constructor will accept an 
     305argument of type :obj:`Orange.data.Domain` (in which case it will construct an 
     306empty instance for :obj:`defaults`) or an example. (Be careful with this: 
     307:obj:`Orange.feature.imputation.Imputer_defaults` will have a reference to the 
     308instance and not a copy. But you can make a copy yourself to avoid problems: 
     309instead of `Imputer_defaults(data[0])` you may want to write 
     310`Imputer_defaults(Orange.data.Instance(data[0]))`. 
     311 
     312Random imputation 
     313================= 
     314 
     315.. class:: Imputer_Random 
     316 
     317    Imputes random values. The corresponding constructor is 
     318    :obj:`ImputerConstructor_Random`. 
     319 
     320    .. attribute:: imputeClass 
     321     
     322    Tells whether to impute the class values or not. Defaults to :obj:`True`. 
     323 
     324    .. attribute:: deterministic 
     325 
     326    If true (default is :obj:`False`), random generator is initialized for each 
     327    example using the example's hash value as a seed. This results in same 
     328    examples being always imputed the same values. 
     329     
     330Model-based imputation 
     331====================== 
     332 
     333.. class:: ImputerConstructor_model 
     334 
     335    Model-based imputers learn to predict the attribute's value from values of 
     336    other attributes. :obj:`ImputerConstructor_model` are given a learning 
     337    algorithm (two, actually - one for discrete and one for continuous 
     338    attributes) and they construct a classifier for each attribute. The 
     339    constructed imputer :obj:`Imputer_model` stores a list of classifiers which 
     340    are used when needed. 
     341 
     342    .. attribute:: learnerDiscrete, learnerContinuous 
     343     
     344    Learner for discrete and for continuous attributes. If any of them is 
     345    missing, the attributes of the corresponding type won't get imputed. 
     346 
     347    .. attribute:: useClass 
     348     
     349    Tells whether the imputer is allowed to use the class value. As this is 
     350    most often undesired, this option is by default set to :obj:`False`. It can 
     351    however be useful for a more complex design in which we would use one 
     352    imputer for learning examples (this one would use the class value) and 
     353    another for testing examples (which would not use the class value as this 
     354    is unavailable at that moment). 
     355 
     356..class:: Imputer_model 
     357 
     358    .. attribute: models 
     359 
     360    A list of classifiers, each corresponding to one attribute of the examples 
     361    whose values are to be imputed. The :obj:`classVar`'s of the models should 
     362    equal the examples' attributes. If any of classifier is missing (that is, 
     363    the corresponding element of the table is :obj:`None`, the corresponding 
     364    attribute's values will not be imputed. 
     365 
     366.. rubric:: Examples 
     367 
     368The following imputer predicts the missing attribute values using 
     369classification and regression trees with the minimum of 20 examples in a leaf. 
     370 
     371<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a 
     372href="bridges.tab">bridges.tab</a>)</P> <XMP class=code>import orngTree imputer 
     373= orange.ImputerConstructor_model() imputer.learnerContinuous = 
     374imputer.learnerDiscrete = orngTree.TreeLearner(minSubset = 20) imputer = 
     375imputer(data) </XMP> 
     376 
     377<P>We could even use the same learner for discrete and continuous attributes! 
     378(The way this functions is rather tricky. If you desire to know: 
     379<CODE>orngTree.TreeLearner</CODE> is a learning algorithm written in Python - 
     380Orange doesn't mind, it will wrap it into a C++ wrapper for a Python-written 
     381learners which then call-backs the Python code. When given the examples to 
     382learn from, <CODE>orngTree.TreeLearner</CODE> checks the class type. If it's 
     383continuous, it will set the <CODE>orange.TreeLearner</CODE> to construct 
     384regression trees, and if it's discrete, it will set the components for 
     385classification trees. The common parameters, such as the minimal number of 
     386examples in leaves, are used in both cases.)</P> 
     387 
     388<P>You can of course use different learning algorithms for discrete and 
     389continuous attributes. Probably a common setup will be to use 
     390<CODE>BayesLearner</CODE> for discrete and <CODE>MajorityLearner</CODE> (which 
     391just remembers the average) for continuous attributes, as follows.</P> 
     392 
     393<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a 
     394href="bridges.tab">bridges.tab</a>)</P> <XMP class=code>imputer = 
     395orange.ImputerConstructor_model() imputer.learnerContinuous = 
     396orange.MajorityLearner() imputer.learnerDiscrete = orange.BayesLearner() 
     397imputer = imputer(data) </XMP> 
     398 
     399<P>You can also construct an <CODE>Imputer_model</CODE> yourself. You will do this if different attributes need different treatment. Brace for an example that will be a bit more complex. First we shall construct an <CODE>Imputer_model</CODE> and initialize an empty list of models.</P> 
     400 
     401<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P> 
     402<XMP class=code>imputer = orange.Imputer_model() 
     403imputer.models = [None] * len(data.domain) 
     404</XMP> 
     405 
     406<P>Attributes "LANES" and "T-OR-D" will always be imputed values 2 and 
     407"THROUGH". Since "LANES" is continuous, it suffices to construct a 
     408<CODE>DefaultClassifier</CODE> with the default value 2.0 (don't forget the 
     409decimal part, or else Orange will think you talk about an index of a discrete 
     410value - how could it tell?). For the discrete attribute "T-OR-D", we could 
     411construct a <CODE>DefaultClassifier</CODE> and give the index of value 
     412"THROUGH" as an argument. But we shall do it nicer, by constructing a 
     413<CODE>Value</CODE>. Both classifiers will be stored at the appropriate places 
     414in <CODE>imputer.models</CODE>.</P> 
     415 
     416<XMP class=code>imputer.models[data.domain.index("LANES")] = orange.DefaultClassifier(2.0) 
     417 
     418tord = orange.DefaultClassifier(orange.Value(data.domain["T-OR-D"], "THROUGH")) 
     419imputer.models[data.domain.index("T-OR-D")] = tord 
     420</XMP> 
     421 
     422<P>"LENGTH" will be computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of course). Note that we initialized the domain by simply giving a list with the names of the attributes, with the domain as an additional argument in which Orange will look for the named attributes.</P> 
     423 
     424<XMP class=code>import orngTree 
     425len_domain = orange.Domain(["MATERIAL", "SPAN", "ERECTED", "LENGTH"], data.domain) 
     426len_data = orange.ExampleTable(len_domain, data) 
     427len_tree = orngTree.TreeLearner(len_data, minSubset=20) 
     428imputer.models[data.domain.index("LENGTH")] = len_tree 
     429orngTree.printTxt(len_tree) 
     430</XMP> 
     431 
     432<P>We printed the tree just to see what it looks like.</P> 
     433 
     434<XMP class=code>SPAN=SHORT: 1158 
     435SPAN=LONG: 1907 
     436SPAN=MEDIUM 
     437|    ERECTED<1908.500: 1325 
     438|    ERECTED>=1908.500: 1528 
     439</XMP> 
     440 
     441<P>Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, while the others are mostly medium. This could be done by <a href="lookup.htm"><CODE>ClassifierByLookupTable</CODE></A> - this would be faster than what we plan here. See the corresponding documentation on lookup classifier. Here we are gonna do it with a Python function.</P> 
     442 
     443<XMP class=code>spanVar = data.domain["SPAN"] 
     444 
     445def computeSpan(ex, returnWhat): 
     446    if ex["TYPE"] == "WOOD" or ex["PURPOSE"] == "WALK": 
     447        span = "SHORT" 
     448    else: 
     449        span = "MEDIUM" 
     450    return orange.Value(spanVar, span) 
     451 
     452imputer.models[data.domain.index("SPAN")] = computeSpan 
     453</XMP> 
     454 
     455 
     456<P><CODE>computeSpan</CODE> could also be written as a class, if you'd prefer 
     457it. It's important that it behaves like a classifier, that is, gets an example 
     458and returns a value. The second element tells, as usual, what the caller expect 
     459the classifier to return - a value, a distribution or both. Since the caller, 
     460<CODE>Imputer_model</CODE>, always wants values, we shall ignore the argument 
     461(at risk of having problems in the future when imputers might handle 
     462distribution as well).</P> 
     463 
     464 
     465Treating the missing values as special values 
     466============================================= 
     467 
     468<P>Missing values sometimes have a special meaning. The fact that something was 
     469not measured can sometimes tell a lot. Be, however, cautious when using such 
     470values in decision models; it the decision not to measure something (for 
     471instance performing a laboratory test on a patient) is based on the expert's 
     472knowledge of the class value, such unknown values clearly should not be used in 
     473models.</P> 
     474 
     475<P><CODE><INDEX name="classes/ImputerConstructor_asValue">ImputerConstructor_asValue</INDEX></CODE> constructs a new domain in which each discrete attribute is replaced with a new attribute that has one value more: "NA". The new attribute will compute its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA".</P> 
     476 
     477<P>For continuous attributes, <CODE>ImputerConstructor_asValue</CODE> will 
     478construct a two-valued discrete attribute with values "def" and "undef", 
     479telling whether the continuous attribute was defined or not. The attribute's 
     480name will equal the original's with "_def" appended. The original continuous 
     481attribute will remain in the domain and its unknowns will be replaced by 
     482averages.</P> 
     483 
     484<P><CODE>ImputerConstructor_asValue</CODE> has no specific attributes.</P> 
     485 
     486<P>The constructed imputer is named <CODE>Imputer_asValue</CODE> (I bet you 
     487wouldn't guess). It converts the example into the new domain, which imputes the 
     488values for discrete attributes. If continuous attributes are present, it will 
     489also replace their values by the averages.</P> 
     490 
     491<P class=section>Attributes of <CODE>Imputer_asValue</CODE></P> 
     492<DL class=attributes> 
     493<DT>domain</DT> 
     494<DD>The domain with the new attributes constructed by <CODE>ImputerConstructor_asValue</CODE>.</DD> 
     495 
     496<DT>defaults</DT> 
     497<DD>Default values for continuous attributes. Present only if there are any.</DD> 
     498</DL> 
     499 
     500<P>Here's a script that shows what this imputer actually does to the domain.</P> 
     501 
     502<P class="header">part of <A href="imputation.py">imputation.py</A> (uses <a href="bridges.tab">bridges.tab</a>)</P> 
     503<XMP class=code>imputer = orange.ImputerConstructor_asValue(data) 
     504 
     505original = data[19] 
     506imputed = imputer(data[19]) 
     507 
     508print original.domain 
     509print 
     510print imputed.domain 
     511print 
     512 
     513for i in original.domain: 
     514    print "%s: %s -> %s" % (original.domain[i].name, original[i], imputed[i.name]), 
     515    if original.domain[i].varType == orange.VarTypes.Continuous: 
     516        print "(%s)" % imputed[i.name+"_def"] 
     517    else: 
     518        print 
     519print 
     520</XMP> 
     521 
     522<P>The script's output looks like this.</P> 
     523 
     524<XMP class=code>[RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, 
     525MATERIAL, SPAN, REL-L, TYPE] 
     526 
     527[RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, 
     528LANES_def, LANES, CLEAR-G, T-OR-D, 
     529MATERIAL, SPAN, REL-L, TYPE] 
     530 
     531 
     532RIVER: M -> M 
     533ERECTED: 1874 -> 1874 (def) 
     534PURPOSE: RR -> RR 
     535LENGTH: ? -> 1567 (undef) 
     536LANES: 2 -> 2 (def) 
     537CLEAR-G: ? -> NA 
     538T-OR-D: THROUGH -> THROUGH 
     539MATERIAL: IRON -> IRON 
     540SPAN: ? -> NA 
     541REL-L: ? -> NA 
     542TYPE: SIMPLE-T -> SIMPLE-T 
     543</XMP> 
     544 
     545<P>Seemingly, the two examples have the same attributes (with 
     546<CODE>imputed</CODE> having a few additional ones). If you check this by 
     547<CODE>original.domain[0] == imputed.domain[0]</CODE>, you shall see that this 
     548first glance is <CODE>False</CODE>. The attributes only have the same names, 
     549but they are different attributes. If you read this page (which is already a 
     550bit advanced), you know that Orange does not really care about the attribute 
     551names).</P> 
     552 
     553<P>Therefore, if we wrote "<CODE>imputed[i]</CODE>" the program would fail 
     554since <CODE>imputed</CODE> has no attribute <CODE>i</CODE>. But it has an 
     555attribute with the same name (which even usually has the same value). We 
     556therefore use <CODE>i.name</CODE> to index the attributes of 
     557<CODE>imputed</CODE>. (Using names for indexing is not fast, though; if you do 
     558it a lot, compute the integer index with 
     559<CODE>imputed.domain.index(i.name)</CODE>.)</P> 
     560 
     561<P>For continuous attributes, there is an additional attribute with "_def" 
     562appended; we get it by <CODE>i.name+"_def"</CODE>. Not really nice, but it 
     563works.</P> 
     564 
     565<P>The first continuous attribute, "ERECTED" is defined. Its value remains 1874 
     566and the additional attribute "ERECTED_def" has value "def". Not so for 
     567"LENGTH". Its undefined value is replaced by the average (1567) and the new 
     568attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and 
     569all other undefined discrete attributes) is assigned the value "NA".</P> 
     570 
     571Using imputers 
     572============== 
     573 
     574To properly use the imputation classes in learning process, they must be 
     575trained on training examples only. Imputing the missing values and subsequently 
     576using the data set in cross-validation will give overly optimistic results. 
     577 
     578Learners with imputer as a component 
     579------------------------------------ 
     580 
     581Orange learners that cannot handle missing values will generally provide a slot 
     582for the imputer component. An example of such a class is 
     583:obj:`Orange.classification.logreg.LogRegLearner` with an attribute called 
     584:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor`. To it you 
     585can assign an imputer constructor - one of the above constructors or a specific 
     586constructor you wrote yourself. When given learning examples, 
     587:obj:`Orange.classification.logreg.LogRegLearner` will pass them to 
     588:obj:`Orange.classification.logreg.LogRegLearner.imputerConstructor` to get an 
     589imputer (again some of the above or a specific imputer you programmed). It will 
     590immediately use the imputer to impute the missing values in the learning data 
     591set, so it can be used by the actual learning algorithm. Besides, when the 
     592classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
     593the imputer will be stored in its attribute 
     594:obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
     595classification, the imputer will be used for imputation of missing values in 
     596(testing) examples. 
     597 
     598Although details may vary from algorithm to algorithm, this is how the 
     599imputation is generally used in Orange's learners. Also, if you write your own 
     600learners, it is recommended that you use imputation according to the described 
     601procedure. 
     602 
     603Write your own imputer 
     604====================== 
     605 
     606Imputation classes provide the Python-callback functionality (not all Orange 
     607classes do so, refer to the documentation on `subtyping the Orange classes  
     608in Python <callbacks.htm>`_ for a list). If you want to write your own 
     609imputation constructor or an imputer, you need to simply program a Python 
     610function that will behave like the built-in Orange classes (and even less, 
     611for imputer, you only need to write a function that gets an example as 
     612argument, imputation for example tables will then use that function). 
     613 
     614You will most often write the imputation constructor when you have a special 
     615imputation procedure or separate procedures for various attributes, as we've  
     616demonstrated in the description of 
     617:obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only  
     618need to pack everything we've written there to an imputer constructor that 
     619will accept a data set and the id of the weight meta-attribute (ignore it if 
     620you will, but you must accept two arguments), and return the imputer (probably 
     621:obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 
     622imputer constructor as opposed to what we did above is that you can use such a 
     623constructor as a component for Orange learners (like logistic regression) or 
     624for wrappers from module orngImpute, and that way properly use the in 
     625classifier testing procedures. 
     626 
     627.. _imputation-minimal-imputer.py: code/imputation-minimal-imputer.py 
     628.. _imputation-complex.py: code/imputation-complex.py 
     629.. _voting.tab: code/voting.tab 
     630.. _bridges.tab: code/bridges.tab 
     631 
     632""" 
     633 
    1634import Orange.core as orange 
     635from orange import ImputerConstructor_minimal  
     636from orange import ImputerConstructor_maximal 
     637from orange import ImputerConstructor_average 
     638from orange import Imputer_defaults 
     639from orange import ImputerConstructor_model 
     640from orange import Imputer_model 
     641from orange import ImputerConstructor_asValue  
    2642 
    3643class ImputeLearner(orange.Learner): 
  • orange/Orange/feature/selection.py

    r7405 r7611  
    6565.. automethod:: Orange.feature.selection.filterRelieff 
    6666 
    67  
    68 ======== 
    69 Examples 
    70 ======== 
     67.. rubric:: Examples 
    7168 
    7269Following is a script that defines a new classifier that is based 
  • orange/doc/Orange/rst/Orange.feature.rst

    r7274 r7611  
    1 ============== 
    2 Orange.feature 
    3 ============== 
     1******* 
     2Feature 
     3******* 
    44 
    55.. automodule:: Orange.feature 
  • orange/orngImpute.py

    r7322 r7611  
    1 from Orange.feature.impute import Learner as ImputeLearner 
    2 from Orange.feature.impute import Classifier as ImputeClassifier 
     1from Orange.feature.impute import ImputeLearner as ImputeLearner 
     2from Orange.feature.impute import ImputeClassifier as ImputeClassifier 
Note: See TracChangeset for help on using the changeset viewer.