source: orange/orange/Orange/classification/logreg.py @ 7757:c694f77187ab

Revision 7757:c694f77187ab, 46.6 KB checked in by jzbontar <jure.zbontar@…>, 3 years ago (diff)

Mereged Logistic regression documentation.

Line 
1"""
2.. index: logistic regression
3.. index:
4   single: classification; logistic regression
5
6*******************
7Logistic regression
8*******************
9
10Implements `logistic regression
11<http://en.wikipedia.org/wiki/Logistic_regression>`_ with an extension for
12proper treatment of discrete features.  The algorithm can handle various
13anomalies in features, such as constant variables and singularities, that
14could make fitting of logistic regression almost impossible. Stepwise
15logistic regression, which iteratively selects the most informative
16features, is also supported.
17
18Logistic regression is a popular classification method that comes
19from statistics. The model is described by a linear combination of
20coefficients,
21
22.. math::
23   
24    F = \\beta_0 + \\beta_1*X_1 + \\beta_2*X_2 + ... + \\beta_k*X_k
25
26and the probability (p) of a class value is  computed as:
27
28.. math::
29
30    p = \\frac{\exp(F)}{1 + \exp(F)}
31
32
33.. class :: LogRegClassifier
34
35    :obj:`LogRegClassifier` stores estimated values of regression
36    coefficients and their significances, and uses them to predict
37    classes and class probabilities using the equations described above.
38
39    .. attribute :: beta
40
41        Estimated regression coefficients.
42
43    .. attribute :: beta_se
44
45        Estimated standard errors for regression coefficients.
46
47    .. attribute :: wald_Z
48
49        Wald Z statistics for beta coefficients. Wald Z is computed
50        as beta/beta_se.
51
52    .. attribute :: P
53
54        List of P-values for beta coefficients, that is, the probability
55        that beta coefficients differ from 0.0. The probability is
56        computed from squared Wald Z statistics that is distributed with
57        Chi-Square distribution.
58
59    .. attribute :: likelihood
60
61        The probability of the sample (ie. learning examples) observed on
62        the basis of the derived model, as a function of the regression
63        parameters.
64
65    .. attribute :: fitStatus
66
67        Tells how the model fitting ended - either regularly
68        (:obj:`LogRegFitter.OK`), or it was interrupted due to one of beta
69        coefficients escaping towards infinity (:obj:`LogRegFitter.Infinity`)
70        or since the values didn't converge (:obj:`LogRegFitter.Divergence`). The
71        value tells about the classifier's "reliability"; the classifier
72        itself is useful in either case.
73
74.. autoclass:: LogRegLearner
75
76.. class:: LogRegFitter
77
78    :obj:`LogRegFitter` is the abstract base class for logistic fitters. It
79    defines the form of call operator and the constants denoting its
80    (un)success:
81
82    .. attribute:: OK
83
84        Fitter succeeded to converge to the optimal fit.
85
86    .. attribute:: Infinity
87
88        Fitter failed due to one or more beta coefficients escaping towards infinity.
89
90    .. attribute:: Divergence
91
92        Beta coefficients failed to converge, but none of beta coefficients escaped.
93
94    .. attribute:: Constant
95
96        There is a constant attribute that causes the matrix to be singular.
97
98    .. attribute:: Singularity
99
100        The matrix is singular.
101
102
103    .. method:: __call__(examples, weightID)
104
105        Performs the fitting. There can be two different cases: either
106        the fitting succeeded to find a set of beta coefficients (although
107        possibly with difficulties) or the fitting failed altogether. The
108        two cases return different results.
109
110        `(status, beta, beta_se, likelihood)`
111            The fitter managed to fit the model. The first element of
112            the tuple, result, tells about the problems occurred; it can
113            be either :obj:`OK`, :obj:`Infinity` or :obj:`Divergence`. In
114            the latter cases, returned values may still be useful for
115            making predictions, but it's recommended that you inspect
116            the coefficients and their errors and make your decision
117            whether to use the model or not.
118
119        `(status, attribute)`
120            The fitter failed and the returned attribute is responsible
121            for it. The type of failure is reported in status, which
122            can be either :obj:`Constant` or :obj:`Singularity`.
123
124        The proper way of calling the fitter is to expect and handle all
125        the situations described. For instance, if fitter is an instance
126        of some fitter and examples contain a set of suitable examples,
127        a script should look like this::
128
129            res = fitter(examples)
130            if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]:
131               status, beta, beta_se, likelihood = res
132               < proceed by doing something with what you got >
133            else:
134               status, attr = res
135               < remove the attribute or complain to the user or ... >
136
137
138.. class :: LogRegFitter_Cholesky
139
140    :obj:`LogRegFitter_Cholesky` is the sole fitter available at the
141    moment. It is a C++ translation of `Alan Miller's logistic regression
142    code <http://users.bigpond.net.au/amiller/>`_. It uses Newton-Raphson
143    algorithm to iteratively minimize least squares error computed from
144    learning examples.
145
146
147.. autoclass:: StepWiseFSS
148.. autofunction:: dump
149
150
151
152Examples
153--------
154
155The first example shows a very simple induction of a logistic regression
156classifier (`logreg-run.py`_, uses `titanic.tab`_).
157
158.. literalinclude:: code/logreg-run.py
159
160Result::
161
162    Classification accuracy: 0.778282598819
163
164    class attribute = survived
165    class values = <no, yes>
166
167        Attribute       beta  st. error     wald Z          P OR=exp(beta)
168
169        Intercept      -1.23       0.08     -15.15      -0.00
170     status=first       0.86       0.16       5.39       0.00       2.36
171    status=second      -0.16       0.18      -0.91       0.36       0.85
172     status=third      -0.92       0.15      -6.12       0.00       0.40
173        age=child       1.06       0.25       4.30       0.00       2.89
174       sex=female       2.42       0.14      17.04       0.00      11.25
175
176The next examples shows how to handle singularities in data sets
177(`logreg-singularities.py`_, uses `adult_sample.tab`_).
178
179.. literalinclude:: code/logreg-singularities.py
180
181The first few lines of the output of this script are::
182
183    <=50K <=50K
184    <=50K <=50K
185    <=50K <=50K
186    >50K >50K
187    <=50K >50K
188
189    class attribute = y
190    class values = <>50K, <=50K>
191
192                               Attribute       beta  st. error     wald Z          P OR=exp(beta)
193
194                               Intercept       6.62      -0.00       -inf       0.00
195                                     age      -0.04       0.00       -inf       0.00       0.96
196                                  fnlwgt      -0.00       0.00       -inf       0.00       1.00
197                           education-num      -0.28       0.00       -inf       0.00       0.76
198                 marital-status=Divorced       4.29       0.00        inf       0.00      72.62
199            marital-status=Never-married       3.79       0.00        inf       0.00      44.45
200                marital-status=Separated       3.46       0.00        inf       0.00      31.95
201                  marital-status=Widowed       3.85       0.00        inf       0.00      46.96
202    marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63
203        marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19
204                 occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72
205
206If :obj:`removeSingular` is set to 0, inducing a logistic regression
207classifier would return an error::
208
209    Traceback (most recent call last):
210      File "logreg-singularities.py", line 4, in <module>
211        lr = classification.logreg.LogRegLearner(table, removeSingular=0)
212      File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner
213        return lr(examples, weightID)
214      File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__
215        lr = learner(examples, weight)
216    orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked
217
218We can see that the attribute workclass is causing a singularity.
219
220The example below shows, how the use of stepwise logistic regression can help to
221gain in classification performance (`logreg-stepwise.py`_, uses `ionosphere.tab`_):
222
223.. literalinclude:: code/logreg-stepwise.py
224
225The output of this script is::
226
227    Learner      CA
228    logistic     0.841
229    filtered     0.846
230
231    Number of times attributes were used in cross-validation:
232     1 x a21
233    10 x a22
234     8 x a23
235     7 x a24
236     1 x a25
237    10 x a26
238    10 x a27
239     3 x a28
240     7 x a29
241     9 x a31
242     2 x a16
243     7 x a12
244     1 x a32
245     8 x a15
246    10 x a14
247     4 x a17
248     7 x a30
249    10 x a11
250     1 x a10
251     1 x a13
252    10 x a34
253     2 x a19
254     1 x a18
255    10 x a3
256    10 x a5
257     4 x a4
258     4 x a7
259     8 x a6
260    10 x a9
261    10 x a8
262
263.. _logreg-run.py: code/logreg-run.py
264.. _logreg-singularities.py: code/logreg-singularities.py
265.. _logreg-stepwise.py: code/logreg-stepwise.py
266
267.. _ionosphere.tab: code/ionosphere.tab
268.. _adult_sample.tab: code/adult_sample.tab
269.. _titanic.tab: code/titanic.tab
270
271"""
272
273from Orange.core import LogRegLearner, LogRegClassifier, LogRegFitter, LogRegFitter_Cholesky
274
275import Orange
276import math, os
277import warnings
278from numpy import *
279from numpy.linalg import *
280
281
282##########################################################################
283## Print out methods
284
285def dump(classifier):
286    """ Formatted string of all major features in logistic
287    regression classifier.
288
289    :param classifier: logistic regression classifier
290    """
291
292    # print out class values
293    out = ['']
294    out.append("class attribute = " + classifier.domain.classVar.name)
295    out.append("class values = " + str(classifier.domain.classVar.values))
296    out.append('')
297   
298    # get the longest attribute name
299    longest=0
300    for at in classifier.continuizedDomain.attributes:
301        if len(at.name)>longest:
302            longest=len(at.name);
303
304    # print out the head
305    formatstr = "%"+str(longest)+"s %10s %10s %10s %10s %10s"
306    out.append(formatstr % ("Feature", "beta", "st. error", "wald Z", "P", "OR=exp(beta)"))
307    out.append('')
308    formatstr = "%"+str(longest)+"s %10.2f %10.2f %10.2f %10.2f"   
309    out.append(formatstr % ("Intercept", classifier.beta[0], classifier.beta_se[0], classifier.wald_Z[0], classifier.P[0]))
310    formatstr = "%"+str(longest)+"s %10.2f %10.2f %10.2f %10.2f %10.2f"   
311    for i in range(len(classifier.continuizedDomain.attributes)):
312        out.append(formatstr % (classifier.continuizedDomain.attributes[i].name, classifier.beta[i+1], classifier.beta_se[i+1], classifier.wald_Z[i+1], abs(classifier.P[i+1]), math.exp(classifier.beta[i+1])))
313
314    return '\n'.join(out)
315       
316
317def has_discrete_values(domain):
318    for at in domain.attributes:
319        if at.varType == Orange.core.VarTypes.Discrete:
320            return 1
321    return 0
322
323class LogRegLearner(Orange.classification.Learner):
324    """ Logistic regression learner.
325
326    Implements logistic regression. If data instances are provided to
327    the constructor, the learning algorithm is called and the resulting
328    classifier is returned instead of the learner.
329
330    :param table: data table with either discrete or continuous features
331    :type table: Orange.data.Table
332    :param weightID: the ID of the weight meta attribute
333    :type weightID: int
334    :param removeSingular: set to 1 if you want automatic removal of disturbing features, such as constants and singularities
335    :type removeSingular: bool
336    :param fitter: the fitting algorithm (by default the Newton-Raphson fitting algorithm is used)
337    :param stepwiseLR: set to 1 if you wish to use stepwise logistic regression
338    :type stepwiseLR: bool
339    :param addCrit: parameter for stepwise feature selection
340    :type addCrit: float
341    :param deleteCrit: parameter for stepwise feature selection
342    :type deleteCrit: float
343    :param numFeatures: parameter for stepwise feature selection
344    :type numFeatures: int
345    :rtype: :obj:`LogRegLearner` or :obj:`LogRegClassifier`
346
347    """
348    def __new__(cls, instances=None, weightID=0, **argkw):
349        self = Orange.classification.Learner.__new__(cls, **argkw)
350        if instances:
351            self.__init__(**argkw)
352            return self.__call__(instances, weightID)
353        else:
354            return self
355
356    def __init__(self, removeSingular=0, fitter = None, **kwds):
357        self.__dict__.update(kwds)
358        self.removeSingular = removeSingular
359        self.fitter = None
360
361    def __call__(self, examples, weight=0):
362        imputer = getattr(self, "imputer", None) or None
363        if getattr(self, "removeMissing", 0):
364            examples = Orange.core.Preprocessor_dropMissing(examples)
365##        if hasDiscreteValues(examples.domain):
366##            examples = createNoDiscTable(examples)
367        if not len(examples):
368            return None
369        if getattr(self, "stepwiseLR", 0):
370            addCrit = getattr(self, "addCrit", 0.2)
371            removeCrit = getattr(self, "removeCrit", 0.3)
372            numFeatures = getattr(self, "numFeatures", -1)
373            attributes = StepWiseFSS(examples, addCrit = addCrit, deleteCrit = removeCrit, imputer = imputer, numFeatures = numFeatures)
374            tmpDomain = Orange.core.Domain(attributes, examples.domain.classVar)
375            tmpDomain.addmetas(examples.domain.getmetas())
376            examples = examples.select(tmpDomain)
377        learner = Orange.core.LogRegLearner()
378        learner.imputerConstructor = imputer
379        if imputer:
380            examples = self.imputer(examples)(examples)
381        examples = Orange.core.Preprocessor_dropMissing(examples)
382        if self.fitter:
383            learner.fitter = self.fitter
384        if self.removeSingular:
385            lr = learner.fitModel(examples, weight)
386        else:
387            lr = learner(examples, weight)
388        while isinstance(lr, Orange.core.Variable):
389            if isinstance(lr.getValueFrom, Orange.core.ClassifierFromVar) and isinstance(lr.getValueFrom.transformer, Orange.core.Discrete2Continuous):
390                lr = lr.getValueFrom.variable
391            attributes = examples.domain.attributes[:]
392            if lr in attributes:
393                attributes.remove(lr)
394            else:
395                attributes.remove(lr.getValueFrom.variable)
396            newDomain = Orange.core.Domain(attributes, examples.domain.classVar)
397            newDomain.addmetas(examples.domain.getmetas())
398            examples = examples.select(newDomain)
399            lr = learner.fitModel(examples, weight)
400        return lr
401
402
403
404class UnivariateLogRegLearner(Orange.classification.Learner):
405    def __new__(cls, instances=None, **argkw):
406        self = Orange.classification.Learner.__new__(cls, **argkw)
407        if instances:
408            self.__init__(**argkw)
409            return self.__call__(instances)
410        else:
411            return self
412
413    def __init__(self, **kwds):
414        self.__dict__.update(kwds)
415
416    def __call__(self, examples):
417        examples = createFullNoDiscTable(examples)
418        classifiers = map(lambda x: LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(x, examples.domain.classVar)))), examples.domain.attributes)
419        maj_classifier = LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(examples.domain.classVar))))
420        beta = [maj_classifier.beta[0]] + [x.beta[1] for x in classifiers]
421        beta_se = [maj_classifier.beta_se[0]] + [x.beta_se[1] for x in classifiers]
422        P = [maj_classifier.P[0]] + [x.P[1] for x in classifiers]
423        wald_Z = [maj_classifier.wald_Z[0]] + [x.wald_Z[1] for x in classifiers]
424        domain = examples.domain
425
426        return Univariate_LogRegClassifier(beta = beta, beta_se = beta_se, P = P, wald_Z = wald_Z, domain = domain)
427
428class UnivariateLogRegClassifier(Orange.core.Classifier):
429    def __init__(self, **kwds):
430        self.__dict__.update(kwds)
431
432    def __call__(self, example, resultType = Orange.core.GetValue):
433        # classification not implemented yet. For now its use is only to provide regression coefficients and its statistics
434        pass
435   
436
437class LogRegLearnerGetPriors(object):
438    def __new__(cls, instances=None, weightID=0, **argkw):
439        self = object.__new__(cls, **argkw)
440        if instances:
441            self.__init__(**argkw)
442            return self.__call__(instances, weightID)
443        else:
444            return self
445
446    def __init__(self, removeSingular=0, **kwds):
447        self.__dict__.update(kwds)
448        self.removeSingular = removeSingular
449    def __call__(self, examples, weight=0):
450        # next function changes data set to a extended with unknown values
451        def createLogRegExampleTable(data, weightID):
452            setsOfData = []
453            for at in data.domain.attributes:
454                # za vsak atribut kreiraj nov newExampleTable newData
455                # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable
456                if at.varType == Orange.core.VarTypes.Continuous:
457                    atDisc = Orange.core.FloatVariable(at.name + "Disc")
458                    newDomain = Orange.core.Domain(data.domain.attributes+[atDisc,data.domain.classVar])
459                    newDomain.addmetas(data.domain.getmetas())
460                    newData = Orange.core.ExampleTable(newDomain,data)
461                    altData = Orange.core.ExampleTable(newDomain,data)
462                    for i,d in enumerate(newData):
463                        d[atDisc] = 0
464                        d[weightID] = 1*data[i][weightID]
465                    for i,d in enumerate(altData):
466                        d[atDisc] = 1
467                        d[at] = 0
468                        d[weightID] = 0.000001*data[i][weightID]
469                elif at.varType == Orange.core.VarTypes.Discrete:
470                # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X"
471                    atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"])
472                    newDomain = Orange.core.Domain(filter(lambda x: x!=at, data.domain.attributes)+[atNew,data.domain.classVar])
473                    newDomain.addmetas(data.domain.getmetas())
474                    newData = Orange.core.ExampleTable(newDomain,data)
475                    altData = Orange.core.ExampleTable(newDomain,data)
476                    for i,d in enumerate(newData):
477                        d[atNew] = data[i][at]
478                        d[weightID] = 1*data[i][weightID]
479                    for i,d in enumerate(altData):
480                        d[atNew] = at.name+"X"
481                        d[weightID] = 0.000001*data[i][weightID]
482                newData.extend(altData)
483                setsOfData.append(newData)
484            return setsOfData
485                 
486        learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular)
487        # get Original Model
488        orig_model = learner(examples,weight)
489        if orig_model.fit_status:
490            print "Warning: model did not converge"
491
492        # get extended Model (you should not change data)
493        if weight == 0:
494            weight = Orange.core.newmetaid()
495            examples.addMetaAttribute(weight, 1.0)
496        extended_set_of_examples = createLogRegExampleTable(examples, weight)
497        extended_models = [learner(extended_examples, weight) \
498                           for extended_examples in extended_set_of_examples]
499
500##        print examples[0]
501##        printOUT(orig_model)
502##        print orig_model.domain
503##        print orig_model.beta
504##        print orig_model.beta[orig_model.continuizedDomain.attributes[-1]]
505##        for i,m in enumerate(extended_models):
506##            print examples.domain.attributes[i]
507##            printOUT(m)
508           
509       
510        # izracunas odstopanja
511        # get sum of all betas
512        beta = 0
513        betas_ap = []
514        for m in extended_models:
515            beta_add = m.beta[m.continuizedDomain.attributes[-1]]
516            betas_ap.append(beta_add)
517            beta = beta + beta_add
518       
519        # substract it from intercept
520        #print "beta", beta
521        logistic_prior = orig_model.beta[0]+beta
522       
523        # compare it to bayes prior
524        bayes = Orange.core.BayesLearner(examples)
525        bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0])
526
527        # normalize errors
528##        print "bayes", bayes_prior
529##        print "lr", orig_model.beta[0]
530##        print "lr2", logistic_prior
531##        print "dist", Orange.core.Distribution(examples.domain.classVar,examples)
532##        print "prej", betas_ap
533
534        # error normalization - to avoid errors due to assumption of independence of unknown values
535        dif = bayes_prior - logistic_prior
536        positives = sum(filter(lambda x: x>=0, betas_ap))
537        negatives = -sum(filter(lambda x: x<0, betas_ap))
538        if not negatives == 0:
539            kPN = positives/negatives
540            diffNegatives = dif/(1+kPN)
541            diffPositives = kPN*diffNegatives
542            kNegatives = (negatives-diffNegatives)/negatives
543            kPositives = positives/(positives-diffPositives)
544    ##        print kNegatives
545    ##        print kPositives
546
547            for i,b in enumerate(betas_ap):
548                if b<0: betas_ap[i]*=kNegatives
549                else: betas_ap[i]*=kPositives
550        #print "potem", betas_ap
551
552        # vrni originalni model in pripadajoce apriorne niclele
553        return (orig_model, betas_ap)
554        #return (bayes_prior,orig_model.beta[examples.domain.classVar],logistic_prior)
555
556class LogRegLearnerGetPriorsOneTable:
557    def __init__(self, removeSingular=0, **kwds):
558        self.__dict__.update(kwds)
559        self.removeSingular = removeSingular
560    def __call__(self, examples, weight=0):
561        # next function changes data set to a extended with unknown values
562        def createLogRegExampleTable(data, weightID):
563            finalData = Orange.core.ExampleTable(data)
564            origData = Orange.core.ExampleTable(data)
565            for at in data.domain.attributes:
566                # za vsak atribut kreiraj nov newExampleTable newData
567                # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable
568                if at.varType == Orange.core.VarTypes.Continuous:
569                    atDisc = Orange.core.FloatVariable(at.name + "Disc")
570                    newDomain = Orange.core.Domain(origData.domain.attributes+[atDisc,data.domain.classVar])
571                    newDomain.addmetas(newData.domain.getmetas())
572                    finalData = Orange.core.ExampleTable(newDomain,finalData)
573                    newData = Orange.core.ExampleTable(newDomain,origData)
574                    origData = Orange.core.ExampleTable(newDomain,origData)
575                    for d in origData:
576                        d[atDisc] = 0
577                    for d in finalData:
578                        d[atDisc] = 0
579                    for i,d in enumerate(newData):
580                        d[atDisc] = 1
581                        d[at] = 0
582                        d[weightID] = 100*data[i][weightID]
583                       
584                elif at.varType == Orange.core.VarTypes.Discrete:
585                # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X"
586                    atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"])
587                    newDomain = Orange.core.Domain(filter(lambda x: x!=at, origData.domain.attributes)+[atNew,origData.domain.classVar])
588                    newDomain.addmetas(origData.domain.getmetas())
589                    temp_finalData = Orange.core.ExampleTable(finalData)
590                    finalData = Orange.core.ExampleTable(newDomain,finalData)
591                    newData = Orange.core.ExampleTable(newDomain,origData)
592                    temp_origData = Orange.core.ExampleTable(origData)
593                    origData = Orange.core.ExampleTable(newDomain,origData)
594                    for i,d in enumerate(origData):
595                        d[atNew] = temp_origData[i][at]
596                    for i,d in enumerate(finalData):
597                        d[atNew] = temp_finalData[i][at]                       
598                    for i,d in enumerate(newData):
599                        d[atNew] = at.name+"X"
600                        d[weightID] = 10*data[i][weightID]
601                finalData.extend(newData)
602            return finalData
603                 
604        learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular)
605        # get Original Model
606        orig_model = learner(examples,weight)
607
608        # get extended Model (you should not change data)
609        if weight == 0:
610            weight = Orange.core.newmetaid()
611            examples.addMetaAttribute(weight, 1.0)
612        extended_examples = createLogRegExampleTable(examples, weight)
613        extended_model = learner(extended_examples, weight)
614
615##        print examples[0]
616##        printOUT(orig_model)
617##        print orig_model.domain
618##        print orig_model.beta
619
620##        printOUT(extended_model)       
621        # izracunas odstopanja
622        # get sum of all betas
623        beta = 0
624        betas_ap = []
625        for m in extended_models:
626            beta_add = m.beta[m.continuizedDomain.attributes[-1]]
627            betas_ap.append(beta_add)
628            beta = beta + beta_add
629       
630        # substract it from intercept
631        #print "beta", beta
632        logistic_prior = orig_model.beta[0]+beta
633       
634        # compare it to bayes prior
635        bayes = Orange.core.BayesLearner(examples)
636        bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0])
637
638        # normalize errors
639        #print "bayes", bayes_prior
640        #print "lr", orig_model.beta[0]
641        #print "lr2", logistic_prior
642        #print "dist", Orange.core.Distribution(examples.domain.classVar,examples)
643        k = (bayes_prior-orig_model.beta[0])/(logistic_prior-orig_model.beta[0])
644        #print "prej", betas_ap
645        betas_ap = [k*x for x in betas_ap]               
646        #print "potem", betas_ap
647
648        # vrni originalni model in pripadajoce apriorne niclele
649        return (orig_model, betas_ap)
650        #return (bayes_prior,orig_model.beta[data.domain.classVar],logistic_prior)
651
652
653######################################
654#### Fitters for logistic regression (logreg) learner ####
655######################################
656
657def pr(x, betas):
658    k = math.exp(dot(x, betas))
659    return k / (1+k)
660
661def lh(x,y,betas):
662    llh = 0.0
663    for i,x_i in enumerate(x):
664        pr = pr(x_i,betas)
665        llh += y[i]*log(max(pr,1e-6)) + (1-y[i])*log(max(1-pr,1e-6))
666    return llh
667
668
669def diag(vector):
670    mat = identity(len(vector), Float)
671    for i,v in enumerate(vector):
672        mat[i][i] = v
673    return mat
674   
675class SimpleFitter(Orange.core.LogRegFitter):
676    def __init__(self, penalty=0, se_penalty = False):
677        self.penalty = penalty
678        self.se_penalty = se_penalty
679    def __call__(self, data, weight=0):
680        ml = data.native(0)
681        for i in range(len(data.domain.attributes)):
682          a = data.domain.attributes[i]
683          if a.varType == Orange.core.VarTypes.Discrete:
684            for m in ml:
685              m[i] = a.values.index(m[i])
686        for m in ml:
687          m[-1] = data.domain.classVar.values.index(m[-1])
688        Xtmp = array(ml)
689        y = Xtmp[:,-1]   # true probabilities (1's or 0's)
690        one = reshape(array([1]*len(data)), (len(data),1)) # intercept column
691        X=concatenate((one, Xtmp[:,:-1]),1)  # intercept first, then data
692
693        betas = array([0.0] * (len(data.domain.attributes)+1))
694        oldBetas = array([1.0] * (len(data.domain.attributes)+1))
695        N = len(data)
696
697        pen_matrix = array([self.penalty] * (len(data.domain.attributes)+1))
698        if self.se_penalty:
699            p = array([pr(X[i], betas) for i in range(len(data))])
700            W = identity(len(data), Float)
701            pp = p * (1.0-p)
702            for i in range(N):
703                W[i,i] = pp[i]
704            se = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X)))))
705            for i,p in enumerate(pen_matrix):
706                pen_matrix[i] *= se[i]
707        # predict the probability for an instance, x and betas are vectors
708        # start the computation
709        likelihood = 0.
710        likelihood_new = 1.
711        while abs(likelihood - likelihood_new)>1e-5:
712            likelihood = likelihood_new
713            oldBetas = betas
714            p = array([pr(X[i], betas) for i in range(len(data))])
715
716            W = identity(len(data), Float)
717            pp = p * (1.0-p)
718            for i in range(N):
719                W[i,i] = pp[i]
720
721            WI = inverse(W)
722            z = matrixmultiply(X, betas) + matrixmultiply(WI, y - p)
723
724            tmpA = inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))+diag(pen_matrix))
725            tmpB = matrixmultiply(transpose(X), y-p)
726            betas = oldBetas + matrixmultiply(tmpA,tmpB)
727#            betaTemp = matrixmultiply(matrixmultiply(matrixmultiply(matrixmultiply(tmpA,transpose(X)),W),X),oldBetas)
728#            print betaTemp
729#            tmpB = matrixmultiply(transpose(X), matrixmultiply(W, z))
730#            betas = matrixmultiply(tmpA, tmpB)
731            likelihood_new = lh(X,y,betas)-self.penalty*sum([b*b for b in betas])
732            print likelihood_new
733
734           
735           
736##        XX = sqrt(diagonal(inverse(matrixmultiply(transpose(X),X))))
737##        yhat = array([pr(X[i], betas) for i in range(len(data))])
738##        ss = sum((y - yhat) ** 2) / (N - len(data.domain.attributes) - 1)
739##        sigma = math.sqrt(ss)
740        p = array([pr(X[i], betas) for i in range(len(data))])
741        W = identity(len(data), Float)
742        pp = p * (1.0-p)
743        for i in range(N):
744            W[i,i] = pp[i]
745        diXWX = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X)))))
746        xTemp = matrixmultiply(matrixmultiply(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))),transpose(X)),y)
747        beta = []
748        beta_se = []
749        print "likelihood ridge", likelihood
750        for i in range(len(betas)):
751            beta.append(betas[i])
752            beta_se.append(diXWX[i])
753        return (self.OK, beta, beta_se, 0)
754
755def pr_bx(bx):
756    if bx > 35:
757        return 1
758    if bx < -35:
759        return 0
760    return exp(bx)/(1+exp(bx))
761
762class BayesianFitter(Orange.core.LogRegFitter):
763    def __init__(self, penalty=0, anch_examples=[], tau = 0):
764        self.penalty = penalty
765        self.anch_examples = anch_examples
766        self.tau = tau
767
768    def create_array_data(self,data):
769        if not len(data):
770            return (array([]),array([]))
771        # convert data to numeric
772        ml = data.native(0)
773        for i,a in enumerate(data.domain.attributes):
774          if a.varType == Orange.core.VarTypes.Discrete:
775            for m in ml:
776              m[i] = a.values.index(m[i])
777        for m in ml:
778          m[-1] = data.domain.classVar.values.index(m[-1])
779        Xtmp = array(ml)
780        y = Xtmp[:,-1]   # true probabilities (1's or 0's)
781        one = reshape(array([1]*len(data)), (len(data),1)) # intercept column
782        X=concatenate((one, Xtmp[:,:-1]),1)  # intercept first, then data
783        return (X,y)
784   
785    def __call__(self, data, weight=0):
786        (X,y)=self.create_array_data(data)
787
788        exTable = Orange.core.ExampleTable(data.domain)
789        for id,ex in self.anch_examples:
790            exTable.extend(Orange.core.ExampleTable(ex,data.domain))
791        (X_anch,y_anch)=self.create_array_data(exTable)
792
793        betas = array([0.0] * (len(data.domain.attributes)+1))
794
795        likelihood,betas = self.estimate_beta(X,y,betas,[0]*(len(betas)),X_anch,y_anch)
796
797        # get attribute groups atGroup = [(startIndex, number of values), ...)
798        ats = data.domain.attributes
799        atVec=reduce(lambda x,y: x+[(y,not y==x[-1][0])], [a.getValueFrom and a.getValueFrom.whichVar or a for a in ats],[(ats[0].getValueFrom and ats[0].getValueFrom.whichVar or ats[0],0)])[1:]
800        atGroup=[[0,0]]
801        for v_i,v in enumerate(atVec):
802            if v[1]==0: atGroup[-1][1]+=1
803            else:       atGroup.append([v_i,1])
804       
805        # compute zero values for attributes
806        sumB = 0.
807        for ag in atGroup:
808            X_temp = concatenate((X[:,:ag[0]+1],X[:,ag[0]+1+ag[1]:]),1)
809            if X_anch:
810                X_anch_temp = concatenate((X_anch[:,:ag[0]+1],X_anch[:,ag[0]+1+ag[1]:]),1)
811            else: X_anch_temp = X_anch
812##            print "1", concatenate((betas[:i+1],betas[i+2:]))
813##            print "2", betas
814            likelihood_temp,betas_temp=self.estimate_beta(X_temp,y,concatenate((betas[:ag[0]+1],betas[ag[0]+ag[1]+1:])),[0]+[1]*(len(betas)-1-ag[1]),X_anch_temp,y_anch)
815            print "finBetas", betas, betas_temp
816            print "betas", betas[0], betas_temp[0]
817            sumB += betas[0]-betas_temp[0]
818        apriori = Orange.core.Distribution(data.domain.classVar, data)
819        aprioriProb = apriori[0]/apriori.abs
820       
821        print "koncni rezultat", sumB, math.log((1-aprioriProb)/aprioriProb), betas[0]
822           
823        beta = []
824        beta_se = []
825        print "likelihood2", likelihood
826        for i in range(len(betas)):
827            beta.append(betas[i])
828            beta_se.append(0.0)
829        return (self.OK, beta, beta_se, 0)
830
831     
832       
833    def estimate_beta(self,X,y,betas,const_betas,X_anch,y_anch):
834        N,N_anch = len(y),len(y_anch)
835        r,r_anch = array([dot(X[i], betas) for i in range(N)]),\
836                   array([dot(X_anch[i], betas) for i in range(N_anch)])
837        p    = array([pr_bx(ri) for ri in r])
838        X_sq = X*X
839
840        max_delta      = [1.]*len(const_betas)
841        likelihood     = -1.e+10
842        likelihood_new = -1.e+9
843        while abs(likelihood - likelihood_new)>0.01 and max(max_delta)>0.01:
844            likelihood = likelihood_new
845            print likelihood
846            betas_temp = [b for b in betas]
847            for j in range(len(betas)):
848                if const_betas[j]: continue
849                dl = matrixmultiply(X[:,j],transpose(y-p))
850                for xi,x in enumerate(X_anch):
851                    dl += self.penalty*x[j]*(y_anch[xi] - pr_bx(r_anch[xi]*self.penalty))
852
853                ddl = matrixmultiply(X_sq[:,j],transpose(p*(1-p)))
854                for xi,x in enumerate(X_anch):
855                    ddl += self.penalty*x[j]*pr_bx(r[xi]*self.penalty)*(1-pr_bx(r[xi]*self.penalty))
856
857                if j==0:
858                    dv = dl/max(ddl,1e-6)
859                elif betas[j] == 0: # special handling due to non-defined first and second derivatives
860                    dv = (dl-self.tau)/max(ddl,1e-6)
861                    if dv < 0:
862                        dv = (dl+self.tau)/max(ddl,1e-6)
863                        if dv > 0:
864                            dv = 0
865                else:
866                    dl -= sign(betas[j])*self.tau
867                    dv = dl/max(ddl,1e-6)
868                    if not sign(betas[j] + dv) == sign(betas[j]):
869                        dv = -betas[j]
870                dv = min(max(dv,-max_delta[j]),max_delta[j])
871                r+= X[:,j]*dv
872                p = array([pr_bx(ri) for ri in r])
873                if N_anch:
874                    r_anch+=X_anch[:,j]*dv
875                betas[j] += dv
876                max_delta[j] = max(2*abs(dv),max_delta[j]/2)
877            likelihood_new = lh(X,y,betas)
878            for xi,x in enumerate(X_anch):
879                try:
880                    likelihood_new += y_anch[xi]*r_anch[xi]*self.penalty-log(1+exp(r_anch[xi]*self.penalty))
881                except:
882                    likelihood_new += r_anch[xi]*self.penalty*(y_anch[xi]-1)
883            likelihood_new -= sum([abs(b) for b in betas[1:]])*self.tau
884            if likelihood_new < likelihood:
885                max_delta = [md/4 for md in max_delta]
886                likelihood_new = likelihood
887                likelihood = likelihood_new + 1.
888                betas = [b for b in betas_temp]
889        print "betas", betas
890        print "init_like", likelihood_new
891        print "pure_like", lh(X,y,betas)
892        return (likelihood,betas)
893   
894############################################################
895#  Feature subset selection for logistic regression
896
897def get_likelihood(fitter, examples):
898    res = fitter(examples)
899    if res[0] in [fitter.OK]: #, fitter.Infinity, fitter.Divergence]:
900       status, beta, beta_se, likelihood = res
901       if sum([abs(b) for b in beta])<sum([abs(b) for b in beta_se]):
902           return -100*len(examples)
903       return likelihood
904    else:
905       return -100*len(examples)
906       
907
908
909class StepWiseFSS(Orange.classification.Learner):
910  """Implementation of algorithm described in [Hosmer and Lemeshow, Applied Logistic Regression, 2000].
911
912  Perform stepwise logistic regression and return a list of the
913  most "informative" features. Each step of the algorithm is composed
914  of two parts. The first is backward elimination, where each already
915  chosen feature is tested for a significant contribution to the overall
916  model. If the worst among all tested features has higher significance
917  than is specified in :obj:`deleteCrit`, the feature is removed from
918  the model. The second step is forward selection, which is similar to
919  backward elimination. It loops through all the features that are not
920  in the model and tests whether they contribute to the common model
921  with significance lower that :obj:`addCrit`. The algorithm stops when
922  no feature in the model is to be removed and no feature not in the
923  model is to be added. By setting :obj:`numFeatures` larger than -1,
924  the algorithm will stop its execution when the number of features in model
925  exceeds that number.
926
927  Significances are assesed via the likelihood ration chi-square
928  test. Normal F test is not appropriate, because errors are assumed to
929  follow a binomial distribution.
930
931  If :obj:`table` is specified, stepwise logistic regression implemented
932  in :obj:`StepWiseFSS` is performed and a list of chosen features
933  is returned. If :obj:`table` is not specified an instance of
934  :obj:`StepWiseFSS` with all parameters set is returned.
935
936  :param table: data set
937  :type table: Orange.data.Table
938
939  :param addCrit: "Alpha" level to judge if variable has enough importance to be added in the new set. (e.g. if addCrit is 0.2, then features is added if its P is lower than 0.2)
940  :type addCrit: float
941
942  :param deleteCrit: Similar to addCrit, just that it is used at backward elimination. It should be higher than addCrit!
943  :type deleteCrit: float
944
945  :param numFeatures: maximum number of selected features, use -1 for infinity.
946  :type numFeatures: int
947  :rtype: :obj:`StepWiseFSS` or list of features
948
949  """
950
951  def __new__(cls, instances=None, **argkw):
952      self = Orange.classification.Learner.__new__(cls, **argkw)
953      if instances:
954          self.__init__(**argkw)
955          return self.__call__(instances)
956      else:
957          return self
958
959
960  def __init__(self, addCrit=0.2, deleteCrit=0.3, numFeatures = -1, **kwds):
961    self.__dict__.update(kwds)
962    self.addCrit = addCrit
963    self.deleteCrit = deleteCrit
964    self.numFeatures = numFeatures
965  def __call__(self, examples):
966    if getattr(self, "imputer", 0):
967        examples = self.imputer(examples)(examples)
968    if getattr(self, "removeMissing", 0):
969        examples = Orange.core.Preprocessor_dropMissing(examples)
970    continuizer = Orange.core.DomainContinuizer(zeroBased=1,continuousTreatment=Orange.core.DomainContinuizer.Leave,
971                                           multinomialTreatment = Orange.core.DomainContinuizer.FrequentIsBase,
972                                           classTreatment = Orange.core.DomainContinuizer.Ignore)
973    attr = []
974    remain_attr = examples.domain.attributes[:]
975
976    # get LL for Majority Learner
977    tempDomain = Orange.core.Domain(attr,examples.domain.classVar)
978    #tempData  = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
979    tempData  = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
980
981    ll_Old = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData)
982    ll_Best = -1000000
983    length_Old = float(len(tempData))
984
985    stop = 0
986    while not stop:
987        # LOOP until all variables are added or no further deletion nor addition of attribute is possible
988        worstAt = None
989        # if there are more than 1 attribute then perform backward elimination
990        if len(attr) >= 2:
991            minG = 1000
992            worstAt = attr[0]
993            ll_Best = ll_Old
994            length_Best = length_Old
995            for at in attr:
996                # check all attribute whether its presence enough increases LL?
997
998                tempAttr = filter(lambda x: x!=at, attr)
999                tempDomain = Orange.core.Domain(tempAttr,examples.domain.classVar)
1000                tempDomain.addmetas(examples.domain.getmetas())
1001                # domain, calculate P for LL improvement.
1002                tempDomain  = continuizer(Orange.core.Preprocessor_dropMissing(examples.select(tempDomain)))
1003                tempData = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
1004
1005                ll_Delete = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData)
1006                length_Delete = float(len(tempData))
1007                length_Avg = (length_Delete + length_Old)/2.0
1008
1009                G=-2*length_Avg*(ll_Delete/length_Delete-ll_Old/length_Old)
1010
1011                # set new worst attribute               
1012                if G<minG:
1013                    worstAt = at
1014                    minG=G
1015                    ll_Best = ll_Delete
1016                    length_Best = length_Delete
1017            # deletion of attribute
1018           
1019            if worstAt.varType==Orange.core.VarTypes.Continuous:
1020                P=lchisqprob(minG,1);
1021            else:
1022                P=lchisqprob(minG,len(worstAt.values)-1);
1023            if P>=self.deleteCrit:
1024                attr.remove(worstAt)
1025                remain_attr.append(worstAt)
1026                nodeletion=0
1027                ll_Old = ll_Best
1028                length_Old = length_Best
1029            else:
1030                nodeletion=1
1031        else:
1032            nodeletion = 1
1033            # END OF DELETION PART
1034           
1035        # if enough attributes has been chosen, stop the procedure
1036        if self.numFeatures>-1 and len(attr)>=self.numFeatures:
1037            remain_attr=[]
1038         
1039        # for each attribute in the remaining
1040        maxG=-1
1041        ll_Best = ll_Old
1042        length_Best = length_Old
1043        bestAt = None
1044        for at in remain_attr:
1045            tempAttr = attr + [at]
1046            tempDomain = Orange.core.Domain(tempAttr,examples.domain.classVar)
1047            tempDomain.addmetas(examples.domain.getmetas())
1048            # domain, calculate P for LL improvement.
1049            tempDomain  = continuizer(Orange.core.Preprocessor_dropMissing(examples.select(tempDomain)))
1050            tempData = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
1051            ll_New = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData)
1052
1053            length_New = float(len(tempData)) # get number of examples in tempData to normalize likelihood
1054
1055            # P=PR(CHI^2>G), G=-2(L(0)-L(1))=2(E(0)-E(1))
1056            length_avg = (length_New + length_Old)/2
1057            G=-2*length_avg*(ll_Old/length_Old-ll_New/length_New);
1058            if G>maxG:
1059                bestAt = at
1060                maxG=G
1061                ll_Best = ll_New
1062                length_Best = length_New
1063        if not bestAt:
1064            stop = 1
1065            continue
1066       
1067        if bestAt.varType==Orange.core.VarTypes.Continuous:
1068            P=lchisqprob(maxG,1);
1069        else:
1070            P=lchisqprob(maxG,len(bestAt.values)-1);
1071        # Add attribute with smallest P to attributes(attr)
1072        if P<=self.addCrit:
1073            attr.append(bestAt)
1074            remain_attr.remove(bestAt)
1075            ll_Old = ll_Best
1076            length_Old = length_Best
1077
1078        if (P>self.addCrit and nodeletion) or (bestAt == worstAt):
1079            stop = 1
1080
1081    return attr
1082
1083
1084class StepWiseFSSFilter(object):
1085    def __new__(cls, instances=None, **argkw):
1086        self = object.__new__(cls, **argkw)
1087        if instances:
1088            self.__init__(**argkw)
1089            return self.__call__(instances)
1090        else:
1091            return self
1092   
1093    def __init__(self, addCrit=0.2, deleteCrit=0.3, numFeatures = -1):
1094        self.addCrit = addCrit
1095        self.deleteCrit = deleteCrit
1096        self.numFeatures = numFeatures
1097
1098    def __call__(self, examples):
1099        attr = StepWiseFSS(examples, addCrit=self.addCrit, deleteCrit = self.deleteCrit, numFeatures = self.numFeatures)
1100        return examples.select(Orange.core.Domain(attr, examples.domain.classVar))
1101               
1102
1103####################################
1104##  PROBABILITY CALCULATIONS
1105
1106def lchisqprob(chisq,df):
1107    """
1108    Return the (1-tailed) probability value associated with the provided
1109    chi-square value and df.  Adapted from chisq.c in Gary Perlman's |Stat.
1110    """
1111    BIG = 20.0
1112    def ex(x):
1113        BIG = 20.0
1114        if x < -BIG:
1115            return 0.0
1116        else:
1117            return math.exp(x)
1118    if chisq <=0 or df < 1:
1119        return 1.0
1120    a = 0.5 * chisq
1121    if df%2 == 0:
1122        even = 1
1123    else:
1124        even = 0
1125    if df > 1:
1126        y = ex(-a)
1127    if even:
1128        s = y
1129    else:
1130        s = 2.0 * zprob(-math.sqrt(chisq))
1131    if (df > 2):
1132        chisq = 0.5 * (df - 1.0)
1133        if even:
1134            z = 1.0
1135        else:
1136            z = 0.5
1137        if a > BIG:
1138            if even:
1139                e = 0.0
1140            else:
1141                e = math.log(math.sqrt(math.pi))
1142            c = math.log(a)
1143            while (z <= chisq):
1144                e = math.log(z) + e
1145                s = s + ex(c*z-a-e)
1146                z = z + 1.0
1147            return s
1148        else:
1149            if even:
1150                e = 1.0
1151            else:
1152                e = 1.0 / math.sqrt(math.pi) / math.sqrt(a)
1153            c = 0.0
1154            while (z <= chisq):
1155                e = e * (a/float(z))
1156                c = c + e
1157                z = z + 1.0
1158            return (c*y+s)
1159    else:
1160        return s
1161
1162
1163def zprob(z):
1164    """
1165    Returns the area under the normal curve 'to the left of' the given z value.
1166    Thus::
1167
1168    for z<0, zprob(z) = 1-tail probability
1169    for z>0, 1.0-zprob(z) = 1-tail probability
1170    for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability
1171
1172    Adapted from z.c in Gary Perlman's |Stat.
1173    """
1174    Z_MAX = 6.0    # maximum meaningful z-value
1175    if z == 0.0:
1176    x = 0.0
1177    else:
1178    y = 0.5 * math.fabs(z)
1179    if y >= (Z_MAX*0.5):
1180        x = 1.0
1181    elif (y < 1.0):
1182        w = y*y
1183        x = ((((((((0.000124818987 * w
1184            -0.001075204047) * w +0.005198775019) * w
1185              -0.019198292004) * w +0.059054035642) * w
1186            -0.151968751364) * w +0.319152932694) * w
1187          -0.531923007300) * w +0.797884560593) * y * 2.0
1188    else:
1189        y = y - 2.0
1190        x = (((((((((((((-0.000045255659 * y
1191                 +0.000152529290) * y -0.000019538132) * y
1192               -0.000676904986) * y +0.001390604284) * y
1193             -0.000794620820) * y -0.002034254874) * y
1194               +0.006549791214) * y -0.010557625006) * y
1195             +0.011630447319) * y -0.009279453341) * y
1196           +0.005353579108) * y -0.002141268741) * y
1197         +0.000535310849) * y +0.999936657524
1198    if z > 0.0:
1199    prob = ((x+1.0)*0.5)
1200    else:
1201    prob = ((1.0-x)*0.5)
1202    return prob
1203
1204   
Note: See TracBrowser for help on using the repository browser.