source:orange/orange/Orange/classification/logreg.py@9349:fa13a2c52fcd

Revision 9349:fa13a2c52fcd, 46.6 KB checked in by mitar, 2 years ago (diff)

Changed way of linking to code in documentation.

Line
1"""
2.. index: logistic regression
3.. index:
4   single: classification; logistic regression
5
6********************************
7Logistic regression (logreg)
8********************************
9
10Implements logistic regression
11<http://en.wikipedia.org/wiki/Logistic_regression>_ with an extension for
12proper treatment of discrete features.  The algorithm can handle various
13anomalies in features, such as constant variables and singularities, that
14could make fitting of logistic regression almost impossible. Stepwise
15logistic regression, which iteratively selects the most informative
16features, is also supported.
17
18Logistic regression is a popular classification method that comes
19from statistics. The model is described by a linear combination of
20coefficients,
21
22.. math::
23
24    F = \\beta_0 + \\beta_1*X_1 + \\beta_2*X_2 + ... + \\beta_k*X_k
25
26and the probability (p) of a class value is  computed as:
27
28.. math::
29
30    p = \\frac{\exp(F)}{1 + \exp(F)}
31
32
33.. class :: LogRegClassifier
34
35    :obj:LogRegClassifier stores estimated values of regression
36    coefficients and their significances, and uses them to predict
37    classes and class probabilities using the equations described above.
38
39    .. attribute :: beta
40
41        Estimated regression coefficients.
42
43    .. attribute :: beta_se
44
45        Estimated standard errors for regression coefficients.
46
47    .. attribute :: wald_Z
48
49        Wald Z statistics for beta coefficients. Wald Z is computed
50        as beta/beta_se.
51
52    .. attribute :: P
53
54        List of P-values for beta coefficients, that is, the probability
55        that beta coefficients differ from 0.0. The probability is
56        computed from squared Wald Z statistics that is distributed with
57        Chi-Square distribution.
58
59    .. attribute :: likelihood
60
61        The probability of the sample (ie. learning examples) observed on
62        the basis of the derived model, as a function of the regression
63        parameters.
64
65    .. attribute :: fitStatus
66
67        Tells how the model fitting ended - either regularly
68        (:obj:LogRegFitter.OK), or it was interrupted due to one of beta
69        coefficients escaping towards infinity (:obj:LogRegFitter.Infinity)
70        or since the values didn't converge (:obj:LogRegFitter.Divergence). The
71        value tells about the classifier's "reliability"; the classifier
72        itself is useful in either case.
73
74.. autoclass:: LogRegLearner
75
76.. class:: LogRegFitter
77
78    :obj:LogRegFitter is the abstract base class for logistic fitters. It
79    defines the form of call operator and the constants denoting its
80    (un)success:
81
82    .. attribute:: OK
83
84        Fitter succeeded to converge to the optimal fit.
85
86    .. attribute:: Infinity
87
88        Fitter failed due to one or more beta coefficients escaping towards infinity.
89
90    .. attribute:: Divergence
91
92        Beta coefficients failed to converge, but none of beta coefficients escaped.
93
94    .. attribute:: Constant
95
96        There is a constant attribute that causes the matrix to be singular.
97
98    .. attribute:: Singularity
99
100        The matrix is singular.
101
102
103    .. method:: __call__(examples, weightID)
104
105        Performs the fitting. There can be two different cases: either
106        the fitting succeeded to find a set of beta coefficients (although
107        possibly with difficulties) or the fitting failed altogether. The
108        two cases return different results.
109
110        (status, beta, beta_se, likelihood)
111            The fitter managed to fit the model. The first element of
112            the tuple, result, tells about the problems occurred; it can
113            be either :obj:OK, :obj:Infinity or :obj:Divergence. In
114            the latter cases, returned values may still be useful for
115            making predictions, but it's recommended that you inspect
116            the coefficients and their errors and make your decision
117            whether to use the model or not.
118
119        (status, attribute)
120            The fitter failed and the returned attribute is responsible
121            for it. The type of failure is reported in status, which
122            can be either :obj:Constant or :obj:Singularity.
123
124        The proper way of calling the fitter is to expect and handle all
125        the situations described. For instance, if fitter is an instance
126        of some fitter and examples contain a set of suitable examples,
127        a script should look like this::
128
129            res = fitter(examples)
130            if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]:
131               status, beta, beta_se, likelihood = res
132               < proceed by doing something with what you got >
133            else:
134               status, attr = res
135               < remove the attribute or complain to the user or ... >
136
137
138.. class :: LogRegFitter_Cholesky
139
140    :obj:LogRegFitter_Cholesky is the sole fitter available at the
141    moment. It is a C++ translation of Alan Miller's logistic regression
142    code <http://users.bigpond.net.au/amiller/>_. It uses Newton-Raphson
143    algorithm to iteratively minimize least squares error computed from
144    learning examples.
145
146
147.. autoclass:: StepWiseFSS
148.. autofunction:: dump
149
150
151
152Examples
153--------
154
155The first example shows a very simple induction of a logistic regression
156classifier (:download:logreg-run.py <code/logreg-run.py>, uses :download:titanic.tab <code/titanic.tab>).
157
158.. literalinclude:: code/logreg-run.py
159
160Result::
161
162    Classification accuracy: 0.778282598819
163
164    class attribute = survived
165    class values = <no, yes>
166
167        Attribute       beta  st. error     wald Z          P OR=exp(beta)
168
169        Intercept      -1.23       0.08     -15.15      -0.00
170     status=first       0.86       0.16       5.39       0.00       2.36
171    status=second      -0.16       0.18      -0.91       0.36       0.85
172     status=third      -0.92       0.15      -6.12       0.00       0.40
173        age=child       1.06       0.25       4.30       0.00       2.89
174       sex=female       2.42       0.14      17.04       0.00      11.25
175
176The next examples shows how to handle singularities in data sets
177(:download:logreg-singularities.py <code/logreg-singularities.py>, uses :download:adult_sample.tab <code/adult_sample.tab>).
178
179.. literalinclude:: code/logreg-singularities.py
180
181The first few lines of the output of this script are::
182
183    <=50K <=50K
184    <=50K <=50K
185    <=50K <=50K
186    >50K >50K
187    <=50K >50K
188
189    class attribute = y
190    class values = <>50K, <=50K>
191
192                               Attribute       beta  st. error     wald Z          P OR=exp(beta)
193
194                               Intercept       6.62      -0.00       -inf       0.00
195                                     age      -0.04       0.00       -inf       0.00       0.96
196                                  fnlwgt      -0.00       0.00       -inf       0.00       1.00
197                           education-num      -0.28       0.00       -inf       0.00       0.76
198                 marital-status=Divorced       4.29       0.00        inf       0.00      72.62
199            marital-status=Never-married       3.79       0.00        inf       0.00      44.45
200                marital-status=Separated       3.46       0.00        inf       0.00      31.95
201                  marital-status=Widowed       3.85       0.00        inf       0.00      46.96
202    marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63
203        marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19
204                 occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72
205
206If :obj:removeSingular is set to 0, inducing a logistic regression
207classifier would return an error::
208
209    Traceback (most recent call last):
210      File "logreg-singularities.py", line 4, in <module>
211        lr = classification.logreg.LogRegLearner(table, removeSingular=0)
212      File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner
213        return lr(examples, weightID)
214      File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__
215        lr = learner(examples, weight)
216    orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked
217
218We can see that the attribute workclass is causing a singularity.
219
220The example below shows, how the use of stepwise logistic regression can help to
221gain in classification performance (:download:logreg-stepwise.py <code/logreg-stepwise.py>, uses :download:ionosphere.tab <code/ionosphere.tab>):
222
223.. literalinclude:: code/logreg-stepwise.py
224
225The output of this script is::
226
227    Learner      CA
228    logistic     0.841
229    filtered     0.846
230
231    Number of times attributes were used in cross-validation:
232     1 x a21
233    10 x a22
234     8 x a23
235     7 x a24
236     1 x a25
237    10 x a26
238    10 x a27
239     3 x a28
240     7 x a29
241     9 x a31
242     2 x a16
243     7 x a12
244     1 x a32
245     8 x a15
246    10 x a14
247     4 x a17
248     7 x a30
249    10 x a11
250     1 x a10
251     1 x a13
252    10 x a34
253     2 x a19
254     1 x a18
255    10 x a3
256    10 x a5
257     4 x a4
258     4 x a7
259     8 x a6
260    10 x a9
261    10 x a8
262
263"""
264
265from Orange.core import LogRegLearner, LogRegClassifier, LogRegFitter, LogRegFitter_Cholesky
266
267import Orange
268import math, os
269import warnings
270from numpy import *
271from numpy.linalg import *
272
273
274##########################################################################
275## Print out methods
276
277def dump(classifier):
278    """ Formatted string of all major features in logistic
279    regression classifier.
280
281    :param classifier: logistic regression classifier
282    """
283
284    # print out class values
285    out = ['']
286    out.append("class attribute = " + classifier.domain.classVar.name)
287    out.append("class values = " + str(classifier.domain.classVar.values))
288    out.append('')
289
290    # get the longest attribute name
291    longest=0
292    for at in classifier.continuizedDomain.attributes:
293        if len(at.name)>longest:
294            longest=len(at.name);
295
296    # print out the head
297    formatstr = "%"+str(longest)+"s %10s %10s %10s %10s %10s"
298    out.append(formatstr % ("Feature", "beta", "st. error", "wald Z", "P", "OR=exp(beta)"))
299    out.append('')
300    formatstr = "%"+str(longest)+"s %10.2f %10.2f %10.2f %10.2f"
301    out.append(formatstr % ("Intercept", classifier.beta[0], classifier.beta_se[0], classifier.wald_Z[0], classifier.P[0]))
302    formatstr = "%"+str(longest)+"s %10.2f %10.2f %10.2f %10.2f %10.2f"
303    for i in range(len(classifier.continuizedDomain.attributes)):
304        out.append(formatstr % (classifier.continuizedDomain.attributes[i].name, classifier.beta[i+1], classifier.beta_se[i+1], classifier.wald_Z[i+1], abs(classifier.P[i+1]), math.exp(classifier.beta[i+1])))
305
306    return '\n'.join(out)
307
308
309def has_discrete_values(domain):
310    for at in domain.attributes:
311        if at.varType == Orange.core.VarTypes.Discrete:
312            return 1
313    return 0
314
315class LogRegLearner(Orange.classification.Learner):
316    """ Logistic regression learner.
317
318    Implements logistic regression. If data instances are provided to
319    the constructor, the learning algorithm is called and the resulting
320    classifier is returned instead of the learner.
321
322    :param table: data table with either discrete or continuous features
323    :type table: Orange.data.Table
324    :param weightID: the ID of the weight meta attribute
325    :type weightID: int
326    :param removeSingular: set to 1 if you want automatic removal of disturbing features, such as constants and singularities
327    :type removeSingular: bool
328    :param fitter: the fitting algorithm (by default the Newton-Raphson fitting algorithm is used)
329    :param stepwiseLR: set to 1 if you wish to use stepwise logistic regression
330    :type stepwiseLR: bool
331    :param addCrit: parameter for stepwise feature selection
333    :param deleteCrit: parameter for stepwise feature selection
334    :type deleteCrit: float
335    :param numFeatures: parameter for stepwise feature selection
336    :type numFeatures: int
337    :rtype: :obj:LogRegLearner or :obj:LogRegClassifier
338
339    """
340    def __new__(cls, instances=None, weightID=0, **argkw):
341        self = Orange.classification.Learner.__new__(cls, **argkw)
342        if instances:
343            self.__init__(**argkw)
344            return self.__call__(instances, weightID)
345        else:
346            return self
347
348    def __init__(self, removeSingular=0, fitter = None, **kwds):
349        self.__dict__.update(kwds)
350        self.removeSingular = removeSingular
351        self.fitter = None
352
353    def __call__(self, examples, weight=0):
354        imputer = getattr(self, "imputer", None) or None
355        if getattr(self, "removeMissing", 0):
356            examples = Orange.core.Preprocessor_dropMissing(examples)
357##        if hasDiscreteValues(examples.domain):
358##            examples = createNoDiscTable(examples)
359        if not len(examples):
360            return None
361        if getattr(self, "stepwiseLR", 0):
363            removeCrit = getattr(self, "removeCrit", 0.3)
364            numFeatures = getattr(self, "numFeatures", -1)
365            attributes = StepWiseFSS(examples, addCrit = addCrit, deleteCrit = removeCrit, imputer = imputer, numFeatures = numFeatures)
366            tmpDomain = Orange.core.Domain(attributes, examples.domain.classVar)
368            examples = examples.select(tmpDomain)
369        learner = Orange.core.LogRegLearner()
370        learner.imputerConstructor = imputer
371        if imputer:
372            examples = self.imputer(examples)(examples)
373        examples = Orange.core.Preprocessor_dropMissing(examples)
374        if self.fitter:
375            learner.fitter = self.fitter
376        if self.removeSingular:
377            lr = learner.fitModel(examples, weight)
378        else:
379            lr = learner(examples, weight)
380        while isinstance(lr, Orange.core.Variable):
381            if isinstance(lr.getValueFrom, Orange.core.ClassifierFromVar) and isinstance(lr.getValueFrom.transformer, Orange.core.Discrete2Continuous):
382                lr = lr.getValueFrom.variable
383            attributes = examples.domain.attributes[:]
384            if lr in attributes:
385                attributes.remove(lr)
386            else:
387                attributes.remove(lr.getValueFrom.variable)
388            newDomain = Orange.core.Domain(attributes, examples.domain.classVar)
390            examples = examples.select(newDomain)
391            lr = learner.fitModel(examples, weight)
392        return lr
393
394
395
396class UnivariateLogRegLearner(Orange.classification.Learner):
397    def __new__(cls, instances=None, **argkw):
398        self = Orange.classification.Learner.__new__(cls, **argkw)
399        if instances:
400            self.__init__(**argkw)
401            return self.__call__(instances)
402        else:
403            return self
404
405    def __init__(self, **kwds):
406        self.__dict__.update(kwds)
407
408    def __call__(self, examples):
409        examples = createFullNoDiscTable(examples)
410        classifiers = map(lambda x: LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(x, examples.domain.classVar)))), examples.domain.attributes)
411        maj_classifier = LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(examples.domain.classVar))))
412        beta = [maj_classifier.beta[0]] + [x.beta[1] for x in classifiers]
413        beta_se = [maj_classifier.beta_se[0]] + [x.beta_se[1] for x in classifiers]
414        P = [maj_classifier.P[0]] + [x.P[1] for x in classifiers]
415        wald_Z = [maj_classifier.wald_Z[0]] + [x.wald_Z[1] for x in classifiers]
416        domain = examples.domain
417
418        return Univariate_LogRegClassifier(beta = beta, beta_se = beta_se, P = P, wald_Z = wald_Z, domain = domain)
419
420class UnivariateLogRegClassifier(Orange.core.Classifier):
421    def __init__(self, **kwds):
422        self.__dict__.update(kwds)
423
424    def __call__(self, example, resultType = Orange.core.GetValue):
425        # classification not implemented yet. For now its use is only to provide regression coefficients and its statistics
426        pass
427
428
429class LogRegLearnerGetPriors(object):
430    def __new__(cls, instances=None, weightID=0, **argkw):
431        self = object.__new__(cls, **argkw)
432        if instances:
433            self.__init__(**argkw)
434            return self.__call__(instances, weightID)
435        else:
436            return self
437
438    def __init__(self, removeSingular=0, **kwds):
439        self.__dict__.update(kwds)
440        self.removeSingular = removeSingular
441    def __call__(self, examples, weight=0):
442        # next function changes data set to a extended with unknown values
443        def createLogRegExampleTable(data, weightID):
444            setsOfData = []
445            for at in data.domain.attributes:
446                # za vsak atribut kreiraj nov newExampleTable newData
447                # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable
448                if at.varType == Orange.core.VarTypes.Continuous:
449                    atDisc = Orange.core.FloatVariable(at.name + "Disc")
450                    newDomain = Orange.core.Domain(data.domain.attributes+[atDisc,data.domain.classVar])
452                    newData = Orange.core.ExampleTable(newDomain,data)
453                    altData = Orange.core.ExampleTable(newDomain,data)
454                    for i,d in enumerate(newData):
455                        d[atDisc] = 0
456                        d[weightID] = 1*data[i][weightID]
457                    for i,d in enumerate(altData):
458                        d[atDisc] = 1
459                        d[at] = 0
460                        d[weightID] = 0.000001*data[i][weightID]
461                elif at.varType == Orange.core.VarTypes.Discrete:
462                # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X"
463                    atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"])
464                    newDomain = Orange.core.Domain(filter(lambda x: x!=at, data.domain.attributes)+[atNew,data.domain.classVar])
466                    newData = Orange.core.ExampleTable(newDomain,data)
467                    altData = Orange.core.ExampleTable(newDomain,data)
468                    for i,d in enumerate(newData):
469                        d[atNew] = data[i][at]
470                        d[weightID] = 1*data[i][weightID]
471                    for i,d in enumerate(altData):
472                        d[atNew] = at.name+"X"
473                        d[weightID] = 0.000001*data[i][weightID]
474                newData.extend(altData)
475                setsOfData.append(newData)
476            return setsOfData
477
478        learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular)
479        # get Original Model
480        orig_model = learner(examples,weight)
481        if orig_model.fit_status:
482            print "Warning: model did not converge"
483
484        # get extended Model (you should not change data)
485        if weight == 0:
486            weight = Orange.core.newmetaid()
488        extended_set_of_examples = createLogRegExampleTable(examples, weight)
489        extended_models = [learner(extended_examples, weight) \
490                           for extended_examples in extended_set_of_examples]
491
492##        print examples[0]
493##        printOUT(orig_model)
494##        print orig_model.domain
495##        print orig_model.beta
496##        print orig_model.beta[orig_model.continuizedDomain.attributes[-1]]
497##        for i,m in enumerate(extended_models):
498##            print examples.domain.attributes[i]
499##            printOUT(m)
500
501
502        # izracunas odstopanja
503        # get sum of all betas
504        beta = 0
505        betas_ap = []
506        for m in extended_models:
509            beta = beta + beta_add
510
511        # substract it from intercept
512        #print "beta", beta
513        logistic_prior = orig_model.beta[0]+beta
514
515        # compare it to bayes prior
516        bayes = Orange.core.BayesLearner(examples)
517        bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0])
518
519        # normalize errors
520##        print "bayes", bayes_prior
521##        print "lr", orig_model.beta[0]
522##        print "lr2", logistic_prior
523##        print "dist", Orange.core.Distribution(examples.domain.classVar,examples)
524##        print "prej", betas_ap
525
526        # error normalization - to avoid errors due to assumption of independence of unknown values
527        dif = bayes_prior - logistic_prior
528        positives = sum(filter(lambda x: x>=0, betas_ap))
529        negatives = -sum(filter(lambda x: x<0, betas_ap))
530        if not negatives == 0:
531            kPN = positives/negatives
532            diffNegatives = dif/(1+kPN)
533            diffPositives = kPN*diffNegatives
534            kNegatives = (negatives-diffNegatives)/negatives
535            kPositives = positives/(positives-diffPositives)
536    ##        print kNegatives
537    ##        print kPositives
538
539            for i,b in enumerate(betas_ap):
540                if b<0: betas_ap[i]*=kNegatives
541                else: betas_ap[i]*=kPositives
542        #print "potem", betas_ap
543
544        # vrni originalni model in pripadajoce apriorne niclele
545        return (orig_model, betas_ap)
546        #return (bayes_prior,orig_model.beta[examples.domain.classVar],logistic_prior)
547
548class LogRegLearnerGetPriorsOneTable:
549    def __init__(self, removeSingular=0, **kwds):
550        self.__dict__.update(kwds)
551        self.removeSingular = removeSingular
552    def __call__(self, examples, weight=0):
553        # next function changes data set to a extended with unknown values
554        def createLogRegExampleTable(data, weightID):
555            finalData = Orange.core.ExampleTable(data)
556            origData = Orange.core.ExampleTable(data)
557            for at in data.domain.attributes:
558                # za vsak atribut kreiraj nov newExampleTable newData
559                # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable
560                if at.varType == Orange.core.VarTypes.Continuous:
561                    atDisc = Orange.core.FloatVariable(at.name + "Disc")
562                    newDomain = Orange.core.Domain(origData.domain.attributes+[atDisc,data.domain.classVar])
564                    finalData = Orange.core.ExampleTable(newDomain,finalData)
565                    newData = Orange.core.ExampleTable(newDomain,origData)
566                    origData = Orange.core.ExampleTable(newDomain,origData)
567                    for d in origData:
568                        d[atDisc] = 0
569                    for d in finalData:
570                        d[atDisc] = 0
571                    for i,d in enumerate(newData):
572                        d[atDisc] = 1
573                        d[at] = 0
574                        d[weightID] = 100*data[i][weightID]
575
576                elif at.varType == Orange.core.VarTypes.Discrete:
577                # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X"
578                    atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"])
579                    newDomain = Orange.core.Domain(filter(lambda x: x!=at, origData.domain.attributes)+[atNew,origData.domain.classVar])
581                    temp_finalData = Orange.core.ExampleTable(finalData)
582                    finalData = Orange.core.ExampleTable(newDomain,finalData)
583                    newData = Orange.core.ExampleTable(newDomain,origData)
584                    temp_origData = Orange.core.ExampleTable(origData)
585                    origData = Orange.core.ExampleTable(newDomain,origData)
586                    for i,d in enumerate(origData):
587                        d[atNew] = temp_origData[i][at]
588                    for i,d in enumerate(finalData):
589                        d[atNew] = temp_finalData[i][at]
590                    for i,d in enumerate(newData):
591                        d[atNew] = at.name+"X"
592                        d[weightID] = 10*data[i][weightID]
593                finalData.extend(newData)
594            return finalData
595
596        learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular)
597        # get Original Model
598        orig_model = learner(examples,weight)
599
600        # get extended Model (you should not change data)
601        if weight == 0:
602            weight = Orange.core.newmetaid()
604        extended_examples = createLogRegExampleTable(examples, weight)
605        extended_model = learner(extended_examples, weight)
606
607##        print examples[0]
608##        printOUT(orig_model)
609##        print orig_model.domain
610##        print orig_model.beta
611
612##        printOUT(extended_model)
613        # izracunas odstopanja
614        # get sum of all betas
615        beta = 0
616        betas_ap = []
617        for m in extended_models:
620            beta = beta + beta_add
621
622        # substract it from intercept
623        #print "beta", beta
624        logistic_prior = orig_model.beta[0]+beta
625
626        # compare it to bayes prior
627        bayes = Orange.core.BayesLearner(examples)
628        bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0])
629
630        # normalize errors
631        #print "bayes", bayes_prior
632        #print "lr", orig_model.beta[0]
633        #print "lr2", logistic_prior
634        #print "dist", Orange.core.Distribution(examples.domain.classVar,examples)
635        k = (bayes_prior-orig_model.beta[0])/(logistic_prior-orig_model.beta[0])
636        #print "prej", betas_ap
637        betas_ap = [k*x for x in betas_ap]
638        #print "potem", betas_ap
639
640        # vrni originalni model in pripadajoce apriorne niclele
641        return (orig_model, betas_ap)
642        #return (bayes_prior,orig_model.beta[data.domain.classVar],logistic_prior)
643
644
645######################################
646#### Fitters for logistic regression (logreg) learner ####
647######################################
648
649def pr(x, betas):
650    k = math.exp(dot(x, betas))
651    return k / (1+k)
652
653def lh(x,y,betas):
654    llh = 0.0
655    for i,x_i in enumerate(x):
656        pr = pr(x_i,betas)
657        llh += y[i]*log(max(pr,1e-6)) + (1-y[i])*log(max(1-pr,1e-6))
658    return llh
659
660
661def diag(vector):
662    mat = identity(len(vector), Float)
663    for i,v in enumerate(vector):
664        mat[i][i] = v
665    return mat
666
667class SimpleFitter(Orange.core.LogRegFitter):
668    def __init__(self, penalty=0, se_penalty = False):
669        self.penalty = penalty
670        self.se_penalty = se_penalty
671    def __call__(self, data, weight=0):
672        ml = data.native(0)
673        for i in range(len(data.domain.attributes)):
674          a = data.domain.attributes[i]
675          if a.varType == Orange.core.VarTypes.Discrete:
676            for m in ml:
677              m[i] = a.values.index(m[i])
678        for m in ml:
679          m[-1] = data.domain.classVar.values.index(m[-1])
680        Xtmp = array(ml)
681        y = Xtmp[:,-1]   # true probabilities (1's or 0's)
682        one = reshape(array([1]*len(data)), (len(data),1)) # intercept column
683        X=concatenate((one, Xtmp[:,:-1]),1)  # intercept first, then data
684
685        betas = array([0.0] * (len(data.domain.attributes)+1))
686        oldBetas = array([1.0] * (len(data.domain.attributes)+1))
687        N = len(data)
688
689        pen_matrix = array([self.penalty] * (len(data.domain.attributes)+1))
690        if self.se_penalty:
691            p = array([pr(X[i], betas) for i in range(len(data))])
692            W = identity(len(data), Float)
693            pp = p * (1.0-p)
694            for i in range(N):
695                W[i,i] = pp[i]
696            se = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X)))))
697            for i,p in enumerate(pen_matrix):
698                pen_matrix[i] *= se[i]
699        # predict the probability for an instance, x and betas are vectors
700        # start the computation
701        likelihood = 0.
702        likelihood_new = 1.
703        while abs(likelihood - likelihood_new)>1e-5:
704            likelihood = likelihood_new
705            oldBetas = betas
706            p = array([pr(X[i], betas) for i in range(len(data))])
707
708            W = identity(len(data), Float)
709            pp = p * (1.0-p)
710            for i in range(N):
711                W[i,i] = pp[i]
712
713            WI = inverse(W)
714            z = matrixmultiply(X, betas) + matrixmultiply(WI, y - p)
715
716            tmpA = inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))+diag(pen_matrix))
717            tmpB = matrixmultiply(transpose(X), y-p)
718            betas = oldBetas + matrixmultiply(tmpA,tmpB)
719#            betaTemp = matrixmultiply(matrixmultiply(matrixmultiply(matrixmultiply(tmpA,transpose(X)),W),X),oldBetas)
720#            print betaTemp
721#            tmpB = matrixmultiply(transpose(X), matrixmultiply(W, z))
722#            betas = matrixmultiply(tmpA, tmpB)
723            likelihood_new = lh(X,y,betas)-self.penalty*sum([b*b for b in betas])
724            print likelihood_new
725
726
727
728##        XX = sqrt(diagonal(inverse(matrixmultiply(transpose(X),X))))
729##        yhat = array([pr(X[i], betas) for i in range(len(data))])
730##        ss = sum((y - yhat) ** 2) / (N - len(data.domain.attributes) - 1)
731##        sigma = math.sqrt(ss)
732        p = array([pr(X[i], betas) for i in range(len(data))])
733        W = identity(len(data), Float)
734        pp = p * (1.0-p)
735        for i in range(N):
736            W[i,i] = pp[i]
737        diXWX = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X)))))
738        xTemp = matrixmultiply(matrixmultiply(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))),transpose(X)),y)
739        beta = []
740        beta_se = []
741        print "likelihood ridge", likelihood
742        for i in range(len(betas)):
743            beta.append(betas[i])
744            beta_se.append(diXWX[i])
745        return (self.OK, beta, beta_se, 0)
746
747def pr_bx(bx):
748    if bx > 35:
749        return 1
750    if bx < -35:
751        return 0
752    return exp(bx)/(1+exp(bx))
753
754class BayesianFitter(Orange.core.LogRegFitter):
755    def __init__(self, penalty=0, anch_examples=[], tau = 0):
756        self.penalty = penalty
757        self.anch_examples = anch_examples
758        self.tau = tau
759
760    def create_array_data(self,data):
761        if not len(data):
762            return (array([]),array([]))
763        # convert data to numeric
764        ml = data.native(0)
765        for i,a in enumerate(data.domain.attributes):
766          if a.varType == Orange.core.VarTypes.Discrete:
767            for m in ml:
768              m[i] = a.values.index(m[i])
769        for m in ml:
770          m[-1] = data.domain.classVar.values.index(m[-1])
771        Xtmp = array(ml)
772        y = Xtmp[:,-1]   # true probabilities (1's or 0's)
773        one = reshape(array([1]*len(data)), (len(data),1)) # intercept column
774        X=concatenate((one, Xtmp[:,:-1]),1)  # intercept first, then data
775        return (X,y)
776
777    def __call__(self, data, weight=0):
778        (X,y)=self.create_array_data(data)
779
780        exTable = Orange.core.ExampleTable(data.domain)
781        for id,ex in self.anch_examples:
782            exTable.extend(Orange.core.ExampleTable(ex,data.domain))
783        (X_anch,y_anch)=self.create_array_data(exTable)
784
785        betas = array([0.0] * (len(data.domain.attributes)+1))
786
787        likelihood,betas = self.estimate_beta(X,y,betas,[0]*(len(betas)),X_anch,y_anch)
788
789        # get attribute groups atGroup = [(startIndex, number of values), ...)
790        ats = data.domain.attributes
791        atVec=reduce(lambda x,y: x+[(y,not y==x[-1][0])], [a.getValueFrom and a.getValueFrom.whichVar or a for a in ats],[(ats[0].getValueFrom and ats[0].getValueFrom.whichVar or ats[0],0)])[1:]
792        atGroup=[[0,0]]
793        for v_i,v in enumerate(atVec):
794            if v[1]==0: atGroup[-1][1]+=1
795            else:       atGroup.append([v_i,1])
796
797        # compute zero values for attributes
798        sumB = 0.
799        for ag in atGroup:
800            X_temp = concatenate((X[:,:ag[0]+1],X[:,ag[0]+1+ag[1]:]),1)
801            if X_anch:
802                X_anch_temp = concatenate((X_anch[:,:ag[0]+1],X_anch[:,ag[0]+1+ag[1]:]),1)
803            else: X_anch_temp = X_anch
804##            print "1", concatenate((betas[:i+1],betas[i+2:]))
805##            print "2", betas
806            likelihood_temp,betas_temp=self.estimate_beta(X_temp,y,concatenate((betas[:ag[0]+1],betas[ag[0]+ag[1]+1:])),[0]+[1]*(len(betas)-1-ag[1]),X_anch_temp,y_anch)
807            print "finBetas", betas, betas_temp
808            print "betas", betas[0], betas_temp[0]
809            sumB += betas[0]-betas_temp[0]
810        apriori = Orange.core.Distribution(data.domain.classVar, data)
811        aprioriProb = apriori[0]/apriori.abs
812
813        print "koncni rezultat", sumB, math.log((1-aprioriProb)/aprioriProb), betas[0]
814
815        beta = []
816        beta_se = []
817        print "likelihood2", likelihood
818        for i in range(len(betas)):
819            beta.append(betas[i])
820            beta_se.append(0.0)
821        return (self.OK, beta, beta_se, 0)
822
823
824
825    def estimate_beta(self,X,y,betas,const_betas,X_anch,y_anch):
826        N,N_anch = len(y),len(y_anch)
827        r,r_anch = array([dot(X[i], betas) for i in range(N)]),\
828                   array([dot(X_anch[i], betas) for i in range(N_anch)])
829        p    = array([pr_bx(ri) for ri in r])
830        X_sq = X*X
831
832        max_delta      = [1.]*len(const_betas)
833        likelihood     = -1.e+10
834        likelihood_new = -1.e+9
835        while abs(likelihood - likelihood_new)>0.01 and max(max_delta)>0.01:
836            likelihood = likelihood_new
837            print likelihood
838            betas_temp = [b for b in betas]
839            for j in range(len(betas)):
840                if const_betas[j]: continue
841                dl = matrixmultiply(X[:,j],transpose(y-p))
842                for xi,x in enumerate(X_anch):
843                    dl += self.penalty*x[j]*(y_anch[xi] - pr_bx(r_anch[xi]*self.penalty))
844
845                ddl = matrixmultiply(X_sq[:,j],transpose(p*(1-p)))
846                for xi,x in enumerate(X_anch):
847                    ddl += self.penalty*x[j]*pr_bx(r[xi]*self.penalty)*(1-pr_bx(r[xi]*self.penalty))
848
849                if j==0:
850                    dv = dl/max(ddl,1e-6)
851                elif betas[j] == 0: # special handling due to non-defined first and second derivatives
852                    dv = (dl-self.tau)/max(ddl,1e-6)
853                    if dv < 0:
854                        dv = (dl+self.tau)/max(ddl,1e-6)
855                        if dv > 0:
856                            dv = 0
857                else:
858                    dl -= sign(betas[j])*self.tau
859                    dv = dl/max(ddl,1e-6)
860                    if not sign(betas[j] + dv) == sign(betas[j]):
861                        dv = -betas[j]
862                dv = min(max(dv,-max_delta[j]),max_delta[j])
863                r+= X[:,j]*dv
864                p = array([pr_bx(ri) for ri in r])
865                if N_anch:
866                    r_anch+=X_anch[:,j]*dv
867                betas[j] += dv
868                max_delta[j] = max(2*abs(dv),max_delta[j]/2)
869            likelihood_new = lh(X,y,betas)
870            for xi,x in enumerate(X_anch):
871                try:
872                    likelihood_new += y_anch[xi]*r_anch[xi]*self.penalty-log(1+exp(r_anch[xi]*self.penalty))
873                except:
874                    likelihood_new += r_anch[xi]*self.penalty*(y_anch[xi]-1)
875            likelihood_new -= sum([abs(b) for b in betas[1:]])*self.tau
876            if likelihood_new < likelihood:
877                max_delta = [md/4 for md in max_delta]
878                likelihood_new = likelihood
879                likelihood = likelihood_new + 1.
880                betas = [b for b in betas_temp]
881        print "betas", betas
882        print "init_like", likelihood_new
883        print "pure_like", lh(X,y,betas)
884        return (likelihood,betas)
885
886############################################################
887#  Feature subset selection for logistic regression
888
889def get_likelihood(fitter, examples):
890    res = fitter(examples)
891    if res[0] in [fitter.OK]: #, fitter.Infinity, fitter.Divergence]:
892       status, beta, beta_se, likelihood = res
893       if sum([abs(b) for b in beta])<sum([abs(b) for b in beta_se]):
894           return -100*len(examples)
895       return likelihood
896    else:
897       return -100*len(examples)
898
899
900
901class StepWiseFSS(Orange.classification.Learner):
902  """Implementation of algorithm described in [Hosmer and Lemeshow, Applied Logistic Regression, 2000].
903
904  Perform stepwise logistic regression and return a list of the
905  most "informative" features. Each step of the algorithm is composed
906  of two parts. The first is backward elimination, where each already
907  chosen feature is tested for a significant contribution to the overall
908  model. If the worst among all tested features has higher significance
909  than is specified in :obj:deleteCrit, the feature is removed from
910  the model. The second step is forward selection, which is similar to
911  backward elimination. It loops through all the features that are not
912  in the model and tests whether they contribute to the common model
913  with significance lower that :obj:addCrit. The algorithm stops when
914  no feature in the model is to be removed and no feature not in the
915  model is to be added. By setting :obj:numFeatures larger than -1,
916  the algorithm will stop its execution when the number of features in model
917  exceeds that number.
918
919  Significances are assesed via the likelihood ration chi-square
920  test. Normal F test is not appropriate, because errors are assumed to
922
923  If :obj:table is specified, stepwise logistic regression implemented
924  in :obj:StepWiseFSS is performed and a list of chosen features
925  is returned. If :obj:table is not specified an instance of
926  :obj:StepWiseFSS with all parameters set is returned.
927
928  :param table: data set
929  :type table: Orange.data.Table
930
931  :param addCrit: "Alpha" level to judge if variable has enough importance to be added in the new set. (e.g. if addCrit is 0.2, then features is added if its P is lower than 0.2)
933
934  :param deleteCrit: Similar to addCrit, just that it is used at backward elimination. It should be higher than addCrit!
935  :type deleteCrit: float
936
937  :param numFeatures: maximum number of selected features, use -1 for infinity.
938  :type numFeatures: int
939  :rtype: :obj:StepWiseFSS or list of features
940
941  """
942
943  def __new__(cls, instances=None, **argkw):
944      self = Orange.classification.Learner.__new__(cls, **argkw)
945      if instances:
946          self.__init__(**argkw)
947          return self.__call__(instances)
948      else:
949          return self
950
951
952  def __init__(self, addCrit=0.2, deleteCrit=0.3, numFeatures = -1, **kwds):
953    self.__dict__.update(kwds)
955    self.deleteCrit = deleteCrit
956    self.numFeatures = numFeatures
957  def __call__(self, examples):
958    if getattr(self, "imputer", 0):
959        examples = self.imputer(examples)(examples)
960    if getattr(self, "removeMissing", 0):
961        examples = Orange.core.Preprocessor_dropMissing(examples)
962    continuizer = Orange.core.DomainContinuizer(zeroBased=1,continuousTreatment=Orange.core.DomainContinuizer.Leave,
963                                           multinomialTreatment = Orange.core.DomainContinuizer.FrequentIsBase,
964                                           classTreatment = Orange.core.DomainContinuizer.Ignore)
965    attr = []
966    remain_attr = examples.domain.attributes[:]
967
968    # get LL for Majority Learner
969    tempDomain = Orange.core.Domain(attr,examples.domain.classVar)
970    #tempData  = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
971    tempData  = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
972
973    ll_Old = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData)
974    ll_Best = -1000000
975    length_Old = float(len(tempData))
976
977    stop = 0
978    while not stop:
979        # LOOP until all variables are added or no further deletion nor addition of attribute is possible
980        worstAt = None
981        # if there are more than 1 attribute then perform backward elimination
982        if len(attr) >= 2:
983            minG = 1000
984            worstAt = attr[0]
985            ll_Best = ll_Old
986            length_Best = length_Old
987            for at in attr:
988                # check all attribute whether its presence enough increases LL?
989
990                tempAttr = filter(lambda x: x!=at, attr)
991                tempDomain = Orange.core.Domain(tempAttr,examples.domain.classVar)
993                # domain, calculate P for LL improvement.
994                tempDomain  = continuizer(Orange.core.Preprocessor_dropMissing(examples.select(tempDomain)))
995                tempData = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
996
997                ll_Delete = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData)
998                length_Delete = float(len(tempData))
999                length_Avg = (length_Delete + length_Old)/2.0
1000
1001                G=-2*length_Avg*(ll_Delete/length_Delete-ll_Old/length_Old)
1002
1003                # set new worst attribute
1004                if G<minG:
1005                    worstAt = at
1006                    minG=G
1007                    ll_Best = ll_Delete
1008                    length_Best = length_Delete
1009            # deletion of attribute
1010
1011            if worstAt.varType==Orange.core.VarTypes.Continuous:
1012                P=lchisqprob(minG,1);
1013            else:
1014                P=lchisqprob(minG,len(worstAt.values)-1);
1015            if P>=self.deleteCrit:
1016                attr.remove(worstAt)
1017                remain_attr.append(worstAt)
1018                nodeletion=0
1019                ll_Old = ll_Best
1020                length_Old = length_Best
1021            else:
1022                nodeletion=1
1023        else:
1024            nodeletion = 1
1025            # END OF DELETION PART
1026
1027        # if enough attributes has been chosen, stop the procedure
1028        if self.numFeatures>-1 and len(attr)>=self.numFeatures:
1029            remain_attr=[]
1030
1031        # for each attribute in the remaining
1032        maxG=-1
1033        ll_Best = ll_Old
1034        length_Best = length_Old
1035        bestAt = None
1036        for at in remain_attr:
1037            tempAttr = attr + [at]
1038            tempDomain = Orange.core.Domain(tempAttr,examples.domain.classVar)
1040            # domain, calculate P for LL improvement.
1041            tempDomain  = continuizer(Orange.core.Preprocessor_dropMissing(examples.select(tempDomain)))
1042            tempData = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))
1043            ll_New = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData)
1044
1045            length_New = float(len(tempData)) # get number of examples in tempData to normalize likelihood
1046
1047            # P=PR(CHI^2>G), G=-2(L(0)-L(1))=2(E(0)-E(1))
1048            length_avg = (length_New + length_Old)/2
1049            G=-2*length_avg*(ll_Old/length_Old-ll_New/length_New);
1050            if G>maxG:
1051                bestAt = at
1052                maxG=G
1053                ll_Best = ll_New
1054                length_Best = length_New
1055        if not bestAt:
1056            stop = 1
1057            continue
1058
1059        if bestAt.varType==Orange.core.VarTypes.Continuous:
1060            P=lchisqprob(maxG,1);
1061        else:
1062            P=lchisqprob(maxG,len(bestAt.values)-1);
1063        # Add attribute with smallest P to attributes(attr)
1065            attr.append(bestAt)
1066            remain_attr.remove(bestAt)
1067            ll_Old = ll_Best
1068            length_Old = length_Best
1069
1070        if (P>self.addCrit and nodeletion) or (bestAt == worstAt):
1071            stop = 1
1072
1073    return attr
1074
1075
1076class StepWiseFSSFilter(object):
1077    def __new__(cls, instances=None, **argkw):
1078        self = object.__new__(cls, **argkw)
1079        if instances:
1080            self.__init__(**argkw)
1081            return self.__call__(instances)
1082        else:
1083            return self
1084
1085    def __init__(self, addCrit=0.2, deleteCrit=0.3, numFeatures = -1):
1087        self.deleteCrit = deleteCrit
1088        self.numFeatures = numFeatures
1089
1090    def __call__(self, examples):
1092        return examples.select(Orange.core.Domain(attr, examples.domain.classVar))
1093
1094
1095####################################
1096##  PROBABILITY CALCULATIONS
1097
1098def lchisqprob(chisq,df):
1099    """
1100    Return the (1-tailed) probability value associated with the provided
1101    chi-square value and df.  Adapted from chisq.c in Gary Perlman's |Stat.
1102    """
1103    BIG = 20.0
1104    def ex(x):
1105        BIG = 20.0
1106        if x < -BIG:
1107            return 0.0
1108        else:
1109            return math.exp(x)
1110    if chisq <=0 or df < 1:
1111        return 1.0
1112    a = 0.5 * chisq
1113    if df%2 == 0:
1114        even = 1
1115    else:
1116        even = 0
1117    if df > 1:
1118        y = ex(-a)
1119    if even:
1120        s = y
1121    else:
1122        s = 2.0 * zprob(-math.sqrt(chisq))
1123    if (df > 2):
1124        chisq = 0.5 * (df - 1.0)
1125        if even:
1126            z = 1.0
1127        else:
1128            z = 0.5
1129        if a > BIG:
1130            if even:
1131                e = 0.0
1132            else:
1133                e = math.log(math.sqrt(math.pi))
1134            c = math.log(a)
1135            while (z <= chisq):
1136                e = math.log(z) + e
1137                s = s + ex(c*z-a-e)
1138                z = z + 1.0
1139            return s
1140        else:
1141            if even:
1142                e = 1.0
1143            else:
1144                e = 1.0 / math.sqrt(math.pi) / math.sqrt(a)
1145            c = 0.0
1146            while (z <= chisq):
1147                e = e * (a/float(z))
1148                c = c + e
1149                z = z + 1.0
1150            return (c*y+s)
1151    else:
1152        return s
1153
1154
1155def zprob(z):
1156    """
1157    Returns the area under the normal curve 'to the left of' the given z value.
1158    Thus::
1159
1160    for z<0, zprob(z) = 1-tail probability
1161    for z>0, 1.0-zprob(z) = 1-tail probability
1162    for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability
1163
1164    Adapted from z.c in Gary Perlman's |Stat.
1165    """
1166    Z_MAX = 6.0    # maximum meaningful z-value
1167    if z == 0.0:
1168    x = 0.0
1169    else:
1170    y = 0.5 * math.fabs(z)
1171    if y >= (Z_MAX*0.5):
1172        x = 1.0
1173    elif (y < 1.0):
1174        w = y*y
1175        x = ((((((((0.000124818987 * w
1176            -0.001075204047) * w +0.005198775019) * w
1177              -0.019198292004) * w +0.059054035642) * w
1178            -0.151968751364) * w +0.319152932694) * w
1179          -0.531923007300) * w +0.797884560593) * y * 2.0
1180    else:
1181        y = y - 2.0
1182        x = (((((((((((((-0.000045255659 * y
1183                 +0.000152529290) * y -0.000019538132) * y
1184               -0.000676904986) * y +0.001390604284) * y
1185             -0.000794620820) * y -0.002034254874) * y
1186               +0.006549791214) * y -0.010557625006) * y
1187             +0.011630447319) * y -0.009279453341) * y
1188           +0.005353579108) * y -0.002141268741) * y
1189         +0.000535310849) * y +0.999936657524
1190    if z > 0.0:
1191    prob = ((x+1.0)*0.5)
1192    else:
1193    prob = ((1.0-x)*0.5)
1194    return prob
1195
1196
Note: See TracBrowser for help on using the repository browser.