source: orange/orange/Orange/optimization/__init__.py @ 7616:c392c6b940c3

Revision 7616:c392c6b940c3, 22.6 KB checked in by ales_erjavec <ales.erjavec@…>, 3 years ago (diff)
  • fixed some old style exceptions
Line 
1"""
2.. index:: optimization
3
4Wrappers for Tuning Parameters and Thresholds
5
6Classes for two very useful purposes: tuning learning algorithm's parameters
7using internal validation and tuning the threshold for classification into
8positive class.
9
10*****************
11Tuning parameters
12*****************
13
14Two classes support tuning parameters.
15:obj:`Orange.optimization.Tune1Parameter` for fitting a single parameter and
16:obj:`Orange.optimization.TuneMParameters` fitting multiple parameters at once,
17trying all possible combinations. When called with examples and, optionally, id
18of meta attribute with weights, they find the optimal setting of arguments
19using the cross validation. The classes can also be used as ordinary learning
20algorithms - they are in fact derived from
21:obj:`Orange.classification.Learner`.
22
23Both classes have a common parent, :obj:`Orange.optimization.TuneParameters`,
24and a few common attributes.
25
26.. autoclass:: Orange.optimization.TuneParameters
27   :members:
28
29.. autoclass:: Orange.optimization.Tune1Parameter
30   :members:
31 
32.. autoclass:: Orange.optimization.TuneMParameters
33   :members:
34   
35**************************
36Setting Optimal Thresholds
37**************************
38
39Some models may perform well in terms of AUC which measures the ability to
40distinguish between examples of two classes, but have low classifications
41accuracies. The reason may be in the threshold: in binary problems, classifiers
42usually classify into the more probable class, while sometimes, when class
43distributions are highly skewed, a modified threshold would give better
44accuracies. Here are two classes that can help.
45 
46.. autoclass:: Orange.optimization.ThresholdLearner
47   :members:
48     
49.. autoclass:: Orange.optimization.ThresholdClassifier
50   :members:
51   
52Examples
53========
54
55This is how you use the learner.
56
57part of `optimization-thresholding1.py`_
58
59.. literalinclude:: code/optimization-thresholding1.py
60
61The output::
62
63    W/out threshold adjustement: 0.633
64    With adjusted thredhold: 0.659
65    With threshold at 0.80: 0.449
66
67shows that fitting threshold is good (well, although 2.5 percent increase in
68the accuracy absolutely guarantees you a publication at ICML, the difference is
69still unimportant), while setting it at 80% is a bad idea. Or is it?
70
71part of `optimization-thresholding2.py`_
72
73.. literalinclude:: code/optimization-thresholding2.py
74
75The script first divides the data into training and testing examples. It trains
76a naive Bayesian classifier and than wraps it into
77:obj:`Orange.optimization.ThresholdClassifiers` with thresholds of .2, .5 and
78.8. The three models are tested on the left-out examples, and we compute the
79confusion matrices from the results. The printout::
80
81    0.20: TP 60.000, TN 1.000
82    0.50: TP 42.000, TN 24.000
83    0.80: TP 2.000, TN 43.000
84
85shows how the varying threshold changes the balance between the number of true
86positives and negatives.
87
88.. autoclass:: Orange.optimization.PreprocessedLearner
89   :members:
90   
91.. _optimization-thresholding1.py: code/optimization-thresholding1.py
92.. _optimization-thresholding2.py: code/optimization-thresholding2.py
93
94"""
95
96import Orange.core
97import Orange.classification
98import Orange.evaluation.scoring
99import Orange.evaluation.testing
100import Orange.misc
101
102class TuneParameters(Orange.classification.Learner):
103   
104    """.. attribute:: examples
105   
106        Data table with either discrete or continuous features
107   
108    .. attribute:: weightID
109   
110        The ID of the weight meta attribute
111   
112    .. attribute:: object
113   
114        The learning algorithm whose parameters are to be tuned. This can be,
115        for instance, :obj:`Orange.classification.tree.TreeLearner`. You will
116        usually use the wrapped learners from modules, not the built-in
117        classifiers, such as :obj:`Orange.classification.tree.TreeLearner`
118        directly, since the arguments to be fitted are easier to address in the
119        wrapped versions. But in principle it doesn't matter.
120   
121    .. attribute:: evaluate
122   
123        The statistics to evaluate. The default is
124        :obj:`Orange.evaluation.scoring.CA`, so the learner will be fit for the
125        optimal classification accuracy. You can replace it with, for instance,
126        :obj:`Orange.evaluation.scoring.AUC` to optimize the AUC. Statistics
127        can return either a single value (classification accuracy), a list with
128        a single value (this is what :obj:`Orange.evaluation.scoring.CA`
129        actually does), or arbitrary objects which the compare function below
130        must be able to compare.
131   
132    .. attribute:: folds
133   
134        The number of folds used in internal cross-validation. Default is 5.
135   
136    .. attribute:: compare
137   
138        The function used to compare the results. The function should accept
139        two arguments (e.g. two classification accuracies, AUCs or whatever the
140        result of evaluate is) and return a positive value if the first
141        argument is better, 0 if they are equal and a negative value if the
142        first is worse than the second. The default compare function is cmp.
143        You don't need to change this if evaluate is such that higher values
144        mean a better classifier.
145   
146    .. attribute:: returnWhat
147   
148        Decides what should be result of tuning. Possible values are:
149   
150        * TuneParameters.returnNone (or 0): tuning will return nothing,
151        * TuneParameters.returnParameters (or 1): return the optimal value(s) of parameter(s),
152        * TuneParameters.returnLearner (or 2): return the learner set to optimal parameters,
153        * TuneParameters.returnClassifier (or 3): return a classifier trained with the optimal parameters on the entire data set. This is the default setting.
154       
155        Regardless of this, the learner (given as object) is left set to the
156        optimal parameters.
157   
158    .. attribute:: verbose
159   
160        If 0 (default), the class doesn't print anything. If set to 1, it will
161        print out the optimal value found, if set to 2, it will print out all
162        tried values and the related
163   
164    If tuner returns the classifier, it behaves as a learning algorithm. As the
165    examples below will demonstrate, it can be called, given the examples and
166    the result is a "trained" classifier. It can, for instance, be used in
167    cross-validation.
168
169    Out of these attributes, the only necessary argument is object. The real
170    tuning classes add two additional - the attributes that tell what
171    parameter(s) to optimize and which values to use.
172   
173    """
174   
175    returnNone=0
176    returnParameters=1
177    returnLearner=2
178    returnClassifier=3
179   
180    def __new__(cls, examples = None, weightID = 0, **argkw):
181        self = Orange.classification.Learner.__new__(cls, **argkw)
182        self.__dict__.update(argkw)
183        if examples:
184            return self.__call__(examples, weightID)
185        else:
186            return self
187
188    def findobj(self, name):
189        import string
190        names=string.split(name, ".")
191        lastobj=self.object
192        for i in names[:-1]:
193            lastobj=getattr(lastobj, i)
194        return lastobj, names[-1]
195       
196class Tune1Parameter(TuneParameters):
197   
198    """Class :obj:`Orange.optimization.Tune1Parameter` tunes a single parameter.
199   
200    .. attribute:: parameter
201   
202        The name of the parameter (or a list of names, if the same parameter is
203        stored at multiple places - see the examples) to be tuned.
204   
205    .. attribute:: values
206   
207        A list of parameter's values to be tried.
208   
209    To show how it works, we shall fit the minimal number of examples in a leaf
210    for a tree classifier.
211   
212    part of `optimization-tuning1.py`_
213
214    .. literalinclude:: code/optimization-tuning1.py
215        :lines: 3-11
216
217    Set up like this, when the tuner is called, set learner.minSubset to 1, 2,
218    3, 4, 5, 10, 15 and 20, and measure the AUC in 5-fold cross validation. It
219    will then reset the learner.minSubset to the optimal value found and, since
220    we left returnWhat at the default (returnClassifier), construct and return
221    the classifier from the entire data set. So, what we get is a classifier,
222    but if we'd also like to know what the optimal value was, we can get it
223    from learner.minSubset.
224
225    Tuning is of course not limited to setting numeric parameters. You can, for
226    instance, try to find the optimal criteria for assessing the quality of
227    attributes by tuning parameter="measure", trying settings like
228    values=[orange.MeasureAttribute_gainRatio(),
229    orange.MeasureAttribute_gini()].
230   
231    Since the tuner returns a classifier and thus behaves like a learner, it
232    can be used in a cross-validation. Let us see whether a tuning tree indeed
233    enhances the AUC or not. We shall reuse the tuner from above, add another
234    tree learner, and test them both.
235   
236    part of `optimization-tuning1.py`_
237
238    .. literalinclude:: code/optimization-tuning1.py
239        :lines: 13-18
240   
241    This will take some time: for each of 8 values for minSubset it will
242    perform 5-fold cross validation inside a 10-fold cross validation -
243    altogether 400 trees. Plus, it will learn the optimal tree afterwards for
244    each fold. Add a tree without tuning, and you get 420 trees build.
245   
246    Well, not that long, and the results are good::
247   
248        Untuned tree: 0.930
249        Tuned tree: 0.986
250   
251    .. _optimization-tuning1.py: code/optimization-tuning1.py
252   
253    """
254   
255    def __call__(self, table, weight=None, verbose=0):
256        verbose = verbose or getattr(self, "verbose", 0)
257        evaluate = getattr(self, "evaluate", Orange.evaluation.scoring.CA)
258        folds = getattr(self, "folds", 5)
259        compare = getattr(self, "compare", cmp)
260        returnWhat = getattr(self, "returnWhat", 
261                             Tune1Parameter.returnClassifier)
262
263        if (type(self.parameter)==list) or (type(self.parameter)==tuple):
264            to_set = [self.findobj(ld) for ld in self.parameter]
265        else:
266            to_set = [self.findobj(self.parameter)]
267
268        cvind = Orange.core.MakeRandomIndicesCV(table, folds)
269        findBest = Orange.misc.selection.BestOnTheFly(seed = table.checksum(), 
270                                         callCompareOn1st = True)
271        tableAndWeight = weight and (table, weight) or table
272        for par in self.values:
273            for i in to_set:
274                setattr(i[0], i[1], par)
275            res = evaluate(Orange.evaluation.testing.testWithIndices(
276                                        [self.object], tableAndWeight, cvind))
277            findBest.candidate((res, par))
278            if verbose==2:
279                print '*** optimization  %s: %s:' % (par, res)
280
281        bestpar = findBest.winner()[1]
282        for i in to_set:
283            setattr(i[0], i[1], bestpar)
284
285        if verbose:
286            print "*** Optimal parameter: %s = %s" % (self.parameter, bestpar)
287
288        if returnWhat==Tune1Parameter.returnNone:
289            return None
290        elif returnWhat==Tune1Parameter.returnParameters:
291            return bestpar
292        elif returnWhat==Tune1Parameter.returnLearner:
293            return self.object
294        else:
295            classifier = self.object(table)
296            classifier.setattr("fittedParameter", bestpar)
297            return classifier
298
299class TuneMParameters(TuneParameters):
300   
301    """The use of :obj:`Orange.optimization.TuneMParameters` differs from
302    :obj:`Orange.optimization.Tune1Parameter` only in specification of tuning
303    parameters.
304   
305    .. attribute:: parameters
306   
307        A list of two-element tuples, each containing the name of a parameter
308        and its possible values.
309   
310    For exercise we can try to tune both settings mentioned above, the minimal
311    number of examples in leaves and the splitting criteria by setting the
312    tuner as follows:
313   
314    `optimization-tuningm.py`_
315
316    .. literalinclude:: code/optimization-tuningm.py
317       
318    Everything else stays like above, in examples for
319    :obj:`Orange.optimization.Tune1Parameter`.
320   
321    .. _optimization-tuningm.py: code/optimization-tuningm.py
322       
323    """
324   
325    def __call__(self, table, weight=None, verbose=0):
326        evaluate = getattr(self, "evaluate", Orange.evaluation.scoring.CA)
327        folds = getattr(self, "folds", 5)
328        compare = getattr(self, "compare", cmp)
329        verbose = verbose or getattr(self, "verbose", 0)
330        returnWhat=getattr(self, "returnWhat", Tune1Parameter.returnClassifier)
331        progressCallback = getattr(self, "progressCallback", lambda i: None)
332       
333        to_set = []
334        parnames = []
335        for par in self.parameters:
336            if (type(par[0])==list) or (type(par[0])==tuple):
337                to_set.append([self.findobj(ld) for ld in par[0]])
338                parnames.append(par[0])
339            else:
340                to_set.append([self.findobj(par[0])])
341                parnames.append([par[0]])
342
343
344        cvind = Orange.core.MakeRandomIndicesCV(table, folds)
345        findBest = Orange.misc.selection.BestOnTheFly(seed = table.checksum(), 
346                                         callCompareOn1st = True)
347        tableAndWeight = weight and (table, weight) or table
348        numOfTests = sum([len(x[1]) for x in self.parameters])
349        milestones = set(range(0, numOfTests, max(numOfTests / 100, 1)))
350        for itercount, valueindices in enumerate(Orange.misc.counters.LimitedCounter( \
351                                        [len(x[1]) for x in self.parameters])):
352            values = [self.parameters[i][1][x] for i,x \
353                      in enumerate(valueindices)]
354            for pi, value in enumerate(values):
355                for i, par in enumerate(to_set[pi]):
356                    setattr(par[0], par[1], value)
357                    if verbose==2:
358                        print "%s: %s" % (parnames[pi][i], value)
359                       
360            res = evaluate(Orange.evaluation.testing.testWithIndices(
361                                        [self.object], tableAndWeight, cvind))
362            if itercount in milestones:
363                progressCallback(100.0 * itercount / numOfTests)
364           
365            findBest.candidate((res, values))
366            if verbose==2:
367                print "===> Result: %s\n" % res
368
369        bestpar = findBest.winner()[1]
370        if verbose:
371            print "*** Optimal set of parameters: ",
372        for pi, value in enumerate(bestpar):
373            for i, par in enumerate(to_set[pi]):
374                setattr(par[0], par[1], value)
375                if verbose:
376                    print "%s: %s" % (parnames[pi][i], value),
377        if verbose:
378            print
379
380        if returnWhat==Tune1Parameter.returnNone:
381            return None
382        elif returnWhat==Tune1Parameter.returnParameters:
383            return bestpar
384        elif returnWhat==Tune1Parameter.returnLearner:
385            return self.object
386        else:
387            classifier = self.object(table)
388            classifier.fittedParameters = bestpar
389            return classifier
390
391class ThresholdLearner(Orange.classification.Learner):
392   
393    """:obj:`Orange.optimization.ThresholdLearner` is a class that wraps around
394    another learner. When given the data, it calls the wrapped learner to build
395    a classifier, than it uses the classifier to predict the class
396    probabilities on the training examples. Storing the probabilities, it
397    computes the threshold that would give the optimal classification accuracy.
398    Then it wraps the classifier and the threshold into an instance of
399    :obj:`Orange.optimization.ThresholdClassifier`.
400
401    Note that the learner doesn't perform internal cross-validation. Also, the
402    learner doesn't work for multivalued classes. If you don't understand why,
403    think harder. If you still don't, try to program it yourself, this should
404    help. :)
405
406    :obj:`Orange.optimization.ThresholdLearner` has the same interface as any
407    learner: if the constructor is given examples, it returns a classifier,
408    else it returns a learner. It has two attributes.
409   
410    .. attribute:: learner
411   
412        The wrapped learner, for example an instance of
413        :obj:`Orange.classification.bayes.NaiveLearner`.
414   
415    .. attribute:: storeCurve
416   
417        If set, the resulting classifier will contain an attribute curve, with
418        a list of tuples containing thresholds and classification accuracies at
419        that threshold.
420   
421    """
422   
423    def __new__(cls, examples = None, weightID = 0, **kwds):
424        self = Orange.classification.Learner.__new__(cls, **kwds)
425        self.__dict__.update(kwds)
426        if examples:
427            return self.__call__(examples, weightID)
428        else:
429            return self
430
431    def __call__(self, examples, weightID = 0):
432        if not hasattr(self, "learner"):
433            raise AttributeError("learner not set")
434       
435        classifier = self.learner(examples, weightID)
436        threshold, optCA, curve = Orange.wrappers.ThresholdCA(classifier, 
437                                                          examples, 
438                                                          weightID)
439        if getattr(self, "storeCurve", 0):
440            return ThresholdClassifier(classifier, threshold, curve = curve)
441        else:
442            return ThresholdClassifier(classifier, threshold)
443
444class ThresholdClassifier(Orange.classification.Classifier):
445   
446    """:obj:`Orange.optimization.ThresholdClassifier`, used by both
447    :obj:`Orange.optimization.ThredholdLearner` and
448    :obj:`Orange.optimization.ThresholdLearner_fixed` is therefore another
449    wrapper class, containing a classifier and a threshold. When it needs to
450    classify an example, it calls the wrapped classifier to predict
451    probabilities. The example will be classified into the second class only if
452    the probability of that class is above the threshold.
453
454    .. attribute:: classifier
455   
456    The wrapped classifier, normally the one related to the ThresholdLearner's
457    learner, e.g. an instance of
458    :obj:`Orange.classification.bayes.NaiveLearner`.
459   
460    .. attribute:: threshold
461   
462    The threshold for classification into the second class.
463   
464    The two attributes can be specified set as attributes or given to the
465    constructor as ordinary arguments.
466   
467    """
468   
469    def __init__(self, classifier, threshold, **kwds):
470        self.classifier = classifier
471        self.threshold = threshold
472        self.__dict__.update(kwds)
473
474    def __call__(self, example, what = Orange.classification.Classifier.GetValue):
475        probs = self.classifier(example, self.GetProbabilities)
476        if what == self.GetProbabilities:
477            return probs
478        value = Orange.data.Value(self.classifier.classVar, probs[1] > \
479                                  self.threshold)
480        if what == Orange.classification.Classifier.GetValue:
481            return value
482        else:
483            return (value, probs)
484
485def ThresholdLearner_fixed(learner, threshold, 
486                           examples=None, weightId=0, **kwds):
487   
488    """There's also a dumb variant of
489    :obj:`Orange.optimization.ThresholdLearner`, a class called
490    :obj:`Orange.optimization.ThreshholdLearner_fixed`. Instead of finding the
491    optimal threshold it uses a prescribed one. So, it has the following two
492    attributes.
493   
494    .. attriute:: learner
495   
496    The wrapped learner, for example an instance of
497    :obj:`Orange.classification.bayes.NaiveLearner`.
498   
499    .. attriute:: threshold
500   
501    Threshold to use in classification.
502   
503    What this guy does is therefore simple: to learn, it calls the learner and
504    puts the resulting classifier together with the threshold into an instance
505    of ThresholdClassifier.
506   
507    """
508   
509    lr = apply(ThresholdLearner_fixed_Class, (learner, threshold), kwds)
510    if examples:
511        return lr(examples, weightId)
512    else:
513        return lr
514   
515class ThresholdLearner_fixed(Orange.classification.Learner):
516    def __new__(cls, examples = None, weightID = 0, **kwds):
517        self = Orange.classification.Learner.__new__(cls, **kwds)
518        self.__dict__.update(kwds)
519        if examples:
520            return self.__call__(examples, weightID)
521        else:
522            return self
523
524    def __call__(self, examples, weightID = 0):
525        if not hasattr(self, "learner"):
526            raise AttributeError("learner not set")
527        if not hasattr(self, "threshold"):
528            raise AttributeError("threshold not set")
529        if len(examples.domain.classVar.values)!=2:
530            raise ValueError("ThresholdLearner handles binary classes only")
531       
532        return ThresholdClassifier(self.learner(examples, weightID), 
533                                   self.threshold)
534
535class PreprocessedLearner(object):
536    def __new__(cls, preprocessor = None, learner = None):
537        self = object.__new__(cls)
538        if learner is not None:
539            self.__init__(preprocessor)
540            return self.wrapLearner(learner)
541        else:
542            return self
543       
544    def __init__(self, preprocessor = None, learner = None):
545        if isinstance(preprocessor, list):
546            self.preprocessors = preprocessor
547        elif preprocessor is not None:
548            self.preprocessors = [preprocessor]
549        else:
550            self.preprocessors = []
551        #self.preprocessors = [Orange.core.Preprocessor_addClassNoise(proportion=0.8)]
552        if learner:
553            self.wrapLearner(learner)
554       
555    def processData(self, data, weightId = None):
556        hadWeight = hasWeight = weightId is not None
557        for preprocessor in self.preprocessors:
558            if hasWeight:
559                t = preprocessor(data, weightId) 
560            else:
561                t = preprocessor(data)
562               
563            if isinstance(t, tuple):
564                data, weightId = t
565                hasWeight = True
566            else:
567                data = t
568        if hadWeight:
569            return data, weightId
570        else:
571            return data
572
573    def wrapLearner(self, learner):
574        class WrappedLearner(learner.__class__):
575            preprocessor = self
576            wrappedLearner = learner
577            name = getattr(learner, "name", "")
578            def __call__(self, data, weightId=0, getData = False):
579                t = self.preprocessor.processData(data, weightId or 0)
580                processed, procW = t if isinstance(t, tuple) else (t, 0)
581                classifier = self.wrappedLearner(processed, procW)
582                if getData:
583                    return classifier, processed
584                else:
585                    return classifier # super(WrappedLearner, self).__call__(processed, procW)
586               
587            def __reduce__(self):
588                return PreprocessedLearner, (self.preprocessor.preprocessors, \
589                                             self.wrappedLearner)
590           
591            def __getattr__(self, name):
592                return getattr(learner, name)
593           
594        return WrappedLearner()
Note: See TracBrowser for help on using the repository browser.