source: orange/orange/optimization/__init__.py @ 9669:165371b04b4a

Revision 9669:165371b04b4a, 22.6 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved content of Orange dir to package dir

Line 
1"""
2.. index:: optimization
3
4Wrappers for Tuning Parameters and Thresholds
5
6Classes for two very useful purposes: tuning learning algorithm's parameters
7using internal validation and tuning the threshold for classification into
8positive class.
9
10*****************
11Tuning parameters
12*****************
13
14Two classes support tuning parameters.
15:obj:`Orange.optimization.Tune1Parameter` for fitting a single parameter and
16:obj:`Orange.optimization.TuneMParameters` fitting multiple parameters at once,
17trying all possible combinations. When called with examples and, optionally, id
18of meta attribute with weights, they find the optimal setting of arguments
19using the cross validation. The classes can also be used as ordinary learning
20algorithms - they are in fact derived from
21:obj:`Orange.classification.Learner`.
22
23Both classes have a common parent, :obj:`Orange.optimization.TuneParameters`,
24and a few common attributes.
25
26.. autoclass:: Orange.optimization.TuneParameters
27   :members:
28
29.. autoclass:: Orange.optimization.Tune1Parameter
30   :members:
31 
32.. autoclass:: Orange.optimization.TuneMParameters
33   :members:
34   
35**************************
36Setting Optimal Thresholds
37**************************
38
39Some models may perform well in terms of AUC which measures the ability to
40distinguish between examples of two classes, but have low classifications
41accuracies. The reason may be in the threshold: in binary problems, classifiers
42usually classify into the more probable class, while sometimes, when class
43distributions are highly skewed, a modified threshold would give better
44accuracies. Here are two classes that can help.
45 
46.. autoclass:: Orange.optimization.ThresholdLearner
47   :members:
48     
49.. autoclass:: Orange.optimization.ThresholdClassifier
50   :members:
51   
52Examples
53========
54
55This is how you use the learner.
56
57part of :download:`optimization-thresholding1.py <code/optimization-thresholding1.py>`
58
59.. literalinclude:: code/optimization-thresholding1.py
60
61The output::
62
63    W/out threshold adjustement: 0.633
64    With adjusted thredhold: 0.659
65    With threshold at 0.80: 0.449
66
67shows that fitting threshold is good (well, although 2.5 percent increase in
68the accuracy absolutely guarantees you a publication at ICML, the difference is
69still unimportant), while setting it at 80% is a bad idea. Or is it?
70
71part of :download:`optimization-thresholding2.py <code/optimization-thresholding2.py>`
72
73.. literalinclude:: code/optimization-thresholding2.py
74
75The script first divides the data into training and testing examples. It trains
76a naive Bayesian classifier and than wraps it into
77:obj:`Orange.optimization.ThresholdClassifiers` with thresholds of .2, .5 and
78.8. The three models are tested on the left-out examples, and we compute the
79confusion matrices from the results. The printout::
80
81    0.20: TP 60.000, TN 1.000
82    0.50: TP 42.000, TN 24.000
83    0.80: TP 2.000, TN 43.000
84
85shows how the varying threshold changes the balance between the number of true
86positives and negatives.
87
88.. autoclass:: Orange.optimization.PreprocessedLearner
89   :members:
90   
91"""
92
93import Orange.core
94import Orange.classification
95import Orange.evaluation.scoring
96import Orange.evaluation.testing
97import Orange.misc
98
99class TuneParameters(Orange.classification.Learner):
100   
101    """.. attribute:: examples
102   
103        Data table with either discrete or continuous features
104   
105    .. attribute:: weightID
106   
107        The ID of the weight meta attribute
108   
109    .. attribute:: object
110   
111        The learning algorithm whose parameters are to be tuned. This can be,
112        for instance, :obj:`Orange.classification.tree.TreeLearner`. You will
113        usually use the wrapped learners from modules, not the built-in
114        classifiers, such as :obj:`Orange.classification.tree.TreeLearner`
115        directly, since the arguments to be fitted are easier to address in the
116        wrapped versions. But in principle it doesn't matter.
117   
118    .. attribute:: evaluate
119   
120        The statistics to evaluate. The default is
121        :obj:`Orange.evaluation.scoring.CA`, so the learner will be fit for the
122        optimal classification accuracy. You can replace it with, for instance,
123        :obj:`Orange.evaluation.scoring.AUC` to optimize the AUC. Statistics
124        can return either a single value (classification accuracy), a list with
125        a single value (this is what :obj:`Orange.evaluation.scoring.CA`
126        actually does), or arbitrary objects which the compare function below
127        must be able to compare.
128   
129    .. attribute:: folds
130   
131        The number of folds used in internal cross-validation. Default is 5.
132   
133    .. attribute:: compare
134   
135        The function used to compare the results. The function should accept
136        two arguments (e.g. two classification accuracies, AUCs or whatever the
137        result of evaluate is) and return a positive value if the first
138        argument is better, 0 if they are equal and a negative value if the
139        first is worse than the second. The default compare function is cmp.
140        You don't need to change this if evaluate is such that higher values
141        mean a better classifier.
142   
143    .. attribute:: returnWhat
144   
145        Decides what should be result of tuning. Possible values are:
146   
147        * TuneParameters.returnNone (or 0): tuning will return nothing,
148        * TuneParameters.returnParameters (or 1): return the optimal value(s) of parameter(s),
149        * TuneParameters.returnLearner (or 2): return the learner set to optimal parameters,
150        * TuneParameters.returnClassifier (or 3): return a classifier trained with the optimal parameters on the entire data set. This is the default setting.
151       
152        Regardless of this, the learner (given as object) is left set to the
153        optimal parameters.
154   
155    .. attribute:: verbose
156   
157        If 0 (default), the class doesn't print anything. If set to 1, it will
158        print out the optimal value found, if set to 2, it will print out all
159        tried values and the related
160   
161    If tuner returns the classifier, it behaves as a learning algorithm. As the
162    examples below will demonstrate, it can be called, given the examples and
163    the result is a "trained" classifier. It can, for instance, be used in
164    cross-validation.
165
166    Out of these attributes, the only necessary argument is object. The real
167    tuning classes add two additional - the attributes that tell what
168    parameter(s) to optimize and which values to use.
169   
170    """
171   
172    returnNone=0
173    returnParameters=1
174    returnLearner=2
175    returnClassifier=3
176   
177    def __new__(cls, examples = None, weightID = 0, **argkw):
178        self = Orange.classification.Learner.__new__(cls, **argkw)
179        self.__dict__.update(argkw)
180        if examples:
181            return self.__call__(examples, weightID)
182        else:
183            return self
184
185    def findobj(self, name):
186        import string
187        names=string.split(name, ".")
188        lastobj=self.object
189        for i in names[:-1]:
190            lastobj=getattr(lastobj, i)
191        return lastobj, names[-1]
192       
193class Tune1Parameter(TuneParameters):
194   
195    """Class :obj:`Orange.optimization.Tune1Parameter` tunes a single parameter.
196   
197    .. attribute:: parameter
198   
199        The name of the parameter (or a list of names, if the same parameter is
200        stored at multiple places - see the examples) to be tuned.
201   
202    .. attribute:: values
203   
204        A list of parameter's values to be tried.
205   
206    To show how it works, we shall fit the minimal number of examples in a leaf
207    for a tree classifier.
208   
209    part of :download:`optimization-tuning1.py <code/optimization-tuning1.py>`
210
211    .. literalinclude:: code/optimization-tuning1.py
212        :lines: 3-11
213
214    Set up like this, when the tuner is called, set learner.minSubset to 1, 2,
215    3, 4, 5, 10, 15 and 20, and measure the AUC in 5-fold cross validation. It
216    will then reset the learner.minSubset to the optimal value found and, since
217    we left returnWhat at the default (returnClassifier), construct and return
218    the classifier from the entire data set. So, what we get is a classifier,
219    but if we'd also like to know what the optimal value was, we can get it
220    from learner.minSubset.
221
222    Tuning is of course not limited to setting numeric parameters. You can, for
223    instance, try to find the optimal criteria for assessing the quality of
224    attributes by tuning parameter="measure", trying settings like
225    values=[orange.MeasureAttribute_gainRatio(),
226    orange.MeasureAttribute_gini()].
227   
228    Since the tuner returns a classifier and thus behaves like a learner, it
229    can be used in a cross-validation. Let us see whether a tuning tree indeed
230    enhances the AUC or not. We shall reuse the tuner from above, add another
231    tree learner, and test them both.
232   
233    part of :download:`optimization-tuning1.py <code/optimization-tuning1.py>`
234
235    .. literalinclude:: code/optimization-tuning1.py
236        :lines: 13-18
237   
238    This will take some time: for each of 8 values for minSubset it will
239    perform 5-fold cross validation inside a 10-fold cross validation -
240    altogether 400 trees. Plus, it will learn the optimal tree afterwards for
241    each fold. Add a tree without tuning, and you get 420 trees build.
242   
243    Well, not that long, and the results are good::
244   
245        Untuned tree: 0.930
246        Tuned tree: 0.986
247   
248    """
249   
250    def __call__(self, table, weight=None, verbose=0):
251        verbose = verbose or getattr(self, "verbose", 0)
252        evaluate = getattr(self, "evaluate", Orange.evaluation.scoring.CA)
253        folds = getattr(self, "folds", 5)
254        compare = getattr(self, "compare", cmp)
255        returnWhat = getattr(self, "returnWhat", 
256                             Tune1Parameter.returnClassifier)
257
258        if (type(self.parameter)==list) or (type(self.parameter)==tuple):
259            to_set = [self.findobj(ld) for ld in self.parameter]
260        else:
261            to_set = [self.findobj(self.parameter)]
262
263        cvind = Orange.core.MakeRandomIndicesCV(table, folds)
264        findBest = Orange.misc.selection.BestOnTheFly(seed = table.checksum(), 
265                                         callCompareOn1st = True)
266        tableAndWeight = weight and (table, weight) or table
267        for par in self.values:
268            for i in to_set:
269                setattr(i[0], i[1], par)
270            res = evaluate(Orange.evaluation.testing.test_with_indices(
271                                        [self.object], tableAndWeight, cvind))
272            findBest.candidate((res, par))
273            if verbose==2:
274                print '*** optimization  %s: %s:' % (par, res)
275
276        bestpar = findBest.winner()[1]
277        for i in to_set:
278            setattr(i[0], i[1], bestpar)
279
280        if verbose:
281            print "*** Optimal parameter: %s = %s" % (self.parameter, bestpar)
282
283        if returnWhat==Tune1Parameter.returnNone:
284            return None
285        elif returnWhat==Tune1Parameter.returnParameters:
286            return bestpar
287        elif returnWhat==Tune1Parameter.returnLearner:
288            return self.object
289        else:
290            classifier = self.object(table)
291            classifier.setattr("fittedParameter", bestpar)
292            return classifier
293
294class TuneMParameters(TuneParameters):
295   
296    """The use of :obj:`Orange.optimization.TuneMParameters` differs from
297    :obj:`Orange.optimization.Tune1Parameter` only in specification of tuning
298    parameters.
299   
300    .. attribute:: parameters
301   
302        A list of two-element tuples, each containing the name of a parameter
303        and its possible values.
304   
305    For exercise we can try to tune both settings mentioned above, the minimal
306    number of examples in leaves and the splitting criteria by setting the
307    tuner as follows:
308   
309    :download:`optimization-tuningm.py <code/optimization-tuningm.py>`
310
311    .. literalinclude:: code/optimization-tuningm.py
312       
313    Everything else stays like above, in examples for
314    :obj:`Orange.optimization.Tune1Parameter`.
315   
316    """
317   
318    def __call__(self, table, weight=None, verbose=0):
319        evaluate = getattr(self, "evaluate", Orange.evaluation.scoring.CA)
320        folds = getattr(self, "folds", 5)
321        compare = getattr(self, "compare", cmp)
322        verbose = verbose or getattr(self, "verbose", 0)
323        returnWhat=getattr(self, "returnWhat", Tune1Parameter.returnClassifier)
324        progressCallback = getattr(self, "progressCallback", lambda i: None)
325       
326        to_set = []
327        parnames = []
328        for par in self.parameters:
329            if (type(par[0])==list) or (type(par[0])==tuple):
330                to_set.append([self.findobj(ld) for ld in par[0]])
331                parnames.append(par[0])
332            else:
333                to_set.append([self.findobj(par[0])])
334                parnames.append([par[0]])
335
336
337        cvind = Orange.core.MakeRandomIndicesCV(table, folds)
338        findBest = Orange.misc.selection.BestOnTheFly(seed = table.checksum(), 
339                                         callCompareOn1st = True)
340        tableAndWeight = weight and (table, weight) or table
341        numOfTests = sum([len(x[1]) for x in self.parameters])
342        milestones = set(range(0, numOfTests, max(numOfTests / 100, 1)))
343        for itercount, valueindices in enumerate(Orange.misc.counters.LimitedCounter( \
344                                        [len(x[1]) for x in self.parameters])):
345            values = [self.parameters[i][1][x] for i,x \
346                      in enumerate(valueindices)]
347            for pi, value in enumerate(values):
348                for i, par in enumerate(to_set[pi]):
349                    setattr(par[0], par[1], value)
350                    if verbose==2:
351                        print "%s: %s" % (parnames[pi][i], value)
352                       
353            res = evaluate(Orange.evaluation.testing.test_with_indices(
354                                        [self.object], tableAndWeight, cvind))
355            if itercount in milestones:
356                progressCallback(100.0 * itercount / numOfTests)
357           
358            findBest.candidate((res, values))
359            if verbose==2:
360                print "===> Result: %s\n" % res
361
362        bestpar = findBest.winner()[1]
363        if verbose:
364            print "*** Optimal set of parameters: ",
365        for pi, value in enumerate(bestpar):
366            for i, par in enumerate(to_set[pi]):
367                setattr(par[0], par[1], value)
368                if verbose:
369                    print "%s: %s" % (parnames[pi][i], value),
370        if verbose:
371            print
372
373        if returnWhat==Tune1Parameter.returnNone:
374            return None
375        elif returnWhat==Tune1Parameter.returnParameters:
376            return bestpar
377        elif returnWhat==Tune1Parameter.returnLearner:
378            return self.object
379        else:
380            classifier = self.object(table)
381            classifier.fittedParameters = bestpar
382            return classifier
383
384class ThresholdLearner(Orange.classification.Learner):
385   
386    """:obj:`Orange.optimization.ThresholdLearner` is a class that wraps around
387    another learner. When given the data, it calls the wrapped learner to build
388    a classifier, than it uses the classifier to predict the class
389    probabilities on the training examples. Storing the probabilities, it
390    computes the threshold that would give the optimal classification accuracy.
391    Then it wraps the classifier and the threshold into an instance of
392    :obj:`Orange.optimization.ThresholdClassifier`.
393
394    Note that the learner doesn't perform internal cross-validation. Also, the
395    learner doesn't work for multivalued classes. If you don't understand why,
396    think harder. If you still don't, try to program it yourself, this should
397    help. :)
398
399    :obj:`Orange.optimization.ThresholdLearner` has the same interface as any
400    learner: if the constructor is given examples, it returns a classifier,
401    else it returns a learner. It has two attributes.
402   
403    .. attribute:: learner
404   
405        The wrapped learner, for example an instance of
406        :obj:`Orange.classification.bayes.NaiveLearner`.
407   
408    .. attribute:: storeCurve
409   
410        If `True`, the resulting classifier will contain an attribute curve, with
411        a list of tuples containing thresholds and classification accuracies at
412        that threshold (default `False`).
413   
414    """
415   
416    def __new__(cls, examples = None, weightID = 0, **kwds):
417        self = Orange.classification.Learner.__new__(cls, **kwds)
418        if examples:
419            self.__init__(**kwargs)
420            return self.__call__(examples, weightID)
421        else:
422            return self
423       
424    def __init__(self, learner=None, storeCurve=False, **kwds):
425        self.learner = learner
426        self.storeCurve = storeCurve
427        self.__dict__.update(kwds)
428
429    def __call__(self, examples, weightID = 0):
430        if self.learner is None:
431            raise AttributeError("Learner not set.")
432       
433        classifier = self.learner(examples, weightID)
434        threshold, optCA, curve = Orange.wrappers.ThresholdCA(classifier, 
435                                                          examples, 
436                                                          weightID)
437        if self.storeCurve:
438            return ThresholdClassifier(classifier, threshold, curve = curve)
439        else:
440            return ThresholdClassifier(classifier, threshold)
441
442class ThresholdClassifier(Orange.classification.Classifier):
443   
444    """:obj:`Orange.optimization.ThresholdClassifier`, used by both
445    :obj:`Orange.optimization.ThredholdLearner` and
446    :obj:`Orange.optimization.ThresholdLearner_fixed` is therefore another
447    wrapper class, containing a classifier and a threshold. When it needs to
448    classify an example, it calls the wrapped classifier to predict
449    probabilities. The example will be classified into the second class only if
450    the probability of that class is above the threshold.
451
452    .. attribute:: classifier
453   
454    The wrapped classifier, normally the one related to the ThresholdLearner's
455    learner, e.g. an instance of
456    :obj:`Orange.classification.bayes.NaiveLearner`.
457   
458    .. attribute:: threshold
459   
460    The threshold for classification into the second class.
461   
462    The two attributes can be specified set as attributes or given to the
463    constructor as ordinary arguments.
464   
465    """
466   
467    def __init__(self, classifier, threshold, **kwds):
468        self.classifier = classifier
469        self.threshold = threshold
470        self.__dict__.update(kwds)
471
472    def __call__(self, example, what = Orange.classification.Classifier.GetValue):
473        probs = self.classifier(example, self.GetProbabilities)
474        if what == self.GetProbabilities:
475            return probs
476        value = Orange.data.Value(self.classifier.classVar, probs[1] > \
477                                  self.threshold)
478        if what == Orange.classification.Classifier.GetValue:
479            return value
480        else:
481            return (value, probs)
482       
483   
484class ThresholdLearner_fixed(Orange.classification.Learner):
485    """ There's also a dumb variant of
486    :obj:`Orange.optimization.ThresholdLearner`, a class called
487    :obj:`Orange.optimization.ThreshholdLearner_fixed`. Instead of finding the
488    optimal threshold it uses a prescribed one. So, it has the following two
489    attributes.
490   
491    .. attribute:: learner
492   
493    The wrapped learner, for example an instance of
494    :obj:`Orange.classification.bayes.NaiveLearner`.
495   
496    .. attribute:: threshold
497   
498    Threshold to use in classification.
499   
500    What this guy does is therefore simple: to learn, it calls the learner and
501    puts the resulting classifier together with the threshold into an instance
502    of ThresholdClassifier.
503   
504    """
505    def __new__(cls, examples = None, weightID = 0, **kwds):
506        self = Orange.classification.Learner.__new__(cls, **kwds)
507        if examples:
508            self.__init__(**kwds)
509            return self.__call__(examples, weightID)
510        else:
511            return self
512       
513    def __init__(self, learner=None, threshold=None, **kwds):
514        self.learner = learner
515        self.threshold = threshold
516        self.__dict__.update(kwds)
517
518    def __call__(self, examples, weightID = 0):
519        if self.learner is None:
520            raise AttributeError("Learner not set.")
521        if self.threshold is None:
522            raise AttributeError("Threshold not set.")
523        if len(examples.domain.classVar.values) != 2:
524            raise ValueError("ThresholdLearner handles binary classes only.")
525       
526        return ThresholdClassifier(self.learner(examples, weightID), 
527                                   self.threshold)
528
529class PreprocessedLearner(object):
530    def __new__(cls, preprocessor = None, learner = None):
531        self = object.__new__(cls)
532        if learner is not None:
533            self.__init__(preprocessor)
534            return self.wrapLearner(learner)
535        else:
536            return self
537       
538    def __init__(self, preprocessor = None, learner = None):
539        if isinstance(preprocessor, list):
540            self.preprocessors = preprocessor
541        elif preprocessor is not None:
542            self.preprocessors = [preprocessor]
543        else:
544            self.preprocessors = []
545        #self.preprocessors = [Orange.core.Preprocessor_addClassNoise(proportion=0.8)]
546        if learner:
547            self.wrapLearner(learner)
548       
549    def processData(self, data, weightId = None):
550        hadWeight = hasWeight = weightId is not None
551        for preprocessor in self.preprocessors:
552            if hasWeight:
553                t = preprocessor(data, weightId) 
554            else:
555                t = preprocessor(data)
556               
557            if isinstance(t, tuple):
558                data, weightId = t
559                hasWeight = True
560            else:
561                data = t
562        if hadWeight:
563            return data, weightId
564        else:
565            return data
566
567    def wrapLearner(self, learner):
568        class WrappedLearner(learner.__class__):
569            preprocessor = self
570            wrappedLearner = learner
571            name = getattr(learner, "name", "")
572            def __call__(self, data, weightId=0, getData = False):
573                t = self.preprocessor.processData(data, weightId or 0)
574                processed, procW = t if isinstance(t, tuple) else (t, 0)
575                classifier = self.wrappedLearner(processed, procW)
576                if getData:
577                    return classifier, processed
578                else:
579                    return classifier # super(WrappedLearner, self).__call__(processed, procW)
580               
581            def __reduce__(self):
582                return PreprocessedLearner, (self.preprocessor.preprocessors, \
583                                             self.wrappedLearner)
584           
585            def __getattr__(self, name):
586                return getattr(learner, name)
587           
588        return WrappedLearner()
Note: See TracBrowser for help on using the repository browser.