source: orange/docs/tutorial/rst/python-learners.rst @ 11052:c22077a09e63

Revision 11052:c22077a09e63, 5.1 KB checked in by blaz <blaz.zupan@…>, 16 months ago (diff)

new tutorial

Line 
1Learners in Python
2==================
3
4.. index::
5   single: classifiers; in Python
6
7Orange comes with plenty classification and regression algorithms. But its also fun to make the new ones. You can build them anew, or wrap existing learners and add some preprocessing to construct new variants. Notice that learners in Orange have to adhere to certain rules. Let us observe them on a classification algorithm::
8
9   >>> import Orange
10   >>> data = Orange.data.Table("titanic")
11   >>> learner = Orange.classification.logreg.LogRegLearner()
12   >>> classifier = learner(data)
13   >>> classifier(data[0])
14   <orange.Value 'survived'='no'>
15
16When learner is given the data it returns a predictor. In our case, classifier. Classifiers are passed data instances and return a value of a class. They can also return probability distribution, or this together with a class value::
17
18   >>> classifier(data[0], Orange.classification.Classifier.GetProbabilities)
19   Out[26]: <0.593, 0.407>
20   >>> classifier(data[0], Orange.classification.Classifier.GetBoth)
21   Out[27]: (<orange.Value 'survived'='no'>, <0.593, 0.407>)
22
23Regression is similar, just that the regression model would return only the predicted continuous value.
24
25Notice also that the constructor for the learner can be given the data, and in that case it will construct a classifier (what else could it do?)::
26
27   >>> classifier = Orange.classification.logreg.LogRegLearner(data)
28   >>> classifier(data[42])
29   <orange.Value 'survived'='no'>
30
31Now we are ready to build our own learner. We will do this for a classification problem.
32
33Classifier with Feature Selection
34---------------------------------
35
36Consider a naive Bayesian classifiers. They do perform well, but could loose accuracy when there are many features, especially when these are correlated. Feature selection can help. We may want to wrap naive Bayesian classifier with feature subset selection, such that it would learn only from the few most informative features. We will assume the data contains only discrete features and will score them with information gain. Here is an example that sets the scorer (``gain``) and uses it to find best five features from the classification data set:
37
38.. literalinclude:: code/py-score-features.py
39   :lines: 3-
40
41We need to incorporate the feature selection within the learner, at the point where it gets the data. Learners for classification tasks inherit from ``Orange.classification.PyLearner``:
42
43.. literalinclude:: code/py-small.py
44   :lines: 3-17
45
46The initialization part of the learner (``__init__``) simply stores the based learner (in our case a naive Bayesian classifier), the name of the learner and a number of features we would like to use. Invocation of the learner (``__call__``) scores the features, stores the best one in the list (``best``), construct a data domain and then uses the one to transform the data (``Orange.data.Table(domain, data)``) by including only the set of the best features. Besides the most informative features we needed to include also the class. The learner then returns the classifier by using a generic classifier ``Orange.classification.PyClassifier``, where the actual prediction model is passed through ``classifier`` argument.
47
48Note that classifiers in Orange also use the weight vector, which records the importance of training data items. This is useful for several algorithms, like boosting.
49
50Let's check if this works::
51
52   >>> data = Orange.data.Table("promoters")
53   >>> s_learner = SmallLearner(m=3)
54   >>> classifier = s_learner(data)
55   >>> classifier(data[20])
56   <orange.Value 'y'='mm'>
57   >>> classifier(data[20], Orange.classification.Classifier.GetProbabilities)
58   <0.439, 0.561>
59
60It does! We constructed the naive Bayesian classifier with only three features. But how do we know what is the best number of features we could use? It's time to construct one more learner.
61
62Estimation of Feature Set Size
63------------------------------
64
65Given a training data, what is the best number of features we could use with a training algorithm? We can estimate that through cross-validation, by checking possible feature set sizes and estimating how well does the classifier on such reduced feature set behave. When we are done, we use the feature sets size with best performance, and build a classifier on the entire training set. This procedure is often referred to as internal cross validation. We wrap it into a new learner:
66
67.. literalinclude:: code/py-small.py
68   :lines: 19-31
69
70Again, our code stores the arguments at initialization (``__init__``). The learner invocation part selects the best value of parameter ``m``, the size of the feature set, and uses it to construct the final classifier.
71
72We can now compare the three classification algorithms. That is, the base classifier (naive Bayesian), the classifier with a fixed number of selected features, and the classifier that estimates the optimal number of features from the training set:
73
74.. literalinclude:: code/py-small.py
75   :lines: 39-45
76
77And the result? The classifier with feature set size wins (but not substantially. The results would be more pronounced if we would run this on the datasets with larger number of features)::
78
79   opt_small: 0.942, small: 0.937, nbc: 0.933
80
Note: See TracBrowser for help on using the repository browser.