# source:orange/orange/doc/ofb/c_nb.htm@6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 8.3 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line
4<body>
5
6<p class="Path">
7Prev: <a href="c_nb_disc.htm">Naive Bayes with Discretization</a>,
8Next: <a href="c_bagging.htm">Bagging</a>,
9Up: <a href="c_pythonlearner.htm">Build Your Own Learner</a>
10</p>
11
12<H1>My First Orange Classifier</H1>
13
14<p>The naive Bayesian classification method we will implement in
15this lesson uses standard naive Bayesian algorithm also described
16in Michell: Machine Learning, 1997 (pages 177-180). Essentially, if
17an instance is described with n attributes a<sub>i</sub> (i from 1
18to n), then the class that instance is classified to a class v from
19set of possible classes V according to naive Bayes classifier
20is:</p>
21
22<center><img src="f1.gif" alt="formula for v"></center>
23
24<p>We will also compute a vector of elements</p>
25
26<center><img src="f2.gif" alt="formula for pj"></center>
27
28<p>which, after normalization so that the sum of elements is equal
29to 1, will (also by Mitchell) denote the class probabilities. The
30class probabilities and conditional probabilities (priors) in above
31formulas are estimated from training data: class probability is
32equal to the relative class frequency, while the conditional
33probability of attribute value given class is computed by figuring
34out the proportion of instances with a value of i-th attribute
35equal to a<sub>i</sub> among instances that from class
36v<sub>j</sub>.</p>
37
38<p>To complicate things just a little bit, m-estimate (see
39Mitchell, and Cestnik IJCAI-1990) will be used instead of relative
40frequency when computing prior conditional probabilities. So
41(following the example in Mitchell), when assessing
42P=P(Wind=strong|PlayTennis=no) we find that the total number of
43training examples with PlayTennis=no is n=5, and of these there are
44nc=3 for which Wind=strong, than using relative frequency the
45corresponding probability would be</p>
46
47<center><img src="f3.gif" alt="formula for P"></center>
48
49<p>Relative frequency has a problem when number of instance is
50small, and to alleviate that m-estimate assumes that there are m
51imaginary cases (m is also referred to as equivalent sample size)
52with equal probability of class values p. Our conditional
53probability using m-estimate is then computed as</p>
54
55<center><img src="f4.gif" alt="formula for Pm"></center>
56
57<p>Often, instead of uniform class probability p, a relative class
58frequency as estimated from training data is taken.</p>
59
60<p></p>
61
62<p>We will develop a module called bayes.py that will implement our
63naive Bayes learner and classifier. The structure of the module
64will be as with <a href="c_nb_disc.htm">previous example</a>.
65Again, we will implement two classes, one for learning and the other
66on for classification. Here is a <code>Learner</code>: class</p>
67
68<p class="header">class Learner_Class from <a href=
69"bayes.py">bayes.py</a></p>
70<xmp class="code">class Learner(object):
71    def __new__(cls, examples=None, **kwds):
72        learner = object.__new__(cls, **kwds)
73        if examples:
74            learner.__init__(**kwds)
75            return learner(examples)
76        else:
77            return learner
78
79    def __init__(self, m=0.0, name='std naive bayes', **kwds):
80        self.__dict__.update(kwds)
81        self.m = m
82        self.name = name
83
84    def __call__(self, examples, weight=None, **kwds):
85        for k in kwds.keys():
86            self.__dict__[k] = kwds[k]
87        domain = examples.domain
88
89        # first, compute class probabilities
90        n_class = [0.] * len(domain.classVar.values)
91        for e in examples:
92            n_class[int(e.getclass())] += 1
93
94        p_class = [0.] * len(domain.classVar.values)
95        for i in range(len(domain.classVar.values)):
96            p_class[i] = n_class[i] / len(examples)
97
98        # count examples with specific attribute and
99        # class value, pc[attribute][value][class]
100
101        # initialization of pc
102        pc = []
103        for i in domain.attributes:
104            p = [[0.]*len(domain.classVar.values) for i in range(len(i.values))]
105            pc.append(p)
106
107        # count instances, store them in pc
108        for e in examples:
109            c = int(e.getclass())
110            for i in range(len(domain.attributes)):
111                if not e[i].isSpecial():
112                    pc[i][int(e[i])][c] += 1.0
113
114        # compute conditional probabilities
115        for i in range(len(domain.attributes)):
116            for j in range(len(domain.attributes[i].values)):
117                for k in range(len(domain.classVar.values)):
118                    pc[i][j][k] = (pc[i][j][k] + self.m * p_class[k])/ \
119                                  (n_class[k] + self.m)
120
121        return Classifier(m = self.m, domain=domain, p_class=p_class, \
122                                       p_cond=pc, name=self.name)
123
124</xmp>
125
126<p>Initialization of Learner_Class saves the two attributes, m and
127name of the classifier. Notice that both parameters are optional,
128and the default value for m is 0, making naive Bayes m-estimate
129equal to relative frequency unless the user specifies some other
130value for m. Function <code>__call__</code> is called with the training data
131set, computes class and conditional probabilities and calls
132classifiers, passing the probabilities along with some other
133variables required for classification.</p>
134
135<p class="header">class Classifier from <a href="bayes.py">bayes.py</a></p>
136<xmp class="code">class Classifier:
137    def __init__(self, **kwds):
138        self.__dict__.update(kwds)
139
140    def __call__(self, example, result_type=orange.GetValue):
141        # compute the class probabilities
142        p = map(None, self.p_class)
143        for c in range(len(self.domain.classVar.values)):
144            for a in range(len(self.domain.attributes)):
145                if not example[a].isSpecial():
146                    p[c] *= self.p_cond[a][int(example[a])][c]
147
148        # normalize probabilities to sum to 1
149        sum =0.
150        for pp in p: sum += pp
151        if sum>0:
152            for i in range(len(p)): p[i] = p[i]/sum
153
154        # find the class with highest probability
155        v_index = p.index(max(p))
156        v = orange.Value(self.domain.classVar, v_index)
157
158        # return the value based on requested return type
159        if result_type == orange.GetValue:
160            return v
161        if result_type == orange.GetProbabilities:
162            return p
163        return (v,p)
164
165    def show(self):
166        print 'm=', self.m
167        print 'class prob=', self.p_class
168        print 'cond prob=', self.p_cond
169</xmp>
170
171
172<p>Upon the first invocation, the classifier will store the values of
173the parameters it was called with (<code>__init__</code>). When called with a
174data instance, it will first compute the class probabilities using
175the prior probabilities sent by the learner. The probabilities will
176be normalized to sum to 1. The class will then be found that has
177the highest probability, and the classifier will accordingly
178predict to this class. Notice that we have also added a method
179called show, which reports on m, class probabilities and
180conditional probabilities:</p>
181
183<pre class="code">
184> <strong>python</strong>
185>>> <strong>import orange, bayes</strong>
186>>> <strong>data = orange.ExampleTable("voting")</strong>
187>>> <strong>classifier = bayes.Learner(data)</strong>
188>>> <strong>classifier.show()</strong>
189m= 0.0
190class prob= [0.38620689655172413, 0.61379310344827587]
191cond prob= [[[0.79761904761904767, 0.38202247191011235], ...]]
192>>>
193</pre>
194
195
196<p>The following script tests our naive Bayes, and compares it to
19710-nearest neighbors. Running the script (do you it yourself)
198reports classification accuracies just about 90% (somehow, on this
199data set, kNN does better; smrc&hellip;).</p>
200
201<p class="header"><a href="bayes_test.py">bayes_test.py</a> (uses <a href="bayes.py">bayes.py</a> and <a href="voting.tab">voting.tab</a>)</p>
202<xmp class="code">import orange, orngEval, bayes
203data = orange.ExampleTable("voting")
204
205bayes = bayes.Learner(m=2, name='my bayes')
206knn = orange.kNNLearner(k=10)
207knn.name = "knn"
208
209learners = [knn,bayes]
210results = orngEval.CrossValidation(learners, data)
211for i in range(len(learners)):
212    print learners[i].name, orngEval.CA(results)[i]
213</xmp>
214
215<hr><br><p class="Path">
216Prev: <a href="c_nb_disc.htm">Naive Bayes with Discretization</a>,
217Next: <a href="c_bagging.htm">Bagging</a>,
218Up: <a href="c_pythonlearner.htm">Build Your Own Learner</a>
219</p>
220
221</body>
222</html>
Note: See TracBrowser for help on using the repository browser.