# source:orange/orange/doc/reference/contingency.htm@6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 23.5 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line
5<body>
6
7<h1>Contingency Matrix</h1>
8
9<P>Contingency matrix contains conditional distributions. They can work for both, discrete and continuous attributes; although the examples on this page will be mostly limited to discrete attributes, the analogous could be done with continuous values.</P>
10
12(uses <a href="monk1.tab">monk1.tab</a>)</p>
13<XMP class=code>>>> import orange
14>>> data = orange.ExampleTable("monk1")
15>>> cont = orange.ContingencyAttrClass("e", data)
16>>> for val, dist in cont.items():
17...     print val, dist
181 <0.000, 108.000>
192 <72.000, 36.000>
203 <72.000, 36.000>
214 <72.000, 36.000>
22</XMP>
23
24<P>As this simple example shows, contingency is similar to a dictionary (or a list, it is a bit ambiguous), where attribute values serve as keys and class distributions are the dictionary values. The attribute <CODE>e</CODE> is here called the outer attribute, and the class is the inner. That's not the only possible configuration of contingency matrix; class can also be outside or there can be no class at all and the matrix shows distributions of one attribute values given the value of another.</P>
25
26<P>There is a hierarchy of classes with contingencies:</P>
27<B>
28<P style="margin-bottom: 0cm; margin-left: 1cm"><CODE>Contingency</CODE></P>
29<P style="margin-top: 0cm; margin-bottom: 0cm; margin-left: 1cm"><CODE>ContingencyClass</CODE></P>
30<P style="margin-top: 0cm; margin-bottom: 0cm; margin-left: 3cm"><CODE>ContingencyClassAttr</CODE></P>
31<P style="margin-top: 0cm; margin-bottom: 0cm; margin-left: 3cm"><CODE>ContingencyAttrClass</CODE></P>
32<P style="margin-top: 0cm; margin-left: 2cm"><CODE>ContingencyAttrAttr</CODE></P>
33</B>
34
35<P>The base object is <CODE>Contingency</CODE>. Derived from it is <CODE>ContingencyClass</CODE> in which one of the attributes is class attribute; <CODE>ContingencyClass</CODE> is a base for two classes, <CODE>ContingencyAttrClass</CODE> and <CODE>ContingencyClassAttr</CODE>, the former having class as the inner and the latter as the outer attribute. Class <CODE>ContingencyAttrAttr</CODE> is derived directly from <CODE>Contingency</CODE> and represents contingency matrices in which none of the attributes is the class attribute.</P>
36
37<P>The most common used of the above classes is <CODE>ContingencyAttrClass</CODE> which resembles conditional probabilities of classes given the attribute value.</P>
38
39
40<H2>General Contingency Matrix</H2>
41
42<P>Here's what all contingency matrices share in common.</P>
43
44<P class=section>Attributes</P>
45<DL class=attributes>
46<DT>outerVariable</DT>
47<DD>The outer attribute descriptor. In the above case, it is <CODE>e</CODE>. </DD>
48
49<DT>innerVariable</DT>
50<DD>The inner attribute descriptor. In the above case, it is the class attribute.</DD>
51
52<DT>outerDistribution</DT>
53<DD>The distribution of the outer attribute's values - sums of rows. In the above case, distribution of <CODE>e</CODE> is &lt;108.000, 108.000, 108.000, 108.000&gt;</DD>
54
55<DT>innerDistribution</DT>
56<DD>The distribution of the inner attribute. In the above case, it is the class distribution, which is &lt;216.000, 216.000&lt;.
57
58<DT>innerDistributionUnknown</DT>
59<DD>The distribution of the inner attribute for the examples where the outer attribute was unknown. This is the difference between the <CODE>innerDistribution</CODE> and the sum of all distributions in the matrix.</DD>
60
61<DT>varType</DT>
62<DD>The <CODE>varType</CODE> for the outer attribute (discrete, continuous...); <CODE>varType</CODE> equals <CODE>outerVariable.varType</CODE> and <CODE>outerDistribution.varType</CODE>.</DD>
63</DL>
64
65
66<P class=section>Methods</P>
67<DL class=attributes>
68<DT>&lt;standard list/dictionary operations&gt;</DT>
69<DD>Contingency matrix is a cross between dictionary and a list. It supports standard dictionary methods <CODE>keys</CODE>, <CODE>values</CODE> and <CODE>items</CODE>.
70
71<XMP class=code>>>> print cont.keys()
72['1', '2', '3', '4']
73>>> print cont.values()
74[<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>]
75>>> print cont.items()
76[('1', <0.000, 108.000>), ('2', <72.000, 36.000>), ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]
77</XMP>
78
79<P>Although keys returned by the above functions are strings, you can index the contingency with anything that converts into values of the outer attribute - strings, numbers or instances of <CODE>Value</CODE>.</p>
80
81<XMP class=code>>>> print cont[0]
82<0.000, 108.000>
83>>> print cont["1"]
84<0.000, 108.000>
85>>> print cont[orange.Value(data.domain["e"], "1")]
86</XMP>
87
88<P>Naturally, the length of <CODE>Contingency</CODE> equals the number of values of the outer attribute. The only weird thing is that iterating through contingency (by using a <CODE>for</CODE> loop, for instance) doesn't return keys, as with dictionaries, but dictionary values.</P>
89
90<XMP class=code>>>> for i in cont:
91...     print i
92<0.000, 108.000>
93<72.000, 36.000>
94<72.000, 36.000>
95<72.000, 36.000>
96<72.000, 36.000>
97</XMP>
98
99<P>If <CODE>cont</CODE> behaved like a normal dictionary, the above script would print out strings from '0' to '3'.</P>
100</dd>
101
103<DD>Adds an element to the contingency matrix.</DD>
104
105<DT>normalize()</DT>
106<DD>Normalizes all distributions (rows) in the contingency to sum to 1. It doesn't change the <CODE>innerDistribution</CODE> or <CODE>outerDistribution</CODE>.
107
108<XMP class=code>>>> cont.normalize()
109>>> for val, dist in cont.items():
110...     print val, dist
1111 <0.000, 1.000>
1122 <0.667, 0.333>
1133 <0.667, 0.333>
1144 <0.667, 0.333>
115</XMP>
116</DD>
117</DL>
118
119
120<H2>Contingency</H2>
121
122<P>The base class is, once for a change, not abstract. Its constructor expects two attribute descriptors, the first one for the outer and the second for the inner attribute. It initializes empty distributions and it's up to you to fill them. This is, for instance, how to manually reproduce results of the script at the top of the page.</P>
123
125(uses <a href="monk1.tab">monk1.tab</a>)</p>
126<XMP class=code>import orange
127data = orange.ExampleTable("monk1")
128
129cont = orange.Contingency(data.domain["e"], data.domain.classVar)
130for ex in data:
131    cont [ex["e"]] [ex.getclass()] += 1
132
133print "Contingency items:"
134for val, dist in cont.items():
135    print val, dist
136print
137</XMP>
138
139<P>The "reproduction" is not perfect. We didn't care about unknown values and haven't computed <CODE>innerDistribution</CODE> and <CODE>outerDistribution</CODE>. The better way to do it is by using the method <CODE>add</CODE>, so that the loop becomes:</p>
140
141<XMP class=code>for ex in data:
143</XMP>
144
145<P>It's not only simpler, but also correctly handles unknown values and updates <CODE>innerDistribution</CODE> and <CODE>outerDistribution</CODE>.
146
147
148<H2>ContingencyClass</H2>
149
150<P><CODE>ContingencyClass</CODE> is an abstract base class for contingency matrices that contain the class attribute, either as the inner or the outer attribute. If offers a function for making filing the contingency clearer.</P>
151
152<P>After reading through the rest of this page you might ask yourself why do we need to separate the classes <CODE>ContingencyAttrClass</CODE>, <CODE>ContingencyClassAttr</CODE> and <CODE>ContingencyAttrAttr</CODE>, given that the underlying matrix is the same. This is to avoid confusion about what is in the inner and the outer variable. Contingency matrices are most often used to compute probabilities of conditional classes or attributes. By separating the classes and giving them specialized methods for computing the probabilities that are most suitable to compute from a particular class, the user (ie, you or the method that gets passed the matrix) is relieved from checking what kind of matrix it got, that is, where is the class and where's the attribute.</P>
153
154<P class=section>Attributes</P>
155<DL class=attributes>
157<DD>The class attribute descriptor. This is always equal either to <CODE>innerVariable</CODE> or <CODE>outerVariable</CODE></DD>
158
160<DD>The class attribute descriptor. This is always equal either to <CODE>innerVariable</CODE> or <CODE>outerVariable</CODE></DD>
161</DL>
162
163
164<P class=section>Methods</P>
165<DL class=attributes>
167<DD>Adds an element to contingency. The difference between this and <CODE>Contigency.add</CODE> is that the attribute value is always the first argument and class value the second, regardless whether the attribute is actually the outer variable or the inner.</DD>
168</DL>
169
170
171<H2>ContingencyAttrClass</H2>
172
173<P><CODE>ContingencyAttrClass</CODE> is derived from <CODE>ContingencyClass</CODE>. Here, attribute is the outer variable (hence <CODE>variable=outerVariable</CODE>) and class is the inner (<CODE>classVar=innerVariable</CODE>), so this form of contingency matrix is suitable for computing the conditional probabilities of classes given a value of an attribute.</P>
174
175<P>Calling <CODE>add_attrclass(v, c)</CODE> is here equivalent to calling <CODE>add(v, c)</CODE>. In addition to this, the class supports computation of contingency from examples, as you have already seen in the example at the top of this page.</P>
176
177<P class=section>Methods</P>
178<DL class=attributes>
179<DT>ContingencyAttrClass(attribute, class_attribute)</DT>
180<DD>The inherited constructor, which does exactly the same as <CODE>Contingency</CODE>'s constructor.</DD>
181
182<DT>ContingencyAttrClass(attribute, examples[, weightID])</DT>
183<DD>Constructor that constructs the contingency and computes the data from the given examples. If these are weighted, the meta attribute with example weights can be specified.
184</DD>
185
186<DT>p_class(attribute_value)</DT>
187<DD>Returns the distribution of classes given the <CODE>attribute_value</CODE>. If the matrix is normalized, this is equivalent to returning <CODE>self[attribute_value]</CODE>. Result is returned as a normalized <CODE>Distribution</CODE>.</DD>
188
189<DT>p_class(attribute_value, class_value)</DT>
190<DD>Returns the conditional probability of <CODE>class_value</CODE> given the <CODE>attribute_value</CODE>. If the matrix is normalized, this is equivalent to returning <CODE>self[attribute_value][class_value]</CODE>.</P>
191
192<P>Don't confuse the order of arguments: attribute value is the first, class value is the second, just as in <CODE>add_attrclass</CODE>. Although in this instance counterintuitive (since the returned value represents the conditional probability P(class_value|attribute_value), this order is uniform for all (applicable) methods of classes derived from <CODE>ContingencyClass</CODE>.</P>
193</DD>
194</DL>
195
196<P>You have seen this form of matrix used already at the top of the page. We shall only explore the new stuff we've learned about it.</P>
197
199(uses <a href="monk1.tab">monk1.tab</a>)</p>
200<XMP class=code>import orange
201data = orange.ExampleTable("monk1")
202cont = orange.ContingencyAttrClass("e", data)
203
204print "Inner variable: ", cont.innerVariable.name
205print "Outer variable: ", cont.outerVariable.name
206print
207print "Class variable: ", cont.classVar.name
208print "Attribute:      ", cont.variable.name
209print
210
211print "Distributions:"
212for val in cont.variable:
213    print "  p(.|%s) = %s" % (val.native(), cont.p_class(val))
214print
215
216firstclass = orange.Value(cont.classVar, 1)
217firstnative = firstclass.native()
218print "Probabilities of class '%s'" % firstnative
219for val in cont.variable:
220    print "  p(%s|%s) = %5.3f" % (firstnative, val.native(), cont.p_class(val, firstclass))
221</XMP>
222
223<P>The inner and the outer variable and their relations to the class and the attribute are as expected.</P>
224
225<XMP class=code>Inner variable:  y
226Outer variable:  e
227
228Class variable:  y
229Attribute:       e
230</XMP>
231
232<P>Distributions are normalized and probabilities are elements from the normalized distributions. Knowing that the target concept is <CODE>y := (e=1) or (a=b)</CODE>, distributions are as expected: when <CODE>e</CODE> equals 1, class 1 has a 100% probability, while for the rest, probability is one third, which agrees with a probability that two three-valued independent attributes have the same value.</P>
233
234<XMP class=code>Distributions:
235  p(.|1) = <0.000, 1.000>
236  p(.|2) = <0.667, 0.333>
237  p(.|3) = <0.667, 0.333>
238  p(.|4) = <0.667, 0.333>
239
240Probabilities of class '1'
241  p(1|1) = 1.000
242  p(1|2) = 0.333
243  p(1|3) = 0.333
244  p(1|4) = 0.333
245</XMP>
246
247<P>Manual computation using <CODE>add_attrclass</CODE> is similar (to be precise: exactly the same) as computation using <CODE>add</CODE>.</P>
248
249<XMP class=code>cont = orange.ContingencyAttrClass(data.domain["e"], data.domain.classVar)
250for ex in data:
252</XMP>
253
254
255<H2>ContingencyClassAttr</H2>
256
257<P>ContingencyClassAttr is similar to <CODE>ContingencyAttrClass</CODE> except that here the class attribute is the outer and the attribute the inner variable. As a consequence, this form of contingency matrix is suitable for computing conditional probabilities of attribute values given class values. Constructor and <CODE>add_attrclass</CODE> nevertheless get the arguments in the same order as for <CODE>ContingencyAttrClass</CODE>, that is, attribute first, class second.</P>
258
259<P class=section>Methods</P>
260<DL class=attributes>
261<DT>ContingencyClassAttr(attribute, class_attribute)</DT>
262<DD>The inherited constructor is exactly the same as <CODE>Contingency</CODE>'s constructor, except that the argument order is reversed (in <CODE>Contingency</CODE>, the outer attribute is given first, while here the first argument, <CODE>attribute</CODE>, is the inner attribute).</DD>
263
264<DT>ContingencyAttrClass(attribute, examples[, weightID])</DT>
265<DD>Constructs the contingency and computes the data from the given examples. If these are weighted, the meta attribute with example weights can be specified.
266</DD>
267
268<DT>p_attr(class_value)</DT>
269<DD>Returns the distribution of attribute values given the <CODE>class_value</CODE>. If the matrix is normalized, this is equivalent to returning <CODE>self[class_value]</CODE>. Result is returned as a normalized <CODE>Distribution</CODE>.</DD>
270
271<DT>p_attr(attribute_value, class_value)</DT>
272<DD>Returns the conditional probability of <CODE>attribute_value</CODE> given the <CODE>class_value</CODE>. If the matrix is normalized, this is equivalent to returning <CODE>self[class_value][attribute_value]</CODE>.</DD>
273</DL>
274
275<P>As you can see, the class is rather similar to <CODE>ContingencyAttrClass</CODE>, except that it has <CODE>p_attr</CODE> instead of <CODE>p_class</CODE>. If you, for instance, take the above script and replace the class name, the first bunch of prints print out
276
277<p class="header"><a href="contingency4.py">part of the output from contingency4.py</a>
278(uses <a href="monk1.tab">monk1.tab</a>)</p>
279<XMP class=code>Inner variable:  e
280Outer variable:  y
281
282Class variable:  y
283Attribute:       e
284</XMP>
285
286<P>This is exactly the reverse of the printout from <CODE>ContingencyAttrClass</CODE>. To print out the distributions, the only difference now is that you need to iterate through values of the class attribute and call <CODE>p_attr</CODE>. For instance,
287
289(uses <a href="monk1.tab">monk1.tab</a>)</p>
290<XMP class=code>for val in cont.classVar:
291    print "  p(.|%s) = %s" % (val.native(), cont.p_attr(val))
292</XMP>
293
294<P>will print</P>
295
296<XMP class=code>  p(.|0) = <0.000, 0.333, 0.333, 0.333>
297  p(.|1) = <0.500, 0.167, 0.167, 0.167>
298</XMP>
299
300<P>If the class value is '0', than attribute <CODE>e</CODE> cannot be '1' (the first value), but can be anything else, with equal probabilities of 0.333. If the class value is '1', <CODE>e</CODE> is '1' in exactly half of examples (work-out why this is so); in the remaining cases, <CODE>e</CODE> is again distributed uniformly.</p>
301
302
303<H2>ContingencyAttrAttr</H2>
304
305<P><CODE>ContingencyAttrAttr</CODE> stores contingency matrices in which none of the attributes is the class attribute. This is rather similar to <CODE>Contingency</CODE>, except that it has an additional constructor and method for getting the conditional probabilities.</P>
306
307<P class=section>Methods</P>
308<DL class=attributes>
309<DT>ContingencyAttrAttr(outer_variable, inner_variable)</DT>
310<DD>This constructor is exactly the same as that of <CODE>Contingency</CODE>.</DD>
311
312<DT>ContingencyAttrAttr(outer_variable, inner_variable, examples[, weightID])</DT>
313<DD>Computes the contingency from the given <CODE>examples</CODE>.</DD>
314
315<DT>p_attr(outer_value)</DT>
316<DD>Returns the probability distribution of the inner variable given the outer variable.</DD>
317
318<DT>p_attr(outer_value, inner_value)</DT>
319<DD>Returns the conditional probability of the <CODE>inner_value</CODE> given the <CODE>outer_value</CODE>.</DD>
320</DL>
321
322<P>In the following example, we shall use the <CODE>ContingencyAttrAttr</CODE> on dataset "bridges" to determine which material is used for bridges of different lengths.</P>
323
325(uses <a href="bridges.tab">bridges.tab</a>)</p>
326<XML class=code>
327import orange
328data = orange.ExampleTable("bridges")
329cont = orange.ContingencyAttrAttr("SPAN", "MATERIAL", data)
330
331cont.normalize()
332for val in cont.outerVariable:
333    print "%s:" % val.native()
334    for inval, p in cont[val].items():
335        if p:
336            print "   %s (%i%%)" % (inval, int(100*p+0.5))
337    print
338</XML>
339
340<P>The output tells us that short bridges are mostly wooden or iron, and the longer (and the most of middle sized) are made from steel.</P>
341
342<XMP class=code>SHORT:
343   WOOD (56%)
344   IRON (44%)
345
346MEDIUM:
347   WOOD (9%)
348   IRON (11%)
349   STEEL (79%)
350
351LONG:
352   STEEL (100%)
353</XMP>
354
355<P>As all other contingency matrices, this one can also be computed "manually".</P>
356
358(uses <a href="bridges.tab">bridges.tab</a>)</p>
359<XMP class=code>cont = orange.ContingencyAttrAttr(data.domain["SPAN"], data.domain["MATERIAL"])
360for ex in data:
362</XMP>
363
364
365<H2>Contingencies with Continuous Values</H2>
366
367<P>What happens if one or both attributes are continuous? As first, contingencies can be built for such attributes as well. Just imagine a contingency as a dictionary with attribute values as keys and objects of type <CODE>Distribution</CODE> as values.</P>
368
369<P>If the outer attribute is continuous, you can use either its values or ordinary floating point number for indexing. The index must be one of the values that do exist in the contingency matrix.</P>
370
371<P>The following script will query for a distribution in between the first two keys, which triggers an exception.</P>
372
373<p class="header"><a href="contingency6.py">part of the output from contingency6.py</a>
374(uses <a href="iris.tab">iris.tab</a>)</p>
375<XMP class=code>import orange
376data = orange.ExampleTable("iris")
377cont = orange.ContingencyAttrClass(0, data)
378
379midkey = (cont.keys()[0] + cont.keys()[1])/2.0
380print "cont[%5.3f] =" % (midkey, cont[midkey])
381</XMP>
382
383<P>If you still find such contingencies useful, you need to take care about what you pass for indices. Always use the values from <CODE>keys()</CODE> directly, instead of manually entering the keys' values you see printed. If, for instance, you print out the first key, see it's <CODE>4.500</CODE> and then request <CODE>cont[4.500]</CODE> this can give an index error due to rounding.</p>
384
385<P>Contingencies with continuous inner attributes are more useful. As first, indexing by discrete values is easier than with continuous. Secondly, class <A href="distributions.htm"><CODE>Distribution</CODE></A> covers both, discrete and continuous distributions, so even the methods <CODE>p_class</CODE> and <CODE>p_attr</CODE> will work, except they won't return is not the probability but the density (interpolated, if necessary). See the page about <A href="distributions.htm"><CODE>Distribution</CODE></A> for more information.</p>
386
387<P>For instance, if you build a <CODE>ContingencyClassAttr</CODE> on the iris dataset, you can enquire about the probability of the sepal length 5.5.</P>
388
390(uses <a href="iris.tab">iris.tab</a>)</p>
391<XMP class=code>import orange
392data = orange.ExampleTable("iris")
393cont = orange.ContingencyClassAttr("sepal length", data)
394
395for val in cont.classVar:
396    print "  p(%s|%s) = %5.3f" % (5.5, val.native(), cont.p_attr(5.5, val))
397</XMP>
398
399<P>The script's output is</P>
400
401<XMP class=code>  p(5.5|Iris-setosa) = 2.000
402  p(5.5|Iris-versicolor) = 5.000
403  p(5.5|Iris-virginica) = 1.000
404</XMP>
405
406<P>These number represent the number of examples having with sepal length of 5.5. If the matrix was normalized, numbers would be divided by the total number of examples in classes setosa, versicolor and virginica, respectively.</P>
407
408<H2>Computing Contingencies for All Attributes</H2>
409
410<P>Computing contingency matrices requires iteration through examples. We often need to compute <CODE>ContingencyAttrClass</CODE> or <CODE>ContingencyClassAttr</CODE> for all attributes in the dataset and it is obvious that this will be faster if we do it for all attributes at once. That's taken care of by class <CODE>DomainContingency</CODE>.</P>
411
412<P><CODE>DomainContingency</CODE> is basically a list of contingencies, either of type <CODE>ContingencyAttrClass</CODE> or <CODE>ContingencyClassAttr</CODE>, with two additional fields and a constructor that computes the contingencies.</P>
413
414<P class=section>Attributes</P>
415<DL class=attributes>
417<DD>Tells whether the class is the outer or the inner attribute. Effectively, this tells whether the elements of the list are <CODE>ContingencyAttrClass</CODE> or <CODE>ContingencyClassAttr</CODE>.</DD>
418
419<DT>classes</DT>
420<DD>Contains the distribution of class values on the entire dataset.</DD>
421</DL>
422
423
424<P class=section>Methods</P>
425<DL class=attributes>
426<DT>DomainContingency(examples[, weightID][, classIsOuter=0|1])</DT>
427<DD>Constructor needs to be given a list of examples. It constructs a list of contingencies; if <CODE>classIsOuter</CODE> is 0 (default), these will be <CODE>ContingencyAttrClass</CODE>, if 1, <CODE>ContingencyClassAttr</CODE> are used. It then iterates through examples and computes the contingencies.</DD>
428
429<DT>list-like operations</DT>
430<DD>The only real difference between <CODE>DomainContingency</CODE> and an ordinary Python list (except for the additional methods and fields, of course) is that its elements cannot be indexed only by numbers, but also by attribute names and descriptors, as shown in the example below.</DD>
431
432<DT>normalize</DT>
433<DD>Calls normalize for each contingency.</DD>
434</DL>
435
436<P>The following script will print the contingencies for attributes "a", "b" and "e" for the dataset Monk 1.</P>
437
439(uses <a href="monk1.tab">monk1.tab</a>)</p>
440<XMP class=code>import orange
441data = orange.ExampleTable("monk1")
442
443dc = orange.DomainContingency(data)
444print dc["a"]
445print dc["b"]
446print dc["e"]
447</XMP>
448
449<P>The contingencies in the <CODE>DomainContingency</CODE> <CODE>dc</CODE> are of type <CODE>ContingencyAttrClass</CODE> and  tell us conditional distributions of classes, given the value of the attribute. To compute the distribution of attribute values given the class, one needs to get a list of <CODE>ContingencyClassAttr</CODE>.</p>
450