source: orange/orange/Orange/statistics/contingency.py @ 9550:bd71f96b33d5

Revision 9550:bd71f96b33d5, 17.6 KB checked in by lanz <lan.zagar@…>, 2 years ago (diff)

Documentation fixes.

Line 
1"""
2.. index:: Contingency table
3
4=================
5Contingency table
6=================
7
8Contingency table contains conditional distributions. Unless explicitly
9'normalized', they contain absolute frequencies, that is, the number of
10instances with a particular combination of two variables' values. If they are
11normalized by dividing each cell by the row sum, the represent conditional
12probabilities of the column variable (here denoted as ``innerVariable``)
13conditioned by the row variable (``outerVariable``).
14
15Contingency tables are usually constructed for discrete variables. Tables
16for continuous variables have certain limitations described in a :ref:`separate
17section <contcont>`.
18
19The example below loads the monks-1 data set and prints out the conditional
20class distribution given the value of `e`.
21
22.. literalinclude:: code/statistics-contingency.py
23    :lines: 1-7
24
25This code prints out::
26
27    1 <0.000, 108.000>
28    2 <72.000, 36.000>
29    3 <72.000, 36.000>
30    4 <72.000, 36.000>
31
32Contingencies behave like lists of distributions (in this case, class
33distributions) indexed by values (of `e`, in this
34example). Distributions are, in turn indexed by values (class values,
35here). The variable `e` from the above example is called the outer
36variable, and the class is the inner. This can also be reversed. It is
37also possible to use features for both, outer and inner variable, so
38the table shows distributions of one variable's values given the
39value of another.  There is a corresponding hierarchy of classes:
40:obj:`Table` is a base class for :obj:`VarVar` (both
41variables are attributes) and :obj:`Class` (one variable is
42the class).  The latter is the base class for
43:obj:`VarClass` and :obj:`ClassVar`.
44
45The most commonly used of the above classes is :obj:`VarClass` which
46can compute and store conditional probabilities of classes given the feature value.
47
48Contingency tables
49==================
50
51.. class:: Table
52
53    Provides a base class for storing and manipulating contingency
54    tables. Although it is not abstract, it is seldom used directly but rather
55    through more convenient derived classes described below.
56
57    .. attribute:: outerVariable
58
59       Outer variable (:class:`Orange.data.variable.Variable`) whose values are
60       used as the first, outer index.
61
62    .. attribute:: innerVariable
63
64       Inner variable(:class:`Orange.data.variable.Variable`), whose values are
65       used as the second, inner index.
66 
67    .. attribute:: outerDistribution
68
69        The marginal distribution (:class:`Distribution`) of the outer variable.
70
71    .. attribute:: innerDistribution
72
73        The marginal distribution (:class:`Distribution`) of the inner variable.
74       
75    .. attribute:: innerDistributionUnknown
76
77        The distribution (:class:`distribution.Distribution`) of the inner variable for
78        instances for which the outer variable was undefined. This is the
79        difference between the ``innerDistribution`` and (unconditional)
80        distribution of inner variable.
81     
82    .. attribute:: varType
83
84        The type of the outer variable (:obj:`Orange.data.Type`, usually
85        :obj:`Orange.data.variable.Discrete` or
86        :obj:`Orange.data.variable.Continuous`); equals
87        ``outerVariable.varType`` and ``outerDistribution.varType``.
88
89    .. method:: __init__(outer_variable, inner_variable)
90     
91        Construct an instance of contingency table for the given pair of
92        variables.
93     
94        :param outer_variable: Descriptor of the outer variable
95        :type outer_variable: Orange.data.variable.Variable
96        :param outer_variable: Descriptor of the inner variable
97        :type inner_variable: Orange.data.variable.Variable
98       
99    .. method:: add(outer_value, inner_value[, weight=1])
100   
101        Add an element to the contingency table by adding ``weight`` to the
102        corresponding cell.
103
104        :param outer_value: The value for the outer variable
105        :type outer_value: int, float, string or :obj:`Orange.data.Value`
106        :param inner_value: The value for the inner variable
107        :type inner_value: int, float, string or :obj:`Orange.data.Value`
108        :param weight: Instance weight
109        :type weight: float
110
111    .. method:: normalize()
112
113        Normalize all distributions (rows) in the table to sum to ``1``::
114       
115            >>> cont.normalize()
116            >>> for val, dist in cont.items():
117                   print val, dist
118
119        Output: ::
120
121            1 <0.000, 1.000>
122            2 <0.667, 0.333>
123            3 <0.667, 0.333>
124            4 <0.667, 0.333>
125
126        .. note::
127       
128            This method does not change the ``innerDistribution`` or
129            ``outerDistribution``.
130       
131    With respect to indexing, contingency table is a cross between dictionary
132    and a list. It supports standard dictionary methods ``keys``, ``values`` and
133    ``items``. ::
134
135        >> print cont.keys()
136        ['1', '2', '3', '4']
137        >>> print cont.values()
138        [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>]
139        >>> print cont.items()
140        [('1', <0.000, 108.000>), ('2', <72.000, 36.000>),
141        ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]
142
143    Although keys returned by the above functions are strings, contingency can
144    be indexed by anything that can be converted into values of the outer
145    variable: strings, numbers or instances of ``Orange.data.Value``. ::
146
147        >>> print cont[0]
148        <0.000, 108.000>
149        >>> print cont["1"]
150        <0.000, 108.000>
151        >>> print cont[orange.Value(data.domain["e"], "1")]
152
153    The length of the table equals the number of values of the outer
154    variable. However, iterating through contingency
155    does not return keys, as with dictionaries, but distributions. ::
156
157        >>> for i in cont:
158            ... print i
159        <0.000, 108.000>
160        <72.000, 36.000>
161        <72.000, 36.000>
162        <72.000, 36.000>
163        <72.000, 36.000>
164
165
166.. class:: Class
167
168    An abstract base class for contingency tables that contain the class,
169    either as the inner or the outer variable.
170
171    .. attribute:: classVar (read only)
172   
173        The class attribute descriptor; always equal to either
174        :obj:`Table.innerVariable` or :obj:``Table.outerVariable``.
175
176    .. attribute:: variable
177   
178        Variable; always equal either to either ``innerVariable`` or ``outerVariable``
179
180    .. method:: add_var_class(variable_value, class_value[, weight=1])
181
182        Add an element to contingency by increasing the corresponding count. The
183        difference between this and :obj:`Table.add` is that the variable
184        value is always the first argument and class value the second,
185        regardless of which one is inner and which one is outer.
186
187        :param variable_value: Variable value
188        :type variable_value: int, float, string or :obj:`Orange.data.Value`
189        :param class_value: Class value
190        :type class_value: int, float, string or :obj:`Orange.data.Value`
191        :param weight: Instance weight
192        :type weight: float
193
194
195.. class:: VarClass
196
197    A class derived from :obj:`Class` in which the variable is
198    used as :obj:`Table.outerVariable` and class as the
199    :obj:`Table.innerVariable`. This form is a form suitable for
200    computation of conditional class probabilities given the variable value.
201   
202    Calling :obj:`VarClass.add_var_class(v, c)` is equivalent to
203    :obj:`Table.add(v, c)`. Similar as :obj:`Table`,
204    :obj:`VarClass` can compute contingency from instances.
205
206    .. method:: __init__(feature, class_variable)
207
208        Construct an instance of :obj:`VarClass` for the given pair of
209        variables. Inherited from :obj:`Table`.
210
211        :param feature: Outer variable
212        :type feature: Orange.data.variable.Variable
213        :param class_attribute: Class variable; used as ``innerVariable``
214        :type class_attribute: Orange.data.variable.Variable
215       
216    .. method:: __init__(feature, data[, weightId])
217
218        Compute the contingency table from data.
219
220        :param feature: Outer variable
221        :type feature: Orange.data.variable.Variable
222        :param data: A set of instances
223        :type data: Orange.data.Table
224        :param weightId: meta attribute with weights of instances
225        :type weightId: int
226
227    .. method:: p_class(value)
228
229        Return the probability distribution of classes given the value of the
230        variable.
231
232        :param value: The value of the variable
233        :type value: int, float, string or :obj:`Orange.data.Value`
234        :rtype: Orange.statistics.distribution.Distribution
235
236
237    .. method:: p_class(value, class_value)
238
239        Returns the conditional probability of the class_value given the
240        feature value, p(class_value|value) (note the order of arguments!)
241       
242        :param value: The value of the variable
243        :type value: int, float, string or :obj:`Orange.data.Value`
244        :param class_value: The class value
245        :type value: int, float, string or :obj:`Orange.data.Value`
246        :rtype: float
247
248    .. literalinclude:: code/statistics-contingency3.py
249        :lines: 1-23
250
251    The inner and the outer variable and their relations to the class are
252    as follows::
253
254        Inner variable:  y
255        Outer variable:  e
256   
257        Class variable:  y
258        Feature:         e
259
260    Distributions are normalized, and probabilities are elements from the
261    normalized distributions. Knowing that the target concept is
262    y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1
263    has a 100% probability, while for the rest, probability is one third, which
264    agrees with a probability that two three-valued independent features
265    have the same value. ::
266
267        Distributions:
268          p(.|1) = <0.000, 1.000>
269          p(.|2) = <0.662, 0.338>
270          p(.|3) = <0.659, 0.341>
271          p(.|4) = <0.669, 0.331>
272   
273        Probabilities of class '1'
274          p(1|1) = 1.000
275          p(1|2) = 0.338
276          p(1|3) = 0.341
277          p(1|4) = 0.331
278   
279        Distributions from a matrix computed manually:
280          p(.|1) = <0.000, 1.000>
281          p(.|2) = <0.662, 0.338>
282          p(.|3) = <0.659, 0.341>
283          p(.|4) = <0.669, 0.331>
284
285
286.. class:: ClassVar
287
288    :obj:`ClassVar` is similar to :obj:`VarClass` except
289    that the class is outside and the variable is inside. This form of
290    contingency table is suitable for computing conditional probabilities of
291    variable given the class. All methods get the two arguments in the same
292    order as :obj:`VarClass`.
293
294    .. method:: __init__(feature, class_variable)
295
296        Construct an instance of :obj:`VarClass` for the given pair of
297        variables. Inherited from :obj:`Table`, except for the reversed
298        order of arguments.
299
300        :param feature: Outer variable
301        :type feature: Orange.data.variable.Variable
302        :param class_variable: Class variable
303        :type class_variable: Orange.data.variable.Variable
304       
305    .. method:: __init__(feature, data[, weightId])
306
307        Compute contingency table from the data.
308
309        :param feature: Descriptor of the outer variable
310        :type feature: Orange.data.variable.Variable
311        :param data: A set of instances
312        :type data: Orange.data.Table
313        :param weightId: meta attribute with weights of instances
314        :type weightId: int
315
316    .. method:: p_attr(class_value)
317
318        Return the probability distribution of variable given the class.
319
320        :param class_value: The value of the variable
321        :type class_value: int, float, string or :obj:`Orange.data.Value`
322        :rtype: Orange.statistics.distribution.Distribution
323
324    .. method:: p_attr(value, class_value)
325
326        Returns the conditional probability of the value given the
327        class, p(value|class_value).
328
329        :param value: Value of the variable
330        :type value: int, float, string or :obj:`Orange.data.Value`
331        :param class_value: Class value
332        :type value: int, float, string or :obj:`Orange.data.Value`
333        :rtype: float
334
335    .. literalinclude:: code/statistics-contingency4.py
336        :lines: 1-27
337
338    The role of the feature and the class are reversed compared to
339    :obj:`ClassVar`::
340   
341        Inner variable:  e
342        Outer variable:  y
343   
344        Class variable:  y
345        Feature:         e
346   
347    Distributions given the class can be printed out by calling :meth:`p_attr`.
348   
349    .. literalinclude:: code/statistics-contingency4.py
350        :lines: 30-31
351   
352    will print::
353        p(.|0) = <0.000, 0.333, 0.333, 0.333>
354        p(.|1) = <0.500, 0.167, 0.167, 0.167>
355   
356    If the class value is '0', the attribute `e` cannot be `1` (the first
357    value), while distribution across other values is uniform.  If the class
358    value is `1`, `e` is `1` for exactly half of instances, and distribution of
359    other values is again uniform.
360
361.. class:: VarVar
362
363    Contingency table in which none of the variables is the class.  The class
364    is derived from :obj:`Table`, and adds an additional constructor and
365    method for getting conditional probabilities.
366
367    .. method:: VarVar(outer_variable, inner_variable)
368
369        Inherited from :obj:`Table`.
370
371    .. method:: __init__(outer_variable, inner_variable, data[, weightId])
372
373        Compute the contingency from the given instances.
374
375        :param outer_variable: Outer variable
376        :type outer_variable: Orange.data.variable.Variable
377        :param inner_variable: Inner variable
378        :type inner_variable: Orange.data.variable.Variable
379        :param data: A set of instances
380        :type data: Orange.data.Table
381        :param weightId: meta attribute with weights of instances
382        :type weightId: int
383
384    .. method:: p_attr(outer_value)
385
386        Return the probability distribution of the inner variable given the
387        outer variable value.
388
389        :param outer_value: The value of the outer variable
390        :type outer_value: int, float, string or :obj:`Orange.data.Value`
391        :rtype: Orange.statistics.distribution.Distribution
392 
393    .. method:: p_attr(outer_value, inner_value)
394
395        Return the conditional probability of the inner_value
396        given the outer_value.
397
398        :param outer_value: The value of the outer variable
399        :type outer_value: int, float, string or :obj:`Orange.data.Value`
400        :param inner_value: The value of the inner variable
401        :type inner_value: int, float, string or :obj:`Orange.data.Value`
402        :rtype: float
403
404    The following example investigates which material is used for
405    bridges of different lengths.
406   
407    .. literalinclude:: code/statistics-contingency5.py
408        :lines: 1-17
409
410    Short bridges are mostly wooden or iron, and the longer (and most of the
411    middle sized) are made from steel::
412   
413        SHORT:
414           WOOD (56%)
415           IRON (44%)
416   
417        MEDIUM:
418           WOOD (9%)
419           IRON (11%)
420           STEEL (79%)
421   
422        LONG:
423           STEEL (100%)
424   
425    As all other contingency tables, this one can also be computed "manually".
426   
427    .. literalinclude:: code/statistics-contingency5.py
428        :lines: 18-
429
430
431Contingencies for entire domain
432===============================
433
434A list of contingency tables, either :obj:`VarClass` or
435:obj:`ClassVar`.
436
437.. class:: Domain
438
439    .. method:: __init__(data[, weightId=0, classOuter=0|1])
440
441        Compute a list of contingency tables.
442
443        :param data: A set of instances
444        :type data: Orange.data.Table
445        :param weightId: meta attribute with weights of instances
446        :type weightId: int
447        :param classOuter: `True`, if class is the outer variable
448        :type classOuter: bool
449
450        .. note::
451       
452            ``classIsOuter`` cannot be given as positional argument,
453            but needs to be passed by keyword.
454
455    .. attribute:: classIsOuter (read only)
456
457        Tells whether the class is the outer or the inner variable.
458
459    .. attribute:: classes
460
461        Contains the distribution of class values on the entire dataset.
462
463    .. method:: normalize()
464
465        Call normalize for all contingencies.
466
467    The following script prints the contingency tables for features
468    "a", "b" and "e" for the dataset Monk 1.
469       
470    .. literalinclude:: code/statistics-contingency8.py
471        :lines: 9
472
473    Contingency tables of type :obj:`VarClass` give
474    the conditional distributions of classes, given the value of the variable.
475   
476    .. literalinclude:: code/statistics-contingency8.py
477        :lines: 12-
478
479.. _contcont:
480
481Contingency tables for continuous variables
482===========================================
483
484If the outer variable is continuous, the index must be one of the
485values that do exist in the contingency table; other values raise an
486exception:
487
488.. literalinclude:: code/statistics-contingency6.py
489    :lines: 1-4,17-
490
491Since even rounding can be a problem, the only safe way to get the key
492is to take it from from the contingencies' ``keys``.
493
494Contingency tables with discrete outer variable and continuous inner variables
495are more useful, since methods :obj:`ContingencyClassVar.p_class`
496and :obj:`ContingencyVarClass.p_attr` use the primitive density estimation
497provided by :obj:`Orange.statistics.distribution.Distribution`.
498
499For example, :obj:`ClassVar` on the iris dataset can return the
500probability of the sepal length 5.5 for different classes:
501
502.. literalinclude:: code/statistics-contingency7.py
503
504The script outputs::
505
506    Estimated frequencies for e=5.5
507      f(5.5|Iris-setosa) = 2.000
508      f(5.5|Iris-versicolor) = 5.000
509      f(5.5|Iris-virginica) = 1.000
510
511"""
512
513from Orange.core import Contingency as Table
514from Orange.core import ContingencyAttrAttr as VarVar
515from Orange.core import ContingencyClass as Class
516from Orange.core import ContingencyAttrClass as VarClass
517from Orange.core import ContingencyClassAttr as ClassVar
518
519from Orange.core import DomainContingency as Domain
Note: See TracBrowser for help on using the repository browser.