source: orange/docs/reference/rst/Orange.statistics.contingency.rst @ 10246:11b418321f79

Revision 10246:11b418321f79, 17.3 KB checked in by janezd <janez.demsar@…>, 2 years ago (diff)

Unified argument names in and 2.5 and 3.0; numerous other changes in documentation

Line 
1.. py:currentmodule::Orange.statistics.contingency
2
3.. index:: Contingency table
4
5=================
6Contingency table
7=================
8
9Contingency table contains conditional distributions. Unless explicitly
10'normalized', they contain absolute frequencies, that is, the number of
11instances with a particular combination of two variables' values. If they are
12normalized by dividing each cell by the row sum, the represent conditional
13probabilities of the column variable (here denoted as ``innerVariable``)
14conditioned by the row variable (``outerVariable``).
15
16Contingency tables are usually constructed for discrete variables. Tables
17for continuous variables have certain limitations described in a :ref:`separate
18section <contcont>`.
19
20The example below loads the monks-1 data set and prints out the conditional
21class distribution given the value of `e`.
22
23.. literalinclude:: code/statistics-contingency.py
24    :lines: 1-7
25
26This code prints out::
27
28    1 <0.000, 108.000>
29    2 <72.000, 36.000>
30    3 <72.000, 36.000>
31    4 <72.000, 36.000>
32
33Contingencies behave like lists of distributions (in this case, class
34distributions) indexed by values (of `e`, in this
35example). Distributions are, in turn indexed by values (class values,
36here). The variable `e` from the above example is called the outer
37variable, and the class is the inner. This can also be reversed. It is
38also possible to use features for both, outer and inner variable, so
39the table shows distributions of one variable's values given the
40value of another.  There is a corresponding hierarchy of classes:
41:obj:`Table` is a base class for :obj:`VarVar` (both
42variables are attributes) and :obj:`Class` (one variable is
43the class).  The latter is the base class for
44:obj:`VarClass` and :obj:`ClassVar`.
45
46The most commonly used of the above classes is :obj:`VarClass` which
47can compute and store conditional probabilities of classes given the feature value.
48
49Contingency tables
50==================
51
52.. class:: Table
53
54    Provides a base class for storing and manipulating contingency
55    tables. Although it is not abstract, it is seldom used directly but rather
56    through more convenient derived classes described below.
57
58    .. attribute:: outerVariable
59
60       Outer variable (:class:`Orange.feature.Descriptor`) whose values are
61       used as the first, outer index.
62
63    .. attribute:: innerVariable
64
65       Inner variable(:class:`Orange.feature.Descriptor`), whose values are
66       used as the second, inner index.
67 
68    .. attribute:: outerDistribution
69
70        The marginal distribution (:class:`Distribution`) of the outer variable.
71
72    .. attribute:: innerDistribution
73
74        The marginal distribution (:class:`Distribution`) of the inner variable.
75       
76    .. attribute:: innerDistributionUnknown
77
78        The distribution (:class:`distribution.Distribution`) of the inner variable for
79        instances for which the outer variable was undefined. This is the
80        difference between the ``innerDistribution`` and (unconditional)
81        distribution of inner variable.
82     
83    .. attribute:: varType
84
85        The type of the outer variable (:obj:`Orange.feature.Type`, usually
86        :obj:`Orange.feature.Discrete` or
87        :obj:`Orange.feature.Continuous`); equals
88        ``outerVariable.varType`` and ``outerDistribution.varType``.
89
90    .. method:: __init__(outer_variable, inner_variable)
91     
92        Construct an instance of contingency table for the given pair of
93        variables.
94     
95        :param outer_variable: Descriptor of the outer variable
96        :type outer_variable: Orange.feature.Descriptor
97        :param outer_variable: Descriptor of the inner variable
98        :type inner_variable: Orange.feature.Descriptor
99       
100    .. method:: add(outer_value, inner_value[, weight=1])
101   
102        Add an element to the contingency table by adding ``weight`` to the
103        corresponding cell.
104
105        :param outer_value: The value for the outer variable
106        :type outer_value: int, float, string or :obj:`Orange.data.Value`
107        :param inner_value: The value for the inner variable
108        :type inner_value: int, float, string or :obj:`Orange.data.Value`
109        :param weight: Instance weight
110        :type weight: float
111
112    .. method:: normalize()
113
114        Normalize all distributions (rows) in the table to sum to ``1``::
115       
116            >>> cont.normalize()
117            >>> for val, dist in cont.items():
118                   print val, dist
119
120        Output: ::
121
122            1 <0.000, 1.000>
123            2 <0.667, 0.333>
124            3 <0.667, 0.333>
125            4 <0.667, 0.333>
126
127        .. note::
128       
129            This method does not change the ``innerDistribution`` or
130            ``outerDistribution``.
131       
132    With respect to indexing, contingency table is a cross between dictionary
133    and a list. It supports standard dictionary methods ``keys``, ``values`` and
134    ``items``. ::
135
136        >> print cont.keys()
137        ['1', '2', '3', '4']
138        >>> print cont.values()
139        [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>]
140        >>> print cont.items()
141        [('1', <0.000, 108.000>), ('2', <72.000, 36.000>),
142        ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]
143
144    Although keys returned by the above functions are strings, contingency can
145    be indexed by anything that can be converted into values of the outer
146    variable: strings, numbers or instances of ``Orange.data.Value``. ::
147
148        >>> print cont[0]
149        <0.000, 108.000>
150        >>> print cont["1"]
151        <0.000, 108.000>
152        >>> print cont[orange.Value(data.domain["e"], "1")]
153
154    The length of the table equals the number of values of the outer
155    variable. However, iterating through contingency
156    does not return keys, as with dictionaries, but distributions. ::
157
158        >>> for i in cont:
159            ... print i
160        <0.000, 108.000>
161        <72.000, 36.000>
162        <72.000, 36.000>
163        <72.000, 36.000>
164        <72.000, 36.000>
165
166
167.. class:: Class
168
169    An abstract base class for contingency tables that contain the class,
170    either as the inner or the outer variable.
171
172    .. attribute:: classVar (read only)
173   
174        The class attribute descriptor; always equal to either
175        :obj:`Table.innerVariable` or :obj:``Table.outerVariable``.
176
177    .. attribute:: variable
178   
179        Variable; always equal either to either ``innerVariable`` or ``outerVariable``
180
181    .. method:: add_var_class(variable_value, class_value[, weight=1])
182
183        Add an element to contingency by increasing the corresponding count. The
184        difference between this and :obj:`Table.add` is that the variable
185        value is always the first argument and class value the second,
186        regardless of which one is inner and which one is outer.
187
188        :param variable_value: Variable value
189        :type variable_value: int, float, string or :obj:`Orange.data.Value`
190        :param class_value: Class value
191        :type class_value: int, float, string or :obj:`Orange.data.Value`
192        :param weight: Instance weight
193        :type weight: float
194
195
196.. class:: VarClass
197
198    A class derived from :obj:`Class` in which the variable is
199    used as :obj:`Table.outerVariable` and class as the
200    :obj:`Table.innerVariable`. This form is a form suitable for
201    computation of conditional class probabilities given the variable value.
202   
203    Calling :obj:`VarClass.add_var_class(v, c)` is equivalent to
204    :obj:`Table.add(v, c)`. Similar as :obj:`Table`,
205    :obj:`VarClass` can compute contingency from instances.
206
207    .. method:: __init__(feature, class_variable)
208
209        Construct an instance of :obj:`VarClass` for the given pair of
210        variables. Inherited from :obj:`Table`.
211
212        :param feature: Outer variable
213        :type feature: Orange.feature.Descriptor
214        :param class_attribute: Class variable; used as ``innerVariable``
215        :type class_attribute: Orange.feature.Descriptor
216       
217    .. method:: __init__(feature, data[, weightId])
218
219        Compute the contingency table from data.
220
221        :param feature: Outer variable
222        :type feature: Orange.feature.Descriptor
223        :param data: A set of instances
224        :type data: Orange.data.Table
225        :param weightId: meta attribute with weights of instances
226        :type weightId: int
227
228    .. method:: p_class(value)
229
230        Return the probability distribution of classes given the value of the
231        variable.
232
233        :param value: The value of the variable
234        :type value: int, float, string or :obj:`Orange.data.Value`
235        :rtype: Orange.statistics.distribution.Distribution
236
237
238    .. method:: p_class(value, class_value)
239
240        Returns the conditional probability of the class_value given the
241        feature value, p(class_value|value) (note the order of arguments!)
242       
243        :param value: The value of the variable
244        :type value: int, float, string or :obj:`Orange.data.Value`
245        :param class_value: The class value
246        :type value: int, float, string or :obj:`Orange.data.Value`
247        :rtype: float
248
249    .. literalinclude:: code/statistics-contingency3.py
250        :lines: 1-23
251
252    The inner and the outer variable and their relations to the class are
253    as follows::
254
255        Inner variable:  y
256        Outer variable:  e
257   
258        Class variable:  y
259        Feature:         e
260
261    Distributions are normalized, and probabilities are elements from the
262    normalized distributions. Knowing that the target concept is
263    y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1
264    has a 100% probability, while for the rest, probability is one third, which
265    agrees with a probability that two three-valued independent features
266    have the same value. ::
267
268        Distributions:
269          p(.|1) = <0.000, 1.000>
270          p(.|2) = <0.662, 0.338>
271          p(.|3) = <0.659, 0.341>
272          p(.|4) = <0.669, 0.331>
273   
274        Probabilities of class '1'
275          p(1|1) = 1.000
276          p(1|2) = 0.338
277          p(1|3) = 0.341
278          p(1|4) = 0.331
279   
280        Distributions from a matrix computed manually:
281          p(.|1) = <0.000, 1.000>
282          p(.|2) = <0.662, 0.338>
283          p(.|3) = <0.659, 0.341>
284          p(.|4) = <0.669, 0.331>
285
286
287.. class:: ClassVar
288
289    :obj:`ClassVar` is similar to :obj:`VarClass` except
290    that the class is outside and the variable is inside. This form of
291    contingency table is suitable for computing conditional probabilities of
292    variable given the class. All methods get the two arguments in the same
293    order as :obj:`VarClass`.
294
295    .. method:: __init__(feature, class_variable)
296
297        Construct an instance of :obj:`VarClass` for the given pair of
298        variables. Inherited from :obj:`Table`, except for the reversed
299        order of arguments.
300
301        :param feature: Outer variable
302        :type feature: Orange.feature.Descriptor
303        :param class_variable: Class variable
304        :type class_variable: Orange.feature.Descriptor
305       
306    .. method:: __init__(feature, data[, weightId])
307
308        Compute contingency table from the data.
309
310        :param feature: Descriptor of the outer variable
311        :type feature: Orange.feature.Descriptor
312        :param data: A set of instances
313        :type data: Orange.data.Table
314        :param weightId: meta attribute with weights of instances
315        :type weightId: int
316
317    .. method:: p_attr(class_value)
318
319        Return the probability distribution of variable given the class.
320
321        :param class_value: The value of the variable
322        :type class_value: int, float, string or :obj:`Orange.data.Value`
323        :rtype: Orange.statistics.distribution.Distribution
324
325    .. method:: p_attr(value, class_value)
326
327        Returns the conditional probability of the value given the
328        class, p(value|class_value).
329
330        :param value: Value of the variable
331        :type value: int, float, string or :obj:`Orange.data.Value`
332        :param class_value: Class value
333        :type value: int, float, string or :obj:`Orange.data.Value`
334        :rtype: float
335
336    .. literalinclude:: code/statistics-contingency4.py
337        :lines: 1-27
338
339    The role of the feature and the class are reversed compared to
340    :obj:`ClassVar`::
341   
342        Inner variable:  e
343        Outer variable:  y
344   
345        Class variable:  y
346        Feature:         e
347   
348    Distributions given the class can be printed out by calling :meth:`p_attr`.
349   
350    .. literalinclude:: code/statistics-contingency4.py
351        :lines: 30-31
352   
353    will print::
354        p(.|0) = <0.000, 0.333, 0.333, 0.333>
355        p(.|1) = <0.500, 0.167, 0.167, 0.167>
356   
357    If the class value is '0', the attribute `e` cannot be `1` (the first
358    value), while distribution across other values is uniform.  If the class
359    value is `1`, `e` is `1` for exactly half of instances, and distribution of
360    other values is again uniform.
361
362.. class:: VarVar
363
364    Contingency table in which none of the variables is the class.  The class
365    is derived from :obj:`Table`, and adds an additional constructor and
366    method for getting conditional probabilities.
367
368    .. method:: VarVar(outer_variable, inner_variable)
369
370        Inherited from :obj:`Table`.
371
372    .. method:: __init__(outer_variable, inner_variable, data[, weightId])
373
374        Compute the contingency from the given instances.
375
376        :param outer_variable: Outer variable
377        :type outer_variable: Orange.feature.Descriptor
378        :param inner_variable: Inner variable
379        :type inner_variable: Orange.feature.Descriptor
380        :param data: A set of instances
381        :type data: Orange.data.Table
382        :param weightId: meta attribute with weights of instances
383        :type weightId: int
384
385    .. method:: p_attr(outer_value)
386
387        Return the probability distribution of the inner variable given the
388        outer variable value.
389
390        :param outer_value: The value of the outer variable
391        :type outer_value: int, float, string or :obj:`Orange.data.Value`
392        :rtype: Orange.statistics.distribution.Distribution
393 
394    .. method:: p_attr(outer_value, inner_value)
395
396        Return the conditional probability of the inner_value
397        given the outer_value.
398
399        :param outer_value: The value of the outer variable
400        :type outer_value: int, float, string or :obj:`Orange.data.Value`
401        :param inner_value: The value of the inner variable
402        :type inner_value: int, float, string or :obj:`Orange.data.Value`
403        :rtype: float
404
405    The following example investigates which material is used for
406    bridges of different lengths.
407   
408    .. literalinclude:: code/statistics-contingency5.py
409        :lines: 1-17
410
411    Short bridges are mostly wooden or iron, and the longer (and most of the
412    middle sized) are made from steel::
413   
414        SHORT:
415           WOOD (56%)
416           IRON (44%)
417   
418        MEDIUM:
419           WOOD (9%)
420           IRON (11%)
421           STEEL (79%)
422   
423        LONG:
424           STEEL (100%)
425   
426    As all other contingency tables, this one can also be computed "manually".
427   
428    .. literalinclude:: code/statistics-contingency5.py
429        :lines: 18-
430
431
432Contingencies for entire domain
433===============================
434
435A list of contingency tables, either :obj:`VarClass` or
436:obj:`ClassVar`.
437
438.. class:: Domain
439
440    .. method:: __init__(data[, weight_id=0, class_outer=0|1])
441
442        Compute a list of contingency tables.
443
444        :param data: A set of instances
445        :type data: Orange.data.Table
446        :param weight_id: meta attribute with weights of instances
447        :type weight_id: int
448        :param class_is_outer: `True`, if class is the outer variable
449        :type class_is_outer: bool
450
451        .. note::
452       
453            ``class_is_outer`` needs to be given as keyword argument.
454
455    .. attribute:: class_is_outer (read only)
456
457        Tells whether the class is the outer or the inner variable.
458
459    .. attribute:: classes
460
461        Contains the distribution of class values on the entire dataset.
462
463    .. method:: normalize()
464
465        Call normalize for all contingencies.
466
467    The following script prints the contingency tables for features
468    "a", "b" and "e" for the dataset Monk 1.
469       
470    .. literalinclude:: code/statistics-contingency8.py
471        :lines: 9
472
473    Contingency tables of type :obj:`VarClass` give
474    the conditional distributions of classes, given the value of the variable.
475   
476    .. literalinclude:: code/statistics-contingency8.py
477        :lines: 12-
478
479.. _contcont:
480
481Contingency tables for continuous variables
482===========================================
483
484If the outer variable is continuous, the index must be one of the
485values that do exist in the contingency table; other values raise an
486exception:
487
488.. literalinclude:: code/statistics-contingency6.py
489    :lines: 1-4,17-
490
491Since even rounding can be a problem, the only safe way to get the key
492is to take it from from the contingencies' ``keys``.
493
494Contingency tables with discrete outer variable and continuous inner variables
495are more useful, since methods :obj:`ContingencyClassVar.p_class`
496and :obj:`ContingencyVarClass.p_attr` use the primitive density estimation
497provided by :obj:`Orange.statistics.distribution.Distribution`.
498
499For example, :obj:`ClassVar` on the iris dataset can return the
500probability of the sepal length 5.5 for different classes:
501
502.. literalinclude:: code/statistics-contingency7.py
503
504The script outputs::
505
506    Estimated frequencies for e=5.5
507      f(5.5|Iris-setosa) = 2.000
508      f(5.5|Iris-versicolor) = 5.000
509      f(5.5|Iris-virginica) = 1.000
510
511"""
512
513
514.. automodule:: Orange.statistics.contingency
Note: See TracBrowser for help on using the repository browser.