# source:orange/orange/Orange/statistics/contingency.py@9136:b714663d7f58

Revision 9136:b714663d7f58, 17.6 KB checked in by markotoplak, 2 years ago (diff)

fixed Orange.statistics.contingecy a little: examples were not showing

Line
1"""
2=================
3Contingency table
4=================
5
6Contingency table contains conditional distributions. Unless explicitly
7'normalized', they contain absolute frequencies, that is, the number of
8instances with a particular combination of two variables' values. If they are
9normalized by dividing each cell by the row sum, the represent conditional
10probabilities of the column variable (here denoted as ``innerVariable``)
11conditioned by the row variable (``outerVariable``).
12
13Contingency tables are usually constructed for discrete variables. Tables
14for continuous variables have certain limitations described in a :ref:`separate
15section <contcont>`.
16
17The example below loads the monks-1 data set and prints out the conditional
18class distribution given the value of `e`.
19
20.. literalinclude:: code/statistics-contingency.py
21    :lines: 1-7
22
23This code prints out::
24
25    1 <0.000, 108.000>
26    2 <72.000, 36.000>
27    3 <72.000, 36.000>
28    4 <72.000, 36.000>
29
30Contingencies behave like lists of distributions (in this case, class
31distributions) indexed by values (of `e`, in this
32example). Distributions are, in turn indexed by values (class values,
33here). The variable `e` from the above example is called the outer
34variable, and the class is the inner. This can also be reversed. It is
35also possible to use features for both, outer and inner variable, so
36the table shows distributions of one variable's values given the
37value of another.  There is a corresponding hierarchy of classes:
38:obj:`Table` is a base class for :obj:`VarVar` (both
39variables are attributes) and :obj:`Class` (one variable is
40the class).  The latter is the base class for
41:obj:`VarClass` and :obj:`ClassVar`.
42
43The most commonly used of the above classes is :obj:`VarClass` which
44can compute and store conditional probabilities of classes given the feature value.
45
46Contingency tables
47==================
48
49.. class:: Table
50
51    Provides a base class for storing and manipulating contingency
52    tables. Although it is not abstract, it is seldom used directly but rather
53    through more convenient derived classes described below.
54
55    .. attribute:: outerVariable
56
57       Outer variable (:class:`Orange.data.variable.Variable`) whose values are
58       used as the first, outer index.
59
60    .. attribute:: innerVariable
61
62       Inner variable(:class:`Orange.data.variable.Variable`), whose values are
63       used as the second, inner index.
64
65    .. attribute:: outerDistribution
66
67        The marginal distribution (:class:`Distribution`) of the outer variable.
68
69    .. attribute:: innerDistribution
70
71        The marginal distribution (:class:`Distribution`) of the inner variable.
72
73    .. attribute:: innerDistributionUnknown
74
75        The distribution (:class:`distribution.Distribution`) of the inner variable for
76        instances for which the outer variable was undefined. This is the
77        difference between the ``innerDistribution`` and (unconditional)
78        distribution of inner variable.
79
80    .. attribute:: varType
81
82        The type of the outer variable (:obj:`Orange.data.Type`, usually
83        :obj:`Orange.data.variable.Discrete` or
84        :obj:`Orange.data.variable.Continuous`); equals
85        ``outerVariable.varType`` and ``outerDistribution.varType``.
86
87    .. method:: __init__(outer_variable, inner_variable)
88
89        Construct an instance of contingency table for the given pair of
90        variables.
91
92        :param outer_variable: Descriptor of the outer variable
93        :type outer_variable: Orange.data.variable.Variable
94        :param outer_variable: Descriptor of the inner variable
95        :type inner_variable: Orange.data.variable.Variable
96
97    .. method:: add(outer_value, inner_value[, weight=1])
98
99        Add an element to the contingency table by adding ``weight`` to the
100        corresponding cell.
101
102        :param outer_value: The value for the outer variable
103        :type outer_value: int, float, string or :obj:`Orange.data.Value`
104        :param inner_value: The value for the inner variable
105        :type inner_value: int, float, string or :obj:`Orange.data.Value`
106        :param weight: Instance weight
107        :type weight: float
108
109    .. method:: normalize()
110
111        Normalize all distributions (rows) in the table to sum to ``1``::
112
113            >>> cont.normalize()
114            >>> for val, dist in cont.items():
115                   print val, dist
116
117        Output: ::
118
119            1 <0.000, 1.000>
120            2 <0.667, 0.333>
121            3 <0.667, 0.333>
122            4 <0.667, 0.333>
123
124        .. note::
125
126            This method does not change the ``innerDistribution`` or
127            ``outerDistribution``.
128
129    With respect to indexing, contingency table is a cross between dictionary
130    and a list. It supports standard dictionary methods ``keys``, ``values`` and
131    ``items``. ::
132
133        >> print cont.keys()
134        ['1', '2', '3', '4']
135        >>> print cont.values()
136        [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>]
137        >>> print cont.items()
138        [('1', <0.000, 108.000>), ('2', <72.000, 36.000>),
139        ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]
140
141    Although keys returned by the above functions are strings, contingency can
142    be indexed by anything that can be converted into values of the outer
143    variable: strings, numbers or instances of ``Orange.data.Value``. ::
144
145        >>> print cont[0]
146        <0.000, 108.000>
147        >>> print cont["1"]
148        <0.000, 108.000>
149        >>> print cont[orange.Value(data.domain["e"], "1")]
150
151    The length of the table equals the number of values of the outer
152    variable. However, iterating through contingency
153    does not return keys, as with dictionaries, but distributions. ::
154
155        >>> for i in cont:
156            ... print i
157        <0.000, 108.000>
158        <72.000, 36.000>
159        <72.000, 36.000>
160        <72.000, 36.000>
161        <72.000, 36.000>
162
163
164.. class:: Class
165
166    An abstract base class for contingency tables that contain the class,
167    either as the inner or the outer variable.
168
169    .. attribute:: classVar (read only)
170
171        The class attribute descriptor; always equal to either
172        :obj:`Table.innerVariable` or :obj:``Table.outerVariable``.
173
174    .. attribute:: variable
175
176        Variable; always equal either to either ``innerVariable`` or ``outerVariable``
177
178    .. method:: add_var_class(variable_value, class_value[, weight=1])
179
180        Add an element to contingency by increasing the corresponding count. The
181        difference between this and :obj:`Table.add` is that the variable
182        value is always the first argument and class value the second,
183        regardless of which one is inner and which one is outer.
184
185        :param variable_value: Variable value
186        :type variable_value: int, float, string or :obj:`Orange.data.Value`
187        :param class_value: Class value
188        :type class_value: int, float, string or :obj:`Orange.data.Value`
189        :param weight: Instance weight
190        :type weight: float
191
192
193.. class:: VarClass
194
195    A class derived from :obj:`Class` in which the variable is
196    used as :obj:`Table.outerVariable` and class as the
197    :obj:`Table.innerVariable`. This form is a form suitable for
198    computation of conditional class probabilities given the variable value.
199
200    Calling :obj:`VarClass.add_var_class(v, c)` is equivalent to
201    :obj:`Table.add(v, c)`. Similar as :obj:`Table`,
202    :obj:`VarClass` can compute contingency from instances.
203
204    .. method:: __init__(feature, class_variable)
205
206        Construct an instance of :obj:`VarClass` for the given pair of
207        variables. Inherited from :obj:`Table`.
208
209        :param feature: Outer variable
210        :type feature: Orange.data.variable.Variable
211        :param class_attribute: Class variable; used as ``innerVariable``
212        :type class_attribute: Orange.data.variable.Variable
213
214    .. method:: __init__(feature, data[, weightId])
215
216        Compute the contingency table from data.
217
218        :param feature: Outer variable
219        :type feature: Orange.data.variable.Variable
220        :param data: A set of instances
221        :type data: Orange.data.Table
222        :param weightId: meta attribute with weights of instances
223        :type weightId: int
224
225    .. method:: p_class(value)
226
227        Return the probability distribution of classes given the value of the
228        variable.
229
230        :param value: The value of the variable
231        :type value: int, float, string or :obj:`Orange.data.Value`
232        :rtype: Orange.statistics.distribution.Distribution
233
234
235    .. method:: p_class(value, class_value)
236
237        Returns the conditional probability of the class_value given the
238        feature value, p(class_value|value) (note the order of arguments!)
239
240        :param value: The value of the variable
241        :type value: int, float, string or :obj:`Orange.data.Value`
242        :param class_value: The class value
243        :type value: int, float, string or :obj:`Orange.data.Value`
244        :rtype: float
245
246    .. literalinclude:: code/statistics-contingency3.py
247        :lines: 1-23
248
249    The inner and the outer variable and their relations to the class are
250    as follows::
251
252        Inner variable:  y
253        Outer variable:  e
254
255        Class variable:  y
256        Feature:         e
257
258    Distributions are normalized, and probabilities are elements from the
259    normalized distributions. Knowing that the target concept is
260    y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1
261    has a 100% probability, while for the rest, probability is one third, which
262    agrees with a probability that two three-valued independent features
263    have the same value. ::
264
265        Distributions:
266          p(.|1) = <0.000, 1.000>
267          p(.|2) = <0.662, 0.338>
268          p(.|3) = <0.659, 0.341>
269          p(.|4) = <0.669, 0.331>
270
271        Probabilities of class '1'
272          p(1|1) = 1.000
273          p(1|2) = 0.338
274          p(1|3) = 0.341
275          p(1|4) = 0.331
276
277        Distributions from a matrix computed manually:
278          p(.|1) = <0.000, 1.000>
279          p(.|2) = <0.662, 0.338>
280          p(.|3) = <0.659, 0.341>
281          p(.|4) = <0.669, 0.331>
282
283
284.. class:: ClassVar
285
286    :obj:`ClassVar` is similar to :obj:`VarClass` except
287    that the class is outside and the variable is inside. This form of
288    contingency table is suitable for computing conditional probabilities of
289    variable given the class. All methods get the two arguments in the same
290    order as :obj:`VarClass`.
291
292    .. method:: __init__(feature, class_variable)
293
294        Construct an instance of :obj:`VarClass` for the given pair of
295        variables. Inherited from :obj:`Table`, except for the reversed
296        order of arguments.
297
298        :param feature: Outer variable
299        :type feature: Orange.data.variable.Variable
300        :param class_variable: Class variable
301        :type class_variable: Orange.data.variable.Variable
302
303    .. method:: __init__(feature, data[, weightId])
304
305        Compute contingency table from the data.
306
307        :param feature: Descriptor of the outer variable
308        :type feature: Orange.data.variable.Variable
309        :param data: A set of instances
310        :type data: Orange.data.Table
311        :param weightId: meta attribute with weights of instances
312        :type weightId: int
313
314    .. method:: p_attr(class_value)
315
316        Return the probability distribution of variable given the class.
317
318        :param class_value: The value of the variable
319        :type class_value: int, float, string or :obj:`Orange.data.Value`
320        :rtype: Orange.statistics.distribution.Distribution
321
322    .. method:: p_attr(value, class_value)
323
324        Returns the conditional probability of the value given the
325        class, p(value|class_value).
326
327        :param value: Value of the variable
328        :type value: int, float, string or :obj:`Orange.data.Value`
329        :param class_value: Class value
330        :type value: int, float, string or :obj:`Orange.data.Value`
331        :rtype: float
332
333    .. literalinclude:: code/statistics-contingency4.py
334        :lines: 1-27
335
336    The role of the feature and the class are reversed compared to
337    :obj:`ClassVar`::
338
339        Inner variable:  e
340        Outer variable:  y
341
342        Class variable:  y
343        Feature:         e
344
345    Distributions given the class can be printed out by calling :meth:`p_attr`.
346
347    .. literalinclude:: code/statistics-contingency4.py
348        :lines: 30-31
349
350    will print::
351        p(.|0) = <0.000, 0.333, 0.333, 0.333>
352        p(.|1) = <0.500, 0.167, 0.167, 0.167>
353
354    If the class value is '0', the attribute `e` cannot be `1` (the first
355    value), while distribution across other values is uniform.  If the class
356    value is `1`, `e` is `1` for exactly half of instances, and distribution of
357    other values is again uniform.
358
359.. class:: VarVar
360
361    Contingency table in which none of the variables is the class.  The class
362    is derived from :obj:`Table`, and adds an additional constructor and
363    method for getting conditional probabilities.
364
365    .. method:: VarVar(outer_variable, inner_variable)
366
367        Inherited from :obj:`Table`.
368
369    .. method:: __init__(outer_variable, inner_variable, data[, weightId])
370
371        Compute the contingency from the given instances.
372
373        :param outer_variable: Outer variable
374        :type outer_variable: Orange.data.variable.Variable
375        :param inner_variable: Inner variable
376        :type inner_variable: Orange.data.variable.Variable
377        :param data: A set of instances
378        :type data: Orange.data.Table
379        :param weightId: meta attribute with weights of instances
380        :type weightId: int
381
382    .. method:: p_attr(outer_value)
383
384        Return the probability distribution of the inner variable given the
385        outer variable value.
386
387        :param outer_value: The value of the outer variable
388        :type outer_value: int, float, string or :obj:`Orange.data.Value`
389        :rtype: Orange.statistics.distribution.Distribution
390
391    .. method:: p_attr(outer_value, inner_value)
392
393        Return the conditional probability of the inner_value
394        given the outer_value.
395
396        :param outer_value: The value of the outer variable
397        :type outer_value: int, float, string or :obj:`Orange.data.Value`
398        :param inner_value: The value of the inner variable
399        :type inner_value: int, float, string or :obj:`Orange.data.Value`
400        :rtype: float
401
402    The following example investigates which material is used for
403    bridges of different lengths.
404
405    .. literalinclude:: code/statistics-contingency5.py
406        :lines: 1-17
407
408    Short bridges are mostly wooden or iron, and the longer (and most of the
409    middle sized) are made from steel::
410
411        SHORT:
412           WOOD (56%)
413           IRON (44%)
414
415        MEDIUM:
416           WOOD (9%)
417           IRON (11%)
418           STEEL (79%)
419
420        LONG:
421           STEEL (100%)
422
423    As all other contingency tables, this one can also be computed "manually".
424
425    .. literalinclude:: code/statistics-contingency5.py
426        :lines: 18-
427
428
429Contingencies for entire domain
430===============================
431
432A list of contingency tables, either :obj:`VarClass` or
433:obj:`ClassVar`.
434
435.. class:: Domain
436
437    .. method:: __init__(data[, weightId=0, classOuter=0|1])
438
439        Compute a list of contingency tables.
440
441        :param data: A set of instances
442        :type data: Orange.data.Table
443        :param weightId: meta attribute with weights of instances
444        :type weightId: int
445        :param classOuter: `True`, if class is the outer variable
446        :type classOuter: bool
447
448        .. note::
449
450            ``classIsOuter`` cannot be given as positional argument,
451            but needs to be passed by keyword.
452
453    .. attribute:: classIsOuter (read only)
454
455        Tells whether the class is the outer or the inner variable.
456
457    .. attribute:: classes
458
459        Contains the distribution of class values on the entire dataset.
460
461    .. method:: normalize()
462
463        Call normalize for all contingencies.
464
465    The following script prints the contingency tables for features
466    "a", "b" and "e" for the dataset Monk 1.
467
468    .. literalinclude:: code/statistics-contingency8.py
469        :lines: 9
470
471    Contingency tables of type :obj:`VarClass` give
472    the conditional distributions of classes, given the value of the variable.
473
474    .. literalinclude:: code/statistics-contingency8.py
475        :lines: 12-
476
477.. _contcont:
478
479Contingency tables for continuous variables
480===========================================
481
482If the outer variable is continuous, the index must be one of the
483values that do exist in the contingency table; other values raise an
484exception:
485
486.. literalinclude:: code/statistics-contingency6.py
487    :lines: 1-4,17-
488
489Since even rounding can be a problem, the only safe way to get the key
490is to take it from from the contingencies' ``keys``.
491
492Contingency tables with discrete outer variable and continuous inner variables
493are more useful, since methods :obj:`ContingencyClassVar.p_class`
494and :obj:`ContingencyVarClass.p_attr` use the primitive density estimation
495provided by :obj:`Orange.statistics.distribution.Distribution`.
496
497For example, :obj:`ClassVar` on the iris dataset can return the
498probability of the sepal length 5.5 for different classes:
499
500.. literalinclude:: code/statistics-contingency7.py
501
502The script outputs::
503
504    Estimated frequencies for e=5.5
505      f(5.5|Iris-setosa) = 2.000
506      f(5.5|Iris-versicolor) = 5.000
507      f(5.5|Iris-virginica) = 1.000
508
509"""
510
511from Orange.core import Contingency as Table
512from Orange.core import ContingencyAttrAttr as VarVar
513from Orange.core import ContingencyClass as Class
514from Orange.core import ContingencyAttrClass as VarClass
515from Orange.core import ContingencyClassAttr as ClassVar
516
517from Orange.core import DomainContingency as Domain
Note: See TracBrowser for help on using the repository browser.