source: orange/orange/Orange/statistics/contingency.py @ 7759:06a126b6aa4a

Revision 7759:06a126b6aa4a, 18.7 KB checked in by markotoplak, 3 years ago (diff)

Put Orange.statistics in the index.

Line 
1"""
2=================
3Contingency table
4=================
5
6Contingency table contains conditional distributions. Unless explicitly
7'normalized', they contain absolute frequencies, that is, the number of
8instances with a particular combination of two variables' values. If they are
9normalized by dividing each cell by the row sum, the represent conditional
10probabilities of the column variable (here denoted as ``innerVariable``)
11conditioned by the row variable (``outerVariable``).
12
13Contingency tables are usually constructed for discrete variables. Tables
14for continuous variables have certain limitations described in a :ref:`separate
15section <contcont>`.
16
17The example below loads the monks-1 data set and prints out the conditional
18class distribution given the value of `e`.
19
20.. _statistics-contingency: code/statistics-contingency.py
21
22part of `statistics-contingency`_ (uses monks-1.tab)
23
24.. literalinclude:: code/statistics-contingency.py
25    :lines: 1-7
26
27This code prints out::
28
29    1 <0.000, 108.000>
30    2 <72.000, 36.000>
31    3 <72.000, 36.000>
32    4 <72.000, 36.000>
33
34Contingencies behave like lists of distributions (in this case, class
35distributions) indexed by values (of `e`, in this
36example). Distributions are, in turn indexed by values (class values,
37here). The variable `e` from the above example is called the outer
38variable, and the class is the inner. This can also be reversed. It is
39also possible to use features for both, outer and inner variable, so
40the table shows distributions of one variable's values given the
41value of another.  There is a corresponding hierarchy of classes:
42:obj:`Table` is a base class for :obj:`VarVar` (both
43variables are attributes) and :obj:`Class` (one variable is
44the class).  The latter is the base class for
45:obj:`VarClass` and :obj:`ClassVar`.
46
47The most commonly used of the above classes is :obj:`VarClass` which
48can compute and store conditional probabilities of classes given the feature value.
49
50Contingency tables
51==================
52
53.. class:: Table
54
55    Provides a base class for storing and manipulating contingency
56    tables. Although it is not abstract, it is seldom used directly but rather
57    through more convenient derived classes described below.
58
59    .. attribute:: outerVariable
60
61       Outer variable (:class:`Orange.data.variable.Variable`) whose values are
62       used as the first, outer index.
63
64    .. attribute:: innerVariable
65
66       Inner variable(:class:`Orange.data.variable.Variable`), whose values are
67       used as the second, inner index.
68 
69    .. attribute:: outerDistribution
70
71        The marginal distribution (:class:`Distribution`) of the outer variable.
72
73    .. attribute:: innerDistribution
74
75        The marginal distribution (:class:`Distribution`) of the inner variable.
76       
77    .. attribute:: innerDistributionUnknown
78
79        The distribution (:class:`distribution.Distribution`) of the inner variable for
80        instances for which the outer variable was undefined. This is the
81        difference between the ``innerDistribution`` and (unconditional)
82        distribution of inner variable.
83     
84    .. attribute:: varType
85
86        The type of the outer variable (:obj:`Orange.data.Type`, usually
87        :obj:`Orange.data.variable.Discrete` or
88        :obj:`Orange.data.variable.Continuous`); equals
89        ``outerVariable.varType`` and ``outerDistribution.varType``.
90
91    .. method:: __init__(outer_variable, inner_variable)
92     
93        Construct an instance of contingency table for the given pair of
94        variables.
95     
96        :param outer_variable: Descriptor of the outer variable
97        :type outer_variable: Orange.data.variable.Variable
98        :param outer_variable: Descriptor of the inner variable
99        :type inner_variable: Orange.data.variable.Variable
100       
101    .. method:: add(outer_value, inner_value[, weight=1])
102   
103        Add an element to the contingency table by adding ``weight`` to the
104        corresponding cell.
105
106        :param outer_value: The value for the outer variable
107        :type outer_value: int, float, string or :obj:`Orange.data.Value`
108        :param inner_value: The value for the inner variable
109        :type inner_value: int, float, string or :obj:`Orange.data.Value`
110        :param weight: Instance weight
111        :type weight: float
112
113    .. method:: normalize()
114
115        Normalize all distributions (rows) in the table to sum to ``1``::
116       
117            >>> cont.normalize()
118            >>> for val, dist in cont.items():
119                   print val, dist
120
121        Output: ::
122
123            1 <0.000, 1.000>
124            2 <0.667, 0.333>
125            3 <0.667, 0.333>
126            4 <0.667, 0.333>
127
128        .. note::
129       
130            This method does not change the ``innerDistribution`` or
131            ``outerDistribution``.
132       
133    With respect to indexing, contingency table is a cross between dictionary
134    and a list. It supports standard dictionary methods ``keys``, ``values`` and
135    ``items``. ::
136
137        >> print cont.keys()
138        ['1', '2', '3', '4']
139        >>> print cont.values()
140        [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>]
141        >>> print cont.items()
142        [('1', <0.000, 108.000>), ('2', <72.000, 36.000>),
143        ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]
144
145    Although keys returned by the above functions are strings, contingency can
146    be indexed by anything that can be converted into values of the outer
147    variable: strings, numbers or instances of ``Orange.data.Value``. ::
148
149        >>> print cont[0]
150        <0.000, 108.000>
151        >>> print cont["1"]
152        <0.000, 108.000>
153        >>> print cont[orange.Value(data.domain["e"], "1")]
154
155    The length of the table equals the number of values of the outer
156    variable. However, iterating through contingency
157    does not return keys, as with dictionaries, but distributions. ::
158
159        >>> for i in cont:
160            ... print i
161        <0.000, 108.000>
162        <72.000, 36.000>
163        <72.000, 36.000>
164        <72.000, 36.000>
165        <72.000, 36.000>
166
167
168.. class:: Class
169
170    An abstract base class for contingency tables that contain the class,
171    either as the inner or the outer variable.
172
173    .. attribute:: classVar (read only)
174   
175        The class attribute descriptor; always equal to either
176        :obj:`Table.innerVariable` or :obj:``Table.outerVariable``.
177
178    .. attribute:: variable
179   
180        Variable; always equal either to either ``innerVariable`` or ``outerVariable``
181
182    .. method:: add_var_class(variable_value, class_value[, weight=1])
183
184        Add an element to contingency by increasing the corresponding count. The
185        difference between this and :obj:`Table.add` is that the variable
186        value is always the first argument and class value the second,
187        regardless of which one is inner and which one is outer.
188
189        :param variable_value: Variable value
190        :type variable_value: int, float, string or :obj:`Orange.data.Value`
191        :param class_value: Class value
192        :type class_value: int, float, string or :obj:`Orange.data.Value`
193        :param weight: Instance weight
194        :type weight: float
195
196
197.. class:: VarClass
198
199    A class derived from :obj:`Class` in which the variable is
200    used as :obj:`Table.outerVariable` and class as the
201    :obj:`Table.innerVariable`. This form is a form suitable for
202    computation of conditional class probabilities given the variable value.
203   
204    Calling :obj:`VarClass.add_var_class(v, c)` is equivalent to
205    :obj:`Table.add(v, c)`. Similar as :obj:`Table`,
206    :obj:`VarClass` can compute contingency from instances.
207
208    .. method:: __init__(feature, class_variable)
209
210        Construct an instance of :obj:`VarClass` for the given pair of
211        variables. Inherited from :obj:`Table`.
212
213        :param feature: Outer variable
214        :type feature: Orange.data.variable.Variable
215        :param class_attribute: Class variable; used as ``innerVariable``
216        :type class_attribute: Orange.data.variable.Variable
217       
218    .. method:: __init__(feature, data[, weightId])
219
220        Compute the contingency table from data.
221
222        :param feature: Outer variable
223        :type feature: Orange.data.variable.Variable
224        :param data: A set of instances
225        :type data: Orange.data.Table
226        :param weightId: meta attribute with weights of instances
227        :type weightId: int
228
229    .. method:: p_class(value)
230
231        Return the probability distribution of classes given the value of the
232        variable.
233
234        :param value: The value of the variable
235        :type value: int, float, string or :obj:`Orange.data.Value`
236        :rtype: Orange.statistics.distribution.Distribution
237
238
239    .. method:: p_class(value, class_value)
240
241        Returns the conditional probability of the class_value given the
242        feature value, p(class_value|value) (note the order of arguments!)
243       
244        :param value: The value of the variable
245        :type value: int, float, string or :obj:`Orange.data.Value`
246        :param class_value: The class value
247        :type value: int, float, string or :obj:`Orange.data.Value`
248        :rtype: float
249
250    .. _statistics-contingency3.py: code/statistics-contingency3.py
251
252    part of `statistics-contingency3.py`_ (uses monks-1.tab)
253
254    .. literalinclude:: code/statistics-contingency3.py
255        :lines: 1-23
256
257    The inner and the outer variable and their relations to the class are
258    as follows::
259
260        Inner variable:  y
261        Outer variable:  e
262   
263        Class variable:  y
264        Feature:         e
265
266    Distributions are normalized, and probabilities are elements from the
267    normalized distributions. Knowing that the target concept is
268    y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1
269    has a 100% probability, while for the rest, probability is one third, which
270    agrees with a probability that two three-valued independent features
271    have the same value. ::
272
273        Distributions:
274          p(.|1) = <0.000, 1.000>
275          p(.|2) = <0.662, 0.338>
276          p(.|3) = <0.659, 0.341>
277          p(.|4) = <0.669, 0.331>
278   
279        Probabilities of class '1'
280          p(1|1) = 1.000
281          p(1|2) = 0.338
282          p(1|3) = 0.341
283          p(1|4) = 0.331
284   
285        Distributions from a matrix computed manually:
286          p(.|1) = <0.000, 1.000>
287          p(.|2) = <0.662, 0.338>
288          p(.|3) = <0.659, 0.341>
289          p(.|4) = <0.669, 0.331>
290
291
292.. class:: ClassVar
293
294    :obj:`ClassVar` is similar to :obj:`VarClass` except
295    that the class is outside and the variable is inside. This form of
296    contingency table is suitable for computing conditional probabilities of
297    variable given the class. All methods get the two arguments in the same
298    order as :obj:`VarClass`.
299
300    .. method:: __init__(feature, class_variable)
301
302        Construct an instance of :obj:`VarClass` for the given pair of
303        variables. Inherited from :obj:`Table`, except for the reversed
304        order of arguments.
305
306        :param feature: Outer variable
307        :type feature: Orange.data.variable.Variable
308        :param class_variable: Class variable
309        :type class_variable: Orange.data.variable.Variable
310       
311    .. method:: __init__(feature, data[, weightId])
312
313        Compute contingency table from the data.
314
315        :param feature: Descriptor of the outer variable
316        :type feature: Orange.data.variable.Variable
317        :param data: A set of instances
318        :type data: Orange.data.Table
319        :param weightId: meta attribute with weights of instances
320        :type weightId: int
321
322    .. method:: p_attr(class_value)
323
324        Return the probability distribution of variable given the class.
325
326        :param class_value: The value of the variable
327        :type class_value: int, float, string or :obj:`Orange.data.Value`
328        :rtype: Orange.statistics.distribution.Distribution
329
330    .. method:: p_attr(value, class_value)
331
332        Returns the conditional probability of the value given the
333        class, p(value|class_value).
334
335        :param value: Value of the variable
336        :type value: int, float, string or :obj:`Orange.data.Value`
337        :param class_value: Class value
338        :type value: int, float, string or :obj:`Orange.data.Value`
339        :rtype: float
340
341    .. _statistics-contingency4.py: code/statistics-contingency4.py
342
343    .. literalinclude:: code/statistics-contingency4.py
344        :lines: 1-27
345
346    part of the output from `statistics-contingency4.py`_ (uses monk1.tab)
347   
348    The role of the feature and the class are reversed compared to
349    :obj:`ClassVar`::
350   
351        Inner variable:  e
352        Outer variable:  y
353   
354        Class variable:  y
355        Feature:         e
356   
357    Distributions given the class can be printed out by calling :meth:`p_attr`.
358   
359    part of `statistics-contingency4.py`_ (uses monks-1.tab)
360   
361    .. literalinclude:: code/statistics-contingency4.py
362        :lines: 30-31
363   
364    will print::
365        p(.|0) = <0.000, 0.333, 0.333, 0.333>
366        p(.|1) = <0.500, 0.167, 0.167, 0.167>
367   
368    If the class value is '0', the attribute `e` cannot be `1` (the first
369    value), while distribution across other values is uniform.  If the class
370    value is `1`, `e` is `1` for exactly half of instances, and distribution of
371    other values is again uniform.
372
373.. class:: VarVar
374
375    Contingency table in which none of the variables is the class.  The class
376    is derived from :obj:`Table`, and adds an additional constructor and
377    method for getting conditional probabilities.
378
379    .. method:: VarVar(outer_variable, inner_variable)
380
381        Inherited from :obj:`Table`.
382
383    .. method:: __init__(outer_variable, inner_variable, data[, weightId])
384
385        Compute the contingency from the given instances.
386
387        :param outer_variable: Outer variable
388        :type outer_variable: Orange.data.variable.Variable
389        :param inner_variable: Inner variable
390        :type inner_variable: Orange.data.variable.Variable
391        :param data: A set of instances
392        :type data: Orange.data.Table
393        :param weightId: meta attribute with weights of instances
394        :type weightId: int
395
396    .. method:: p_attr(outer_value)
397
398        Return the probability distribution of the inner variable given the
399        outer variable value.
400
401        :param outer_value: The value of the outer variable
402        :type outer_value: int, float, string or :obj:`Orange.data.Value`
403        :rtype: Orange.statistics.distribution.Distribution
404 
405    .. method:: p_attr(outer_value, inner_value)
406
407        Return the conditional probability of the inner_value
408        given the outer_value.
409
410        :param outer_value: The value of the outer variable
411        :type outer_value: int, float, string or :obj:`Orange.data.Value`
412        :param inner_value: The value of the inner variable
413        :type inner_value: int, float, string or :obj:`Orange.data.Value`
414        :rtype: float
415
416    The following example investigates which material is used for
417    bridges of different lengths.
418   
419    .. _statistics-contingency5.py: code/statistics-contingency5.py
420   
421    part of `statistics-contingency5.py`_ (uses bridges.tab)
422   
423    .. literalinclude:: code/statistics-contingency5.py
424        :lines: 1-17
425
426    Short bridges are mostly wooden or iron, and the longer (and most of the
427    middle sized) are made from steel::
428   
429        SHORT:
430           WOOD (56%)
431           IRON (44%)
432   
433        MEDIUM:
434           WOOD (9%)
435           IRON (11%)
436           STEEL (79%)
437   
438        LONG:
439           STEEL (100%)
440   
441    As all other contingency tables, this one can also be computed "manually".
442   
443    .. literalinclude:: code/statistics-contingency5.py
444        :lines: 18-
445
446
447Contingencies for entire domain
448===============================
449
450A list of contingency tables, either :obj:`VarClass` or
451:obj:`ClassVar`.
452
453.. class:: Domain
454
455    .. method:: __init__(data[, weightId=0, classOuter=0|1])
456
457        Compute a list of contingency tables.
458
459        :param data: A set of instances
460        :type data: Orange.data.Table
461        :param weightId: meta attribute with weights of instances
462        :type weightId: int
463        :param classOuter: `True`, if class is the outer variable
464        :type classOuter: bool
465
466        .. note::
467       
468            ``classIsOuter`` cannot be given as positional argument,
469            but needs to be passed by keyword.
470
471    .. attribute:: classIsOuter (read only)
472
473        Tells whether the class is the outer or the inner variable.
474
475    .. attribute:: classes
476
477        Contains the distribution of class values on the entire dataset.
478
479    .. method:: normalize()
480
481        Call normalize for all contingencies.
482
483    The following script prints the contingency tables for features
484    "a", "b" and "e" for the dataset Monk 1.
485   
486    .. _statistics-contingency8: code/statistics-contingency8.py
487   
488    part of `statistics-contingency8`_ (uses monks-1.tab)
489   
490    .. literalinclude:: code/statistics-contingency8.py
491        :lines: 9
492
493    Contingency tables of type :obj:`VarClass` give
494    the conditional distributions of classes, given the value of the variable.
495   
496    .. _statistics-contingency8: code/statistics-contingency8.py
497   
498    part of `statistics-contingency8`_ (uses monks-1.tab)
499   
500    .. literalinclude:: code/statistics-contingency8.py
501        :lines: 12-
502
503
504.. _contcont:
505
506Contingency tables for continuous variables
507===========================================
508
509If the outer variable is continuous, the index must be one of the
510values that do exist in the contingency table; other values raise an
511exception::
512
513    .. _statistics-contingency6: code/statistics-contingency6.py
514   
515    part of `statistics-contingency6`_ (uses monks-1.tab)
516   
517    .. literalinclude:: code/statistics-contingency6.py
518        :lines: 1-4,17-
519
520Since even rounding can be a problem, the only safe way to get the key
521is to take it from from the contingencies' ``keys``.
522
523Contingency tables with discrete outer variable and continuous inner variables
524are more useful, since methods :obj:`ContingencyClassVar.p_class`
525and :obj:`ContingencyVarClass.p_attr` use the primitive density estimation
526provided by :obj:`Orange.statistics.distribution.Distribution`.
527
528For example, :obj:`ClassVar` on the iris dataset can return the
529probability of the sepal length 5.5 for different classes::
530
531    .. _statistics-contingency7: code/statistics-contingency7.py
532   
533    part of `statistics-contingency7`_ (uses iris.tab)
534   
535    .. literalinclude:: code/statistics-contingency7.py
536
537The script outputs::
538
539    Estimated frequencies for e=5.5
540      f(5.5|Iris-setosa) = 2.000
541      f(5.5|Iris-versicolor) = 5.000
542      f(5.5|Iris-virginica) = 1.000
543
544"""
545
546from Orange.core import Contingency as Table
547from Orange.core import ContingencyAttrAttr as VarVar
548from Orange.core import ContingencyClass as Class
549from Orange.core import ContingencyAttrClass as VarClass
550from Orange.core import ContingencyClassAttr as ClassVar
551
552from Orange.core import DomainContingency as Domain
Note: See TracBrowser for help on using the repository browser.