Contingency table (contingency)¶
Contingency table contains conditional distributions. Unless explicitly ‘normalized’, they contain absolute frequencies, that is, the number of instances with a particular combination of two variables’ values. If they are normalized by dividing each cell by the row sum, the represent conditional probabilities of the column variable (here denoted as innerVariable) conditioned by the row variable (outerVariable).
Contingency tables are usually constructed for discrete variables. Tables for continuous variables have certain limitations described in a separate section.
The example below loads the monks-1 data set and prints out the conditional class distribution given the value of e.
import Orange
monks = Orange.data.Table("monks-1.tab")
cont = Orange.statistics.contingency.VarClass("e", monks)
for val, dist in cont.items():
print val, dist
This code prints out:
1 <0.000, 108.000>
2 <72.000, 36.000>
3 <72.000, 36.000>
4 <72.000, 36.000>
Contingencies behave like lists of distributions (in this case, class distributions) indexed by values (of e, in this example). Distributions are, in turn indexed by values (class values, here). The variable e from the above example is called the outer variable, and the class is the inner. This can also be reversed. It is also possible to use features for both, outer and inner variable, so the table shows distributions of one variable’s values given the value of another. There is a corresponding hierarchy of classes: Table is a base class for VarVar (both variables are attributes) and Class (one variable is the class). The latter is the base class for VarClass and ClassVar.
The most commonly used of the above classes is VarClass which can compute and store conditional probabilities of classes given the feature value.
Contingency tables¶
- class Orange.statistics.contingency.Table¶
Provides a base class for storing and manipulating contingency tables. Although it is not abstract, it is seldom used directly but rather through more convenient derived classes described below.
- outerVariable¶
Outer variable (Orange.feature.Descriptor) whose values are used as the first, outer index.
- innerVariable¶
Inner variable(Orange.feature.Descriptor), whose values are used as the second, inner index.
- outerDistribution¶
The marginal distribution (Distribution) of the outer variable.
- innerDistribution¶
The marginal distribution (Distribution) of the inner variable.
- innerDistributionUnknown¶
The distribution (distribution.Distribution) of the inner variable for instances for which the outer variable was undefined. This is the difference between the innerDistribution and (unconditional) distribution of inner variable.
- varType¶
The type of the outer variable (Orange.feature.Type, usually Orange.feature.Discrete or Orange.feature.Continuous); equals outerVariable.varType and outerDistribution.varType.
- __init__(outer_variable, inner_variable)¶
Construct an instance of contingency table for the given pair of variables.
Parameters: - outer_variable (Orange.feature.Descriptor) – Descriptor of the outer variable
- outer_variable – Descriptor of the inner variable
- add(outer_value, inner_value[, weight=1])¶
Add an element to the contingency table by adding weight to the corresponding cell.
Parameters: - outer_value (int, float, string or Orange.data.Value) – The value for the outer variable
- inner_value (int, float, string or Orange.data.Value) – The value for the inner variable
- weight (float) – Instance weight
- normalize()¶
Normalize all distributions (rows) in the table to sum to 1:
>>> cont.normalize() >>> for val, dist in cont.items(): print val, dist
Output:
1 <0.000, 1.000> 2 <0.667, 0.333> 3 <0.667, 0.333> 4 <0.667, 0.333>
Note
This method does not change the innerDistribution or outerDistribution.
With respect to indexing, contingency table is a cross between dictionary and a list. It supports standard dictionary methods keys, values and items.
>> print cont.keys() ['1', '2', '3', '4'] >>> print cont.values() [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>] >>> print cont.items() [('1', <0.000, 108.000>), ('2', <72.000, 36.000>), ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]
Although keys returned by the above functions are strings, contingency can be indexed by anything that can be converted into values of the outer variable: strings, numbers or instances of Orange.data.Value.
>>> print cont[0] <0.000, 108.000> >>> print cont["1"] <0.000, 108.000> >>> print cont[orange.Value(data.domain["e"], "1")]
The length of the table equals the number of values of the outer variable. However, iterating through contingency does not return keys, as with dictionaries, but distributions.
>>> for i in cont: ... print i <0.000, 108.000> <72.000, 36.000> <72.000, 36.000> <72.000, 36.000> <72.000, 36.000>
- class Orange.statistics.contingency.Class¶
An abstract base class for contingency tables that contain the class, either as the inner or the outer variable.
- classVar(read only)¶
The class attribute descriptor; always equal to either Table.innerVariable or :obj:Table.outerVariable.
- variable¶
Variable; always equal either to either innerVariable or outerVariable
- add_var_class(variable_value, class_value[, weight=1])¶
Add an element to contingency by increasing the corresponding count. The difference between this and Table.add is that the variable value is always the first argument and class value the second, regardless of which one is inner and which one is outer.
Parameters: - variable_value (int, float, string or Orange.data.Value) – Variable value
- class_value (int, float, string or Orange.data.Value) – Class value
- weight (float) – Instance weight
- class Orange.statistics.contingency.VarClass¶
A class derived from Class in which the variable is used as Table.outerVariable and class as the Table.innerVariable. This form is a form suitable for computation of conditional class probabilities given the variable value.
Calling VarClass.add_var_class(v, c) is equivalent to Table.add(v, c). Similar as Table, VarClass can compute contingency from instances.
- __init__(feature, class_variable)¶
Construct an instance of VarClass for the given pair of variables. Inherited from Table.
Parameters: - feature (Orange.feature.Descriptor) – Outer variable
- class_attribute (Orange.feature.Descriptor) – Class variable; used as innerVariable
- __init__(feature, data[, weightId])
Compute the contingency table from data.
Parameters: - feature (Orange.feature.Descriptor) – Outer variable
- data (Orange.data.Table) – A set of instances
- weightId (int) – meta attribute with weights of instances
- p_class(value)¶
Return the probability distribution of classes given the value of the variable.
Parameters: value (int, float, string or Orange.data.Value) – The value of the variable Return type: Orange.statistics.distribution.Distribution
- p_class(value, class_value)
Returns the conditional probability of the class_value given the feature value, p(class_value|value) (note the order of arguments!)
Parameters: - value (int, float, string or Orange.data.Value) – The value of the variable
- class_value – The class value
Return type: float
import Orange.statistics.contingency monks = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.VarClass("e", monks) print "Inner variable: ", cont.inner_variable.name print "Outer variable: ", cont.outer_variable.name print print "Class variable: ", cont.class_var.name print "Feature: ", cont.variable.name print print "Distributions:" for val in cont.variable: print " p(.|%s) = %s" % (val.native(), cont.p_class(val)) print first_class = Orange.data.Value(cont.class_var, 1) first_native = first_class.native() print "Probabilities of class '%s'" % first_native for val in cont.variable: print " p(%s|%s) = %5.3f" % (first_native, val.native(), cont.p_class(val, first_class))
The inner and the outer variable and their relations to the class are as follows:
Inner variable: y Outer variable: e Class variable: y Feature: e
Distributions are normalized, and probabilities are elements from the normalized distributions. Knowing that the target concept is y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1 has a 100% probability, while for the rest, probability is one third, which agrees with a probability that two three-valued independent features have the same value.
Distributions: p(.|1) = <0.000, 1.000> p(.|2) = <0.662, 0.338> p(.|3) = <0.659, 0.341> p(.|4) = <0.669, 0.331> Probabilities of class '1' p(1|1) = 1.000 p(1|2) = 0.338 p(1|3) = 0.341 p(1|4) = 0.331 Distributions from a matrix computed manually: p(.|1) = <0.000, 1.000> p(.|2) = <0.662, 0.338> p(.|3) = <0.659, 0.341> p(.|4) = <0.669, 0.331>
- class Orange.statistics.contingency.ClassVar¶
ClassVar is similar to VarClass except that the class is outside and the variable is inside. This form of contingency table is suitable for computing conditional probabilities of variable given the class. All methods get the two arguments in the same order as VarClass.
- __init__(feature, class_variable)¶
Construct an instance of VarClass for the given pair of variables. Inherited from Table, except for the reversed order of arguments.
Parameters: - feature (Orange.feature.Descriptor) – Outer variable
- class_variable (Orange.feature.Descriptor) – Class variable
- __init__(feature, data[, weightId])
Compute contingency table from the data.
Parameters: - feature (Orange.feature.Descriptor) – Descriptor of the outer variable
- data (Orange.data.Table) – A set of instances
- weightId (int) – meta attribute with weights of instances
- p_attr(class_value)¶
Return the probability distribution of variable given the class.
Parameters: class_value (int, float, string or Orange.data.Value) – The value of the variable Return type: Orange.statistics.distribution.Distribution
- p_attr(value, class_value)
Returns the conditional probability of the value given the class, p(value|class_value).
Parameters: - value (int, float, string or Orange.data.Value) – Value of the variable
- class_value – Class value
Return type: float
import Orange.statistics.contingency monks = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.ClassVar("e", monks) print "Inner variable: ", cont.inner_variable.name print "Outer variable: ", cont.outer_variable.name print print "Class variable: ", cont.class_var.name print "Attribute: ", cont.variable.name print print "Distributions:" for val in cont.class_var: print " p(.|%s) = %s" % (val.native(), cont.p_attr(val)) print first_value = Orange.data.Value(cont.variable, 0) first_native = first_value.native() print "Probabilities for e='%s'" % first_native for val in cont.class_var: print " p(%s|%s) = %5.3f" % (first_native, val.native(), cont.p_attr(first_value, val)) print cont = Orange.statistics.contingency.ClassVar(monks.domain["e"], monks.domain.class_var) for ins in monks: cont.add_var_class(ins["e"], ins.get_class())
The role of the feature and the class are reversed compared to ClassVar:
Inner variable: e Outer variable: y Class variable: y Feature: e
Distributions given the class can be printed out by calling p_attr().
for val in cont.class_var: print " p(.|%s) = %s" % (val.native(), cont.p_attr(val))
- will print::
- p(.|0) = <0.000, 0.333, 0.333, 0.333> p(.|1) = <0.500, 0.167, 0.167, 0.167>
If the class value is ‘0’, the attribute e cannot be 1 (the first value), while distribution across other values is uniform. If the class value is 1, e is 1 for exactly half of instances, and distribution of other values is again uniform.
- class Orange.statistics.contingency.VarVar¶
Contingency table in which none of the variables is the class. The class is derived from Table, and adds an additional constructor and method for getting conditional probabilities.
- __init__(outer_variable, inner_variable, data[, weightId])¶
Compute the contingency from the given instances.
Parameters: - outer_variable (Orange.feature.Descriptor) – Outer variable
- inner_variable (Orange.feature.Descriptor) – Inner variable
- data (Orange.data.Table) – A set of instances
- weightId (int) – meta attribute with weights of instances
- p_attr(outer_value)¶
Return the probability distribution of the inner variable given the outer variable value.
Parameters: outer_value (int, float, string or Orange.data.Value) – The value of the outer variable Return type: Orange.statistics.distribution.Distribution
- p_attr(outer_value, inner_value)
Return the conditional probability of the inner_value given the outer_value.
Parameters: - outer_value (int, float, string or Orange.data.Value) – The value of the outer variable
- inner_value (int, float, string or Orange.data.Value) – The value of the inner variable
Return type: float
The following example investigates which material is used for bridges of different lengths.
import Orange bridges = Orange.data.Table("bridges.tab") cont = Orange.statistics.contingency.VarVar("SPAN", "MATERIAL", bridges) print "Distributions:" for val in cont.outer_variable: print " p(.|%s) = %s" % (val.native(), cont.p_attr(val)) print cont.normalize() for val in cont.outer_variable: print "%s:" % val.native() for inval, p in cont[val].items(): if p: print " %s (%i%%)" % (inval, int(100*p+0.5)) print
Short bridges are mostly wooden or iron, and the longer (and most of the middle sized) are made from steel:
SHORT: WOOD (56%) IRON (44%) MEDIUM: WOOD (9%) IRON (11%) STEEL (79%) LONG: STEEL (100%)
As all other contingency tables, this one can also be computed “manually”.
cont = Orange.statistics.contingency.VarVar(bridges.domain["SPAN"], bridges.domain["MATERIAL"]) for ins in bridges: cont.add(ins["SPAN"], ins["MATERIAL"]) print "Distributions from a matrix computed manually:" for val in cont.outer_variable: print " p(.|%s) = %s" % (val.native(), cont.p_attr(val)) print
Contingencies for entire domain¶
A list of contingency tables, either VarClass or ClassVar.
- class Orange.statistics.contingency.Domain¶
- __init__(data[, weight_id=0, class_outer=0|1])¶
Compute a list of contingency tables.
Parameters: - data (Orange.data.Table) – A set of instances
- weight_id (int) – meta attribute with weights of instances
- class_is_outer (bool) – True, if class is the outer variable
Note
class_is_outer needs to be given as keyword argument.
- class_is_outer(read only)¶
Tells whether the class is the outer or the inner variable.
- classes¶
Contains the distribution of class values on the entire dataset.
- normalize()¶
Call normalize for all contingencies.
The following script prints the contingency tables for features “a”, “b” and “e” for the dataset Monk 1.
print "c: ", dc["e"]
Contingency tables of type VarClass give the conditional distributions of classes, given the value of the variable.
print "Distributions of feature values given the class value" dc = Orange.statistics.contingency.Domain(monks, classIsOuter = 1) print "a: ", dc["a"] print "b: ", dc["b"] print "c: ", dc["e"] print
Contingency tables for continuous variables¶
If the outer variable is continuous, the index must be one of the values that do exist in the contingency table; other values raise an exception:
import Orange
iris = Orange.data.Table("iris.tab")
cont = Orange.statistics.contingency.VarClass(0, iris)
midkey = (cont.keys()[0] + cont.keys()[1])/2.0
print "cont[%5.3f] =" % midkey, cont[midkey]
Since even rounding can be a problem, the only safe way to get the key is to take it from from the contingencies’ keys.
Contingency tables with discrete outer variable and continuous inner variables are more useful, since methods ContingencyClassVar.p_class and ContingencyVarClass.p_attr use the primitive density estimation provided by Orange.statistics.distribution.Distribution.
For example, ClassVar on the iris dataset can return the probability of the sepal length 5.5 for different classes:
import Orange
iris = Orange.data.Table("iris")
cont = Orange.statistics.contingency.ClassVar("sepal length", iris)
print "Inner variable: ", cont.inner_variable.name
print "Outer variable: ", cont.outer_variable.name
print
print "Class variable: ", cont.class_var.name
print "Attribute: ", cont.variable.name
print
print "Distributions:"
for val in cont.class_var:
print " p(.|%s) = %s" % (val.native(), cont.p_attr(val))
print
print "Estimated for e=5.5"
for val in cont.class_var:
print " f(%s|%s) = %5.3f" % (5.5, val.native(), cont.p_attr(5.5, val))
print
cont = Orange.statistics.contingency.ClassVar(iris.domain["sepal length"],
iris.domain.class_var)
for ins in iris:
cont.add_var_class(ins["sepal length"], ins.get_class())
print "Distributions from a matrix computed manually:"
for val in cont.class_var:
print " p(.|%s) = %s" % (val.native(), cont.p_attr(val))
print
The script outputs:
Estimated frequencies for e=5.5
f(5.5|Iris-setosa) = 2.000
f(5.5|Iris-versicolor) = 5.000
f(5.5|Iris-virginica) = 1.000
“”“