orngCA: Orange Correspondence Analysis
Correspondence anaysis is an explorative technique applyed to analysis of contingency tables. The module provides implements correspondence analysis for two-way frequency crosstabulation tables.
Module contains one class CA which wraps all the
mathematical functions and a function input for loading
contingency table from a file. The class can be constructed by
providing a contingency table as a parameter to the
constructor. Contingency table is encoded as a Python's nested lists,
"list-of-lists" or using numpy types matrix and
array. The class also includes a method
input(filename) that reads the contingency table from a
file, where each row of contingency table is represented with a line
of comma-separated numbers. Different means of passing the contingency
table to a correspondence analysis method are illustrated in the
following snippet:
Class orngCA
Attributes
The attributes provide access to the contingency table and various matrices created in the analysis process.
- dataMatrix
- A contingency table as provided by the user.
- A
- Principal axes of the column clouds.
- B
- Principal axes of the row clouds.
- D
- Matrix whose diagonal elements are singular values of the decomposition.
- F
- Coordinates of the row profiles with respect to principal axes in the matrix B.
- G
- Coordinates of the column profiles with respect to principal axes in the matrix A.
Methods
- getA(), getB(), ..., getG
- Returns the matrices A to G, respectively.
- getPrincipalRowProfilesCoordinates(dim = (0,1))
- Returns co-ordinates of the row profiles with respect to principal
axes A. Only co-ordinates defined in tuple
dimare returned.dimis optional and if omitted, first two dimensions are returned. - getPrincipalColProfilesCoordinates(dim = (0,1))
- Returns co-ordinates of the column profiles with respect to
principal axes B. Only co-ordinates defined in tuple
dimare returned. Ifdimis omitted, first two dimensions are returned. - DecompositionOfInertia(axis = 0)
- Returns decomposition of the inertia across the axes. Columns of
this matrix represents contribution of the rows or columns to the
inertia of
axis. Ifaxisequals to 0, inertia is decomposed across rows. If axis equals to 1, inertia is decomposed across columns. This parameter is optional, and defaults to 0. - InertiaOfAxis(percentage = 0)
- Returns numpy
arraywhose elements are inertias of axes. Ifpercentage = 1percentages of inertias of each axis are returned. - ContributionOfPointsToAxis(rowColumn = 0, axis = 0, percentage = 0)
- Returns numpy
arraywhose elements are contributions of points to the inertia of axis. ArgumentrowColumndefines wheter the calculation will be performed for row (default action) or column points. The values can be represented in percentages ifpercentage = 1. - PointsWithMostInertia(rowColumn = 0, axis = (0, 1))
- Returns indices of row or column points sorted in decresing value
of their contribution to axes defined in a tuple
axis. - PlotScreeDiagram()
- Creates a canvas and plots a scree diagram in it.
- Biplot(dim = (0, 1))
- Plots row points and column points in 2D canvas. If arguments are omitted, the first two dimensions are displayed, otherwise tuple dim defines principal axes.
Examples of use
Data table given below represents smoking habits of different employees in a company.
|
|
Smoking category |
|
|||
|
Staff Group |
(1) None |
(2) Light |
(3) Medium |
(4) Heavy |
Row Totals |
|
(1) Senior managers |
4 |
2 |
3 |
2 |
11 |
|
(2) Junior Managers |
4 |
3 |
7 |
4 |
18 |
|
(3) Senior Employees |
25 |
10 |
12 |
2 |
51 |
|
(4) Junior Employees |
18 |
24 |
33 |
13 |
88 |
|
(5) Secretaries |
10 |
6 |
7 |
2 |
25 |
|
Column Totals |
61 |
45 |
62 |
25 |
193 |
The 4 column values in each row of the table can be viewed as coordinates in a 4-dimensional space, and the (Euclidean) distances could be computed between the 5 row points in the 4-dimensional space. The distances between the points in the 4-dimensional space summarize all information about the similarities between the rows in the table above. Correspondence analysis module can be used to find a lower-dimensional space, in which the row points are positioned in a manner that retains all, or almost all, of the information about the differences between the rows. All information about the similarities between the rows (types of employees in this case) can be presented in a simple 2-dimensional graph. While this may not appear to be particularly useful for small tables like the one shown above, the presentation and interpretation of very large tables (e.g., differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved via correspondence analysis (e.g., represent the 10 consumer items in a 2-dimensional space). This analysis can be similarly performed on columns of the table.
Following lines load modules and data needed for the analysis. Analysis is started in the last line.
After analysis finishes, results can be inspected:
The points in the two-dimensional display that are close to each other are similar with regard to the pattern of relative frequencies across the columns, i.e. they have similar row profiles. After producing the plot it can be noticed that along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together. This can be also seen by examining row profile, these two groups of employees show very similar patterns of relative frequencies across the categories of smoking intensity.
Lines 17- 19 print out singular values , eigenvalues, percentages of inertia explained. These are important values to decide how many axes are needed to represent the data. The dimensions are "extracted" to maximize the distances between the row or column points, and successive dimensions will "explain" less and less of the overall inertia.
Lines 21-22 print out principal row co-ordinates with respect to first two axes. And lines 24-25 show decomposition of inertia.
Following two last statements plot a scree diagram and a biplot. Scree diagram is a plot of the amount of inertia accounted for by successive dimensions, i.e. it is a plot of the percentage of inertia against the components, plotted in order of magnitude from largest to smallest. This plot is usually used to identify components with the highest contribution of inertia, which are selected, and then look for a change in slope in the diagram, where the remaining factors seem simply to be debris at the bottom of the slope and they are discarded. Biplot is a plot or row and column point in two-dimensional space.
