source: orange/orange/Orange/projection/correspondence.py @ 9349:fa13a2c52fcd

Revision 9349:fa13a2c52fcd, 22.2 KB checked in by mitar, 2 years ago (diff)

Changed way of linking to code in documentation.

Line 
1"""\
2********************************************
3Correspondence Analysis (``correspondence``)
4********************************************
5
6Correspondence analysis is a descriptive/exploratory technique designed
7to analyze simple two-way and multi-way tables containing some measure
8of correspondence between the rows and columns. It does this by mapping
9the rows and columns into two sets of factor scores respectively while
10preserving the similarity structure of the table's rows and columns.
11
12It is similar in nature to PCA but is used for analysis of
13quantitative data.
14
15.. autoclass:: CA
16    :members:
17    :exclude-members: A, B, D, F, G
18             
19Example
20-------
21
22Data table given below represents smoking habits of different employees
23in a company (computed from :download:`smokers_ct.tab <code/smokers_ct.tab>`).
24
25    ================  ====  =====  ======  =====  ==========
26                           Smoking category
27    ----------------  --------------------------  ----------
28    Staff group       None  Light  Medium  Heavy  Row totals
29    ================  ====  =====  ======  =====  ==========
30    Senior managers   4     2      3       2      11
31    Junior managers   4     3      7       4      18
32    Senior employees  25    10     12      2      51
33    Junior employees  18    24     33      12     88
34    Secretaries       10    6      7       2      25
35    Column totals     61    45     62      25     193
36    ================  ====  =====  ======  =====  ==========
37
38The 4 column values in each row of the table can be viewed as coordinates
39in a 4-dimensional space, and the (Euclidean) distances could be computed
40between the 5 row points in the 4-dimensional space. The distances
41between the points in the 4-dimensional space summarize all information
42about the similarities between the rows in the table above.
43Correspondence analysis module can be used to find a
44lower-dimensional space, in which the row points are positioned in a
45manner that retains all, or almost all, of the information about the
46differences between the rows. All information about the similarities
47between the rows (types of employees in this case) can be presented in a
48simple 2-dimensional graph. While this may not appear to be particularly
49useful for small tables like the one shown above, the presentation and
50interpretation of very large tables (e.g., differential preference for
5110 consumer items among 100 groups of respondents in a consumer survey)
52could greatly benefit from the simplification that can be achieved via
53correspondence analysis (e.g., represent the 10 consumer items in a
542-dimensional space). This analysis can be similarly performed on columns
55of the table.
56
57So lets load the data, compute the contingency and do the analysis
58(:download:`correspondence.py <code/correspondence.py>`, uses :download:`smokers_ct.tab <code/smokers_ct.tab>`)::
59   
60    from Orange.projection import correspondence
61    from Orange.statistics import contingency
62   
63    data = Orange.data.Table("smokers_ct.tab")
64    staff = data.domain["Staff group"]
65    smoking = data.domain["Smoking category"]
66   
67    # Compute the contingency
68    cont = contingency.VarVar(staff, smoking, data)
69   
70    c = correspondence.CA(cont, staff.values, smoking.values)
71   
72    print "Row profiles"
73    print c.row_profiles()
74    print
75    print "Column profiles"
76    print c.column_profiles()
77   
78which produces matrices of relative frequencies (normalized across rows
79and columns respectively) ::
80   
81    Column profiles:
82    [[ 0.06557377  0.06557377  0.40983607  0.29508197  0.16393443]
83     [ 0.04444444  0.06666667  0.22222222  0.53333333  0.13333333]
84     [ 0.0483871   0.11290323  0.19354839  0.53225806  0.11290323]
85     [ 0.08        0.16        0.16        0.52        0.08      ]]
86   
87    Row profiles:
88    [[ 0.36363636  0.18181818  0.27272727  0.18181818]
89     [ 0.22222222  0.16666667  0.38888889  0.22222222]
90     [ 0.49019608  0.19607843  0.23529412  0.07843137]
91     [ 0.20454545  0.27272727  0.375       0.14772727]
92     [ 0.4         0.24        0.28        0.08      ]]
93   
94The points in the two-dimensional correspondence analysis display that are
95close to each other are similar with regard to the pattern of relative
96frequencies across the columns, i.e. they have similar row profiles.
97After producing the plot it can be noticed that along the most important
98first axis in the plot, the Senior employees and Secretaries are relatively
99close together. This can be also seen by examining row profile, these two
100groups of employees show very similar patterns of relative frequencies
101across the categories of smoking intensity. ::
102
103    c.plot_biplot()
104   
105.. image:: files/correspondence-biplot.png
106
107Lines 26-29 print out singular values , eigenvalues, percentages of
108inertia explained. These are important values to decide how many axes
109are needed to represent the data. The dimensions are "extracted" to
110maximize the distances between the row or column points, and successive
111dimensions will "explain" less and less of the overall inertia. ::
112   
113    print "Singular values: " + str(diag(c.D))
114    print "Eigen values: " + str(square(diag(c.D)))
115    print "Percentage of Inertia:" + str(c.inertia_of_axes() / sum(c.inertia_of_axes()) * 100.0)
116    print
117   
118which outputs::
119   
120    Singular values:
121    [  2.73421115e-01   1.00085866e-01   2.03365208e-02   1.20036007e-16]
122    Eigen values:
123    [  7.47591059e-02   1.00171805e-02   4.13574080e-04   1.44086430e-32]
124    Percentage of Inertia:
125    [  8.78492893e+01   1.16387938e+01   5.11916964e-01   1.78671526e-29]
126
127Lines 31-35 print out principal row coordinates with respect to first
128two axes. And lines 24-25 show decomposition of inertia. ::
129
130    print "Principal row coordinates:"
131    print c.row_factors()
132    print
133    print "Decomposition Of Inertia:"
134    print c.column_inertia()
135   
136Lets also plot the scree diagram. Scree diagram is a plot of the
137amount of inertia accounted for by successive dimensions, i.e.
138it is a plot of the percentage of inertia against the components,
139plotted in order of magnitude from largest to smallest. This plot is
140usually used to identify components with the highest contribution of
141inertia, which are selected, and then look for a change in slope in
142the diagram, where the remaining factors seem simply to be debris at
143the bottom of the slope and they are discarded. ::
144
145    c.plot_scree_diagram()
146   
147.. image:: files/correspondence-scree-plot.png
148
149Multi-Correspondence Analysis
150=============================
151
152
153
154   
155Utility Functions
156-----------------
157
158.. autofunction:: burt_table
159
160"""
161
162import numpy
163import numpy.linalg
164import orange
165import operator
166
167from Orange.misc import deprecation_warning
168from Orange.statistics import contingency
169
170
171def input(filename):
172    """ Loads contingency matrix from the file. The expected format is a
173    simple tab delimited matrix of numerical values.
174   
175    :param filename: Filename
176    :type filename: str
177   
178    """
179    table = [[],[]]
180    try:
181        file = open(filename)
182        try:
183            table = file.readlines()
184        finally:
185            file.close()
186        table = [map(int, el.split()) for el in table]
187    except IOError:
188        pass
189    return table
190   
191
192class CA(object):
193    """ Main class used for computation of correspondence analysis.
194    """
195    def __init__(self, contingency_table, row_labels = [], col_labels= []):
196        """ Initialize the CA instance
197       
198        :param contingency_table: A contingency table (can be
199            :class:`Orange.statistics.contingency.Table` or
200            :class:`numpy.ndarray` or list-of-lists)
201       
202        :param row_labels: An optional list of row labels
203        :type row_labels: list
204        :param col_labels: An optional list of column labels
205        :type col_labels: list
206       
207        """     
208        # calculating correspondence analysis from the data matrix
209        # algorithm described in the book (put reference) is used
210       
211        self.row_labels = row_labels
212        self.col_labels = col_labels
213       
214        if isinstance(contingency_table, contingency.Table):
215            contingency_table = numpy.asarray([[e for e in row] for row in contingency_table])
216           
217        self.__data_matrix = numpy.matrix(contingency_table, float)
218        sum_elem = numpy.sum(self.__data_matrix)
219       
220        # __corr is a matrix of relative frequencies of elements in data matrix
221        self.__corr = self.__data_matrix / sum_elem
222        self.__col_sums = numpy.sum(self.__corr, 0)
223        self.__row_sums = numpy.sum(self.__corr, 1)
224       
225        self.__col_profiles = numpy.matrix(numpy.diag((1. / numpy.array(self.__col_sums))[0])) * self.__corr.T
226        self.__row_profiles = numpy.matrix(numpy.diag((1. / numpy.array(self.__row_sums).reshape(1, -1))[0])) * self.__corr
227       
228        self.__a, self.__d, self.__b = self.__calculate_svd();
229       
230        self.__f = numpy.diag((1. / self.__row_sums).reshape(1,-1).tolist()[0]) * self.__a * self.__d
231        self.__g = numpy.diag((1. / self.__col_sums).tolist()[0]) * self.__b * self.__d.T
232       
233    def __calculate_svd(self):
234        """ Computes generalized SVD ...
235           
236        This function is used to calculate decomposition A = N * D_mi * M' , where N' * diag(row_sums) * N = I and
237        M' * diag(col_sums) * M = I. This decomposition is calculated in 4 steps:
238       
239            i)   B = diag(row_sums)^1/2 * A * diag(col_sums)^1/2
240            ii)  find the ordinary SVD of B: B = U * D * V'
241            iii) N = diag(row_sums)^-1/2 * U
242                 M = diag(col_sums)^-1/2 * V
243                 D_mi = D
244            iv)  A = N * D_mi * M'
245       
246        Returns (N, D_mi, M)
247                   
248        """
249       
250        a = self.__corr - self.__row_sums * self.__col_sums
251        b = numpy.diag(numpy.sqrt((1. / self.__row_sums).reshape(1,-1).tolist()[0])) * a * numpy.diag(numpy.sqrt((1. / self.__col_sums).tolist()[0]))
252        u, d, v = numpy.linalg.svd(b, 0)
253        N = numpy.diag(numpy.sqrt(self.__row_sums.reshape(1, -1).tolist()[0])) * u
254        M = numpy.diag(numpy.sqrt(self.__col_sums.tolist()[0])) * numpy.transpose(v)
255        d = numpy.diag(d.tolist())
256       
257        return (N, d, M)       
258       
259    @property
260    def data_matrix(self):
261        """ The :obj:`numpy.matrix` object that is representation of
262        input contingency table.
263        """
264        return self.__data_matrix   
265
266    @property
267    def A(self): 
268        """ columns of A define the principal axes of the column clouds
269        (same as row_principal_axes)
270        """
271        return self.__a
272   
273    @property
274    def D(self): 
275        """ elements on diagonal of D are singular values.
276        """
277        return self.__d
278   
279    @property
280    def B(self): 
281        """ columns of B defines the principal axes of the row clouds
282        (same as column_principal_axes)
283        """       
284        return self.__b
285   
286    @property
287    def F(self): 
288        """ coordinates of the row profiles with respect to principal axes B
289        (same as row_factors).
290        """
291        return self.__f
292   
293    @property
294    def G(self): 
295        """ coordinates of the column profiles with respect to principal axes A
296        (same as column_factors).
297        """
298        return self.__g
299   
300    @property
301    def row_principal_axes(self):
302        """ A :obj:`numpy.matrix` of principal axes (in columns)
303        of the row points.
304         
305        """
306        return self.__b
307   
308    @property
309    def column_principal_axes(self):
310        """ A :obj:`numpy.matrix` of principal axes (in columns)
311        of the column points.
312        """
313        return self.__b
314   
315    def row_factors(self):
316        """ Return a :obj:`numpy.matrix` of factor scores (coordinates)
317        for rows of the input matrix.       
318        """
319        return self.__f
320   
321    def column_factors(self):
322        """ Return a :obj:`numpy.matrix` of factor scores (coordinates)
323        for columns of the input matrix.
324        """ 
325        return self.__g
326   
327    def row_profiles(self):
328        """ Return a :obj:`numpy.matrix` of row profiles, i.e. rows of the
329        ``data matrix`` normalized by the row sums.
330        """
331        return self.__row_profiles
332   
333    def column_profiles(self):
334        """ Return a :obj:`numpy.matrix` of column profiles, i.e. rows of the
335        ``data_matrix`` normalized by the column sums.
336        """
337        return self.__row_profiles
338       
339    def row_inertia(self):
340        """ Return the contribution of rows to the inertia across principal axes.
341        """
342        return numpy.multiply(self.__row_sums, numpy.multiply(self.__f, self.__f))
343   
344    def column_inertia(self):
345        """ Return the contribution of columns to the inertia across principal axes.
346        """
347        return numpy.multiply(numpy.transpose(self.__col_sums), numpy.multiply(self.__g, self.__g))
348       
349    def inertia_of_axes(self):
350        """ Return :obj:`numpy.ndarray` whose elements are inertias of principal axes.
351        """
352        return numpy.array(numpy.sum(self.column_inertia(), 0).tolist()[0])
353   
354    def ordered_row_indices(self, axes=None):
355        """ Return indices of rows with most inertia. If ``axes`` is given
356        take only inertia in those those principal axes into account.
357       
358        :param axes: Axes to take into account.
359        """
360        inertia = self.row_inertia()
361        if axes:
362            inertia = inertia[:, axes]
363        indices = numpy.ravel(numpy.argsort(numpy.sum(inertia, 1), 0))
364        return list(reversed(indices))
365   
366    def ordered_column_indices(self, axes=None):
367        """ Return indices of rows with most inertia. If ``axes`` is given
368        take only inertia in those principal axes into account.
369       
370        :param axes: Axes to take into account.
371        """
372        inertia = self.column_inertia()
373        if axes:
374            inertia = inertia[:, axes]
375       
376        indices = numpy.ravel(numpy.argsort(numpy.sum(inertia, 1), 0))
377        return list(reversed(indices))
378   
379    def plot_scree_diagram(self):
380        """ Plot a scree diagram of the inertia.
381        """
382        import pylab
383        inertia = self.inertia_of_axes()
384        inertia *= 100.0 / numpy.sum(inertia)
385       
386        pylab.plot(range(1, min(self.__data_matrix.shape) + 1), inertia)
387        pylab.axis([0, min(self.__data_matrix.shape) + 1, 0, 100])
388        pylab.show()
389       
390    def plot_biplot(self, axes = (0, 1)):
391        import pylab
392        if len(axes) != 2:
393           raise ValueError("Dim tuple must be of length two")
394        rows = self.row_factors()[:, axes]
395        columns = self.column_factors()[:, axes]
396        pylab.plot(numpy.ravel(rows[:, 0]), numpy.ravel(rows[:, 1]), "ro")
397        pylab.plot(numpy.ravel(columns[:, 0]), numpy.ravel(columns[:, 1]), "bs")
398       
399        if self.row_labels:
400             for i, coord in enumerate(rows):
401                x, y = coord.T
402                pylab.text(x, y, self.row_labels[i], horizontalalignment='center')
403        if self.col_labels:
404             for i, coord in enumerate(columns):
405                x, y = coord.T
406                pylab.text(x, y, self.col_labels[i], horizontalalignment='center')               
407        pylab.grid()
408        pylab.show()
409       
410       
411    """\
412    Deprecated interface. Do not use.
413   
414    """
415#    def getPrincipalRowProfilesCoordinates(self, dim = (0, 1)):
416#       """ Return principal coordinates of row profiles with respect
417#       to principal axis B.
418#       
419#       :param dim: Defines which principal axes should be taken into account.
420#       :type dim: list
421#       
422#       """
423#       deprecation_warning("getPrincipalRowProfilesCoordinates", "row_factors")
424#       if len(dim) == 0:
425#           raise ValueError("Dim tuple cannot be of length zero")
426#       return numpy.array(numpy.take(self.__f, dim, 1))
427#   
428#    def getPrincipalColProfilesCoordinates(self, dim = (0, 1)):
429#       """ Return principal coordinates of column profiles with respect
430#       to principal axes A.
431#       
432#       :param dim: Defines which principal axes should be taken into account.
433#       :type dim: list
434#       
435#       """   
436#       deprecation_warning("getPrincipalColProfilesCoordinates", "column_factors")
437#       if len(dim) == 0:
438#           raise ValueError("dim tuple cannot be of length zero")
439#       return numpy.array(numpy.take(self.__g, dim, 1))
440#   
441#    def getStandardRowCoordinates(self):
442#        deprecation_warning("getStandardRowCoordinates", "standard_row_factors")
443#        dinv = numpy.where(self.__d != 0, 1. / self.__d, 0)
444#        return numpy.matrixmultiply(self.__f, numpy.transpose(dinv))
445#   
446#    def getStandardColCoordinates(self):
447#        deprecation_warning("getStandardColCoordinates", "standard_column_factors")
448#        dinv = numpy.where(self.__d != 0, 1. / self.__d, 0)
449#        return numpy.matrixmultiply(self.__g, dinv)
450#   
451#    def DecompositionOfInertia(self, rows=False):
452#        """ Returns decomposition of the inertia across the principal axes.
453#        Columns of this matrix represent contribution of the rows or columns
454#        to the inertia of axes. If ``rows`` is True inertia is
455#        decomposed across rows, across columns if False.
456#       
457#        """
458#        if rows:
459#            deprecation_warning("DecompositionOfInertia", "row_inertia")
460#            return numpy.multiply(self.__row_sums, numpy.multiply(self.__f, self.__f))
461#        else:
462#            deprecation_warning("DecompositionOfInertia", "column_inertia")
463#            return numpy.multiply(numpy.transpose(self.__col_sums), numpy.multiply(self.__g, self.__g))
464#       
465#    def InertiaOfAxis(self, percentage = False):
466#        """ Return :obj:`numpy.ndarray` whose elements are inertias of axes.
467#        If ``percentage`` is True percentages of inertias of each axis are returned.
468#       
469#        """
470#        deprecation_warning("InertiaOfAxis", "inertia_of_axes")
471#        inertias = numpy.array(numpy.sum(self.DecompositionOfInertia(), 0).tolist()[0])
472#        if percentage:
473#            return inertias / numpy.sum(inertias) * 100
474#        else:
475#            return inertias
476#       
477#    def ContributionOfPointsToAxis(self, rowColumn = 0, axis = 0, percentage = 0):
478#        """ Returns :obj:`numpy.ndarray` whose elements are contributions
479#        of points to the inertia of axis. Argument ``rowColumn`` defines if
480#        the calculation will be performed for row (default action) or
481#        column points. The values can be represented in percentages if
482#        percentage = 1.
483#         
484#        """
485#        deprecation_warning("ContributionOfPointsToAxis", "nothing")
486#        contribution = numpy.array(numpy.transpose(self.DecompositionOfInertia(rowColumn)[:,axis]).tolist()[0])
487#        if percentage:
488#            return contribution / numpy.sum(contribution) * 100
489#        else:
490#            return contribution
491#       
492#    def PointsWithMostInertia(self, rowColumn = 0, axis = (0, 1)):
493#        """ Returns indices of row or column points sorted in decreasing
494#        value of their contribution to axes defined in a tuple axis.
495#       
496#        """
497#        deprecation_warning("PointsWithMostInertia", "ordered_row_indices")
498#        contribution = self.ContributionOfPointsToAxis(rowColumn = rowColumn, axis = axis[0], percentage = 0) + \
499#                        self.ContributionOfPointsToAxis(rowColumn = rowColumn, axis = axis[1], percentage = 0)
500#        tmp = zip(range(len(contribution)), contribution)
501#
502#        tmp.sort(lambda x, y: cmp(x[1], y[1]))
503#
504#        a = [i for (i, v) in tmp]
505#        a.reverse()
506#        return a
507   
508   
509def burt_table(data, attributes):
510    """ Construct a Burt table (all values cross-tabulation) from data for attributes.
511   
512    Return and ordered list of (attribute, value) pairs and a numpy.ndarray with the tabulations.
513   
514    :param data: Data table.
515    :type data: :class:`Orange.data.Table`
516   
517    :param attributes: List of attributes (must be Discrete).
518    :type attributes: list
519   
520    Example ::
521   
522        >>> data = Orange.data.Table("smokers_ct")
523        >>> items, counts = burt_table(data, [data.domain["Staff group"], data.domain["Smoking category"]])
524       
525    """
526    values = [(attr, value) for attr in attributes for value in attr.values]
527    table = numpy.zeros((len(values), len(values)))
528    counts = [len(attr.values) for attr in attributes]
529    offsets = [sum(counts[: i]) for i in range(len(attributes))]
530    for i in range(len(attributes)):
531        for j in range(i + 1):
532            attr1 = attributes[i]
533            attr2 = attributes[j]
534           
535            cm = contingency.VarVar(attr1, attr2, data)
536            cm = numpy.array([list(row) for row in cm])
537           
538            range1 = range(offsets[i], offsets[i] + counts[i])
539            range2 = range(offsets[j], offsets[j] + counts[j])
540            start1, end1 = offsets[i], offsets[i] + counts[i]
541            start2, end2 = offsets[j], offsets[j] + counts[j]
542           
543            table[start1: end1, start2: end2] += cm
544            if i != j: #also fill the upper part
545                table[start2: end2, start1: end1] += cm.T
546               
547    return values, table
548   
549if __name__ == '__main__':
550    a = numpy.random.random_integers(0, 100, 100).reshape(10,-1)
551    c = CA(a)
552    c.plot_scree_diagram()
553    c.plot_biplot()
554
555##    data = matrix([[72,    39,    26,    23 ,    4],
556##    [95,    58,    66,    84,    41],
557##    [80,    73,    83,     4 ,   96],
558##    [79,    93,    35,    73,    63]])
559##
560##    data = [[9, 11, 4],
561##                [ 3,          5,          3],
562##                [     11,          6,          3],
563##                [24,         73,         48]]
564
565    # Author punctuation (from 'Correspondence Analysis - Herve Abdi Lynne J. Williams')
566    data = [[7836,   13112,  6026 ],
567            [53655,  102383, 42413],
568            [115615, 184541, 59226],
569            [161926, 340479, 62754],
570            [38177,  105101, 12670],
571            [46371,  58367,  14299]]
572   
573    c = CA(data, ["Rousseau", "Chateaubriand", "Hugo", "Zola", "Proust","Giraudoux"],
574                 ["period", "comma", "other"])
575    c.plot_scree_diagram()
576    c.plot_biplot()
577
578    import Orange
579    data = Orange.data.Table("../../doc/datasets/smokers_ct")
580    staff = data.domain["Staff group"]
581    smoking = data.domain["Smoking category"]
582    cont = contingency.VarVar(staff, smoking, data)
583   
584    c = CA(cont, staff.values, smoking.values)
585    c.plot_scree_diagram()
586    c.plot_biplot()
587   
Note: See TracBrowser for help on using the repository browser.