source: orange/Orange/doc/modules/orngCA.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 11.7 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5<h1>orngCA: Orange Correspondence Analysis</h1>
6
7<P>Correspondence anaysis is an explorative technique applyed to
8analysis of contingency tables. The module provides implements
9correspondence analysis for two-way frequency crosstabulation
10tables.</P>
11
12<P>Module contains one class <CODE>CA</CODE> which wraps all the
13mathematical functions and a function <CODE>input</CODE> for loading
14contingency table from a file. The class can be constructed by
15providing a contingency table as a parameter to the
16constructor. Contingency table is encoded as a Python's nested lists,
17"list-of-lists" or using numpy types <CODE>matrix</CODE> and
18<CODE>array</CODE>. The class also includes a method
19<CODE>input(filename)</CODE> that reads the contingency table from a
20file, where each row of contingency table is represented with a line
21of comma-separated numbers. Different means of passing the contingency
22table to a correspondence analysis method are illustrated in the
23following snippet:<p>
24
25<XMP class="code">>>> import orngCA
26>>> data = [[72,    39,    26,    23,     4],
27...         [95,    58,    66,    84,    41],
28...         [80,    73,    83,     4,    96],
29...         [79,    93,    35,    73,    63]]
30>>> c = orngCA.CA(data)
31>>>
32>>> data = orngCA.input('contigencyTable')
33>>> c = orngCA.CA(data)
34</XMP>
35
36<H2>Class orngCA</H2>
37
38<H3>Attributes</H3>
39
40<P>The attributes provide access to the contingency table and various
41matrices created in the analysis process.</P>
42
43<dl class="attributes">
44<dt>dataMatrix</dt>
45<dd>A contingency table as provided by the user.</dd>
46<dt>A</dt>
47<dd>Principal axes of the column clouds.</dd> 
48<dt>B</dt>
49<dd>Principal axes of the row clouds.</dd>
50<dt>D</dt>
51<dd>Matrix whose diagonal elements are singular values of the
52decomposition.</dd>
53<dt>F</dt><dd>Coordinates of the row profiles with respect to
54principal axes in the matrix <B>B</B>.</dd>
55<dt>G</dt><dd>Coordinates of the
56column profiles with respect to principal axes in the matrix <B>A</B>.</dd>
57</dl>
58
59
60<H3>Methods</H3>
61<dl class="attributes">
62<dt>getA(), getB(), ..., getG</dt><dd>Returns the matrices <B>A</B> to
63<B>G</B>, respectively.</dd>
64<dt>getPrincipalRowProfilesCoordinates(dim = (0,1))</dt>
65<dd>Returns co-ordinates of the row profiles with respect to principal
66axes <B>A</B>. Only co-ordinates defined in tuple <CODE>dim</CODE> are
67returned. <CODE>dim</CODE> is optional and if omitted, first two
68dimensions are returned. </dd>
69<dt>getPrincipalColProfilesCoordinates(dim = (0,1))</dt>
70<dd>Returns co-ordinates of the column profiles with respect to
71principal axes <B>B</B>. Only co-ordinates defined in tuple <CODE> dim
72</CODE> are returned. If <CODE>dim </CODE>is omitted, first two
73dimensions are returned. </dd>
74<dt>DecompositionOfInertia(axis = 0)</dt>
75<dd>Returns decomposition of the inertia across the axes. Columns of
76this matrix represents contribution of the rows or columns to the
77inertia of <CODE>axis</CODE>. If <CODE>axis</CODE> equals to 0,
78inertia is decomposed across
79rows. If axis equals to 1, inertia is decomposed across columns. This
80parameter is optional, and defaults to 0. </dd>
81<dt>InertiaOfAxis(percentage = 0)</dt>
82<dd>Returns numpy <CODE>array</CODE> whose elements are inertias of axes. If
83<CODE>percentage = 1</CODE> percentages of inertias of each axis are
84returned. </dd>
85<dt>ContributionOfPointsToAxis(rowColumn = 0, axis = 0, percentage =
860) </dt><dd>Returns numpy <CODE>array</CODE> whose elements are
87contributions of points to the inertia of axis. Argument
88<CODE>rowColumn</CODE> defines wheter the calculation will be
89performed for row (default action) or column points. The values can be
90represented in percentages if <CODE>percentage = 1</CODE>. </dd>
91<dt>PointsWithMostInertia(rowColumn = 0, axis = (0, 1)) </dt>
92<dd>Returns indices of row or column points sorted in decresing value
93of their contribution to axes defined in a tuple
94<CODE>axis</CODE>. </dd>
95<dt>PlotScreeDiagram() </dt>
96<dd>Creates a canvas and plots a scree diagram in it.</dd>
97<dt>Biplot(dim = (0, 1)) </dt>
98<dd>Plots row points and column points in 2D canvas. If arguments are
99omitted, the first two dimensions are displayed, otherwise tuple dim
100defines principal axes. </dd>
101</dl>
102
103<h2>Examples of use</h2>
104<P> Data table given below represents smoking habits of different
105employees in a company.</P>
106
107
108<TABLE WIDTH=476 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=0 CELLSPACING=0>
109    <COL WIDTH=78>
110    <COL WIDTH=79>
111    <COL WIDTH=79>
112    <COL WIDTH=79>
113
114    <COL WIDTH=79>
115    <COL WIDTH=78>
116    <TR>
117        <TD WIDTH=78>
118            <P ><BR>
119            </P>
120        </TD>
121        <TD COLSPAN=4 WIDTH=318>
122            <P  ALIGN=CENTER>Smoking category</P>
123
124        </TD>
125        <TD WIDTH=78>
126            <P  ALIGN=CENTER><BR>
127            </P>
128        </TD>
129    </TR>
130    <TR>
131        <TD WIDTH=78>
132            <P  ALIGN=CENTER>Staff Group</P>
133
134        </TD>
135        <TD WIDTH=79>
136            <P  ALIGN=CENTER>(1) None</P>
137        </TD>
138        <TD WIDTH=79>
139            <P  ALIGN=CENTER>(2) Light</P>
140        </TD>
141        <TD WIDTH=79>
142
143            <P  ALIGN=CENTER>(3) Medium</P>
144        </TD>
145        <TD WIDTH=79>
146            <P  ALIGN=CENTER>(4) Heavy</P>
147        </TD>
148        <TD WIDTH=78>
149            <P  ALIGN=CENTER>Row Totals</P>
150
151        </TD>
152    </TR>
153    <TR>
154        <TD WIDTH=78>
155            <P  ALIGN=CENTER>(1) Senior managers</P>
156        </TD>
157        <TD WIDTH=79>
158            <P  ALIGN=CENTER>4</P>
159
160        </TD>
161        <TD WIDTH=79>
162            <P  ALIGN=CENTER>2</P>
163        </TD>
164        <TD WIDTH=79>
165            <P  ALIGN=CENTER>3</P>
166        </TD>
167        <TD WIDTH=79>
168
169            <P  ALIGN=CENTER>2</P>
170        </TD>
171        <TD WIDTH=78>
172            <P  ALIGN=CENTER>11</P>
173        </TD>
174    </TR>
175    <TR>
176        <TD WIDTH=78>
177
178            <P  ALIGN=CENTER>(2) Junior Managers</P>
179        </TD>
180        <TD WIDTH=79>
181            <P  ALIGN=CENTER>4</P>
182        </TD>
183        <TD WIDTH=79>
184            <P  ALIGN=CENTER>3</P>
185
186        </TD>
187        <TD WIDTH=79>
188            <P  ALIGN=CENTER>7</P>
189        </TD>
190        <TD WIDTH=79>
191            <P  ALIGN=CENTER>4</P>
192        </TD>
193        <TD WIDTH=78>
194
195            <P  ALIGN=CENTER>18</P>
196        </TD>
197    </TR>
198    <TR>
199        <TD WIDTH=78>
200            <P  ALIGN=CENTER>(3) Senior Employees</P>
201        </TD>
202        <TD WIDTH=79>
203
204            <P  ALIGN=CENTER>25</P>
205        </TD>
206        <TD WIDTH=79>
207            <P  ALIGN=CENTER>10</P>
208        </TD>
209        <TD WIDTH=79>
210            <P  ALIGN=CENTER>12</P>
211
212        </TD>
213        <TD WIDTH=79>
214            <P  ALIGN=CENTER>2</P>
215        </TD>
216        <TD WIDTH=78>
217            <P  ALIGN=CENTER>51</P>
218        </TD>
219    </TR>
220
221    <TR>
222        <TD WIDTH=78>
223            <P  ALIGN=CENTER>(4) Junior Employees</P>
224        </TD>
225        <TD WIDTH=79>
226            <P  ALIGN=CENTER>18</P>
227        </TD>
228        <TD WIDTH=79>
229
230            <P  ALIGN=CENTER>24</P>
231        </TD>
232        <TD WIDTH=79>
233            <P  ALIGN=CENTER>33</P>
234        </TD>
235        <TD WIDTH=79>
236            <P  ALIGN=CENTER>13</P>
237
238        </TD>
239        <TD WIDTH=78>
240            <P  ALIGN=CENTER>88</P>
241        </TD>
242    </TR>
243    <TR>
244        <TD WIDTH=78>
245            <P  ALIGN=CENTER>(5) Secretaries</P>
246
247        </TD>
248        <TD WIDTH=79>
249            <P  ALIGN=CENTER>10</P>
250        </TD>
251        <TD WIDTH=79>
252            <P  ALIGN=CENTER>6</P>
253        </TD>
254        <TD WIDTH=79>
255
256            <P  ALIGN=CENTER>7</P>
257        </TD>
258        <TD WIDTH=79>
259            <P  ALIGN=CENTER>2</P>
260        </TD>
261        <TD WIDTH=78>
262            <P  ALIGN=CENTER>25</P>
263
264        </TD>
265    </TR>
266    <TR>
267        <TD WIDTH=78>
268            <P  ALIGN=CENTER>Column Totals</P>
269        </TD>
270        <TD WIDTH=79>
271            <P  ALIGN=CENTER>61</P>
272
273        </TD>
274        <TD WIDTH=79>
275            <P  ALIGN=CENTER>45</P>
276        </TD>
277        <TD WIDTH=79>
278            <P  ALIGN=CENTER>62</P>
279        </TD>
280        <TD WIDTH=79>
281
282            <P  ALIGN=CENTER>25</P>
283        </TD>
284        <TD WIDTH=78>
285            <P  ALIGN=CENTER>193</P>
286        </TD>
287    </TR>
288</TABLE>
289
290<P>The 4 column values in each row of the table can be viewed as
291coordinates in a 4-dimensional space, and the (Euclidean) distances
292could be computed between the 5 row points in the 4-dimensional
293space. The distances between the points in the 4-dimensional space
294summarize all information about the similarities between the rows in
295the table above. Correspondence analysis module can be used to find a
296lower-dimensional space, in which the row points are positioned in a
297manner that retains all, or almost all, of the information about the
298differences between the rows. All information about the similarities
299between the rows (types of employees in this case) can be presented in
300a simple 2-dimensional graph. While this may not appear to be
301particularly useful for small tables like the one shown above, the
302presentation and interpretation of very large tables (e.g.,
303differential preference for 10 consumer items among 100 groups of
304respondents in a consumer survey) could greatly benefit from the
305simplification that can be achieved via correspondence analysis (e.g.,
306represent the 10 consumer items in a 2-dimensional space). This
307analysis can be similarly performed on columns of the table. </P>
308
309<P>Following lines load modules and data needed for the
310analysis. Analysis is started in the last line.</P>
311
312<XMP class="code">
313 1    import orange
314 2    from orngCA import CA
315 3   
316 4    data = [[4, 2, 3, 2],
317 5            [4, 3, 7, 4],
318 6            [25, 10, 12, 4],
319 7            [18, 24, 33, 13],
320 8            [10, 6, 7, 2]]
321 9   
32210    c = CA(data)
323</XMP>
324
325<P>After analysis finishes, results can be inspected:</P>
326<XMP class="code">
32711    print "Column profiles:"
32812    print c._CA__colProfiles
32913    print
33014    print "Row profiles:"
33115    print c._CA__rowProfiles
33216    print
333
334Column profiles:
335[[ 0.06557377  0.06557377  0.40983607  0.29508197  0.16393443]
336 [ 0.04444444  0.06666667  0.22222222  0.53333333  0.13333333]
337 [ 0.0483871   0.11290323  0.19354839  0.53225806  0.11290323]
338 [ 0.08        0.16        0.16        0.52        0.08      ]]
339
340Row profiles:
341[[ 0.36363636  0.18181818  0.27272727  0.18181818]
342 [ 0.22222222  0.16666667  0.38888889  0.22222222]
343 [ 0.49019608  0.19607843  0.23529412  0.07843137]
344 [ 0.20454545  0.27272727  0.375       0.14772727]
345 [ 0.4         0.24        0.28        0.08      ]]
346
347</XMP>
348
349<P>The points in the two-dimensional display that are close to each
350other are similar with regard to the pattern of relative frequencies
351across the columns, i.e. they have similar row profiles. After
352producing the plot it can be noticed that along the most important
353first axis in the plot, the Senior employees and Secretaries are
354relatively close together. This can be also seen by examining row
355profile, these two groups of employees show very similar patterns of
356relative frequencies across the categories of smoking intensity.</P>
357
358
359<P>Lines 17- 19 print out singular values , eigenvalues, percentages
360of inertia explained. These are important values to decide how many
361axes are needed to represent the data. The dimensions are "extracted"
362to maximize the distances between the row or column points, and
363successive dimensions will "explain" less and less of the overall
364inertia. </P>
365
366<XMP class="code">
36717    print "Singular values: " + str(diag(c.D))
36818    print "Eigen values: " + str(square(diag(c.D)))
36919    print "Percentage of Inertia:" + str(c.PercentageOfInertia())
37020    print
371
372Singular values:
373[  2.73421115e-01   1.00085866e-01   2.03365208e-02   1.20036007e-16]
374Eigen values:
375[  7.47591059e-02   1.00171805e-02   4.13574080e-04   1.44086430e-32]
376Percentage of Inertia:
377[  8.78492893e+01   1.16387938e+01   5.11916964e-01   1.78671526e-29]
378</XMP>
379
380<P>Lines 21-22 print out principal row co-ordinates with respect to
381first two axes. And lines 24-25 show decomposition of inertia. </P>
382
383<XMP class="code">
38421    print "Principal row coordinates:"
38522    print c.getPrincipalRowProfilesCoordinates()
38623    print
38724    print "Decomposition Of Inertia:"
38825    print c.DecompositionOfInertia()
389</XMP>
390
391<P>Following two last statements plot a scree diagram and a
392biplot. Scree diagram is a plot of the amount of inertia accounted
393for by successive dimensions, i.e. it is a plot of the percentage of
394inertia against the components, plotted in order of magnitude from
395largest to smallest. This plot is usually used to identify components
396with the highest contribution of inertia, which are selected, and then
397look for a change in slope in the diagram, where the remaining factors
398seem simply to be debris at the bottom of the slope and they are
399discarded. Biplot is a plot or row and column point in two-dimensional
400space.</P>
401
402<XMP class="code">
40327    c.PlotScreeDiagram()
404</XMP> 
405
406<P>
407<img src="scree.png">
408</P>
409
410<XMP class="code">
41128    c.Biplot()
412</XMP>
413
414<P>
415<img src="biplot.png">
416</P>
417</body> </html>
Note: See TracBrowser for help on using the repository browser.