source: orange/docs/tutorial/rst/basic-exploration.rst @ 9386:b95da3693f19

Revision 9386:b95da3693f19, 14.6 KB checked in by mitar, 2 years ago (diff)

Renamed tutorial files.

Line 
1Basic data exploration
2======================
3
4.. index:: basic data exploration
5
6Until now we have looked only at data files that include solely
7nominal (discrete) attributes. Let's make thinks now more interesting,
8and look at another file with mixture of attribute types. We will
9first use adult data set from UCI ML Repository. The prediction task
10related to this data set is to determine whether a person
11characterized by 14 attributes like education, race, occupation, etc.,
12makes over $50K/year. Because of the original set :download:`adult.tab <code/adult.tab>` is
13rather big (32561 data instances, about 4 MBytes), we will first
14create a smaller sample of about 3% of instances and use it in our
15examples. If you are curious how we do this, here is the code
16(:download:`sample_adult.py <code/sample_adult.py>`)::
17
18   import orange
19   data = orange.ExampleTable("adult")
20   selection = orange.MakeRandomIndices2(data, 0.03)
21   sample = data.select(selection, 0)
22   sample.save("adult_sample.tab")
23
24Above loads the data, prepares a selection vector of length equal to
25the number of data instances, which includes 0's and 1's, but it is
26told that there should be about 3% of 0's. Then, those instances are
27selected which have a corresponding 0 in selection vector, and stored
28in an object called *sample*. The sampled data is then saved in a
29file.  Note that ``MakeRandomIndices2`` performs a stratified selection,
30i.e., the class distribution of original and sampled data should be
31nearly the same.
32
33Basic characteristics of data sets
34----------------------------------
35
36.. index::
37   single: basic data exploration; attributes
38.. index::
39   single: basic data exploration; classes
40.. index::
41   single: basic data exploration; missing values
42
43For classification data sets, basic data characteristics are most
44often number of classes, number of attributes (and of these, how many
45are nominal and continuous), information if data contains missing
46values, and class distribution. Below is the script that does all
47this (:download:`data_characteristics.py <code/data_characteristics.py>`, :download:`adult_sample.tab <code/adult_sample.tab>`)::
48
49   import orange
50   data = orange.ExampleTable("adult_sample")
51   
52   # report on number of classes and attributes
53   print "Classes:", len(data.domain.classVar.values)
54   print "Attributes:", len(data.domain.attributes), ",",
55   
56   # count number of continuous and discrete attributes
57   ncont=0; ndisc=0
58   for a in data.domain.attributes:
59       if a.varType == orange.VarTypes.Discrete:
60           ndisc = ndisc + 1
61       else:
62           ncont = ncont + 1
63   print ncont, "continuous,", ndisc, "discrete"
64   
65   # obtain class distribution
66   c = [0] * len(data.domain.classVar.values)
67   for e in data:
68       c[int(e.getclass())] += 1
69   print "Instances: ", len(data), "total",
70   for i in range(len(data.domain.classVar.values)):
71       print ",", c[i], "with class", data.domain.classVar.values[i],
72   print
73
74The first part is the one that we know already: the script import
75Orange library into Python, and loads the data. The information on
76domain (class and attribute names, types, values, etc.) are stored in
77``data.domain``. Information on class variable is accessible through the
78``data.domain.classVar`` object which stores
79a vector of class' values. Its length is obtained using a function
80``len()``. Similarly, the list of attributes is stored in
81data.domain.attributes. Notice that to obtain the information on i-th
82attribute, this list can be indexed, e.g., ``data.domain.attributes[i]``.
83
84To count the number of continuous and discrete attributes, we have
85first initialized two counters (``ncont``, ``ndisc``), and then iterated
86through the attributes (variable ``a`` is an iteration variable that in is
87each loop associated with a single attribute).  The field ``varType``
88contains the type of the attribute; for discrete attributes, ``varType``
89is equal to ``orange.VarTypes.Discrete``, and for continuous ``varType`` is
90equal to ``orange.VarTypes.Continuous``.
91
92To obtain the number of instances for each class, we first
93initialized a vector c that would of the length equal to the number of
94different classes. Then, we iterated through the data;
95``e.getclass()`` returns a class of an instance e, and to
96turn it into a class index (a number that is in range from 0 to n-1,
97where n is the number of classes) and is used for an index of a
98element of c that should be incremented.
99
100Throughout the code, notice that a print statement in Python prints
101whatever items it has in the line that follows. The items are
102separated with commas, and Python will by default put a blank between
103them when printing. It will also print a new line, unless the print
104statement ends with a comma. It is possible to use print statement in
105Python with formatting directives, just like in C or C++, but this is
106beyond this text.
107
108Running the above script, we obtain the following output::
109
110   Classes: 2
111   Attributes: 14 , 6 continuous, 8 discrete
112   Instances:  977 total , 236 with class >50K , 741 with class <=50K
113
114If you would like class distributions printed as proportions of
115each class in the data sets, then the last part of the script needs
116to be slightly changed. This time, we have used string formatting
117with print as well (part of :download:`data_characteristics2.py <code/data_characteristics2.py>`)::
118
119   # obtain class distribution
120   c = [0] * len(data.domain.classVar.values)
121   for e in data:
122       c[int(e.getclass())] += 1
123   print "Instances: ", len(data), "total",
124   r = [0.] * len(c)
125   for i in range(len(c)):
126       r[i] = c[i]*100./len(data)
127   for i in range(len(data.domain.classVar.values)):
128       print ", %d(%4.1f%s) with class %s" % (c[i], r[i], '%', data.domain.classVar.values[i]),
129   print
130
131The new script outputs the following information::
132
133   Classes: 2
134   Attributes: 14 , 6 continuous, 8 discrete
135   Instances:  977 total , 236(24.2%) with class >50K , 741(75.8%) with class <=50K
136
137As it turns out, there are more people that earn less than those,
138that earn more... On a more technical site, such information may
139be important when your build your classifier; the base error for this
140data set is 1-.758 = .242, and your constructed models should only be
141better than this.
142
143Contingency matrix
144------------------
145
146.. index::
147   single: basic data exploration; class distribution
148
149Another interesting piece of information that we can obtain from the
150data is the distribution of classes for each value of the discrete
151attribute, and means for continuous attribute (we will leave the
152computation of standard deviation and other statistics to you). Let's
153compute means of continuous attributes first (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`)::
154
155   print "Continuous attributes:"
156   for a in range(len(data.domain.attributes)):
157       if data.domain.attributes[a].varType == orange.VarTypes.Continuous:
158           d = 0.; n = 0
159           for e in data:
160               if not e[a].isSpecial():
161                   d += e[a]
162                   n += 1
163           print "  %s, mean=%3.2f" % (data.domain.attributes[a].name, d/n)
164
165This script iterates through attributes (outer for loop), and for
166attributes that are continuous (first if statement) computes a sum
167over all instances. A single new trick that the script uses is that it
168checks if the instance has a defined attribute value.  Namely, for
169instance ``e`` and attribute ``a``, ``e[a].isSpecial()`` is true if
170the value is not defined (unknown). Variable n stores the number of
171instances with defined values of attribute. For our sampled adult data
172set, this part of the code outputs::
173
174   Continuous attributes:
175     age, mean=37.74
176     fnlwgt, mean=189344.06
177     education-num, mean=9.97
178     capital-gain, mean=1219.90
179     capital-loss, mean=99.49
180     hours-per-week, mean=40.27
181   
182For nominal attributes, we could now compose a code that computes,
183for each attribute, how many times a specific value was used for each
184class. Instead, we used a build-in method DomainContingency, which
185does just that. All that our script will do is, mainly, to print it
186out in a readable form (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`)::
187
188   print "\nNominal attributes (contingency matrix for classes:", data.domain.classVar.values, ")"
189   cont = orange.DomainContingency(data)
190   for a in data.domain.attributes:
191       if a.varType == orange.VarTypes.Discrete:
192           print "  %s:" % a.name
193           for v in range(len(a.values)):
194               sum = 0
195               for cv in cont[a][v]:
196                   sum += cv
197               print "    %s, total %d, %s" % (a.values[v], sum, cont[a][v])
198           print
199
200Notice that the first part of this script is similar to the one that
201is dealing with continuous attributes, except that the for loop is a
202little bit simpler. With continuous attributes, the iterator in the
203loop was an attribute index, whereas in the script above we iterate
204through members of ``data.domain.attributes``, which are objects that
205represent attributes. Data structures that may be addressed in Orange
206by attribute may most often be addressed either by attribute index,
207attribute name (string), or an object that represents an attribute.
208
209The output of the code above is rather long (this data set has
210some attributes that have rather large sets of values), so we show
211only the output for two attributes::
212
213   Nominal attributes (contingency matrix for classes: <>50K, <=50K> )
214     workclass:
215       Private, total 729, <170.000, 559.000>
216       Self-emp-not-inc, total 62, <19.000, 43.000>
217       Self-emp-inc, total 22, <10.000, 12.000>
218       Federal-gov, total 27, <10.000, 17.000>
219       Local-gov, total 53, <14.000, 39.000>
220       State-gov, total 39, <10.000, 29.000>
221       Without-pay, total 1, <0.000, 1.000>
222       Never-worked, total 0, <0.000, 0.000>
223   
224     sex:
225       Female, total 330, <28.000, 302.000>
226       Male, total 647, <208.000, 439.000>
227
228First, notice that the in the vectors the first number refers to a
229higher income, and the second number to the lower income (e.g., from
230this data it looks like that women earn less than men). Notice that
231Orange outputs the tuples. To change this, we would need another loop
232that would iterate through members of the tuples. You may also foresee
233that it would be interesting to compute the proportions rather than
234number of instances in above contingency matrix, but that we leave for
235your exercise.
236
237Missing values
238--------------
239
240.. index::
241   single: missing values; statistics
242
243It is often interesting to see, given the attribute, what is the
244proportion of the instances with that attribute unknown. We have
245already learned that if a function isSpecial() can be used to
246determine if for specific instances and attribute the value is not
247defined. Let us use this function to compute the proportion of missing
248values per each attribute (:download:`report_missing.py <code/report_missing.py>`, uses :download:`adult_sample.tab <code/adult_sample.tab>`)::
249
250   import orange
251   data = orange.ExampleTable("adult_sample")
252   
253   natt = len(data.domain.attributes)
254   missing = [0.] * natt
255   for i in data:
256       for j in range(natt):
257           if i[j].isSpecial():
258               missing[j] += 1
259   missing = map(lambda x, l=len(data):x/l*100., missing)
260   
261   print "Missing values per attribute:"
262   atts = data.domain.attributes
263   for i in range(natt):
264       print "  %5.1f%s %s" % (missing[i], '%', atts[i].name)
265
266Integer variable natt stores number of attributes in the data set. An
267array missing stores the number of the missing values per attribute;
268its size is therefore equal to natt, and all of its elements are
269initially 0 (in fact, 0.0, since we purposely identified it as a real
270number, which helped us later when we converted it to percents).
271
272The only line that possibly looks (very?) strange is ``missing =
273map(lambda x, l=len(data):x/l*100., missing)``. This line could be
274replaced with for loop, but we just wanted to have it here to show how
275coding in Python may look very strange, but may gain in
276efficiency. The function map takes a vector (in our case missing), and
277executes a function on every of its elements, thus obtaining a new
278vector. The function it executes is in our case defined inline, and is
279in Python called lambda expression. You can see that our lambda
280function takes a single argument (when mapped, an element of vector
281missing), and returns its value that is normalized with the number of
282data instances (``len(data)``) multiplied by 100, to turn it in
283percentage. Thus, the map function in fact normalizes the elements of
284missing to express a proportion of missing values over the instances
285of the data set.
286
287Finally, let us see what outputs the script we have just been working
288on::
289
290   Missing values per attribute:
291       0.0% age
292       4.5% workclass
293       0.0% fnlwgt
294       0.0% education
295       0.0% education-num
296       0.0% marital-status
297       4.5% occupation
298       0.0% relationship
299       0.0% race
300       0.0% sex
301       0.0% capital-gain
302       0.0% capital-loss
303       0.0% hours-per-week
304       1.9% native-country
305
306In our sampled data set, just three attributes contain the missing
307values.
308
309Distributions of feature values
310-------------------------------
311
312For some of the tasks above, Orange can provide a shortcut by means of
313``orange.DomainDistributions`` function which returns an object that
314holds averages and mean square errors for continuous attributes, value
315frequencies for discrete attributes, and for both number of instances
316where specific attribute has a missing value.  The use of this object
317is exemplified in the following script (:download:`data_characteristics4.py <code/data_characteristics4.py>`,
318uses :download:`adult_sample.tab <code/adult_sample.tab>`)::
319
320   import orange
321   data = orange.ExampleTable("adult_sample")
322   dist = orange.DomainDistributions(data)
323   
324   print "Average values and mean square errors:"
325   for i in range(len(data.domain.attributes)):
326       if data.domain.attributes[i].varType == orange.VarTypes.Continuous:
327           print "%s, mean=%5.2f +- %5.2f" % \
328               (data.domain.attributes[i].name, dist[i].average(), dist[i].error())
329   
330   print "\nFrequencies for values of discrete attributes:"
331   for i in range(len(data.domain.attributes)):
332       a = data.domain.attributes[i]
333       if a.varType == orange.VarTypes.Discrete:
334           print "%s:" % a.name
335           for j in range(len(a.values)):
336               print "  %s: %d" % (a.values[j], int(dist[i][j]))
337   
338   print "\nNumber of items where attribute is not defined:"
339   for i in range(len(data.domain.attributes)):
340       a = data.domain.attributes[i]
341       print "  %2d %s" % (dist[i].unknowns, a.name)
342
343Check this script out. Its results should match with the results we
344have derived by other scripts in this lesson.
Note: See TracBrowser for help on using the repository browser.