source: orange/orange/doc/ofb/basic_exploration.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 17.4 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="load_data.htm">Load-in the data</a>,
8Next: <a href="classification.htm">Classification</a>,
9Up: <a href="default.htm">On Tutorial 'Orange for Beginners'</a>
10</p>
11
12<H1>Basic Data Exploration</H1>
13<index name="basic data exploration">
14
15<p>Until now (in <a href="load_data.htm">Loading-in the data</a>) we
16have looked only at data files that include solely nominal (discrete)
17attributes. Let's make thinks now more interesting, and look at
18another file with mixture of attribute types. We will first use adult
19data set from UCI ML Repository. The prediction task related to this
20data set is to determine whether a person characterized by 14
21attributes like education, race, occupation, etc., makes over
22$50K/year. Because of the original set <a
23href="../datasets/adult.tab">adult.tab</a> is rather big (32561 data
24instances, about 4 MBytes), we will first create a smaller sample of
25about 3% of instances and use it in our examples. If you are curious
26how we do this, here is the code:</p>
27
28<p class="header"><a href="sample_adult.py">sample_adult.py</a> (uses <a
29href="../datasets/adult.tab">adult.tab</a> and generates <a href=
30"../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
31
32<xmp class="code">import orange
33data = orange.ExampleTable("adult")
34selection = orange.MakeRandomIndices2(data, 0.03)
35sample = data.select(selection, 0)
36sample.save("adult_sample.tab")
37</xmp>
38
39<p>Above loads the data, prepares a selection vector of length equal
40to the number of data instances, which includes 0&rsquo;s and
411&rsquo;s, but it is told that there should be about 3% of
420&rsquo;s. Then, those instances are selected which have a
43corresponding 0 in selection vector, and stored in an object called
44&ldquo;sample&rdquo;. The sampled data is then saved in a file.  Note
45that MakeRandomIndices2 performs a stratified selection, i.e., the
46class distribution of original and sampled data should be nearly the
47same.</p>
48
49<h2>Basic Data Set Characteristics</h2>
50
51<p>For classification data sets, basic data characteristics are most
52often number of classes, number of attributes (and of these, how many
53are nominal and continuous), information if data contains missing
54values, and class distribution. Below is the script that does all
55this:</p>
56<index name="basic data exploration/attributes">
57<index name="basic data exploration/classes">
58<index name="basic data exploration/missing values">
59
60
61<p class="header"><a href=
62"data_characteristics.py">data_characteristics.py</a> (uses
63<a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
64<xmp class="code">import orange
65data = orange.ExampleTable("adult_sample")
66
67# report on number of classes and attributes
68print "Classes:", len(data.domain.classVar.values)
69print "Attributes:", len(data.domain.attributes), ",",
70
71# count number of continuous and discrete attributes
72ncont=0; ndisc=0
73for a in data.domain.attributes:
74    if a.varType == orange.VarTypes.Discrete:
75        ndisc = ndisc + 1
76    else:
77        ncont = ncont + 1
78print ncont, "continuous,", ndisc, "discrete"
79
80# obtain class distribution
81c = [0] * len(data.domain.classVar.values)
82for e in data:
83    c[int(e.getclass())] += 1
84print "Instances: ", len(data), "total",
85for i in range(len(data.domain.classVar.values)):
86    print ",", c[i], "with class", data.domain.classVar.values[i],
87print
88</xmp>
89
90<p>The first part is the one that we know already: the script import
91Orange library into Python, and loads the data. The information on
92domain (class and attribute names, types, values, etc.) are stored in
93data.domain. Information on class variable is accessible through the
94data.domain.classVar object, where data.domain.classVar.values stores
95a vector of its values. Its length is obtained using a function
96len(). Similarly, the list of attributes is stored in
97data.domain.attributes. Notice that to obtain the information on i-th
98attribute, this list can be indexed, e.g.,
99data.domain.attributes[i].</p>
100
101<p>To count the number of continuous and discrete attributes, we have
102first initialized two counters (ncont, ndisc), and then iterated
103through the attributes (variable a is an iteration variable that in is
104each loop associated with a single attribute).  The field varType
105contains the type of the attribute; for discrete attributes, varType
106is equal to orange.VarTypes.Discrete, and for continuous varType is
107equal to orange.VarTypes.Continuous.</p>
108
109<p>To obtain the number of instances for each class, we first
110initialized a vector c that would of the length equal to the number of
111different classes. Then, we iterated through the data;
112<code>e.getclass()</code> returns a class of an instance e, and to
113turn it into a class index (a number that is in range from 0 to n-1,
114where n is the number of classes) and is used for an index of a
115element of c that should be incremented.</p>
116
117<p>Throughout the code, notice that a print statement in Python prints
118whatever items it has in the line that follows. The items are
119separated with commas, and Python will by default put a blank between
120them when printing. It will also print a new line, unless the print
121statement ends with a comma. It is possible to use print statement in
122Python with formatting directives, just like in C or C++, but this is
123beyond this text.</p>
124
125<p>Running the above script, we obtain the following output
126(although running using the Command Prompt is shown, you may
127equally use PythonWin to load and run the script; see one of our <a
128href="load_data.htm">previous lessons</a> on this):</p>
129
130<pre class="code">
131> <b>python data_characteristics.py</b>
132Classes: 2
133Attributes: 14 , 6 continuous, 8 discrete
134Instances:  977 total , 236 with class >50K , 741 with class <=50K
135>
136</pre>
137
138<p>If you would like class distributions printed as proportions of
139each class in the data sets, then the last part of the script needs
140to be slightly changed. This time, we have used string formatting
141with print as well:</p>
142
143<p class="header">part of <a href=
144"data_characteristics2.py">data_characteristics2.py</a> (uses
145<a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
146<xmp class="code"># obtain class distribution
147c = [0] * len(data.domain.classVar.values)
148for e in data:
149    c[int(e.getclass())] += 1
150print "Instances: ", len(data), "total",
151r = [0.] * len(c)
152for i in range(len(c)):
153    r[i] = c[i]*100./len(data)
154for i in range(len(data.domain.classVar.values)):
155    print ", %d(%4.1f%s) with class %s" % (c[i], r[i], '%', data.domain.classVar.values[i]),
156print
157</xmp>
158
159<p>The new script outputs the following information:</p>
160
161<pre class="code">
162> <b>python data_characteristics2.py</b>
163Classes: 2
164Attributes: 14 , 6 continuous, 8 discrete
165Instances:  977 total , 236(24.2%) with class >50K , 741(75.8%) with class <=50K
166>
167</pre>
168
169<p>As it turns out, there are more people that earn less than those,
170that earn more&hellip; On a more technical site, such information may
171be important when your build your classifier; the base error for this
172data set is 1-.758 = .242, and your constructed models should only be
173better than this.</p>
174
175<h2>Contingency matrix for nominal and mean for continuous
176attributes</h2>
177
178<index name="basic data exploration/class distribution">
179<p>Another interesting piece of information that we can obtain from
180the data is the distribution of classes for each value of the discrete
181attribute, and means for continuous attribute (we will leave the
182computation of standard deviation and other statistics to
183you). Let&rsquo;s compute means of continuous attributes first:</p>
184
185<p class="header">part of <a href=
186"data_characteristics3.py">data_characteristics3.py</a> (uses
187<a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
188<xmp class="code">print "Continuous attributes:"
189for a in range(len(data.domain.attributes)):
190    if data.domain.attributes[a].varType == orange.VarTypes.Continuous:
191        d = 0.; n = 0
192        for e in data:
193            if not e[a].isSpecial():
194                d += e[a]
195                n += 1
196        print "  %s, mean=%3.2f" % (data.domain.attributes[a].name, d/n)
197</xmp>
198
199<p>This script iterates through attributes (outer for loop), and for
200attributes that are continuous (first if statement) computes a sum
201over all instances. A single new trick that the script uses is that it
202checks if the instance has a defined attribute value.  Namely, for
203instance e and attribute a, e[a].isSpecial() is true if the value is
204not defined (unknown). Variable n stores the number of instances with
205defined values of attribute. For our sampled adult data set, this part
206of the code outputs:</p>
207
208<xmp class="code">Continuous attributes:
209  age, mean=37.74
210  fnlwgt, mean=189344.06
211  education-num, mean=9.97
212  capital-gain, mean=1219.90
213  capital-loss, mean=99.49
214  hours-per-week, mean=40.27
215</xmp>
216
217<p>For nominal attributes, we could now compose a code that computes,
218for each attribute, how many times a specific value was used for each
219class. Instead, we used a build-in method DomainContingency, which
220does just that. All that our script will do is, mainly, to print it
221out in a readable form.</p>
222
223<p class="header">part of <a href=
224"data_characteristics3.py">data_characteristics3.py</a> (uses
225<a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
226<xmp class="code">print "\nNominal attributes (contingency matrix for classes:", data.domain.classVar.values, ")"
227cont = orange.DomainContingency(data)
228for a in data.domain.attributes:
229    if a.varType == orange.VarTypes.Discrete:
230        print "  %s:" % a.name
231        for v in range(len(a.values)):
232            sum = 0
233            for cv in cont[a][v]:
234                sum += cv
235            print "    %s, total %d, %s" % (a.values[v], sum, cont[a][v])
236        print
237</xmp>
238
239<p>Notice that the first part of this script is similar to the one
240that is dealing with continuous attributes, except that the for
241loop is a little bit simpler. With continuous attributes, the
242iterator in the loop was an attribute index, whereas in the script
243above we iterate through members of data.domain.attributes, which
244are objects that represent attributes. Data structures that may be
245addressed in Orange by attribute may most often be addressed either
246by attribute index, attribute name (string), or an object that
247represents an attribute.</p>
248
249<p>The output of the code above is rather long (this data set has
250some attributes that have rather large sets of values), so we show
251only the output for two attributes:</p>
252
253<p class="header">partial output of running of the script <a href=
254"data_characteristics3.py">data_characteristics3.py</a> (uses
255<a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
256<xmp class="code">Nominal attributes (contingency matrix for classes: <>50K, <=50K> )
257  workclass:
258    Private, total 729, <170.000, 559.000>
259    Self-emp-not-inc, total 62, <19.000, 43.000>
260    Self-emp-inc, total 22, <10.000, 12.000>
261    Federal-gov, total 27, <10.000, 17.000>
262    Local-gov, total 53, <14.000, 39.000>
263    State-gov, total 39, <10.000, 29.000>
264    Without-pay, total 1, <0.000, 1.000>
265    Never-worked, total 0, <0.000, 0.000>
266
267  sex:
268    Female, total 330, <28.000, 302.000>
269    Male, total 647, <208.000, 439.000>
270</xmp>
271
272<p>First, notice that the in the vectors the first number refers to a
273higher income, and the second number to the lower income (e.g., from
274this data it looks like that women earn less than men). Notice that
275Orange outputs the tuples in the form &ldquo;&lt; tuple-data
276&gt;&rdquo;. To change this, we would need another loop that would
277iterate through members of the tuples. You may also foresee that it
278would be interesting to compute the proportions rather than number of
279instances in above contingency matrix, but that we leave for your
280exercise.</p>
281
282<h2>Missing Values</h2>
283<index name="missing values/statistics">
284
285<p>It is often interesting to see, given the attribute, what is the
286proportion of the instances with that attribute unknown. We have
287already learned that if a function isSpecial() can be used to
288determine if for specific instances and attribute the value is not
289defined. Let us use this function to compute the proportion of missing
290values per each attribute:</p>
291
292<p class="header"><a href="report_missing.py">report_missing.py</a> (uses <a
293href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
294<xmp class="code">import orange
295data = orange.ExampleTable("adult_sample")
296
297natt = len(data.domain.attributes)
298missing = [0.] * natt
299for i in data:
300    for j in range(natt):
301        if i[j].isSpecial():
302            missing[j] += 1
303missing = map(lambda x, l=len(data):x/l*100., missing)
304
305print "Missing values per attribute:"
306atts = data.domain.attributes
307for i in range(natt):
308    print "  %5.1f%s %s" % (missing[i], '%', atts[i].name)
309</xmp>
310
311<p>Integer variable natt stores number of attributes in the data
312set. An array missing stores the number of the missing values per
313attribute; its size is therefore equal to natt, and all of its
314elements are initially 0 (in fact, 0.0, since we purposely
315identified it as a real number, which helped us later when we
316converted it to percents).</p>
317
318<p>The only line that possibly looks (very?) strange is <code>missing = map(lambda x, l=len(data):x/l*100., missing)</code>. This line could be
319replaced with for loop, but we just wanted to have it here to show
320how coding in Python may look very strange, but may gain in
321efficiency. The function map takes a vector (in our case missing),
322and executes a function on every of its elements, thus obtaining a
323new vector. The function it executes is in our case defined inline,
324and is in Python called lambda expression. You can see that our
325lambda function takes a single argument (when mapped, an element of
326vector missing), and returns its value that is normalized with the
327number of data instances (len(data)) multiplied by 100, to turn it
328in percentage. Thus, the map function in fact normalizes the
329elements of missing to express a proportion of missing values over
330the instances of the data set.</p>
331
332<p>Finally, let us see what outputs the script we have just been
333working on:</p>
334
335<pre class="code">
336>>> <b>python report_missing.py</b>
337Missing values per attribute:
338    0.0% age
339    4.5% workclass
340    0.0% fnlwgt
341    0.0% education
342    0.0% education-num
343    0.0% marital-status
344    4.5% occupation
345    0.0% relationship
346    0.0% race
347    0.0% sex
348    0.0% capital-gain
349    0.0% capital-loss
350    0.0% hours-per-week
351    1.9% native-country
352</pre>
353
354<p>In our sampled data set, just three attributes contain the
355missing values.</p>
356
357<h2>Basic Data Analysis with orange.DomainDistributions</h2>
358
359<p>For some of the tasks above, Orange can provide a shortcut by
360means of orange.DomainDistributions function which returns an
361object that holds averages and mean square errors for continuous
362attributes, value frequencies for discrete attributes, and for both
363number of instances where specific attribute has a missing value.
364The use of this object is exemplified in the following script:</p>
365
366<p class="header"><a href=
367"data_characteristics4.py">data_characteristics4.py</a> (uses
368<a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
369<xmp class="code">import orange
370data = orange.ExampleTable("adult_sample")
371dist = orange.DomainDistributions(data)
372
373print "Average values and mean square errors:"
374for i in range(len(data.domain.attributes)):
375    if data.domain.attributes[i].varType == orange.VarTypes.Continuous:
376        print "%s, mean=%5.2f +- %5.2f" % \
377            (data.domain.attributes[i].name, dist[i].average(), dist[i].error())
378
379print "\nFrequencies for values of discrete attributes:"
380for i in range(len(data.domain.attributes)):
381    a = data.domain.attributes[i]
382    if a.varType == orange.VarTypes.Discrete:
383        print "%s:" % a.name
384        for j in range(len(a.values)):
385            print "  %s: %d" % (a.values[j], int(dist[i][j]))
386
387print "\nNumber of items where attribute is not defined:"
388for i in range(len(data.domain.attributes)):
389    a = data.domain.attributes[i]
390    print "  %2d %s" % (dist[i].unknowns, a.name)
391</xmp>
392
393<p>Check this script out. Its results should match with the results we have derived by other scripts in this lesson.</p>
394
395
396<h2>Where Next?</h2>
397
398
399<p>This lesson taught some basics of Orange scripting, as well (but
400not really intentionally) as some basics of Python programming.
401Perhaps the most important part of it was accessing important
402pieces of data, like data instances, attribute values of data
403instances, class variable and its properties, vector with objects
404that store information about attributes, etc. What we have shown
405here was very much inclined toward classical machine learning type
406of data analysis, where a data is classified and classes are
407nominal. Of course data may not be like that, may be labelled with
408continuous classes or may not be classified at all. In any case,
409the concepts that we have presented may apply to different types of
410data sets as well.</p>
411
412<p>From here, your pathway through our tutorial may not follow the
413order presented in the list of topics. Instead of learning how to
414<a href="classification.htm">build classification models</a>,
415you may want to branch to see how Orange deals with <a href=
416"regression.htm">regression</a> or some other tasks. Though, we have
417to admit, that Orange and its authors is currently highly biased toward predictive data mining and
418building classification model, so those sections of this tutorial may
419be more elaborate than others.</p>
420
421<hr><br><p class="Path">
422Prev: <a href="load_data.htm">Load-in the data</a>,
423Next: <a href="classification.htm">Classification</a>,
424Up: <a href="default.htm">On Tutorial 'Orange for Beginners'</a>
425</p>
426
427</body>
428</html>
429
Note: See TracBrowser for help on using the repository browser.