#
source:
orange/docs/tutorial/rst/basic-exploration.rst
@
9994:1073e0304a87

Revision 9994:1073e0304a87, 14.5 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff) |
---|

Line | |
---|---|

1 | Basic data exploration |

2 | ====================== |

3 | |

4 | .. index:: basic data exploration |

5 | |

6 | Until now we have looked only at data files that include solely |

7 | nominal (discrete) attributes. Let's make thinks now more interesting, |

8 | and look at another file with mixture of attribute types. We will |

9 | first use adult data set from UCI ML Repository. The prediction task |

10 | related to this data set is to determine whether a person |

11 | characterized by 14 attributes like education, race, occupation, etc., |

12 | makes over $50K/year. Because of the original set :download:`adult.tab <code/adult.tab>` is |

13 | rather big (32561 data instances, about 4 MBytes), we will first |

14 | create a smaller sample of about 3% of instances and use it in our |

15 | examples. If you are curious how we do this, here is the code |

16 | (:download:`sample_adult.py <code/sample_adult.py>`):: |

17 | |

18 | import orange |

19 | data = orange.ExampleTable("adult") |

20 | selection = orange.MakeRandomIndices2(data, 0.03) |

21 | sample = data.select(selection, 0) |

22 | sample.save("adult_sample.tab") |

23 | |

24 | Above loads the data, prepares a selection vector of length equal to |

25 | the number of data instances, which includes 0's and 1's, but it is |

26 | told that there should be about 3% of 0's. Then, those instances are |

27 | selected which have a corresponding 0 in selection vector, and stored |

28 | in an object called *sample*. The sampled data is then saved in a |

29 | file. Note that ``MakeRandomIndices2`` performs a stratified selection, |

30 | i.e., the class distribution of original and sampled data should be |

31 | nearly the same. |

32 | |

33 | Basic characteristics of data sets |

34 | ---------------------------------- |

35 | |

36 | .. index:: |

37 | single: basic data exploration; attributes |

38 | .. index:: |

39 | single: basic data exploration; classes |

40 | .. index:: |

41 | single: basic data exploration; missing values |

42 | |

43 | For classification data sets, basic data characteristics are most |

44 | often number of classes, number of attributes (and of these, how many |

45 | are nominal and continuous), information if data contains missing |

46 | values, and class distribution. Below is the script that does all |

47 | this (:download:`data_characteristics.py <code/data_characteristics.py>`, :download:`adult_sample.tab <code/adult_sample.tab>`):: |

48 | |

49 | import orange |

50 | data = orange.ExampleTable("adult_sample") |

51 | |

52 | # report on number of classes and attributes |

53 | print "Classes:", len(data.domain.classVar.values) |

54 | print "Attributes:", len(data.domain.attributes), ",", |

55 | |

56 | # count number of continuous and discrete attributes |

57 | ncont=0; ndisc=0 |

58 | for a in data.domain.attributes: |

59 | if a.varType == orange.VarTypes.Discrete: |

60 | ndisc = ndisc + 1 |

61 | else: |

62 | ncont = ncont + 1 |

63 | print ncont, "continuous,", ndisc, "discrete" |

64 | |

65 | # obtain class distribution |

66 | c = [0] * len(data.domain.classVar.values) |

67 | for e in data: |

68 | c[int(e.getclass())] += 1 |

69 | print "Instances: ", len(data), "total", |

70 | for i in range(len(data.domain.classVar.values)): |

71 | print ",", c[i], "with class", data.domain.classVar.values[i], |

72 | |

73 | |

74 | The first part is the one that we know already: the script import |

75 | Orange library into Python, and loads the data. The information on |

76 | domain (class and attribute names, types, values, etc.) are stored in |

77 | ``data.domain``. Information on class variable is accessible through the |

78 | ``data.domain.classVar`` object which stores |

79 | a vector of class' values. Its length is obtained using a function |

80 | ``len()``. Similarly, the list of attributes is stored in |

81 | data.domain.attributes. Notice that to obtain the information on i-th |

82 | attribute, this list can be indexed, e.g., ``data.domain.attributes[i]``. |

83 | |

84 | To count the number of continuous and discrete attributes, we have |

85 | first initialized two counters (``ncont``, ``ndisc``), and then iterated |

86 | through the attributes (variable ``a`` is an iteration variable that in is |

87 | each loop associated with a single attribute). The field ``varType`` |

88 | contains the type of the attribute; for discrete attributes, ``varType`` |

89 | is equal to ``orange.VarTypes.Discrete``, and for continuous ``varType`` is |

90 | equal to ``orange.VarTypes.Continuous``. |

91 | |

92 | To obtain the number of instances for each class, we first |

93 | initialized a vector c that would of the length equal to the number of |

94 | different classes. Then, we iterated through the data; |

95 | ``e.getclass()`` returns a class of an instance e, and to |

96 | turn it into a class index (a number that is in range from 0 to n-1, |

97 | where n is the number of classes) and is used for an index of a |

98 | element of c that should be incremented. |

99 | |

100 | Throughout the code, notice that a print statement in Python prints |

101 | whatever items it has in the line that follows. The items are |

102 | separated with commas, and Python will by default put a blank between |

103 | them when printing. It will also print a new line, unless the print |

104 | statement ends with a comma. It is possible to use print statement in |

105 | Python with formatting directives, just like in C or C++, but this is |

106 | beyond this text. |

107 | |

108 | Running the above script, we obtain the following output:: |

109 | |

110 | Classes: 2 |

111 | Attributes: 14 , 6 continuous, 8 discrete |

112 | Instances: 977 total , 236 with class >50K , 741 with class <=50K |

113 | |

114 | If you would like class distributions printed as proportions of |

115 | each class in the data sets, then the last part of the script needs |

116 | to be slightly changed. This time, we have used string formatting |

117 | with print as well (part of :download:`data_characteristics2.py <code/data_characteristics2.py>`):: |

118 | |

119 | # obtain class distribution |

120 | c = [0] * len(data.domain.classVar.values) |

121 | for e in data: |

122 | c[int(e.getclass())] += 1 |

123 | print "Instances: ", len(data), "total", |

124 | r = [0.] * len(c) |

125 | for i in range(len(c)): |

126 | r[i] = c[i]*100./len(data) |

127 | for i in range(len(data.domain.classVar.values)): |

128 | print ", %d(%4.1f%s) with class %s" % (c[i], r[i], '%', data.domain.classVar.values[i]), |

129 | |

130 | |

131 | The new script outputs the following information:: |

132 | |

133 | Classes: 2 |

134 | Attributes: 14 , 6 continuous, 8 discrete |

135 | Instances: 977 total , 236(24.2%) with class >50K , 741(75.8%) with class <=50K |

136 | |

137 | As it turns out, there are more people that earn less than those, |

138 | that earn more... On a more technical site, such information may |

139 | be important when your build your classifier; the base error for this |

140 | data set is 1-.758 = .242, and your constructed models should only be |

141 | better than this. |

142 | |

143 | Contingency matrix |

144 | ------------------ |

145 | |

146 | .. index:: |

147 | single: basic data exploration; class distribution |

148 | |

149 | Another interesting piece of information that we can obtain from the |

150 | data is the distribution of classes for each value of the discrete |

151 | attribute, and means for continuous attribute (we will leave the |

152 | computation of standard deviation and other statistics to you). Let's |

153 | compute means of continuous attributes first (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`):: |

154 | |

155 | print "Continuous attributes:" |

156 | for a in range(len(data.domain.attributes)): |

157 | if data.domain.attributes[a].varType == orange.VarTypes.Continuous: |

158 | d = 0.; n = 0 |

159 | for e in data: |

160 | if not e[a].isSpecial(): |

161 | d += e[a] |

162 | n += 1 |

163 | print " %s, mean=%3.2f" % (data.domain.attributes[a].name, d/n) |

164 | |

165 | This script iterates through attributes (outer for loop), and for |

166 | attributes that are continuous (first if statement) computes a sum |

167 | over all instances. A single new trick that the script uses is that it |

168 | checks if the instance has a defined attribute value. Namely, for |

169 | instance ``e`` and attribute ``a``, ``e[a].isSpecial()`` is true if |

170 | the value is not defined (unknown). Variable n stores the number of |

171 | instances with defined values of attribute. For our sampled adult data |

172 | set, this part of the code outputs:: |

173 | |

174 | Continuous attributes: |

175 | age, mean=37.74 |

176 | fnlwgt, mean=189344.06 |

177 | education-num, mean=9.97 |

178 | capital-gain, mean=1219.90 |

179 | capital-loss, mean=99.49 |

180 | hours-per-week, mean=40.27 |

181 | |

182 | For nominal attributes, we could now compose a code that computes, |

183 | for each attribute, how many times a specific value was used for each |

184 | class. Instead, we used a build-in method DomainContingency, which |

185 | does just that. All that our script will do is, mainly, to print it |

186 | out in a readable form (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`):: |

187 | |

188 | print "\nNominal attributes (contingency matrix for classes:", data.domain.classVar.values, ")" |

189 | cont = orange.DomainContingency(data) |

190 | for a in data.domain.attributes: |

191 | if a.varType == orange.VarTypes.Discrete: |

192 | print " %s:" % a.name |

193 | for v in range(len(a.values)): |

194 | sum = 0 |

195 | for cv in cont[a][v]: |

196 | sum += cv |

197 | print " %s, total %d, %s" % (a.values[v], sum, cont[a][v]) |

198 | |

199 | |

200 | Notice that the first part of this script is similar to the one that |

201 | is dealing with continuous attributes, except that the for loop is a |

202 | little bit simpler. With continuous attributes, the iterator in the |

203 | loop was an attribute index, whereas in the script above we iterate |

204 | through members of ``data.domain.attributes``, which are objects that |

205 | represent attributes. Data structures that may be addressed in Orange |

206 | by attribute may most often be addressed either by attribute index, |

207 | attribute name (string), or an object that represents an attribute. |

208 | |

209 | The output of the code above is rather long (this data set has |

210 | some attributes that have rather large sets of values), so we show |

211 | only the output for two attributes:: |

212 | |

213 | Nominal attributes (contingency matrix for classes: <>50K, <=50K> ) |

214 | workclass: |

215 | Private, total 729, <170.000, 559.000> |

216 | Self-emp-not-inc, total 62, <19.000, 43.000> |

217 | Self-emp-inc, total 22, <10.000, 12.000> |

218 | Federal-gov, total 27, <10.000, 17.000> |

219 | Local-gov, total 53, <14.000, 39.000> |

220 | State-gov, total 39, <10.000, 29.000> |

221 | Without-pay, total 1, <0.000, 1.000> |

222 | Never-worked, total 0, <0.000, 0.000> |

223 | |

224 | sex: |

225 | Female, total 330, <28.000, 302.000> |

226 | Male, total 647, <208.000, 439.000> |

227 | |

228 | First, notice that the in the vectors the first number refers to a |

229 | higher income, and the second number to the lower income (e.g., from |

230 | this data it looks like that women earn less than men). Notice that |

231 | Orange outputs the tuples. To change this, we would need another loop |

232 | that would iterate through members of the tuples. You may also foresee |

233 | that it would be interesting to compute the proportions rather than |

234 | number of instances in above contingency matrix, but that we leave for |

235 | your exercise. |

236 | |

237 | Missing values |

238 | -------------- |

239 | |

240 | .. index:: |

241 | single: missing values; statistics |

242 | |

243 | It is often interesting to see, given the attribute, what is the |

244 | proportion of the instances with that attribute unknown. We have |

245 | already learned that if a function isSpecial() can be used to |

246 | determine if for specific instances and attribute the value is not |

247 | defined. Let us use this function to compute the proportion of missing |

248 | values per each attribute (:download:`report_missing.py <code/report_missing.py>`):: |

249 | |

250 | import orange |

251 | data = orange.ExampleTable("adult_sample") |

252 | |

253 | natt = len(data.domain.attributes) |

254 | missing = [0.] * natt |

255 | for i in data: |

256 | for j in range(natt): |

257 | if i[j].isSpecial(): |

258 | missing[j] += 1 |

259 | missing = map(lambda x, l=len(data):x/l*100., missing) |

260 | |

261 | print "Missing values per attribute:" |

262 | atts = data.domain.attributes |

263 | for i in range(natt): |

264 | print " %5.1f%s %s" % (missing[i], '%', atts[i].name) |

265 | |

266 | Integer variable natt stores number of attributes in the data set. An |

267 | array missing stores the number of the missing values per attribute; |

268 | its size is therefore equal to natt, and all of its elements are |

269 | initially 0 (in fact, 0.0, since we purposely identified it as a real |

270 | number, which helped us later when we converted it to percents). |

271 | |

272 | The only line that possibly looks (very?) strange is ``missing = |

273 | map(lambda x, l=len(data):x/l*100., missing)``. This line could be |

274 | replaced with for loop, but we just wanted to have it here to show how |

275 | coding in Python may look very strange, but may gain in |

276 | efficiency. The function map takes a vector (in our case missing), and |

277 | executes a function on every of its elements, thus obtaining a new |

278 | vector. The function it executes is in our case defined inline, and is |

279 | in Python called lambda expression. You can see that our lambda |

280 | function takes a single argument (when mapped, an element of vector |

281 | missing), and returns its value that is normalized with the number of |

282 | data instances (``len(data)``) multiplied by 100, to turn it in |

283 | percentage. Thus, the map function in fact normalizes the elements of |

284 | missing to express a proportion of missing values over the instances |

285 | of the data set. |

286 | |

287 | Finally, let us see what outputs the script we have just been working |

288 | on:: |

289 | |

290 | Missing values per attribute: |

291 | 0.0% age |

292 | 4.5% workclass |

293 | 0.0% fnlwgt |

294 | 0.0% education |

295 | 0.0% education-num |

296 | 0.0% marital-status |

297 | 4.5% occupation |

298 | 0.0% relationship |

299 | 0.0% race |

300 | 0.0% sex |

301 | 0.0% capital-gain |

302 | 0.0% capital-loss |

303 | 0.0% hours-per-week |

304 | 1.9% native-country |

305 | |

306 | In our sampled data set, just three attributes contain the missing |

307 | values. |

308 | |

309 | Distributions of feature values |

310 | ------------------------------- |

311 | |

312 | For some of the tasks above, Orange can provide a shortcut by means of |

313 | ``orange.DomainDistributions`` function which returns an object that |

314 | holds averages and mean square errors for continuous attributes, value |

315 | frequencies for discrete attributes, and for both number of instances |

316 | where specific attribute has a missing value. The use of this object |

317 | is exemplified in the following script (:download:`data_characteristics4.py <code/data_characteristics4.py>`):: |

318 | |

319 | import orange |

320 | data = orange.ExampleTable("adult_sample") |

321 | dist = orange.DomainDistributions(data) |

322 | |

323 | print "Average values and mean square errors:" |

324 | for i in range(len(data.domain.attributes)): |

325 | if data.domain.attributes[i].varType == orange.VarTypes.Continuous: |

326 | print "%s, mean=%5.2f +- %5.2f" % \ |

327 | (data.domain.attributes[i].name, dist[i].average(), dist[i].error()) |

328 | |

329 | print "\nFrequencies for values of discrete attributes:" |

330 | for i in range(len(data.domain.attributes)): |

331 | a = data.domain.attributes[i] |

332 | if a.varType == orange.VarTypes.Discrete: |

333 | print "%s:" % a.name |

334 | for j in range(len(a.values)): |

335 | print " %s: %d" % (a.values[j], int(dist[i][j])) |

336 | |

337 | print "\nNumber of items where attribute is not defined:" |

338 | for i in range(len(data.domain.attributes)): |

339 | a = data.domain.attributes[i] |

340 | print " %2d %s" % (dist[i].unknowns, a.name) |

341 | |

342 | Check this script out. Its results should match with the results we |

343 | have derived by other scripts in this lesson. |

**Note:**See TracBrowser for help on using the repository browser.