source: orange/orange/doc/datasets/adult.htm @ 1760:9d4bb141fb0e

Revision 1760:9d4bb141fb0e, 3.7 KB checked in by blaz <blaz.zupan@…>, 9 years ago (diff)

data info file

Line 
1<html>
2<head>
3<title>Adult Data Base</title>
4</head>
5<body>
6<h1>Info on Adult Data Base</h1>
7<pre>
8This data was extracted from the census bureau database found at
9http://www.census.gov/ftp/pub/DES/www/welcome.html
10
11Donor: Ronny Kohavi and Barry Becker,
12       Data Mining and Visualization
13       Silicon Graphics.
14       e-mail: ronnyk@sgi.com for questions.
15
16Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
1748842 instances, mix of continuous and discrete    (train=32561, test=16281)
1845222 if instances with unknown values are removed (train=30162, test=15060)
19
20Duplicate or conflicting instances : 6
21
22Class probabilities for adult.all file
23Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
24Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
25Extraction was done by Barry Becker from the 1994 Census database.  A set of
26  reasonably clean records was extracted using the following conditions:
27  ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
28
29Prediction task is to determine whether a person makes over 50K
30a year.
31
32First cited in:
33@inproceedings{kohavi-nbtree,
34   author={Ron Kohavi},
35   title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
36          Decision-Tree Hybrid},
37   booktitle={Proceedings of the Second International Conference on
38              Knowledge Discovery and Data Mining},
39   year = 1996,
40   pages={to appear}}
41
42Error Accuracy reported as follows, after removal of unknowns from
43   train/test sets):
44   C4.5       : 84.46+-0.30
45   Naive-Bayes: 83.88+-0.30
46   NBTree     : 85.90+-0.28
47
48Following algorithms were later run with the following error rates,
49all after removal of unknowns and using the original train/test split.
50All these numbers are straight runs using MLC++ with default values.
51
52   Algorithm               Error
53-- ----------------        -----
541  C4.5                    15.54
552  C4.5-auto               14.46
563  C4.5 rules              14.94
574  Voted ID3 (0.6)         15.64
585  Voted ID3 (0.8)         16.47
596  T2                      16.84
607  1R                      19.54
618  NBTree                  14.10
629  CN2                     16.00
6310 HOODG                   14.82
6411 FSS Naive Bayes         14.05
6512 IDTM (Decision table)   14.46
6613 Naive-Bayes             16.12
6714 Nearest-neighbor (1)    21.42
6815 Nearest-neighbor (3)    20.35
6916 OC1                     15.04
7017 Pebls                   Crashed.  Unknown why (bounds WERE increased)
71
72Conversion of original data as follows:
731. Discretized agrossincome into two ranges with threshold 50,000.
742. Convert U.S. to US to avoid periods.
753. Convert Unknown to "?"
764. Run MLC++ GenCVFiles to generate data,test.
77
78Description of fnlwgt (final weight): The weights on the CPS files are
79controlled to independent estimates of the civilian noninstitutional
80population of the US.  These are prepared monthly for us by Population
81Division here at the Census Bureau.  We use 3 sets of controls. These
82are:
83
84  1.  A single cell estimate of the population 16+ for each state.
85  2.  Controls for Hispanic Origin by age and sex.
86  3.  Controls by Race, age and sex.
87
88We use all three sets of controls in our weighting program and "rake"
89through them 6 times so that by the end we come back to all the
90controls we used.
91
92The term estimate refers to population totals derived from CPS by
93creating "weighted tallies" of any specified socio-economic
94characteristics of the population.
95
96People with similar demographic characteristics should have similar
97weights.  There is one important caveat to remember about this
98statement.  That is that since the CPS sample is actually a collection
99of 51 state samples, each with its own probability of selection, the
100statement only applies within state.
101<pre>
102</body>
103</html>
Note: See TracBrowser for help on using the repository browser.