source: orange/orange/doc/datasets/breast-cancer-wisconsin.htm @ 1760:9d4bb141fb0e

Revision 1760:9d4bb141fb0e, 5.7 KB checked in by blaz <blaz.zupan@…>, 9 years ago (diff)

data info file

Line 
1<html>
2<head>
3<title>Breast Cancer Wisconsin Data Base</title>
4</head>
5<body>
6<h1>Info on Breast Cancer Wisconsin Data Base</h1>
7<pre>
8Citation Request:
9   This breast cancer databases was obtained from the University of Wisconsin
10   Hospitals, Madison from Dr. William H. Wolberg.  If you publish results
11   when using this database, then please include this information in your
12   acknowledgements.  Also, please cite one or more of:
13
14   1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
15      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
16
17   2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
18      pattern separation for medical diagnosis applied to breast cytology",
19      Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
20      December 1990, pp 9193-9196.
21
22   3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
23      via linear programming: Theory and application to medical diagnosis",
24      in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
25      Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
26
27   4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming
28      discrimination of two linearly inseparable sets", Optimization Methods
29      and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
30
311. Title: Wisconsin Breast Cancer Database (January 8, 1991)
32
332. Sources:
34   -- Dr. WIlliam H. Wolberg (physician)
35      University of Wisconsin Hospitals
36      Madison, Wisconsin
37      USA
38   -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
39      Received by David W. Aha (aha@cs.jhu.edu)
40   -- Date: 15 July 1992
41
423. Past Usage:
43
44   Attributes 2 through 10 have been used to represent instances.
45   Each instance has one of 2 possible classes: benign or malignant.
46
47   1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of
48      pattern separation for medical diagnosis applied to breast cytology. In
49      {\it Proceedings of the National Academy of Sciences}, {\it 87},
50      9193--9196.
51      -- Size of data set: only 369 instances (at that point in time)
52      -- Collected classification results: 1 trial only
53      -- Two pairs of parallel hyperplanes were found to be consistent with
54         50% of the data
55         -- Accuracy on remaining 50% of dataset: 93.5%
56      -- Three pairs of parallel hyperplanes were found to be consistent with
57         67% of data
58         -- Accuracy on remaining 33% of dataset: 95.9%
59
60   2. Zhang,~J. (1992). Selecting typical instances in instance-based
61      learning.  In {\it Proceedings of the Ninth International Machine
62      Learning Conference} (pp. 470--479).  Aberdeen, Scotland: Morgan
63      Kaufmann.
64      -- Size of data set: only 369 instances (at that point in time)
65      -- Applied 4 instance-based learning algorithms
66      -- Collected classification results averaged over 10 trials
67      -- Best accuracy result:
68         -- 1-nearest neighbor: 93.7%
69         -- trained on 200 instances, tested on the other 169
70      -- Also of interest:
71         -- Using only typical instances: 92.2% (storing only 23.1 instances)
72         -- trained on 200 instances, tested on the other 169
73
744. Relevant Information:
75
76   Samples arrive periodically as Dr. Wolberg reports his clinical cases.
77   The database therefore reflects this chronological grouping of the data.
78   This grouping information appears immediately below, having been removed
79   from the data itself:
80
81     Group 1: 367 instances (January 1989)
82     Group 2:  70 instances (October 1989)
83     Group 3:  31 instances (February 1990)
84     Group 4:  17 instances (April 1990)
85     Group 5:  48 instances (August 1990)
86     Group 6:  49 instances (Updated January 1991)
87     Group 7:  31 instances (June 1991)
88     Group 8:  86 instances (November 1991)
89     -----------------------------------------
90     Total:   699 points (as of the donated datbase on 15 July 1992)
91
92   Note that the results summarized above in Past Usage refer to a dataset
93   of size 369, while Group 1 has only 367 instances.  This is because it
94   originally contained 369 instances; 2 were removed.  The following
95   statements summarizes changes to the original Group 1's set of data:
96
97   #####  Group 1 : 367 points: 200B 167M (January 1989)
98   #####  Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
99   #####  Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
100   #####                  : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
101   #####                  : Changed 0 to 1 in field 6 of sample 1219406
102   #####                  : Changed 0 to 1 in field 8 of following sample:
103   #####                  : 1182404,2,3,1,1,1,2,0,1,1,1
104
1055. Number of Instances: 699 (as of 15 July 1992)
106
1076. Number of Attributes: 10 plus the class attribute
108
1097. Attribute Information: (class attribute has been moved to last column)
110
111   #  Attribute                     Domain
112   -- -----------------------------------------
113   1. Sample code number            id number
114   2. Clump Thickness               1 - 10
115   3. Uniformity of Cell Size       1 - 10
116   4. Uniformity of Cell Shape      1 - 10
117   5. Marginal Adhesion             1 - 10
118   6. Single Epithelial Cell Size   1 - 10
119   7. Bare Nuclei                   1 - 10
120   8. Bland Chromatin               1 - 10
121   9. Normal Nucleoli               1 - 10
122  10. Mitoses                       1 - 10
123  11. Class:                        (2 for benign, 4 for malignant)
124
1258. Missing attribute values: 16
126
127   There are 16 instances in Groups 1 to 6 that contain a single missing
128   (i.e., unavailable) attribute value, now denoted by "?". 
129
1309. Class distribution:
131 
132   Benign: 458 (65.5%)
133   Malignant: 241 (34.5%)
134<pre>
135</body>
136</html>
Note: See TracBrowser for help on using the repository browser.