source: orange/docs/tutorial/rst/data.rst @ 11556:c9ec072c64a1

Revision 11556:c9ec072c64a1, 8.7 KB checked in by blaz <blaz.zupan@…>, 11 months ago (diff)

corrected error with 25th data instance

Line 
1The Data
2========
3
4.. index: data
5
6This section describes how to load and save the data. We also show how to explore the data, its domain description, how to report on basic data set statistics, and how to sample the data.
7
8Data Input
9----------
10
11.. index:: 
12   single: data; input
13
14Orange can read files in native and other data formats. Native format starts with feature (attribute) names, their type (continuous, discrete, string). The third line contains meta information to identify dependent features (class), irrelevant features (ignore) or meta features (meta). Here are the first few lines from a data set :download:`lenses.tab <code/lenses.tab>` on prescription of eye
15lenses [CJ1987]::
16
17   age       prescription  astigmatic    tear_rate     lenses
18   discrete  discrete      discrete      discrete      discrete
19                                                       class
20   young     myope         no            reduced       none
21   young     myope         no            normal        soft
22   young     myope         yes           reduced       none
23   young     myope         yes           normal        hard
24   young     hypermetrope  no            reduced       none
25
26
27Values are tab-limited. The data set has four attributes (age of the patient, spectacle prescription, notion on astigmatism, and information on tear production rate) and an associated three-valued dependent variable encoding lens prescription for the patient (hard contact lenses, soft contact lenses, no lenses). Feature descriptions could use one letter only, so the header of this data set could also read::
28
29   age       prescription  astigmatic    tear_rate     lenses
30   d         d             d             d             d
31                                                       c
32
33The rest of the table gives the data. Note that there are 5
34instances in our table above (check the original file to see
35other). Orange is rather free in what attribute value names it
36uses, so they do not need all to start with a letter like in our
37example.
38
39You may download :download:`lenses.tab <code/lenses.tab>` to a target directory and there open a python shell. Alternatively, just execute the code below; this particular data set comes with Orange instalation, and Orange knows where to find it:
40
41    >>> import Orange
42    >>> data = Orange.data.Table("lenses")
43    >>>
44
45Note that for the file name no suffix is needed; as Orange checks if any files in the current directory are of the readable type. The call to ``Orange.data.Table`` creates an object called ``data`` that holds your data set and information about the lenses domain:
46
47>>> print data.domain.features
48<Orange.feature.Discrete 'age', Orange.feature.Discrete 'prescription', Orange.feature.Discrete 'astigmatic', Orange.feature.Discrete 'tear_rate'>
49>>> print data.domain.class_var
50Orange.feature.Discrete 'lenses'
51>>> for d in data[:3]:
52   ...:     print d
53   ...:
54['young', 'myope', 'no', 'reduced', 'none']
55['young', 'myope', 'no', 'normal', 'soft']
56['young', 'myope', 'yes', 'reduced', 'none']
57>>>
58
59The following script wraps-up everything we have done so far and lists first 5 data instances with ``soft`` perscription:
60
61.. literalinclude:: code/data-lenses.py
62
63Note that data is an object that holds both the data and information on the domain. We show above how to access attribute and class names, but there is much more information there, including that on feature type, set of values for categorical features, and other.
64
65Saving the Data
66---------------
67
68Data objects can be saved to a file:
69
70>>> data.save("new_data.tab")
71>>>
72
73This time, we have to provide the extension for Orange to know which data format to use. An extension for native Orange's data format is ".tab". The following code saves only the data items with myope perscription:
74
75.. literalinclude:: code/data-save.py
76
77Exploration of Data Domain
78--------------------------
79
80.. index::
81   single: data; features
82.. index::
83   single: data; domain
84.. index::
85   single: data; class
86
87Data table object stores information on data instances as well as on data domain. Domain holds the names of features, optional classes, their types and, if categorical, value names.
88
89.. literalinclude:: code/data-domain1.py
90
91Orange's objects often behave like Python lists and dictionaries, and can be indexed or accessed through feature names.
92
93.. literalinclude:: code/data-domain2.py
94    :lines: 5-
95
96Data Instances
97--------------
98
99.. index::
100   single: data; instances
101.. index::
102   single: data; examples
103
104Data table stores data instances (or examples). These can be index or traversed as any Python list. Data instances can be considered as vectors, accessed through element index, or through feature name.
105
106.. literalinclude:: code/data-instances1.py
107
108The script above displays the following output::
109
110   First three data instances:
111   [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
112   [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
113   [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
114   25-th data instance:
115   [4.8, 3.4, 1.9, 0.2, 'Iris-setosa']
116   Value of 'sepal width' for the first instance: 3.5
117   The 3rd value of the 25th data instance: 1.9
118
119Iris data set we have used above has four continous attributes. Here's a script that computes their mean:
120
121.. literalinclude:: code/data-instances2.py
122   :lines: 3-
123
124Above also illustrates indexing of data instances with objects that store features; in ``d[x]`` variable ``x`` is an Orange object. Here's the output::
125
126   Feature         Mean
127   sepal length    5.84
128   sepal width     3.05
129   petal length    3.76
130   petal width     1.20
131
132
133Slightly more complicated, but more interesting is a code that computes per-class averages:
134
135.. literalinclude:: code/data-instances3.py
136   :lines: 3-
137
138Of the four features, petal width and length look quite discriminative for the type of iris::
139
140   Feature             Iris-setosa Iris-versicolor  Iris-virginica
141   sepal length               5.01            5.94            6.59
142   sepal width                3.42            2.77            2.97
143   petal length               1.46            4.26            5.55
144   petal width                0.24            1.33            2.03
145
146Finally, here is a quick code that computes the class distribution for another data set:
147
148.. literalinclude:: code/data-instances4.py
149
150Missing Values
151--------------
152
153.. index::
154   single: data; missing values
155
156Consider the following exploration of senate voting data set::
157
158   >>> data = Orange.data.Table("voting.tab")
159   >>> data[2]
160   ['?', 'y', 'y', '?', 'y', 'y', 'n', 'n', 'n', 'n', 'y', 'n', 'y', 'y', 'n', 'n', 'democrat']
161   >>> data[2][0].is_special()
162   1
163   >>> data[2][1].is_special()
164   0
165
166The particular data instance included missing data (represented with '?') for first and fourth feature. We can use the method ``is_special()`` to detect parts of the data which is missing. In the original data set file, the missing values are, by default, represented with a blank space. We use the method ``is_special()`` below to examine each feature and report on proportion of instances for which this feature was undefined:
167
168.. literalinclude:: code/data-missing.py
169
170First few lines of the output of this script are::
171
172    2.8% handicapped-infants
173   11.0% water-project-cost-sharing
174    2.5% adoption-of-the-budget-resolution
175    2.5% physician-fee-freeze
176    3.4% el-salvador-aid
177
178A single-liner that reports on number of data instances with at least one missing value is::
179
180    >>> sum(any(d[x].is_special() for x in data.domain.features) for d in data)
181    203
182
183
184Data Subsetting
185---------------
186
187.. index::
188   single: data; subsetting
189
190``Orange.data.Table`` accepts a list of data items and returns a new data set. This is useful for any data subsetting:
191
192.. literalinclude:: code/data-subsetting.py
193   :lines: 3-
194
195The code outputs::
196
197   Subsetting from 150 to 99 instances.
198
199and inherits the data description (domain) from the original data set. Changing the domain requires setting up a new domain descriptor. This feature is useful for any kind of feature selection:
200
201.. literalinclude:: code/data-featureselection.py
202   :lines: 3-
203
204.. index::
205   single: feature; selection
206
207By default, ``Orange.data.Domain`` assumes that last feature in argument feature list is a class variable. This can be changed with an optional argument::
208
209   >>> nd = Orange.data.Domain(data.domain.features[:2], False)
210   >>> print nd.class_var
211   None
212   >>> nd = Orange.data.Domain(data.domain.features[:2], True)
213   >>> print nd.class_var
214   Orange.feature.Continuous 'sepal width'
215
216The first call to ``Orange.data.Domain`` constructed the classless domain, while the second used the last feature and constructed the domain with one input feature and a continous class.   
217
218**References**
219
220.. [CJ1987] Cendrowska J (1987) PRISM: An algorithm for inducing modular rules, International Journal of Man-Machine Studies, 27, 349-370.
Note: See TracBrowser for help on using the repository browser.