source: orange/docs/tutorial/rst/load-data.rst @ 9386:b95da3693f19

Revision 9386:b95da3693f19, 7.7 KB checked in by mitar, 2 years ago (diff)

Renamed tutorial files.

Line 
1Data input
2==========
3
4.. index:: data input
5
6Orange is a machine learning and data mining suite, so
7loading-in the data is, as you may acknowledge, its essential
8functionality (we tried not to stop here, though, so read on).
9Orange supports C4.5, Assistant, Retis, and tab-delimited (native
10Orange) data formats. Of these, you may be most familiar with C4.5,
11so we will say something about this here, whereas Orange's
12native format is the simplest so most of our data files will come
13in this flavor.
14
15Let us start with example and Orange native data format. Let us
16consider an artificial data set :download:`lenses.tab <code/lenses.tab>` on prescription of eye
17lenses [CJ1987]. The data set has four attributes (age of the patient,
18spectacle prescription, notion on astigmatism, and information on tear
19production rate) plus an associated three-valued class, that gives the
20appropriate lens prescription for patient (hard contact lenses, soft
21contact lenses, no lenses). You may have already guessed that this
22data set can, in principle, be used to build a classifier that, based
23on the four attributes, prescribes the right lenses. But before we do
24that, let us see how the data set file is composed and how to read it
25in Orange by displaying first few lines of :download:`lenses.tab <code/lenses.tab>`::
26
27   age       prescription  astigmatic    tear_rate     lenses
28   discrete  discrete      discrete      discrete      discrete
29                                                       class
30   young     myope         no            reduced       none
31   young     myope         no            normal        soft
32   young     myope         yes           reduced       none
33   young     myope         yes           normal        hard
34   young     hypermetrope  no            reduced       none
35
36First line of the file lists names of attributes and class.
37Second line gives the type of the attribute. Here, all attributes
38are nominal (or discrete), hence the ``discrete`` keyword
39any every column. If you get tired of typing
40``discrete``, you may use ``d`` instead. We
41will later find that attribute may also be continuous, and will
42have appropriate keyword (or just ``c``) in their
43corresponding columns. The third line adds an additional
44description to every column. Note that ``lenses`` is a
45special variable since it represents a class where each data
46instance is classified. This is denoted as ``class`` in
47the third line of the last column. Other keywords may be used in
48this line that we have not used in our example. For instance, for
49the attributes that we would like to ignore, we can use
50``ignore`` keyword (or simply ``i``). There are
51also other keywords that may be used, but for the sake of
52simplicity we will skip all this here.
53
54The rest of the table gives the data. Note that there are 5
55instances in our table above (check the original file to see
56other). Orange is rather free in what attribute value names it
57uses, so they do not need all to start with a letter like in our
58example.
59
60Attribute values are separated with tabulators (<TAB>).  This is
61rather hard to see above (it looks like spaces were used), so to
62verify that check the original :download:`lenses.tab <code/lenses.tab>` data set in
63your favorite text editor.  Alternatively, authors of this text like
64best to edit these files in a spreadsheet program (and use
65tab-delimited format to save the files), so a snapshot of the data set
66as edited in Excel can look like this:
67
68.. image:: files/excel.png
69   :alt: Data in Excel
70
71Now create a directory, save :download:`lenses.tab <code/lenses.tab>` in
72it (right click on the link and choose choose "Save Target As
73..."). Open a terminal (cmd shell in Windows, Terminal on Mac OS X),
74change the directory to the one you have just created, and run
75Python. In the interactive Python shell, import Orange and the data
76file:
77
78>>> import orange
79>>> data = orange.ExampleTable("lenses")
80>>>
81
82This creates an object called data that holds your data set and
83information about the lenses domain. Note that for the file name no
84suffix was needed: Orange ventures through the current directory
85and checks if any files of the types it knows are available. This
86time, it found lenses.tab.
87
88How do we know that data really contains our data set? Well,
89let's check this out and print the attribute names and first
903 data items:
91
92>>> print data.domain.attributes
93<age, prescription, astigmatic, tear_rate>
94>>> for i in range(3):
95...     print data[i]
96...     
97['young', 'myope', 'no', 'reduced', 'none']
98['young', 'myope', 'no', 'normal', 'soft']
99['young', 'myope', 'yes', 'reduced', 'none']
100>>>
101
102Now let's put together a script file :download:`lenses.py <code/lenses.py>` that
103reads lenses data, prints out names of the attributes and class, and
104lists first 5 data instances (:download:`lenses.py <code/lenses.py>`)::
105
106   import orange
107   data = orange.ExampleTable("lenses")
108   print "Attributes:",
109   for i in data.domain.attributes:
110       print i.name,
111   print
112   print "Class:", data.domain.classVar.name
113   
114   print "First 5 data items:"
115   for i in range(5):
116      print data[i]
117
118Few comments on this script are in place. First, note that data
119is an object that holds both the data and information on the
120domain. We show above how to access attribute and class names, but
121you may correctly expect that there is much more information there,
122including on attribute type, values it may hold, etc. Also notice
123the particular syntax python uses for ``for`` loops: the
124line that declares the loop ends with ``:``, and whatever
125is in the loop is indented (we have used three spaces to indent the
126statements that are within each loop).
127
128Save :download:`lenses.py <code/lenses.py>` in your working directory. There
129should now be both files lenses.py and lenses.tab. Now let's see if we
130run the script we have just written::
131
132   > python lenses.py
133   Attributes: age prescription astigmatic tear_rate
134   Class: lenses
135   First 5 data items:
136   ['young', 'myope', 'no', 'reduced', 'none']
137   ['young', 'myope', 'no', 'normal', 'soft']
138   ['young', 'myope', 'yes', 'reduced', 'none']
139   ['young', 'myope', 'yes', 'normal', 'hard']
140   ['young', 'hypermetrope', 'no', 'reduced', 'none']
141   >
142
143Now, we promised to say something about C4.5 data files, which syntax
144was (and perhaps still is) common within machine learning community
145due to extensive use of this program. Notice that C4.5 data sets are
146described within two files: file with extension ".data" holds the
147actual data, whereas domain (attribute and class names and types) are
148described in a separate file ".names".  Instead of going into how
149exactly these files are formed, we show just an example that Orange
150can handle them. For this purpose, load :download:`car.data <code/car.data>` and
151:download:`car.names <code/car.names>` and run the following code::
152
153   > python
154   >>> car_data = orange.ExampleTable("car")
155   >>> print car_data.domain.attributes
156   <buying, maint, doors, persons, lugboot, safety>
157   >>>
158
159If you think that storing domain information and data in a single
160file, or if you better like looking to your data through the
161spreadsheet, you may now store your C4.5 data file to a Orange native
162(.tab) format:
163
164>>> orange.saveTabDelimited ("car.tab", car_data)
165>>>
166
167Similarly, saving to C4.5 format is possible through ``orange.saveC45``.
168
169Above all applies if you run Python through Command Prompt. If you use
170PythonWin, however, you have to tell it where exactly your data is
171located. You may either need to specify absolute path of your data
172files, like (type your commands in Interactive Window):
173
174>>> car_data = orange.ExampleTable("c:/orange/car")
175>>>
176
177or set a working directory through Python's os library:
178
179>>> import os
180>>> os.chdir("c:/orange")
181>>>
182
183**References**
184
185.. [CJ1987] Cendrowska J (1987) PRISM: An algorithm for inducing modular rules,
186   International Journal of Man-Machine Studies, 27, 349-370.
Note: See TracBrowser for help on using the repository browser.