source: orange/Orange/doc/ofb/load_data.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 10.6 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html><HEAD>
2<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
3</HEAD>
4<body>
5
6<p class="Path">
7Prev: <a href="start.htm">Start with Orange</a>,
8Next: <a href="basic_exploration.htm">Basic Data Exploration</a>,
9Up: <a href="default.htm">On Tutorial 'Orange for Beginners'</a>
10</p>
11
12<H1>Load-In The Data</H1>
13<index name="data input">
14
15<p>Orange is a machine learning and data mining suite, so
16loading-in the data is, as you may acknowledge, its essential
17functionality (we tried not to stop here, though, so read on).
18Orange supports C4.5, Assistant, Retis, and tab-delimited (native
19Orange) data formats. Of these, you may be most familiar with C4.5,
20so we will say something about this here, whereas Orange&rsquo;s
21native format is the simplest so most of our data files will come
22in this flavor.</p>
23
24<p>Let us start with example and Orange native data format. Let us
25consider an artificial data set on lens prescription (from
26Cendrowska, J. "PRISM: An algorithm for inducing modular rules",
27International Journal of Man-Machine Studies, 1987, 27, 349-370).
28The data set has four attributes (age of the patient, spectacle
29prescription, notion on astigmatism, and information on tear
30production rate) plus an associated three-valued class, that gives
31the appropriate lens prescription for patient (hard contact lenses,
32soft contact lenses, no lenses). You may have already guessed that
33this data set can, in principle, be used to build a classifier
34that, based on the four attributes, prescribes the right lenses.
35But before we do that, let us see how the data set file is composed
36and how to read it in Orange.</p>
37
38<p class=header>first few lines of <a href="lenses.tab">lenses.tab</a></p>
39<xmp class=code>age       prescription  astigmatic    tear_rate     lenses
40discrete  discrete      discrete      discrete      discrete
41                                                    class
42young     myope         no            reduced       none
43young     myope         no            normal        soft
44young     myope         yes           reduced       none
45young     myope         yes           normal        hard
46young     hypermetrope  no            reduced       none
47</xmp>
48
49<p>First line of the file lists names of attributes and class.
50Second line gives the type of the attribute. Here, all attributes
51are nominal (or discrete), hence the &ldquo;discrete&rdquo; keyword
52any every column. If you get tired of typing
53&ldquo;discrete&rdquo;, you may use &ldquo;d&rdquo; instead. We
54will later find that attribute may also be continuous, and will
55have appropriate keyword (or just &ldquo;c&rdquo;) in their
56corresponding columns. The third line adds an additional
57description to every column. Note that &ldquo;lenses&rdquo; is a
58special variable since it represents a class where each data
59instance is classified. This is denoted as &ldquo;class&rdquo; in
60the third line of the last column. Other keywords may be used in
61this line that we have not used in our example. For instance, for
62the attributes that we would like to ignore, we can use
63&ldquo;ignore&rdquo; keyword (or simply &ldquo;i&rdquo;). There are
64also other keywords that may be used, but for the sake of
65simplicity we will skip all this here.</p>
66
67<p>The rest of the table gives the data. Note that there are 5
68instances in our table above (check the original file to see
69other). Orange is rather free in what attribute value names it
70uses, so they do not need all to start with a letter like in our
71example.</p>
72
73<p>Attribute values are separated with tabulators (&lt;TAB&gt;).
74This is rather hard to see above (it looks like spaces were used),
75so to verify that check the original data set <a href=
76"lenses.tab">lenses.tab</a> in your favorite text editor.
77Alternatively, authors of this text like best to edit these files
78in a spreadsheet program (and use tab-delimited format to save the
79files), so a snapshot of the data set as edited in Excel can look
80like this:</p>
81
82<img border=0 src="excel.png" alt="Data in Excel">
83
84<p>To load the file <a href="lenses.tab">lenses.tab</a> in
85Orange, first decide in which folder of your hard disk you want to
86put it (you may create this folder first). Say you want to work in
87c:\orange directory. Now right click on <a href=
88"lenses.tab">lenses.tab</a>, choose &ldquo;Save Target
89As&hellip;&rdquo; command and save lenses.tab in c:\orange. Let us
90see first how you would load the file if you use Command Prompt
91application. Bring up the command prompt window first
92(Start-&gt;Programs-&gt;Accessories-&gt;Command Prompt), change the
93directory to c:\orange and run Python. Import Orange library and
94use ExampleTable method to read in the data. The whole dialog
95should look something like this (the text typed in by user is
96marked in bold):</p>
97
98<pre class=code class=code>
99> <b>cd c:\orange</b>
100> <b>python</b>
101>>> <b>import orange</b>
102>>> <b>data = orange.ExampleTable("lenses")</b>
103</pre>
104
105<p>This creates an object called data that holds your data set and
106information about the lenses domain. Note that for the file name no
107suffix was needed: Orange ventures through the current directory
108and checks if any files of the types it knows are available. This
109time, it found lenses.tab.</p>
110
111<p>How do we know that data really contains our data set? Well,
112let&rsquo;s check this out and print the attribute names and first
1133 data items:</p>
114
115<pre class=code>
116>>> <b>print data.domain.attributes</b>
117&lt;age, prescription, astigmatic, tear_rate&gt;
118>>> <b>for i in range(3):</b>
119...     print data[i]
120...     
121['young', 'myope', 'no', 'reduced', 'none']
122['young', 'myope', 'no', 'normal', 'soft']
123['young', 'myope', 'yes', 'reduced', 'none']
124>>>
125</pre>
126
127<p>Now let&rsquo;s put together a script file that reads lenses
128data, prints out names of the attributes and class, and lists first
1295 data instances:</p>
130
131
132<p class=header><a href="lenses.py">lenses.py</a> (uses <a href="lenses.tab">lenses.tab</a>)</p>
133<xmp class=code class=code>import orange
134data = orange.ExampleTable("lenses")
135print "Attributes:",
136for i in data.domain.attributes:
137    print i.name,
138print
139print "Class:", data.domain.classVar.name
140
141print "First 5 data items:"
142for i in range(5):
143   print data[i]
144</xmp>
145
146<p>Few comments on this script are in place. First, note that data
147is an object that holds both the data and information on the
148domain. We show above how to access attribute and class names, but
149you may correctly expect that there is much more information there,
150including on attribute type, values it may hold, etc. Also notice
151the particular syntax python uses for &ldquo;for&rdquo; loops: the
152line that declares the loop ends with &ldquo;:&rdquo;, and whatever
153is in the loop is indented (we have used three spaces to indent the
154statements that are within each loop).</p>
155
156<p>Put <a href="lenses.py">lenses.py</a> script in your
157working directory (e.g., c:\orange or alike). There should now be
158files lenses.py and lenses.tab. Now let&rsquo;s see if we run the
159script we have just written:</p>
160
161<pre class=code>
162> <b>cd d:\orange</b>
163> <b>python lenses.py</b>
164Attributes: age prescription astigmatic tear_rate
165Class: lenses
166First 5 data items:
167['young', 'myope', 'no', 'reduced', 'none']
168['young', 'myope', 'no', 'normal', 'soft']
169['young', 'myope', 'yes', 'reduced', 'none']
170['young', 'myope', 'yes', 'normal', 'hard']
171['young', 'hypermetrope', 'no', 'reduced', 'none']
172>
173</pre>
174
175<p>Now, we promised to say something about C4.5 data files, which
176syntax is sort-of common within machine learning community due to
177extensive use of this program. Notice that C4.5 data sets are
178described within two files: file with extension &ldquo;.data&rdquo;
179holds the actual data, whereas domain (attribute and class names
180and types) are described in a separate file &ldquo;.names&rdquo;.
181Instead of going into how exactly these files are formed, we show
182just an example that Orange can handle them. For this purpose, load
183<a href="car.data">car.data</a> and <a href=
184"car.names">car.names</a> of the <a href=
185"/blaz/hint/car_dataset.asp">car evaluation dataset</a>, and run the
186following code through your Command Prompt:</p>
187
188<p class="header">loading of C4.5 file (uses <a href="car.data">car.data</a>
189and <a href="car.names">car.names</a>)</p>
190<pre class=code>
191> <b>python</b>
192>>> <b>car_data = orange.ExampleTable("car")</b>
193>>> <b>print car_data.domain.attributes</b>
194&lt;buying, maint, doors, persons, lugboot, safety&gt;
195>>>
196</pre>
197
198<p>If you think that storing domain information and data in a
199single file, or if you better like looking to your data through the
200spreadsheet, you may now store your C4.5 data file to a Orange
201native (.tab) format:</p>
202
203<pre class=code>
204>>> <b>orange.saveTabDelimited ("car.tab", car_data)</b>
205>>>
206</pre>
207
208<p>Similarly, saving to C4.5 format is possible through
209orange.saveC45.</p>
210
211<p>Above all applies if you run Python through Command Prompt. If
212you use PythonWin, however, you have to tell it where exactly your
213data is located. You may either need to specify absolute path of
214your data files, like (type your commands in Interactive
215Window):</p>
216
217<pre class=code>
218>>> <b>car_data = orange.ExampleTable("c:\\orange\\car")</b>
219>>>
220</pre>
221
222<p>or set a working directory through Python&rsquo;s os
223library:</p>
224
225<pre class=code>
226>>> <b>import os</b>
227>>> <b>os.chdir("c:\\orange")</b>
228>>>
229</pre>
230
231<p>Double backslashes (&ldquo;\\&rdquo;) are needed since this is
232how Python handles them in strings. If you do not like this (I
233don&rsquo;t), you need to put &ldquo;r&rdquo; in front of any
234string that specifies file paths, like:</p>
235
236<pre class=code>
237>>> <b>car_data = orange.ExampleTable(r"c:\orange\car")</b>
238>>>
239</pre>
240
241<p>In PythonWin, you would probably like to use the scripts that
242come with this tutorial without changing them every time to type in
243your specific file paths. To do this, just after opening PythonWin,
244in Interactive Window change the working directory (use os.chdir()
245as described above). Now open the script (say, lenses.py) using
246File-&gt;Open&hellip; menu. To run it, make sure the script&rsquo;s
247window is active, and press F5 to run the script. The output of
248your script is printed in Interactive Window, just like in snapshot
249below.</p>
250
251<img src="python_win_source.png" alt="Orange Scripting in PythonWin">
252
253<p>So much for loading in the data. Remember, we have also learned
254how to save the data set and how to print out some basic
255information on data domain. In the next lesson, you will learn how
256to <a href="basic_exploration.htm">extract some more basic
257information on the data sets</a>, and how to use Python to derive
258some basic statistics of the data.</p>
259
260<hr><br><p class="Path">
261Prev: <a href="start.htm">Start with Orange</a>,
262Next: <a href="basic_exploration.htm">Basic Data Exploration</a>,
263Up: <a href="default.htm">On Tutorial 'Orange for Beginners'</a>
264</p>
265
266</body></html>
267
Note: See TracBrowser for help on using the repository browser.