source: orange/Orange/doc/reference/tabdelimited.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 15.4 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8
9<H1>Tab-delimited and similar formats</H1>
10<index name="tab-delimited file format">
11
12<P>Besides supporting several common file formats that are used in machine learning (C4.5...), Orange introduces a more capable format that supports many additional features. There are several variations of it. The most powerful is the old-style tab-delimited file, with a header that gives the names of attributes, their type and role (ordinary attributes, class, meta-attribute). A simpler new-style tab-delimited file has a simpler header and when attribute types are omitted, Orange will attempt to guess them itself. Comma-separated files are essentially the same as the new-style tab-delimited, except that they have commas instead of tabulators. Finally, Orange for Windows can also read files in Excel format, provided that you have Excel installed. The organization of the file is again similar to the new-style tab-delimited files.</P>
13
14<P>All file formats begin with the header; in the old-style it has three lines and in the new style it has only one. The remaining lines contain examples. These are given as lists of symbolic values, separated by tabulators in .tab and .txt files, commas in .csv, or occupying a row in an Excel file. Lines in (.tab, .txt and .csv) files that commence by "|" are comment lines and are ignored. There are no comments in Excel. Lines that are entirely empty (except for the delimiters) are skipped.</P>
15
16
17<H3>Domain description - older version (.tab)</H3>
18
19<P>The first line of the file in the older format contains names of the attributes. Names can contain any character but CR, LF, NUL or TAB. Spaces are allowed.</P>
20
21<P>The second line contains types of attributes, one entry for each attribute. Attributes can be of the following four types.
22<UL>
23<LI><B>Discrete attributes</B>, denoted by <CODE>d</CODE> or <CODE>discrete</CODE> for discrete attributes. Alternatively, you can list the possible values of the attribute instead of "d" or "discrete"; values should be separated by spaces. Spaces contained in values must be "escaped" (that is, preceded by a backslash): a value named "light blue" would be written as <CODE>light\ blue</CODE>. Listing attributes is useful since it prescribes the order of the values; if you only described the attribute type by "d", the order of values will be the same as encountered when reading the examples. The corresponding attribute descriptor is of type <A href="Variable.htm#EnumVariable"><CODE>EnumVariable</CODE></A>.</LI>
24<LI><B>Continuous attribute</B>, defined by <CODE>c</CODE> or <CODE>continuous</CODE> for continuous attributes. They are described by an instance of <A href="Variable.htm#FloatVariable"><CODE>FloatVariable</CODE></A>.</LI>
25<LI><B>String attributes</B>, marked by <CODE>string</CODE> and described by <a href="Variable.htm#StringVariable"><CODE>StringVariable</CODE></A>.</LI>
26<LI><B>Basket</B>, marked by <CODE>basket</CODE>; this does not create a single attribute but rather tells the parser that this column will list values of optional continuous meta attributes. There can only be one basket. The attribute needs a name to simplify the parser, yet the name is not used anywhere. <a href="#basket">More on baskets</a> in a dedicated chapter.</LI>
27<LI><B>Python attributes</B> are meant for advanced users who want to store additional with examples or who write specific learning algorithms that can make use of such data. The definition and use of such attributes is described in a specific page on <a href="PythonVariable.htm"><CODE>Python attribute type</CODE></A>.
28</UL>
29
30<P><B>Change: </B> The (undocumented) symbols that could be used for declaring continuous and discrete attributes ('f' and 'float', and 'e' and 'enum') have been removed.</P>
31
32<P>Note that at the moment, Orange learning methods can only use discrete and continuous attributes. String and Python attributes can be used as meta-attributes describing examples, or you can use them in specific learning methods and other algorithms.</P>
33
34<P>The third line of the file contains optional flags.
35<DL>
36<DT><CODE>i</CODE> or <CODE>ignore</CODE></DT>
37<DD>Attributes with this flag are ignored, <EM>i.e.</EM> not read into the table.</DD>
38
39<DT><CODE>c</CODE> or <CODE>class</CODE></DT>
40<DD>denotes a class attribute. There can be at most one such attribute; if there are none, the last attribute is the class.</DD>
41
42<DT><CODE>m</CODE> or <CODE>meta</CODE></DT>
43<DD>denotes a meta-attribute. Such attributes are not used by any learning algorithm (or by algorithms for, say, measuring distances between examples) but are stored with examples. Meta attributes are most often used for weighting the examples.</DD>
44
45<DT><CODE>-dc</CODE></DT>
46<DD>followed by a value that serves as "don't care" symbol for this attribute. This option can be used more than once for each attribute if don't cares are specified with different symbols. See below for the details.</DD>
47</DL>
48
49<P>The basket can be ignored, while other flags have no effect.</P>
50
51<P>The first few lines of iris dataset in this format look like this:</P>
52
53<XMP class="code">sepal length   sepal width   petal length   petal width   iris
54c              c             c              c             d
55                                                          class
565.1            3.5           1.4            0.2           Iris-setosa
574.9            3.0           1.4            0.2           Iris-setosa
584.7            3.2           1.3            0.2           Iris-setosa
594.6            3.1           1.5            0.2           Iris-setosa
60</XMP>
61
62<H3>Domain description - new version (.txt, .csv and .xls)</H3>
63
64<P>The newer version of tab-delimited formats is much simpler yet still powerful. The domain description is given in a single line which, in its most simple form, contains only the names of the attributes. In this case, Orange will recognize the attribute types itself, using this procedure:</P>
65
66<OL>
67<LI>If the attribute descriptor with the same name is found in known descriptors (passed by <CODE>use</CODE> or determine by reuse), it will be used, thus specifying the attribute type as well.</LI>
68<LI>If the attribute is new, its values in the file are checked:
69<UL>
70<LI>attributes whose values are digits from 0-9 (or some subset of this) are discrete; this is to cover the domains with coded attribute values,</LI>
71<LI>attributes whose values can be parsed as numbers (in .txt) or whose cells contain numbers (in Excel) are continuous,</LI>
72<li>attributes which have more than 20 different values, yet less than half of them appear more than in one example, are strings and are put among meta attributes,</li>
73<LI>other attributes are discrete.</LI>
74Symbolic values representing unknown values ("?", "~", "NA"... are ignored).
75</OL>
76
77<P>The last non-ignored non-meta attribute will be a class attribute. It is not possible to specify a classless domain in those two file formats.</P>
78
79<P>This procedure is not foolproof. You can have continuous attributes whose values are accidentally only digits from 0-9. Or you can have a discrete attribute with values 1.1, 1.2, 1.3 and 2.1. You may want to designate some other attribute as class attribute, ignore another attribute and have a few meta attribute. This can be achieved by prefixes.</P>
80
81<P>Prefixed attributes contain one- or two-lettered prefix, followed by "#" and the name. The first letter of the prefix can be either "m" for meta-attributes, "i" to ignore the attribute, and "c" to define the class attribute. As always, only one attribute can be a class attribue. The second letter denotes the attribute type, "D" for discrete, "C" for continuous, "S" for string attributes and "B" for baskets.</P>
82
83<P>In most cases, however, the attribute detection mechanism will suffice. Therefore, Iris can be given like this:</P>
84
85<XMP class="code">sepal length   sepal width   petal length   petal width   iris
865.1            3.5           1.4            0.2           Iris-setosa
874.9            3.0           1.4            0.2           Iris-setosa
884.7            3.2           1.3            0.2           Iris-setosa
894.6            3.1           1.5            0.2           Iris-setosa
90</XMP>
91
92<P>If you would like to ignore the first attribute, use the second as a class, explicitly require the third attribute to be discrete and have the fourth attribute be a continuous weight, you would "correct" the first line like this</P>
93
94<XMP class="code">i#sepal length   c#sepal width    D#petal length    mC#petal width    iris
95</XMP>
96
97<a name="basket"></a>
98<H3>Baskets</H3>
99
100<P>Baskets can be used for storing sparse data in tab delimited files. They were specifically designed for text mining needs. If text mining and sparse data is not your business, you can skip this section.</P>
101
102<P>Baskets are given as a list of space-separated <code>&lt;name&gt;=&lt;value&gt;</code> atoms. A continuous meta attribute named &lt;name&gt; will be created and added to the domain as optional if it is not already there. A meta value for that variable will be added to the example. If the value is 1, you can omit the <code>=&lt;value&gt;</code> part.</P>
103
104<P>It is not possible to put meta attributes of other types than continuous in the basket.</P>
105
106<P>A tab delimited file with a basket can look like this:
107<xmp class="code">K       Ca      b_foo     Ba  y
108c       c       basket    c   c
109        meta              i   class
1100.06    8.75    a b a c   0   1
1110.48            b=2 d     0   1
1120.39    7.78              0   1
1130.57    8.22    c=13      0   1</xmp>
114These are the examples read from such a file:
115<xmp class="code">[0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
116[0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
117[0.39, 1], {"Ca":7.78}
118[0.57, 1], {"Ca":8.22, "c":13.000}</xmp>
119</P>
120
121<P>It is recommended to have the basket as the last column, especially if it contains a lot of data.</P>
122
123<P>Note a few things. The basket column's name, b_foo, is not used. In the first example, the value of a is 2 since it appears twice. The ordinary meta attribute, Ca, appears in all examples, even in those where its value is undefined. Meta attributes from the basket appear only where they are defined. This is due to the different nature of these meta attributes: Ca is required while the others are optional.
124<xmp class="code">>>> d.domain.getmetas()
125{-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
126>>> d.domain.getmetas(False)
127{-22: FloatVariable 'Ca'}
128>>> d.domain.getmetas(True)
129{-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
130</xmp>
131To fully understand all this, you should read the documentation on <a href="Domain.htm#meta-attributes">meta attributes in <code>Domain</code></a> and on the <a href="basket.htm">basket file format</a> (a simple format that is limited to baskets only).
132</P>
133
134
135<H3>Comma separated files</H3>
136
137<P>Comma separated files are just like the new-format tab-delimited, except that commas are used instead of tabs. For instance, for documentation on censoring, we downloaded the new Wisconsin breast cancer data from UCI:</P>
138
139<XMP class=code>119513,N,31,18.02,27.6,117.5,1013,0.09489,0.1036, <...>
1408423,N,61,17.99,10.38,122.8,1001,0.1184,0.2776, <...>
141842517,N,116,21.37,17.44,137.5,1373,0.08836,0.1189, <...>
142843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839, <...>
143<...>
144</XMP>
145
146<P>To import this data to Orange, we only needed to add the first line, describing the attribute names.</P>
147
148<XMP class=code>m#ID,c#recur,time,radius,texture,perimeter,area,smoothness, <...>
149</XMP>
150
151<P>Since we don't want the ID to be used for learning, we turned it into a meta-attribute. Besides, we needed to tell Orange that "recur" is the class attribute.</P>
152
153
154<H3>Undefined values</H3>
155
156<P>By default, empty fields, <CODE>?</CODE> and <CODE>NA</CODE> are interpreted as "don't care", and "~" and "*" as "don't know". You can't change this: this symbols are reserved.</P>
157
158<P>You can, however, specify additional symbols to denote undefineds. This can be done either per attribute or for all attributes at ones. Per-attribute unknowns are specified using <CODE>-dc</CODE> option in the old-style tab-delimited files. For instance, if unknowns for some attribute are given as "UNK", add <CODE>-dc UNK</CODE> in the third line. There is no similar option in the .txt in .csv files.</P>
159
160<P>General symbols for unknown values are not specified in the file but given as keyword arguments to <CODE>ExampleTable</CODE>. Three keywords are recognized: <CODE>DC</CODE> and <CODE>DK</CODE> give symbols for don't cares and don't knows, and <CODE>NA</CODE> for both; <CODE>DC</CODE> and <CODE>DK</CODE> have the priority over <CODE>NA</CODE>. Only one symbol can be specified for each kind of undefined values.</P>
161
162<P>Although we can also load data in other format (such as C4.5) through calling <CODE>ExampleTable</CODE>, these keyword arguments only affect the formats described on this page.</P>
163
164<P>To show how this works, we shall use the file <A href="undefineds.tab">undefineds.tab</A> which looks like this.</P>
165
166<XMP class=code>a               b               c
167d               d               d
168-dc X -dc UNK   -dc UNAVAILABLE
1690               0               0
1701               1               1
171                                1
172*               *               *
173?               ?               ?
174.               .               .
175GDC             GDC             GDC
176GDK             GDK             GDK
177X               X               X
178UNK             UNK             UNK
179UNAVAILABLE     UNAVAILABLE     UNAVAILABLE
180</XMP>
181
182<P>Let's load and print it.</P>
183
184<p class="header">part of <a href="undefineds.py">undefineds.py</a> (uses <a href=
185"undefineds.tab">undefineds.tab</a>)</p>
186<XMP class=code>import orange
187data = orange.ExampleTable("undefineds", DK="GDK", DC="GDC")
188
189for ex in data:
190    print ex
191</XMP>
192
193<P>Here's how the file is interpreted.</P>
194
195<XMP class=code>['0', '0', '0']
196['1', '1', '1']
197['?', '?', '1']
198['~', '~', '~']
199['?', '?', '?']
200['?', '?', '?']
201['?', '?', '?']
202['~', '~', '~']
203['?', 'X', 'X']
204['?', 'UNK', 'UNK']
205['UNAVAILABLE', '?', 'UNAVAILABLE']
206</XMP>
207
208<P>As the call to <CODE>ExampleTable</CODE> specifies, symbols GDC and GDK stand for don't care and don't know for all attributes. In addition, X and UNK denote don't cares for the first attribute and UNAVAILABLE for the second. For other attributes, these symbols are just normal values.</P>
209
210<P>As you have noted, undefined values are printed as "?" and "~", disregarding the way they were specified in the files they were read from. Orange cannot remember such details.</P>
211
212<P>However, when saving the data back to files, you can specify the symbols to be used (for all attributes, not per-attribute). This is done in a similar fashion as when reading the data - by giving additional keyword arguments <CODE>DC</CODE>, <CODE>DK</CODE> and/or <CODE>NA</CODE> to the function <CODE>saveTabDelimited</CODE>, <CODE>saveTxt</CODE> or <CODE>saveCSV</CODE>.</P>
213
214<P>For instance, if we save the file by <xmp class=CODE>orange.saveTabDelimited("undefined-saved-dc-dk", data, DC="GDC", DK="GDK")</xmp>all don't cares ("?") are written as "GDC" and don't knows as "GDK".</P>
215
216<P>This mechanism should provide for easier exporting to other data mining programs that can handle tab- or comma-delimited files. For specific problems, such as having more names denoting different types of unknowns, possibly in combination with other attribute values, you can easily program your own input/output routines in Python.</P> 
Note: See TracBrowser for help on using the repository browser.