source: orange/docs/reference/rst/Orange.data.formats.rst @ 9884:52decd2d77eb

Revision 9884:52decd2d77eb, 5.4 KB checked in by janezd <janez.demsar@…>, 2 years ago (diff)

Polished documentation on data.formats

Line 
1.. py:currentmodule:: Orange.data
2
3=======================
4Loading and saving data
5=======================
6
7Tab-delimited format
8====================
9Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features
10and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data
11is provided in a form of a 3-line header. First line contains variable
12names, followed by their types in the second line and optional
13flags in the third.
14
15Example of iris dataset in tab-delimited format (:download:`iris.tab <code/iris.tab>`)
16
17.. literalinclude:: code/iris.tab
18   :lines: 1-7
19
20Feature types
21-------------
22 * discrete (or d) - imported as Orange.data.variable.Discrete
23 * continuous (or c) - imported as Orange.data.variable.Continuous
24 * string - imported as Orange.data.variable.String
25 * basket - used for storing sparse data. More on basket formats in a dedicated section.
26
27Optional flags
28--------------
29 * ignore (or i) - feature will not be imported
30 * class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
31 * multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
32 * meta (or m) - feature will be imported as a meta attribute.
33 * -dc
34
35Baskets
36-------
37
38Baskets can be used for storing sparse data in tab delimited files. They were
39specifically designed for text mining needs. If text mining and sparse data is
40not your business, you can skip this section.
41
42Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
43continuous meta attribute named ``<name>`` will be created and added to the domain
44as optional if it is not already there. A meta value for that variable will be
45added to the example. If the value is 1, you can omit the ``=<value>`` part.
46
47It is not possible to put meta attributes of other types than continuous in the
48basket.
49
50A tab delimited file with a basket can look like this::
51
52    K       Ca      b_foo     Ba  y
53    c       c       basket    c   c
54            meta              i   class
55    0.06    8.75    a b a c   0   1
56    0.48            b=2 d     0   1
57    0.39    7.78              0   1
58    0.57    8.22    c=13      0   1
59
60These are the examples read from such a file::
61
62    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
63    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
64    [0.39, 1], {"Ca":7.78}
65    [0.57, 1], {"Ca":8.22, "c":13.000}
66
67It is recommended to have the basket as the last column, especially if it
68contains a lot of data.
69
70Note a few things. The basket column's name, ``b_foo``, is not used. In the first
71example, the value of ``a`` is 2 since it appears twice. The ordinary meta
72attribute, ``Ca``, appears in all examples, even in those where its value is
73undefined. Meta attributes from the basket appear only where they are defined.
74This is due to the different nature of these meta attributes: ``Ca`` is required
75while the others are optional.  ::
76
77    >>> d.domain.getmetas()
78    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
79    >>> d.domain.getmetas(False)
80    {-22: FloatVariable 'Ca'}
81    >>> d.domain.getmetas(True)
82    {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
83
84To fully understand all this, you should read the documentation on :ref:`meta
85attributes <meta-attributes>` in Domain and on the :ref:`basket file format
86<basket-format>` (a simple format that is limited to baskets only).
87
88.. _basket-format:
89
90Basket Format
91=============
92
93Basket files (.basket) are suitable for representing sparse data. Each example
94is represented by a line in the file. The line is written as a comma-separated
95list of name-value pairs. Here's an example of such file. ::
96
97    nobody, expects, the, Spanish, Inquisition=5
98    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
99    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
100    to, the, Pope, and, nice, red, uniforms, oh damn
101
102The file contains four examples. The first examples has five attributes
103defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
104four have (the default) value of 1.0 and the last has a value of 5.0.
105
106The attributes that appear in the domain aren't defined in any headers or even
107separate files, as with other formats supported by Orange.
108
109If attribute appears more than once, its values are added. For instance, the
110value of attribute "surprise" in the second examples is 6.0 and the value of
111"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
112and the latter appears twice with value of 1.0.
113
114All attributes are loaded as optional meta-attributes, so zero values don't
115take any memory (unless they are given, but initialized to zero). See also
116section on :ref:`meta attributes <meta-attributes>` in the reference for domain
117descriptors.
118
119Notice that at the time of writing this reference only association rules can
120directly use examples presented in the basket format.
121
122
123Other supported data formats
124============================
125Orange can import data from csv or tab delimited files where the first line contains attribute names followed by
126lines containing data. For such files, orange tries to guess the type of features and treats the right-most
127column as the class variable. If feature types are known in advance, special orange tab format should be used.
Note: See TracBrowser for help on using the repository browser.