source: orange/docs/reference/rst/Orange.data.formats.rst @ 9928:9dddd16dbc01

Revision 9928:9dddd16dbc01, 5.4 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Improve multilabel and multitarget documentation introduction.

Line 
1.. py:currentmodule:: Orange.data
2
3=======================
4Loading and saving data
5=======================
6
7.. _tab-delimited:
8
9Tab-delimited format
10====================
11Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features
12and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data
13is provided in a form of a 3-line header. First line contains variable
14names, followed by their types in the second line and optional
15flags in the third.
16
17Example of iris dataset in tab-delimited format (:download:`iris.tab <code/iris.tab>`)
18
19.. literalinclude:: code/iris.tab
20   :lines: 1-7
21
22Feature types
23-------------
24 * discrete (or d) - imported as :obj:`Orange.feature.Discrete`
25 * continuous (or c) - imported as :obj:`Orange.feature.Continuous`
26 * string - imported as :obj:`Orange.feature.String`
27 * basket - used for storing sparse data. More on basket formats in a dedicated section.
28
29Optional flags
30--------------
31 * ignore (or i) - feature will not be imported
32 * class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
33 * multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
34 * meta (or m) - feature will be imported as a meta attribute.
35 * -dc
36
37Baskets
38-------
39
40Baskets can be used for storing sparse data in tab delimited files. They were
41specifically designed for text mining needs. If text mining and sparse data is
42not your business, you can skip this section.
43
44Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
45continuous meta attribute named ``<name>`` will be created and added to the domain
46as optional if it is not already there. A meta value for that variable will be
47added to the example. If the value is 1, you can omit the ``=<value>`` part.
48
49It is not possible to put meta attributes of other types than continuous in the
50basket.
51
52A tab delimited file with a basket can look like this::
53
54    K       Ca      b_foo     Ba  y
55    c       c       basket    c   c
56            meta              i   class
57    0.06    8.75    a b a c   0   1
58    0.48            b=2 d     0   1
59    0.39    7.78              0   1
60    0.57    8.22    c=13      0   1
61
62These are the examples read from such a file::
63
64    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
65    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
66    [0.39, 1], {"Ca":7.78}
67    [0.57, 1], {"Ca":8.22, "c":13.000}
68
69It is recommended to have the basket as the last column, especially if it
70contains a lot of data.
71
72Note a few things. The basket column's name, ``b_foo``, is not used. In the first
73example, the value of ``a`` is 2 since it appears twice. The ordinary meta
74attribute, ``Ca``, appears in all examples, even in those where its value is
75undefined. Meta attributes from the basket appear only where they are defined.
76This is due to the different nature of these meta attributes: ``Ca`` is required
77while the others are optional.  ::
78
79    >>> d.domain.getmetas()
80    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
81    >>> d.domain.getmetas(False)
82    {-22: FloatVariable 'Ca'}
83    >>> d.domain.getmetas(True)
84    {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
85
86To fully understand all this, you should read the documentation on :ref:`meta
87attributes <meta-attributes>` in Domain and on the :ref:`basket file format
88<basket-format>` (a simple format that is limited to baskets only).
89
90.. _basket-format:
91
92Basket Format
93=============
94
95Basket files (.basket) are suitable for representing sparse data. Each example
96is represented by a line in the file. The line is written as a comma-separated
97list of name-value pairs. Here's an example of such file. ::
98
99    nobody, expects, the, Spanish, Inquisition=5
100    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
101    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
102    to, the, Pope, and, nice, red, uniforms, oh damn
103
104The file contains four examples. The first examples has five attributes
105defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
106four have (the default) value of 1.0 and the last has a value of 5.0.
107
108The attributes that appear in the domain aren't defined in any headers or even
109separate files, as with other formats supported by Orange.
110
111If attribute appears more than once, its values are added. For instance, the
112value of attribute "surprise" in the second examples is 6.0 and the value of
113"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
114and the latter appears twice with value of 1.0.
115
116All attributes are loaded as optional meta-attributes, so zero values don't
117take any memory (unless they are given, but initialized to zero). See also
118section on :ref:`meta attributes <meta-attributes>` in the reference for domain
119descriptors.
120
121Notice that at the time of writing this reference only association rules can
122directly use examples presented in the basket format.
123
124
125Other supported data formats
126============================
127Orange can import data from csv or tab delimited files where the first line contains attribute names followed by
128lines containing data. For such files, orange tries to guess the type of features and treats the right-most
129column as the class variable. If feature types are known in advance, special orange tab format should be used.
Note: See TracBrowser for help on using the repository browser.