source: orange/docs/reference/rst/Orange.data.formats.rst @ 9535:6ad805782021

Revision 9535:6ad805782021, 5.4 KB checked in by jzbontar <jure.zbontar@…>, 2 years ago (diff)

Basket format documentation - added links

Line 
1.. py:currentmodule:: Orange.data
2
3=======================
4Loading and saving data
5=======================
6
7Tab-delimited format
8====================
9Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features
10and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data
11is provided in a form of a 3-line header. First line contains feature names, followed by type of features and optional
12flags in that order.
13
14Example of iris dataset in tab-delimited format (:download:`iris.tab <code/iris.tab>`)
15
16.. literalinclude:: code/iris.tab
17   :lines: 1-7
18
19Feature types
20-------------
21 * discrete (or d) - imported as Orange.data.variable.Discrete
22 * continuous (or c) - imported as Orange.data.variable.Continuous
23 * string - imported as Orange.data.variable.String
24 * basket - used for storing sparse data. More on basket formats in a dedicated section.
25
26Optional flags
27--------------
28 * ignore (or i) - feature will not be imported
29 * class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
30 * multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
31 * meta (or m) - feature will be imported as a meta attribute.
32 * -dc
33
34Baskets
35-------
36
37Baskets can be used for storing sparse data in tab delimited files. They were
38specifically designed for text mining needs. If text mining and sparse data is
39not your business, you can skip this section.
40
41Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
42continuous meta attribute named ``<name>`` will be created and added to the domain
43as optional if it is not already there. A meta value for that variable will be
44added to the example. If the value is 1, you can omit the ``=<value>`` part.
45
46It is not possible to put meta attributes of other types than continuous in the
47basket.
48
49A tab delimited file with a basket can look like this::
50
51    K       Ca      b_foo     Ba  y
52    c       c       basket    c   c
53            meta              i   class
54    0.06    8.75    a b a c   0   1
55    0.48            b=2 d     0   1
56    0.39    7.78              0   1
57    0.57    8.22    c=13      0   1
58
59These are the examples read from such a file::
60
61    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
62    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
63    [0.39, 1], {"Ca":7.78}
64    [0.57, 1], {"Ca":8.22, "c":13.000}
65
66It is recommended to have the basket as the last column, especially if it
67contains a lot of data.
68
69Note a few things. The basket column's name, ``b_foo``, is not used. In the first
70example, the value of ``a`` is 2 since it appears twice. The ordinary meta
71attribute, ``Ca``, appears in all examples, even in those where its value is
72undefined. Meta attributes from the basket appear only where they are defined.
73This is due to the different nature of these meta attributes: ``Ca`` is required
74while the others are optional.  ::
75
76    >>> d.domain.getmetas()
77    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
78    >>> d.domain.getmetas(False)
79    {-22: FloatVariable 'Ca'}
80    >>> d.domain.getmetas(True)
81    {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
82
83To fully understand all this, you should read the documentation on :ref:`meta
84attributes <meta-attributes>` in Domain and on the :ref:`basket file format
85<basket-format>` (a simple format that is limited to baskets only).
86
87.. _basket-format:
88
89Basket Format
90=============
91
92Basket files (.basket) are suitable for representing sparse data. Each example
93is represented by a line in the file. The line is written as a comma-separated
94list of name-value pairs. Here's an example of such file. ::
95
96    nobody, expects, the, Spanish, Inquisition=5
97    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
98    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
99    to, the, Pope, and, nice, red, uniforms, oh damn
100
101The file contains four examples. The first examples has five attributes
102defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
103four have (the default) value of 1.0 and the last has a value of 5.0.
104
105The attributes that appear in the domain aren't defined in any headers or even
106separate files, as with other formats supported by Orange.
107
108If attribute appears more than once, its values are added. For instance, the
109value of attribute "surprise" in the second examples is 6.0 and the value of
110"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
111and the latter appears twice with value of 1.0.
112
113All attributes are loaded as optional meta-attributes, so zero values don't
114take any memory (unless they are given, but initialized to zero). See also
115section on :ref:`meta attributes <meta-attributes>` in the reference for domain
116descriptors.
117
118Notice that at the time of writing this reference only association rules can
119directly use examples presented in the basket format.
120
121
122Other supported data formats
123============================
124Orange can import data from csv or tab delimited files where the first line contains attribute names followed by
125lines containing data. For such files, orange tries to guess the type of features and treats the right-most
126column as the class variable. If feature types are known in advance, special orange tab format should be used.
Note: See TracBrowser for help on using the repository browser.