source: orange/docs/reference/rst/Orange.data.formats.rst @ 9531:feba07f2b199

Revision 9531:feba07f2b199, 5.3 KB checked in by jzbontar <jure.zbontar@…>, 2 years ago (diff)

Basket format documentation

Line 
1.. py:currentmodule:: Orange.data
2
3=======================
4Loading and saving data
5=======================
6
7Tab-delimited format
8====================
9Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features
10and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data
11is provided in a form of a 3-line header. First line contains feature names, followed by type of features and optional
12flags in that order.
13
14Example of iris dataset in tab-delimited format (:download:`iris.tab <code/iris.tab>`)
15
16.. literalinclude:: code/iris.tab
17   :lines: 1-7
18
19Feature types
20-------------
21 * discrete (or d) - imported as Orange.data.variable.Discrete
22 * continuous (or c) - imported as Orange.data.variable.Continuous
23 * string - imported as Orange.data.variable.String
24 * basket - used for storing sparse data. More on basket formats in a dedicated section.
25
26Optional flags
27--------------
28 * ignore (or i) - feature will not be imported
29 * class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
30 * multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
31 * meta (or m) - feature will be imported as a meta attribute.
32 * -dc
33
34Baskets
35-------
36
37Baskets can be used for storing sparse data in tab delimited files. They were
38specifically designed for text mining needs. If text mining and sparse data is
39not your business, you can skip this section.
40
41Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
42continuous meta attribute named ``<name>`` will be created and added to the domain
43as optional if it is not already there. A meta value for that variable will be
44added to the example. If the value is 1, you can omit the ``=<value>`` part.
45
46It is not possible to put meta attributes of other types than continuous in the
47basket.
48
49A tab delimited file with a basket can look like this::
50
51    K       Ca      b_foo     Ba  y
52    c       c       basket    c   c
53            meta              i   class
54    0.06    8.75    a b a c   0   1
55    0.48            b=2 d     0   1
56    0.39    7.78              0   1
57    0.57    8.22    c=13      0   1
58
59These are the examples read from such a file::
60
61    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
62    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
63    [0.39, 1], {"Ca":7.78}
64    [0.57, 1], {"Ca":8.22, "c":13.000}
65
66It is recommended to have the basket as the last column, especially if it
67contains a lot of data.
68
69Note a few things. The basket column's name, ``b_foo``, is not used. In the first
70example, the value of ``a`` is 2 since it appears twice. The ordinary meta
71attribute, ``Ca``, appears in all examples, even in those where its value is
72undefined. Meta attributes from the basket appear only where they are defined.
73This is due to the different nature of these meta attributes: ``Ca`` is required
74while the others are optional.  ::
75
76    >>> d.domain.getmetas()
77    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
78    >>> d.domain.getmetas(False)
79    {-22: FloatVariable 'Ca'}
80    >>> d.domain.getmetas(True)
81    {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
82
83To fully understand all this, you should read the documentation on meta
84attributes in Domain and on the basket file format (a simple format that is
85limited to baskets only).
86
87Basket Format
88=============
89
90Basket files (.basket) are suitable for representing sparse data. Each example
91is represented by a line in the file. The line is written as a comma-separated
92list of name-value pairs. Here's an example of such file. ::
93
94    nobody, expects, the, Spanish, Inquisition=5
95    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
96    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
97    to, the, Pope, and, nice, red, uniforms, oh damn
98
99The file contains four examples. The first examples has five attributes
100defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
101four have (the default) value of 1.0 and the last has a value of 5.0.
102
103The attributes that appear in the domain aren't defined in any headers or even
104separate files, as with other formats supported by Orange.
105
106If attribute appears more than once, its values are added. For instance, the
107value of attribute "surprise" in the second examples is 6.0 and the value of
108"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
109and the latter appears twice with value of 1.0.
110
111All attributes are loaded as optional meta-attributes, so zero values don't
112take any memory (unless they are given, but initialized to zero). See also
113section on meta-attributes in the reference for domain descriptors.
114
115Notice that at the time of writing this reference only association rules can
116directly use examples presented in the basket format.
117
118
119Other supported data formats
120============================
121Orange can import data from csv or tab delimited files where the first line contains attribute names followed by
122lines containing data. For such files, orange tries to guess the type of features and treats the right-most
123column as the class variable. If feature types are known in advance, special orange tab format should be used.
Note: See TracBrowser for help on using the repository browser.