source: orange/docs/reference/rst/Orange.data.formats.rst @ 11700:5cbc8eae8ad2

Revision 11700:5cbc8eae8ad2, 5.5 KB checked in by Ales Erjavec <ales.erjavec@…>, 7 months ago (diff)

Removed '-dc' flag from the documentation (no longer supported).

Line 
1.. py:currentmodule:: Orange.data
2
3.. _Orange-data-formats:
4
5=======================
6Loading and saving data
7=======================
8
9.. _tab-delimited:
10
11Tab-delimited format
12====================
13Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features
14and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data
15is provided in a form of a 3-line header. First line contains variable
16names, followed by their types in the second line and optional
17flags in the third.
18
19Example of iris dataset in tab-delimited format (:download:`iris.tab <code/iris.tab>`)
20
21.. literalinclude:: code/iris.tab
22   :lines: 1-7
23
24Feature types
25-------------
26 * discrete (or d) - imported as :obj:`Orange.feature.Discrete`
27 * continuous (or c) - imported as :obj:`Orange.feature.Continuous`
28 * string - imported as :obj:`Orange.feature.String`
29 * basket - used for storing sparse data. More on basket formats in a dedicated section.
30
31Optional flags
32--------------
33 * ignore (or i) - feature will not be imported
34 * class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
35 * multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
36 * meta (or m) - feature will be imported as a meta attribute.
37
38Baskets
39-------
40
41Baskets can be used for storing sparse data in tab delimited files. They were
42specifically designed for text mining needs. If text mining and sparse data is
43not your business, you can skip this section.
44
45Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
46continuous meta attribute named ``<name>`` will be created and added to the domain
47as optional if it is not already there. A meta value for that variable will be
48added to the example. If the value is 1, you can omit the ``=<value>`` part.
49
50It is not possible to put meta attributes of other types than continuous in the
51basket.
52
53A tab delimited file with a basket can look like this::
54
55    K       Ca      b_foo     Ba  y
56    c       c       basket    c   c
57            meta              i   class
58    0.06    8.75    a b a c   0   1
59    0.48            b=2 d     0   1
60    0.39    7.78              0   1
61    0.57    8.22    c=13      0   1
62
63These are the examples read from such a file::
64
65    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
66    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
67    [0.39, 1], {"Ca":7.78}
68    [0.57, 1], {"Ca":8.22, "c":13.000}
69
70It is recommended to have the basket as the last column, especially if it
71contains a lot of data.
72
73Note a few things. The basket column's name, ``b_foo``, is not used. In the first
74example, the value of ``a`` is 2 since it appears twice. The ordinary meta
75attribute, ``Ca``, appears in all examples, even in those where its value is
76undefined. Meta attributes from the basket appear only where they are defined.
77This is due to the different nature of these meta attributes: ``Ca`` is required
78while the others are optional.  ::
79
80    >>> d.domain.getmetas()
81    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
82    >>> d.domain.getmetas(False)
83    {-22: FloatVariable 'Ca'}
84    >>> d.domain.getmetas(True)
85    {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
86
87To fully understand all this, you should read the documentation on :ref:`meta
88attributes <meta-attributes>` in Domain and on the :ref:`basket file format
89<basket-format>` (a simple format that is limited to baskets only).
90
91.. _basket-format:
92
93Basket Format
94=============
95
96Basket files (.basket) are suitable for representing sparse data. Each example
97is represented by a line in the file. The line is written as a comma-separated
98list of name-value pairs. Here's an example of such file. ::
99
100    nobody, expects, the, Spanish, Inquisition=5
101    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
102    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
103    to, the, Pope, and, nice, red, uniforms, oh damn
104
105The file contains four examples. The first examples has five attributes
106defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
107four have (the default) value of 1.0 and the last has a value of 5.0.
108
109The attributes that appear in the domain aren't defined in any headers or even
110separate files, as with other formats supported by Orange.
111
112If attribute appears more than once, its values are added. For instance, the
113value of attribute "surprise" in the second examples is 6.0 and the value of
114"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
115and the latter appears twice with value of 1.0.
116
117All attributes are loaded as optional meta-attributes, so zero values don't
118take any memory (unless they are given, but initialized to zero). See also
119section on :ref:`meta attributes <meta-attributes>` in the reference for domain
120descriptors.
121
122Notice that at the time of writing this reference only association rules can
123directly use examples presented in the basket format.
124
125
126Other supported data formats
127============================
128Orange can import data from csv or tab delimited files where the first line contains attribute names followed by
129lines containing data. For such files, orange tries to guess the type of features and treats the right-most
130column as the class variable. If feature types are known in advance, special orange tab format should be used.
Note: See TracBrowser for help on using the repository browser.