source: orange/docs/reference/rst/Orange.data.formats.rst @ 11028:009ba5a75e30

Revision 11028:009ba5a75e30, 5.5 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 18 months ago (diff)

Added a common documentation index.

Line 
1.. py:currentmodule:: Orange.data
2
3.. _Orange-data-formats:
4
5=======================
6Loading and saving data
7=======================
8
9.. _tab-delimited:
10
11Tab-delimited format
12====================
13Orange prefers to open data files in its native, tab-delimited format. This format allows us to specify type of features
14and optional flags along with the feature names, which can ofter result in shorter loading times. This additional data
15is provided in a form of a 3-line header. First line contains variable
16names, followed by their types in the second line and optional
17flags in the third.
18
19Example of iris dataset in tab-delimited format (:download:`iris.tab <code/iris.tab>`)
20
21.. literalinclude:: code/iris.tab
22   :lines: 1-7
23
24Feature types
25-------------
26 * discrete (or d) - imported as :obj:`Orange.feature.Discrete`
27 * continuous (or c) - imported as :obj:`Orange.feature.Continuous`
28 * string - imported as :obj:`Orange.feature.String`
29 * basket - used for storing sparse data. More on basket formats in a dedicated section.
30
31Optional flags
32--------------
33 * ignore (or i) - feature will not be imported
34 * class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
35 * multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
36 * meta (or m) - feature will be imported as a meta attribute.
37 * -dc
38
39Baskets
40-------
41
42Baskets can be used for storing sparse data in tab delimited files. They were
43specifically designed for text mining needs. If text mining and sparse data is
44not your business, you can skip this section.
45
46Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
47continuous meta attribute named ``<name>`` will be created and added to the domain
48as optional if it is not already there. A meta value for that variable will be
49added to the example. If the value is 1, you can omit the ``=<value>`` part.
50
51It is not possible to put meta attributes of other types than continuous in the
52basket.
53
54A tab delimited file with a basket can look like this::
55
56    K       Ca      b_foo     Ba  y
57    c       c       basket    c   c
58            meta              i   class
59    0.06    8.75    a b a c   0   1
60    0.48            b=2 d     0   1
61    0.39    7.78              0   1
62    0.57    8.22    c=13      0   1
63
64These are the examples read from such a file::
65
66    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
67    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
68    [0.39, 1], {"Ca":7.78}
69    [0.57, 1], {"Ca":8.22, "c":13.000}
70
71It is recommended to have the basket as the last column, especially if it
72contains a lot of data.
73
74Note a few things. The basket column's name, ``b_foo``, is not used. In the first
75example, the value of ``a`` is 2 since it appears twice. The ordinary meta
76attribute, ``Ca``, appears in all examples, even in those where its value is
77undefined. Meta attributes from the basket appear only where they are defined.
78This is due to the different nature of these meta attributes: ``Ca`` is required
79while the others are optional.  ::
80
81    >>> d.domain.getmetas()
82    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
83    >>> d.domain.getmetas(False)
84    {-22: FloatVariable 'Ca'}
85    >>> d.domain.getmetas(True)
86    {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
87
88To fully understand all this, you should read the documentation on :ref:`meta
89attributes <meta-attributes>` in Domain and on the :ref:`basket file format
90<basket-format>` (a simple format that is limited to baskets only).
91
92.. _basket-format:
93
94Basket Format
95=============
96
97Basket files (.basket) are suitable for representing sparse data. Each example
98is represented by a line in the file. The line is written as a comma-separated
99list of name-value pairs. Here's an example of such file. ::
100
101    nobody, expects, the, Spanish, Inquisition=5
102    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
103    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
104    to, the, Pope, and, nice, red, uniforms, oh damn
105
106The file contains four examples. The first examples has five attributes
107defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
108four have (the default) value of 1.0 and the last has a value of 5.0.
109
110The attributes that appear in the domain aren't defined in any headers or even
111separate files, as with other formats supported by Orange.
112
113If attribute appears more than once, its values are added. For instance, the
114value of attribute "surprise" in the second examples is 6.0 and the value of
115"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
116and the latter appears twice with value of 1.0.
117
118All attributes are loaded as optional meta-attributes, so zero values don't
119take any memory (unless they are given, but initialized to zero). See also
120section on :ref:`meta attributes <meta-attributes>` in the reference for domain
121descriptors.
122
123Notice that at the time of writing this reference only association rules can
124directly use examples presented in the basket format.
125
126
127Other supported data formats
128============================
129Orange can import data from csv or tab delimited files where the first line contains attribute names followed by
130lines containing data. For such files, orange tries to guess the type of features and treats the right-most
131column as the class variable. If feature types are known in advance, special orange tab format should be used.
Note: See TracBrowser for help on using the repository browser.