source: orange/docs/reference/rst/Orange.data.table.rst @ 9848:00f5832a4c71

Revision 9848:00f5832a4c71, 17.2 KB checked in by Matija Polajnar <matija.polajnar@…>, 2 years ago (diff)

Fix rst indentation.

Line 
1.. py:currentmodule:: Orange.data
2
3======================
4Data table (``Table``)
5======================
6
7Class `Orange.data.Table` holds a list of data instances of type
8:obj:`Orange.data.Instance`. All instances belong to the same domain
9(:obj:`Orange.data.Domain`).
10
11Data tables are usually loaded from a file (see :doc:`Orange.data.formats`)::
12
13    import Orange
14    data = Orange.data.Table("titanic")
15
16Data tables can also be created programmatically, as in the :ref:`code
17below <example-table-prog1>`.
18
19
20-------------------
21List-like behaviour
22-------------------
23
24:obj:`Table` supports most list-like operations: getting, setting,
25removing data instances, as well as methods :obj:`append` and
26:obj:`extend`. When setting items, the item must be
27either the instance of the correct type or a Python list of
28appropriate length and content to be converted into a data instance of
29the corresponding domain. Retrieving data instances returns references
30and not copies: changing the retrieved instance changes the data in the
31table. Slicing returns ordinary Python lists containing references to
32data instances, not a new :obj:`Orange.data.Table`.
33
34According to a Python convention, the data table is considered ``False``
35when empty.
36
37.. class:: Table
38
39    .. attribute:: domain
40
41        The domain to which the instances correspond. This
42        attribute is read-only.
43
44    .. attribute:: owns_instances
45
46        ``True``, if the table contains the data instances and ``False`` if
47        it contains references to instances owned by another table.
48
49    .. attribute:: owner
50
51        The actual owner of the data when ``own_instances`` is ``False``.
52
53    .. attribute:: version
54
55        An integer that is increased when instances are added or
56        removed from the table. It does not detect changes of the data.
57
58    .. attribute:: random_generator
59
60       Random generator that is used by method
61       :obj:`random_instance`. If the method is called and
62       ``random_generator`` is ``None``, a new generator is constructed
63       with random seed 0 and stored here for future use.
64
65    .. attribute:: attribute_load_status
66
67       If the table was loaded from a file, this list of flags tells
68       whether the feature descriptors were reused and how they
69       matched. See :ref:`descriptor reuse <variable_descriptor_reuse>`
70       for details.
71
72    .. attribute:: meta_attribute_load_status
73
74       A dictionary holding this same information for meta
75       attributes, with keys corresponding to their ids and values to
76       load statuses.
77
78    .. method:: __init__(filename[, create_new_on])
79
80        Read data from the given file. If the name includes the
81        extension it must be one of the known file formats
82        (see :doc:`/Orange.data.formats`). If no extension is given, the
83        directory is searched for any file with recognized extensions. If the
84        file is not found, Orange will also search the directories
85        specified in the environment variable `ORANGE_DATA_PATH`.
86
87        The optional flag ``create_new_on`` decides when variable
88        descriptors are reused. See :ref:`descriptor reuse
89        <variable_descriptor_reuse>` for more details.
90
91        :param filename: the name of the file
92        :type filename: str
93        :param create_new_on: flag specifying when to reuse existing descriptors
94        :type create_new_on: int
95
96    .. _example-table-prog1:
97
98    .. method:: __init__(domain)
99
100        Construct an empty data table with the given domain.
101
102        .. literalinclude:: code/datatable1.py
103            :lines: 7-16
104
105        The example :ref:`continues <example-table-prog2>`.
106
107        :param domain: domain descriptor
108        :type domain: Orange.data.Domain
109
110    .. method:: __init__(instances[, references])
111
112        Construct a new data table containing the given data
113        instances. These can be given either as another :obj:`Table`
114        or as Python list containing instances of
115        :obj:`Orange.data.Instance`.
116
117        If the optional second argument is True, the first argument
118        must be a :obj:`Table`. The new table will contain references
119        to data stored in the given table. If the second argument is
120        omitted or False, data instances are copied.
121
122        :param instances: data instances
123        :type instances: Table or list
124        :param references: if True, the new table contains references
125        :type references: bool
126
127    .. _example-table-prog2:
128
129    .. method:: __init__(domain, instances)
130
131        Construct a new data table with a given domain and initialize
132        it with the given instances. Instances can be given as a
133        :obj:`Table` (if domains do not match, they are converted),
134        as a list containing either instances of
135        :obj:`Orange.data.Instance` or lists, or as a numpy array.
136
137        :param domain: domain descriptor
138        :type domain: Orange.data.Domain
139        :param instances: data instances
140        :type instances: Table or list or numpy.array
141
142        The following example fills the data table created :ref:`above
143        <example-table-prog1>` with some data from a list.
144
145        .. literalinclude:: code/datatable1.py
146            :lines: 29-34
147
148        The following example shows initializing a data table from
149        numpy array.
150
151        .. literalinclude:: code/datatable1.py
152            :lines: 38-41
153
154    .. method:: __init__(tables)
155
156        Construct a table by combining data instances from a list of
157        tables. All tables must have the same length. Domains are
158        combined so that each (ordinary) feature appears only once in
159        the resulting table. The class attribute is the last class
160        attribute in the list of tables; for instance, if three tables
161        are merged but the last one is class-less, the class attribute
162        for the new table will come from the second table. Meta
163        attributes for the new domain are merged based on id's: if the
164        same attribute appears under two id's it will be added
165        twice. If, on the opposite, same id appears two different
166        attributes in two tables, this throws an exception. As
167        instances are merged, Orange checks the features and meta
168        attributes that appear in multiple tables have the same value
169        on all. Missing values are allowed.
170
171        Note that this is not the SQL's join operator as it doesn't
172        try to find matches between the tables.
173
174        :param tables: tables to be merged into the new table
175        :type tables: list of instances of :obj:`Table`
176
177        For example, suppose the file merge1.tab contains::
178
179            a1    a2    m1    m2
180            f     f     f     f
181                        meta  meta
182            1     2     3     4
183            5     6     7     8
184            9     10    11    12
185
186        and merge2.tab contains::
187
188            a1    a3    m1     m3
189            f     f     f      f
190                        meta   meta
191            1     2.5   3      4.5
192            5     6.5   7      8.5
193            9     10.5  11     12.5
194
195        The two tables can be loaded, merged and printed out by the
196        following script.
197
198        .. literalinclude:: code/datatable_merge.py
199
200        This is what the output looks like::
201
202            Domain 1:  [a1, a2], {-2:m1, -3:m2}
203            Domain 2:  [a1, a3], {-2:m1, -4:m3}
204            Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}
205
206               [1, 2], {"m1":3, "m2":4}
207             + [1, 2.5], {"m1":3, "m3":4.5}
208            -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}
209
210               [5, 6], {
211            "m1":7, "m2":8}
212             + [5, 6.5], {"m1":7, "m3":8.5}
213            -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}
214
215               [9, 10], {"m1":11, "m2":12}
216             + [9, 10.5], {"m1":11, "m3":12.5}
217            -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}
218
219        Merging succeeds since the values of `a1` and `m1` are the
220        same for all matching instances from both tables.
221
222    .. method:: append(inst)
223
224        Append the given instance to the end of the table.
225
226        :param inst: instance to be appended
227        :type inst: :obj:`Orange.data.Instance` or a list
228
229        .. literalinclude:: code/datatable1.py
230            :lines: 21-24
231
232    .. method:: extend(instances)
233
234        Append the given list of instances to the end of the table.
235
236        :param instances: instances to be appended
237        :type instances: list
238
239
240    .. method:: select(filt[, idx, negate=False])
241
242        Return a subset of instances as a new :obj:`Table`. The first
243        argument should be a list of the same length as the table; its
244        elements should be integers or bools. The resulting table
245        contains instances corresponding to non-zero elements of the
246        list.
247
248        If the second argument is given, it must be an integer;
249        select will then return the data instances for which the
250        corresponding `filt`'s elements match `idx`.
251
252        The third argument, `negate`, can only be given as a
253        keyword. Its effect is to negate the selection.
254
255        Note: This method should be used when the selected data
256        instances are going to be modified. In all other cases,
257        method :obj:`select_ref` is preferred.
258
259        :param filt: filter list
260        :type filt: list of integers
261        :param idx: selects which instances to pick
262        :type idx: int
263        :param negate: negates the selection
264        :type negate: bool
265        :rtype: :obj:`Orange.data.Table`
266
267        One common use of this method is to split the data into
268        folds. A list for the first argument can be prepared using
269        `Orange.core.MakeRandomIndicesCV`. The following example
270        prepares a simple data table and indices for four-fold cross
271        validation, and then selects the training and testing sets for
272        each fold.
273
274        .. literalinclude:: code/datatable2.py
275            :lines: 7-27
276
277        The printout begins with::
278
279            Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>
280
281            Fold 0: train
282                 [0.000000]
283                 [2.000000]
284                 [3.000000]
285                 [5.000000]
286                 [7.000000]
287                 [8.000000]
288                 [9.000000]
289
290                  : test
291                 [1.000000]
292                 [4.000000]
293                 [6.000000]
294
295        Another form of calling the method is to use a vector of
296        zero's and one's.
297
298        .. literalinclude:: code/datatable2.py
299            :lines: 29-31
300
301        This prints out::
302
303            [0.000000]
304            [1.000000]
305            [9.000000]
306
307    .. method:: select_ref(filt[, idx, negate=False])
308
309        Same as :obj:`select`, except that the resulting table
310        contains references to data instances in the original table
311        instead of its own copies.
312
313        In most cases, this function is preferred over the former
314        since it consumes much less memory.
315
316        :param filt: filter list
317        :type filt: list of integers
318        :param idx: selects which instances to pick
319        :type idx: int
320        :param negate: negates the selection
321        :type negate: bool
322        :rtype: :obj:`Orange.data.Table`
323
324    .. method:: select_list(filt[, idx, negate=False])
325
326        Same as :obj:`select`, except that it returns a Python list
327        with data instances.
328
329        :param filt: filter list
330        :type filt: list of integers
331        :param idx: selects which instances to pick
332        :type idx: int
333        :param negate: negates the selection
334        :type negate: bool
335        :rtype: list
336
337    .. method:: get_items(indices)
338
339        Return a table with data instances indicated by indices. For
340        instance, `data.get_items([0, 1, 9]` returns a table with
341        instances with indices 0, 1 and 9.
342
343        This function is useful when data is going to be modified. If
344        not, use :obj:`get_items_ref`.
345
346        :param indices: indices of selected data instances
347        :type indices: list of int's
348        :rtype: :obj:`Orange.data.Table`
349
350    .. method:: get_items_ref(indices)
351
352         Same as above, except that it returns a table with references
353         to data instances instead of copies. This method is normally
354         preferred over the above one.
355
356        :param indices: indices of selected data instances
357        :type indices: list of int's
358        :rtype: :obj:`Orange.data.Table`
359
360    .. method:: filter(conditions)
361
362        Return a table with data instances matching the
363        criteria. These can be given in form of keyword arguments or a
364        dictionary; with the latter, additional keyword argument negate
365        can be given for selection reversal.
366
367        Note that method :obj:`filter_ref` is more memory efficient and
368        should be preferred when data is not going to be modified.
369
370        For example, young patients from the lenses data set can be
371        selected by ::
372
373            young = data.filter(age="young")
374
375        More than one value can be allowed and more than one attribute
376        checked. This selects all patients with age "young" or "psby" who
377        are astigmatic::
378
379            young = data.filter(age=["young", "presbyopic"], astigm="y")
380
381        The following has the same effect::
382
383            young = data.filter({"age": ["young", "presbyopic"],
384                                "astigm": "y"})
385
386        Selection can be reversed only with the latter form, by adding
387        a keyword argument `negate` with value 1::
388
389            young = data.filter({"age": ["young", "presbyopic"],
390                                "astigm": "y"},
391                                negate=1)
392
393        Filters for continuous features are specified by pairs of
394        values. In dataset "bridges", bridges with lengths between
395        1000 and 2000 (inclusive) are selected by ::
396
397            mid = data.filter(LENGTH=(1000, 2000))
398
399        Bridges that are shorter or longer than that can be selected
400        by inverting the range. ::
401
402            mid = data.filter(LENGTH=(2000, 1000))
403
404    .. method:: filter(filt)
405
406            Similar to above, except that conditions are given as
407            :obj:`Orange.core.Filter`.
408
409    .. method:: filter_ref(conditions), filter_ref(filter)
410
411            Same as the above two, except that they return a table
412            with references to instances instead of their copies.
413
414    .. method:: filter_list(conditions), filter_list(filter)
415
416            As above, except that it return a pure Python list with
417            data instances.
418
419    .. method:: filter_bool(conditions), filter_bool(filter)
420
421            Return a list of bools denoting which data instances are
422            accepted by the conditions or the filter.
423
424    .. method:: translate(domain)
425
426            Return a new data table in which data instances are
427            translated into the given domain.
428         
429            :param domain: new domain
430            :type domain: :obj:`Orange.data.Domain`
431            :rtype: :obj:`Orange.data.Table`
432
433    .. method:: translate(features[, keep_metas])
434
435            Similar to above, except that the domain is given by a
436            list of features. If keep_metas is True, the new data
437            instances will also have all the meta attributes from the
438            original domain.
439
440            :param features: features for the new data
441            :type domain: list
442            :rtype: :obj:`Orange.data.Table`
443
444    .. method:: checksum()
445
446            Return a CRC32 computed over all discrete and continuous
447            features and class attributes of all data instances. Meta
448            attributes and features of other types are ignored.
449
450            :rtype: int
451
452    .. method:: has_missing_values()
453
454            Return True if any of data instances has any missing
455            values. Meta attributes are not checked.
456
457    .. method:: has_missing_classes()
458
459            Return True if any instance has a missing class value.
460
461    .. method:: random_instance()
462
463            Return a random instance from the
464            table. Data table's own :obj:`random_generator` is used,
465            which is initially seeded to 0, so results are
466            deterministic.
467
468    .. method:: remove_duplicates([weightID])
469
470            Remove duplicates of data instances. If weightID is given,
471            a meta attribute is added which contains the number of
472            instances merged into each new instance.
473
474            :param weightID: id for meta attribute with weight
475            :type weightID: int
476            :rtype: None
477
478    .. method:: sort([features])
479
480            Sort the data by attribute values. The argument gives the
481            features ordered by importance. If omitted, the order from
482            the domain is used. Note that the values of discrete
483            features are not ordered alphabetically but according to
484            the :obj:`Orange.data.variable.Discrete.values`.
485
486            This sorts the data from the bridges data set by the lengths
487            and years of their construction::
488
489                data.sort(["LENGTH", "ERECTED"])
490
491    .. method:: shuffle()
492
493            Randomly shuffle the data instances.
494
495    .. method:: add_meta_attribute(id[, value=1])
496
497            Add a meta value to all data instances. The first argument
498            can be an integer id, or a string or a variable descriptor
499            of a meta attribute registered in the domain.
500
501    .. method:: remove_meta_attribute(id)
502
503            Removes a meta attribute from all data instances.
Note: See TracBrowser for help on using the repository browser.