source: orange/docs/reference/rst/Orange.data.table.rst @ 9883:33c19621e154

Revision 9883:33c19621e154, 17.3 KB checked in by janezd <janez.demsar@…>, 2 years ago (diff)

Polished documentation on data.Table

Line 
1.. py:currentmodule:: Orange.data
2
3======================
4Data table (``Table``)
5======================
6
7Class `Orange.data.Table` holds a list of data instances of type
8:obj:`Orange.data.Instance`. All instances belong to the same domain
9(:obj:`Orange.data.Domain`).
10
11Data tables are usually loaded from a file (see :doc:`Orange.data.formats`)::
12
13    import Orange
14    data = Orange.data.Table("titanic")
15
16Data tables can also be created programmatically, as in the :ref:`code
17below <example-table-prog1>`.
18
19
20-------------------
21List-like behaviour
22-------------------
23
24:obj:`Table` supports most list-like operations: getting, setting,
25removing data instances, as well as methods :obj:`append` and
26:obj:`extend`. When setting items, the item must be
27either the instance of the correct type or a Python list of
28appropriate length and content to be converted into a data instance of
29the corresponding domain. Retrieving data instances returns references
30and not copies: changing the retrieved instance changes the data in the
31table. Slicing returns ordinary Python lists containing references to
32data instances, not a new :obj:`Orange.data.Table`.
33
34According to a Python convention, the data table is considered ``False``
35when empty.
36
37.. class:: Table
38
39    .. attribute:: domain
40
41        The domain to which the instances belong. This
42        attribute is read-only.
43
44    .. attribute:: owns_instances
45
46        ``True`` if the table contains the data instances and ``False`` if
47        it contains references to instances owned by another table.
48
49    .. attribute:: owner
50
51        The actual owner of the data when ``own_instances`` is ``False``;
52        ``None`` otherwise.
53
54    .. attribute:: version
55
56        An integer that is increased when instances are added or
57        removed from the table. It does not detect changes of the data.
58
59    .. attribute:: random_generator
60
61       Random generator that is used by method
62       :obj:`random_instance`. If the method is called and
63       ``random_generator`` is ``None``, a new generator is constructed
64       with random seed 0 and stored here for future use.
65
66    .. attribute:: attribute_load_status
67
68       If the table was loaded from a file, this list of flags tells
69       whether the feature descriptors were reused and how they
70       matched. See :ref:`descriptor reuse <variable_descriptor_reuse>`
71       for details.
72
73    .. attribute:: meta_attribute_load_status
74
75       A dictionary holding this same information for meta
76       attributes, with keys corresponding to their ids and values to
77       load statuses.
78
79    .. method:: __init__(filename[, create_new_on])
80
81        Read data from the given file. If the name includes the
82        extension, it must be one of the known file formats (see
83        :doc:`/Orange.data.formats`). If no extension is given, the directory
84        is searched for any file with recognized extensions. If the
85        file is not found, Orange will also search the directories
86        specified in the environment variable `ORANGE_DATA_PATH`.
87
88        The optional flag ``create_new_on`` decides when variable
89        descriptors are reused. See :ref:`descriptor reuse
90        <variable_descriptor_reuse>` for more details.
91
92        :param filename: the name of the file
93        :type filename: str
94        :param create_new_on: flag specifying when to reuse existing descriptors
95        :type create_new_on: int
96
97    .. _example-table-prog1:
98
99    .. method:: __init__(domain)
100
101        Construct an empty data table with the given domain.
102
103        .. literalinclude:: code/datatable1.py
104            :lines: 7-16
105
106        The example :ref:`continues <example-table-prog2>`.
107
108        :param domain: domain descriptor
109        :type domain: Orange.data.Domain
110
111    .. method:: __init__(instances[, references])
112
113        Construct a new data table containing the given data
114        instances. These can be given either as another :obj:`Table`
115        or as list of instances represented by list of value or as
116        :obj:`Orange.data.Instance`.
117
118        If the optional second argument is ``True``, the first argument
119        must be a :obj:`Table`. The new table will contain references
120        to data stored in the given table. If the second argument is
121        omitted or ``False``, data instances are copied.
122
123        :param instances: data instances
124        :type instances: Table or list
125        :param references: if ``True``, the new table contains references
126        :type references: bool
127
128    .. _example-table-prog2:
129
130    .. method:: __init__(domain, instances)
131
132        Construct a new data table with the given domain and initialize
133        it with the given instances. Instances can be given as a
134        :obj:`Table` (if domains do not match, they are converted),
135        as a list containing either instances of
136        :obj:`Orange.data.Instance` or lists, or as a numpy array.
137
138        :param domain: domain descriptor
139        :type domain: Orange.data.Domain
140        :param instances: data instances
141        :type instances: Table or list or numpy.array
142
143        The following example fills the data table created :ref:`above
144        <example-table-prog1>` with some data from a list.
145
146        .. literalinclude:: code/datatable1.py
147            :lines: 29-34
148
149        The following example shows initializing a data table from
150        numpy array.
151
152        .. literalinclude:: code/datatable1.py
153            :lines: 38-41
154
155    .. method:: __init__(tables)
156
157        Construct a table by combining data instances from a list of
158        tables. All tables must have the same length. Domains are
159        combined so that each (ordinary) feature appears only once in
160        the resulting table. The class attribute is the last class
161        attribute in the list of tables; for instance, if three tables
162        are merged but the last one is class-less, the class attribute
163        for the new table will come from the second table. Meta
164        attributes for the new domain are merged based on id's: if the
165        same attribute appears under two id's it will be added
166        twice. If, on the opposite, same id appears two different
167        attributes in two tables, this raises an exception. As
168        instances are merged, exception is raised if a features or
169        a meta attribute that appears in multiple tables does not have the
170        same value on all of them; the feature is allowed to have a
171        missing value on one or more (or all) tables.
172
173        Note that this is not the SQL's join operator as it doesn't
174        try to find matches between the tables but instead merges them
175        row by row.
176
177        :param tables: tables to be merged into the new table
178        :type tables: list of instances of :obj:`Table`
179
180        For example, suppose the file merge1.tab contains::
181
182            a1    a2    m1    m2
183            f     f     f     f
184                        meta  meta
185            1     2     3     4
186            5     6     7     8
187            9     10    11    12
188
189        and merge2.tab contains::
190
191            a1    a3    m1     m3
192            f     f     f      f
193                        meta   meta
194            1     2.5   3      4.5
195            5     6.5   7      8.5
196            9     10.5  11     12.5
197
198        The two tables can be loaded, merged and printed out by the
199        following script.
200
201        .. literalinclude:: code/datatable_merge.py
202
203        This is what the output looks like::
204
205            Domain 1:  [a1, a2], {-2:m1, -3:m2}
206            Domain 2:  [a1, a3], {-2:m1, -4:m3}
207            Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}
208
209               [1, 2], {"m1":3, "m2":4}
210             + [1, 2.5], {"m1":3, "m3":4.5}
211            -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}
212
213               [5, 6], {
214            "m1":7, "m2":8}
215             + [5, 6.5], {"m1":7, "m3":8.5}
216            -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}
217
218               [9, 10], {"m1":11, "m2":12}
219             + [9, 10.5], {"m1":11, "m3":12.5}
220            -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}
221
222        Merging succeeds since the values of `a1` and `m1` are the
223        same for all matching instances from both tables.
224
225    .. method:: append(inst)
226
227        Append the given instance to the end of the table.
228
229        :param inst: instance to be appended
230        :type inst: :obj:`Orange.data.Instance` or a list
231
232        .. literalinclude:: code/datatable1.py
233            :lines: 21-24
234
235    .. method:: extend(instances)
236
237        Append the given list of instances to the end of the table.
238
239        :param instances: instances to be appended
240        :type instances: list
241
242
243    .. method:: select(filter[, idx, negate=False])
244
245        Return a subset of instances as a new :obj:`Table`. The first
246        argument should be a list of the same length as the table; its
247        elements should be integers or bools. The resulting table
248        contains instances corresponding to non-zero elements of the
249        list.
250
251        If the second argument is given, it must be an integer;
252        select will then return the data instances for which the
253        corresponding `filter`'s elements match `idx`.
254
255        The third argument, `negate`, can only be given as a
256        keyword. Its effect is to negate the selection.
257
258        Note: This method should be used when the selected data
259        instances are going to be modified. In all other cases, method
260        :obj:`select_ref` is preferred.
261
262        :param filt: filter list
263        :type filt: list of integers
264        :param idx: selects which instances to pick
265        :type idx: int
266        :param negate: negates the selection
267        :type negate: bool
268        :rtype: :obj:`Orange.data.Table`
269
270        One common use of this method is to split the data into
271        folds. A list for the first argument can be prepared using
272        `Orange.data.sample.SubsetIndicesCV`. The following example
273        prepares a simple data table and indices for four-fold cross
274        validation, and then selects the training and testing sets for
275        each fold.
276
277        .. literalinclude:: code/datatable2.py
278            :lines: 7-27
279
280        The printout begins with::
281
282            Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>
283
284            Fold 0: train
285                 [0.000000]
286                 [2.000000]
287                 [3.000000]
288                 [5.000000]
289                 [7.000000]
290                 [8.000000]
291                 [9.000000]
292
293                  : test
294                 [1.000000]
295                 [4.000000]
296                 [6.000000]
297
298        Another form of calling the method is to use a vector of
299        zero's and one's.
300
301        .. literalinclude:: code/datatable2.py
302            :lines: 29-31
303
304        This prints out::
305
306            [0.000000]
307            [1.000000]
308            [9.000000]
309
310    .. method:: select_ref(filt[, idx, negate=False])
311
312        Same as :obj:`select`, except that the resulting table
313        contains references to data instances in the original table
314        instead of its own copy of data.
315
316        In most cases, this function is preferred over the former
317        since it consumes less memory.
318
319        :param filt: filter list
320        :type filt: list of integers
321        :param idx: selects which instances to pick
322        :type idx: int
323        :param negate: negates the selection
324        :type negate: bool
325        :rtype: :obj:`Orange.data.Table`
326
327    .. method:: select_list(filt[, idx, negate=False])
328
329        Same as :obj:`select`, except that it returns a Python list
330        with data instances.
331
332        :param filt: filter list
333        :type filt: list of integers
334        :param idx: selects which instances to pick
335        :type idx: int
336        :param negate: negates the selection
337        :type negate: bool
338        :rtype: list
339
340    .. method:: get_items(indices)
341
342        Return a table with data instances indicated by indices. For
343        instance, `data.get_items([0, 1, 9])` returns a table with
344        instances with indices 0, 1 and 9.
345
346        This function is useful when data is going to be modified. If
347        not, use :obj:`get_items_ref`.
348
349        :param indices: indices of selected data instances
350        :type indices: list of int's
351        :rtype: :obj:`Orange.data.Table`
352
353    .. method:: get_items_ref(indices)
354
355         Same as above, except that it returns a table with references
356         to data instances. This method is usually
357         preferred over the above one.
358
359        :param indices: indices of selected data instances
360        :type indices: list of int's
361        :rtype: :obj:`Orange.data.Table`
362
363    .. method:: filter(conditions)
364
365        Return a table with data instances matching the
366        criteria. These can be given in form of keyword arguments or a
367        dictionary; with the latter, additional keyword argument ``negate``
368        can be given to reverse the selection.
369
370        Note that method :obj:`filter_ref` is more memory efficient and
371        should be preferred when data is not going to be modified.
372
373        Young patients from the lenses data set can be selected by ::
374
375            young = data.filter(age="young")
376
377        More than one value can be allowed and more than one attribute
378        checked. This selects all patients with age "young" or "psby" who
379        are astigmatic::
380
381            young = data.filter(age=["young", "presbyopic"], astigm="y")
382
383        The following has the same effect::
384
385            young = data.filter({"age": ["young", "presbyopic"],
386                                "astigm": "y"})
387
388        Selection can be reversed only in the latter form, by adding
389        a keyword argument ``negate`` with value 1::
390
391            young = data.filter({"age": ["young", "presbyopic"],
392                                "astigm": "y"},
393                                negate=1)
394
395        Filters for continuous features are specified by pairs of
396        values. In dataset "bridges", bridges with lengths between
397        1000 and 2000 (inclusive) are selected by ::
398
399            mid = data.filter(LENGTH=(1000, 2000))
400
401        Bridges that are shorter or longer than that can be selected
402        by inverting the range. ::
403
404            mid = data.filter(LENGTH=(2000, 1000))
405
406    .. method:: filter(filt)
407
408            Similar to above, except that conditions are given as
409            :obj:`Orange.core.Filter`.
410
411    .. method:: filter_ref(conditions), filter_ref(filter)
412
413            Same as the above two, except that they return a table
414            with references to instances instead of their copies.
415
416    .. method:: filter_list(conditions), filter_list(filter)
417
418            As above, except that it returns a pure Python list with
419            data instances.
420
421    .. method:: filter_bool(conditions), filter_bool(filter)
422
423            Return a list of bools denoting which data instances are
424            accepted by the conditions or the filter.
425
426    .. method:: translate(domain)
427
428            Return a new data table in which data instances are
429            translated into the given domain.
430         
431            :param domain: new domain
432            :type domain: :obj:`Orange.data.Domain`
433            :rtype: :obj:`Orange.data.Table`
434
435    .. method:: translate(features[, keep_metas])
436
437            Similar to above, except that the domain is given by a
438            list of features. If ``keep_metas`` is ``True``, the new data
439            instances will also have all the meta attributes from the
440            original domain.
441
442            :param features: features for the new data
443            :type domain: list
444            :rtype: :obj:`Orange.data.Table`
445
446    .. method:: checksum()
447
448            Return a CRC32 computed over all discrete and continuous
449            features and class attributes of all data instances.
450
451            :rtype: int
452
453    .. method:: has_missing_values()
454
455            Return ``True`` if any of data instances has any missing
456            values. Meta attributes are not checked.
457
458    .. method:: has_missing_classes()
459
460            Return ``True`` if any instance miss the class value.
461
462    .. method:: random_instance()
463
464            Return a random instance from the
465            table. Data table's :obj:`random_generator` is used,
466            which is initially seeded to 0, so results are
467            deterministic.
468
469    .. method:: remove_duplicates([weightID])
470
471            Remove duplicates of data instances. If ``weightID`` is given,
472            a meta attribute is added which contains the number of
473            instances merged into each new instance.
474
475            :param weightID: id for meta attribute with weight
476            :type weightID: int
477            :rtype: None
478
479    .. method:: sort([features])
480
481            Sort the data by attribute values. The argument gives the
482            features ordered by importance. If omitted, the order from
483            the domain is used. Note that the values of discrete
484            features are not ordered alphabetically but according to
485            the :obj:`Orange.data.variable.Discrete.values`.
486
487            This sorts the data from the bridges data set by the lengths
488            and years of their construction::
489
490                data.sort(["LENGTH", "ERECTED"])
491
492    .. method:: shuffle()
493
494            Randomly shuffle the data instances.
495
496    .. method:: add_meta_attribute(id[, value=1])
497
498            Add a meta value to all data instances. The first argument
499            can be an integer id, or a string or a variable descriptor
500            of a meta attribute registered in the domain.
501
502    .. method:: remove_meta_attribute(id)
503
504            Remove a meta attribute from all data instances.
Note: See TracBrowser for help on using the repository browser.