source: orange/docs/reference/rst/Orange.data.table.rst @ 9726:aec7c15289d3

Revision 9726:aec7c15289d3, 17.3 KB checked in by janezd <janez.demsar@…>, 2 years ago (diff)

Fixed a broken link

Line 
1.. py:currentmodule:: Orange.data
2
3======================
4Data table (``Table``)
5======================
6
7Class `Orange.data.Table` holds a list of data instances of type
8:obj:`Orange.data.Instance`. All instances belong to the same domain
9(:obj:`Orange.data.Domain`).
10
11Data tables are usually loaded from a file (see :doc:`Orange.data.formats`)::
12
13    import Orange
14    data = Orange.data.Table("titanic")
15
16Data tables can also be created programmatically, as in the :ref:`code
17below <example-table-prog1>`.
18
19
20-------------------
21List-like behaviour
22-------------------
23
24:obj:`Table` supports most list-like operations: getting, setting,
25removing data instances, as well as methods :obj:`append` and
26:obj:`extend`. The limitation is that table contain instances of
27:obj:`Orange.data.Instance`. When setting items, the item must be
28either the instance of the correct type or a Python list of
29appropriate length and content to be converted into a data instance of
30the corresponding domain.
31
32When retrieving data instances, what we get are references and not
33copies. Changing the retrieved instance changes the data in the table,
34too.
35
36Slicing returns ordinary Python lists containing the data instance,
37not a new Table.
38
39As usual in Python, the data table is considered False, when empty.
40
41.. class:: Table
42
43    .. attribute:: domain
44
45        The domain to which the instances correspond. This
46        attribute is read-only.
47
48    .. attribute:: owns_instances
49
50        True, if the table contains the data instances, False if it
51        contains just references to instances owned by another table.
52
53    .. attribute:: owner
54
55        If the table does not own the data instances, this attribute
56        gives the actual owner.
57
58    .. attribute:: version
59
60        An integer that is increased whenever the table is
61        changed. This is not foolproof, since the object cannot
62        detect when individual instances are changed. It will, however,
63        catch any additions and removals from the table.
64
65    .. attribute:: random_generator
66
67       Random generator that is used by method
68       :obj:`random_instance`. If the method is called and
69       random_generator is None, a new generator is constructed with
70       random seed 0, and stored here for subsequent use.
71
72    .. attribute:: attribute_load_status
73
74       If the table was loaded from a file, this list of flags tells
75       whether the feature descriptors were reused and how they
76       matched. See :ref:`descriptor reuse <variable_descriptor_reuse>` for details.
77
78    .. attribute:: meta_attribute_load_status
79
80       Same as above, except that this is a dictionary for meta
81       attributes, with keys corresponding to their ids.
82
83    .. method:: __init__(filename[, create_new_on])
84
85        Read data from the given file. If the name includes the
86        extension it must be one of the known file formats (see
87        :doc:`/Orange.data.formats`). If no extension is given, the directory
88        is searched for any file with recognized extensions. If the
89        file is not found, Orange will also search the directories
90        specified in the environment variable `ORANGE_DATA_PATH`.
91
92        The optional flag `create_new_on` decides when variable
93        descriptors are reused. See :ref:`descriptor reuse
94        <variable_descriptor_reuse>` for more details.
95
96        :param filename: the name of the file
97        :type filename: str
98        :param create_new_on: flag specifying when to reuse existing descriptors
99        :type create_new_on: int
100
101    .. _example-table-prog1:
102
103    .. method:: __init__(domain)
104
105        Construct an empty data table with the given domain.
106
107        .. literalinclude:: code/datatable1.py
108            :lines: 7-16
109
110        The example :ref:`continues <example-table-prog2>`.
111
112        :param domain: domain descriptor
113        :type domain: Orange.data.Domain
114
115    .. method:: __init__(instances[, references])
116
117        Construct a new data table containing the given data
118        instances. These can be given either as another :obj:`Table`
119        or as Python list containing instances of
120        :obj:`Orange.data.Instance`.
121
122        If the optional second argument is True, the first argument
123        must be a :obj:`Table`. The new table will contain references
124        to data stored in the given table. If the second argument is
125        omitted or False, data instances are copied.
126
127        :param instances: data instances
128        :type instances: Table or list
129        :param references: if True, the new table contains references
130        :type references: bool
131
132    .. _example-table-prog2:
133
134    .. method:: __init__(domain, instances)
135
136        Construct a new data table with a given domain and initialize
137        it with the given instances. Instances can be given as a
138        :obj:`Table` (if domains do not match, they are converted),
139        as a list containing either instances of
140        :obj:`Orange.data.Instance` or lists, or as a numpy array.
141
142        :param domain: domain descriptor
143        :type domain: Orange.data.Domain
144        :param instances: data instances
145        :type instances: Table or list or numpy.array
146
147        The following example fills the data table created :ref:`above
148        <example-table-prog1>` with some data from a list.
149
150        .. literalinclude:: code/datatable1.py
151            :lines: 29-34
152
153        The following example shows initializing a data table from
154        numpy array.
155
156        .. literalinclude:: code/datatable1.py
157            :lines: 38-41
158
159    .. method:: __init__(tables)
160
161        Construct a table by combining data instances from a list of
162        tables. All tables must have the same length. Domains are
163        combined so that each (ordinary) feature appears only once in
164        the resulting table. The class attribute is the last class
165        attribute in the list of tables; for instance, if three tables
166        are merged but the last one is class-less, the class attribute
167        for the new table will come from the second table. Meta
168        attributes for the new domain are merged based on id's: if the
169        same attribute appears under two id's it will be added
170        twice. If, on the opposite, same id appears two different
171        attributes in two tables, this throws an exception. As
172        instances are merged, Orange checks the features and meta
173        attributes that appear in multiple tables have the same value
174        on all. Missing values are allowed.
175
176        Note that this is not the SQL's join operator as it doesn't
177        try to find matches between the tables.
178
179        :param tables: tables to be merged into the new table
180        :type tables: list of instances of :obj:`Table`
181
182        For example, suppose the file merge1.tab contains::
183
184            a1    a2    m1    m2
185            f     f     f     f
186                        meta  meta
187            1     2     3     4
188            5     6     7     8
189            9     10    11    12
190
191        and merge2.tab contains::
192
193            a1    a3    m1     m3
194            f     f     f      f
195                        meta   meta
196            1     2.5   3      4.5
197            5     6.5   7      8.5
198            9     10.5  11     12.5
199
200        The two tables can be loaded, merged and printed out by the
201        following script.
202
203        .. literalinclude:: code/datatable_merge.py
204
205        This is what the output looks like::
206
207            Domain 1:  [a1, a2], {-2:m1, -3:m2}
208            Domain 2:  [a1, a3], {-2:m1, -4:m3}
209            Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}
210
211               [1, 2], {"m1":3, "m2":4}
212             + [1, 2.5], {"m1":3, "m3":4.5}
213            -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}
214
215               [5, 6], {
216            "m1":7, "m2":8}
217             + [5, 6.5], {"m1":7, "m3":8.5}
218            -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}
219
220               [9, 10], {"m1":11, "m2":12}
221             + [9, 10.5], {"m1":11, "m3":12.5}
222            -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}
223
224        Merging succeeds since the values of `a1` and `m1` are the
225        same for all matching instances from both tables.
226
227    .. method:: append(inst)
228
229        Append the given instance to the end of the table.
230
231        :param inst: instance to be appended
232        :type inst: :obj:`Orange.data.Instance` or a list
233
234        .. literalinclude:: code/datatable1.py
235            :lines: 21-24
236
237    .. method:: extend(instances)
238
239        Append the given list of instances to the end of the table.
240
241        :param instances: instances to be appended
242        :type instances: list
243
244
245    .. method:: select(filt[, idx, negate=False])
246
247        Return a subset of instances as a new :obj:`Table`. The first
248        argument should be a list of the same length as the table; its
249        elements should be integers or bools. The resulting table
250        contains instances corresponding to non-zero elements of the
251        list.
252
253        If the second argument is given, it must be an integer;
254        select will then return the data instances for which the
255        corresponding `filt`'s elements match `idx`.
256
257        The third argument, `negate`, can only be given as a
258        keyword. Its effect is to negate the selection.
259
260        Note: This method should be used when the selected data
261        instances are going to be modified. In all other cases, method
262        :obj:`select_ref` is preferred.
263
264        :param filt: filter list
265        :type filt: list of integers
266        :param idx: selects which instances to pick
267        :type idx: int
268        :param negate: negates the selection
269        :type negate: bool
270        :rtype: :obj:`Orange.data.Table`
271
272        One common use of this method is to split the data into
273        folds. A list for the first argument can be prepared using
274        `Orange.core.MakeRandomIndicesCV`. The following example
275        prepares a simple data table and indices for four-fold cross
276        validation, and then selects the training and testing sets for
277        each fold.
278
279        .. literalinclude:: code/datatable2.py
280            :lines: 7-27
281
282        The printout begins with::
283
284            Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>
285
286            Fold 0: train
287                 [0.000000]
288                 [2.000000]
289                 [3.000000]
290                 [5.000000]
291                 [7.000000]
292                 [8.000000]
293                 [9.000000]
294
295                  : test
296                 [1.000000]
297                 [4.000000]
298                 [6.000000]
299
300        Another form of calling the method is to use a vector of
301        zero's and one's.
302
303        .. literalinclude:: code/datatable2.py
304            :lines: 29-31
305
306        This prints out::
307
308            [0.000000]
309            [1.000000]
310            [9.000000]
311
312    .. method:: select_ref(filt[, idx, negate=False])
313
314        Same as :obj:`select`, except that the resulting table
315        contains references to data instances in the original table
316        instead of its own copies.
317
318        In most cases, this function is preferred over the former
319        since it consumes much less memory.
320
321        :param filt: filter list
322        :type filt: list of integers
323        :param idx: selects which instances to pick
324        :type idx: int
325        :param negate: negates the selection
326        :type negate: bool
327        :rtype: :obj:`Orange.data.Table`
328
329    .. method:: select_list(filt[, idx, negate=False])
330
331        Same as :obj:`select`, except that it returns a Python list
332    with data instances.
333
334        :param filt: filter list
335        :type filt: list of integers
336        :param idx: selects which instances to pick
337        :type idx: int
338        :param negate: negates the selection
339        :type negate: bool
340        :rtype: list
341
342    .. method:: get_items(indices)
343
344        Return a table with data instances indicated by indices. For
345        instance, `data.get_items([0, 1, 9]` returns a table with
346        instances with indices 0, 1 and 9.
347
348        This function is useful when data is going to be modified. If
349        not, use :obj:`get_items_ref`.
350
351        :param indices: indices of selected data instances
352        :type indices: list of int's
353        :rtype: :obj:`Orange.data.Table`
354
355    .. method:: get_items_ref(indices)
356
357         Same as above, except that it returns a table with references
358         to data instances instead of copies. This method is normally
359         preferred over the above one.
360
361        :param indices: indices of selected data instances
362        :type indices: list of int's
363        :rtype: :obj:`Orange.data.Table`
364
365    .. method:: filter(conditions)
366
367        Return a table with data instances matching the
368        criteria. These can be given in form of keyword arguments or a
369        dictionary; with the latter, additional keyword argument negate
370        can be given for selection reversal.
371
372        Note that method :obj:`filter_ref` is more memory efficient and
373        should be preferred when data is not going to be modified.
374
375        For example, young patients from the lenses data set can be
376        selected by ::
377
378            young = data.filter(age="young")
379
380        More than one value can be allowed and more than one attribute
381        checked. This selects all patients with age "young" or "psby" who
382        are astigmatic::
383
384            young = data.filter(age=["young", "presbyopic"], astigm="y")
385
386        The following has the same effect::
387
388            young = data.filter({"age": ["young", "presbyopic"],
389                                "astigm": "y"})
390
391        Selection can be reversed only with the latter form, by adding
392        a keyword argument `negate` with value 1::
393
394            young = data.filter({"age": ["young", "presbyopic"],
395                                "astigm": "y"},
396                                negate=1)
397
398        Filters for continuous features are specified by pairs of
399        values. In dataset "bridges", bridges with lengths between
400        1000 and 2000 (inclusive) are selected by ::
401
402            mid = data.filter(LENGTH=(1000, 2000))
403
404        Bridges that are shorter or longer than that can be selected
405        by inverting the range. ::
406
407            mid = data.filter(LENGTH=(2000, 1000))
408
409    .. method:: filter(filt)
410
411            Similar to above, except that conditions are given as
412            :obj:`Orange.core.Filter`.
413
414    .. method:: filter_ref(conditions), filter_ref(filter)
415
416            Same as the above two, except that they return a table
417            with references to instances instead of their copies.
418
419    .. method:: filter_list(conditions), filter_list(filter)
420
421            As above, except that it return a pure Python list with
422            data instances.
423
424    .. method:: filter_bool(conditions), filter_bool(filter)
425
426            Return a list of bools denoting which data instances are
427            accepted by the conditions or the filter.
428
429    .. method:: translate(domain)
430
431            Return a new data table in which data instances are
432            translated into the given domain.
433         
434            :param domain: new domain
435            :type domain: :obj:`Orange.data.Domain`
436            :rtype: :obj:`Orange.data.Table`
437
438    .. method:: translate(features[, keep_metas])
439
440            Similar to above, except that the domain is given by a
441            list of features. If keep_metas is True, the new data
442            instances will also have all the meta attributes from the
443            original domain.
444
445            :param features: features for the new data
446            :type domain: list
447            :rtype: :obj:`Orange.data.Table`
448
449    .. method:: checksum()
450
451            Return a CRC32 computed over all discrete and continuous
452            features and class attributes of all data instances. Meta
453            attributes and features of other types are ignored.
454
455            :rtype: int
456
457    .. method:: has_missing_values()
458
459            Return True if any of data instances has any missing
460            values. Meta attributes are not checked.
461
462    .. method:: has_missing_classes()
463
464            Return True if any instance has a missing class value.
465
466    .. method:: random_instance()
467
468            Return a random instance from the
469            table. Data table's own :obj:`random_generator` is used,
470            which is initially seeded to 0, so results are
471            deterministic.
472
473    .. method:: remove_duplicates([weightID])
474
475            Remove duplicates of data instances. If weightID is given,
476            a meta attribute is added which contains the number of
477            instances merged into each new instance.
478
479            :param weightID: id for meta attribute with weight
480            :type weightID: int
481            :rtype: None
482
483    .. method:: sort([features])
484
485            Sort the data by attribute values. The argument gives the
486            features ordered by importance. If omitted, the order from
487            the domain is used. Note that the values of discrete
488            features are not ordered alphabetically but according to
489            the :obj:`Orange.data.variable.Discrete.values`.
490
491            This sorts the data from the bridges data set by the lengths
492            and years of their construction::
493
494                data.sort(["LENGTH", "ERECTED"])
495
496    .. method:: shuffle()
497
498            Randomly shuffle the data instances.
499
500    .. method:: add_meta_attribute(id[, value=1])
501
502            Add a meta value to all data instances. The first argument
503            can be an integer id, or a string or a variable descriptor
504            of a meta attribute registered in the domain.
505
506    .. method:: remove_meta_attribute(id)
507
508            Removes a meta attribute from all data instances.
Note: See TracBrowser for help on using the repository browser.