source: orange/docs/reference/rst/Orange.data.table.rst @ 9372:aef193695ea9

Revision 9372:aef193695ea9, 16.8 KB checked in by mitar, 2 years ago (diff)

Moved documentation to the separate directory.

Line 
1.. py:currentmodule:: Orange.data
2
3======================
4Data table (``Table``)
5======================
6
7Class `Orange.data.Table` holds a list of data instances of type
8:obj:`Orange.data.Instance`. All instances belong to the same domain.
9
10-------------------
11List-like behaviour
12-------------------
13
14:obj:`Table` supports most list-like operations: gettins, setting,
15removing data instances, as well as methods :obj:`append` and
16:obj:`extend`. The limitation is that table contain instances of
17:obj:`Orange.data.Instance`. When setting items, the item must be
18either the instance of the correct type or a Python list of
19appropriate length and content to be converted into a data instance of
20the corresponding domain.
21
22When retrieving data instances, what we get are references and not
23copies. Changing the retrieved instance changes the data in the table,
24too.
25
26Slicing returns ordinary Python lists containing the data instance,
27not a new Table.
28
29As usual in Python, the data table is considered False, when empty.
30
31-----------
32 
33.. class:: Table
34
35    .. attribute:: domain
36
37        The domain to which the instances correspond. This
38        attribute is read-only.
39
40    .. attribute:: owns_examples
41
42        True, if the table contains the data instances, False if it
43        contains just references to instances owned by another table.
44
45    .. attribute:: owner
46
47        If the table does not own the data instances, this attribute
48        gives the actual owner.
49
50    .. attribute:: version
51
52        An integer that is increased whenever the table is
53        changed. This is not foolproof, since the object cannot
54        detect when individual examples are changed. It will, however,
55        catch any additions and removals from the table.
56
57    .. attribute:: random_generator
58
59       Random generator that is used by method
60       :obj:`random_instance`. If the method is called and
61       random_generator is None, a new generator is constructed with
62       random seed 0, and stored here for subsequent use.
63
64    .. attribute:: attribute_load_status
65
66       If the table was loaded from a file, this list of flags tells
67       whether the feature descriptors were reused and how they
68       matched. See :ref:`file-formats` for details.
69
70    .. attribute:: meta_attribute_load_status
71
72       Same as above, except that this is a dictionary for meta
73       attributes, with keys corresponding to their ids.
74
75    .. method:: __init__(filename[, create_new_on])
76
77        Read data from the given file. If the name includes the
78        extension it must be one of the known file formats (see
79        :ref:`file-formats`). If no extension is given, the directory
80        is searched for any file with recognized extensions. If the
81        file is not found, Orange will also search the directories
82        specified in the environment variable `ORANGE_DATA_PATH`.
83
84        The optional flag `create_new_on` decides when variable
85        descriptors are reused. See :ref:`file-formats` for more details.
86
87        :param filename: the name of the file
88        :type filename: str
89        :param create_new_on: flag specifying when to reuse existing descriptors
90        :type create_new_on: int
91
92    .. method:: __init__(domain)
93
94        Construct an empty data table with the given domain.
95
96        :param domain: domain descriptor
97        :type domain: Orange.data.Domain
98
99        ..literalinclude:: code/datatable1.py
100        :lines: 7-16
101
102    .. method:: __init__(instances[, references])
103
104        Construct a new data table containing the given data
105        instances. These can be given either as another :obj:`Table`
106        or as Python list containing instances of
107        :obj:`Orange.data.Instance`.
108
109        If the optional second argument is True, the first argument
110        must be a :obj:`Table`. The new table will contain references
111        to data stored in the given table. If the second argument is
112        omitted or False, data instances are copied.
113
114        :param instances: data instances
115        :type instances: Table or list
116        :param references: if True, the new table contains references
117        :type references: bool
118
119    .. method:: __init__(domain, instances)
120
121        Construct a new data table with a given domain and initialize
122        it with the given instances. Instances can be given as a
123        :obj:`Table` (if domains do not match, they are converted),
124        as a list containing either instances of
125        :obj:`Orange.data.Instance` or lists, or as a numpy array.
126
127        :param domain: domain descriptor
128        :type domain: Orange.data.Domain
129        :param instances: data instances
130        :type instances: Table or list or numpy.array
131
132        The following example fills the data table created above with
133        some data from a list.
134
135        .. literalinclude:: code/datatable1.py
136            :lines: 29-34
137
138        The following example shows initializing a data table from
139        numpy array.
140
141        .. literalinclude:: code/datatable1.py
142            :lines: 38-41
143
144    .. method:: __init__(tables)
145
146        Construct a table by combining data instances from a list of
147        tables. All tables must have the same length. Domains are
148        combined so that each (ordinary) feature appears only once in
149        the resulting table. The class attribute is the last class
150        attribute in the list of tables; for instance, if three tables
151        are merged but the last one is class-less, the class attribute
152        for the new table will come from the second table. Meta
153        attributes for the new domain are merged based on id's: if the
154        same attribute appears under two id's it will be added
155        twice. If, on the opposite, same id appears two different
156        attributes in two tables, this throws an exception. As
157        instances are merged, Orange checks the features and meta
158        attributes that appear in multiple tables have the same value
159        on all. Missing values are allowed.
160
161        Note that this is not the SQL's join operator as it doesn't
162        try to find matches between the tables.
163
164        :param tables: tables to be merged into the new table
165        :type tables: list of instances of :obj:`Table`
166
167        For example, suppose the file merge1.tab contains::
168
169            a1    a2    m1    m2
170            f     f     f     f
171                        meta  meta
172            1     2     3     4
173            5     6     7     8
174            9     10    11    12
175
176        and merge2.tab contains::
177
178            a1    a3    m1     m3
179            f     f     f      f
180                        meta   meta
181            1     2.5   3      4.5
182            5     6.5   7      8.5
183            9     10.5  11     12.5
184
185        The two tables can be loaded, merged and printed out by the
186        following script.
187
188        ..literalinclude:: code/datatable_merge.py
189
190        This is what the output looks like::
191
192            Domain 1:  [a1, a2], {-2:m1, -3:m2}
193            Domain 2:  [a1, a3], {-2:m1, -4:m3}
194            Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}
195
196               [1, 2], {"m1":3, "m2":4}
197             + [1, 2.5], {"m1":3, "m3":4.5}
198            -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}
199
200               [5, 6], {
201            "m1":7, "m2":8}
202             + [5, 6.5], {"m1":7, "m3":8.5}
203            -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}
204
205               [9, 10], {"m1":11, "m2":12}
206             + [9, 10.5], {"m1":11, "m3":12.5}
207            -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}
208
209        Merging succeeds since the values of `a1` and `m1` are the
210        same for all matching examples from both tables.
211
212    .. method:: append(inst)
213
214        Append the given instance to the end of the table.
215
216        :param inst: instance to be appended
217        :type inst: :obj:`Orange.data.Instance` or a list
218
219        .. literalinclude:: code/datatable1.py
220            :lines: 21-24
221
222    .. method:: extend(instances)
223
224        Append the given list of instances to the end of the table.
225
226        :param instances: instances to be appended
227        :type instances: list
228
229
230    .. method:: select(filt[, idx, negate=False])
231
232        Return a subset of instances as a new :obj:`Table`. The first
233        argument should be a list of the same length as the table; its
234        elements should be integers or bools. The resulting table
235        contains instances corresponding to non-zero elements of the
236        list.
237
238        If the second argument is given, it must be an integer;
239        select will then return the data instances for which the
240        corresponding `filt`'s elements match `idx`.
241
242        The third argument, `negate`, can only be given as a
243        keyword. Its effect is to negate the selection.
244
245        Note: This method should be used when the selected data
246        instances are going to be modified. In all other cases, method
247        :obj:`select_ref` is preferred.
248
249        :param filt: filter list
250        :type filt: list of integers
251        :param idx: selects which examples to pick
252        :type idx: int
253        :param negate: negates the selection
254        :type negate: bool
255        :rtype: :obj:`Orange.data.Table`
256
257        One common use of this method is to split the data into
258        folds. A list for the first argument can be prepared using
259        `Orange.core.MakeRandomIndicesCV`. The following example
260        prepares a simple data table and indices for four-fold cross
261        validation, and then selects the training and testing sets for
262        each fold.
263
264        .. literalinclude:: code/datatable2.py
265            :lines: 7-27
266
267        The printout begins with::
268
269            Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>
270
271            Fold 0: train
272                 [0.000000]
273                 [2.000000]
274                 [3.000000]
275                 [5.000000]
276                 [7.000000]
277                 [8.000000]
278                 [9.000000]
279
280                  : test
281                 [1.000000]
282                 [4.000000]
283                 [6.000000]
284
285        Another form of calling the method is to use a vector of
286        zero's and one's.
287
288        .. literalinclude:: code/datatable2.py
289            :lines: 29-31
290
291        This prints out::
292
293            [0.000000]
294            [1.000000]
295            [9.000000]
296
297    .. method:: select_ref(filt[, idx, negate=False])
298
299        Same as :obj:`select`, except that the resulting table
300        contains references to data instances in the original table
301        instead of its own copies.
302
303        In most cases, this function is preferred over the former
304        since it consumes much less memory.
305
306        :param filt: filter list
307        :type filt: list of integers
308        :param idx: selects which examples to pick
309        :type idx: int
310        :param negate: negates the selection
311        :type negate: bool
312        :rtype: :obj:`Orange.data.Table`
313
314    .. method:: select_list(filt[, idx, negate=False])
315
316        Same as :obj:`select`, except that it returns a Python list
317    with data instances.
318
319        :param filt: filter list
320        :type filt: list of integers
321        :param idx: selects which examples to pick
322        :type idx: int
323        :param negate: negates the selection
324        :type negate: bool
325        :rtype: list
326
327    .. method:: get_items(indices)
328
329        Return a table with data instances indicated by indices. For
330        instance, `data.get_items([0, 1, 9]` returns a table with
331        instances with indices 0, 1 and 9.
332
333        This function is useful when data is going to be modified. If
334        not, use :obj:`get_items_ref`.
335
336        :param indices: indices of selected data instances
337        :type indices: list of int's
338        :rtype: :obj:`Orange.data.Table`
339
340    .. method:: get_items_ref(indices)
341
342         Same as above, except that it returns a table with references
343         to data instances instead of copies. This method is normally
344         preferred over the above one.
345
346        :param indices: indices of selected data instances
347        :type indices: list of int's
348        :rtype: :obj:`Orange.data.Table`
349
350    .. method:: filter(conditions)
351
352        Return a table with data instances matching the
353        criteria. These can be given in form of keyword arguments or a
354        dictionary; with the latter, additional keyword argument negate
355        can be given for selection reversal.
356
357        Note that method :obj:`filter_ref` is more memory efficient and
358        should be preferred when data is not going to be modified.
359
360        For example, young patients from the lenses data set can be
361        selected by ::
362
363            young = data.filter(age="young")
364
365        More than one value can be allowed and more than one attribute
366        checked. This selects all patients with age "young" or "psby" who
367        are astigmatic::
368
369            young = data.filter(age=["young", "presbyopic"], astigm="y")
370
371        The following has the same effect::
372
373            young = data.filter({"age": ["young", "presbyopic"],
374                                "astigm": "y"})
375
376        Selection can be reversed only with the latter form, by adding
377        a keyword argument `negate` with value 1::
378
379            young = data.filter({"age": ["young", "presbyopic"],
380                                "astigm": "y"},
381                                negate=1)
382
383        Filters for continuous features are specified by pairs of
384        values. In dataset "bridges", bridges with lengths between
385        1000 and 2000 (inclusive) are selected by ::
386
387            mid = data.filter(LENGTH=(1000, 2000))
388
389        Bridges that are shorter or longer than that can be selected
390        by inverting the range. ::
391
392            mid = data.filter(LENGTH=(2000, 1000))
393
394    .. method:: filter(filt)
395
396            Similar to above, except that conditions are given as
397            :obj:`Orange.core.Filter`.
398
399    .. method:: filter_ref(conditions), filter_ref(filter)
400
401            Same as the above two, except that they return a table
402            with references to instances instead of their copies.
403
404    .. method:: filter_list(conditions), filter_list(filter)
405
406            As above, except that it return a pure Python list with
407            data instances.
408
409    .. method:: filter_bool(conditions), filter_bool(filter)
410
411            Return a list of bools denoting which data instances are
412            accepted by the conditions or the filter.
413
414    .. method:: translate(domain)
415
416            Return a new data table in which data instances are
417            translated into the given domain.
418         
419            :param domain: new domain
420            :type domain: :obj:`Orange.data.Domain`
421            :rtype: :obj:`Orange.data.Table`
422
423    .. method:: translate(features[, keep_metas])
424
425            Similar to above, except that the domain is given by a
426            list of features. If keep_metas is True, the new data
427            instances will also have all the meta attributes from the
428            original domain.
429
430            :param features: features for the new data
431            :type domain: list
432            :rtype: :obj:`Orange.data.Table`
433
434    .. method:: checksum()
435
436            Return a CRC32 computed over all discrete and continuous
437            features and class attributes of all data instances. Meta
438            attributes and features of other types are ignored.
439
440            :rtype: int
441
442    .. method:: has_missing_values()
443
444            Return True if any of data instances has any missing
445            values. Meta attributes are not checked.
446
447    .. method:: has_missing_classes()
448
449            Return True if any instance has a missing class value.
450
451    .. method:: random_example()
452
453            Return a random example from the
454            table. Data table's own :obj:`random_generator` is used,
455            which is initially seeded to 0, so results are
456            deterministic.
457
458    .. method:: remove_duplicates([weightID])
459
460            Remove duplicates of data instances. If weightID is given,
461            a meta attribute is added which contains the number of
462            instances merged into each new instance.
463
464            :param weightID: id for meta attribute with weight
465            :type weightID: int
466            :rtype: None
467
468    .. method:: sort([features])
469
470            Sort the data by attribute values. The argument gives the
471            features ordered by importance. If omitted, the order from
472            the domain is used. Note that the values of discrete
473            features are not ordered alphabetically but according to
474            the :obj:`Orange.data.variable.Discrete.values`.
475
476            This sorts the data from the bridges data set by the lengths
477            and years of their construction::
478
479                data.sort(["LENGTH", "ERECTED"])
480
481    .. method:: shuffle()
482
483            Randomly shuffle the data instances.
484
485    .. method:: add_meta_attribute(id[, value=1])
486
487            Add a meta value to all data instances. The first argument
488            can be an integer id, or a string or a variable descriptor
489            of a meta attribute registered in the domain.
490
491    .. method:: remove_meta_attribute(id)
492
493            Removes a meta attribute from all data instances.
Note: See TracBrowser for help on using the repository browser.