source: orange/docs/reference/rst/Orange.data.table.rst @ 10246:11b418321f79

Revision 10246:11b418321f79, 20.7 KB checked in by janezd <janez.demsar@…>, 2 years ago (diff)

Unified argument names in and 2.5 and 3.0; numerous other changes in documentation

Line 
1.. py:currentmodule:: Orange.data
2
3======================
4Data table (``Table``)
5======================
6
7Class `Orange.data.Table` holds a list of data instances of type
8:obj:`Orange.data.Instance`. All instances belong to the same domain
9(:obj:`Orange.data.Domain`).
10
11Data tables are usually loaded from a file (see :doc:`Orange.data.formats`)::
12
13    import Orange
14    data = Orange.data.Table("titanic")
15
16Data tables can also be created programmatically, as in the :ref:`code
17below <example-table-prog1>`.
18
19:obj:`Table` supports most list-like operations: getting, setting,
20removing data instances, as well as methods :obj:`append` and
21:obj:`extend`. When setting items, the item must be
22either the instance of the correct type or a Python list of
23appropriate length and content to be converted into a data instance of
24the corresponding domain. Retrieving data instances returns references
25and not copies: changing the retrieved instance changes the data in the
26table. Slicing returns ordinary Python lists containing references to
27data instances, not a new :obj:`Orange.data.Table`.
28
29According to a Python convention, the data table is considered ``False``
30when empty.
31
32.. class:: Table
33
34    .. attribute:: domain
35
36        The domain to which the instances belong. This
37        attribute is read-only.
38
39    .. attribute:: owns_instances
40
41        ``True`` if the table contains the data instances and ``False`` if
42        it contains references to instances owned by another table.
43
44    .. attribute:: owner
45
46        The actual owner of the data when ``own_instances`` is ``False``;
47        ``None`` otherwise.
48
49    .. attribute:: version
50
51        An integer that is increased when instances are added or
52        removed from the table. It does not detect changes of the data.
53
54    .. attribute:: random_generator
55
56       Random generator that is used by method
57       :obj:`random_instance`. If the method is called and
58       ``random_generator`` is ``None``, a new generator is constructed
59       with random seed 0 and stored here for future use.
60
61    .. attribute:: attribute_load_status
62
63       If the table was loaded from a file, this list of flags tells
64       whether the feature descriptors were reused and how they
65       matched. See :ref:`descriptor reuse <variable_descriptor_reuse>`
66       for details.
67
68    .. attribute:: meta_attribute_load_status
69
70       A dictionary holding this same information for meta
71       attributes, with keys corresponding to their ids and values to
72       load statuses.
73
74    .. method:: __init__(filename[, create_new_on])
75
76        Read data from the given file. If the name includes the
77        extension it must be one of the known file formats
78        (see :doc:`/Orange.data.formats`). If no extension is given, the
79        directory is searched for any file with recognized extensions. If the
80        file is not found, Orange will also search the directories
81        specified in the environment variable `ORANGE_DATA_PATH`.
82
83        The optional flag ``create_new_on`` decides when variable
84        descriptors are reused. See :ref:`descriptor reuse
85        <variable_descriptor_reuse>` for more details.
86
87        :param filename: the name of the file
88        :type filename: str
89        :param create_new_on: flag specifying when to reuse existing descriptors
90        :type create_new_on: int
91
92    .. _example-table-prog1:
93
94    .. method:: __init__(domain)
95
96        Construct an empty data table with the given domain.
97
98        .. literalinclude:: code/datatable1.py
99            :lines: 7-16
100
101        The example :ref:`continues <example-table-prog2>`.
102
103        :param domain: domain descriptor
104        :type domain: Orange.data.Domain
105
106    .. method:: __init__(instances[, references])
107
108        Construct a new data table containing the given data
109        instances. These can be given either as another :obj:`Table`
110        or as list of instances represented by list of value or as
111        :obj:`Orange.data.Instance`.
112
113        If the optional second argument is ``True``, the first argument
114        must be a :obj:`Table`. The new table will contain references
115        to data stored in the given table. If the second argument is
116        omitted or ``False``, data instances are copied.
117
118        :param instances: data instances
119        :type instances: Table or list
120        :param references: if ``True``, the new table contains references
121        :type references: bool
122
123    .. _example-table-prog2:
124
125    .. method:: __init__(domain, instances)
126
127        Construct a new data table with the given domain and initialize
128        it with the given instances. Instances can be given as a
129        :obj:`Table` (if domains do not match, they are converted),
130        as a list containing either instances of
131        :obj:`Orange.data.Instance` or lists.
132
133        This constructor can also be used for conversion from numpy
134        arrays. The argument ``instances`` can be a numpy array. The number
135        of variables in the domain must match the number of columns.
136
137        :param domain: domain descriptor
138        :type domain: Orange.data.Domain
139        :param instances: data instances
140        :type instances: Table or list or numpy.array
141
142        The following example fills the data table created :ref:`above
143        <example-table-prog1>` with some data from a list.
144
145        .. literalinclude:: code/datatable1.py
146            :lines: 29-34
147
148        The following example shows initializing a data table from
149        numpy array.
150
151        .. literalinclude:: code/datatable1.py
152            :lines: 38-41
153
154    .. method:: __init__(tables)
155
156        Construct a table by combining data instances from a list of
157        tables. All tables must have the same length. Domains are
158        combined so that each (ordinary) feature appears only once in
159        the resulting table. The class attribute is the last class
160        attribute in the list of tables; for instance, if three tables
161        are merged but the last one is class-less, the class attribute
162        for the new table will come from the second table. Meta
163        attributes for the new domain are merged based on id's: if the
164        same attribute appears under two id's it will be added
165        twice. If, on the opposite, same id appears two different
166        attributes in two tables, this raises an exception. As
167        instances are merged, exception is raised if a features or
168        a meta attribute that appears in multiple tables does not have the
169        same value on all of them; the feature is allowed to have a
170        missing value on one or more (or all) tables.
171
172        Note that this is not the SQL's join operator as it doesn't
173        try to find matches between the tables but instead merges them
174        row by row.
175
176        :param tables: tables to be merged into the new table
177        :type tables: list of instances of :obj:`Table`
178
179        For example, suppose the file merge1.tab contains::
180
181            a1    a2    m1    m2
182            f     f     f     f
183                        meta  meta
184            1     2     3     4
185            5     6     7     8
186            9     10    11    12
187
188        and merge2.tab contains::
189
190            a1    a3    m1     m3
191            f     f     f      f
192                        meta   meta
193            1     2.5   3      4.5
194            5     6.5   7      8.5
195            9     10.5  11     12.5
196
197        The two tables can be loaded, merged and printed out by the
198        following script.
199
200        .. literalinclude:: code/datatable_merge.py
201
202        This is what the output looks like::
203
204            Domain 1:  [a1, a2], {-2:m1, -3:m2}
205            Domain 2:  [a1, a3], {-2:m1, -4:m3}
206            Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}
207
208               [1, 2], {"m1":3, "m2":4}
209             + [1, 2.5], {"m1":3, "m3":4.5}
210            -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}
211
212               [5, 6], {
213            "m1":7, "m2":8}
214             + [5, 6.5], {"m1":7, "m3":8.5}
215            -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}
216
217               [9, 10], {"m1":11, "m2":12}
218             + [9, 10.5], {"m1":11, "m3":12.5}
219            -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}
220
221        Merging succeeds since the values of `a1` and `m1` are the
222        same for all matching instances from both tables.
223
224    .. method:: append(instance)
225
226        Append the given instance to the end of the table.
227
228        :param instance: instance to be appended
229        :type instance: :obj:`Orange.data.Instance` or a list
230
231        .. literalinclude:: code/datatable1.py
232            :lines: 21-24
233
234    .. method:: extend(instances)
235
236        Append the given list of instances to the end of the table.
237
238        :param instances: instances to be appended
239        :type instances: list
240
241
242    .. method:: select(folds[, select, negate=False])
243
244        Return a subset of instances as a new :obj:`Table`. The first
245        argument should be a list of the same length as the table; its
246        elements should be integers or bools. The resulting table
247        contains instances corresponding to non-zero elements of the
248        list.
249
250        If the second argument is given, it must be an integer; method
251        ``select`` will then return the data instances for which the
252        corresponding ``fold``'s elements match the value of the
253        argument ``select``.
254
255        The third argument, `negate` inverts the selection. It can
256        only be given as a keyword.
257
258        Note: This method should be used when the selected data
259        instances are going to be modified later on. In all other
260        cases, method :obj:`select_ref` is preferred.
261
262        :param folds: list of fold indices corresponding to data instances
263        :type folds: list
264        :param select: select which instances to pick
265        :type select: int
266        :param negate: inverts the selection
267        :type negate: bool
268        :rtype: :obj:`Orange.data.Table`
269
270        One common use of this method is to split the data into
271        folds. A list for the first argument can be prepared using
272        `Orange.data.sample.SubsetIndicesCV`. The following example
273        prepares a simple data table and indices for four-fold cross
274        validation, and then selects the training and testing sets for
275        each fold.
276
277        .. literalinclude:: code/datatable2.py
278            :lines: 7-27
279
280        The printout begins with::
281
282            Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>
283
284            Fold 0: train
285                 [0.000000]
286                 [2.000000]
287                 [3.000000]
288                 [5.000000]
289                 [7.000000]
290                 [8.000000]
291                 [9.000000]
292
293                  : test
294                 [1.000000]
295                 [4.000000]
296                 [6.000000]
297
298        Another form of calling the method is to use a vector of
299        zero's and one's.
300
301        .. literalinclude:: code/datatable2.py
302            :lines: 29-31
303
304        This prints out::
305
306            [0.000000]
307            [1.000000]
308            [9.000000]
309
310    .. method:: select_ref(folds[, select, negate=False])
311
312        Same as :obj:`select`, except that the resulting table
313        contains references to data instances in the original table
314        instead of its own copy of data.
315
316        In most cases, this function is preferred over the former
317        since it consumes less memory.
318
319        :param folds: list of fold indices corresponding to data instances
320        :type folds: list
321        :param select: select which instances to pick
322        :type select: int
323        :param negate: inverts the selection
324        :type negate: bool
325        :rtype: :obj:`Orange.data.Table`
326
327    .. method:: get_items(indices)
328
329        Return a table with data instances indicated by indices. For
330        instance, `data.get_items([0, 1, 9])` returns a table with
331        instances with indices 0, 1 and 9.
332
333        This function is useful when data is going to be modified. If
334        not, use :obj:`get_items_ref`.
335
336        :param indices: indices of selected data instances
337        :type indices: list of int's
338        :rtype: :obj:`Orange.data.Table`
339
340    .. method:: get_items_ref(indices)
341
342         Same as above, except that it returns a table with references
343         to data instances. This method is usually
344         preferred over the above one.
345
346        :param indices: indices of selected data instances
347        :type indices: list of int's
348        :rtype: :obj:`Orange.data.Table`
349
350    .. method:: filter(conditions)
351
352        Return a table with data instances matching the
353        criteria. These can be given in form of keyword arguments or a
354        dictionary; with the latter, additional keyword argument ``negate``
355        can be given to reverse the selection.
356
357        Note that method :obj:`filter_ref` is more memory efficient and
358        should be preferred when data is not going to be modified.
359
360        Young patients from the lenses data set can be selected by ::
361
362            young = data.filter(age="young")
363
364        More than one value can be allowed and more than one attribute
365        checked. This selects all patients with age "young" or "psby" who
366        are astigmatic::
367
368            young = data.filter(age=["young", "presbyopic"], astigm="y")
369
370        The following has the same effect::
371
372            young = data.filter({"age": ["young", "presbyopic"],
373                                "astigm": "y"})
374
375        Selection can be reversed only in the latter form, by adding
376        a keyword argument ``negate`` with value 1::
377
378            young = data.filter({"age": ["young", "presbyopic"],
379                                "astigm": "y"},
380                                negate=1)
381
382        Filters for continuous features are specified by pairs of
383        values. In dataset "bridges", bridges with lengths between
384        1000 and 2000 (inclusive) are selected by ::
385
386            mid = data.filter(LENGTH=(1000, 2000))
387
388        Bridges that are shorter or longer than that can be selected
389        by inverting the range. ::
390
391            mid = data.filter(LENGTH=(2000, 1000))
392
393    .. method:: filter(filt)
394
395            Similar to above, except that conditions are given as
396            :obj:`Orange.core.Filter`.
397
398    .. method:: filter_ref(conditions), filter_ref(filter)
399
400            Same as the above two, except that they return a table
401            with references to instances instead of their copies.
402
403    .. method:: filter_bool(conditions), filter_bool(filter)
404
405            Return a list of bools denoting which data instances are
406            accepted by the conditions or the filter.
407
408    .. method:: translate(domain)
409
410            Return a new data table in which data instances are
411            translated into the given domain.
412         
413            :param domain: new domain
414            :type domain: :obj:`Orange.data.Domain`
415            :rtype: :obj:`Orange.data.Table`
416
417    .. method:: translate(variables[, keep_metas])
418
419            Similar to above, except that the domain is given by a
420            list of features. If ``keep_metas`` is ``True``, the new data
421            instances will also have all the meta attributes from the
422            original domain.
423
424            :param variables: variables for the new data
425            :type variables: list
426            :rtype: :obj:`Orange.data.Table`
427
428    .. method:: to_numpy(content, weightID, multinominal)
429
430        Convert a data table to numpy array. Raises an exception if the data
431        contains undefined values. :obj:`to_numpyMA` converts to a masked
432        array where the mask denotes the defined values. (For conversion
433        from numpy, see the constructor.)
434
435        The function returns a tuple with the array and, depending on
436        arguments, some vectors. The argument ``content`` is a string
437        separated in two parts with a slash. The part to the left of slash
438        describes the content of the array; in the part on the right side
439        lists the vectors. The content is described with the following
440        characters:
441
442        ``a``
443            features (without the class); can only appear on the left
444
445        ``A``
446            like ``a``, but raises exception if there are no features
447
448        ``c``
449            class value represented as an index of the value (0, 1, 2...);
450            if the data has no class, the column is omitted (if ``c`` is to
451            the left of the slash) or the tuple will contain ``None``
452            instead of the vector.
453
454        ``C``
455            like ``c``, but raises exception if the data has no class
456
457        ``w``
458            instance weight; like for ``c`` the column is omitted or
459            ``None`` is returned instead of the vector if the argument
460            ``weightID`` is missing.
461
462        ``W``
463            instance weight; raise an exception if ``weightID``
464            is missing.
465
466        ``0``
467            a vector of zeros
468
469        ``1``
470            a vector of ones
471
472    The default content is ``a/cw``: an array with feature values and
473    separate vectors with classes and weights. Specifying an empty string
474    has the same effect. If the elements to the right of the slash repeat,
475    the function returns the same Python object, e.g. in ``acc000/cwww`` the
476    three weight vectors are one and the same Python object, so modifying
477    one will change all three of them.
478
479        This is the default behaviour on data set iris with 150 data
480        instances described by four features and a class value::
481
482        >>> data = orange.ExampleTable("../datasets/iris")
483        >>> a, c, w = data.toNumpy()
484        >>> a.shape
485        (150, 4)
486        >>> c.shape
487        (150,)
488        >>> print w
489            None
490        >>> a[0]
491        array([ 5.0999999 ,  3.5       ,  1.39999998,  0.2       ])
492        >>> c[0]
493        0.0
494
495        For a more complicated example, the array will contain a column with
496        class, features, a vector of ones, two vectors with classes and
497        another vector of zeroes::
498
499        >>> a, = data.toNumpy("ca1cc0")
500        >>> a[0]
501        array([ 0., 5.0999999, 3.5       , 1.39999998, 0.2       , 1., 0., 0., 0.])
502        >>> a[130]
503        array([ 2., 7.4000001, 2.79999995, 6.0999999 , 1.89999998, 1., 2., 2., 0.])
504        >>> c[120]
505        2.0
506
507    The third argument specifies the treatment of non-continuous
508    non-binary values (binary values are always translated to 0.0 or
509    1.0). The argument's value can be
510    :obj:`Orange.data.Table.Multinomial_Ignore` (such features are
511    omitted), :obj:`Orange.data.Table.Multinomial_AsOrdinal` (the
512    values' indices are treated as continuous numbers) or
513    :obj:`Orange.data.Table.Multinomial_Error` (an exception is raised
514    if such features are encountered). Default treatment is
515    :obj:`Orange.data.Table.ExampleTable.Multinomial_AsOrdinal`.
516
517    When the class attribute is discrete and has more than two values,
518    an exception is raised unless multinomial attributes are treated as
519    ordinal. More options for treating multinominal values are available
520    in :obj:`Orange.data.continuization`.
521
522    .. method:: to_numpyMA(content, weightID, multinominal)
523
524        Similar to :obj:`to_numpy` except that it returns a masked array
525        with mask representing the (un)defined values.
526
527    .. method:: checksum()
528
529            Return a CRC32 computed over all discrete and continuous
530            features and class attributes of all data instances.
531
532            :rtype: int
533
534    .. method:: has_missing_values()
535
536            Return ``True`` if any of data instances has any missing
537            values. Meta attributes are not checked.
538
539    .. method:: has_missing_classes()
540
541            Return ``True`` if any instance miss the class value.
542
543    .. method:: random_instance()
544
545            Return a random instance from the
546            table. Data table's :obj:`random_generator` is used,
547            which is initially seeded to 0, so results are
548            deterministic.
549
550    .. method:: remove_duplicates([weightID])
551
552            Remove duplicates of data instances. If ``weightID`` is given,
553            a meta attribute is added which contains the number of
554            instances merged into each new instance.
555
556            :param weightID: id for meta attribute with weight
557            :type weightID: int
558            :rtype: None
559
560    .. method:: sort([variables])
561
562            Sort the data table. The argument gives the
563            values ordered by importance. If omitted, the order from
564            the domain is used. Values of discrete
565            features are not ordered alphabetically but according to
566            the :obj:`Orange.feature.Discrete.values`.
567
568            This sorts the data from the bridges data set by the lengths
569            and years of their construction::
570
571                data.sort(["LENGTH", "ERECTED"])
572
573    .. method:: shuffle()
574
575            Randomly shuffle the data instances.
576
577    .. method:: add_meta_attribute(attr[, value=1])
578
579            Add a meta value to all data instances. The first argument
580            can be an integer id, or a string or a variable descriptor
581            of a meta attribute registered in the domain.
582
583    .. method:: remove_meta_attribute(attr)
584
585            Remove a meta attribute from all data instances.
586
587
Note: See TracBrowser for help on using the repository browser.