source: orange/docs/reference/rst/Orange.data.table.rst @ 11028:009ba5a75e30

Revision 11028:009ba5a75e30, 20.9 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 17 months ago (diff)

Added a common documentation index.

Line 
1.. py:currentmodule:: Orange.data
2
3======================
4Data table (``Table``)
5======================
6
7Class `Orange.data.Table` holds a list of data instances of type
8:obj:`Orange.data.Instance`. All instances belong to the same domain
9(:obj:`Orange.data.Domain`).
10
11Data tables are usually loaded from a file (see :ref:`Orange-data-formats`)::
12
13    import Orange
14    data = Orange.data.Table("titanic")
15
16Data tables can also be created programmatically, as in the :ref:`code
17below <example-table-prog1>`.
18
19:obj:`Table` supports most list-like operations: getting, setting,
20removing data instances, as well as methods :obj:`append` and
21:obj:`extend`. When setting items, the item must be
22either the instance of the correct type or a Python list of
23appropriate length and content to be converted into a data instance of
24the corresponding domain. Retrieving data instances returns references
25and not copies: changing the retrieved instance changes the data in the
26table. Slicing returns ordinary Python lists containing references to
27data instances, not a new :obj:`Orange.data.Table`.
28
29According to a Python convention, the data table is considered ``False``
30when empty.
31
32.. class:: Table
33
34    .. attribute:: domain
35
36        The domain to which the instances belong. This
37        attribute is read-only.
38
39    .. attribute:: owns_instances
40
41        ``True`` if the table contains the data instances and ``False`` if
42        it contains references to instances owned by another table.
43
44    .. attribute:: owner
45
46        The actual owner of the data when ``own_instances`` is ``False``;
47        ``None`` otherwise.
48
49    .. attribute:: version
50
51        An integer that is increased when instances are added or
52        removed from the table. It does not detect changes of the data.
53
54    .. attribute:: random_generator
55
56       Random generator that is used by method
57       :obj:`random_instance`. If the method is called and
58       ``random_generator`` is ``None``, a new generator is constructed
59       with random seed 0 and stored here for future use.
60
61    .. attribute:: attribute_load_status
62
63       If the table was loaded from a file, this list of flags tells
64       whether the feature descriptors were reused and how they
65       matched. See :ref:`descriptor reuse <variable_descriptor_reuse>`
66       for details.
67
68    .. attribute:: meta_attribute_load_status
69
70       A dictionary holding this same information for meta
71       attributes, with keys corresponding to their ids and values to
72       load statuses.
73
74    .. method:: __init__(filename[, create_new_on])
75
76        Read data from the given file. If the name includes the
77        extension it must be one of the known file formats
78        (see :ref:`Orange-data-formats`). If no extension is given, the
79        directory is searched for any file with recognized extensions. If the
80        file is not found, Orange will also search the directories
81        specified in the environment variable `ORANGE_DATA_PATH`.
82
83        The optional flag ``create_new_on`` decides when variable
84        descriptors are reused. See :ref:`descriptor reuse
85        <variable_descriptor_reuse>` for more details.
86
87        :param filename: the name of the file
88        :type filename: str
89        :param create_new_on: flag specifying when to reuse existing descriptors
90        :type create_new_on: int
91
92    .. _example-table-prog1:
93
94    .. method:: __init__(domain)
95
96        Construct an empty data table with the given domain.
97
98        .. literalinclude:: code/datatable1.py
99            :lines: 7-16
100
101        The example :ref:`continues <example-table-prog2>`.
102
103        :param domain: domain descriptor
104        :type domain: Orange.data.Domain
105
106    .. method:: __init__(instances[, references])
107
108        Construct a new data table containing the given data
109        instances. These can be given either as another :obj:`Table`
110        or as list of instances represented by list of value or as
111        :obj:`Orange.data.Instance`.
112
113        If the optional second argument is ``True``, the first argument
114        must be a :obj:`Table`. The new table will contain references
115        to data stored in the given table. If the second argument is
116        omitted or ``False``, data instances are copied.
117
118        :param instances: data instances
119        :type instances: Table or list
120        :param references: if ``True``, the new table contains references
121        :type references: bool
122
123    .. _example-table-prog2:
124
125    .. method:: __init__(domain, instances)
126
127        Construct a new data table with the given domain and initialize
128        it with the given instances. Instances can be given as a
129        :obj:`Table` (if domains do not match, they are converted),
130        as a list containing either instances of
131        :obj:`Orange.data.Instance` or lists.
132
133        This constructor can also be used for conversion from numpy
134        arrays. The argument ``instances`` can be a numpy array. The number
135        of variables in the domain must match the number of columns.
136
137        :param domain: domain descriptor
138        :type domain: Orange.data.Domain
139        :param instances: data instances
140        :type instances: Table or list or numpy.array
141
142        The following example fills the data table created :ref:`above
143        <example-table-prog1>` with some data from a list.
144
145        .. literalinclude:: code/datatable1.py
146            :lines: 29-34
147
148        The following example shows initializing a data table from
149        numpy array.
150
151        .. literalinclude:: code/datatable1.py
152            :lines: 38-41
153
154    .. method:: __init__(tables)
155
156        Construct a table by combining data instances from a list of
157        tables. All tables must have the same length. Domains are
158        combined so that each (ordinary) feature appears only once in
159        the resulting table. The class attribute is the last class
160        attribute in the list of tables, while all other class attributes
161        are added as ordinary features. For instance, if three tables
162        are merged but the last one is class-less, the class attribute
163        for the new table will come from the second table. Meta
164        attributes for the new domain are merged based on id's: if the
165        same attribute appears under two id's it will be added
166        twice. If, on the opposite, same id appears two different
167        attributes in two tables, this raises an exception. As
168        instances are merged, exception is raised if a features or
169        a meta attribute that appears in multiple tables does not have the
170        same value on all of them; the feature is allowed to have a
171        missing value on one or more (or all) tables.
172
173        Note that this is not the SQL's join operator as it doesn't
174        try to find matches between the tables but instead merges them
175        row by row.
176
177        :param tables: tables to be merged into the new table
178        :type tables: list of instances of :obj:`Table`
179
180        For example, suppose the file merge1.tab contains::
181
182            a1    a2    m1    m2
183            f     f     f     f
184                        meta  meta
185            1     2     3     4
186            5     6     7     8
187            9     10    11    12
188
189        and merge2.tab contains::
190
191            a1    a3    m1     m3
192            f     f     f      f
193                        meta   meta
194            1     2.5   3      4.5
195            5     6.5   7      8.5
196            9     10.5  11     12.5
197
198        The two tables can be loaded, merged and printed out by the
199        following script.
200
201        .. literalinclude:: code/datatable_merge.py
202
203        This is what the output looks like::
204
205            Domain 1:  [a1, a2], {-2:m1, -3:m2}
206            Domain 2:  [a1, a3], {-2:m1, -4:m3}
207            Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}
208
209               [1, 2], {"m1":3, "m2":4}
210             + [1, 2.5], {"m1":3, "m3":4.5}
211            -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}
212
213               [5, 6], {
214            "m1":7, "m2":8}
215             + [5, 6.5], {"m1":7, "m3":8.5}
216            -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}
217
218               [9, 10], {"m1":11, "m2":12}
219             + [9, 10.5], {"m1":11, "m3":12.5}
220            -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}
221
222        Merging succeeds since the values of `a1` and `m1` are the
223        same for all matching instances from both tables.
224
225    .. method:: append(instance)
226
227        Append the given instance to the end of the table.
228
229        :param instance: instance to be appended
230        :type instance: :obj:`Orange.data.Instance` or a list
231
232        .. literalinclude:: code/datatable1.py
233            :lines: 21-24
234
235    .. method:: extend(instances)
236
237        Append the given list of instances to the end of the table.
238
239        :param instances: instances to be appended
240        :type instances: list
241
242
243    .. method:: select(folds[, select, negate=False])
244
245        Return a subset of instances as a new :obj:`Table`. The first
246        argument should be a list of the same length as the table; its
247        elements should be integers or bools. The resulting table
248        contains instances corresponding to non-zero elements of the
249        list.
250
251        If the second argument is given, it must be an integer; method
252        ``select`` will then return the data instances for which the
253        corresponding ``fold``'s elements match the value of the
254        argument ``select``.
255
256        The third argument, `negate` inverts the selection. It can
257        only be given as a keyword.
258
259        Note: This method should be used when the selected data
260        instances are going to be modified later on. In all other
261        cases, method :obj:`select_ref` is preferred.
262
263        :param folds: list of fold indices corresponding to data instances
264        :type folds: list
265        :param select: select which instances to pick
266        :type select: int
267        :param negate: inverts the selection
268        :type negate: bool
269        :rtype: :obj:`Orange.data.Table`
270
271        One common use of this method is to split the data into
272        folds. A list for the first argument can be prepared using
273        `Orange.data.sample.SubsetIndicesCV`. The following example
274        prepares a simple data table and indices for four-fold cross
275        validation, and then selects the training and testing sets for
276        each fold.
277
278        .. literalinclude:: code/datatable2.py
279            :lines: 7-27
280
281        The printout begins with::
282
283            Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>
284
285            Fold 0: train
286                 [0.000000]
287                 [2.000000]
288                 [3.000000]
289                 [5.000000]
290                 [7.000000]
291                 [8.000000]
292                 [9.000000]
293
294                  : test
295                 [1.000000]
296                 [4.000000]
297                 [6.000000]
298
299        Another form of calling the method is to use a vector of
300        zero's and one's.
301
302        .. literalinclude:: code/datatable2.py
303            :lines: 29-31
304
305        This prints out::
306
307            [0.000000]
308            [1.000000]
309            [9.000000]
310
311    .. method:: select_ref(folds[, select, negate=False])
312
313        Same as :obj:`select`, except that the resulting table
314        contains references to data instances in the original table
315        instead of its own copy of data.
316
317        In most cases, this function is preferred over the former
318        since it consumes less memory.
319
320        :param folds: list of fold indices corresponding to data instances
321        :type folds: list
322        :param select: select which instances to pick
323        :type select: int
324        :param negate: inverts the selection
325        :type negate: bool
326        :rtype: :obj:`Orange.data.Table`
327
328    .. method:: get_items(indices)
329
330        Return a table with data instances indicated by indices. For
331        instance, `data.get_items([0, 1, 9])` returns a table with
332        instances with indices 0, 1 and 9.
333
334        This function is useful when data is going to be modified. If
335        not, use :obj:`get_items_ref`.
336
337        :param indices: indices of selected data instances
338        :type indices: list of int's
339        :rtype: :obj:`Orange.data.Table`
340
341    .. method:: get_items_ref(indices)
342
343         Same as above, except that it returns a table with references
344         to data instances. This method is usually
345         preferred over the above one.
346
347        :param indices: indices of selected data instances
348        :type indices: list of int's
349        :rtype: :obj:`Orange.data.Table`
350
351    .. method:: filter(conditions)
352
353        Return a table with data instances matching the
354        criteria. These can be given in form of keyword arguments or a
355        dictionary; with the latter, additional keyword argument ``negate``
356        can be given to reverse the selection.
357
358        Note that method :obj:`filter_ref` is more memory efficient and
359        should be preferred when data is not going to be modified.
360
361        Young patients from the lenses data set can be selected by ::
362
363            young = data.filter(age="young")
364
365        More than one value can be allowed and more than one attribute
366        checked. This selects all patients with age "young" or "psby" who
367        are astigmatic::
368
369            young = data.filter(age=["young", "presbyopic"], astigm="y")
370
371        The following has the same effect::
372
373            young = data.filter({"age": ["young", "presbyopic"],
374                                "astigm": "y"})
375
376        Selection can be reversed only in the latter form, by adding
377        a keyword argument ``negate`` with value 1::
378
379            young = data.filter({"age": ["young", "presbyopic"],
380                                "astigm": "y"},
381                                negate=1)
382
383        Filters for continuous features are specified by pairs of
384        values. In dataset "bridges", bridges with lengths between
385        1000 and 2000 (inclusive) are selected by ::
386
387            mid = data.filter(LENGTH=(1000, 2000))
388
389        Bridges that are shorter or longer than that can be selected
390        by inverting the range. ::
391
392            mid = data.filter(LENGTH=(2000, 1000))
393
394    .. method:: filter(filt)
395
396            Similar to above, except that conditions are given as
397            :obj:`Orange.core.Filter`.
398
399    .. method:: filter_ref(conditions), filter_ref(filter)
400
401            Same as the above two, except that they return a table
402            with references to instances instead of their copies.
403
404    .. method:: filter_bool(conditions), filter_bool(filter)
405
406            Return a list of bools denoting which data instances are
407            accepted by the conditions or the filter.
408
409    .. method:: translate(domain)
410
411            Return a new data table in which data instances are
412            translated into the given domain.
413         
414            :param domain: new domain
415            :type domain: :obj:`Orange.data.Domain`
416            :rtype: :obj:`Orange.data.Table`
417
418    .. method:: translate(variables[, keep_metas])
419
420            Similar to above, except that the domain is given by a
421            list of features. If ``keep_metas`` is ``True``, the new data
422            instances will also have all the meta attributes from the
423            original domain.
424
425            :param variables: variables for the new data
426            :type variables: list
427            :rtype: :obj:`Orange.data.Table`
428
429    .. method:: to_numpy(content, weightID, multinominal)
430
431        Convert a data table to numpy array. Raises an exception if the data
432        contains undefined values. :obj:`to_numpyMA` converts to a masked
433        array where the mask denotes the defined values. (For conversion
434        from numpy, see the constructor.)
435
436        The function returns a tuple with the array and, depending on
437        arguments, some vectors. The argument ``content`` is a string
438        separated in two parts with a slash. The part to the left of slash
439        describes the content of the array; in the part on the right side
440        lists the vectors. The content is described with the following
441        characters:
442
443        ``a``
444            features (without the class); can only appear on the left
445
446        ``A``
447            like ``a``, but raises exception if there are no features
448
449        ``c``
450            class value represented as an index of the value (0, 1, 2...);
451            if the data has no class, the column is omitted (if ``c`` is to
452            the left of the slash) or the tuple will contain ``None``
453            instead of the vector.
454
455        ``C``
456            like ``c``, but raises exception if the data has no class
457
458        ``m``
459            like ``c``, but one column for each target variable in a
460            multi-target domain.
461
462        ``M``
463            synonymous to ``m``.
464
465        ``w``
466            instance weight; like for ``c`` the column is omitted or
467            ``None`` is returned instead of the vector if the argument
468            ``weightID`` is missing.
469
470        ``W``
471            instance weight; raise an exception if ``weightID``
472            is missing.
473
474        ``0``
475            a vector of zeros
476
477        ``1``
478            a vector of ones
479
480    The default content is ``a/cw``: an array with feature values and
481    separate vectors with classes and weights. Specifying an empty string
482    has the same effect. If the elements to the right of the slash repeat,
483    the function returns the same Python object, e.g. in ``acc000/cwww`` the
484    three weight vectors are one and the same Python object, so modifying
485    one will change all three of them.
486
487        This is the default behaviour on data set iris with 150 data
488        instances described by four features and a class value::
489
490        >>> data = orange.ExampleTable("../datasets/iris")
491        >>> a, c, w = data.toNumpy()
492        >>> a.shape
493        (150, 4)
494        >>> c.shape
495        (150,)
496        >>> print w
497            None
498        >>> a[0]
499        array([ 5.0999999 ,  3.5       ,  1.39999998,  0.2       ])
500        >>> c[0]
501        0.0
502
503        For a more complicated example, the array will contain a column with
504        class, features, a vector of ones, two vectors with classes and
505        another vector of zeroes::
506
507        >>> a, = data.toNumpy("ca1cc0")
508        >>> a[0]
509        array([ 0., 5.0999999, 3.5       , 1.39999998, 0.2       , 1., 0., 0., 0.])
510        >>> a[130]
511        array([ 2., 7.4000001, 2.79999995, 6.0999999 , 1.89999998, 1., 2., 2., 0.])
512        >>> c[120]
513        2.0
514
515    The third argument specifies the treatment of non-continuous
516    non-binary values (binary values are always translated to 0.0 or
517    1.0). The argument's value can be
518    :obj:`Orange.data.Table.Multinomial_Ignore` (such features are
519    omitted), :obj:`Orange.data.Table.Multinomial_AsOrdinal` (the
520    values' indices are treated as continuous numbers) or
521    :obj:`Orange.data.Table.Multinomial_Error` (an exception is raised
522    if such features are encountered). Default treatment is
523    :obj:`Orange.data.Table.ExampleTable.Multinomial_AsOrdinal`.
524
525    When the class attribute is discrete and has more than two values,
526    an exception is raised unless multinomial attributes are treated as
527    ordinal. More options for treating multinominal values are available
528    in :obj:`Orange.data.continuization`.
529
530    .. method:: to_numpyMA(content, weightID, multinominal)
531
532        Similar to :obj:`to_numpy` except that it returns a masked array
533        with mask representing the (un)defined values.
534
535    .. method:: checksum()
536
537            Return a CRC32 computed over all discrete and continuous
538            features and class attributes of all data instances.
539
540            :rtype: int
541
542    .. method:: has_missing_values()
543
544            Return ``True`` if any of data instances has any missing
545            values. Meta attributes are not checked.
546
547    .. method:: has_missing_classes()
548
549            Return ``True`` if any instance miss the class value.
550
551    .. method:: random_instance()
552
553            Return a random instance from the
554            table. Data table's :obj:`random_generator` is used,
555            which is initially seeded to 0, so results are
556            deterministic.
557
558    .. method:: remove_duplicates([weightID])
559
560            Remove duplicates of data instances. If ``weightID`` is given,
561            a meta attribute is added which contains the number of
562            instances merged into each new instance.
563
564            :param weightID: id for meta attribute with weight
565            :type weightID: int
566            :rtype: None
567
568    .. method:: sort([variables])
569
570            Sort the data table. The argument gives the
571            values ordered by importance. If omitted, the order from
572            the domain is used. Values of discrete
573            features are not ordered alphabetically but according to
574            the :obj:`Orange.feature.Discrete.values`.
575
576            This sorts the data from the bridges data set by the lengths
577            and years of their construction::
578
579                data.sort(["LENGTH", "ERECTED"])
580
581    .. method:: shuffle()
582
583            Randomly shuffle the data instances.
584
585    .. method:: add_meta_attribute(attr[, value=1])
586
587            Add a meta value to all data instances. The first argument
588            can be an integer id, or a string or a variable descriptor
589            of a meta attribute registered in the domain.
590
591    .. method:: remove_meta_attribute(attr)
592
593            Remove a meta attribute from all data instances.
594
595
Note: See TracBrowser for help on using the repository browser.