source: orange/docs/reference/rst/Orange.statistics.distribution.rst @ 10372:ae567c0440c7

Revision 10372:ae567c0440c7, 12.1 KB checked in by janezd <janez.demsar@…>, 2 years ago (diff)

Moved documentation about statistics.distribution to rst

Line 
1.. py:currentmodule:: Orange.statistics.distribution
2
3.. index:: Distributions
4
5=============
6Distributions
7=============
8
9:obj:`Distribution` and derived classes store empirical
10distributions of discrete and continuous variables.
11
12.. class:: Distribution
13
14    This class can
15    store absolute or relative frequencies. It provides a convenience constructor
16    which constructs instances of derived classes. ::
17
18        >>> import Orange
19        >>> data = Orange.data.Table("adult_sample")
20        >>> disc = Orange.statistics.distribution.Distribution("workclass", data)
21        >>> print disc
22        <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000>
23        >>> print type(disc)
24        <type 'DiscDistribution'>
25
26    The resulting distribution is of type :obj:`DiscDistribution` since variable
27    `workclass` is discrete. The printed numbers are counts of examples that have particular
28    attribute value. ::
29
30        >>> workclass = data.domain["workclass"]
31        >>> for i in range(len(workclass.values)):
32        ...     print "%20s: %5.3f" % (workclass.values[i], disc[i])
33                 Private: 685.000
34        Self-emp-not-inc: 72.000
35            Self-emp-inc: 28.000
36             Federal-gov: 29.000
37               Local-gov: 59.000
38               State-gov: 43.000
39             Without-pay: 2.000
40            Never-worked: 0.000
41
42    Distributions resembles dictionaries, supporting indexing by instances of
43    :obj:`Orange.data.Value`, integers or floats (depending on the distribution
44    type), and symbolic names (if :obj:`variable` is defined).
45
46    For instance, the number of examples with `workclass="private"`, can be
47    obtained in three ways::
48   
49        print "Private: ", disc["Private"]
50        print "Private: ", disc[0]
51        print "Private: ", disc[orange.Value(workclass, "Private")]
52
53    Elements cannot be removed from distributions.
54
55    Length of distribution equals the number of possible values for discrete
56    distributions (if :obj:`variable` is set), the value with the highest index
57    encountered (if distribution is discrete and :obj: `variable` is
58    :obj:`None`) or the number of different values encountered (for continuous
59    distributions).
60
61    .. attribute:: variable
62
63        Variable to which the distribution applies; may be :obj:`None` if not
64        applicable.
65
66    .. attribute:: unknowns
67
68        The number of instances for which the value of the variable was
69        undefined.
70
71    .. attribute:: abs
72
73        Sum of all elements in the distribution. Usually it equals either
74        :obj:`cases` if the instance stores absolute frequencies or 1 if the
75        stored frequencies are relative, e.g. after calling :obj:`normalize`.
76
77    .. attribute:: cases
78
79        The number of instances from which the distribution is computed,
80        excluding those on which the value was undefined. If instances were
81        weighted, this is the sum of weights.
82
83    .. attribute:: normalized
84
85        :obj:`True` if distribution is normalized.
86
87    .. attribute:: random_generator
88
89        A pseudo-random number generator used for method :obj:`Orange.misc.Random`.
90
91    .. method:: __init__(variable[, data[, weightId=0]])
92
93        Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`,
94        depending on the variable type. If the variable is the only argument, it
95        must be an instance of :obj:`Orange.feature.Descriptor`. In that case,
96        an empty distribution is constructed. If data is given as well, the
97        variable can also be specified by name or index in the
98        domain. Constructor then computes the distribution of the specified
99        variable on the given data. If instances are weighted, the id of
100        meta-attribute with weights can be passed as the third argument.
101
102        If variable is given by descriptor, it doesn't need to exist in the
103        domain, but it must be computable from given instances. For example, the
104        variable can be a discretized version of a variable from data.
105
106    .. method:: keys()
107
108        Return a list of possible values (if distribution is discrete and
109        :obj:`variable` is set) or a list encountered values otherwise.
110
111    .. method:: values()
112
113        Return a list of frequencies of values such as described above.
114
115    .. method:: items()
116
117        Return a list of pairs of elements of the above lists.
118
119    .. method:: native()
120
121        Return the distribution as a list (for discrete distributions) or as a
122        dictionary (for continuous distributions)
123
124    .. method:: add(value[, weight=1])
125
126        Increase the count of the element corresponding to ``value`` by
127        ``weight``.
128
129        :param value: Value
130        :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions
131        :param weight: Weight to be added to the count for ``value``
132        :type weight: float
133
134    .. method:: normalize()
135
136        Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and
137        :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are
138        unchanged. This changes absoluted frequencies into relative.
139
140    .. method:: modus()
141
142        Return the most common value. If there are multiple such values, one is
143        chosen at random, although the chosen value will always be the same for
144        the same distribution.
145
146    .. method:: random()
147
148        Return a random value based on the stored empirical probability
149        distribution. For continuous distributions, this will always be one of
150        the values which actually appeared (e.g. one of the values from
151        :obj:`keys`).
152
153        The method uses :obj:`random_generator`. If none has been constructed or
154        assigned yet, a new one is constructed and stored for further use.
155
156
157.. class:: Discrete
158
159    Stores a discrete distribution of values. The class differs from its parent
160    class in having a few additional constructors.
161
162    .. method:: __init__(variable)
163
164        Construct an instance of :obj:`Discrete` and set the variable
165        attribute.
166
167        :param variable: A discrete variable
168        :type variable: Orange.feature.Discrete
169
170    .. method:: __init__(frequencies)
171
172        Construct an instance and initialize the frequencies from the list, but
173        leave `Distribution.variable` empty.
174
175        :param frequencies: A list of frequencies
176        :type frequencies: list
177
178        Distribution constructed in this way can be used, for instance, to
179        generate random numbers from a given discrete distribution::
180
181            disc = Orange.statistics.distribution.Discrete([0.5, 0.3, 0.2])
182            for i in range(20):
183                print disc.random(),
184
185        This prints out approximatelly ten 0's, six 1's and four 2's. The values
186        can be named by assigning a variable::
187
188            v = orange.EnumVariable(values = ["red", "green", "blue"])
189            disc.variable = v
190
191    .. method:: __init__(distribution)
192
193        Copy constructor; makes a shallow copy of the given distribution
194
195        :param distribution: An existing discrete distribution
196        :type distribution: Discrete
197
198
199.. class:: Continuous
200
201    Stores a continuous distribution, that is, a dictionary-like structure with
202    values and their frequencies.
203
204    .. method:: __init__(variable)
205
206        Construct an instance of :obj:`ContDistribution` and set the variable
207        attribute.
208
209        :param variable: A continuous variable
210        :type variable: Orange.feature.Continuous
211
212    .. method:: __init__(frequencies)
213
214        Construct an instance of :obj:`Continuous` and initialize it from
215        the given dictionary with frequencies, whose keys and values must be integers.
216
217        :param frequencies: Values and their corresponding frequencies
218        :type frequencies: dict
219
220    .. method:: __init__(distribution)
221
222        Copy constructor; makes a shallow copy of the given distribution
223
224        :param distribution: An existing continuous distribution
225        :type distribution: Continuous
226
227    .. method:: average()
228
229        Return the average value. Note that the average can also be
230        computed using a simpler and faster classes from module
231        :obj:`Orange.statistics.basic`.
232
233    .. method:: var()
234
235        Return the variance of distribution.
236
237    .. method:: dev()
238
239        Return the standard deviation.
240
241    .. method:: error()
242
243        Return the standard error.
244
245    .. method:: percentile(p)
246
247        Return the value at the `p`-th percentile.
248
249        :param p: The percentile, must be between 0 and 100
250        :type p: float
251        :rtype: float
252
253        For example, if `d_age` is a continuous distribution, the quartiles can
254        be printed by ::
255
256            print "Quartiles: %5.3f - %5.3f - %5.3f" % (
257                 dage.percentile(25), dage.percentile(50), dage.percentile(75))
258
259   .. method:: density(x)
260
261        Return the probability density at `x`. If the value is not in
262        :obj:`Distribution.keys`, it is interpolated.
263
264
265.. class:: Gaussian
266
267    A class imitating :obj:`Continuous` by returning the statistics and
268    densities for Gaussian distribution. The class is not meant only for a
269    convenient substitution for code which expects an instance of
270    :obj:`Distribution`. For general use, Python module :obj:`random`
271    provides a comprehensive set of functions for various random distributions.
272
273    .. attribute:: mean
274
275        The mean value parameter of the Gauss distribution.
276
277    .. attribute:: sigma
278
279        The standard deviation of the distribution
280
281    .. attribute:: abs
282
283        The simulated number of instances; in effect, the Gaussian distribution
284        density, as returned by method :obj:`density` is multiplied by
285        :obj:`abs`.
286
287    .. method:: __init__([mean=0, sigma=1])
288
289        Construct an instance, set :obj:`mean` and :obj:`sigma` to the given
290        values and :obj:`abs` to 1.
291
292    .. method:: __init__(distribution)
293
294        Construct a distribution which approximates the given distribution,
295        which must be either :obj:`Continuous`, in which case its
296        average and deviation will be used for mean and sigma, or and existing
297        :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs`
298        is set to the given distribution's ``abs``.
299
300    .. method:: average()
301
302        Return :obj:`mean`.
303
304    .. method:: dev()
305
306        Return :obj:`sigma`.
307
308    .. method:: var()
309
310        Return square of :obj:`sigma`.
311
312    .. method:: density(x)
313
314        Return the density at point ``x``, that is, the Gaussian distribution
315        density multiplied by :obj:`abs`.
316
317
318Class distributions
319===================
320
321There is a convenience function for computing empirical class distributions from
322data.
323
324.. function:: getClassDistribution(data[, weightID=0])
325
326    Return a class distribution for the given data.
327
328    :param data: A set of instances.
329    :type data: Orange.data.Table
330    :param weightID: An id for meta attribute with weights of instances
331    :type weightID: int
332    :rtype: :obj:`Discrete` or :obj:`Continuous`, depending on the class type
333
334Distributions of all variables
335==============================
336
337Distributions of all variables can be computed and stored in
338:obj:`Domain`. The list-like object can be indexed by variable
339indices in the domain, as well as by variables and their names.
340
341.. class:: Domain
342
343    .. method:: __init__(data[, weightID=0])
344
345        Construct an instance with distributions of all discrete and continuous
346        variables from the given data.
347
348    :param data: A set of instances.
349    :type data: Orange.data.Table
350    :param weightID: An id for meta attribute with weights of instances
351    :type weightID: int
352
353The script below computes distributions for all attributes in the data and
354prints out distributions for discrete and averages for continuous attributes. ::
355
356    dist = Orange.statistics.distribution.Domain(data)
357
358    for d in dist:
359        if d.variable.var_type == Orange.feature.Type.Discrete:
360             print "%30s: %s" % (d.variable.name, d)
361        else:
362             print "%30s: avg. %5.3f" % (d.variable.name, d.average())
363
364The distribution for, say, attribute `age` can be obtained by its index and also
365by its name::
366
367    dist_age = dist["age"]
Note: See TracBrowser for help on using the repository browser.