source: orange/orange/Orange/statistics/distribution.py @ 9694:c756222dc4cc

Revision 9694:c756222dc4cc, 12.3 KB checked in by Jure Zbontar <jure.zbontar@…>, 2 years ago (diff)

Rename RandomGenerator with Random in code.

Line 
1"""
2.. index:: Distributions
3
4=============
5Distributions
6=============
7
8:obj:`Distribution` and derived classes store empirical
9distributions of discrete and continuous variables.
10
11.. class:: Distribution
12
13    This class can
14    store absolute or relative frequencies. It provides a convenience constructor
15    which constructs instances of derived classes. ::
16
17        >>> import Orange
18        >>> data = Orange.data.Table("adult_sample")
19        >>> disc = Orange.statistics.distribution.Distribution("workclass", data)
20        >>> print disc
21        <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000>
22        >>> print type(disc)
23        <type 'DiscDistribution'>
24
25    The resulting distribution is of type :obj:`DiscDistribution` since variable
26    `workclass` is discrete. The printed numbers are counts of examples that have particular
27    attribute value. ::
28
29        >>> workclass = data.domain["workclass"]
30        >>> for i in range(len(workclass.values)):
31        ...     print "%20s: %5.3f" % (workclass.values[i], disc[i])
32                 Private: 685.000
33        Self-emp-not-inc: 72.000
34            Self-emp-inc: 28.000
35             Federal-gov: 29.000
36               Local-gov: 59.000
37               State-gov: 43.000
38             Without-pay: 2.000
39            Never-worked: 0.000
40
41    Distributions resembles dictionaries, supporting indexing by instances of
42    :obj:`Orange.data.Value`, integers or floats (depending on the distribution
43    type), and symbolic names (if :obj:`variable` is defined).
44
45    For instance, the number of examples with `workclass="private"`, can be
46    obtained in three ways::
47   
48        print "Private: ", disc["Private"]
49        print "Private: ", disc[0]
50        print "Private: ", disc[orange.Value(workclass, "Private")]
51
52    Elements cannot be removed from distributions.
53
54    Length of distribution equals the number of possible values for discrete
55    distributions (if :obj:`variable` is set), the value with the highest index
56    encountered (if distribution is discrete and :obj: `variable` is
57    :obj:`None`) or the number of different values encountered (for continuous
58    distributions).
59
60    .. attribute:: variable
61
62        Variable to which the distribution applies; may be :obj:`None` if not
63        applicable.
64
65    .. attribute:: unknowns
66
67        The number of instances for which the value of the variable was
68        undefined.
69
70    .. attribute:: abs
71
72        Sum of all elements in the distribution. Usually it equals either
73        :obj:`cases` if the instance stores absolute frequencies or 1 if the
74        stored frequencies are relative, e.g. after calling :obj:`normalize`.
75
76    .. attribute:: cases
77
78        The number of instances from which the distribution is computed,
79        excluding those on which the value was undefined. If instances were
80        weighted, this is the sum of weights.
81
82    .. attribute:: normalized
83
84        :obj:`True` if distribution is normalized.
85
86    .. attribute:: random_generator
87
88        A pseudo-random number generator used for method :obj:`Orange.misc.Random`.
89
90    .. method:: __init__(variable[, data[, weightId=0]])
91
92        Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`,
93        depending on the variable type. If the variable is the only argument, it
94        must be an instance of :obj:`Orange.data.variable.Variable`. In that case,
95        an empty distribution is constructed. If data is given as well, the
96        variable can also be specified by name or index in the
97        domain. Constructor then computes the distribution of the specified
98        variable on the given data. If instances are weighted, the id of
99        meta-attribute with weights can be passed as the third argument.
100
101        If variable is given by descriptor, it doesn't need to exist in the
102        domain, but it must be computable from given instances. For example, the
103        variable can be a discretized version of a variable from data.
104
105    .. method:: keys()
106
107        Return a list of possible values (if distribution is discrete and
108        :obj:`variable` is set) or a list encountered values otherwise.
109
110    .. method:: values()
111
112        Return a list of frequencies of values such as described above.
113
114    .. method:: items()
115
116        Return a list of pairs of elements of the above lists.
117
118    .. method:: native()
119
120        Return the distribution as a list (for discrete distributions) or as a
121        dictionary (for continuous distributions)
122
123    .. method:: add(value[, weight=1])
124
125        Increase the count of the element corresponding to ``value`` by
126        ``weight``.
127
128        :param value: Value
129        :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions
130        :param weight: Weight to be added to the count for ``value``
131        :type weight: float
132
133    .. method:: normalize()
134
135        Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and
136        :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are
137        unchanged. This changes absoluted frequencies into relative.
138
139    .. method:: modus()
140
141        Return the most common value. If there are multiple such values, one is
142        chosen at random, although the chosen value will always be the same for
143        the same distribution.
144
145    .. method:: random()
146
147        Return a random value based on the stored empirical probability
148        distribution. For continuous distributions, this will always be one of
149        the values which actually appeared (e.g. one of the values from
150        :obj:`keys`).
151
152        The method uses :obj:`random_generator`. If none has been constructed or
153        assigned yet, a new one is constructed and stored for further use.
154
155
156.. class:: Discrete
157
158    Stores a discrete distribution of values. The class differs from its parent
159    class in having a few additional constructors.
160
161    .. method:: __init__(variable)
162
163        Construct an instance of :obj:`Discrete` and set the variable
164        attribute.
165
166        :param variable: A discrete variable
167        :type variable: Orange.data.variable.Discrete
168
169    .. method:: __init__(frequencies)
170
171        Construct an instance and initialize the frequencies from the list, but
172        leave `Distribution.variable` empty.
173
174        :param frequencies: A list of frequencies
175        :type frequencies: list
176
177        Distribution constructed in this way can be used, for instance, to
178        generate random numbers from a given discrete distribution::
179
180            disc = Orange.statistics.distribution.Discrete([0.5, 0.3, 0.2])
181            for i in range(20):
182                print disc.random(),
183
184        This prints out approximatelly ten 0's, six 1's and four 2's. The values
185        can be named by assigning a variable::
186
187            v = orange.EnumVariable(values = ["red", "green", "blue"])
188            disc.variable = v
189
190    .. method:: __init__(distribution)
191
192        Copy constructor; makes a shallow copy of the given distribution
193
194        :param distribution: An existing discrete distribution
195        :type distribution: Discrete
196
197
198.. class:: Continuous
199
200    Stores a continuous distribution, that is, a dictionary-like structure with
201    values and their frequencies.
202
203    .. method:: __init__(variable)
204
205        Construct an instance of :obj:`ContDistribution` and set the variable
206        attribute.
207
208        :param variable: A continuous variable
209        :type variable: Orange.data.variable.Continuous
210
211    .. method:: __init__(frequencies)
212
213        Construct an instance of :obj:`Continuous` and initialize it from
214        the given dictionary with frequencies, whose keys and values must be integers.
215
216        :param frequencies: Values and their corresponding frequencies
217        :type frequencies: dict
218
219    .. method:: __init__(distribution)
220
221        Copy constructor; makes a shallow copy of the given distribution
222
223        :param distribution: An existing continuous distribution
224        :type distribution: Continuous
225
226    .. method:: average()
227
228        Return the average value. Note that the average can also be
229        computed using a simpler and faster classes from module
230        :obj:`Orange.statistics.basic`.
231
232    .. method:: var()
233
234        Return the variance of distribution.
235
236    .. method:: dev()
237
238        Return the standard deviation.
239
240    .. method:: error()
241
242        Return the standard error.
243
244    .. method:: percentile(p)
245
246        Return the value at the `p`-th percentile.
247
248        :param p: The percentile, must be between 0 and 100
249        :type p: float
250        :rtype: float
251
252        For example, if `d_age` is a continuous distribution, the quartiles can
253        be printed by ::
254
255            print "Quartiles: %5.3f - %5.3f - %5.3f" % (
256                 dage.percentile(25), dage.percentile(50), dage.percentile(75))
257
258   .. method:: density(x)
259
260        Return the probability density at `x`. If the value is not in
261        :obj:`Distribution.keys`, it is interpolated.
262
263
264.. class:: Gaussian
265
266    A class imitating :obj:`Continuous` by returning the statistics and
267    densities for Gaussian distribution. The class is not meant only for a
268    convenient substitution for code which expects an instance of
269    :obj:`Distribution`. For general use, Python module :obj:`random`
270    provides a comprehensive set of functions for various random distributions.
271
272    .. attribute:: mean
273
274        The mean value parameter of the Gauss distribution.
275
276    .. attribute:: sigma
277
278        The standard deviation of the distribution
279
280    .. attribute:: abs
281
282        The simulated number of instances; in effect, the Gaussian distribution
283        density, as returned by method :obj:`density` is multiplied by
284        :obj:`abs`.
285
286    .. method:: __init__([mean=0, sigma=1])
287
288        Construct an instance, set :obj:`mean` and :obj:`sigma` to the given
289        values and :obj:`abs` to 1.
290
291    .. method:: __init__(distribution)
292
293        Construct a distribution which approximates the given distribution,
294        which must be either :obj:`Continuous`, in which case its
295        average and deviation will be used for mean and sigma, or and existing
296        :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs`
297        is set to the given distribution's ``abs``.
298
299    .. method:: average()
300
301        Return :obj:`mean`.
302
303    .. method:: dev()
304
305        Return :obj:`sigma`.
306
307    .. method:: var()
308
309        Return square of :obj:`sigma`.
310
311    .. method:: density(x)
312
313        Return the density at point ``x``, that is, the Gaussian distribution
314        density multiplied by :obj:`abs`.
315
316
317Class distributions
318===================
319
320There is a convenience function for computing empirical class distributions from
321data.
322
323.. function:: getClassDistribution(data[, weightID=0])
324
325    Return a class distribution for the given data.
326
327    :param data: A set of instances.
328    :type data: Orange.data.Table
329    :param weightID: An id for meta attribute with weights of instances
330    :type weightID: int
331    :rtype: :obj:`Discrete` or :obj:`Continuous`, depending on the class type
332
333Distributions of all variables
334==============================
335
336Distributions of all variables can be computed and stored in
337:obj:`Domain`. The list-like object can be indexed by variable
338indices in the domain, as well as by variables and their names.
339
340.. class:: Domain
341
342    .. method:: __init__(data[, weightID=0])
343
344        Construct an instance with distributions of all discrete and continuous
345        variables from the given data.
346
347    :param data: A set of instances.
348    :type data: Orange.data.Table
349    :param weightID: An id for meta attribute with weights of instances
350    :type weightID: int
351
352The script below computes distributions for all attributes in the data and
353prints out distributions for discrete and averages for continuous attributes. ::
354
355    dist = Orange.statistics.distribution.Domain(data)
356
357    for d in dist:
358        if d.variable.var_type == Orange.data.Type.Discrete:
359             print "%30s: %s" % (d.variable.name, d)
360        else:
361             print "%30s: avg. %5.3f" % (d.variable.name, d.average())
362
363The distribution for, say, attribute `age` can be obtained by its index and also
364by its name::
365
366    dist_age = dist["age"]
367
368"""
369
370
371from Orange.core import Distribution
372from Orange.core import DiscDistribution as Discrete
373from Orange.core import ContDistribution as Continuous
374from Orange.core import GaussianDistribution as Gaussian
375
376from Orange.core import DomainDistributions as Domain
Note: See TracBrowser for help on using the repository browser.