source: orange/orange/Orange/statistics/distribution.py @ 7671:65b33dc81393

Revision 7671:65b33dc81393, 12.2 KB checked in by janezd <janez.demsar@…>, 3 years ago (diff)

Added the missing modules

Line 
1"""
2
3Class :obj:`Distribution` and derived classes are used for storing empirical
4distributions of discrete and continuous variables.
5
6.. class:: Distribution
7
8    A base class for storing distributions of variable values. The class can
9    store absolute or relative frequencies. Provides a convenience constructor
10    which constructs instances of derived classes. ::
11
12        >>> import Orange
13        >>> data = Orange.data.Table("adult_sample")
14    >>> disc = orange.statistics.distribution.Distribution("workclass", data)
15    >>> print disc
16    <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000>
17    >> print type(disc)
18    <type 'DiscDistribution'>
19
20    The resulting distribution is of type :obj:`DiscDistribution` since variable
21    `workclass` is discrete. The printed numbers are counts of examples that have particular
22    attribute value. ::
23
24        >>> workclass = data.domain["workclass"]
25    >>> for i in range(len(workclass.values)):
26    ... print "%20s: %5.3f" % (workclass.values[i], disc[i])
27                 Private: 685.000
28        Self-emp-not-inc: 72.000
29            Self-emp-inc: 28.000
30             Federal-gov: 29.000
31               Local-gov: 59.000
32               State-gov: 43.000
33             Without-pay: 2.000
34            Never-worked: 0.000
35
36    Distributions resembles dictionaries, supporting indexing by instances of
37    :obj:`Orange.data.Value`, integers or floats (depending on the distribution
38    type), and symbolic names (if :obj:`variable` is defined).
39
40    For instance, the number of examples with `workclass="private"`, can be
41    obtained in three ways::
42   
43        print "Private: ", disc["Private"]
44        print "Private: ", disc[0]
45        print "Private: ", disc[orange.Value(workclass, "Private")]
46
47    Elements cannot be removed from distributions.
48
49    Length of distribution equals the number of possible values for discrete
50    distributions (if :obj:`variable` is set), the value with the highest index
51    encountered (if distribution is discrete and :obj: `variable` is
52    :obj:`None`) or the number of different values encountered (for continuous
53    distributions).
54
55    .. attribute:: variable
56
57        Variable to which the distribution applies; may be :obj:`None` if not
58        applicable.
59
60    .. attribute:: unknowns
61
62        The number of instances for which the value of the variable was
63        undefined.
64
65    .. attribute:: abs
66
67        Sum of all elements in the distribution. Usually it equals either
68        :obj:`cases` if the instance stores absolute frequencies or 1 if the
69        stored frequencies are relative, e.g. after calling :obj:`normalize`.
70
71    .. attribute:: cases
72
73        The number of instances from which the distribution is computed,
74        excluding those on which the value was undefined. If instances were
75        weighted, this is the sum of weights.
76
77    .. attribute:: normalized
78
79        :obj:`True` if distribution is normalized.
80
81    .. attribute:: randomGenerator
82
83        A pseudo-random number generator used for method :obj:`random`.
84
85    .. method:: __init__(variable[, data[, weightId=0]])
86
87        Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`,
88        depending on the variable type. If the variable is the only argument, it
89        must be an instance of :obj:`Orange.data.variable.Variable`. In that case,
90        an empty distribution is constructed. If data is given as well, the
91        variable can also be specified by name or index in the
92        domain. Constructor then computes the distribution of the specified
93        variable on the given data. If instances are weighted, the id of
94        meta-attribute with weights can be passed as the third argument.
95
96    If variable is given by descriptor, it doesn't need to exist in the
97    domain, but it must be computable from given instances. For example, the
98    variable can be a discretized version of a variable from data.
99
100    .. method:: keys()
101
102        Return a list of possible values (if distribution is discrete and
103        :obj:`variable` is set) or a list encountered values otherwise.
104
105    .. method:: values()
106
107        Return a list of frequencies of values such as described above.
108
109    .. method:: items()
110
111        Return a list of pairs of elements of the above lists.
112
113    .. method:: native()
114
115        Return the distribution as a list (for discrete distributions) or as a
116        dictionary (for continuous distributions)
117
118    .. method:: add(value[, weight=1])
119
120        Increase the count of the element corresponding to ``value`` by
121        ``weight``.
122
123        :param value: Value
124        :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions
125        :param weight: Weight to be added to the count for ``value``
126        :type weight: float
127
128    .. method:: normalize()
129
130        Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and
131        :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are
132        unchanged. This changes absoluted frequencies into relative.
133
134    .. method:: modus()
135
136        Return the most common value. If there are multiple such values, one is
137        chosen at random, although the chosen value will always be the same for
138        the same distribution.
139
140    .. method:: random()
141
142        Return a random value based on the stored empirical probability
143        distribution. For continuous distributions, this will always be one of
144        the values which actually appeared (e.g. one of the values from
145        :obj:`keys`).
146
147        The method uses :obj:`randomGenerator`. If none has been constructed or
148        assigned yet, a new one is constructed and stored for further use.
149
150
151.. class:: Discrete
152
153    Stores a discrete distribution of values. The class differs from its parent
154    class in having a few additional constructors.
155
156    .. method:: __init__(variable)
157
158        Construct an instance of :obj:`Discrete` and set the variable
159        attribute.
160
161        :param variable: A discrete variable
162        :type variable: Orange.data.variable.Discrete
163
164    .. method:: __init__(frequencies)
165
166        Construct an instance and initialize the frequencies from the list, but
167    leave `Distribution.variable` empty.
168
169        :param frequencies: A list of frequencies
170        :type frequencies: list
171
172        Distribution constructed in this way can be used, for instance, to
173    generate random numbers from a given discrete distribution::
174
175            disc = Orange.statistics.distributions.Discrete([0.5, 0.3, 0.2])
176            for i in range(20):
177                print disc.random(),
178
179        This prints out approximatelly ten 0's, six 1's and four 2's. The values
180    can be named by assigning a variable::
181
182            v = orange.EnumVariable(values = ["red", "green", "blue"])
183            disc.variable = v
184
185    .. method:: __init__(distribution)
186
187        Copy constructor; makes a shallow copy of the given distribution
188
189        :param distribution: An existing discrete distribution
190        :type distribution: Discrete
191
192
193.. class:: Continuous
194
195    Stores a continuous distribution, that is, a dictionary-like structure with
196    values and their frequencies.
197
198    .. method:: __init__(variable)
199
200        Construct an instance of :obj:`ContDistribution` and set the variable
201        attribute.
202
203        :param variable: A continuous variable
204        :type variable: Orange.data.variable.Continuous
205
206    .. method:: __init__(frequencies)
207
208        Construct an instance of :obj:`Continuous` and initialize it from
209        the given dictionary with frequencies, whose keys and values must be integers.
210
211        :param frequencies: Values and their corresponding frequencies
212        :type frequencies: dict
213
214    .. method:: __init__(distribution)
215
216        Copy constructor; makes a shallow copy of the given distribution
217
218        :param distribution: An existing continuous distribution
219        :type distribution: Continuous
220
221    .. method:: average()
222
223        Return the average value. Note that the average can also be
224        computed using a simpler and faster classes from module
225        :obj:`Orange.statistics.basic`.
226
227    .. method:: var()
228
229        Return the variance of distribution.
230
231    .. method:: dev()
232
233        Return the standard deviation.
234
235    .. method:: error()
236
237        Return the standard error.
238
239    .. method:: percentile(p)
240
241        Return the value at the `p`-th percentile.
242
243        :param p: The percentile, must be between 0 and 100
244        :type p: float
245        :rtype: float
246
247        For example, if `d_age` is a continuous distribution, the quartiles can
248    be printed by ::
249
250            print "Quartiles: %5.3f - %5.3f - %5.3f" % (
251                 dage.percentile(25), dage.percentile(50), dage.percentile(75))
252
253   .. method:: density(x)
254
255        Return the probability density at `x`. If the value is not in
256    :obj:`Distribution.keys`, it is interpolated.
257
258
259.. class:: Gaussian
260
261    A class imitating :obj:`Continuous` by returning the statistics and
262    densities for Gaussian distribution. The class is not meant only for a
263    convenient substitution for code which expects an instance of
264    :obj:`Distribution`. For general use, Python module :obj:`random`
265    provides a comprehensive set of functions for various random distributions.
266
267    .. attribute:: mean
268
269        The mean value parameter of the Gauss distribution.
270
271    .. attribute:: sigma
272
273        The standard deviation of the distribution
274
275    .. attribute:: abs
276
277        The simulated number of instances; in effect, the Gaussian distribution
278        density, as returned by method :obj:`density` is multiplied by
279        :obj:`abs`.
280
281    .. method:: __init__([mean=0, sigma=1])
282
283        Construct an instance, set :obj:`mean` and :obj:`sigma` to the given
284        values and :obj:`abs` to 1.
285
286    .. method:: __init__(distribution)
287
288        Construct a distribution which approximates the given distribution,
289        which must be either :obj:`Continuous`, in which case its
290    average and deviation will be used for mean and sigma, or and existing
291        :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs`
292        is set to the given distribution's ``abs``.
293
294    .. method:: average()
295
296        Return :obj:`mean`.
297
298    .. method:: dev()
299
300        Return :obj:`sigma`.
301
302    .. method:: var()
303
304        Return square of :obj:`sigma`.
305
306    .. method:: density(x)
307
308        Return the density at point ``x``, that is, the Gaussian distribution
309    density multiplied by :obj:`abs`.
310
311
312Class distributions
313===================
314
315There is a convenience function for computing empirical class distributions from
316data.
317
318.. function:: getClassDistribution(data[, weightID=0])
319
320    Return a class distribution for the given data.
321
322    :param data: A set of instances.
323    :type data: Orange.data.Table
324    :param weightID: An id for meta attribute with weights of instances
325    :type weightID: int
326    :rtype: :obj:`Discrete` or :obj:`Continuous`, depending on the class type
327
328Distributions of all variables
329==============================
330
331Distributions of all variables can be computed and stored in
332:obj:`Domain`. The list-like object can be indexed by variable
333indices in the domain, as well as by variables and their names.
334
335.. class:: Domain
336
337    .. method:: __init__(data[, weightID=0])
338
339        Construct an instance with distributions of all discrete and continuous
340        variables from the given data.
341
342    :param data: A set of instances.
343    :type data: Orange.data.Table
344    :param weightID: An id for meta attribute with weights of instances
345    :type weightID: int
346
347The script below computes distributions for all attributes in the data and
348prints out distributions for discrete and averages for continuous attributes. ::
349
350    dist = Orange.statistics.distributions.Domain(data)
351
352        for d in dist:
353        if d.variable.varType == orange.VarTypes.Discrete:
354                 print "%30s: %s" % (d.variable.name, d)
355        else:
356                 print "%30s: avg. %5.3f" % (d.variable.name, d.average())
357
358The distribution for, say, attribute `age` can be obtained by its index and also
359by its name::
360
361    dist_age = dist["age"]
362
363"""
364
365
366from Orange.core import Distribution
367from Orange.core import DiscDistribution as Discrete
368from Orange.core import ContDistribution as Continuous
369from Orange.core import GaussianDistribution as Gaussian
370
371from Orange.core import DomainDistributions as Domain
Note: See TracBrowser for help on using the repository browser.