Distributions (distribution)¶
Distribution and derived classes store empirical distributions of discrete and continuous variables.
- class Orange.statistics.distribution.Distribution¶
This class can store absolute or relative frequencies. It provides a convenience constructor which constructs instances of derived classes.
>>> import Orange >>> data = Orange.data.Table("adult_sample") >>> disc = Orange.statistics.distribution.Distribution("workclass", data) >>> print disc <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000> >>> print type(disc) <type 'DiscDistribution'>
The resulting distribution is of type DiscDistribution since variable workclass is discrete. The printed numbers are counts of examples that have particular attribute value.
>>> workclass = data.domain["workclass"] >>> for i in range(len(workclass.values)): ... print "%20s: %5.3f" % (workclass.values[i], disc[i]) Private: 685.000 Self-emp-not-inc: 72.000 Self-emp-inc: 28.000 Federal-gov: 29.000 Local-gov: 59.000 State-gov: 43.000 Without-pay: 2.000 Never-worked: 0.000
Distributions resembles dictionaries, supporting indexing by instances of Orange.data.Value, integers or floats (depending on the distribution type), and symbolic names (if variable is defined).
For instance, the number of examples with workclass=”private”, can be obtained in three ways:
print "Private: ", disc["Private"] print "Private: ", disc[0] print "Private: ", disc[orange.Value(workclass, "Private")]
Elements cannot be removed from distributions.
Length of distribution equals the number of possible values for discrete distributions (if variable is set), the value with the highest index encountered (if distribution is discrete and :obj: variable is None) or the number of different values encountered (for continuous distributions).
- unknowns¶
The number of instances for which the value of the variable was undefined.
- abs¶
Sum of all elements in the distribution. Usually it equals either cases if the instance stores absolute frequencies or 1 if the stored frequencies are relative, e.g. after calling normalize.
- cases¶
The number of instances from which the distribution is computed, excluding those on which the value was undefined. If instances were weighted, this is the sum of weights.
- random_generator¶
A pseudo-random number generator used for method Orange.misc.Random.
- __init__(variable[, data[, weightId=0]])¶
Construct either DiscDistribution or ContDistribution, depending on the variable type. If the variable is the only argument, it must be an instance of Orange.feature.Descriptor. In that case, an empty distribution is constructed. If data is given as well, the variable can also be specified by name or index in the domain. Constructor then computes the distribution of the specified variable on the given data. If instances are weighted, the id of meta-attribute with weights can be passed as the third argument.
If variable is given by descriptor, it doesn’t need to exist in the domain, but it must be computable from given instances. For example, the variable can be a discretized version of a variable from data.
- keys()¶
Return a list of possible values (if distribution is discrete and variable is set) or a list encountered values otherwise.
- values()¶
Return a list of frequencies of values such as described above.
- items()¶
Return a list of pairs of elements of the above lists.
- native()¶
Return the distribution as a list (for discrete distributions) or as a dictionary (for continuous distributions)
- add(value[, weight=1])¶
Increase the count of the element corresponding to value by weight.
Parameters: - value (Orange.data.Value, string (if variable is set), int for discrete distributions or float for continuous distributions) – Value
- weight (float) – Weight to be added to the count for value
- normalize()¶
Divide the counts by their sum, set normalized to True and abs to 1. Attributes cases and unknowns are unchanged. This changes absoluted frequencies into relative.
- modus()¶
Return the most common value. If there are multiple such values, one is chosen at random, although the chosen value will always be the same for the same distribution.
- random()¶
Return a random value based on the stored empirical probability distribution. For continuous distributions, this will always be one of the values which actually appeared (e.g. one of the values from keys).
The method uses random_generator. If none has been constructed or assigned yet, a new one is constructed and stored for further use.
- class Orange.statistics.distribution.Discrete¶
Stores a discrete distribution of values. The class differs from its parent class in having a few additional constructors.
- __init__(variable)¶
Construct an instance of Discrete and set the variable attribute.
Parameters: variable (Orange.feature.Discrete) – A discrete variable
- __init__(frequencies)
Construct an instance and initialize the frequencies from the list, but leave Distribution.variable empty.
Parameters: frequencies (list) – A list of frequencies Distribution constructed in this way can be used, for instance, to generate random numbers from a given discrete distribution:
disc = Orange.statistics.distribution.Discrete([0.5, 0.3, 0.2]) for i in range(20): print disc.random(),
This prints out approximatelly ten 0’s, six 1’s and four 2’s. The values can be named by assigning a variable:
v = orange.EnumVariable(values = ["red", "green", "blue"]) disc.variable = v
- __init__(distribution)
Copy constructor; makes a shallow copy of the given distribution
Parameters: distribution (Discrete) – An existing discrete distribution
- class Orange.statistics.distribution.Continuous¶
Stores a continuous distribution, that is, a dictionary-like structure with values and their frequencies.
- __init__(variable)¶
Construct an instance of ContDistribution and set the variable attribute.
Parameters: variable (Orange.feature.Continuous) – A continuous variable
- __init__(frequencies)
Construct an instance of Continuous and initialize it from the given dictionary with frequencies, whose keys and values must be integers.
Parameters: frequencies (dict) – Values and their corresponding frequencies
- __init__(distribution)
Copy constructor; makes a shallow copy of the given distribution
Parameters: distribution (Continuous) – An existing continuous distribution
- average()¶
Return the average value. Note that the average can also be computed using a simpler and faster classes from module Orange.statistics.basic.
- var()¶
Return the variance of distribution.
- dev()¶
Return the standard deviation.
- error()¶
Return the standard error.
- percentile(p)¶
Return the value at the p-th percentile.
Parameters: p (float) – The percentile, must be between 0 and 100 Return type: float For example, if d_age is a continuous distribution, the quartiles can be printed by
print "Quartiles: %5.3f - %5.3f - %5.3f" % ( dage.percentile(25), dage.percentile(50), dage.percentile(75))
- density(x)¶
Return the probability density at x. If the value is not in Distribution.keys, it is interpolated.
- class Orange.statistics.distribution.Gaussian¶
A class imitating Continuous by returning the statistics and densities for Gaussian distribution. The class is not meant only for a convenient substitution for code which expects an instance of Distribution. For general use, Python module random provides a comprehensive set of functions for various random distributions.
- mean¶
The mean value parameter of the Gauss distribution.
- sigma¶
The standard deviation of the distribution
- abs¶
The simulated number of instances; in effect, the Gaussian distribution density, as returned by method density is multiplied by abs.
- __init__([mean=0, sigma=1])¶
Construct an instance, set mean and sigma to the given values and abs to 1.
- __init__(distribution)
Construct a distribution which approximates the given distribution, which must be either Continuous, in which case its average and deviation will be used for mean and sigma, or and existing GaussianDistribution, which will be copied. Attribute abs is set to the given distribution’s abs.
Class distributions¶
There is a convenience function for computing empirical class distributions from data.
- Orange.statistics.distribution.getClassDistribution(data[, weightID=0])¶
Return a class distribution for the given data.
Parameters: - data (Orange.data.Table) – A set of instances.
- weightID (int) – An id for meta attribute with weights of instances
Return type: Discrete or Continuous, depending on the class type
Distributions of all variables¶
Distributions of all variables can be computed and stored in Domain. The list-like object can be indexed by variable indices in the domain, as well as by variables and their names.
- class Orange.statistics.distribution.Domain¶
- __init__(data[, weightID=0])¶
Construct an instance with distributions of all discrete and continuous variables from the given data.
Parameters: - data (Orange.data.Table) – A set of instances.
- weightID (int) – An id for meta attribute with weights of instances
The script below computes distributions for all attributes in the data and prints out distributions for discrete and averages for continuous attributes.
dist = Orange.statistics.distribution.Domain(data)
for d in dist:
if d.variable.var_type == Orange.feature.Type.Discrete:
print "%30s: %s" % (d.variable.name, d)
else:
print "%30s: avg. %5.3f" % (d.variable.name, d.average())
The distribution for, say, attribute age can be obtained by its index and also by its name:
dist_age = dist["age"]