Changeset 7626:b33c516eedd8 in orange


Ignore:
Timestamp:
02/08/11 22:57:04 (3 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Convert:
14b1b52c81eec19a99f157610fdd1dd21d94a69a
Message:

Added documentation about Distribution

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/statistics/distributions.py

    r7623 r7626  
    2020    standard deviation of a variable. It does not include the median or any 
    2121    other statistics that can be computed on the fly, without remembering the 
    22     data; such statistics can be obtained using :obj:`ContDistribution`. !!!TODO 
     22    data; such statistics can be obtained using :obj:`ContDistribution`. 
    2323 
    2424    Instances of this class are seldom constructed manually; they are more often 
     
    126126.. _distributions-basic-stat.py: code/distributions-basic-stat.py 
    127127 
     128 
     129================================ 
     130Distributions of variable values 
     131================================ 
     132 
     133Class :obj:`Distribution` and derived classes are used for storing empirical 
     134distributions of discrete and continuous variables. 
     135 
     136.. class:: Distribution 
     137 
     138    A base class for storing distributions of variable values. The class can 
     139    store absolute or relative frequencies. Provides a convenience constructor 
     140    which constructs instances of derived classes. :: 
     141 
     142        >>> import Orange 
     143        >>> data = Orange.data.Table("adult_sample") 
     144    >>> disc = orange.statistics.distribution.Distribution("workclass", data) 
     145    >>> print disc 
     146    <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000> 
     147    >> print type(disc) 
     148    <type 'DiscDistribution'> 
     149 
     150    The resulting distribution is of type :obj:`DiscDistribution` since variable 
     151    `workclass` is discrete. The printed numbers are counts of examples that have particular 
     152    attribute value. :: 
     153 
     154        >>> workclass = data.domain["workclass"] 
     155    >>> for i in range(len(workclass.values)): 
     156    ... print "%20s: %5.3f" % (workclass.values[i], disc[i]) 
     157                 Private: 685.000 
     158        Self-emp-not-inc: 72.000 
     159            Self-emp-inc: 28.000 
     160             Federal-gov: 29.000 
     161               Local-gov: 59.000 
     162               State-gov: 43.000 
     163             Without-pay: 2.000 
     164            Never-worked: 0.000 
     165 
     166    Distributions resembles dictionaries, supporting indexing by instances of 
     167    :obj:`Orange.data.Value`, integers or floats (depending on the distribution 
     168    type), and symbolic names (if :obj:`variable` is defined). 
     169 
     170    For instance, the number of examples with `workclass="private"`, can be 
     171    obtained in three ways:: 
     172     
     173        print "Private: ", disc["Private"] 
     174        print "Private: ", disc[0] 
     175        print "Private: ", disc[orange.Value(workclass, "Private")] 
     176 
     177    Elements cannot be removed from distributions. 
     178 
     179    Length of distribution equals the number of possible values for discrete 
     180    distributions (if :obj:`variable` is set), the value with the highest index 
     181    encountered (if distribution is discrete and :obj: `variable` is 
     182    :obj:`None`) or the number of different values encountered (for continuous 
     183    distributions). 
     184 
     185    .. attribute:: variable 
     186 
     187        Variable to which the distribution applies; may be :obj:`None` if not 
     188        applicable. 
     189 
     190    .. attribute:: unknowns 
     191 
     192        The number of instances for which the value of the variable was 
     193        undefined. 
     194 
     195    .. attribute:: abs 
     196 
     197        Sum of all elements in the distribution. Usually it equals either 
     198        :obj:`cases` if the instance stores absolute frequencies or 1 if the 
     199        stored frequencies are relative, e.g. after calling :obj:`normalize`. 
     200 
     201    .. attribute:: cases 
     202 
     203        The number of instances from which the distribution is computed, 
     204        excluding those on which the value was undefined. If instances were 
     205        weighted, this is the sum of weights. 
     206 
     207    .. attribute:: normalized 
     208 
     209        :obj:`True` if distribution is normalized. 
     210 
     211    .. attribute:: randomGenerator 
     212 
     213        A pseudo-random number generator used for method :obj:`random`. 
     214 
     215    .. method:: __init__(variable[, data[, weightId=0]]) 
     216 
     217        Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`, 
     218        depending on the variable type. If the variable is the only argument, it 
     219        must be an instance of :obj:`Orange.data.feature.Feature`. In that case, 
     220        an empty distribution is constructed. If data is given as well, the 
     221        variable can also be specified by name or index in the 
     222        domain. Constructor then computes the distribution of the specified 
     223        variable on the given data. If instances are weighted, the id of 
     224        meta-attribute with weights can be passed as the third argument. 
     225 
     226    If variable is given by descriptor, it doesn't need to exist in the 
     227    domain, but it must be computable from given instances. For example, the 
     228    variable can be a discretized version of a variable from data. 
     229 
     230    .. method:: keys() 
     231 
     232        Return a list of possible values (if distribution is discrete and 
     233        :obj:`variable` is set) or a list encountered values otherwise. 
     234 
     235    .. method:: values() 
     236 
     237        Return a list of frequencies of values such as described above. 
     238 
     239    .. method:: items() 
     240 
     241        Return a list of pairs of elements of the above lists. 
     242 
     243    .. method:: native() 
     244 
     245        Return the distribution as a list (for discrete distributions) or as a 
     246        dictionary (for continuous distributions) 
     247 
     248    .. method:: add(value[, weight=1]) 
     249 
     250        Increase the count of the element corresponding to ``value`` by 
     251        ``weight``. 
     252 
     253        :param value: Value 
     254        :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions 
     255        :param weight: Weight to be added to the count for ``value`` 
     256        :type weight: float 
     257 
     258    .. method:: normalize() 
     259 
     260        Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and 
     261        :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are 
     262        unchanged. This changes absoluted frequencies into relative. 
     263 
     264    .. method:: modus() 
     265 
     266        Return the most common value. If there are multiple such values, one is 
     267        chosen at random, although the chosen value will always be the same for 
     268        the same distribution. 
     269 
     270    .. method:: random() 
     271 
     272        Return a random value based on the stored empirical probability 
     273        distribution. For continuous distributions, this will always be one of 
     274        the values which actually appeared (e.g. one of the values from 
     275        :obj:`keys`). 
     276 
     277        The method uses :obj:`randomGenerator`. If none has been constructed or 
     278        assigned yet, a new one is constructed and stored for further use. 
     279 
     280 
     281.. class:: DiscDistribution 
     282 
     283    Stores a discrete distribution of values. The class differs from its parent 
     284    class in having a few additional constructors. 
     285 
     286    .. method:: __init__(variable) 
     287 
     288        Construct an instance of :obj:`DiscDistribution` and set the variable 
     289        attribute. 
     290 
     291        :param variable: A discrete variable 
     292        :type variable: Orange.data.feature.Discrete 
     293 
     294    .. method:: __init__(frequencies) 
     295 
     296        Construct an instance and initialize the frequencies from the list, but 
     297    leave `Distribution.variable` empty. 
     298 
     299        :param frequencies: A list of frequencies 
     300        :type frequencies: list 
     301 
     302        Distribution constructed in this way can be used, for instance, to 
     303    generate random numbers from a given discrete distribution:: 
     304 
     305            disc = orange.DiscDistribution([0.5, 0.3, 0.2]) 
     306            for i in range(20): 
     307                print disc.random(), 
     308 
     309        This prints out approximatelly ten 0's, six 1's and four 2's. The values 
     310    can be named by assigning a variable:: 
     311 
     312            v = orange.EnumVariable(values = ["red", "green", "blue"]) 
     313            disc.variable = v 
     314 
     315    .. method:: __init__(distribution) 
     316 
     317        Copy constructor; makes a shallow copy of the given distribution 
     318 
     319        :param distribution: An existing discrete distribution 
     320        :type distribution: DiscDistribution 
     321 
     322 
     323.. class:: ContDistribution 
     324 
     325    Stores a continuous distribution, that is, a dictionary-like structure with 
     326    values and their frequencies. 
     327 
     328    .. method:: __init__(variable) 
     329 
     330        Construct an instance of :obj:`ContDistribution` and set the variable 
     331        attribute. 
     332 
     333        :param variable: A continuous variable 
     334        :type variable: Orange.data.feature.Continuous 
     335 
     336    .. method:: __init__(frequencies) 
     337 
     338        Construct an instance of :obj:`ContDistribution` and initialize it from 
     339        the given dictionary with frequencies, whose keys and values must be integers. 
     340 
     341        :param frequencies: Values and their corresponding frequencies 
     342        :type frequencies: dict 
     343 
     344    .. method:: __init__(distribution) 
     345 
     346        Copy constructor; makes a shallow copy of the given distribution 
     347 
     348        :param distribution: An existing continuous distribution 
     349        :type distribution: ContDistribution 
     350 
     351    .. method:: average() 
     352 
     353        Return the average value. Note that the average can also be computed 
     354        using a simpler and faster class 
     355        :obj:`Orange.statistics.distributions.BasicStatistics`. 
     356 
     357    .. method:: var() 
     358 
     359        Return the variance of distribution. 
     360 
     361    .. method:: dev() 
     362 
     363        Return the standard deviation. 
     364 
     365    .. method:: error() 
     366 
     367        Return the standard error. 
     368 
     369    .. method:: percentile(p) 
     370 
     371        Return the value at the `p`-th percentile. 
     372 
     373        :param p: The percentile, must be between 0 and 100 
     374        :type p: float 
     375        :rtype: float 
     376 
     377        For example, if `d_age` is a continuous distribution, the quartiles can 
     378    be printed by :: 
     379 
     380            print "Quartiles: %5.3f - %5.3f - %5.3f" % (  
     381                 dage.percentile(25), dage.percentile(50), dage.percentile(75)) 
     382 
     383   .. method:: density(x) 
     384 
     385        Return the probability density at `x`. If the value is not in 
     386    :obj:`Distribution.keys`, it is interpolated. 
     387 
     388 
     389.. class:: GaussianDistribution 
     390 
     391    A class imitating :obj:`ContDistribution` by returning the statistics and 
     392    densities for Gaussian distribution. The class is not meant only for a 
     393    convenient substitution for code which expects an instance of 
     394    :obj:`Distribution`. For general use, Python module :obj:`random` 
     395    provides a comprehensive set of functions for various random distributions. 
     396 
     397    .. attribute:: mean 
     398 
     399        The mean value parameter of the Gauss distribution. 
     400 
     401    .. attribute:: sigma 
     402 
     403        The standard deviation of the distribution 
     404 
     405    .. attribute:: abs 
     406 
     407        The simulated number of instances; in effect, the Gaussian distribution 
     408        density, as returned by method :obj:`density` is multiplied by 
     409        :obj:`abs`. 
     410 
     411    .. method:: __init__([mean=0, sigma=1]) 
     412 
     413        Construct an instance, set :obj:`mean` and :obj:`sigma` to the given 
     414        values and :obj:`abs` to 1. 
     415 
     416    .. method:: __init__(distribution) 
     417 
     418        Construct a distribution which approximates the given distribution, 
     419        which must be either :obj:`ContDistribution`, in which case its 
     420    average and deviation will be used for mean and sigma, or and existing 
     421        :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs` 
     422        is set to the given distribution's ``abs``. 
     423 
     424    .. method:: average() 
     425 
     426        Return :obj:`mean`. 
     427 
     428    .. method:: dev() 
     429 
     430        Return :obj:`sigma`. 
     431 
     432    .. method:: var() 
     433 
     434        Return square of :obj:`sigma`. 
     435 
     436    .. method:: density(x) 
     437 
     438        Return the density at point ``x``, that is, the Gaussian distribution 
     439    density multiplied by :obj:`abs`. 
     440 
     441 
     442Class distributions 
     443=================== 
     444 
     445There is a convenience function for computing empirical class distributions from 
     446data. 
     447 
     448.. function:: getClassDistribution(data[, weightID=0]) 
     449 
     450    Return a class distribution for the given data. 
     451 
     452    :param data: A set of instances. 
     453    :type data: Orange.data.Table 
     454    :param weightID: An id for meta attribute with weights of instances 
     455    :type weightID: int 
     456    :rtype: :obj:`DiscDistribution` or :obj:`ContDistribution`, depending on the class type 
     457 
     458Distributions of all variables 
     459============================== 
     460 
     461Distributions of all variables can be computed and stored in 
     462:obj:`DomainDistributions`. The list-like object can be indexed by variable 
     463indices in the domain, as well as by variables and their names. 
     464 
     465.. class:: DomainDistributions 
     466 
     467    .. method:: __init__(data[, weightID=0]) 
     468 
     469        Construct an instance with distributions of all discrete and continuous 
     470        variables from the given data. 
     471 
     472    :param data: A set of instances. 
     473    :type data: Orange.data.Table 
     474    :param weightID: An id for meta attribute with weights of instances 
     475    :type weightID: int 
     476 
     477The script below computes distributions for all attributes in the data and 
     478prints out distributions for discrete and averages for continuous attributes. :: 
     479 
     480    dist = orange.DomainDistributions(data) 
     481 
     482        for d in dist: 
     483        if d.variable.varType == orange.VarTypes.Discrete: 
     484                 print "%30s: %s" % (d.variable.name, d) 
     485        else: 
     486                 print "%30s: avg. %5.3f" % (d.variable.name, d.average()) 
     487 
     488The distribution for, say, attribute `age` can be obtained by its index and also 
     489by its name:: 
     490 
     491    dist_age = dist["age"] 
    128492 
    129493================== 
Note: See TracChangeset for help on using the changeset viewer.