Ignore:
Timestamp:
02/25/12 22:42:47 (2 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Message:

Moved documentation about statistics.distribution to rst

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.statistics.distribution.rst

    r9372 r10372  
    1 .. automodule:: Orange.statistics.distribution 
     1.. py:currentmodule:: Orange.statistics.distribution 
     2 
     3.. index:: Distributions 
     4 
     5============= 
     6Distributions 
     7============= 
     8 
     9:obj:`Distribution` and derived classes store empirical 
     10distributions of discrete and continuous variables. 
     11 
     12.. class:: Distribution 
     13 
     14    This class can 
     15    store absolute or relative frequencies. It provides a convenience constructor 
     16    which constructs instances of derived classes. :: 
     17 
     18        >>> import Orange 
     19        >>> data = Orange.data.Table("adult_sample") 
     20        >>> disc = Orange.statistics.distribution.Distribution("workclass", data) 
     21        >>> print disc 
     22        <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000> 
     23        >>> print type(disc) 
     24        <type 'DiscDistribution'> 
     25 
     26    The resulting distribution is of type :obj:`DiscDistribution` since variable 
     27    `workclass` is discrete. The printed numbers are counts of examples that have particular 
     28    attribute value. :: 
     29 
     30        >>> workclass = data.domain["workclass"] 
     31        >>> for i in range(len(workclass.values)): 
     32        ...     print "%20s: %5.3f" % (workclass.values[i], disc[i]) 
     33                 Private: 685.000 
     34        Self-emp-not-inc: 72.000 
     35            Self-emp-inc: 28.000 
     36             Federal-gov: 29.000 
     37               Local-gov: 59.000 
     38               State-gov: 43.000 
     39             Without-pay: 2.000 
     40            Never-worked: 0.000 
     41 
     42    Distributions resembles dictionaries, supporting indexing by instances of 
     43    :obj:`Orange.data.Value`, integers or floats (depending on the distribution 
     44    type), and symbolic names (if :obj:`variable` is defined). 
     45 
     46    For instance, the number of examples with `workclass="private"`, can be 
     47    obtained in three ways:: 
     48     
     49        print "Private: ", disc["Private"] 
     50        print "Private: ", disc[0] 
     51        print "Private: ", disc[orange.Value(workclass, "Private")] 
     52 
     53    Elements cannot be removed from distributions. 
     54 
     55    Length of distribution equals the number of possible values for discrete 
     56    distributions (if :obj:`variable` is set), the value with the highest index 
     57    encountered (if distribution is discrete and :obj: `variable` is 
     58    :obj:`None`) or the number of different values encountered (for continuous 
     59    distributions). 
     60 
     61    .. attribute:: variable 
     62 
     63        Variable to which the distribution applies; may be :obj:`None` if not 
     64        applicable. 
     65 
     66    .. attribute:: unknowns 
     67 
     68        The number of instances for which the value of the variable was 
     69        undefined. 
     70 
     71    .. attribute:: abs 
     72 
     73        Sum of all elements in the distribution. Usually it equals either 
     74        :obj:`cases` if the instance stores absolute frequencies or 1 if the 
     75        stored frequencies are relative, e.g. after calling :obj:`normalize`. 
     76 
     77    .. attribute:: cases 
     78 
     79        The number of instances from which the distribution is computed, 
     80        excluding those on which the value was undefined. If instances were 
     81        weighted, this is the sum of weights. 
     82 
     83    .. attribute:: normalized 
     84 
     85        :obj:`True` if distribution is normalized. 
     86 
     87    .. attribute:: random_generator 
     88 
     89        A pseudo-random number generator used for method :obj:`Orange.misc.Random`. 
     90 
     91    .. method:: __init__(variable[, data[, weightId=0]]) 
     92 
     93        Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`, 
     94        depending on the variable type. If the variable is the only argument, it 
     95        must be an instance of :obj:`Orange.feature.Descriptor`. In that case, 
     96        an empty distribution is constructed. If data is given as well, the 
     97        variable can also be specified by name or index in the 
     98        domain. Constructor then computes the distribution of the specified 
     99        variable on the given data. If instances are weighted, the id of 
     100        meta-attribute with weights can be passed as the third argument. 
     101 
     102        If variable is given by descriptor, it doesn't need to exist in the 
     103        domain, but it must be computable from given instances. For example, the 
     104        variable can be a discretized version of a variable from data. 
     105 
     106    .. method:: keys() 
     107 
     108        Return a list of possible values (if distribution is discrete and 
     109        :obj:`variable` is set) or a list encountered values otherwise. 
     110 
     111    .. method:: values() 
     112 
     113        Return a list of frequencies of values such as described above. 
     114 
     115    .. method:: items() 
     116 
     117        Return a list of pairs of elements of the above lists. 
     118 
     119    .. method:: native() 
     120 
     121        Return the distribution as a list (for discrete distributions) or as a 
     122        dictionary (for continuous distributions) 
     123 
     124    .. method:: add(value[, weight=1]) 
     125 
     126        Increase the count of the element corresponding to ``value`` by 
     127        ``weight``. 
     128 
     129        :param value: Value 
     130        :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions 
     131        :param weight: Weight to be added to the count for ``value`` 
     132        :type weight: float 
     133 
     134    .. method:: normalize() 
     135 
     136        Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and 
     137        :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are 
     138        unchanged. This changes absoluted frequencies into relative. 
     139 
     140    .. method:: modus() 
     141 
     142        Return the most common value. If there are multiple such values, one is 
     143        chosen at random, although the chosen value will always be the same for 
     144        the same distribution. 
     145 
     146    .. method:: random() 
     147 
     148        Return a random value based on the stored empirical probability 
     149        distribution. For continuous distributions, this will always be one of 
     150        the values which actually appeared (e.g. one of the values from 
     151        :obj:`keys`). 
     152 
     153        The method uses :obj:`random_generator`. If none has been constructed or 
     154        assigned yet, a new one is constructed and stored for further use. 
     155 
     156 
     157.. class:: Discrete 
     158 
     159    Stores a discrete distribution of values. The class differs from its parent 
     160    class in having a few additional constructors. 
     161 
     162    .. method:: __init__(variable) 
     163 
     164        Construct an instance of :obj:`Discrete` and set the variable 
     165        attribute. 
     166 
     167        :param variable: A discrete variable 
     168        :type variable: Orange.feature.Discrete 
     169 
     170    .. method:: __init__(frequencies) 
     171 
     172        Construct an instance and initialize the frequencies from the list, but 
     173        leave `Distribution.variable` empty. 
     174 
     175        :param frequencies: A list of frequencies 
     176        :type frequencies: list 
     177 
     178        Distribution constructed in this way can be used, for instance, to 
     179        generate random numbers from a given discrete distribution:: 
     180 
     181            disc = Orange.statistics.distribution.Discrete([0.5, 0.3, 0.2]) 
     182            for i in range(20): 
     183                print disc.random(), 
     184 
     185        This prints out approximatelly ten 0's, six 1's and four 2's. The values 
     186        can be named by assigning a variable:: 
     187 
     188            v = orange.EnumVariable(values = ["red", "green", "blue"]) 
     189            disc.variable = v 
     190 
     191    .. method:: __init__(distribution) 
     192 
     193        Copy constructor; makes a shallow copy of the given distribution 
     194 
     195        :param distribution: An existing discrete distribution 
     196        :type distribution: Discrete 
     197 
     198 
     199.. class:: Continuous 
     200 
     201    Stores a continuous distribution, that is, a dictionary-like structure with 
     202    values and their frequencies. 
     203 
     204    .. method:: __init__(variable) 
     205 
     206        Construct an instance of :obj:`ContDistribution` and set the variable 
     207        attribute. 
     208 
     209        :param variable: A continuous variable 
     210        :type variable: Orange.feature.Continuous 
     211 
     212    .. method:: __init__(frequencies) 
     213 
     214        Construct an instance of :obj:`Continuous` and initialize it from 
     215        the given dictionary with frequencies, whose keys and values must be integers. 
     216 
     217        :param frequencies: Values and their corresponding frequencies 
     218        :type frequencies: dict 
     219 
     220    .. method:: __init__(distribution) 
     221 
     222        Copy constructor; makes a shallow copy of the given distribution 
     223 
     224        :param distribution: An existing continuous distribution 
     225        :type distribution: Continuous 
     226 
     227    .. method:: average() 
     228 
     229        Return the average value. Note that the average can also be 
     230        computed using a simpler and faster classes from module 
     231        :obj:`Orange.statistics.basic`. 
     232 
     233    .. method:: var() 
     234 
     235        Return the variance of distribution. 
     236 
     237    .. method:: dev() 
     238 
     239        Return the standard deviation. 
     240 
     241    .. method:: error() 
     242 
     243        Return the standard error. 
     244 
     245    .. method:: percentile(p) 
     246 
     247        Return the value at the `p`-th percentile. 
     248 
     249        :param p: The percentile, must be between 0 and 100 
     250        :type p: float 
     251        :rtype: float 
     252 
     253        For example, if `d_age` is a continuous distribution, the quartiles can 
     254        be printed by :: 
     255 
     256            print "Quartiles: %5.3f - %5.3f - %5.3f" % (  
     257                 dage.percentile(25), dage.percentile(50), dage.percentile(75)) 
     258 
     259   .. method:: density(x) 
     260 
     261        Return the probability density at `x`. If the value is not in 
     262        :obj:`Distribution.keys`, it is interpolated. 
     263 
     264 
     265.. class:: Gaussian 
     266 
     267    A class imitating :obj:`Continuous` by returning the statistics and 
     268    densities for Gaussian distribution. The class is not meant only for a 
     269    convenient substitution for code which expects an instance of 
     270    :obj:`Distribution`. For general use, Python module :obj:`random` 
     271    provides a comprehensive set of functions for various random distributions. 
     272 
     273    .. attribute:: mean 
     274 
     275        The mean value parameter of the Gauss distribution. 
     276 
     277    .. attribute:: sigma 
     278 
     279        The standard deviation of the distribution 
     280 
     281    .. attribute:: abs 
     282 
     283        The simulated number of instances; in effect, the Gaussian distribution 
     284        density, as returned by method :obj:`density` is multiplied by 
     285        :obj:`abs`. 
     286 
     287    .. method:: __init__([mean=0, sigma=1]) 
     288 
     289        Construct an instance, set :obj:`mean` and :obj:`sigma` to the given 
     290        values and :obj:`abs` to 1. 
     291 
     292    .. method:: __init__(distribution) 
     293 
     294        Construct a distribution which approximates the given distribution, 
     295        which must be either :obj:`Continuous`, in which case its 
     296        average and deviation will be used for mean and sigma, or and existing 
     297        :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs` 
     298        is set to the given distribution's ``abs``. 
     299 
     300    .. method:: average() 
     301 
     302        Return :obj:`mean`. 
     303 
     304    .. method:: dev() 
     305 
     306        Return :obj:`sigma`. 
     307 
     308    .. method:: var() 
     309 
     310        Return square of :obj:`sigma`. 
     311 
     312    .. method:: density(x) 
     313 
     314        Return the density at point ``x``, that is, the Gaussian distribution 
     315        density multiplied by :obj:`abs`. 
     316 
     317 
     318Class distributions 
     319=================== 
     320 
     321There is a convenience function for computing empirical class distributions from 
     322data. 
     323 
     324.. function:: getClassDistribution(data[, weightID=0]) 
     325 
     326    Return a class distribution for the given data. 
     327 
     328    :param data: A set of instances. 
     329    :type data: Orange.data.Table 
     330    :param weightID: An id for meta attribute with weights of instances 
     331    :type weightID: int 
     332    :rtype: :obj:`Discrete` or :obj:`Continuous`, depending on the class type 
     333 
     334Distributions of all variables 
     335============================== 
     336 
     337Distributions of all variables can be computed and stored in 
     338:obj:`Domain`. The list-like object can be indexed by variable 
     339indices in the domain, as well as by variables and their names. 
     340 
     341.. class:: Domain 
     342 
     343    .. method:: __init__(data[, weightID=0]) 
     344 
     345        Construct an instance with distributions of all discrete and continuous 
     346        variables from the given data. 
     347 
     348    :param data: A set of instances. 
     349    :type data: Orange.data.Table 
     350    :param weightID: An id for meta attribute with weights of instances 
     351    :type weightID: int 
     352 
     353The script below computes distributions for all attributes in the data and 
     354prints out distributions for discrete and averages for continuous attributes. :: 
     355 
     356    dist = Orange.statistics.distribution.Domain(data) 
     357 
     358    for d in dist: 
     359        if d.variable.var_type == Orange.feature.Type.Discrete: 
     360             print "%30s: %s" % (d.variable.name, d) 
     361        else: 
     362             print "%30s: avg. %5.3f" % (d.variable.name, d.average()) 
     363 
     364The distribution for, say, attribute `age` can be obtained by its index and also 
     365by its name:: 
     366 
     367    dist_age = dist["age"] 
Note: See TracChangeset for help on using the changeset viewer.