Changeset 7626:b33c516eedd8 in orange
 Timestamp:
 02/08/11 22:57:04 (3 years ago)
 Branch:
 default
 Convert:
 14b1b52c81eec19a99f157610fdd1dd21d94a69a
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

orange/Orange/statistics/distributions.py
r7623 r7626 20 20 standard deviation of a variable. It does not include the median or any 21 21 other statistics that can be computed on the fly, without remembering the 22 data; such statistics can be obtained using :obj:`ContDistribution`. !!!TODO22 data; such statistics can be obtained using :obj:`ContDistribution`. 23 23 24 24 Instances of this class are seldom constructed manually; they are more often … … 126 126 .. _distributionsbasicstat.py: code/distributionsbasicstat.py 127 127 128 129 ================================ 130 Distributions of variable values 131 ================================ 132 133 Class :obj:`Distribution` and derived classes are used for storing empirical 134 distributions of discrete and continuous variables. 135 136 .. class:: Distribution 137 138 A base class for storing distributions of variable values. The class can 139 store absolute or relative frequencies. Provides a convenience constructor 140 which constructs instances of derived classes. :: 141 142 >>> import Orange 143 >>> data = Orange.data.Table("adult_sample") 144 >>> disc = orange.statistics.distribution.Distribution("workclass", data) 145 >>> print disc 146 <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000> 147 >> print type(disc) 148 <type 'DiscDistribution'> 149 150 The resulting distribution is of type :obj:`DiscDistribution` since variable 151 `workclass` is discrete. The printed numbers are counts of examples that have particular 152 attribute value. :: 153 154 >>> workclass = data.domain["workclass"] 155 >>> for i in range(len(workclass.values)): 156 ... print "%20s: %5.3f" % (workclass.values[i], disc[i]) 157 Private: 685.000 158 Selfempnotinc: 72.000 159 Selfempinc: 28.000 160 Federalgov: 29.000 161 Localgov: 59.000 162 Stategov: 43.000 163 Withoutpay: 2.000 164 Neverworked: 0.000 165 166 Distributions resembles dictionaries, supporting indexing by instances of 167 :obj:`Orange.data.Value`, integers or floats (depending on the distribution 168 type), and symbolic names (if :obj:`variable` is defined). 169 170 For instance, the number of examples with `workclass="private"`, can be 171 obtained in three ways:: 172 173 print "Private: ", disc["Private"] 174 print "Private: ", disc[0] 175 print "Private: ", disc[orange.Value(workclass, "Private")] 176 177 Elements cannot be removed from distributions. 178 179 Length of distribution equals the number of possible values for discrete 180 distributions (if :obj:`variable` is set), the value with the highest index 181 encountered (if distribution is discrete and :obj: `variable` is 182 :obj:`None`) or the number of different values encountered (for continuous 183 distributions). 184 185 .. attribute:: variable 186 187 Variable to which the distribution applies; may be :obj:`None` if not 188 applicable. 189 190 .. attribute:: unknowns 191 192 The number of instances for which the value of the variable was 193 undefined. 194 195 .. attribute:: abs 196 197 Sum of all elements in the distribution. Usually it equals either 198 :obj:`cases` if the instance stores absolute frequencies or 1 if the 199 stored frequencies are relative, e.g. after calling :obj:`normalize`. 200 201 .. attribute:: cases 202 203 The number of instances from which the distribution is computed, 204 excluding those on which the value was undefined. If instances were 205 weighted, this is the sum of weights. 206 207 .. attribute:: normalized 208 209 :obj:`True` if distribution is normalized. 210 211 .. attribute:: randomGenerator 212 213 A pseudorandom number generator used for method :obj:`random`. 214 215 .. method:: __init__(variable[, data[, weightId=0]]) 216 217 Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`, 218 depending on the variable type. If the variable is the only argument, it 219 must be an instance of :obj:`Orange.data.feature.Feature`. In that case, 220 an empty distribution is constructed. If data is given as well, the 221 variable can also be specified by name or index in the 222 domain. Constructor then computes the distribution of the specified 223 variable on the given data. If instances are weighted, the id of 224 metaattribute with weights can be passed as the third argument. 225 226 If variable is given by descriptor, it doesn't need to exist in the 227 domain, but it must be computable from given instances. For example, the 228 variable can be a discretized version of a variable from data. 229 230 .. method:: keys() 231 232 Return a list of possible values (if distribution is discrete and 233 :obj:`variable` is set) or a list encountered values otherwise. 234 235 .. method:: values() 236 237 Return a list of frequencies of values such as described above. 238 239 .. method:: items() 240 241 Return a list of pairs of elements of the above lists. 242 243 .. method:: native() 244 245 Return the distribution as a list (for discrete distributions) or as a 246 dictionary (for continuous distributions) 247 248 .. method:: add(value[, weight=1]) 249 250 Increase the count of the element corresponding to ``value`` by 251 ``weight``. 252 253 :param value: Value 254 :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions 255 :param weight: Weight to be added to the count for ``value`` 256 :type weight: float 257 258 .. method:: normalize() 259 260 Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and 261 :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are 262 unchanged. This changes absoluted frequencies into relative. 263 264 .. method:: modus() 265 266 Return the most common value. If there are multiple such values, one is 267 chosen at random, although the chosen value will always be the same for 268 the same distribution. 269 270 .. method:: random() 271 272 Return a random value based on the stored empirical probability 273 distribution. For continuous distributions, this will always be one of 274 the values which actually appeared (e.g. one of the values from 275 :obj:`keys`). 276 277 The method uses :obj:`randomGenerator`. If none has been constructed or 278 assigned yet, a new one is constructed and stored for further use. 279 280 281 .. class:: DiscDistribution 282 283 Stores a discrete distribution of values. The class differs from its parent 284 class in having a few additional constructors. 285 286 .. method:: __init__(variable) 287 288 Construct an instance of :obj:`DiscDistribution` and set the variable 289 attribute. 290 291 :param variable: A discrete variable 292 :type variable: Orange.data.feature.Discrete 293 294 .. method:: __init__(frequencies) 295 296 Construct an instance and initialize the frequencies from the list, but 297 leave `Distribution.variable` empty. 298 299 :param frequencies: A list of frequencies 300 :type frequencies: list 301 302 Distribution constructed in this way can be used, for instance, to 303 generate random numbers from a given discrete distribution:: 304 305 disc = orange.DiscDistribution([0.5, 0.3, 0.2]) 306 for i in range(20): 307 print disc.random(), 308 309 This prints out approximatelly ten 0's, six 1's and four 2's. The values 310 can be named by assigning a variable:: 311 312 v = orange.EnumVariable(values = ["red", "green", "blue"]) 313 disc.variable = v 314 315 .. method:: __init__(distribution) 316 317 Copy constructor; makes a shallow copy of the given distribution 318 319 :param distribution: An existing discrete distribution 320 :type distribution: DiscDistribution 321 322 323 .. class:: ContDistribution 324 325 Stores a continuous distribution, that is, a dictionarylike structure with 326 values and their frequencies. 327 328 .. method:: __init__(variable) 329 330 Construct an instance of :obj:`ContDistribution` and set the variable 331 attribute. 332 333 :param variable: A continuous variable 334 :type variable: Orange.data.feature.Continuous 335 336 .. method:: __init__(frequencies) 337 338 Construct an instance of :obj:`ContDistribution` and initialize it from 339 the given dictionary with frequencies, whose keys and values must be integers. 340 341 :param frequencies: Values and their corresponding frequencies 342 :type frequencies: dict 343 344 .. method:: __init__(distribution) 345 346 Copy constructor; makes a shallow copy of the given distribution 347 348 :param distribution: An existing continuous distribution 349 :type distribution: ContDistribution 350 351 .. method:: average() 352 353 Return the average value. Note that the average can also be computed 354 using a simpler and faster class 355 :obj:`Orange.statistics.distributions.BasicStatistics`. 356 357 .. method:: var() 358 359 Return the variance of distribution. 360 361 .. method:: dev() 362 363 Return the standard deviation. 364 365 .. method:: error() 366 367 Return the standard error. 368 369 .. method:: percentile(p) 370 371 Return the value at the `p`th percentile. 372 373 :param p: The percentile, must be between 0 and 100 374 :type p: float 375 :rtype: float 376 377 For example, if `d_age` is a continuous distribution, the quartiles can 378 be printed by :: 379 380 print "Quartiles: %5.3f  %5.3f  %5.3f" % ( 381 dage.percentile(25), dage.percentile(50), dage.percentile(75)) 382 383 .. method:: density(x) 384 385 Return the probability density at `x`. If the value is not in 386 :obj:`Distribution.keys`, it is interpolated. 387 388 389 .. class:: GaussianDistribution 390 391 A class imitating :obj:`ContDistribution` by returning the statistics and 392 densities for Gaussian distribution. The class is not meant only for a 393 convenient substitution for code which expects an instance of 394 :obj:`Distribution`. For general use, Python module :obj:`random` 395 provides a comprehensive set of functions for various random distributions. 396 397 .. attribute:: mean 398 399 The mean value parameter of the Gauss distribution. 400 401 .. attribute:: sigma 402 403 The standard deviation of the distribution 404 405 .. attribute:: abs 406 407 The simulated number of instances; in effect, the Gaussian distribution 408 density, as returned by method :obj:`density` is multiplied by 409 :obj:`abs`. 410 411 .. method:: __init__([mean=0, sigma=1]) 412 413 Construct an instance, set :obj:`mean` and :obj:`sigma` to the given 414 values and :obj:`abs` to 1. 415 416 .. method:: __init__(distribution) 417 418 Construct a distribution which approximates the given distribution, 419 which must be either :obj:`ContDistribution`, in which case its 420 average and deviation will be used for mean and sigma, or and existing 421 :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs` 422 is set to the given distribution's ``abs``. 423 424 .. method:: average() 425 426 Return :obj:`mean`. 427 428 .. method:: dev() 429 430 Return :obj:`sigma`. 431 432 .. method:: var() 433 434 Return square of :obj:`sigma`. 435 436 .. method:: density(x) 437 438 Return the density at point ``x``, that is, the Gaussian distribution 439 density multiplied by :obj:`abs`. 440 441 442 Class distributions 443 =================== 444 445 There is a convenience function for computing empirical class distributions from 446 data. 447 448 .. function:: getClassDistribution(data[, weightID=0]) 449 450 Return a class distribution for the given data. 451 452 :param data: A set of instances. 453 :type data: Orange.data.Table 454 :param weightID: An id for meta attribute with weights of instances 455 :type weightID: int 456 :rtype: :obj:`DiscDistribution` or :obj:`ContDistribution`, depending on the class type 457 458 Distributions of all variables 459 ============================== 460 461 Distributions of all variables can be computed and stored in 462 :obj:`DomainDistributions`. The listlike object can be indexed by variable 463 indices in the domain, as well as by variables and their names. 464 465 .. class:: DomainDistributions 466 467 .. method:: __init__(data[, weightID=0]) 468 469 Construct an instance with distributions of all discrete and continuous 470 variables from the given data. 471 472 :param data: A set of instances. 473 :type data: Orange.data.Table 474 :param weightID: An id for meta attribute with weights of instances 475 :type weightID: int 476 477 The script below computes distributions for all attributes in the data and 478 prints out distributions for discrete and averages for continuous attributes. :: 479 480 dist = orange.DomainDistributions(data) 481 482 for d in dist: 483 if d.variable.varType == orange.VarTypes.Discrete: 484 print "%30s: %s" % (d.variable.name, d) 485 else: 486 print "%30s: avg. %5.3f" % (d.variable.name, d.average()) 487 488 The distribution for, say, attribute `age` can be obtained by its index and also 489 by its name:: 490 491 dist_age = dist["age"] 128 492 129 493 ==================
Note: See TracChangeset
for help on using the changeset viewer.