source: orange/orange/doc/reference/distributions.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 13.9 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line 
1<HTML>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8
9<index name="distributions">
10<H1>Distributions</H1>
11
12<P>Objects derived from <CODE>Distribution</CODE> are used throughout Orange to store various distributions. These often - but not necessarily - apply to distribution of values of certain attribute on some dataset. You will most often encounter two classes derived from <CODE>Distribution</CODE>: <CODE>DiscDistribution</CODE> stores discrete and <CODE>ContDistribution</CODE> stores continuous distributions. To some extent, they both resemble dictionaries, with attribute values as keys and number of examples with particular value as elements.</P>
13
14<H2>General Distributions</H2>
15
16<P>Class <CODE><index name="classes/Distribution">Distribution</CODE> contains the common methods for different types of distributions. Even more, its constructor can be used to construct objects of type <CODE>DiscDistribution</CODE> and <CODE>ContDistribution</CODE> (class <CODE>Distribution</CODE> itself is abstract, so no instances of that class can actually exist).</P>
17
18<p class="header"><a href="distributions.py">part of distributions.py</a>
19(uses <a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
20<XMP class="code">>>> import orange
21>>> data = orange.ExampleTable("adult_sample")
22>>> disc = orange.Distribution("workclass", data)
23>>> print disc
24<685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000>
25>>> print type(disc)
26<type 'DiscDistribution'>
27</XMP>
28
29<P>This simple script prints out distribution of attribute "workclass" on dataset "adult_sample". The resulting distribution is of type <CODE>DiscDistribution</CODE> since the attribute is discrete. The printed numbers are counts of examples that have particular attribute value.</P>
30
31<p class="header"><a href="distributions.py">part of distributions.py</a>
32(uses <a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
33<XMP class="code">>>> workclass = data.domain["workclass"]
34>>> for i in range(len(workclass.values)):
35...   print "%20s: %5.3f" % (workclass.values[i], disc[i])
36             Private: 685.000
37    Self-emp-not-inc: 72.000
38        Self-emp-inc: 28.000
39         Federal-gov: 29.000
40           Local-gov: 59.000
41           State-gov: 43.000
42         Without-pay: 2.000
43        Never-worked: 0.000
44</XMP>
45
46<P>Enough introduction. Here are <CODE>Distribution</CODE>'s attributes and methods.</P>
47
48<p class=section>Attributes</p>
49
50<DL class=attributes>
51<DT>variable</DT>
52<DD>Descriptor of the attribute to which the distribution applies. Can be left empty, when not applicable.</DD>
53
54
55<DT>unknowns</DT>
56<DD>Number of examples for which the attribute value was unknown. This field is not changed at normalization (see below).</DD>
57
58
59<DT>abs</DT>
60<DD>Sum of all elements in the distribution.</DD>
61
62
63<DT>cases</DT>
64<DD>(Weighted) number of examples, on which the distribution is computed, not including the examples on which the observed attribute had unknown value. This equals <CODE>abs</CODE> as long as the distribution is not normalized.</DD>
65
66
67<DT>normalized</DT>
68<DD>If true, the distribution is normalized, ie the distribution sums to 1.</DD>
69
70
71<DT>supportsDiscrete, supportsContinuous</DT>
72<DD>Tells whether distribution supports the protocol for working with discrete/continuous values (this is rather internal thing; still, you can use those flags to check whether the distribution is discrete or continuous).</DD>
73
74<DT>randomGenerator</DT>
75<DD>A random generator needed for method <CODE>random()</CODE>.</DD>
76</DL>
77
78<p class=section>Methods</p>
79
80<DL class=attributes>
81<DT>orange.Distribution(attribute[, examples[, weightID]])</DT>
82<DD>Constructs either <CODE>DiscDistribution</CODE> or <CODE>ContDistribution</CODE>, depending on the attribute type. If <CODE>attribute</CODE> is the only argument, it must be an attribute descriptor (see <A href="Variable.htm"><CODE>Variable</CODE></A>). In that case, an empty distribution is constructed. If <CODE>examples</CODE> are given as well, the attribute's distribution is computed, as seen in the above example. In that case, <CODE>attribute</CODE> can also be given by name or its position in the domain. If examples are weighted, the id of meta-attribute with weights is passed as the third argument (default is 0, no weights).</P>
83
84<P>If attribute is given by descriptor, it doesn't need to exist in the domain, but it must be computable from given examples. This way, it is possible to obtain distributions for attributes constructed by constructive induction or for discretized attributes, without translating the entire dataset. There's an example for this in documentation on <A href="Variable.htm#getValueFrom">attribute descriptors</A>.</P>
85</DD>
86
87
88<DT>&lt;standard dictionary operations&gt;</DT>
89<DD>For getting elements of discrete distributions, indices of type <CODE>Value</CODE>, integers and symbolic names (if <CODE>variable</CODE> is defined) can be used. For continuous elements, use <CODE>Value</CODE> or continuous number (eg <CODE>cont[3.14]</CODE>).</P>
90
91<P>To get the number of examples with <CODE>workclass</CODE>="private", you can use either of the three forms below:</P>
92
93<XMP class="code">print "Private: ", disc["Private"]
94print "Private: ", disc[0]
95print "Private: ", disc[orange.Value(workclass, "Private")]
96</XMP>
97
98<P>Elements cannot be removed from distributions.</P>
99
100<P>Length of distribution equals the number of possible values for discrete distributions (if <CODE>variable</CODE> is set), the value with the highest index encountered (if distribution is discrete and <CODE>variable</CODE> is not set) or the number of different values encountered (for continuous distributions).</P>
101</DD>
102
103
104<DT>keys(), values(), items()</DT>
105<DD>Return a list of values, a list of example counts and a list of (value, frequency) pairs, respectively. For instance, distribution in the last example of section "General Distributions" could be printed out by
106
107<XMP class="code">for val, num in disc.items():
108    print "%20s: %5.3f" % (val, num)
109</xmp
110</dd>
111
112
113<DT>native()</DT>
114<DD>Converts the distribution into a list (for discrete distrbutions) or a dictionary (for continuous distributions).</dd>
115
116
117<DT>add(value[, weight])</DT>
118<DD>Adds a value to the distribution - as if an example with weight <CODE>weight</CODE> (default is 1.0) was added. <CODE>value</CODE> can be <CODE>orange.Value</CODE>, an index (for discrete distributions), continuous number (for continuous distributions) or symbolic value, if <CODE>variable</CODE> is set.</DD>
119
120
121<DT>normalize()</DT>
122<DD>Divides all elements of the distribution by their sum (<CODE>abs</CODE>), sets <CODE>normalized</CODE> to <CODE>true</CODE> and <CODE>abs</CODE> to 1.0. Fields <CODE>cases</CODE> and <CODE>unknowns</CODE> are unchanged.</DD>
123
124
125<DT>modus()</DT>
126<DD>Returns the most common value of the attribute. If there is more than one such value, one is chosen at random (but always the same for particular distribution). More explanation on that is available on page about <A href="random.htm">randomness in Orange</A>.</DD>
127
128<DT>random()</DT>
129<DD>Returns a random value, where probabilities of values are as given by the distribution. For continuous distributions, returned value will always be one of the values that occur in the distribution (ie one of values returned by <CODE>keys()</CODE>), not any continuous value from the distribution's range.</P>
130
131<P>This method uses distribution's <CODE>randomGenerator</CODE>. If none has been constructed and/or assigned yet, one is constructed and stored for further use.</P>
132</DD>
133
134</DL>
135
136<H2>Discrete distributions</H2>
137<index name="classes/DiscDistribution">
138<index name="distributions/discrete">
139
140<P>Discrete distributions can be constructed directly.</P>
141
142<P class=section>Methods</P>
143<DL class=attributes>
144<DT>DiscDistribution(attribute)</DT>
145<DD>Constructor stores the attribute descriptor (which must be of a discrete attribute) to <CODE>variable</CODE> and allocates a list of appropriate size for the distribution.</DD>
146
147<DT>DiscDistribution(list)</DT>
148<DD>This form of constructor initializes a list, but leaves the variable at <CODE>None</CODE>. You can use such distribution for random number generation.
149
150<XMP class="code">disc = orange.DiscDistribution([0.5, 0.3, 0.2])
151for i in range(20):
152    print disc.random(),
153</XMP>
154
155<P>This will print out approximately ten 0's, six 1's and four 2's. To name the values, you can assign an attribute descriptor.</P>
156
157<XMP class=code>v = orange.EnumVariable(values = ["red", "green", "blue"])
158disc.variable = v
159</XMP>
160</DD>
161
162<DT>DiscDistribution(distribution)</DT>
163<DD>A copy-constructor, which initializes a new distribution as a copy of an existing.</DD>
164
165<DT>DiscDistribution()</DT>
166<DD>A constructor that creates a distribution and leaves all fields blank, 0 and <CODE>None</CODE>.</DD>
167</DL>
168
169<P>Besides those constructors, there are no other specific operations for discrete distributions.</P>
170
171
172<H2>Continuous distributions</H2>
173<index name="distributions/continuous">
174<index name="classes/ContDistribution">
175
176<P>Continuous distribution (<CODE><INDEX>ContDistribution</INDEX></CODE>) offers similar constructors as discrete distributions, except that instead of a list, it expects a dictionary, such as one returned by <CODE>native</CODE>. There are some specific methods.</P>
177
178<P class=section>Methods</P>
179<DL class=attributes>
180
181<DT>ContDistribution(attribute)</DT>
182<DD>Constructor that stores the attribute descriptor (which must be of a continuous attribute) to <CODE>variable</CODE>.</dd>
183
184<DT>ContDistribution(dictionary)</DT>
185<DD>Initializes the distribution with the values from the dictionary. All keys and values must be numbers.</DD>
186
187<DT>ContDistribution(distribution)</DT>
188<DD>A copy constructor that initializes the distribution as a copy of the existing distribution.</DD>
189
190<DT>ContDistribution</DT>
191<DD>Constructor that leaves everything blank, 0 and <CODE>None</CODE>.</DD>
192
193<DT>average()</DT>
194<DD>Returns the average value.</DD>
195
196<DT>var(), dev(), error()</DT>
197<DD>Return variance, deviation and standard error of the distribution, respectively.</DD>
198
199<DT>percentile(p)</DT>
200<DD>Returns p-th percentile of distribution, ie such value x that <CODE>p</CODE> percents of attribute's values are smaller than x. <CODE>p</CODE> must be a value between 0 and 100. For instance, if <CODE>dage</CODE> is a continuous distribution, quartiles can be printed by
201
202<XMP class=code>print "Quartiles: %5.3f - %5.3f - %5.3f" % \
203  (dage.percentile(25), dage.percentile(50), dage.percentile(75))
204</XMP>
205</DD>
206
207<DT>density(x)</DT>
208<DD>Returns probability density at <CODE>x</CODE>. If value is not present, it is interpolated.</dd>
209</DL>
210
211<H2>Gaussian distribution</H2>
212<index name="distribution/Gaussian">
213<index name="classes/GaussianDistribution">
214
215<P>Represents Gaussian distribution.</P>
216
217<P class=section>Attributes</P>
218<DL class=attributes>
219<DT>mean, sigma</DT>
220<DD>Parameters of the distribution.</DD>
221<DT>abs</DT>
222<DD>This field represents the number of "examples" for discrete and continuous distributions. In case of Gaussian distribution, this is the integral under density function; in effect, the normal Gaussian density function is multiplied by <CODE>abs</CODE>.
223</DL>
224
225<P class=section>Methods</P>
226<DL class=attributes>
227<DT>GaussianDistribution([mean, sigma])</DT>
228<DD>Constructs Gaussian distribution. Default <CODE>mean</CODE> and <CODE>sigma</CODE> are 0.0 and 1.0 (normalized distribution), and <CODE>abs</CODE> is set to 1.0.</DD>
229
230<DT>GaussianDistribution(distribution)</DT>
231<DD>Construct Gaussian distribution by approximating another distribution. The given distribution must support continuous protocol (ie, must be able to provide average and deviation). In other words, <CODE>distribution</CODE> must be either <CODE>ContDistribution</CODE> and its <CODE>average</CODE> and <CODE>deviation</CODE> will become <CODE>mean</CODE> and <CODE>sigma</CODE> for the new distribution, or <CODE>GaussianDistribution</CODE>, which will be simply copied. <CODE>abs</CODE> is set to <CODE>distribution.abs</CODE>.</DD>
232
233<DT>average()<DT>
234<DD>Returns <CODE>mean</CODE>.</DD>
235
236<DT>dev(), error()</DT>
237<DD>Returns <CODE>sigma</CODE></DD>
238
239<DT>var()</DT>
240<DD>Returns <CODE>sigma</CODE><SUP>2</SUP>.</DD>
241
242<DT>density(x)<DT>
243<DD>Returns density at point <CODE>x</CODE> (Gaussian function multiplied by <CODE>abs</CODE>).</DD>
244</DL>
245
246
247<H2>Computing class distributions</H2>
248<index name="distribution/computation from examples">
249
250<P>Class distributions can be computed by calling <CODE>orange.Distribution(data.domain.classVar, weightID)</CODE> (<CODE>weightID</CODE> can be left out if examples are not weighted). Since this is a frequent operation a shortcut is provided.</p>
251
252<P><B><CODE>orange.getClassDistribution(examples[, weightID])</CODE></B> computes distribution of class values for the given data set. Result is of type <CODE>DiscDistribution</CODE> or <CODE>ContDistribution</CODE>.</P>
253
254
255<H2>Computing distributions for all attributes</H2>
256
257<P>Orange can compute distributions for all objects in a single iteration over examples and store them in an object of type <CODE><B>DomainDistributions</B></CODE>. Its constructor accepts examples and, optionally, an ID of meta attribute with weights. Resulting object is list-like, with the exception that not only integers but also attribute descriptors and names can be used for indexing.</P>
258
259<P>The script below computes distributions for all attributes in the data and prints out distributions for discrete and averages for continuous attributes.</P>
260
261<p class="header"><a href="distributions.py">part of distributions.py</a>
262(uses <a href="../datasets/adult_sample.tab">adult_sample.tab</a>)</p>
263<XMP class="code">dist = orange.DomainDistributions(data)
264
265for d in dist:
266    if d.variable.varType == orange.VarTypes.Discrete:
267        print "%30s: %s" % (d.variable.name, d)
268    else:
269        print "%30s: avg. %5.3f" % (d.variable.name, d.average())
270</XMP>
271
272<P>To get the distribution for, say, attribute "age", you can either use its index in the domain or its name or descriptor.</P>
273
274<XMP class="code">dist_age = dist["age"]
275</XMP>
276
277</BODY> 
Note: See TracBrowser for help on using the repository browser.