source: orange/Orange/doc/reference/MeasureAttribute.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 18.8 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<HTML>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8
9<H1>Relevance of Attributes</H1>
10<index name="attribute scoring">
11
12<P>There are a number of different measures for assessing the
13relevance of attributes with respect to much information they contain
14about the corresponding class. These procedures are also known as
15attribute scoring. Orange implements several methods that all stem
16from <CODE>MeasureAttribute</CODE>. The most of common ones compute
17certain statistics on conditional distributions of class values given
18the attribute values; in Orange, these are derived from
19<CODE>MeasureAttributeFromProbabilities</CODE>.</P>
20
21<HR>
22
23<H2>Base Classes</H2>
24
25<H3>MeasureAttribute</H3>
26
27<P><CODE><INDEX
28name="classes/MeasureAttribute">MeasureAttribute</CODE> is the base
29class for a wide range of classes that measure quality of
30attributes. The class itself is, naturally, abstract. Its fields
31merely describe what kinds of attributes it can handle and what kind
32of data it requires.</P>
33
34<P class=section>Attributes</P>
35<DL class=attributes>
36<DT>handlesDiscrete</DT>
37<DD>Tells whether the measure can handle discrete attributes.</DD>
38
39<DT>handlesContinuous</DT>
40<DD>Tells whether the measure can handle continuous attributes.</DD>
41
42<dt>computesThresholds</dt>
43<dd>Tells whether the measure implements the <code>thresholdFunction</code>.</dd>
44
45<DT>needs</DT>
46<DD>Tells what kind of data the measure needs. This can be either <CODE>MeasureAttribute.NeedsGenerator</CODE>, <CODE>MeasureAttribute.NeedsDomainContingency</CODE> or <CODE>MeasureAttribute.NeedsContingency_Class</CODE>. The first need an example generator (Relief is an example of such measure), the second can compute the quality from <a href="contingency.htm"><CODE>DomainContingency</CODE></A> and the latter only needs the contingency (<CODE>ContingencyAttrClass</CODE>) the attribute distribution and the apriori class distribution. Most of the measure are content by the latter.</DD>
47</DL>
48
49<P>Several (but not all) measures can treat unknown attribute values in different ways, depending on field <B><CODE>unknownsTreatment</CODE></B> (this field is not defined in <CODE>MeasureAttribute</CODE> but in many derived classes). Undefined values can be
50<UL style="margin-top=0cm">
51<LI><B>ignored (<CODE>MeasureAttribute.IgnoreUnknowns</CODE>)</B>; this has the same effect as if the example for which the attribute value is unknown are removed.</LI>
52
53<LI><B>punished (<CODE>MeasureAttribute.ReduceByUnknown</CODE>)</B>; the attribute quality is reduced by the proportion of unknown values. In impurity measures, this can be interpreted as if the impurity is decreased only on examples for which the value is defined and stays the same for the others, and the attribute quality is the average impurity decrease.</B></LI>
54
55<LI><B>imputed (<CODE>MeasureAttribute.UnknownsToCommon</CODE>)</B>; here, undefined values are replaced by the most common attribute value. If you want a more clever imputation, you should do it in advance.</LI>
56
57<LI><B>treated as a separate value (<CODE>MeasureAttribute.UnknownsAsValue</CODE>)</B>
58</UL>
59
60<P>The default treatment is <CODE>ReduceByUnknown</CODE>, which is optimal in most cases and does not make additional presumptions (as, for instance, <CODE>UnknownsToCommon</CODE> which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other attributes). Use other treatments if you know that they make better sense on your data.</P>
61
62<P>The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the attribute. The number does not have any absolute meaning and can vary widely for different attribute measures. The only common characteristic is that higher the value, better the attribute. If the attribute is so bad that it's quality cannot be measured, the measure returns <CODE>MeasureAttribute.Rejected</CODE>. None of the measures described here do so.</P>
63
64<P>There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the attribute and the class need to be of the correct type for a particular measure.</P>
65
66<P class=section>Methods</P>
67<DL class=attributes>
68 <DT>__call__(attribute, examples[, apriori class distribution][, weightID])<br>
69 __call__(attribute, domain contingency[, apriori class distribution])<br>
70 __call__(contingency, class distribution[, apriori class distribution])</DT>
71
72<DD>There are three call operators just to make your life simpler and faster. When working with the data, your method might have already computed, for instance, contingency matrix. If so and if the quality measure you use is OK with that (as most measures are), you can pass the contingency matrix and the measure will compute much faster. If, on the other hand, you only have examples and haven't computed any statistics on them, you can pass examples (and, optionally, an id for meta-attribute with weights) and the measure will compute the contingency itself, if needed.</P>
73
74<P>Argument <CODE>attribute</CODE> gives the attribute whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the attribute is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the attribute in the domain, though.</DD>
75
76<P>The data is given either as <CODE>examples</CODE> (and, optionally, id for meta-attribute with weight), <CODE>domain contingency</CODE> (a list of contingencies) or <CODE>contingency</CODE> matrix and <CODE>class distribution</CODE>. If you use the latter form, what you should give as the class distribution depends upon what you do with unknown values (if there are any). If <CODE>unknownsTreatment</CODE> is <CODE>IgnoreUnknowns</CODE>, the class distribution should be computed on examples for which the attribute value is defined. Otherwise, class distribution should be the overall class distribution.</P>
77
78<P>The optional argument with <CODE>apriori class distribution</CODE> is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate).</P>
79</DD>
80
81<dt>thresholdFunction(attribute, examples[, weightID])</dt>
82<dd>This function computes the qualities for different binarizations of the continuous attribute <code>attribute</code>. The attribute should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing attribute values), the second is the measured quality for a corresponding binary attribute and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself.</dd>
83</DL>
84
85<P>The script below shows different ways to assess the quality of astigmatic, tear rate and the first attribute (whichever it is) in the dataset lenses.</P>
86
87<p class="header"><a href="MeasureAttribute1.py">part of measureattribute1.py</a>
88(uses <a href="lenses.tab">lenses.tab</a>)</p>
89<XMP class=code>import orange, random
90data = orange.ExampleTable("lenses")
91
92meas = orange.MeasureAttribute_info()
93
94astigm = data.domain["astigmatic"]
95print "Information gain of 'astigmatic': %6.4f" % meas(astigm, data)
96
97classdistr = orange.Distribution(data.domain.classVar, data)
98cont = orange.ContingencyAttrClass("tear_rate", data)
99print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr)
100
101dcont = orange.DomainContingency(data)
102print "Information gain of the first attribute: %6.4f" % meas(0, dcont)
103print
104</XMP>
105
106<P>As for many other classes in Orange, you can construct the object and use it on-the-fly. For instance, to measure the quality of attribute "tear_rate", you could write simply</P>
107
108<XMP class=code>>>> print orange.MeasureAttribute_info("tear_rate", data)
1090.548794984818
110</XMP>
111
112<P>You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF.</P>
113
114<P>It is also possible to assess the quality of attributes that do not exist in the dataset. For instance, you can assess the quality of discretized attributes without constructing a new domain and dataset that would include them.</P>
115
116<p class="header"><a href="MeasureAttribute1.py">measureattribute1a.py</a>
117(uses <a href="iris.tab">iris.tab</a>)</p>
118<XMP class=code>import orange, random
119data = orange.ExampleTable("iris")
120
121d1 = orange.EntropyDiscretization("petal length", data)
122print orange.MeasureAttribute_info(d1, data)
123</XMP>
124
125<P>The quality of the new attribute <CODE>d1</CODE> is assessed on <CODE>data</CODE>, which does not include the new attribute at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the attribute to be present in the dataset.)</p>
126
127<P>Finally, you can compute the quality of meta-attributes. The following script adds a meta-attribute to an example table, initializes it to random values and measures its information gain.</P>
128
129<p class="header"><a href="MeasureAttribute1.py">part of measureattribute1.py</a>
130(uses <a href="lenses.tab">lenses.tab</a>)</p>
131<XMP class=code>mid = orange.newmetaid()
132data.domain.addmeta(mid, orange.EnumVariable(values = ["v0", "v1"]))
133data.addMetaAttribute(mid)
134
135rg = random.Random()
136rg.seed(0)
137for ex in data:
138    ex[mid] = orange.Value(rg.randint(0, 1))
139
140print "Information gain for a random meta attribute: %6.4f" % \
141  orange.MeasureAttribute_info(mid, data)
142</XMP>
143
144
145<P>To show the computation of thresholds, we shall use the Iris data set.</P>
146
147<p class="header"><a href="MeasureAttribute1a.py">measureattribute1a.py</a>
148(uses <a href="iris.tab">iris.tab</a>)</p>
149<xmp class="code">
150import orange
151data = orange.ExampleTable("iris")
152
153meas = orange.MeasureAttribute_relief()
154for t in meas.thresholdFunction("petal length", data):
155    print "%5.3f: %5.3f" % t
156</xmp>
157
158<P>If we hadn't constructed the attribute in advance, we could write <code>orange.MeasureAttribute_relief().thresholdFunction("petal length", data)</code>. This is not recommendable for ReliefF, since it may be a lot slower.</P>
159
160<P>The script below finds and prints out the best threshold for binarization of an attribute, that is, the threshold with which the resulting binary attribute will have the optimal ReliefF (or any other measure).</P>
161<xmp class="code">thresh, score, distr = meas.bestThreshold("petal length", data)
162print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)</xmp>
163
164<H3>MeasureAttributeFromProbabilities</H3>
165
166<P><CODE><INDEX name="classes/MeasureAttributeFromProbabilities">MeasureAttributeFromProbabilities</CODE> is the abstract base class for attribute quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies.</P>
167
168<P class=section>Attributes</P>
169<DL class=attributes>
170<DT>unknownsTreatment</DT>
171<DD>Defines what to do with unknown values. See the possibilities described above.</DD>
172
173<DT>estimatorConstructor, conditionalEstimatorConstructor</DT>
174<DD>The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, <CODE>ProbabilityEstimatorConstructor_m</CODE> and <CODE>ConditionalProbabilityEstimatorConstructor_ByRows</CODE> (with estimator constructor again set to <CODE>ProbabilityEstimatorConstructor_m</CODE>), respectively.</DD>
175</DL>
176
177
178<HR>
179
180<H2>Measures for Classification Problems</H2>
181
182<P>The following section describes the attribute quality measures suitable for discrete attributes and outcomes. See <A href="MeasureAttribute1.py">MeasureAttribute1.py</A>, <A href="MeasureAttribute1a.py">MeasureAttribute1a.py</A>, <A href="MeasureAttribute1b.py">MeasureAttribute1b.py</A>, <A href="MeasureAttribute2.py">MeasureAttribute2.py</A> and <A href="MeasureAttribute3.py">MeasureAttribute3.py</A> for more examples on their use.</P>
183
184<H3>Information Gain</H3>
185<index name="attribute scoring+information gain">
186
187<P>The most popular measure, information gain (<CODE><INDEX name="classes/MeasureAttribute_info">MeasureAttribute_info</CODE>), measures the expected decrease of the entropy.</P>
188
189<H3>Gain Ratio</H3>
190<index name="attribute scoring+gain ratio">
191
192<P>Gain ratio (<CODE><INDEX name="classes/MeasureAttribute_gainRatio">MeasureAttribute_gainRatio</CODE>) was introduced by Quinlan in order to avoid overestimation of multi-valued attributes. It is computed as information gain divided by the entropy of the attribute's value. (It has been shown, however, that such measure still overstimates the attributes with multiple values.)
193
194<H3>Gini index</H3>
195<index name="attribute scoring+gini index">
196
197<P>Gini index (<CODE>MeasureAttribute_gini</CODE>) was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes.</P>
198
199<H3>Relevance</H3>
200<index name="attribute scoring+relevance">
201
202<P>Relevance of attributes (<CODE><INDEX
203name="classes/MeasureAttribute_relevance">MeasureAttribute_relevance</CODE>)
204is a measure that discriminate between attributes on the basis of
205their potential value in the formation of decision rules.</P>
206
207<H3>Costs</H3>
208
209<P><CODE><INDEX name="classes/MeasureAttribute_cost">MeasureAttribute_cost</CODE> evaluates attributes based on the "saving" achieved by knowing the value of attribute, according to the specified cost matrix.</P>
210
211<P class=section>Attributes</P>
212<DL class=attributes>
213<DT>cost</DT>
214<DD>Cost matrix (see the page about <A href="CostMatrix.htm">cost matrices</A> for details)</DD>
215</DL>
216
217<P>If cost of predicting the first class for an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for attribute 3 as follows.</P>
218
219<XMP class=code>>>> meas = orange.MeasureAttribute_cost()
220>>> meas.cost = ((0, 5), (1, 0))
221>>> meas(3, data)
2220.083333350718021393
223</XMP>
224
225<P>This tells that knowing the value of attribute 3 would decrease the classification cost for appx 0.083 per example.</P>
226
227
228<H3>ReliefF</H3>
229<index name="attribute scoring+ReliefF">
230
231<P>ReliefF (<CODE><INDEX name="classes/MeasureAttribute_relief">MeasureAttribute_relief</CODE>) was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of attributes based on their ability to distinguish between very similar examples belonging to different classes.</P>
232
233<P class=section>Attributes</P>
234<DL class=attributes>
235<DT>k</DT>
236<DD>Number of neighbours for each example. Default is 5.</DD>
237
238<DT>m</DT>
239<DD>Number of reference examples. Default is 100. Set to -1 to take all the examples.</DD>
240
241<DT>checkCachedData</DT>
242<DD>A flag best left alone unless you know what you do.</DD>
243</DL>
244
245<P>Computation of ReliefF is rather slow since it needs to find <CODE>k</CODE> nearest neighbours for each of <CODE>m</CODE> reference examples (or all examples, if <code>m</code> is set to -1). Since we normally compute ReliefF for all attributes in the dataset, <CODE>MeasureAttribute_relief</CODE> caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting <code>checkCachedData</code> to <code>False</code>. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the 'version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control.</P>
246
247<P>Caching will only have an effect if you use the same instance for all attributes in the domain. So, don't do this:</P>
248
249<XMP class=code>for attr in data.domain.attributes:
250    print orange.MeasureAttribute_relief(attr, data)
251</XMP>
252
253<P>In this script, cached data dies together with the instance of <CODE>MeasureAttribute_relief</CODE>, which is constructed and destructed for each attribute separately. It's way faster to go like this.</P>
254
255<XMP class=code>meas = orange.MeasureAttribute_relief()
256for attr in data.domain.attributes:
257    print meas(attr, data)
258</XMP>
259
260<P>When called for the first time, <CODE>meas</CODE> will compute ReliefF for all attributes and the subsequent calls simply return the stored data.</P>
261
262<P>Class <CODE>MeasureAttribute_relief</CODE> works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF.</P>
263
264<P>Note that ReliefF can also compute the threshold function, that is, the attribute quality at different thresholds for binarization.</P>
265
266<p>Finally, here is an example which shows what can happen if you disable the computation of checksums.</p>
267
268<xmp>data = orange.ExampleTable("iris")
269r1 = orange.MeasureAttribute_relief()
270r2 = orange.MeasureAttribute_relief(checkCachedData = False)
271
272print "%.3f\t%.3f" % (r1(0, data), r2(0, data))
273for ex in data:
274    ex[0] = 0
275print "%.3f\t%.3f" % (r1(0, data), r2(0, data))
276</xmp>
277
278<p>The first print prints out the same number, 0.321 twice. Then we annulate the first attribute. <code>r1</code> notices it and returns -1 as it's ReliefF, while <code>r2</code> does not and returns the same number, 0.321, which is now wrong.</p>
279
280
281<H2>Measure for Attributes for Regression Problems</H2>
282
283<P>Except for ReliefF, the only attribute quality measure available for regression problems is based on a mean square error.</P>
284
285<H3>Mean Square Error</H3>
286<index name="attribute scoring/mean square error">
287<index name="MSE of an attribute">
288
289<P>The mean square error measure is implemented in class <CODE><INDEX name="classes/MeasureAttribute_MSE">MeasureAttribute_MSE</CODE>.</P>
290
291<P class=section>Attributes</P>
292<DL class=attributes>
293<DT>unknownsTreatment</DT>
294<DD>Tells what to do with unknown attribute values. See description on the top of this page.</DD>
295
296<DT>m</DT>
297<DD>Parameter for m-estimate of error. Default is 0 (no m-estimate).
298
299</DL> 
Note: See TracBrowser for help on using the repository browser.