source: orange/Orange/doc/reference/filters.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 14.6 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<HTML>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8
9<index name="filtering examples">
10<H1>Filters</H1>
11
12<P>Filters are objects that can be used for selecting examples. They are somewhat related to <A href="preprocessing.htm">preprocessors</A>. Filters are more limited, they can accept or reject examples, but cannot modify them. Additional restriction of filters is that they only see individual examples, not entire datasets. This is important at random selection of examples (see <A href="#randomfilter"><CODE>Filter_random</CODE></A>).</P>
13
14<H2>General behavior</H2>
15
16<P>All filters have two attributes.</P>
17
18<P class=section>Attributes</P>
19<DL class=attributes>
20<DT><CODE><B>negate</B></CODE></DT>
21<DD>Inverts filters decisions</DD>
22
23<DT><CODE><B>domain</B></CODE></DT>
24<DD>Domain to which examples are converted prior to checking (except for <CODE>Filter_random</CODE>, which ignores this field).</DD>
25</DL>
26
27<P>Besides the constructor, filters provide the call operator and a method that returns a list denoting which examples match the filter criterion.</P>
28
29<P class="section">Attributes</P>
30<dl class="attributes">
31<dt>__call__(example)</dt>
32<dd>Checks whether the example matches the filter's criterion and returns either <code>True</code> or <code>False</code>.</dd>
33
34<dt>__call__(examples)</dt>
35<dd>When given an entire example table, it returns a list of examples (as an <code>ExampleTable</code>) that matches the criterion.</dd>
36
37<dt>selectionVector(examples)</dt>
38<dd>Returns a list of bools of the same length as <code>examples</code>, denoting which examples are accepted. Equivalent to <code>[filter(ex) for ex in examples]</code></dd>.
39</dl>
40
41<P>An alternative way to apply a filter to a table of examples is to call <A href="ExampleTable.htm#filter"><CODE>ExampleTable.filter</CODE></A>.</P>
42
43<H2>Random filter</H2>
44<A name="randomfilter"></A>
45<index name="classes/Filter_random">
46
47<P><CODE>Filter_random</CODE> accepts an example with a given probability.</P>
48
49<P class=section>Attributes</P>
50<DL class=attributes>
51<DT>prob</DT>
52<DD>Probability for accepting an example.</DD>
53
54<DT>randomGenerator</DT>
55<DD>A random number generator that used for making selections. If not set before the filter is used for the first time, a new generator is constructed and stored here for the future use.</DD>
56</DL>
57
58<P>The inherited attribute <CODE>domain</CODE> is ignored.</P>
59
60<p class="header"><a href="filter.py">part of filter.py</a></p>
61<XMP class="code">>>> randomfilter = orange.Filter_random(prob = 0.7, randomGenerator = 24)
62>>> for i in range(10):
63...    print randomfilter(example),
641 1 0 1 0 0 0 1 1 1
65</XMP>
66
67<P>For this script, <CODE>example</CODE> should be some learning example; you can load any data and set <CODE>example = data[0]</CODE>. Script's result will always be the same. Although the probability of selecting an example is set to 0.7, the filter accepted the example six times out of ten. Since filter only sees individual examples, it cannot be accurate; if you need to select exactly 70% of examples in a dataset, use a <A href="RandomIndices.htm">random indices</A>.</P>
68
69<P>Setting the random generator ensures that the filter will always select the same examples, disregarding of how many times you run the script or what you do (in Orange) before you run it. <CODE>randomGenerator=24</CODE> is a shortcut for <CODE>randomGenerator = orange.RandomGenerator(24)</CODE> or <CODE>randomGenerator = orange.RandomGenerator(initseed=24)</CODE>.</P>
70
71<P>To select a subset of examples instead of calling the filter for each individual example, use the filter like this.</P>
72<XMP class=code>data70 = randomfilter(data)
73</XMP>
74
75
76<H2>Filtering examples with(out) unknown values</H2>
77<index name="classes/Filter_isDefined">
78<index name="classes/Filter_hasClass">
79
80<P><CODE>Filter_isDefined</CODE> selects examples for which all attribute values are defined (known). By default, the filter checks all attributes; you can modify the list <CODE>check</CODE> to select the attributes to be checked. This filter never checks meta-attributes are not checked. (There is an obsolete filter <CODE>Filter_hasSpecial</CODE>, which does the opposite, that is, selects examples with at least one unknown value, in any of attributes, including the class attribute. <CODE>Filter_hasSpecial</CODE> always checks all attributes except meta-attributes.) <CODE>Filter_hasClass</CODE> selects examples with defined class value. You can use <CODE>negate</CODE> to invert the selection, as shown in the script below.</P>
81
82<P class=section>Attributes</P>
83<DL class=attributes>
84<DT>check</DT>
85<DD>A list of boolean elements specifying which attributes to check. Each element corresponds to an attribute in the domain. By default, <CODE>check</CODE> is <CODE>None</CODE>, meaning that all attributes are checked. The list is initialized to a list of <CODE>true</CODE>s when the filter's <CODE>domain</CODE> is set unless the list already exists. You can also set <CODE>check</CODE> manually, even without setting the <CODE>domain</CODE>. The list can be indexed by ordinary integers (<I>e.g.</I>, <CODE>check[0]</CODE>); if <CODE>domain</CODE> is set, you can also address the list by attribute names or descriptors.</DD>
86</DL>
87
88<P>As for all Orange objects, it is not recommended to modify the <CODE>domain</CODE> after it has been set once, unless you know exactly what you are doing. In this particular case, changing the domain would disrupt the correspondence between the domain attributes and the <CODE>check</CODE> list, causing unpredictable behaviour.</P>
89
90
91<p class="header">part of <a href="filter.py">filter.py</a>
92(uses <a href="lenses.tab">lenses.tab</a>)</p>
93<XMP class="code">data = orange.ExampleTable("lenses")
94data2 = data[:5]
95data2[0][0] = "?"
96data2[1].setclass("?")
97print "First five examples"
98for ex in data2:
99    print ex
100
101print "\nExamples without unknown values"
102f = orange.Filter_isDefined(domain = data.domain)
103for ex in f(data2):
104    print ex
105
106print "\nExamples without unknown values, ignoring 'age'"
107f.check["age"] = 0
108for ex in f(data2):
109    print ex
110
111print "\nExamples with unknown values (ignoring age)"
112for ex in f(data2, negate=1):
113    print ex
114
115print "\nExamples with defined class"
116for ex in orange.Filter_hasClassValue(data2):
117    print ex
118
119print "\nExamples with undefined class"
120for ex in orange.Filter_hasClassValue(data2, negate=1):
121    print ex
122</XMP>
123
124<H2>Filtering examples with(out) a meta value</H2>
125<index name="classes/Filter_hasMeta">
126
127<P>Filter <code>Filter_hasMeta</code> filters out the attributes that don't have (or that <em>do have</em>, when <code>negate</code>d) a meta attribute with the given id.</P>
128
129<DL class=attributes>
130<DT>id</DT>
131<dd>The id of the meta attribute we look for.</dd>
132</DL>
133
134<P>This is filter is especially useful with examples from basket format and their optional meta attributes. If they come, for instance, from a text mining domain, we can use it to get the documents that contain a certain word.</P>
135
136
137<p class="header">part of <a href="filter.py">filterm.py</a>
138(uses <a href="inquisition.basket">inquisition.basket</a>)</p>
139<xmp class="code">data = orange.ExampleTable("inquisition.basket")
140haveSurprise = orange.Filter_hasMeta(data, id = data.domain.index("surprise"))</xmp>
141
142<P>This example, which will print out all instances that contain the word "surprise", gets the id of the meta attribute from the domain by searching for the attribute named "surprise". This meta attribute is optional and does not necessarily appear in all examples. To fully understand how this particular example works, you should be familiar with <a href="Domain.htm#meta-attributes">optional meta attributes</a> and the <a href="basket.htm">basket file format</a>.</P>
143
144<P>This filter can of course also be used in other situations involving meta values that appear only in some examples. The corresponding attributes do not need to be registered in the domain.</P>
145
146
147<H2>Filtering by attribute values</H2>
148
149<H3>Fast filter for single values</H3>
150<index name="classes/Filter_sameValue">
151
152<P><CODE>Filter_sameValue</CODE> is a fast filter for selecting examples with particular value of some attribute.
153
154<P class=section>Attributes</P>
155<DL class=attributes>
156<DT>position</DT>
157<DD>Position of the attribute in the domain.</DD>
158
159<DT>value</DT>
160<DD>Attribute's value</DD>
161</DL>
162
163<P>If <CODE>domain</CODE> is not set, make sure that examples are from the right domain so that <CODE>position</CODE> applies to the attribute you want.</P>
164
165<p class="header"><a href="filter.py">part of filter.py</a>
166(uses <a href="lenses.tab">lenses.tab</a>)</p>
167<XMP class="code">filteryoung = orange.Filter_sameValue()
168age = data.domain["age"]
169filteryoung.value = orange.Value(age, "young")
170filteryoung.position = data.domain.attributes.index(age)
171print "\nYoung examples"
172for ex in filteryoung(data):
173    print ex
174</XMP>
175
176<P>This script select examples with age="young" from lenses dataset. Setting position is somewhat tricky: <CODE>data.domain.attributes</CODE> behaves as a list and provides method <CODE>index</CODE>, which we can use to retrieve the position of attribute <CODE>age</CODE>. The attribute <CODE>age</CODE> is also needed to construct a <CODE>Value</CODE>.</P>
177
178<P>As you can see, this filter is dirty but quick.</P>
179
180<H3>Simple filter for continuous attributes</H3>
181
182<P>ValueFilter class provides different methods for filtering values of countinuous attributes: <CODE>ValueFilter.Equal</CODE>, <CODE>ValueFilter.Less</CODE>, <CODE>ValueFilter.LessEqual</CODE>, <CODE>ValueFilter.Greater</CODE>, <CODE>ValueFilter.GreaterEqual</CODE>, <CODE>ValueFilter.Between</CODE>, <CODE>ValueFilter.Outside</CODE>.
183
184<P>In the following excerpt there are two different filters used: <CODE>ValueFilter.GreaterEqual</CODE> which needs only one parameter and <CODE>ValueFilter.Between</CODE> which needs to be defined by two parameters.
185
186<p class="header"><a href="filterv.py">part of filterv.py</a>
187(uses <a href="iris.tab">iris.tab</a>)</p>
188
189<XMP class="code">fcont = orange.Filter_values(domain = data.domain)
190
191fcont[0] = (orange.ValueFilter.GreaterEqual, 7.6)
192print "\n\nThe first attribute is greater than or equal to 7.6"
193for ex in fcont(data):
194    print ex
195
196fcont[0] = (orange.ValueFilter.Between, 4.6, 5.0)
197print "\n\nThe first attribute is between to 4.5 and 5.0"
198for ex in fcont(data):
199    print ex
200</XMP>
201
202
203<H3>Filter for multiple values and attributes</H3>
204<index name="classes/Filter_Values">
205<index name="classes/ValueFilterList">
206<index name="classes/ValueFilter">
207<index name="classes/ValueFilter_discrete">
208<index name="classes/ValueFilter_continuous_">
209
210
211<P><CODE>Filter_Values</CODE> performs a similar function as <CODE>Filter_sameValue</CODE>, but can handle conjunctions and disjunctions of more complex conditions.</P>
212
213<P class=section>Attributes</P>
214<DL class=attributes>
215<DT>conditions</DT>
216<DD>A list of type <CODE>ValueFilterList</CODE> that contains conditions.</DD>
217
218<DT>conjunction<DT>
219<DD>Decides whether the filter will compute conjunction or disjunction of conditions. If <CODE>true</CODE>, example is accepted if no values are rejected. If <CODE>false</CODE>, example is accepted if at least one value is accepted.</DD>
220</DL>
221
222<P>Elements of list <CODE>conditions</CODE> must be objects of type <CODE>ValueFilter_discrete</CODE> for discrete and <CODE>ValueFilter_continuous</CODE> for continuous attributes; both are derived from <CODE>ValueFilter</CODE>.
223
224<P>Both have fields <CODE><B>position</B></CODE> denoting the position of the checked attribute (just as in <CODE>Filter_sameValue</CODE>) and <CODE><B>acceptSpecial</B></CODE> that determines whether undefined values are accepted (1), rejected (0) or simply ignored (-1, default).</P>
225
226<P><CODE><B>ValueFilter_discrete</B></CODE> has field <CODE><B>values</B></CODE> of type <CODE>ValueList</CODE> that contains objects of type <CODE><B>Value</B></CODE> that represent the acceptable values.</P>
227
228<P><CODE><B>ValueFilter_continous</B></CODE> has fields <CODE><B>min</B></CODE> and <CODE><B>max</B></CODE> that define an interval, and field <CODE><B>outside</B></CODE> that tells whether values outside or inside interval are accepted. Default is <CODE>false</CODE> (inside).</P>
229
230<p class="header"><a href="filter.py">part of filter.py</a>
231(uses <a href="lenses.tab">lenses.tab</a>)</p>
232<XMP class="code">fya = orange.Filter_values()
233fya.domain = data.domain
234age, astigm = data.domain["age"], data.domain["astigmatic"]
235fya.conditions.append(orange.ValueFilter_discrete(
236                        position = data.domain.attributes.index(age),
237                        values=[orange.Value(age,"young"),
238                                orange.Value(age, "presbyopic")])
239                     )
240fya.conditions.append(orange.ValueFilter_discrete(
241                        position = data.domain.attributes.index(astigm),
242                        values=[orange.Value(astigm, "yes")])
243                     )
244for ex in fya(data):
245    print ex
246</XMP>
247
248<P>This script selects examples whose age is "young" or "presbyopic" and which are astigmatic. Unknown values are ignored (if value for one of the two attributes is missing, only the other is checked; if both are missing, example is accepted).
249
250<P>Script first constructs the filter and assigns a domain. Then it appends both conditions to the filter's <CODE>conditions</CODE> field. Both are of type <CODE>orange.ValueFilter_discrete</CODE>, since the two attributes are discrete. Position of the attribute is obtained the same way as for <CODE>Filter_sameValue</CODE>, described above.</P>
251
252<P>The list of conditions can also be given to filter constructor. The following filter will accept examples whose age is "young" or "presbyopic" or who are astigmatic (<CODE>conjunction = 0</CODE>). For contrast from above filter, unknown age is not acceptable (but examples with unknown age can still be accepted if they are astigmatic). Meanwhile, examples with unknown astigmatism are always accepted.</P>
253
254
255<p class="header"><a href="filter.py">part of filter.py</a>
256(uses <a href="lenses.tab">lenses.tab</a>)</p>
257<XMP class="code">fya = orange.Filter_values(
258   domain = data.domain,
259   conditions = [
260     orange.ValueFilter_discrete(
261       position = data.domain.attributes.index(age),
262       values = [orange.Value(age, "young"),
263                 orange.Value(age, "presbyopic")],
264       acceptSpecial = 0
265     ),
266     orange.ValueFilter_discrete(
267       position = data.domain.attributes.index(astigm),
268       values = [orange.Value(astigm, "yes")],
269       acceptSpecial = 1
270     )
271   ],
272   conjunction = 0
273)
274</XMP>
275
276<P>If you don't find this filter attractive, use <a href="preprocessing.htm"><CODE>Preprocessor_take</CODE></a> instead, which is less flexible but more intelligent and friendly.</P>
277
278</BODY>
Note: See TracBrowser for help on using the repository browser.