source: orange/Orange/doc/reference/unusedvalues.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 4.9 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8<h1>Removal of Unused Attribute Values</h1>
9<index name="preprocessing+removal of unused values">
10
11<P>It can often happen that the definition of a discrete attribute (<CODE>EnumVariable</CODE>) declares values that do not actually appear in the data, either originally or as a consequence of some preprocessing. Such anomalies are taken care of by class <CODE><INDEX name="classes/RemoveUnusedValues">RemoveUnusedValues</CODE> that, given an attribute and the data, determines whether there are any unused values and reduces the attribute if needed. There are four possible cases.
12
13<UL>
14<LI>The attribute does not have any used values in the data - value of this attribute is undefined for all examples. The attribute is thus useless and the class returns <CODE>None</CODE>.</LI>
15
16<LI>The attribute has only one used value (or, possibly, only one value at all). Such an attribute is in fact useless, and can probably be removed without harm. Nevertheless, its fate is decided by the flag <CODE>removeOneValued</CODE> which is <CODE>False</CODE> by default, so such attributes are retained unless explicitly specified otherwise.</LI>
17
18<LI>All attribute's values occur in the data (and the attribute it has more than one value; otherwise the above case applies). The original attribute is returned.</LI>
19
20<LI>There are some unused values. A new attribute is constructed and the unused values are omitted. The value of the new attribute is computed automatically from the value of the original attribute (<A href="lookup.htm"><CODE>ClassifierByLookupTable1</CODE></a> is used for mapping).</LI>
21</UL>
22
23<P class=section>Attributes</P>
24<DL class=attributes>
25<DT>removeOneValued</DT>
26<DD>Decides whether to remove or to retain the attributes with only one value defined (default: <CODE>False</CODE>).</DD>
27</DL>
28
29
30<P>Let us show the use of the class on a simple dataset with three examples, given by the following tab-delimited file.</P>
31
32<XMP CLASS=CODE>a       b      c         d         y
330 1     0 1 2  discrete  discrete  discrete
34                                   class
350       0      ?         0         0
361       2      ?         0         0
370       0      ?         0         1
38</XMP>
39
40<P>The below script construct a list <CODE>newattrs</CODE> which contains either the original attribute, <CODE>None</CODE> or a reduced attribute, for each attribute from the original dataset.</P>
41
42<p class="header">part of <A href="unusedValues.py">unusedValues.py</a>
43(uses <a href="unusedValues.tab">unusedValues.tab</a>)</P>
44<XMP class="code">import orange
45data = orange.ExampleTable("unusedValues")
46
47newattrs = [orange.RemoveUnusedValues(attr, data) for attr in data.domain.variables]
48
49print
50for attr in range(len(data.domain)):
51    print data.domain[attr],
52    if newattrs[attr] == data.domain[attr]:
53        print "retained as is"
54    elif newattrs[attr]:
55        print "reduced, new values are", newattrs[attr].values
56    else:
57        print "removed"
58</XMP>
59
60<P>And here's the script's output.</P>
61<XMP class="code">EnumVariable 'a' retained as is
62EnumVariable 'b' reduced, new values are <0, 2>
63EnumVariable 'c' removed
64EnumVariable 'd' retained as is
65EnumVariable 'y' retained as is
66</xmp>
67
68<P>Attributes <CODE>a</CODE> and <CODE>y</CODE> are OK and are left alone. In <CODE>b</CODE>, value 1 is not used and is removed (not in the original attribute, of course; a new attribute is created). <CODE>c</CODE> is useless and is removed altogether. <CODE>d</CODE> is retained since <CODE>removeOneValued</CODE> was left at <CODE>False</CODE>; if we set it to <CODE>True</CODE>, this attribute would be removed as well.</P>
69
70<P>The values of the new attribute for <CODE>b</CODE> are automatically computed from the original. The script can thus proceed as follows.</P>
71
72<p class="header">part of <A href="unusedValues.py">unusedValues.py</a>
73(uses <a href="unusedValues.tab">unusedValues.tab</a>)</P>
74<XMP class="code">filteredattrs = filter(bool, newattrs)
75newdata = orange.ExampleTable(orange.Domain(filteredattrs), data)
76
77print "\nOriginal example table"
78for ex in data:
79    print ex
80
81print "\nReduced example table"
82for ex in newdata:
83    print ex
84</XMP>
85
86<P>List <CODE>newattrs</CODE> includes some original attributes (<CODE>a</CODE>, <CODE>d</CODE> and <CODE>y</CODE>) a new attribute (<CODE>b</CODE>) and a <CODE>None</CODE> (for <CODE>c</CODE>). The latter is removed by <CODE>filter</CODE> called at the beginning of the script. We use <CODE>filteredattrs</CODE> to construct a new domain and then convert the original <CODE>data</CODE> to <CODE>newdata</CODE>. As the output shows, the two tables are the same except for the removed attribute <CODE>c</CODE>.</P>
87
88<XMP class="code">Original example table
89['0', '0', '?', '0', '0']
90['1', '2', '?', '0', '0']
91['0', '0', '?', '0', '1']
92
93Reduced example table
94['0', '0', '0', '0']
95['1', '2', '0', '0']
96['0', '0', '0', '1']
97</XMP>
98
99</BODY></HTML> 
Note: See TracBrowser for help on using the repository browser.