source: orange/Orange/doc/reference/ExamplesDistance.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 9.2 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

Line 
1<html>
2<HEAD>
3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
5</HEAD>
6
7<BODY>
8<index name="distances between examples">
9<h1>Distances between Examples</h1>
10
11<p>This page describes a bunch of classes for different metrics for measure distances between examples.</p>
12
13<P>Typical (although not all) measures of distance between examples require some "learning" - adjusting the measure to the data. For instance, when the dataset contains continuous attributes, the distances between continuous values should be normalized, e.g. by dividing the distance with the range of possible values or with some interquartile distance to ensure that all attributes have, in principle, similar impacts.</P>
14
15<P>Different measures of distance thus appear in pairs - a class that measures the distance and a class that constructs it based on the data. The abstract classes representing such a pair are <CODE>ExamplesDistance</CODE> and <CODE>ExamplesDistanceConstructor</CODE>. Since most measures work on normalized distances between corresponding attributes, there is an abstract intermediate class <CODE>ExamplesDistance_Normalized</CODE> that takes care of normalizing. The remaining classes correspond to different ways of defining the distances, such as Manhattan or Euclidean distance.</P>
16
17<P>Unknown values are treated correctly only by Euclidean and Relief distance. For other measure of distance, a distance between unknown and known or between two unknown values is always 0.5.</P>
18
19<hr>
20
21<H2>ExamplesDistance</H2>
22<index name="classes/ExamplesDistance">
23
24<P class=section>Methods</P>
25<DL class=attributes><DT>__call__(example1, example2)</DT>
26<DD>Returns a distance between the given examples as floating point number.</DD>
27</DL>
28
29
30<H2>ExamplesDistanceConstructor</H2>
31<index name="classes/ExamplesDistanceConstructor">
32
33<P class=section>Methods</P>
34<DL class=attributes><DT>__call__([examples, weightID][, DomainDistributions][, DomainBasicAttrStat])</DT>
35<DD>Constructs an instance of <CODE>ExamplesDistance</CODE>. Not all the data needs to be given. Most measures can be constructed from <CODE>DomainBasicAttrStat</CODE>; if it is not given, they can help themselves either by <CODE>examples</CODE> or <CODE>DomainDistributions</CODE>. Some (e.g. <CODE>ExamplesDistance_Hamming)</CODE> even do not need any arguments.</DD>
36</DL>
37
38<H2>ExamplesDistance_Normalized</H2>
39<index name="classes/ExamplesDistance_Normalized">
40
41<P>This abstract class provides a function which is given two examples and returns a list of normalized distances between values of their attributes. Many distance measuring classes need such a function and are therefore derived from this class.</p>
42
43<P class=section>Attributes</P>
44<DL class=attributes>
45<DT>normalizers</DT>
46<DD>A precomputed list of normalizing factors for attribute values:
47<UL><LI>if a factor positive, differences in attribute's values are multiplied by it; for continuous attributes the factor would be 1/(max_value-min_value) and for ordinal attributes the factor is 1/number-of-values. If either (or both) of attributes are unknown, the distance is 0.5</LI>
48<LI>if a factor is -1, the attribute is nominal; the distance between two values is 0 if they are same (or at least one is unknown) and 1 if they are different.</LI>
49<LI>if a factor is 0, the attribute is ignored.</LI>
50</UL>
51</DD>
52
53<DT>bases, averages, variances</DT>
54<DD>The minimal values, averages and variances (continuous attributes only)</DD>
55
56<DT>domainVersion</DT>
57<DD>stores a domain version for which the normalizers were computed. The domain version is increased each time a domain description is changed (i.e. attributes are added or removed); this is used for a quick check that the user is not attempting to measure distances between examples that do not correspond to normalizers. Since domains are practicably immutable (especially from Python), you don't need to care about this anyway.</DD>
58</DL>
59
60<P class=section>Methods</P>
61<DL class=attributes>
62<DT>attributeDistances(example1, example2)</DT>
63<DD>Returns a list of floats representing distances between pairs of attribute values of the two examples.</DD>
64</DL>
65
66<H2>ExamplesDistance_Hamming / ExamplesDistanceConstructor_Hamming</H2>
67<index name="Hammming distance">
68<index name="classes/ExamplesDistance_Hamming">
69<index name="classes/ExamplesDistanceConstructor_Hamming">
70
71<P>Hamming distance between two examples is defined as the number of attributes in which the two examples differ. Note that this measure is not really appropriate for examples that contain continuous attributes.</P>
72
73<P>This class is derived directly from <CODE>ExamplesDistance</CODE>, not from <CODE>ExamplesDistance_Normalized</CODE>.</P>
74
75<P><B>Note: in some previous versions of Orange, this distance was wrongly referred to as Hamiltonian, not Hamming.</B> This has been corrected <em>without</em> providing any aliases for backward compatibility.</P>
76
77
78<H2>ExamplesDistance_Maximal / ExamplesDistanceConstructor_Maximal</H2>
79<index name="classes/ExamplesDistance_Maximal">
80<index name="classes/ExamplesDistanceConstructor_Maximal">
81
82<P>The maximal (also called infinite distance) between two examples is defined as the maximal distance between two attribute values. If <CODE>dist</CODE> is the result of <CODE>ExamplesDistance_Normalized.attributeDistances</CODE>, then <CODE>ExamplesDistance_Maximal</CODE> returns <CODE>max(dist)</CODE>.</P>
83
84
85<H2>ExamplesDistance_Manhattan / ExamplesDistanceConstructor_Manhattan</H2>
86<index name="Manhattan distance">
87<index name="classes/ExamplesDistance_Manhattan">
88<index name="classes/ExamplesDistanceConstructor_Manhattan">
89
90<P>Manhattan distance between two examples is a sum of absolute values of distances between pairs of attributes, e.g. <CODE>apply(add, [abs(x) for x in dist])</CODE>, where <CODE>dist</CODE> is the result of <CODE>ExamplesDistance_Normalized.attributeDistances</CODE>.</P>
91
92<H2>ExamplesDistance_Euclidean / ExamplesDistanceConstructor_Euclidean</H2>
93<index name="Euclidean distance">
94<index name="classes/ExamplesDistance_Euclidean">
95<index name="classes/ExamplesDistanceConstructor_Euclidean">
96
97<P>Euclidean distance is a square root of sum of squared per-attribute distances, i.e. <CODE>sqrt(apply(add, [x*x for x in dist]))</CODE>, where <CODE>dist</CODE> is the result of <CODE>ExamplesDistance_Normalized.attributeDistances</CODE>.</P>
98
99<P class=section>Methods</P>
100<DL class=attributes>
101<DT>distributions</DT>
102<DD>An object of type <CODE>DomainDistributions</CODE> that holds the distributions for all discrete attributes. This is needed to compute distances between known and unknown values.</DD>
103
104<DT>bothSpecialDist</DT>
105<DD>A list containing the distance between two unknown values for each discrete attribute.</DD>
106</DL>
107</P>
108
109<P>This measure of distance deals with unknown values by computing the expected square of distance based on the distribution obtained from the "training" data. Squared distance between
110<UL>
111<LI>a known and unknown continuous attribute equals squared distance between the known and the average, plus variance</LI>
112<LI>two unknown continuous attributes equals double variance</LI>
113<LI>a known and unknown discrete attribute equals the probability that the unknown attribute has different value than the known (ie, 1 - probability of the known value)</LI>
114<LI>two unknown discrete attributes equals the probability that two random chosen values are equal, which can be computed as 1 - sum of squares of probabilities.</LI>
115</UL>
116
117<P>Continuous cases can be handled by averages and variances inherited from <CODE>ExamplesDistance_normalized</CODE>. The data for discrete cases are stored in <CODE>distributions</CODE> (used for unknown vs. known value) and in <CODE>bothSpecial</CODE> (the precomputed distance between two unknown values).</P>
118
119<P>See the output of <A href="examplesdistance-missing.py">examplesdistance-missing.py</A> for an example.</A>
120
121
122<H2>ExampleDistance_Relief / ExampleDistanceConstructor_Relief</H2>
123<index name="classes/ExamplesDistance_Relief">
124<index name="classes/ExamplesDistanceConstructor_Relief">
125
126<p><CODE>ExamplesDistance_Relief</CODE> is similar to Manhattan distance, but incorporates a more correct treatment of undefined values, which is used by ReliefF measure.</p>
127
128<HR>
129
130<H2>Example</H2>
131
132<P>If attributes are discrete, <CODE>ExamplesDistance_Manhattan</CODE> basically counts the number of attributes in which two examples differ. It's therefore easily to "check" its results.</P>
133
134<p class="header"><a href="examplesdistance.py">examplesdistance.py</a>
135(uses <a href="lenses.tab">lenses.tab</a>)</p>
136<XMP class=code>import orange
137
138data = orange.ExampleTable("lenses")
139
140distance = orange.ExamplesDistanceConstructor_Manhattan(data)
141
142ref = data[0]
143print "*** Reference example: ", ref
144for ex in data:
145    print ex, distance(ex, ref)
146</XMP>
147
148<P>The printout begins with:</P>
149
150<XMP class=code>*** Reference example:  ['young', 'myope', 'no', 'reduced', 'none']
151['young', 'myope', 'no', 'reduced', 'none'] 0.0
152['young', 'myope', 'no', 'normal', 'soft'] 1.0
153['young', 'myope', 'yes', 'reduced', 'none'] 1.0
154['young', 'myope', 'yes', 'normal', 'hard'] 2.0
155['young', 'hypermetrope', 'no', 'reduced', 'none'] 1.0
156['young', 'hypermetrope', 'no', 'normal', 'soft'] 2.0
157['young', 'hypermetrope', 'yes', 'reduced', 'none'] 2.0
158['young', 'hypermetrope', 'yes', 'normal', 'hard'] 3.0
159</XMP> 
Note: See TracBrowser for help on using the repository browser.