Changeset 7599:5c0ed7601f79 in orange


Ignore:
Timestamp:
02/05/11 00:42:08 (3 years ago)
Author:
lan <lan@…>
Branch:
default
Convert:
9f4e59a95a5a9d8ddb14fe65406498d3b033e732
Message:

Some documentation, PearsonR and SpearmanR distance.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/distances/__init__.py

    r6848 r7599  
     1""" 
     2 
     3########################### 
     4Distances between Instances 
     5########################### 
     6 
     7This page describes a bunch of classes for different metrics for measure 
     8distances (dissimilarities) between instances. 
     9 
     10Typical (although not all) measures of distance between instances require 
     11some "learning" - adjusting the measure to the data. For instance, when 
     12the dataset contains continuous features, the distances between continuous 
     13values should be normalized, e.g. by dividing the distance with the range 
     14of possible values or with some interquartile distance to ensure that all 
     15features have, in principle, similar impacts. 
     16 
     17Different measures of distance thus appear in pairs - a class that measures 
     18the distance and a class that constructs it based on the data. The abstract 
     19classes representing such a pair are `ExamplesDistance` and 
     20`ExamplesDistanceConstructor`. 
     21 
     22Since most measures work on normalized distances between corresponding 
     23features, there is an abstract intermediate class 
     24`ExamplesDistance_Normalized` that takes care of normalizing. 
     25The remaining classes correspond to different ways of defining the distances, 
     26such as Manhattan or Euclidean distance. 
     27 
     28Unknown values are treated correctly only by Euclidean and Relief distance. 
     29For other measure of distance, a distance between unknown and known or between 
     30two unknown values is always 0.5. 
     31 
     32.. class:: ExamplesDistance 
     33 
     34    .. method:: __call__(instance1, instance2) 
     35 
     36        Returns a distance between the given instances as floating point number.  
     37 
     38.. class:: ExamplesDistanceConstructor 
     39 
     40    .. method:: __call__([instances, weightID][, DomainDistributions][, DomainBasicAttrStat]) 
     41 
     42        Constructs an instance of ExamplesDistance. 
     43        Not all the data needs to be given. Most measures can be constructed 
     44        from DomainBasicAttrStat; if it is not given, they can help themselves 
     45        either by instances or DomainDistributions. 
     46        Some (e.g. ExamplesDistance_Hamming) even do not need any arguments. 
     47 
     48.. class:: ExamplesDistance_Normalized 
     49 
     50    This abstract class provides a function which is given two instances 
     51    and returns a list of normalized distances between values of their 
     52    features. Many distance measuring classes need such a function and are 
     53    therefore derived from this class 
     54 
     55    .. attribute:: normalizers 
     56     
     57        A precomputed list of normalizing factors for feature values 
     58         
     59        - If a factor positive, differences in feature's values 
     60          are multiplied by it; for continuous features the factor 
     61          would be 1/(max_value-min_value) and for ordinal features 
     62          the factor is 1/number_of_values. If either (or both) of 
     63          features are unknown, the distance is 0.5 
     64        - If a factor is -1, the feature is nominal; the distance 
     65          between two values is 0 if they are same (or at least 
     66          one is unknown) and 1 if they are different. 
     67        - If a factor is 0, the feature is ignored. 
     68 
     69    .. attribute:: bases, averages, variances 
     70 
     71        The minimal values, averages and variances 
     72        (continuous features only) 
     73 
     74    .. attribute:: domainVersion 
     75 
     76        Stores a domain version for which the normalizers were computed. 
     77        The domain version is increased each time a domain description is 
     78        changed (i.e. features are added or removed); this is used for a quick 
     79        check that the user is not attempting to measure distances between 
     80        instances that do not correspond to normalizers. 
     81        Since domains are practicably immutable (especially from Python), 
     82        you don't need to care about this anyway.  
     83 
     84    .. method:: attributeDistances(instance1, instance2) 
     85 
     86        Returns a list of floats representing distances between pairs of 
     87        feature values of the two instances. 
     88 
     89 
     90.. class:: Hamming, HammingConstructor 
     91 
     92    Hamming distance between two instances is defined as the number of 
     93    features in which the two instances differ. Note that this measure 
     94    is not really appropriate for instances that contain continuous features. 
     95 
     96 
     97.. class:: Maximal, MaximalConstructor 
     98 
     99    The maximal between two instances is defined as the maximal distance 
     100    between two feature values. If dist is the result of 
     101    ExamplesDistance_Normalized.attributeDistances, 
     102    then Maximal returns max(dist). 
     103 
     104 
     105.. class:: Manhattan, ManhattanConstructor 
     106 
     107    Manhattan distance between two instances is a sum of absolute values 
     108    of distances between pairs of features, e.g. ``apply(add, [abs(x) for x in dist])`` 
     109    where dist is the result of ExamplesDistance_Normalized.attributeDistances. 
     110 
     111.. class:: Euclidean, EuclideanConstructor 
     112 
     113 
     114    Euclidean distance is a square root of sum of squared per-feature distances, 
     115    i.e. ``sqrt(apply(add, [x*x for x in dist]))``, where dist is the result of 
     116    ExamplesDistance_Normalized.attributeDistances. 
     117 
     118    .. method:: distributions  
     119 
     120        An object of type DomainDistributions that holds the distributions 
     121        for all discrete features. This is needed to compute distances between 
     122        known and unknown values. 
     123 
     124    .. method:: bothSpecialDist 
     125 
     126        A list containing the distance between two unknown values for each 
     127        discrete feature. 
     128 
     129    This measure of distance deals with unknown values by computing the 
     130    expected square of distance based on the distribution obtained from the 
     131    "training" data. Squared distance between 
     132 
     133        - A known and unknown continuous attribute equals squared distance 
     134          between the known and the average, plus variance 
     135        - Two unknown continuous attributes equals double variance 
     136        - A known and unknown discrete attribute equals the probabilit 
     137          that the unknown attribute has different value than the known 
     138          (i.e., 1 - probability of the known value) 
     139        - Two unknown discrete attributes equals the probability that two 
     140          random chosen values are equal, which can be computed as 
     141          1 - sum of squares of probabilities. 
     142 
     143    Continuous cases can be handled by averages and variances inherited from 
     144    ExamplesDistance_normalized. The data for discrete cases are stored in 
     145    distributions (used for unknown vs. known value) and in bothSpecial 
     146    (the precomputed distance between two unknown values). 
     147 
     148.. class:: Relief, ReliefConstructor 
     149 
     150    Relief is similar to Manhattan distance, but incorporates a more 
     151    correct treatment of undefined values, which is used by ReliefF measure. 
     152 
     153This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized.         
     154             
     155 
     156.. autoclass:: PearsonR 
     157    :members: 
     158 
     159.. autoclass:: SpearmanR 
     160    :members: 
     161 
     162.. autoclass:: PearsonRConstructor 
     163    :members: 
     164 
     165.. autoclass:: SpearmanRConstructor 
     166    :members:     
     167 
     168 
     169""" 
     170 
     171import Orange 
     172 
    1173from orange import \ 
    2          AlignmentList, \ 
    3     DistanceMap, \ 
    4     DistanceMapConstructor, \ 
    5     ExampleDistConstructor, \ 
    6          ExampleDistBySorting, \ 
    7     ExampleDistVector, \ 
    8     ExamplesDistance, \ 
    9          ExamplesDistance_Hamming, \ 
    10          ExamplesDistance_Normalized, \ 
    11               ExamplesDistance_DTW, \ 
    12               ExamplesDistance_Euclidean, \ 
    13               ExamplesDistance_Manhattan, \ 
    14               ExamplesDistance_Maximal, \ 
    15          ExamplesDistance_Relief, \ 
    16     ExamplesDistanceConstructor, \ 
    17          ExamplesDistanceConstructor_DTW, \ 
    18          ExamplesDistanceConstructor_Euclidean, \ 
    19          ExamplesDistanceConstructor_Hamming, \ 
    20          ExamplesDistanceConstructor_Manhattan, \ 
    21          ExamplesDistanceConstructor_Maximal, \ 
    22          ExamplesDistanceConstructor_Relief 
     174     AlignmentList, \ 
     175     DistanceMap, \ 
     176     DistanceMapConstructor, \ 
     177     ExampleDistConstructor, \ 
     178     ExampleDistBySorting, \ 
     179     ExampleDistVector, \ 
     180     ExamplesDistance, \ 
     181     ExamplesDistance_Normalized, \ 
     182     ExamplesDistanceConstructor 
     183 
     184from orange import ExamplesDistance_Hamming as Hamming 
     185from orange import ExamplesDistance_DTW as DTW 
     186from orange import ExamplesDistance_Euclidean as Euclidean 
     187from orange import ExamplesDistance_Manhattan as Manhattan 
     188from orange import ExamplesDistance_Maximal as Maximal 
     189from orange import ExamplesDistance_Relief as Relief 
     190 
     191from orange import ExamplesDistanceConstructor_DTW as DTWConstructor 
     192from orange import ExamplesDistanceConstructor_Euclidean as EuclideanConstructor 
     193from orange import ExamplesDistanceConstructor_Hamming as HammingConstructor 
     194from orange import ExamplesDistanceConstructor_Manhattan as ManhattanConstructor 
     195from orange import ExamplesDistanceConstructor_Maximal as MaximalConstructor 
     196from orange import ExamplesDistanceConstructor_Relief as ReliefConstructor 
     197 
     198import statc 
     199 
     200class PearsonRConstructor(ExamplesDistanceConstructor): 
     201    """Constructs an instance of PearsonR. Not all the data needs to be given.""" 
     202     
     203    def __new__(cls, data=None, **argkw): 
     204        self = ExamplesDistanceConstructor.__new__(cls, **argkw) 
     205        self.__dict__.update(argkw) 
     206        if data: 
     207            return self.__call__(data) 
     208        else: 
     209            return self 
     210 
     211    def __call__(self, table): 
     212        indxs = [i for i, a in enumerate(table.domain.attributes) \ 
     213                 if a.varType==Orange.data.Type.Continuous] 
     214        return PearsonR(domain=table.domain, indxs=indxs) 
     215 
     216class PearsonR(ExamplesDistance): 
     217    """ 
     218    `Pearson correlation coefficient 
     219    <http://en.wikipedia.org/wiki/Pearson_product-moment\ 
     220    _correlation_coefficient>`_ 
     221    """ 
     222 
     223    def __init__(self, **argkw): 
     224        self.__dict__.update(argkw) 
     225         
     226    def __call__(self, e1, e2): 
     227        """ 
     228        :param e1: data instances. 
     229        :param e2: data instances. 
     230         
     231        Returns Pearson's disimilarity between e1 and e2, 
     232        i.e. (1-r)/2 where r is Sprearman's rank coefficient. 
     233        """ 
     234        X1 = [] 
     235        X2 = [] 
     236        for i in self.indxs: 
     237            if not(e1[i].isSpecial() or e2[i].isSpecial()): 
     238                X1.append(float(e1[i])) 
     239                X2.append(float(e2[i])) 
     240        if not X1: 
     241            return 1.0 
     242        try: 
     243            return (1.0 - statc.pearsonr(X1, X2)[0]) / 2. 
     244        except: 
     245            return 1.0 
     246 
     247class SpearmanRConstructor(ExamplesDistanceConstructor): 
     248    """Constructs an instance of SpearmanR. Not all the data needs to be given.""" 
     249     
     250    def __new__(cls, data=None, **argkw): 
     251        self = ExamplesDistanceConstructor.__new__(cls, **argkw) 
     252        self.__dict__.update(argkw) 
     253        if data: 
     254            return self.__call__(data) 
     255        else: 
     256            return self 
     257 
     258    def __call__(self, table): 
     259        indxs = [i for i, a in enumerate(table.domain.attributes) \ 
     260                 if a.varType==Orange.data.Type.Continuous] 
     261        return SpearmanR(domain=table.domain, indxs=indxs) 
     262 
     263class SpearmanR(ExamplesDistance):   
     264 
     265    """`Spearman's rank correlation coefficient 
     266    <http://en.wikipedia.org/wiki/Spearman%27s_rank_\ 
     267    correlation_coefficient>`_""" 
     268 
     269    def __init__(self, **argkw): 
     270        self.__dict__.update(argkw) 
     271         
     272    def __call__(self, e1, e2): 
     273        """ 
     274        :param e1: data instances. 
     275        :param e2: data instances. 
     276         
     277        Returns Sprearman's disimilarity between e1 and e2, 
     278        i.e. (1-r)/2 where r is Sprearman's rank coefficient. 
     279        """ 
     280        X1 = []; X2 = [] 
     281        for i in self.indxs: 
     282            if not(e1[i].isSpecial() or e2[i].isSpecial()): 
     283                X1.append(float(e1[i])) 
     284                X2.append(float(e2[i])) 
     285        if not X1: 
     286            return 1.0 
     287        try: 
     288            return (1.0 - statc.spearmanr(X1, X2)[0]) / 2. 
     289        except: 
     290            return 1.0 
     291 
Note: See TracChangeset for help on using the changeset viewer.