Distance (distance)

The following example demonstrates how to compute distances between two instances:

import Orange

data = Orange.data.Table("iris")
#build a distance with a DistanceConstructor
measure = Orange.distance.Euclidean(data)
print "Distance between first two examples:", \
    measure(data[0], data[1]) #use the Distance

A matrix with all pairwise distances can be computed with distance_matrix:

matrix = Orange.distance.distance_matrix(data)
print "Distance between first two examples:", \
    matrix[0,1]

Unknown values are treated correctly only by Euclidean and Relief distance. For other measures, a distance between unknown and known or between two unknown values is always 0.5.

Computing distances

Distance measures typically have to be adjusted to the data. For instance, when the data set contains continuous features, the distances between continuous values should be normalized to ensure that all features have similar impats, e.g. by dividing the distance with the range.

Distance measures thus appear in pairs:

class Orange.distance.DistanceConstructor
__call__([data, weightID][, distributions][, basic_stat])

Constructs an Distance. Not all arguments are required. Most measures can be constructed from basic_stat; if it is not given, instances or distributions can be used.

class Orange.distance.Distance
__call__(instance1, instance2)

Return a distance between the given instances (as a floating point number).

Pairwise distances

Orange.distance.distance_matrix(data, distance_constructor=<type 'Orange.distance.Euclidean'>, progress_callback=None)

A helper function that computes an Orange.misc.SymMatrix of all pairwise distances between instances in data.

Parameters:
  • data (Orange.data.Table) – A data table
  • distance_constructor (Orange.distances.DistanceConstructor) – An DistanceConstructor instance (defaults to Euclidean).
  • progress_callback (function) – A function (taking one argument) to use for reporting the on the progress.
Return type:

Orange.misc.SymMatrix

Measures

Distance measures are defined with two classes: a subclass of obj:DistanceConstructor and a subclass of Distance.

class Orange.distance.Hamming
class Orange.distance.HammingDistance

The number of features in which the two instances differ. This measure is not appropriate for instances that contain continuous features.

class Orange.distance.Maximal
class Orange.distance.MaximalDistance

The maximal distance between two feature values. If dist is the result of feature_distances, then Maximal returns max(dist).

class Orange.distance.Manhattan
class Orange.distance.ManhattanDistance

The sum of absolute values of distances between pairs of features, e.g. sum(abs(x) for x in dist) where dist is the result of feature_distances.

class Orange.distance.Euclidean
class Orange.distance.EuclideanDistance

The square root of sum of squared per-feature distances, i.e. sqrt(sum(x*x for x in dist)), where dist is the result of feature_distances.

distributions()

A Distribution containing the distributions for all discrete features used for computation of distances between known and unknown values.

both_special_dist()

A list containing the distance between two unknown values for each discrete feature.

Unknown values are handled by computing the expected square of distance based on the distribution from the “training” data. Squared distance between

  • A known and unknown continuous feature equals squared distance between the known and the average, plus variance.
  • Two unknown continuous features equals double variance.
  • A known and unknown discrete feature equals the probability that the unknown feature has different value than the known (i.e., 1 - probability of the known value).
  • Two unknown discrete features equals the probability that two random chosen values are equal, which can be computed as 1 - sum of squares of probabilities.

Continuous cases are handled as inherited from DistanceNormalized. The data for discrete cases are stored in distributions (used for unknown vs. known value) and in both_special_dist (the precomputed distance between two unknown values).

class Orange.distance.Relief
class Orange.distance.ReliefDistance

Relief is similar to Manhattan distance, but incorporates the treatment of undefined values, which is used by ReliefF measure.

This class is derived directly from Distance.

class Orange.distance.PearsonR
class Orange.distance.PearsonRDistance(**argkw)

Pearson correlation coefficient.

__call__(e1, e2)
Parameters:
  • e1 – data instances.
  • e2 – data instances.

Returns Pearson’s disimilarity between e1 and e2, i.e. (1-r)/2 where r is Pearson’s rank coefficient.

class Orange.distance.SpearmanR
class Orange.distance.SpearmanRDistance(**argkw)

Spearman’s rank correlation coefficient.

__call__(e1, e2)
Parameters:
  • e1 – data instances.
  • e2 – data instances.

Returns Sprearman’s disimilarity between e1 and e2, i.e. (1-r)/2 where r is Sprearman’s rank coefficient.

class Orange.distance.Mahalanobis
class Orange.distance.MahalanobisDistance(domain, icm, **argkw)

Mahalanobis distance

__call__(e1, e2)
Parameters:
  • e1 – data instances.
  • e2 – data instances.

Returns Mahalanobis distance between e1 and e2.

Utilities

class Orange.distance.DistanceNormalized

An abstract class that provides normalization.

normalizers

A precomputed list of normalizing factors for feature values. They are:

  • 1/(max_value-min_value) for continuous and 1/number_of_values for ordinal features. If either feature is unknown, the distance is 0.5. Such factors are used to multiply differences in feature’s values.
  • -1 for nominal features; the distance between two values is 0 if they are same (or at least one is unknown) and 1 if they are different.
  • 0 for ignored features.
bases, averages, variances

The minimal values, averages and variances (continuous features only).

domain_version

The domain version changes each time a domain description is changed (i.e. features are added or removed).

feature_distances(instance1, instance2)

Return a list of floats representing normalized distances between pairs of feature values of the two instances.