Distance (distance)¶
The following example demonstrates how to compute distances between two instances:
import Orange
data = Orange.data.Table("iris")
#build a distance with a DistanceConstructor
measure = Orange.distance.Euclidean(data)
print "Distance between first two examples:", \
measure(data[0], data[1]) #use the Distance
A matrix with all pairwise distances can be computed with distance_matrix:
matrix = Orange.distance.distance_matrix(data)
print "Distance between first two examples:", \
matrix[0,1]
Unknown values are treated correctly only by Euclidean and Relief distance. For other measures, a distance between unknown and known or between two unknown values is always 0.5.
Computing distances¶
Distance measures typically have to be adjusted to the data. For instance, when the data set contains continuous features, the distances between continuous values should be normalized to ensure that all features have similar impats, e.g. by dividing the distance with the range.
Distance measures thus appear in pairs:
- a class that constructs the distance measure based on the data (subclass of DistanceConstructor, for example Euclidean), and returns is as
- a class that measures the distance between two instances (subclass of Distance, for example EuclideanDistance).
- class Orange.distance.DistanceConstructor¶
- class Orange.distance.Distance¶
- __call__(instance1, instance2)¶
Return a distance between the given instances (as a floating point number).
Pairwise distances¶
- Orange.distance.distance_matrix(data, distance_constructor=<type 'Orange.distance.Euclidean'>, progress_callback=None)¶
A helper function that computes an Orange.misc.SymMatrix of all pairwise distances between instances in data.
Parameters: - data (Orange.data.Table) – A data table
- distance_constructor (Orange.distances.DistanceConstructor) – An DistanceConstructor instance (defaults to Euclidean).
- progress_callback (function) – A function (taking one argument) to use for reporting the on the progress.
Return type:
Measures¶
Distance measures are defined with two classes: a subclass of obj:DistanceConstructor and a subclass of Distance.
- class Orange.distance.Hamming¶
- class Orange.distance.HammingDistance¶
The number of features in which the two instances differ. This measure is not appropriate for instances that contain continuous features.
- class Orange.distance.Maximal¶
- class Orange.distance.MaximalDistance¶
The maximal distance between two feature values. If dist is the result of feature_distances, then Maximal returns max(dist).
- class Orange.distance.Manhattan¶
- class Orange.distance.ManhattanDistance¶
The sum of absolute values of distances between pairs of features, e.g. sum(abs(x) for x in dist) where dist is the result of feature_distances.
- class Orange.distance.Euclidean¶
- class Orange.distance.EuclideanDistance¶
The square root of sum of squared per-feature distances, i.e. sqrt(sum(x*x for x in dist)), where dist is the result of feature_distances.
- distributions()¶
A Distribution containing the distributions for all discrete features used for computation of distances between known and unknown values.
- both_special_dist()¶
A list containing the distance between two unknown values for each discrete feature.
Unknown values are handled by computing the expected square of distance based on the distribution from the “training” data. Squared distance between
- A known and unknown continuous feature equals squared distance between the known and the average, plus variance.
- Two unknown continuous features equals double variance.
- A known and unknown discrete feature equals the probability that the unknown feature has different value than the known (i.e., 1 - probability of the known value).
- Two unknown discrete features equals the probability that two random chosen values are equal, which can be computed as 1 - sum of squares of probabilities.
Continuous cases are handled as inherited from DistanceNormalized. The data for discrete cases are stored in distributions (used for unknown vs. known value) and in both_special_dist (the precomputed distance between two unknown values).
- class Orange.distance.Relief¶
- class Orange.distance.ReliefDistance¶
Relief is similar to Manhattan distance, but incorporates the treatment of undefined values, which is used by ReliefF measure.
This class is derived directly from Distance.
- class Orange.distance.PearsonR¶
- class Orange.distance.PearsonRDistance(**argkw)¶
Pearson correlation coefficient.
- __call__(e1, e2)¶
Parameters: - e1 – data instances.
- e2 – data instances.
Returns Pearson’s disimilarity between e1 and e2, i.e. (1-r)/2 where r is Pearson’s rank coefficient.
- class Orange.distance.SpearmanR¶
- class Orange.distance.SpearmanRDistance(**argkw)¶
Spearman’s rank correlation coefficient.
- __call__(e1, e2)¶
Parameters: - e1 – data instances.
- e2 – data instances.
Returns Sprearman’s disimilarity between e1 and e2, i.e. (1-r)/2 where r is Sprearman’s rank coefficient.
- class Orange.distance.Mahalanobis¶
Utilities¶
- class Orange.distance.DistanceNormalized¶
An abstract class that provides normalization.
- normalizers¶
A precomputed list of normalizing factors for feature values. They are:
- 1/(max_value-min_value) for continuous and 1/number_of_values for ordinal features. If either feature is unknown, the distance is 0.5. Such factors are used to multiply differences in feature’s values.
- -1 for nominal features; the distance between two values is 0 if they are same (or at least one is unknown) and 1 if they are different.
- 0 for ignored features.
- bases, averages, variances
The minimal values, averages and variances (continuous features only).
- domain_version¶
The domain version changes each time a domain description is changed (i.e. features are added or removed).
- feature_distances(instance1, instance2)¶
Return a list of floats representing normalized distances between pairs of feature values of the two instances.