Last Updated: February 25, 2016
·
3.608K
· tedtanne

Distance and Similarity Metric Calculations

Depending on the data sets you are dealing with there is a high probability that you will need to perform some type of distance computation or similarity computation. Very often in most data sets you are looking for the features or dimensions and this involves computing dot products, squared sum, L2 or Frobenious norm functionality. This can get quite messy in some functions. For easy of use I have found the following to be very useful via scikit sklearn:

sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric='euclidean', n_jobs=1, **kwds)

This method takes either a vector or a distance matrix and returns a distance matrix. Of interest is the ability to take a distance matrix and "safely" preserve compatibility with other algos that take vector arrays and can operate on sparse data.

The function can compute many different types of metrics or distances ‘euclidean’, ‘l2’, ‘l1’, ‘manhattan’, ‘cityblock’. Also the function utilizes all of the distance and similarity metrics in scipy: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘cosine’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

In [34]: from sklearn.metrics.pairwise import euclidean_distances
In [35]: X = [[1,0],[1,1]]

In [36]: euclidean_distances(X,X)
Out[36]:
array([[ 0., 1.],
[ 1., 0.]])

for more info: http://scikit-learn.org/