Is Cosine-Similarity of Embeddings Really About Similarity?

Author

Harald Steck, Chaitanya Ekanadham, Nathan Kallus

Year

2024

Is Cosine-Similarity of Embeddings Really About Similarity?

Harald Steck, Chaitanya Ekanadmahm, Nathan Kallus. 2024. (View Paper → )

Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’ For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosine-similarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.

The paper emphasises the need for caution when using cosine similarity on learned embeddings, as the results may not capture the true semantic similarity as intended. Product managers should carefully consider and validate the embedding techniques and similarity metrics used, rather than treating them as black boxes. Remedies such as avoiding embeddings altogether or applying normalisation/debiasing earlier in the process are proposed.

It underscores the need for transparency and explainability of machine learning models, especially in the context of cosine similarity. It encourages the exploration and integration of alternative similarity measures or approaches that are more robust. The insights from this research can help guide strategic decisions in product development, thereby enhancing product effectiveness, transparency, and user satisfaction.