Last Updated: October 22, 2018
·
502
· carlosomar2

Finding Possible Duplicate Strings Using the Levenshtein Distance Metric in SQL

I was recently tasked with doing a report that would display addresses that might be shared between several users in our system. The report is going to be used by one reviewer that then will manually check the user account and verify that the address are indeed shared. Finding duplicate addresses involves finding strings that are very similar but not quite the same. Ie. 931 Main St should be reported to be the same as 931 Main Street. So the LIKE operator in SQL did not fulfill my needs. After some research and found the Liechtenstein distance algorithm and that helped me write the report.

The Levenshtein algorithm measures the difference between two strings. That is the minimum number of single-character edits that are required to change one string into the other. Single-character edits can be insertions, deletions, and substitutions.

I wrote a post about how I accomplished the task. https://nearsoft.com/blog/finding-duplicate-addresses-using-the-levenshtein-distance-metric-in-sql/. I found it amazing that as my career in SW continues I keep learning new stuff that seems pretty old that can be used to solve current problems. The business users were quite happy with the results.