Postgres Substring Search by Index
On a recent project I had to create an autosuggest feature that could look at 1.5 million rows and suggest based on sub-strings. I looked into a redis-based solution, a custom built solution, and an engine like Solr. I really didnt like the extra complexity that these solutions created, so I went looking into how to do it in postgres. The solution I came up with is very simple and works incredibly fast, <5ms.
It uses indexes and sub-string matching to do it and took only a few seconds to setup.
This being a rails 3 project, the first thing to do was create the migration to create the indexes:
class AddSubstringIndexes < ActiveRecord::Migration
def up
30.times {|i| execute "CREATE INDEX charity_name_substring_#{i+1} ON charities USING gin(to_tsvector('english', substring(name from 1 for #{i+1})));"}
end
def down
30.times {|i| execute "DROP INDEX charity_name_substring_#{i+1};"}
end
end
The migration will create a sub-string matcher that will work up to a string 30-characters long. It would be possible to add more linguistic matching functions to this such as fuzzystrmatch & soundex, but that wasnt necessary here.
Now that we have our indexes setup we can begin querying. Adding a suggest class method onto our model makes it easy to use throughout our app.
def self.suggest(query, options = {})
options.reverse_merge!({limit: 50})
rank = <<-RANK
ts_rank(to_tsvector(name), plainto_tsquery(#{sanitize(query)}))
RANK
ActiveRecord::Base.connection.select_all("SELECT id, name, mailing_address FROM charities WHERE to_tsvector('english', substring(name from 1 for #{query.length})) @@ plainto_tsquery('english', #{sanitize(query)}) ORDER BY #{rank} DESC LIMIT #{options[:limit]}")
end
We want the fastest possible query so we are gonna bypass active record and go for a straight SQL statement.
The method above takes a limit argument and uses reverse_merge to create a default.
We then create a rank method for postgres to use.
Finally we execute the query.