Last Updated: February 25, 2016
·
2.569K
· thekidcoder

Postgres Substring Search by Index

On a recent project I had to create an autosuggest feature that could look at 1.5 million rows and suggest based on sub-strings. I looked into a redis-based solution, a custom built solution, and an engine like Solr. I really didnt like the extra complexity that these solutions created, so I went looking into how to do it in postgres. The solution I came up with is very simple and works incredibly fast, <5ms.

It uses indexes and sub-string matching to do it and took only a few seconds to setup.

This being a rails 3 project, the first thing to do was create the migration to create the indexes:

class AddSubstringIndexes < ActiveRecord::Migration
  def up
    30.times {|i| execute "CREATE INDEX charity_name_substring_#{i+1} ON charities USING gin(to_tsvector('english', substring(name from 1 for #{i+1})));"}
  end

  def down
    30.times {|i| execute "DROP INDEX charity_name_substring_#{i+1};"}
  end
end

The migration will create a sub-string matcher that will work up to a string 30-characters long. It would be possible to add more linguistic matching functions to this such as fuzzystrmatch & soundex, but that wasnt necessary here.

Now that we have our indexes setup we can begin querying. Adding a suggest class method onto our model makes it easy to use throughout our app.


def self.suggest(query, options = {})
      options.reverse_merge!({limit: 50})
      rank = <<-RANK
        ts_rank(to_tsvector(name), plainto_tsquery(#{sanitize(query)}))
      RANK
      ActiveRecord::Base.connection.select_all("SELECT id, name, mailing_address FROM charities WHERE to_tsvector('english', substring(name from 1 for #{query.length})) @@ plainto_tsquery('english', #{sanitize(query)}) ORDER BY #{rank} DESC LIMIT #{options[:limit]}")
end

We want the fastest possible query so we are gonna bypass active record and go for a straight SQL statement.

The method above takes a limit argument and uses reverse_merge to create a default.
We then create a rank method for postgres to use.

Finally we execute the query.