Last Updated: February 25, 2016
· just3ws

Using Postgres Regex expressions to find very specific matches

Coderwall has been having issues with certain avatars generating 403 errors in the browser console. You probably wouldn't notice unless you had your dev tools open while you were browsing but there were a little over 25% of our Twitter avatars that weren't rendering properly due to trying to connect via HTTP instead of HTTPS. (And other changes to how Twitter resolves it's profile images but that's a bigger issue.) Fortunately with the power of Ruby and Postgres Regex selectors it's relatively trivial to find and transform the HTTP urls to use HTTPS.

User.where("twitter_token is not null AND thumbnail_url ~ '^http:' AND thumbnail_url ~ 'twimg\.com'").find_each(batch_size: 500) do |user|
    url = URI.parse(user.profile_url)
    puts "Update #{user.username} because #{user.profile_url} appears to be HTTP."
    url.scheme = 'https'
    user.update_attribute(:thumbnail_url, url.to_s)
    puts " ==> #{user.profile_url}"
  rescue URI::InvalidURIError
    ap url

Our User model has a field thumbnail_url which holds the URL for users who log into Coderwall via OAuth. Since the issue was currently a problem predominantly for Twitter logins, and LinkedIn doesn't allow for fetching the thumbnails via HTTPS, I first limited the query to known Twitter login users.

Next I want to filter by users who have thumbnail_url's that are not already using HTTPS. That number was near-zero after some poking I did to verify the issue, but better to be safe than sorry and it also helps with re-running the script, no sense in selecting a record that's already been updated. I used the Postgres Regex matcher thumbnail_url ~ '^http:' but could very well have used thumbnail_url ilike 'http:%' which can still utilize some indexes if available. The ilike vs 'like` is also preferable in this case because it is case-insensitive

The last part of the query thumbnail_url ~ 'twimg\.com' could have also been thumbnail_url ilike '' but was used because I was refactoring from matchers that were inside the block as I was testing this little bit of functionality.

Then I used the find_each(batch_size: 500) to fetch the records in batches of 500 to avoid excessive queries. Given this batch size I did about 56+ selects instead of 28,000+.

We can fetch Avatars from a few different places so we abstract the url via the profile_url method. I could have just fetched from thumbnail_url but it didn't make any difference in this case. I load the url into URI.parse so I can manipulate the url without string manipulation. I convert the scheme on the URI instance and then update the attribute on the model. The puts and ap (awesome_print) statements are just there to help me as I watch the process run.

While the query could have further been simplified via ilike wildcard statements the Postgres Regex expressions are very useful for matching data inside a Postgres instance and let you have extremely fine grained control over the your results without having to pull more than you absolutely must from the database. They come at a cost of being rather non-index friendly but they are a useful tool to have at your disposal when you need it.

To fix the Avatar issues I'll probably have to refetch the profile url's from Twitter but that's another article. :)