Using Postgres Regex expressions to find very specific matches
Coderwall has been having issues with certain avatars generating 403 errors in the browser console. You probably wouldn't notice unless you had your dev tools open while you were browsing but there were a little over 25% of our Twitter avatars that weren't rendering properly due to trying to connect via HTTP instead of HTTPS. (And other changes to how Twitter resolves it's profile images but that's a bigger issue.) Fortunately with the power of Ruby and Postgres Regex selectors it's relatively trivial to find and transform the HTTP urls to use HTTPS.
User.where("twitter_token is not null AND thumbnail_url ~ '^http:' AND thumbnail_url ~ 'twimg\.com'").find_each(batch_size: 500) do |user|
begin
url = URI.parse(user.profile_url)
puts "Update #{user.username} because #{user.profile_url} appears to be HTTP."
url.scheme = 'https'
user.update_attribute(:thumbnail_url, url.to_s)
puts " ==> #{user.profile_url}"
rescue URI::InvalidURIError
ap url
end
end
Our User
model has a field thumbnail_url
which holds the URL for users who log into Coderwall via OAuth. Since the issue was currently a problem predominantly for Twitter logins, and LinkedIn doesn't allow for fetching the thumbnails via HTTPS, I first limited the query to known Twitter login users.
Next I want to filter by users who have thumbnail_url
's that are not already using HTTPS. That number was near-zero after some poking I did to verify the issue, but better to be safe than sorry and it also helps with re-running the script, no sense in selecting a record that's already been updated. I used the Postgres Regex matcher thumbnail_url ~ '^http:'
but could very well have used thumbnail_url ilike 'http:%'
which can still utilize some indexes if available. The ilike
vs 'like` is also preferable in this case because it is case-insensitive
The last part of the query thumbnail_url ~ 'twimg\.com'
could have also been thumbnail_url ilike '%twimg.com%'
but was used because I was refactoring from matchers that were inside the block as I was testing this little bit of functionality.
Then I used the find_each(batch_size: 500)
to fetch the records in batches of 500 to avoid excessive queries. Given this batch size I did about 56+ select
s instead of 28,000+.
We can fetch Avatars from a few different places so we abstract the url via the profile_url
method. I could have just fetched from thumbnail_url
but it didn't make any difference in this case. I load the url into URI.parse
so I can manipulate the url without string manipulation. I convert the scheme
on the URI instance and then update the attribute on the model. The puts
and ap
(awesome_print) statements are just there to help me as I watch the process run.
While the query could have further been simplified via ilike
wildcard statements the Postgres Regex expressions are very useful for matching data inside a Postgres instance and let you have extremely fine grained control over the your results without having to pull more than you absolutely must from the database. They come at a cost of being rather non-index friendly but they are a useful tool to have at your disposal when you need it.
To fix the Avatar issues I'll probably have to refetch the profile url's from Twitter but that's another article. :)