Last Updated: February 25, 2016
·
1.03K
· scott2b

A TwitterSearcher class for aggressive Twitter REST search using birdy

I've really been liking birdy for Twitter API consumption, mostly just for its simplicity. But, it's not without its quirks -- some of those stemming from the requests library, some from birdy's error handling, and some from the Twitter API itself.

Whether with birdy, or any other Twitter API client library, it seems like I am always writing a bunch of scaffolding for managing the search API rate limit, pagination, etc. This gist is an attempt to address this handling once and for all, as well as to deal with the quirks I've run into in using the Twitter REST API in general.

The class is TwitterSearcher. It requires birdy and delorean. The gist is here: https://gist.github.com/scott2b/9219919

Use of the Twitter API response headers for managing rate limiting eliminates the need to calculate delays and pace your queries -- but do expect your queries to get delayed if you are doing aggressive searching.

As it is, this uses app-level authentication. It should be easy enough to modify the class to handle user authentication (resulting in less throughput for a given searcher instance, due to API restrictions.)

Usage looks something like this:

searcher = TwitterSearcher(
    TWITTER_CONSUMER_KEY,
    TWITTER_CONSUMER_SECRET,
    TWITTER_APP_CLIENT_ACCESS_TOKEN)
for query in my_query_generator():
    searcher.paginated_search(
        page_handler=my_page_handler,
        # see birdy AppClient docs and Twitter API docs for params
        # to pass in here:
        since_id=my_since_id,
        q=query,
        count=100,
        lang='en'
    )

TwitterSearcher will issue searches (and paginations) as quickly as you call them -- until the rate limit is hit, at which time it will wait for the time specified by Twitter, and then will start churning again.

Uses header time provided by Twitter to avoid out-of-sync time issues. Handles the weird connection pool problem that requests does not propagate properly -- reconnecting when a TwitterClientError matches the string "HTTPSConnectionPool". Will paginate up to as many pages provided to the instance (default_max_pages) or to the paginated_search method (max_pages). Pass in a page_handler callable to actually do something with paginated responses.