Last Updated: February 25, 2016

Dynamic robots.txt in Rails

When using the Amazon Cloudfront CDN with a certain setup, your entire site becomes available via the CDN url. In the Google search results, two different results started showing up, one with the canonical url, another one with the Cloudfront url.

Canonical host

Cloudfront host

There are various strategies solving this situation, among them the two ones listed here:
* Canonical urls
* A dynamic robots.txt

Reading up on some advice to a similar answer on Stackoverflow, I currently use the following solution to render a dynamic robots.txt based on the request's host parameter.

Routing

# config/routes.rb
#
# Dynamic robots.txt
get 'robots.:format' => 'robots#index'

Controller

# app/controllers/robots_controller.rb
class RobotsController < ApplicationController
  # No layout
  layout false

  # Render a robots.txt file based on whether the request
  # is performed against a canonical url or not
  # Prevent robots from indexing content served via a CDN twice
  def index
    if canonical_host?
      render 'allow'
    else
      render 'disallow'
    end
  end

  private

  def canonical_host?
    request.host =~ /plugingeek\.com/
  end
end

Views

Based on the request.host we render one of two different .text.erb view files.

Allowing robots

# app/views/robots/allow.text.erb # Note the .text extension

# Allow robots to index the entire site except some specified routes
# rendered when site is visited with the default hostname
# http://www.robotstxt.org/

# ALLOW ROBOTS
User-agent: *
Disallow:

Banning spiders

# app/views/robots/disallow.text.erb # Note the .text extension

# Disallow robots to index any page on the site
# rendered when robot is visiting the site
# via the Cloudfront CDN URL
# to prevent duplicate indexing
# and search results referencing the Cloudfront URL

# DISALLOW ROBOTS
User-agent: *
Disallow: /

Specs

Testing the setup with RSpec and Capybara can be done quite easily, too.

# spec/features/robots_spec.rb
require 'spec_helper'

feature "Robots" do
  context "canonical host" do
    scenario "allow robots to index the site" do
      Capybara.app_host = 'http://www.plugingeek.com'
      visit '/robots.txt'
      Capybara.app_host = nil

      expect(page).to have_content('# ALLOW ROBOTS')
      expect(page).to have_content('User-agent: *')
      expect(page).to have_content('Disallow:')
      expect(page).to have_no_content('Disallow: /')
    end
  end

  context "non-canonical host" do
    scenario "deny robots to index the site" do
      visit '/robots.txt'

      expect(page).to have_content('# DISALLOW ROBOTS')
      expect(page).to have_content('User-agent: *')
      expect(page).to have_content('Disallow: /')
    end
  end
end

# This would be the resulting docs
# Robots
#   canonical host
#      allow robots to index the site
#   non-canonical host
#      deny robots to index the site

As a last step, you might need to remove the static public/robots.txt in the public folder if it's still present.

I hope you find this useful. Feel free to comment, helping to improve this technique even further.

#rails

#cloudfront

#robots.txt

Written by Thomas Klemm

Say Thanks

Respond

2 Responses

Add your response

tmaier

Wouldn't it be better if you would add your domain to the canonical link.
http://en.wikipedia.org/wiki/Canonical_link_element

Instead of

<link href='/p/tlmhnq' rel='canonical'>

you should write

<link href='https://coderwall.com/p/tlmhnq' rel='canonical'>

into your html head

over 1 year ago ·

thomasklemm

Thanks Tobias, that makes total sense. Seems I didn't jump to your solution when I stumbled across this problem, will do so in the future.

over 1 year ago ·

Have a fresh tip? Share with Coderwall community!

Best #Rails Authors

226.3K

143.5K

91.37K

edokun_

88.3K

84.07K

Related Tags

Filed Under

Ruby on Rails Development Tips

Awesome Job

See All Jobs

Post a job for only $299

#native_title# #native_desc#

#native_cta#

Dynamic robots.txt in Rails

Routing

Controller

Views

Specs

Written by Thomas Klemm

Related protips

Rails 4: How to partials & AJAX, dead easy

Ruby on Rails 4 - Authentication with Facebook and OmniAuth.

Open a rails form with Twitter Bootstrap modals

2 Responses

Add your response

Have a fresh tip? Share with Coderwall community!

Dynamic robots.txt in Rails

Routing

Controller

Views

Specs

Written by Thomas Klemm

Related protips

Rails 4: How to partials & AJAX, dead easy

Ruby on Rails 4 - Authentication with Facebook and OmniAuth.

Open a rails form with Twitter Bootstrap modals

2 Responses Add your response

Have a fresh tip? Share with Coderwall community!

2 Responses

Add your response