When using the Amazon Cloudfront CDN with a certain setup, your entire site becomes available via the CDN url. In the Google search results, two different results started showing up, one with the canonical url, another one with the Cloudfront url.
There are various strategies solving this situation, among them the two ones listed here:
* Canonical urls
* A dynamic robots.txt
Reading up on some advice to a similar answer on Stackoverflow, I currently use the following solution to render a dynamic robots.txt based on the request's host parameter.
# config/routes.rb # # Dynamic robots.txt get 'robots.:format' => 'robots#index'
# app/controllers/robots_controller.rb class RobotsController < ApplicationController # No layout layout false # Render a robots.txt file based on whether the request # is performed against a canonical url or not # Prevent robots from indexing content served via a CDN twice def index if canonical_host? render 'allow' else render 'disallow' end end private def canonical_host? request.host =~ /plugingeek\.com/ end end
Based on the
request.host we render one of two different
.text.erb view files.
# app/views/robots/allow.text.erb # Note the .text extension # Allow robots to index the entire site except some specified routes # rendered when site is visited with the default hostname # http://www.robotstxt.org/ # ALLOW ROBOTS User-agent: * Disallow:
# app/views/robots/disallow.text.erb # Note the .text extension # Disallow robots to index any page on the site # rendered when robot is visiting the site # via the Cloudfront CDN URL # to prevent duplicate indexing # and search results referencing the Cloudfront URL # DISALLOW ROBOTS User-agent: * Disallow: /
Testing the setup with RSpec and Capybara can be done quite easily, too.
# spec/features/robots_spec.rb require 'spec_helper' feature "Robots" do context "canonical host" do scenario "allow robots to index the site" do Capybara.app_host = 'http://www.plugingeek.com' visit '/robots.txt' Capybara.app_host = nil expect(page).to have_content('# ALLOW ROBOTS') expect(page).to have_content('User-agent: *') expect(page).to have_content('Disallow:') expect(page).to have_no_content('Disallow: /') end end context "non-canonical host" do scenario "deny robots to index the site" do visit '/robots.txt' expect(page).to have_content('# DISALLOW ROBOTS') expect(page).to have_content('User-agent: *') expect(page).to have_content('Disallow: /') end end end # This would be the resulting docs # Robots # canonical host # allow robots to index the site # non-canonical host # deny robots to index the site
As a last step, you might need to remove the static
public/robots.txt in the public folder if it's still present.
I hope you find this useful. Feel free to comment, helping to improve this technique even further.