Where developers come to connect, share, build and be inspired.

18

AngularJS SEO

30875 views


UPDATE As noted by birkof in the comments, in May 2014 Google announced better support for Javascript (http://googlewebmastercentral.blogspot.de/2014/05/understanding-web-pages-better.html). That makes this post somewhat obsolete although it still applies for Bing and other search engines.

Original post here:

This is a report of my research on the state of the art of AngularJS SEO as of January 2014.

Problem:

  • Search bots do not execute the javascript and, thus, all they see is the empty template for the application. It is ok if the app is password protected (the pages would not be indexed anyway) but is unacceptable if we want our application pages indexed in search engines

Solutions:

  • The easiest solution is not to make a SPA in the first place. We have to account for this problem when deciding on whether to make a SPA or a traditional Server-side MVC application.

  • The alternative to this solution is to have some server side component to serve the rendered content to the search bots

Serving the rendered content to the search bots will create two new problems for us:

  • How can the search bot find out the URLs to be indexed
  • How can we serve the final DOM instead of the empty template to the search bot.

Telling the search bot what to index

A search bot has two ways to determine the URLs to index in a web site: One is crawling the site (following the links) and the other is reading the URL list from a file (a robotx.txt or sitemap.xml file). These approaches are not mutually exclusive and can be used together.

Ajax crawling

We can make our ajax applications crawlable by following some conventions and then serving a snapshot of the page to our users. If we follow the conventions, the web crawler will send a http://www.example.com?_escaped_fragment_=/products/123 request when it finds a http://www.example.com/#!/products/123 url so that we can detect a search bot request on the server. Of course, we still need a way to resolve the _escaped_fragment_ request but this is another problem.

NOTE: As ylesaout points in his comment, the ! in the URL is important because it is what tells the search bot that this URL is crawlable (that is, that it can request the escapedfragment_ URL from the server). Otherwise, the bot will just ignore the URL.

Site maps

We can also have a sitemap file if we want finer control on the search bot behavior (instead of blindly follow links it will index whatever we decided to put on the sitemap file). We can also do cool things like indexing a page that is not linked anywhere (for instance, because it is accesed after clicking a button) or have some control on the url that is listed in search results.

The problem with this approach is that we have to generate this sitemap file and it will depend on our routing logic and, thus, is a totally ad-hoc solution that will be different for every project. We can automate this, though, to a certain degree.

Server side rendering

Although it is on the roadmap we don't have anything like Rendr for AngularJS and, as far as I know, the easiest way for us to render the pages in the server without duplicating code is to run the application inside a headless browser (like PhantomJS), navigate to the appropiate url, let angular do it's thing and then read the resulting DOM and send it to the crawler.

This is the approach taken in the yearofmoo angularjs seo tutorial and in this github project. Another possibility (discussed here) is to use jsdom instead of PhantomJS but the idea is more or less the same (run the app in the server and take a snapshot of the resulting DOM).

This generation can be made on demand whenever the server detects a request from a search bot but I think that the performace of this approach will be very bad (we have to start a browser, load the application and navigate to the view) and, thus, my recommendation is to pregenerate the snapshots and store them as html files.

The easiest way to generate and store these snapshots is to use a service like prerender.io or BromBone (there are more here) and they all work in a similar way: We point the service to our url and it crawls our application at regular intervals, stores the generated DOM somewhere and then configure our webserver to serve the generated snapshots whenever a request from a search bot comes.

If we don't want to use an external service, we can use a script to crawl our site (or read our sitemap.xml file), render the snapshots and store them in a location where we can, later, retrieve them from the server (like the filesystem or a shared cache).

I found two grunt plugins to do this as part of our build (we could do this in our CI server):

  • grunt-html-snapshots will read your robots.txt or sitemap.xml file (you still have to generate these, though) and render them in a phantomjs browser. It is also available as a library
  • grunt-html-snaphot (without the final s). It looks to me like this one will do the crawling for you instead of relying in the sitemap.xml file but I have to check this.

You should also check this project on github

Further work

  • Set up a proof of concept application for the grunt snapshot approach
  • Do some more research on the possibility of using these snapshots to make the application start up faster (a la twitter) and also as cache.

UPDATE

After setting up a proof of concept application I found that:

  • This article (http://www.ng-newsletter.com/posts/serious-angular-seo.html) explains the different options pretty well

  • None of the grunt plugins I mentioned would do the crawling part although it is not very difficult to implement this yourself (see https://gist.github.com/joseraya/8547524)

  • The problem with snapshot generation during the build is that it might take a very long time (specially if you have many pages) because you have to wait for pages to render (and it is not easy to be sure that a page has rendered completely). I would rather do the snapshot generation as an independent step

  • Zombie.js is not necessarily faster than phantomjs (at least prerender.io with its phantom-service is fast enough compared with my zombie.js script)

  • It is very easy to implement prerender.io's solution (either with your own server or with their hosted solution) and the hosted solution is not very expensive (100.000 pages cost more or less the same as the heroku dyno you might use to run your own server)

  • Prerender-node middleware has a very good API and it seems to be easy to implement your own plugins (if you ever need to do so).

For the project I was doing the research for, we will definitely try prerender as our solution to the problem of SEO optimization in AngularJS.

Comments

  • 28ab2d0cda8dd8428f12ec80efa714fc

    Google is behind Angular so if they want this to be common framework to use on the net then they should think about how to index it to the google crawlers.

  • D3b2094f1b3386e660bb737e797f5dcc

    Wow, great research. Thank you for sharing with the community. I am debating server-side vs SPA MVC for my next project now.

    An untested hack in my mind is to create a JSON data structure describes the routes, consumed by both ngRoute and a server-side (node?) micro app to generate the sitemap. This would avoid the need to manually craft the sitemap file. Does this sound remotely reasonable?

  • Ce179565b99a9a32d4125a8205107d7c

    Right now I am more inclined to try the crawling approach instead of the sitemap because of dynamic urls because I like the idea of pre-rendering the snapshots (as a part of the build process or as a periodically-run script) instead of generating them on demand.

    If I have a /products/:id route I need to go to the database to lookup all the valid product id's during the sitemap generation and I think it will be easier to just crawl the links in the app (as the third-party services do).

  • 7e8389de83f8a666c8d4b8099bd4ed19

    Great summary on the subject.

    I juste noticed a little typo error : http://www.example.com/#/products/123 should be http://www.example.com/#!/products/123. Without the !, crawlers won't request the page. Alternatively you can use "classic" URLs (ie without #), by adding the following meta to your HTML Head tag and using the History pushState API.

    As one of SEO4Ajax co-founder, I think that you may be interested by our service. Indeed our solution integrates a crawler. So you get the simplicity of not having to generate a sitemap and the possibility to get all your pages pre-generated and ready to be served.

  • Ce179565b99a9a32d4125a8205107d7c

    Thanks for the heads up! You are right that the ! is important!

    I like the idea of having a crawler to pregenerate the snapshots and I think that in a real world situation (where the rendering of a page has to wait for data to be downloaded from some server) it might help getting a better rank although it is not easy to quantify how much it would help.

    The one thing I like about prerender, though, is that it is open source and, thus, I am more in control of my architecture. If the third-party service becomes unreliable I can always redirect my web server to my own prerender server.

  • 1625660_10152656935577922_3233648390156803618_n-2

    That's a great summary, thanks, @joseraya. I was wondering how I can endorse this post so it ranks higher. Coderwall is doing a great job at hiding this option :/

  • 7e8389de83f8a666c8d4b8099bd4ed19

    @joseraya: Moving from SEO4Ajax to any open source solution is quite easy. It should be only the matter of 2 or 3 lines in your HTTP server configuration.

  • Tfgr7ehgzghak14jf38k_normal

    does hosting provider will provide me pjantomjs setup as its required installation, or is there any JS file which will work.

  • 361d79bca209419576477b97c17cc3b8_normal

    Great summary, thanks!

    I think things are changing with respect to the search indexers and JavaScript. I run a JavaScript error tracking service called {Track:js} (trackjs.com), and we capture quite a few error generated by the GoogleBot. It appears to be executing a lot of JavaScript, storing cookies, and doing all sorts of the same things that regular browsers do.

  • 3b4519c8f9a3e9ded62db9bd35704396

    Google announced that they're finally crawling javascript. Not like before, unpredictable and on minimal scale, no. The whole thing.

    http://googlewebmastercentral.blogspot.de/2014/05/understanding-web-pages-better.html

  • Ce179565b99a9a32d4125a8205107d7c

    That's right. I think that this POST is pretty much obsolete right now :)

  • 971958_626268510747294_1516077905_n

    Yes, Google can crawl javascript now even Bing but what about Facebook and Twitter? I saw someone's website which succeeded on displaying meta tag values on search result but when I tried to post it's URL on the Facebook wall, dynamic page title and meta tags are being ignored.

Add a comment