Original post here:
This is a report of my research on the state of the art of AngularJS SEO as of January 2014.
The easiest solution is not to make a SPA in the first place. We have to account for this problem when deciding on whether to make a SPA or a traditional Server-side MVC application.
The alternative to this solution is to have some server side component to serve the rendered content to the search bots
Serving the rendered content to the search bots will create two new problems for us:
- How can the search bot find out the URLs to be indexed
- How can we serve the final DOM instead of the empty template to the search bot.
Telling the search bot what to index
A search bot has two ways to determine the URLs to index in a web site: One is crawling the site (following the links) and the other is reading the URL list from a file (a robotx.txt or sitemap.xml file). These approaches are not mutually exclusive and can be used together.
We can make our ajax applications crawlable by following some conventions and then serving a snapshot of the page to our users. If we follow the conventions, the web crawler will send a
http://www.example.com?_escaped_fragment_=/products/123 request when it finds a
http://www.example.com/#!/products/123 url so that we can detect a search bot request on the server. Of course, we still need a way to resolve the
_escaped_fragment_ request but this is another problem.
NOTE: As ylesaout points in his comment, the ! in the URL is important because it is what tells the search bot that this URL is crawlable (that is, that it can request the escapedfragment_ URL from the server). Otherwise, the bot will just ignore the URL.
We can also have a sitemap file if we want finer control on the search bot behavior (instead of blindly follow links it will index whatever we decided to put on the sitemap file). We can also do cool things like indexing a page that is not linked anywhere (for instance, because it is accesed after clicking a button) or have some control on the url that is listed in search results.
The problem with this approach is that we have to generate this sitemap file and it will depend on our routing logic and, thus, is a totally ad-hoc solution that will be different for every project. We can automate this, though, to a certain degree.
Server side rendering
Although it is on the roadmap we don't have anything like Rendr for AngularJS and, as far as I know, the easiest way for us to render the pages in the server without duplicating code is to run the application inside a headless browser (like PhantomJS), navigate to the appropiate url, let angular do it's thing and then read the resulting DOM and send it to the crawler.
This is the approach taken in the yearofmoo angularjs seo tutorial and in this github project. Another possibility (discussed here) is to use jsdom instead of PhantomJS but the idea is more or less the same (run the app in the server and take a snapshot of the resulting DOM).
This generation can be made on demand whenever the server detects a request from a search bot but I think that the performace of this approach will be very bad (we have to start a browser, load the application and navigate to the view) and, thus, my recommendation is to pregenerate the snapshots and store them as html files.
The easiest way to generate and store these snapshots is to use a service like prerender.io or BromBone (there are more here) and they all work in a similar way: We point the service to our url and it crawls our application at regular intervals, stores the generated DOM somewhere and then configure our webserver to serve the generated snapshots whenever a request from a search bot comes.
If we don't want to use an external service, we can use a script to crawl our site (or read our sitemap.xml file), render the snapshots and store them in a location where we can, later, retrieve them from the server (like the filesystem or a shared cache).
I found two grunt plugins to do this as part of our build (we could do this in our CI server):
grunt-html-snapshots will read your robots.txt or sitemap.xml file (you still have to generate these, though) and render them in a phantomjs browser. It is also available as a library
grunt-html-snaphot (without the final s). It looks to me like this one will do the crawling for you instead of relying in the sitemap.xml file but I have to check this.
You should also check this project on github
- Set up a proof of concept application for the grunt snapshot approach
- Do some more research on the possibility of using these snapshots to make the application start up faster (a la twitter) and also as cache.
After setting up a proof of concept application I found that:
This article (http://www.ng-newsletter.com/posts/serious-angular-seo.html) explains the different options pretty well
None of the grunt plugins I mentioned would do the crawling part although it is not very difficult to implement this yourself (see https://gist.github.com/joseraya/8547524)
The problem with snapshot generation during the build is that it might take a very long time (specially if you have many pages) because you have to wait for pages to render (and it is not easy to be sure that a page has rendered completely). I would rather do the snapshot generation as an independent step
Zombie.js is not necessarily faster than phantomjs (at least prerender.io with its phantom-service is fast enough compared with my zombie.js script)
It is very easy to implement prerender.io's solution (either with your own server or with their hosted solution) and the hosted solution is not very expensive (100.000 pages cost more or less the same as the heroku dyno you might use to run your own server)
Prerender-node middleware has a very good API and it seems to be easy to implement your own plugins (if you ever need to do so).
For the project I was doing the research for, we will definitely try prerender as our solution to the problem of SEO optimization in AngularJS.