As engineers and server ops folks know, the best systems are the ones which allow you to sleep at night. When there’s trouble, great systems contact engineers with the right information at the right times.
After 13 years of creating web applications, I’ve reduced my thoughts on monitoring your systems to the following eight strategies:
Sign up with Pingdom. Use it to request your website’s front page and to check for an expected word somewhere in the HTML body. On RockThePost, I have it set to ensure the phrase “Startup investing” is found. If Pingdom doesn't find that, it will contact me. I know something is majorly screwed up if Pingdom calls and I'll step away from dinner to investigate. This the easiest catch-all you can possibly set up.
Use environmental monitoring like CloudWatch or New Relic. We built RockThePost with RightScale which already has tons of monitoring. We were able to automatically monitor all of the server’s internal resources by default such as CPU, memory consumption, disk space, Apache requests, MySQL activity and more. If you’re on Amazon, you might want to look at CloudWatch. If you aren’t on Amazon, take a look at New Relic.
Log errors and other valuable event information from within your application. Even though your website might be returning front pages correctly and the server environment is healthy, your application might be broken. You need to institute an application exception handler in your code which logs the exception details into a log file in the filesystem. My favorite way to log stuff is wrapping all of the exception information, the server information (ip address, server name), and any relevant code information such as currently deployed tag, into a JSON string and then logging all of it in just one line. The reason for doing this makes sense later.
Just before my exception handler loads, I also like to generate a uniqid. I write that unique id into the JSON log entry but I also display that specific’s error uniqid to the user in the "We're Sorry" friendly error / apology page. That way, if someone contacts you later, they can say “Feature X didn’t work and here’s the error id I got.” Now, you have an easy way to find what went wrong.
- Categorize how bad errors and/or log entries are. You want to utilize a convention for categorizing how bad an exception is when you log. By default, I log my exceptions as CRIT which means something that was totally unexpected has happened and someone was probably disappointed. If the problem isn’t so bad, I’ll log an ERR level error. For more severe errors, I will log EMERG and ALERT level errors. Here’s a general guide to what meaning we assigned each of the logging error levels:
Consolidate all of your logs with PaperTrailApp. The biggest problem with storing log files on a server is that they are stuck on that server. You want to be able to consolidate all of the logs into a single interface with a smart way to filter through it. My favorite solution for this is PaperTrail. It's a web app which gives me a single online live scroll of the consolidation of all of the logs on all of my servers. If someone gets an exception, the details immediately appear in the PaperTrail interface. At our office, we have a big flat screen tv which rotates through different screens of information. PaperTrail's event log is on one of the screens we rotate through.
Sign up wth PagerDuty to get contacted about problems in your system. PagerDuty allows you to set who’s "on call". Even if you just have one developer who handles everything, PagerDuty can take a message and deliver it to the on-call engineer in a smart way. Instead of repeatedly messaging a person every time an error is received, PagerDuty knows how to de-duplicate a problem and only attempt to contact a person via a customizable schedule in an orderly manner. I have my PagerDuty account set to email me, wait 5 minutes, then text me, wait 5 minutes, and then to call my phone. If my phone rings, a robotic voice which sounds like Stephen Hawking reads the exception overview to me. I have to then clear the problem and mark the issue as "resolved". They even have a mobile app.
Create intelligent convention and strategy for responding to your errors. PaperTrail gives you the ability to set saved searches and then to push them into PagerDuty. For example, I push CRIT, EMERG, and ALERT errors to me via PagerDuty. I ignore the ERR level and lower problems. This is the magic link that allows an exception thrown in the live environment to be pushed right to the developer via his preferred contact strategy. It's a good idea to push ALERT and EMERG severity problems into a special PagerDuty service which uses a more aggressive contact policy such that developers are contacted faster.
Display a useful “oops we’re sorry” friendly error page with a ZenDesk support ticket form. Our oops page apologizes for the inconvenience, shows the user the error id number, allows them to file a support ticket via ZenDesk, and displays a list of ventures just so there’s still something a little relevant to look at.
This strategy for dealing with exceptions is what I would recommend in most cases to a small to mid-sized team. After you institute it, you can refine and adjust things as you go. If you methodically fix the errors that surface and adjust your logs to ignore unimportant errors, you’ll find yourself in a situation where the worst errors are cleared and your systems stability is higher than it was before.