Keeping Time in a Modern Tech Stack
Time is something that most of us gloss over without much thought. However, when it comes to modern distributed systems like Cassandra and Zookeeper, time is incredibly important
What we need...
We need all our nodes to have their clocks synced within milliseconds of each other
Why...
Hell could freeze over! no seriously, when time is out of sync between your nodes you will end up with havoc at some point. Our biggest fear was writes to Cassandra being overwritten because nodes were out of sync by a something ridiculous like 0.1 seconds, but there are hundreds of other edge cases that make life miserable
Solutions... Problems...
So NTP is the obvious answer here, but it is way more complicated than just installing the NTP package because we need all out nodes to have the exact same time.
Syncing with external NTP pools is unreliable, too much jitter. If you get a bad node then your time is messed up for that 1 node and then you have lost your consistency. The goal of this entire project was to have a single true time for our entire network and all our nodes would be in sync with that time.
What we did...
We built our own private NTP server cluster on 3 existing nodes. 1 node is a master that is synced with the amazon NTP pool and the other 2 providing HA and synced directly from the master.
Even better these nodes can be swapped out with raspberry pi's with GPS modules for Stratum 0 time accuracy... great if you want to completely cut out public NTP pools
Every other node in the network (app, db, etc.) gets it's time directly from the single master or the standby slaves in the event of a failed master. This means that we have a single true time from the master that is in sync with wall clock time and consistent within a few microseconds across the network.
Protip:
The ubuntu NTP pool is horrible. Regardless of your use case change your /etc/ntp.conf
to use a more stable pool like 0.amazon.pool.ntp.org
, 1.amazon...
Wrapping this up...
So we built this into a Chef Cookbook which manages all the master/slave election, client and server configuration, and NTP configuration automatically. It's open source on Github: https://github.com/evertrue/ntp_cluster and the Supermarket
If you are interested in using it, please let me know, I will be happy to help get you going
http://edhurtig.com/2015/05/keeping-time-in-a-modern-tech-stack/