On keepalived, and dropping VIPs
Over the past year, we've been tinkering with high availibility for web apps, and my favored way of achieving this was with a small group of webservers, sitting behind a pair of load balancers, which themselves sit behind a virtual IP address:
The webservers would run on whatever already powers your app: PHP, Ruby, Node, etc.
The "load balancers" in our case are vanilla Ubuntu servers, with an nginx reverse proxy to the webservers, and (this is the critical part) keepalived to handle failing over the virtual IP, should the active load balancer become unresponsive.
Here's the rub
Fast forward a few months, and we've been using this setup for several low to medium traffic apps without inciden...until we moved a high traffic app onto it. All went well for about a week, and then we began to see the VIP itself become unresponsive.
During these outages, the servers themselves all stay up, and the active/standby states of the load balancers don't change -- the active simply stops responding on the VIP, while not losing its association with it:
ubuntu@lb1:~$ ip addr show | grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 10.1.2.11/22 brd 10.1.2.255 scope global eth0
inet 10.1.2.10/32 scope global eth0
Restarting the keepalived service gets us responding to the VIP again...until the next (seemingly random) failure.
Rooting out the cause
I'm still unclear on the exact reason for these failures, but it seems to do with network congestion, and (upon enabling keepalive's detailed logging) is accompanied with these sort of messages in the syslog:
Nov 12 12:05:06 localhost Keepalived_vrrp[1031]: VRRP_Instance(VI_1) Received lower prio advert, forcing new election
Nov 12 12:05:06 localhost Keepalived_vrrp[1031]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 10.1.2.10
Nov 12 12:11:40 localhost Keepalived_vrrp[1031]: VRRP_Instance(VI_1) Received lower prio advert, forcing new election
Nov 12 12:11:40 localhost Keepalived_vrrp[1031]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 10.1.2.10
Extensive googling found a handful of people having similar issues, but no pinned down cause or solution.
Up to this point, we've been using Ubuntu 14.04 and the default version of keepalived that it provides (1.2.7), which is about three years old.
A shot in the dark
Largely on a whim, I decided to try a launchpad package that provides a much newer version of 1.2.13.
I've been running this newer version with a continuous ping for a couple of days now, and have yet to see the VIP drop like it had been.
I don't know precisely why, but I'm calling this a qualified success. ¯\ (ツ) /¯