Now, as a background, our Redis servers are running on AWS EC2 instances and perform snapshotting to an EBS volume.
It turns out that this has already been observed and documented. From http://redis.io/topics/admin:
The use of Redis persistence with EC2 EBS volumes is discouraged since EBS performance is usually poor. Use ephemeral storage to persist and then move your persistence files to EBS when possible.
<br /><br />
If you are deploying using a virtual machine that uses the Xen hypervisor you may experience slow fork() times. This may block Redis from a few milliseconds up to a few seconds depending on the dataset size. Check the latency page for more information. This problem is not common to other hypervisors.
EC2 uses a highly customized version of Xen. From http://redis.io/topics/latency:
Linux VM on EC2 (Xen) 6.1GB RSS forked in 1460 milliseconds (239.3 milliseconds per GB)
Yeah, really horrendous.
The solution: Move all our snapshots to persistence-only slaves. Since we already had a master/slave replication setup, this was simple enough to do.
redis 127.0.0.1:XXXX> config set save "" OK
And that's it. We have yet to experience a pronounced latency spike or receive a
Redis::Timeout</code> from our Redis clients since making this change.