At New Relic, we're just starting to give official definition to our operations team--or as we call it, Site Engineering. That is not to say that we haven't spent considerable effort in establishing the kind of operations team we want to be. While Collecting 51 billion metrics/day has provided us some interesting challenges along the way, we haven't been sidetracked by building a complex infrastructure to support such a massive undertaking. In fact, it is our dedication to simplicity in operations that has provided us the opportunity to focus on what really matters: our users.
Below is a summary of our story thus far, and the lessons we've learned. You can also view the accompanying presentation given at OSCON 2012.
"Dedication to Efficiency, Getting Along on Little Power"
Operations, and the technology choices we make, should not be a burden or a distraction from our primary focus--delivering happiness to our users. Using a well-established and stable technology stack, we're able to maintain our infrastructure with little effort and a small amount of resources. By making technology choices that are uneventful and efficient, we aren't spending time fire fighting nor establishing a complex system to prevent them.
Coupling Our Business to Technology Choices
We do not want our business choices to be predicated on our technology choices. We want to be able to get our applications into our user's hands as fast as possible and get their feedback just as quickly. As we learn what works for them and what doesn't, we'll need to make shifts. The same tight coupling principles that we use in software development also apply to operations and our technology stack. If you've implemented a particular technology (programming language, datastore, etc.) that requires significant reworking as you make the inevitable shifts, then you've coupled too tightly.
Engineer, Don't Administer
We want to solve our infrastructure problems through engineering the solution, and not administering it. When hiring, we look for individuals that know how to build the tools that remove pain points rather than individuals that are experts in a particular piece of technology. Companies and products shift over time, and we want people that can ease the friction from those shifts. We need generalists that take the best from multiple disparate areas. We need our operations staff to think like software developers. As we continue down the path of DevOps, our infrastructure will begin to look more and more like software.
Operations Should Be Interesting, Not Exciting
Firefighting may look exciting in the movies. But as an operations team, it is not what we want to be known for. The operations world is filled with brilliant and passionate people often fighting fires at 3am, instead of working on hard and interesting problems. We want to change that paradigm, we want to be at a point where a failure at 3am doesn't wake anyone up. If we're spending out time fighting fires, we aren't spending our time solving our real problems. An operations team could be building self-healing and self-organizing infrastructures or they could be resolving database replication problems at 3am. Which sounds more interesting?
We are deliberate when making choices about our processes and technology in operations. In the same way that ancient tools found in a buried city tell us something about that culture, our tool choices tell us something about ours. We look for tools that are mature and have a healthy ecosystem around them. We want to understand our tools intimately, we're likely going to push them to their limits and we want to know how to improve them as they start to show their stress points. We want tools that we can easily integrate into our world and swap them out when we've reached their breaking point. We are extremely considerate of our culture, and we select our technology with the same level of consideration. What do your tools and processes say about you?
Optimize for Discovery
Our choices in technologies and process have aligned us to be optimized for discovery. We want to get our products out to our users quickly, and the barriers standing in the way of that are broken down first. Implementing a continuous delivery pipeline wasn't easy nor simple, but now our users can see improvements and features rapidly. Feature flags allow us to roll-out features incrementally, to A/B test them, and quickly iterate on ideas. We work in complex systems that will inevitably fail, we want a resilient infrastructure where our MTTR is small and MTBF isn't even a consideration.
Everyone in the company is behind our build, measure, iterate cycle. In fact, operations is at the center of this process. Without careful cultivation of our operations culture, we couldn't be optimized for discovery.