Share This Article:Share on FacebookTweet about this on TwitterShare on LinkedIn

How to Embrace Failure and Influence Scalability
By Krishna Sankar

As we continue our experiences with 10X scalability with our object store layer and get deeper into the design and development, it dawned on us that our first and foremost criteria is to befriend failures and architect for them! We have heard these ideas before, but it always becomes real when one feels their own growing pains.

We now understand very well what it means to “design a control plane for failure and tune the data for normal ops.”

A quick word about different planes. In any system there are essentially three planes – the Data Plane, the Management Plane and the Control Plane. The Data Plane is where the traffic flows (whether it is a storage cloud or a transportation system). The Management Plane monitors and reports activities. The Control Plane is the set of knobs, the signals, the throttling, the metadata and configuration and so forth.

The control plane can be:

  • Reactive – Wake up when failures occur and adjust the traffic
  • Proactive – Look for choke points before failures occur and take appropriate actions
  • Adaptive – Anticipate choke points and avoid them altogether either in design or at run-time
Design for failure is more important in a cloud infrastructure at scale as the system has a lot of moving parts. It is less of a deterministic Newtonian and more of “organized chaos!”
Earlier era client server systems were built with redundancy mechanisms like HSRP and clustering. That was fine when you had relatively local server systems and the scale was spanning a few machines. The architecture was scale and you deployed a few larger machines (like the HP Superdomes for databases).
The current cloud architecture is scale-out, or add more non-redundant smaller machines to increase capacity. But this architecture does not relieve the architect of the responsibility of reliability and availability. What is happening is that the redundancy is moving from the data plane to the control plane and moving away from MTBF to MTTR. The responsibility to accommodate failures is the job of the designers and not the ops. In short, we know the components will fail, in fact fail in droves and so we expect the systems to be up and running in spite of the failures!
In the next blogs, we will talk about specific examples of how we embrace failures and influence scalability. The principles of Carnegie are not just for humans anymore – they are equally applicable to the machines we make, even when they are asleep and dreaming of electronic sheep! Or do they?
Comments are closed.