Cloud Network Scalability

What is Scalability?

Scalability concerns the ability of a cloud system to scale, and the cost and performance implications of doing so. As any cloud system is made up of a multitude of components, cloud scalability raises two questions:

1. Is there a maximum size of any part of the system?
2. If the size of a system is doubled, for example, does that mean the cost of running it (technical or otherwise) doubles, less than doubles, or more than doubles?

Scalability issues thus result in a maximum size and an optimum size (where the cost of providing service is lowest).

In a previous whitepaper, we discussed the economic drivers for adoption of cloud technology. We explained how cloud allows homogeneous hardware to be provisioned at scale, serving a multitude of customers’ heterogeneous needs. This in turn produces economies of scale (and thus cheaper or more profitable services) through lower purchase costs and statistical gain. However, this argument is predicated on those economies of scale continuing to exist as the platform itself scales. If those economies of scale tail off (due to scalability issues), these economic drivers diminish or may even start to turn against the cloud operator.

Dealing with Failure
Mathematics presents an inexorable challenge to conventional IT scaling. A mathematical theorem prosaically entitled the “law of large numbers” determines that as the number of components in a system increases, the probability of any given component failing asymptotically approaches one. To put it more simply, however carefully you design the individual components, if you try to scale up your IT system, there will be a size beyond which a failure becomes a racing certainty.

Traditional IT strategies cope with resiliency issues by n:1 strategies. These strategies require that a small number of identical components are deployed in parallel in a closely coupled manner (often using expensive custom hardware), with each component designed for as low a probability of failure as possible. On the assumption that the probability of failure of multiple components is independent, and the system can survive with a single component running, the probability of failure of the system as a whole (which requires failure of every component) can be brought within an acceptably low range.

In a cloud environment, this strategy does not work. As the system scales, such failures become inevitable. n:1 provisioning becomes prohibitively expensive, and the custom hardware involved conflicts with the principle of using cheap commodity hardware. As a result, to cope with scaling cloud systems, service providers must ‘build to fail’, i.e. they must expect failure of individual components, and deal with that failure gracefully. For instance, a hardware failure of a compute node should not be a cause for alarm that requires emergency maintenance, manual intervention and data restore. Rather the system should recover automatically, perhaps removing the potentially faulty hardware from the pool of resources used to run services; fix or replacement is thus performed as a part of a routine maintenance strategy.

Moving to a ‘built to fail’ model requires changes in architecture, in management practice, and in management technologies. For instance, features such as live migration, live recovery, and network booted nodes (sometimes called stateless nodes) are particularly useful.

So question becomes, do you cope with scale by scaling up or out? How do you cope with the limits of compute scale? Read our paper A Guide to Network Scalability which goes into more scene setting details, discusses the characteristics that produce scalability challenges, and offers Flexiant’s approach to network scalability. Download now.

Tags: , , , , , ,