Scalability is fundamental to cloud, but unfortunately not all cloud services are created equally. I previously authored a whitepaper “A Guide to Cloud Compute Scalability” to look at compute infrastructure to help service providers understanding issues and limitations regarding hardware, power, density, single vs. multi-tenancy, on-going management. Today, we release the next paper in our scalability series on scaling network resources.
In part 1, we looked at scaling up as a strategy for public cloud service provider. Today, I’m discussing scaling out.
Scale Out Discussed
Any scale out strategy will seek to find a balance between scale-up and scale-out. For instance, if a total load requires two CPU cores rather than one, suggesting a scale-out strategy would probably not be appropriate. But going between one thousand and two thousand CPU cores will inevitably require scale out. Even in the latter case it is important to work out how to balance scale out and scale up – in other words one needs to determine how large each compute node is.
Cloud Scalability: What is it?
Cloud equals scalability, right? Wrong. As public cloud consumption increases, so too does the demand for cloud service providers, however not all service providers are created equally or rather with the right infrastructure to scale in place.
To build and deliver successful cloud services, the ability to scale must underpin a service provider’s compute infrastructure, networking, storage and overall business. So what is scalability?
Over the weekend there seems to have been a lot of comment on Randy Bias’s articles “The 6 Requirement of Enterprise-grade OpenStack” here, here, and here on CloudScaling’s blog, including some occasionally intemperate exchanges with Flexiant’s CEO, George Knox. I thought I’d review the articles and ignore, for the time being, the exchanges.
Three points by way of preamble: Firstly, I agree with about 90% of what Randy writes. In several areas, he’s bang on. However, in some areas I draw different conclusions, set out below. Secondly, Randy’s article is entitled “The 6 Requirements of Enterprise-grade OpenStack”, and deals (unsurprisingly) with the challenges of deploying OpenStack in the enterprise. Flexiant’s product, Flexiant Cloud Orchestrator, is aimed at the service provider. As I am sure Randy would be the first to admit, these are different requirements, and can have different challenges. Thirdly, just so you know where Flexiant is coming from, we are members of the OpenStack Foundation, but our product does not currently use OpenStack (for many of the reasons Randy mentions); however, it does use many of the components also used by OpenStack (for instance Qemu), many of which we’ve made contributions to.
Let’s pick one of the more controversial areas Randy mentioned, his requirement #5, and specifically the claim “OpenStack Default Networking is a Bust”, which I suspect may not have pleased some people. I think Randy’s bang on. Randy says, “The flat and multi_host networking model requires a single shared VLAN for all elastic (floating) IP addresses. This requires running spanning tree protocol (STP) across your switch fabric, a notoriously dangerous approach if you want high network uptime”. Relying on a layer 2 network based on STP to scale is the triumph of hope over experience. If there’s something wrong with what Randy wrote, it’s only that he didn’t go far enough. Yes, STP is protocol that scales poorly, but (some networking vendors will assure you), you can switch to TRILL or 802.1aq / SPB; admittedly these scale better. But that would be to miss the point. You cannot elastically scale a single layer 2 broadcast domain, whatever topology control protocol you use. Switches use CAM (Content Addressable Memory) tables to forward packets based on MAC address, and CAM table exhaustion is an obvious scaling issue, indeed one we’ve seen in relatively small clouds. This is especially likely to occur where each MAC address of each NICs of the constituent VMs appear in each and every CAM table. There is a reason the internet (which scales pretty well) is not built as one large Ethernet network, and the same applies to clouds. Randy also picks up on the problem of trying to shoe-horn all traffic through a single router – obviously not scalable. This is precisely why FCO (which currently does not use OpenStack, though it uses some of its components) has a technology called PVIP, which allows for in-cluster floating IPs (elastic IPs as Randy calls them with reference to the AWS infrastructure) to be routed at hypervisor level, then routed (rather than switched) on the upstream network if desired, and precisely why we don’t use Neutron. But Randy’s claim “OpenStack Default Networking is a Bust” is, today, correct in my view.