- Member Type(s): Expert
- Title:VP, Business Development
- Area of Expertise:Internet Technology
To become a ProfNet premium member and receive requests from reporters looking for expert sources, click here.
Wednesday, March 28, 2012, 3:51 PM
Back in December I wrote about the role of network elasticity in cloud computing. Rapid elasticity is one of five essential characteristics of cloud computing referenced by the NIST (National Institute of Standards and Technology). The others include on-demand self-service, broad network access, resource pooling and measured service.
Although cloud / utility computing models continuing to increase in popularity, it is still very much a work in progress. Even the definition of what constitutes the “cloud” is still being debated in some corners. While it may be hard to define, I think the NIST definition of cloud provides a valuable baseline for discussing some of the broader requirements and obstacles associated with it. Towards that end, I wanted to turn our attention to what NIST calls on-demand self-service and why it’s so critical to the success of cloud computing.
NIST defines ‘On-demand self-service’ as when “A consumer (or any user for our purposes here) can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.” Without on-demand self-service the issue of user adoption of utility computing models can be severely hampered. An application developer or an IT administrator should be able to, at the click of a mouse, order the exact environment they need for a specific application.
On-Demand Self-Service Today
To some degree, virtualization has made on-demand self-service a reality in the cloud. Most cloud-based solutions today allow users to configure their compute (number of CPUs, memory capacity per VM) and their storage resources (capacity). Well-designed orchestration management tools like OpenStack, CloudStack, vCloud Director are making these capabilities even more promising and accessible to all cloud providers, both public and private. And while self-service provisioning of CPUs, memory, and storage is very useful for small, contained applications, it is still not sufficient for more complex (Enterprise-class) applications that typically require some type of custom networking configuration.
What We Really Need
If users could define their own detailed networking requirements for each application that is using the cloud infrastructure, the real challenge would turn to the operations folks within the cloud infrastructure who would need the ability to place workloads based on their specific network requirements. To do this, first we need to treat the network as a pooled resource, just like compute and storage. Then we need to make all user-relevant network resources configurable by the user on-demand - like capacity, proximity, path diversity, and data segmentation. Each user should be able to tell the network what resources they need for any given application they are running in the cloud. For example:
- Capacity: allow users to allocate capacity in real time to meet specific customer and/or application needs
- Proximity: provide real or emulated proximity to compute/storage resources that require low latency connectivity
- Path resiliency: provide multiple discreet paths for mission-critical applications
- Data segmentation: provide multiple discreet paths for security or compliance reasons
Once the user has defined these parameters, the cloud infrastructure must then have the ability to take the sum of all the individual application network requirements and dynamically make network workload placement decisions along with CPU and storage placement decisions that are already being made by orchestration management tools.
Like rapid elasticity, on-demand self-service is an essential requirement for cloud computing. It is already working with compute and storage in many cases, but to realize the full value of the cloud – i.e. to extend the cloud to all types of workloads - we also need to incorporate the same type of self-service access and configuration capabilities for key network resources and make sure they are configurable by users as well. This will be an important step in helping cloud computing reach its true potential.
Thursday, January 19, 2012, 4:16 PM
For a few years (maybe 10?), we seemed to have reached some sort of happy equilibrium in networking. Ethernet and IP won the protocol wars and came to dominate almost all voice and data traffic. Things moved along very nicely and stayed that way for some time. We were able to build functional networks that got data from point A to point B and just added some capacity every few years and it kept on ticking.
As with anything that settles into some sort of equilibrium, eventually a new stimulus comes along to upset that equilibrium. This time it’s not the protocol wars. Rather the stimulus seems to be the increased usage of server virtualization, especially in large data center networks. This stimulus is causing quite a disbalance in networking. It creates very basic issues like connectivity and preservation of policy for servers that are moving around a data center. It also presents more complicated issues such as how to go from a relatively static network engineering mindset to one that must contemplate dynamic workload changes and movements.
In response we’ve seen two evolutionary responses from the network – each solving some of the problems, while creating new ones. As with any evolutionary approach, multiple reactions may appear in response to a new stimulus, but it will likely take some time before the winner appears. In the case of networking technology, the winning adaptation may not be one of the two discussed below, but for now we’ll take these as leading indicators of what to expect. Ultimately, the outcome may be shaped more by market forces than technology.
1. First Evolutionary Response: “The Fabric”
Here we see the network responding to the stimuli of virtual machine movement and so-called “east-west” traffic patterns. Existing routed and hierarchical network approaches tend to be sub-optimal for highly virtualized data centers. Many have written about this, so I won’t waste space re-hashing the issues, but suffice it to say that virtual machines want to be able to move anywhere in the data center, which is generally hampered when the network is organized as many routed subnets. And creating highly interconnected mesh networks of servers with hierarchical networks can create much inefficiency when one server just wants to talk to another server.
The so-called network fabric attempts to ease primarily those two issues (and in some cases also tries to present the entire network as a single entity for simpler management.) The aim of most fabric designs is to create a single large layer 2 domain so that virtual machines can be moved without concern to IP address changes. Other approaches are more layer 3 centric, but use some tracking logic to ensure that the network can keep up with moves at the virtual server layer. In either case, the network fabric should allow for VM mobility without re-addressing the VM. Fabrics also aim to better utilize all available network links. By doing this, the fabric eases some of the constraints in typical hierarchical networks and makes the network appear “flatter” by opening more available capacity for server-to-server needs.
Problems It Solves:
- A fully-connected, any-to-any meshed network makes the network a non-issue when it comes to VM mobility.
- Newer technologies allow full use of all the capacity without intentionally blocking links to prevent loops in the topology, making the network appear flatter for servers.
Problems It Creates:
- To be fully useful for virtualized data centers, the mesh needs to reduce or eliminate oversubscription, making for a very costly network. In essence, the entire network needs to be engineered for the most demanding workloads as workload location cannot be pre-supposed.
- All of that over-engineered capacity (i.e. for workloads below the most demanding) becomes unusable.
- The same new technologies that unlock all the potential capacity also create new challenges and complications; e.g., technologies like TRILL and SPB introduce complex link state protocols at layer 2, taking a once relatively simple connectivity layer and making it subject to the vagaries of non-deterministic protocols.
- These network control protocols also add more intelligence to the network versus the end points (virtual switches in servers) which can create conflicts with SDN architectures (more on that below.)
- In some architecture, “flat” means a single large broadcast domain, which can be very difficult to manage and troubleshoot.
- Inserting network services becomes problematic, especially if the fabric is proprietary or if the need for L3-L7 services negates any gains in “flatness.”
2.Second Evolutionary Response – Software Defined Networks
The goal of Software Defined Networks (SDN) is to move the control point of the network into software. The premise is that in order to truly respond to the new stimulus, both the currently known and the future unknown, we need to separate the intelligence that controls the network from the data paths. By moving the “control plane” into software, it can be more easily evolved and adapt at the rate of software evolutions, versus hardware evolutions.
In many cases, SDN is being conflated with OpenFlow, Network Virtualization or vice-versa. In reality, OpenFlow is a component of SDN. It aims to standardize the interface by which software-based network control planes speak to the various network elements. Network virtualization generally refers to the ability to define software paths through a network. These software-defined paths are generally instantiated at the very edge of the network (i.e. from the virtual switch on the server) and controlled also by a software control plane. In this view network virtualization could be viewed as an implementation of SDN. Of course, all of this is relatively new, so many may disagree with my dissection and resulting taxonomies.
Problems It Solves:
- A commoditized and dumb HW layer should (in theory) bring down costs.
- The ability to control and manipulate the network from software should (in theory) enable faster innovation cycles for new networking capabilities.
- The network control plane gets homogenized (in theory) reducing complexity.
- Users can manage intelligence from a Host; e.g., virtual switch, allowing for the creation of more coherent policies that are more closely related to the business and application needs.
- Compartmentalization, etc. restore some order to the network.
- There is a direct interface to orchestration and management infrastructure for the VM layer.
Problems It Creates:
- This approach requires cooperation and synchronization of intelligence/policy at both the virtual network and the physical network, or requires completely overlaying the physical network with software-defined tunnels. In some cases, there is a new layer of complexity to manage; e.g. tunnels, etc.
- It works best with fully-meshed fabrics that offer uniform performance characteristics. These fabrics are becoming more intelligent/complex (not less so, see above) – meaning users could end up paying for intelligence twice.
- It also brings up the obvious question – when using an SDN approach on top of an intelligent fabric, who is in control and is the result deterministic and predictable?
The Winning Adaptation?
I’m fairly certain we haven’t seen the end of this evolutionary phase. The new stimulus being applied to the network is only just the beginning, and these first two adaptations may themselves morph a few times before we reach the next equilibrium. If we project from our current course, we’re heading to a place where the network will be fully aligned to the dynamic and evolving needs of business applications, and will also present a more software-minded construct in its own evolutionary process such that networks will be designed and deployed as rapidly as the applications that ride on top of them. In order to get there, the industry needs to continue to build on the work done to date to at least address the following issues (and probably a few more):
- Simplify the point of control (intelligence) in the software layer and reach the point where we have fully deterministic and predictable network behavior.
- Allow the actual physical network to morph dynamically to changing inputs, without unneeded intelligence and corresponding cost/complexity.
- Allow for the creation of specific software-based network topologies for each application, and allow all of the resulting network topologies to coexist in the physical network.
- Build service interposition into the workflow and respect/enforce the requirements for L3-L7 services directly at the physical layer.
SDN and network fabrics present an exciting vision of what’s possible and there is tremendous energy and passion behind both. In my view, these are both steps in the right direction. As the network adapts, it solves some problems, and creates some new ones. We’ll have reached an equilibrium when the new network solves more problems than it creates, and presents a clear evolutionary advantage relative to the current stimuli. To get there, we’ll certainly need more than SDN and/or Fabrics. It’s hard to predict what the end result will look like, and the answer will be shaped not only by technology, but also by market and economic forces. In the meantime, we’ll all continue solving as many problems as we can, introducing as few new problems as possible, and seeing where this evolutionary process takes us next.
Tuesday, January 17, 2012, 1:35 PM
Last week I was re-reading this blog post - dev2ops.org/blog/2010/2/22/what-is-devop... - about an emerging trend called “DevOps.” The article was written about a year ago and the concept has really matured and gained interest since then. At the risk of over-simplification, DevOps aims to smooth the disconnect that frequently occurs in companies between application developers (Dev) – the people tasked with technology innovation and change; and operations (Ops) – the people responsible for putting that innovation into production in the enterprise data center and the business.
The post got me thinking about the state of application development versus networking. Application development has moved into a new era of agility, led by process evolution (Waterfall to Agile), a rapid transformation of foundational tools and libraries, and new ways to leverage open source code and communities. As application developers have become more capable and agile, they have put pressure on the deployment infrastructure (servers in data centers) to be equally agile and responsive. And now, in turn, the deployment infrastructure is pushing on the network to deliver the new capabilities quickly and efficiently to the business.
In the broader context of information technology, it is clear that many businesses are becoming more and more reliant on the speed of their IT (or R&D) capabilities. This newfound ability to quickly develop new applications either to directly support the business’ revenue (e.g. Google, Facebook, etc.) or to improve the internal business operations, is no longer a luxury. Agile software development methodologies have allowed these businesses to rapidly develop new IT or product capabilities and refine them quickly to get the final product. As the article highlights, this speed create a tension – or a “Wall of Confusion” – between agility (development) and stability (operations), and the DevOps movement has really been focused on giving the operation folks the tools and processes to smooth those tensions. Technologies like server virtualization, orchestration tools and even “Infrastructure as Code” concepts are facilitating this transition, making the computing infrastructure much more responsive to the needs of the applications.
Editors note – we added the third frame of the graphic for purposes of this blog.
This “Wall of Confusion” could now be drawn between Operations and Networking. Today, the network is squarely Waterfall. It takes a long time to design and engineer a network to support a specific set of assumptions based on the gross application needs. In a world where the applications needs are changing rapidly, we no longer have that luxury. As Ops teams have become more fluid with server virtualization and tools for orchestration, they are now looking at the network to get “agilified.”
Technologies like OpenFlow and the general trend toward Software-Defined Networking (SDN) are emerging as enablers to help “virtualize” the network resources and bring programmatic aspects to the network. These are necessary steps in the evolution of the enterprise network – where it ultimately becomes just another cog in the rapid application delivery machinery that enables us to transition smoothly from business concept to deployment to delivery in synchronous, agile steps.
Thursday, November 17, 2011, 2:12 PM
The definition and requirements of cloud computing are still evolving, but NIST (National Institute of Standards and Technology) is always a good objective reference. NIST defines five essential characteristics of cloud computing: On-demand self-service, broad network access, resource pooling, rapid elasticity and measured service. In this post, I focus on rapid elasticity and how it relates to the networking aspect of the cloud. In future posts I will comment on some of the other characteristics of cloud computing.
According to NIST, ‘Rapid elasticity’ is defined as: “Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the user, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.”
With the growth of new cloud computing models, the enterprise data center has become home to a lot of dynamic workloads that require this type of rapid elasticity. These workloads appear suddenly, migrate rapidly and change intensity unexpectedly. They offer significant benefits in the way of performance, flexibility and operational cost effectiveness, but they also create some interesting data center challenges. In order to achieve the real benefits of these dynamic workloads, the underlying infrastructure also has to be elastic. Most data center managers are focused on achieving elasticity for compute and storage aspects of the infrastructure, but there is also a third aspect that is often overlooked – the network.
In the data center, workloads are placed primarily based on their performance requirements in terms of the underlying compute horsepower, storage capacity needs or idle time energy savings requirements. Most will be moved more than once when servers need to be rotated out, power consumption needs to be re-balanced, workloads fail over or servers get consolidated. The problem is that when workloads are moved, there is little attention paid to their physical location on the network and the resulting effects it may have on performance (in terms of capacity/utilization of the links, latency due to hop counts, etc.), security (when and how data is co-mingled on the network) and resiliency (how many redundant paths exist for a given inter- or intra-workload communication path).
Trying to optimize workload placement across all three dimensions – compute, storage and network – quickly becomes a never-ending game of whack-a-mole. Instead, we need to make all three of these different sets of resources more fluid such that workloads can be placed based on the most important needs of that workload, and the other aspects of the infrastructure follow along. For example, if a workload requires very high compute performance, a data center operator should be able to move (or expand) the workload to the most capable virtual machines anywhere in the infrastructure, regardless of the network or storage implications. Those resources should have enough elasticity to meet the workload needs by adjusting themselves appropriately. Conversely, if a component of a workload needs high-capacity storage, the compute and network dimensions should accommodate without any adverse effects. This type of elasticity enables a number of advantages including shorter provisioning cycles, easier maintenance of existing applications, more rapid adoption of new web-based business models, better customer experience, reduced complexity and higher levels of security and data integrity.
With the help of virtualization technologies, compute and storage resources have already come a long way, achieving greater degrees of elasticity than ever before. But in order to realize the full benefits and capabilities of cloud computing, all three dimensions must continue to evolve; most notably the network. Networking is still in the early stages of this transformation, one that will allow for rapid elasticity of workloads without compromise. Then, the resources provided by the network, such as connectivity, capacity, latency performance, path diversity and data segregation, will be fully accessible to workloads regardless of physical location. Only then will we see what’s truly possible in the cloud.
Monday, November 14, 2011, 1:54 PM
There have been a number of very high-profile service outages recently – Amazon AWS in April1, United Airlines in June2, and just this week, RIM. I doubt it’s the last. While it’s interesting to wonder how these companies could have let such critical services go down at all, there is a more fundamental problem here. It’s the network.
As RIM noted on its website, "The messaging and browsing issues many of you are still experiencing were caused by a core switch failure within RIMs Infrastructure."
In my view, RIM’s outage is just one more example of how broken and mis-applied “modern day” networking technology is for our mission critical services. Traditional networking gear held together by overly complex protocols and schemes simply isn’t working.
I won't join the bandwagon speculating on whether this is the nail in RIM’s coffin. As a former Crackberry addict, I'm still rooting for them! I do find it interesting though that a company that has operated a rock solid enterprise network for so many years (the best in the enterprise business by most measures) is now the latest victim of a “networktasrophe.” From what I've read in the press (most of it unsubstantiated) it sounds like a large switch had a module failure and the supposed redundancy scheme (likely something like VRRP) did not operate as planned. Although, a more likely explanation is someone fat fingered a command in a 1980’s era command-line interface, and all hell broke loose. We can certainly blame RIM for not testing their redundancy architecture or procedures well enough, but let’s not overlook the fundamental problem – the network and its apparent fragility. Everything from the complexity (and corresponding fragility) of the protocols to the downright user-unfriendly management needs to be rethought for today’s critical data and service needs. We built these systems to solve a different problem (and under a different set of constraints) than we’re trying to address with them today. And the band-aid layers we've applied over the many years have made the network an overly complex and inflexible house of cards at risk of coming down at the slightest mis-key.
1Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region - aws.amazon.com/message/65648/
2United blames outage on network connectivity issue, Bloomberg Businessweek, June 18, 2011 -www.businessweek.com/ap/financialnews/D9...
Monday, November 7, 2011, 10:22 AM
I’ve written a lot lately about how “broken” the network is
. Some readers have asked for guidance on the best place to start to fix it. That’s a big question but a fair one. Let’s start by trying to agree on the key problems.
The networking technology used today for data center server/storage interconnection was created for a different purpose – to enable connectivity over arbitrary topologies and link speeds without regard to applications. That problem’s been solved. What we have today is a dynamic enterprise data center that provides the infrastructure for a wide range of different applications with varying requirements that change hour to hour and minute to minute. Understanding that, I think there are five core areas that need to be addressed – see which and why below. I’ll delve into each of these in more detail in future posts, but wanted to share my initial thoughts and get yours. Are these the right targets? Are there more urgent networking issues to be addressed? Please comment with your own suggestions or advice.
Layering (or stacks) was originally introduced to create abstractions from the physical layer transmission. At each layer information can be exchanged with a peer device without needing to know anything about the layer above or below. This approach has been very effective in dealing with the many different types of physical media, varying capabilities at the transmission level, and also for abstracting the physical network from the applications. In modern data center environments, layering is being taken to new extremes – and it’s not working well. Architects are trying to create data centers that have maximal fluidity of resources with optimal utilization of power, space, and capacity. Many of these initiatives challenge ordered/layered networking concepts. The result is a frustrating game of whack-a-mole. For every new challenge stemming from these new directions, we see a corresponding set of tunneling, tagging, encapsulation methods and schemes all meant to fix/patch the current stack. Examples of this include VEPA/802.1qbg, VN-Tag/VN-Link/802.1qbh, VxLAN, PBB, LISP, OTV, as well as emerging NVGRE schemes, and surely countless others. While many of these approaches may have merit in isolation, developing new headers, tags, or tunnels every time we need to solve an application layer problem is simply not practical or sustainable going forward. I’d say we need a simpler approach that allows us to achieve fluidity at the network level, without an endless parade of band aids on the existing paradigm.
(Great slide from Dave Meyer (Cisco) from the recent ONS summit at Stanford that depicted this quite well!):
Protocols enable us to create multi-vendor networks where each device speaks the same language. Protocols also help devices exchange information with peers to help build an understanding of the broader context of the network from that device. The fundamental assumption is that the network should, on its own, figure out how to organize itself optimally by exchanging information between devices. Like layering, protocols have been very useful in the development of networking especially as a way to allow multiple devices to work together somewhat seamlessly. The issue in today’s data center is that we already know what we need to know about the topology, the application requirements, the redundancy needs, and pretty much everything else. This does not mean that we need to do away with protocols altogether. There are areas where protocols make a lot of sense. But in places where we already have the salient information about the environment that we need, why guess using complex protocols designed to try to derive the right answer – why not just start with the answer?
3.Oversubscription Models and Hierarchies
The whole notion of Aggregation / Distribution switching makes a lot of sense in the traditional networking context. Distribution allows for low cost/port fan-out at the server or host lever, while aggregation allows for bandwidth oversubscription into the core to economize on core capacity. This works great for traditional client-server computing traffic models, especially fat client models.
Today’s data center dynamics are very different. Most applications now are running closer to thin-client (e.g. browser), and often a small amount of client data can generate huge amounts of computational activity on the server side (think of a typical search engine query) where massive amounts of data is moved around. To manage this computational need, applications are now constructed as hundreds or thousands of coordinated pieces networked together. The resulting traffic flows are highly east to west oriented (i.e. server to server) versus north to south (client to server).
Looked at in this context, oversubscription becomes very difficult, as it doesn’t provide the necessary east-to-west capacity. So while high port fan-out may still be required to connect many servers together, now instead of oversubscribing uplink capacity, we need to maximize the available server-to-server capacity. By not oversubscribing, our once economical hierarchal network starts to look more fully subscribed (closer to 1:1), assuming that we can model the actual sustained and interstitial server-to-server needs and still provide some oversubscription. However, in many data centers, virtualized machines can be moved throughout the data center dynamically, which means that the only way to meet the needs of the applications would be to build a full “fat tree” meshed fabric, a very costly (and largely wasteful) endeavor, and also a path with not much runway left (see #4 below).
4.Evolution of Switched Hierarchies of Point-to-Point Links
Today’s model of creating switched hierarchies of links have a critical flaw. In order to maintain the mesh as we build wider data centers, we need to build taller aggregation layers (with more oversubscription as we build up.) The only way around this is to build more and more dense aggregation layers, so as to avoid having to ever go beyond the capacity of a single aggregation tier. But is that really feasible? What happens when we do need to grow larger? How about when we need to lower the oversubscription ratio (as per #3 above)? Or what about when the servers move from 10 GbE to 40 GbE or 100 GbE?
A more interesting question is: how long will this runway last? We used to be able to advance an order of magnitude (in terms of switching technology) every three-four years in Ethernet (as casually witnessed by the evolution from 1 Mbps to 10 Mbps to 100 Mbps to 1 Gbps switched Ethernet technology). And probably around 15 years ago we’d certainly have expected to have achieved "Terabit Ethernet" switching capacities by now. But we have not. We're going on close to 16 years of 1 Gbps Ethernet switching and we’ve just recently reached the point where 10 Gbps switching is somewhat feasible at smaller scales. Building large scale 10 Gbps (or 40 Gbps via 4x 10 Gbps) aggregation is still costly, complex, and power hungry. And many doubt if we will ever have feasible solutions for true 40 Gbps or 100 Gbps switched Ethernet at the aggregation layer, never mind Tbps, and at the least we know that each turn of the crank is taking longer and longer; it is hard to say right now if switched hierarchies will ever get beyond 10 Gbps in any meaningful way
On top of all this talk of bits and bytes, and speeds and feeds, is what the true next step for networking might be – i.e. Network Orchestration. The term "orchestration" is now common data center parlance as the action required to allocate, provision, and apply virtualized resources to meet the different needs of workloads running across the data center. Think of literally thousands of virtual machines and storage blocks moving (or growing/contracting) to a conductor’s wand. Server and storage admins have access to some very sophisticated tools to perform this orchestration including modeling, provisioning, auditing, rollback, and dynamic "elastic" re-orchestration. The network also needs to participate in this orchestration. To do this, we need a network that is orchestratable, and can work in concert with the same or a similar set of tools used for compute/storage orchestration.
So, there are five areas I think are critical to fixing the network. What are yours? If we can gain some consensus on the key problems, we have a much better chance of solving for them.
Thursday, November 3, 2011, 11:55 AM
In a previous Plexxi Pulse, I referenced a blog post by W.R. Koss – Scale Out Networking and Solving Big Problems Matter. Commenting on the network's ability to scale to meet the higher utilization of today's compute resources, Koss writes:
". . . I believe some smart people will find a way to put this amount of capacity into the network and the practices of the past will not be the practices of the future. If you like solving big problems, I just described your next career endeavor. I would also state that this type of network capacity growth changes the fundamentals of how networks are built – hence it could usher in a new set of leaders for networking."
While Koss is pointing to one specific aspect of the network, I think he also speaks (intentionally or not) to a more fundamental issue.
The current networking model was not built for today’s data center – and is the primary cause of many problems in these environments today. If ever a technology segment needed some glass breaking, networking is it. As server and storage technologies have transformed themselves, networking has stood still. The networking technologies we depend on today solved some very important problems – just not today’s problems. In our effort to retrofit them for today, we’ve made them overly complex and inefficient. And they remain ill-equipped to meet the current and future needs of the data center and cloud computing.
Despite all these inadequacies, the existing networking market is enormous and involves large incumbents that have way too much to lose to lead the necessary changes. Naturally, the incumbents view data centers as just another market to sell the same (slightly modified) products and technologies. As Koss suggests, the opportunity exists for a new set of leaders that can approach the networking problem without the encumbrance of protecting an existing cash cow; innovators who are not afraid to tackle big problems and challenge the status quo when it doesn’t make sense. It’s a big job for sure, but it also promises an exciting future for an industry that has done so little to distinguish itself in the recent past.
So, who is ready to break some glass?
Wednesday, November 2, 2011, 12:30 PM
Timothy Pricket Morgan’s Register article about flatter networks - “No more tiers for flatter networks; solving the east-west traffic problem,” raises some really good questions about the state of the network and how it needs to evolve to meet the current and future needs of the enterprise data center. The author really hones in on the problem: “There is a disconnect between data centre networks and modern distributed applications and it is not a broken wire. It is a broken networking model.”
So right! The current networking model is severely broken, and has been for some time. But the “flatter” networks as defined by some solution vendors just obfuscate the problem. I would suggest we take our thinking about the solution a few steps further.
Is My Flatter Flatter than Your Flatter?
What enterprises really need are flat networks, not “flatter” networks. While leaf/spine architectures may be flatter than the traditional 3-tier, what we're really doing here is collapsing gear, not layers. The resulting tree structures in a chassis (the spine) can reduce management administration and cut down on latency and bandwidth inefficiency (depending on how it is architected), but they also create very large/dense spine layers that take up lots of space and power, and start to become very costly over time. This is not what users are looking for.
If we're really going for "flat," why not enable east-west bandwidth directly without any hierarchy? As Prickett Morgan writes, the problem is to make east-west bandwidth available to account for inter-application traffic (versus client/server traffic). The leaf/spine, fat tree, Clos architectures he mentions try to emulate east-west capacity, but do so at a very high cost by trying to provide full (or close to it) bisectional bandwidth. If we modeled the actual application needs, we'd find that most of that bandwidth is wasted, or to be more specific, stranded, so that when it is needed it isn't there.
Routed vs. Flat Networks
We're starting to see a related debate on how best to implement these virtualized networks. While building "flat" networks may make sense in terms of where and how bandwidth is laid out to match the server to server traffic flows, some customers still want to be able to build segmented networks. These traditional routed network designs have been vilified because they supposedly are not compatible with flat networks; and certainly these routed designs have become problematic for the fluid VM layer that most people designing virtualized data centers are striving to achieve. So in typical knee-jerk reaction the industry has shifted to espousing fully layer 2 networks and using a variety of new schemes to overcome the traditional deficiencies of large layer 2 networks; e.g., large broadcast domains, lack of troubleshooting abilities and scalability challenges. Others are eschewing the bandwagon and continuing to offer fully-routed architectures, relying on even more "fixes" to overcome the rigidity of layer 3 hops (specifically the inability to easily move VMs around). But does it really come down to an either/or question?
In my opinion, neither approach addresses the fundamental problem that Prickett Morgan raises: a disconnect from the actual stuff we are networking - the applications. That is, the very science of networking (ordered, layered sets of protocols and abstractions) is the problem. More protocols and abstractions don’t make the problem better; they just put off the real pain for another day. Until we start with a clean slate and the understanding that we can't build sensible networks in an OSI stack vacuum, we're just re-arranging the proverbial deck chairs.