Eighteen months of planning, identifying candidate solutions, writing business and technology requirements, building test plans and finally testing hardware have transpired. We have successfully proven all fabrics tested have the capabilities we need. All tasks can be accomplished on all platforms, and all features, functions, and protocols required from a modern data center function appropriately for our usecases. This turns out to be quite a conundrum. In every architecture bake off I have previously participated in, there was usually a clear winner. There was a solitary product which clearly solved a majority of the business or technology problems and others that did not. For the first time we had four products, and all four solutions by any any technical measure go above and beyond our requirements.
We found ourselves in an odd place. We have to make and defend our decision. To do that, we need to actually determine how to qualify and quantify why we preferred our choice. And when asked what the differences were, it really came down to the philosophical approach each manufacturer took to solving higher level problems we face in the data center of tomorrow. I use tomorrow–not today–for good reason.
All platforms tested utilized 100G backbone connectivity between leaves and spines. Each manufacturer provided high capacity leaves to provide interconnect to storage, UCS Fabric Interconnects as well as a low-density 10/25 Gbps leaves, and a 1G option for telemetry. Since the intent of a data center switch is to provide a transitive low-latency buffer between hosts, we evaluated the hardware to take measure of the approach each manufacturer used when it comes to forwarding, filters, and buffer.
There has been a lot of discussion over the past 8 years as we moved from 10-Gbps to 40 and now 100-Gbps interconnect over some interesting observations with how TCP responds in regards to its use in scale over these high-throughput networks. Most issues come in one of two flavors: dealing with different types of traffic and how scaling flows and packet rate generates these nearly unnoticeable “micro bursts” which can cause enough drops such that it impacts application performance while going unnoticed in most network performance monitoring systems.
Mice and Elephants
These different types of traffic are colloquially referred to as elephants and mice. An elephant is classified as a flow which persists for along period of time. By definition this allows the elephant has ample opportunity to leverage TCP windowing to scale up it’s utilization of it’s path through the network. Shorter-running low-capacity flows, termed mice, as expected complete the transmission of their payload much faster than the elephants. Due to their limited lifetime, mice have limited opportunity to increase utilization. In the event that two packets, an elephant and a mouse, show up to the same interface and both get dropped, both TCP stacks cut their window in half. This does decrease the throughput used of both the elephant and the mouse flow. However, the elephant is already significantly further ahead of the mouse in the throughput game so the elephant is less impacted by the congestion event. Simply put this results in a statistical preference over time given to elephants.
This preference degrades application performance on interactive applications, especially when the distance increases between hosts. Why does this matter? Up to 90% of all transactions in the datacenter are considered mice, and only 10% are considered elephants; which means 90% of the sessions created between hosts in the datacenter are degraded by 10% of the sessions. To make matters worse, short-lived connections, are generally associated with protocols that facilitate user interaction, and long-lived flows are generally associated with server to server communications. So mice are end users click links in thier web browser getting frustrateed at the slow load of thier web page. Elephants are background processes that may take 20 minutes or 28 minutes and in most cases no one notices.
Now imagine a bunch of elephants walking down a highway filled with mice. The mess on the highway in your mind’s eye right now is exactly the result in the network. Short lived sessions get squashed out, resulting in mice having to be re-transmitted by the two hosts involved in the conversation. This causes longer delays for user interaction, which greatly degrades the user experience. So remember, every time someone clicks a link in their web browser hundreds of cute little mice are being sent to their sure demise among the stampede of elephants.
And as if we were talking about unreal numbers or particle physics, new magical and previously unmeasurable events called microbursts come onto the scene. A microburst is the result of fan-in in the network, which is when multiple source interfaces attempt to deliver packets on a single destination interface, causing instantaneous congestion. If there isn’t enough buffer to hold the pending packets before delivery, packets begin to drop. Network performance monitoring systems deal in averages over time. The averaging of the interface utilization in combination with how quickly a micro burst rises and dissipates, can easily allow them to go unnoticed. Most flows never leave the data center, but the large majority of application users at least at our organization are at remote locations. As noted with Mice and Elephants, these microevents affect all transactions fairly, within the event window. 
However, the algorithm for choosing what to drop gives no weight to the distance a packet has to travel. This is important to note as bandwidth-delay in TCP networks at distance allows these micro bursts to unfairly impact packets traveling greater distances. 
To understand why this is a problem, let’s go to our road analogy. Now imagine if say Chicago did not have freeways. So now the traffic generated from people trying to get to and from work in rush hour competes for traversal through the same stop signs and stop lights as those needing to get through the city to a location hundreds of miles away. If you are the person trying to get to New York from Seattle, having to compete with the locals through rush hour doesn’t sound very fun at all. In the same vein, without knowledge of how far the packet has to travel, any increase in packet drops in the datacenter no matter how trivial will impact the performance of remote users significantly more than local users.
Manufacturer Provided Switch Models
What follows is speculative as we did not have an effective mechanism for testing for this problem, but was included more to showcase our concern over the lack of scale in buffer allocations across most industry-wide switching hardware.
Dell PowerEdge Fabric Switches
In support of the Cumulus software solution, Dell provided switching hardware based on the Broadcom and Intel Chipsets.
Cisco 9300 Series Fabric Switches
Both Cisco solutions use the same Catalyst 9300 Switching Hardware. Cisco has proposed using the 93108T, 93180YC, 93280YC and 9336C for leaves and the 9336C for the spine. The leaves which touch UCS will utilize the 9336 and 93240. These switches utilize Cisco’s latest LS3600FX2 ASIC. This matters as word on the street is the Cisco UCS business unit is targeting the FX2 platform with the availability to run UCS Manager as a container on the switch. This would eliminate having to have dedicated hardware for the Fabric Interconnect function. For the out of band network and any in-fabric requiring one-gigabit connectivity the 93108T was suggested. When it comes to buffer, TCAM and SRAM allocation, Cisco has taken a novel approach and have leveraged in house design and engineering to create what they are branding as the Cloud Scale ASIC line. Cloud Scale includes the LSE, LS1800FX, S6400, and LS3600FX2 Chips, providing 2 x 900 Gbps slices, 1.8 Tbps in a single slice, 4 x 1.6 Tbps slices and 2 x 1.8 Tbps slices. The thought is they can gain competitive advantage by tightly coupling software to the hardware platform and provide hardware which can dynamically use resources on chip for a variable number of ways based on profile selected during the boot process.
The allocation of these resources from Forwarding Tables, Access Control Entries, or Packet Buffers, as well as what qualifies as a elephants or mice, can then be adjusted by setting options in a profile and choosing this operating profile in the configuration. When the switch is reset, those allocations and qualifications take effect. The proposition that Cisco makes here is that they have developed intellectual property as a dynamic algorithm in their ASICs that can utilize this profile to identify and de-preference elephant flows giving favor to mice. Marketing has branded this function Intelligent Buffers and there are several papers written showcasing the benefits in this type of technology.  The logic looks sound. Regardless, we couldn’t find anything out there debunking this feature’s ability to deliver on it’s promise of allowing shorter-running flows to better maximize the throughput of the path and deal less with congestion control related constraints.
What this does not do is increase the total available buffer for packets when under contention. The argument here is that it is less likely that your application should suffer because statistically buffers are allocated to provide a statistically “fair” distribution of drop under load. The problem is as far as we can tell this only accounts for qualifying how long the flow has been running, not how much latency (distance) is between the two hosts communicating. So remote high-latency locations connecting to the datacenter still suffer at the expense of drops due to micro bursts. Due to the remoteness of our sites, latency greatly affects usability for a majority of the employees we have to support. For this reason, we would really like to see a larger buffer allocation in combination with Intelligent Buffers. Today, it appears to be one or the other is provided, but not both.
Arista 7050SX and 7280R Series Switches
Arista’s solution utilizes switching architecture leveraging the Broadcom Trident 3 for cut-through small-buffer switching and Broadcom Jericho+ for large buffer cut-through switching. The option for a high-throughput large-buffer switch is specific to Arista as far as we can tell. The switch comes with 4-½ Gigabytes of Packet Buffer which is directly attached to the Broadcom Chipset in their Jericho-architecture switches.
The running trend with Arista is simplicity. They definitely took this approach when it came to buffer. Instead of a statistical modelNo custom algorithms, they quite simply threw money at the problem. Buffer, TCAM, and SRAM are the heart of the expense in a switch, which is why all of Arista’s competitors are doing everything they can to not put any more buffer than needed in their hardware. The beauty in just adding a huge shared buffer is it provides a predictable model with predictable latency. It is quite literally the least complicated solution when designing hardware.
Moreover, one could theoretically achieve similar results to what intelligent buffers do through the use of a fine grained queuing strategy. It would be a very complex policy set involving byte-counts to qualify a match parameter, and while this does appear to be a function within the Broadcom SDK, we couldn’t figure out how to leverage this in the Arista QoS configuration.
Brains vs Brawn: The great buffer debate
Since 2010, Arista and Cisco’s marketing departments have been very busy battling over buffer and who’s implementation is the most appropriate for mostly edge-case scenarios. The one benefit as you’ll see is how thier competition has greatly improved both manufacturers product lines very quickly. The timeline follows:
NetworkWorld: Arista 7124S bests Cisco Nexus 5010?
Decreasing latency became very important at the turn of the century as Computed Model Trading was transitioning to High Frequency Trading and Wall Street traders were constantly trying to decrease latency to and from the exchanges. This meant lower latency switches could began to drive sales in the Financial vertical.
In early 2007, Arastra (later to be renamed Arista) announces their first switches to market and the Extensible (Network) Operating System. The 7124S and 7148S are 10G cut-through ethernet switches since the 90s, featuring 24 and 48 ports of Tengigabit Ethernet. This is innovative in that cut-through switching allows drastic reductions in latency, and was largely refined in FibreChannel fabrics and Infiniband networks.
Cisco Releases the Nexus 7000 and 5000 Series switches in 2008. The 7000 is split into two functions based on line card and VDC allocation, wherein cut-through switching is done in F-Series Modules and store-and-forward switching with advanced protocols are done in the M-Series Modules. The 5010 and 5020 are Cisco’s first cut-through switch decreasing latency drastically at the edge. Cisco’s initial designs and software are evolutions of the MDS Storage Directors and Fabric and the Catalyst 6500 Series utilizing SAN-OS as a base for NX-OS (the Nexus firmware) and the passive backplane plus split function of fabric modules to provide a crossbar fabric with mezzanine. In the chase for lowest latency possible, the fabric side line-cards and the 5000 series were constrained to 480KB per port and the 7000 Series F1 Modules
- In January of 2010, Network World publishes an article with industry wide test results for 10GE switches announcing IBM Blade and Arista are their top recommendations in both features and performance, noting issues in the Nexus 5010 and HPE 10G switch.
Cisco’s Nexus 5010 is the only switch tested with a complete story on data/storage convergence, and its lengthy features list includes some outstanding virtualization capabilities. But high latency, usability gremlins and multicast leakage all hampered the Nexus 5010 in this test.
Round 1: Cisco marketing on the defense
- Cisco commissions Miercom to compare the Arista 7124 and Cisco Nexus 5010 switches, where Cisco claims their switches provides better guarantees for delivery of packets in Financial Networks.
Cisco Nexus 5010 and Arista 7124S were tested with 128-byte sized frames. Because the test results were strikingly similar between the two frame sizes, all test results that are shown are based on 128-byte sized packets.
It is understandable that different traffic profiles used in testing can produce different performance results. Traffic profiles should include more bursty characterized traffic for testing of switching products that will be employed in environments, such as financial markets, that will have surges of high volume, of short duration, and small transaction type traffic.
Detailed test results follow and demonstrate the advantages in using the Cisco Nexus 5000 in a network environment that consists of high, bursty traffic.
- Arista responds with a writeup giving their perspective on industry myths regarding microbursts, and goes on to detail how constraints in the Miercom test were not in fact real world but ideal conditions for which the Nexus 5010’s architecture directly benefits. They also release a white paper discussing the effects of micro bursts on application performance in 10GE networks.
The two architectures are substantially different when it comes to buffering. The Nexus 5000 will perform better when most ports are congested at the same time. Arista 7100 switches will perform better when some ports get congested at any given time, which is the common case in most networks.
Nexus buffer is allocated in chunks of 140 bytes. Thus by picking packet sizes that fit nicely in the Nexus architecture, Cisco was able to achieve better performance. 
Cisco announces the Nexus 3000 and Nexus 5500 platform in 2011, moving to shared buffer model, as opposed to per port, and increasing buffer capacity to 32MB.
- Miercom publishes Cisco commissioned test data for the Nexus 3064, their lowest latency switch, targeted at the high frequence trading and financial markets. Miercom makes the claim the Nexus 3064 “demonstrated advantages over other products” and performs better than “similar switches they have tested in the past,” while not naming any particular vendor or product:
The Cisco Nexus switches performed exceptionally well and demonstrated advantages over other products we have evaluated, particularly in environments where traffic surges occur. The Cisco Nexus 3064 has proven it can support equal or better traffic throughput while maintaining a lower average latency.
The Nexus 3064 performed better than similar switches we have observed in prior testing, showing both low latency and low jitter. With its low latency capabilities, packets can be processed faster without error allowing more data to be processed through the network.
When compared to features of similar products, the Nexus 3064 was able to offer all of their features plus numerous extras, as well as several more Cisco proprietary protocols 
- Network world correlates the Nexus 3064 test as Cisco’s response to the January 2010 article, pointing out that wall street is an important market from a network vendor’s perspective. The implication is that Cisco is attempting to regain foothold lost to the poor performance of the First Generation Nexus 5000.
Cisco’s High-Performance Trading Fabric includes existing switching products like the recently-announced 3064 and 5500 platforms.
“Cisco is worried about Arista in that market,” says Zeus Kerravala of the Yankee Group. “But it’s a little out of character for Cisco to respond like this,” with the High-Performance Trading Fabric launch. “Cisco’s always believed as goes Wall Street, so goes the rest of the world – it drives sales in other verticals. If you want to see where networks go, go to Wall Street.” 
Round 2: Arista Marketing on the Offense: Bigger Buffers
Arista makes the claim that at scale Big Data suffers under the packet drops and retrasmissions limiting throughput of Big Data Applications.
- Arista releases a white paper claiming that the 7508E and 7050 can provide increased performance for Big Data applications due to the increased buffer available in their deep buffer switches and shared buffer memory in their shallow buffer switches as opposed to fixed port buffers on competitors hardware (assumed to be the Nexus 5010.)
The design of the network fabric has a major effect on application performance, and switch buffers can have a major impact in this regard for big data clusters.
Packets will drop when buffers are exhausted, causing retransmissions and making throughput suffer, thus turning ‘good-put’ into ‘bad-put’. Observations made in realworld large-scale (defined as greater than 1K nodes) big data clusters indicate peaks in buffer utilization in excess of 40MB.
- Cisco enters the SDN market by completing the purchase of Insieme Networks in 2013, which would eventually come to market as ACI and the Nexus 9000 Series Switches. The Nexus 9508 Chassis and Nexus 9396PX are announced in November. The 9508 is Cisco’s first Nexus switch with “deep buffer” in the marketing materials, and the 9396PX which increases the buffer from 32M in the Nexus 5000 series to 50M of shared buffer.
- Miercom is commissioned by Cisco claing the Cisco Nexus 9508 can do more while consuming less power than the Arista 7508E.
Based on the test findings for throughput, latency and IP frame variance (jitter), the Arista 7508E switch lacks consistency in performance. Considering the requirements for a data center class switch to provide high throughput, low latency, low delay variance and consistent performance, the Cisco Nexus 9508 has proven to be a more appropriate platform to be deployed in a mission critical data center. In particular the Cisco Nexus 9508 is an optimally designed aggregation/spine switch that will provide high-density and high-performance 40GE connectivity.
- Lippis Enterprises releases a Lippis Report claiming the Arista 7508E offers the most dense lowest latency, best congestion management, and least power consumption of any switch in it’s class. Granted they did not test any Cisco gear, and they largely discount the switches inability to provide wire speed with packet sizes at 64 and 256 bytes. Granted only in extreme corner cases would this be an issue as it is extremely unlikely all packets showing up on an interface would be either 64 or 256 bytes in size, though average packet size on the general internet is 192 bytes.
The Arista 7500E is the only spine switch that offers both L2 and L3 forwarding at 288 40GbE scale and at ultra low latency. It’s also the only modular switch we have tested that offers 10/40 and 100GbE line card options. It offers the best congestion management system measured to date, thanks to its generous buffer allocation and VOQ buffer algorithm.
Round 3: Cisco adds Intelligence
With the Inseime Networks purchase, Cisco gained access to Intellectual Property which abstracts the functional components which make up the forwarding plane (buffer, tcam and sram) into logical components on their LSE ASIC. This allows some level of post production configuration of the pipeline through the switch, giving them the ability to handle buffer allocation and utilization with higher level code in software–aka Programmable Buffers coined Intelligent Buffers by their Marketing Department.
- Cisco commissions Miercom to validate their newly designed buffering algorithm against Arista’s 7150S making the claim that the additional buffer and intelligent queuing strategy in the 9396PX does in fact allow for absorption of 7X the amount of burst than the Arista 7150S. This Miercom report and bakeoff is the first we hear of “Intelligent buffering.” You have to give Cisco credit here. This was a huge Marketing win and sets the stage for all that will come.
With a typical or default configuration, using only the default queue, the Arista 7150S-52 offers a maximum buffer space of 4.66 MB (Megabytes) that is accessible to user traffic. The Cisco Nexus 9396PX switch exhibited a combined 31.74 MB of buffer space available for user traffic–about seven times the buffer size of theArista 7150S-52.
Miercom independently substantiates the superior performance of the Nexus 9300 with regards to buffer management and the handling of bursty traffic flows, accommodating longer-duration bursts and, in so doing, minimizing packet loss.
- In August of 2014, Arista publishes a whitepaper making the claim that large buffers improve performance in IP based Storage Networks. The claim here is pretty simple, availability of deeper buffers allow a lossless fabric minimally increasing latency in the storage protocol while preventing retransmissions.
To validate the importance of deep buffers at both the leaf and spine layers, one leaf switch is an Arista 7280SE switch which has 128MB of buffers per 10G port, while the other is the relatively shallow buffered Arista 7050 with a shared buffer of 9MB for all ports. At the spine layer, an Arista 7504 is used configured with 48 MB of buffer per 10G.
To simulate shallow buffering across the board, the 7280SE and 7504 were manually configured so as to have 1 MB/10G port of buffering available. The deep-buffered network was oversubscribed at a ratio of 5.33:1, while the shallow-buffered network saw a lower oversubscription ratio of only 4:1.
Running traffic at maximum capacity, TCP retransmits were measured. Looking at the retransmits in this simple 2 leaf east/west network, even with the very conservative approaches to buffer sizing.
The retransmits increase dramatically when lowering the buffer size to a level far greater than many datacenter switches can support today. This data shows that supporting high rates of east- west traffic requires large buffers at both the leaf and spine layers.
- Cisco commissions Miercom to test flow completion times in a data mining applications network as throughput increases on a link. The intent is to showcase how Cisco’s intelligent buffering improves performance for mice by providing low-latency path at the cost of drops in comparison to using simple deep buffering. They model this by setting a fixed distribution of flow lengths wherein elephants make up only 5% of the throughput and mice make up the other 95%.
Expediting mice flows and regulating the elephant flows early under the intelligent buffer architecture on the Cisco Nexus 92160YC-X and 9272Q switches can bring orders of magnitude better performance for mission critical flows without causing elephant flows to slow down.
Intelligent buffering allows the elephant and mice flows to share network buffers gracefully: there is enough buffer space for the bursts of mice flows while the elephant flows are properly regulated to fully utilize the link capacity. Simple, deep buffering can lead to collateral damage in the form of longer queuing latency, and hence longer flow completion time for all flow types.
As a conclusion, the testing results through the Cisco Nexus 92160YC-X and 9272Q switches validated that the algorithm-based intelligent buffering and scheduling approach address the real-world network congestion problems caused by traffic bursts more efficiently, and demonstrated overall better application performance in comparison to the deep buffer approach represented by the Arista 7280SE-72 switch in this test.
- Cisco commissions Miercom to test Big Data across Cisco Nexus 9272Q, Nexus 92160YC, and Arista 7280SE-72. The test concludes that there was no difference in performance, and hence big buffers are not required when compared with Cisco’s Intelligent Buffering.
With the benchmarks that were run as part of this study, there was no discernable difference in the performance of the cluster with different switching platforms. Performance was measured primarily as job completion time for the DFSIO, TeraGen and TeraSort tests.
A large buffer is not required for dealing with the Hadoop tests. All buffer-occupancy figures revealed that average and maximum buffer utilization in all Hadoop test cases were very low. Across the entire set of benchmarks, the maximum instantaneous buffer occupancy that was noticed was under 15% of available buffers. 
- Arista publishes a whitepaper detailing how big-buffer switches decrease Flow Completion Times and Query Completion Times in Leaf Spine Cloud Topologies giving big-buffer switches between a 50 and 60 to 1 performance advantage.
We found that flow completion times (FCT) increased dramatically for small-buffer switches as the load on the spine increased. At 95% network loading and 95% flow completion the FCT with small-buffer switches increased to 600 milliseconds. With big buffer switches under the same conditions, FCT remained at less than 10 milliseconds, a 60:1 performance advantage.
Under 90% network load, and for 90% query completion, the small buffer network took more than 1 second. In contrast, under the same loading conditions, the QCT with big buffer switches best performance was a mere 20 msec, a 50:1 performance advantage.
Buffers in Summary
So we sifted through a significant amount of information only to feel like most of this is marketing material. On multiple occasions both Cisco and Arista use the counter argument when they’re defending their position that the difference is marginal. Technically, Cisco has provided a single solution across their entire platform. This is a massive benefit to it’s customers as it simplifies their purchase decision. Cisco’s has also shown their Intelligent Buffer algorithm shows marked improvement in the handling of packets from their previous iteration of hardware. However, the novelty and benefits of that Intelligence is reflected in the list price of their product.
By comparison, Arista has multiple options in their hardware lines targeting various use-cases. This does constrain the deployment of compute and storage to PODs with the correct switching platform, but with appropriate planning and guidance in deployments in the datacenter any concerns along these lines could be rendered moot. We have the benefit of having a datacenter that is already organized into separate PODs that match and reflect the benefits of the various switch lines. It is also worth noting, that there is definitely some level of insurance in just having large buffer, as it’s much harder (or not feasible) to add buffer after the purchase, but one can always short stroke the allocation. One last note on this. What Arista considers it’s shallow-buffer switch still has more buffer than its competitors, and is drastically less expensive than the competition. Only when comparing their deep buffer switches to competitors does the expense of the hardware share similar price point.
As was pointed out by our sales engineer when presented with the aforementioned references,
“You could argue that Arista can provide the same single product solution” when comparing against the Cisco Nexus 9000 Series (sic) “versus POD specific selections by simply deploying deep buffer (7500/7800/7280) at Spine and TOR instead of mixing and matching hardware types. Differing TOR solutions allow you to take workload into account and be a good steward of budget by not adding overkill into the design.“
At the end of the day, it really depends on how much weight you put on the Cisco-Miercom studies vs the Arista-Lippis Reports. From our perspective, we took the stance that all the reports provide great information that is really helpful when choosing hardware within a manufacturer. They didn’t however sway us enough that one manufacturer’s design and implementation was significantly improved over the others.
The application of this philosophy really shows through in the wide range of how the manufacturers have engineered the interaction with the various products.
Cumulus: Distributed Linux Fabric
Cumulus Networks is the linux administrator’s distributed fabric. Their approach to interaction when configuration is applied to the fabric is perfectly intune with a Debian based distribution of linux. Effectively, they simply change the supported device from a cluster of computers to a cluster of switches. The network operating system is simply a collection of simple tools designed to interact cohesively with each other running in the shell of the engineers choosing. They have chosen to customize open protocols to improve usability and accommodate their use case for holistic management of the datacenter fabric.
Arista: Distributed Programmable Switching
Arista has actually taken a very different approach, one that almost seems archaic in the times of Controllers and Centralization. They operate in a traditional manner–distributed, independent of each other or any configuration management system–a fine testament to the end-to-end principle. They exchange information with each other using standard open protocols and each operates autonomous of one another. Most importantly the switches are the single source of truth, as opposed to a centralized cluster of controllers. Cloud Vision is simply a compliance and visibility system which due to how Arista EOS (Extensible Operating System) was designed allows it to interact with the switches as a single system.
Cisco ACI: Network Policy Object Framework
Cisco’s Application Centric Infrastructure is a purpose built platform realizing their vision that the modern data center is built from applications and policy. ACI leverages the same Object-Oriented Programming concepts a software developer leverages in C++ or Java Languages. Through abstraction, objects representing configurable network elements are instantiated, bound to other objects using inheritance or encapsulation, viewed, modified, and destroyed. The fabric is simply the application of the policy and elements programmed at the APIC (Application Programmable Infrastructure Controller)–it is nearly a stateless system. And while the fabric can live on without the APIC’s existence, their destiny is intertwined like two quarks in quantum entanglement. The physical switches could as well be considered lightweight APs in a controller based wireless environment. Without the APICs, they are meaningless ports of ethernet, power supplies and fans.
Cisco DCNM: Configuration Management
Datacenter Network Manager takes a more traditional approach to configuration promised by the Cisco’s APICs in the ACI model. As a central source of truth, DCNM provides an in depth templating solution for nearly every task of configuration in the datacenter. These templates generate traditional textual CLI configuration which is then applied via NETCONF/YANG over BEEP. It takes the necessary technology steps to provide deployment of configuration, rollback, and the other necessities while still leaving the foundations seasoned network engineers know and understand.
During our testing, at several stages we were required to interact with vendor support staff. This section documents those events. Names and faces have been changed to protect the honest hardworking folks in the various TAC groups.
Cumulus NOS on Dell Hardware and their Support
Dell partnered with Cumulus to participate in our Cumulus POC. They provided seed gear as well as some newer hardware to flesh out the POC fabric. The dell and cumulus engineering team were top notch and extremely supportive from the beginnings of the POC through the fabric build.
One bad account manager ruins the whole bunch
As the POC came to an end, we needed to start working on the financials, what support would look like, bill of materials, and legal work. Due to its size and consumer base in the central part of the United States being significantly smaller than thier competitors the Account Manager assigned was from the east coast. Our guess is that his other accounts are in the business of building and managing data centers, and are not mid-cap enterprise energy companies.
Either way, the account manager was anything but helpful and gave us the impression that his time was worth more than what our deployment had to offer. Needless to say, we had a really bad experience with the Cumulus Sales Staff. I was particularly hurt by this because I expected Cumulus would be the most open and easy to work with.
Dell and Cumulus engineering for the win
I would like to reiterate, Dell’s account manager was extremely helpful with hardware specifications, design, and what the hardware support looked like, and the both Dell and Cumulus engineering staff were wonderful to work with.
We ran into two bugs and a caveat when working through the Cloud Vision and Arista EOS POC as noted below. I can not explain the level of service provided by Arista. It reminded me of the support you received from Cisco and Sun in the late 90s and early 2000s. The account team was extremely helpful. Their Sales Engineer easily put in 160 hours with us working through various designs and the benefits and pitfalls of all them.
Arista Engineers were quick to be honest about any shortcomings or limitations when it came to features that existed but were not up to what they felt was enterprise worthy. On two occasions customized remote POC sessions were created on short notice to show us features that specifically benefit our business lines. To call the POC catered towards us would be an understatement. It is apparent that the engineering teams we worked with at their corporate offices when testing some of the niche products, were directly involved in writing the code we were testing. They treated thier showcase as if they were showing off a child, and welcomed and received our feedback with poise and grace.
On three occasions we contacted the Arista TAC for configuration support. They were patient and supportive in helping us get our configuration vetted.
Support Case 1: TermAttr failure
The first time we contacted Arista it was with our Sales Engineer as we were attempting to get Cloud Vision to connect to the switches so we could begin deploying configuration. Several of the switches ended up coming with an old version of TermAttr which was preventing the TermAttr switch daemon from connecting to Cloud Vision. They walked us through the process to get TermAttr upgraded manually and even shared with us a new not quite released version of thier Fabric Builder to test with in the POC.
Support Case 2: Storage, what storage?
The second time we called in, we had pulled the storage out from under the Cloud Vision cluster. In talking with the gentleman on the phone, he almost immediately stated we should just build it from scratch because we may go through a bunch of effort to only have to do that anyhow. We unknowning what the implication of rebuilding the fabric asked to proceed. He patiently walked us through attempting to recover Cloud Vision. In the process we learned a lot about how Cloud Visions internals worked. Hours in, as he predicted, we had completely nuked the ability to get Cloud Vision back. We pulled a backup from the CLI and did the rebuild.
Support Case 3: Multicast Support in the Overlay
The third time we called in, we were attempting to test an as yet supported use case, where multicast would traverse the EVPN overlay in Symmetric Routing Design. The TAC engineer worked with us for nearly 8 hours, before figuring out that everything was actually working and that the problem was the TTL of the packets being sent by the source were set to 1. Again we learned a lot about EOS and its flexibility, but what can’t be spoken of enough is the Arista TAC’s willingness to support the end customer even in something that’s “not supported.”
TAC Impact Escalation
On two occasions we ran into actual issues which required us to interact with TAC.
Impact Event 1: Speed Change
The first time was due to a bug surrounding speed change on the PHY, wherein the application of the change doesn’t take effect without shutting down and bringing back up the port. A bug id was issued on 4.22.0F (which was bleeding edge at the time.) A patch was made available for the latest release (18.104.22.168F) which we upgraded to first. And then the mainline which included the patch was released less than two weeks later as 4.22.1F. It was the fastest I have ever seen something go from problem to buttoned up, production release… ever.
Impact Event 2: TwinAX Failure
Our last interaction with TAC was when someone on our team was unplugging some cables and caught a TwinAX Fiber cable on his sleeve, kinking the fiber. As the cable was no longer functional we turned in a TAC case, support contacted us within 5 minutes to verify the part number and a replacement was shipped with tracking information within the hour. The part showed up prior to me the next morning. After replacement, we enclosed the old part back in the shipping materials sent in, slapped the provided label on the outside and it shipped back to Arista. Nothing spectacular, just efficient and effective–all you can ask for.
We had very little interaction with Cisco directly during our POCs. This was largely due to Cisco’s requirement to use a partner to implement ACI. At first we believed this was an unnecessary expense, leaving us extremely frustrated that Cisco was making us jump through this hoop. As we worked through the initial deployment of the fabric, it became apparent the requirement for a qualified partner is a must.
Push Button Build
We had been through a half dozen Labs and POCs on ACI in the past few years and were quite confident that this would not be needed. We quickly realized the naivety of this as we walked through the build process. For the first few years, ACI marketing in the labs at Cisco Live circulated the simplicity in setup as “push button” and likening the install to Wireless controllers and Access Points. This is due to the use of Power-on Auto Provisioning and Zero Touch via DHCP with fabric discovery by the APIC. As the product matured, it is apparent Cisco has backed off on this messaging significantly in favor of quality control in installations.
At the end of the day, data center networks are just too complex to be ever be “push button.” Specific to the programmatic nature of ACI, a lot of the decisions that would be made as you went through the evolution of a data center build must be made at day zero. It is just unreasonable to think that you can make all these decisions up front without a near expert level understanding of ACI. In the labs and POCs we had attended, this legwork had already been done to showcase the usability of ACI, and hence one misses out on the level of complexity in the build process. We greatly appreciate Cisco for pushing us to contract with a partner for the build, as it was money well spent regardless of the outcome of the POC. Our partner was extremely helpful in both walking us through the ACI and DCNM builds and in showcasing where we needed to avoid major pitfalls in deployment.
DCNM TAC Support
With Data Center Network Manager we did run into one issue requiring escalation to TAC, in which we were attempting to apply a change and rollback the change to test the configuration management aspects of the software.
We added a VLAN, pushed the configuration to the leafs, and then when trying to rollback to the previous configuration, the removal of the VLAN from the VPC peerlink prevented the peerlink from being rebuilt. This left the switch in a state that prevented DCNM from seeing the switch as manageable. We turned in a TAC case, and the TAC engineer was equally stuck. We pulled a the state information out of the switches and DCNM and uploaded to Cisco to resolve.
TAC did replicate the configuration in their lab and couldn’t find any reason why it wouldn’t have allowed rollback on our equipment in our lab. We ultimately didn’t have time to test further because the lab needed to be built to showcase to the team the following day, so we just factory reset all the switches and rebuilt DCNM.
We didn’t really give TAC the opportunity to get the issue resolved, however the fact remains that in a production environment there is no simple way to force DCNM to resynchronize with the switch without impact and possible downtime. This had us very concerned about the software really being production ready.
It’s linux. I don’t think I can express this simply enough. If you can dream it, you can do it. Want to write automation or a native service to control flows in Rust or GO. Have at it. Python, yes sir. Want to offload to a prebuilt automation engine, Salt,Puppet, Chef, or Open Daylight are ready for deployment. That being said, we don’t have the time or space to cover all that could be done, so we’ll leave those with imaginations running wild right now to their own devices.
Go download Cumulus. It’s free. Walk through their lab. You will learn more getting to know how it works than I could possibly explain here. No novelties, just linux on a switch or more linux on a cluster of switches, and just like your favorite linux distribution, building the cluster is 100% of the fun.
Arista Cloud Vision and Extensible Network Operating System
Cloud Vision is tight and light, with a minimal interface which gives you the ability to manage your switch inventory, perform provisioning functions both in firmware and in configuration, track CVE and configuration compliance across the fabric. It does feel a little feature incomplete, but when they deliver on something it completely feels like they were asking, “What would a network engineer need do?” The interface is built from the bottom up around a REST API, so anything you can do with the click of a mouse, could be done with a REST API call. Whether it’s provisioning, changes, configuration, rollback, or upgrades every function is right where the engineer would expect it. Most elements can be manipulated from a context menu with the right click of the mouse. There is ample telemetry visibility within Cloud Vision and keeping snapshots of fabric state are as easy as specifying a command to be ran on a schedule. The workflow is simple and straight forward, even in regards to extending the builtin functions of Cloud Vision.
Adding the ability to auto-provision elements of the fabric is almost too simple. Out of the box ZTP is available as an API call at http://cloudvision.domain.local/ztp/bootstrap, where cloudvision.domain.local is either the DNS resolvable hostname of the cloudvision host, or the Cloud Vision’s Cluster IP. In the event you have a centralized DHCP and BOOTP server already, you can simply add DHCP option 67 known as bootfile, with the modified url as it’s content, to the network hosting the switches management interface. If you don’t want to have to depend on the corporate DHCP server, Cloud Vision can self-host the full ZTP process with the built-in ISC DHCP Server by adding a small configuration to support booting hosts. This is most likely the preferable method as it allows the Cloud Vision management interface and Switch Interfaces to be isolated to a separate out-of-band Telemetry network. Any switch booted from ZTP shows up in an undefined container. Unfortunately this can not be done from the GUI and requires the user to login via SSH to Cloud Vision, drop to /bin/bash and update /etc/dhcpd/dhcp.conf as well as enable and start the dhcpd-server via systemd.
When applying or modifying configlets or rolling back configuration, the Cloud Vision GUI makes configuration management extremely simple. There is a built in editor in the web interface that allows the engineer to both write simple configlets as well as config builders, which we will delve into further in Programmability. There is also the ability to test config builders and configlets against the actual switch from within the UI, to determine the outcome when executed. And when applying or rolling back configuration from the UI, the engineer is given color coded differentiated configuration output allowing the engineer to easily pinpoint what is changing. To aid in rolling back configuration, there is a timeline allowing one to walk through what was done and see the individual configuration applied at each node, also allowing the executing of a rollback from the context menu when a node is right clicked upon. Auditing your configuration is two clicks, right click on the “tenant” and click “check compliance.” Switches that are out of compliance are highlighted yellow. Viewing the configuration of the highlighted switch will show the difference between what is applied and what is running.
The streaming telemetry generated by TermAttr at the switch provides Cloud Vision provides milisecond resolution of state information. All of this information can be visualized in time-series within Cloud Vision. This feature is incredible when it comes to troubleshooting or tracking down events that could be causing issues. The ability to track the historical existence of the source-learning of mac addresses feels like a feature that should have existed for years in competing network management platforms. And if there is something that is needed to be tracked but isn’t provided in the telemetry data natively, scheduled snapshots can be configured. Snapshots are effectively a point in time copy of the state of the fabric. They can be setup to run on both a schedule as well as when any configuration or firmware is deployed. Leveraging this is as simple as choosing the commands you wish to have executed on a the switches, selecting the switches, and setting the frequency or occurrence of when it should be ran. Cloud Vision then collects this information and provides a time-series graph showing when the data changed.
In 2014, Arista introduced Smart System Update as a more robust answer to the industries attempts at In-Services Software Update. This effectively is three functions:
- Maintenance mode of a spine switch: A single EOS command is issued, which begins this process on a spine switch. The spine switch then utilizes the routing protocol to depreference routes traversing the switch which results in graceful reroute of traffic through the other switches participating in the same function (leaf or spine). Then the spine switch is seamlessly inserted back into the spine layer when the upgrade has completed and paths have stabilized by returning the routes to their previously homogenous state.
- Non-stop forwarding in the Leafs: The leaf leverages the ASICs ability to continue forwarding while the control plane reboots. This allows all current devices to continue to communicate without interruption during the restart of the switch post upgrade.
- In-Service Software Patching: Allows an individual process to be upgraded without system interruption. This is simply the byproduct of how Arista’s EOS was designed.
The latest Cloud Vision release (2019.1) further compliments their Smart System Update function within a pair of switches with by simply being aware of the resiliency built into the fabric. This gives the engineer greater control over the upgrade process and helps minimize impact during the upgrade process. It blocks simultaneous upgrades of MLAG pairs and upgrades each spine in serial. This allows a single change process to be packaged up prior to the change window and executed with the ability to rollback upon failure. As one can imagine, the streamlining of this process greatly improves success when the need to keep current on firmware is a must.
Cloud Vision as a Platform
Cloud Vision has multiple modules which are provided as containers to be installed either side by side or within cloud vision itself, including 3rd party controller integration, ServiceNow integration, and InfoBlox integration. It also out of the box has a function called VMTracer which allows the engineer to get visibility on vmguest location in both the topology view of Cloud Vision and vie eAPI. This can also be used to learn and dynamically apply VLAN pruning configuration on the fabric leaf based on connected guests, however this is really counter-intuitive from our perspective. What we would like to see is the ability to modify configuration on the vSphere dVSwitch via API to allow CloudVision to provision both VLAN and VNI configuration on the fabric and in vSphere.
Cisco ACI: Application Policy Infrastructure Controller Web Interface
The APIC’s UI feels primarily targeted at support functions. From a provisioning perspective it’s existence seems to be intended as a stopgap for those new to development and to help visualize the complexities that come with object oriented concepts. This is not to say that the UI can not be used to perform day-to-day moves, adds, and changes to the fabric, it’s just not ideal. Simple things can be relatively easy to find, such as the health of the fabric, an over utilized port, or find a mac address or IP in the fabric. ACI can generate an analysis of the fabric between two hosts to help troubleshoot a path connectivity. But make no mistake, the APIC is not Solarwinds. Historical graphs and non-existant. The current CPU and memory utilization as well as temperature and fan health is readily available in table form, just no history from the UI. When it comes to serious troubleshooting, you are jumping into the APIC’s CLI or using a third party product. When it comes to troubleshooting a contract, you’re nearly out of luck. Fixes quickly become contracts with ‘permit any’ trying to prove where the problem is.
This is really where an APM akin to Tetration and the like can really shine. We didn’t get deep into Tetration during the POC, but experience at CPOC and Cisco Live prior, gave us enough experience to know what we were missing. Without something to give you visibility into what’s actually flowing through the data center, the creation of a policy, or attempt to fix a policy is anything but easy. You’re left guessing.
Cisco has spent a lot of time adding features and improving the APIC’s UI over the lifetime of ACI. While this shows commitment to the product, things have moved around within ACI interface across versions making it difficult on the integrator to develop a solid process for deployment. Our implementation engineer (with multiple CCIEs) spent significant amounts of time searching the manual to find out where something was in the UI to configure. In a specific case, something as simple as NTP configuration ended up taking more than I think he would have liked. The lack of consistency with the interface across versions should really be addressed within Cisco, as it was definitely a point of frustration for the partner.
Cisco Data Center Network Manager Web Interface
Datacenter Network Manager started off as the Datacenter Module in Cisco Works nearly twenty years ago. It transitioned through the Prime branding in 7.x and was the primary solution provided by Cisco for Datacenter Switch Management with the Nexus 7K/5K/2K. As of version 10, the interface was still difficult to navigate and we found success limited in utilization as a single management platform for your entire fabric.
With 11, this has changed immensely. The interface is much simpler and significantly more polished. There is an integration with vCenter which allows providing visualization and analytics around where a host sits on the vSphere dVSwitch and through which Physical Switch.
We noticed it’s very difficult in a lot of areas to multitask in DCNM. When entering or viewing configuration elements, the UI opens up floating windows within the HTML based UI. The problem is you can’t move those windows off of the screen to view whats behind them. This meant having to cancel out of the window we’re in and writing down information before opening the aforementioned window back up and having to fill the form out again from scratch. From an interactivity perspective, when DCNM is busy you get a twirling fan of death and you’re forced to wait until whatever task is running completes.
Last frustration is in multiple areas you get a diff view with side by side of what is about to change when you apply a template. However when rolling back configuration you get two separate tabs. This basically means you have to copy the contents of each tab into Notepad++, Sublime or Vimdiff to compare what will change when you rollback a configuration.
They are only minor annoyances, but still easily fixable by Cisco to provide a better user experience. That being said, the rigidity of the platform shows in the best practices in the manual with how you should interact with it. Cisco clearly states that you should in no case make any changes from the switch as they will be at a minimum wiped out on the next update, and at worse cause consistency issues between DCNM and the switches you made changes within. A perfect example of this is in the VLAN change noted prior that when rolling back caused the switch to be stranded from the fabric.
Arista Cloud Vision and EOS
Arista has taken a completely different approach to network programmability. Each switch is autonomous allowing the network to distribute services without oversight from a centralized controller. This model has generated some interesting capabilities.
The EOS SDK
First, the switch is running a generic Linux kernel. EOS provides a very simplified POSIX-type user-land. And support exists within EOS for creation of daemons to be deployed, allowing one to write a piece of software against the EOS SDK in either C++ or Python (Go and C Bindings exist as a wrapper) to allow full control of the switch. In fact, this SDK is how Cloud Vision was actually written to interact with the switch. When importing switches into Cloud Vision, an application service daemon named TermAttr is deployed to the switch. TermAttr’s only function is to provide Cloud Vision fine grained live streamed telemetry information. This coupled with EOS’ Native REST API (Command API) gives Cloud Vision the necessary visibility and capability to provision and deploy configuration on the switch in real time.
Secondarily, as EOS is built upon linux, it has the availability to leverage VMs with KVM and containers on the switch for a more modular approach to deployment of distributed network applications and services. Theoretically, one could write a completely custom control plane, and load it as a daemon within the switching infrastructure instead of relying on standard open protocols.
Cloud Vision itsellf is built on top of EOS also has a significant level of programmability within it, after all it is from the same codebase as EOS. You can leverage a C++/Python API or the eAPI (REST based) to control Cloud Vision. This allows Cloud Vision to be extended to provide functionality beyond what the manufacturer intended. However, all this aside, the real juice here is how configuration is applied from Cloud Vision.
Most shops end up writing some sort of templating system for generating configuration as either a web page but most of the time in Excel. To save you the time, Arista has written this functionality into Cloud Vision. They have also extended this function with the ability to leverage python to make dynamic configlets, called config builders, which generate configuration from complex sets of information either supplied by the user or from a database somewhere.
There is an entire training course written around config builder, and the materials and configlet examples are provided free of charge on GitHub. This makes it very easy to get started customizing the Fabric Builder to meet the business’s requirement and any specifics around a shops process.
UPDATE: In the latest version of CloudBuilder and CloudVision (2019.2), Arista has added the ability to use Jinja Templates to provide simple templating with variables when generating configlets. There is also a built in IPAM solution which greatly simplifies management of networks in whole in both day1 and day2 operational duties.
Events and Time-based Configuration
In the same vein as Cisco’s Embedded Event Manager, Arista provides an Event Framework which allows configuration elements to be tied to scripts which can generate configuration, evaluate statistical data, send notifications, etc. These can trigger on Control Plane Events, Syslog Events or at specific intervals, dates or timeframes. The difference is, the evaluated script can be a REST call, a locally evaluated bash script, python script, or any other interpreter you install on the switch (as opposed to only TCL in the Cisco world.)
Cisco’s ACI as a Programmable Object-Oriented Fabric
ACI’s object-based design provides a huge opportunity to begin treating network elements as constructs for a software developer to work with. Cisco provides a fine grained and rich set of objects that can be manipulated to control every aspect of both the underlying infrastructure in the underlay and the policy and application containment in use in the overlay. It is clear in the design of the REST API that the intent was for the entire fabric to be interacted with via API from the Application Developer. While the API is used to program all aspects of the fabric, the real meat of the API are the objects allowing you to create policy for transport traffic.
The base container is a Tenant. This represents all the assets controllable by a group or entity and is how role-based user controls are applied, ultimately allowing the management plane to be cut into separate managed fabrics. Within a tenant the elements making policy include Outside Networks, Application Profiles, Endpoint Groups, Bridge Domains, Contexts, Contracts, Subjects, Aliases, and Filters. Application Profiles effectively define an Application’s policy and at a minimum include an endpoint group and a contract. The endpoint group defines a set of hosts (virtual or physical) which are attached to a fabric and the contract which is effectively a dynamic access-list composed of information that source and destination protocols and TCP/UDP ports as well as the target endpoint group.
If you’re using Postman templates to build policy-sets, mistakes leave lots of ugly little orphan objects lying around the APIC. This shouldn’t have been a surprise, but when the OCD within me felt the need to clean up after myself, it was a 6 hour stretch into gathering, filtering, associating, and destroying objects. By the time I was done, I had nearly written a python module to find abandoned or unused objects to be destroyed in the APIC. The point of realization that came with having a highly programmable fabric was the fear of less capable programmers doing much more damaging things with this new playground. I won’t lie, my distrust of others is a large part of my opinion that programmable fabrics are a wonderfully dangerous horrible idea. That being said, if I was the maintainer of a massive Cisco distributed datacenter containing unpatched turn of the century systems in a PCI or DoD environment, this is definitely the product I would use to do my job–unmatched on the segmentation front.
From a programmatic perspective, we performed minimal testing. We did confirm there is a well versed API and it allows for provisioning and retrieval of configuration, images, inventory, links, policies, and VRFs. This is in no way as capable as ACI’s API. The API is driven more around interacting with the fabric as a group of switches instead of as one entity.
It is really the scripting and modularity of the configuration element as Templates that makes DCNM so much easier to work with. First off, Cisco provides hundreds of Templates for common tasks. From creating a Virtual Port Channel to creating a stretched VXLAN, there are templates a plenty. These templates can contain scripting elements from either Cisco’s custom scripting language which seems very Perl in its syntax or python. This is a far simpler approach than that of ACI, and more akin to what the average network engineer already does. Effectively, Cisco has simply wrapped a GUI around it.
This in combination with the API gives the average network engineer an easy win in the ability to customize a template
The advantage in utilizing DCNM over ACI comes with the ability to leverage all of the programming constructs natively built into the more traditional NX-OS running on the switches. One can leverage Embedded Event Manager, build a container with Cisco’s OnePK, or leverage Ansible, Puppet, or Salt to gain additional programmability while keeping the traditional face of IOS at your beck and call.
Agility and Usability
As far as usability, there are some killer features that Cumulus has that no other manufacturer provides (as far as we know or have tested).
The biggest one I spoke of earlier, NETQ, which allows you to see statistical, state, and configuration information across all nodes in your cluster from a single node.
OSPF and BGP Unnumbered
OSPF Unnumbered isn’t necessarily something new, but it is something that is not supported in Arista at the moment. Basically this allows the engineer to IP the loopback and then share the address of the loopback on all interfaces. Since OSPF doesn’t requirehe next-hop to be in the same network (as does IS-IS which is my preferred protocol) you can effectively IP the switch with one IP Address.
BGP Unnumbered is a poor choice of naming in my book. It gives one the impression that it’s just like OSPF Unnumbered, but it’s so much more. Effectively, it allows you to bypass the peer AS check, permitting peering with a large set of random BGP AS’
from a single peer-group. This greatly simplifies configuration within your EVPN and underlay networks and with authenticated BGP and hop-limits set on the peership, is just as secure as the legacy configuration would be. Granted, this is a pretty big change, and who knows what other interesting behavior this could cause, but at face value this seems to be a genius way to simplify configuration.
Cumulus also provides a plugin for most distributions that allows the server to advertise it’s own routes, allowing the servers to participate in the topology. While this could be done fairly easily with most operating systems, they provide this out of the box with a simple install, greatly improving the ability to provide host mobility in your multi-pod or multi-datacenter topologies. We can’t unfortunately speak to this as this was outside of the scope of what we were testing, but it’s a slick feature none-the-less.
Arista Cloud Vision and EOS
The simplicity of Cloud Vision and the familiarity within the CLI of Arista’s EOS is a perfect match for the transitioning Network Engineer. Cloud Vision’s simple and intuitive interface provides a lot of little niceties that solve problems we needed to deal with today. We have been really struggling with keeping the switches’ configuration in compliance in our current datacenter, so the configuration verification built into Cloud Vision was a pretty substantial win. Bundle that in with the Service Now integration, built in analytics with streaming telemetry, firmware management, and scoped vulnerability compliance component, and it’s almost the perfect solution.
The big misses are really the lack of vSphere VMM Integration ACI does so eloquently, the promise that DCNM will be able to manage the VLAN configuration in the Fabric Interconnect in the near future, and the inability to support Private VLANs (due to a legal dispute with Cisco in 2017.)
[^] Citation for legal result from Arista-Cisco Entanglements…
The EOS CLI
Since the switch is the heart of what makes Arista successful, they have included some very impressive functionality in the CLI, above and beyond what we have had with the IOS CLI in the past. Arista gives you the basic toolsets including: packet capture, config sessions, textual histograms over time of memory and CPU, command aliases, local and remote file management, TCP/IP services, menu creation constructs, banners, full AAA in the form of RADIUS and TACACS, command pipelines, and local log collection. There are improved ways to filter the configuration while performing configuration, detail resolution in key show commands, and always on live debugs.
The show active keyword has been added to help with configuration output. This is extremely helpful in lengthy configurations as it shows you the configuration related to the current context you are in including the parent tree. There is also additional sub-contexts within show running-config and show startup-config, giving one the ability to see all EVPN, BGP, VXLAN or MPLS configuration.
Within various show commands you can now see the neighbor or name information. For example when looking at evpn peers, vxlan next-hop, or the mac-address table, you get not only the destination IP address of the VTEP but the name of the switch as well.
The last big wow we saw, was the elimination of the need to enable debugs. This has always been a concern due to the ability to easily impact the control plane, if the CPU gets over-consumed gathering debugs. Well, in the Arista world, all of the debug information in the switch is stored to a volatile log in /var/log/. The provide a special set of tools for viewing these in real time. What this means is when an engineer is working through an issue and sees an event, (s)he doesn’t have to enable the debugs and wait for it to happen again. Instead, one can just jump into bash and qtless /var/log/ProcessName.qt where ProcessName is the name of the process of interest.
Cisco Application Centric Infrastructure
This change in paradigm is beyond significant for the traditional network engineer, as there is a stark gap between the skillsets required by the two functions required to utilize the APIC’s for interaction with ACI. The support, troubleshooting, and debugging skillset of the seasoned network engineer will be required to see through the abstractions provided by the programmatic nature of ACI when things go wrong. This either raises the premium for a network engineer significantly or will require significant investment into current staff to help them convert useful intellect in networking pitfalls and architecture to the ability to solve problems and leverage ACI in a meaningful manner for the business. In every aspect, the tool-set changes, as one transitions operational duties from SNMP, NETFLOW, and DEBUG in a CLI to REST JSON API calls, webhooks, and event streaming with Postman and the like. The tool-sets are anything but simple and there is very little of old to help in the transition as they become acquainted with this new world.
Bring not your old ways…
The most beautiful part of ACI is also it’s biggest pain point. The transition is not transitional at all, but a forklift—new switches, new interface, new constructs, new tools, nothing of old… This leaves no ability for corporations and users to take baby-steps. Attempting to leverage the APIC’s UI as a stopgap for traditional network engineers shows the complexity of what can be accomplished and at a minimum is required. It is easy to lose one’s place traversing through the dependency tree and when something is missed, it is complex to track down what portion needs to be created and bound to make the policy available. While ACI provides for what one would think is a pristine walk into the new world, the reality feels like it may be too much change, too quickly.
Wait! Nothing’s changed…
In talking with peers in other organizations, the resultant workflow within ACI ends up looking very similar to what is done today in traditional networks but with different tools. Spreadsheets, text files, or a templating tool still ends up getting used as is today. However, instead of storing a textual configuration template with variables, the contents end up being a REST API call with variables. This new template is then pasted into Postman instead of being pasted into the CLI. The single difference is instead of having to own a tool to push this template to all the switches or paste it into every switches CLI one by one, the APIC provides a central point of configuration for pushing out your REST driven configuration change.
Cisco ACI and vSphere VMM Integration
Virtual Machine Integration is simple to setup, and provides the ability to both keep the vSphere dVSwitch in sync with the fabric and to eliminate the effort of creation of virtual networks by the Virtualization Team. We went through the process of connecting APIC to our POC vCenter installation. In our scenario specifically, the vCenter that managed the POC was inaccessible from the POC. All of the APICs must be able to communicate with vCenter and the VMKernel interfaces on hosts affected by changes to the dVSwitch. This required us to create NATs into the management network for all of the APICs. In a real implementation this would probably not be necessary, but it’s something to note depending on how isolated the vSphere management network is from your fabric management network. Either way, the APIC created an EPG in the VMware environment with all of the dependancies required. Contracts were created. Traffic successfully flowed between two endpoint groups on two separate Hosts. It was a success, with minimal effort. This was by far the biggest win going through the POC process with ACI.
One of the reasons we evaluated ACI was it’s ability to segment east-west traffic in virtual environments as well. Currently we utilize Private VLANs for this function. In our Nexus 5K/7K architecture we have in place today, we have been constantly bit by both bugs and limitations in scale in regards to our use of Private VLANs. This has resulting in long hours trying to packet capture traffic, which is again another pain point on the 5K/7K. You can’t capture packets in Private VLANs on a 7K. The packets never make it to the monitor port for an architectural reason I can no longer recall.
To put it simply, our goal was to evade the latest IT dirty word blamed for everything by simply eliminating its use. Network as a group all agreed we’d be happy to be rid of Private VLANs. Well to our surprise, Private VLANs are utilized within the dVSwitch to assist in this micro-segmentation model with ACI. This isn’t a tick against ACI as we don’t necessarily envision this causing issues and the novelty in which this is used is quite impressive. It is just unfortunate that we cannot with any level of honesty tell management that we have in-fact eliminated Private VLAN utilization in the datacenter, should we go with ACI.
vSphere VMM integration Road Blocks
Really the only actual concern over this function isn’t related to ACI’s implementation or execution at all but company politics. The ability for the APIC to have administrative reach into the vSphere dVSwitch by utilizing the out of the box VMM integration requires your systems and networking teams to be able to cooperate. As silly as it is, company culture may hinder adoption of this feature. If it proves difficult to navigate the VMM integration, ACI’s benefits quickly dwindle.
Undesirable behavior and deployment recommendations
There were several comments made by the vendors which we found enlightening. They are documented here.
Arista Cloud Vision and EOS
Due to how Arista manages convergence during a switch reset or interface failure, the lower priority half of an MLAG will stay down for up to 5 minutes when the two MLAG switches lose communication. This is to prevent the possibility of black-holing traffic or causing session resets when the switch is going through a startup cycle. It is not readily apparent this is what is going on without running the show mlag status command within the switch. The immediate symptom is that all ports on that switch will show error disabled. Note that these timers are user configurable within the switch configuration, we simply wanted to share a difference in expectation in case this catches someone else of guard.
Certain model switches may not have perfect feature parity when it comes to specific EVPN and VXLAN features across the various Broadcom Chipsets. Verify with your Sales Engineer before purchase.
If you have any future needs requiring the use of Converged Ethernet in the data center, also known as FibreChannel over Ethernet or Native FibreChannel make sure you purchase the FX or FX2 chipset based switches.
VXLAN, specifically VTEP termination is supported on the Nexus 7000 and 5600 Series as well as the Nexus 9300 and 9500 series switches, however it is not supported on the Nexus 5000 and 5500 series switches, and has limited support on the Nexus 3000 Series. See your Sales Engineer before purchasing if the intent is to utilize legacy equipment in your fabric.
- The role must be assigned to the elements within DCNM prior to building any overlays. You can not build overlays until after the roles have been set on the various fabric components, or you will not be able to change the role. This includes creation of a vPC Pair.
- When showing the VM Guests on the dVSwitch in DCNM Topology View, you must sometimes hide and reshow the dVSwitch to get the layout to show-up correctly.
- If you are expecting to see something in a frame but it doesn’t appear to be there, attempt to scroll while hovering over the frame. Sometimes the content gets stuck at the bottom of the box.
- It is required to use the identical port on the paired switch for vPC. While this is probably good practice this seems very inflexible
- Unable to pull configuration back into DCNM that was deployed directly to the switch. In fact Cisco states you should only perform configuration functions from within DCNM.
- Leverage a Cisco ACI Qualified Partner to perform the initial implementation.
- NTP accuracy is critical to the stability of the fabric. Before migrating to production, verify NTP is in sync across all peers.
- ACI must be built using the following process: Bring up the first APIC in the cluster. Discover the fabric. Bring up the rest of the APIC in the cluster. Perform the same process on the second site’s fabric. Add manageability to both fabric’s APICs to the MSO.
- Network Centric vs Application Centric mode is configured per bridge-group.
- Network Centric mode is intended to be a transitional state for brownfield deployments, wherein the VXLAN fabric is just providing layer-2 transparency via the overlay. This means the SVI must exist outside the fabric. To perform symmetric routing, and host the SVI with anycast FHRP at the leaves, the bridge-group must be migrated to Application Centric Mode.
- When enabling Application Centric Mode, your simply checking the checkbox on the bridge-group that turns on policy enforcement, converting the implicit permit-all to an implicit deny-all.
- In a virtual environment where one can not leverage vSphere VMM Integration, ACI should not be used. Static port bindings quickly become unmanageable.
- It is required to use the identical port on the paired switch for vPC. While this is probably good practice this seems very inflexible
Notes Independent of Product or Manufacturer
For production environments always deploy management connectivity of switches and controllers on a dedicated separate out-of-band switching infrastructure that is only accessible via layer-3 host-port attached interfaces. This prevents layer-2 topology changes from breaking the switch-fabrics ability to be configured, managed, or rolled back. When utilizing EVPN with multisite topologies with three or more sites make sure you use a dedicated pair of devices as intersite route reflectors.
Data center interconnects should always be dedicated unshared connections for fabric use only, preferably over lambda or dark fiber.
Remember, automation gives you the ability to break everything very quickly, be wise, test, test, test and test again or you may automate yourself out of a job.
We have had the ability to talk with a smattering of engineers across various enterprises in the mid-western United States who have replaced their fabrics over the past 5 years. It seems only a very small group of companies have done anything beyond migrate their topology to VXLAN from STP, TRILL or FabricPath.
The few companies that we have heard of making great strides on programmatic provisioning of their fabric had significant pains in getting there. The consensus feels the same regardless of the product, which lead us to believe programmable fabrics are new playgrounds fraught with opportunities for failure that could very well be resume generating. In that regard, failure is where we learn and ultimately what will make us better engineers. It allows us opportunities for the employers we keep to leverage agility, possibly providing gains in the marketplace.
Arista Cloud Vision and EOS
When it comes to the support staff’s ability to get work done, the simplicity of Cloud Vision and the familiarity within the CLI of Arista’s EOS is a perfect match for the transitioning Network Engineer. Cloud Vision offers a full configuration management platform from a single simple web interface, providing timelines, audit, and remediation within the platform. The Compliance module validates and notifies based on actual attack surface CVEs or bugs. Software updates are already stupid simple, and yet they are still being improved with each release of Cloud Vision. Programmatically, the fabric is well suited to be interacted with in a variety of ways.
The key wins from our perspective are the built-in highly integrated interactions with ServiceNow and Compliance. It is remarkably simple to setup the ServiceNow integration and is ever expanding in the ways both can leverage each other for functional change process. From a compliance perspective, no third party tools are needed, and out of the box Cloud Vision feels best of breed in this function.
The principle concern in going with Arista is their reliance on a wide array of commodity chipsets, which could mean limited control on implementation and availability as they are competing with other open manufacturers for the same chips.
Cisco Application Centric Infrastructure
From our perspective, ACI’s key feature is the ability for the ACI to deploy configuration to vSphere via API. This alleviates workload on the VMWare administrator and prevents delays due to a mismatch in configuration between VMWare’s dVSwitch and the network fabric. In some organizations, this may be difficult to move the ball on as it requires administrative control by ACI at the Datacenter level in vCenter. This could pose to be difficult to negotiate depending on the organizational structure and political landscape.
Cisco’s Application Centric Infrastructure is a white-list model for the Datacenter. By providing fine grained control elements, ACI effectively dissolves the reliance on creation of network policy around IP addressing, VLAN, and subnet. This allows the creation of a policy as generic or specific as a company needs. As with any good tool, ACI’s ability to provide this fine grained control also comes at the price of complexity. This coupled with anything programmatic it gives you the ability to do things fast which can mean little mistakes make large messes leaving lots of orphaned objects to cleanup.
The greater lack of visibility in ACI for the development of the policy to be applied has been remediated in the integration of Tetration, which among other things provides a host-based firewall solution. There is limited opinion among my peers that when pairing this solution with a Host-based firewall solution such as Tetration (or it’s largest competitor Illumio), the policy element quickly becomes less important.
Cisco Datacenter Network Manager
This rigid approach to best practices in combination with the issues we ran into trying to rollback a relatively simple change, made DCNM feel a little risky in it’s current state. It looks to be on target to be a heavy contender with both Arista and ACI in the not so distant future. The promise of the ability to control the UCS Manager’s network components from within DCNM is a feature we eagerly await.
An aside: a bowl full of opinion on Cisco and it’s lack of Focus
In multiple area’s it feels like Cisco is undercommitted to various technologies. They are actively competing with themselves in the wireless arena with Cisco Enterprise Wireless and Meraki. In the data center vertical they have DCNM and ACI.
In the Network Automation world, the lack of commonality in architecture between major components of their network leaves massive opportunities for failure. For example, ACI, Viptela and DNA are nearly three identical systems focused on different areas of the network. And while they do leverage some constructs the same, the most complicated ones (like security policy) must be translated between systems. ISE seems to be the glue they intend on tying it all together with.
This in combination with the now complicated and ever-changing world of licensing is very frustrating. Cisco licensing and support used to only be complicated for OPEX companies that were cash limited. If you purchased all of the licensing up front, you never had to worry. Today, it seems working with Cisco is anything but easy. Complication abounds and not necessarily because it provides advantage in modularity, scalability, or performance.
Let it be known. The current model of licensing and support borrows financial dollars today with the promise to deliver something tomorrow. In the Cisco ONE, DNA, and ELA license models, we are gaining no more features today and at an instantaneous cost increase that is not trivial. This makes Cisco’s financials look impressive today as they benefit from long term commitments of recurring cash-flow, but when the majority of companies do not re-up on their licensing because Cisco has traded talented engineers for marketing and sales staff, the bell for whom Nortel tolled will ring again.
As someone who has built a career on the backs of the fine engineering staff at Cisco, it sure feels like they are a rube-goldberg machine held together with the tangled webs weaved by their massive marketing machine. That being said, we all have invested large portions of our time and money both personally and via business into Cisco over the past 25 to 30 years and not a single engineer I work with wants them to do anything but deliver products we can leverage to solve problems for our business lines.
Get back to the basics and get focused. Find where you have helped your customers succeed and expand from there. Build a toolset that engineers can use to solve problems. Keep it simple. Plan what you intend on doing and execute on that plan. Software companies come and go but novel customer focused hardware is hard to replicate.
- Chhabra. A., and Kiran, M. “Classifying Flows: Mice and Elephants” supercomputing.org, 19 Sept 2017.
- Cloudshark. “What is a micro burst and how to detect them?” cloudshark.io
- Heder, Brian. “Are your pipes too big? The problem with long fat networks.” Network World, 6 May 2014.
- Alizadeh, Mohammed., Edsail, Tom, et al. “CONGA: Distributed Congestion-Aware Load Balancing” SIGCOMM’14, Aug 2014.
- Newman, David. “Arista, Blade win top spot in data center switch test.” Network World, 18 Jan 2010.
- Lab Test Report DR1000401G. “Comparing Cisco Nexus 5010 and Arista 7124S in Financial Markets” Miercom, 26 Apr 2010.
- “Microbursts, Jitter, and Buffers.” Arista, Jan 2012.
- “Myths about Microbursts.” Arista, Aug 2010.
- “Cisco Nexus 3064 Performance Test”, Miercom, Apr 2011.
- “Cisco fabric launch seeks to undermine Arista, IBM.” Network World, Jun 2011.
- “Big Data Becoming a Common Problem.” Arista, Nov 2013.
- “Update: Center 40GE Switch Study: Nexus 9508 v Arista 7508E” Miercom, Jan 2014.
- “Arista 7500E Data Center Switch Test.” Lippis, Jan 2014.
- “Buffer Performance Testing: Cisco Nexus 9396PX v Arista 7150S” Miercom, Aug 2014.
- “Deploying IP Storage Infrastructures” Arista, Aug 2014.
- “Speeding of Applications in Data Center Networks: Cisco Nexus 92160YC v Nexus 9272Q v Arista 7280SE-72” Miercom, Feb 2016.
- “Network Switch Impact on Big Data — Hadoop-Cluster Data Processing: Cisco Nexus 9272Q v Nexus 92160YC v Arista 7280SE-72.” Miercom, Mar 2016.
- Bechtolsheim, Dale, Holbrook and Li. “Big Data Needs Big Buffer Switches.” Arista Sep 2016.