Tags

, , , ,

This is my hobby, one that I am paid handsomely. I design and automate networks, write software, support applications, and learn all I can to be successful in that regard for my customers benefit. You should read LACP and vSphere ESXi hosts: not a very good marriage first. Dennis Lefeber does a really wonderful job of providing the. VMware viewpoint. I agree with a lot of his rationale, however, I think his argument lacks some depth I would like to provide and in doing so may change your perspective on how your build data centers. As a part of this discussion, we will together dig into the specific benefits interface bonding provides both from a technical and business-case standpoint.

But first, for those with short attention spans who only came here for the short on LACP and LBT specific to the implementation with VMware:

  1. Increase application performance by decreasing application latency with high-speed (line-rate) interfaces at the switch-server interconnect.
  2. Use LBT or LACP for it’s true business use-case: rapid failover during a failure or maintenance activities, increasing the headroom for a given interface, and decreasing transmit time between the server and switch.
  3. Maximize your value to your customer (the company, application and other infrastructure groups, your peers, by knowing how to turn business-cases into technical implementations.
  4. Remember, one-size-fits-all results in big shirts on little people or big people in little shirts. Don’t be a hardline proponent or opponent of any technology, vendor, or view. Always seek to understand the opposing viewpoint as there is probably a use-case where that fits as well.

Best of luck to the channel changers. I appreciate the click and read. For the rest of you knowledge seekers, lets dig in.

At this point, I’ve built a LOT of data centers, using Cisco Nexus (Legacy and ACI), Cumulus, Big Switch (Arista CCF), Arista UCN, and Juniper EX/QFX. I’ve deployed VMware as far back as 4.5 through vxRail and NSX-T, as well as OpenStack, Nutanix, Oracle Private Cloud Architecture, IBM p-Series, and Legacy Sun Microsystems Virtualization. I’ve also had the pleasure of supporting some of the most advanced applications in the energy industry, aerospace, and retail as well as the worst. My point is, I’m probably a hundred or so DCs deployments in at this point, and I still learn things every day. So lets go through some exploration together.

First the topic that brought me to write this and quick comparison of Link Aggregation Control Protocol (LACP) and VMware vSphere’s dVS implementation of Load Balance Teaming (LBT) in use at the switch-compute interconnect for VMware vSphere and vxRail deployments:

Link Aggregation Control Protocol (LACP)Load Balance Teaming (LBT)
Open Standard ProtocolProprietary Feature within VMware ESXi
1. Mechanics and Authority

Active control-plane protocol includes peering process (LACP hello exchange) where specifics for how the member ports should forward traffic in a link aggregation group is negotiated during port linkUp.

Channel configuration and negotiation results in a single logical port made from two member ports.

LACP control-plane signaling allows for constant exchange of state with between the two link partners permitting in-service changes to forwarding semantics.
1. Mechanics and Authority

No control-plane protocol.





Each port in an uplink-port group operate independently.

VMware host is the single authority on forwarding semantics.

2. Link Aggregation Group Member Availability

Link Fault Signaling in combination with LACP protocol timeouts result in known minimum outage time by taking a failed path out of service.





In-service link availability is negotiated. The LACP max-links knob allows for passive links to have an active physical link, while excluded from packet distribution.
2. Uplink Group Member Availability

In-service link availability relies on Link Fault Signaling, where the NIC identifies the link is up or down in both directions. Failure of LFS, kernel driver, NIC or PHY (SFP) can result in traffic black hole. Limitation can be overcome through the use and configuration of Bidirectional Forwarding Detection (BFD).

VMware and the switch rely on Link Fault Signaling and take ports out of service by using the kernel driver to disable the port at the PHY/NIC to remove it from service. (Port Shutdown)
3. Link Agg Group Forwarding Distribution

Service routing in a Link Aggregation Group (LAG), is negotiated at startup as an XOR hash of one of the following:

3a. Source and Destination MAC
3b. Source and Destination MAC and IP
3c. Source and Destination IP and TCP/UDP port
3d. Other future options defined in the spec
3. Uplink Member Group Forwarding Selection

Service routing in an uplink port group relies on VMware selecting a port in the group for a given server allowing the IP addresses from that server to be bound to the attached switch port through implicit and passive MAC learning. When the port selection changes for an active mac-ip during a balancing event, the upstream switch registers a mac-move event as the MAC is learned from a transmitting packet on the new port.
4. Efficient use of the channel

4a. Entropy of the selected portions of the packet used in the hash

4b. Single interface serialization/deserialization rate (link speed) exceeding maximum attainable bandwidth-delay capacity between clients transport protocol socket and server transport protocol socket by 4 times.

4c. A well distributed allocation of addressing and layer-4 ports used by the applications traversing a given LAG.


4. Efficient use of the uplink port group

4a. Quantity of VMs consuming bandwidth on the uplink port group.

4b. Single interface serialization/deserialization rate (link speed) exceeding maximum attainable bandwidth-delay capacity between clients and server by 4 times.


4c. Dedicated cluster allocation based on application type for complex applications leveraging large UDP transfers or where large quantities of services run on a single VM, as together these applications may exceed the band-width delay capacity noted in 2.
5. Simplicity

While LACP is both well documented, standardized, and has been in use between network equipment for over 20 years. Novice IT support staff may lack the understanding to implement effectively, resulting in longer deployment times and/or impacts when left misconfigured during maintenance activities.
5. Simplicity

Less to configure. Simple (though arguably inelegant) solution results in moderate success in operational activities for novice server and network admins as it is easy to grasp the mechanics.

6. Business case for efficiency

Smallest unit of distribution is per source-port at the transport layer (layer-4). So the advantage here is that a single VM’s services can be distributed across multiple ports.

LACP’s ability to distribute the transactional requests (TCP sessions) for a single workload (VM) serving many clients is statistically more efficient. As a result, LACP should delay the need to horizontally scale the application.

There are very few conditions when using L3+L4 XOR Hash (3c) for packet distribution with LACP leads to interface contention and LBT would not.
6. Business case for efficiency

The IP address is the smallest unit available for distribution. Most servers have a single IP. As each physical ESXi host tracks the vmnic’s network commit on the pnic, workloads can be shifted across NICs to dynamically balance workload. As a result this works well for a great majority of business-cases and workloads.

The ability for a single Application to scale across multiple cooperating VMs becomes a requirement at smaller scales than LACP with LBT balanced workloads as the vertical scaling limit for an interface under contention is per IP.
6. Business case for user experience

For a monolithic application running on a single VM, network latency decreases for a static number of active users as you increase port count. This results from the ability for a single target service (IP:DST-PORT) to be hashed across multiple interfaces as the client (IP:DST-PORT) will provide the needed entropy.


6. Business case for user experience

Can only decrease latency per workload (IP), so a monolithic application on a single VM with multiple services provides a poorer user experience per unit time as the active user-base grows. This results from the technical limit associated with the binding of a single IP (target VM) to a single port in that group at a time. VMware can move it freely amogst the ports in the uplink port group, but it can never occupy more than one port simultaneously.
7. Business case for impact resiliency

In a properly sized interconnect as dictated by 4b above LACP surpasses LBT’s ability to recover connectivity for realtime workloads. By virtualizing the port and the cooperation of both link partners gain in usage of LACP to signal aggregation group member utilization, LACP provides near-realtime failover in both directions with lossless transition.

The gain in resiliency requires additional CPU consumption on network equipment and the hypervisor LACP process due to the required use of lacp-fast-timers.


7. Business case for impact resiliency

In a properly sized interconnect as dictated by 4b above, near realtime failover can be achieved for outbound traffic flows when an interface goes down as the interface group is undersubscribed to allow for transparent transition.

Inbound traffic flows require signaling via MAC learning to distribute new destination information for mac-moves as a result of the rebalance due to port failure. This results in minimal packet loss which is only detrimental to realtime applications.

Some switching vendors provide in-place update as during mac-move events, which improves performance and decreases routing churn from LBT balancing events.

Throughput, Speed, and Application Performance

It is often argued that LACP or LBT can be leveraged to horizontally scale bandwidth. Regardless of which you choose these features SHOULD NEVER be used to scale the throughput needs of an application. This is not their contribution to the design. More importantly, if you value your contribution to your customer, bandwidth and throughput isn’t where you should be focusing your efforts. It does not reflect what your customer cares about. So where does that leave us?

There exist a dichotomy in viewpoints that is pervasive in the industry with regards to how technology staff measures application performance versus how the application user measures performance. Technology owners measure application performance as how consumed an available service at time-scales in seconds. Lets take a stroll down something I’m sure we’ve all done at some point in our career. It’s time to build a new data center and management says “we want this future proofed, but we don’t want to just throw away money on what we aren’t utilizing.” So we promptly get to determining our historical utilization of interfaces and interconnects in the data center and WAN. We look at nearterm trend lines made from averages that started out as bits per second averages over a 5 minute window on these ports. We create historical graphs going back the lifetime of the data center. On these graphs we consolidate the buckets into time segments of days, sometimes weeks. We might even get tricky and say we graph the maximum bps 5 min average in that window versus just the average of these values. What invariably is determined in most cases is backups are the biggest consumer of bandwidth in the data center, with storage taking second, and some large database servers taking a third. While these are good things to know, its a misadventure if you want performant applications because we’re minimizing our cost and maximizing utilization as a result.

Now let’s think about this from the application user’s perspective. As a user we measures application performance in task executions over wait time in milliseconds. We know this because there is no easier way to irritate an application user than to twirl the “busy” pointer while something is loading, if only for half a second. Our expectation is immediate response, processing and a response faster than we can think. So how do we measure that? Well considering 24 frames per second is the bare minimum for something to be considered motion by the human eye. At this rate, processing time from one frame of interaction to the next is only 41ms. Keep in mind this is the bare minimum and that hundreds of transactions may be happening between the click at the client and presentation of the response from the server.

The point to this fairly extensive argument is bandwidth isn’t the target. If our goal is quantifiably happy users, minimizing application transaction time is the target. The easiest way to minimize latency is to decrease the time it takes to transmit a packet on the links between the client and the server. Switching manufacturers sell cut-through switches to decrease latency through their switches. The biggest bang for your buck however is the interface. The difference between a 10G interface and a 100G interface is a 90% reduction in latency. I don’t have to lookup prices to know that in bulk 100G transceivers are not 10x the price. The only place where one should be extra diligent is in regards to leased circuits or long rate services like in cloud and colocation facilities as these are recurring costs and can come with monthly caps at commodity prices.

We have had NICs that provide 40G, 50G, and 100G server ports for years now. These interface speeds far exceed the usable throughput even the most high-end server could serve transactions over. In specific use-cases where we see server-switch interconnects in the enterprise approach 20% utilization for seconds at a time, they almost certainly are not production interfaces for virtualization workloads. So how should one decide on which interface speed to purchase to provide the adequate latency required for the applications we intend to support? Just buy the fastest interfaces your network and server hardware support, then select your spine switches to accommodate for 1:1 oversubscription between leaves and hand off at the same interface rate across any DCI links between border leaves or services leaves through firewall interconnects. This prevents interface rate conversion from tying up buffers in switches where latency matters the most. Let the firewall be the throughput constraint as it has to buffer large quantities of packets to inspect the data. And if there is a place where interface rate conversion must take place to bring connectivity down to 25G, 10G or scarily 1G, make sure you have a router with a large amount of packet memory and make effective use of QoS policy to control what traffic is preferred for drop selection under contention. VMware makes this recommendation in many blogs and guides, but I think Pete Koehler provides the most succinct reasoning on why its worth the spend on the highest rate available.

Optimizing performance is often about shifting the bottleneck to a location that is the easiest to identify and control. Investing in higher quality, faster-performing switchgear helps shift potential contention back to the host, where it is easier to control through sophisticated schedulers, and remedy through faster storage devices. Good switches do indeed have up-front costs, but given the longer lifecycles of switches paired with the abilities for vSAN to easily accommodate newer, faster hardware, it is a wise step for any data center design. Pete Koehler [blogs.vmware.com]

The inservice lifetime cost difference in terms of percentage of the overall deployment using 10G versus 100G is well under 2% in most cases, especially when accounting for licensing on VMware and service contracts for your switch hardware.

LBT signaling during port failure

The TX queue on the pNIC simply stops transmitting the packet in flight on the down interface and begins transmitting the lost packet from the same queue on the newly chosen interface. The pNIC further signals the switch to the mac-move by transmits a GARP message the upstream switching and routing equipment on the newly selected port. This tells the switch to now begin forwarding packets to the newly selected port.

With available queue capacity in the upstream switching fabric and a local layer-2 path between the two ports, in-flight packets destined to the VM are diverted to the new path as the MAC and IP are learned via the GARP message. This move in some switching designs can result in the revoking of the MAC and IP from the fabric forwarding tables and insertion of the new MAC and IP with the new next-hop. A small percentage of rebalanced host bound in-flight packets which will be lost. For some realtime applications this can be detrimental to the stability or durability of the application, state, and data managed by these critical applications.

The reality here is most enterprise applications do not require this level of availability. Further, most enterprise IT environments don’t have monitoring sufficient to isolate and identify these events even if they did. We often see these events get written off as an “application problem” by the infrastructure teams and a “network problem” by the application team, leaving the VMware admins oblivious to or ignorant to the problem in whole. To some ignorance is bliss.

There is one very specific area of concern around LBT. The smallest unit LBT can balance due to limits in how the switch makes forwarding decisions is per IP. In smaller enterprise, this is rarely an issue. But as user-count for a single monolithic application increases, the user experience will naturally degrade as the interface faces longer transmit times as utilization of the interface grows per user. In large organizations running applications like Microsoft Sharepoint and Team, Oracle Financial Applications, Tibco, SAP, and the like, can result in staggering amounts of bandwidth at peak usage times. The only mechanism available in these situations is to horizontally scale the VMs associated with these applications and balance workloads across them.

Most IT departments don’t have the knowledge-set to support horizontal scaling of applications. Enterprise applications are not generally designed with horizontal scaling in mind. Designing horizontally scalable applications is complex due to dependancies on scaling transient and durable storage with storage, middle-tier, and databases.

LACP Signaling during port failure

As we have outlined above, LACP has the ability to distribute TCP sessions across multiple member interfaces. This theoretically allows LACP to provide additional horizontal scaling options by way of utilizing more ports for distribution of the workload. I know this sounds like we are attempting to meet the performance demands of the application by using link aggregation, but what we are actually attempting to do is minimize latency for any single session. Again VMware agrees:

LACP enables more advanced hashes that mix in things like source and destination port.  These hashes will allow for potentially balancing of traffic that is split across multiple connection sessions between the same two hosts. ⎯ John Nicholson [blogs.vmware.com]

So when LACP suffers a port failure or a switch or NIC is placed in maintenance, LACP gracefully notifies the link partner that the interface should be removed from service. The port stays up, both sides continue to forward traffic across the virtual link aggregation port, and forwarding information does not need to be updated in the switch fabric. This allows us to perform maintenance, firmware or software upgrades, and configuration changes without the application taking impact.

LACP fast-timers also allow subsecond identification of a path failure in the few scenarios where LFS does not identify a unidirectional or failed link. LACP fast-timers places additional burden on both the switch and VMware CPU. This is not always an optimal trade-off especially on switches with low end control-plane processors carrying the burden of large routing and/or EVPN tables with significant churn.

VMware vSAN Design Guide: Summary of Network Design Considerations

  • 25Gbps or faster (50/100Gbps) are required for vSAN Express Storage Architecture. 10 Gbps is supported for ESA-AF-0 ReadyNode profiles to support legacy/brownfield environments, but 25Gbps NICs should be used for these configurations.
  • For ReadyNodes hosting 200TB+ of capacity (Common to vSAN Max), 100Gbps is required for the storage cluster.
  • 10Gb networks at a minimum are required for vSAN Original Storage Architecture (OSA) all-flash configurations. 25Gbps or faster are recommended for best performance.
  • NIC teaming is recommended for availability/redundancy
  • Jumbo frames can provide benefits in a vSAN environment.
  • Consider vDS with NIOC to provide QoS for vSAN traffic.
  • The VMware vSAN Networking Design Guide reviews design options, best practices, and configuration details, including:
    • vSphere Teaming Considerations – IP Hash vs other vSphere teaming algorithms
    • Physical Topology/switch Considerations – Leaf Spine topology is preferred to legacy 3 tier designs or use of fabric extension.
    • vSAN Network Design for High Availability – Design considerations to achieve a highly available vSAN network
    • Load Balancing Considerations – How to achieve aggregated bandwidth via multiple physical uplinks for vSAN traffic in combination with other traffic types
    • vSAN with other Traffic Types – Detailed architectural examples and test results of using Network I/O Control with vSAN and other traffic types

Observations in data center design worth noting

  1. The higher the line-rate, the smaller amount of time required for packets to leave one host and arrive at another. Faster link transmission results in lower latency. Lower Latency results in a more responsive application, faster transactions, and happy customers continue to be customers.
  2. As a rule of thumb backups shouldn’t compete with the workload, and as a result, should never transit the production virtualization interface. Keep this in mind when selecting your backup solution.
  3. As of vSphere 5.5 LACP supports inclusion of the socket (TCP/UDP Port) in the hash for interface distribution, providing significant balancing improvement over IP hash and Dynamic LBT with IP routing.
  4. As of vSAN 6.0, it is recommended to leverage LACP for the vSAN and vMotion dVS.
  5. VCF has specific requirements and recommendations, so when using VCF for vxRail deployments, strictly follow the guide or run the risk of upgrade failures and prolonged outages.
  6. When using QoS in data center networks, opt for shaping over policing, and keep it simple.
  7. Isolate virtualized applications with high throughput non-TCP traffic, such as Multicast, Voice over IP, Broadcast Video on dedicated lower-rate interfaces and/or clusters which match the linerate handoff in your WAN.
  8. For added redundancy, leverage leaf-spine architecture with Multi-chassis Link Aggregation or EVPN Active-Active Multihoming topologies for the leaf-server interconnect.
  9. Lossless ethernet functions should only be used in use-cases that specifically call for them. Pause frames can and will impact workloads in unexpected ways otherwise.
  10. Switch Fabrics should ALWAYS enable maximum MTU on leaf-spine and DCI interfaces and service facing SVIs and Layer-3 interfaces should decrease the MTU such that there is a minimum of 100 bytes for other encapsulations which may be used by providers.
  11. TCP MSS is your friend, understand how it works, it’s limitations, and apply it liberally at the data center edge, in the campus and WAN.
  12. Make your vendors compete for your business. Request Proof of Concepts (POCs) to showcase real world capabilities.
  13. Maintain a small lab reflecting what is in production, so when things break you have a place to test.
  14. As of VCF 4.1, VMware has made the recommendation to NOT use LACP despite its benefits. Heed this advice and follow VCF. I have been on two calls in the past few years, where VMware has refused to support a deployment because the design deviated from VCF guidelines.

References