If you are considering moving your VMware workloads to the cloud, or you have already tried and run into some quirky Layer 2 Extension networking issues, this is the blog for you.
If you have a VMware vCenter stack and physical networking on-premises and you’re looking to stretch your layer 2 from on-premises to your cloud vCenter in VMware Cloud on AWS that is primarily NSX, you might encounter a problem.
From the available options, you likely selected the Standalone Edge as there is no additional licensing cost and is relatively quick to set up.
Session-based traffic that crosses your physical networking and the Standalone Edge appears to drop and never reaches its destination.
Here’s what we discovered and how you can resolve this issue.
The trouble ticket came to us in the following state: The client had 3 networks stretched from their on-premises data center to VMware Cloud on AWS. Two networks were considered “internal” and the third was a “DMZ”. Think “web application with layer 2 extension across both sites. Gateways, IDS, and IPS on-premises.
Traffic was working between the two internal networks, but not between the internal networks and the DMZ.
Here are the steps we took:
We checked all firewall rules and routes. Everything appeared to be configured correctly.
We sent ICMP packets and Trace Routes across all three networks without any problems.
Even better – we initiated RDP, ODBC and telnet/ssh sessions that would initiate successfully and then immediately terminate. Does this feel “firewall-y” enough yet? But, the firewall, IDS, IPS, and routes were already confirmed good! So, what next?
We spent quite a few hours debugging NSX Edges and ASA appliances.
What we found
We came to determine that the traffic was, in fact, ingressing the Standalone Appliance but the traffic would never egress. At no point were we seeing any indication that the Standalone Edge dropped the traffic, attempted to route it anywhere else or that it was being actively rejected.
Whatever was happening was definitely happening on the Standalone Edge. We removed and re-deployed the Standalone Edge to make sure this wasn’t a versioning issue, but the newly deployed Appliance had the same behavior.
We began a conversation with VMware and VMware Cloud on AWS support. After more hours of debugging every virtual and physical network interface in the path of the data, we came across a few forum and blog posts from 2009 (10 years ago as of the writing of this) that described similar behavior, having to do with certain Kernels quietly dropping packets when TCP Sequence Numbers were randomized. This rung a bell with one of the VMware Cloud on AWS support staff and they recalled a bug from some time ago.
We continued speaking with VMware staff and sure enough, this was considered a “bug” at VMware, but it didn’t have a supported fix for the Standalone Edge. We found a few solutions for NSX Manager and unrelated products, but nothing specifically for the Standalone Edge within supported tracks.
Eventually, we got the below information as a supported method to allow random sequence numbers with TCP traffic on the standalone edge. We implemented this for our client, it works a charm and is persistent after a reboot. Just keep it in your back pocket for the day you have to redeploy!
So, if you are stretching Layer 2 using the NSX Standalone Appliance, your session-based traffic isn’t reaching its destination and you have TCP Randomization enabled on your physical network, the above may be of use to you.
We hope you found the information and the script valuable. We also published a really handy Cheat Sheet for VMware Cloud on AWS recently that can provide some guidance on the criteria and limitations you should know as you adopt VMware Cloud on AWS for your next project.