VM Network Troubleshooting from Guest OS to Uplink: A Layer by Layer VMware Runbook

Virtual machine network problems rarely arrive with a clean label.

The ticket usually says something like “the VM is unreachable,” “the application cannot connect,” “ping fails,” “internet access is down,” or “VMs on different hosts cannot talk.” The underlying cause might be inside the guest OS, on the VM’s virtual NIC, in the port group, on the VLAN trunk, in the distributed switch, on a bad uplink, at the physical switch, in routing, or at a firewall boundary.

That is why a useful VMware troubleshooting process needs to be layered.

Broadcom’s VMware KB324542 (KB 324542) frames VM network troubleshooting as a sequence of checks that should not be skipped, covering port group names, VM adapter connection state, guest OS networking, TCP/IP stack behavior, P2V hidden adapters, uplink isolation, VLAN configuration, jumbo frames, and packet capture. This article turns that KB into an operational ladder that an engineer can use during a real incident.

The goal is not to prove that the network, virtualization layer, firewall, or guest OS is “the problem.” The goal is to narrow the failure domain without creating a second outage.

Scenario

A virtual machine running on VMware vSphere has lost network connectivity.

The symptom may be isolated to one VM, several VMs on the same port group, VMs on one ESXi host, VMs after vMotion, VMs on different VLANs, or traffic to a specific destination. Broadcom’s KB lists common symptoms such as unreachable VMs, failed VM-to-VM communication across hosts, high latency, failed inbound or outbound traffic, unavailable internet access, and TCP/IP connection failures.

The runbook starts inside the guest OS and works outward to the physical and policy boundaries.

Why This Matters Operationally

The fastest way to waste time on a VM network issue is to start in the middle.

Changing a VLAN before checking the guest IP configuration can hide a simple OS issue. Rebuilding a port group before checking an uplink can create a broader outage. Blaming routing before testing the default gateway can pull the wrong team into the incident.

That matters in vCF and vSphere operations because VM networking crosses ownership boundaries. The same packet can touch the guest OS, vNIC, port group, vDS, host uplink, top-of-rack switch, default gateway, firewall, and routing domain before the application ever sees a response.

Symptoms and Risk

Use this runbook when you see symptoms like:

The operational risk is not just downtime. It is accidental blast radius.

Do not change VLANs, uplink teaming, LACP, distributed switch policies, firewall rules, or physical switch trunks until you have captured the current state and identified the smallest safe test.

Troubleshooting Ladder at a Glance

The diagram below is the troubleshooting path. The important thing to notice is that the checks move from the VM outward. Each layer should either prove connectivity, identify the break, or provide the evidence needed to hand off to the next owner.

This should be treated as a ladder, not a checklist of random ideas. If the VM cannot reach its default gateway, focus on Layer 2, VLAN, port group, uplink, and physical switch evidence first. If the VM can reach the gateway but cannot reach another subnet, Broadcom’s default-gateway troubleshooting guidance points toward Layer 3 routing rather than the local virtual switch path.

Prerequisites and Safety Checks

Before changing anything, collect the basics.

You need:

VM name

Guest OS type

VM IP address, subnet mask, default gateway, DNS servers

Destination IP, port, and protocol being tested

ESXi host currently running the VM

Cluster and vDS or standard vSwitch name

Port group name and VLAN ID

Physical uplinks used by the host

Whether NSX/vDefend Distributed Firewall applies

Whether this is a single VM, port group, host, cluster, or site-wide symptom

There is one important exception: if the unreachable VM is vCenter Server, be careful. Broadcom’s KB specifically calls out vCenter reachability as a scenario where opening a networking support case may be the best path, especially when vCenter networking is delivered through a vSphere Distributed Switch.

That warning exists for a reason. A vDS-backed vCenter outage can turn normal remediation into a control-plane recovery problem.

Runbook Stages

Stage 1: Define the Failure Domain

Start by proving the scope.

Ask four questions:

Is this one VM or multiple VMs?

Is it one port group or multiple port groups?

Is it one ESXi host or every host in the cluster?

Is the failure limited to one destination, one subnet, or all traffic?

This first step decides where the runbook branches.

A single VM problem usually starts with the guest OS, VM vNIC, or VM-specific policy. A port group-wide issue points toward VLAN, port group policy, or upstream trunking. A host-specific issue points toward that ESXi host’s uplinks, physical switch ports, or LACP/team configuration. A cross-subnet-only issue points toward routing or firewall policy.

Document the failure in plain terms:

Source VM: APP01
Source IP: 10.20.30.41
Source Host: esxi07
Port Group: PG-App-Prod
VLAN: 230
Destination: 10.20.30.1 default gateway
Result: Ping fails from APP01, succeeds from APP02 on same port group
Scope: Single VM

That simple record prevents the incident from drifting.

Stage 2: Check the Guest OS First

A VM can be perfectly connected to the right port group and still fail because the guest OS is misconfigured.

From inside the guest, verify:

IP address

Subnet mask or prefix length

Default gateway

DNS settings

Static routes

Duplicate IP warnings

OS firewall profile

NIC driver state

Whether the OS thinks the cable is disconnected

For Windows:

ipconfig /all
route print
ping 127.0.0.1
ping <vm-ip>
ping <default-gateway-ip>
tracert <destination-ip>
Test-NetConnection <destination-ip> -Port <tcp-port>

For Linux:

ip addr
ip route
ping -c 4 127.0.0.1
ping -c 4 <vm-ip>
ping -c 4 <default-gateway-ip>
traceroute <destination-ip>
nc -vz <destination-ip> <tcp-port>

Interpret the results carefully.

If loopback fails, the problem is inside the OS TCP/IP stack. If the VM cannot ping its own IP, the guest stack or interface configuration is suspect. If the VM can ping itself but not the gateway, move outward to the vNIC, port group, VLAN, and uplink path. If the VM can ping the gateway but not a remote subnet, shift toward routing or firewall boundaries.

Broadcom’s KB explicitly includes guest OS networking and TCP/IP stack validation as part of the VM network troubleshooting sequence.

Stage 3: Verify the VM vNIC and Port Group Assignment

Next, confirm the virtual NIC exists, is connected, and is attached to the intended network.

In vSphere Client, check:

VM > Edit Settings

Network Adapter status

Connected checkbox

Connect at power on

Port group name

Adapter type

MAC address

Any recent network adapter changes

Broadcom’s KB starts the vSphere-side troubleshooting sequence by ensuring the VM’s port group exists on the vSwitch or vDS, is spelled correctly, and that the VM’s adapter is connected. It also notes that standard switches require VMkernel adapters to use their own port groups, so a VM should not be placed on a VMkernel port group.

This stage catches common mistakes:

FindingLikely CauseActionAdapter disconnectedManual change, automation issue, migration artifactReconnect only after confirming correct port groupWrong port groupTemplate, clone, restore, or migration mistakeMove to correct port groupPort group missing on target hostHost not attached to vDS, standard switch inconsistencyFix host/vDS membership or port group placementDuplicate or stale guest NICP2V or OS-level hidden adapterClean up hidden adapter/IP conflict

If the VM was converted from physical to virtual, pay attention to hidden adapters. Broadcom’s KB calls out P2V hidden network adapters as a specific condition to check when troubleshooting VM networking.

Stage 4: Validate VLAN and Subnet Alignment

A large percentage of “VM network” incidents are really VLAN consistency problems.

Confirm:

VM IP subnet matches the intended VLAN

Port group VLAN ID is correct

Physical switch port mode matches the VMware tagging model

The VLAN is allowed on the trunk

Native VLAN expectations are understood

The same VLAN is available on every host where the VM can run

Broadcom’s VLAN configuration article describes three ESXi VLAN tagging methods: External Switch Tagging, Virtual Switch Tagging, and Virtual Guest Tagging. In EST, tagging is done on the physical switch and the ESXi port group VLAN ID is set to 0. In VST, tagging is done by the virtual switch and the ESXi uplinks connect to physical trunk ports with the appropriate VLAN configured on the port group. In VGT, tagging is done inside the guest OS and VLAN tags are preserved through the virtual switch.

Most enterprise VM port groups use VST. That means the usual check is:

VM subnet -> expected VLAN
Port group -> same VLAN ID
ESXi uplink -> physical trunk
Switchport -> VLAN allowed on trunk
Gateway -> SVI/router for that VLAN reachable

Do not assume the VLAN is correct because the port group name looks right. Validate the actual VLAN ID.

Stage 5: Check the vSwitch or Distributed Switch Path

Now move from the VM object to the switching layer.

For a standard vSwitch, confirm:

Port group exists on the host where the VM is running

Correct VLAN ID

Correct uplinks assigned

Teaming and failover settings

Security policy settings if relevant

MTU alignment if jumbo frames are required

For a vSphere Distributed Switch, confirm:

Host is attached to the correct vDS

Distributed port group exists

VM is connected to the expected distributed port

Port group VLAN policy is correct

Teaming and failover policy is correct

Active uplinks map to physical NICs that carry the required VLAN

No per-port override is changing the expected policy

This is where a lot of post-vMotion issues show up. The VM may land on a host where the distributed port group exists, but the physical uplink path does not actually carry the VLAN.

A clean test is to compare a working VM and a failing VM:

Comparison PointWorking VMFailing VMSame port group?Yes/NoYes/NoSame VLAN ID?Yes/NoYes/NoSame ESXi host?Yes/NoYes/NoSame active vmnic?Yes/NoYes/NoSame default gateway result?Yes/NoYes/NoSame firewall policy?Yes/NoYes/No

Broadcom’s default gateway troubleshooting guidance recommends comparing affected VMs against other VMs in the same port group/subnet, and using esxtop networking view when only some VMs have gateway connectivity issues.

Stage 6: Isolate the ESXi Uplink and Teaming Path

If the problem appears host-specific or intermittent, check the uplink path.

On the ESXi host, use esxtop and press n for networking. Broadcom’s KB recommends using esxtop networking output to see which physical NIC a VM is using, then isolating physical switch ports one at a time to determine where connectivity is lost.

Useful ESXi checks:

esxtop
# Press n for networking view

net-stats -l

esxcli network nic list

esxcli network nic stats get -n vmnicX

Look for:

VM mapped to a different uplink than working VMs

Link down or speed/duplex mismatch

RX/TX errors

Dropped packets

Incorrect standby/active uplink order

LACP or EtherChannel mismatch

VLAN missing on one trunk but present on another

If the port group uses Route Based on Originating Virtual Port ID, a VM may consistently use one uplink until it moves or reconnects. If one uplink path is misconfigured, only a subset of VMs may fail. That symptom often looks random until you map VM traffic to the active pNIC.

If LACP or EtherChannel is in use, validate both sides. Broadcom’s VM network troubleshooting KB calls out port-channel techniques and recommends verifying that the physical switch ports are configured correctly for the channel.

Stage 7: Validate the Physical Switch Edge

At this stage, the virtualization team should have enough evidence to engage the network team with specifics.

Provide:

ESXi host: esxi07
VM: APP01
Port group: PG-App-Prod
VLAN: 230
Active vmnic: vmnic2
Switchport: ToR-A Eth1/17
Test: APP01 cannot ping 10.20.30.1 gateway
Working path: APP02 on esxi08 via vmnic3 can ping gateway
Request: Confirm switchport trunk allows VLAN 230 and MTU matches

Ask the network team to validate:

Access vs trunk mode

Allowed VLAN list

Native VLAN behavior

Port-channel membership

STP/portfast configuration

MTU

MAC address learning

ARP behavior

Interface errors or drops

ACLs on the switchport or SVI

This is also the right stage to check jumbo frames. Broadcom’s KB notes that if VMs require MTU 9000 and the VM network is configured for jumbo frames, the physical switch ports must also be configured for jumbo frames.

Stage 8: Test Default Gateway, Routing, and Remote Subnets

Separate Layer 2 reachability from Layer 3 reachability.

Use this logic:

Can VM ping itself?
No -> guest OS / TCP/IP stack

Can VM ping another VM on same subnet?
No -> port group / VLAN / uplink / local firewall

Can VM ping default gateway?
No -> VLAN / uplink / physical switch / gateway SVI

Can VM ping remote subnet?
No -> routing / firewall / ACL / asymmetric path

Can VM ping remote host but TCP fails?
No -> service listener / firewall / security policy / application path

Broadcom’s default gateway article states that if VMs on the same subnet and host cannot reach the gateway, check VLAN configuration on the port group and physical switch. It also states that if gateway connectivity succeeds but other subnets fail, the issue is likely routing/Layer 3 and the network team should investigate.

For TCP checks from ESXi or supporting hosts, nc is useful when you need to test whether a TCP port is reachable. Broadcom’s host network troubleshooting KB lists ping/vmkping, nc, openssl, tcpdump-uw, and esxcli network as ESXi troubleshooting tools, and notes that nc helps determine whether a TCP port is online or possibly blocked by a firewall.

Example:

nc -z <destination-ip> <tcp-port>

For guest-level testing, use tools appropriate to the OS:

Test-NetConnection <destination-ip> -Port 443

nc -vz <destination-ip> 443

A successful ping does not prove the application path is open. It only proves ICMP reachability.

Stage 9: Check Firewall and Security Policy Boundaries

Firewall troubleshooting belongs near the end of the ladder, but it should not be ignored.

There may be multiple enforcement points:

BoundaryWhat to CheckGuest OS firewallWindows Defender Firewall, Linux firewalld/iptables/nftablesNSX/vDefend Distributed FirewallRule match, applied-to scope, rule order, realization, exclusion listUpstream firewallSource/destination zones, service object, NAT, route symmetryPhysical ACLSVI ACL, switchport ACL, routed interface ACLApplication listenerService bound to correct IP and port

For NSX/vDefend DFW, Broadcom’s DFW troubleshooting guidance recommends checking rule source, destination, services, profiles, actions, applied-to scope, rule order, whether the rule is enabled, Traceflow, packet logs, and realized rules on ESXi hosts.

Do not “test” a firewall theory by broadly disabling security controls in production.

Safer tests include:

Verify rule hit counters.

Temporarily enable logging on the suspected rule.

Test a narrow source/destination/service tuple.

Use Traceflow where NSX applies.

Compare the VM against a known-good VM in the same security group.

Use a temporary allow rule only with change approval, scope, owner, and rollback.

If adding the VM to an exclusion list appears to remediate the problem, treat that as a diagnostic result, not the final fix. Broadcom’s DFW troubleshooting article includes the exclusion list as one troubleshooting step, but the durable fix should be a corrected policy, group membership, service definition, or rule order.

Stage 10: Use Packet Capture When the Evidence Is Still Ambiguous

Packet captures are the escalation tool that turns “it should work” into evidence.

Use them when:

The VM sends traffic but never receives replies.

The gateway ARP does not resolve.

One uplink works and another does not.

A firewall team needs proof of source, destination, and port.

The physical network team needs to know whether frames leave the ESXi host.

The application team says traffic never arrives.

Broadcom documents pktcap-uw as an ESXi packet capture tool included in ESXi 5.5 and later, capable of capturing traffic at multiple points in the hypervisor. The same Broadcom article warns not to store packet captures in /tmp; use an appropriate datastore path instead.

A practical pattern is to capture near the VM and near the uplink at the same time.

First identify the VM’s switchport and active uplink:

net-stats -l
esxtop
# Press n for networking view

Then capture at the VM vNIC side and uplink side:

mkdir /vmfs/volumes/<datastore>/Packet_Captures

pktcap-uw –switchport <switchport-id>
–capture VnicTx,VnicRx
-s 256
–ip <gateway-or-destination-ip>
-o /vmfs/volumes/<datastore>/Packet_Captures/<host>.<vm>.switchport.pcapng &

pktcap-uw –uplink vmnicX
–capture UplinkSndKernel,UplinkRcvKernel
-s 256
–ip <gateway-or-destination-ip>
-o /vmfs/volumes/<datastore>/Packet_Captures/<host>.vmnicX.uplink.pcapng &

Stop captures cleanly:

kill $(lsof | grep pktcap-uw | awk ‘{print $1}’ | sort -u)

Broadcom’s pktcap-uw guidance describes –switchport as the capture point closest to the VM vNIC and –uplink as the capture point closest to the physical infrastructure.

Interpretation is straightforward:

Capture ResultLikely MeaningPacket leaves VM vNIC but not uplinkvSwitch/vDS policy, port state, security filter, teaming pathPacket leaves uplink but no reply returnsPhysical switch, VLAN, gateway, firewall, routingRequest and reply seen on uplink but not VM vNICHost switching, DFW/security filter, port stateNothing leaves VM vNICGuest OS, application, local firewall, vNIC disconnectedARP request leaves but no ARP replyVLAN, gateway, physical switch, duplicate IP, upstream filtering

Packet capture should be short, scoped, and tied to an active test. Long unspecific captures create noise and operational risk.

Command Reference

TaskCommand / ToolWhereShow Windows IP configurationipconfig /allGuest OSShow Windows routesroute printGuest OSTest Windows TCP portTest-NetConnection <ip> -Port <port>Guest OSShow Linux IP configurationip addrGuest OSShow Linux routesip routeGuest OSTest Linux TCP portnc -vz <ip> <port>Guest OSTest gatewayping <gateway-ip>Guest OSTrace routed pathtracert / tracerouteGuest OSShow ESXi networking viewesxtop, then nESXiList VM switchportsnet-stats -lESXiShow physical NIC statsesxcli network nic stats get -n vmnicXESXiCapture VM-side trafficpktcap-uw –switchport <id>ESXiCapture uplink trafficpktcap-uw –uplink vmnicXESXiTest ESXi TCP connectivitync -z <ip> <port>ESXi

Validation Steps

Do not close the incident after the first successful ping.

For vMotion-sensitive issues, validate on more than one host. A VM that works only on one ESXi host is not fixed; it is pinned to a working path.

Rollback and Fallback Guidance

Troubleshooting should not leave the environment in a more fragile state.

Before changing a network setting, capture:

Object changed:
Original value:
New value:
Reason:
Approver:
Validation test:
Rollback step:
Rollback owner:

Safe fallback options include:

Reconnect the VM to the previously working port group.

Move the VM back to the previously working ESXi host.

Revert a port group VLAN change.

Restore original uplink teaming order.

Remove temporary firewall allow rules.

Revert guest firewall test changes.

Remove temporary static routes.

Stop packet captures and clean up capture files.

Avoid fallback actions that hide the root cause. For example, pinning a VM to one host might restore service, but it should be documented as a containment action, not the final resolution.

Practical Troubleshooting Patterns

Conclusion

VM network troubleshooting works best when it is boring.

Start in the guest. Validate the vNIC. Confirm the port group. Prove the VLAN. Check the distributed switch and uplink path. Validate the physical switch. Separate gateway reachability from routing. Then test firewall and application boundaries with specific source, destination, protocol, and port evidence.

The operational mistake is jumping layers too quickly. The operational discipline is proving where the packet stops.

Broadcom KB 324542 provides the vendor-backed troubleshooting sequence. The runbook above turns that sequence into a practical ladder for vSphere and vCF operations: guest OS to vNIC, port group to VLAN, distributed switch to uplink, physical network to routing, and firewall policy to final application validation.

External Sources

Broadcom KB 324542 — Troubleshooting virtual machine network connection issues

Broadcom KB 311764 — VLAN configuration on virtual switches, physical switches, and virtual machines

Broadcom KB 307777 — Troubleshooting virtual machine default gateway connection issues

Broadcom KB 341568 — Using the pktcap-uw tool in ESXi 5.5 and later

Broadcom KB 341078 — Troubleshooting network and TCP/UDP port connectivity issues on hosts

Broadcom KB 379438 — Troubleshooting Distributed Firewall

Patching vCenter Through VAMI Without Turning It Into a Recovery Event
Patching vCenter should not feel dramatic. The workflow in the Appliance Management Interface is straightforward: log in to VAMI, check for updates,…

The post VM Network Troubleshooting from Guest OS to Uplink: A Layer by Layer VMware Runbook appeared first on Digital Thought Disruption.