ESXi PSOD Triage: Turning a Purple Screen into an Evidence-Driven Escalation

A purple screen on an ESXi host creates an immediate operational problem, but the bigger risk is what happens next.

The first reaction is usually to get the host back online. That is understandable, especially when workloads are down, HA is recovering virtual machines, or a cluster is running hot after losing capacity. But if the host is power-cycled too quickly, patched based on a partial error string, or returned to production without preserving evidence, the team may lose the only data that can explain what actually happened.

Broadcom KB 316522 is a useful example. The signature mentions NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307, and the KB ties a specific pattern to HPE Gen10, Gen10 Plus, and Gen11 platforms, the HPE iLO native driver, and ESXi 7.x / 8.x remediation guidance. But the important lesson is broader than one crash signature: a PSOD should be treated as an evidence workflow before it becomes a remediation workflow.

Broadcom’s KB identifies the affected hardware families, the vmkernel.log heap alert, and an example NOT_IMPLEMENTED backtrace, while also noting that the example values can vary by environment.

This runbook focuses on how to triage an ESXi PSOD in a VCF or vSphere environment: capture the screen, preserve the core dump, collect support bundles, correlate driver and firmware state, and decide whether the next move is rollback, targeted remediation, hardware isolation, or vendor escalation.

Scenario

You have an ESXi host that has stopped responding. vCenter shows the host as disconnected or not responding. Some virtual machines may have restarted through HA, some may still be unavailable, and the out-of-band console shows a purple diagnostic screen.

The visible error may include a recognizable string such as:

NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307

or a similar ASSERT, Exception, Spin count exceeded, Machine Check Exception, or device-driver-related failure.

The goal is not to diagnose the entire VMkernel from the console. The goal is to preserve enough evidence that the next decision is defensible.

Why This Matters Operationally

A PSOD is not a normal service restart. Broadcom describes purple screen errors as severe hardware or software errors that halt the server and prevent it from continuing. Common symptoms include the host being listed as not responding in vCenter, VMs becoming unresponsive, loss of ICMP or SSH access, and the host displaying a purple diagnostic screen on the console.

That distinction matters because the reboot is only one part of recovery. The incident also creates several operational questions:

QuestionWhy it mattersDid the host successfully write a core dump?Without it, support may only have logs and a screenshot.Is this a one-off hardware event or a repeatable software signature?The remediation path is different.Did the crash follow a driver, firmware, ESXi, or vendor add-on change?Recent lifecycle activity becomes evidence.Is the issue isolated to one host model, one cluster, or one image baseline?Scope determines whether to isolate, roll forward, roll back, or escalate.Can the host safely return to production?Returning a repeatedly crashing host can create secondary outages.

A PSOD is a failure event, but it is also a diagnostic opportunity. Treat it that way.

Symptoms and Risk Signals

During the first pass, do not over-index on a single line of the purple screen. The line number in a signature can vary between ESXi builds, patches, and compiled code paths. Instead, capture a small set of repeatable evidence points.

Look for:

EvidenceExampleWhy it mattersESXi version and buildESXi 7.x / 8.x build numberDetermines known issue applicability and patch path.Panic or exception stringNOT_IMPLEMENTED, #PF Exception, MCE, ASSERTHelps route the issue class.Stack trace patternWorld_DestroyHeap, CpuSched_StartWorld, driver modulesHelps compare recurrence across hosts.Physical CPU / worldSame CPU or same world across crashesCan suggest hardware affinity or userworld/module pattern.VMK uptimeHost uptime before crashUseful for recurrence timing.Core dump statusDisk dump successful or failedDetermines whether deep analysis is possible.Recent lifecycle changesESXi patch, firmware, driver, vendor add-onOften the difference between rollback and escalation.

Broadcom’s purple screen interpretation guidance explains that the stack trace represents what the VMkernel was doing at the time of the error and that the core dump section indicates VMkernel memory being copied to the configured dump location. It also recommends using repeated patterns across error message, stack trace, physical CPU, and world to distinguish likely software patterns from possible hardware patterns.

PSOD Triage Workflow

The workflow below is the operational path I would want a team to follow before making a rollback or patching decision. Notice that remediation does not happen first. Evidence capture does.

The key point is sequencing. A PSOD response should move from evidence preservation to controlled recovery to correlation to decision. Skipping straight to remediation creates a fragile root-cause story.

Prerequisites and Safety Checks

Before touching the host, confirm who owns each part of the response:

AreaOwnerSafety checkWorkload recoveryvSphere / application operationsConfirm HA restart state, failed VMs, and application impact.Host accessPlatform operationsConfirm iLO/iDRAC/OOB console access and SSH policy.Evidence handlingPlatform operations / securityConfirm whether core dumps and support bundles can be shared externally.Lifecycle dataVMware / hardware platform teamConfirm ESXi image, vendor add-on, driver, firmware, and BIOS state.EscalationVMware/Broadcom and hardware vendor supportConfirm SR ownership and upload path.

Core dumps and support bundles deserve special handling. Broadcom notes that host support bundles can include host logs, VM descriptions, system state, and core dumps. Core dumps can include data from memory at the time of failure, and transmitting a support bundle grants VMware permission to examine the included data. Environments using vSphere Virtual Machine Encryption can also affect core dump handling and access.

That does not mean “do not collect evidence.” It means evidence collection should follow your security policy.

Runbook Stage 1: Capture the PSOD Before Reboot

When the purple screen is still visible, capture it.

Do not reset the host immediately. Broadcom explicitly warns not to reset an ESX/ESXi host while the purple screen is displayed and recommends taking a picture or screenshot that captures all visible technical data. The same guidance says to verify whether “Disk Dump Successful” appears and to allow more time if the dump has not completed; in some cases, dump completion may take up to an hour.

Capture:

ItemHowFull console screenshotOOB console screenshot or phone photo if necessaryHostname and asset tagvCenter inventory, hardware management console, CMDBTime of failureInclude timezone and whether this is host, vCenter, or monitoring timeESXi buildFrom console if visible, otherwise collect after rebootPanic stringExact first-line message and any file/line referenceStack traceFull visible backtrace, not just the first lineDump statusWhether the screen reports dump progress, success, or failure

A partial screenshot of only the first error line is not enough. The stack trace, CPU/world information, and dump status are part of the diagnostic record.

Runbook Stage 2: Reboot Without Destroying the Investigation

After the dump completes, reboot the host through the cleanest available method. If the host is fully halted, the out-of-band power control may be the only practical option.

After boot:

Do not immediately return the host to normal workload placement.

Keep the host in maintenance mode or otherwise prevent automated workload return if recurrence risk is unknown.

Confirm whether vCenter reports an unread host kernel core dump.

Collect logs and support bundles before applying patches, removing drivers, or changing firmware.

The startup sequence can process configured core dump slots and create a core dump file after a PSOD, which can then be reviewed for corrective action and root-cause work.

Runbook Stage 3: Collect the Support Bundle

Broadcom’s vm-support guidance states that VMware Technical Support routinely requests diagnostic information for support requests and that the vm-support utility is present on all ESXi versions, though available options vary by release. The traditional command creates a compressed .tgz bundle locally on the host, and -w can write it to a specific VMFS datastore.

Use the datastore method when the host has enough accessible storage and your security policy allows it:

# Create a support bundle on a VMFS datastore
vm-support -w /vmfs/volumes/DATASTORE_NAME

For environments where saving locally is not preferred, Broadcom documents streaming vm-support over SSH to a client system:

# Stream vm-support to a local file from a management workstation
ssh root@ESXHostnameOrIPAddress vm-support -s > vm-support-ESXHostname.tgz

This method requires root authentication and is not usable with lockdown mode.

Collect the vCenter support bundle as well if the incident involved HA behavior, host disconnect events, lifecycle remediation, DRS activity, or cluster-level alarms.

Runbook Stage 4: Preserve and Verify the Core Dump

Do not assume the dump exists just because the host crashed.

Check the configured dump targets:

# Check VMFS coredump files
esxcli system coredump file list

# Check coredump partition configuration
esxcli system coredump partition list

# Check network coredump configuration
esxcli system coredump network get

The ESXCLI command reference includes commands to create, list, set, and remove VMkernel dump files; it also includes commands to check file, partition, and network dump configuration.

If the host uses a diagnostic partition, Broadcom documents extracting a VMkernel core dump by identifying the diagnostic partition with esxcli system coredump partition list or esxcfg-dumppart -t, changing to a datastore with enough space, and using esxcfg-dumppart –copy to produce a zdump file.

# Identify diagnostic partition
esxcli system coredump partition list

# Example extraction pattern after identifying the device path
cd /vmfs/volumes/DatastoreName/

esxcfg-dumppart –copy
–devname “/vmfs/devices/disks/identifier”
–zdumpname /vmfs/volumes/DatastoreName/hostname-date-zdump

If no coredump target exists, fix that as a preventive control after the incident. Broadcom’s coredump-to-file guidance notes the warning “No coredump target has been configured. Host core dumps cannot be saved,” and documents creating a VMFS dump file with esxcli system coredump file add, then enabling it with esxcli system coredump file set –smart –enable true. It also notes that Software iSCSI and Software FCoE are not supported for coredump locations.

# Create a VMFS coredump file
esxcli system coredump file add -d <datastore_UUID> -f <hostname>.dumpfile

# Enable smart selection for the dump file
esxcli system coredump file set –smart –enable true

# Verify Active and Configured are true
esxcli system coredump file list

For larger environments, configure network dump collection as a standard build item. Broadcom states that ESXi network coredump functionality helps capture diagnostic data through the network during a purple diagnostic screen, and documents configuring it with a VMkernel interface, destination server IP, and UDP port, then validating with esxcli system coredump network get and vmkping.

# Configure network coredump collector
esxcli system coredump network set
–interface-name vmk0
–server-ipv4 <collector-or-vcenter-ip>
–server-port 6500

# Enable network coredump
esxcli system coredump network set –enable true

# Verify configuration
esxcli system coredump network get

# Confirm VMkernel network path
vmkping -I vmk0 <collector-or-vcenter-ip>

Runbook Stage 5: Build the Evidence Matrix

Once the host is booted and evidence is preserved, build a simple matrix. This gives support, hardware vendors, and internal change approvers the same view of the event.

EvidenceCommand or sourceNotesESXi version and buildvmware -vlMatch against KBs and release notes.Installed VIBs/componentsesxcli software vib listLook for hardware vendor drivers and async drivers.Loaded modulesesxcli system module listUseful when a stack trace references a module or device path.Coredump configesxcli system coredump file list / partition list / network getConfirms whether future crashes will be captured.VMkernel logs/var/log/vmkernel.logSearch for panic, heap, driver, storage, network, MCE, or NMI messages.Hardware modelesxcli hardware platform getRequired for vendor advisories and compatibility checks.Firmware / BIOS / iLOVendor tooling, OneView, iLO, iDRAC, OME, vLCM/HSMNeeded for hardware correlation.Recent changesvLCM, SDDC Manager, change recordDetermines rollback versus roll-forward options.

Useful first-pass commands:

# Version and build
vmware -vl

# Hardware platform
esxcli hardware platform get

# Coredump targets
esxcli system coredump file list
esxcli system coredump partition list
esxcli system coredump network get

Treat this as a triage set, not a final RCA. The goal is to avoid empty escalation: “Host crashed, please advise.”

Runbook Stage 6: Compare the Signature Without Anchoring on It

This is where KB316522 becomes useful.

Broadcom’s KB identifies a specific issue where ESXi hosts on HPE Gen10, Gen10 Plus, or Gen11 hardware can experience a PSOD. The KB lists a vmkernel.log alert similar to Unable to complete wait for non-empty heap, and an example backtrace containing NOT_IMPLEMENTED and World_DestroyHeap.

The KB’s stated cause is specific: when a kernel module exposing a character device does not behave as expected, a vmkpollcontext object can leak after a userspace poll() syscall; later, when the userspace process terminates, the VMkernel can PSOD with a NOT_IMPLEMENTED assert. The KB also says the HPE ilo kernel module used by HPE SMAD is known to cause this issue.

For remediation, Broadcom states:

EnvironmentKB 316522 remediation guidanceESXi 7.0 or laterUpdate the HPE iLO Native Driver component to v10.8.2 or later.ESXi 8.0 or laterUpdate the HPE iLO Native Driver component to v10.8.2 or later and update ESXi to 8.0 Update 2b or later.

The operational caution is this: do not assume every NOT_IMPLEMENTED purple screen is KB 316522. Match the platform, ESXi version, vendor module state, log alert, stack trace shape, and recent lifecycle history. A signature is evidence. It is not the entire case.

Runbook Stage 7: Correlate Driver, Firmware, Build, and Vendor Image

A PSOD investigation usually becomes a lifecycle investigation.

For HPE environments, confirm whether the host is running a supported HPE custom ESXi image, a vendor add-on, or a manually assembled image. HPE’s VMware ESXi support page states that HPE servers require the HPE custom ESXi image or an ESXi image built with ImageBuilder that includes appropriate drivers for the boot controller and at least one network device. It also notes that drivers for newer network and storage controllers are integrated in the HPE custom ESXi image and are not part of VMware’s base ESXi image.

For clusters managed by vSphere Lifecycle Manager, use the image, vendor add-on, firmware and drivers add-on, and hardware support manager data as part of the evidence trail. VMware’s Cloud Foundation blog notes that firmware, driver, and BIOS/EFI versions can be inspected and monitored for compliance with the Broadcom Compatibility Guide and vSAN Compatibility Guide, and that vSphere Lifecycle Manager interfaces with a registered Hardware Support Manager to orchestrate firmware lifecycle operations.

Capture:

LayerEvidence to collectESXi base imageVersion, build, patch levelVendor add-onHPE, Dell, Lenovo, Cisco, or other vendor package versionDevice driversNIC, storage, NVMe, FC, iLO/iDRAC/platform agentsFirmwareBIOS/UEFI, BMC/iLO/iDRAC, NIC, HBA, RAID, disk firmwareManagement agentsAMS, SMAD, CIM providers, vendor toolsCluster lifecycle stateDesired image, compliance drift, recent remediation tasks

The strongest escalation packet includes both the crash evidence and the lifecycle state. The support engineer should not have to ask which driver was installed, which firmware was active, or whether the host was recently remediated.

Runbook Stage 8: Decide Rollback, Roll Forward, or Escalate

The wrong move is to pick one answer for every PSOD. Use the evidence pattern.

ConditionPreferred actionWhyKnown KB match, supported fix exists, and issue matches platform/build/driver patternRoll forward to the documented driver/ESXi fix during a controlled maintenance windowYou have a supported remediation path.Crash started immediately after a driver, firmware, or ESXi update and repeats on the same imageConsider rollback to the last known-good validated image while preserving evidence and opening supportThe change is temporally tied to the incident.Same host repeatedly crashes with different stack traces or same physical CPU indicatorsIsolate host and engage hardware vendor diagnosticsPattern may indicate hardware or platform fault.Multiple hosts on the same model/image show the same signatureTreat as cluster image or vendor component issue; stop broad remediation until scopedPrevents spreading a bad image or unsupported combination.No core dump, no full screenshot, and no repeatable patternFix evidence capture first, then monitor or escalate with limited confidenceRCA will be weak without dump and logs.Production cluster is capacity constrained after host lossKeep stability first; defer nonessential remediation until workload capacity is safeAvoids creating a second outage during investigation.

A rollback should not be emotional. It should be tied to a recent known change, a repeatable failure pattern, and an approved fallback image. A roll-forward should be tied to a vendor-documented fix, compatibility validation, and staged host remediation. Escalation should include enough artifacts for support to analyze the issue instead of recreating your evidence collection process.

Targeted Remediation Example: KB 316522 Pattern

When the evidence matches KB 316522, the remediation path should still be staged.

Recommended sequence:

Confirm affected hardware model: HPE Gen10, Gen10 Plus, or Gen11.

Confirm ESXi major version and build.

Confirm installed HPE iLO Native Driver component version.

Confirm whether the vmkernel.log heap alert and stack trace pattern match the KB.

Confirm whether HPE SMAD / AMS / iLO-related components are present.

Confirm the target driver and ESXi build are supported for the server model.

Remediate one host first in a maintenance window.

Validate stability before expanding to the cluster.

Document the final image state in vLCM / SDDC Manager / change records.

For ESXi 8.x hosts matching this KB, Broadcom’s resolution calls for both the HPE iLO Native Driver component v10.8.2 or later and ESXi 8.0 Update 2b or later.

That “and” matters. Updating only one layer may leave the environment in a partially remediated state.

Validation Steps After Recovery

After the host is back online and before it returns to full production placement, validate the following:

ValidationPass conditionHost boots cleanlyNo immediate PSOD or management agent failure.vCenter connectivity restoredHost reconnects without repeated disconnects.Core dump target configuredFile, partition, or network dump target is active and configured.Support bundle collectedBundle is stored securely and associated with the incident/SR.Driver and firmware state capturedEvidence matrix includes current and previous versions.Cluster health stableHA, DRS, vSAN, NSX, and workload alarms reviewed as applicable.Lifecycle compliance knownHost is compliant with intended image or intentionally held back.Recurrence monitoring activeLogs and monitoring are watching for repeated stack or heap alerts.

For VCF environments, also confirm whether SDDC Manager, vCenter, NSX, vSAN, and lifecycle tasks recorded relevant events around the incident window. A host PSOD may be local, but the recovery story is cluster-wide.

Rollback and Fallback Guidance

Rollback is appropriate when the evidence points to a recent change and a known-good target exists. It is not appropriate when the team is guessing.

Before rollback, confirm:

The previous ESXi image, vendor add-on, driver, and firmware combination is documented.

The previous state is still supported by the hardware vendor and VMware/Broadcom.

The rollback process has been tested or is operationally understood.

Workloads can tolerate the maintenance sequence.

Evidence from the failure state has already been collected.

Fallback options include:

FallbackUse whenKeep host in maintenance modeRecurrence risk is unknown or evidence points to hardware.Evacuate and isolate hostCluster has enough capacity and host stability is suspect.Revert to previous imageRecent lifecycle change is strongly correlated and rollback is supported.Apply vendor-documented fixKB match is strong and remediation is validated.Open Broadcom and hardware vendor casesCore dump analysis or hardware diagnosis is required.

Do not remove vendor agents, disable platform modules, or downgrade drivers as an unsupported workaround unless directed by the vendor or support. Those changes may reduce observability, create supportability issues, or make later analysis harder.

What to Hand to Support

A good escalation packet should include:

ArtifactNotesFull PSOD screenshotInclude the entire visible stack, not just the first line.vm-support bundleCollected before remediation where possible.Core dump / zdumpPreserve securely; follow data handling policy.ESXi version/buildvmware -vl output.Installed VIB/component listInclude vendor drivers and add-ons.Hardware model and serialInclude host generation and platform details.Firmware versionsBIOS/UEFI, BMC/iLO/iDRAC, NIC, HBA, RAID, disks.vLCM / SDDC Manager image stateDesired image, compliance state, recent remediation tasks.Incident timelineFailure time, last lifecycle change, reboot time, validation steps.Scope statementOne host, one cluster, one hardware model, or fleet-wide.

This is the difference between “we had a PSOD” and “we have a reproducible evidence package.”

Conclusion

A PSOD is not just a crash screen. It is a time-sensitive evidence source.

The right operational posture is to slow down just enough to capture the facts: screenshot, dump status, support bundle, core dump, ESXi build, driver versions, firmware state, and recent lifecycle changes. Once that evidence is preserved, the team can make a disciplined decision: apply a known fix, roll back a suspect change, isolate a hardware candidate, or escalate with a useful support packet.

KB 316522 is a good reminder of why this matters. The visible signature is useful, but the real answer lives in the correlation between the stack, the platform, the driver, the ESXi build, and the lifecycle history. Treat the purple screen as the start of the investigation, not the end of it.

External Sources

Broadcom KB 316522: ESXi host may crash with PSOD with the message NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307

Broadcom KB 337182: ESX/ESXi host stops responding and displays a purple diagnostic screen

Broadcom KB 343033: Interpreting an ESX/ESXi host purple diagnostic screen

Broadcom KB 313542: Collecting diagnostic information for VMware ESX/ESXi using the vm-support command

Broadcom ESXCLI Command Reference: esxcli system namespace

Broadcom KB 343591: Extracting a core dump file from the diagnostic partition

Broadcom KB 314320: Configuring ESXi coredump to file instead of partition

Broadcom KB 344063: Configuring the network dump collector service in ESXi

Broadcom KB 327899: Data collected when gathering diagnostic information for VMware ESX/ESXi

HPE VMware ESXi support and certification matrix

VMware Cloud Foundation Blog: Firmware Lifecycle Made Simple with vSphere Lifecycle Manager

The vCenter Log Partition Runbook: Find Growth, Preserve Evidence, Restore Headroom
A full /storage/log partition on a vCenter Server Appliance is not just a housekeeping problem. It is a management-plane risk. In a…

The post ESXi PSOD Triage: Turning a Purple Screen into an Evidence-Driven Escalation appeared first on Digital Thought Disruption.