PDL vs APD: The Storage Failure Model Every vSphere Operator Needs

Storage failures in vSphere are rarely just “storage is down.”

That phrase may be accurate from the application owner’s point of view, but it is not precise enough for the operator who has to decide what happens next. A host that has lost all paths to a datastore behaves differently from a host that has been told the device is permanently gone. A VM that is stalled on I/O behaves differently from a VM that vSphere HA can safely restart somewhere else. A storage fabric issue behaves differently from an administrative LUN removal.

That difference is the operational line between APD and PDL.

Broadcom KB 318712 frames the issue directly: Permanent Device Loss and All-Paths-Down are separate ESXi storage accessibility conditions, and each one changes how the host, management agents, vSphere HA, and affected virtual machines behave. (knowledge.broadcom.com)

This article builds the mental model first, then turns it into a practical runbook for vSphere and VCF operators.

TL;DR

PDL means ESXi has evidence the device is permanently unavailable. The storage array returns supported SCSI sense information, such as logical unit not supported, and ESXi stops trying to re-establish connectivity or issue commands to that device. (knowledge.broadcom.com)

APD means all paths are down, but ESXi does not know whether the loss is temporary or permanent. Because the host cannot classify the failure as permanent, it continues retrying I/O, including management-agent and VM guest I/O, which can make the host appear disconnected, unresponsive, or difficult to manage. (knowledge.broadcom.com)

VMCP matters, but it is not magic. vSphere HA VM Component Protection can react to APD and PDL storage failures, but the response depends on cluster configuration, APD timeout behavior, restart policy, whether another host can access the datastore, and whether the affected VM is protected by HA. (developer.broadcom.com)

For VCF designs, treat this primarily as a non-vSAN datastore protection model. Broadcom’s VCF design decision for PDL says VMCP can increase availability for VMs on non-vSAN datastores and explicitly notes that only VMs on non-vSAN datastores can be protected by VMCP in that design context. (knowledge.broadcom.com)

The Scenario: Same Symptom, Different Failure Model

A common incident starts like this:

A few VMs stop responding. A datastore shows inaccessible. A host flips to Disconnected or Not Responding in vCenter. vMotion tasks hang. The storage team reports that the LUN “dropped” or that “all paths are down.”

At this point, the wrong question is:

“Is storage down?”

The better question is:

“Did ESXi receive a permanent loss signal, or is it stuck in an unknown all-paths-down condition?”

That distinction determines whether you are dealing with a cleanup and failover problem or a host management and fabric recovery problem.

PDL and APD at a Glance

The storage failure model is easier to understand if you separate three things:

What the host can see

What the storage array communicated

What vSphere HA is allowed to do

The diagram below shows the decision path. Notice that APD is not simply “worse PDL.” It is a different classification problem. PDL is explicit. APD is ambiguous.

The core idea: PDL is a declared loss. APD is an unresolved loss.

That is why APD incidents can feel messier. ESXi may keep trying to communicate with a device it believes might return, and management operations can be affected while those retries are happening. Broadcom documents that APD can impact management agents because their commands may not receive responses until the device becomes accessible again. (knowledge.broadcom.com)

Scope and Terminology Guardrails

This article focuses on ESXi host behavior for datastore/device accessibility failures in vSphere and VCF environments.

The vSAN caveat matters. In a VCF environment, the design guidance for VMCP PDL response is specifically useful for non-vSAN datastores; Broadcom’s VCF design decision states that only VMs on non-vSAN datastores can be protected by VMCP in that context. (knowledge.broadcom.com)

For VxRail or vSAN incidents, APD/PDL strings may still appear around devices, disks, controllers, or paths, but the operational model should shift toward vSAN health, object availability, disk group state, resync, and hardware lifecycle workflows.

Assumptions for This Runbook

This runbook assumes:

vSphere HA is enabled or intended to be enabled on the cluster.

The affected VM runs on a datastore that may be accessible from more than one host.

You have at least one reliable path to investigate the host, such as vCenter, SSH, DCUI, iDRAC, iLO, or another out-of-band path.

Storage, network, and virtualization teams can coordinate during the incident.

You are not intentionally removing a datastore without completing the proper unmount and detach workflow.

One more assumption is important: do not assume HA will restart a VM just because HA is enabled. VMCP has separate APD and PDL response settings, and the vSphere API documents different settings for APD and PDL reactions, including disabled, warning/event behavior, conservative restart, and aggressive restart behavior. (developer.broadcom.com)

The Operational Difference Between PDL and APD

Broadcom’s APD documentation states that a device enters APD when all paths are down and ESXi has no indication whether the loss is permanent or temporary. In that condition, ESXi continues attempts to establish connectivity and may continue sending I/O until it receives a response. (knowledge.broadcom.com)

For PDL, Broadcom documents that specific SCSI sense information, such as logical unit not supported, indicates permanent inaccessibility, after which ESXi no longer attempts to reconnect or issue commands to the device.

How PDL Usually Surfaces

PDL often appears after an array-side event that tells ESXi the device is gone or no longer valid.

Broadcom’s KB describes planned PDL as an intentional removal where the datastore should first be unmounted and the device detached before the array-side unpresentation occurs. Unplanned PDL happens when the storage device is unexpectedly unpresented without that host-side workflow. (knowledge.broadcom.com)

Symptoms can include inaccessible datastores, datastores showing 0 B, all paths marked dead, VMs showing inaccessible, vMotion tasks hanging, and vmkernel messages containing PERM LOSS or “device has been removed or is permanently inaccessible.” (knowledge.broadcom.com)

Example log patterns to look for:

PERM LOSS
Device is permanently unavailable
Logical Unit Not Supported
has been removed or is permanently inaccessible
Valid sense data: 0x5 0x25 0x0

The important operational point is that PDL is not a “wait and see” condition in the same way APD is. ESXi has classified the device as permanently lost, and the next action is either to correct the storage presentation mistake or clean up the device/datastore state safely.

How APD Usually Surfaces

APD is more ambiguous and often more operationally painful.

Broadcom documents that APD can make an ESXi host appear disconnected or not responding in vCenter, and affected VMs can become unresponsive. (knowledge.broadcom.com)

Another Broadcom KB on iSCSI APD shows how a duplicate IP address on the storage network can produce isolated host disconnects, lost connectivity alarms, NMP NO_CONNECT messages, and APD behavior. (knowledge.broadcom.com)

Example log patterns to look for:

All Paths Down
esx.problem.storage.apd.start
esx.problem.storage.apd.timeout
Not found (APD)
No connection
NMP Device is blocked
awaiting fast path state update

The APD timeout is also important. Broadcom documents that the APD timer defaults to 140 seconds; during that period the host continues attempting to re-establish connectivity. If connectivity does not return, the device enters the APD timeout state. (knowledge.broadcom.com)

That does not automatically mean the VMs restart. If vSphere HA is configured to respond to APD, Broadcom documents a common failover sequence of 140 seconds plus an additional default 3-minute delay before HA executes the configured response, assuming the APD condition persists. (knowledge.broadcom.com)

Why APD Can Make the Host Feel Broken

The most dangerous part of APD is not only that VMs lose access to storage. It is that the host management plane can become collateral damage.

During APD, ESXi can retry I/O from both userworld processes and virtual machines. Broadcom specifically calls out hostd management-agent I/O and VM guest I/O as part of the retry behavior. (knowledge.broadcom.com)

Broadcom’s APD management-agent KB documents cases where restarting management agents fails, esxcli commands fail to connect to localhost, hostd fails to start, and the host appears disconnected or not responding in the vSphere Client.

This is the mental shortcut:

PDL breaks access to a device. APD can break your ability to manage the host while it is trying to decide what happened.

Runbook Stage 0: Stabilize the Incident Before You Touch Storage

Start with containment. Do not jump directly into rescans or restarts.

First actions:

Open the incident bridge and name the failure domain.

Identify affected hosts, datastores, clusters, and VMs.

Confirm whether vCenter itself is on an affected datastore.

Freeze storage-side changes until the current presentation state is known.

Stop repeated rescan attempts unless a specific recovery step requires one.

Capture logs before rebooting hosts if the environment allows it.

Validate whether this is host-specific, fabric-specific, array-specific, or cluster-wide.

Be especially careful with rescans. Broadcom documents a case where rescanning while any LUN is in APD can cause VMs on other LUNs to stop responding temporarily or permanently.

Runbook Stage 1: Classify the Condition

Your first technical objective is to answer three questions:

Are all paths down?

Did ESXi receive a PDL signal?

Can any other host still access the datastore?

From the ESXi shell, start with logs. Use localcli when hostd/esxcli is impacted, but treat it as a break-glass operational tool rather than a normal-change workflow.

# Check APD and PDL indicators in host event logs
grep -Ei “apd|all paths down|pdl|permanent|perm loss” /var/run/log/vobd.log

# Check storage stack behavior
grep -Ei “APD|PERM LOSS|permanently inaccessible|No connection|Logical Unit” /var/run/log/vmkernel.log

For APD event tracking, Broadcom recommends looking in /var/run/log/vobd.log for strings such as esx.problem.storage.apd.start and esx.clear.storage.apd.exit.

Then check path state:

If you need to confirm legacy multipath state for a specific device, Broadcom’s APD rescan KB shows the older esxcfg-mpath –list-paths –device <device> approach and notes that APD is likely when no paths are active.

Decision point:

If the logs show PERM LOSS, supported SCSI sense information, or “permanently inaccessible,” move down the PDL branch.

If all paths are dead and the logs show APD start/timeout/no connection behavior without PDL sense codes, move down the APD branch.

If some paths are still active, this may be path redundancy degradation rather than APD. Treat it as an urgent storage path resiliency issue, but do not classify it as APD unless all paths to the device are unavailable.

Runbook Stage 2A: PDL Branch

When you classify the event as PDL, stop thinking in terms of transient retry. Think in terms of why ESXi believes the device is permanently gone.

Confirm the storage-side intent

Ask the storage team:

Was the LUN unmapped?

Was the host or cluster removed from an initiator group?

Was a masking view changed?

Was a snapshot, clone, or replicated copy presented unexpectedly?

Did the LUN run out of space or become inaccessible on the array?

Was this part of a planned decommission?

Broadcom documents that an ESXi host losing datastore access with PERM LOSS messages can occur when the array indicates the device is no longer available or the LUN has been unmapped or deleted. The recommended checks include LUN presentation, cabling, fabric switches, and storage vendor investigation.

If the PDL was accidental

The safest pattern is:

Correct the array-side presentation.

Verify the host sees the device again.

Confirm no stale snapshot LUN or unexpected device identity issue exists.

Rescan only after the storage-side condition is corrected.

Validate datastore accessibility and VM state.

Do not blindly mount a datastore that may be a stale copy, snapshot, or incorrectly presented LUN.

If the PDL was part of a planned removal

Follow the proper storage-removal sequence:

Broadcom’s KB states that for planned PDL, the datastore must be unmounted and the device detached before array-side unpresentation.

If the PDL was unplanned and the datastore is gone

Broadcom’s cleanup guidance for unplanned PDL includes powering off and unregistering running VMs from the datastore, unmounting the datastore, rescanning all ESXi hosts that had visibility to the LUN, and checking for active references such as VMs, templates, ISO images, floppy images, and RDMs if the device remains listed.

Do not let the storage cleanup outrun the VM recovery plan.

Runbook Stage 2B: APD Branch

When you classify the event as APD, the priority is to restore connectivity at the layer that failed. ESXi is still treating the device as potentially recoverable, but the host may be degraded while waiting.

Identify the failure domain

Ask:

Is APD isolated to one host?

Is it isolated to one HBA, vmhba, vmkernel port, uplink, or storage VLAN?

Is it affecting every host in the cluster?

Is it affecting only one datastore or many?

Is the same datastore accessible from another host?

Did a storage fabric, switch, array controller, or network change happen recently?

For iSCSI and NFS, include normal network diagnostics:

# Example: verify VMkernel path to a storage target
vmkping -I vmkX <storage_target_ip>

# Example: inspect vmkernel logs for no-connect style errors
grep -Ei “NO_CONNECT|No connection|iscsi|nfs|APD” /var/run/log/vmkernel.log

For Fibre Channel, validate fabric login, zoning, target visibility, HBA state, and array masking from the storage side before repeatedly rescanning the host.

Avoid management-agent restart loops

If the host is in APD and hostd is stuck, restarting services may fail. Broadcom documents APD cases where restarting management agents fails with “Not all VMFS volumes were updated,” hostd fails to start, esxcli cannot connect to localhost, and the host appears disconnected or not responding.

That means the immediate fix is usually not “restart agents again.” The fix is to restore the missing storage path or remove the condition that keeps the host blocked.

Check open worlds if cleanup is blocked

If dead paths cannot be removed because the host still has open handles, Broadcom documents using:

localcli storage core device world list

to determine worlds accessing VMFS volumes in APD state.

For emergency recovery, Broadcom documents listing VM worlds and force-killing VM processes before removing dead paths, but this is disruptive and should be treated as a controlled incident action, not a routine troubleshooting step.

# List VM worlds
localcli vm process list

# Emergency only: force kill a VM world ID
localcli vm process kill –type=force –world-id <WorldID>

Use that only when the outage decision has been made, affected workloads are identified, and the team understands the impact.

Plan for host reboot if residual APD state remains

Broadcom notes that APD has no clean recovery path in some cases, that the issue must be resolved at the array or fabric layer, and that affected ESXi hosts may require a reboot to remove residual references to devices in APD state.

That reboot can affect VMs that were not on the failed datastore if vMotion is unavailable because the management agents are affected. Broadcom explicitly warns that vMotion of unaffected VMs may not be possible during APD management impact, so rebooting an affected host can force an outage to otherwise unaffected VMs on that host.

Runbook Stage 3: Decide What vSphere HA and VMCP Should Do

This is where many incidents go sideways.

Operators expect HA to “just restart the VM.” HA can only do that safely when the VM is protected, the configured VMCP response allows it, and a suitable host can access the datastore.

VM Component Protection is designed to detect and react to storage failures that may not immediately crash a VM but can affect VM health or quality of service. The vSphere API describes VMCP as part of vSphere HA and defines separate storage protection settings for APD and PDL. (developer.broadcom.com)

PDL response

For PDL, the VMCP setting can be disabled, warn, use cluster default, or restart aggressively. The API states that when PDL protection is set to restart aggressively, VMCP immediately terminates impacted VMs and attempts to restart them on a best-effort basis. (developer.broadcom.com)

In VCF, Broadcom’s design decision is explicit: set datastore with PDL to Power Off and Restart VMs in vSphere HA for non-vSAN datastores, because availability can be increased when another host can still access the datastore.

APD response

For APD, the response choices include disabled, warning, conservative restart, aggressive restart, or cluster default. The API explains that conservative restart terminates impacted VMs after a configured delay if they are to be restarted, while aggressive restart may terminate impacted VMs even in cases where restart may not be possible. (developer.broadcom.com)

The APD timer sequence matters:

Broadcom documents this common APD sequence as 140 seconds plus an additional 3 minutes before HA executes the configured response if APD persists.

Do not tune random host settings first

Broadcom’s HA best-practice KB warns that VMCP-related HA settings can be confused with host-level advanced settings, and it recommends following official HA and VMCP documentation over advanced host settings unless directed by VMware Support or a storage vendor.

That matters during incident review. Do not “fix” APD/PDL by changing host-level advanced settings unless support has tied the change to your failure mode.

Runbook Stage 4: Validate Recovery

After the storage team believes the issue is fixed, validation needs to happen from the vSphere side and the storage side.

Host validation

Check:

# Confirm no new APD events
grep -Ei “apd.start|apd.timeout|apd.exit” /var/run/log/vobd.log

# Confirm no continuing PDL or no-connection patterns
grep -Ei “PERM LOSS|permanently inaccessible|No connection|Not found (APD)” /var/run/log/vmkernel.log

# Confirm paths are no longer all dead
localcli storage core path list | egrep “Runtime Name:|Device:|State:”

vCenter validation

Confirm:

Host is connected and responsive.

Datastore is accessible.

Datastore capacity is visible and not 0 B.

Affected VMs are powered on or intentionally powered off.

No VMs remain invalid, orphaned, or inaccessible without a recovery plan.

HA protection is restored.

VMCP settings are still what you expect at cluster and VM override levels.

There are no stale ISO, template, snapshot, or RDM references to removed datastores.

Storage validation

Confirm:

LUN masking and host group membership are correct.

SAN zoning or iSCSI/NFS network path is stable.

No path flapping remains.

Array logs align with the ESXi event timeline.

Any controller failover, firmware, port, or switch event is documented.

Storage-side remediation did not create a new stale presentation risk.

Rollback and Fallback Guidance

PDL and APD do not have the same rollback story.

For accidental PDL, rollback usually means restoring correct array presentation, then validating that ESXi sees the expected device identity. If the datastore was intentionally removed, rollback may be inappropriate; proceed with cleanup and VM recovery.

For APD, rollback means restoring the path that disappeared. That might be a fabric rollback, storage network rollback, zoning correction, cable replacement, storage controller recovery, duplicate-IP correction, or array-side failback. The host may still require reboot if APD references remain after connectivity returns.

For VMCP actions, fallback depends on where the VM landed:

If the VM restarted on a healthy host, validate application consistency.

If the VM was powered off but could not restart, confirm datastore accessibility from candidate hosts.

If every host lost access to the datastore, VMCP cannot create a valid restart target.

If aggressive APD policy powered off workloads that could not restart, document that as a policy/design issue, not just an incident artifact.

Common Mistakes That Make APD and PDL Incidents Worse

Mistake 1: Treating APD like a normal path failover

Path redundancy degradation is not the same as APD. If one path fails and another path is active, you have a redundancy issue. If every path to the device is down, you have APD or PDL classification work to do.

Mistake 2: Rescanning before the failure domain is known

A rescan can be appropriate after storage presentation is corrected. It can also be harmful when APD is active. Broadcom documents that a rescan while any LUN is in APD can affect VMs on other LUNs.

Mistake 3: Assuming disconnected means management network

A host in APD can appear disconnected or not responding because storage I/O is blocking management behavior. Treat “host disconnected” as a symptom, not a root cause.

Mistake 4: Expecting HA to restart VMs when no host has storage

If all hosts in the cluster lost access to the datastore, HA has nowhere valid to restart those VMs. VMCP improves response when there is a healthy restart target. It does not replace storage availability.

Mistake 5: Applying non-vSAN VMCP assumptions to vSAN

For VCF and vSphere environments with non-vSAN datastores, VMCP is central to the operational model. For VxRail/vSAN, use vSAN health, object compliance, disk group state, component availability, and resync behavior as the primary model.

Operational Design Implications

A good APD/PDL posture is designed before the incident.

Storage removal must be treated as a host-and-array workflow

For planned LUN removal, the clean sequence is not optional:

Evacuate or power off workloads
Unmount datastore
Detach device
Unpresent from array
Rescan hosts
Validate cleanup

Skipping host-side unmount and detach turns a planned activity into an unplanned PDL scenario.

VMCP settings should be reviewed per cluster

Do not assume every cluster should use the same APD policy.

For normal shared-storage clusters, conservative APD behavior may be safer because it avoids terminating VMs unless restart is expected to succeed.

For specific stretched or partition-tolerant designs, aggressive behavior may be intentional, but it needs design justification.

For non-vSAN VCF datastores, validate PDL response against the VCF design decision.

For VM overrides, verify that APD/PDL protection settings did not drift from the cluster intent.

Storage path monitoring should alert before APD

By the time APD occurs, all paths are gone. Better monitoring should alert on:

Path redundancy loss.

HBA or vmnic instability.

SAN switch port errors.

iSCSI packet loss or duplicate IP conditions.

NFS latency and disconnects.

Array target/controller failover anomalies.

Incident playbooks should separate classification from remediation

The first 10 minutes of the runbook should classify the failure. The next steps should remediate. Mixing those phases leads to rescans, restarts, and VM actions before anyone knows whether the failure is PDL, APD, path degradation, or an array-side administrative mistake.

Final Takeaway

PDL and APD are not just storage acronyms. They are operational models.

PDL says: ESXi has been told the device is permanently unavailable. Your job is to validate intent, correct presentation if it was accidental, clean up safely if it was not, and let HA/VMCP restart protected workloads where possible.

APD says: ESXi lost every path but does not know whether the loss is permanent. Your job is to restore fabric, array, or network connectivity quickly, avoid actions that worsen host management lockup, and understand that host reboot may become part of recovery.

The practical skill is not memorizing the acronyms. It is knowing which branch you are on before you start pushing buttons.

That is the difference between a controlled storage incident and a platform-wide scramble.

Sources:

Broadcom KB 318712: Permanent Device Loss and All-Paths-Down on host. (knowledge.broadcom.com)

Broadcom KB 318850: APD timeout behavior and 140-second default. (knowledge.broadcom.com)

Broadcom vSphere API: VM Component Protection settings and APD/PDL reaction model. (developer.broadcom.com)

Broadcom VCF design decision for PDL response on non-vSAN datastores. (knowledge.broadcom.com)

Broadcom KB 342615: APD management-agent impact and recovery constraints. (knowledge.broadcom.com)

Broadcom KB 414574: APD failover delay behavior. (knowledge.broadcom.com)

Why Large VM vMotion and Clone Tasks Fail: Device Limits, Config Hygiene, and PowerCLI Prechecks
Large VM migrations usually fail at the worst possible time: late in the change window, after the task has already consumed hours…

The post PDL vs APD: The Storage Failure Model Every vSphere Operator Needs appeared first on Digital Thought Disruption.