VCF 9.0 GA Mental Model Part 3: Day-0 to Day-2 Ownership Across Fleets, Instances, and Domains

TL;DR

If you want clean accountability in VCF 9.0, anchor your operating model to the official hierarchy:

VCF private cloud -> VCF fleet -> VCF instance -> VCF domain -> vSphere clusters.

This post translates that hierarchy into an operating model: who owns what, where day-0/day-1/day-2 work happens, and how topology (single site vs two sites vs multi-region) changes your posture.

Scope and code levels referenced in this article (VCF 9.0 GA component set):

SDDC Manager: 9.0.0.0 build 24703748

vCenter: 9.0.0.0 build 24755230

ESX: 9.0.0.0 build 24755229

NSX: 9.0.0.0 build 24733065

VCF Operations: 9.0.0.0 build 24695812

VCF Operations fleet management: 9.0.0.0 build 24695816

VCF Automation: 9.0.0.0 build 24701403

VCF Identity Broker: 9.0.0.0 build 24695128

VCF Installer: 9.0.1.0 build 24962180 (used to deploy the 9.0.0.0 component set)

Architecture Diagram

Legend:

Fleet-level management components centralize governance and shared services across a fleet.

Instances retain discrete infrastructure management stacks (they do not merge into one giant vCenter/NSX).

Domains are your lifecycle and isolation boundary for compute + storage + network resources.

Table of Contents

Scenario

Assumptions

Scope and Code Levels

The Ownership Model You Actually Need

Day-0, Day-1, Day-2 Map

Decision Criteria: Fleet vs Instance vs Domain vs Cluster

Topology Posture

Identity Boundaries and SSO Scope

Failure Domain Analysis

Operational Runbook Snapshot

Anti-patterns

Troubleshooting workflow

Conclusion

Scenario

You are aligning architects, operations, and leadership on what VCF 9.0 is actually managing, and how responsibilities split across:

Fleet-level services (VCF Operations, VCF Operations fleet management, VCF Automation, identity)

Instance-level foundations (SDDC Manager and the instance management domain)

Domain-level lifecycle and isolation (management domain and VI workload domains)

Consumption teams (VMs, Kubernetes, platforms, and automation templates)

The goal is predictable ownership and predictable blast radius.

Assumptions

You are deploying greenfield VCF 9.0.

You are building at least one VCF private cloud containing one or more VCF fleets.

You plan to deploy VCF Operations and VCF Automation early enough that the platform team uses them as the primary operational console for the environment.

You will support three physical site configurations over time:

Single site

Two sites in one region

Multi-region

Scope and Code Levels

This article is pinned to the VCF 9.0 GA component set (versions and builds listed in TL;DR). If you are on a later 9.0.x maintenance release, terminology remains consistent, but exact UI placement and lifecycle sequencing can shift.

Version Compatibility Matrix

LayerComponentVersionBuildWhy you care operationallyDeploymentVCF Installer9.0.1.024962180The day-1 bring-up entrypoint for fleets and instancesInstance foundationSDDC Manager9.0.0.024703748Drives instance lifecycle workflows and inventory controlDomain foundationvCenter9.0.0.024755230Domain-level management boundary and API surfaceHost layerESX9.0.0.024755229Your cluster capacity and patching blast radius unitNetwork layerNSX9.0.0.024733065Network segmentation, security policy, and edge servicesFleet servicesVCF Operations9.0.0.024695812Central ops, visibility, grouping, and platform workflowsFleet servicesVCF Operations fleet management9.0.0.024695816Lifecycle for fleet services and related management componentsFleet servicesVCF Automation9.0.0.024701403Self-service, governance, and policy-driven provisioningIdentityVCF Identity Broker9.0.0.024695128Enables VCF Single Sign-On models and SSO scope decisions

The Ownership Model You Actually Need

A clean VCF program usually stabilizes when you stop assigning ownership by product name and start assigning it by boundary:

Fleet boundary -> governance and shared services

Instance boundary -> discrete infrastructure footprint and operational control plane

Domain boundary -> lifecycle, isolation, and workload placement

Cluster boundary -> scale unit and maintenance blast radius

“Who owns what” chart

Use this as a starting point for your internal RACI.

Construct or capabilityPrimary ownerSecondary ownerDay-2 responsibilities that must be explicitVCF private cloud (org boundary)Platform teamSecurity/GRCPortfolio decisions, fleet count, policy and compliance guardrailsVCF fleetPlatform teamArchitectureFleet service lifecycle, shared governance, change windows, identity postureFleet-level management components (VCF Operations, VCF Operations fleet management, VCF Automation)Platform teamSRE/OperationsBackups, upgrades, integrations, tenant and RBAC guardrailsVCF instancePlatform teamRegional opsCapacity lifecycle, adding domains, instance-level networking standardsManagement domainPlatform teamVI admin“Keep the platform running” discipline: patching, certificates, backupsVI workload domainVI adminPlatform teamDay-2 LCM inside guardrails, cluster operations, domain healthDomain networking (NSX segments, T0/T1 patterns, edge capacity)Platform teamNetwork/securityNetwork design standards, firewall policy model, edge scaling ceilingsVM provisioning and templatesApp/platform teamsVI adminGolden image ownership, config drift control, tagging standardsKubernetes platform on vSphereApp/platform teamsPlatform teamNamespace policy, cluster lifecycle, RBAC, platform SLOsVCF Automation catalogs, projects, policiesPlatform teamApp/platform teamsSelf-service guardrails, approvals, quotas, blueprint governanceFinOps reporting and showbackPlatform teamFinanceTagging accuracy, allocation rules, cost anomaly response

Design-time vs day-2 operations

This split is where most teams get surprised.

Design-time decisions (day-0) are expensive to unwind:

Fleet count and boundaries

Instance placement (site and region alignment)

Domain topology (number of workload domains, shared vs dedicated services)

Identity model and SSO scope

Network consumption model (and how much change control you want to enforce)

Day-2 operations should be routine, repeatable, and low-toil:

Adding workload domains and clusters

Capacity rebalancing

Patch and upgrade sequencing

RBAC lifecycle and access review

Drift detection and remediation

Day-0, Day-1, Day-2 Map

Use this map to stop “platform work” from leaking into “workload work”, and vice versa.

PhaseWhat you doWhere it happensWhy it mattersDay-0Decide VCF private cloud -> fleets -> instances -> domains topologyArchitecture/designThis locks your governance and blast radius postureDay-0Choose identity model and SSO scopeArchitecture/securityIdentity boundaries are hard to change later without operational painDay-0Define network consumption model and tenant isolation modelPlatform + network/securityNetwork decisions dictate scale ceilings and operational toilDay-1Deploy first fleet + first instance management domainVCF Installer + first instance management domainThe first instance becomes the anchor location for fleet servicesDay-1Stand up fleet-level management componentsFleet services (hosted in first instance management domain)This is your “platform services layer” for operations and governanceDay-1Deploy initial VI workload domain(s)Instance lifecycle workflowsWorkload domains become your default lifecycle and isolation unitDay-2Add instances (new sites or regions)Fleet services + new instance management domainExpands footprint while keeping governance centralizedDay-2Add workload domains and clustersInstance workflows + domain operationsExpands capacity and isolates workloads cleanlyDay-2Operate identity, automation, and lifecycleFleet servicesCentralizes day-2 governance across attached instances

Decision Criteria: Fleet vs Instance vs Domain vs Cluster

Most “VCF design debates” are actually “where do I want the blast radius to stop?”

Quick decision table

If you need…Add a fleetAdd an instanceAdd a domainAdd a clusterSeparate governance plane and change windowsYesNoNoNoRegulated isolation with hard separationOften yesSometimesSometimesNoNew site or region footprintSometimesYesNoNoMore lifecycle isolation for workloadsNoNoYesSometimesDifferent SLA or patch cadence for a workload groupNoNoYesSometimesMore capacity in same workload boundaryNoNoNoYesSeparate SSO boundaryYes (cleanest)SometimesNoNoReduced shared service blast radiusYesSometimesSometimesNo

Architecture Tradeoff Matrix

OptionGovernance isolationOperational overheadScale ceilingTypical useOne private cloud, one fleetLowestLowestMedium to high, depending on identity modelStandard enterprise starting pointOne private cloud, multiple fleetsHighHigherHigher overall, but duplicated servicesRegulated zones, different change windowsMultiple private cloudsHighestHighestHighestMergers, hard org separation, distinct GRC boundaries

Topology Posture

You can support all three topologies with the same mental model. What changes is how you set your fleet and instance boundaries.

Single site

This is the simplest operating posture:

One VCF private cloud

One fleet

One instance

Management domain + one or more workload domains

Operational posture:

One change window for platform services is acceptable for most orgs.

Use workload domains to isolate “platform workloads” from “business workloads”.

Treat the management domain as the platform control plane. Keep it boring.

Two sites in one region

Challenge:

You want higher availability and operational continuity, but you do not want to turn every incident into a “distributed systems lesson”.

Solutions:

A) One fleet, one instance, stretched where justified

Best when latency is low enough and your network supports the design.

Operational reality: stretched designs raise complexity. Treat it as an advanced pattern, not the default.

B) One fleet, two instances (one per site)

Clear failure domains.

You can keep lifecycle and capacity operations site-aligned.

Fleet-level services still centralize governance.

C) Two fleets (one per site)

You get maximum isolation at the cost of duplicated fleet services and duplicated operations.

Use this when you need separate change windows or regulated separation even within a region.

Multi-region

Challenge:

Regions are real failure domains. Latency and inter-region dependency will punish “single control plane” assumptions.

Solutions:

A) One private cloud, one fleet, multiple instances (region aligned)

You keep centralized governance while allowing region-specific instances.

Plan identity carefully. SSO scope can become too large too quickly.

B) One private cloud, multiple fleets (region aligned or regulation aligned)

You isolate fleet services, identity posture, and change windows.

You pay in duplicated management footprint and duplicated operational effort.

C) Multiple private clouds

Use when organizational or regulatory boundaries require hard separation.

Expect duplicated tooling and duplicated platform practices unless you standardize aggressively.

Identity Boundaries and SSO Scope

In VCF 9.0, identity is not a footnote. It is a design-time decision that changes:

Admin experience

Audit posture

Blast radius during identity incidents

How your teams move between instances and domains

VCF Single Sign-On models you should reason about

Treat these as scope control knobs.

ModelSSO scopeAvailability postureOperational overheadWhen it fitsFleet-wide SSOLargeLower (single identity service per fleet)LowOne fleet, tight governance, smaller instance countCross-instance SSOBalancedBalancedMediumLarger fleets, want to limit identity blast radiusSingle-instance SSOSmall (per instance)Higher per instanceHigherRegulated isolation or region autonomy

Scale note you should plan for:

A common planning guideline is to size a VCF Identity Broker deployment for a limited number of instances. If you intend to exceed that, plan multiple identity brokers or multiple fleets.

Separate IdP and separate SSO boundaries (do both)

You typically implement “separate identity boundaries” two ways.

A) Separate fleets, separate IdPs

Cleanest separation.

Strongest isolation for regulated tenants.

Duplicates fleet services footprint.

B) One fleet, multiple identity brokers (cross-instance model)

Keeps a single governance plane.

Reduces blast radius of identity events.

You must be disciplined about which instances authenticate through which broker.

Design-time warning:

Resetting VCF Single Sign-On is a non-trivial event. Treat identity changes like a change program, not a quick admin task.

Failure Domain Analysis

This is the mental model that reduces panic during incidents.

Practical blast radius map

FailureWhat breaks firstWhat usually keeps runningYour first triage questionFleet services outage (VCF Operations, VCF Automation)Visibility, governance workflows, self-service provisioning, central policy operationsExisting workloads in domains, core hypervisor operationsIs this governance down or is core infrastructure down?Identity broker outage (in-scope instances)Logins and SSO flows for in-scope componentsExisting workloads and dataplane continueWhat is the SSO scope for this identity broker?Instance management domain incidentInstance lifecycle workflows, management vCenter/NSX for that instanceWorkloads can keep running, but operations become constrainedCan you still reach workload domain vCenter/NSX?Workload domain incidentDomain-specific provisioning and lifecycleOther domains and instancesIs isolation working the way you intended?Cluster-level capacity failurePlacement, HA behavior, performanceOther clusters/domainsDid you design cluster boundaries around maintenance and failure?

Operational Runbook Snapshot

This is the minimum you want documented before you call the platform “ready”.

Fleet services runbook (platform team)

Backups and restore procedures for:

VCF Operations

VCF Operations fleet management

VCF Automation

VCF Identity Broker (if used)

Certificate lifecycle and rotation plan

Upgrade sequencing plan:

Fleet-level management components first

Core instance components next (SDDC Manager, NSX, vCenter, ESX, vSAN)

Health checks:

Fleet service availability and telemetry ingestion

Identity broker health and readiness

Automation integration health

Real-world RTO/RPO examples you can start with

These are starting targets that many teams use to set expectations. Tune them to your recovery strategy and staffing model.

Fleet services (ops and automation):

RPO: 4 to 24 hours depending on your backup cadence and whether you treat it as “governance state” vs “mission critical”

RTO: 2 to 8 hours depending on appliance recovery automation and runbook maturity

Identity services:

RPO: 1 to 8 hours

RTO: 1 to 4 hours, because identity outages create broad administrative impact

Instance management domain:

RPO: 15 minutes to 4 hours depending on backup tools and datastore replication

RTO: 4 to 24 hours depending on whether you can rebuild vs restore

The key is consistency: define targets per boundary and test them.

Anti-patterns

These are the patterns that inflate toil and create “mystery outages”.

Treating “fleet” and “instance” as synonyms.

Putting regulated tenants in the same fleet without a clear identity and change-window strategy.

Running business workloads in the management domain.

Sprawling workload domains with no lifecycle boundary strategy.

Trying to retrofit complex identity changes without a reset and rollback plan.

Assuming multi-region behaves like a LAN.

Troubleshooting workflow

When something breaks, the fastest teams classify the problem by boundary first.

Identify the boundary:

Fleet service issue?

Identity issue?

Instance management domain issue?

Workload domain issue?

Cluster or host issue?

Confirm scope:

One domain, one instance, or the whole fleet?

Validate impact:

Provisioning impacted?

Visibility impacted?

Existing workloads impacted?

Choose the right console:

Fleet-level visibility and operations -> start with VCF Operations

Instance lifecycle workflows -> validate instance health and management domain state

Domain operational state -> validate domain vCenter and NSX health

Stabilize, then remediate:

Restore service first

Then fix drift, misconfiguration, or lifecycle backlog

Conclusion

If you want VCF 9.0 to feel operable at scale, you need an ownership model that matches the platform hierarchy:

VCF private cloud is your organizational boundary for platform outcomes.

Fleet is where you place shared governance services and shared operational responsibility.

Instance is your discrete infrastructure footprint, often aligned to a site or region.

Domain is the lifecycle and isolation boundary you use to protect workloads from each other.

Cluster is your capacity and maintenance blast radius unit.

When those boundaries map cleanly to “who owns what”, day-2 operations becomes repeatable instead of heroic.

Sources

VMware Cloud Foundation 9.0 Documentation (Tech Docs): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.htmlVMware Cloud Foundation 9.0 Release Notes: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/vmware-cloud-foundation-90-release-notes.htmlDesign (VMware Cloud Foundation 9.0): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design.htmlArchitectural Options in VMware Cloud Foundation: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/vmware-cloud-foundation-concepts.htmlFleet Management (VCF Operations): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/overview-of-vmware-cloud-foundation-9/what-is-vmware-cloud-foundation-and-vmware-vsphere-foundation/vcf-operations-overview/fleet-management.htmlVCF Single Sign-On Architecture: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/what-is/sso-architecture.htmlIdentity Providers and Protocols Supported for VCF Single Sign-On: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/what-is/protocols-suported-for–sso.htmlLinking vCenter instances in VCF Operations: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/linking-vcenter-systems-in-vmware-cloud-foundation-operations.html

The post VCF 9.0 GA Mental Model Part 3: Day-0 to Day-2 Ownership Across Fleets, Instances, and Domains appeared first on Digital Thought Disruption.