VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region

TL;DR

If you want architects, operators, and leadership aligned, you need a topology mental model that starts with VCF objects and only then maps to your physical sites.

The hierarchy you should standardize on is Fleet -> Instance -> Domain -> Cluster.

Your topology decision is mostly about:

How many instances you deploy and how they map to sites and regions.

How many fleets you operate as governance, identity, and operational boundary lines.

Three practical deployment postures:

Single site: fastest path, smallest blast radius, simplest networking.

Two sites in one region: stretched clusters, stronger site resilience, tighter latency constraints.

Multi-region: multiple instances, DR-oriented operating model, more dependencies and more change control.

VCF 9.0 GA code levels referenced in this post (component set and build numbers):

SDDC Manager 9.0.0.0 build 24703748

vCenter 9.0.0.0 build 24755230

ESX 9.0.0.0 build 24755229

NSX 9.0.0.0 build 24733065

VCF Operations 9.0.0.0 build 24695812

VCF Automation 9.0.0.0 build 24701403

VCF Identity Broker 9.0.0.0 build 24695128

Note: the BOM for this release also calls out VCF Installer 9.0.1.0 build 24962180 as required to deploy all VCF 9.0.0.0 components.

Architecture Diagram

Table of Contents

Scenario

Scope and version alignment

Core concepts: mapping physical topology to fleets and instances

Decision criteria you should agree on up front

Challenge: choose your deployment posture

Architecture tradeoff matrix

One private cloud vs multiple fleets

Identity and SSO boundary patterns

Failure domain analysis

Day-0, day-1, day-2 map by topology

Who owns what

Operational runbook snapshot

Troubleshooting workflow

Anti-patterns

Best practices

Summary and takeaways

Conclusion

Scenario

You are about to deploy VCF 9.0 GA greenfield and you need a shared language for:

What VCF is actually managing.

Where you draw governance boundaries vs infrastructure boundaries.

What changes when you move from a single site to stretched sites to multiple regions.

How identity scope and fleet count become day-0 decisions with long tail operational consequences.

Scope and version alignment

This post assumes:

VCF 9.0.0.0 GA terminology and workflows.

Greenfield deployment using VCF Installer.

You deploy both VCF Operations and VCF Automation from day-1, even if you phase consumption later.

Version compatibility matrix

Use this as your “what are we talking about” anchor in architecture reviews and CAB meetings.

ComponentVersionBuildSDDC Manager9.0.0.024703748vCenter9.0.0.024755230ESX9.0.0.024755229NSX9.0.0.024733065VCF Operations9.0.0.024695812VCF Automation9.0.0.024701403VCF Identity Broker9.0.0.024695128VCF Installer9.0.1.024962180

Core concepts: mapping physical topology to fleets and instances

The physical words that matter

You will hear these terms used casually. Align them to your constraints:

Site: your contained fault domain boundary. Power, cooling, ToR switches, upstream routing, and physical security usually correlate here.

Region: one or more sites within synchronous replication latencies. Crossing regions is typically a disaster recovery process, not an HA event.

The VCF objects you should use in every design discussion

Fleet: your shared governance and shared platform services boundary. This is where you centralize fleet services like operations and automation.

Instance: a discrete VCF deployment footprint. Each instance contains its own management domain and workload domains.

Domain: the lifecycle and isolation boundary. You patch and evolve domains independently.

Management domain: hosts instance management components.

VI workload domain(s): run consumer workloads.

Practical rule:

If someone says “we need another vCenter,” you force the conversation back to domain and instance.

If someone says “we need separation,” you ask whether they mean governance separation (fleet) or workload isolation (domain).

Decision criteria you should agree on up front

These are design-time decisions that are expensive to reverse later.

Design-time decision criteria

Availability objective

Host failure only

Rack failure

Site or availability zone failure

Region failure

Latency and network fabric capability

ESX host to ESX host latency within clusters

Stretched VLAN and L2 adjacency requirements

Fleet-wide connectivity constraints between instances

Isolation objective

Logical isolation only

Physical isolation by cluster or domain

Regulated tenant isolation that requires separate identity and change control

Operating model maturity

Do you have a platform team that can own fleet services and identity lifecycle?

Do you have standardized change windows and patch discipline?

Day-2 reality check questions

Can you support fleet services as shared dependencies?

Can you operationalize backup schedules, certificate lifecycle, and password rotation consistently?

Can you troubleshoot cross-site failures without escalating everything to vendors?

Challenge: choose your deployment posture

You need a deployment posture that matches your physical topology without creating a governance model you cannot operate.

Solution A: Single site

This is your default starting posture unless you have a clear availability driver.

What it looks like

One fleet

One instance

One management domain

One or more workload domains

Design intent

Optimize for time-to-value and operational simplicity.

Keep latency and networking requirements straightforward.

Operational implications

You can still scale within the site by adding:

More clusters to domains

More workload domains

Potentially more instances if you need isolation at the instance boundary

Your DR posture becomes a separate conversation, usually backup/restore first, then replication and orchestration.

Solution B: Two sites in one region

This is the “site resilience” posture. It usually assumes stretched constructs and tighter network constraints.

What it looks like

One fleet

One instance stretched across two sites in the same region

A management domain designed for high availability across the two sites

Workload domains separated from management

Design intent

Tolerate a full site or availability zone failure for management and potentially workloads.

Reduce downtime for site-local incidents.

Hard constraints you must respect

Stretched designs require disciplined network engineering. VCF calls out maximum latency thresholds for:

ESX hosts within a vSphere cluster.

Hosts running NSX Edge nodes within the same NSX Edge cluster.

vSAN witness connectivity in stretched designs.

You also inherit “stretched gateway” and “first hop gateway HA” problems that are not optional in real outages.

Operational implications

Your patch workflow needs to understand site affinity and failover capacity.

Your network troubleshooting becomes as important as your vSphere troubleshooting.

Solution C: Multi-region

This is the “DR-oriented operating model” posture. You typically deploy multiple instances, each aligned to a region.

What it looks like

One fleet

Multiple instances

Each instance has its own management domain and workload domains

Regions are connected with a cross-region network for centralized management and visibility

Design intent

Separate failure domains by region.

Enable recovery workflows that survive region-level events, given adequate capacity and replication strategy.

Operational implications

You introduce an explicit dependency chain:

Fleet services live somewhere (commonly the first instance), and other instances rely on cross-region connectivity to reach them.

Change control becomes multi-region by default:

Certificates, identity, and patching must be coordinated across locations.

Architecture tradeoff matrix

Use this in design boards to stop circular debates.

AttributeSingle siteTwo sites in one regionMulti-regionPrimary goalSimplicitySite resilienceRegion separation and DR postureTypical instance count112+Network complexityLowHighMedium to highLatency sensitivityModerateHighMediumFleet service dependencyLocalLocal but stretchedCross-region dependencyOperational overheadLowHighHighCost driversHost count, storageStretched fabric, witness, failover capacityDuplicate capacity, replication, bandwidth

Cost model snapshot

This is not pricing. It is what actually moves your bill of materials.

Single site

Cheapest fleet service hosting footprint.

Lowest network engineering cost.

Two sites in one region

You pay for:

Higher-quality inter-site links

Stretched VLAN support

Additional failover capacity (because you are engineering for a site loss)

Multi-region

You pay for:

Duplicate management footprints per region

Data replication and orchestration tooling

Higher operational toil unless you automate day-2 heavily

One private cloud vs multiple fleets

Treat “private cloud” as your organizational wrapper. VCF objects start at fleet.

When one fleet is enough

Choose one fleet when:

You want centralized observability and automation.

You can accept shared governance services across multiple instances.

You want a standard operating model across locations.

When you should operate multiple fleets

Choose multiple fleets when you need:

Separate identity providers or separate SSO boundaries for regulated isolation.

Independent change windows and patch schedules.

Hard blast radius separation for fleet services.

Practical framing:

Fleet separation is about governance, identity scope, and shared service blast radius.

Domain separation is about workload isolation and lifecycle independence.

Identity and SSO boundary patterns

Identity is not a “later” decision. It is a boundary decision.

Challenge: unify access or isolate tenants

You need a model that matches your org and compliance posture.

Solution A: Fleet-wide SSO

Use this when:

You want one set of credentials and SSO across all components in the fleet.

You can tolerate that an identity broker impact affects the fleet.

Operational reality:

This is powerful for operator experience, but it increases the blast radius of identity outages.

Solution B: Cross-instance SSO

Use this when:

You want shared identity across a subset of instances, not necessarily all.

You want more control over blast radius than a single fleet-wide configuration.

Solution C: Single instance SSO boundaries

Use this when:

You need regulated or tenant isolation.

You need different identity providers or different authentication policies per instance.

You want to localize identity outages.

Embedded vs appliance identity broker

Treat this as a scaling and availability decision:

Embedded identity broker is simpler but inherits dependency on instance components.

Appliance identity broker adds overhead but improves availability and scale.

Design constraint worth calling out: there is a maximum number of instances that can connect to a single identity broker deployment.

Failure domain analysis

This is where topology turns into real operational outcomes.

Failure domains you should model

FailureWhat breaksWhat keeps runningTypical owner responseFleet services unavailableCentral observability, centralized automation, fleet management workflows, and optionally SSO experienceExisting workloads and instance-level management planes continue to runPlatform team restores fleet services, validates integrationsInstance management domain impairedDomain lifecycle actions, some instance operationsWorkloads may still run, but you lose safe lifecycle control and may lose some vCenter or NSX management functions depending on the failureVI admin + platform team coordinate recoveryWorkload domain impairedWorkloads in that domainOther domains and instances continueVI admin + app teams execute workload recovery runbooks

Practical RTO and RPO examples you can use as starting targets

These are real-world starting points, not vendor commitments:

Fleet services (VCF Operations, VCF Automation, identity broker)

RTO: 2 to 8 hours depending on how automated your restore is

RPO: 24 hours is a common baseline when backups are daily

Instance management domain components

RTO: 1 to 4 hours if you have clean backups and documented restore procedures

RPO: 24 hours baseline, tighter if you replicate critical config

Workload domains and applications

RTO and RPO are application-specific and often require orchestration tooling

If you cannot state these targets, you should at least agree on the priority order:

Identity and authentication

SDDC Manager and lifecycle

vCenter and NSX management

Workload recovery

Day-0, day-1, day-2 map by topology

Day-0: decisions you should lock

These apply to all topologies:

Fleet count and naming standard.

Instance to site or region mapping.

Domain strategy:

Management domain is not where you run business workloads.

Workload domains align to lifecycle and isolation needs.

Network and IP plan:

Treat subnet sizing as irreversible planning, not something you “fix later”.

Allocate address space with room for expansion.

Identity model:

Fleet-wide vs instance-level boundaries.

Corporate IdP integration and MFA policy alignment.

Certificate authority and certificate lifecycle plan.

Backup targets and backup schedule owners.

Day-1: bring-up sequence that fits the object model

A typical greenfield flow looks like:

Deploy VCF Installer appliance.

Start new fleet deployment and create the first instance.

License and stand up fleet services:

VCF Operations

VCF Automation

Deploy identity broker and configure VCF SSO with your directory.

Create workload domain(s) and establish network connectivity patterns.

Stand up VCF Automation constructs for consumption.

Day-2: operations you should operationalize early

Patch and lifecycle:

Domain-based upgrades and maintenance windows.

Explicit rollback plans when upgrading shared fleet services.

Backup and restore:

SFTP backup targets for management components.

Backup schedules for fleet services and for instance components.

Security lifecycle:

Password rotation and account management.

Certificate replacement and renewal.

Expansion:

Add workload domains, clusters, and potentially additional instances.

Who owns what

Use this to prevent “everyone owns it, so no one owns it.”

CapabilityPlatform teamVI adminApp and platform teamsFleet services lifecycleOwnConsultInformedVCF Operations configuration and alertsOwnConsultInformedVCF Automation provider setupOwnConsultInformedIdentity broker and SSO modelOwnConsultInformedInstance bring-up and healthOwnOwnInformedSDDC Manager operationsConsultOwnInformedvCenter and NSX in management domainConsultOwnInformedWorkload domain creation and lifecycleConsultOwnInformedWorkload provisioning via automationOwn the platformConsultOwn consumptionApplication deployment and runtimeInformedConsultOwn

Operational runbook snapshot

Keep this as a living page in your ops wiki.

Weekly

Review fleet service health and integrations.

Validate that all instances are reporting metrics and logs.

Confirm certificate expiration windows and rotation queue.

Monthly

Execute backup restore tests for:

Fleet services

Instance management components

Review capacity and “failover capacity” assumptions for your topology.

Quarterly

Patch at domain boundaries, not by ad hoc component upgrades.

Re-validate cross-site network latency and packet loss.

Run a tabletop exercise:

Fleet services outage

Instance outage

Site outage

Validation checklist

Use UI and workflow validation before you declare success:

In VCF Operations, confirm each VCF instance is visible and healthy.

Confirm your automation provider and tenant access paths work with the chosen identity model.

Confirm backups are running and stored off the platform.

Troubleshooting workflow

When something breaks, troubleshoot by boundary.

Provisioning failures

Check VCF Automation health and its integration to VCF Operations.

Validate identity provider connectivity and token issuance.

Validate network connectivity between fleet services and target instance.

Instance lifecycle failures

Inspect SDDC Manager alarms and recent change history.

Validate domain health and vCenter availability.

Check for drift from out-of-band changes.

Cross-site weirdness

Start with latency and MTU validation.

Validate gateway HA behavior for stretched segments.

Confirm site affinity rules for critical components.

Anti-patterns

Avoid these early and you remove a lot of future toil.

Treating a two-site stretched design like “just two data centers”.

Using a single fleet across regulated tenants when you actually need separate identity and change boundaries.

Running meaningful workloads in the management domain because it was “available”.

Designing IP space too tightly and assuming you can resize later.

Assuming multi-region means “active-active” without defining replication, orchestration, and capacity for failover.

Best practices

Standardize vocabulary in writing:

Fleet -> Instance -> Domain -> Cluster

Keep fleet services highly available and backed up like any other Tier 0 platform.

Make identity a design board item, not an implementation checkbox.

Use domains as your lifecycle boundary:

Patch domains, validate domains, roll back at domain boundaries.

Write failure-mode runbooks for:

Fleet services down

Instance down

Site down

Region down

Summary and takeaways

Your topology posture is an operating model decision, not just an architecture diagram.

Two sites in one region usually increases availability but also increases network and day-2 complexity.

Multi-region usually improves fault domain separation, but it introduces cross-region dependencies for fleet services unless you deliberately isolate with multiple fleets.

Decide identity scope and fleet count at day-0. The cost of changing later is always higher than the cost of deciding carefully now.

Conclusion

VCF 9.0 GA becomes easier to design and operate when you treat fleet, instance, and domain as explicit boundaries and then map them to site and region realities. Pick the simplest topology that meets your availability and isolation goals, and invest early in day-2 practices for identity, backups, lifecycle, and change control.

Sources

VMware Cloud Foundation 9.0 and later Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.html

The post VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region appeared first on Digital Thought Disruption.