
Everything is fine until one database disk fills up - and suddenly the entire platform is down because there’s no replica, no fallback, and no one had thought to ask “what happens if this specific thing fails?”
That’s the single point of failure problem. Not that components fail - everything fails eventually - but that one failure takes everything else down with it.
This guide explains what SPOFs are, why they’re so dangerous, how to find them in a real architecture, and the concrete patterns used to eliminate them.
If you’re following the series, this post builds directly on What Is High Availability? A Beginner’s Guide. For the full roadmap, the System Design Foundations series is the pillar page and System Design is the category page.
Table of Contents
Open Table of Contents
- What Is a Single Point of Failure? (Definition)
- Why SPOFs Are More Dangerous Than Other Failures
- Common SPOFs in Web Systems
- A Typical SPOF Architecture (And What’s Wrong with It)
- How to Find SPOFs in Your Architecture
- How to Eliminate SPOFs
- The Database SPOF: The Hardest One to Fix
- Real-World SPOF Failures
- Interview Questions
- 1. What is a single point of failure and why does it matter in system design?
- 2. How would you identify SPOFs in a new system you’re reviewing?
- 3. A database is often called the most critical SPOF - how do you eliminate it?
- 4. What’s the difference between a SPOF at the application layer vs the database layer?
- 5. Can a cloud provider’s managed service still be a SPOF?
- 6. What’s the difference between eliminating a SPOF and achieving fault tolerance?
- Conclusion
- References
- YouTube Videos
What Is a Single Point of Failure? (Definition)
A single point of failure (SPOF) is any individual component in a system whose failure causes the entire system to stop working. No redundancy, no backup path, no automatic recovery - the system just goes down.
What makes a SPOF a SPOF isn’t that it can fail (every component can), but that the rest of the system depends on it with no alternative. Remove or break it, and everything that depends on it fails too.
SPOFs exist at every layer of a stack:
- A single web server with no load balancer and no standby
- A single database with no replica
- A single network switch that all services route through
- A single configuration service that every app reads at startup
High availability is fundamentally the discipline of identifying and eliminating these single points of failure. If you want to achieve three nines (99.9% uptime) or better, you can’t have any major SPOFs left in your system.
Why SPOFs Are More Dangerous Than Other Failures
The danger of a SPOF isn’t just the failure itself - it’s the blast radius. A SPOF doesn’t take itself down; it takes down everything that depends on it, which is often the entire system.
Consider a payment service with a single database node. When that node’s disk fills up:
- The database stops accepting writes
- Every app server trying to write a payment record gets an error
- The payment API returns 500s to all clients
- The frontend shows error pages to every user
- Revenue stops completely
The same failure on a properly designed system with primary-replica failover would look completely different: the primary fails, the replica is promoted automatically in seconds, app servers reconnect, and users experience a brief blip - or nothing at all if connection retry logic is in place.
The core distinction is failure isolation. Well-designed systems contain failures; systems with SPOFs propagate them.
Common SPOFs in Web Systems
SPOFs hide in plain sight. Here are the ones that cause the most incidents:
| Component | SPOF risk | What failure looks like |
|---|---|---|
| Single app server | High | All requests fail; users see errors |
| Single database (no replica) | Critical | App can’t read or write; total outage |
| Single load balancer (self-managed) | High | No traffic reaches any app server |
| Single DNS resolver | High | Users can’t resolve your domain |
| Single CDN origin | Medium | Cache misses cascade to origin failure |
| Single secrets/config store | Medium | Services can’t start or read config |
| Shared writable filesystem (NFS) | High | All services using it hang or error |
| Single region/data center | High | Region outage = complete outage |
The database SPOF is typically the most critical because it holds state. Losing the database means every stateful operation in your system fails simultaneously.
Secrets and config stores are often overlooked SPOFs. If your app reads its database password from a Vault instance at startup and that Vault instance is down, your app servers can’t start - even if the database itself is perfectly healthy.
A Typical SPOF Architecture (And What’s Wrong with It)
This is what a small production stack looks like when HA hasn’t been considered:
flowchart TD
U[User] --> NGINX[Single Nginx
⚠ SPOF]
NGINX --> APP[Single App Server
⚠ SPOF]
APP --> DB[(Single Database
⚠ SPOF)]
APP --> CACHE[(Single Redis
⚠ SPOF)]
classDef spof fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000;
class NGINX,APP,DB,CACHE spof;
Every component in this diagram is a SPOF. Take down any one of them and the service goes offline. This architecture might be fine for a personal project or early prototype, but it’s not production-ready for a service anyone depends on.
How to Find SPOFs in Your Architecture
The method is a directed question: trace the critical path of a request and ask at each step, “if this component disappeared right now, would the service still function?”
For a typical web app processing a payment:
- DNS lookup → Is there a single DNS server? What if it’s unresponsive?
- Load balancer → Is there only one? Is it cloud-managed (HA built-in) or self-managed?
- App server → Is there only one? What if the process crashes?
- Database → Is there a replica? Is automatic failover configured?
- Cache (Redis) → Is it a single instance? What happens if the cache is unavailable - does the app fall back to the database, or does it error?
- Secrets store → If the secrets service is down, can running services still function?
Walk every dependency in this way. The answer “no, the service would fail” marks a SPOF. The answer “yes, traffic routes to another instance” means redundancy is in place.
A second useful method is to run failure mode exercises: pick a component and ask “if an engineer ran kill -9 on this process right now, what would users see?” Organizations like Netflix made this literal with Chaos Monkey - randomly terminating production instances to surface hidden SPOFs before they surface on their own at the worst possible moment.
How to Eliminate SPOFs
Eliminating a SPOF always involves the same core idea: add redundancy and automate recovery. Here’s what that looks like in practice:
flowchart TD
U[User] --> LB[Redundant Load Balancer
Active-Passive Pair]
LB --> A1[App Server 1]
LB --> A2[App Server 2]
A1 --> PDB[(Primary Database)]
A2 --> PDB
PDB -->|Continuous replication| RDB[(Replica Database)]
A1 --> RC[Redis Cluster
3-node]
A2 --> RC
classDef server fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000;
classDef db fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000000;
classDef cache fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000;
classDef lb fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000;
class A1,A2 server;
class PDB,RDB db;
class RC cache;
class LB lb;
The key changes from the SPOF architecture:
- App servers: two instances behind a load balancer. If one crashes, the load balancer routes traffic to the other. This pairs directly with what’s covered in What Is Load Balancing and How It Works.
- Load balancer: an active-passive pair sharing a virtual IP, or a cloud-managed load balancer (AWS ALB, GCP Load Balancing) that has redundancy built into the service.
- Database: a primary with a replica, plus automatic failover so if the primary becomes unreachable, the replica is promoted without manual intervention.
- Cache: a Redis Cluster (or Redis Sentinel) instead of a single instance, so cache failures don’t cascade into application errors.
The principle across all of these: no component should be the only one capable of doing its job.
For stateless components like app servers, active-active works well - both instances serve traffic simultaneously. For stateful components like primary databases, active-passive is more common because concurrent writes require complex conflict resolution logic. See What Is High Availability? A Beginner’s Guide for a full breakdown of active-active vs active-passive.
The Database SPOF: The Hardest One to Fix
The database deserves special attention because it’s stateful - it holds your actual data - and because eliminating its SPOF is more complex than adding another app server.
A database with no replica is the most dangerous SPOF in most production systems. If the disk fails, the instance crashes, or the host is terminated, you lose both read and write access to your data simultaneously. Depending on your backup strategy, you might also lose hours of data.
The solution has three parts:
1. Replication: A replica continuously receives changes from the primary. Reads can be routed to the replica (offloading the primary); more importantly, the replica is ready to become the new primary if the original fails.
2. Automatic failover: Replication alone isn’t enough. You need a system that detects primary failure and promotes a replica - quickly and automatically. PostgreSQL with Patroni, MySQL with Group Replication, or a managed service like AWS RDS Multi-AZ all handle this. Without automation, you’re dependent on an engineer waking up at 3 AM to run promotion commands manually.
3. Application-level connection failover: Your app needs to reconnect to the new primary after failover. Most database drivers support this with a connection string that lists multiple hosts, or via a DNS name that gets updated during failover. If the app holds a hardcoded connection to the old primary’s IP, failover won’t help - the app still can’t connect.
A working database SPOF elimination requires all three pieces. Having replication but no automatic failover, or having automatic failover but no connection retry in the app, still leaves you partially exposed.
Real-World SPOF Failures
GitLab Database Deletion (2017)
GitLab experienced an incident where a database administrator accidentally deleted the production database directory while intending to target a different environment. The damage was compounded by multiple coincident failures: five separate backup and replication systems had all failed or were misconfigured, leaving no reliable recovery path. GitLab lost approximately six hours of data and experienced nearly 18 hours of downtime.
The incident became a public case study because GitLab published a transparent post-mortem. The root SPOF wasn’t just the single database - it was the lack of verified backups. A backup that has never been tested is not a backup; it’s a hope.
AWS US-East-1 S3 Outage (2017)
A typo in an AWS maintenance command removed more capacity from S3 than intended, triggering a cascade that took down a large portion of us-east-1. Services that had deployed entirely within one AWS region were completely unavailable. Services that used multi-region or multi-AZ architectures survived with minimal impact.
The lesson: a single AWS region is itself a SPOF if you don’t architect around it. Cloud infrastructure provides redundancy tools - Availability Zones, multi-region deployments - but they only help if you actually use them.
GitHub Network Partition (2018)
GitHub experienced a 24-hour degraded state after a brief network interruption caused their primary database to become unreachable. The replica was promoted, but because the primary had briefly been accessible to some nodes after the partition, the two databases had diverged. The result was data inconsistency that required careful intervention to resolve.
The incident highlighted a subtlety: eliminating the database SPOF isn’t just about “have a replica.” It’s about having a failover process that correctly handles split-brain scenarios - situations where two nodes briefly think they’re both the primary.
Interview Questions
1. What is a single point of failure and why does it matter in system design?
A single point of failure is any component whose failure causes the entire system to stop working - no redundancy, no automatic recovery path. It matters because systems at scale fail all the time: hardware breaks, processes crash, networks partition. A SPOF transforms what should be a contained failure into a total outage. In system design interviews, identifying SPOFs and explaining how to eliminate them is a core skill that demonstrates you understand not just how a system works when things go right, but how it behaves when they don’t.
2. How would you identify SPOFs in a new system you’re reviewing?
I’d trace the critical request path from user to database and ask at each hop: “if this component disappeared right now, would the service still function?” Every component where the answer is “no” is a SPOF. Beyond the happy path, I’d also look at shared dependencies - a common secrets store, a shared filesystem, a single DNS resolver - because these are often overlooked. In practice, running a failure mode exercise (asking “what if an engineer killed this process right now?”) surfaces hidden assumptions quickly, and it’s exactly the approach Netflix automated with Chaos Monkey.
3. A database is often called the most critical SPOF - how do you eliminate it?
Eliminating the database SPOF requires three things working together: replication (a replica continuously receiving changes from the primary), automatic failover (a system that detects primary failure and promotes a replica without manual intervention), and application-layer connection failover (the app reconnects to the new primary after promotion). Having only replication without automatic failover still leaves you dependent on a human waking up at 3 AM. Having automatic failover but no connection retry in the app means the app keeps trying to reach the old primary’s IP. All three parts must work together, and the failover process must also handle split-brain scenarios where the network partition briefly makes two nodes think they’re both primary.
4. What’s the difference between a SPOF at the application layer vs the database layer?
SPOFs at the application layer are generally easier to eliminate: application servers are typically stateless, so you can run multiple instances behind a load balancer and any instance can handle any request. If one crashes, the others continue. Database SPOFs are harder because the database is stateful - it holds your data - so you can’t just “add another instance” without also solving data synchronization, conflict resolution, and failover promotion. The blast radius is also larger: a failed app server affects only the requests it was handling, while a failed primary database affects every service that depends on storage simultaneously.
5. Can a cloud provider’s managed service still be a SPOF?
Yes, absolutely. A managed database in a single availability zone is still a SPOF - if that AZ has a power failure or network issue, your database becomes unreachable. A managed load balancer deployed only in one region is a SPOF if the region goes down. Cloud providers offer tools to eliminate these risks - multi-AZ deployments, cross-region replication, global load balancers - but they’re opt-in. Using a managed service doesn’t automatically mean you’ve eliminated SPOFs; it just means you’ve delegated the underlying infrastructure management. The architecture choices about redundancy and placement still fall on you.
6. What’s the difference between eliminating a SPOF and achieving fault tolerance?
Eliminating a SPOF means adding enough redundancy that a single component failure doesn’t take the system down - but there may still be a brief transition period (seconds) during failover. Fault tolerance goes further: the system continues with zero user-visible interruption even during failure, typically through hardware-level mirroring, synchronous replication, or active-active configurations where requests are processed by multiple nodes simultaneously. SPOF elimination is achievable with standard cloud infrastructure and is the right target for most web services. Fault tolerance is significantly more expensive and is usually reserved for financial transaction systems or infrastructure where even seconds of downtime carries unacceptable cost or risk.
Conclusion
- A single point of failure is any component whose failure causes the entire system to fail - no alternatives, no automatic recovery.
- The danger of a SPOF is its blast radius: one failure takes down everything that depends on it, which is often the whole service.
- Common SPOFs include single app servers, single databases, self-managed single load balancers, shared filesystems, and single-region deployments.
- Find SPOFs by tracing the critical request path and asking “if this disappeared, would the service still work?” at each step.
- Eliminate SPOFs with redundancy at every layer - multiple app servers, primary-replica databases with automatic failover, redundant load balancers.
- The database SPOF requires all three pieces: replication, automatic failover, and application-level connection retry.
This wraps up Series 4 of the System Design Foundations series. Series 5 begins with How Logging Works in Backend Systems - a practical look at how production services track what’s happening inside them.
If you want to revisit the broader HA design context that makes SPOF elimination worthwhile, What Is High Availability? A Beginner’s Guide covers redundancy, failover, and the nines in depth. For the load balancing patterns that sit at the center of most SPOF elimination strategies, see What Is Load Balancing and How It Works.
References
-
What is a single point of failure (SPOF)? - TechTarget
https://www.techtarget.com/searchdatacenter/definition/Single-point-of-failure-SPOF -
GitLab.com database incident post-mortem - GitLab
https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/ -
Summary of the Amazon S3 Service Disruption - AWS
https://aws.amazon.com/message/41926/
YouTube Videos
-
“How to Avoid Single Points of Failure (SPOF)? | System Design Interview”
https://www.youtube.com/watch?v=WORO6R14b_k -
“#21: Single Point of Failure (SPOF) | System Design Fundamentals”
https://www.youtube.com/watch?v=V8QBGpUckeU -
“Single Point of Failure (SPOF) in System Design”
https://www.youtube.com/watch?v=Iy2YqgjXtRM