System Design Fundamentals: Scalability, Load Balancing, Caching and Databases

System design interviews evaluate your ability to architect large-scale distributed systems. Unlike coding interviews with clear right/wrong answers, system design is about reasoning through trade-offs. This guide covers the building blocks and concepts that appear in almost every system design discussion.

How to Approach System Design Interviews

A good framework:

Clarify requirements — functional (what it does) and non-functional (scale, latency, availability)
Estimate scale — DAU, requests per second, data volume
High-level design — draw the major components
Deep dive — detail the components your interviewer cares about
Identify bottlenecks — what breaks at scale? how do you fix it?

Vertical vs Horizontal Scaling

Vertical scaling (scaling up): Give the server more RAM, CPU, or faster disks.

Simple — no code changes
Hard limit — you cannot scale a single machine forever
Single point of failure

Horizontal scaling (scaling out): Add more servers.

No hard limit — add servers as needed
Requires stateless application design
More complex — load balancing, session management, data consistency

Modern systems scale horizontally. Your application servers should be stateless — any request can be handled by any server. Store session data in Redis, not in-process memory.

Load Balancers

A load balancer distributes incoming traffic across multiple servers.

code
                    ┌─────────────────┐
Client ────────────►│  Load Balancer  │
                    └────────┬────────┘
                   ┌─────────┼─────────┐
                   ▼         ▼         ▼
               Server 1  Server 2  Server 3

Load balancing algorithms

Round robin — distribute requests evenly in sequence (good for uniform workloads)
Least connections — send to the server with fewest active connections (good for variable request durations)
IP hash — hash client IP to a consistent server (useful for session stickiness)
Weighted round robin — distribute proportionally to server capacity

Layer 4 vs Layer 7

L4 (transport layer) — routes based on IP/TCP, very fast, no content inspection
L7 (application layer) — routes based on HTTP headers, URL, cookies — can do content-aware routing (send /api/* to API servers, /static/* to CDN)

AWS ALB, Nginx, and HAProxy are L7 load balancers. AWS NLB is L4.

Caching

Caching stores copies of frequently accessed data closer to the requester, reducing latency and database load.

Cache-Aside (Lazy Loading)

code
App → Check cache → Hit: return cached data
                  → Miss: query DB → store in cache → return data

Most common pattern. Cache only what is actually requested.

Write-Through

Write to cache and DB simultaneously. Cache is always fresh but adds write latency.

Write-Behind (Write-Back)

Write to cache immediately, flush to DB asynchronously. Low write latency but risk of data loss if cache goes down before flush.

Cache Eviction Policies

LRU (Least Recently Used) — evict the item not accessed for the longest time
LFU (Least Frequently Used) — evict the item accessed least often
TTL (Time to Live) — expire items after a fixed duration

CDN (Content Delivery Network)

CDNs cache static assets (images, JS, CSS) at edge servers geographically close to users:

code
User (Tokyo) ──► CDN Edge (Tokyo) ──► Origin Server (US) [on miss]
                     │
                  cache hit: 5ms vs 150ms round trip to origin

Use a CDN for all static assets, and for cacheable API responses when possible.

Cache Invalidation

The hardest problem in caching. Strategies:

TTL-based expiry — simple, stale data is possible
Event-driven invalidation — when data changes, explicitly delete/update cache keys
Cache-aside with short TTL — accept eventual consistency

Database Scaling

Read Replicas

Most web applications read far more than they write. Add read replicas to distribute read load:

code
Writes ──────────────────────────────► Primary DB
                                           │
                                    replication
                                     │         │
                                     ▼         ▼
Reads ───────────────────────► Replica 1   Replica 2

Tradeoff: replication is asynchronous — replicas may be slightly behind (replication lag).

Database Sharding

Horizontally partition data across multiple database instances. Each shard holds a subset of the data.

Hash sharding: shard = hash(user_id) % num_shards

code
User 1 ──► Shard 0
User 2 ──► Shard 1
User 3 ──► Shard 0
User 4 ──► Shard 2

Range sharding: Users A–H go to Shard 0, I–P to Shard 1, Q–Z to Shard 2.

Sharding enables near-unlimited horizontal scale but complicates cross-shard queries and transactions.

SQL vs NoSQL for Scale

SQL databases (PostgreSQL, MySQL) are strongly consistent and support complex queries. They scale well with read replicas and vertical scaling, but horizontal sharding is complex.

NoSQL databases (DynamoDB, Cassandra, MongoDB) are designed for horizontal scaling from the ground up. They trade ACID guarantees and join support for massive throughput and availability.

Choose based on your access patterns:

Complex queries, strong consistency → SQL
High write throughput, simple lookups by key → NoSQL
Flexible schema, document model → MongoDB
Time-series data → TimescaleDB or InfluxDB

CAP Theorem

In a distributed system, you can only guarantee two of three:

Consistency (C) — every read returns the most recent write
Availability (A) — every request receives a response (not guaranteed to be current)
Partition Tolerance (P) — system works even if some nodes cannot communicate

Network partitions happen in real distributed systems — you cannot avoid P. So the real choice is CP vs AP:

CP systems (prefer consistency): ZooKeeper, HBase — during a partition, they reject some requests
AP systems (prefer availability): Cassandra, DynamoDB — during a partition, they may return stale data

For most web apps, AP with eventual consistency is acceptable (showing a slightly stale tweet count is fine). For financial transactions, CP is required (cannot show wrong account balance).

Availability and Reliability

Availability is usually expressed in "nines":

Availability	Downtime per year
99% (2 nines)	3.65 days
99.9% (3 nines)	8.77 hours
99.99% (4 nines)	52 minutes
99.999% (5 nines)	5.26 minutes

Patterns for high availability

Redundancy: No single points of failure. Multiple app servers, database replicas, multi-AZ deployments.

Health checks and auto-healing: Load balancers stop sending traffic to unhealthy instances. Auto-scaling groups replace failed instances automatically.

Circuit breaker: Prevent cascading failures. When a downstream service fails repeatedly, stop calling it for a period and return a fallback.

Graceful degradation: If a non-critical component fails (recommendations service, analytics), continue serving core functionality.

Retry with exponential backoff: Automatically retry transient failures with increasing delays.

Message Queues

Message queues decouple producers from consumers and enable async processing:

code
Web Server ──► Message Queue ──► Worker Service
(fast)         (Kafka/SQS)       (slow processing)

Benefits:

Producer does not block waiting for slow processing
Workers can scale independently from the web tier
Messages are durable — not lost if a worker crashes
Buffer traffic spikes — queue absorbs bursts, workers process at their own pace

Use cases: sending emails, processing images, generating reports, triggering webhooks, propagating events between microservices.

Back-of-Envelope Estimation

Interviewers expect you to estimate scale. Useful numbers to memorize:

Operation	Approximate time
L1 cache reference	0.5 ns
RAM reference	100 ns
SSD random read	100 µs
Network round trip (same datacenter)	0.5 ms
Network round trip (cross-continent)	150 ms
HDD seek	10 ms

Traffic math: 1M DAU × 10 requests/day = ~116 requests/second. A single well-tuned server handles thousands of req/s for simple APIs.

Storage math: 1M users × 1KB profile = 1GB. 100M photos × 1MB = 100TB.

Example: Design a URL Shortener

Quick walkthrough of a common interview question:

Functional requirements: Create short URLs, redirect short → long URL.

Non-functional: 100M URLs created/day, 10B redirects/day (read-heavy, 100:1 read/write).

Core components:

code
Client ──► Load Balancer ──► API Servers ──► Cache (Redis)
                                                │ (miss)
                                                ▼
                                           Database (short_url → long_url)

Data model:

code

{ short_code: "abc123", long_url: "https://...", created_at, user_id }

Short code generation: Base62 encode a counter, or take first 6 chars of MD5(long_url). Handle collisions.

Scale: 10B redirects/day = ~115,000 req/s. Cache hot short codes in Redis (99%+ of traffic served from cache). Database is write-only from the hot path.

Practice on Froquiz

System design concepts appear in senior developer and staff engineer interviews. Explore our backend and infrastructure quizzes on Froquiz to reinforce the fundamentals.

Summary

Vertical scaling is simple but limited; horizontal scaling requires stateless app design
Load balancers distribute traffic; L7 LBs can route by URL/header
Cache-aside is the most common caching pattern; always set a TTL
Read replicas scale reads; sharding scales writes — both add complexity
CAP theorem: network partitions are inevitable — choose CP or AP based on your consistency needs
Message queues decouple services and absorb traffic spikes
Always clarify requirements and estimate scale before drawing architecture in interviews