Chapter 14: Design YouTube

What is a YouTube?

YouTube is a video-sharing platform that allows users to upload, view, and share videos.

Problem Clarification & Scope Definition

Clarification Questions

Question	Example Answer
What’s the primary focus?	Video upload and playback
Device types supported?	Web, mobile app, smart TVs
Are users global?	Yes, worldwide audience
Video duration or size limit?	1 GB max per upload
Use cloud providers?	Yes (AWS/GCP/Azure acceptable)

High-Level Requirements & Goals

Functional Requirements

Upload videos
Store and process video files
Stream videos efficiently to users
Serve multiple video qualities (e.g., 240p → 1080p)
Secure content access
Support recommendation and metadata search

Non-Functional Requirements

Highly scalable
Globally available
Fault-tolerant
Secure and cost-efficient
Smooth playback with adaptive streaming

Back-of-the-Envelope Estimation

Given assumptions:

Daily Active Users (DAU): 5 million
10% upload one 300 MB video per day
Average user watches 5 videos/day

Compute storage and bandwidth estimates

Uploads:

5M × 10% × 300 MB = 150 TB/day

CDN outgoing traffic (for playback):

5M × 5 × 300 MB = 7.5 PB/day

Storage cost (if 1 TB ≈ $20/month):

~$3,000/day for raw storage, excluding redundancy

CDN egress cost (~$0.02/GB):

150,000 × $0.02 = $150,000/day

⚡ Key insight: CDN cost dominates operational expenses — optimizing it is crucial for cost efficiency.

High-Level System Architecture

Core Components Overview

Client (Web/Mobile) — Video upload & playback interface
API Gateway / Load Balancer — Routes requests to backend microservices
Application Servers — Handle metadata, authentication, user requests
Metadata Database + Cache — Store video metadata (title, description, owner, tags, duration)
Object Storage (Blob Storage) — Store raw and processed videos (e.g., AWS S3, Google Cloud Storage)
Transcoding Service — Converts raw uploads into multiple formats/resolutions
CDN (Content Delivery Network) — Cache videos near end-users
Message Queues — Orchestrate asynchronous operations (upload → transcode → CDN sync)

Video Upload

Step-by-Step Flow

User uploads a video via pre-signed URL → stored temporarily in Raw Storage
Upload metadata (title, size, etc.) sent to API Server
Upload event triggers a message in the transcoding queue
Transcoding Service fetches the original file → generates multiple resolutions
Processed videos stored in Transcoded Storage
CDN caches the newest version
Completion handler updates video status in Metadata DB

Core Components

Component	Purpose
Upload Service	Handles pre-signed URLs, access control
Message Queue (Kafka/SQS)	Connects upload → transcoding asynchronously
Transcoding Worker Pool	Dynamically scalable workers (using Kubernetes or AWS Batch)
Completion Handler	Metadata update; triggers email/notification

💡 Interview Expansion:

How do you ensure video integrity during upload?

How to support resumable uploads (especially on mobile)?

How would you handle upload retry or failure?

Video Streaming Architecture

Playback Flow

User requests playback → API returns video metadata
Player fetches the stream manifest (m3u8/DASH format)
CDN delivers video chunks sequentially (e.g., 4s segments)
Client dynamically switches bitrate according to network conditions

Key Streaming Protocols

HLS (HTTP Live Streaming)
MPEG-DASH
Smooth Streaming

CDN Strategies

Cache popular videos near users
Regional PoPs (Points of Presence)
Push/pull hybrid cache invalidation
Cache miss fallback → fetch from transcoded storage

Video Transcoding System (DAG Architecture)

Why do we need transcoding?

Multiple device compatibility
Adaptive bitrate support
Save space & optimize performance

Typical Output Formats

Resolution	Codec	Approx. Bitrate
240p	H.264	400 Kbps
480p	H.264	1 Mbps
720p	H.264	2.5 Mbps
1080p	H.264	5 Mbps

DAG (Directed Acyclic Graph)

Used to model transcoding jobs:

Nodes: Independent transcoding tasks (e.g., thumbnail generation, encoding, watermarking)
Edges: Dependency links
Enables parallel processing and recovery from partial failure

Tools & Frameworks

FFmpeg, AWS Elastic Transcoder, GCP Transcoder API
Workflow orchestration via Airflow, Celery, or Kubernetes Jobs

Optimization

Performance Optimization

Chunked uploads for resume support
Parallel transcoding by partitioning video
Pre-computed thumbnails for faster UI rendering
CDN tier caching (regional + global levels)

Security Optimization

Pre-signed URLs for controlled upload access
Encrypted HLS (AES-128) or DRM (Widevine, PlayReady)
Token-based access for temporary playback
Digital watermarking for copyright tracking

Cost Optimization

Cold storage for infrequently watched videos
Adaptive transcoding based on video popularity
CDN cache-tier retention policy tuning
Batch processing during off-peak cloud hours

Fault Tolerance & Scalability

Component	Failure Handling
Upload Service	Retry policy + checkpointed resume
Transcoding Workers	Retries + job requeueing
Metadata DB	Master-slave replication, failover
Cache Layer	Data eviction & hot-node rebalancing
Queue System	Idempotent message processing

Horizontal Scalability

Stateless API servers → easy scale-out
Video storage → partition by videoID prefix
Transcoding → serverless or containerized scaling
CDN → distributed POPs for global reach

Summary

Reliable uploads
Efficient, DAG-based transcoding
Cost-effective global delivery via CDNs
Adaptive streaming and smart caching
Resilient queues for asynchronous workflows
Iterate with metrics to balance quality, latency, and spend

Problem Clarification & Scope Definition​

Clarification Questions​

High-Level Requirements & Goals​

Functional Requirements​

Non-Functional Requirements​

Back-of-the-Envelope Estimation​

Compute storage and bandwidth estimates​

High-Level System Architecture​

Core Components Overview​

Video Upload​

Step-by-Step Flow​

Core Components​

Video Streaming Architecture​

Playback Flow​

Key Streaming Protocols​

CDN Strategies​

Video Transcoding System (DAG Architecture)​

Typical Output Formats​

DAG (Directed Acyclic Graph)​

Tools & Frameworks​

Optimization​

Performance Optimization​

Security Optimization​

Cost Optimization​

Fault Tolerance & Scalability​

Horizontal Scalability​

Summary​