Why Zero Downtime Matters
Every minute your site is offline you lose money, trust, and SEO juice. Zero downtime deployment lets you ship new code while users keep clicking, buying, and smiling. The goal is simple: replace the old version with the new one without breaking a single TCP connection.
The Four Core Strategies at a Glance
- Blue-Green: two identical stacks, switch traffic in seconds
- Canary: feed the new version to 1 % of traffic, grow slowly
- Rolling: replace servers one by one behind a load balancer
- Feature Flags: hide new code behind toggles, activate without redeploy
Blue-Green Deployment Step by Step
1. Build the Green Stack
Create a clone of your live Blue environment: same VM image, same disk, same everything. Point Green at a copy of the production database or use a read replica.
2. Run Smoke Tests
Hit Green with automated health checks, API contracts, and end-to-end suites. If any test fails, tear Green down and fix the build.
3. Switch Traffic
Update the load balancer or DNS TTL to route 100 % traffic to Green. Keep Blue warm for instant rollback.
4. Monitor and Cleanup
Watch error rates for fifteen minutes. If all is calm, terminate Blue and snapshot logs. If alarms fire, switch back to Blue in under ten seconds.
Pros and Cons
Blue-Green is brute-force simple and gives you a big red rollback button. The downside is cost: you pay for double infrastructure and you need enough database headroom for two active stacks.
Canary Release: Taste Before You Swallow
Start Tiny
Deploy the new build to one pod, VM, or lambda alias tagged as version v2. Route 1 % of traffic using headers, cookies, or random sampling.
Measure Everything
Track latency P99, error rate, cart conversions, and custom business KPIs. Promote the canary to 5 %, 25 %, 50 %, 100 % only when metrics stay flat or improve.
AutomaticRollback
Set SLO violations as kill switches. If error budget burns faster than 2 % in five minutes, shift traffic back to v1 automatically.
Tools That Help
Google Kubernetes Engine, AWS App Mesh, Flagger, Argo Rollouts, and Istio all expose canary objects with metric-based promotion. You write the YAML once and the controller does the rest.
Rolling Deployment: Keep the Fleet Afloat
How It Works
Behind a load balancer you drain one node, update its code, health-check it, then return it to the pool. Repeat until the fleet is new.
The Draining Dance
Signal the node to stop accepting new connections. Wait for in-flight requests to finish—usually thirty seconds for REST, longer for WebSockets. Then terminate the old process.
Scaling Gotchas
Never roll more than 20 % of capacity at once or you risk brownouts. Use autoscaling buffers: if your nominal fleet is ten servers, scale to twelve before you start rolling.
Database Migrations
Rolling deploys can collide with schema changes. Add nullable columns first, deploy code that reads both old and new shapes, then drop deprecated fields in a later release.
Feature Flags: Deploy Monet, Release Monet
Decouple Deploy and Release
Push the artifact once, then flip features on for internal users, beta testers, or 10 % of Canada. No new binary needed.
Flag Lifecycle
Name flags with ticket IDs, wrap them in kill switches, and set automatic expiry dates. Clean up stale flags every sprint to avoid technical debt.
Testing Matrix
Create a test that spins up the app with all flags on and another with all flags off. This catches interaction bugs before they hit production.
Database Zero Downtime Patterns
ExpandThen Contract
Add new tables and columns without touching old ones. Dual-write in the application layer. Backfill data offline. Only when the new path is 100 % live do you remove the old columns.
Blue-Green for Databases
Use read replicas or logical replication to keep Green DB in sync. Cut writes over by pausing the app for milliseconds using a feature flag. Reverse replication gives you a rollback path.
LoadBalancer Tricks
Set TTL to 30 s for DNS-based switches. Use connection draining on AWS ALB, GCP Load Balancer, and NGINX-plus. For gRPC enable graceful stop with GOAWAY frames so clients reconnect to fresh pods.
Monitoring the Invisible
Deploys fail quietly. Add synthetic checks that log in, add an item, and checkout every minute. Tag metrics with build SHA so you can diff v1.2.3 against v1.2.4 latency curves.
Rollback Horror Stories
A major European retailer once blue-green switched without warming the Green JVM. The cold Java heap caused 4 s GC pauses and every user refreshed, creating a thundering herd. Always warm the pool with synthetic traffic before you cut live users.
Putting It Together: A Sample GitHub Actions Workflow
The YAML below combines canary and feature flags:
- Build Docker image tagged with git SHA
- Deploy to staging, run contract tests
- Create canary ReplicaSet at 1 % traffic
- Promote 10 %, 50 %, 100 % every 10 min if SLOs pass
- Flip feature flag for new checkout flow after 100 %
Each stage gates on Prometheus alerts; any violation triggers automatic rollback to the previous stable SHA.
Choosing the Right Strategy
Strategy | Boot Time | Risk | Cost | Best For |
---|---|---|---|---|
Blue-Green | 1 min | Low | High | Legacy monoliths |
Canary | 10 min | Medium | Medium | API services |
Rolling | 30 min | Medium | Low | Stateless microservices |
Feature Flags | 0 min | Low | Low | UI changes |
Common Pitfalls and How to Dodge Them
Session Affinity
Sticky sessions break rolling deploys. Externalize session state to Redis or JWT instead.
Caching
New code may serialize objects differently. Version your cache keys so old and new shapes coexist.
Resource Limits
Canary pods on undersized nodes can OOM and skew metrics. Use guaranteed QoS and set equal CPU/memory requests and limits.
Team Culture Checklist
- Merge to main only if the commit is production ready
- Every pull request includes monitoring dashboards links
- On-call engineer owns the deploy, not a release manager
- Post-mortem every rollback within 24 h
Next Steps
Pick one service this sprint and implement canary releases. Start with 30 % synthetic traffic, add SLO gates, then invite real users. Once you trust the pipeline, extend it to the rest of the fleet. Zero downtime is not a myth; it is a habit you practice every deploy day.
Disclaimer: This article is for educational purposes only and was generated by an AI language model. Always test deployment strategies in a staging environment before touching production.