The complete guide to SLO tracking for modern infrastructure
Define SLOs, track compliance, manage error budgets. Complete guide to implementing SLO tracking for reliability and accountability.
up0 Team
What Is an SLO?
An SLO (Service Level Objective) is a measurable target for your service's reliability and performance.
It's the answer to: "How reliable does our service need to be?"
SLO vs SLA vs SLI
SLI (Service Level Indicator): The actual measurement
Uptime: 99.87%
P95 Latency: 142ms
Error Rate: 0.23%
SLO (Service Level Objective): Your target
Uptime: >= 99.9%
P95 Latency: <= 200ms
Error Rate: <= 1%
SLA (Service Level Agreement): The contractual commitment
Uptime: >= 99.5% or we pay credits
P95 Latency: <= 500ms or we owe refunds
In this guide, we focus on SLOs—the internal targets that drive reliability decisions.
Why SLOs Matter
1. Alignment
Without SLOs, different teams have different reliability expectations:
- Frontend wants "never break"
- Backend says "99% uptime is fine"
- Operations targets "zero incidents"
SLOs create a shared language.
2. Trade-off Decisions
SLOs help you answer: "Should we ship this feature or focus on reliability?"
If we're at 98% uptime (below our 99.9% SLO):
→ Focus on stability, delay feature release
If we're at 99.95% uptime (above our 99.9% SLO):
→ We have error budget, ship the feature
3. Alerting Strategy
SLOs drive your alerting rules:
At 99.5% (50% error budget remaining):
→ Start investigations, increase monitoring
At 99.2% (80% error budget used):
→ Critical alert, page on-call, pause deployments
At 99.0% (100% error budget used):
→ SLO breach, incident response mode
4. Accountability
With SLOs, you can say:
- "We're meeting our reliability commitment" or
- "We're not and here's what we'll do about it"
SLOs create accountability.
Defining Your SLOs
Step 1: Choose What to Measure
Availability SLO: Is the service responding?
SLO: 99.9% of requests succeed
Measured as: successful responses / total requests
Latency SLO: Is the service responding fast enough?
SLO: p95 latency <= 200ms
Measured as: 95th percentile response time
Freshness SLO: Is the data fresh enough?
SLO: 95% of data queries return data <= 5min old
Measured as: percentage of queries returning fresh data
Correctness SLO: Is the service giving correct results?
SLO: 99.99% of transactions processed correctly
Measured as: correct transactions / total transactions
Most services track 2-3 key SLOs, not everything.
Step 2: Choose Your Measurement Window
Per-minute: Too granular, too noisy
p50 uptime = 100% (one failure in 1000 requests doesn't show)
Per-hour: Better for catching issues
p50 uptime = 99.75% (issues in the hour are visible)
Per-day: Good operational window
p50 uptime = 99.2% (multi-hour outages show clearly)
Per-month/quarter: Standard for SLO contracts
Monthly uptime = 99.9% (standard SLO window)
Quarterly uptime = 99.95% (stricter measure)
Most teams track:
- Hourly (for immediate decision making)
- Monthly (for SLO compliance)
Step 3: Set The Target
Your target should be:
-
Achievable without heroics
- If you need to work weekends constantly to hit it, it's too aggressive
-
Meaningful for users
- 99% vs 99.9% feels different to users
- 99.99% is usually overkill for most services
-
Defensible financially
- Higher SLOs cost more (redundancy, failover, testing)
- Does the business value justify the cost?
Example SLO Targets
Customer-facing API
Availability: >= 99.9% (8.6 hours downtime/month allowed)
Latency (p95): <= 200ms
Error rate: <= 0.1%
Internal batch processing
Availability: >= 99% (7.2 hours downtime/month allowed)
Latency: <= 30 minutes (job completion time)
No specific latency percentile needed
Background queue service
Availability: >= 99.5% (3.6 hours downtime/month allowed)
Processing latency (p95): <= 5 minutes
No user-facing SLO needed
Error Budgets
Once you define an SLO, you get an error budget: how much downtime you can tolerate while still meeting your SLO.
Calculating Error Budget
SLO: 99.9% availability
Meaning: 99.9% uptime allowed
Error budget = 100% - 99.9% = 0.1%
Monthly (30 days):
0.1% of 30 days = 0.03 days = 43 minutes
So you can tolerate 43 minutes of downtime per month
More Examples
| SLO | Error Budget (Month) | Error Budget (Quarter) |
|---|---|---|
| 99% | 7.2 hours | 21.6 hours |
| 99.5% | 3.6 hours | 10.8 hours |
| 99.9% | 43 minutes | 130 minutes |
| 99.95% | 21 minutes | 65 minutes |
| 99.99% | 4.3 minutes | 13 minutes |
Using Error Budget
At beginning of month:
Availability SLO: 99.9%
Error budget: 43 minutes
Consumed: 0 minutes
Remaining: 43 minutes
After small outage (15 minutes):
Consumed: 15 minutes
Remaining: 28 minutes
After second outage (20 minutes):
Consumed: 35 minutes
Remaining: 8 minutes
After third outage (10 minutes):
Consumed: 45 minutes (exceeded!)
Status: SLO BREACHED
Error Budget Policies
Aggressive: Spend budget to ship features
If budget > 50%: Ship features freely
If budget 10-50%: Ship only critical features
If budget < 10%: No new features, focus on stability
Conservative: Save budget for emergencies
If budget > 75%: Consider shipping
If budget 25-75%: Ship only essential features
If budget < 25%: Stability only
Balanced: Allow some spending, protect some buffer
Target remaining budget: >= 20%
If above 20%: OK to deploy and experiment
If below 20%: Stability focus
Implementing SLO Tracking
1. Define Your SLIs (measurements)
Choose exactly how you'll measure each SLO:
Availability SLO: 99.9%
SLI Definition:
- Measure: HTTP requests to production API
- Success: Response code 2xx or 3xx
- Exclude: Requests from our health checks
- Exclude: Requests with invalid authentication
- Window: Rolling 30-day average
Being specific prevents interpretation issues.
2. Instrument Your Metrics
Collect the raw data:
// Track each request
const startTime = Date.now();
const response = await api.call();
const latency = Date.now() - startTime;
// Record metrics
metrics.record('request_total', 1);
metrics.record('request_success', response.ok ? 1 : 0);
metrics.record('request_latency', latency);
3. Calculate SLI Values
From raw metrics, calculate SLI:
// Calculate hourly availability
hourly_availability = successful_requests / total_requests
// Calculate p95 latency
latencies.sort();
p95_latency = latencies[Math.floor(latencies.length * 0.95)]
// Calculate error rate
error_rate = error_requests / total_requests
4. Compare Against SLO
Check if you're meeting targets:
SLO: >= 99.9% availability
Current: 99.87%
Status: MISSING SLO ⚠️
SLO: p95 latency <= 200ms
Current: 142ms
Status: MEETING SLO ✓
SLO: error rate <= 0.1%
Current: 0.23%
Status: MISSING SLO ⚠️
5. Track Error Budget
Calculate remaining error budget:
SLO: 99.9% uptime
Period: January (31 days)
Allowed downtime: 26.74 minutes
Actual downtime: 8.5 minutes
Remaining budget: 18.24 minutes (68%)
Monitoring and Alerting on SLOs
Real-time Dashboard
Create a dashboard showing:
┌─────────────────────────────────────┐
│ Availability (Month to Date) │
│ Current: 99.87% │
│ Target: 99.9% │
│ Status: MISSING (0.03%) │
│ │
│ Error Budget Remaining: 35 min (81%) │
├─────────────────────────────────────┤
│ Latency (p95, Last 24h) │
│ Current: 156ms │
│ Target: 200ms │
│ Status: OK │
├─────────────────────────────────────┤
│ Error Rate (Last Hour) │
│ Current: 0.19% │
│ Target: 0.1% │
│ Status: MISSING (0.09%) │
└─────────────────────────────────────┘
Alert Thresholds
Set up alerting based on SLO consumption rate:
Alert Level 1 (Warning):
- If burning error budget at 3x expected rate
- We'll breach SLO if this continues
- Action: Increase monitoring, start investigation
Alert Level 2 (Critical):
- If error budget remaining < 20% of month
- We're on track to breach SLO
- Action: Page on-call, pause deployments
Alert Level 3 (Incident):
- If SLO breached (error budget exhausted)
- We've failed our commitment
- Action: Full incident response
Alert Rules
# Alert if burning budget too fast
IF (error_budget_remaining / days_remaining) < (error_budget_monthly / 30) * 3
THEN alert "High SLO burn rate"
# Alert if breaching now
IF (current_availability < slo_target) for 5 minutes
THEN alert "SLO breach in progress"
# Alert on approaching breach
IF (error_budget_remaining / error_budget_monthly) < 0.2
THEN alert "Low error budget remaining"
Incident Response and SLOs
During an Incident
Track SLO impact:
Incident: Database connection pool exhaustion
Duration: 12 minutes
Services affected: API, Web app
Availability impact: Lost 0.8% uptime
Error budget consumed: 0.5%
Estimated SLO status: Still on track (99.1% month-to-date)
Post-Incident
Analyze SLO implications:
Before incident: 99.95% uptime, 99.5% SLO
After incident: 99.87% uptime, 99.5% SLO
Status: SLO still met, but margin reduced
Next incident of same duration would breach SLO
Action: Increase database connection pool capacity
Priority: High (before next incident)
Improving to Meet SLOs
If You're Missing an SLO
-
Identify the cause
- Is it a specific component failing?
- Is it a performance bottleneck?
- Is it degradation across the board?
-
Quantify the cost
Current: 99.2% Target: 99.9% Gap: 0.7% = 5 hours/month Cost to fix: Estimated 200 hours engineering Benefit: Meets SLO, improves user trust ROI: Worth it -
Prioritize improvements
Quick wins (< 1 week): - Add caching layer (+0.2%) - Optimize slow queries (+0.1%) Medium-term (1-4 weeks): - Add database replicas (+0.3%) Long-term (1-3 months): - Redesign architecture (+0.2%) -
Track progress
Week 1: 99.2% → 99.4% (caching + queries) Week 2: 99.4% → 99.6% (replication) Week 3: 99.6% → 99.9% (architecture work) Status: SLO met
If You're Exceeding an SLO
Consider raising the target or investing budget elsewhere:
Current: 99.99% (4.3 minutes downtime/month)
Target: 99.9% (43 minutes allowed)
You have 10x more reliability than needed
Options:
1. Use extra reliability budget to ship features faster
2. Invest engineering time in new capabilities
3. Reduce ops complexity (save money)
4. Increase SLO target (push harder on reliability)
SLO Best Practices
1. Align with Business Impact
SLOs should reflect what matters to users:
Good: 99.9% availability (users notice when down)
Bad: 99.99% uptime on internal tool (nobody cares)
2. Use Meaningful Numbers
Round to numbers that make sense:
Good: 99.9%, 99.5%, 99% (clear meaning)
Bad: 99.837% (too precise, suggests false certainty)
3. Multiple Dimensions
Track SLOs in different ways:
Global SLO: 99.9% availability
Regional SLOs:
- US: 99.95%
- EU: 99.9%
- APAC: 99.8%
Service SLOs:
- API: 99.9%
- Web: 99.5%
- Batch: 99%
4. Review and Adjust
Quarterly, review whether SLOs are still appropriate:
Q1: 99.9% SLO
Q2: Still meeting it easily
Q3: Consider raising to 99.95%
Q4: Too aggressive, back to 99.9%
5. Communicate Clearly
Make sure everyone understands:
SLO: 99.9% availability
Meaning: ~43 minutes downtime/month allowed
Not meaning: We'll be up 43 minutes this month
We're OK with any downtime under 43 min
This guarantees 99.9% uptime (contractual SLA)
6. Alert on Trends
Alert not just on breaches, but on trends:
Alert if:
- SLO trending down (burn rate increasing)
- SLO trending dangerously (burn rate 3x+ expected)
- SLO very low remaining budget
SLO Levels by Organization
Startup
SLO: 99% (broad reliability)
Error Budget: 7.2 hours/month (lots of room)
Philosophy: Move fast, accept some downtime
Growth-stage
Availability SLO: 99.9%
Error Budget: 43 minutes/month
Philosophy: Need reliability to retain users
Enterprise
Availability SLO: 99.95%
Error Budget: 21 minutes/month
Philosophy: Reliability is core competency
Mission-critical
Availability SLO: 99.99%
Error Budget: 4.3 minutes/month
Philosophy: Downtime unacceptable
Tools for SLO Management
Dedicated SLO Platforms
- Nobl9 (SLO-focused SaaS)
- Datadog SLOs
- New Relic SLOs
- Grafana OnCall (with Prometheus)
DIY with Open Source
- Prometheus + custom rules
- Grafana dashboards
- CloudWatch alarms
- Custom scripts
The right tool depends on your:
- Existing monitoring stack
- Team size and expertise
- Budget
- Complexity of services
The Bottom Line
SLOs are powerful:
- Create alignment between engineering and business
- Drive smart decisions about reliability vs features
- Enable accountability through shared targets
- Build predictability through error budgets
- Improve communication about reliability
Without SLOs, your reliability efforts are aimless. With them, you're strategic.
Start with one simple SLO: "99.9% availability." Track it for a month. See how it influences your decisions. Then expand from there.
Track SLO compliance with up0 - Define SLOs, monitor error budgets, track compliance across regions. Get started free.