The complete guide to SLO tracking for modern infrastructure

Define SLOs, track compliance, manage error budgets. Complete guide to implementing SLO tracking for reliability and accountability.

u

up0 Team

12 min read

What Is an SLO?

An SLO (Service Level Objective) is a measurable target for your service's reliability and performance.

It's the answer to: "How reliable does our service need to be?"

SLO vs SLA vs SLI

SLI (Service Level Indicator): The actual measurement

Uptime: 99.87%
P95 Latency: 142ms
Error Rate: 0.23%

SLO (Service Level Objective): Your target

Uptime: >= 99.9%
P95 Latency: <= 200ms
Error Rate: <= 1%

SLA (Service Level Agreement): The contractual commitment

Uptime: >= 99.5% or we pay credits
P95 Latency: <= 500ms or we owe refunds

In this guide, we focus on SLOs—the internal targets that drive reliability decisions.

Why SLOs Matter

1. Alignment

Without SLOs, different teams have different reliability expectations:

  • Frontend wants "never break"
  • Backend says "99% uptime is fine"
  • Operations targets "zero incidents"

SLOs create a shared language.

2. Trade-off Decisions

SLOs help you answer: "Should we ship this feature or focus on reliability?"

If we're at 98% uptime (below our 99.9% SLO):
→ Focus on stability, delay feature release

If we're at 99.95% uptime (above our 99.9% SLO):
→ We have error budget, ship the feature

3. Alerting Strategy

SLOs drive your alerting rules:

At 99.5% (50% error budget remaining):
→ Start investigations, increase monitoring

At 99.2% (80% error budget used):
→ Critical alert, page on-call, pause deployments

At 99.0% (100% error budget used):
→ SLO breach, incident response mode

4. Accountability

With SLOs, you can say:

  • "We're meeting our reliability commitment" or
  • "We're not and here's what we'll do about it"

SLOs create accountability.

Defining Your SLOs

Step 1: Choose What to Measure

Availability SLO: Is the service responding?

SLO: 99.9% of requests succeed
Measured as: successful responses / total requests

Latency SLO: Is the service responding fast enough?

SLO: p95 latency <= 200ms
Measured as: 95th percentile response time

Freshness SLO: Is the data fresh enough?

SLO: 95% of data queries return data <= 5min old
Measured as: percentage of queries returning fresh data

Correctness SLO: Is the service giving correct results?

SLO: 99.99% of transactions processed correctly
Measured as: correct transactions / total transactions

Most services track 2-3 key SLOs, not everything.

Step 2: Choose Your Measurement Window

Per-minute: Too granular, too noisy

p50 uptime = 100% (one failure in 1000 requests doesn't show)

Per-hour: Better for catching issues

p50 uptime = 99.75% (issues in the hour are visible)

Per-day: Good operational window

p50 uptime = 99.2% (multi-hour outages show clearly)

Per-month/quarter: Standard for SLO contracts

Monthly uptime = 99.9% (standard SLO window)
Quarterly uptime = 99.95% (stricter measure)

Most teams track:

  • Hourly (for immediate decision making)
  • Monthly (for SLO compliance)

Step 3: Set The Target

Your target should be:

  1. Achievable without heroics

    • If you need to work weekends constantly to hit it, it's too aggressive
  2. Meaningful for users

    • 99% vs 99.9% feels different to users
    • 99.99% is usually overkill for most services
  3. Defensible financially

    • Higher SLOs cost more (redundancy, failover, testing)
    • Does the business value justify the cost?

Example SLO Targets

Customer-facing API

Availability: >= 99.9% (8.6 hours downtime/month allowed)
Latency (p95): <= 200ms
Error rate: <= 0.1%

Internal batch processing

Availability: >= 99% (7.2 hours downtime/month allowed)
Latency: <= 30 minutes (job completion time)
No specific latency percentile needed

Background queue service

Availability: >= 99.5% (3.6 hours downtime/month allowed)
Processing latency (p95): <= 5 minutes
No user-facing SLO needed

Error Budgets

Once you define an SLO, you get an error budget: how much downtime you can tolerate while still meeting your SLO.

Calculating Error Budget

SLO: 99.9% availability
Meaning: 99.9% uptime allowed

Error budget = 100% - 99.9% = 0.1%

Monthly (30 days):
0.1% of 30 days = 0.03 days = 43 minutes

So you can tolerate 43 minutes of downtime per month

More Examples

SLOError Budget (Month)Error Budget (Quarter)
99%7.2 hours21.6 hours
99.5%3.6 hours10.8 hours
99.9%43 minutes130 minutes
99.95%21 minutes65 minutes
99.99%4.3 minutes13 minutes

Using Error Budget

At beginning of month:

Availability SLO: 99.9%
Error budget: 43 minutes
Consumed: 0 minutes
Remaining: 43 minutes

After small outage (15 minutes):

Consumed: 15 minutes
Remaining: 28 minutes

After second outage (20 minutes):

Consumed: 35 minutes
Remaining: 8 minutes

After third outage (10 minutes):

Consumed: 45 minutes (exceeded!)
Status: SLO BREACHED

Error Budget Policies

Aggressive: Spend budget to ship features

If budget > 50%: Ship features freely
If budget 10-50%: Ship only critical features
If budget < 10%: No new features, focus on stability

Conservative: Save budget for emergencies

If budget > 75%: Consider shipping
If budget 25-75%: Ship only essential features
If budget < 25%: Stability only

Balanced: Allow some spending, protect some buffer

Target remaining budget: >= 20%
If above 20%: OK to deploy and experiment
If below 20%: Stability focus

Implementing SLO Tracking

1. Define Your SLIs (measurements)

Choose exactly how you'll measure each SLO:

Availability SLO: 99.9%

SLI Definition:
- Measure: HTTP requests to production API
- Success: Response code 2xx or 3xx
- Exclude: Requests from our health checks
- Exclude: Requests with invalid authentication
- Window: Rolling 30-day average

Being specific prevents interpretation issues.

2. Instrument Your Metrics

Collect the raw data:

// Track each request
const startTime = Date.now();
const response = await api.call();
const latency = Date.now() - startTime;

// Record metrics
metrics.record('request_total', 1);
metrics.record('request_success', response.ok ? 1 : 0);
metrics.record('request_latency', latency);

3. Calculate SLI Values

From raw metrics, calculate SLI:

// Calculate hourly availability
hourly_availability = successful_requests / total_requests

// Calculate p95 latency
latencies.sort();
p95_latency = latencies[Math.floor(latencies.length * 0.95)]

// Calculate error rate
error_rate = error_requests / total_requests

4. Compare Against SLO

Check if you're meeting targets:

SLO: >= 99.9% availability
Current: 99.87%
Status: MISSING SLO ⚠️

SLO: p95 latency <= 200ms
Current: 142ms
Status: MEETING SLO ✓

SLO: error rate <= 0.1%
Current: 0.23%
Status: MISSING SLO ⚠️

5. Track Error Budget

Calculate remaining error budget:

SLO: 99.9% uptime
Period: January (31 days)
Allowed downtime: 26.74 minutes
Actual downtime: 8.5 minutes
Remaining budget: 18.24 minutes (68%)

Monitoring and Alerting on SLOs

Real-time Dashboard

Create a dashboard showing:

┌─────────────────────────────────────┐
│ Availability (Month to Date)         │
│ Current:   99.87%                    │
│ Target:    99.9%                     │
│ Status:    MISSING (0.03%)           │
│                                     │
│ Error Budget Remaining: 35 min (81%) │
├─────────────────────────────────────┤
│ Latency (p95, Last 24h)             │
│ Current:   156ms                     │
│ Target:    200ms                     │
│ Status:    OK                        │
├─────────────────────────────────────┤
│ Error Rate (Last Hour)              │
│ Current:   0.19%                     │
│ Target:    0.1%                      │
│ Status:    MISSING (0.09%)           │
└─────────────────────────────────────┘

Alert Thresholds

Set up alerting based on SLO consumption rate:

Alert Level 1 (Warning):
- If burning error budget at 3x expected rate
- We'll breach SLO if this continues
- Action: Increase monitoring, start investigation

Alert Level 2 (Critical):
- If error budget remaining < 20% of month
- We're on track to breach SLO
- Action: Page on-call, pause deployments

Alert Level 3 (Incident):
- If SLO breached (error budget exhausted)
- We've failed our commitment
- Action: Full incident response

Alert Rules

# Alert if burning budget too fast
IF (error_budget_remaining / days_remaining) < (error_budget_monthly / 30) * 3
THEN alert "High SLO burn rate"

# Alert if breaching now
IF (current_availability < slo_target) for 5 minutes
THEN alert "SLO breach in progress"

# Alert on approaching breach
IF (error_budget_remaining / error_budget_monthly) < 0.2
THEN alert "Low error budget remaining"

Incident Response and SLOs

During an Incident

Track SLO impact:

Incident: Database connection pool exhaustion
Duration: 12 minutes
Services affected: API, Web app
Availability impact: Lost 0.8% uptime
Error budget consumed: 0.5%
Estimated SLO status: Still on track (99.1% month-to-date)

Post-Incident

Analyze SLO implications:

Before incident: 99.95% uptime, 99.5% SLO
After incident: 99.87% uptime, 99.5% SLO
Status: SLO still met, but margin reduced

Next incident of same duration would breach SLO
Action: Increase database connection pool capacity
Priority: High (before next incident)

Improving to Meet SLOs

If You're Missing an SLO

  1. Identify the cause

    • Is it a specific component failing?
    • Is it a performance bottleneck?
    • Is it degradation across the board?
  2. Quantify the cost

    Current: 99.2%
    Target: 99.9%
    Gap: 0.7% = 5 hours/month
    
    Cost to fix: Estimated 200 hours engineering
    Benefit: Meets SLO, improves user trust
    ROI: Worth it
    
  3. Prioritize improvements

    Quick wins (< 1 week):
    - Add caching layer (+0.2%)
    - Optimize slow queries (+0.1%)
    
    Medium-term (1-4 weeks):
    - Add database replicas (+0.3%)
    
    Long-term (1-3 months):
    - Redesign architecture (+0.2%)
    
  4. Track progress

    Week 1: 99.2% → 99.4% (caching + queries)
    Week 2: 99.4% → 99.6% (replication)
    Week 3: 99.6% → 99.9% (architecture work)
    Status: SLO met
    

If You're Exceeding an SLO

Consider raising the target or investing budget elsewhere:

Current: 99.99% (4.3 minutes downtime/month)
Target: 99.9% (43 minutes allowed)

You have 10x more reliability than needed
Options:
1. Use extra reliability budget to ship features faster
2. Invest engineering time in new capabilities
3. Reduce ops complexity (save money)
4. Increase SLO target (push harder on reliability)

SLO Best Practices

1. Align with Business Impact

SLOs should reflect what matters to users:

Good: 99.9% availability (users notice when down)
Bad: 99.99% uptime on internal tool (nobody cares)

2. Use Meaningful Numbers

Round to numbers that make sense:

Good: 99.9%, 99.5%, 99% (clear meaning)
Bad: 99.837% (too precise, suggests false certainty)

3. Multiple Dimensions

Track SLOs in different ways:

Global SLO: 99.9% availability
Regional SLOs:
- US: 99.95%
- EU: 99.9%
- APAC: 99.8%

Service SLOs:
- API: 99.9%
- Web: 99.5%
- Batch: 99%

4. Review and Adjust

Quarterly, review whether SLOs are still appropriate:

Q1: 99.9% SLO
Q2: Still meeting it easily
Q3: Consider raising to 99.95%
Q4: Too aggressive, back to 99.9%

5. Communicate Clearly

Make sure everyone understands:

SLO: 99.9% availability
Meaning: ~43 minutes downtime/month allowed
Not meaning: We'll be up 43 minutes this month
           We're OK with any downtime under 43 min
           This guarantees 99.9% uptime (contractual SLA)

6. Alert on Trends

Alert not just on breaches, but on trends:

Alert if:
- SLO trending down (burn rate increasing)
- SLO trending dangerously (burn rate 3x+ expected)
- SLO very low remaining budget

SLO Levels by Organization

Startup

SLO: 99% (broad reliability)
Error Budget: 7.2 hours/month (lots of room)
Philosophy: Move fast, accept some downtime

Growth-stage

Availability SLO: 99.9%
Error Budget: 43 minutes/month
Philosophy: Need reliability to retain users

Enterprise

Availability SLO: 99.95%
Error Budget: 21 minutes/month
Philosophy: Reliability is core competency

Mission-critical

Availability SLO: 99.99%
Error Budget: 4.3 minutes/month
Philosophy: Downtime unacceptable

Tools for SLO Management

Dedicated SLO Platforms

  • Nobl9 (SLO-focused SaaS)
  • Datadog SLOs
  • New Relic SLOs
  • Grafana OnCall (with Prometheus)

DIY with Open Source

  • Prometheus + custom rules
  • Grafana dashboards
  • CloudWatch alarms
  • Custom scripts

The right tool depends on your:

  • Existing monitoring stack
  • Team size and expertise
  • Budget
  • Complexity of services

The Bottom Line

SLOs are powerful:

  1. Create alignment between engineering and business
  2. Drive smart decisions about reliability vs features
  3. Enable accountability through shared targets
  4. Build predictability through error budgets
  5. Improve communication about reliability

Without SLOs, your reliability efforts are aimless. With them, you're strategic.

Start with one simple SLO: "99.9% availability." Track it for a month. See how it influences your decisions. Then expand from there.


Track SLO compliance with up0 - Define SLOs, monitor error budgets, track compliance across regions. Get started free.