Understanding latency percentiles: p50, p95, p99

Average latency lies. Learn why percentiles matter, how to read them, and what they reveal about your system's real performance.

M

Michael Torres

10 min read

Why Your Average Latency Lies to You

Your application's average response time is 50ms. Sounds good, right?

But what if this average is hiding the story? What if while most requests complete in 50ms, 1% of requests take 5 seconds?

If even 1% of your users experience 5-second delays, your application feels broken to them. An average of 50ms is meaningless to a user whose request took 5 seconds.

This is the critical insight behind latency percentiles.

What Are Latency Percentiles?

A percentile tells you: "X% of requests completed in Y milliseconds."

Common Percentiles

p50 (median): 50% of requests complete within this time

  • Half your users experience this or better
  • Fast requests

p95: 95% of requests complete within this time

  • 95% of your users have this experience
  • The level most users actually feel

p99: 99% of requests complete within this time

  • The worst 1% of users experience this or worse
  • The "tail" of the distribution

p99.9: 99.9% of requests complete within this time

  • The very worst requests
  • Often called "tail latency"

A Real Example

Let's imagine you're monitoring an API endpoint. You get 10,000 requests in a minute.

Request 1:     25ms
Request 2:     28ms
Request 3:     27ms
...
Request 5,000: 52ms  ← This is approximately p50 (median)
Request 5,001: 65ms
...
Request 9,500: 85ms  ← This is approximately p95
Request 9,501: 120ms
...
Request 9,900: 410ms
Request 9,901: 450ms ← This is approximately p99
Request 9,902: 520ms
...
Request 10,000: 5,200ms

So you might report:

  • p50: 52ms - Half of requests finish in 52ms or less
  • p95: 85ms - 95% of requests finish in 85ms or less
  • p99: 450ms - 99% of requests finish in 450ms or less

Why Percentiles Matter More Than Averages

The Average Can Mislead

Suppose you have response times: 10ms, 20ms, 30ms, 40ms, 5000ms

Average = (10 + 20 + 30 + 40 + 5000) / 5 = 1020ms
Median (p50) = 30ms

The average is pulled way up by that one slow request. But 80% of your users had a great experience (under 40ms).

The average tells a lie. The percentile tells the truth.

Users Experience Percentiles, Not Averages

Your users don't care about average performance. Each user cares only about their experience:

  • A user in the p50 cares that their request was fast
  • A user in the p95 cares their request was reasonable
  • A user in the p99 cares their request was slow

If you only track averages, you're ignoring the worst experiences.

Percentiles Show Distribution

Latency percentiles tell you about the distribution:

Healthy system:
p50: 50ms
p95: 80ms
p99: 120ms
- Tight distribution, predictable performance

Degraded system:
p50: 50ms
p95: 200ms
p99: 2500ms
- Wide spread, unpredictable performance, tail latency issues

The same p50 doesn't mean the same performance.

What Causes Tail Latency (p95, p99)?

Tail latency—those outlier requests that take much longer—comes from:

1. Lock Contention

When multiple requests compete for the same database lock, one wins immediately, others wait.

Request 1: Acquires lock immediately -> 50ms
Request 2: Waits for lock -> 450ms (p99)

2. Garbage Collection

A garbage collection pause stops all request processing:

Requests 1-100: Normal 50ms latency
GC pause for 200ms
Requests 101-150: All delayed by 200ms
Requests 151-200: Back to 50ms

3. Cache Misses

A cache miss requires a database hit:

Cache hit:  5ms
Cache miss: 300ms (p99)

4. Network Timeouts

Occasionally, a downstream service is slow:

Normal call:  100ms
Timeout call: 10,000ms (p99.9)

5. Thundering Herd

When many clients retry simultaneously:

Normal load:   50ms
Retry spike:   2,000ms (p99)

Setting Performance Thresholds

You should monitor and alert on percentiles, not just averages.

Bad Approach

Alert if average latency > 100ms

This misses the tail latency problem that affects your worst users.

Good Approach

Alert if p50 latency > 80ms
Alert if p95 latency > 150ms
Alert if p99 latency > 400ms

This tracks the distribution and alerts when things actually degrade.

Better Approach

Define SLOs by percentile:
p50 latency:  < 100ms  (fast response)
p95 latency:  < 200ms  (good response)
p99 latency:  < 500ms  (acceptable worst case)

Alert if you're trending toward SLO breach
Alert if you breach SLO
Track error budget consumption

Real-World SLO Examples

High-performance API Service

p50: < 50ms
p95: < 100ms
p99: < 200ms

Why these numbers? Customers expect sub-100ms response times. The p99 being 2x p95 is acceptable for a healthy system.

Data Processing Service

p50: < 500ms
p95: < 2000ms
p99: < 5000ms

Processing takes longer, so thresholds are higher. But the ratio (p99 is 10x p95) indicates potential issues.

Interactive Web Application

p50: < 200ms
p95: < 500ms
p99: < 1000ms

Users tolerate longer latencies for complex operations. The wider spread (p99 is 5x p95) is expected.

Actionable Guidance: What's Normal?

p95 to p99 Ratio

Healthy system: p99 is 1.5x - 4x p99

  • Tight distribution
  • Predictable performance
  • Few anomalies
Example: p95=100ms, p99=200ms (ratio: 2x) - Healthy

Warning signs: p99 is 5x+ p95

  • Wide tail
  • Many slow requests
  • Something is causing degradation
Example: p95=100ms, p99=1000ms (ratio: 10x) - Investigate

Expected Ranges by Use Case

Service Typep50p95p99
API Gateway10-30ms50-100ms100-200ms
Web Server50-100ms150-300ms500-1000ms
Database Query10-50ms100-200ms500-2000ms
External API100-500ms1000-3000ms3000-10000ms
Batch Processing1-10s30-60s60-300s

These are rough guides. Your specific values depend on your architecture.

How to Measure Percentiles

In Your Application Code

const latencies = [];
const startTime = performance.now();

// ... do work ...

const endTime = performance.now();
const latency = endTime - startTime;
latencies.push(latency);

// Sort and calculate percentiles
latencies.sort((a, b) => a - b);
const p50Index = Math.floor(latencies.length * 0.50);
const p95Index = Math.floor(latencies.length * 0.95);
const p99Index = Math.floor(latencies.length * 0.99);

console.log(`p50: ${latencies[p50Index]}ms`);
console.log(`p95: ${latencies[p95Index]}ms`);
console.log(`p99: ${latencies[p99Index]}ms`);

With Monitoring Tools

Most modern monitoring platforms (Datadog, New Relic, Prometheus) calculate percentiles automatically:

up0_api_latency_p50{region="us-east-1"} 52
up0_api_latency_p95{region="us-east-1"} 85
up0_api_latency_p99{region="us-east-1"} 450

Log Analysis

You can also calculate from logs:

# Extract response times and calculate p95
cat access.log | awk '{print $NF}' | sort -n | awk 'NR==int(NR_LINES*0.95)'

Debugging High Tail Latency

When you see high p99 latencies:

Step 1: Narrow Down the Service

Which service has the high latency? Instrument all services to find the culprit.

Step 2: Identify the Pattern

Is it:

  • Constant? Something is slow for all requests
  • Spiky? Occasional slow requests (GC, lock contention)
  • Time-based? Happens at specific times (batch jobs, backups)
  • Load-dependent? Worse under high load

Step 3: Investigate the Cause

Based on the pattern:

  • Lock contention? Add metrics for lock wait times
  • GC? Check GC logs and JVM heap
  • Cache misses? Track cache hit rates
  • External service? Measure downstream latency
  • Capacity? Check CPU, memory, disk

Step 4: Measure the Impact

What percentage of users experience this latency? Is it worth fixing?

p99 = 2000ms affecting 1% of users = potentially worth fixing
p99.999 = 5000ms affecting 0.001% = probably not worth fixing

Step 5: Fix or Accept

Either fix the root cause, or accept it and set your SLO accordingly.

Percentiles in Different Contexts

API Uptime Monitoring

Track response time percentiles across regions:

us-east-1 p95: 85ms
eu-west-1 p95: 140ms
ap-south-1 p95: 280ms

Regional differences are normal. Alert if a region's percentile spikes.

Database Monitoring

Track query latency percentiles:

SELECT query_latency_p95() -> 250ms
UPDATE query_latency_p95() -> 450ms

Write operations are slower. Track separately.

End-to-End User Experience

Real User Monitoring (RUM) tracks what users actually see:

Page load p95: 3200ms (includes network, rendering, etc.)

Always slower than backend latency. This is what users care about.

Key Takeaways

  1. Percentiles tell the truth, averages lie

    • Track p50, p95, p99 not just average
  2. Different percentiles matter for different things

    • p50: Is the system generally fast?
    • p95: Is the experience good for most users?
    • p99: Are edge cases handled well?
  3. Set SLOs by percentile

    • Define what "good performance" means
    • Alert on approaching or breaching SLOs
    • Track error budget
  4. Investigate high tail latency

    • Percentile ratio shows distribution health
    • Root cause could be locks, GC, cache, capacity, or network
  5. Monitor percentiles across all services and regions

    • Backend latencies
    • Database latencies
    • API calls
    • Regional differences

Track your latency percentiles with up0 - Monitor p50, p95, p99 across 20+ regions. Get started free.