Building a monitoring-first culture

Great monitoring isn't just tooling—it's culture. Learn how top engineering teams build observability into every decision.

D

David Kim

9 min read

The Problem: Reactive vs Proactive Monitoring

Most engineering teams are in reactive mode:

  1. User: "Your service is broken"
  2. Engineer: Gets paged at 3 AM
  3. Team: Debugs frantically for 2 hours
  4. Root cause: "We didn't know it could fail this way"
  5. Post-mortem: "We need better monitoring"
  6. Repeat: Next incident in 2 weeks

This cycle is expensive—in stress, productivity, and customer trust.

Monitoring-first teams work differently:

  1. Engineer: Deploys new service
  2. Monitoring system: "New endpoint discovered, setting up checks"
  3. Degradation occurs: System alerts before users notice
  4. Engineer: Fixes issue in 10 minutes
  5. No customer impact, no incident

The difference? Culture.

What Is a Monitoring-First Culture?

A monitoring-first culture means:

  1. Monitoring is a first-class concern

    • As important as code quality
    • Required before deployment
    • Measured and tracked
  2. Visibility is default

    • Services are observable by default
    • Metrics are instrumented automatically
    • Dashboards exist before they're needed
  3. Proactive is expected

    • Monitoring prevents incidents
    • Teams own alerting rules
    • On-call is effective, not painful
  4. Data drives decisions

    • Reliability is measurable
    • Trade-offs are informed by data
    • SLOs guide priorities

Building Monitoring-First Culture: Key Changes

Change 1: Monitoring Is Definition of Done

In most teams, "done" means:

  • Code is written
  • Tests pass
  • Code review approved
  • Deployed to production

In monitoring-first teams, "done" means all of the above, PLUS:

Checklist: Definition of Done
✓ Code written and reviewed
✓ Unit tests passing
✓ Integration tests passing
✓ Deployed to staging
✓ Deployed to production
✓ Metrics instrumented
✓ Dashboards created
✓ Alerts configured
✓ Runbook written
✓ Team trained

If monitoring isn't done, the feature isn't done.

Change 2: Every Engineer Owns Monitoring

Not just the "monitoring team" (if you have one).

Every engineer who builds a service owns:

  • Metrics for that service
  • Dashboards for that service
  • Alerts for that service
  • Runbooks for that service
  • On-call support for that service

This creates accountability and prevents blind spots.

Change 3: Alerts Are Sacred

In reactive teams:

  • Lots of alerts
  • Most are false positives
  • Engineers mute them
  • Real problems get lost
  • Alert fatigue is normal

In monitoring-first teams:

  • Fewer alerts, all meaningful
  • Each alert has a runbook
  • On-call responds immediately
  • False positives are investigated and fixed
  • Alerts are predictive, not reactive

Rule: If you get paged, it's a real problem. If it's not, fix the alert.

Change 4: Observability Is Built In

Not added after the fact.

Monitoring-first teams:

New service being built?
→ Instrumentation happens day 1
→ Metrics framework is part of service skeleton
→ Logging is configured
→ Tracing is enabled
→ Dashboard template is created

Not:

Service deployed to production
→ "We should probably add monitoring"
→ 3 weeks later, basic metrics added
→ Alert fatigue from wrong thresholds

Change 5: SLOs Drive Priorities

In reactive teams:

  • "We should improve reliability"
  • No one knows how much or which parts
  • Reliability work loses to feature work

In monitoring-first teams:

Availability SLO: 99.9%
Current: 99.2%
Error budget: Exhausted

Priority 1: Fix availability (not features)
Target: 99.9%
Timeline: 2 sprints

Data drives what matters.

Practical Steps to Build Monitoring-First Culture

Phase 1: Establish Foundations (Weeks 1-4)

Step 1: Define Core SLOs

Start with 1-2 SLOs per service:
- Availability: 99.9%
- Latency (p95): 200ms

Step 2: Create Monitoring Standards

Every service must have:
- Request rate metric
- Error rate metric
- Latency percentile metrics
- Resource utilization (CPU, memory)
- Custom business metrics

Step 3: Set Up Dashboards

Create service template dashboard:
- Health status (green/yellow/red)
- Key SLI metrics
- Error breakdown
- Dependency health

Step 4: Define Alert Policy

- Each alert has a runbook
- Runbook says exactly what to do
- Team members trained on top 10 alerts
- Alert fatigue < 5 false positives/week/person

Phase 2: Adoption (Weeks 5-12)

Step 1: Onboard First Service

Pick a critical service (not too complex)
- Add complete instrumentation
- Create comprehensive dashboard
- Define alerting rules
- Write runbook
- Train team
- Monitor for 2 weeks
- Iterate on alerts
- Use as template for others

Step 2: Team Training

Session 1 (1 hour): Why monitoring matters
- Show incident timeline without monitoring
- Show with monitoring
- Case study: saved incident

Session 2 (2 hours): How to use dashboards
- Reading different metric types
- Understanding alert context
- Finding relevant data quickly

Session 3 (3 hours): How to build monitoring
- Creating metrics
- Building dashboards
- Setting up alerts
- Writing runbooks

Session 4 (ongoing): Office hours
- Questions about their services
- Help with complex monitoring scenarios
- Share best practices

Step 3: Establish Monitoring Review

In code review, add monitoring review:
- Do metrics make sense?
- Are thresholds appropriate?
- Can we trace request flow?
- Will we know if this fails?
- If you can't answer yes, don't merge

Step 4: Retrospectives on Monitoring

After each incident:
- Did monitoring catch it?
- If yes: Did we respond fast enough?
- If no: What monitoring would have caught it?
- Add that monitoring
- Update playbook

Phase 3: Scale (Weeks 13+)

Step 1: Expand to All Services

Audit all services:
- Ranked by criticality
- Ranked by gaps in monitoring
- Create monitoring plan
- Assign owners
- Track progress

Month 1: Top 5 services
Month 2: Next 10 services
Month 3: Remaining services
Month 4+: New services by default

Step 2: Create Shared Platform

Make monitoring easier for everyone:
- Monitoring templates for common patterns
- Automated metric collection
- Standard dashboard layouts
- Pre-built alert rules
- Shared runbooks

Step 3: Build Culture of Monitoring

Celebrate monitoring wins:
- "Monitoring caught this before users noticed"
- "Alert runbook worked perfectly"
- "New engineer added monitoring day 1"

Share learnings:
- Monthly "reliability wins" meeting
- Blog posts about outages prevented
- Metrics on reliability improvements

Step 4: Continuous Improvement

Quarterly reviews:
- Are SLOs still meaningful?
- Which alerts are actually useful?
- What monitoring gaps remain?
- Update standards based on learnings

Tools for Monitoring-First Culture

You don't need fancy tools, but good tools help:

Monitoring Infrastructure

  • Prometheus (metrics collection, alerting)
  • Grafana (dashboards, visualization)
  • ELK Stack (log aggregation)
  • Jaeger (distributed tracing)

Or managed services:

  • Datadog (full observability)
  • New Relic (full observability)
  • up0 (uptime/SLO focused)

Incident Management

  • PagerDuty (on-call, alerting)
  • Opsgenie (alerting, escalation)
  • Incident.io (incident response)

Documentation

  • Runbooks (playbooks for known issues)
  • Dashboards (service health visualization)
  • Status page (communication during incidents)

Key: Pick tools your team will actually use. Over-complicated monitoring systems get ignored.

Addressing Common Obstacles

Obstacle 1: "We're too small for monitoring"

False. You're too small not to have monitoring.

Small team = Everyone on call for production issues Monitoring = Fewer 3 AM pages Conclusion: Monitoring saves you time

Action: Start with 1 critical service and 5 key metrics.

Obstacle 2: "We don't have time to add monitoring"

Common perspective: Monitoring is overhead on feature work.

Reality: Monitoring is investment that pays back immediately.

Without monitoring:
- 3 hours debugging incident per month
- 2 hours lost user productivity
- 1 customer lost
- Cost: $50k/month

With monitoring:
- 30 minutes preventing same incident
- 0 hours lost user productivity
- 0 customers lost
- Cost: $5k/month

ROI: 10x in 1 month, 100x in 6 months

Action: Quantify the cost of incidents. Invest in prevention.

Obstacle 3: "Our team doesn't know how to set up monitoring"

This is the real problem, and it's fixable.

Action:

  1. Hire or designate a "monitoring champion"
  2. Have them build templates for common patterns
  3. Everyone else follows templates
  4. Champion trains team through office hours
  5. Templates improve iteratively

Obstacle 4: "We get too many false alerts"

Root causes:

  1. Wrong thresholds
  2. Alerting on expected events
  3. Alert rules capturing noise

Action: For each alert going off:

  1. Is it a real problem? If not, update threshold
  2. Can we prevent it? If yes, add check before alerting
  3. Is this normal noise? If yes, remove alert

Track false positive rate. Goal: < 5% false positives.

Obstacle 5: "Our dashboards are too complex"

Complexity usually means:

  1. Too many metrics on one dashboard
  2. Metrics that don't relate to each other
  3. No clear story (what does this dashboard show?)

Action: Create focused dashboards:

  • Operator Dashboard: "Is service healthy right now?" (5 key metrics)
  • SRE Dashboard: "Are we meeting SLOs?" (SLI metrics, error budget)
  • Debug Dashboard: "What's actually happening?" (detailed metrics)

Different audiences, different dashboards.

Measuring Success

How do you know monitoring-first culture is working?

Metric 1: MTTR (Mean Time To Recovery)

Before: 120 minutes
Target: 30 minutes
After 3 months: 45 minutes
After 6 months: 25 minutes
→ Success

Metric 2: Incidents Prevented

Track incidents that would have happened:
"Monitoring alert caught degradation before it became outage"
Target: 50% of would-be incidents prevented
Measure by reviewing potential failure scenarios

Metric 3: On-Call Satisfaction

Survey on-call engineers:
"When paged, was it a real problem?"
Before: 30% (lots of false positives)
Target: 90%+ (mostly real problems)
After 3 months: 85%
After 6 months: 92%
→ Success

Metric 4: SLO Compliance

Track whether you're meeting defined SLOs:
Before: No SLOs defined
After: 95% SLO compliance
→ Success (and basis for scaling features)

Metric 5: Monitoring Coverage

Services with adequate monitoring:
Before: 20%
Target: 100%
After 3 months: 40%
After 6 months: 75%
After 9 months: 100%
→ Success

Cultural Shifts to Embrace

From "Monitoring Team" to "Everyone's Responsibility"

Before: "Tell the monitoring team and they'll fix it"
After: "As the author, you own monitoring for this service"
Impact: Monitoring gets better because authors understand their service

From "Reactive Firefighting" to "Proactive Prevention"

Before: Problem → Alert → Page → Investigate
After: Risk identification → Alert tuning → Problem prevented
Impact: Fewer incidents, better sleep, happier team

From "More Metrics Is Better" to "Right Metrics Matter"

Before: 10,000 metrics, 500 dashboards, 100 alerts
After: 100 meaningful metrics, 10 focused dashboards, 20 effective alerts
Impact: Faster problem identification, less alert fatigue

From "After Deployment" to "Before Deployment"

Before: "Deploy, then figure out monitoring"
After: "Monitor is part of definition of done"
Impact: Services are observable from day 1

Example: Small Team Implementation

Let's say you're a 5-engineer team with 3 services:

Month 1: Foundation

  • Week 1: Define 1 SLO per service (availability: 99.9%)
  • Week 2: Set up Prometheus, Grafana
  • Week 3: Instrument service #1 completely
  • Week 4: Create dashboards, alerts, runbooks for service #1

Month 2: Adoption

  • Week 5: Team training on dashboards
  • Week 6-7: Instruments services #2 and #3
  • Week 8: Create dashboards and alerts for remaining services

Month 3+: Operation and Improvement

  • Weekly: Review any false positive alerts
  • Monthly: Check SLO compliance
  • Quarterly: Adjust SLOs based on data

Investment: ~20% of engineer time for 1 quarter, then ~5% ongoing.

Return:

  • Incidents drop from 3/month to <1/month
  • MTTR drops from 120min to 30min
  • On-call satisfaction improves dramatically
  • Team sleeps better

Getting Started Today

You don't need to be perfect. You just need to start:

This week:

  1. Pick your most important service
  2. Define one SLO for it (99.9% availability)
  3. Start collecting availability metrics

Next week:

  1. Create a dashboard showing that metric
  2. Set up one alert for SLO breach
  3. Write a runbook for what to do when it alerts

Following week:

  1. Train your team on the dashboard
  2. Adjust alert threshold based on real data
  3. Repeat for next service

That's it. You're now monitoring-first.


Start building monitoring-first culture with up0. Multi-region monitoring helps teams adopt monitoring practices faster. Get started free.