Building a monitoring-first culture
Great monitoring isn't just tooling—it's culture. Learn how top engineering teams build observability into every decision.
David Kim
The Problem: Reactive vs Proactive Monitoring
Most engineering teams are in reactive mode:
- User: "Your service is broken"
- Engineer: Gets paged at 3 AM
- Team: Debugs frantically for 2 hours
- Root cause: "We didn't know it could fail this way"
- Post-mortem: "We need better monitoring"
- Repeat: Next incident in 2 weeks
This cycle is expensive—in stress, productivity, and customer trust.
Monitoring-first teams work differently:
- Engineer: Deploys new service
- Monitoring system: "New endpoint discovered, setting up checks"
- Degradation occurs: System alerts before users notice
- Engineer: Fixes issue in 10 minutes
- No customer impact, no incident
The difference? Culture.
What Is a Monitoring-First Culture?
A monitoring-first culture means:
-
Monitoring is a first-class concern
- As important as code quality
- Required before deployment
- Measured and tracked
-
Visibility is default
- Services are observable by default
- Metrics are instrumented automatically
- Dashboards exist before they're needed
-
Proactive is expected
- Monitoring prevents incidents
- Teams own alerting rules
- On-call is effective, not painful
-
Data drives decisions
- Reliability is measurable
- Trade-offs are informed by data
- SLOs guide priorities
Building Monitoring-First Culture: Key Changes
Change 1: Monitoring Is Definition of Done
In most teams, "done" means:
- Code is written
- Tests pass
- Code review approved
- Deployed to production
In monitoring-first teams, "done" means all of the above, PLUS:
Checklist: Definition of Done
✓ Code written and reviewed
✓ Unit tests passing
✓ Integration tests passing
✓ Deployed to staging
✓ Deployed to production
✓ Metrics instrumented
✓ Dashboards created
✓ Alerts configured
✓ Runbook written
✓ Team trained
If monitoring isn't done, the feature isn't done.
Change 2: Every Engineer Owns Monitoring
Not just the "monitoring team" (if you have one).
Every engineer who builds a service owns:
- Metrics for that service
- Dashboards for that service
- Alerts for that service
- Runbooks for that service
- On-call support for that service
This creates accountability and prevents blind spots.
Change 3: Alerts Are Sacred
In reactive teams:
- Lots of alerts
- Most are false positives
- Engineers mute them
- Real problems get lost
- Alert fatigue is normal
In monitoring-first teams:
- Fewer alerts, all meaningful
- Each alert has a runbook
- On-call responds immediately
- False positives are investigated and fixed
- Alerts are predictive, not reactive
Rule: If you get paged, it's a real problem. If it's not, fix the alert.
Change 4: Observability Is Built In
Not added after the fact.
Monitoring-first teams:
New service being built?
→ Instrumentation happens day 1
→ Metrics framework is part of service skeleton
→ Logging is configured
→ Tracing is enabled
→ Dashboard template is created
Not:
Service deployed to production
→ "We should probably add monitoring"
→ 3 weeks later, basic metrics added
→ Alert fatigue from wrong thresholds
Change 5: SLOs Drive Priorities
In reactive teams:
- "We should improve reliability"
- No one knows how much or which parts
- Reliability work loses to feature work
In monitoring-first teams:
Availability SLO: 99.9%
Current: 99.2%
Error budget: Exhausted
Priority 1: Fix availability (not features)
Target: 99.9%
Timeline: 2 sprints
Data drives what matters.
Practical Steps to Build Monitoring-First Culture
Phase 1: Establish Foundations (Weeks 1-4)
Step 1: Define Core SLOs
Start with 1-2 SLOs per service:
- Availability: 99.9%
- Latency (p95): 200ms
Step 2: Create Monitoring Standards
Every service must have:
- Request rate metric
- Error rate metric
- Latency percentile metrics
- Resource utilization (CPU, memory)
- Custom business metrics
Step 3: Set Up Dashboards
Create service template dashboard:
- Health status (green/yellow/red)
- Key SLI metrics
- Error breakdown
- Dependency health
Step 4: Define Alert Policy
- Each alert has a runbook
- Runbook says exactly what to do
- Team members trained on top 10 alerts
- Alert fatigue < 5 false positives/week/person
Phase 2: Adoption (Weeks 5-12)
Step 1: Onboard First Service
Pick a critical service (not too complex)
- Add complete instrumentation
- Create comprehensive dashboard
- Define alerting rules
- Write runbook
- Train team
- Monitor for 2 weeks
- Iterate on alerts
- Use as template for others
Step 2: Team Training
Session 1 (1 hour): Why monitoring matters
- Show incident timeline without monitoring
- Show with monitoring
- Case study: saved incident
Session 2 (2 hours): How to use dashboards
- Reading different metric types
- Understanding alert context
- Finding relevant data quickly
Session 3 (3 hours): How to build monitoring
- Creating metrics
- Building dashboards
- Setting up alerts
- Writing runbooks
Session 4 (ongoing): Office hours
- Questions about their services
- Help with complex monitoring scenarios
- Share best practices
Step 3: Establish Monitoring Review
In code review, add monitoring review:
- Do metrics make sense?
- Are thresholds appropriate?
- Can we trace request flow?
- Will we know if this fails?
- If you can't answer yes, don't merge
Step 4: Retrospectives on Monitoring
After each incident:
- Did monitoring catch it?
- If yes: Did we respond fast enough?
- If no: What monitoring would have caught it?
- Add that monitoring
- Update playbook
Phase 3: Scale (Weeks 13+)
Step 1: Expand to All Services
Audit all services:
- Ranked by criticality
- Ranked by gaps in monitoring
- Create monitoring plan
- Assign owners
- Track progress
Month 1: Top 5 services
Month 2: Next 10 services
Month 3: Remaining services
Month 4+: New services by default
Step 2: Create Shared Platform
Make monitoring easier for everyone:
- Monitoring templates for common patterns
- Automated metric collection
- Standard dashboard layouts
- Pre-built alert rules
- Shared runbooks
Step 3: Build Culture of Monitoring
Celebrate monitoring wins:
- "Monitoring caught this before users noticed"
- "Alert runbook worked perfectly"
- "New engineer added monitoring day 1"
Share learnings:
- Monthly "reliability wins" meeting
- Blog posts about outages prevented
- Metrics on reliability improvements
Step 4: Continuous Improvement
Quarterly reviews:
- Are SLOs still meaningful?
- Which alerts are actually useful?
- What monitoring gaps remain?
- Update standards based on learnings
Tools for Monitoring-First Culture
You don't need fancy tools, but good tools help:
Monitoring Infrastructure
- Prometheus (metrics collection, alerting)
- Grafana (dashboards, visualization)
- ELK Stack (log aggregation)
- Jaeger (distributed tracing)
Or managed services:
- Datadog (full observability)
- New Relic (full observability)
- up0 (uptime/SLO focused)
Incident Management
- PagerDuty (on-call, alerting)
- Opsgenie (alerting, escalation)
- Incident.io (incident response)
Documentation
- Runbooks (playbooks for known issues)
- Dashboards (service health visualization)
- Status page (communication during incidents)
Key: Pick tools your team will actually use. Over-complicated monitoring systems get ignored.
Addressing Common Obstacles
Obstacle 1: "We're too small for monitoring"
False. You're too small not to have monitoring.
Small team = Everyone on call for production issues Monitoring = Fewer 3 AM pages Conclusion: Monitoring saves you time
Action: Start with 1 critical service and 5 key metrics.
Obstacle 2: "We don't have time to add monitoring"
Common perspective: Monitoring is overhead on feature work.
Reality: Monitoring is investment that pays back immediately.
Without monitoring:
- 3 hours debugging incident per month
- 2 hours lost user productivity
- 1 customer lost
- Cost: $50k/month
With monitoring:
- 30 minutes preventing same incident
- 0 hours lost user productivity
- 0 customers lost
- Cost: $5k/month
ROI: 10x in 1 month, 100x in 6 months
Action: Quantify the cost of incidents. Invest in prevention.
Obstacle 3: "Our team doesn't know how to set up monitoring"
This is the real problem, and it's fixable.
Action:
- Hire or designate a "monitoring champion"
- Have them build templates for common patterns
- Everyone else follows templates
- Champion trains team through office hours
- Templates improve iteratively
Obstacle 4: "We get too many false alerts"
Root causes:
- Wrong thresholds
- Alerting on expected events
- Alert rules capturing noise
Action: For each alert going off:
- Is it a real problem? If not, update threshold
- Can we prevent it? If yes, add check before alerting
- Is this normal noise? If yes, remove alert
Track false positive rate. Goal: < 5% false positives.
Obstacle 5: "Our dashboards are too complex"
Complexity usually means:
- Too many metrics on one dashboard
- Metrics that don't relate to each other
- No clear story (what does this dashboard show?)
Action: Create focused dashboards:
- Operator Dashboard: "Is service healthy right now?" (5 key metrics)
- SRE Dashboard: "Are we meeting SLOs?" (SLI metrics, error budget)
- Debug Dashboard: "What's actually happening?" (detailed metrics)
Different audiences, different dashboards.
Measuring Success
How do you know monitoring-first culture is working?
Metric 1: MTTR (Mean Time To Recovery)
Before: 120 minutes
Target: 30 minutes
After 3 months: 45 minutes
After 6 months: 25 minutes
→ Success
Metric 2: Incidents Prevented
Track incidents that would have happened:
"Monitoring alert caught degradation before it became outage"
Target: 50% of would-be incidents prevented
Measure by reviewing potential failure scenarios
Metric 3: On-Call Satisfaction
Survey on-call engineers:
"When paged, was it a real problem?"
Before: 30% (lots of false positives)
Target: 90%+ (mostly real problems)
After 3 months: 85%
After 6 months: 92%
→ Success
Metric 4: SLO Compliance
Track whether you're meeting defined SLOs:
Before: No SLOs defined
After: 95% SLO compliance
→ Success (and basis for scaling features)
Metric 5: Monitoring Coverage
Services with adequate monitoring:
Before: 20%
Target: 100%
After 3 months: 40%
After 6 months: 75%
After 9 months: 100%
→ Success
Cultural Shifts to Embrace
From "Monitoring Team" to "Everyone's Responsibility"
Before: "Tell the monitoring team and they'll fix it"
After: "As the author, you own monitoring for this service"
Impact: Monitoring gets better because authors understand their service
From "Reactive Firefighting" to "Proactive Prevention"
Before: Problem → Alert → Page → Investigate
After: Risk identification → Alert tuning → Problem prevented
Impact: Fewer incidents, better sleep, happier team
From "More Metrics Is Better" to "Right Metrics Matter"
Before: 10,000 metrics, 500 dashboards, 100 alerts
After: 100 meaningful metrics, 10 focused dashboards, 20 effective alerts
Impact: Faster problem identification, less alert fatigue
From "After Deployment" to "Before Deployment"
Before: "Deploy, then figure out monitoring"
After: "Monitor is part of definition of done"
Impact: Services are observable from day 1
Example: Small Team Implementation
Let's say you're a 5-engineer team with 3 services:
Month 1: Foundation
- Week 1: Define 1 SLO per service (availability: 99.9%)
- Week 2: Set up Prometheus, Grafana
- Week 3: Instrument service #1 completely
- Week 4: Create dashboards, alerts, runbooks for service #1
Month 2: Adoption
- Week 5: Team training on dashboards
- Week 6-7: Instruments services #2 and #3
- Week 8: Create dashboards and alerts for remaining services
Month 3+: Operation and Improvement
- Weekly: Review any false positive alerts
- Monthly: Check SLO compliance
- Quarterly: Adjust SLOs based on data
Investment: ~20% of engineer time for 1 quarter, then ~5% ongoing.
Return:
- Incidents drop from 3/month to <1/month
- MTTR drops from 120min to 30min
- On-call satisfaction improves dramatically
- Team sleeps better
Getting Started Today
You don't need to be perfect. You just need to start:
This week:
- Pick your most important service
- Define one SLO for it (99.9% availability)
- Start collecting availability metrics
Next week:
- Create a dashboard showing that metric
- Set up one alert for SLO breach
- Write a runbook for what to do when it alerts
Following week:
- Train your team on the dashboard
- Adjust alert threshold based on real data
- Repeat for next service
That's it. You're now monitoring-first.
Start building monitoring-first culture with up0. Multi-region monitoring helps teams adopt monitoring practices faster. Get started free.