Observability platform for a healthcare SaaS
Built monitoring, alerting, and dashboards from scratch. Mean time to detection dropped from 47 minutes to under 3.
Context
A healthcare SaaS platform serving 200+ clinics had no centralised monitoring. The team discovered outages when customers called. Mean time to detection was 47 minutes — an eternity when patient scheduling depends on your uptime.
They had logging, scattered across services, but no correlation, no alerting, and no dashboards.
Challenge
Build a complete observability stack — metrics, logs, traces — across 8 microservices and 3 databases. The system processes sensitive health data, so all telemetry had to stay within their compliance boundary.
The team needed to go from “we find out when users complain” to “we know before users notice.”
Approach
Instrumentation first. We added OpenTelemetry to every service — structured logs, request traces, and business metrics (appointment bookings, API latency by clinic). No code rewrites, just instrumentation layers.
Centralised stack. Prometheus for metrics, Grafana for dashboards, Alertmanager for routing. All deployed within their existing AWS VPC to satisfy compliance requirements.
Alert design. We wrote alerts based on symptoms, not causes. “API error rate above 1% for 5 minutes” rather than “CPU above 80%.” Fewer alerts, each one actionable. Routed through PagerDuty with clear escalation paths.
Dashboards for humans. Three dashboards: executive (uptime, SLA), engineering (latency, error rates, saturation), and on-call (active alerts, recent deployments).
Outcome
- Mean time to detection dropped from 47 minutes to under 3 minutes
- Alert noise reduced by 80% — every alert requires action
- Team resolved their first proactively-detected incident within the first week
- On-call rotations now functional and sustainable
- Full compliance audit trail for telemetry data
Technologies
Prometheus, Grafana, Alertmanager, OpenTelemetry, AWS CloudWatch, PagerDuty.