Observability platform for a healthcare SaaS

Built monitoring, alerting, and dashboards from scratch. Mean time to detection dropped from 47 minutes to under 3.

MTTD: 47 minutes to under 3 April 2025
Prometheus Grafana Alertmanager OpenTelemetry AWS CloudWatch PagerDuty

Context

A healthcare SaaS platform serving 200+ clinics had no centralised monitoring. The team discovered outages when customers called. Mean time to detection was 47 minutes — an eternity when patient scheduling depends on your uptime.

They had logging, scattered across services, but no correlation, no alerting, and no dashboards.

Challenge

Build a complete observability stack — metrics, logs, traces — across 8 microservices and 3 databases. The system processes sensitive health data, so all telemetry had to stay within their compliance boundary.

The team needed to go from “we find out when users complain” to “we know before users notice.”

Approach

Instrumentation first. We added OpenTelemetry to every service — structured logs, request traces, and business metrics (appointment bookings, API latency by clinic). No code rewrites, just instrumentation layers.

Centralised stack. Prometheus for metrics, Grafana for dashboards, Alertmanager for routing. All deployed within their existing AWS VPC to satisfy compliance requirements.

Alert design. We wrote alerts based on symptoms, not causes. “API error rate above 1% for 5 minutes” rather than “CPU above 80%.” Fewer alerts, each one actionable. Routed through PagerDuty with clear escalation paths.

Dashboards for humans. Three dashboards: executive (uptime, SLA), engineering (latency, error rates, saturation), and on-call (active alerts, recent deployments).

Outcome

  • Mean time to detection dropped from 47 minutes to under 3 minutes
  • Alert noise reduced by 80% — every alert requires action
  • Team resolved their first proactively-detected incident within the first week
  • On-call rotations now functional and sustainable
  • Full compliance audit trail for telemetry data

Technologies

Prometheus, Grafana, Alertmanager, OpenTelemetry, AWS CloudWatch, PagerDuty.

Have a similar challenge?

Every system is different. Let's talk about yours.

Get in touch
neem

Ask about services, past work, or describe your situation. I'll give you a straight answer.

© 2026