Observability platform for a healthcare SaaS

Context

A healthcare SaaS platform serving 200+ clinics had no centralised monitoring. The team discovered outages when customers called. Mean time to detection was 47 minutes — an eternity when patient scheduling depends on your uptime.

They had logging, scattered across services, but no correlation, no alerting, and no dashboards.

Challenge

Build a complete observability stack — metrics, logs, traces — across 8 microservices and 3 databases. The system processes sensitive health data, so all telemetry had to stay within their compliance boundary.

The team needed to go from “we find out when users complain” to “we know before users notice.”

Approach

Instrumentation first. We added OpenTelemetry to every service — structured logs, request traces, and business metrics (appointment bookings, API latency by clinic). No code rewrites, just instrumentation layers.

Centralised stack. Prometheus for metrics, Grafana for dashboards, Alertmanager for routing. All deployed within their existing AWS VPC to satisfy compliance requirements.

Alert design. We wrote alerts based on symptoms, not causes. “API error rate above 1% for 5 minutes” rather than “CPU above 80%.” Fewer alerts, each one actionable. Routed through PagerDuty with clear escalation paths.

Dashboards for humans. Three dashboards: executive (uptime, SLA), engineering (latency, error rates, saturation), and on-call (active alerts, recent deployments).

Outcome

Mean time to detection dropped from 47 minutes to under 3 minutes
Alert noise reduced by 80% — every alert requires action
Team resolved their first proactively-detected incident within the first week
On-call rotations now functional and sustainable
Full compliance audit trail for telemetry data

Technologies