aws · cloudwatch · monitoring · devops · alb

The CloudWatch Alarm Nobody Sets Up (But Should)

High CPU alerts are standard. Alerting on unexpectedly low traffic is not — and it's often the difference between finding out about an outage yourself versus your customers finding out first.

Every AWS account I've ever inherited has CloudWatch alarms for high CPU. Most have alarms for high error rates. Almost none have alarms for traffic that's suspiciously quiet.

That's a problem, because low traffic is often the scariest signal of all.

Why silence is louder than noise

High CPU means something is working hard. High error rates mean something is failing loudly. But when traffic drops 80% from its normal baseline at 2pm on a Tuesday, nothing is screaming. Everything just... stops.

That pattern has one of a few causes, and none of them are good:

  • A deployment went out and broke the routing rules
  • The load balancer is healthy but the target group isn't — it's returning 200s to health checks and dropping real requests
  • A DNS change propagated incorrectly and half your traffic is hitting a dead endpoint
  • A feature flag got flipped the wrong way and a percentage of users are silently getting no response
  • Someone updated a security group and blocked port 443 in production

The first three items on that list have happened to me on client infrastructure. The DNS one cost a client about four hours of lost revenue before anyone noticed — because their on-call alert was set up for 5xx errors, not for traffic volume. Zero requests means zero errors. The alarm never fired.

The fix: an ALB RequestCount alarm with a lower threshold

The setup takes about 10 minutes in the console.

Go to CloudWatch → Alarms → Create Alarm.

Select the metric: ApplicationELB → Per AppELB Metrics → RequestCount. Choose your load balancer.

Set the statistic to Sum, period to 5 minutes.

For the threshold, you need your normal baseline. Open the metric graph and look at the last two weeks of traffic for the same time window you care about (business hours, or 24/7 if you're running something with consistent volume). Pick a number that represents "definitely not normal" — typically baseline * 0.2 to baseline * 0.3 is right. If your ALB normally sees 5,000 requests per 5 minutes during the day, set the alarm threshold at 1,000.

Condition: Lower than [threshold].

Alarm state action: notify an SNS topic. If you're already routing CloudWatch alerts to Slack, wire it to the same topic. If not, set up an SNS → Lambda → Slack webhook — 20 minutes, done once.

Total cost: £0 for the alarm itself. SNS costs fractions of a penny per notification.

The Terraform version, because clicking through consoles doesn't scale

resource "aws_cloudwatch_metric_alarm" "alb_low_traffic" {
  alarm_name          = "${var.environment}-alb-low-request-count"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 2
  metric_name         = "RequestCount"
  namespace           = "AWS/ApplicationELB"
  period              = 300
  statistic           = "Sum"
  threshold           = var.alb_low_traffic_threshold
 
  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
  }
 
  alarm_description = "ALB request count is significantly below normal baseline — possible deployment failure, routing issue, or upstream breakage."
  alarm_actions     = [aws_sns_topic.alerts.arn]
  ok_actions        = [aws_sns_topic.alerts.arn]
 
  treat_missing_data = "breaching"
}

treat_missing_data = "breaching" is intentional. If CloudWatch stops receiving the metric entirely — because the ALB is gone, because the data stopped publishing — you want the alarm to fire, not to sit in an "Insufficient data" state while production is on fire.

A few things worth knowing before you set it up

Your threshold needs to account for time of day. A 1,000 request threshold might be fine at 2pm but will fire every night at 3am if your traffic drops off outside business hours. You have a few options: use CloudWatch Anomaly Detection instead of a static threshold (it learns your traffic pattern), set up different alarms for business hours vs. off-hours, or suppress the alarm via a maintenance window during known low-traffic periods.

Set an OK action too. When traffic recovers, you want a notification. Otherwise you're left wondering whether it resolved itself or whether someone manually fixed something without telling anyone.

Pair it with a 5xx error rate alarm. Low traffic + normal errors means the traffic genuinely disappeared. Low traffic + high error rate means traffic is arriving but something is actively broken. They tell you different things.

The metric nobody checks is usually the useful one

High CPU is easy to alarm on because it feels like a problem. Low traffic feels like a quiet afternoon — which is why it takes longer to notice when it's actually an incident.

A deployment that quietly breaks routing, a DNS misconfiguration that takes half your traffic to a dead IP, a load balancer that stops forwarding requests after a certificate renewal — none of these produce the kind of noise that wakes people up. They just stop producing signal.

Set this alarm up once per environment. It costs nothing and it will tell you about an outage before your customers get to your inbox first.

work together

Need this for your team?

This is the kind of infrastructure work I do for clients — cloud platforms, CI/CD automation, and GitOps workflows that teams can actually operate.