What Fintech Infrastructure Teaches You That Nothing Else Will

Most DevOps engineers think about uptime in percentages. 99.9%. 99.99%. Five nines.

Financial market infrastructure doesn't work like that.

When the infrastructure you run carries trade orders for banks and financial institutions, uptime stops being a metric on a dashboard. It becomes something more immediate. Every second the system is down, orders aren't executing. Positions aren't being settled. People are picking up phones.

You stop saying "we had 99.95% availability last month." You start saying "we were down for four minutes on Tuesday and here's exactly what happened."

That shift changes how you build things.

The observability problem most teams get wrong

There's a version of observability that looks good in demos. You have Grafana dashboards with colourful graphs. Metrics everywhere. A monitoring tab that's permanently open on someone's second monitor.

Nobody looks at it until something breaks.

That's not observability. That's post-mortem archaeology.

Real observability means the system tells you something is wrong before your users do. Not a CPU spike that you happened to notice — an alert that fires because a specific condition crossed a threshold you defined deliberately. The difference matters because in financial infrastructure, by the time a user reports a problem, the problem is already several minutes old.

The question isn't "do we have monitoring?" Almost everyone has monitoring. The question is: does your monitoring surface the right things, fast enough, to the right people?

Most teams can't answer that confidently. They find out during incidents.

Change management isn't bureaucracy

Somewhere along the way, change management got a reputation as the thing that slows you down. The CAB meeting nobody wants. The ticket you have to fill in before touching production.

That reputation is earned by bad implementations, not by the idea itself.

In environments where a misconfigured routing rule can affect trade execution across multiple institutions, change management isn't overhead. It's the thing that keeps a routine deployment from becoming a Friday evening incident. Peer review before anything touches production. A rollback plan that exists before the change, not after. A known state you can return to if something behaves unexpectedly.

The teams I've seen who resist change management most strongly are usually the same teams who spend the most time firefighting. The correlation isn't coincidental.

Done properly, change management is a feature. It's how you ship confidently instead of nervously.

The 3am runbook test

There's a simple test for any runbook: hand it to the person who will actually be woken up at 3am to use it, and watch them follow it cold.

If they have to stop and interpret anything, the runbook has failed. If they need to know context that isn't written down, the runbook has failed. If they have to make a judgment call that the runbook didn't anticipate, the runbook has failed.

Runbooks written by architects for engineers don't pass this test. Neither do runbooks written once and never updated. The only runbooks that work under pressure are the ones written by people who have been on-call, updated by people who have used them during incidents, and tested by people who are seeing them for the first time.

This sounds obvious. Most runbooks don't reflect it.

The common thread across all three is the same thing: discipline applied before the incident, not during it. Observability built before the alert fires. Change controls in place before the deployment goes wrong. Runbooks written before the 3am call.

The boring operational work that nobody wants to do in advance is exactly the work that determines whether an incident is a managed problem or a full-scale emergency.

The boring stuff is what keeps the lights on.

What Fintech Infrastructure Teaches You That Nothing Else Will

The observability problem most teams get wrong

Change management isn't bureaucracy

The 3am runbook test

Need this for your team?