Site icon Being Software Craftsman (DFTBA)

Timeouts, Retries, and Circuit Breakers — Lessons From Building Real Systems

If you’ve spent enough time building and running distributed systems, you eventually learn this the hard way:
Systems don’t fail politely.

They slow down. They half-work. They drag everything else down with them.

Over the years, while designing and operating production systems, three ideas have repeatedly proven to be non-negotiable:

Timeouts protect your resources.
Retries help with transient failures.
Circuit breakers prevent cascading failures.

This isn’t theory. This is operational reality.


Timeouts: Respect Your Own Limits

Every external call is a bet.
You’re betting that a dependency will respond soon enough.

Without a timeout, that bet has unlimited downside.

I’ve seen services that were “up” but completely unavailable because all their threads were waiting on a slow dependency. No crashes. No alerts. Just quiet failure.

Timeouts put an upper bound on damage.

They force you to answer a simple but uncomfortable question:
How long is this call allowed to block my system?

A few practical rules I follow:

A slow service is survivable.
An unbounded wait is not.


Retries: Use Them With Care

Some failures are temporary. A network hiccup. A short spike in load. A rolling deployment somewhere else.

Retries can smooth over these bumps — when used deliberately.

But retries are also one of the easiest ways to take a bad situation and make it worse.

I’ve seen retry storms bring down perfectly capable systems simply because everyone retried at the same time.

What works in practice:

Retries should be a tool for recovery, not a denial of reality.


Circuit Breakers: Know When to Stop

There’s a point where continuing to call a failing dependency is irresponsible.

At that point, the most valuable thing your system can do is fail fast.

This is where circuit breakers earn their place.

A circuit breaker observes failures over time and makes a decision:

This simple mechanism prevents the kind of chain reaction where one struggling service takes out half the system.

In production, circuit breakers are less about protecting the dependency and more about protecting everything else.


These Patterns Work as a Set

Timeouts, retries, and circuit breakers are not optional add-ons.
They are layers of defence.

PatternPrimary role
TimeoutProtects your service
RetryAbsorbs transient faults
Circuit BreakerPrevents system-wide failure

The order matters:

Call → Timeout → Retry (bounded) → Circuit Breaker

Skip one, and you create a weak point.


A Mistake I Still See Too Often

“We have retries, so we’re covered.”

Usually what that means is:

This setup works… until it doesn’t. And when it fails, it fails loudly and expensively.


Closing Thoughts

Resilience isn’t about adding more code or more libraries.
It’s about engineering discipline.

If your system is going to fail — and it will — make sure it fails fast, contained, and predictable.

That’s not just good architecture.
That’s craftsmanship.

Exit mobile version