The spectrum of error handling

Posted on 2026-04-10 :: systems-programming learning

I've been reading about how TigerBeetle approaches correctness, and one thing that stuck with me is their aggressive use of assertions.

Most codebases I've worked on treat assertions as debug or testing tools. Rarely you sprinkle in during development, and even less common you leave them in prod. TigerBeetle does the opposite, assertions stay in prod, and they're a core part of the correctness strategy.

This got me thinking about error handling more broadly. There's a spectrum here that most code blurs together, and I think that's a mistake.

Three categories of "something went wrong"

1. Expected failures (recoverable)

Network is down, file doesn't exist, user gave you garbage input, whatever, right? These aren't bugs, they are normal operation. The system should handle them gracefully.

function read_config(path):
    contents = read_file(path)    // might fail - file doesn't exist
    return parse(contents)         // might fail - malformed

The caller decides what to do.

2. Bugs (should never happen)

This is different. An index out of bounds or an NPE where one should never exist. A state that your code assumed it was impossible.

These aren't "errors" to handle gracefully. They're bugs. Something is already wrong with your program. If you're here, your assumptions about reality are broken.

3. Invariant violations (corruption is worse than crashing)

This is TigerBeetle's territory. Things that should always be true about your system. Not "probably true" or "usually true". From their specific domain, a bank balance should never be negative (unless you allow overdrafts), that kind of stuff.

If these invariants break, something is deeply wrong. Maybe a cosmic ray flipped a bit (I heard this happen during an election, more here). Maybe (probably) there's a bug you haven't found. Maybe memory got corrupted. Whatever the case, continuing means potentially propagating corruption through your system.

function transfer(from, to, amount):
    from.balance -= amount
    to.balance += amount
    
    // Invariant: total money in system should be constant
    assert(from.balance + to.balance == TOTAL_EXPECTED)

TigerBeetle's philosophy is to crash and restart from a known-good state. A crash is recoverable but corrupted data propagating silently is not.

Why most code blurs these together

I've seen (and written) code that treats all three the same way:

function process(data):
    if data.items is empty:
        return Error("empty data")  // Is this a bug or expected input?
    // ...

The problem is that the caller doesn't know if EmptyData means "user gave bad input, show them a message" or "something is fundamentally broken, we should crash."

Defensive programming makes this worse. "Handle every error gracefully" sounds safe, but it can mean silently continuing when you shouldn't. You end up with systems that limp along in corrupted states instead of failing loudly at the first sign of trouble.

The TigerBeetle stance

TigerBeetle is building a financial database. Correctness matters more than uptime. Their position based on their docs on my interpretation is:

If an invariant is important enough to check, it's important enough to crash for
Assertions stay in production
A crash + restart from clean state is better than silent corruption
"Impossible" states should be asserted, not handled <- big one here

This is extreme for most applications. But the underlying principle is sound... be clear about what category each failure falls into, and handle each appropriately.

What I'm taking from this

I haven't tried this approach in my own code yet. But I want to on my next project.

The mental shift is to stop treating all errors the same. And while I'm hitting the keyboard with my head (how I mostly produce code) I'll ask myself

Is this expected? Let the caller decide.
Is this a bug/non expected state? Panic. Fix the bug.
Is this a state that should never occur? Assert it in production, crash if it does.

The third category is the one I've been mostly ignoring. To always try to "handle" impossible states. But that's not really handling, it's hiding. If whatever state or invariant is broken, I want to know immediately, not three hours after deploy during an incident.

We'll see how it goes. Maybe I'll write a follow-up after I've actually tried being more aggressive with assertions. For now, it's just an idea that's sounds great in theory, but maybe overkill for non high stakes projects.

But the core insight already feels true, not all errors are the same, and error handling is apparently something I have a lot to learn about.