I have a side project called trueseal-sync (coming soon). It's an E2EE sync primitive built on top of Noise protocol. The relay is written in Go, the client in Rust, and they speak Noise XX over plain TCP. Boring on purpose.
Everything worked perfectly when I ran the relay natively with go run. Then I dockerized it. And about 90% of the time, the handshake failed with Decrypt (AEAD MAC failure on msg1). Same client. Same relay code. Same machine. Just one ran inside a container and the other didn't.
This post is the autopsy. Four wrong theories, one pcap, and a punchline I should've seen at hour one.
Theory 1: it's a partial TCP read
This was my first instinct and I was very confident about it.
Reasoning was that native works, Docker fails → the difference must be in network shape.
Loopback always coalesces small writes into a single read. Docker NAT can split them. Noise XX msg1 is [len:2][body:96] written by the relay as two Write calls. On loopback they arrive as one read but through a bridge they might not. If my framing code only read 2 bytes and assumed the body was there, state would diverge and AEAD would fail.
This is also the bug I would expect myself to write. I hand-rolled the framing.
I checked my framing and sessions, read_exact everywhere... there since v0.1.0.
Theory dead, as well as my hopes for an easy fix.
Theory 2: it's the MTU
I have a weird docker context due to some company security measures, and this wouldn't be the first time that makes my hair fall. ifconfig on the host showed bridge100 at MTU 1500. From inside a container: eth0 at MTU 1280. My docker context creates a bridged network and apparently picks 1280 for the VM's eth0.
Smaller MTU + bigger sender + broken PMTUD = silent black-holing of larger segments. New networking stacks ship with weird defaults all the time.
So I bumped the MTU, re-ran the test. Same failure rate.
It also occurred to me, slightly too late, that Noise handshake messages are under 100 bytes. MTU 1280 vs 1500 cannot possibly matter for a 98-byte payload (please don't ask me how I know this now, but I think it is a clear indication of how much I suffer).
Theory dead, as well as my hopes for a complicated fix.
Theory 3: my read path mishandles real network latency
I captured a pcap on bridge100 and looked at it. Sure enough: relay's msg1 arrives as two TCP segments, length 2 then length 96. Separated by anywhere from 1µs (when they coalesce) to ~70µs (when they don't). On the failing attempts, the gap was always wider.
New theory was that my client uses set_nonblocking(true) on the TcpStream before the handshake, and the trueseal-noise crate has a RetryConn wrapper that loops on WouldBlock. Maybe RetryConn has a subtle bug that only manifests when reads actually split and the second read returns WouldBlock first. On loopback, the data is always already in the kernel buffer, so the retry path never executes. On docker, it does.
This was my favourite theory, quite "elegant" if I may. It explained the timing, the intermittency, and the "only on Docker" pattern.
I vibed coded a debug wrapper called SlowStream that:
- caps every
read()to 1 byte, - sleeps 5ms before each call,
- returns
WouldBlockon every other call to force the retry loop.
Then I pointed the client at the native relay (loopback, where it normally works 100%)
Handshake succeeded. Every time. Slowly, but successfully.
Theory dead, as well as my desire to live.
Theory 4: the bridge is mangling bytes
By this point I was running out of ideas. Let's at least least look at the bytes.
I added byte-level logging to two places: the client (every read() and write() on the TcpStream, hex-dumped) and the relay (the exact msg0 it received and the exact msg1 it sent, hex-dumped). Then I ran tcpdump in parallel and grabbed the wire bytes.
Three views of the same handshake:
client wrote msg0 → e6b22…de71
wire stream msg0 → e6b22…de71
relay read msg0 → e6b22…de71
relay wrote msg1 → 5890800f…7e9ece
wire stream msg1 → 5890800f…7e9ece
client read msg1 → 5890800f…7e9ece
Bit-identical across all three. For every failing attempt. The bytes were not mangled at all.
And yet chacha20poly1305: message authentication failed.
This is the moment in any debugging session where you start to question your career choices. The bytes are right on both sides, the protocol is right, the framing is right. AEAD still fails. Either snow is broken (it's not, this code has worked for years, properly battle tested), or the math itself is wrong somewhere (also not quite possible).
If the bytes are identical but the math disagrees, then one side computed those "correct" bytes from the wrong inputs. Both ends are honest about what they sent and received. They're just not computing the same things from them. Math was indeed not mathing.
That's a CPU bug.
CPU was lying
I ran one command I should've run on hour one:
$ docker run --rm --net host alpine uname -m
aarch64
OK, the VM is arm64. That's expected, I'm on Apple Silicon.
Then I dared to check the Dockerfile:
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -ldflags="-s -w" -trimpath \
-o /trueseal-relay ./cmd/trueseal-relay
GOARCH=amd64.
I'm building an x86-64 binary and running it on an arm64 VM. The Linux kernel sees a foreign ELF, hands it off to qemu-user, and qemu dynamically translates every instruction at runtime.
Most of the time, this works, web servers run, databases run. Business as usual.
But flynn/noise does Curve25519, ChaCha20-Poly1305, and BLAKE2s. These are full of 64-bit multiplies with carry chains, bit rotations, constant-time arithmetic (and god knows what more). Qemu's x86 emulator on arm64 has a long history of being subtly wrong on a fraction of crypto-shaped operations (I had a vague memory of something like this happening to me when trying to build a simple OS MMU). Not always wrong, but wrong sometimes.
So the relay was computing X25519 shared secrets that were occasionally wrong by a few bits. Both ends wrote bytes that exactly reflected what they computed. They just weren't computing the same things.
Bytes correct where ok, but math ain't mathing.
Why was that line even there
This is the part where I want to be to past me. And at the same time murder him.
I vibed coded the Dockerfile for a Hostinger VPS on the cheapest tier (hehe), AI pinned it to amd64, which was OK for what I told it to do. I wanted to be able to docker push from my laptop and know the binary on the server would match the server's CPU. That was all OK before I dared to test what AI did before pushing the changes, which started a 2 day headbanding against the keyboard sessions.
AI optimized for one deployment target and forgot that the same Dockerfile was also building images I run locally, on the opposite architecture, every time I run docker-compose up. The pin guaranteed correctness on the VPS at the cost of correctness on my dev machine. But I cannot blame the pig, it just did what I asked it to.
In a way, it did it better than I would've done it myself, I would've not pin the architecture at all, it didn't even cross my mind. I'm sure there is an important take-away here... but I am to tiered to think about it today, this post is running on fumes already. Maybe something like:
The relay is zero-trust on the network. Turns out you also can't trust your emulated CPU or myself for that matter.