3 min read#infrastructure#engineering

Notes on NFS — or why you might need Proxmox Backup Server

The previous post ended with "let's go" — first Synology backup running tonight. In the morning I opened the management panel. Interface doesn't load. :)


SSH

Server is reachable, ping goes through, SSH works. But uptime says:

load average: 26.29, 24.46, 23.54

Load of 26 on a Xeon 2388G is not normal operating mode — that's every core pinned and the major services lying down for a rest. Three pveproxy workers sitting in D-state since 02:59. D-state means the kernel is waiting for I/O and won't hand the thread to anyone else. SIGKILL doesn't help.

In the logs:

100.64.0.4:/volume1/pve-backups ... hard,fatal_neterrors=none,timeo=600,retrans=2

In tailscale logs from 03:36:

netcheck: UDP is blocked, trying HTTPS
timeout opening TCP 100.64.0.5 => 100.64.0.4:111

What happened

→ Backup job started at 03:00 → At 03:36 — tailscaled missed a WireGuard keepalive under the load of active backup I/O → UDP connection went dark: NAT on the home router expired → Tailscale switched to DERP relay over HTTPS → DERP runs on the same physical machine (I know) → hard + fatal_neterrors=none mount: processes wait forever, no errors returned → Load climbs → headscale VM starves → DERP degrades → Tailscale can't reconnect → No exit. Deadlock.

NFS over Tailscale with DERP relay on the same machine isn't just a bad config. It's an architecture that will deadlock on any UDP blip. Not "might" — will.

Result: 16 D-state processes, load 29, pveproxy unresponsive, headscale VM unreachable over SSH. VMID 9000 (465 MB) finished writing to .vma.zst at 03:01. VMID 1013001 — 15 GB of an unfinished .vma.dat out of 500 GB.


How I fixed it

reboot -f didn't execute — SSH couldn't deliver the command at load 29. Had to use SysRq:

echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

After reboot — rebuilt the connection scheme.

Tailscale is out of the backup path entirely. MikroTik port-forwards TCP 2049 to the Synology, source-restricted to the Moscow server's public IP only. Two firewall rules: accept from that IP, drop everything else on that port. RouterOS is genuinely great for this.

It works.

In storage.cfg:

options vers=4,soft,timeo=30,retrans=3

vers=4 — no portmapper needed, just port 2049. soft — three failures over 3 seconds, process gets an error and exits clean. Backup aborts, host keeps running.


Ran manually. VMID 9000 — 55 seconds, 473 MB. VMID 1013001 went from scratch: 225 MB/s off ZFS, ZSTD compressing on the fly before sending.


hard NFS over any tunnel isn't fault tolerance — it's "I'll hang forever instead of failing" → DERP relay on the same machine as the NFS client is a backup exit through the same wall → WireGuard keepalive under I/O load — that's UDP behavior, not a bug → NFSv4 doesn't need portmapper. One port, 2049. That's enough.

The previous post in this series ended on "let's go." This one ends on "it works."

Related reading