I moved csylabs to Selectel overnight — but the story is not speed

Yesterday evening I started a migration that, a week ago, I expected to stretch over five days. I finished it at 7:50 in the morning. Along the way, I did almost everything I had planned not to do.

The important part is not the speed. The important part is that the migration stayed production-grade. Mail, DKIM, and PTR work. The headscale mesh coordinator moved to the new IP. Backups now live in another city over a private 9 ms link. The ansible inventory was renamed and brought into canonical shape. What was supposed to be a staged migration with 24-hour bake windows between waves collapsed into one night, but without losing a single production property.

This is a story about how two years of substrate make migrations boring, and make production-grade behavior almost a side effect.

Where it came from, where it went

csylabs and my personal projects have lived at Servers.ru for the last couple of years. The provider gave me good hardware in Moscow at a reasonable price, and everything worked.

In late 2024, Selectel acquired Servers.ru.

Technically, nothing suddenly changed: Servers.ru continues to work, and Selectel continues to work. But the Servers.ru platform is gradually moving into Selectel's billing panel while the infrastructure remains in place. One owner, two brands, two contracts.

In April 2026, two things converged for me. First, my Servers.ru host hit its ceiling: 64 GB of RAM at about 70% utilization, no NVMe, only SATA SSDs, and backups that lived on the same host before being copied to my home Synology. That worked for a production contour, but it was not real geographically separated disaster recovery.

Second, I entered a direct contract with Selectel and saw that for the same monthly budget I could get:

→ The same CPU, Xeon E-2388G, but 128 GB RAM instead of 64 GB
→ NVMe instead of SATA SSD, with about 4x more hot space for production VMs
→ A dedicated PBS host in Saint Petersburg with 4 TB of archive storage
→ A Moscow ↔ Saint Petersburg private network at 9 ms — geographic DR out of the box
→ The same monthly spend

What I lose: 2x10 Gbit uplinks, because Selectel gives me 1 Gbit here, and Dell iDRAC remote console, because this Selectel host does not expose iDRAC to me and everything goes through the PVE console. For my current load and work pattern this is enough, but honestly: it is a deliberate downgrade in one part in exchange for an upgrade in another. I am not pretending it is a pure win.

In simple terms: the same class of machine, twice the resources, plus a real disaster recovery contour that I used to imitate with manual copies between a host and a Synology.

The plan was five days. Reality was one night.

The original plan was a clean blue/green migration: restore all 12 services on Selectel in parallel, keep them dormant, then switch DNS one service at a time with a 24-hour bake period between waves.

By the evening of April 28, Phase 2 was ready: all 9 VMs and 3 LXCs had been restored on Selectel, IPs were changed, and onboot=0 was set. The plan was to sleep and migrate one service at a time in the morning.

Instead, at one in the morning, I decided the window was not tomorrow morning. It was now. The logic was simple: running two hosts in parallel creates more operational noise than it saves. One cutover instead of two, and one rollback path if something breaks, not a game of guessing which half is live.

It was a debatable decision. In the end it worked, but that is an outcome statement, not a risk argument.

50 minutes of active work

The sequence looked like this:

→ Rotated the Cloudflare API token. One token had been sitting in an old Caddyfile, and another had been exposed through ansible-vault view a few days earlier. The second case is now a personal rule: inspect vault key names with grep, never view secrets directly.
→ Audited Caddy across three hosts. Found 17 csylabs vhosts on the platform node, 2 on the Caddy VM, 2 on the video node.
→ Prepared the Cloudflare playbook with new IPs and TTL=60. Collected retired records separately.
→ Took fresh delta backups of 5 active services through PBS. Deduplication made restore times feel absurd: about a minute per service, reproducible every time.
→ Destroyed and restored over the Selectel copies. The 500 GB platform VM took 19 minutes. The others took seconds.
→ Reconfigured IPs through qm set / pct set --net0.
→ Stopped the source and started the destination. All 5 services moved inside a 30-second window.
→ Ran the DNS playbook.

By 7:50, mail was flowing through the new mail node, the video stream was on the new IP, headscale was coordinating the mesh from Selectel, and the coffee was gone.

What actually broke

Vault path. In configure_cloudflare_dns.yml, variables were loaded from ../group_vars/all/vault.yml — a playbook-level vault nobody maintained. The canonical vault was in the inventory, and that is where I had rotated the token. The first run failed with 403 on 30+ records. Cloudflare then rate-limited the follow-up attempts into 429. The symptom was simple: DNS did not update. The cause was three layers of vault resolution, two of which worked correctly. It took one line to fix and about an hour to understand.

cloudflare_dns created duplicates. When the module saw a record with the same name but a different value, state=present did not update it. It added a second record. Now each name had two A records: old 88.212.x and new 155.212.x, and Cloudflare round-robin'd between them unpredictably. This did not break everywhere at once, so I noticed it after "everything worked." The fix was a small direct Cloudflare API script: list zone records, filter the old range, delete by ID. 38 records. Delete, quiet down, move on.

Dual IP on first LXC boot. After pct restore, a container first booted with both IPs on eth0, old and new. systemd-networkd showed only the new config, but the kernel had both addresses. The fix was trivial: pct stop && pct start clears the kernel state. But you need to know that, otherwise the first moment looks worse than it is.

What exists now

The Selectel cluster now looks like this:

→ pve-selectel-mow-01 — 12 services
→ pbs-selectel-spb-01 — backup in another city
→ Private link through Selectel Global Router
→ ansible inventory under inventories/selectel/, with hosts.yml in one canonical shape
→ Cloudflare DNS managed by one playbook with inventories/selectel/group_vars/all/vault.yml as the source of truth
→ Stalwart on the new machine, DKIM/SPF/DMARC in place, PTR configured
→ headscale moved; mesh.csylabs.com answers from the new IP
→ MistServer for the video vertical is alive

The Servers.ru host remains as a candidate for shutdown. The bake period is a couple of days to catch delayed DNS caches and any scripts that still point at old IPs. After that, it is a support ticket for cancellation and contract termination.

What I did not do

I did not upgrade the kernel. Proxmox 7.0 is already available as an opt-in beta for PVE 9, and it has interesting work in it: Zen 6 and Nova Lake support, faster EXT4 parallel direct I/O, faster page-cache reclaim. But my Selectel hosts are Xeon E-class machines, and none of that justifies the risk during a bake period. Kernel 7.0 is expected to become the default around PVE 9.2 / PBS 4.2 near the end of Q2. I can wait.

I did not update ansible roles on the way. The goal of the migration was the migration. Caddyfile cleanup, removal of inline tokens, and a couple of dusty vhosts are a separate commit for next week.

I did not start the highlights node or the OpenVidu node on the new site. They are in the spec, but the current load does not need them. They are restored and dormant; if needed, they come up in a minute.

Discipline

What I like is not that I "moved everything overnight."

What I like is that the tools — PBS deduplication, ansible inventories, Cloudflare playbooks — combined into a setup where 50 minutes of real cutover time became possible. Most of the work was not that night. It was two years of substrate: Proxmox habits, Ansible roles, MikroTik and sing-box routes, and the habit of keeping the system in a vault.

When these things exist, an overnight migration becomes boring work, not a heroic act.

I want boring migrations. Heroics are usually a symptom of missing substrate.

I moved csylabs to Selectel overnight — but the story is not speed

Related reading

Youth sports: the point is not the tournament, it is the infrastructure

Three Proxmox-on-ZFS installs, three UEFI shells

Thought it was one thing, turned out to be another: 4Kn drives broke the ZFS boot