Splitting up Proxmox cluster, to remove overhead

Published

# Contents

# background

I am undergoing a effort in my homelab journey, which has turned out to be about simplifying, and removing as much maintenance overhead and complexity as I can. I have always been keen on working with new technologies, and trying things out. But when I am shorter on time, I need to prioritize better.

As a part of that, a major issue with my current proxmox cluster, has triggered me to split it from a cluster, into separate nodes. I am not sure what actually caused the issue, or what happened, but none of the nodes were that happy about starting up, and I also consistently got thrown out of the web ui. I was able to resolve it, but I am still not sure exactly which of the steps I took fixed it. The only thing that is clear, is that it is related to the networking between the nodes, and corosync.

I currently have a bit of a crazy network configuration. These nodes are on different sites, 30-40km apart, connected over the public internet with a WireGuard tunnel on the routers at the sites. I cannot really do anything about this two-site-configuration now, so that is just something I have to deal with. The craziest part of it, is that on one of the sites, the lab is connected over a 2.4GHz Wifi connection which needs to reach across 3 floors to my basement. This introduces a lot of latency, and packet drops. I get a good 10Mbit/s between the sites, with 5-20ms of ping on clear days. This clearly means that unnecessary network traffic, and systems which rely on low latency, eventually will be a problem.

The reason for the 2.4GHz link, is that the lab consists of enterprise-hardware that I cannot run in my living room due to noise and space. And since I am renting, I cannot really start drilling holes into the walls to pull a network-cable. I am somewhat planning on making a smaller computer, which is less noisy, that can be in the living-room. But there is of course a cost associated with that, and since I currently do not need the extra bandwidth for my usage "actual" usage, that is the traffic me and people using my homelab generate directly, it seems bit of a waste, if I can reduce my issues significantly, just by making the systems less chatty by themselves, and less dependent on each other.

# deserting-the-cluster

Basically, I followed the guide on the proxmox wiki[0]. to desert the cluster from the node I needed to leave. This basically entails restarting all proxmox-services in local-mode, resetting all corosync-related configuration, and restarting the node.

systemctl stop pve-cluster
systemctl stop corosync

pmxcfs -l
rm /etc/pve/corosync.conf
rm -r /etc/corosync/*

killall pmxcfs
systemctl start pve-cluster

# Delete all other nodes manually from /etc/pve/nodes
shutdown -r now

rm /var/lib/corosync/*

Then from the node still in the cluster

# If nee carry, change expected votes to a appropriate amount
# pvecm expected 1
pvecm delnode <oldnode>

# backups

At this point, you need to figure out a way other than replication for keeping two live copies for data. I am planning to do something with zfs at some point. But for now, I mount a dataset on the other node using nfs, and do "normal" backups. And once a month I do backups onto cold storage which is stored in a offsite closet.

Basically i have some disks with ZFS with encrypted datasets, e.g.

zpool create COLD_BACKUP_01 /dev/<disk>
zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase COLD_BACKUP_01/backups
zfs set compression=on COLD_BACKUP_01/backups

And before removing the usb disk

zpool export COLD_BACKUP_01
# or
zpool export -f COLD_BACKUP_01
``

When I am making a new backup, I do 

zpool import COLD_BACKUP_01 zfs load-key COLD_BACKUP_01/backups zfs mount COLD_BACKUP_01/backups


I am planning somewhat to make a simple UDEV-rule which does all the required things to mount, start a proxmox backup, and export.
I don't really care that the password would be stored in plain-text on the server,
the point is just that if someone steals the disk, the data is not immediately accessible.

</section>
<section id="Replication"><h2><a href="#Replication">#</a> replication</h2>

I have previously used replication with the intention of being able to manually recover without doing backup restore,
if one node goes down.
The main reason for doing something similar to replication, is so I can minimize needed network traffic.
I started with `pve-zsync`, but that for some reason does not work natively with LXC's, which I am using a lot of.

Currently, I am just using the offsite backup strategy on cold disks. 
I have two disks, which I swap every few weeks and bring somewhere else.
So at most I would loose a few weeks of data in case of a fire or something like that.

</section>

# footnotes

[0]
https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_separate_node_without_reinstall

If you have a comment, please e-mail [email protected], I will add it to a comment section below :)
Please indicate how you want me to identify you (or not) in the e-mail.