For HA reasons, I setup a cluster of two nodes that are using “bond0” for the external networks and “bond1” for the communication between the two hosts.
As I setup the machines in the standard way, the bonding module options for the bond1 interface was a simple “miimon=100”, which monitors the link activity – so a broken or removed network cable should bring the interface down and trigger a failover.
Now, as I wanted to test the setup, I did what I always do on such bonding devices to test the failover:
- find out the available slaves by
- find out the active slave by
- trigger a failover by echo
eth0 > /sys/class/net/bond1/bonding/active_slave
- test, if ping still works from outside and inside
But for the direct connection, this did not work. So in the first run I assumed a broken network cable. But standing at the back of the machines and seeing all interfaces blinking, I got another idea: if you monitor your bonding devices via miimon, the bonding driver has no need to move to another device if the interface reports an active link. …and exactly that happened to the other machine that had not been triggered by myself to switch the active interface. So the monitored interface on both hosts was still ok – but as the active slave switched on one host and the bonding was setup as active-backup bonding, the other node still ran the IP addresses on the other interface, so the network traffic did not work any more.
As result, I configured this bonding interface now via the “mode=802.3ad” (or easier: “mode=4”), meaning “Dynamic link aggregation”. So now I do not even have a fail-over setting, as wanted, but also an increased bandwidth, allowing twice the amount of packages going over the line. To speed up the boot time, I also set the additional option “primary=eth0” on both hosts, which will bring up those devices as primary slaves on each boot.
…and indeed: the test above now succeeds.