Trap: failover bonding and miimon on directly connected machines

For HA reasons, I setup a cluster of two nodes that are using “bond0″ for the external networks and “bond1″ for the communication between the two hosts.

As I setup the machines in the standard way, the bonding module options for the bond1 interface was a simple “miimon=100″, which monitors the link activity – so a broken or removed network cable should bring the interface down and trigger a failover.

Now, as I wanted to test the setup, I did what I always do on such bonding devices to test the failover:

  1. find out the available slaves by cat /sys/class/net/bond1/bonding/slaves
  2. find out the active slave by cat /sys/class/net/bond1/bonding/active_slave
  3. trigger a failover by echo eth0 > /sys/class/net/bond1/bonding/active_slave
  4. test, if ping still works from outside and inside

But for the direct connection, this did not work. So in the first run I assumed a broken network cable. But standing at the back of the machines and seeing all interfaces blinking, I got another idea: if you monitor your bonding devices via miimon, the bonding driver has no need to move to another device if the interface reports an active link. …and exactly that happened to the other machine that had not been triggered by myself to switch the active interface. So the monitored interface on both hosts was still ok – but as the active slave switched on one host and the bonding was setup as active-backup bonding, the other node still ran the IP addresses on the other interface, so the network traffic did not work any more.

As result, I configured this bonding interface now via the “mode=802.3ad” (or easier: “mode=4″), meaning “Dynamic link aggregation”. So now I do not even have a fail-over setting, as wanted, but also an increased bandwidth, allowing twice the amount of packages going over the line. To speed up the boot time, I also set the additional option “primary=eth0″ on both hosts, which will bring up those devices as primary slaves on each boot.

…and indeed: the test above now succeeds. :-)

About these ads

About lvogdt

This is the private blog space of Lars Vogdt, the topics will be in first place work related.
This entry was posted in network, openSUSE, SUSE Linux Enterprise and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s