Technical Details
Leader Selection Algorithm
The leader selection algorithm is a critical part of MooseFS. Strict safeguards ensure that no more than one leader can ever be elected simultaneously.
The Original Algorithm
In previous versions of MooseFS, chunkservers cast votes for a candidate master (Elect) to be promoted to Leader. The promotion condition was simple: more than half of all chunkservers must connect to the Elect.
Why the Original Algorithm Does Not Scale to Multilocations
The original algorithm creates two problems in multi-location deployments:
Uneven chunkserver distribution: Consider a setup with one main location (12 chunkservers) and two backup locations (5 chunkservers each). If the main location fails entirely, only 10 out of 22 chunkservers remain – less than half. The leader cannot be elected, even though two fully operational locations still hold a complete copy of all data.
Equal split between two locations: With two locations of equal size, a failure of one leaves exactly half of all chunkservers. Exactly half is not more than half, so the leader cannot be elected in the surviving location.
The New Algorithm
The leader selection algorithm was redesigned for Multilocations. The new rule is:
A location casts a vote only if more than half of its chunkservers are able to connect to the Elect. A leader is elected when either:
- more than half of the participating locations vote, or
- exactly half of the participating locations vote and that set includes the default location.
Results:
- 3 locations: if one fails, the remaining two vote – more than half (2 out of 3) → leader is elected.
- 2 locations: if the non-default location fails, the default location votes alone – exactly half (1 out of 2) including the default → leader is elected. If the default location fails, the non-default location cannot reach a majority or include the default → no leader.
The count of participating locations is based on locations in ON or IO state. Locations set to OFF are excluded from the count entirely.
Example: a cluster with 3 locations all in ON state requires 2 locations able to vote. If one location is switched to OFF, the instance behaves as a 2-location cluster: both remaining locations must vote, or the default location must be among the voters when only one location remains active.