Skip to main content
  1. Papers/

Paper: Raft Consensus Algorithm

·1523 words·8 mins

Original: “In Search of an Understandable Consensus Algorithm (Extended Version)”
Authors: Diego Ongaro and John Ousterhout (Stanford University)

introduce
#

Raft is a consensus algorithm widely used in industry. Another widely used available algorithm is called paxos, but the original paper ‘paxos made simple’ doesn’t provide enough details about how to handle stream replication problem(replicate state machine), it talks about the single decree problem which is mainly about how to decide a certain value among a couple of nodes. The real applied algorithm is called multi-paxos, however, the nature of multi paxos in my opinion is almost the same with raft, the main difference is in leader election.

background
#

We want to copy a stream to 3~5 nodes(which we called RSM, replicate state machine problem).
The critical point is all the nodes should keep the same sequence, that means same data unit, same order. For example, think about a stream of number 51348, every node should persist the same sequence 51348, 53148 and 51148 is wrong.

According to some experience from building practical real world paxos application, we find multi paxos need to elect a leader to improve performance(limit the Round Trip of replication to 1). Diego catch up a great idea: why not split the leader election, the real replicate progress as different parts to make it much clear? That’s raft, a practical version of multi paxos, which the author declare more understandable.

The result is fault tolerant, which can tolerate less than half of the nodes fails. So every time we accept a request, we don’t need to copy to all the nodes, we just replicate to majority of nodes and response immediately to improve performance.

how to reach consensus
#

First of all, which we called leader, primary, master is exactly the same thing.

When we want to replicate a stream, it’s always difficult to figure out how to reach a consensus for all the nodes in a specific issue, the problem is a stream includes a sequence of issues, that means a sequence of consensus need to be deal with, which sounds like a mess.

However, if we have a leader, life becomes easier: all the nodes just follow the leader, replicate the stream element one by one according to leader, no coordination at all, if majority of follower nodes get the replica successfully, this replication turn has finished. An potential problem is the slow leader could become the bottleneck, which is a problem of raft.

but where is our leader
#

Let’s elect, everyone nominate himself as a leader, to make it more understandable, we called it president, everyone wants to be president, they all know the current term, for example, the current president is Baiden, the term is 47, in brief is Baiden is the 47th US president. If I want to be the next president, I will be the 48th president, so I nominate myself as 48th president, and someone like me is called the candidate. There is something to decide who can be the next president, in reality, it’s the speech, in raft, it’s who has more valid stream elements. Since it’s a stream replicate system, we compare the length of streams in all the nodes, the longest one should be the leader, so we compared last log index, if the log index is 6, it should longer than log with index 5. In the other hand, for every term, we have only 1 president, a president could be 47th, 48th, but 47th president could not be Trump and Baiden. So we compare the president term at the same time. if a node has term 48 logs, it should has newer logs than a node has only term 20 logs. In conclusion, every node try to be the president, but only one of them could be the next president. they will tell the others his expect term and his last log information, the log info includes log index and the term. The node who has the latest log, will become the next president with his expected term. Since the leader replicate the data to majority of nodes, the candidate only need to win the majority of the nodes instead all the nodes to be chosen.

how leader works
#

The president broadcast all request to the other nodes with its own term and the global sequential number, every log element has a term and an index number. In the real world, a president can propose a new act with his term and the act number, for every term there is only one president, which ensure if the term and index is the same, the log element could be the same, because a president could not say his 38th act is A, and his 38th act is also B. Notice that we could have more than president at the same time, but we have exact 1 president in the same term. If half of the nodes persist a specific element and response president, the president will say, ok, the act 38 starts work now, it’s a consensus.

what if the leader fails
#

If the president dies, maybe a terrorist assault, a well-known scandal or a traffic accident, the other nodes would not get acts from president any more, they could not even hear the president, which we called ’lose the heartbeat’, they would join the election, nominate themselves. However, how about the log sequence? The next president will find some nodes(like original leader) has longer log than him, some nodes has shorter log than him, but since he can become the leader, he must has all log which has become a consensus(accept by majority), so nothing lost if the president just handle new request and ignore the longer nodes. The president try to cover all the newer logs in the nodes which has longer logs, and fill the the nodes which has shorter logs. No matter how the original log sequence it looks like, every node would have the same log sequence with leader. That’s what we called strong leadership. But wait a minute, if the president cover the longer node, will something be lost? No, since the president could become a president, it must hold the logs commit before, in reality, the longer logs includes the elements haven’t reach a consensus. But if the president contains some log elements which haven’t commit, it will still replicate this elements left.

Let’s think about a 1 node application, if we issue a write request, the node could shutdown before persist the write or after the persist, but the client don’t know, the client don’t get the response, so it feel like the write operation fails. Since the president in a 3 nodes system hasn’t response the client yet, if could have persisted the write or not, thus the president can cover the other longer nodes, and submit the log elements which even haven’t commit.

the whole progress
#

At the very beginning, we have some nodes, they all want to be the president to handle the out world request, if they have a president, they act like single one node, act as a whole. the system will keep running, if minority of the nodes fail, nothing lost, nothing broken in the perspective of out world request.

In order to be a president, the node need to propose his next term, if current president is 48th, and the node propose 42th, that means the node fall behind, the other node will refuse him. The node need to compare his latest log with the others too, because it’s a stream replicate system, we want the log to be highly reliable, nothing lost. If the node has newer log against the majority of the nodes, it could be voted as the next president. The reason of majority is we want to tolerate less than half of the nodes fail, and that also need the president replicate the data to a majority of nodes to promise the data will always survive.

The president broadcast log elements with their president term and log index, the log index is just like the index of an array to provide a sequence number. The president force all the other nodes keep the same log with him, and that’s how they reach consensus.

If a new president occurs, it would find himself has longer logs than some nodes, or has shorter los than some nodes too. However, it always try to make the other nodes persist the same logs with him, if the other nodes have different logs, cover them. The method to compare a specific log element is the term and sequential number, if term and N.O. is the same, the log must be the same, because a president will never say act 123 is A and act 123 is B at the same time, and there is only 1 president in every certain term.

try to read the paper before read the article, if you feel blocked, get back here instead of reading the article without reading the paper.
a visible progress of raft could be found here: raftThe Raft Consensus Algorithm