在被Paxos的The Part-Time Parliament伤了之后,我自然而然就准备来看一下大家都说容易理解的Raft协议,也就是这篇In Search of an Understandable Consensus Algorithm,来给自己的智商重新冲值。
这篇论文开宗明义就说了:"Raft is a consensus algorithmfor managing a replicated log. It produces a result equivalent to Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election and log replication, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety. Results from a user study demonstrate that Raft is easier for students to learn than Paxos."
里面除了说Raft功能和Paxos一样,可是更容易理解,人见人爱之外。关键是这句: In order to enhance understandability, Raft separates the key elements of consensus, such as leader election and log replication, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. 意思是:为了提升可理解性,Raft 将一致性算法分解成了几个关键模块,例如领导人选举、日志复制和安全性。同时它通过实施一个更强的一致性来减少需要考虑的状态的数量。
后面这篇论文的内容大致就是讲一段自己的理论就来一段Raft协议是多么容易理解或者Paxos是多么不好理解的论述。最后还搞了一个针对Paxos及Raft 算法的可理解能力的测试。整个方案是这样的:“为了和 Paxos 比较 Raft 算法的可理解能力,我们针对高层次的本科生和研究生,在斯坦福大学的高级操作系统课程和加州大学伯克利分校的分布式计算课程上,进行了一次学习的实验。我们分别拍了针对 Raft 和 Paxos 的视频课程,并准备了相应的小测验。Raft 的视频讲课覆盖了这篇论文的所有内容除了日志压缩;Paxos 讲课包含了足够的资料来创建一个等价的复制状态机,包括单决策 Paxos,多决策 Paxos,重新配置和一些实际系统需要的性能优化(例如领导人选举)。小测验测试一些对算法的基本理解和解释一些边角的示例。每个学生都是看完第一个视频,回答相应的测试,再看第二个视频,回答相应的测试。大约有一半的学生先进行 Paxos 部分,然后另一半先进行 Raft 部分,这是为了说明两者独立的区别从第一个算法处学来的经验。我们计算参加人员的每一个小测验的得分来看参与者是否在 Raft 算法上更加容易理解。”为了增加说服了作者还用线性回归模型对学生成绩数据进行了统计,最后结果当然是在可理解性上Raft胜出。
Raft协议之所以容易被理解是因为在Raft协议进行了算法分解(Raft 主要被分成了领导人选举,日志复制和安全三个模块)和减少状态机的状态(相对于 Paxos,Raft 减少了非确定性和服务器互相处于非一致性的方式)。具体实现上有以下几个独特的特性:
Design for understandability(易于理解的设计): understandability was our most important criterion in evaluating design alternatives. We applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety so that they can be understood relatively independently) and state space reduction (Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other, in order to make it easier to reason about the system).
Strong leader(强领导者): Raft differs from other consensus algorithms in that it employs a strong form of leadership where only leaders (or would-be leaders) issue requests; other servers are completely passive. This makes Raft easier to understand and also simplifies the implementation.
Membership changes(关系调整): Raft’s mechanism for changing the set of servers in the cluster uses a simple joint consensus approach where the majorities of two different configurations overlap during transitions.
因为强领导者,Raft可以将一致性问题分解成了三个相对独立的子问题:
Leader election(领导选举): a newleadermust be chosenwhen an existing leader fails, and Raft must guarantee that exactly one leader is chosen (Section 5.2).
Log replication(日志复制): the leader must accept log entries from clients and replicate them faithfully across the cluster, forcing all other logs to agree with its own (Section 5.3).
Safety(安全性): Section5.4describeshowRaft decideswhen a log entry has been committed, and how it ensures that leaders always hold all committed entries in their logs.
后续的章节对这几个问题的具体实现进行了详细的描述,甚至连实现的函数的定义都给了出来(RequestVote RPC和AppendEntries RPC)。整个论文可以说通俗易懂,大家可以从头橹到尾。
论文下载地址:http://fi.ee.tsinghua.edu.cn/media/files/2016/11/168263a78850135205a978e7230056d2/raft.pdf