技术分享《分布式一致性raft算法实现原理》

rfyiamcool · 2017-04-04 19:32:07 · 5296 次点击 · 预计阅读时间 11 分钟 · 大约8小时之前开始浏览

这是一个创建于 2017-04-04 19:32:07 的文章，其中的信息可能已经有所发展或是发生改变。

这段时间跟同事聊了不少分布式系统中的常用算法协议，中间有聊到分布式一致性的话题，当然我们对一致性理解都是那种介绍的层次。哈哈，后来用了心思去学习分布式一致性协议 raft, 现在有些心得，就拿出来给同事分享下。

先简单聊下什么是raft协议. 他用来做什么的？分布式存储系统通常会维护多个副本，这样不仅能提高系统的可用性，因为有多个副本所以性能也有提高。但是多副本带来的代价就是分布式存储系统的核心问题之一：需要维护多个副本的数据一致性。 Raft一致性协议就是用来干这事的，即使在部分副本宕机的情况下，只要符合raft的原则，照样可以对外提供服务。

Raft是一种较容易理解的一致性协议。我曾经也写过工夫去学习paxos，结果…. 我想大家都懂的，没搞明白。学习paxos的过程是有些痛苦的，国内的一些文档说的不明不白的，国外的文档又太有深度。。。现在只是浅薄的看懂他在正常情况下的选举，日志复制，分区容错，但是paxos对于错误的处理，有不少知识点不是很理解。

Raft是个好东西呀，我以前用的influxdb，现在用的etcd,consul 都是采用Raft来确保数据的一致性。为了做raft的话题分享，硬着头皮看了国外的raft说明文档，有些醉心. 我一般分享不会录制视频，所以尽量会把ppt做的详细点，有点raft基础的人，应该可以流畅的看下去。

PDF地址:

http://static.xiaorui.cc/raft_design.pdf

slideshare.net：

<br />

Raft from rfyiamcool

slideshare是需要翻墙才能访问的，另外把 Raft分享的摘要贴出来。

1. 分布式一致性raft实现原理 - 峰云就她了 - xiaorui.cc
2. 什么是一致性协议 ? raft有哪些特点 ? raft vs paxos ? raft的构成组件及实现原理 ? 各种所谓奇葩的raft场景 ? 如何实现raft ? 介绍
3. 单节点环境 client server 存在数据一致性问题 ?
4. 多节点环境 node 1 node 3 node 2 那么如何保证数据的一致性 ?
5. 角色 Follower Candidate Leader
6. KeyWorld 定时器 Term 时间片 Term ID N/2 + 1 Heartbeats
7. KeyWorld 选举成Leader需提供TermID 和 LogIndex Leader 绝对不会删除自己的日志 客户端自己携带ID帮助raft保持幂等性 一条记录提交了，那么它之前的记录一定都是 commited.
8. KeyWorld 节点之间的Term和索引一致, 我们就认为数据是 一致的. 在一个Term里只会有一个Leader 每个Follower只能选一个Leader
9. KeyWorld currentTerm 服务器最后一次知道的任期号（初始化为 0，持续递增） voteFor 在当前获得选票的候选人的 Id log[] 日志条目集( 状态机指令及TermId ) commitIndex 已知最大的索引值 nextIndex[] 每个follower的下一个索引值
10. Vote RPC Term 候选人的任期号 candidateid ID lastLogIndex 候选人的最后日志的索引值 lastLogTerm 候选人最后日志的任期号 Term 当前的任期号, 用于领导人去更新自己 voteGranted True or False
11. most simple election vote for me vote for me OK ! OK !
12. C-1 simple election F-2 F-1 vote for me vote for me NO timer 155 Term 2 Timer 170 Term 3 Condition比Follwer的term id小 不影响 “F” 定时器在转 ! C 已得知情况, 故意Vote超时, 等他人选举 . Timer 183 Term 3
13. C-1 simple election RequestVote(term=2) voteGranted=true, term=2 C-2 same term id wait timeout! NO ! Term not match RequestVote(term=2)
14. hard election -1 vote for me OK ! vote for me not term match term conflict not n/2 + 1 OK ! 都变为一个term id !
15. summery election 过程 定时器触发, followers把current_term_id + 1 改变成candidate状态 发送RequestVoteRPC请求 结果 成功选举 别人被选 重新选
16. Client Works with leader Leader return to response when it commits an entry ! Assign uniquqeID to every command , Leader store latest ID with response.
17. client process Only log entry ! 1 Hello 2 Raft 1 Hello 2 Raft 1 Hello 2 Raft
18. Log Replication 默认心跳为 50 ms 默认心跳超时为 300ms 每次心跳的时候做 Log entry commit 超过 n/2+1 就算成功
19. Log RPC Term 领导人的任期号 LeaderID 领导人的 Id，以便于跟随者重定向请求 pervLogIndex 新的日志条目紧随之前的索引值 entries[] 需要存储当然日志条目（表示心跳时为空；一次性发送多个是为了 提高效率） LeaderCommit 领导人已经提交的日志的索引值 Term 当前的任期号, 用于领导人去更新自己 success 跟随者包含了匹配上 prevLogIndex 和 prevLogTerm 的日志时为真
20. log replication - 1 Heartbaet & Append Entries1 Hello 1 Hello 1 Hello Heartbaet & Append Entries Only log entry !
21. log replication - 2 OK ! 1 Hello 1 Hello 1 Hello OK ! Leader commit !
22. Le_1 log replication - 3 F_2 F_1 Heartbaet & commit1 Hello Heartbaet & commit 1 Hello 1 Hello Follower commit !
23. 常见疑难杂症
24. Le_1 if a node reply timeout ？ F_2 F_1 Heartbaet & commit 1 Hello 1 Hello 1 Hellotimeout !!! F_2 如何保持数据一致性 ? Leader会重试 !
25. Le_1 Leader crash F_2 F_1 Log entry Ack 1 Hello 1 Hello 1 Hello Leader在本地commit后, 发给follower commit 之前crash ! Hello 还在么？ F_3 1 Hello
26. Le_1 Follower crash F_2 F_1 prevLogIndex 1 Hello 2 Raft F_3 crash重新启动后如何平衡数据. F_3 1 Hello 2 Raft 1 Hello 2 Raft 1 Hello 2
27. Network Partition
28. Le_1 正常情况 F_2 F_1 Heartbaet & commit 1 Hello F_3 F_4 1 Hello 1 Hello 1 Hello 1 Hello
29. Le_1 网络分区 F_2 F_1 Request Vote 1 Hello F_3 F_4 1 Hello 1 Hello 1 Hello 1 Hello 两个人怎么够法定人数 ! ! ! Vote Granted
30. Le_1 新集群正常 F_2 F_1 Heartbeat & Log entry & commit 1 Hello 2 Tim F_3 F_4 1 Hello 2 Ying 1 Hello 2 Ying 1 Hello 2 Tim 1 Hello 2 Ying 两个人怎么够法定人数 ! ! !
31. Le_1 网络恢复 F_2 F_1 Heartbeart & Append Log Entries 1 Hello Le_2 F_4 1 Hello 2 Ying 1 Hello 2 Ying 1 Hello 1 Hello 2 Ying 网络好了后, 开始抢夺Leader Le_1 term 小于 Le_2 !
32. 一致性 F_2 F_1 Heartbeat & Log entry & commit Le_2 F_4 1 Hello 2 Ying 1 Hello 2 Ying 1 Hello 2 Ying F_5 1 Hello 2 Ying 1 Hello 2 Ying
33. 冲突Split brain 如符合法定人数并产生了N条数据 与 新集群怎么保持数据一致性 覆盖 VS 合并 ? 被分区前有些node没有收到commit ? timer check
34. 预防Split brain 单播制定节点 指定法定人数 , 每次addreduce都需要更改 加大timeout , retry 统一 client 入口 , But … 监控脑裂情况, 反查各个node的leader是否一致
35. 复杂一致性 1 2 3 4 5 6 7 8 9 10 S1 44 44 55 66 77 80 89 90 S2 44 44 55 66 77 80 89 S3 44 44 55 66 77 S4 44 44 55 70 70 85 85 S5 44 44 55 70 70 85 index Host term id 每个方格为Log entry
36. Log compress 1 2 3 4 5 6 7 8 9 10 S1 44 44 55 66 77 80 89 90 index Snapshot Last included index : 6 Last included term : 80 state macheie state: x <— 0 y <— 9 all commited !!!
37. study 动画演示: https://ongardie.github.io/raft-talk-archive/2015/buildstuff/raftscope-replay/ 文档: http://en.youscribe.com/catalogue/tous/professional-resources/it-systems/raft- in-search-of-an-understandable-consensus-algorithm-2088704 Googole …
38. Q & A

1. 分布式一致性raft实现原理 - 峰云就她了 - xiaorui.cc

2. 什么是一致性协议 ? raft有哪些特点 ? raft vs paxos ? raft的构成组件及实现原理 ? 各种所谓奇葩的raft场景 ? 如何实现raft ? 介绍

3. 单节点环境 client server 存在数据一致性问题 ?

4. 多节点环境 node 1 node 3 node 2 那么如何保证数据的一致性 ?

5. 角色 Follower Candidate Leader

6. KeyWorld 定时器 Term 时间片 Term ID N/2 + 1 Heartbeats

7. KeyWorld 选举成Leader需提供TermID 和 LogIndex Leader 绝对不会删除自己的日志客户端自己携带ID帮助raft保持幂等性一条记录提交了，那么它之前的记录一定都是 commited.

8. KeyWorld 节点之间的Term和索引一致, 我们就认为数据是一致的. 在一个Term里只会有一个Leader 每个Follower只能选一个Leader

9. KeyWorld currentTerm 服务器最后一次知道的任期号（初始化为 0，持续递增） voteFor 在当前获得选票的候选人的 Id log[] 日志条目集( 状态机指令及TermId ) commitIndex 已知最大的索引值 nextIndex[] 每个follower的下一个索引值

10. Vote RPC Term 候选人的任期号 candidateid ID lastLogIndex 候选人的最后日志的索引值 lastLogTerm 候选人最后日志的任期号 Term 当前的任期号, 用于领导人去更新自己 voteGranted True or False

11. most simple election vote for me vote for me OK ! OK !

12. C-1 simple election F-2 F-1 vote for me vote for me NO timer 155 Term 2 Timer 170 Term 3 Condition比Follwer的term id小不影响 “F” 定时器在转 ! C 已得知情况, 故意Vote超时, 等他人选举 . Timer 183 Term 3

13. C-1 simple election RequestVote(term=2) voteGranted=true, term=2 C-2 same term id wait timeout! NO ! Term not match RequestVote(term=2)

14. hard election -1 vote for me OK ! vote for me not term match term conflict not n/2 + 1 OK ! 都变为一个term id !

15. summery election 过程定时器触发, followers把current_term_id + 1 改变成candidate状态发送RequestVoteRPC请求结果成功选举别人被选重新选

16. Client Works with leader Leader return to response when it commits an entry ! Assign uniquqeID to every command , Leader store latest ID with response.

17. client process Only log entry ! 1 Hello 2 Raft 1 Hello 2 Raft 1 Hello 2 Raft

18. Log Replication 默认心跳为 50 ms 默认心跳超时为 300ms 每次心跳的时候做 Log entry commit 超过 n/2+1 就算成功

19. Log RPC Term 领导人的任期号 LeaderID 领导人的 Id，以便于跟随者重定向请求 pervLogIndex 新的日志条目紧随之前的索引值 entries[] 需要存储当然日志条目（表示心跳时为空；一次性发送多个是为了提高效率） LeaderCommit 领导人已经提交的日志的索引值 Term 当前的任期号, 用于领导人去更新自己 success 跟随者包含了匹配上 prevLogIndex 和 prevLogTerm 的日志时为真

20. log replication - 1 Heartbaet & Append Entries1 Hello 1 Hello 1 Hello Heartbaet & Append Entries Only log entry !

21. log replication - 2 OK ! 1 Hello 1 Hello 1 Hello OK ! Leader commit !

22. Le_1 log replication - 3 F_2 F_1 Heartbaet & commit1 Hello Heartbaet & commit 1 Hello 1 Hello Follower commit !

23. 常见疑难杂症

24. Le_1 if a node reply timeout ？ F_2 F_1 Heartbaet & commit 1 Hello 1 Hello 1 Hellotimeout !!! F_2 如何保持数据一致性 ? Leader会重试 !

25. Le_1 Leader crash F_2 F_1 Log entry Ack 1 Hello 1 Hello 1 Hello Leader在本地commit后, 发给follower commit 之前crash ! Hello 还在么？ F_3 1 Hello

26. Le_1 Follower crash F_2 F_1 prevLogIndex 1 Hello 2 Raft F_3 crash重新启动后如何平衡数据. F_3 1 Hello 2 Raft 1 Hello 2 Raft 1 Hello 2

27. Network Partition

28. Le_1 正常情况 F_2 F_1 Heartbaet & commit 1 Hello F_3 F_4 1 Hello 1 Hello 1 Hello 1 Hello

29. Le_1 网络分区 F_2 F_1 Request Vote 1 Hello F_3 F_4 1 Hello 1 Hello 1 Hello 1 Hello 两个人怎么够法定人数 ! ! ! Vote Granted

30. Le_1 新集群正常 F_2 F_1 Heartbeat & Log entry & commit 1 Hello 2 Tim F_3 F_4 1 Hello 2 Ying 1 Hello 2 Ying 1 Hello 2 Tim 1 Hello 2 Ying 两个人怎么够法定人数 ! ! !

31. Le_1 网络恢复 F_2 F_1 Heartbeart & Append Log Entries 1 Hello Le_2 F_4 1 Hello 2 Ying 1 Hello 2 Ying 1 Hello 1 Hello 2 Ying 网络好了后, 开始抢夺Leader Le_1 term 小于 Le_2 !

32. 一致性 F_2 F_1 Heartbeat & Log entry & commit Le_2 F_4 1 Hello 2 Ying 1 Hello 2 Ying 1 Hello 2 Ying F_5 1 Hello 2 Ying 1 Hello 2 Ying

33. 冲突Split brain 如符合法定人数并产生了N条数据与新集群怎么保持数据一致性覆盖 VS 合并 ? 被分区前有些node没有收到commit ? timer check

34. 预防Split brain 单播制定节点指定法定人数 , 每次addreduce都需要更改加大timeout , retry 统一 client 入口 , But … 监控脑裂情况, 反查各个node的leader是否一致

35. 复杂一致性 1 2 3 4 5 6 7 8 9 10 S1 44 44 55 66 77 80 89 90 S2 44 44 55 66 77 80 89 S3 44 44 55 66 77 S4 44 44 55 70 70 85 85 S5 44 44 55 70 70 85 index Host term id 每个方格为Log entry

36. Log compress 1 2 3 4 5 6 7 8 9 10 S1 44 44 55 66 77 80 89 90 index Snapshot Last included index : 6 Last included term : 80 state macheie state: x <— 0 y <— 9 all commited !!!

37. study 动画演示: https://ongardie.github.io/raft-talk-archive/2015/buildstuff/raftscope-replay/ 文档: http://en.youscribe.com/catalogue/tous/professional-resources/it-systems/raft- in-search-of-an-understandable-consensus-algorithm-2088704 Googole …

38. Q & A

有疑问加站长微信联系（非本文作者）

本文来自：峰云就她了

感谢作者：rfyiamcool

查看原文：技术分享《分布式一致性raft算法实现原理》

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

5296 次点击

加入收藏微博

收入我的专栏

上一篇：Golang使用zlib压缩和解压缩字符串

下一篇：Go methods for types

保持数据

rpc

http

法定人数

0 回复

暂无回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

关注我

扫码关注领全套学习资料
加入 QQ 群：
- 192706294（已满）
- 731990104（已满）
- 798786647（已满）
- 729884609（已满）
- 977810755（已满）
- 815126783（已满）
- 812540095（已满）
- 1006366459（已满）
- 692541889
加入微信群：liuxiaoyan-s，备注入群
也欢迎加入知识星球 Go粉丝们（免费）

技术分享《分布式一致性raft算法实现原理》

用户登录

今日阅读排行

一周阅读排行

关注我

技术分享 《分布式一致性raft算法实现原理》

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

技术分享《分布式一致性raft算法实现原理》

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏