MongoDB数据库复制集同步原理与例子

同步过程

选取从哪个节点同步后,拉取oplog
1.Applies the op执行这个op日志
2.Writes the op to its own oplog (also local.oplog.rs)将这个op日志写入到自己的oplog中
3.Requests the next op请求下一个op日志secondary节点同步到哪了
secondary节点同步到哪了

主节点根据从节点获取oplog的时间戳可以判断数据同步到哪了

How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.
So, it goes like:
1.Do a write on primary.
2.Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
3.{getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
4.secondary queries the oplog on primary and gets the op
5.secondary applies the op from time t
6.secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
7.primary updates that secondary has applied up to t because it is requesting ops > t.
8.getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.
同步原理
如果A从B同步数据,B从C同步,C怎么知道A同步到哪了?看oplog读取协议:
当A从B同步数据,B对A说,我从年度oplog同步数据,如果你有写操作,告诉我一下。
B回答,我不是主节点,等我转发一下;B就跟主节点C说,就当做我是A,我代表A从你这同步数据。这时B与主节点C有两个连接,一个是B自己的,一个是代表A的。
A向B请求ops(写操作),B就转向C,这样来完成A的请求。
A            B          C
<====>
<====> <---->
<====> 是”真正”的同步连接. <----> “ghost” connection,B代表A与C的连接。
初始化同步
新增成员或者重做同步的时候,会进行初始化同步。
如下7步:
1.Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:检查oplog,如果空的,需要进行初始化同步,否则进行普通的同步。
2.Get the latest oplog time from the source member: call this time start.取同步来源节点最新的oplog time,标记为start
3.Clone all of the data from the source member to the destination member.复制所有数据到目标节点
4.Build indexes on destination. 目标节点建索引,2.0版本包含在复制数据步骤里,2.2在复制数据后建索引。
5.Get the latest oplog time from the sync target, which is called minValid.取目标节点最新的oplog time,标记为minValid
6.Apply the sync target’s oplog from start to minValid.在目标节点执行start 到minValid之间的oplog
7.Become a “normal” member (transition into secondary state).成为正常的成员
个人理解,start 到minValid之间的oplog是复制过来的没有执行的oplog,没有完成最终一致性的那部分,就是一个oplog replay的过程。
查看源码rs_initialsync.cpp,同步初始化步骤如下:
/**
Do the initial sync for this member. There are several steps to this process:
*
Record start time.
Clone.
Set minValid1 to sync target’s latest op time.
Apply ops from start to minValid1, fetching missing docs as needed.
Set minValid2 to sync target’s latest op time.
Apply ops from minValid1 to minValid2.
Build indexes.
Set minValid3 to sync target’s latest op time.
Apply ops from minValid2 to minValid3.
*
At that point, initial sync is finished. Note that the oplog from the sync target is applied
three times: step 4, 6, and 8. 4 may involve refetching, 6 should not. By the end of 6,
this member should have consistent data. 8 is “cosmetic,” it is only to get this member
closer to the latest op time before it can transition to secondary state.
*/

clone data 复制数据的过程:

for each db on sourceServer:
    for each collection in db:
        for each doc in db.collection.find():
             destinationServer.getDB(db).getCollection(collection).insert(doc)
初始化同步特点
好处:数据更紧凑,节省磁盘空间,因为所有操作都是insert。注意padding factor会设置为1。
不好的地方:同步速度太慢。使用fsync+lock 加写锁复制数据文件同步更快。
另外,mongodump/mongorestore来恢复不带oplog,实际上不太适合作为“从备份恢复”的策略。
从哪个成员来同步数据(Who to sync from)
MongoDB初始化同步数据的时候,可能从主节点同步,也可能是从从节点同步,根据最邻近原则,选择最邻近节点去同步数据;

By default, the member syncs from the the closest member of the set that is either the primary or another secondary with more recent oplog entries. This prevents two secondaries from syncing from each other.
http://docs.mongodb.org/manual/core/replication-internals/

如在上一篇文章提到的日志里[rsSync] replSet syncing to: 10.0.0.106:20011
这里syncing to 实际上是syncing from的意思,由于版本兼容原因沿用,正如kristina chodorow 所说’Backwards compatibility sucks’.
Replica Sets通过选择最邻近的节点(基于ping值),通过如下算法选择从哪个节点同步:
for each member that is healthy:
if member[state] == PRIMARY
add to set of possible sync targets

if member[lastOpTimeWritten] > our[lastOpTimeWritten]
    add to set of possible sync targets
sync target = member with the min ping time from the possible sync targets
对于节点是否healthy的判断,各个版本不同,但是其目的都是找出正常运转的节点。在2.0版本中,它的判断还包括了salve delay这个因素。
在secondary运行db.adminCommand({replSetGetStatus:1}) 或者rs.status()命令来查看当前的节点状况,可以看到syncingTo这个字段,这个字段的值就是这个secondary的同步来源。
2.2新增replSetSyncFrom命令,可以指定从哪个节点同步数据。

db.adminCommand( { replSetSyncFrom: "[hostname]:[port]" } )

或者

rs.syncFrom("[hostname]:[port]")

如何选择最邻近的节点?看源码,最新2.2.2为例:mongodb-src-r2.2.2/src/mongo/db/repl/rs_initialsync.cpp

Member* ReplSetImpl::getMemberToSyncTo() {
 lock lk(this);
 
bool buildIndexes = true;
 
// if we have a target we’ve requested to sync from, use it
 
if (_forceSyncTarget) {
 Member* target = _forceSyncTarget;
 _forceSyncTarget = 0;
 sethbmsg( str::stream() << “syncing to: ” << target->fullName() << ” by request”, 0);
 return target;
 }
 
Member* primary = const_cast<Member*>(box.getPrimary());
 
// wait for 2N pings before choosing a sync target
 if (_cfg) {
 int needMorePings = config().members.size()*2 – HeartbeatInfo::numPings;
 
if (needMorePings > 0) {
 OCCASIONALLY log() << “waiting for ” << needMorePings << ” pings from other members before syncing” << endl;
 return NULL;
 }
 
buildIndexes = myConfig().buildIndexes;
 
// If we are only allowed to sync from the primary, return that
 if (!_cfg->chainingAllowed()) {
 // Returns NULL if we cannot reach the primary
 return primary;
 }
 }
 
// find the member with the lowest ping time that has more data than me
 
// Find primary’s oplog time. Reject sync candidates that are more than
 // MAX_SLACK_TIME seconds behind.
 OpTime primaryOpTime;
 static const unsigned maxSlackDurationSeconds = 10 * 60; // 10 minutes
 if (primary)
 primaryOpTime = primary->hbinfo().opTime;
 else
 // choose a time that will exclude no candidates, since we don’t see a primary
 primaryOpTime = OpTime(maxSlackDurationSeconds, 0);
 
if ( primaryOpTime.getSecs() < maxSlackDurationSeconds ) {
 // erh – I think this means there was just a new election
 // and we don’t yet know the new primary’s optime
 primaryOpTime = OpTime(maxSlackDurationSeconds, 0);
 }
 
OpTime oldestSyncOpTime(primaryOpTime.getSecs() – maxSlackDurationSeconds, 0);
 
Member *closest = 0;
 time_t now = 0;
 
// Make two attempts. The first attempt, we ignore those nodes with
 // slave delay higher than our own. The second attempt includes such
 // nodes, in case those are the only ones we can reach.
 // This loop attempts to set ‘closest’.
for (int attempts = 0; attempts < 2; ++attempts) {
 for (Member *m = _members.head(); m; m = m->next()) {
 if (!m->hbinfo().up())
 continue;
 // make sure members with buildIndexes sync from other members w/indexes
 if (buildIndexes && !m->config().buildIndexes)
 continue;
 
if (!m->state().readable())
 continue;
 
if (m->state() == MemberState::RS_SECONDARY) {
 // only consider secondaries that are ahead of where we are 只考虑OpTime在当前节点之前的节点
if (m->hbinfo().opTime <= lastOpTimeWritten)
 continue;
 // omit secondaries that are excessively behind, on the first attempt at least.
 if (attempts == 0 &&
 m->hbinfo().opTime < oldestSyncOpTime)
 continue;
 }
 
// omit nodes that are more latent than anything we’ve already considered 忽略ping值延迟的
if (closest &&
 (m->hbinfo().ping > closest->hbinfo().ping))
 continue;
 
if (attempts == 0 &&
 (myConfig().slaveDelay < m->config().slaveDelay || m->config().hidden)) {
 continue; // skip this one in the first attempt
 }
 
map<string,time_t>::iterator vetoed = _veto.find(m->fullName());
 if (vetoed != _veto.end()) {
 // Do some veto housekeeping
 if (now == 0) {
 now = time(0);
 }
 
// if this was on the veto list, check if it was vetoed in the last “while”. 判断是否在否决列表里
// if it was, skip.
 if (vetoed->second >= now) {
 if (time(0) % 5 == 0) {
 log() << “replSet not trying to sync from ” << (*vetoed).first
 << “, it is vetoed for ” << ((*vetoed).second – now) << ” more seconds” << rsLog;
 }
 continue;
 }
 _veto.erase(vetoed);
 // fall through, this is a valid candidate now
 }
 // This candidate has passed all tests; set ‘closest’  满足所有条件,设置为最近的节点
closest = m;
 }
 if (closest) break; // no need for second attempt
 }
 
if (!closest) {
 return NULL;
 }
 
sethbmsg( str::stream() << “syncing to: ” << closest->fullName(), 0);
 
return closest;
 }

 

时间: 2024-08-03 13:37:13

MongoDB数据库复制集同步原理与例子的相关文章

MongoDB复制集同步原理解析

MongoDB副本集数据同步](https://docs.mongodb.com/manual/core/replica-set-sync/)主要包含2个步骤 intial sync,可以理解为全量同步 replication,追同步源的oplog,可以理解为增量同步 本文是对MongoDB高可用复制集原理的补充,会详细介绍MongoDB数据同步的实现原理. initial sync Secondary节点当出现如下状况时,需要先进行全量同步 oplog为空 local.replset.minv

mongodb数据库复制集随机同步数据

记一次MongoDB主从切换,重新同步数据. PRIMARY> rs.conf() { "_id" : "poptask", "version" : 4, "members" : [ { "_id" : 0, "host" : "10.0.0.105:20011" }, { "_id" : 1, "host" : &quo

MongoDB基于复制集创建索引

MongoDB在启用复制集(Replica Set)功能后,原先一个简单的索引添加,在之上会变得相对复杂,尤其是在数据量巨大的时候,需要考虑尽可能将性能影响降低到最小.基于此我们需要采取逐个节点创建索引的方式来达成.如下本文描述. 一.复制集索引创建的过程 MongoDB从节点上复制集上索引的创建,通常是在主节点索引创建完成之后. 在分片集群环境中,mongos将发送createindex()命令到每一个shard的主成员节点, 当主副本成员完成索引创建后,辅助副本开始创建索引. 二.如何最小化

MongoDB 3.4 复制集全量同步改进

3.2版本复制集同步的过程参考MongoDB 复制集同步原理解析 在 3.4 版本里 MongoDB 对复制集同步的全量同步阶段做了2个改进 在拷贝数据的时候同时建立所有的索引,在之前的版本里,拷贝数据时会先建立_id索引,其余的索引在数据拷贝完之后集中建立 在拷贝数据的同时,会把同步源上新产生的oplog拉取到本地local数据库的临时集合存储着,等数据全量拷贝完,直接读取本地临时集合的oplog来应用,提升了追增量的效率,同时也避免了同步源上oplog不足导致无法同步的问题. 上图描述了这2

MongoDB 复制集(Replica Set)

复制集(replica Set)或者副本集是MongoDB的核心高可用特性之一,它基于主节点的oplog日志持续传送到辅助节点,并重放得以实现主从节点一致.再结合心跳机制,当感知到主节点不可访问或宕机的情形下,辅助节点通过选举机制来从剩余的辅助节点中推选一个新的主节点从而实现自动切换.这个特性与MySQL MHA实现原理一样.本文主要描述MongoDB复制集并给出创建复制集示例以及完成自动切换. 一.复制集相关概念 复制集 复制是在多台服务器之间同步数据的过程,由一组Mongod实例(进程)组成

MongoDB原理:复制集状态同步机制

MongoDB复制集(3.0版本)之间通过心跳信息来同步成员的状态信息,每个节点会周期性的向复制集内其它的成员发送心跳信息来获取状态,如rs.status()看到的复制集状态信息. 一次心跳请求分3个阶段 (主动发起心跳请求的节点称为源,接受到心跳请求的成为目标) 源向目标发送心跳请求 目标处理心跳请求,并向源发送应答 源接受到心跳应答,更新目标节点状态 接下来将介绍这3个阶段里的主要状态同步逻辑 阶段1 默认配置下,复制集的节点每隔2s会向其他成员发送一次心跳请求,即发送replSetHear

MongoDB复制集原理

复制集简介 Mongodb复制集由一组Mongod实例(进程)组成,包含一个Primary节点和多个Secondary节点,Mongodb Driver(客户端)的所有数据都写入Primary,Secondary从Primary同步写入的数据,以保持复制集内所有成员存储相同的数据集,提供数据的高可用. 下图(图片源于Mongodb官方文档)是一个典型的Mongdb复制集,包含一个Primary节点和2个Secondary节点. Primary选举 复制集通过replSetInitiate命令(或

MongoDB管理:如何优雅的重启复制集?

啊!你还不了解MongoDB复制集?先看这里科普一下 复制集的成员启动后,会选举出一个Primary,Primary需要得到大多数成员的投票.所有的写入操作都必须向Primary发起,通过oplog将写操作同步到Secondary. 在复制集运行的过程中,难免会遇到需要重启节点的场景,比如复制集版本升级.节点维护等,在重启节点的过程中,建议不要直接shutdown Primary,这样可能导致已经写入primary但未同步到secondary的数据丢失,过程类似如下: shutdown Prim

MongoDB复制集自适应oplog管理

MongoDB复制集运行过程中,经常可能出现Secondary同步跟不上的情况,主要原因是主备写入速度上有差异,而复制集配置的oplog又太小,这时需要人工介入,向Secondary节点发送resync命令. 上述问题可通过配置更大的oplog来规避,目前官方文档建议的修改方案步骤比较长,而且需要停写服务来做,大致过程是先把oplog备份,然后再oplog集合删掉,重新创建,再把备份的内容导入到新创建的oplog. 我们团队)针对使用wiredtiger存储引擎的场景,开发了通过collMod命