[redis设计与实现][12]sentinel——故障检测

检查主观下线

sentinel每次发送PING命令,用于检测被监测的master和slave是否宕机。

int sentinelSendPing(sentinelRedisInstance *ri) {
    int retval = redisAsyncCommand(ri->cc,
        sentinelPingReplyCallback, NULL, "PING");
    if (retval == REDIS_OK) {
        ri->pending_commands++;
        /* We update the ping time only if we received the pong for
         * the previous ping, otherwise we are technically waiting
         * since the first ping that did not received a reply. */
         // 只有收到了PONG响应的时候,这个字段才会变成0
        if (ri->last_ping_time == 0) ri->last_ping_time = mstime();
        return 1;
    } else {
        return 0;
    }
}
void sentinelPingReplyCallback(redisAsyncContext *c, void *reply, void *privdata) {
    ...
    if (strncmp(r->str,"PONG",4) == 0 ||
        strncmp(r->str,"LOADING",7) == 0 ||
        strncmp(r->str,"MASTERDOWN",10) == 0)
    {
        ri->last_avail_time = mstime();
        ri->last_ping_time = 0; /* Flag the pong as received. */
    }
    ...
}

只有收到PONG、LOADING、MASTERDOWN响应的时候,sentinel才认为master正常,记录最后的响应时间。

同时每次调用sentinelHandleRedisInstance函数的时候,都会去检查是否有监控的master活着slave已经主管超时。

void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
    mstime_t elapsed = 0;

    if (ri->last_ping_time)
        elapsed = mstime() - ri->last_ping_time;

    /* Check if we are in need for a reconnection of one of the
     * links, because we are detecting low activity.
     *
     * 1) Check if the command link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
     *    pending ping for more than half the timeout. */
     // 检查命令连接是否已经超时
    if (ri->cc &&
        (mstime() - ri->cc_conn_time) > SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        ri->last_ping_time != 0 && /* Ther is a pending ping... */
        /* The pending ping is delayed, and we did not received
         * error replies as well. */
        (mstime() - ri->last_ping_time) > (ri->down_after_period/2) &&
        (mstime() - ri->last_pong_time) > (ri->down_after_period/2))
    {
        sentinelKillLink(ri,ri->cc);
    }

    /* 2) Check if the pubsub link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
     *    activity in the Pub/Sub channel for more than
     *    SENTINEL_PUBLISH_PERIOD * 3.
     */
     // 检查订阅连接是否已经超时
    if (ri->pc &&
        (mstime() - ri->pc_conn_time) > SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        (mstime() - ri->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
    {
        sentinelKillLink(ri,ri->pc);
    }

    /* Update the SDOWN flag. We believe the instance is SDOWN if:
     *
     * 1) It is not replying.
     * 2) We believe it is a master, it reports to be a slave for enough time
     *    to meet the down_after_period, plus enough time to get two times
     *    INFO report from the instance. */
    if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&
         mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
    {
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) {
            sentinelEvent(REDIS_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        }
    } else {
        /* Is subjectively up */
        if (ri->flags & SRI_S_DOWN) {
            sentinelEvent(REDIS_WARNING,"-sdown",ri,"%@");
            ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);
        }
    }
}

注意:同一个master的所有slave,主观下线超时时间相同(配置文件中配置的down-after-milliseconds),不同master可以独立配置。

检测客观下线

void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {
    ...
    /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }
}

在sentinelHandleRedisInstance函数最后,针对master还会做这些操作:
* 检查客观下线
* 询问其他sentinel的检查结果
* 进行故障转移

首先先来看询问其他sentinel的检查结果:

#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
    dictIterator *di;
    dictEntry *de;

    // 迭代sentinels字典里面保存的所有已知sentinels
    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) {
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        }

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not received the info within SENTINEL_ASK_PERIOD ms. */
         // 只有当自己已经检测到master已经进入主观下线状态才发送询问
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        // 确保这个sentinel仍然连接着
        if (ri->flags & SRI_DISCONNECTED) continue;
        // 非强制,且在指定周期(1000ms)内还没有收到回复,暂时不发送
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval = redisAsyncCommand(ri->cc,
                    sentinelReceiveIsMasterDownReply, NULL,
                    "SENTINEL is-master-down-by-addr %s %s %llu %s",
                    master->addr->ip, port,
                    sentinel.current_epoch,
                    (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
                    server.runid : "*");
        if (retval == REDIS_OK) ri->pending_commands++;
    }
    dictReleaseIterator(di);
}

发送消息包含:
* 被判断为主观下线的master IP(master->addr->ip)
* 被判断为主管下线的master端口(port)
* sentinel当前纪元
* 运行id,*表示用于检测该master是否客观下线,当前sentinel运行id用于选举头领

接收响应:

void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = c->data;
    redisReply *r;
    REDIS_NOTUSED(privdata);

    if (ri) ri->pending_commands--;
    if (!reply || !ri) return;
    r = reply;

    /* Ignore every error or unexpected reply.
     * Note that if the command returns an error for any reason we'll
     * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
    // 响应格式:第一个元素表示是否同意主观下线,1表示同意,0表示不同意
    // 第二个元素,在查询状态的时候返回*
    // 第三个元素仅在第二个元素不为*的时候有效,其余情况都未0
    if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
        r->element[0]->type == REDIS_REPLY_INTEGER &&
        r->element[1]->type == REDIS_REPLY_STRING &&
        r->element[2]->type == REDIS_REPLY_INTEGER)
    {
        ri->last_master_down_reply_time = mstime();
        if (r->element[0]->integer == 1) {
            ri->flags |= SRI_MASTER_DOWN;
        } else {
            ri->flags &= ~SRI_MASTER_DOWN;
        }
        if (strcmp(r->element[1]->str,"*")) {
            /* If the runid in the reply is not "*" the Sentinel actually
             * replied with a vote. */
            sdsfree(ri->leader);
            if ((long long)ri->leader_epoch != r->element[2]->integer)
                redisLog(REDIS_WARNING,
                    "%s voted for %s %llu", ri->name,
                    r->element[1]->str,
                    (unsigned long long) r->element[2]->integer);
            ri->leader = sdsnew(r->element[1]->str);
            ri->leader_epoch = r->element[2]->integer;
        }
    }
}

检查是否客观下线:

void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    unsigned int quorum = 0, odown = 0;

    if (master->flags & SRI_S_DOWN) {
        /* Is down for enough sentinels? */
        quorum = 1; /* the current sentinel. */
        /* Count all the other sentinels. */
        // 遍历sentinels字典,统计标记master主观下线的sentinel数量
        di = dictGetIterator(master->sentinels);
        while((de = dictNext(di)) != NULL) {
            sentinelRedisInstance *ri = dictGetVal(de);

            if (ri->flags & SRI_MASTER_DOWN) quorum++;
        }
        dictReleaseIterator(di);
        if (quorum >= master->quorum) odown = 1;
    }

    /* Set the flag accordingly to the outcome. */
    if (odown) {
        if ((master->flags & SRI_O_DOWN) == 0) {
            sentinelEvent(REDIS_WARNING,"+odown",master,"%@ #quorum %d/%d",
                quorum, master->quorum);
            master->flags |= SRI_O_DOWN;
            master->o_down_since_time = mstime();
        }
    } else {
        if (master->flags & SRI_O_DOWN) {
            sentinelEvent(REDIS_WARNING,"-odown",master,"%@");
            master->flags &= ~SRI_O_DOWN;
        }
    }
}

有了前面通过is-master-down-by-addr询问和标记,就能够知道标记为主观下线的sentinel数量,如果这个数量超过了配置文件里面设置的quorum,
则标记该master进入客观下线。

转载自:https://coolex.info/blog/473.html

时间: 2024-09-20 06:14:11

[redis设计与实现][12]sentinel——故障检测的相关文章

[redis设计与实现][10]sentinel——简介和启动

Sentinel(Redis 3.0.0-rc1) Sentinel是Redis HA方案,一个或多个Sentinel实例组成的Sentinel系统,可以监视任意多个主服务器(master), 以及这些主服务器属下的所有从服务器(slave),并在被监视的主服务器进入下线状态时,自动在将被下线主服务器属下的某个从服务器升级为新的主服务器, 然后由新的主服务器代替已下线的主服务器继续处理命令请求. 基础数据结构 typedef struct sentinelRedisInstance { // 当

[redis设计与实现][13]sentinel——故障恢复

选举领头sentinel 当sentinelStartFailoverIfNeeded判断需要进入故障恢复(failover)的时候,会调用sentinelStartFailover函数,开始进入failover状态. 这时,会标记master的failover_state为SENTINEL_FAILOVER_STATE_WAIT_START. int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) { /* We can

[redis设计与实现][11]sentinel——通信

通信 初始化完成之后,sentinel会主动和master.slave进行通信,获取他们的信息. 获取主服务器信息 首先,sentinel会和master建立两个连接,分别是命令连接和订阅连接(分别保存在sentinelRedisInstance的cc和pc字段中). void sentinelHandleRedisInstance(sentinelRedisInstance *ri) { sentinelReconnectInstance(ri); ... } #define SENTINEL

《Redis官方文档 》sentinel

原文链接 Redis Sentinel 文档 Redis Sentinel为Redis提供了高可用解决方案.实际上这意味着使用Sentinel可以部署一套Redis,在没有人为干预的情况下去应付各种各样的失败事件. Redis Sentinel同时提供了一些其他的功能,例如:监控.通知.并为client提供配置. 下面是Sentinel的功能列表: 监控(Monitoring):Sentinel不断的去检查你的主从实例是否按照预期在工作. 通知(Notification):Sentinel可以通

《Redis设计与实现》阅读:Redis底层研究之简单动态字符串SDS

        除仅用于字符串字面量的情况外,对于可以被修改值的字符串的表示,Redis底层并没有采用C语言传统的字符串表示,即以空字符结尾的字符数组,而是采用专门为其设计的简单动态字符串作为其默认字符串表示,其英文全称为Simple Dynamic String,简称SDS.除了用于保存数据库中字符串值外,SDS也可以用于缓冲区buffer,比如AOF中的缓冲区.客户端输入缓冲区等.本文,我们将详细研究简单动态字符串SDS的实现及其在性能等方面的独特之处.             注:内容总结

设计Logo的12条原则

要从一众竞争者中显得突出,你必须创造独特的风格来显示自己的设计特色,而不是模仿甚至抄袭其它设计或者风格,唯有创新才能脱颖而出. Logo(品牌标识)是品牌的"面子",所以极其重要,一个设计优秀的Logo是品牌的重要资产.然而,仅仅靠优秀的平面设计并不能保证设计出来的logo是令人印象深刻且形象生动的品牌标识. 正如不同的行业有其特定的技能,logo设计也要求设计师不断学习,积累经验才能获得成功:对所有平面设计师来说,知识便是能力. 基于此,我们总结出要设计好的logo必须遵循的12条原

摆脱土气,造就专业,网页设计精要12点

  一个专业的网站得到大多数人的信任和信心 ,如何才能让您的网站摆脱土气,造就专业?这里提出12点设计精要,考虑在一个专业的网站设计上 . 1. 第一印象 我们都知道第一印象相当重要,.它始终是留住读者,吸引新的用户一大因素.如果在第一次访问,您的网站没有打动读者,他们不希望再次访问您的网站.你不想失去潜在客户,所以请确保您的网站看起来很专业.这是一个重要的因素,以留住读者. 2 优化载入时间 你的访问者不会长时间停留,只是等待为您的网站加载.你会被浪费的带宽,它也将刺激不耐烦的游客.请记住,一

如何通过设计提升高达12%的注册转化率?

  我最早的时候认为设计就是如何去做出各种新奇的图形.质感和界面,追逐潮流和创意.可是后来发现设计最难的是平衡各方面的因素,在条条框框的限制中找到方案还要推进下去,并被人看到价值.前者很容易满足,而后者要做好却非常的难,PM不给力.沟通不顺畅.开发不支持.老板不满意.很多设计师都会苦恼原因和解决方案是什么,而正好我最近在圈内交流发现一个很严重的现象:一线设计师对于数据和目标的敏感程度非常的低,所以设计没有说服力.自认为设计很好的东西别人看不明白,推进很困难.而很多产品经理也因此认为设计师就是固执

[redis设计与实现][9]复制

复制(Redis2.8) 设置主服务器的地址和端口(SLAVE OF命令) SLAVEOF host port Redis的主从复制设置非常方便,只需要在从服务器上设置主服务器的IP和端口即可.如果需要关闭主从同步,只需要执行SLAVEOF NO ONE即可. 该命令的具体描述见官方文档 void slaveofCommand(redisClient *c) { // 先处理no one,解除现有的主从同步 if (!strcasecmp(c->argv[1]->ptr,"no&qu