PostgreSQL 可靠性和一致性 代码分析

PostgreSQL 的数据可靠性是依赖XLOG的实现的,所有的对数据块的变更操作在write到磁盘前,一定是确保这个变更产生的REDO会先写到XLOG,并保证XLOG已落盘。
也就是说流程是这样的:
.1. 首先将需要变更的块从文件读入shared buffer

.2. 变更shared buffer中block的内容

.3. 将shared buffer中block变更的内容写入XLOG,如果是checkpoint后第一次变更该块,则写full page。(通过参数控制是否要写full page)。

.4. 在bgwriter将shared buffer中的脏块write到os dirty page前,会确保它对应的XLOG已经落盘,通过脏块的LSN来确保。

所以问题来了,如果用户使用了异步提交,即synchronous_commit=off,会怎样呢?
也没有问题,因为在第四步,一定是会保证造成脏页的XLOG是先落盘的。
所以开启synchronous_commit=off,只会造成丢XLOG,绝对不会造成数据不一致。
确保可靠性和一致性的代码如下:


/*
 * Main entry point for bgwriter process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void
BackgroundWriterMain(void)
{
...
        /*
         * Do one cycle of dirty-buffer writing.
         */
        can_hibernate = BgBufferSync();

...

/*
 * BgBufferSync -- Write out some dirty buffers in the pool.
 *
 * This is called periodically by the background writer process.
 *
 * Returns true if it's appropriate for the bgwriter process to go into
 * low-power hibernation mode.  (This happens if the strategy clock sweep
 * has been "lapped" and no buffer allocations have occurred recently,
 * or if the bgwriter has been effectively disabled by setting
 * bgwriter_lru_maxpages to 0.)
 */
bool
BgBufferSync(void)
{
...

    / Execute the LRU scan /
    while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
    {
        int         buffer_state = SyncOneBuffer(next_to_clean, true);

...

/*
 * SyncOneBuffer -- process a single buffer during syncing.
 *
 * If skip_recently_used is true, we don't write currently-pinned buffers, nor
 * buffers marked recently used, as these are not replacement candidates.
 *
 * Returns a bitmask containing the following flag bits:
 *  BUF_WRITTEN: we wrote the buffer.
 *  BUF_REUSABLE: buffer is available for replacement, ie, it has
 *      pin count 0 and usage count 0.
 *
 * (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean
 * after locking it, but we don't care all that much.)
 *
 * Note: caller must have done ResourceOwnerEnlargeBuffers.
 */
static int
SyncOneBuffer(int buf_id, bool skip_recently_used)
{

...

    FlushBuffer(bufHdr, NULL);
...

/*
 * FlushBuffer
 *      Physically write out a shared buffer.
 *
 * NOTE: this actually just passes the buffer contents to the kernel; the
 * real write to disk won't happen until the kernel feels like it.  This
 * is okay from our point of view since we can redo the changes from WAL.
 * However, we will need to force the changes to disk via fsync before
 * we can checkpoint WAL.
 *
 * The caller must hold a pin on the buffer and have share-locked the
 * buffer contents.  (Note: a share-lock does not prevent updates of
 * hint bits in the buffer, so the page could change while the write
 * is in progress, but we assume that that will not invalidate the data
 * written.)
 *
 * If the caller has an smgr reference for the buffer's relation, pass it
 * as the second parameter.  If not, pass NULL.
 */
static void
FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{

...

    /*
     * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
     * rule that log updates must hit disk before any of the data-file changes
     * they describe do.
     *
     * However, this rule does not apply to unlogged relations, which will be
     * lost after a crash anyway.  Most unlogged relation pages do not bear
     * LSNs since we never emit WAL records for them, and therefore flushing
     * up through the buffer LSN would be useless, but harmless.  However,
     * GiST indexes use LSNs internally to track page-splits, and therefore
     * unlogged GiST pages bear "fake" LSNs generated by
     * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
     * LSN counter could advance past the WAL insertion point; and if it did
     * happen, attempting to flush WAL through that location would fail, with
     * disastrous system-wide consequences.  To make sure that can't happen,
     * skip the flush if the buffer isn't permanent.
     */
    if (buf->flags & BM_PERMANENT)
        XLogFlush(recptr);

...

/*
 * Ensure that all XLOG data through the given position is flushed to disk.
 *
 * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
 * already held, and we try to avoid acquiring it if possible.
 */
void
XLogFlush(XLogRecPtr record)
{
    XLogRecPtr  WriteRqstPtr;
    XLogwrtRqst WriteRqst;

...
        XLogWrite(WriteRqst, false);

...

/*
 * Write and/or fsync the log at least as far as WriteRqst indicates.
 *
 * If flexible == TRUE, we don't have to write as far as WriteRqst, but
 * may stop at any convenient boundary (such as a cache or logfile boundary).
 * This option allows us to avoid uselessly issuing multiple writes when a
 * single one would do.
 *
 * Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)
 * must be called before grabbing the lock, to make sure the data is ready to
 * write.
 */
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{

...
    /*
     * If asked to flush, do so
     */
    if (LogwrtResult.Flush < WriteRqst.Flush &&
        LogwrtResult.Flush < LogwrtResult.Write)

    {
        /*
         * Could get here without iterating above loop, in which case we might
         * have no open file or the wrong one.  However, we do not need to
         * fsync more than one file.
         */
        if (sync_method != SYNC_METHOD_OPEN &&
            sync_method != SYNC_METHOD_OPEN_DSYNC)
        {
            if (openLogFile >= 0 &&
                !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
                XLogFileClose();
            if (openLogFile < 0)
            {
                XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
                openLogFile = XLogFileOpen(openLogSegNo);
                openLogOff = 0;
            }

            issue_xlog_fsync(openLogFile, openLogSegNo);
        }

        / signal that we need to wakeup walsenders later /
        WalSndWakeupRequest();

        LogwrtResult.Flush = LogwrtResult.Write;
    }
...

异步提交代码如下

/*
     * Check if we want to commit asynchronously.  We can allow the XLOG flush
     * to happen asynchronously if synchronous_commit=off, or if the current
     * transaction has not performed any WAL-logged operation or didn't assign
     * a xid.  The transaction can end up not writing any WAL, even if it has
     * a xid, if it only wrote to temporary and/or unlogged tables.  It can
     * end up having written WAL without an xid if it did HOT pruning.  In
     * case of a crash, the loss of such a transaction will be irrelevant;
     * temp tables will be lost anyway, unlogged tables will be truncated and
     * HOT pruning will be done again later. (Given the foregoing, you might
     * think that it would be unnecessary to emit the XLOG record at all in
     * this case, but we don't currently try to do that.  It would certainly
     * cause problems at least in Hot Standby mode, where the
     * KnownAssignedXids machinery requires tracking every XID assignment.  It
     * might be OK to skip it only when wal_level < hot_standby, but for now
     * we don't.)
     *
     * However, if we're doing cleanup of any non-temp rels or committing any
     * command that wanted to force sync commit, then we must flush XLOG
     * immediately.  (We must not allow asynchronous commit if there are any
     * non-temp tables to be deleted, because we might delete the files before
     * the COMMIT record is flushed to disk.  We do allow asynchronous commit
     * if all to-be-deleted tables are temporary though, since they are lost
     * anyway if we crash.)
     */
    if ((wrote_xlog && markXidCommitted &&
         synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
        forceSyncCommit || nrels > 0)
    {
        XLogFlush(XactLastRecEnd);

        /*
         * Now we may update the CLOG, if we wrote a COMMIT record above
         */
        if (markXidCommitted)
            TransactionIdCommitTree(xid, nchildren, children);
    }
    else
    {
        /*
         * Asynchronous commit case:
         *
         * This enables possible committed transaction loss in the case of a
         * postmaster crash because WAL buffers are left unwritten. Ideally we
         * could issue the WAL write without the fsync, but some
         * wal_sync_methods do not allow separate write/fsync.
         *
         * Report the latest async commit LSN, so that the WAL writer knows to
         * flush this commit.
         */
        XLogSetAsyncXactLSN(XactLastRecEnd);

        /*
         * We must not immediately update the CLOG, since we didn't flush the
         * XLOG. Instead, we store the LSN up to which the XLOG must be
         * flushed before the CLOG may be updated.
         */
        if (markXidCommitted)
            TransactionIdAsyncCommitTree(xid, nchildren, children, XactLastRecEnd);
    }
时间: 2024-10-27 14:36:20

PostgreSQL 可靠性和一致性 代码分析的相关文章

PostgreSQL 可靠性分析 - 关于redo block原子写

PostgreSQL 可靠性分析 - 关于redo block原子写 作者 digoal 日期 2016-10-11 标签 PostgreSQL , redo , redo block 原子写 , 可靠性分析 背景 PostgreSQL 可靠性与大多数关系数据库一样,都是通过REDO来保障的. 群里有位童鞋问了一个问题,为什么PostgreSQL的REDO块大小默认是8K的,不是512字节. 这位童鞋提问的理由是,大多数的块设备扇区大小是512字节的,512字节可以保证原子写,而如果REDO的块

Alipay UED推出网站代码分析插件:Monster

Monster 是 Alipay UED 推出的网站代码分析.质量检测及评分的浏览器扩展,它能智能分析CSS.JS.HTML内容并生动形象展示网页得分情况(类似YSlow).它是一个开源项目,您可以在GoogleCode中心检出MonsterForChrome项目源代码.不久会推出Firefox版扩展. Monster主要检测规则: 检测是否有重复ID的标签: 检测是否使用内联标签嵌套块级标签,如a嵌套div: 检测https协议页面,是否使用了http协议的图片.JS.CSS等: 检测comp

Javascript日期级联组件代码分析及demo

最近研究下JS日期级联效果 感觉还不错,然后看了下kissy也正好有这么一个 组件,也看了下源码,写的还不错,通过google最早是在2011年 淘宝的虎牙(花 名)用原审JS写了一个(貌似据说是从YUI那边重构下的) 具体的可以看他的 博 客园 , 感觉kissy组件源码 思路也是和YUI类似 所以我今天的基本思路也和他们 的一样 只是通过自己分析下及用自己的方式包装下. 基本原理 1.传参中有 '年份下拉框dom节点', '月份下拉框dom节点', '天数下拉框dom 节点', "开始日期&

C语言中的数组和指针汇编代码分析实例

  这篇文章主要介绍了C语言中的数组和指针汇编代码分析实例,本文用一则C语言例子来得到对应的汇编代码,并一一注解每句汇编代码的含义,需要的朋友可以参考下 今天看<程序员面试宝典>时偶然看到讲数组和指针的存取效率,闲着无聊,就自己写了段小代码,简单分析一下C语言背后的汇编,可能很多人只注重C语言,但在实际应用当中,当出现问题时,有时候还是通过分析汇编代码能够解决问题.本文只是为初学者,大牛可以飘过~ C源代码如下: 代码如下: #include "stdafx.h" int

传智播客c/c++公开课学习笔记--C语言与木马恶意代码分析和360安全防护揭秘

黑客代码分析与预防 笔记 [课程简介] C/C++语言是除了汇编之外,最接近底层的计算机语言,目前windows,linux,iOS,Android等主流操作系统都是用C/C++编写的,所以很多病毒.木马也都是用C/C++实现的.课程的目的就是通过C语言揭秘木马和各种远程控制软件的实现原理以及如何防护.  [课程知识点] 1.木马入侵系统的方式: 2.木马入侵到宿主目标后的关键行为分析: 3.可信任端口以及端口扫描技术: 4.远程控制的实现代码实现: 5.恶意代码中使用TCP.UDP协议与防火墙

【ARM】Uboot代码分析

一.摘要 这篇文章主要对BootLoader(UBoot)的源码进行了分析,并对UBoot的移植略作提及.  BootLoader的总目标是正确调用内核的执行,由于大部分的BoorLoader都依赖于CPU的体系结构.因此大部分的BootLoader都分为两个步骤启动.依赖于CPU体系结构(如设备初始化等)的代码都放在stage1.而stage2一般使用C语言实现,能够实现更加复杂的功能,代码的可移植性也提高. 二.本文提纲 1. 摘要 2. 本文提纲 3. UBoot启动过程 4. Stage

谁有基于用户的推荐系统或者协同过滤的算法和代码分析啊

问题描述 求个大数据的大神给个基于用户的推荐系统或者协同过滤的算法和代码分析啊我有部分代码但是不知道怎么在Eclipse上实现求解答啊1.publicclassAggregateAndRecommendReducerextendsReducer<VarLongWritable,VectorWritable,VarLongWritable,RecommendedItemsWritable>{...publicviodreduce(VarLongWritablekey,Iterable<Ve

免费的Lucene 原理与代码分析完整版下载

Lucene是一个基于Java的高效的全文检索库.那么什么是全文检索,为什么需要全文检索?目前人们生活中出现的数据总的来说分为两类:结构化数据和非结构化数据.很容易理解,结构化数据是有固定格式和结构的或者有限长度的数据,比如数据库,元数据等.非结构化数据则是不定长或者没有固定格式的数据,如图片,邮件,文档等.还有一种较少的分类为半结构化数据,如XML,HTML等,在一定程度上我们可以将其按照结构化数据来处理,也可以抽取纯文本按照非结构化数据来处理.非结构化数据又称为全文数据.,对其搜索主要有两种

【C/C++学院】0907-象棋五子棋代码分析/寻找算法以及排序算法

象棋五子棋代码分析 编译代码报错: 错误 1 error MSB8031: Building an MFC project for a non-Unicode character set is deprecated. You must change the project property to Unicode or download an additional library. See http://go.microsoft.com/fwlink/p/?LinkId=286820 for mo