Nosql数据一致性技术概要

主要参考, http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/, Distributed Algorithms in NoSQL Databases

 

Data Consistency

It is well known and fairly obvious that in geographically distributed systems or other environments with probable network partitions or delays it is not generally possible to maintain high availability without sacrificing consistency because isolated parts of the database have to operate independently in case of network partition. This fact is often referred to as the CAP theorem. However, consistency is a very expensive thing in distributed systems, so it can be traded not only to availability. It is often involved into multiple tradeoffs. To study these tradeoffs, we first note that consistency issues in distributed systems are induced by the replication and the spatial separation of coupled data, so we have to start with goals and desired properties of the replication:

根据CAP理论, 现在的系统往往会牺牲consistency而换取可用性, 因为在分布式系统上保证数据一致性是非常困难的. 作者认为现在系统除了可用性, 还有从其他因素的考虑,需要牺牲一致性, 因素如下.

  • Availability. Isolated parts of the database can serve read/write requests in case of network partition.
  • Read/Write latency. Read/Write requests are processes with a minimal latency.
  • Read/Write scalability. Read/Write load can be balanced across multiple nodes.
  • Fault-tolerance. Ability to serve read/write requests does not depend on availability of any particular node.
  • Data persistence. Node failures within certain limits do not cause data loss.

 

Consistency is a much more complicated property than the previous ones, so we have to discuss different options in detail. It beyond this article to go deeply into theoretical consistency and concurrency models, so we use a very lean framework of simple properties.



  • Read-Write consistency


    From the read-write perspective, the basic goal of a database is to minimize a replica convergence time (how long does it take to propagate an update to all replicas) and guarantee eventual consistency. Besides these weak guarantees, one can be interested in stronger consistency properties:

    • Read-after-write consistency. The effect of a write operation on data item X, will always be seen by a successive read operation on X.
    • Read-after-read consistency. If some client reads the value of a data item X, any successive read operation on X will always return that same or a more recent value.

读一致性, 比较好理解, 更新一个复本, 这个更新传播到所有复本肯定需要时间, 所以我们需要尽量减少这个时间, 并保证最终的一致性.

但是在这个过程中, 读操作对于不同的复本肯定会出现不一致, 这就是Read-after-write consistency

 

  • Write-Write consistency

    Write-write conflicts appear in case of database partition, so a database should either handle these conflicts somehow or guarantee that concurrent writes will not be processed by different partitions. From this perspective, a database can offer different consistency models:

    • Atomic Writes. If a database provides an API where a write request can only be an independent atomic assignment of a value, one possible way to avoid write-write conflicts is to pick the “most recent” version of each entity. This guarantees that all nodes will end up with the same version of data irrespectively to the order of updates which can be affected by network failures and delays. Data version can be specified by a timestamps or application-specific metric. This approach is used for example in Cassandra.
    • Atomic Read-modify-write. Applications often do a read-modify-write sequence instead of independent atomic writes. If two clients read the same version of data, modify it and write back concurrently, the latest update will silently override the first one in the atomic writes model. This behavior can be semantically inappropriate (for example, if both clients add a value to a list). A database can offer at least two solutions:
      • Conflict prevention. Read-modify-write can be thought as a particular case of transaction, so distributed locking or consensus protocols like PAXOS [20, 21] are both a solution.  This is a generic technique that can support both atomic read-modify-write semantics and arbitrary isolated transactions. An alternative approach is to prevent distributed concurrent writes entirely and route all writes of a particular data item to a single node (global master or shard master).  To prevent conflicts, a database must sacrifice availability in case of network partitioning and stop all but one partition. This approach is used in many systems with strong consistency guarantees (e.g. most RDBMSs, HBase, MongoDB).
      • Conflict detection. A database track concurrent conflicting updates and either rollback one of the conflicting updates or preserve both versions for resolving on the client side. Concurrent updates are typically tracked by using vector clocks [19] (which can be though as a generalization of the optimistic locking) or by preserving an entire version history. This approach is used in systems like Riak, Voldemort, CouchDB.

写一致性复杂一些, 在并发写的情况下, 各种写操作很可能会发生冲突和互相覆盖.

第一种比较简单的case, 独立的写操作, 写的时候不用care之前的值, 单纯的状态更新. 这种情况唯一要考虑的问题是时序问题, 必须保证“most recent” version被更新, 但由于network failures and delays, “most recent” version反而可能后到. 通常的方法是通过比如timestamps或application-specific metric来保证这点.

第二种比较复杂一些, Read-modify-write, 如果不控制, 在这个过程中, 很可能有其他并发写把这个值给update了, 或者是同时两个client读到这个值, 并执行Read-modify-write,  这样就会发生写冲突.

解决方法有两种,

  • 事前预防(悲观锁), 保证高一致性, 用分布式lock或Paxos协议, 来保证一致性. 另一种思路, 使用master, 由master来协调所有并发写的顺序, 比如HBase, MongoDB, 当然这样会带来单点问题, 好处是简单
  • 事后detection(乐观锁), 高可用性, 先各自存各自的版本, 这样当版本发生conflict时, 再由client side resolving, 比如Dynamo, CouchDB, 都是使用这个方法

 

 

Now let’s take a closer look at commonly used replication techniques and classify them in accordance with the described properties. The first figure below depicts logical relationships between different techniques and their coordinates in the system of the consistency-scalability-availability-latency tradeoffs. The second figure illustrates each technique in detail.

上面给出分布式系统设计需要考虑的属性, consistency-scalability-availability-latency, 而不同的设计就是在各个属性之间的tradeoffs, 下面第一张图就表示具体的tradeoff的情况, 而第二张图描绘出具体的设计思想.

Replication factor 4. It is assumed that read/write coordinator can be either an external client or a proxy node within a database.

Let’s go through all these techniques moving from weak to strong consistency guarantees:

  • (A, Anti-Entropy) Weakest consistency guarantees are provided by the following strategy. Writer updates any arbitrary selected replica. Reader reads any replica and sees the old data until a new version is not propagated via background anti-entropy protocol (more on anti-entropy protocols in the next section). The main properties of this approach are:

    • High propagation latency makes it quite impractical for data synchronization, so it is typically used only as an auxiliary background process that detects and repairs unplanned inconsistencies. However, databases like Cassandra use anti-entropy as a primary way to propagate information about database topology and other metadata.
    • Consistency guarantees are poor: write-write conflicts and read-write discrepancies are very probable even in absence of failures.
    • Superior availability and robustness against network partitions. This schema provides good performance because individual updates are replaced by asynchronous batch processing.
    • Persistence guarantees are weak because new data are initially stored on a single replica.
  • (B) An obvious improvement of the previous schema is to send an update to all (available) replicas asynchronously as soon as the update request hits any replica. It can be considered as a kind of targeted anti-entropy.
    • In comparison with pure anti-entropy, this greatly improves consistency with a relatively small performance penalty. However, formal consistency and persistence guarantees remain the same.
    • If some replica is temporary unavailable due to network failures or node failure/replacement, updates should be eventually delivered to it by the anti-entropy process.
  • (C) In the previous schema, failures can be handled better using the hinted handoff technique [8]. Updates that are intended for unavailable nodes are recorded on the coordinator or any other node with a hint that they should be delivered to a certain node as soon as it will become available. This improves persistence guarantees and replica convergence time.
  • (D, Read One Write One) Since the carrier of hinted handoffs can fail before deferred updates were propagated, it makes sense to enforce consistency by so-called read repairs. Each read (or randomly selected reads) triggers an asynchronous process that requests a digest (a kind of signature/hash) of the requested data from all replicas and reconciles inconsistencies if detected.

We use term ReadOne-WriteOne for combination of techniques A, B, C and D – they all do not provide strict consistency guarantees, but are efficient enough to be used in practice as an self-contained approach.

A, B, C, D可以称为ReadOne-WriteOne, 通过tradeoff consistency, 来获取available, r/w latency, 和扩展性

A是特点最鲜明的, 只保证最低的consistency, 而获取最高的可用性. 只更新任意一replica, 然后完全依靠anti-entropy去传播更新.

B, 为了降低propagation latency, 会把更新异步的发送给所有复本, 这样提高了传播效率, 代价就说略微牺牲了r/w latency, 和扩展性, 需要获取所有复本的location信息并发送.

在B中, 只是异步的将更新发给所有复本, 但并不保证更新成功, 如果有replic fail, 只有后面通过anti-entropy去同步, 所以是没有牺牲可用性的.

C, 提供hinted handoff technique来提高fail节点同步效率, 以提高persistence和replica convergence time.

hinted handoff technique, 说白了就是把更新暂时放在coordinator or any other node, 然后不断侦听fail node, 一旦恢复, 自动将更新同步. hinted意思就是这个handoff对client透明的, 暗示的. 真是会起名字, 怎么想到这么诡异的名字的

D, 读任意一复本时, 对其他复本请求digest, 然后再reconcile不一致, 这个应该是对读consistency比较大的提高, 但也较大的牺牲了available, r/w latency.

 

  • (E, Read Quorum Write Quorum) The strategies above are heuristic enhancements that decrease replicas convergence time. To provide guarantees beyond eventual consistency, one has to sacrifice availability and guarantee an overlap between read and write sets. A common generalization is to write synchronously W replicas instead of one and touch R replicas during reading.

    • First, this allows one to manage persistence guarantees setting W>1.
    • Second, this improves consistency for R+W>N because synchronously written set will overlap with the set that is contacted during reading (in the figure above W=2, R=3, N=4), so reader will touch at least one fresh replica and select it as a result. This guarantees consistency if read and write requests are issued sequentially (e.g. by one client, read-your-writes consistency), but do not guarantee global read-after-read consistency. Consider an example in the figure below to see why reads can be inconsistent. In this example R=2, W=2, N=3. However, writing of two replicas is not transactional, so clients can fetch both old and new values until writing is not completed:

    • Different values of R and W allows to trade write latency and persistence to read latency and vice versa.
    • Concurrent writers can write to disjoint quorums if W<=N/2. Setting W>N/2 guarantees immediate conflict detection in Atomic Read-modify-write with rollbacks model.
    • Strictly speaking, this schema is not tolerant to network partitions, although it tolerates failures of separate nodes. In practice, heuristics like sloppy quorum [8] can be used to sacrifice consistency provided by a standard quorum schema in favor of availability in certain scenarios. 
      "sloppy quorum”, “马虎的quorum”,会把通过Hinted Handoff 写成功的临时节点也计算在成功写入数中, 解决临时部分节点fail的问题
  • (F, Read All Write Quorum) The problem with read-after-read consistency can be alleviated by contacting all replicas during reading (reader can fetch data or check digests). This ensures that a new version of data becomes visible to the readers as soon as it appears on at least one node. Network partitions of course can lead to violation of this guarantee.

E, F可以称为Read Quorum Write Quorum, 这种设计对consistency和available的tradeoff做了比较好的balance, 在高可用性的前提下, 又能保证eventual consistency, Amazon Dynamo就是用的这种方案.

只要R+W>N, 就可以保证读操作至少可以读到一个最新的replica. 同时通过调整R,W的值可以trade write latency和read latency.

保证W>N/2, 可以立即发现write conflict, 并rollback, 但是否要在write的时候去消除conflict, 也是策略问题, Dynamo就为了保证永远可写, 没有采用这种策略, 而将conflict交给client在读的时候解决.

为了保证分区容错, 可以采用sloppy quorum技术.

当然采取这样的策略, 当write过程没有完成时, 去读数据是有可能读不到new数据的, 见上图, 如果要解决这个问题, 可以用Read All Write Quorum

 

  • (G, Master-Slave) The techniques above are often used to provide either Atomic Writes or Read-modify-write withConflict Detection consistency levels. To achieve a Conflict Prevention level, one has to use a kind of centralization or locking. A simplest strategy is to use master-slave asynchronous replication. All writes for a particular data item are routed to a central node that executes write operations sequentially. This makes master a bottleneck, so it becomes crucial to partition data into independent shards to be scalable.
  • (H, Transactional Read Quorum Write Quorum and Read One Write All) Quorum approach can also be reinforced by transactional techniques to prevent write-write conflicts. A well-known approach is to use two-phase commit protocol. However, two-phase commit is not perfectly reliable because coordinator failures can cause resource blocking. PAXOS commit protocol [20, 21] is a more reliable alterative, but with a price or performance penalty. A small step forward and we end up with the Read One Write All approach where writes update all replicas in a transactional fashion. This approach provides strong fault-tolerant consistency but with a price of performance and availability.

 A~F都是高可用性优先, 最多实现eventual consistency, 而G, H都是强一致性的方案

 简单的强一致性方法, Master-Slave, 通过master来统一安排, 避免conflict, HBase和MongoDB的方案, 当然无法避免master的单点问题

 对于无Master的去中心化的并发写, 要保证强一致性, 最基本的就通过two-phase commit protocol, 考虑到coordinator failures的情况, 可以使用PAXOS commit protocol (支持leader选举)

 同时这儿也可以使用Read One Write All approach或者Quorum approach

 

总结,

作者谈一致性, 谈的还是比较透彻的,

对于Read-Write consistency, 基本上是通过Quorum approach来解决, 如Read Quorum Write Quorum. 之所以是Quorum而不是All, 考虑到r/w latency和available的tradeoff

对于Write-Write consistency, 主要是Read-modify-write问题, 要么采用可用性优先的Conflict Detection, 要么就使用一致性优先的Conflict Prevention


本文章摘自博客园,原文发布日期:2012-11-24

时间: 2024-11-05 12:36:16

Nosql数据一致性技术概要的相关文章

NoSQL Databases技术资料整理汇总

0 Reference NoSQL论文 在 Stuttgart Media 大学的 Christof Strauch 历时8个月(2010年6月-2011年2月)完成了一篇150页长的NoSQL相关的论文, 对NoSQL的各个方面做了探讨 http://www.christof-strauch.de/nosqldbs.pdf 分布式系统领域经典论文翻译集 http://duanple.blog.163.com/blog/static/709717672011330101333271/ 2010

JP摩根在金融衍生品交易系统中引入NoSQL数据库技术

MarkLogic软件也将通过语义解析来识别市场操作 JP摩根通过将关系型数据库切换成NoSQL数据库系统,来降低其金融衍生品处理系统的复杂度, 以便于处理更多样性的数据并且满足日益增长的需求. 通过各类非常复杂的金融工具,该美国银行业巨头每日会产生成百上千的并且价值以亿计甚至兆计的金融衍生交易,但目前的关系型数据库并不适合存储并处理这些交易. 为降低系统压力,也为整合不同的数据库,以便于处理某些特定的功能,该银行与其合作伙伴MarkLogic软件一同开发实施了他们的NoSQL数据库技术. 在伦

NoSQL数据库技术特性解析之文档数据库

文档数据库-nosql数据库技术实战"> 现今云计算的从业人员对NoSQL一词并不感到陌生,虽然很多技术人员都长期从事关系数据库的工作,但现在他们对NoSQL技术充满期待.对于企业来说,从关系型数据库到NoSQL数据库转变绝对是个需要深思熟虑的大改变.这涉及的不仅是软件的变化,更多的是对于数据存储上观念性的变化. CouchDB专家兼作者Bradley Holt认为NoSQL并不是反SQL的运动,为对应的工作选择最恰当的工具才是正确的模式. 大多数非关系数据库都具有快速和可伸缩的特性.通过

《面向机器智能的TensorFlow实践》一1.4 TensorFlow:技术概要

1.4 TensorFlow:技术概要 本小节将给出一些关于TensorFlow库的高层信息,如它是什么.它的发展史.用例以及与竞争对手的比较.决策制定者.利益相关者以及任何希望了解TensorFlow背景的人都会从本小节受益. 谷歌的深度学习研究简史 谷歌最初开发的大规模深度学习工具是谷歌大脑(Google Brain)团队研发的DistBelief.自创建以来,它便被数十个团队应用于包括深度神经网络在内的不计其数的项目中.然而,像许多开创性的工程项目一样,DistBelief也存在一些限制了

介绍DB2 NoSQL JSON技术预览

DB2® NoSQL JSON 使开发人员能够使用 MongoDB 创建的面向 JSON 的流行查询语言来编写应用程序,以便与 IBM® DB2 for Linux®, UNIX®, and Windows® 中存储的数据进行交互.这个基于驱动程序的解决方案提高了 RDBMS 上下文中的 JSON 数据表示的灵活性,该上下文提供了既有的企业特性和服务质量.此 DB2 NoSQL JSON 功能支持一个命令行处理器.一个 Java API 和一个处理 JSON 文档的 Wire Listener.

NoSQL数据库技术公司MongoDB获1.5亿美元投资

MongoDB完成新一轮融资,获得1.5亿美元的投资.本轮融资吸引了T. Rowe Price Associates以及Altimeter Capital.Salesforce.com等新投资者的参与.英特尔资本.NEA.Red Hat和红杉资本等原有投资者也参与了本轮融资.该公司自2007年成立至今,一共已经拿到了2.31亿美元的投资. 该公司是NoSQL数据库技术领域最知名的公司之一.同时,该公司也面临着其它NoSQL数据库厂商.SQL巨头.内存数据库提供商,以及一批提供"数据库即服务&qu

云计算一周热文回顾:NoSQL数据库技术特性解析之文档数据库

NoSQL数据库技术特性解析之文档数据库 现今云计算的从业人员对NoSQL一词并不感到陌生,虽然很多技术人员都长期从事关系数据库的工作,但现在他们对NoSQL技术充满期待.对于企业来说,从关系型数据库到NoSQL数据库转变绝对是个需要深思熟虑的大改变.这涉及的不仅是软件的变化,更多的是对于数据存储上观念性的变化. 大多数非关系数据库都具有快速和可伸缩的特性.通过放弃关系存储模型和架构,关系数据库便可脱离由紧密结合的架构所带来对其施加的限制.应用程序也无需再链接数据库内表中的数据. MongoDB

大数据架构师必读的NoSQL建模技术

从数据建模的角度对NoSQL家族系统做了比较简单的比较,并简要介绍几种常见建模技术. 1.前言 为了适应大数据应用场景的要求,Hadoop以及NoSQL等与传统企业平台完全不同的新兴架构迅速地崛起.而下层技术基础的革命必将影响上层建筑:数据模型和算法.简单地将传统基于第四范式结构化关系型数据库的模型拷贝到新的引擎上,无异于削足适履,不仅增加了大数据应用开发的难度和复杂度,又无法发释放新框架的潜能. 该如何构建基于NoSQL的数据模型?现在能供参考的公开知识积累要么是空虚简单的一句"去规范化&qu

数据中心网络虚拟化技术概要

随着云计算和大数据等新兴应用的快速发展,"数据中心即计算机"(data center as a computer)的技术发展趋势逐渐明朗.数据中心作为一台计算机,与传统的高性能计算机具有很大的不同.在高性能计算领域,因为服务器被独占式的分配给租户使用,所以其主要的优化目标是"算得快".但是在云计算领域,为了提高数据中心的利用率.降低其运营成本,服务器整合(server consolidation)技术将成为常态.此时,服务器内将同时运行不同租户.不同应用的实例.一般