这章主要描述怎样设计HBase schema. 关于这个主题, 强烈推荐下面这个presentation, 写的非常清晰.
首先再次强调的是, Nosql无法替代SQL, 对于非bigdata, 毫无疑问SQL更加好用. 对于系统或场景, 我们不应该执着的想着用Nosql去替代SQL, 而是仅仅将SQL无法handle那部分big data(往往关系性不强)放到Nosql上.
比如在设计嵌套关系(多层一对多关系), 用sql就非常的麻烦, 查询的时候需要多表join, 而用HBase或者CouchDB就非常方便的设计和查询.
而对于复杂关系, 如多对多, 用Hbase设计就比较头疼, 因为HBase中要表示关系只有通过nested entities
由于HBase的设计核心为lars说的DDI, Denormalization, Duplication and Intelligent Keys
反范式化, 通过重复数据来避免join等复杂的关系操作, 所以如果数据会有频繁的update时, 使用这样的设计也会非常的麻烦.
所以在Nosql设计中的金科玉律是, Design for the questions, not the answers
首先要明白Nosql是一种妥协, 任意Nosql的设计只是正对当前场景的优化, 只是为了回答这个question.
忘掉完美, 你不可能提出一种放之四海而皆准的Nosql设计. 所以Nosql一定是一种过渡技术, 谈不上革命性, 谈不上美...
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012
内容很多, 也比较清楚, 就把最后总结的几点, Hbase设计pattern列出
0: The row key design is the single most important decision you will make.
row key设计是最为重要的, 要考虑哪些方面后面专门谈
1: Design for the questions, not the answers.
if you aren't 100% sure what the questions are going to be. Use a relational DB for that! Or a document database like CouchDB
当问题不清楚, 场景需求不清楚的时候, 需要一种比较通用的DB schema, 这样HBase不适合, 你还是选SQL或者CouchDB这样的document数据库
2: There are only two sizes of data: too big, and not too big.
如果数据量没有大到一定要使用HBase的时候, 请还是用sql
3: Be compact. You can squeeze a lot into a little space.
这个典型的例子看openTSDB, 怎样优化rowkey, 和column的设计的. 参考Lessons learned from OpenTSDB
4: Use row atomicity as a design tool.
HBase提供了row原子性保证, 这个很重要, 所以想要transaction操作, 就放到同一row里面
如果把数据拆分到两个table, 作为两行row存储, 就无法保证原子性.
5: Attributes can move into the row key.
Even if it's not "identifying" (part of the uniqueness of an entity), adding an attribute into the row key can make access more efficient in some cases.
这样可以有效降低存储粒度, Tall-Narrow的设计
6: If you nest entities, you can transactionally pre-aggregate data.
You can recalculate aggregates on write, or periodically with a map/reduce job.
Advanced HBase Schema Design
Berlin Buzzwords, June 2012
Lars George
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/hbase-lgeorge-bbuzz12.pdf
基本上就是将definite guide中的内容做个presentation
DDI
• Stands for Denormalization, Duplication and Intelligent Keys
• Needed to overcome shortcomings of architecture
• Denormalization -> replacement for JOINs
• Duplication -> Design for reads
• Intelligent Keys -> Implement indexing and sorting, optimize reads
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
http://www.slideshare.net/cloudera/4-opentsdb-hbasecon
Key Design
Concepts
Figure 9.1, “The cells are stored self-contained and moving pertinent information is nearly free”
可以参考Advanced HBase Schema Design, 主要说明HBase怎样从逻辑的数据模型转换到物理实现的
图中描述从逻辑模型按ColumnFamliy进行fold, 然后不同的ColumnFamliy存储到不同的storage file里面
最后一步shift很关键, 设计到key的design, 可以对rowkey和columnkey经行mashing(部分value), 以达到更好的读写效率.
Figure 9.2. From left to right the performance of the retrieval decreases
Rowkey决定Region, 而ColumnFamliy决定HFile
并且由于Hbase的多版本性, 不同的HFile也有不同的timestamp区间
所以在查询时, 当然rowkey是必须的, 决定所读取的region
和关系型数据库不同的, 查询时指定ColumnFamliy将大大提高查询效率, 因为决定了读取的HFile个数
如果可以指定timestamp, 也可以进一步对HFile经行过滤, 以提高查询效率.
对于column qualifier, value就需要把数据读出来进行逐个过滤, 相对比较低效
Tall-Narrow vs. Flat-Wide Tables
At this time you may ask yourself where and how you should store your data. The two choices are tall-narrow, and flat-wide.
The former is a table with few columns but many rows, while the latter has fewer rows, but many columns.
对于email inbox的例子,
Flat-wide的设计就是rowkey:userid, columnkey:emailid
12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi Lars, ..."
Tall-narrow的设计就是rowkey: userid-emailid
12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi Lars, ..."
采用哪种设计, 各有利弊, Flat-wide可以保证原子性, 但是会导致row过大, Tall-narrow的设计更为常见一些.
Partial Key Scans
对于Tall-narrow设计, 很多内容mashing的key, 需要支持partial key scan, 不用给出具体userid-emailid, 我只给出userid来scan所有该user的message.
Pagination
Using the above approach of partial key scans it is possible to iterate over subsets of rows.
常见应用, 比如给出userid来scan所有该user的message, 需要分页, 我不可能一下显示所有的message
scan的时候, 除了start and stop key, 还需要给出offset and limit parameter
This approach works well for a low number of pages. If you were to page through thousands of them, then a different approach would be required.
个人认为, 象mysql一样, 你虽然设置offset, 但还是要从头开始读的, 所以page number大的时候就比较低效
解决方法很简单, 把rowkey改成user id-sequential id, 这样直接设start and stop key, 就可以解决分页, 比如userid-500, userid-550
Time Series
When dealing with stream processing of events, the most common use-case is time series data.
These could be coming from a sensor in a power grid, a stock exchange, or a monitoring system for computer systems. Their salient feature is that their row key represents the event time.
关键问题就是热点问题, 在一定rowkey range范围内都会被分配到一个region server, 怎么解决?
简单, 把row进行随机化, 随机化的方法如下, 程度不同
Salting
You can use a salting prefix to the key that guarantees a spread of all rows across all region servers.
0myrowkey-1, 1myrowkey-2, 2myrowkey-3, 0myrowkey-4, 1myrowkey-5,
keys 0myrowkey-1 and 0myrowkey-4 would be sent to one region, 1myrowkey-2 and 1myrowkey-5 are sent to another
Use-Case: Mozilla Socorro
The Mozilla organization has built a crash reporter – named Socorro[91] - for Firefox and Thunderbird, which stores all the pertinent details when a client asks its user to report a programanomaly.
Field Swap/Promotion
In case you already have a row key with more than one field, then you can swap those. If you have only the timestamp as the current row key then you need to promote another field from
the column keys, or even the value, into the row key.
Use-Case: OpenTSDB
The OpenTSDB[92] project provides a time series database used to store metrics about servers and services, gathered by external collection agents.<metric-id><base-timestamp>...
Randomization
Using a hash function like MD5 is giving you a random distribution of the key across all available region servers.
Using the salted or promoted-field keys can strike a good balance of distribution for write performance, and sequential subsets of keys for read performance. If you are only doing random reads, then it makes most sense to use random keys: this will avoid creating region hot spots.
总结, 随着rowkey随机性不断增强, 写效率不断提高, 但是sequential read的效率不断降低, 所以需要balance
Time Ordered Relations
Since all of the columns are sorted per column family you can treat this sorting as a replacement for a secondary index, as available in RDBMSs.
Multiple secondary indexes can be emulated by using multiple column families - although that is not the recommended way of designing a schema.
But for a small number of indexes this might be what you need.
因为column family里面的column是排序的, 所以采用Flat-wide设计时, 我们可以利用columnkey来做secondary index.
Consider the earlier example of the user inbox, which stores all of the emails of a user in a single row.
你如果要按收到的时间进行排序, 就把收到时间作为columnkey, 那么email会自动按时间排序存储.
如果你还要按主题排序, 那么就增加一个column family, 把subject作为column key.
例子, 一共有2个column family
Data family,
12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..."
12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..."
12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..."
From index family,
12345 : index : idx-from-asc-paul@foobar.com : 1307103848 :dcbee495-6d5e...
12345 : index : idx-from-asc-pete@foobar.com : 1307097848 :5fc38314-e290...
12345 : index : idx-from-asc-sales@ignore.me : 1307101848 :cc6775b3-f249...
Subject index family,
12345 : index : idx-subject-desc-\xa8\x90\x8d\x93\x9b\xde :1307103848 : dcbee495-6d5e-6ed48124632c
12345 : index : idx-subject-desc-\xb7\x9a\x93\x93\x90\xd3 :1307099848 : 725aae5f-d72e-f90f3f070419
这儿他把所有index放到一个family里面, 用前缀区分, 我觉得不同的index用不同的family name更清晰一些.
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce, page 157, 有图
Secondary Indexes
Although HBase has no native support for secondary indexes there are use-cases that need them. The requirements are usually that you can lookup a cell with not just the primary coordinates - the row key, column family name, and qualifier - but an alternative one. In addition, you can scan a range of rows from the main table but ordered by the secondary index.
Similar to an index in RDBMSs, secondary indexes store a mapping between the new coordinates and the existing ones. Here is a list of possible solutions:
HBase本身并不支持secondary indexes, 但是有很多场景确实需要, 光靠row key, famliy key不够, 所以就需要实现secondary indexes.
其实上面讲的方法也可以用于代替index, 但是问题是会导致row过大, 所以需要更加通用的方案.
Client Managed
Moving the responsibility completely into the application layer, this approach is typically a combination of a data table and one (or more) lookup/mapping table.
Whenever the code writes into the data table it also updates the lookup tables. Reading data is then either a direct lookup in the main table, or, in case the key is from a secondary index, a lookup of the main row key and then retrieve the data in a second operation.
There are advantages and disadvantages to this approach.
First, since the entire logic is handled in the client code, you have all the freedom to map the keys exactly the way needed.
The list of shortcomings is longer though: since you have no cross-row atomicity, for example, in the form of transactions, you cannot guarantee consistency of the main and dependent tables.
这个方案将data和index放到两个table, 自然无法保证原子性, 会导致data和index的不一致.
Indexed-Transactional HBase
A different solution is offered by the open-source Indexed-Transactional HBase (in short ITHBase) project.
The core extension is the addition of transactions, which are used to guarantee that all secondary index updates are consistent.
The drawback is that it may not support the latest version of HBase available as it is not tied to its release cycle. It also adds a considerable amount of synchronization overhead that results
in a decreased performance, so you need to benchmark carefully.
Indexed HBase
Another solution to add secondary indexes to HBase is Indexed HBase (in short IHBase).
It forfeits the use of separate tables for each index but maintains them purely in memory.
These are generated when a region is opened for the first time, or when a memstore is flushed to disk - involving an entire regions scan to build the index.
Only the on-disk information is indexed, the in-memory data is searched as-is
这个的缺点很明显, 空间换时间, 所以比较耗费内存
Search Integration
A very common use-case is to combine the arbitrary nature of keys with a search based lookup, often backed by a full search engine integration.
Client Managed
A prominent implementation of a client managed solution is the Facebook inbox search.
The schema is built roughly like this:
• Every row is a single inbox, i.e., every user has a single row in the search table,
• the columns are the terms indexed from the messages, the versions are the message IDs,
• the values contain additional information, such as the position of the term in the document.
column key就是term, column value是message id list, positions, 这样就可以实现单个inbox内的term search
Lucene
Using Lucene - or a derived solution - separately from HBase involves building the index using a MapReduce job.
An externally hosted project[99] provides the BuildTableIndex class, which was formerly part of the contrib modules shipping with HBase.
It scans an entire table and builds the Lucene indexes, which ultimately end up as directories on HDFS – their count depends on the number of reducers used. These indexes
can be downloaded to a Lucene based server, and accessed locally using, for example, a MultiSearcher class, provided by Lucene.
这个单纯就是把数据用mapreduce转化为lucene index文件, 然后可以通过lucene查到rowkey, 然后再去HBase里面获取data
考虑到效率, 也可以导index的时候, 把需要的数据也导到index文件里面, 这样就不用查HBase.
HBasene
The approach chosen by HBasene is to build an entire search index directly inside HBase, while supporting the well-established Lucene API to its users. The schema used
stores each document field, aka term, in a separate row, with the documents containing the term stored as columns inside that row.
这个就是类似client managed solution, 完全基于HBase实现一套全文检索, 但是好的是, 他提供well-established Lucene API, 用户好像是在用lucene.
Transactions
Transactions, offering ACID compliance across more than one row, and more than one table. This is necessary in lieu of a matching schema pattern in HBase. For example updating the main data table, and the secondary index table requires transactions to be reliably consistent.
Often transactions are not needed as normalized data schemas can be folded into a single table, and row, design that does not need the overhead of a distributed transaction support. If you cannot do without this extra control, here are a few possible solutions:
对于HBase, 最好的transaction方案就是, 把他们放到一行row里面
如果不行, 你还想要transaction, 就用下面的方案
Transactional HBase
The Indexed Transactional HBase project comes with a set of extended classes that replace the default client- and server-side ones, while adding support for transactions across row and table boundaries.
ZooKeeper
HBase requires a ZooKeeper ensemble to be present, acting as the seed, or bootstrap mechanism, for cluster setup.
There are templates, or recipes, available that show how ZooKeeper can also be used as a transaction control back-end.
For example, the Cages project offers an abstraction to implement locks across multiple resources, and is scheduled to add a specialized transactions class - using ZooKeeper as the distributed
coordination system.
Bloom Filters
简单的说, HBase文件HFile是按columnfamily组织的, 而且一共columnfamily可能需要存储成很多个HFile, 如果我要取某row的这个columnfamily的数据, 怎么办, 简单的办法, 就是把所有HFile都打开查一遍, 这个效率就很低. BloomFilter, 可以用于对每个HFile建一个rowkey的filter, 在打开HFile前我们可以先filter一下, 该文件是否可能包含该row. 这样大大减少打开文件的个数, 效率会大大的提高.
文中还提到一种row+column Bloom filter, 这种只有在row的columnfamily很大, 经常要多个HFile来存储一个row的columnfamily数据时, 才有用.
Versioning
Implicit Versioning
It was pointed out before that you should ensure that the clock on your servers is synchronized. One of the issue is that when you store data in multiple rows across different servers, using the implicit timestamps, you end up with completely different times set.
HBase默认会已server time作为版本时间戳, 但问题在于, 各个server的时间可能不同步, 这样会导致问题.
This can be avoided by setting an agreed, or shared, timestamp when storing these values.
The put operation allows you to set a client-side timestamp that is used instead, therefore overriding the server time.
Obviously, the better approach is to rely on the servers doing this work for you, but you might be required to use this approach in some circumstances.
当然最好是, server之间可以自动同步时间, 如果不行, 我们可以在Put的时候指定一个client-side timestamp , 这样就可以避免不同步的问题.
Another issue with servers not being aligned by time is exposed by region splits.
Assume you have saved a value on a server that is one hour ahead all other servers in the cluster, using the implicit timestamp of the server. Ten minutes later the region is split and the half with your update is moved to another server. Five minutes later you are then inserting a new value for the same column, again using the automatic server time. The new value is now considered older
then the initial one, because the first version has a timestamp one hour ahead of the current server's time. If you do a standard get call to retrieve the newest version of the value, you would get the one that was stored first.
服务器之间时间不同步的另一个issue, 由region splits导致
时间快的server上的region, 在split后, 被分配到较慢的server上, 当新数据更新时, 会使用较慢server的时间戳, 这样会导致更新次序混乱.
Custom Versioning
Since you can specify your own timestamp values - and therefore create your own versioning scheme - while overriding the server-side timestamp generation based on the synchronized server time, you are free to not use epoch based versions at all.
For example, you could use the timestamp with a global number generator[104] that supplies you with ever increasing, sequential numbers starting at "1". Every time you insert a new value you
retrieve a new number and use that when calling the put function.
You must do this for every put operation, or the server will insert an epoch based timestamp instead. There is flag in the table or column descriptors that indicate your use of custom timestamp values, or in other words your own versioning. If you fail to set the value it is silently replaced with the server timestamp.
本文章摘自博客园,原文发布日期:2012-10-24