HBase-TDG ClientAPI The Basics

table.delete(delete);
table.close();

Atomic Compare-and-Delete

You have seen in the section called “Atomic Compare-and-Set” how to use an atomic, conditional operation to insert data into a table. There is an equivalent call for deletes that give you access to server side, read-and-modify functionality:

boolean checkAndDelete(byte[] row, byte[] family, byte[] qualifier, byte[] value, Delete delete) throws IOException

参考checkAndPut, 原理一样 

Batch Operations

You have seen how you can add, retrieve, and remove data from a table using single, or list based operations. In this section we will look at API calls to batch different operations across multiple rows. 
前面我们看到的所有put, get, delete的list操作, 实际上也是基于batch实现的.

private final static byte[] ROW1 = Bytes.toBytes("row1");
private final static byte[] ROW2 = Bytes.toBytes("row2");
private final static byte[] COLFAM1 = Bytes.toBytes("colfam1");
private final static byte[] COLFAM2 = Bytes.toBytes("colfam2");
private final static byte[] QUAL1 = Bytes.toBytes("qual1");
private final static byte[] QUAL2 = Bytes.toBytes("qual2");
List<Row> batch = new ArrayList<Row>();
Put put = new Put(ROW2);
put.add(COLFAM2, QUAL1, Bytes.toBytes("val5"));
batch.add(put);
Get get1 = new Get(ROW1);
get1.addColumn(COLFAM1, QUAL1);
batch.add(get1);
Delete delete = new Delete(ROW1);
delete.deleteColumns(COLFAM1, QUAL2);
batch.add(delete);
Get get2 = new Get(ROW2);
get2.addFamily(Bytes.toBytes("BOGUS"));
batch.add(get2); //Fail, column family BOGUS does not exist

Object[] results = new Object[batch.size()];
try {
    table.batch(batch, results);
} catch (Exception e) {
    System.err.println("Error: " + e);
}
for (int i = 0; i < results.length; i++) {
    System.out.println("Result[" + i + "]: " + results[i]);
}

简单的例子, 可以把对不同row的Put, Get, Delete操作都放到一个batch中处理. 
注意不能将同一个row的put和delete放在一个batch里面, 另外batch操作不会使用client端的write buffer, 会直接发给server

Be aware that you should not mix a Delete and Put operation for the same row in one batch call. The operations will be applied in a different order from that guarantees the best performance, but also causes unpredictable results. In some cases you may see fluctuating results due to race conditions.

 

Write Buffer and Batching 
When you use the batch() functionality the included Put instances will not be buffered using the client-side write buffer. The batch() calls are synchronous and send the operations directly to the servers, there is no delay or other intermediate processing used. This is different compared to the put() calls obviously, so choose which one you want to use carefully.

 

Row Locks

Mutating operations - like put(), delete(), checkAndPut(), and so on - are executed exclusively, which means in a serial fashion, for each row, to guarantee the row level atomicity. 
The regions servers provide a row lock feature ensuring that only a client holding the matching lock can modify a row. In practice though most client applications do not provide an explicit lock but rely on the mechanism in place that guard each operation separately.

对于HBase的row level atomicity必须靠row locks来保证, 虽然系统本身提供了自动的lock机制, 但是也提供了显式的lock的调用接口. 
啥时候用? 上面也写了
You should avoid using row locks whenever possible.

 

Scans

After the basic CRUD type operations you will now be introduced to scans, a technique akin to cursors[55] in database systems, making use of the underlying sequential, sorted storage layout HBase is providing.

Introduction

Using the scan operations is very similar to the get() methods. 
And again, in symmetry to all the other functions there is also a supporting class, named Scan. But since scans are similar to iterators you do not have a scan() call but rather a getScanner(), which returns the actual scanner instance you need to iterate over. The available methods are:

ResultScanner getScanner(Scan scan) throws IOException
ResultScanner getScanner(byte[] family) throws IOException
ResultScanner getScanner(byte[] family, byte[] qualifier) throws IOException

Scan类定义了Scan的条件, getScanner必须以一个scan类作为参数(上面后两个,系统还是会为你创建scan类对象的), 返回ResultScanner是个迭代器(iterators), 可以通过next()来获取数据.

Scan类的定义如下,
Scan()
Scan(byte[] startRow, Filter filter)
Scan(byte[] startRow)
Scan(byte[] startRow, byte[] stopRow)

The start row is always inclusive, while the end row is exclusive. This is often expressed as [startRow, stopRow) in the interval notation

Scan和get一样支持如下更多的限定条件

Scan addFamily(byte [] family)
Scan addColumn(byte[] family, byte[] qualifier)
Scan setTimeRange(long minStamp, long maxStamp) throws IOException
Scan setTimeStamp(long timestamp)
Scan setMaxVersions()
Scan setMaxVersions(int maxVersions)

 

The ResultScanner Class

Scans do not ship all the matching rows in one RPC to the client but instead do this on a row basis. This obviously makes sense as rows could be very large and sending thousands, and most likely more, of them in one call would use up too many resources, and take a long time.

The ResultScanner converts the scan into a get-like operation, wrapping the Result instance for each row into an iterator functionality. It has few methods of its own:

Result next() throws IOException
Result[] next(int nbRows) throws IOException
void close() // release scanner

Make sure you release a scanner instance as timely as possible. An open scanner holds quite a few resources on the server side, which could accumulate to a large amount of heap space occupied.

 

Caching vs. Batching

So far each call to next() will be a separate RPC for each row – even when you use the next(int nbRows) method, because it is nothing else but a client side loop over next() calls. Obviously this is not very good for performance when dealing with small cells, thus it would make sense to fetch more than one row per RPC if possible. This is called scanner caching and is by default disabled.

每次next都要一次RPC的话, 效率是比较低, 尤其当row数据比较小的时候, 所以会有scanner caching的出现, 一次RPC可以对多条row, 这个可以配置, 需要适度, 否则调用时间和client的memory都会有问题.

 

So far you have learned to use the client-side scanner caching to make better use of bulk transfers between your client application and the remote regions servers. 
There is an issue though that was mentioned in passing earlier: very large rows. Those - potentially – do not fit into the memory of the client process. HBase and its client API has an answer for that: batching. As opposed to caching, which operates on a row level, batching works on the column level instead. It controls how many columns are retrieved for every call to any of the next() functions provided by the ResultScanner instance. For example, setting the scan to use setBatch(5) would return five columns per Result instance.

Batch相反, 对应于非常大的row, 一个row需要分几次读, 以column为单位



本文章摘自博客园,原文发布日期: 2012-09-26

时间: 2024-09-20 06:35:50

HBase-TDG ClientAPI The Basics的相关文章

NoSQL Databases技术资料整理汇总

0 Reference NoSQL论文 在 Stuttgart Media 大学的 Christof Strauch 历时8个月(2010年6月-2011年2月)完成了一篇150页长的NoSQL相关的论文, 对NoSQL的各个方面做了探讨 http://www.christof-strauch.de/nosqldbs.pdf 分布式系统领域经典论文翻译集 http://duanple.blog.163.com/blog/static/709717672011330101333271/ 2010

hbase+hive应用场景

一.Hive应用场景 本文主要讲述使用 Hive 的实践,业务不是关键,简要介绍业务场景,本次的任务是对搜索日志数据进行统计分析. 集团搜索刚上线不久,日志量并不大 .这些日志分布在 5 台前端机,按小时保存,并以小时为周期定时将上一小时产生的数据同步到日志分析机,统计数据要求按小时更新.这些统计项, 包括关键词搜索量 pv ,类别访问量,每秒访问量 tps 等等. 基于 Hive ,我们将这些数据按天为单位建表,每天一个表,后台脚本根据时间戳将每小时同步过来的 5 台前端机的日志数据合并成一个

利用importtsv导入数据到hbase。假如数据第一列不是唯一怎么办

问题描述 利用importtsv导入数据到hbase.假如数据第一列不是唯一怎么办 利用importtsv导入数据,假如数据文件里第一列不是唯一的,请问怎么导入?可以指定主键吗?比如指定主键是两个字段相加 解决方案 可以使用HBASE_ROWKEY关键字指定主键

zookeeker如何解决HBase单节点故障

HBase架构是一个Master与多个RegionServer,Master负责维护Region等一些工作,但是客户端访问Hbase并不需要通过Master.ZK通过监控选举Master来保证集群始终有一个可用的Master,即访问Master需要通过ZK,当ZK发现Master挂掉之后,会从其他机器中进行选举产出新的Master提供服务. Zookeeper作用 通过选举,保证任何时候,集群中只有一个master,Master与RegionServers 启动时会向ZooKeeper注册 存贮

关于java开发hbase的框架

问题描述 关于java开发hbase的框架 刚接触hbase,需要java客户端的编程.目前之了解到hbase客户端的API的编写.就是get,put之类的层次.但以往开发关系型数据库的时候都有很多框架.请问,有经验的朋友是如何快速开发基于hbase的程序的?谢谢?比如是用什么可以使用sql语言来查询之类的.谢谢

Ganglia监控Hadoop与HBase集群

以下基于上篇Hadoop2.6与HBase1.0集群架构: http://lizhenliang.blog.51cto.com/7876557/1665130 http://lizhenliang.blog.51cto.com/7876557/1661354  Hadoop集群基本部署完成,接下来就需要有一个监控系统,能及时发现性能瓶颈,给故障排除提供有力依据.监控hadoop集群系统好用的比较少,自身感觉ambari比较好用,但不能监控已有的集群环境,挺悲催的.ganglia在网上看到原生支持

hbase导出表数据到hdfs

问题描述 hbase导出表数据到hdfs 我需要把hbase中的表数据导入到hdfs 使用的命令 hbase org.apache.hadoop.hbase.mapreduce.Driver import user hdfs://master:9000/user 显示一直重新连接.连接九次后停住不到,已经被这个问题弄疯了 能解答吗各位 报错的信息是: 2015-01-22 00:43:32,293 INFO [main] ipc.Client: Retrying connect to serve

HBase与Zookeeper数据结构查询

一.前言   最近一年了吧,总是忙于特定项目的业务分析和顶层设计,很少花时间和精力放到具体的技术细节,感觉除了架构理念和分析能力的提升,在具体技术层次却并没有多大的进步.因为一些原因,总被人问及一些技术细节,很多细节都模糊了,花点时间,温习一下吧.技术部分将作为下一个阶段的工作重点. 二.操作说明 查看Zookeeper内部HBase相关数据,有两个主要的渠道:一.通过Hbase shell命令zk_dump查看:二.通过zk_cli.sh查看: 三.zk_dump HBase is roote

HBase数据模型剖析

欢迎访问我的个人网站:http://wuyudong.com/ HBase 进行数据建模的方式和你熟悉的关系型数据库有些不同.关系型数据库围绕表.列和数据类型--数据的形态使用严格的规则.遵守这些严格规则的数据称为结构化 数据.HBase 设计上没有严格形态的数据.数据记录可能包含不一致的列.不确定大小等.这种数据称为半结构化数据(semistructured data). 在逻辑模型里针对结构化或半结构化数据的导向影响了数据系统物理模型的设计.关系型数据库假定表中的记录都是结构化的和高度有规律