hbase scan中匪夷所思的SingleColumnValueFilter和COLUMNS

假设关系型数据库中存在如下表(test_row_key)

------------------------------------------

TIME_ID AREA_ID INDEX1

..........................................

201206 730 201206730

201206 731 201206731

201207 730 201207730

201207 731 201207731

201207 732 201207732

让我们将数据导入hbase(测试环境Version 0.94.3, r1408904)

问题重现

// select TIME_ID, AREA_ID, INDEX1 from test_row_key where TIME_ID = 201206 and AREA_ID = 730

// 应该返回1行数据

scan 'test_row_key', {COLUMNS => ['cf:TIME_ID','cf:AREA_ID','cf:INDEX1'], FILTER => "(SingleColumnValueFilter('cf', 'TIME_ID', =, 'binary:201206')) AND (SingleColumnValueFilter('cf', 'AREA_ID', =, 'binary:730'))" }

// select AREA_ID, INDEX1 from test_row_key where TIME_ID = 201206 and AREA_ID = 730

// 返回2行数据??

scan 'test_row_key', {COLUMNS => ['cf:AREA_ID','cf:INDEX1'], FILTER => "(SingleColumnValueFilter('cf', 'TIME_ID', =, 'binary:201206')) AND (SingleColumnValueFilter('cf', 'AREA_ID', =, 'binary:730'))" }

// select INDEX1 from test_row_key where TIME_ID = 201206 and AREA_ID = 730

// 返回2行数据??

scan 'test_row_key', {COLUMNS => ['cf:INDEX1'], FILTER => "(SingleColumnValueFilter('cf', 'TIME_ID', =, 'binary:201206')) AND (SingleColumnValueFilter('cf', 'AREA_ID', =, 'binary:730'))" }

结论

只有当COLUMNS中包含SingleColumnValueFilter提到的字段时, 该SingleColumnValueFilter才有效的

改为SingleColumnValueExcludeFilter, 这时只会返回除开AREA_ID和TIME_ID的字段, 但记录数是对的

HBase JIRA中有人对这种匪夷所思的行为产生的质疑

只有当COLUMNS中包含SingleColumnValueFilter提到的字段时, 该SingleColumnValueFilter才有效的

SingleColumnValueFiltershould be able to find the column value even when it's not specifically addedas input on the scan.

Wheneverapplying a SingleColumnValueFilter to a Scan that has specific columns as it'sinput (but not the column to be checked in the Filter), the Filter won't beable to find the value that it should be checking.

Forexample, let's say we want to do a scan, but we only need COLUMN_2 columns.Furthermore, we only want rows that have a specific value for COLUMN_1. Usingthe following code won't do the trick:
Scan scan = new Scan();
scan.addColumn(FAMILY,COLUMN_2);
SingleColumnValueFilter filter = new SingleColumnValueFilter(FAMILY, COLUMN_1,CompareOp.EQUAL, TEST_VALUE);
filter.setFilterIfMissing(true);
scan.setFilter(filter);

However,we can make it work when specifically also adding the tested column as an inputcolumn:
scan.addColumn(FAMILY,COLUMN_1);

Is this by design?Personally I think that adding a filter with columns tests should not botherthe user to check that it's also on the input. It is prone to bugs.

有人回复说:

Itsby design. High-level in the Scan object you say what you are interested in andthen the filter works against the Scan specification.

作者修改了SingleColumnValueFilter.java的doc注释

Whenusing this filter on a Scan with specified inputs, the column to be testedshould also be added as input (otherwise the filter will regard the column asmissing).

于是作者写了一个新的SingleColumnValueExcludeFilter,用于做单列的值过滤, 但是不查询出这个列的值

Add a new Filter thatchecks a single column value but does not emit it.

A Filter that checks a single column value,but does not emit the tested column. This will enable a performance boost overSingleColumnValueFilter, if the tested column value is not actually needed asinput (besides for the filtering itself).

时间： 2024-09-28 09:42:13

hbase scan中匪夷所思的SingleColumnValueFilter和COLUMNS

hbase scan中匪夷所思的SingleColumnValueFilter和COLUMNS的相关文章

hive数据批量导入hbase过程中遇到分区文件不存在异常

Java在HBase数据库中创建一个数据表的实例

HBase数据库中HTable和HTablePool使用注意事项

Hbase在小米中的应用与实践

HBase在数据统计应用中的使用总结

HBase在数据统计应用中的使用心得

HBase客户端实现并行扫描

HBase shell基础和常用命令详解

基于C#+Thrift操作HBase实践