Greenplum 空间(GIS)数据检索 B-Tree & GiST 索引实践 - 阿里云HybridDB for PostgreSQL最佳实践

背景

气象数据、地震数据、室内定位、室外定位、手机、车联网、还有我们最喜欢的“左划不喜欢、右划喜欢”，越来越多的位置属性的数据。将来会越来越多。

基于GIS的数据分析、OLTP业务也越来越受到决策者的青睐，例如商场的选址决策，O2O的广告营销等。有很多基于多边形、时间、用户对象属性过滤的需求。

阿里云HybridDB for PostgreSQL是一个支持GIS数据类型处理的MPP分布式数据库，支持海量的GIS数据的存储和分析处理。

支持三种索引接口：

bitmap

btree

gist

三种索引的原理请参考

《PostgreSQL 9种索引的原理和应用场景》

在Greenplum中，GiST索引比较重（是R-Tree结构的空间索引），但是它支持几乎所有的空间搜索，但是overhead相比btree也更大一些：

1、平面、三维、多维对象几何相交、不相交、相邻。

2、平面、三维、多维对象的方位判断（相交或严格在左边、右边、上边、下边），类似数值的大于、小于、大于等于、小于等于。

3、平面、三维、多维对象包含另一个对象

4、平面、三维、多维对象等于另一个对象

5、平面、三维、多维对象与另一个对象的（边、最近、中心点的）距离，按距离排序输出满足条件的行，输出距离在XX以内的行。

PostgreSQL 比Greenplum支持的索引接口更多，比如BRIN是一个很好的块级索引，对于基于多边形的群体分析非常有效。但是Greenplum中没有BRIN。

《PostGIS空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践》

那么Greenplum中能否有类似BRIN块级索引的功能的，在阿里云HybridDB for PostgreSQL中是有这个功能的：

《解密上帝之手 - 阿里云HDB for PostgreSQL数据库metascan特性(存储级、块级、batch级过滤与数据编排)》

但是我们先来说说社区版本的Greenplum，能否用b-tree替代overhead更大的GiST？

答案必须是肯定的。

精度不是那么高的GEOHASH

《geohash vs PostGIS》

geohash的精度随着它的位数变化，例如4位时，它是一个20公里大的BOX。（也就是说一个点，被模糊化为1个方圆20公里的BOX。够粗糙的，但是我们就要用这个粗糙来过滤数据。）

GEOHASH B-Tree一重过滤

使用geohash一重过滤，得到一个大范围的BOX内的数据。

由于GEOHASH是text类型，支持b-tree索引。

当POINT在某个范围(box)内时，它的prefix会和这个box的prefix重叠。

查询某些纵横20公里内的点。  

select * from table where st_geohash(pos,15) ~ '^abcd';

st_distancespheroid二重过滤

一重过滤会得到一个较大范围，例如我们要查询5公里内的人群，实际上返回的是20公里内的人群。

二重过滤，使用距离函数，计算真实的距离，过滤不符合条件的记录。

Name  

ST_DistanceSpheroid — Returns the minimum distance between two lon/lat geometries given a particular spheroid. PostGIS versions prior to 1.5 only support points.  

Synopsis  

float ST_DistanceSpheroid(geometry geomlonlatA, geometry geomlonlatB, spheroid measurement_spheroid);

性能评测

1、建表，存储geometry类型的POINT

postgres=# create unlogged table test(id int, pos geometry);
CREATE TABLE

2、写入5000万测试数据

postgres=# insert into test select generate_series(1,50000000), st_setsrid(st_makepoint(random()*360-180, random()*180-90), 4326) ;

3、创建geohash表达式索引

create index idx_test_pos on test using btree(st_geohash(pos, 11));

4、查询ST_Point(100,90)方圆5公里内的点，我们需要采用4位精度（一重过滤收缩到20公里）。

postgresql 语法：

postgres=# explain (analyze,verbose,timing,costs,buffers)
select *,
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_test_pos on public.test  (cost=0.56..2532.74 rows=5000 width=44) (actual time=0.224..7.279 rows=46 loops=1)
   Output: id, pos, st_distancespheroid(pos, '0101000020E610000000000000000059400000000000805640'::geometry, 'SPHEROID("WGS84",6378137,298.257223562997)'::spheroid)
   Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))
   Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)
   Buffers: shared hit=50
 Planning time: 0.194 ms
 Execution time: 7.311 ms
(7 rows)  

postgres=# explain (analyze,verbose,timing,costs,buffers)
select *
from test
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_test_pos on public.test  (cost=0.56..32.74 rows=5000 width=36) (actual time=0.031..0.144 rows=46 loops=1)
   Output: id, pos
   Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))
   Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)
   Buffers: shared hit=50
 Planning time: 0.173 ms
 Execution time: 0.176 ms
(7 rows)  

postgres=#  select id,st_astext(pos),
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test where
st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
    id    |                st_astext                 | st_distancespheroid
----------+------------------------------------------+---------------------
 25173094 | POINT(99.8623717390001 89.8564164061099) |    16037.4226616061
 15256453 | POINT(99.886408187449 89.8609424661845)  |    15531.8890312462
 42820155 | POINT(99.900087621063 89.8426755331457)  |    17572.1953385406
 11941582 | POINT(99.9024446122348 89.8641818203032) |    15170.0726995738
 27302558 | POINT(99.8545220866799 89.9049358069897) |    10618.0979324084
 12748345 | POINT(99.8604090325534 89.9055569060147) |    10548.7249125223
 23691474 | POINT(99.9338417127728 89.8551008664072) |    16184.3605168626
  2789094 | POINT(99.9521030299366 89.8572400957346) |    15945.4214953106
 13014164 | POINT(99.9811399541795 89.8462957609445) |    17167.8377188967
 30480729 | POINT(99.9596883170307 89.899117089808)  |    11268.0135911417
 23922308 | POINT(99.9793708696961 89.8840416502208) |    12951.8493711937
 37865268 | POINT(99.9939758330584 89.8977239336818) |    11423.6207380738
 44349686 | POINT(99.8703192919493 89.9472489114851) |    5891.97898548146
 24627681 | POINT(99.8947213590145 89.9413335509598) |    6552.68913675189
 29549671 | POINT(99.8913882113993 89.9539441242814) |    5144.16402989278
 19377493 | POINT(99.9157874286175 89.935091631487)  |    7249.87395455637
 15823642 | POINT(99.8874770477414 89.9763751029968) |    2638.75876144951
 35541120 | POINT(99.9283183738589 89.9659094586968) |    3807.70821893377
  5136659 | POINT(99.9655172601342 89.9136384855956) |    9646.06115073887
 35330398 | POINT(99.9899900704622 89.9316252116114) |     7637.0521799929
 18139347 | POINT(99.9744066037238 89.9915046896785) |    948.875017323096
 39258118 | POINT(99.9836320616305 89.9618403799832) |    4262.19981177446
 30534377 | POINT(99.9944747239351 89.9620758276433) |    4235.90172575841
  5400987 | POINT(99.9796806648374 89.9704348482192) |     3302.2494557095
 27994107 | POINT(100.035087894648 89.835905469954)  |    18328.3705781338
 11267788 | POINT(100.044458862394 89.8578487057239) |    15877.4434278569
 26289436 | POINT(100.046943258494 89.8645990714431) |    15123.4682619251
 44089622 | POINT(100.072270520031 89.8646929487586) |    15112.9827315541
  3894728 | POINT(100.086387153715 89.8635132797062) |    15244.7446550829
 35587898 | POINT(100.044751726091 89.8876157775521) |    12552.6408821271
  2297611 | POINT(100.070957243443 89.8893462214619) |    12359.3607228444
 31932031 | POINT(100.064905174077 89.8981667496264) |    11374.1608646003
 22491539 | POINT(100.117400866002 89.8279081285)    |    19221.6253932898
 22237545 | POINT(100.126622300595 89.8635710310191) |    15238.2941814952
 36179231 | POINT(100.174575503916 89.8459538631141) |    17206.0256453898
  7990607 | POINT(100.183915123343 89.8694989643991) |     14576.179748045
 42321853 | POINT(100.080158896744 89.9232382792979) |    8573.82201116562
 11478472 | POINT(100.07952991873 89.929187502712)   |    7909.32958390792
 11105634 | POINT(100.022092089057 89.9607635568827) |    4382.47446858912
  2532812 | POINT(100.037098880857 89.9638177547604) |    4041.33895476991
 29379668 | POINT(100.023544505239 89.9718571733683) |     3143.3842999806
 29884935 | POINT(100.043103173375 89.9965709634125) |    383.002742505678
  1076756 | POINT(100.098143275827 89.9763003364205) |    2647.10973787711
 22469249 | POINT(100.096434876323 89.9840644095093) |    1779.90951806841
 14234913 | POINT(100.1863790676 89.9419186916202)   |    6487.33244847654
 33483358 | POINT(100.13643046841 89.9627213180065)  |    4163.80433863277
(46 rows)

Greenplum 语法（因为Greenplum版本较老，不支持正则表达式的索引检索，我们需要人为转换一下）:

select id,st_astext(pos),
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test where
st_geohash(pos,11) >= st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)
and
st_geohash(pos,11) <
substring(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4),1, length(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4))-1 )
||
chr(ascii(substring(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4), '(.)$'))+1);  

    id    |                st_astext                 | st_distancespheroid
----------+------------------------------------------+---------------------
 25173094 | POINT(99.8623717390001 89.8564164061099) |    16037.4226616061
 15256453 | POINT(99.886408187449 89.8609424661845)  |    15531.8890312462
 42820155 | POINT(99.900087621063 89.8426755331457)  |    17572.1953385406
 11941582 | POINT(99.9024446122348 89.8641818203032) |    15170.0726995738
 27302558 | POINT(99.8545220866799 89.9049358069897) |    10618.0979324084
 12748345 | POINT(99.8604090325534 89.9055569060147) |    10548.7249125223
 23691474 | POINT(99.9338417127728 89.8551008664072) |    16184.3605168626
  2789094 | POINT(99.9521030299366 89.8572400957346) |    15945.4214953106
 13014164 | POINT(99.9811399541795 89.8462957609445) |    17167.8377188967
 30480729 | POINT(99.9596883170307 89.899117089808)  |    11268.0135911417
 23922308 | POINT(99.9793708696961 89.8840416502208) |    12951.8493711937
 37865268 | POINT(99.9939758330584 89.8977239336818) |    11423.6207380738
 44349686 | POINT(99.8703192919493 89.9472489114851) |    5891.97898548146
 24627681 | POINT(99.8947213590145 89.9413335509598) |    6552.68913675189
 29549671 | POINT(99.8913882113993 89.9539441242814) |    5144.16402989278
 19377493 | POINT(99.9157874286175 89.935091631487)  |    7249.87395455637
 15823642 | POINT(99.8874770477414 89.9763751029968) |    2638.75876144951
 35541120 | POINT(99.9283183738589 89.9659094586968) |    3807.70821893377
  5136659 | POINT(99.9655172601342 89.9136384855956) |    9646.06115073887
 35330398 | POINT(99.9899900704622 89.9316252116114) |     7637.0521799929
 18139347 | POINT(99.9744066037238 89.9915046896785) |    948.875017323096
 39258118 | POINT(99.9836320616305 89.9618403799832) |    4262.19981177446
 30534377 | POINT(99.9944747239351 89.9620758276433) |    4235.90172575841
  5400987 | POINT(99.9796806648374 89.9704348482192) |     3302.2494557095
 27994107 | POINT(100.035087894648 89.835905469954)  |    18328.3705781338
 11267788 | POINT(100.044458862394 89.8578487057239) |    15877.4434278569
 26289436 | POINT(100.046943258494 89.8645990714431) |    15123.4682619251
 44089622 | POINT(100.072270520031 89.8646929487586) |    15112.9827315541
  3894728 | POINT(100.086387153715 89.8635132797062) |    15244.7446550829
 35587898 | POINT(100.044751726091 89.8876157775521) |    12552.6408821271
  2297611 | POINT(100.070957243443 89.8893462214619) |    12359.3607228444
 31932031 | POINT(100.064905174077 89.8981667496264) |    11374.1608646003
 22491539 | POINT(100.117400866002 89.8279081285)    |    19221.6253932898
 22237545 | POINT(100.126622300595 89.8635710310191) |    15238.2941814952
 36179231 | POINT(100.174575503916 89.8459538631141) |    17206.0256453898
  7990607 | POINT(100.183915123343 89.8694989643991) |     14576.179748045
 42321853 | POINT(100.080158896744 89.9232382792979) |    8573.82201116562
 11478472 | POINT(100.07952991873 89.929187502712)   |    7909.32958390792
 11105634 | POINT(100.022092089057 89.9607635568827) |    4382.47446858912
  2532812 | POINT(100.037098880857 89.9638177547604) |    4041.33895476991
 29379668 | POINT(100.023544505239 89.9718571733683) |     3143.3842999806
 29884935 | POINT(100.043103173375 89.9965709634125) |    383.002742505678
  1076756 | POINT(100.098143275827 89.9763003364205) |    2647.10973787711
 22469249 | POINT(100.096434876323 89.9840644095093) |    1779.90951806841
 14234913 | POINT(100.1863790676 89.9419186916202)   |    6487.33244847654
 33483358 | POINT(100.13643046841 89.9627213180065)  |    4163.80433863277
(46 rows)

5、CLUSTER优化，按geohash重排后，扫描的块降低，速度大幅提升。

postgres=# cluster test using idx_test_pos ;

优化后效率如下，从扫描50个块，降低到扫描5个块。扫描成本直线下降10倍，对SHARED BUFFER的需求也下降10倍。

explain (analyze,verbose,timing,costs,buffers)
select *,
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;  

                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_test_pos on public.test  (cost=0.56..2532.74 rows=5000 width=44) (actual time=0.214..7.172 rows=46 loops=1)
   Output: id, pos, st_distancespheroid(pos, '0101000020E610000000000000000059400000000000805640'::geometry, 'SPHEROID("WGS84",6378137,298.257223562997)'::spheroid)
   Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))
   Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)
   Buffers: shared hit=5
 Planning time: 0.174 ms
 Execution time: 7.205 ms
(7 rows)

云端相关产品

阿里云 RDS PostgreSQL

阿里云 HybridDB for PostgreSQL

小结

Greenplum中能否有类似BRIN块级索引的功能？在阿里云HybridDB for PostgreSQL中是有这个功能的：

《解密上帝之手 - 阿里云HDB for PostgreSQL数据库metascan特性(存储级、块级、batch级过滤与数据编排)》

对于Greenplum来说，如果你觉得GiST索引引入的数据导入OVERHEAD不值得，那么建议改用B-Tree索引，使用geohash的btree来代替GiST索引，通过一重索引过滤，二重CPU过滤来进行空间数据的筛选。

在不太影响导入性能的同时，又能获得不错的空间数据筛选性能。何乐而不为呢?

最后，我们可能还有时间、空间、对象属性多维度查询的需求(可以使用多级分区表、多字段索引bitmap合并扫描等数据库黑科技)，方法请参考：

《PostgreSQL\GPDB 毫秒级海量多维数据透视案例分享》

《时间、空间、对象多维属性海量数据任意多维高效检索 - 阿里云RDS PostgreSQL最佳实践》

《(新零售)商户网格化运营 - 阿里云RDS PostgreSQL、HybridDB for PostgreSQL最佳实践》

《PostGIS空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践》

《PostgreSQL 单列组合查询优化 - 多个多边形查询优化》

二重过滤的CPU开销较大，如果这部分操作放到程序端实现，那么数据库的查询耗时将降低到0.17毫秒

-- 以下运算，建议放到程序过滤。（在一重过滤数据量可控的情况下，例如查询2.4公里内的点，放大到5公里操作，5公里内比如有100万个点，加上其他属性的过滤，可能过滤到50万个点，实际上这样返回给客户端，客户端再次过滤是可以接受的。）
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')

参考

《PostGIS 空间数据学习建议》

http://postgis.net/docs/ST_Distance_Spheroid.html

时间： 2024-11-02 07:47:52

Greenplum 空间(GIS)数据检索 B-Tree & GiST 索引实践 - 阿里云HybridDB for PostgreSQL最佳实践的相关文章

如何检测、清理Greenplum垃圾 - 阿里云HybridDB for PG最佳实践

标签 PostgreSQL , Greenplum , HDB for PG 背景 Greenplum通过多版本支持数据的删除和更新的并发和回滚,在删除数据时(使用DELETE删除),对记录的头部xmax值进行标记.在删除记录时,对记录的头部进行标记,同时插入新的版本. 这一就会导致一个问题,如果用户经常删除和插入或更新数据,表和索引都会膨胀. PostgreSQL是通过HOT技术以及autovacuum来避免或减少垃圾的.但是Greenplum没有自动回收的worker进程,所以需要人为的触发

空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践

标签 PostgreSQL , Greenplum , PostGIS , GiST , R-Tree , BRIN , 相关性 , 网格 , BOX , K-Mean 背景空间数据的搜索需求通常包括: 1.平面.三维.多维对象几何相交.不相交.相邻. 2.平面.三维.多维对象的方位判断(相交或严格在左边.右边.上边.下边),类似数值的大于.小于.大于等于.小于等于. 3.平面.三维.多维对象包含另一个对象 4.平面.三维.多维对象等于另一个对象 5.平面.三维.多维对象与另一个对

数据寻龙点穴（空间聚集分析） - 阿里云RDS PostgreSQL最佳实践

标签 PostgreSQL , Greenplum , PostGIS , K-Mean , 热力图背景最近鬼吹灯热播,胡八一的<十六字阴阳风水秘术>到底是什么武功秘籍?寻龙点穴又是什么?别问我,不知道. PS:截取自互联网.- 寻龙点穴是风水学术语.古人说:三年寻龙,十年点穴.意思就是说,学会寻龙脉要很长的时间,但要懂得点穴,并且点得准则难上加难,甚至须要用"十年"时间. 但是,若没正确方法,就是用百年时间,也不能够点中风水穴心聚气的真点,这样一来,寻龙的功夫也白费了

时间、空间、对象海量极速多维检索 - 阿里云RDS PostgreSQL最佳实践

标签 PostgreSQL , 时间 , 空间 , 对象属性 , 多维度检索 , 海量 , 空间索引 , 数据分区 , 块级索引BRIN , 多级索引 , GIN倒排索引 , JSON索引 , 多列索引 , 多索引扫描合并 , bitmapAnd , bitmapOr , 物理扫描 , ctid扫描 , intersect , partial index , partition index 背景人类或者其他对象的活动产生了海量的时间.空间数据,如果有科技能实现回到过去,过去的世界状态会是什么样

分区索引的应用和实践 - 阿里云RDS PostgreSQL最佳实践

标签 PostgreSQL , partial index , partition index 背景当表很大时,大家可能会想到分区表的概念,例如用户表,按用户ID哈希或者范围分区,拆成很多表. 又比如行为数据表,可以按时间分区,拆成很多表. 拆表的好处: 1.可以将表放到不同的表空间,表空间和块设备挂钩,例如历史数据访问量低,数据量大,可以放到机械盘所在的表空间.而活跃数据则可以放到SSD对应的表空间. 2.拆表后,方便维护,例如删除历史数据,直接DROP TABLE就可以了,不会产生REDO

全文检索 (不包含、不等于) 索引优化 - 阿里云RDS PostgreSQL最佳实践

背景 PostgreSQL内置了GIN索引,支持全文检索,支持数组检索等多值数据类型的检索. 在全文检索中,不包含某个关键字能用到索引吗? 实际上GIN是倒排索引,不包含某个关键字的查询,实际上是跳过主tree上面的TOKEN的扫描. 只要被跳过的TOKEN包含了大量数据,那么就是划算的.PostgreSQL是基于CBO的执行计划优化器,所以会自动选择最优的索引. 例子1,全文检索不包含查询 1.创建测试表 postgres=# create table notcontain (id int,

透过空间角度理解GiST索引的构造

标签 PostgreSQL , GIS , PostGIS , Greenplum , 空间检索 , GiST , B-Tree , geohash 背景可以支持空间检索的GiST索引的数据结果到底是什么样的呢? 本文为以下两篇文档的增补: <Greenplum 空间(GIS)数据检索 B-Tree & GiST 索引实践 - 阿里云HybridDB for PostgreSQL最佳实践> <PostGIS空间索引(GiST.BRIN.R-Tree)选择.优化 - 阿里云RDS

HybridDB for PostgreSQL 列存表(AO表)的膨胀、垃圾检查与空间收缩

标签 PostgreSQL , Greenplum , 垃圾检测 , 膨胀 , 列存表 , gp_appendonly_compaction_threshold 背景 Greenplum支持行存储(堆存储)与AO存储,堆存储的垃圾回收和膨胀检测方法请参考: <如何检测.清理Greenplum膨胀.垃圾 - 阿里云HybridDB for PG最佳实践> 对于AO存储,虽然是appendonly,但实际上GP是支持DELETE和UPDATE的,被删除或更新的行,通过BITMAP来标记. AO存储

Greenplum insert的性能(单步\批量\copy) - 暨推荐使用gpfdist、阿里云oss外部表并行导入

标签 PostgreSQL , Greenplum , HybridDB for PostgreSQL , insert , copy , 外部表 , oss , gpfdist 背景 Greenplum是一款MPP数据库产品,优势是优良的OLAP性能,支持多节点并行计算,实现PB级数据量的实时分析. 除了分析能力,数据写入吞吐也是Greenplum的重要指标,Greenplum支持从master节点写入,也支持从segment节点并行写入. 从segment并行加载的话,性能是线性提升的,因为