PostgreSQL 全表全字段模糊查询的毫秒级高效实现

postgres=# create table t(phonenum text, info text, c1 int, c2 text, c3 text, c4 timestamp);
CREATE TABLE
postgres=# insert into t values ('13888888888','i am digoal, a postgresqler',123,'china','中华人民共和国，阿里巴巴，阿',now());
INSERT 0 1
postgres=# select * from t;
  phonenum   |            info             | c1  |  c2   |              c3              |             c4
-------------+-----------------------------+-----+-------+------------------------------+----------------------------
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2016-04-19 11:15:55.208658
(1 row)

首先，被搜索的字段中如果有中文或者其他mutli-bytes字符，那么不能使用collate, ctype=c的数据库。

还好阿里云RDS PostgreSQL默认就不是C的，很棒。

如果不是，你可以这么指定collate 和 ctype  

postgres=# create database test with template template0 lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8';

那么接下来，我们要做的是，建立支持模糊查询的函数索引，

create extension pg_trgm;  

create or replace function record_to_text(anyelement) returns text as $$
  select $1::text;
$$ language sql strict immutable;  

test=# create index idx_t_1 on t using gin (record_to_text(t) gin_trgm_ops) ;
CREATE INDEX  

当需要使用分页，或者结果集很大时，建议使用gist
test=# create index idx_t_2 on t using gist (record_to_text(t) gist_trgm_ops) ;
CREATE INDEX

查询测试

test=# explain select * from t where record_to_text(t) ~ 'digoal';
                            QUERY PLAN
-------------------------------------------------------------------
 Index Scan using idx_t_2 on t  (cost=0.38..8.39 rows=1 width=140)
   Index Cond: (record_to_text(t.*) ~ 'digoal'::text)
(2 rows)

查询性能测试

先插一堆数据进去
postgres=# insert into t select * from t;
INSERT 0 4194304
test=# select count(*) from t;
  count
---------
 4194304
(1 row)  

然后插几条不一样的  

insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌 德哥 刘德华 彭德怀',now());
insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌 德哥 刘德华 彭德怀',now());    

vacuum analyze t;

查询速度杠杠的

test=# explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ 'dege';
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_t_2 on public.t  (cost=0.41..2.43 rows=1 width=101) (actual time=0.236..0.254 rows=2 loops=1)
   Output: phonenum, info, c1, c2, c3, c4
   Index Cond: (record_to_text(t.*) ~ 'dege'::text)
   Buffers: shared hit=5
 Planning time: 0.349 ms
 Execution time: 0.301 ms
(6 rows)

测试查询包含刘德华的行(因为行很少，所以建议使用GIN索引)

test=# /*+ BitmapScan(t idx_t_1) */ explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ '刘德华' limit 10;
LOG:  available indexes for BitmapScan(t): idx_t_1
LOG:  pg_hint_plan:
used hint:
BitmapScan(t idx_t_1)
not used hint:
duplication hint:
error hint:  

LOG:  pg_hint_plan:
used hint:
not used hint:
BitmapScan(t idx_t_1)
duplication hint:
error hint:  

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=441.00..442.26 rows=1 width=101) (actual time=0.239..0.255 rows=2 loops=1)
   Output: phonenum, info, c1, c2, c3, c4
   Buffers: shared hit=4
   ->  Bitmap Heap Scan on public.t  (cost=441.00..442.26 rows=1 width=101) (actual time=0.238..0.252 rows=2 loops=1)
         Output: phonenum, info, c1, c2, c3, c4
         Recheck Cond: (record_to_text(t.*) ~ '刘德华'::text)
         Heap Blocks: exact=1
         Buffers: shared hit=4
         ->  Bitmap Index Scan on idx_t_1  (cost=0.00..441.00 rows=1 width=0) (actual time=0.086..0.086 rows=2 loops=1)
               Index Cond: (record_to_text(t.*) ~ '刘德华'::text)
               Buffers: shared hit=3
 Planning time: 0.494 ms
 Execution time: 0.313 ms
(13 rows)  

test=# /*+ BitmapScan(t idx_t_1) */  select * from t where record_to_text(t) ~ '刘德华' limit 10;
LOG:  available indexes for BitmapScan(t): idx_t_1
LOG:  pg_hint_plan:
used hint:
BitmapScan(t idx_t_1)
not used hint:
duplication hint:
error hint:  

LOG:  pg_hint_plan:
used hint:
not used hint:
BitmapScan(t idx_t_1)
duplication hint:
error hint:  

  phonenum   |           info            | c1  |  c2   |           c3            |             c4
-------------+---------------------------+-----+-------+-------------------------+----------------------------
 13888889999 | i am dege, a postgresqler | 123 | china | 德歌 德哥 刘德华 彭德怀 | 2017-01-06 17:04:42.19215
 13888889999 | i am dege, a postgresqler | 123 | china | 德歌 德哥 刘德华 彭德怀 | 2017-01-06 17:04:42.514895
(2 rows)  

Time: 1.225 ms

语句超时

通常这种索引命中，根据返回的结果集数量，响应时间可能是 0.几毫秒到几十毫秒不等。

不过有些时候，可能因为用户输入的信息量太少，比如输入了2个字符，那么被匹配到的token信息会很多，导致变慢。

使用GiST可以缓解。

那么应用层可以做一个保护，比如超过1秒，则报语句超时。

postgres=# set statement_timeout = '1s';
SET  

or  

test=# /*+ Set(statement_timeout 1s) */ select * from t where record_to_text(t) ~ 'd' limit 10;
LOG:  pg_hint_plan:
used hint:
Set(statement_timeout 1s)
not used hint:
duplication hint:
error hint:  

LOG:  pg_hint_plan:
used hint:
Set(statement_timeout 1s)
not used hint:
duplication hint:
error hint:  

  phonenum   |            info             | c1  |  c2   |              c3              |             c4
-------------+-----------------------------+-----+-------+------------------------------+----------------------------
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941
(10 rows)

hint的使用

使用规则很简单

当使用游标返回时，使用gist

当输入的字符少于3个时，使用GIST

当评估行很少时，使用GIN

其他情况都是要GIN

有了以上规则，你就可以通过HINT，强制使用哪个索引了。

hint使用方法参考：

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

其他优化

业务层面也可以做出一些优化，比如可以先用全文检索，如果没有匹配到，再用模糊查询。

又比如gist, gin, rum索引应该如何选择，可以参考这篇文档

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

只读实例

按照前面的测试，通常来说一个查询的响应应该在1毫秒以内，

对于一个32核的机器，这种模糊查询能达到的QPS估计在8万左右。

如果你发现单节点，在已优化的情况下，已经不能满足查询的并发，那么可以构建只读实例。

构建只读实例的方法也很简单，请参考

https://github.com/aliyun/rds_dbsync

参考

《PostgreSQL 行级全文检索》

《PostgreSQL 百亿数据秒级响应正则及模糊查询》

《PostgreSQL 1000亿数据量正则匹配速度与激情》

《中文模糊查询性能优化 by PostgreSQL trgm》

搜索引擎引以为豪的rank排序, phrase已经不是什么秘密，在PostgreSQL里面已经攻破。如下文章有介绍。

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

《PostgreSQL 全文检索加速快到没有朋友 - RUM索引接口(潘多拉魔盒)》

《PostgreSQL 9.3 pg_trgm imporve support multi-bytes char and gist,gin index for reg-exp search》

时间： 2024-09-14 15:27:21

PostgreSQL 全表全字段模糊查询的毫秒级高效实现 - 搜索引擎也颤抖了的相关文章

MySQL单表多关键字模糊查询的实现方法_Mysql

在最近的一个项目需要实现在MySQL单表多关键字模糊查询,但这数个关键字并不一定都存在于某个字段.例如现有table表,其中有title,tag,description三个字段,分别记录一条资料的标题,标签和介绍.然后根据用户输入的查询请求,将输入的字串通过空格分割为多个关键字,再在这三个字段中查询包含这些关键字的记录. 可目前遇到的问题是,这些关键字是可能存在于三个字段中的任意一个或者多个,但又要求三个字段必须包含所有的关键词.如果分别对每个字段进行模糊匹配,是没法实现所需的要求,由此想到两种

关于mongodb按照字段模糊查询方法

关于mongodb按照字段模糊查询方法模糊查询:tname包含某个关键字测试' cd /opt/soft/mongodb/bin ./mongo --host 192.168.0.1 --port 17017 test db.test_info.find({"tname": {$regex: '测试', $options:'i'}}) db.test_info.find({"tname": {$regex:/测试.*/i}})

asp.net 多字段模糊查询代码_实用技巧

string strField = "id|className|classAdd"; string strKeyWords = this.tbxKeyWords.Text.Trim(); string strSql = dbexe.searchText("select * from class", strField, strKeyWords); 经常用到多字段的模糊查询,上面的函数可以实现,例如strKeyWords值为""时,可以输出: sel

sql语句实现表的字段名查询

下面为您介绍的是查询表的字段名的sql语句写法,sql语句可以实现许多的功能,希望可以您在学习sql语句使用方面获得启示. select name from syscolumns where id = (select id from sysobjects where type = 'u' and name = '相应表名') 或者 select name from syscolumns where id = object_id('相应表名') 用以上sql语句输入相应表名就可以查到表的

数据库表及字段的查询

//得到数据库中所有表的名字: string sql = " select name from sysobjects where type='U'"; //通过表名获取所有字段名: string sql = "Select name from syscolumns Where ID=OBJECT_ID('"+tableName+"')"; //通过表名获取所有字段名和类型:

entity framework 如何指定字段查询（比如一个表10字段只查询2个）

问题描述达到这种目的selectid,nameformtb 解决方案解决方案二:varresult=EF.Table.Select(s=>new{s.id,s.name}); 解决方案三:应该做不到,因为EF的原理是把数据全部拿到内存中,然后再内存中进行帅选.但是在内存中的数据你就可以随便帅选了用linq

PostgreSQL 实时高效搜索 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询

标签 PostgreSQL , 搜索引擎 , GIN , ranking , high light , 全文检索 , 模糊查询 , 正则查询 , 相似查询 , ADHOC查询背景字符串搜索是非常常见的业务需求,它包括: 1.前缀+模糊查询.(可以使用b-tree索引) select * from tbl where col like 'ab%'; 或 select * from tbl where col ~ '^ab'; 2.后缀+模糊查询.(可以使用reverse(col)表达式b-tr

Greenplum 模糊查询实践

标签 PostgreSQL , Greenplum , orafunc , 阿里云HybridDB for PostgreSQL , reverse , like , 模糊查询背景文本搜索的需求分为: 1.全匹配,如: select * from table where column = 'xxxx'; 2.后模糊,如: select * from table where column like 'xxxx%'; 3.前模糊,如: select * from table where colu

sql模糊查询怎样查询列名在字段中的查询

问题描述 sql模糊查询怎样查询列名在字段中的查询 select album, aritst as a from CD where album like '%a%' album artist 都是列名 CD 是表名, 我题目的意思是如何让 like '%a%' 中的a 代表的是artist的别名而不是字母a 解决方案直接将a定义为变量 set n='a' select album, aritst as @n from CD where album like '%' + @n + '%'

PostgreSQL 全表 全字段 模糊查询的毫秒级高效实现 - 搜索引擎也颤抖了

标签

背景

全表全字段模糊查询的实现例子

语句超时

hint的使用

其他优化

只读实例

对于一个32核的机器，这种模糊查询能达到的QPS估计在8万左右。

参考

PostgreSQL 全表 全字段 模糊查询的毫秒级高效实现 - 搜索引擎也颤抖了的相关文章

PostgreSQL 全表全字段模糊查询的毫秒级高效实现 - 搜索引擎也颤抖了

PostgreSQL 全表全字段模糊查询的毫秒级高效实现 - 搜索引擎也颤抖了的相关文章