PostgreSQL 如何高效解决按任意字段分词检索的问题 - case 1

背景

在有些应用场景中，可能会涉及多个字段的匹配。

例如这样的场景，一张表包含了几个字段，分别为歌手，曲目，专辑名称，作曲，歌词，。。。

用户可能要在所有的字段中以分词的方式匹配刘德华，任意字段匹配即返回TRUE。

传统的做法是每个字段建立分词索引，然后挨个匹配。

这样就导致SQL写法很冗长，而且要使用大量的OR操作。有没有更好的方法呢？

当然有，可以将整条记录输出为一个字符串，然后对这个字符串建立分词索引。

但是问题来了，整条记录输出的格式是怎么样的，会不会影响分词结果。

PostgreSQL 行记录的输出格式

create table t1(id int, c1 text, c2 text, c3 text);
insert into t1 values (1 , '子远e5a1cbb8' , '子远e5a1cbb8' , 'abc');  

postgres=# select t1::text from t1;
                t1
-----------------------------------
 (1,子远e5a1cbb8,子远e5a1cbb8,abc)
(1 row)

postgres=# \df+ record_out
                                                                  List of functions
   Schema   |    Name    | Result data type | Argument data types |  Type  | Security | Volatility |  Owner   | Language | Source code | Description
------------+------------+------------------+---------------------+--------+----------+------------+----------+----------+-------------+-------------
 pg_catalog | record_out | cstring          | record              | normal | invoker  | stable     | postgres | internal | record_out  | I/O
(1 row)

record类型输出对应的源码
src/backend/utils/adt/rowtypes.c

/*
 * record_out           - output routine for any composite type.
 */
Datum
record_out(PG_FUNCTION_ARGS)
{
...
        / And build the result string /
        initStringInfo(&buf);

        appendStringInfoChar(&buf, '(');  // 首尾使用括弧

        for (i = 0; i < ncolumns; i++)
        {
...
                if (needComma)
                        appendStringInfoChar(&buf, ',');  // 字段间使用逗号
                needComma = true;
...
                / Detect whether we need double quotes for this value /
                nq = (value[0] == '\0');        / force quotes for empty string /
                for (tmp = value; *tmp; tmp++)
                {
                        char            ch = *tmp;

                        if (ch == '"' || ch == '\\' ||
                                ch == '(' || ch == ')' || ch == ',' ||
                                isspace((unsigned char) ch))
                        {
                                nq = true;
                                break;
                        }
                }

                / And emit the string /
                if (nq)
                        appendStringInfoCharMacro(&buf, '"');  // 某些类型使用""号
                for (tmp = value; *tmp; tmp++)
                {
                        char            ch = *tmp;

                        if (ch == '"' || ch == '\\')
                                appendStringInfoCharMacro(&buf, ch);
                        appendStringInfoCharMacro(&buf, ch);
                }
                if (nq)
                        appendStringInfoCharMacro(&buf, '"');
        }

        appendStringInfoChar(&buf, ')');
...

scws分词的问题

看似不应该有问题，只是多个逗号，多了双引号，这些都是字符，scws分词应该能处理。
但是实际上有点问题，例子：
这两个词只是末尾不一样，多个个逗号就变这样了

postgres=# select * from ts_debug('scwscfg', '子远e5a1cbb8,');
 alias | description | token | dictionaries | dictionary | lexemes
-------+-------------+-------+--------------+------------+---------
 k     | head        | 子    | {}           |            |
 a     | adjective   | 远    | {simple}     | simple     | {远}
 e     | exclamation | e5a   | {simple}     | simple     | {e5a}
 e     | exclamation | 1cbb  | {simple}     | simple     | {1cbb}
 e     | exclamation | 8     | {simple}     | simple     | {8}
 u     | auxiliary   | ,     | {}           |            |
(6 rows)

postgres=# select * from ts_debug('scwscfg', '子远e5a1cbb8');
 alias | description |  token   | dictionaries | dictionary |  lexemes
-------+-------------+----------+--------------+------------+------------
 k     | head        | 子       | {}           |            |
 a     | adjective   | 远       | {simple}     | simple     | {远}
 e     | exclamation | e5a1cbb8 | {simple}     | simple     | {e5a1cbb8}
(3 rows)

问题分析的手段

PostgreSQL分词的步骤简介

.1. 使用parse将字符串拆分成多个token，以及每个token对应的token type
所以创建text search configuration时，需要指定parser，parser也是分词的核心

Command:     CREATE TEXT SEARCH CONFIGURATION
Description: define a new text search configuration
Syntax:
CREATE TEXT SEARCH CONFIGURATION name (
    PARSER = parser_name |
    COPY = source_config
)

同时parser支持哪些token type也是建立parser时必须指定的

Command:     CREATE TEXT SEARCH PARSER
Description: define a new text search parser
Syntax:
CREATE TEXT SEARCH PARSER name (
    START = start_function ,
    GETTOKEN = gettoken_function ,
    END = end_function ,
    LEXTYPES = lextypes_function
    [, HEADLINE = headline_function ]
)

查看已创建了哪些parser

postgres=# select * from pg_ts_parser ;
 prsname | prsnamespace |   prsstart   |     prstoken     |   prsend   |  prsheadline  |   prslextype
---------+--------------+--------------+------------------+------------+---------------+----------------
 default |           11 | prsd_start   | prsd_nexttoken   | prsd_end   | prsd_headline | prsd_lextype
 scws    |         2200 | pgscws_start | pgscws_getlexeme | pgscws_end | prsd_headline | pgscws_lextype
 jieba   |         2200 | jieba_start  | jieba_gettoken   | jieba_end  | prsd_headline | jieba_lextype
(3 rows)

查看parser支持的token type如下
scws中的释义
http://www.xunsearch.com/scws/docs.php#attr

postgres=# select * from ts_token_type('scws');
 tokid | alias |  description
-------+-------+---------------
    97 | a     | adjective
    98 | b     | difference
    99 | c     | conjunction
   100 | d     | adverb
   101 | e     | exclamation
   102 | f     | position
   103 | g     | word root
   104 | h     | head
   105 | i     | idiom
   106 | j     | abbreviation
   107 | k     | head
   108 | l     | temp
   109 | m     | numeral
   110 | n     | noun
   111 | o     | onomatopoeia
   112 | p     | prepositional
   113 | q     | quantity
   114 | r     | pronoun
   115 | s     | space
   116 | t     | time
   117 | u     | auxiliary
   118 | v     | verb
   119 | w     | punctuation
   120 | x     | unknown
   121 | y     | modal
   122 | z     | status
(26 rows)

.2. 每种toke type，对应一个或多个字典进行匹配处理

ALTER TEXT SEARCH CONFIGURATION name
    ADD MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ]

查看已配置的token type 与 dict 的map信息

postgres=# select * from pg_ts_config_map ;

.3. 第一个适配token的字典，将token输出转换为lexeme
(会去除stop words)，去复数等。

以下几个函数可以用来调试分词的问题

ts_token_type(parser_name text, OUT tokid integer, OUT alias text, OUT description text)
返回指定parser 支持的token type
ts_parse(parser_name text, txt text, OUT tokid integer, OUT token text)
指定parser, 将字符串输出为token
ts_debug(config regconfig, document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[])
指定分词配置，将字符串输出为token以及额外的信息

上面的例子，我们可以看到使用scws parser时，输出的token发生了变化

postgres=# select * from pg_ts_parser ;
 prsname | prsnamespace |   prsstart   |     prstoken     |   prsend   |  prsheadline  |   prslextype
---------+--------------+--------------+------------------+------------+---------------+----------------
 default |           11 | prsd_start   | prsd_nexttoken   | prsd_end   | prsd_headline | prsd_lextype
 scws    |         2200 | pgscws_start | pgscws_getlexeme | pgscws_end | prsd_headline | pgscws_lextype
 jieba   |         2200 | jieba_start  | jieba_gettoken   | jieba_end  | prsd_headline | jieba_lextype
(3 rows)

postgres=# select * from ts_parse('scws', '子远e5a1cbb8,');
 tokid | token
-------+-------
   107 | 子
    97 | 远
   101 | e5a
   101 | 1cbb
   101 | 8
   117 | ,
(6 rows)

如何解决

在不修改scws代码的情况下，我们可以先将逗号替换为空格，scws是会忽略空格的

postgres=# select replace(t1::text, ',', ' ') from t1;
              replace
-----------------------------------
 (1 子远e5a1cbb8 子远e5a1cbb8 abc)
(1 row)

postgres=# select to_tsvector('scwscfg', replace(t1::text, ',', ' ')) from t1;
              to_tsvector
---------------------------------------
 '1':1 'abc':6 'e5a1cbb8':3,5 '远':2,4
(1 row)

行全文检索索引用法

postgres=# create or replace function rec_to_text(anyelement) returns text as $$
select $1::text;
$$ language sql strict immutable;
CREATE FUNCTION

postgres=# create index idx on t1 using gin (to_tsvector('scwscfg', replace(rec_to_text(t1), ',', ' ')));
CREATE INDEX

SQL写法
postgres=# explain verbose select * from t1 where to_tsvector('scwscfg', replace(rec_to_text(t1), ',', ' ')) @@ to_tsquery('scwscfg', '子远e5a1cbb8');
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.t1  (cost=4.50..6.52 rows=1 width=100)
   Output: c1, c2, c3, c4
   Recheck Cond: (to_tsvector('scwscfg'::regconfig, replace(rec_to_text(t1.*), ','::text, ' '::text)) @@ '''远'' & ''e5a1cbb8'''::tsquery)
   ->  Bitmap Index Scan on idx  (cost=0.00..4.50 rows=1 width=0)
         Index Cond: (to_tsvector('scwscfg'::regconfig, replace(rec_to_text(t1.*), ','::text, ' '::text)) @@ '''远'' & ''e5a1cbb8'''::tsquery)
(5 rows)

参考

http://www.xunsearch.com/scws/docs.php#attr
https://github.com/jaiminpan/pg_jieba
https://github.com/jaiminpan/pg_scws

分词速度，每CPU核约4.44万字/s。

postgres=# create extension pg_scws;
CREATE EXTENSION
Time: 6.544 ms
postgres=# alter function to_tsvector(regconfig,text) volatile;
ALTER FUNCTION
postgres=# select to_tsvector('scwscfg','中华人民共和国万岁，如何加快PostgreSQL结巴分词加载速度');
                                   to_tsvector
-----------------------------------------------------------------------------------------
'postgresql':4 '万岁':2 '中华人民共和国':1 '分词':6 '加快':3 '加载':7 '结巴':5 '速度':8
(1 row)
Time: 0.855 ms
postgres=# set zhparser.dict_in_memory = t;
SET
Time: 0.339 ms
postgres=# explain (buffers,timing,costs,verbose,analyze) select to_tsvector('scwscfg','中华人民共和国万岁，如何加快PostgreSQL结巴分词加载速度') from generate_series(1,100000);
                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series  (cost=0.00..260.00 rows=1000 width=0) (actual time=11.431..17971.197 rows=100000 loops=1)
Output: to_tsvector('scwscfg'::regconfig, '中华人民共和国万岁，如何加快PostgreSQL结巴分词加载速度'::text)
Function Call: generate_series(1, 100000)
Buffers: temp read=172 written=171
Planning time: 0.042 ms
Execution time: 18000.344 ms
(6 rows)
Time: 18000.917 ms
postgres=# select 8*100000/18.000344;
  ?column?
--------------------
44443.595077960732
(1 row)

cpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    32
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:              2
CPU MHz:               2494.224
BogoMIPS:              4988.44
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-31

祝大家玩得开心，欢迎随时来 阿里云促膝长谈 业务需求，恭候光临。

阿里云的小伙伴们加油，努力做 最贴地气的云数据库 。

时间： 2024-12-22 05:49:43

PostgreSQL 如何高效解决按任意字段分词检索的问题 - case 1的相关文章

PostgreSQL 实时高效搜索 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询

标签 PostgreSQL , 搜索引擎 , GIN , ranking , high light , 全文检索 , 模糊查询 , 正则查询 , 相似查询 , ADHOC查询背景字符串搜索是非常常见的业务需求,它包括: 1.前缀+模糊查询.(可以使用b-tree索引) select * from tbl where col like 'ab%'; 或 select * from tbl where col ~ '^ab'; 2.后缀+模糊查询.(可以使用reverse(col)表达式b-tr

宝剑赠英雄 - 任意字段\条件等效查询, 探探PostgreSQL多列展开式B树

标签 PostgreSQL , 多列索引 , btree , gin , gist , brin , btree_gist , btree_gin , 复合索引 , composite index , 任意字段等效查询背景很多人小时候都有一个武侠梦,独孤求败更是金庸武侠小说里的一位传奇人物. 纵横江湖三十馀载,杀尽仇寇奸人,败尽英雄豪杰,天下更无抗手,无可奈何,惟隐居深谷,以雕为友. 呜呼,生平求一敌手而不可得,诚寂寥难堪也. 独孤老前辈的佩剑描写非常有意思,从使用的佩剑,可以看出一个人的武

通用数据库显示程序,能调任意库,任意字段,多关键字搜索,自动分页通用

程序|分页|关键字|数据|数据库|显示数据库显示程序,能调任意库,任意字段,多关键字搜索,自动分页. 阿余经常写一些数据库相关的程序,当然离不开显示库中的数据了,说实话,做这样的程序真是无聊啊,所以,阿余就想写个函数,一个通用的数据库显示函数.要求如下: 1. 能显示指定的字段,当然,字段名和显示的文字可以不一样. 2. 能同时按多个字段进行查询,支持模糊和精确两种查询方式. 3. 有横向排列和纵向排列字段两种显示方式. 4. 能自动分页. 5. 能设定每页显示多少条记录.好啦,要

access-如何解决数据库删除字段提示输入参数值

问题描述如何解决数据库删除字段提示输入参数值以前学过编程都忘了,现在需要用,别人的简单会员管理数据库,删了些字段,提示输入参数值, 如何解决数据库删除字段提示输入参数值给个教程也行主要功能是,a表姓名 , 性别 ,年龄 b表姓名 , 消费张三 ,10元李四 , 20元张三 , 30元 a表输入数据时,b表同时生成,姓名字段,只手动添加消费字段数值

SharePoint中解决超长LookUp字段需要双击的问题

摘要: 如果LookUp字段里引用的列表数据超过20条的时候,Lookup字段将不通过Select标记进行输出,而是通过Textbox进行输出,微软希望提供的Textbox可以为用户提供输入过滤的功能,但是相应带来的问题却是用户需要通过双击而不是单击来选择Lookup下拉列表的内容,用户体验上并不友好.本文将通过Javascript的方式提供用户单击选择超长列表的方案.还有两种思路,一种是写一个自己的Lookup字段,注意Lookup字段不能被继承:另一种思路是通过控件的Adapter来改变输出

SQL存储过程中传入参数实现任意字段排序

在做一个project的时候,要对表实现任意的排序,说得明白点就是这样: 在存储过程中声明一个@parameter,然后在使用查询条件后的排序,要根据我的@parameter来实现,而我这个@parameter实际传进去的就是一个字段值. 网上Google了一下,发现有这样的例子,大多数都是以普提树的多字段任意分页的存储过程结合在一起,看起来好不复杂,而我现在没必要搞分页,没有办法实现吗??? 我先按常理搞了一个这样的: select * from SiteDetailInfo where Si

一个通用数据库显示程序,能调任意库,任意字段,多关键字搜索,自动分页

程序|分页|关键字|数据|数据库|显示 . 阿余经常写一些数据库相关的程序,当然离不开显示库中的数据了,说实话,做这样的程序真是无聊啊,所以,阿余就想写个函数,一个通用的数据库显示函数.要求如下: 1. 能显示指定的字段,当然,字段名和显示的文字可以不一样.2. 能同时按多个字段进行查询,支持模糊和精确两种查询方式.3. 有横向排列和纵向排列字段两种显示方式.4. 能自动分页.5. 能设定每页显示多少条记录.好啦,要求大至就是这样了.根据这个要求,阿余写了下面的函数.实际上,这里阿余写了两个函数

C#算法之大牛生小牛的问题高效解决方法_C#教程

问题: 一只刚出生的小牛,4年后生一只小牛,以后每年生一只.现有一只刚出生的小牛,问20年后共有牛多少只?思路: 这种子生孙,孙生子,子子孙孙的问题,循环里面还有循环的嵌套循环,一看就知道是第归问题. 于是乎,第一个版本出现: public long Compute1(uint years) { //初始化为1头牛 long count = 1; if (years <= 3) { return count; } int i = 4; while (i <= years) { int subY

升级 postgresql 数据库出错解决办法

停止老的库没啥好说的用新版本9.4初始化一个新的数据库也没啥好说的检查 /usr/local/Cellar/postgresql/9.4.1/bin/pg_upgrade -c --link -b /usr/local/Cellar/postgresql/9.3.5/bin -B /usr/local/Cellar/postgresql/9.4.1/bin/ -d /usr/local/var/postgres -D /Users/bigzhu/pg94 -b 老版本数据库 bin 目

PostgreSQL 如何高效解决 按任意字段分词检索的问题 - case 1

背景