PostgreSQL 10.0 preview 性能增强 - OLAP提速框架, Faster Expression Evaluation Framework(含JIT)

标签

PostgreSQL , 10.0 , llvm , jit , Faster Expression Evaluation Framework


背景

PostgreSQL 10.0有可能会融合JIT,向量计算等技术,提供一个通用的,便于高效协作,提升OLAP性能的一个开发框架。

虽然目前社区有朋友已经提供了LLVM和向量计算的插件,很显然社区是想在内核中直接整合这些计算的。加油PostgreSQL

《分析加速引擎黑科技 - LLVM、列存、多核并行、算子复用 大联姻 - 一起来开启PostgreSQL的百宝箱》

《PostgreSQL 向量化执行插件(瓦片式实现) 10x提速OLAP》

Hi Everyone,  

TL;DR: Making things faster. Architectural evalation.  

as some of you might be aware I've been working on making execution of
larger queries in postgresl faster. While working on "batched execution"
I came to the conclusion that, while necessary, isn't currently showing
a large benefit because expression evaluation and tuple deforming are
massive bottlenecks.  

I'm posting a quite massive series of WIP patches here, to get some
feedback.  

Tuple deforming is slow because of two reasons:  

1) It's the first thing that accesses tuples, i.e. it'll often incur
   cache misses. That's partially fundamental, but also partially can be
   addressed, e.g. through changing the access order in heap as in [1].
2) Tuple deforming has a lot of unpredicatable branches, because it has
   to cope with various types of fields. We e.g. perform alignment in a
   lot of unneeded cases, do null checks for NOT NULL columns et al.  

I tried to address 2) by changing the C implementation. That brings some
measurable speedups, but it's not huge. A bigger speedup is making
slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
but it's still not huge.  Finally I turned to just-in-time (JIT)
compiling the code for tuple deforming. That doesn't save the cost of
1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The
first part is done in 0008, the JITing in 0012.  

Expression evaluation and projection is another major bottleneck.
1) Our recursive expression evaluation puts a *lot* of pressure on the
   stack.
2) There's a lot of indirect function calls when recursing to other
   expression nodes. These are hard to predict, because the same node
   type (say ExecEvalAnd()) is used in different parts of an expression
   tree, and invokes different sub-nodes.
3) The function calls to operators and other functions are hard to
   predict, leading to a significant number of pipeline stalls.
4) There's a fair amount of pg_list.h list style iteration going on,
   those are cache and pipeline inefficient.  

After some experimenting I came to the conclusion that the recursive
processing is a fundamental impediment to making this faster.  I've
converted (0006) expression processing and projection into an opcode
dispatch based interpreter. That yields, especially for complex
expressions and larger projections a significant speedup in itself.  But
similarly to the deforming, expression evaluation remains a bottleneck
after that, primarily because there's still a lot of unpredictable jump
and calls, and because loads/stores have to be complex
(e.g. ExprContext->ecxt_innertuple->tts_values[i]/tts_isnull[i] for a
single scalar var evaluation).   Using the opcode based representation
of expression evaluation (as it's nearly linear, and has done a lot of
the lookups ahead of time), it's actually quite easy to  

*After JITing expression evaluation itself is more than ten times faster
than before*.  

But unfortunately that doesn't mean that queries are ten times faster -
usually we'll hit bottlenecks elsewhere relatively soon.  WRT to
expression evaluation, the biggest cost afterwards are the relatively
high overhead V1 function calls - register based parameter passing is a
lot faster.  

After experimenting a bit with doing JITing manually (a lot of
eye-stabbing kind of fun), I chose to use LLVM.  

An overview of the patch-queue so far:
0001  Make get_last_attnums more generic.  

Boring prerequisite.  

0002  More efficient AggState->pertrans iteration.  

Relatively boring minor optimization, but it turns out to be a easily
hit bottleneck. Will commit independently.  

0003  Avoid materializing SRFs in the FROM list.
0004  Allow ROWS FROM to return functions as single record column.
0005  Basic implementation of targetlist SRFs via ROWS FROM.
0006  Remove unused code related to targetlist SRFs.  

These are basically just pre-requisites for the faster expression
evaluation, and discussed elsewhere [2].  This implementation is *NOT*
going to survive, because we ended coming to the conclusion that using a
separate executor node to expand SRFs is a btter plan. But the new
expression evaluation code won't be able to handle SRFs...  

0007  WIP: Optimize slot_deform_tuple() significantly.  

This a) turns tuple deforming into an opcode based dispatch loop (using
computed goto on gcc/clang). b) moves a lot of the logic from
slot_deform_tuple() callsites into itself - that turns out to be more
efficient.  I'm not entirely sure it's worth doing the opcode based
dispatch part, if we're going to also do the JIT bit - it's a fair
amount of code, and the speed difference only matters on large amounts
of rows.  

0008  WIP: Faster expression processing and targetlist projection.  

This, functionally nearly complete, patch turns expression evaluation
(and tuple deforming as a special case of that) into a "mini language"
which is interpreted using either a while(true) switch(opcode) or
computed goto to jump from opcode to opcode.  It does so by moving a lot
more of the code for expression evaluation to initialization time and
building a linear series of steps to evaluate expressions, thereby
removing all recursion from expression processing.  

This nearly entirely gets rid of the stack usage cost of expression
evaluation (we pretty much never recurse except for subplans). Being
able to remove, now redundant, calls to check_stack_depth() is a
noticeable benefit, it turns out that that check has a noticeable
performance impact (as it aparently forces to actually use the stack,
instead of just renumbering registers inside the CPU).  

The new representation and evaluation is functionally nearly complete
(there's a single regression test failure, and I know why that is), but
the code needs a fair amount of polishing.  

I do absolutely think that the fundamentals of this are the right way to
go, and I'm going to work hard on polishing the patch up.  But this
isn't something that we can easily do in parts, and it's a huge ass
patch. So I'd like to have at least some more buyin before wasting even
more time on this.  

0009  WIP: Add minimal keytest implementation.  

More or less experimental patch that tries to implement simple
expression of the OpExpr(ScalarVar, Const) into a single expression
evaluation step.  The benefits probably aren't big enough iff we do end
up doing JITing of expressions.  

0010  WIP: Add configure infrastructure to enable LLVM.
0011  WIP: Beginning of a LLVM JIT infrastructure.  

Very boring preliminary patches to add --with-llvm and some minimal
infrastructure to handle LLVM. If we go this way, JITed stuff needs to
be tied to resource owners, and we need some other centralized
infrastructure.  

0012  Heavily-WIP: JITing of tuple deforming.  

This, in a not-yet-that-nice manner, implements a JITed version of the
per-column stuff that slot_deform_tuple() does.  It currently always
deforms all columns, which obviously would have to change. There's also
considerable additional performance improvements possible.  

With this patch the per-column overhead (minus bitmap handling, which
0007 moved into a separate loop), drops from 10%+ into low single digits
for a number of queries.  Afterwards the biggest cost is VARSIZE_ANY()
for varlena columns (which atm isn't inlined).  That is, besides the
initial cache-miss when accessing tuple->t_hoff, which JITing can do
nothing about :(  

This can be enabled/disabled using the new jit_tuple_deforming GUC.  To
make this production ready in some form, we'd have to come up with a way
to determine when it's worth doing JITing. The easiest way would be to
do so after N slot_deform_tuple() calls or such, another way would be to
do it based on cost estimates.  

0013  WIP: ExprEval: Make threaded dispatch use a separate field.  

Boring preliminary patch. Increases memory usage a bit, needs to be
thought through more.  

0014  Heavily-WIP: JITed expression evaluation.  

This is the most-interesting bit performance wise. A few common types of
expressions are JITed. Scalar value accesses, function calls, boolean
expressions, aggregate references.  

This can be enabled using the new jit_expressions GUC.  

Even for the supported expression types I've taken some shortcuts
(e.g. strict functions aren't actually strict).  

The performance benefits are quite noticeable. For TPCH ExecEvalExpr()
(which is where 0008 moved all of expression evaluation/projection) goes
from being the top profile entry, to barely noticeable, with the JITed
function usually not showing up in the top five entries anymore.  

After the patch it becomes very clear that our function call
infrastructure is a serious bottlenecks. Passing all the arguments via
memory, and, even worse, forcing isnull/values to be on separate
cachelines, has significant performance implications.  It also becomes
quite noticeable that nodeAgg's transition function invocation doesn't
go through ExecEvalExpr() but does that itself - which leads to constant
mispredictions if several transition values exist.  

While the JIT code is relatively verbose, it turns out to not actually
be that hard to write after some startup pains. All the JITing of
expressions that exists so far was basically written in ~10 hours.  

This also needs some heuristics about when JITing is
appropriate. Compiling an expression that's only executed once is never
going to be faster than doing the interpretation (it at least needs a
writable allocation for the code, and then a remap to make that code
read-only and executable).  A trace based approach (everything executed
at least a thousand times) or cost based (all queries costing more than
100000 should be JITed) could make sense.  

It's worthwhile to note that at the moment this is a per-query-execution
JIT, not something that can trivially be cached for prepared
statements. That'll need further infrastructure.  

0015  Super-Heavily-WIP: LLVM perf integration.  

This very very very preliminary patch (including some copy-pasted GPL
code!) creates /proc/perf-<pid>.map files, which allows perf to show
useful symbols for profile hits to JIT expressions.  I plan to push this
towards LLVM, so this isn't something PG will have to do, but it's
helpful for evaluation.  

I eventually plan to start separate threads about some of the parts in
here, but I think the overal picture needs some discussion first.  

Q: Why LLVM and not a hand-rolled JIT?
A: Because hand-rolling a JIT is probably hard to scale to multiple
   maintainers, and multiple platforms. I started down the path of doing
   a hand-rolled x86 JIT, and that'd also be doable (faster compilation,
   slower execution basically); but I doubt we'd end up having that on
   different architectures on platforms. Not to speak of things like
   proper debugger and profiler integration.  I'm not entirely convinced
   that that's the right path. It might also be a transitional step,
   towards doing our completely own JIT. But I think it's a sensible
   step.  

Q: Why LLVM and not $jit-toolkit
A: Because all the other JIT stuff I looked at was either really
   unportable (mostly x86 linux only), inconveniently licensed (like
   e.g. gcc's jit library) or nearly unmaintained (luajit's stuff for
   example).  I might have missed something, but ISTM that atm the
   choice is between hand-rolling and using LLVM.  

Q: Does this actually inline functions from the backend?
A: No. That probably is something desirable in the future, but to me
   that seems like it should be a separate step. The current one's big
   enough. It's also further increases compilation times, so quite
   possibly we only want to do so based on another set of heuristics.  

Q: ?  

Comments? Questions?  

Regards,  

Andres  

[1] https://archives.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/20160523005327.v2tr7obytitxcnna@alap3.anarazel.de

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/1061/

https://www.postgresql.org/message-id/flat/20161206034955.bh33paeralxbtluv@alap3.anarazel.de#20161206034955.bh33paeralxbtluv@alap3.anarazel.de

《分析加速引擎黑科技 - LLVM、列存、多核并行、算子复用 大联姻 - 一起来开启PostgreSQL的百宝箱》

《PostgreSQL 向量化执行插件(瓦片式实现) 10x提速OLAP》

时间: 2024-10-20 14:10:05

PostgreSQL 10.0 preview 性能增强 - OLAP提速框架, Faster Expression Evaluation Framework(含JIT)的相关文章

PostgreSQL 10.0 preview 功能增强 - OLAP增强 向量聚集索引(列存储扩展)

标签 PostgreSQL , 10.0 , Vertical Clustered Index (columnar store extension) , 列存储 , 向量聚集索引 背景 未来数据库OLTP+OLAP逐渐模糊化,需求逐渐融合是一个大的趋势,如果你的数据库只支持OLTP的场景,未来可能会成为业务的绊脚石. 在这方面PostgreSQL每年发布的新版本,都给用户很大的惊喜,OLTP已经具备非常强大的竞争力(性能.功能.稳定性.成熟度.案例.跨行业应用等),而OLAP方面,新增的feat

PostgreSQL 10.0 preview 性能增强 - 推出JIT开发框架(朝着HTAP迈进)

标签 PostgreSQL , 10.0 , HTAP , 动态编译 , JIT , LLVM , 表达式 , 函数跳转 背景 数据库发展了几十年,出现了很多产品,有面向OLTP(在线事务处理)的,有面向OLAP(在线分析)的. 虽然两个场景各有需求特色,但是企业需要为其需求买单,因为目前很少有产品可以同时满足在线处理和在线分析的需求. 比如一家企业,通常都有业务的波峰波谷,比如游戏业务,通常波谷可能是在凌晨,因为大多数人都睡了.而波峰可能出现在每天的工作闲时.游戏运营时段.节假日等. 为了分析

震精 - PostgreSQL 10.0 preview 性能增强 - WARM提升一倍性能

标签 PostgreSQL , 10.0 , WARM , 写放大 , 索引写放大 背景 目前,PostgreSQL的MVCC是多版本来实现的,当更新数据时,产生新的版本.(社区正在着手增加基于回滚段的存储引擎) 由于索引存储的是KEY+CTID(行号),当tuple的新版本与旧版本不在同一个数据块(BLOCK)的时候,索引也要随之变化,当新版本在同一个块里面时,则发生HOT UPDATE,索引的值不需要更新,但是因为产生了一条新的记录,所以也需要插入一条索引item,垃圾回收时,将其回收,因此

PostgreSQL 10.0 preview 性能增强 - hash index metapage cache、高并发增强

标签 PostgreSQL , 10.0 , hash index 背景 hash index是PostgreSQL中一个非常老的索引访问方法,也是非常经典的索引. hash index中存储的是索引字段的hash value,而不是原始值,btree索引中存储的是原始值. 因此,当字段非常大时,btree索引可能无法使用. 例如 postgres=# create table test_hash_btree(c1 text); CREATE TABLE postgres=# insert in

PostgreSQL 10.0 preview 性能增强 - 分区表性能增强(plan阶段加速)

标签 PostgreSQL , 10.0 , 分区表 , 子表 , 元信息搜索性能增强 背景 PostgreSQL 10.0 增强了分区表的子表搜索性能,对于涉及分区表包含子表特别多的QUERY,可以提升性能. 性能分析 get_tabstat_entry, find_all_inheritors成为主要瓶颈. Hello. I decided to figure out whether current implementation of declarative partitioning has

PostgreSQL 10.0 preview 性能增强 - mergesort(Gather merge)

标签 PostgreSQL , 10.0 , merge sort , gather merge 背景 在数据库中,经常会有多个节点append,然后sort的情况. 例如一张表有10个分区,查询所有分区,并按某列排序输出,常规的做法是所有的记录append,然后sort. PostgreSQL 10.0 将支持append node的并行计算,也就是说所有的分区表可以并行的sort,然后返回,此时就可以使用merge sort来提高排序的速度. 另外,像单表的并行计算,如果需要排序输出的话,每

PostgreSQL 10.0 preview 性能增强 - pg_xact align(cacheline对齐)

标签 PostgreSQL , 10.0 , cacheline对齐 , pgxact 背景 cacheline对齐,可以大幅提升高并发下的性能. Hackers, originally this idea was proposed by Andres Freund while experimenting with lockfree Pin/UnpinBuffer [1]. The patch is attached as well as results of pgbench -S on 72-

PostgreSQL 10.0 preview 性能增强 - (多维分析)更快,更省内存hashed aggregation with grouping sets

标签 PostgreSQL , 10.0 , hashed aggregation with grouping sets 背景 grouping sets 是多维分析语法,PostgreSQL 从9.5开始支持这种语法,常被用于OLAP系统,数据透视等应用场景. <PostgreSQL 9.5 new feature - Support GROUPING SETS, CUBE and ROLLUP.> 由于多维分析的一个QUERY涉及多个GROUP,所以如果使用hash agg的话,需要多个H

PostgreSQL 10.0 preview 性能增强 - libpq支持pipeline batch模式减少网络交互提升性能

标签 PostgreSQL , 10.0 , libpq , pipeline , batch 背景 PostgreSQL 10.0 libpq支持pipeline batch两种模式,batch模式意味着客户端可以将多个QUERY塞入pipeline,作为一个batch提交给server段,从而减少客户端和服务端的网络交互次数. 在网络环境不太好的环境中,特别是云环境,大幅提升性能. + <application>libpq</application> supports queu