PostgreSQL 多维分析 CASE

昨天和小米的童鞋交流,听说的一个痛点。
也是很多给企业做BI分析的开发小伙伴,可能经常会遇到这样的痛苦,运营人员今天问你要这样的维度报表,明天换个维度又来"折腾"你。
对于开发的小伙伴,确实是非常痛苦的事情,那么有什么好的应对策略,而且对运营来说可能会显得比较高逼格呢?

多维分析派上用场,比如你的表有10个字段,允许运营人员以任意字段组合,产生报表。
很多商业数据库都带了这个功能,开源数据库带这个功能的不多。PostgreSQL真是业界良心啊~~~

例子
假设有4个业务字段,一个时间字段。

postgres=# create table tab5(c1 int, c2 int, c3 int, c4 int, crt_time timestamp);
CREATE TABLE

生成一批测试数据

postgres=# insert into tab5 select
trunc(100*random()),
trunc(1000*random()),
trunc(10000*random()),
trunc(100000*random()),
clock_timestamp() + (trunc(10000*random())||' hour')::interval
from generate_series(1,1000000);
INSERT 0 1000000

postgres=# select * from tab5 limit 10;
 c1 | c2  |  c3  |  c4   |          crt_time
----+-----+------+-------+----------------------------
 72 |  46 | 3479 | 20075 | 2017-02-02 14:56:36.854218
 98 | 979 | 4491 | 83012 | 2017-06-13 08:56:36.854416
 54 | 758 | 5838 | 45956 | 2016-09-18 02:56:36.854427
  3 |  67 | 5148 | 74754 | 2017-01-01 01:56:36.854431
 42 | 650 | 7681 | 36495 | 2017-06-20 15:56:36.854435
  4 | 472 | 6454 | 19554 | 2016-06-18 19:56:36.854438
 82 | 922 |  902 | 17435 | 2016-07-21 14:56:36.854441
 68 | 156 | 1028 | 13275 | 2017-07-16 10:56:36.854444
  0 | 674 | 7446 | 59386 | 2016-07-26 09:56:36.854447
  0 | 629 | 2022 | 52285 | 2016-11-04 13:56:36.85445
(10 rows)

创建一个统计结果表, 其中bitmap表示统计的字段组合, 用位置符0,1表示是否统计了该维度

create table stat_tab5 (c1 int, c2 int, c3 int, c4 int, time1 text, time2 text, time3 text, time4 text, cnt int8, bitmap text);

生成业务字段任意维度组合+4组时间任选一组的组合统计
PS (如果业务字段有空的情况,建议统计时用coalesce转一下,确保不会统计到空的情况)

insert into stat_tab5
select c1,c2,c3,c4,t1,t2,t3,t4,cnt,
'' ||
case when c1 is null then 0 else 1 end ||
case when c2 is null then 0 else 1 end ||
case when c3 is null then 0 else 1 end ||
case when c4 is null then 0 else 1 end ||
case when t1 is null then 0 else 1 end ||
case when t2 is null then 0 else 1 end ||
case when t3 is null then 0 else 1 end ||
case when t4 is null then 0 else 1 end
from
(
select c1,c2,c3,c4,
to_char(crt_time, 'yyyy') t1,
to_char(crt_time, 'yyyy-mm') t2,
to_char(crt_time, 'yyyy-mm-dd') t3,
to_char(crt_time, 'yyyy-mm-dd hh24') t4,
count(*) cnt
from tab5
group by
cube(c1,c2,c3,c4),
grouping sets(to_char(crt_time, 'yyyy'), to_char(crt_time, 'yyyy-mm'), to_char(crt_time, 'yyyy-mm-dd'), to_char(crt_time, 'yyyy-mm-dd hh24'))
)
t;

INSERT 0 49570486
Time: 172373.714 ms

在bitmap上创建索引方便取数据

create index idx_stat_tab5_bitmap on stat_tab5 (bitmap);

用户勾选几个维度,取出数据

c1,c3,c4,t3 = bitmap(10110010)

postgres=# select c1,c3,c4,time3,cnt from stat_tab5 where bitmap='10110010' limit 10;
 c1 | c3 |  c4   |   time3    | cnt
----+----+-------+------------+-----
 41 |  0 | 30748 | 2016-06-04 |   1
 69 |  0 | 87786 | 2016-06-04 |   1
 70 |  0 | 38805 | 2016-06-04 |   1
 79 |  0 | 65892 | 2016-06-08 |   1
 51 |  0 | 13615 | 2016-06-11 |   1
 47 |  0 | 42196 | 2016-06-28 |   1
 45 |  0 | 54736 | 2016-07-01 |   1
 50 |  0 | 21605 | 2016-07-02 |   1
 46 |  0 | 40888 | 2016-07-16 |   1
 41 |  0 | 90258 | 2016-07-17 |   1
(10 rows)
Time: 0.528 ms

postgres=# select * from stat_tab5 where bitmap='00001000' limit 10;
 c1 | c2 | c3 | c4 | time1 | time2 | time3 | time4 |  cnt   |  bitmap
----+----+----+----+-------+-------+-------+-------+--------+----------
    |    |    |    | 2016  |       |       |       | 514580 | 00001000
    |    |    |    | 2017  |       |       |       | 485420 | 00001000
(2 rows)
Time: 0.542 ms

执行计划,可以看到优雅的sort,一次sort多次rolldown, 不是简单的union all哦。

                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Insert on stat_tab5  (cost=208059.84..142986926.23 rows=1536000000 width=184)
   ->  Subquery Scan on t  (cost=208059.84..142986926.23 rows=1536000000 width=184)
         ->  GroupAggregate  (cost=208059.84..35466926.23 rows=1536000000 width=152)
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2, tab5.c1
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3, tab5.c4
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3, tab5.c4
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               ->  Sort  (cost=208059.84..210559.84 rows=1000000 width=144)
                     Sort Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2, tab5.c1
                     ->  Seq Scan on tab5  (cost=0.00..26370.00 rows=1000000 width=144)
(93 rows)

后续的优化手段
.1. 分区表
因为统计维度多,所以统计结果是非常庞大的。
数据分区可以帮组用户解决查询效率的问题。
例如按bitmap分区之后,在每个分区表再按时间维度分区。

.2. 流式计算
使用pipelinedb结合cube和grouping sets,把以上的统计改成流式统计,可以提升用户体验,快速得到报表。

.3. 使用MPP产品来提升数据的存储量和计算能力,例如Greenplum。

.4. 使用PostgreSQL 9.6 的CPU并行技术,性能可以有明显的提升。

.5. 使用GPU加速,也可以有非常巨大的提升。

.6. 更大数据量可以使用Greenplum,支持MPP。 实现PB级别多维分析。

参考
https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-GROUPING-SETS

时间: 2024-10-26 08:17:07

PostgreSQL 多维分析 CASE的相关文章

PostgreSQL hung case

1. OS :  2. /var/log/message :  3. LOAD :  4. PostgreSQL csvlog :  5. strace未做. 1. OS :  Red Hat Enterprise Linux Server release 5.5 (Tikanga) 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux 2. /var/log/message :  ta

为PostgreSQL讨说法 - 浅析《UBER ENGINEERING SWITCHED FROM POSTGRES TO MYSQL》

背景 最近有一篇文档,在国外闹得沸沸扬扬,是关于UBER使用mysql替换postgres原因的文章. 英文原文https://eng.uber.com/mysql-migration/ 来自高可用架构的 中文翻译 文章涉及到 PG数据库的部分,背后的原理并没有深入的剖析,导致读者对PostgreSQL的误解 . uber在文章阐述的遇到的PG问题 We encountered many Postgres limitations: Inefficient architecture for wri

旋转门数据压缩算法在PostgreSQL中的实现 - 流式压缩在物联网、监控、传感器等场景的应用

背景 在物联网.监控.传感器.金融等应用领域,数据在时间维度上流式的产生,而且数据量非常庞大. 例如我们经常看到的性能监控视图,就是很多点在时间维度上描绘的曲线. 又比如金融行业的走势数据等等. 我们想象一下,如果每个传感器或指标每100毫秒产生1个点,一天就是864000个点. 而传感器或指标是非常多的,例如有100万个传感器或指标,一天的量就接近一亿的量. 假设我们要描绘一个时间段的图形,这么多的点,渲染估计都要很久. 那么有没有好的压缩算法,即能保证失真度,又能很好的对数据进行压缩呢? 旋

德哥的PostgreSQL私房菜 - 史上最屌PG资料合集

看完并理解这些文章,相信你会和我一样爱上PostgreSQL,并成为一名PostgreSQL的布道者. 资料不断更新中... ... 沉稳的外表无法掩饰PG炙热的内心 . 扩展阅读,用心感受PostgreSQL 内核扩展 <找对业务G点, 体验酸爽 - PostgreSQL内核扩展指南>https://yq.aliyun.com/articles/55981 <当物流调度遇见PostgreSQL - GIS, 路由, 机器学习 (狮子,女巫,魔衣橱)>https://yq.aliy

PostgreSQL 打开文件句柄优化 - 百万以上实体对象初始化优化

标签 PostgreSQL , ulimit , nofile , fs.nr_open , setrlimit 背景 数据库是一种非常庞大的应用系统,就拿PostgreSQL来说,每个实体对象(物化视图.表.序列.索引.TOAST表.TOAST索引)都需要至少1个数据文件,还需要fsm, vm等辅助文件. 当数据库中有很多实体对象时,文件数会非常庞大,而对于Linux操作系统来说,单个进程的打开文件数是有限制的.PostgreSQL数据库新增了一层文件句柄管理,使用文件句柄池来管理文件的打开和

潘金莲改变了历史吗 - PostgreSQL舆情事件分析应用

标签 PostgreSQL , 独立事件分析 , 舆情分析 , 舆情事件 , 相关事件 , 行为轨迹 , 独立事件的流水相关性分析 , PostgreSQL服务端编程实践 背景 潘金莲改变了历史吗? 网上的段子 潘金莲撑开窗户,撑窗户滴棍子掉下去了,于是西门庆看到了,于是他们相遇了.如果潘金莲同学当时没有开窗,那么她就不会遇到西门庆. 如果没有遇到西门庆,那么她就不会被迫出轨,那样武松哥哥就不会怒发冲冠为红颜,这样他就不会奔上梁山. 武松就不会奔上梁上,不会奔上梁山之后,哪怕水泊梁山107将依旧

数据库案例集锦 - 开发者的《如来神掌》

背景 「剑魔独孤求败,纵横江湖三十馀载,杀尽仇寇,败尽英雄,天下更无抗手,无可柰何,惟隐居深谷,以雕为友.呜呼,生平求一敌手而不可得,诚寂寥难堪也.」 剑冢中,埋的是剑魔独孤求败毕生几个阶段中用过的几柄剑: 利剑无意:第一柄是青光闪闪的利剑,凌厉刚猛,无坚不摧,弱冠前以之与河朔群雄争锋. 软剑无常:第二柄是紫薇软剑,三十岁前所用,误伤义士不祥,悔恨不已,乃弃之深谷. 重剑无锋:第三柄是玄铁重剑,重剑无锋,大巧不工,四十岁之前恃之横行天下. 木剑无俦:第四柄是已腐朽的木剑. 无剑无招:四十岁后,不

一个典型kernel bug的追踪之(一)

原文链接 :  http://kerneldedup.org/forum/forum.php?mod=viewthread&tid=17&extra=page%3D1 一个典型kernel bug的追踪之(一):出错现场分析 昨天偶尔在跑测试的时候发现uksmd的一个极其罕见出现的bug(在一个很难进入的路径上,很久很久没有碰到类似的问题了),本质上属于一个soft lockup的bug.所谓,soft lockup就是说,这个bug没有让系统彻底死机,但是若干个进程(或者kernel t

PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维

标签 PostgreSQL , with recursive , 递归查询 , 求差 , 求唯一值 , 求稀疏列 , 求时序数据最新值 背景 递归SQL的几个应用 递归查询,在几个场景的应用,有十全大补丸的功效. 一.求唯一值 场景 某张表,数据量在亿级别,求某稀疏列的唯一值. create table sex (sex char(1), otherinfo text); create index idx_sex_1 on sex(sex); insert into sex select 'm'