PostgreSQL Oracle 兼容性之 - Partition By Outer Join实现稠化报表

标签

PostgreSQL , Oracle , 稠化报表 , partition by outer join


背景

背景介绍:借用Oracle的一篇例子:

http://blog.sina.com.cn/s/blog_4cef5c7b01016lm5.html

在数据库表中,存储的数据经常是稀疏数据(sparse data),而不是稠密数据(dense data)。

先来了解一下什么是稀疏数据,比如一个产品销售情况表(比如有产品名、销售时间(精确到年月)、销售量3个列),假设某个时间某些产品它没有销售,一般也不会将这些产品的销售量存储为0,而是不存储,这样在产品销售情况表中就会产生很多缺失的行(gap rows),导致的结果就是特定产品销售数据按时间维度进行排序,是不连续的,或者说此产品销售在时间序列上是有缺失的。顾名思义,稠密数据是相对于稀疏数据来说的,还是用上面的假设说明,也就是说产品在某个时间没有销售,也必须存储此产品销售情况,销售量置0存储,这样对某个特定产品来说它在时间序列就是连续的,但是事实经常不是如此,所以才有将稀疏数据稠密化的过程,数据稠密化在数据仓库应用中很常见。

例如以下报表,由于某些商品在某些月份没有销售,所以不会有记录。

postgres=# create table t_sales(year text, month text, prod name, sales int);
CREATE TABLE
postgres=# insert into t_sales values ('2008', '1', 'a', 1000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '1', 'b', 1500);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '2', 'a', 2000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '2', 'b', 3000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '2', 'c', 1000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '3', 'a', 3000);
INSERT 0 1  

postgres=# select * from t_sales;
 year | month | prod | sales
------+-------+------+-------
 2008 | 1     | a    |  1000
 2008 | 1     | b    |  1500
 2008 | 2     | a    |  2000
 2008 | 2     | b    |  3000
 2008 | 2     | c    |  1000
 2008 | 3     | a    |  3000
(6 rows)

Oracle 通过partition by outer join语法可以方便的实现稠化。

SELECT .....  

FROM table_reference  

PARTITION BY (expr [, expr ]... )  

RIGHT OUTER JOIN table_reference  

SELECT .....  

FROM table_reference  

LEFT OUTER JOIN table_reference

达到类似这样的效果

postgres=# select * from continue_out();
 year | month | prod | sales
------+-------+------+-------
 2008 | 2     | a    |  2000
 2008 | 2     | b    |  3000
 2008 | 2     | c    |  1000
 2008 | 3     | a    |  3000
 2008 | 3     | b    |     0
 2008 | 3     | c    |     0
 2008 | 1     | a    |  1000
 2008 | 1     | b    |  1500
 2008 | 1     | c    |     0
(9 rows)

PostgreSQL Partition By Outer Join实现

通过UDF,可以达到类似的效果:

create or replace function continue_out () returns setof t_sales as $$
declare
  v_year text;
  v_month text;
begin
  -- 按年、月为分区 补齐prod  (select distinct year, month from t_sales)
  -- 如果已有元表,则直接取元表
  -- 如无元表,对于连续型数据,亦可以使用PG的generate_series直接生成。
  for v_year, v_month in
    select distinct year, month from t_sales
  loop
    -- 全量 prod  (select distinct prod from t_sales)
    -- 如已有元表,则直接取元表
    -- 如无元表,对于连续型数据,亦可以使用PG的generate_series直接生成。
    return query
      select
        coalesce(t1.year,v_year),
        coalesce(t1.month,v_month),
        t2.prod,
        coalesce(t1.sales, 0)
      from
      t_sales t1
        right outer join
      (select distinct prod from t_sales) t2
        on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month);
  end loop;
end;
$$ language plpgsql strict;

如下

postgres=# select * from continue_out();
 year | month | prod | sales
------+-------+------+-------
 2008 | 2     | a    |  2000
 2008 | 2     | b    |  3000
 2008 | 2     | c    |  1000
 2008 | 3     | a    |  3000
 2008 | 3     | b    |     0
 2008 | 3     | c    |     0
 2008 | 1     | a    |  1000
 2008 | 1     | b    |  1500
 2008 | 1     | c    |     0
(9 rows)

通过窗口,可以实现同比,从而实现快速的搜索到销量逆增长的商品。

postgres=# select *, sales - coalesce(lag(sales) over w1, sales)  from continue_out() window w1 as (partition by prod order by year,month);
 year | month | prod | sales | ?column?
------+-------+------+-------+----------
 2008 | 1     | a    |  1000 |        0
 2008 | 2     | a    |  2000 |     1000
 2008 | 3     | a    |  3000 |     1000
 2008 | 1     | b    |  1500 |        0
 2008 | 2     | b    |  3000 |     1500
 2008 | 3     | b    |     0 |    -3000
 2008 | 1     | c    |     0 |        0
 2008 | 2     | c    |  1000 |     1000
 2008 | 3     | c    |     0 |    -1000
(9 rows)

详细执行计划如下:

postgres=# load 'auto_explain';
LOAD
Time: 22.171 ms
postgres=# set auto_explain.
auto_explain.log_analyze            auto_explain.log_format             auto_explain.log_nested_statements  auto_explain.log_triggers           auto_explain.sample_rate
auto_explain.log_buffers            auto_explain.log_min_duration       auto_explain.log_timing             auto_explain.log_verbose
postgres=# set auto_explain.log_analyze =on;
SET
Time: 0.272 ms
postgres=# set auto_explain.log_buffers =on;
SET
Time: 0.257 ms
postgres=# set auto_explain.log_min_duration =0;
SET
Time: 0.279 ms
postgres=# set auto_explain.log_nested_statements =on;
SET
Time: 0.262 ms
postgres=# set auto_explain.log_timing =on;
SET
Time: 0.269 ms
postgres=# set auto_explain.log_verbose =on;
SET
Time: 0.245 ms
postgres=# set client_min_messages ='log';
SET
Time: 0.265 ms

postgres=# select *, sales - coalesce(lag(sales) over w1, sales)  from continue_out() window w1 as (partition by prod order by year,month);
LOG:  duration: 0.030 ms  plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1
    right outer join
  (select distinct prod from t_sales) t2
  on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join  (cost=34.04..38.80 rows=200 width=132) (actual time=0.024..0.027 rows=3 loops=1)
  Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
  Hash Cond: (t_sales.prod = t1.prod)
  Buffers: shared hit=2
  ->  HashAggregate  (cost=16.38..18.38 rows=200 width=64) (actual time=0.009..0.009 rows=3 loops=1)
        Output: t_sales.prod
        Group Key: t_sales.prod
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.002..0.003 rows=6 loops=1)
              Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
              Buffers: shared hit=1
  ->  Hash  (cost=17.65..17.65 rows=1 width=132) (actual time=0.007..0.007 rows=3 loops=1)
        Output: t1.year, t1.month, t1.sales, t1.prod
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales t1  (cost=0.00..17.65 rows=1 width=132) (actual time=0.003..0.004 rows=3 loops=1)
              Output: t1.year, t1.month, t1.sales, t1.prod
              Filter: ((t1.year = $2) AND (t1.month = $3))
              Rows Removed by Filter: 3
              Buffers: shared hit=1
LOG:  duration: 0.020 ms  plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1
    right outer join
  (select distinct prod from t_sales) t2
  on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join  (cost=34.04..38.80 rows=200 width=132) (actual time=0.016..0.018 rows=3 loops=1)
  Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
  Hash Cond: (t_sales.prod = t1.prod)
  Buffers: shared hit=2
  ->  HashAggregate  (cost=16.38..18.38 rows=200 width=64) (actual time=0.008..0.009 rows=3 loops=1)
        Output: t_sales.prod
        Group Key: t_sales.prod
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.003..0.004 rows=6 loops=1)
              Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
              Buffers: shared hit=1
  ->  Hash  (cost=17.65..17.65 rows=1 width=132) (actual time=0.004..0.004 rows=1 loops=1)
        Output: t1.year, t1.month, t1.sales, t1.prod
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales t1  (cost=0.00..17.65 rows=1 width=132) (actual time=0.002..0.003 rows=1 loops=1)
              Output: t1.year, t1.month, t1.sales, t1.prod
              Filter: ((t1.year = $2) AND (t1.month = $3))
              Rows Removed by Filter: 5
              Buffers: shared hit=1
LOG:  duration: 0.019 ms  plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1
    right outer join
  (select distinct prod from t_sales) t2
  on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join  (cost=34.04..38.80 rows=200 width=132) (actual time=0.014..0.017 rows=3 loops=1)
  Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
  Hash Cond: (t_sales.prod = t1.prod)
  Buffers: shared hit=2
  ->  HashAggregate  (cost=16.38..18.38 rows=200 width=64) (actual time=0.007..0.008 rows=3 loops=1)
        Output: t_sales.prod
        Group Key: t_sales.prod
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.002..0.003 rows=6 loops=1)
              Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
              Buffers: shared hit=1
  ->  Hash  (cost=17.65..17.65 rows=1 width=132) (actual time=0.004..0.004 rows=2 loops=1)
        Output: t1.year, t1.month, t1.sales, t1.prod
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales t1  (cost=0.00..17.65 rows=1 width=132) (actual time=0.001..0.003 rows=2 loops=1)
              Output: t1.year, t1.month, t1.sales, t1.prod
              Filter: ((t1.year = $2) AND (t1.month = $3))
              Rows Removed by Filter: 4
              Buffers: shared hit=1
LOG:  duration: 0.045 ms  plan:
Query Text: select distinct year, month from t_sales
HashAggregate  (cost=17.65..19.65 rows=200 width=64) (actual time=0.040..0.041 rows=3 loops=1)
  Output: year, month
  Group Key: t_sales.year, t_sales.month
  Buffers: shared hit=1
  ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.028..0.029 rows=6 loops=1)
        Output: year, month, prod, sales
        Buffers: shared hit=1
LOG:  duration: 0.832 ms  plan:
Query Text: select *, sales - coalesce(lag(sales) over w1, sales)  from continue_out() window w1 as (partition by prod order by year,month);
WindowAgg  (cost=60.08..85.08 rows=1000 width=136) (actual time=0.808..0.818 rows=9 loops=1)
  Output: year, month, prod, sales, (sales - COALESCE(lag(sales) OVER (?), sales))
  Buffers: shared hit=9
  ->  Sort  (cost=60.08..62.58 rows=1000 width=132) (actual time=0.803..0.804 rows=9 loops=1)
        Output: year, month, prod, sales
        Sort Key: continue_out.prod, continue_out.year, continue_out.month
        Sort Method: quicksort  Memory: 26kB
        Buffers: shared hit=9
        ->  Function Scan on public.continue_out  (cost=0.25..10.25 rows=1000 width=132) (actual time=0.782..0.783 rows=9 loops=1)
              Output: year, month, prod, sales
              Function Call: continue_out()
              Buffers: shared hit=9
 year | month | prod | sales | ?column?
------+-------+------+-------+----------
 2008 | 1     | a    |  1000 |        0
 2008 | 2     | a    |  2000 |     1000
 2008 | 3     | a    |  3000 |     1000
 2008 | 1     | b    |  1500 |        0
 2008 | 2     | b    |  3000 |     1500
 2008 | 3     | b    |     0 |    -3000
 2008 | 1     | c    |     0 |        0
 2008 | 2     | c    |  1000 |     1000
 2008 | 3     | c    |     0 |    -1000
(9 rows)

Time: 1.362 ms

PostgreSQL的UDF非常强大,可以实现很多功能。

当然也期待PostgreSQL直接将partition by语法引入,那就更加方便了。

参考

https://www.postgresql.org/docs/10/static/functions-window.html

https://www.postgresql.org/docs/10/static/plpgsql.html

http://blog.sina.com.cn/s/blog_4cef5c7b01016lm5.html

时间: 2024-10-23 18:21:36

PostgreSQL Oracle 兼容性之 - Partition By Outer Join实现稠化报表的相关文章

PostgreSQL Oracle兼容性之 - plpgsql 自治事务(autonomous_transaction)补丁

PostgreSQL Oracle兼容性之 - plpgsql 自治事务(autonomous_transaction)补丁 作者 digoal 日期 2016-11-04 标签 PostgreSQL , autonomous_transaction , 自治事务 , Oracle兼容性 , plpgsql 背景 PostgreSQL的plpgsql服务端编程语言与Oracle数据库的pl/sql编程语言非常类似,但是对于自治事务一直没有语法层面的支持. 以往如果要支持自治事务,可以使用exce

PostgreSQL Oracle 兼容性之 - rowid (CREATE TABLE WITH OIDS)

标签 PostgreSQL , Oracle , 兼容性 , 行号 , rowid , oid , ctid 背景 Oracle的数据中,通过ROWID可以定位到一条记录,当记录没有发生行迁移时,ROWID是不变的,因此即使不使用PK,也能很好的定位到一条记录. PostgreSQL中,也有行号,CTID,由BLOCK_ID和ITEM_ID组成,即哪个数据块的哪条记录. 但是PostgreSQL的引擎为多版本引擎,因此一条记录在被更新后CTID会发生变化(代表了新的版本). 不管是Oracle还

PostgreSQL Oracle兼容性 之 - parser SQL保留|关键字(keywrods)大全

标签 PostgreSQL , keywords , 关键字 , Oracle 兼容性 背景 PostgreSQL数据库中有哪些关键字,这些关键字的使用限制如何? https://www.postgresql.org/docs/10/static/sql-keywords-appendix.html 文档中的说明并不是特别清晰,(并且KEYWORDS与版本强相关),所以使用pg_get_keywords这个系统函数得到的,更加准确可读. https://www.postgresql.org/do

PostgreSQL Oracle 兼容性之 - timestamp 与 numeric 的运算

标签 PostgreSQL , Oracle , 兼容性 , timestamp , numeric 背景 Oracle里面支持时间戳与数字的加减,数字默认单位为天. PostgreSQL 支持时间戳与interval类型进行加减.日期支持与整型做加减. 为了兼容Oracle(时间戳与数字加减),我们可以复写操作符来实现时间戳与数字的加减. 复写操作符 1.自定义几个函数,用于时间戳与数字的加减. postgres=# create or replace function timestamp_a

PostgreSQL Oracle 兼容性 之 TABLE、PIPELINED函数

标签 PostgreSQL , 返回表 , 返回复合类型 , 返回游标 背景 Oracle 通过table, pipelined函数,用于格式化返回类型为table的函数的结果. Table function concepts There a couple of steps to take when you are working with table functions. Like when you are working with normal tables you have to desc

PostgreSQL Oracle 兼容性之 - 系统列(ctid, oid, cmin, cmax, xmin, xmax)

标签 PostgreSQL , Oracle , 兼容性 , ctid , cmin , cmax , xmin , xmax , oid 背景 PostgreSQL中有一些系统列(即行的头部信息的列),例如物理行号,COMMAND ID,事务号,以及OID. 当我们建表时,不能使用冲突的列名,否则会报错: postgres=# create table a(ctid int); 错误: 42701: 字段名 "ctid" 与系统字段名冲突 LOCATION: CheckAttribu

PostgreSQL Oracle 兼容性 之 - PL/SQL record, table类型定义

背景 Oracle PL/SQL是非常强大的一门SQL编程语言,许多Oracle用户也使用它来处理一些要求延迟低且数据一致性或可靠性要求很高的业务逻辑. PostgreSQL也有一门非常高级的内置SQL编程语言,plpgsql.与Oracle PL/SQL语法极其类似,但是还是有一些不一样的地方.(PS:除了plpgsql,PostgreSQL还支持C,java,python,perl等流行的语言作为数据库的函数编程语言) 本文是针对有Oracle用户遇到的一些函数语法与PostgreSQL不兼

PostgreSQL Oracle 兼容性之 - RATIO_TO_REPORT 分析函数

标签 PostgreSQL , Oracle , RATIO_TO_REPORT , 分析函数 背景 Oracle的分析函数RATIO_TO_REPORT()是用于计算当前值在分组内的占比的 RATIO_TO_REPORT is an analytic function. It computes the ratio of a value to the sum of a set of values. If expr evaluates to null, then the ratio-to-repo

PostgreSQL Oracle 兼容性 之 USERENV

标签 PostgreSQL , Oracle , USERENV , 会话环境变量 背景 USERENV 是Oracle 用来获取当前会话变量的函数.官方是这么介绍的: https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions184.htm#SQLRF06117 Describes the current session. The predefined parameters of namespace USERENV are