MonetDB 1.6 billion(384GB) JOIN 2.4 billion(576GB) 60 columns-random-data wide-table

前面简单的对比了一下PostgreSQL和MonetDB宽表的导入, 查询, 关联方面的性能.

有兴趣的朋友可以参考 : 

MonetDB vs PostgreSQL 2, width table with random data

http://blog.163.com/digoal@126/blog/static/1638770402014714104326879/

MonetDB bulk load performance & fixed length type performance & JOIN performance

http://blog.163.com/digoal@126/blog/static/1638770402014715113449394/

PostgreSQL fixed length wide-table VS MonetDB

http://blog.163.com/digoal@126/blog/static/163877040201471593653444/

PostgreSQL cstore_fdw column-ORI table VS MonetDB (with fixed-length width table)

http://blog.163.com/digoal@126/blog/static/163877040201471691955155/

以上测试基于60个字符串以及INT类型宽表的测试, 单表5000万记录(单表约147GB), 最多关联17个表(约2.5TB)的查询. 

我们已经了解到MonetDB适合定长存储, 效率比变长的字符串要高很多. 

接下来将要测试一下更大的表(指单列超过内存大小)的查询和关联查询. 

查询主要测试and和or的查询, 这种查询一般用在标签场景, 例如按条件筛选商品等.

测试表 : 

宽表单表16亿(384GB)和24亿(576GB)的表关联的查询. 

以及窄表单表364亿(单列存储超过内存大小)的查询性能. 

以下为宽表查询的几个测试: 

表的数据量和容量如下 :

sql>select count(*) from bt1;
+------------+
| L1         |
+============+
| 2400000000 |
+------------+
1 tuple (8.203ms)
sql>select count(*) from bt2;
+------------+
| L1         |
+============+
| 1600000000 |
+------------+
1 tuple (9.380ms)

sql>select "schema","table",sum(columnsize)/1024/1024/1024.0||'GB' from storage group by "schema","table" order by sum(columnsize) desc;
+--------+-------------------+-----------+
| schema | table             | L2        |
+========+===================+===========+
| sys    | bt1               | 536.441GB |
| sys    | bt2               | 357.626GB |
| sys    | bt5               | 135.599GB |
| sys    | bt4               | 84.750GB  |
| sys    | bt3               | 50.849GB  |

and和or的效率差别比较大. or的列越多, 查询越慢.

and查询, 不够多少个条件, 性能都比较平稳, 因为2个条件就已经可以毙掉所有的结果了.

and和or的查询一般用在标签场景, 例如按条件搜索商品等.

sql>select count(*) from bt1 where c1=1;
+------+
| L1   |
+======+
|   48 |
+------+
1 tuple (12.9s)
sql>select count(*) from bt1 where c1=1;
+------+
| L1   |
+======+
|   48 |
+------+
1 tuple (5.5s) -- 第二次不需要从磁盘读取.
sql>select count(*) from bt1 where c1=1 and c2=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (12.7s)  -- c2列需要从磁盘读取
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (13.3s)  -- 不需要把所有的数据载入内存, 因为只需要C1+C2个列的数据就可以得出所有都是false的结论了, 多余的列都不需要判断.
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (12.8s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (13.7s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (15.3s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (5.4s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (16.8s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (17.7s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (14.2s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (15.1s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1 and c12=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (17.4s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1 and c12=1 and c13=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (18.0s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1 and c12=1 and c13=1 and c14=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (23.1s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1 and c12=1 and c13=1 and c14=1 and c15=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (20.4s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1 and c12=1 and c13=1 and c14=1 and c15=1 and c16=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (20.2s)
sql>select count(*) from bt1 where c1=1 and c2=1 and c3=1 and c4=1 and c5=1 and c6=1 and c7=1 and c8=1 and c9=1 and c10=1 and c11=1 and c12=1 and c13=1 and c14=1 and c15=1 and c16=1 and c17=1 and c18=1 and c19=1 and c20=1 and c21=1 and c22=1 and c23=1 and c24=1 and c25=1 and c26=1 and c27=1 and c28=1 and c29=1 and c30=1 and c31=1 and c32=1 and c33=1 and c34=1 and c35=1 and c36=1 and c37=1 and c38=1 and c39=1 and c40=1;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (15.2s)  

or的查询因为每列都要判断, 所以在本例and得出为0的情况下, 判断的字段越多, and和or的性能差别就越大.

sql>select count(*) from bt1 where c1=1 or c2=1;
+------+
| L1   |
+======+
|  240 |
+------+
1 tuple (49.6s)
sql>select count(*) from bt1 where c1=1 or c2=1;
+------+
| L1   |
+======+
|  240 |
+------+
1 tuple (17.3s)
sql>select count(*) from bt1 where c1=1 or c2=1 or c3=1 or c4=1 or c5=1 or c6=1 or c7=1 or c8=1 or c9=1 or c10=1 or c11=1 or c12=1 or c13=1 or c14=1 or c15=1 or c16=1 or c17=1 or c18=1 or c19=1 or c20=1 or c21=1 or c22=1 or c23=1 or c24=1 or c25=1 or c26=1 or c27=1 or c28=1 or c29=1 or c30=1 or c31=1 or c32=1 or c33=1 or c34=1 or c35=1 or c36=1 or c37=1 or c38=1 or c39=1 or c40=1;
+------+
| L1   |
+======+
| 2160 |
+------+
1 tuple (33m 5s)

大表JOIN (536GB JOIN 358GB)

sql>select count(bt1.c1) from bt1 join bt2 on (bt1.c1=bt2.c2 and bt1.c2=bt2.c3 and bt1.c3=bt2.c4 and bt1.c5<10000);
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (15m 27s)

接下来是窄表的测试, 用到3个表, 分别存放过百亿的数据.

sql>\d bt3
CREATE TABLE "sys"."bt3" (
        "id" INTEGER
);
sql>
sql>\d bt4
CREATE TABLE "sys"."bt4" (
        "id" INTEGER
);
sql>
sql>\d bt5
CREATE TABLE "sys"."bt5" (
        "id" INTEGER
);

sql>select * from storage where "table"='bt5';   -- 147GB
+--------+-------+--------+------+----------+-------------+-----------+--------------+----------+---------+--------+
| schema | table | column | type | location | count       | typewidth | columnsize   | heapsize | indices | sorted |
+========+=======+========+======+==========+=============+===========+==============+==========+=========+========+
| sys    | bt5   | id     | int  | 22/2242  | 36400000000 |         4 | 145600000000 |        0 |       0 | false  |
+--------+-------+--------+------+----------+-------------+-----------+--------------+----------+---------+--------+
1 tuple (12.695ms)
sql>select * from storage where "table"='bt4';   -- 91GB
+--------+-------+--------+------+----------+-------------+-----------+-------------+----------+---------+--------+
| schema | table | column | type | location | count       | typewidth | columnsize  | heapsize | indices | sorted |
+========+=======+========+======+==========+=============+===========+=============+==========+=========+========+
| sys    | bt4   | id     | int  | 41/4115  | 22750000000 |         4 | 91000000000 |        0 |       0 | false  |
+--------+-------+--------+------+----------+-------------+-----------+-------------+----------+---------+--------+
1 tuple (5.738ms)
sql>select * from storage where "table"='bt3';   -- 56GB
+--------+-------+--------+------+-------------+-------------+-----------+-------------+----------+---------+--------+
| schema | table | column | type | location    | count       | typewidth | columnsize  | heapsize | indices | sorted |
+========+=======+========+======+=============+=============+===========+=============+==========+=========+========+
| sys    | bt3   | id     | int  | 02/25/22500 | 13650000000 |         4 | 54600000000 |        0 |       0 | false  |
+--------+-------+--------+------+-------------+-------------+-----------+-------------+----------+---------+--------+
1 tuple (6.381ms)

sql>select count(*) from bt3;
+-------------+
| L1          |
+=============+
| 13650000000 |
+-------------+
1 tuple (56.279ms)
sql>select count(*) from bt4;
+-------------+
| L1          |
+=============+
| 22750000000 |
+-------------+
1 tuple (52.546ms)
sql>select count(*) from bt5;
+-------------+
| L1          |
+=============+
| 36400000000 |
+-------------+
1 tuple (55.480ms)

364亿插入性能 :

sql>insert into bt5 select * from bt4 union all select * from bt3;
36400000000 affected row (17m 54s)

查询性能

sql>select count(*) from bt3 where id=1;
+------+
| L1   |
+======+
|  273 |
+------+
1 tuple (50.3s)
sql>select count(*) from bt5 where id=1;
+------+
| L1   |
+======+
|  728 |
+------+
1 tuple (5m 47s)

关联性能 (147GB JOIN 537GB)

sql>select count(bt5.id) from bt5 join bt1 on (bt5.id=bt1.c2 and bt5.id=bt1.c3 and bt5.id<100) ;
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (43m 25s)

[参考]

1. MonetDB vs PostgreSQL 2, width table with random data

http://blog.163.com/digoal@126/blog/static/1638770402014714104326879/

2. MonetDB bulk load performance & fixed length type performance & JOIN performance

http://blog.163.com/digoal@126/blog/static/1638770402014715113449394/

3. PostgreSQL fixed length wide-table VS MonetDB

http://blog.163.com/digoal@126/blog/static/163877040201471593653444/

4. PostgreSQL cstore_fdw column-ORI table VS MonetDB (with fixed-length width table)

http://blog.163.com/digoal@126/blog/static/163877040201471691955155/

时间: 2024-12-28 09:28:19

MonetDB 1.6 billion(384GB) JOIN 2.4 billion(576GB) 60 columns-random-data wide-table的相关文章

MonetDB vs PostgreSQL 2, width table with random data

前面一篇简单的对比了一下PostgreSQL和MonetDB在bulk load和简单的统计查询的性能. 因为测试数据比较单一, 可能没有什么说服力.  本文测试环境与之前的一致, 但是使用宽表, 随机离散字符串进行测试. 5000万测试数据, 60个字段, 单表, 容量149GB, (PG96GB) http://blog.163.com/digoal@126/blog/static/16387704020147139412871/ [注意]  PostgreSQL注重的是高并发, 而Mone

PostgreSQL cstore_fdw column-ORI table VS MonetDB (with fixed-length width table)

前面测试了PostgreSQL 行存储引擎和MonetDB列存储的性能差别. 包括导入, 查询, 关联查询等. 本文将测试一下PostgreSQL使用cstore_fdw插件, 对比MonetDB的性能. 测试方法和性能数据见 :  http://blog.163.com/digoal@126/blog/static/1638770402014715113449394/ http://blog.163.com/digoal@126/blog/static/16387704020147159365

在C#中把两个DataTable连接起来,相当于Sql的Inner Join方法

在下面的例子中实现了3个Join方法,其目的是把两个DataTable连接起来,相当于Sql的Inner Join方法,返回DataTable的所有列.如果两个DataTable中的DataColumn有重复的话,把第二个设置为ColumnName+"_Second",下面是代码,希望对大家有所帮助.using System;using System.Data; namespace WindowsApplication1{    public class SQLOps    {    

C#中把两个DataTable连接起来,相当于Sql的Inner Join方法

在下面的例子中实现了3个Join方法,其目的是把两个DataTable连接起来,相当于Sql的Inner Join方法,返回DataTable的所有列.  如果两个DataTable中的DataColumn有重复的话,把第二个设置为ColumnName+"_Second",下面是代码,希望对大家有所帮助.  using System;  using System.Data;  namespace WindowsApplication1  {      public class SQLO

SQL Server Join方式

原文:SQL Server Join方式 0.参考文献 Microsoft SQL Server企业级平台管理实践  看懂SqlServer查询计划 1.测试数据准备 参考:Sql Server中的表访问方式Table Scan, Index Scan, Index Seek 这篇博客中的实验数据准备.这两篇博客使用了相同的实验数据. 2.SQL Server中的三种Join方式 在Sql Server中,每一个join命令,在内部执行时,都会采用三种更具体的join方式来运行.这三种join的

两种连接的表达 :left(right) join 和 (+)

稍微研究了一下 oracle 自己的join 和标准的join.主要表现在on ,where 关键字所起的作用不同,和连接本身的特性. yang@ORACL> set autotrace on yang@ORACL> select  *   2  from a,b   3  where a.id=b.id(+) and a.name like 'x%';         ID NAME          ID NAME ---------- ----- ---------- -----    

【SQL 学习】表连接--natural join 的一个bug

自然连接(NATURAL JOIN)是一种特殊的等价连接,它将表中具有相同名称的列自动进行记录匹配.自然连接不必指定任何同等连接条件.这篇文章讲的一个关于natural join 的bug!(由 dingjun123 提示!) SQL> conn store/yang 已连接. SQL> create table a as select * from all_objects; 表已创建. SQL> set timing on SQL> create table b as selec

sql server Right Outer Join用法

Right Outer Join 运算符返回满足第二个(底端)输入与第一个(顶端)输入的每个匹配行的联接的每行.此外,它还返回第二个输入中在第一个输入中没有匹配行的任何行,即与 NULL 联接.如果 Argument 列内不存在任何联接谓词,则每行都是一个匹配行. 一.联接( join) 通过联接,可以根据各个表之间的逻辑关系从两个或多个表中检索数据.联接表示应如何使用一个表中的数据来选择另一个表中的行. 联接条件通过以下方法定义两个表在查询中的关联方式: 指定每个表中要用于联接的列.典型的联接

[20150423]left right join.txt

[20150423]left right join.txt --oracle sql 语法里面存在left join,right join连接,而且这种写法是sql ansi标准. --我个人工作习惯特别不喜欢这种写法,使用(+)方式更多一些. -- 实际上我以前开始学习oracle的时候,对于使用(+)那边输出NULL非常混乱.每次都拿scott schema的dept,emp来测试一次. -- 后来我简单的记忆是+表示多的意思,也就这个(+)对应的表不存在时输出NULL值. --为了加强记忆