Columnar Storage

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.

 

磁盘仍然是bigdata查询的最大的瓶颈. 
所以disk-based query engines需要解决的最大问题, 仍然是如何提高从磁盘读取数据的效率. 
思路, 可以从优化磁盘寻道时间和效率入手, 这个比较困难,进展缓慢 
更多的, 应该考虑怎样尽量从磁盘中读取有效的数据, 即减少查询所需要的磁盘读取量 

Disks are still the major bottleneck in query execution over large datasets. 
Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk.

1. The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster
However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.

 

2. The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem.

下面就'怎样最大化在磁盘读出数据中的有用数据的比例?'思路来阐述

Columns VS Rows

传统基于row的存储方式

Traditional database file format store data in rows, where each row is comprised of a contiguous collection of column values.

On disk, that looks roughly like the following:

This row-major layout usually has a header for each row that describes, for example, which columns in the row are NULL. Each column value is then stored contiguously after the header, followed by another row with its own header, and so on.

 

SELECT * FROM table

requires returning every single column of every single row in the table

SELECT <col_set> FROM table WHERE <predicate_set>

the set of rows returned by the evaluation of the predicates over the table is a large proportion of the total set of rows (i.e. the selectivity is high)

the predicate_set requires reading a large subset of the set of columns or b) col_set is a large subset of the set of columns (i.e. the projectivity is high)

 

More simply, a query is going to be efficient if it requires reading most of the columns of most of the rows. 
In these cases, row-major formats allow the query execution engine to achieve good IO efficiency. 
总结的很简单, 对于row-based, 无论实际需要多少column, 总是需要读出整行 
所以如果确实需要读出大部分column (无论在select部分或where部分), 那么row-major是高效的方式

 

基于column的存储方式

However, there is a general consensus that these SELECT * kinds of queries are not representative of typical analytical workloads; instead either a large number of columns are not projected, or they are projected only for a small subset of rows where only a few columns are required to decide which rows to return.

Coupled with a general trend towards very wide tables with high column counts, the total number of bytes that are required to satisfy a query are often a relatively small fraction of the size on disk of the target table. In these cases, row-major formats often are quite wasteful in the amount of IO they require to execute a query.

但是在现实中, 很多场景往往只需要很少一部分的column, 那么row-major就显得非常低效 
自然的思路是, 既然我只需要某一个column, 那么把每个column分开存放并读取就显得更合理

Instead of a format that makes it efficient to read entire rows, it’s advantageous for analytical workloads to make it efficient to read entire columns at once. 
Based on our understanding of what makes disks efficient, we can see that the obvious approach is to store columns values densely and contiguously on disk. 
This is the basic idea behind columnar file formats. The following diagram shows what this looks like on disk:

 

QUERY EXECUTION

The diagram below shows what a simple query plan for SELECT col_b FROM table WHERE col_a > 5

Reading from a traditional row-major file format

对于row-major, 需要读出所有的row来完成这个查询, 明显是比较低效的, 如果row中包含大量的columns

Reading from columnar storage

 

This, then, is the basic idea of columnar storage: we recognise that analytical workloads rarely require full scans of all table data, but do often require full scans of a small subset of the columns, and so we arrange to make column scans cheap at the expense of extra cost reading individual rows.

 

THE COST OF COLUMNAR

Is this a free lunch? Should every analytical database go out and change every file format to be column-major? 
Obviously the story is more complicated than that. There are some query archetypes that suffer when data is stored in a columnar format.

天下没有免费的午餐, 在bigdata领域, 没有one thing fits all 
所以使用columnar方案的问题如下

1. It is expensive to reassemble a row, since the separate values that comprise it are spread far across the disk. Every column included in a projection implies an extra disk seek, and this can add up when the projectivity of a query is high. 所以columnar只适用于读取少量column的case, 否则reassemble的耗费会很高

2. Row-major formats have good in-memory spatial locality, and there are common operations that benefit enormously from this. 比如, 做两个column的相加, 对row-major就很简单, 而column的逻辑就复杂许多

所以对这样的情况, 比如HBase, 就采取column family的策略, 把经常一起使用的column放在一起, 达到很好的优化效果


本文章摘自博客园,原文发布日期:2013-03-26

时间: 2024-07-30 21:09:49

Columnar Storage的相关文章

史上最全的“大数据”学习资源(上)

当前,整个互联网正在从IT时代向DT时代演进,大数据技术也正在助力企业和公众敲开DT世界大门.当今"大数据"一词的重点其实已经不仅在于数据规模的定义,它更代表着信息技术发展进入了一个新的时代,代表着爆炸性的数据信息给传统的计算技术和信息技术带来的技术挑战和困难,代表着大数据处理所需的新的技术和方法,也代表着大数据分析和应用所带来的新发明.新服务和新的发展机遇. 五月直播季迅猛来袭!5月13日20点,直播<虎嗅网架构演讲>,点击报名:https://yq.aliyun.com

大数据的那些事儿

资源列表:   关系数据库管理系统(RDBMS)   框架   分布式编程   分布式文件系统   文件数据模型   Key -Map 数据模型   键-值数据模型   图形数据模型   NewSQL数据库   列式数据库   时间序列数据库   类SQL处理   数据摄取   服务编程   调度   机器学习   基准测试   安全性   系统部署   应用程序   搜索引擎与框架   MySQL的分支和演化   PostgreSQL的分支和演化   Memcached的分支和演化   嵌入式

PostgreSQL 10.0 preview 功能增强 - OLAP增强 向量聚集索引(列存储扩展)

标签 PostgreSQL , 10.0 , Vertical Clustered Index (columnar store extension) , 列存储 , 向量聚集索引 背景 未来数据库OLTP+OLAP逐渐模糊化,需求逐渐融合是一个大的趋势,如果你的数据库只支持OLTP的场景,未来可能会成为业务的绊脚石. 在这方面PostgreSQL每年发布的新版本,都给用户很大的惊喜,OLTP已经具备非常强大的竞争力(性能.功能.稳定性.成熟度.案例.跨行业应用等),而OLAP方面,新增的feat

优质论文list(分布式系统/存储/索引相关)

 转载请注明出处:http://blog.csdn.net/zbf8441372 5. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop, VLDB, 2011 主要提出了对hadoop的node上进行colorate data的改进.(不知道中文怎么翻译= =)与plain hadoop相比,提升了相应data做indexing, grouping, aggregation, columnar storage, j

NoSQL Databases技术资料整理汇总

0 Reference NoSQL论文 在 Stuttgart Media 大学的 Christof Strauch 历时8个月(2010年6月-2011年2月)完成了一篇150页长的NoSQL相关的论文, 对NoSQL的各个方面做了探讨 http://www.christof-strauch.de/nosqldbs.pdf 分布式系统领域经典论文翻译集 http://duanple.blog.163.com/blog/static/709717672011330101333271/ 2010

史上最全“大数据”学习资源整理

史上最全"大数据"学习资源整理 2016-05-17 Hadoop技术博文 当前,整个互联网正在从IT时代向DT时代演进,大数据技术也正在助力企业和公众敲开DT世界大门.当今"大数据"一词的重点其实已经不仅在于数据规模的定义,它更代表着信息技术发展进入了一个新的时代,代表着爆炸性的数据信息给传统的计算技术和信息技术带来的技术挑战和困难,代表着大数据处理所需的新的技术和方法,也代表着大数据分析和应用所带来的新发明.新服务和新的发展机遇.     资源列表:   关系数

Spark-基础-Spark及其生态圈简介

1.简介 1.1 Spark简介 Spark是加州大学伯克利分校AMP实验室(Algorithms, Machines, and People Lab)开发通用内存并行计算框架.Spark在2013年6月进入Apache成为孵化项目,8个月后成为Apache顶级项目,速度之快足见过人之处,Spark以其先进的设计理念,迅速成为社区的热门项目,围绕着Spark推出了Spark SQL.Spark Streaming.MLLib和GraphX等组件,也就是BDAS(伯克利数据分析栈),这些组件逐渐形

新闻 | MariaDB Columnstore 1.0.6 GA发布

0.导读 在昨天(2016.12.14),MariaDB正式发布了MariaDB Columnstore 1.0版本 1.关于MariaDB Columnstore MariaDB Columnstore的前身其实是InfiniDB,大家可能还记得在2014年9月份时,InfiniDB公司宣布关门后,MariaDB接手了这个项目. 在10月15日的杭州阿里云栖大会上,我有幸和MariaDB Columnstore的主要开发者Daniel Lee同台做分享.Daniel是位华人,中文说的很流利,此

Hadoop的分布式架构改进与应用

Hadoop的分布式架构改进与应用 Baofeng Zhang@zju  转载请注明出处:http://blog.csdn.net/zbf8441372  1.  背景介绍        谈到分布式系统,就不得不提到Google的三驾马车:GFS[1],MapReduce[2]和BigTable[3].虽然Google没有开源这三个技术的实现源码,但是基于这三篇开源文档, Nutch项目子项目之一的Yahoo资助的Hadoop分别实现了三个强有力的开源产品:HDFS,MapReduce和HBas