MaxCompute2.0 Index and Optimization Practices

Summary: At the Alibaba Cloud MaxCompute session during the 2017 Computing Conference, Dai Xiening, Senior Technical Expert of Alibaba Cloud, shared the information about the index and optimization practices of MaxCompute. He started from the data models of MaxCompute, shared experience on Hash Clustering and Range Clustering, analyzed the index optimization and join optimization with examples of application, and finally made a brief summary.

The highlights of the speech are as follows:

MaxCompute serves as both a computing engine and storage engine. With 99% of the data of Alibaba stored on this platform, it has always been our goal to optimize storage efficiency and thus improve computational efficiency.

Data models of MaxCompute

Now, the data models of MaxCompute include project, table, and partition. The data under the partition is stored disorderly where no data organization is defined.
So, can we improve the efficiency under the partition by defining data clustering, sorting, and index? The answer is yes.

MaxCompute 2.0 provides two clustering methods: Hash Clustering and Range Clustering.

Hash Clustering

In Hash Clustering, users can specify several columns as the key chain in creating table. MaxCompute runs the hash function based on these columns, and stores the records with the same hash value in a cluster. The records with the same hash value are represented by different colors. Meanwhile, we define whether the clustering data is orderly or disorderly stored with syntaxes. For example, data is stored orderly if "sorted by" clause is specified. In this way, two results are produced: index is created in each file; and with top level index above the cluster, top level index defines the clustering number in the table, the detailed hash function, and the specified columns. All these help our later query.

Range Clustering

Range Clustering is more flexible and advanced than Hash Clustering. In Range Clustering, with Range Clustering columns specified, MaxCompute uses the "range clustered by" clause to sort all the files based on the value field of column. In addition, MaxCompute also clusters the files in a reasonable way to reduce data problems in parallel processing, considering the clustering principles such as cluster size and cluster difference rationalization. For example, we sort the nine records in the preceding figure and divide them into four clusters. With a "sorted by" clause specifying the data storage in each cluster, two levels of index are created after sorting: document level index and top level index. Top level index maintains each cluster, which corresponds to each range and interval.

Index-based query optimization

So, how to optimize it? For example, if I implement the data clustering and index sorting for the id column, the predicate is pushed down to the storage level. Using the predicate information as the filtering condition, a primary index with all cluster information is performed in top level index. Then, the query condition of id<3 filters Bucket 2 and Bucket 3. Besides, the predicate can be pushed down to the bottom of the file. Because bucket1 both has the values less than 3 and equal to 3, we can filter the data again in the file to further reduce the data amount. Before using the data cluster index, the query of id<3 needs to scan all the data in the table. Now, a huge amount of data can be filtered with index, so the efficiency is greatly improved.

The preceding figure shows the TPC-H Q6 query. TPC-H is the standard test set in the database and big data field. According to the data achieving from the 100 GB test dataset, the left bar is the time with index and the right bar is the time without index. It clearly shows that the efficiency is increased by about ten times. With the help of index, the query execution time, CPU service time, and IO service time are decreased, and a lot of IO operations and data loading are reduced.

Join optimization

Besides the index applied to the filter, we also offer Join optimization. When performing sort merge join, you join the data of two data sources on one machine. Generally, you must use Hash function to divide the data sources into several clusters, so that records with the same join key are sorted into the same cluster and then the two data sources are sorted in each cluster. After sorting, you can perform merge join to find the data with the same key value. This process is complicated and time-consuming because you must run Hash function and write the data on one machine, and then transmit the data to another one to read it, for which twice disk I/O is needed. The whole process is called data shuffle.

As what the figure shows, two table scans are loaded from the data disc and data shuffle is performed between streaming read and streaming write. If the data has been clustered and sorted and the data organization is stored in the disc, shuffle and sorting are not needed when performing join. This is the so called Join optimization. The process is shown in the preceding figure on the right. If M1 and M2 have performed sorting by the Hash Clustering, the plan is directly executed as shown in the preceding figure.

TPC-H Q4

Before performing the Hash Clustering, the execution plan is showed in the preceding figure on the right. There are seven stages with multiple Join and Shuffle operations. If the table is changed as a Hash cluster table and the Hash Clustering is performed for the Join key, only three stages are necessary, thus simplifying the execution plan and doubling the efficiency.

Use cases

Query of transaction records on Taobao

Given that there are tens of billions of or hundreds of billions of transaction records on Taobao, it is as difficult as finding a needle in a haystack to query the shopping records of a user in the last week with the user ID. The execution result on the system before the optimization is shown in the following figure. Over 1,000 workers scanned the table with more than 40 billion records and 26 records were found in 1 minute and 48 seconds.

However, if we define the user ID as the primary key and sort the data by the Hash Clustering, the query is accomplished with only 4 Mappers scanning 10,000 records in 6 seconds.

Incremental upgrades of the transaction table of Taobao

As the data source cited by many BUs of Alibaba, the core transaction table of Taobao must be as accurate as possible. There are incremental updates frequently, such as inserting or upgrading the increment data into the original table in a periodic manner. But it is rather time-consuming to shuffle the full table and the increment table in each update since the full table may contain large data volume with tens of billions of or even hundreds of billions of records, as compared to the incremental table with one tenth or one hundredth of the full table. When it comes to the shuffle of the full and increment table for M1 and M2 in the following figure, it took 1 minute and 49 seconds to shuffle the increment table and 33 minutes with 2,000 workers for the full one.

If the full table is sorted by the Hash clustering, you only need to shuffle the incremental table in each update, instead of shuffling the full table repeatedly. In this way, the Join running time is reduced from 60 minutes to 22 minutes.

Conclusion

● By using the data clustering, sorting, and index, MaxCompute can process data in a more efficient manner.
● The predicate push-down helps reduce the I/O of the table scanning and runtime filtering.
● The data clustering and sorting spares repeated Shuffle operations, simplifying the execution plan and saving both time and resources.

时间: 2024-08-04 02:56:48

MaxCompute2.0 Index and Optimization Practices的相关文章

MaxCompute2.0索引优化实践

摘要:2017云栖大会阿里云大数据计算服务(MaxCompute)专场,阿里云高级专家戴谢宁带来MaxCompute的索引与优化实践分享.本文主要从MaxCompute数据模型开始谈起,接着分享了哈希分片和区域分片,着重分析了索引优化和join优化,并且列出了应用实例,最好作出了简要总结.   以下是精彩内容整理: MaxCompute 除了是计算引擎之外,它也是个存储引擎,阿里巴巴99%数据都在这个平台上.那么,怎么去优化存储效率,从而提高计算效率是我们一直努力的目标. MaxCompute的

MaxCompute2.0性能评测:更强大、更高效之上的更快速

MaxCompute2.0(原Odps):通过性能评测,MaxCompute2.0离线计算比同类产品Hive2.0 on Tez性能优势快约90%以上:MaxCompute2.0从新一代执行引擎到编译引擎.基于代价的优化器全流程针对性能提升做出了卓越改进.        本次评测侧重于已发布的MaxCompute2.0与离线处理同类竞品及线上稳定版本的性能对比,通过测试我们看到MaxCompute2.0在功能上更强大.使用和发布更新更高效.开放生态的同时针对线上作业占比80%以上的Sql以及其中

数据智能需求旺盛 阿里云MaxCompute2.0华南区开服

本文讲的是数据智能需求旺盛 阿里云MaxCompute2.0华南区开服 [IT168 资讯]8月31日,阿里云宣布将在深圳区域正式部署大数据计算服务MaxCompute,以进一步提升对华南区域客户服务的响应速度,并推动该地域人工智能产业的发展. 此前,凭借大规模计算存储.多种计算模型.强数据安全和低成本的优势,MaxCompute一直供不应求. 官方预计,深圳区域将于9月7日正式开放售卖,此后阿里云MaxCompute还将在年底前陆续新开北京.杭州.香港.新加坡.美国.日本.欧洲等节点.这显示了

10月18日,德国法兰克福,阿里云MaxCompute2.0,全面布局AI人工智能

在2017年杭州云栖大会上,阿里云宣布大数据计算服务MaxCompute将于10月18日在德国法兰克福正式开服.通过MaxCompute2.0全新一代的人工智能系统,阿里云将携手更多欧洲本地合作伙伴建立科技生态,驱动当地云计算和大数据的发展,将中国先进的云计算推广至海外,在人工智能.深度学习等领域实现突破创新. 据了解,MaxCompute向用户提供了完善的数据导入方案以及多种经典的分布式计算模型,能够更快速解决用户海量数据计算问题,有效降低企业成本.保障数据安全. ​ "数据是机器智能创新的基

MaxCompute2.0 Empowers the Rapid Expansion of ZhongAn Insurance

Summary: At the Alibaba Cloud MaxCompute session during the 2017 Computing Conference held in Hangzhou, Wang Chaoqun, Data Director of ZhongAn Insurance, delivered a speech on how MaxCompute empowers the business expansion of ZhongAn Insurance. This

MaxCompute2.0 对开源系统的支持与融合

要:在2017杭州云栖大会阿里云大数据计算服务(MaxCompute)专场上,阿里巴巴高级技术专家李睿博为大家分享了阿里云大数据计算服务MaxCompute对于开源系统的支持和融合,以及在拥抱开源和生态的时候阿里巴巴的技术团队遇到过哪些问题和挑战.   本文内容根据嘉宾演讲视频以及PPT整理而成.   在2016年杭州云栖大会上,大数据计算服务团队已经分享过MaxCompute在开源方面的支持和融合了,那次分享的主题叫做"MaxCompute的生态开放之路".其实在阿里巴巴自己的历史上

性能追求之路——MaxCompute2.0(原ODPS)的前世今生

在2017云栖大会·深圳峰会大数据专场,阿里云高级专家云郎分享了<大数据计算服务MaxCompute产品最新动态>.他首先介绍了MaxCompute的发展历程和技术架构,然后对MaxCompute 2.0版本新特性和新技术进行了详细介绍.最后,分享了基于MaxCompute平台构建完整大数据应用架构.构建新型数据仓库.实现个性化推荐的实践.   本文根据直播视频整理而成.   首先看三个简单的问题.在阿里内部已经有了30多个业务单元,是一个典型的数据驱动公司.数据在整个业务创新.业务变更的过程

PostgreSQL 9.0.2 Replication Best Practices

PostgreSQL的基于日志的数据库复制技术自8.2版本以来就已经有了,到了9.0更加的完善,standby库已经支持READ ONLY的模式提供用户使用. 1. 基于WAL文件恢复的数据库复制2. 基于stream的数据库复制在设计以上两中复制场景时,需要注意以下几点:1. 尽量使用类似的硬件和相同操作系统.    数据库复制的目的无外乎standby或提供查询用,硬件和primary节点差的太离谱当然不恰当的,所以为了安全,建议使用同等配置的硬件.2. 尽量使用相同的系统配置.    如相

MaxCompute2.0助力众安保险快速成长

摘要:2017云栖大会阿里云大数据计算服务(MaxCompute)专场,众安保险数据总监王超群带来MaxCompute助力众安保险方面的演讲.本文主要从MaxCompute优势开始谈起,进而谈及大数据能够为公司运营带来的好处,最后重点分析了众安保险的数据平台建设,包括任务调度.元数据和数据质量监控等.   以下是精彩内容整理: 众安保险作为国内第一家互联网公司,我们从创立之初计算平台就使用MaxCompute.   为什么会选择MaxCompute? 成立之初我们也在自建平台和MaxComput