Off-heap Memory in Apache Flink and the curious JIT compiler

Running data-intensive code in the JVM and making it well-behaved is tricky. Systems that put billions of data objects naively onto the JVM heap face unpredictable OutOfMemoryErrors and Garbage Collection stalls. Of course, you still want to to keep your data in memory as much as possible, for speed and responsiveness of the processing applications. In that context, “off-heap” has become almost something like a magic word to solve these problems.

 

In this blog post, we will look at how Flink exploits off-heap memory. 
The feature is part of the upcoming release, but you can try it out with the latest nightly builds. We will also give a few interesting insights into the behavior for Java’s JIT compiler for highly optimized methods and loops.

 

Why actually bother with off-heap memory?

Given that Flink has a sophisticated level of managing on-heap memory, why do we even bother with off-heap memory? It is true that “out of memory” has been much less of a problem for Flink because of its heap memory management techniques. Nonetheless, there are a few good reasons to offer the possibility to move Flink’s managed memory out of the JVM heap:

  • Very large JVMs (100s of GBytes heap memory) tend to be tricky. It takes long to start them (allocate and initialize heap) and garbage collection stalls can be huge (minutes). While newer incremental garbage collectors (like G1) mitigate this problem to some extend, an even better solution is to just make the heap much smaller and allocate Flink’s managed memory chunks outside the heap.
  • I/O and network efficiency: In many cases, we write MemorySegments to disk (spilling) or to the network (data transfer). Off-heap memory can be written/transferred with zero copies, while heap memory always incurs an additional memory copy.
  • Off-heap memory can actually be owned by other processes. That way, cached data survives process crashes (due to user code exceptions) and can be used for recovery. Flink does not exploit that, yet, but it is interesting future work.

Flink传统的基于‘on-heap’ 内存管理机制,已经可以解决很多的java关于‘out of memory’或gc的问题,那我们为何还要用 ‘off-heap’的技术,

1. very large的JVM会要很长的启动时间,并且gc的代价也会很大 
2. heap在写磁盘或network时,至少要一次copy,而off-heap可以实现zero copy 
3. off-heap内存是进程共享的,JVM进程crash不会丢失数据

 

The opposite question is also valid. Why should Flink ever not use off-heap memory?

  • On-heap is easier and interplays better with tools. Some container environments and monitoring tools get confused when the monitored heap size does not remotely reflect the amount of memory used by the process.
  • Short lived memory segments are cheaper on the heap. Flink sometimes needs to allocate some short lived buffers, which works cheaper on the heap than off-heap.
  • Some operations are actually a bit faster on heap memory (or the JIT compiler understands them better).

为何Flink不直接用off-heap memory?

越强大的东西,一般都越麻烦,

所以一般case下,用on-heap就够了

 

The off-heap Memory Implementation

Given that all memory intensive internal algorithms are already implemented against the MemorySegment, our implementation to switch to off-heap memory is actually trivial. 
You can compare it to replacing allByteBuffer.allocate(numBytes) calls with ByteBuffer.allocateDirect(numBytes)
In Flink’s case it meant that we made the MemorySegment abstract and added the HeapMemorySegment andOffHeapMemorySegment subclasses. 
TheOffHeapMemorySegment takes the off-heap memory pointer from a java.nio.DirectByteBuffer and implements its specialized access methods using sun.misc.Unsafe
We also made a few adjustments to the startup scripts and the deployment code to make sure that the JVM is permitted enough off-heap memory (direct memory, -XX:MaxDirectMemorySize).

使用off-heap在内存管理机制上和使用on-heap并没有太大的区别,

相比于NIO,使用ByteBuffer.allocate(numBytes)来分配heap内存,而用ByteBuffer.allocateDirect(numBytes)来分配off-heap内存

Flink,对MemorySegment,生成两个子类,HeapMemorySegment and OffHeapMemorySegment

其中OffHeapMemorySegment,以java.nio.DirectByteBuffer的形式使用off-heap memory, 通过sun.misc.Unsafe接口来操作这些memory

 

Understanding the JIT and tuning the implementation

The MemorySegment was (before our change) a standalone class, it was final (had no subclasses). Via Class Hierarchy Analysis (CHA), the JIT compiler was able to determine that all of the accessor method calls go to one specific implementation. That way, all method calls can be perfectly de-virtualized and inlined, which is essential to performance, and the basis for all further optimizations (like vectorization of the calling loop).

With two different memory segments loaded at the same time, the JIT compiler cannot perform the same level of optimization any more, which results in a noticeable difference in performance: A slowdown of about 2.7 x in the following example:

 

这里是考虑性能优化问题,

这里提出的一个问题就是,如果MemorySegment是standalone class类,没有之类,那么会比较高效,因为编译的时候,他所调研的method都是确定的,可以提前做优化; 
如果具有两个子类,那么只有到真正运行到时候才知道是哪个子类,这样就不能提前做优化;

实际测,性能的差距在2.7倍左右

解决方法:

Approach 1: Make sure that only one memory segment implementation is ever loaded.

We re-structured the code a bit to make sure that all places that produce long-lived and short-lived memory segments instantiate the same MemorySegment subclass (Heap- or Off-Heap segment). Using factories rather than directly instantiating the memory segment classes, this was straightforward.

如果在代码里面只可能实例化其中的一个子类,另一个子类根本就没有被实例化过,那么JIT会意识到,并做优化;我们可以用factories来实例化对象,这样更方便;

Approach 2: Write one segment that handles both heap and off-heap memory

We created a class HybridMemorySegment which handles transparently both heap- and off-heap memory. It can be initialized either with a byte array (heap memory), or with a pointer to a memory region outside the heap (off-heap memory).

第二种方法就是用HybridMemorySegment,同时处理heap和off-heap,这样就不需要子类 
并且有tricky的方式,可以做到透明的处理两种memory


细节看原文

时间: 2024-10-26 00:56:42

Off-heap Memory in Apache Flink and the curious JIT compiler的相关文章

深入理解Apache Flink核心技术

Apache Flink(下简称Flink)项目是大数据处理领域最近冉冉升起的一颗新星,其不同于其他大数据项目的诸多特性吸引了越来越多人的关注.本文将深入分析Flink的一些关键技术与特性,希望能够帮助读者对Flink有更加深入的了解,对其他大数据系统开发者也能有所裨益.本文假设读者已对MapReduce.Spark及Storm等大数据处理框架有所了解,同时熟悉流处理与批处理的基本概念. Flink简介 Flink核心是一个流式的数据流执行引擎,其针对数据流的分布式计算提供了数据分布.数据通信以

Peeking into Apache Flink's Engine Room

Join Processing in Apache Flink In this blog post, we cut through Apache Flink's layered architecture and take a look at its internals with a focus on how it handles joins. Specifically, I will show how easy it is to join data sets using Flink's flue

Apache Storm 衍生项目 & Apache Flink初接触

storm-yarn 概要 storm是一个近似于实时的计算框架,甩开hadoop上的原生mapreduce计算框架不只一条街.如果能将storm引入到hadoop中,对存储于hdfs的数据进行分析必然极大的提高处理性能.storm-yarn就是这样一个项目,由yahoo实现,目前已经开源. 除了storm-yarn试图将storm整合进hadoop,以提升hadoop的分析处理能力的尝试之外,Hortonworks也高调宣布在2014年推出整合了storm的hadoop发行版.当然Horton

Apache Flink源码解析之stream-window

window(窗口)是Flink流处理中非常重要的概念,本篇我们来对窗口相关的概念以及关联的实现进行解析.本篇的内容主要集中在package org.apache.flink.streaming.api.windowing下. Window 一个Window代表有限对象的集合.一个窗口有一个最大的时间戳,该时间戳意味着在其代表的某时间点--所有应该进入这个窗口的元素都已经到达. Flink的根窗口对象是一个抽象类,只提供了一个抽象方法: public abstract long maxTimes

Apache Flink fault tolerance源码剖析(五)

上一篇文章我们谈论了保存点的相关内容,其中就谈到了保存点状态的存储.这篇文章我们来探讨用户程序状态的存储,也是在之前的文章中多次提及的state backend(中文暂译为状态终端). 基于数据流API而编写的程序经常以各种各样的形式保存着状态: 窗口收集/聚合元素(这里的元素可以看作是窗口的状态)直到它们被触发 转换函数可能会使用key/value状态接口来存储数据 转换函数可能实现Checkpointed接口来让它们的本地变量受益于fault tolerant机制 当检查点机制工作时,上面谈

Apache Flink改进及其在阿里巴巴搜索中的应用

本文整理自阿里搜索基础设施团队研究员蒋晓伟在Flink Forward 2016大会上的演讲,原始演讲视频可以在这里查看. 以下为演讲整理 阿里是世界上最大的电子商务零售商,其2015年的年销售额就超过了eBay和Amazon的总和,达3940亿.Alibaba Search,个性化搜索和推荐平台,既是顾客的关键入口,也承担了大部分的在线收益.因此,阿里搜索基础设施团队一直在努力改进产品. 对于电子商务网站上的搜索引擎,到底什么最重要?必然是实时地为每一位用户提供最相关和准确的搜索结果.以阿里巴

【活动】Apache Flink文档翻译志愿者招募!

Apache Flink是一款分布式.高性能的开源流式处理框架,在2015年1月12日,Apache Flink正式成为Apache顶级项目.目前Flink在阿里巴巴.Bouygues Teleccom.Capital One等公司得到应用,如阿里巴巴对Apache Flink的应用案例. 为了更好地让大家了解和使用Apache Flink,我们特地发起Apache Flink官方文档中文翻译计划,欢迎兴趣爱好者加入. 本次翻译来源,主要为Apache Flink官网文档内容,共7部分.100多

Apache Flink fault tolerance源码剖析(二)

继续Flink Fault Tolerance机制剖析.上一篇文章我们结合代码讲解了Flink中检查点是如何应用的(如何根据快照做失败恢复,以及检查点被应用的场景),这篇我们来谈谈检查点的触发机制以及基于Actor的消息驱动的协同机制.这篇涉及到一个非常关键的类--CheckpointCoordinator. org.apache.flink.runtime.checkpoint.CheckpointCoordinator 该类可以理解为检查点的协调器,用来协调operator和state的分布

Apache Flink流作业提交流程分析

用户编写的程序逻辑需要提交给Flink才能得到执行.本文来探讨一下客户程序如何提交给Flink.鉴于用户将自己利用Flink的API编写的逻辑打成相应的应用程序包(比如Jar)然后提交到一个目标Flink集群上去运行是比较主流的使用场景,因此我们的分析也基于这一场景进行. Flink的API针对不同的执行环境有不同的Environment对象,这里我们主要基于常用的RemoteStreamEnvironment和RemoteEnvironment进行分析 在前面我们谈到了Flink中实现了"惰性