Dissecting the Disruptor: Why it’s so fast (part two) – Magic cache line padding

原文地址:http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html(因有墙移到墙内)

We mention the phrase Mechanical Sympathy quite a lot, in fact it’s even Martin’s blog title.  It’s about understanding how the underlying hardware operates and programming in a way that works with that, not against it.

We get a number of comments and questions about the mysterious cache line padding in theRingBuffer, and I referred to it in the last post.  Since this lends itself to pretty pictures, it’s the next thing I thought I would tackle.

Comp Sci 101
One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something.  So often as a developer you can get away with not understanding the CPU, data structures or Big O notation – I spent 10 years of my career forgetting all that.  But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.

So, a refresher for those of us who studied this at school, and an intro for those who didn’t.  Beware – this post contains massive over-simplifications.

The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program.  Main memory (RAM) is where your data (including the lines of your program) lives.  We’re going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.

The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow.  If you’re doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it’s performing the operation (think a loop counter – you don’t want to be going off to main memory to fetch this to increment it every time you loop around).

The closer the cache is to the CPU, the faster it is and the smaller it is.  L1 cache is small and very fast, and right next to the core that uses it.  L2 is bigger and slower, and still only used by a single core.  L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket.  Finally you have main memory, which is shared across all cores and all sockets.

When the CPU is performing an operation, it’s first going to look in L1 for the data it needs, then L2, then L3, and finally if it’s not in any of the caches the data needs to be fetched all the way from main memory.  The further it has to go, the longer the operation will take.  So if you’re doing something very frequently, you want to make sure that data is in L1 cache.

Martin and Mike’s QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to… Approx. number of
CPU cycles
Approx. time
in nanoseconds
Main memory ~60-80ns
QPI transit
(between sockets, not drawn)
~20ns
L3 cache ~40-45 cycles, ~15ns
L2 cache ~10 cycles, ~3ns
L1 cache ~3-4 cycles, ~1ns
Register 1 cycle

If you’re aiming for an end-to-end latency of something like 10 milliseconds, an 80 nanosecond trip to main memory to get some missing data is going to take a serious chunk of that.

Cache lines

Now the interesting thing to note is that it’s not individual items that get stored in the cache – i.e. it’s not a single variable, a single pointer.  The cache is made up of cache lines, typically 64 bytes, and it effectively references a location in main memory.  A Java long is 8 bytes, so in a single cache line you could have 8 long variables.

(I’m going to ignore the multiple cache-levels for simplicity)

This is brilliant if you’re accessing an array of longs – when one value from the array gets loaded into the cache, you get up to 7 more for free.  So you can walk that array very quickly.  In fact, you can iterate over any data structure that is allocated to contiguous blocks in memory very quickly.  I made a passing reference to this in the very first post about the ring buffer, and it explains why we use an array for it.

So if items in your data structure aren’t sat next to each other in memory (linked lists, I’m looking at you) you don’t get the advantage of freebie cache loading.  You could be getting a cache miss for every item in that data structure.

However, there is a drawback to all this free loading.  Imagine your long isn’t part of an array.  Imagine it’s just a single variable.  Let’s call it head, for no real reason.  Then imagine you have another variable in your class right next to it.  Let’s arbitrarily call it tail.  Now, when you load head into your cache, you get tail for free.

Which sounds fine.  Until you realise that tail is being written to by your producer, and head is being written to by your consumer.  These two variables aren’t actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.

Imagine your consumer updates the value of head.  The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value.  And remember that we deal with the level of the whole line, we can’t just mark head as being invalid.

Now if some process running on the other core just wants to read the value of tail, the whole cache line needs to be re-read from main memory.  So a thread which is nothing to do with your consumer is reading a value which is nothing to do with head, and it’s slowed down by a cache miss.

Of course this is even worse if two separate threads are writing to the two different values. Both cores are going to be invalidating the cache line on the other core and having to re-read it every time the other thread has written to it. You’ve basically got write-contention between the two threads even though they’re writing to two different variables.

This is called false sharing, because every time you access head you get tail too, and every time you access tail, you get head as well.  All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that’s going to be very inefficient for concurrent access.

Our solution – magic cache line padding

You’ll see that the Disruptor eliminates this problem, at least for architecture that has a cache size of 64 bytes or less, by adding padding to ensure the ring buffer’s sequence number is never in a cache line with anything else.

public long p1, p2, p3, p4, p5, p6, p7; // cache line padding
    private volatile long cursor = INITIAL_CURSOR_VALUE;
    public long p8, p9, p10, p11, p12, p13, p14; // cache line padding

So there’s no false sharing, no unintended contention with any other variables, no needless cache misses.

It’s worth doing this on your Entry classes too – if you have different consumers writing to different fields, you’re going to need to make sure there’s no false sharing between each of the fields.

EDIT: Martin wrote a more technically correct and detailed post about false sharing, and posted performance results too.

时间: 2024-11-01 06:56:50

Dissecting the Disruptor: Why it’s so fast (part two) – Magic cache line padding的相关文章

Dissecting the Disruptor: Why it’s so fast (part one) – Locks Are Bad

原文地址:http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast.html(因前方有墙,所以我移到墙内) Martin Fowler has written a really good article describing not only the Disruptor, but also how it fits into the architecture at LMAX.  This gives so

Dissecting the Disruptor: What’s so special about a ring buffer?

原文地址:http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html(因被墙移到墙内) 作者:Trisha Recently we open sourced the LMAX Disruptor, the key to what makes our exchange so fast.  Why did we open source it?  Well, we've realised that

Dissecting the Disruptor: How do I read from the ring buffer?

The next in the series of understanding the Disruptor pattern developed at LMAX. After the last post we all understand ring buffers and how awesome they are.  Unfortunately for you, I have not said anything about how to actually populate them or read

Dissecting the Disruptor: Demystifying Memory Barriers

原文地址:http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html (因有墙,移到墙内) My recent slow-down in posting is because I've been trying to write a post explaining memory barriersand their applicability in the Disruptor. The proble

Dissecting the Disruptor: Writing to the ring buffer

原文地址:http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-writing-to-ring.html(因被墙移到墙内)  作者:Trisha This is the missing piece in the end-to-end view of the Disruptor.  Brace yourselves, it's quite long.  But I decided to keep it in a single blo

并发译文翻译计划(二)

Doug Lea 的文献 Synchronizer Framework  译者:ClarenceAu (已翻译完成,在校对) Fork/Join  译者:Alex(陆续发表中) Java Concurrency Constructs 译者:萧欢(已翻译完成) Cancellation 译者:丁一  (已翻译完成) 以下文章来自:https://code.google.com/p/disruptor/wiki/BlogsAndArticles 如何使用Disruptor What's so spe

剖析Disruptor:为什么会这么快?(一)锁的缺点

原文:http://ifeve.com/disruptor-locks-are-bad/ 作者:Trisha's  译者:张文灼,潘曦  整理和校对:方腾飞,丁一 Martin Fowler写了一篇非常好的文章,里面不仅提到了Disruptor,而且还解释了Disruptor 如何应用在LMAX的架构里.里面有提及了一些目前没有涉及的概念,但最经常问到的问题是 "Disruptor究竟是什么?". 目前我正准备在回答这个问题,但首先回答"为什么它会这么快?" 这些问题持续出现,但是我不能没有说它

剖析Disruptor:为什么会这么快?(二)神奇的缓存行填充

我们经常提到一个短语Mechanical Sympathy,这个短语也是Martin博客的标题(译注:Martin Thompson),Mechanical Sympathy讲的是底层硬件是如何运作的,以及与其协作而非相悖的编程方式. 我在上一篇文章中提到RingBuffer后,我们收到一些关于RingBuffer中填充高速缓存行的评论和疑问.由于这个适合用漂亮的图片来说明,所以我想这是下一个我该解决的问题了.(译注:Martin Thompson很喜欢用Mechanical Sympathy这

LMAX Disruptor 原理

LMAX需要搭建high performance的交易平台, 所以需要基于并发编程模型 (并发编程模型和访问控制)  当然他们也关注类似Actor或SEDA模型, 并进行了测试, 从而发现了性能瓶颈-- 对于队列的管理 如图这样比较简单的处理流程, 就需要4个queue和大量的message发送, disruptor设计了一种高效的替代方案 解决如下问题,   1, 用何种数据机构来实现Queue 如何使用Disruptor(一)Ringbuffer的特别之处 实现queue首先想到链表, 但使