False Sharing

Memory is stored within the cache system in units know as cache
lines.  Cache lines are a power of 2 of contiguous bytes which are
typically 32-256 in size.  The most common cache line size is 64
bytes.   False sharing is a term which applies when threads unwittingly
impact the performance of each other while modifying independent
variables sharing the same cache line.  Write contention on cache lines
is the single most limiting factor on achieving scalability for parallel
threads of execution in an SMP system.  I’ve heard false sharing
described as the silent performance killer because it is far from
obvious when looking at code.

To achieve
linear scalability with number of threads, we must ensure no two threads
write to the same variable or cache line.  Two threads writing to the
same variable can be tracked down at a code level.   To be able to know
if independent variables share the same cache line we need to know the
memory layout, or we can get a tool to tell us.  Intel VTune is such a
profiling tool.  In this article I’ll explain how memory is laid out for
Java objects and how we can pad out our cache lines to avoid false
sharing.

Figure 1.

Figure 1. above illustrates the issue of false sharing.  A thread
running on core 1 wants to update variable X while a thread on core 2
wants to update variable Y.  Unfortunately these two hot variables
reside in the same cache line.  Each thread will race for ownership of
the cache line so they can update it.  If core 1 gets ownership then the
cache sub-system will need to invalidate the corresponding cache line
for core 2.  When Core 2 gets ownership and performs its update, then
core 1 will be told to invalidate its copy of the cache line.  This will
ping pong back and forth via the L3 cache greatly impacting
performance.  The issue would be further exacerbated if competing cores
are on different sockets and additionally have to cross the socket
interconnect.Java Memory Layout

For the Hotspot JVM, all objects have a 2-word header.  First is the
“mark” word which is made up of 24-bits for the hash code and 8-bits for
flags such as the lock state, or it can be swapped for lock objects. 
The second is a reference to the class of the object.  Arrays have an
additional word for the size of the array.  Every object is aligned to
an 8-byte granularity boundary for performance.  Therefore to be
efficient when packing, the object fields are re-ordered from
declaration order to the following order based on size in bytes:

  1. doubles (8) and longs (8)
  2. ints (4) and floats (4)
  3. shorts (2) and chars (2)
  4. booleans (1) and bytes (1)
  5. references (4/8)
  6. <repeat for sub-class fields>

With this knowledge we can pad a cache line between any fields with 7 longs.  Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences.To
show the performance impact let’s take a few threads each updating
their own independent counters.  These counters will be volatile longs so the world can see their progress.


public final class FalseSharing
    implements Runnable
{
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
    static
    {
        for (int i = 0; i < longs.length; i++)
        {
            longs[i] = new VolatileLong();
        }
    }

    public FalseSharing(final int arrayIndex)
    {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception
    {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new FalseSharing(i));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = ITERATIONS + 1;
        while (0 != --i)
        {
            longs[arrayIndex].value = i;
        }
    }

    public final static class VolatileLong
    {
        public volatile long value = 0L;
        public long p1, p2, p3, p4, p5, p6; // comment out
    }
}

Results

Running the above code while ramping the number of threads and
adding/removing the cache line padding,  I get the results depicted in
Figure 2. below.  This is measuring the duration of test runs on my
4-core Nehalem.

Figure 2.

The impact of false sharing can clearly be seen by the increased
execution time required to complete the test.  Without the cache line
contention we achieve near linear scale up with threads.

This is not a perfect test because we cannot be sure where the VolatileLongs will
be laid out in memory.  They are independent objects.  However
experience shows that objects allocated at the same time tend to be
co-located.

So there you have it.  False sharing can be a silent performance killer.

Note: Please read my further adventures with false sharing in this follow on blog.

文章转自 并发编程网-ifeve.com

时间: 2024-09-19 20:36:55

False Sharing的相关文章

False Sharing &amp;&amp; Java 7

原文:http://mechanical-sympathy.blogspot.hk/2011/08/false-sharing-java-7.html (因为被墙移动到墙内) In my previous post on False Sharing I suggested it can be avoided by padding the cache line with unused longfields.  It seems Java 7 got clever and eliminated or

SMP架构多线程程序的一种性能衰退现象—False Sharing

很久没更新博客了,虽然说一直都在做事情也没虚度,但是内心多少还是有些愧疚的.忙碌好久了,这个周末写篇文章放松下. 言归正传,这次我们来聊一聊多核CPU运行多线程程序时,可能会产生的一种性能衰退现象--False Sharing. 貌似很高大上?No No No,我相信看完这篇文章之后你会完全理解False Sharing,并且能够在设计和编写多线程程序的时候意识到并完美解决这个问题. OK,我们开始吧. 首先,False Sharing的产生需要几个特定条件:CPU具有多个核心,其上运行着的同一

由一道淘宝面试题到False sharing问题

今天在看淘宝之前的一道面试题目,内容是 在高性能服务器的代码中经常会看到类似这样的代码: typedef union { erts_smp_rwmtx_t rwmtx; byte cache_line_align_[ERTS_ALC_CACHE_LINE_ALIGN_SIZE(sizeof(erts_smp_rwmtx_t))]; }erts_meta_main_tab_lock_t; erts_meta_main_tab_lock_t main_tab_lock[16]; 请问其中用来填充的c

从Java视角理解伪共享(False Sharing)

作者:coderplay 从Java视角理解系统结构连载, 关注我的微博(链接)了解最新动态从我的前一篇博文中, 我们知道了CPU缓存及缓存行的概念, 同时用一个例子说明了编写单线程Java代码时应该注意的问题. 下面我们讨论更为复杂, 而且更符合现实情况的多核编程时将会碰到的问题. 这些问题更容易犯, 连j.u.c包作者Doug Lea大师的JDK代码里也存在这些问题.MESI协议及RFO请求从前一篇我们知道, 典型的CPU微架构有3级缓存, 每个核都有自己私有的L1, L2缓存. 那么多线程

伪共享(False Sharing)

原文地址:http://ifeve.com/false-sharing/ 作者:Martin Thompson  译者:丁一 缓存系统中是以缓存行(cache line)为单位存储的.缓存行是2的整数幂个连续字节,一般为32-256个字节.最常见的缓存行大小是64个字节.当多线程修改互相独立的变量时,如果这些变量共享同一个缓存行,就会无意中影响彼此的性能,这就是伪共享.缓存行上的写竞争是运行在SMP系统中并行线程实现可伸缩性最重要的限制因素.有人将伪共享描述成无声的性能杀手,因为从代码中很难看清

jetty9优化的两处地方

jetty 9两个优化: https://webtide.intalio.com/2013/01/jetty-9-goes-fast-with-mechanical-sympathy/?utm_source=tuicool 1. False Sharing in Queues 原先使用了 BlockingArrayQueue,这个queue有头尾两个指针,生产和消费是独立的,但是会产生这样一个问题:"However because of the layout in memory of the c

[CLR via C#]26. 计算限制的异步操作

原文:[CLR via C#]26. 计算限制的异步操作 一.CLR线程池基础     前面说过,创建和销毁线程是一个比较昂贵的操作,太多的线程也会浪费内存资源.由于操作系统必须调度可运行的线程并执行上下文切换,所以太多的线程还有损于性能.为了改善这个情况,CLR使用了代码来管理它自己的线程池.可将线程池想像成可由你的应用程序使用的一个线程集合.每个进程都有一个线程池,它在各个应用程序域(AppDomain)是共享的.     CLR初始化时,线程池是没有线程的.在内部,线程池维护了一个操作请求

高性能服务端系列 -- 处理器篇

从JMM说起,作为一名JAVA开发,特别在多线程编程实践中,了解和熟悉JAVA内存模型是很有必要的.刚开始接触内存模型的时候,有很多概念非常陌生,比如happens-before,可见性,顺序性等等.要理解这些关键词,需要先对编译器.处理器的知识有一些了解. 还有一些框架例如disruptor,在设计的时候就考虑了CPU的特点,充分发挥CPU的性能.要理解这类框架,也需要对处理器有一定了解. introduction 先看下面这个表格,一些场景下的延时,比如CPU执行一条指令大约是1纳秒,从L1

剖析Disruptor:为什么会这么快?(一)锁的缺点

原文:http://ifeve.com/disruptor-locks-are-bad/ 作者:Trisha's  译者:张文灼,潘曦  整理和校对:方腾飞,丁一 Martin Fowler写了一篇非常好的文章,里面不仅提到了Disruptor,而且还解释了Disruptor 如何应用在LMAX的架构里.里面有提及了一些目前没有涉及的概念,但最经常问到的问题是 "Disruptor究竟是什么?". 目前我正准备在回答这个问题,但首先回答"为什么它会这么快?" 这些问题持续出现,但是我不能没有说它