Inter Thread Latency

原文地址:http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html (移到墙内)

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores.   The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects.  In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads.  These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java.  For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions.  In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we’ll use two counters each updated by their own thread.  A simple ping-pong algorithm will be used to signal from one to the other and back again.  The exchange will be repeated millions of times to measure the average latency between cores.  This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.

01 public final class InterThreadLatency
02 implements Runnable
03 {
04 public static final long ITERATIONS = 500L * 1000L * 1000L;
05  
06 public static volatile long s1;
07 public static volatile long s2;
08  
09 public static void main(final String[] args)
10 {
11 Thread t = new Thread(new InterThreadLatency());
12 t.setDaemon(true);
13 t.start();
14  
15 long start = System.nanoTime();
16  
17 long value = s1;
18 while (s1 < ITERATIONS)
19 {
20 while (s2 != value)
21 {
22 // busy spin
23 }
24 value = ++s1;
25 }
26  
27 long duration = System.nanoTime() - start;
28  
29 System.out.println("duration = " + duration);
30 System.out.println("ns per op = " + duration / (ITERATIONS * 2));
31 System.out.println("op/sec = " +
32 (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
33 System.out.println("s1 = " + s1 + ", s2 = " + s2);
34 }
35  
36 public void run()
37 {
38 long value = s2;
39 while (true)
40 {
41 while (value == s1)
42 {
43 // busy spin
44 }
45 value = ++s2;
46 }
47 }
48 }

For C++ we’ll use the <a href=”http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html”>GNU Atomic Builtins</a> which give us a similar lock instruction insertion to that which the JVM uses.

01 #include <time.h>
02 #include <pthread.h>
03 #include <stdio.h>
04  
05 typedef unsigned long long uint64;
06 const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;
07  
08 volatile uint64 s1 = 0;
09 volatile uint64 s2 = 0;
10  
11 void* run(void*)
12 {
13     register uint64 value = s2;
14     while (true)
15     {
16         while (value == s1)
17         {
18             // busy spin
19         }
20         value = __sync_add_and_fetch(&s2, 1);
21     }
22 }
23  
24 int main (int argc, char *argv[])
25 {
26     pthread_t threads[1];
27     pthread_create(&threads[0], NULL, run, NULL);
28  
29     timespec ts_start;
30     timespec ts_finish;
31     clock_gettime(CLOCK_MONOTONIC, &ts_start);
32  
33     register uint64 value = s1;
34     while (s1 < ITERATIONS)
35     {
36         while (s2 != value)
37         {
38             // busy spin
39         }
40         value = __sync_add_and_fetch(&s1, 1);
41     }
42  
43     clock_gettime(CLOCK_MONOTONIC, &ts_finish);
44  
45     uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
46     uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
47     uint64 duration = finish - start;
48  
49     printf("duration = %lld\n", duration);
50     printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
51     printf("op/sec = %lld\n",
52         ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
53     printf("s1 = %lld, s2 = %lld\n", s1, s2);
54  
55     return 0;
56 }

Results

01 $ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
02 duration = 50790271150
03 ns per op = 50
04 op/sec = 19,688,810
05 s1 = 500000000, s2 = 500000000
06  
07 $ g++ -O3 -lpthread -lrt -o itl itl.cpp
08 $ taskset -c 2,4 ./itl
09 duration = 45087955393
10 ns per op = 45
11 op/sec = 22,178,872
12 s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop.  So what does this tell us?  Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion.  On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores.   Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s.  This is because we have measured latency and not throughput.  We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor?  Basically, the latency of the Disruptor is about as low as we can get from Java.  It would be possible to get a ~10% latency improvement by moving to C++.  I’d expect a similar improvement in throughput for C++.  The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly.  The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say???   Well that is one of the neat parts of its design.  The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with theRingBuffer, all without incurring a memory barrier.  For real world applications this batching effect is really significant.  For micro benchmarks it only makes the results more random,  especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns.  The Disruptor design allows us to get greater throughput without explicit batching on the publisher side.  In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second. 

时间: 2024-09-24 15:31:34

Inter Thread Latency的相关文章

让 ZeroMQ 为 MySQL 提供远程分布式任务处理

昨天写了一篇博文    身份证校验,检查身份证号码输入是否正确 http://netkiller-github-com.iteye.com/blog/1997402   很多人担心性能问题,我后来想用C写一个扩展,还会有人说影响性能,一不做二不休,干脆mq到远程,通过负载均衡解决.   这里只是提供了一个 MySQL 与 ZeroMQ 通信的插件,我并没有将身份证校验程序写出来.不过MQ的服务端可以使用很多语言实现,c,Java, php,Python,perl,ruby ..... 你自己选择

8.2. 消息队列

这里选择使用ZeroMQ的原因主要考虑的是性能问题,其他MQ方案可能会阻塞数据库. 8.2.1. 背景 之前我发表过一篇文章 http://netkiller.github.io/journal/mysql.plugin.fifo.html 该文章中提出了通过fifo 管道,实现数据库与其他进程的通信.属于 IPC 机制(同一个OS/服务器内),后我有采用ZeroMQ重新实现了一个 RPC 机制的方案,同时兼容IPC(跨越OS/服务器) 各种缩写的全称 IPC(IPC :Inter-Proces

小规模的流处理框架.Part 1: thread pools

(译者:强力推荐这篇文章,作者设计了一个用于小流量的流式数据处理框架,并详细给出了每一个需要注意的设计细节,对比了不同设计方案的优缺点,能够让你对流处理过程,某些设计模式和设计原则以及指标度量工具有一个更深刻的认识!) 在GeeCON 2016上我为我的公司准备了一个编程竞赛,这次的任务是设计并实现一个能够满足以下要求的系统: 系统能够每秒处理1000个任务,每一个Event至少有2个属性: clientId-我们希望每一秒有多个任务是在同一个客户端下处理的(译者:不同的clientId对应不同

android-如何处理 network on main thread exception 异常?

问题描述 如何处理 network on main thread exception 异常? 程序中设置了下面的两个类 class CallNetworkMethod extends AsyncTask<Void, Void, Void> { @Override protected Void doInBackground(Void... params) { if (TwitterUtils.isAuthenticated(prefs)) { sendTweet(); } else { Inte

5天不再惧怕多线程——第一天 尝试Thread

     原本准备在mongodb之后写一个lucene.net系列,不过这几天用到多线程时才发现自己对多线程的了解少之又少,仅仅停留在lock上面, 故这几天看了下线程参考手册结合自己的心得整理一下放在博客上作为自己的学习笔记.      好了,我们知道"负载"是一个很时尚,很牛X的玩意,往大处说,网站需要负载,数据库需要负载.往小处说,线程也需要负载,面对海量的 用户请求,我们的单线程肯定扛不住,那么怎么办,一定要负载,所以说多线程是我们码农必须要熟练掌握的一门技术.     在f

[转载]Thread.Sleep(0)妙用

原文地址http://blog.csdn.net/lgstudyvc/article/details/9337063 来自本论坛   我们可能经常会用到 Thread.Sleep 函数来使线程挂起一段时间.那么你有没有正确的理解这个函数的用法呢思考下面这两个问题 假设现在是 2008-4-7 12:00:00.000如果我调用一下 Thread.Sleep(1000) 在 2008-4-7 12:00:01.000 的时候这个线程会 不会被唤醒 某人的代码中用了一句看似莫明其妙的话Thread.

多线程Runnable和Thread产生线程

  http://dev.yesky.com/186/2547686.shtml public class Test { public static void main(String[] args) throws Exception{ MyThread mt = new MyThread(); mt.start(); mt.join(); Thread.sleep(3000); mt.start(); } } 当线程对象mt运行完成后,我们让主线程休息一下,然后我们再次在这个线程对象上启动线程.

Java Thread Programming 1.8.2 - Inter-thread Communication

  Missed Notification A missed notification occurs when threadB tries to notify threadA, but threadA is not yet waiting for the notification. In a multithreaded environment like Java, you don't have much control over which thread runs and for how lon

Java Thread Programming 1.8.3 - Inter-thread Communication

CubbyHole Example The class CubbyHole (see Listing 8.9) simulates a cubbyhole. A cubbyhole is a slot that can have only one item in it at a time. One thread puts an item into the slot and another thread takes it out. If a thread tries to put an item