CPU Cache Flushing Fallacy

原文地址:http://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html (因有墙移动到墙内)

作者:Mechanical Sympathy

Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to “flush”.  This seems to be illustrating a very common fallacy about how CPU caches work, and how the cache sub-system interacts with the execution cores.  In this article I will attempt to explain the function CPU caches fulfil, and how the cores, which execute our programs of instructions, interact with them.  For a concrete example I will dive into one of the latest Intel x86 server CPUs.  Other CPUs use similar techniques to achieve the same ends.

Most modern systems that execute our programs are shared-memory multi-processor systems in design.  A shared-memory system has a single memory resource that is accessed by 2 or more independent CPU cores.  Latency to main memory is highly variable from 10s to 100s of nanoseconds.  Within 100ns it is possible for a 3.0GHz CPU to process up to 1200 instructions.  Each Sandy Bridge core is capable of retiring up to 4 instructions-per-cycle (IPC) in parallel.  CPUs employ cache sub-systems to hide this latency and allow them to exercise their huge capacity to process instructions.  Some of these caches are small, very fast, and local to each core; others are slower, larger, and shared across cores.  Together with registers and main-memory, these caches make up our non-persistent memory hierarchy.

Next time you are developing an important algorithm, try pondering that a cache-miss is a lost opportunity to have executed ~500 CPU instructions!  This is for a single-socket system, on a multi-socket system you can effectively double the lost opportunity as memory requests cross socket interconnects.

Memory Hierarchy

Figure 1.

For the circa 2012 Sandy Bridge E class servers our memory hierarchy can be decomposed as follows:

  1. Registers: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers.  These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores.  Compilers will allocate our local variables and function arguments to these registers.  Whenhyperthreading is enabled these registers are shared between the co-located hyperthreads.
  2. Memory Ordering Buffers (MOB): The MOB is comprised of a 64-entry load and 36-entry store buffer.  These buffers are used to track in-flight operations while waiting on the cache sub-system.  The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache.  These buffers enable our fast processors to run asynchronously while data is transferred to and from the cache sub-system.  When the processor issues asynchronous reads and writes then the results can come back out-of-order.  The MOB is used to disambiguate the load and store ordering for compliance to the published memory model.
  3. Level 1 Cache: The L1 is a core-local cache split into separate 32K data and 32K instruction caches.  Access time is 3 cycles and can be hidden as instructions are pipelined by the core for data already in the L1 cache.
  4. Level 2 Cache: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache.  The L2 cache is 256K in size and acts as an effective queue of memory accesses between the L1 and L3.  L2 contains both data and instructions.  L2 access latency is 12 cycles.
  5. Level 3 Cache: The L3 cache is shared across all cores within a socket.  The L3 is split into 2MB segments each connected to a ring-bus network on the socket.  Each core is also connected to this ring-bus.  Addresses are hashed to segments for greater throughput.  Latency can be up to 38 cycles depending on cache size.  Cache size can be up to 20MB depending on the number of segments, with each additional hop around the ring taking an additional cycle.  The L3 cache is inclusive of all data in the L1 and L2 for each core on the same socket.  This inclusiveness, at the cost of space, allows the L3 cache to intercept requests thus removing the burden from private core-local L1 & L2 caches.
  6. Main Memory: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss.  This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer, through to significantly more when queuing effects and memory refresh cycles conflict.  4 memory channels are aggregated together on each socket for throughput, and to hide latency via  pipelining on the independent memory channels.
  7. NUMA: In a multi-socket server we have non-uniform memory access.  It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the QPI bus.  Sandy Bridge is a major step forward for 2-socket systems over Westmere and Nehalem.  With Sandy Bridge the QPI limit has been raised from 6.4GT/s to 8.0GT/s, and two lanes can be aggregated thus eliminating the bottleneck of the previous systems.  For Nehalem and Westmere the QPI link is only capable of ~40% the bandwidth that could be delivered by the memory controller for an individual socket.  This limitation made accessing remote memory a choke point.  In addition, the QPI link can now forward pre-fetch requests which previous generations could not.

Associativity Levels

Caches are effectively hardware based hash tables.  The hash function is usually a simple masking of some low-order bits for cache indexing.  Hash tables need some means to handle a collision for the same slot.  The associativity level is the number of slots, also known as ways or sets, which can be used to hold a hashed version of an address.  Having more levels of associativity is a trade off between storing more data vs. power requirements and time to search each of the ways.

For Sandy Bridge the L1D and L2 are 8-way associative, the L3 is 12-way associative.

Cache Coherence

With some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory.  The cache sub-system is considered the “source of truth” for mainstream systems.  If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory.  This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place.  An x86 cache works on blocks of data that are 64-bytes in size, known as a cache-line.  Other processors can use a different size for the cache-line.  A larger cache-line size reduces effective latency at the expense of increased bandwidth requirements.

To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states.  The protocol Intel employs for this is MESIF, AMD employs a variant know as MOESI.  Under the MESIF protocol each cache-line can be in 1 of the 5 following states:

  1. Modified: Indicates the cache-line is dirty and must be written back to memory at a later stage.  When written back to main-memory the state transitions to Exclusive.
  2. Exclusive: Indicates the cache-line is held exclusively and that it matches main-memory.  When written to, the state then transitions to Modified.  To achieve this state a Request-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies.
  3. Shared: Indicates a clean copy of a cache-line that matches main-memory.
  4. Invalid: Indicates an unused cache-line.
  5. Forward: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system.

To transition from one state to another, a series of messages are sent between the caches to effect state changes.  Previous to Nehalem for Intel, and Opteron for AMD, this cache coherence traffic between sockets had to share the memory bus which greatly limited scalability.  These days the memory controller traffic is on a separate bus.  The Intel QPI, and AMD HyperTransport, buses are used for cache coherence between sockets.

The cache controller exists as a module within each L3 cache segment that is connected to the on-socket ring-bus network.  Each core, L3 cache segment, QPI controller, memory controller, and integrated graphics sub-system are connected to this ring-bus.  The ring is made up of 4 independent lanes for: requestsnoopacknowledge, and 32-bytesdata per cycle.  The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3.  This provides for rapid identification of the core containing a modified line when snooping for changes.  The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns.

If a core wants to read some memory, and it does not have it in a Shared, Exclusive, or Modified state; then it must make a read on the ring bus.  It will then either be read from main-memory if not in the cache sub-systems, or read from L3 if clean, or snooped from another core if Modified.  In any case the read will never return a stale copy from the cache sub-system, it is guaranteed to be coherent.

Concurrent Programming

If our caches are always coherent then why do we worry about visibility when writing concurrent programs?  This is because within our cores, in their quest for ever greater performance, data modifications can appear out-of-order to other threads.  There are 2 major reasons for this.

Firstly, our compilers can generate programs that store variables in registers for relatively long periods of time for performance reasons, e.g. variables used repeatedly within a loop.  If we need these variables to be visible across cores then the updates must not be register allocated.  This is achieved in C by qualifying a variable as “volatile”.  Beware that C/C++ volatile is inadequate for telling the compiler not to reorder other instructions.  For this you need memory fences/barriers.

The second major issue with ordering we have to be aware of is a thread could write a variable and then, if it reads it shortly after, could see the value in its store buffer which may be older than the latest value in the cache sub-system.  This is never an issue for algorithms following the Single Writer Principle but is an issue for the likes of the Dekker andPeterson lock algorithms.  To overcome this issue, and ensure the latest value is observed, the thread must not load the value in the local store buffer.  This can be achieved by issuing a fence instruction which prevents the subsequent load getting ahead of a store from another thread.  The write of a volatile variable in Java, in addition to never being register allocated, is accompanied by a full fence instruction.  This fence instruction on x86 has a significant performance impact by preventing progress on the issuing thread until the store buffer is drained.  Fences on other processors can have more efficient implementations that simply put a marker in the store buffer for the search boundary, e.g. the Azul Vega does this.

If you want to ensure memory ordering across Java threads when following the Single Writer Principle, and avoid the store fence, it is possible by using the j.u.c.Atomic(Int|Long|Reference).lazySet() method, as opposed to setting a volatile variable.

The Fallacy

Returning to the fallacy of “flushing the cache” as part of a concurrent algorithm.  I think we can safely say that we never “flush” the CPU cache within our user space programs.  I believe the source of this fallacy is the need to flush, mark or drain to a point, the store buffer for some classes of concurrent algorithms so the latest value can be observed on a subsequent load operation.  For this we require a memory ordering fence and not a cache flush.

Another possible source of this fallacy is that L1 caches, or the TLB, may need to be flushed based on address indexing policy on a context switch.  ARM, previous to ARMv6, did not use address space tags on TLB entries thus requiring the whole L1 cache to be flushed on a context switch.  Many processors require the L1 instruction cache to be flushed for similar reasons, in many cases this is simply because instruction caches are not required to be kept coherent. The bottom line is, context switching is expensive and a bit off topic, so in addition to the cache pollution of the L2, a context switch can also cause the TLB and/or L1 caches to require a flush.  Intel x86 processors require only a TLB flush on context switch. 

时间: 2024-08-02 18:04:10

CPU Cache Flushing Fallacy的相关文章

linux c语言刷新cpu cache后,如果判断数据真正从cache拷贝到内存中了?

问题描述 linux c语言刷新cpu cache后,如果判断数据真正从cache拷贝到内存中了? linux c语言刷新cpu cache后,如果判断数据真正从cache拷贝到内存中了? 解决方案 用secureCRT看打印信息啊

cpu cache 程序-[碰到一个虐心的作业]设计并运行一组数据密集型程序,推导出CPU的Cache主要参数配置

问题描述 [碰到一个虐心的作业]设计并运行一组数据密集型程序,推导出CPU的Cache主要参数配置 设计并运行一组数据密集型程序,通过分析观察到的性能变化,推导出你计算机上CPU的Cache主要参数配置 层级数 各层:容量.块大小.组相联度.命中时间.缺失代价 注意:1. 要给出分析推导的理由:2. 并不一定所有参数都可以使用这种方法推导出来 有勇士有头绪么,.,..

能过7个示例我们来学习一下CPU缓存(Cache)

CPU cache一直是理解计算机体系架构的重要知识点,也是并发编程设计中的技术难点,而且相关参考资料如同过江之鲫,浩瀚繁星,阅之如临深渊,味同嚼蜡,三言两语难以入门.正好网上有人推荐了微软大牛Igor Ostrovsky一篇博文<漫游处理器缓存效应>,文章不仅仅用7个最简单的源码示例就将CPU cache的原理娓娓道来,还附加图表量化分析做数学上的佐证,个人感觉这种案例教学的切入方式绝对是俺的菜,故而忍不住贸然译之,以飨列位看官. 原文地址:Gallery of Processor Cach

如何在Ubuntu中绑定CPU进程?

  现在科技在不断发展现在多CPU的趋势越来越大了. 有时候为了更好地操作机器, 需要将某个进程绑定到具体的CPU上去.大家可能不能理解将进程绑定到CPU中运行是什么意思,简单的说就是进程/线程与cpu绑定,最直观的好处就是提高了cpu cache的命中率,从而减少内存访问损耗,提高程序的速度,将普通进程变成核心进程.下面小编就像大家介绍在Ubuntu中怎么绑定CPU进程,Ubuntu(乌班图)是一个以桌面应用为主的Linux操作系统,和小编一起学习吧. 在Ubuntu中绑定CPU进程的方法 t

阵列Cache写机制:Write-through与Write-back区别

Write Through和Write Back Write Through和Write Back是阵列卡Cache的两种使用方式,也称为透写和回写.当选用write through方式时,系统的写磁盘操作并不利用阵列卡的Cache,而是直接与磁盘进行数据的交互.而write Back方式则利用阵列Cache作为系统与磁盘间的二传手,系统先将数据交给Cache,然后再由Cache将数据传给磁盘. 在配置阵列的时候,如果不是和弄清楚的话默认就可以了,系统会根据磁盘类型进行默认设置. Write c

矩阵乘法cache优化

好文要转,太棒了~~~~~~~~~~~~~~~~~~~~~~~~~ 题目地址:http://www.51nod.com/onlineJudge/questionCode.html#!problemId=1113 昨晚为了优化这个题目弄到2点多,今天一早就写博,我真是太不蛋定了,哈哈. 做OJ的朋友都知道快速幂,我就不罗嗦了,我说的主要是矩阵乘法实现层面的优化. 最开始我的代码耗时1156ms,代码如下: void  mat_mul( int (*a)[MAXN], int (*b)[MAXN],

从Java视角理解系统结构(二)CPU缓存

从Java视角理解系统结构连载, 关注我的微博(链接)了解最新动态 众所周知, CPU是计算机的大脑, 它负责执行程序的指令; 内存负责存数据, 包括程序自身数据. 同样大家都知道, 内存比CPU慢很多. 其实在30年前, CPU的频率和内存总线的频率在同一个级别, 访问内存只比访问CPU寄存器慢一点儿. 由于内存的发展都到技术及成本的限制, 现在获取内存中的一条数据大概需要200多个CPU周期(CPU cycles), 而CPU寄存器一般情况下1个CPU周期就够了. CPU缓存 网页浏览器为了

降低Cache失效率--编译器优化

PS<<EOF 之前看了酷壳上 @我的上铺叫路遥 投稿的"七个示例科普CPU Cache",其实没认真看完! 丫的,我翻阅了一下<计算机系统结构>的书,把cache那章阅读了下! 发现了新大陆,对自己当时学到的知识的领悟和体会似乎有了新的感受!计算机专业课,其实我都挺认真的学过的,哈哈,只不过...... 我把我觉得有用的摘录下!! 为什么好多那么些开源滴优秀滴代码读不懂咧, 语言不熟练和算法素养很低是一方面, 其实还有好多方面-偶还是太菜了- EOF 降低Ca

PostgreSQL OLAP新高度 - CPU向量计算与瓦片式存储

标签 PostgreSQL , OLAP , 向量化 , vector , postgrespro , tiles , 瓦片 , 瓦片索引 , map , reduce , 分组聚合 , 非分组聚合 , 分区键 , sort key , order by , brin , cpu Cache , projection , 行列变换 背景 在主流的OLTP数据库产品中,毫无疑问,PostgreSQL已经具备非常强大的竞争力(性能.功能.稳定性.成熟度.案例.跨行业应用等). <数据库选型之 - 大