Memory Access Patterns Are Important

原文地址:http://mechanical-sympathy.blogspot.com/2012/08/memory-access-patterns-are-important.html(因墙转载)

In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm.  For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory.  Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced.  To hide this latency our processors employ evermore complex cache sub-systems that have many layers.

The 1994 paper “Hitting the memory wall: implications of the obvious” describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses.  I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.

Let’s start putting the problem in context with some examples.  Our hardware tries to hide the main-memory latency via a number of techniques.  Basically three major bets are taken on memory access patterns:

  1. Temporal: Memory accessed recently will likely be required again soon.
  2. Spatial: Adjacent memory is likely to be required soon.
  3. Striding: Memory access is likely to follow a predictable pattern.

To illustrate these three bets in action let’s write some code and measure the results.

  1. Walk through memory in a linear fashion being completely predictable.
  2. Pseudo randomly walk round memory within a restricted area then move on.  This restricted area is what is commonly known as an operating system page of memory.
  3. Pseudo randomly walk around a large area of the heap.

Code

The following code should be run with the -Xmx4g JVM option.

001 public class TestMemoryAccessPatterns
002 {
003     private static final int LONG_SIZE = 8;
004     private static final int PAGE_SIZE = 2 * 1024 * 1024;
005     private static final int ONE_GIG = 1024 * 1024 * 1024;
006     private static final long TWO_GIG = 2L * ONE_GIG;
007  
008     private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE);
009     private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE;
010  
011     private static final int ARRAY_MASK = ARRAY_SIZE - 1;
012     private static final int PAGE_MASK = WORDS_PER_PAGE - 1;
013  
014     private static final int PRIME_INC = 514229;
015  
016     private static final long[] memory = new long[ARRAY_SIZE];
017  
018     static
019     {
020         for (int i = 0; i < ARRAY_SIZE; i++)
021         {
022             memory[i] = 777;
023         }
024     }
025  
026     public enum StrideType
027     {
028         LINEAR_WALK
029         {
030             public int next(final int pageOffset, final int wordOffset, final int pos)
031             {
032                 return (pos + 1) & ARRAY_MASK;
033             }
034         },
035  
036         RANDOM_PAGE_WALK
037         {
038             public int next(final int pageOffset, final int wordOffset, final int pos)
039             {
040                 return pageOffset + ((pos + PRIME_INC) & PAGE_MASK);
041             }
042         },
043  
044         RANDOM_HEAP_WALK
045         {
046             public int next(final int pageOffset, final int wordOffset, final int pos)
047             {
048                 return (pos + PRIME_INC) & ARRAY_MASK;
049             }
050         };
051  
052         public abstract int next(int pageOffset, int wordOffset, int pos);
053     }
054  
055     public static void main(final String[] args)
056     {
057         final StrideType strideType;
058         switch (Integer.parseInt(args[0]))
059         {
060             case 1:
061                 strideType = StrideType.LINEAR_WALK;
062                 break;
063  
064             case 2:
065                 strideType = StrideType.RANDOM_PAGE_WALK;
066                 break;
067  
068             case 3:
069                 strideType = StrideType.RANDOM_HEAP_WALK;
070                 break;
071  
072             default:
073                 throw new IllegalArgumentException("Unknown StrideType");
074         }
075  
076         for (int i = 0; i < 5; i++)
077         {
078             perfTest(i, strideType);
079         }
080     }
081  
082     private static void perfTest(final int runNumber, final StrideType strideType)
083     {
084         final long start = System.nanoTime();
085  
086         int pos = -1;
087         long result = 0;
088         for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE)
089         {
090             for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE;
091                  wordOffset < limit;
092                  wordOffset++)
093             {
094                 pos = strideType.next(pageOffset, wordOffset, pos);
095                 result += memory[pos];
096             }
097         }
098  
099         final long duration = System.nanoTime() - start;
100         final double nsOp = duration / (double)ARRAY_SIZE;
101  
102         if (208574349312L != result)
103         {
104             throw new IllegalStateException();
105         }
106  
107         System.out.format("%d - %.2fns %s\n",
108                           Integer.valueOf(runNumber),
109                           Double.valueOf(nsOp),
110                           strideType);
111     }
112 }

Results


Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz,
Windows 7 64-bit, Java 1.7.0_05
===========================================
0 - 2.38ns LINEAR_WALK
1 - 2.41ns LINEAR_WALK
2 - 2.35ns LINEAR_WALK
3 - 2.36ns LINEAR_WALK
4 - 2.39ns LINEAR_WALK

0 - 12.45ns RANDOM_PAGE_WALK
1 - 12.27ns RANDOM_PAGE_WALK
2 - 12.17ns RANDOM_PAGE_WALK
3 - 12.22ns RANDOM_PAGE_WALK
4 - 12.18ns RANDOM_PAGE_WALK

0 - 152.86ns RANDOM_HEAP_WALK
1 - 151.80ns RANDOM_HEAP_WALK
2 - 151.72ns RANDOM_HEAP_WALK
3 - 151.91ns RANDOM_HEAP_WALK
4 - 151.36ns RANDOM_HEAP_WALK

Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz,
Windows 7 64-bit, Java 1.7.0_05
=============================================
0 - 1.06ns LINEAR_WALK
1 - 1.05ns LINEAR_WALK
2 - 0.98ns LINEAR_WALK
3 - 1.00ns LINEAR_WALK
4 - 1.00ns LINEAR_WALK

0 - 3.80ns RANDOM_PAGE_WALK
1 - 3.85ns RANDOM_PAGE_WALK
2 - 3.79ns RANDOM_PAGE_WALK
3 - 3.65ns RANDOM_PAGE_WALK
4 - 3.64ns RANDOM_PAGE_WALK

0 - 30.04ns RANDOM_HEAP_WALK
1 - 29.05ns RANDOM_HEAP_WALK
2 - 29.14ns RANDOM_HEAP_WALK
3 - 28.88ns RANDOM_HEAP_WALK
4 - 29.57ns RANDOM_HEAP_WALK

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz,
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 0.91ns LINEAR_WALK
1 - 0.92ns LINEAR_WALK
2 - 0.88ns LINEAR_WALK
3 - 0.89ns LINEAR_WALK
4 - 0.89ns LINEAR_WALK

0 - 3.29ns RANDOM_PAGE_WALK
1 - 3.35ns RANDOM_PAGE_WALK
2 - 3.33ns RANDOM_PAGE_WALK
3 - 3.31ns RANDOM_PAGE_WALK
4 - 3.30ns RANDOM_PAGE_WALK

0 - 9.58ns RANDOM_HEAP_WALK
1 - 9.20ns RANDOM_HEAP_WALK
2 - 9.44ns RANDOM_HEAP_WALK
3 - 9.46ns RANDOM_HEAP_WALK
4 - 9.47ns RANDOM_HEAP_WALK

Analysis

I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel.  It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap.  This is because the size and sophistication of various caches keep improving.  However as memory size increases they become less effective.  For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.It seems that for the linear walk case, memory latency does not exist.  However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.

The random heap walk produced an interesting result.  This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies.  For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration.  The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency.  By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem.  I’d like to test this on a much larger heap with Sandy Bridge.The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency.  Let’s look at these cache sub-systems in a little detail to try and understand the observed results.

Hardware Components

We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden.  In this section I’ll try and cover the major components used to hide latency that our hardware and systems software friends have put in place.  We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Countersutilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs.  Performance counters are CPU specific and what I’ve used here are specific to Sandy Bridge.

Data Caches
Processors typically have 2 or 3 layers of data cache.  Each layer as we move out is progressively larger with increasing latency.  The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.

Data caches are effectively hardware hash tables with a fixed number of slots for each hash value.  These slots are known as “ways”.  An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location.  Within these slots the data caches do not store words, they store cache-lines of multiple words.  For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine.  This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.

Data caches are typically evicted in a LRU manner.  Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted.  This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.


perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,626,053 L1-dcache-loads
       274,255,164 L1-dcache-misses
         #   18.32% of all L1-dcache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,537,057,965 L1-dcache-loads
     1,570,105,933 L1-dcache-misses
         #  102.15% of all L1-dcache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,321,888,497 L1-dcache-loads
     1,780,223,433 L1-dcache-misses
         #   41.19% of all L1-dcache hits  

likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
|     L2_RQSTS_MISS     | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
|     Metric      |  core 2   |
+-----------------+-----------+
|   Runtime [s]   |  2.15481  |
|       CPI       | 0.867293  |
| L2 request rate |  0.18028  |
|  L2 miss rate   | 0.0546988 |
|  L2 miss ratio  | 0.303409  |
+-----------------+-----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
|   L3_LAT_CACHE_MISS    | 2.59059e+07 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
|     L2_RQSTS_MISS     | 1.5547e+09  |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 6.87876  |
|       CPI       | 1.10714  |
| L2 request rate | 0.22925  |
|  L2 miss rate   | 0.104502 |
|  L2 miss ratio  | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
|   L3_LAT_CACHE_MISS    | 1.72918e+08 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
|     L2_RQSTS_MISS     | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   |  17.474  |
|       CPI       |  6.4418  |
| L2 request rate | 0.71973  |
|  L2 miss rate   | 0.220838 |
|  L2 miss ratio  | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
|   L3_LAT_CACHE_MISS    | 1.34832e+09 |
+------------------------+-------------+

Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.

Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses.  Virtual memory systems do this by mapping pages.  We need to know the offset for a given page and its size for any memory operation.  Typically page sizes are 4KB and are gradually moving to 2MB and greater.  Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages.  The translation of virtual memory pages to physical pages is maintained by the page table.  This translation can require multiple accesses to the page table which is a huge performance penalty.  To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache.  A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache.  By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.


perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,128,634 dTLB-loads
           310,901 dTLB-misses
              #    0.02% of all dTLB cache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,551,585,263 dTLB-loads
           340,230 dTLB-misses
              #    0.02% of all dTLB cache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,031,344,537 dTLB-loads
     1,345,807,418 dTLB-misses
              #   33.38% of all dTLB cache hits

Note: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.

Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers.  This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length.  The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.


likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+

java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+

Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.

Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks.  Memory is organised into rows and columns.  To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word.  The row is typically a page in size and loaded into a row buffer.  Even at this stage the hardware is still helping hide the latency.  A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.

Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket.  This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers.  Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket.  The presence of these interconnects gives rise to the non-uniform nature of server memory access.  In a 2-socket system memory may be local or 1 hop away.  On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.

What does this mean for algorithms?

The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns.  If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.

Is there anything we can do about this when designing algorithms and data-structures?  Yes there is a lot we can do.  If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster.  For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing.  Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.

Research is advancing on algorithmic approaches that work in harmony with cache sub-systems.  One area I find fascinating is Cache Oblivious Algorithms.  The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel.  This article is a great illustration of the performance benefits that can be gained.

Conclusion

To achieve great performance it is important to have sympathy for the cache sub-systems.  We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches.  When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm.  This is not what we were taught in algorithm theory when studying computer science.  The last decade has seen some fundamental changes in technology.  For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.

One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.

Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky.  If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster.  The point is performance will be veryunpredictable when you walk around memory in a random fashion.  I’ve also included the L3 cache counters for more detail in all the tests.


private static final long LARGE_PRIME_INC = 70368760954879L;

    RANDOM_HEAP_WALK

    {

        public int next(final int pageOffset, final int wordOffset, final int pos)

        {

            return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK;

        }

    };


Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz,
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     9,444,928,682 dTLB-loads
     4,371,982,327 dTLB-misses
         #   46.29% of all dTLB cache hits 

     9,390,675,639 L1-dcache-loads
     1,471,647,016 L1-dcache-misses
         #   15.67% of all L1-dcache hits  

+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09  |
|     L2_RQSTS_MISS     | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 55.0094  |
|       CPI       | 17.0801  |
| L2 request rate | 1.10108  |
|  L2 miss rate   | 0.362611 |
|  L2 miss ratio  | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
|        Event           |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS      | 2.62346e+07 |
+------------------------+-------------+
时间: 2024-11-30 06:13:52

Memory Access Patterns Are Important的相关文章

Uniform Memory Access

均匀访问模型(UMA): 高度的资源共享类型系统,提供直接的客户端到服务的通讯,并且客户端对远程服务具有详细的了解,又被称为紧耦合系统.在均匀访存模型中,所有的物理存储器被均匀共享,即处理器访问它们的时间是一样的.&http://www.aliyun.com/zixun/aggregation/37954.html">nbsp; (责任编辑:吕光)

【OS】NMON的简介和使用

[OS]NMON的简介和使用 目前NMON已开源,以sourceforge为根据地,网址是http://nmon.sourceforge.net. 1. 目的 本文介绍操作系统监控工具Nmon的概念.使用方式及使用参数.指导运维人员通过nmon工具监视AIX/Linux操作系统资源使用情况,收集监控结果及产生的数据文件,制作相关系统性能分析报告. 2. Nmon简介 Nmon (Nigel's Monitor)是由IBM 提供.免费监控 AIX 系统与 Linux 系统资源的工具.该工具可将服务

Understanding memory

原文http://careers.directi.com/display/tu/Understanding+and+optimizing+Memory+utilization Understanding page frames and pages Memory in Linux is organized in the form of pages (typically 4 KB in size). Contiguous linear addresses within a page are mapp

Operating Principle and Implementation of Flink: Memory Management

Nowadays, open-source big data frameworks (such as Hadoop, Spark and Storm) all employ JVM, and Flink is one of them. JVM-based data analysis engines all need to store a large amount of data in the memory, so they have to address the following JVM is

Java Memory Model

[原文地址] The Java memory model specifies how the Java virtual machine works with the computer's memory (RAM). The Java virtual machine is a model of a whole computer so this model naturally includes a memory model - AKA the Java memory model. It is ver

提示access violation at address解决办法

 解决方法: 右击"我的电脑".单击"属性". 在"系统属性"中单击"高级". 在"性能"中单击"设置". 在"性能选项"中单击"数据执行保护". 单击"添加".选择要运行的程序. OK.就这么简单. Access Violation(非法访问)错误的解决方法 Access Violation(非法访问),General Pro

运行程序提示access violation at address的解决方法

现在把解决方法公布出来: 右击"我的电脑".单击"属性". 在"系统属性"中单击"高级". 在"性能"中单击"设置". 在"性能选项"中单击"数据执行保护". 单击"添加".选择要运行的程序. OK.就这么简单. Access Violation(非法访问)错误的解决方法 Access Violation(非法访问),Gener

Linux内核同步机制之(三):memory barrier【转】

转自:http://www.wowotech.net/kernel_synchronization/memory-barrier.html 一.前言 我记得以前上学的时候大家经常说的一个词汇叫做所见即所得,有些编程工具是所见即所得的,给程序员带来极大的方便.对于一个c程序员,我们的编写的代码能所见即所得吗?我们看到的c程序的逻辑是否就是最后CPU运行的结果呢?很遗憾,不是,我们的"所见"和最后的执行结果隔着: 1.编译器 2.CPU取指执行 编译器将符合人类思考的逻辑(c代码)翻译成了

The Query Optimizer

The Query Optimizer This chapter discusses SQL processing, optimization methods, and how the query optimizer (usually called the optimizer) chooses a specific plan to execute SQL. The chapter contains the following sections: Overview of the Query Opt