Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

What are different optimization techniques I can use for optimisation on AMD machine.

I am writing code in C compiling with gcc 4.4.0. Also, is there any way I can put my specifc code in L1 cache? Like in TI DSP we do .constant  section (for specific frequently used tables) to be placed in cache.

2 Replies

Yes, here's how:

- You have to write SSE asm. L1 cache is designed for the speed of those instructions.

- The memory footprint of your algorithm locally must fit into the L1 cache.

- You have to do 16byte aligned reads/writes (movaps, movdqa)

- If you're not accessing memory in a linear fashion, use prefetchnta to tell hints the cpu about what memory you're going to use soon.

- Also there are a write instruction that skips the caches if you're not going to use that value soon.

It's not easy, but this is the only way to write for example a memcopy for small amount of data that can work on the speed of the L1.

Also you can use OpenCL on cpu too but you have a bit less control over the code with that.


Any data you read -- automatically goes into L1 cache.

After all, caches are built on principles of locality....

Apart from that x86 CPU offers "data cache" handling instructions.. like flushing them, invalidating them etc..

you need to reach out to assembly to use them..

SSE instructions (as pointed out well by realhet) too have cache related intrinsics..

If you are too worried about cache -- You first need to look at your memory access pattern.

If you are having a linear for(int i=0; i<BIG_N; i++) { ARRAY += ARRAY[i+1] }

Then you are already cache-friendly...You probably need to look at vectorization to improve your code (which again is SSE route)

However, if you are having strided access, you are in for trouble.

Classic case is matrix multiplication.. While A's row access is Linear, B's column access is completely strided... i.e successive fetches are many bytes away....It is possible that such fetches may map to the same cache-line - in which case associativity of your cache matters.

In such cases, Tiling helps. C matrix is computed in blocks of smalelr chunks which have better cache usage...

Hope this helped...