cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

manfel
Journeyman III

Urgent: Cache optimization for spring-mass system in OpenCL

Hi there,

for a master's thesis, we could need some urgent, close-to-submission help concerning the cache utilization of the graphics card.

We're developing a rather complex spring-mass system using OpenCL. It is based on a tetrahedral topology and uses springs on edges, triangles and tetraedra. All these springs act on the adjacent vertices and apply forces, which are accumulated in a reduction operation.

Here's a brief description of the general algorithm:

 



  1. Compute forces on vertices per edge and store in temp buffer
  2. Compute forces on vertices per triangle and store in temp buffer
  3. Compute forces on vertices per tetraedron and store in temp buffer
  4. Accumulate forces per vertex from temp buffers


The problem is that in the first 3 kernels, each edge/triangle/tetra must lookup the position of the adjacent vertices. These are unordered and, hence, caching isn't used. In the last kernel it's even worse. Each vertex needs to lookup the forces from all adjacent edges/triangles/tetras. For this, we use three arrays with indices pointing to the elements and then fetch the force vectors in the temp buffers of the previous kernels. These lookups are also very random and don't use the cache.

The AMD profiler tells us that the cache hit is close to 0% and that the ALUs are only at about 10%, which isn't surprising as they get bored while waiting for the global memory read.

So, is there anybody with some suggestions of how to optimize the memory access??? We believe this is a well-known problem (probably in applications other than spring-mass simulations).

Any help is really appreciated!!!

0 Likes
6 Replies

Caching is only support on SDK 2.3 via the constant address space. It will be supported on global pointers in future SDK's via a combination of const + restrict or a compile time option.
0 Likes

Huh. So, if I understand correctly, there is no point in worrying about memory access patterns at all for our global memory data at the moment??

0 Likes

manfel,

Although I am not familiar with the problem, but here are a few general suggestions which might be helpful:

1. Try reducing your channel conflicts by trying to fetch from the same channel per workgroup. This might seem conflicting but in this way you can access memory for many wavefronts simulatanosly.

2. Try to reduce your GPR count or LDS memory used for the kernel. This should allow more workgroups to fit in, which can balance the global access letencies.

3. Using images or  __constant space if possible are also helpful due to there large bandwidth and less latency.

Please refer to the section "Global Memory Optimizations" of OpenCL Programming guide for more details.

0 Likes

manfel,
memory access patterns are very important, however because caching on global is not enabled yet, there is a hard limit on the amount of bandwidth you can achieve. This makes making sure you are not having any bank or channel conflicts all the more important. Himanshu gives some good starting points in his post.
0 Likes

The previous advices are great.

 

Let me just add that you should consider using __local to store heavily used data as this can significantly decrease the number of fetches. (I hope you're already doing do).

 

Additionally, the __constant memory is extremely fast and you have 64kb available there to be wisely used =]

0 Likes
neilbin
Journeyman III

1. Try to sort your input buffer to make coalased access.

2. make good use of shared mem.

0 Likes