for a master's thesis, we could need some urgent, close-to-submission help concerning the cache utilization of the graphics card.
We're developing a rather complex spring-mass system using OpenCL. It is based on a tetrahedral topology and uses springs on edges, triangles and tetraedra. All these springs act on the adjacent vertices and apply forces, which are accumulated in a reduction operation.
Here's a brief description of the general algorithm:
The problem is that in the first 3 kernels, each edge/triangle/tetra must lookup the position of the adjacent vertices. These are unordered and, hence, caching isn't used. In the last kernel it's even worse. Each vertex needs to lookup the forces from all adjacent edges/triangles/tetras. For this, we use three arrays with indices pointing to the elements and then fetch the force vectors in the temp buffers of the previous kernels. These lookups are also very random and don't use the cache.
The AMD profiler tells us that the cache hit is close to 0% and that the ALUs are only at about 10%, which isn't surprising as they get bored while waiting for the global memory read.
So, is there anybody with some suggestions of how to optimize the memory access??? We believe this is a well-known problem (probably in applications other than spring-mass simulations).
Any help is really appreciated!!!
Huh. So, if I understand correctly, there is no point in worrying about memory access patterns at all for our global memory data at the moment??
Although I am not familiar with the problem, but here are a few general suggestions which might be helpful:
1. Try reducing your channel conflicts by trying to fetch from the same channel per workgroup. This might seem conflicting but in this way you can access memory for many wavefronts simulatanosly.
2. Try to reduce your GPR count or LDS memory used for the kernel. This should allow more workgroups to fit in, which can balance the global access letencies.
3. Using images or __constant space if possible are also helpful due to there large bandwidth and less latency.
Please refer to the section "Global Memory Optimizations" of OpenCL Programming guide for more details.
The previous advices are great.
Let me just add that you should consider using __local to store heavily used data as this can significantly decrease the number of fetches. (I hope you're already doing do).
Additionally, the __constant memory is extremely fast and you have 64kb available there to be wisely used =]