Here's the situation. I'm trying to write an MD simulation in OpenCL that processes both collisions and non-bonded force interactions (Van Der Waals forces through the Lennard Jones potential). Essentially, that means an O(n^2) algorithm that checks particles for closeness and evaluates new velocities/sums up a bunch of forces. For the purposes of precision, I need to use 64-bit floating point numbers - double-types, with the extension cl_amd_fp64. My simulation is actually not running too badly, in terms of speed, compared to other algorithms of this type. However, I need it a lot faster.
Here's a high-level overview of the algorithm I'm using:
1.) Iterate through the particles in a spatially-contiguous order, grouping them into "tiles" of 32 particles each.
2.) Figure out which of these tiles are interacting, using a simple O(n^2) algorithm, and store them in an array.
3.) Calculate pairwise interactions between specific particles using the results from the previous kernel.
4.) Update timestep using an Euler integrator (I know, it's terribly imprecise, but I'm just trying to eke out as much speed as I can).
Number 3 is where the bottleneck occurs, taking up nearly 100% of the total time. Step number 2 takes a lot of time when there are a lot of particles in the system, and I'll deal with it later, but for the number of particles we need to simulate, 1, 2, and 4 take just about negligible time compared to 3. Using the AMD APP Profiler, I can see that it's only executing 4 out of 24 maximum wavefronts, limited by the number of VGPRs. I can also see that the ALU and fetch instructions, as well as the fetch size, are well above any other kernel I'm running.
I'm going to attach the source code for step 3, as well as the output from the profiler after running 100 steps for 64000 particles. Note that I'm an amateur OpenCL/GPU programmer, so don't hesitate to point out obvious optimizations or other that I could be making. Also, I know this is a bit of a personal issue, rather than one that deals with more people than just me - not a great fit for a discussion forum. I emailed someone from AMD and they said to post the question here for now, while they try to find someone to help me out. I've been beating my head against this problem for nearly two weeks, and nearly all of the optimizations I try just make it slower, so I'll be very happy if anyone can give some input on this.