cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Adept II
Adept II

Optimizing Kernel

Here's the situation. I'm trying to write an MD simulation in OpenCL that processes both collisions and non-bonded force interactions (Van Der Waals forces through the Lennard Jones potential). Essentially, that means an O(n^2) algorithm that checks particles for closeness and evaluates new velocities/sums up a bunch of forces. For the purposes of precision, I need to use 64-bit floating point numbers - double-types, with the extension cl_amd_fp64. My simulation is actually not running too badly, in terms of speed, compared to other algorithms of this type. However, I need it a lot faster.

Here's a high-level overview of the algorithm I'm using:

1.) Iterate through the particles in a spatially-contiguous order, grouping them into "tiles" of 32 particles each.

2.) Figure out which of these tiles are interacting, using a simple O(n^2) algorithm, and store them in an array.

3.) Calculate pairwise interactions between specific particles using the results from the previous kernel.

4.) Update timestep using an Euler integrator (I know, it's terribly imprecise, but I'm just trying to eke out as much speed as I can).

Number 3 is where the bottleneck occurs, taking up nearly 100% of the total time. Step number 2 takes a lot of time when there are a lot of particles in the system, and I'll deal with it later, but for the number of particles we need to simulate, 1, 2, and 4 take just about negligible time compared to 3. Using the AMD APP Profiler, I can see that it's only executing 4 out of 24 maximum wavefronts, limited by the number of VGPRs. I can also see that the ALU and fetch instructions, as well as the fetch size, are well above any other kernel I'm running.

I'm going to attach the source code for step 3, as well as the output from the profiler after running 100 steps for 64000 particles. Note that I'm an amateur OpenCL/GPU programmer, so don't hesitate to point out obvious optimizations or other that I could be making. Also, I know this is a bit of a personal issue, rather than one that deals with more people than just me - not a great fit for a discussion forum. I emailed someone from AMD and they said to post the question here for now, while they try to find someone to help me out.  I've been beating my head against this problem for nearly two weeks, and nearly all of the optimizations I try just make it slower, so I'll be very happy if anyone can give some input on this.

Tags (2)
0 Kudos
Reply
28 Replies
Adept II
Adept II

Re: Optimizing Kernel

ok, how far i can see on a short view, you don't use local memory. I would first try to use this per workgroupe shared memory to "cache" the data from the global Memory.

Is is also possible, that you don't read in aligned memory patterns from the global memory?

0 Kudos
Reply
Adept II
Adept II

Re: Optimizing Kernel

You're right, I could not find a good way to fit local memory into this. Since the tile size is often smaller than the work group size, I wasn't sure how to get the data and cache it concurrently without introducing yet more nested branches. Not only that, but many threads don't need the information; the "if(NO_INTERACTION)" statement is there to prune out a lot of the calculations.

The same went for coalesced memory reads; the first set (getting the "i" data) is coalesced, I believe, but I could not figure out a good way to get the particle "j" data in a coalesced fashion. However, either of these methods would probably help the time immensely - from the profiling information, it looks like the fetch unit is going crazy. I just couldn't find a way to implement them.

0 Kudos
Reply
Adept II
Adept II

Re: Optimizing Kernel

It sounds silly at the first look, but just try to get rid of all this "skip calculations" optimizations.

I don't know how big the percentage of calculation elimination is, but perhaps you can than use aligned reads from global memory and local memory. It is often so, that just to calculate something is faster than fancy elimination of calculations.

Just try it. It is hard to say, if something is faster or not.

0 Kudos
Reply
Adept II
Adept II

Re: Optimizing Kernel

I've actually tried this already; getting rid of the "tiling" algorithm and instead just resorting to the O(n^2) algorithm (I guess it would technically be O(n) run in parallel) resulted in a pretty massive slowdown with this many particles. It was only faster on particles below some lower number (I don't remember exactly anymore).

I've pretty much been trying random things like this for the last two weeks, to no avail. I've also been trying to look for a different algorithm/trying to develop one myself, with not much luck.

0 Kudos
Reply
Adept II
Adept II

Re: Optimizing Kernel

I think people who solve these MD problems really fast only solve the complete N-Body problem for particles that are close to each other, then use Particle-Mesh Ewald summation and dipoles in the frequency domain for particles that are far away.  You could also look at using the AMD FFT library with that part of the modeling.

http://en.wikipedia.org/wiki/Ewald_summation

0 Kudos
Reply
Challenger
Challenger

Re: Optimizing Kernel

I presume you mean this is part 3 ... looking at the code it looks pretty slow.  I'm surprised you're getting ok performance TBH and I wouldn't be surprised if you can't get a good order of magnitude or more out of it.

The inner loop can be trivially paralleised (based a cursory look, i can't see any dependencies within the loop).  For this type of problem you should use a loop that calculates partial sums in batches based on the work size (and: use 64 for the local work size), which are then summed outside of the loop to get a result for a given b1,b2.

It's complicated a bit as you're using double4, which means conflicts in LDS, but you can get around that by splitting the data when you access LDS.

local double fsum[64 * 4];
int lx = get_local_id(0);

... outer loop stuff

// inner loop
int j = lx;
double4 forceSum = 0;
while (j < numParticles) {
  // do innermost calculation of ( i, j ) result,
  // update forceSum with result
  j += 64;
}
// save partial sum
fsum[lx + 64 * 0] = forceSum.s0;
fsum[lx + 64 * 1] = forceSum.s1;
.. etc.
barrier(CLK_LOCAL_MEM_FENCE);

Then do a parallel reduction sum on the 4 rows of fsum.  e.g.

for (int x =64/2; x>0; x>>=1) {
if (x + lx < 64) {
  forceSum.s0 += fsum[lx + 128 * 0 + x];
  ... 1, 2, 3

  fsum[lx + 128 * 0] = forceSum.s0;
  ... 1, 2, 3
}
barrier(CLK_LOCAL_MEM_FENCE);
}

Now, fsum[128*n] = forceSum.sn  Actually since you need to iterate over all b2 as well, you'd do that before doing the parallel sum, and then you'd end up with finalForce directly.  So the outer loop structure will be somethng like this:

int i = get_group_id(0);
while (i < numParticles) {
   for(int b2 = 0; b2 < numBlocks; b2++) {
     if(interactions[b2*numBlocks+b1] > NO_INTERACTION) {
           // inner loop from above
        }
   }
 
  i += get_num_groups(0);
}
// parallel sum from above

The interactions[] test shouldn't be too expensive here since every work-item is doing it in lock-step, and they're all taking the same branch.

Actually ... since this means each work-group might do a different amount of work, it may be beneficial to use a 'queue' for working out 'i', rather than hard-coding it.  Here you just use a atomic counter (on amd, or a global atomic otherwise) and atomic_inc() to get/update the i value, but otherwise the loop is the same.

Finally, the way this is executed is that you set the local work size to 64,1,1 (and you can use the reqd_work_group_size hint to specify this on the kernel), and then for the global work-size you just give it something that fits the hardware.  The kernels themselves consume work until it's complete.  As a hint for something that fits the hardware, start with compute units (as reported by the opencl api) * 64 (the work-group local size) * 16 (this last factor is the parallelism per compute unit, just fiddle around till you get the best result on the device).

Adept II
Adept II

Re: Optimizing Kernel

Thanks for the detailed answer, notzed!

Actually, I tried parallelizing the outer loop with a 2D NDRange, but it only made the code slower - I would guess because the extra memory accesses that it incurred were more expensive than the removal of the outer for loop.

Anyway, I will try to apply these changes on Monday and get back to you. In the meantime, though, can you please explain how to choose local workgroup size? I got it into my head that the local workgroup size is always best at the maximum size, but is this not true?

0 Kudos
Reply
Challenger
Challenger

Re: Optimizing Kernel

Optimal local work group size depends on quite a few factors, so optimising it isn't trivial.

e.g. some of the factors:

  • If you're using LDS, then the amount of LDS required depends on the LWS (you will always need Nl+x memory cells as there's no point if every work item doesn't have it's own slots).
  • How much LDS you use is one factor on how many wavefronts can execute concurrently, and whether it can even run at all.
  • If you're using barriers, then once you go above 64 work items (i.e. wavefront) then I can only imagine it adds complication to the barrier scheduling - e.g. now the hardware needs to synchronise across multiple wavefronts (I don't know for sure whether this makes a real difference though)
  • Also with barriers, the compiler can no-op them if the LWS <= 64 items on a GPU since all 64 work items execute in lock-step (on AMD at least, and here you need to specify the reqd_work_group_size hint).
  • If you're accessing global memory more than once (per address per work item) bigger work groups mean more cache pressure.
  • Registers are also a limit on concurrency.
  • On AMD, using a non-multiple of 64 work items means you have guaranteed wastage of ALU slots (i don't know what nvidia is now, i think it's a  multiple of 32).

So after a lot of experimentation, now I just use a couple of rules of thumb as a starting point:

  • Just use 64 work items as soon as you need to use LDS for any parallel stuff.
  • Always try to use at least 64 for any problem (unless it just wont fit or requires too much address logic to make fit, but even a serial task that needs to access an array will benefit from parallel memory loads).
  • Use 16x16 for pixel-by-pixel oriented problems on images.

If I have the time/interest/inclination I might fiddle further, but after writing a lot of kernels you run out of time (or interest!) to experiment every time trying to get the absolute optimal performance, and in practice the rules above normally get a good result anyway.  Trying to do too much can sometimes over-specify the problem and over-tune it to a specific bit of hardware and a specific test case.   One thing you get from experience is just what sort of performance to expect: if you know you're an order of magnitude below par you know it's worth the effort to keep working on the problem, but eventually you hit a point of diminishing returns and the small gains possible just aren't worth it - and the thing with GPU code is that it's easy to leave 10x the performance on the floor without knowing it.   I'm not satisfied unless I get at least 10x the contemporary cpu performance of a given algorithm, and not happy unless i get at least 100x for parallel ones.

If you're not using LDS, then to some extent it just doesn't matter a lot: the GPU's execute wavefronts, and whether the wave-front is part of a wider work-group, or another work-group, or another iteration of the same work-group, there isn't much (without knowing internal hardware details) to tell them apart (e.g. try a simple array addition problem and adjust the work-size from 64-512 in steps of 64, the differences will be measurable but not huge).  As soon as you use LDS (and it's an awesome addition that really makes opencl worth worrying about) the landscape changes totally.  Here LDS becomes a precious commodity, and the overheads of barrier() are usually worth removing, and here smaller work-groups can be a big gain.

In the example above, using 64 makes the internal loop more efficient:

  • The partial sum is just kept in a register the whole time, so you only need 1 register per N/64 results, rather than N.
  • Likewise fewer complex address calculations are required.
  • The parallel sum only needs to sum up 64 items and only needs 64 items worth of LDS to do it.
  • The barriers are compiled to nothing (with the right hint)
  • The parallel sum has fixed loop indices so can be optimised/unrolled better by the compiler (although it's only a small part of the execution time in such a bit of code, and the design makes it a smaller part of the execution time the larger the problem gets).
  • You're using a tiny bit of LDS, so the only impediment to concurrent work-groups on the same CU is the register load.

This internal/for free problem-size-reduction aspect is a very useful optimisation trick.

0 Kudos
Reply
Adept II
Adept II

Re: Optimizing Kernel

After going over the code more thoroughly, I'm a bit confused. When you say to put the inner loop into the outer loop, I assume you mean like this:

local double fsum[64 * 4];

int lx = get_local_id(0);

int i = get_group_id(0);

while (i < numParticles) {

   for(int b2 = 0; b2 < numBlocks; b2++) {

     if(interactions[b2*numBlocks+b1] > NO_INTERACTION) {

// inner loop

int j = lx;

double4 forceSum = 0;

while (j < numParticles) {

  // do innermost calculation of ( i, j ) result,

  // update forceSum with result

  j += 64;

}

// save partial sum, n = {0, 1, 2}

fsum[lx + 64 * n] = forceSum.sn

     }

   }

  i += get_num_groups(0);

}

However, I'm not really sure how to interpret this. The interactions[] if statement should cull out some computations between blocks who are too far apart, but it seems like the inner while loop just does the same thing every time, no matter what b2 is. If this is what you meant by "the inner for loop has no dependencies," then I'm not sure if it's correct; if you look at the original code, jBegin=b2*tileSize, and jEnd=jBegin+tileSize, so the inner for loop becomes:

for(int j = b2*tileSize; j < b2*tileSize+tileSize && j < numParticles; j++)

{ /*Force and collision calculations*/ }

The two variables are not explicitly declared this way, but they are meant to serve this purpose - I might have  obfuscated that in an attempt to not repeat the multiplication.

The other thing I'm confused about is the calculation of the global work size. I always thought of a work-item as a single particle, or perhaps a single block, which is obviously not the case here.

Thanks for your help so far, notzed - sorry for not giving you anything in return except for more pleas for help, but I really appreciate it.

EDIT: One more remark: It seems like this partial sum method would reduce the number of writes, but what about the reads? Since the "innermost (i, j) calculation" remains unchanged, that means I'm still reading the particle j info many times.

Message was edited by: Ed Lu

0 Kudos
Reply