cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

registerme
Journeyman III

Re: Optimizing Kernel

Could you please elaborate why "Use 16x16 for pixel-by-pixel oriented problems on images"?

0 Likes
notzed
Challenger

Re: Optimizing Kernel

If processing images in a fairly simple way, a 2-d local work size of 16x16 often works pretty well for me.  i.e. process a 16x16 tile per work-group.

0 Likes
notzed
Challenger

Re: Optimizing Kernel

Ahhh right i'm sorry, i didn't notice that detail, i just assumed it was processing all other particles (which in hindsight doesn't make sense).  I didn't really have time to write your whole routine for you, just give you some ideas 😉

The only real difference is j is initialised to lx + b2 * tilesize rather than lx, and ends in a shorter time.  Except that if your tile size is only 32, then you don't need to loop at all, but also it wont work very efficiently and you can't use all 64 threads (you could do two 32-lots at once, but then the summation is a bit more complex - but not much).  Changing the tile size to 64 and removing the loop would be the easiest ...

What i meant by no dependencies is that each loop iteration stands alone - it's only output is the summand, and there is no feed-back between loops.  i.e. no value from step j-1 is used by step j, or mathematically it's just a sum, and not a relation.

BTW you don't really need to worry about repeating multiplication or not - the compiler will do this for you.  I usually do it just to save typing, particularly if any of the values are going to change.  But often whether you do or not will compile into exactly the same code.

The global work size is per work-item, but if you're setting the local work size it has to be a multiple of that.  So you have gws/lws work groups, each work-group executes on the same cpu core (even on a cpu driver) which allows LDS to work, etc.  If you use a 'persistent kernel' approach as i listed, you move the range calculation to the kernel, and all you're doing is trying to saturate all parts of the whole device with the task in the assumption that you're doing a good amount of work and it wont be wasteful.

As to your last question: well without changing the algorithm you can't reduce the calculations: at the lowest level it's still O(N^2). But doing it this way takes advantages of the hardware characteristics.  The main difference is that the with the inner loop you had it read memory in pretty much the worst-case possible way, and this one reads memory coalesced (i.e. each adjacent work item reads an adjacent memory cell).   It's just rotated the calculation order by 90 degrees.

0 Likes
chaosed0
Adept II

Re: Optimizing Kernel

Alright, I think after reading through everything again, and with this new post, I understand how it is supposed to work. However, I implemented the kernel and it's running slower than before! I'm not sure where the problem lies - perhaps I still don't quite understand.

Interestingly, the kernel runs fastest when the outer loop is completely eliminated, i.e. the "parallelization factor" is equal to the number of particles (the global work size, then, is 64*numParticles). Maybe the slowdown comes, then, because reading the information of particle 'i' is not coalesced anymore?

I attached the new kernel and profiler info. I'll continue trying to see what's going on in the meanwhile.

By the way I modified the parallel reduction sum because I tested it by itself and it didn't seem like it was working. Other than that, though, I haven't tested anything for correctness yet - just speed.

EDIT: I accidentally multiplied the number of threads by 20 (it should be 4096000, not 80million) in the profiling result, but I just did it again and it's not too much different that the one I attached.

0 Likes
notzed
Challenger

Re: Optimizing Kernel

chaosed0 wrote:

Alright, I think after reading through everything again, and with this new post, I understand how it is supposed to work. However, I implemented the kernel and it's running slower than before! I'm not sure where the problem lies - perhaps I still don't quite understand.

Interestingly, the kernel runs fastest when the outer loop is completely eliminated, i.e. the "parallelization factor" is equal to the number of particles (the global work size, then, is 64*numParticles). Maybe the slowdown comes, then, because reading the information of particle 'i' is not coalesced anymore?

I attached the new kernel and profiler info. I'll continue trying to see what's going on in the meanwhile.

By the way I modified the parallel reduction sum because I tested it by itself and it didn't seem like it was working. Other than that, though, I haven't tested anything for correctness yet - just speed.

EDIT: I accidentally multiplied the number of threads by 20 (it should be 4096000, not 80million) in the profiling result, but I just did it again and it's not too much different that the one I attached.

Ahh slower - blast.

Maybe reading i not coalesced doesn't help, but it doesn't do it very often so should't matter much.  I guess without the outer loop removed the difference is only minor?   It was just an idea anyway, it could be removed simply enough - it's necessary when you don't know the work-size at enqueue time (e.g. a queue filled by a kernel), but probably not needed here now i think about it.

Is this the same problem as the initial profiler output?  20x slower?  Ouch.  I find it strange that it's slower than the original code because the memory access in the internal loop should be much better, and the parallelism is higher.  Particularly that it's so much slower, but i've been surprised before.  The cache hit ratio and some other profiling points look better ...  I presume you made tilesize 64 here, otherwise you're doing twice as much work.

I just spotted a bug, forceSum is defined twice, should only be defined just inside the first loop, i'm surprised it isn't just optimising everything away - actually if you haven't fixed that in the code you ran, it probably makes all the profiling results useless as every result of forceSum will be 0.

Is the range of b1 and b2 the same, could you perhaps invert the access, so that the interactions test is accessing adjacent items?  It wont be coalesced but it should be cached.  Can't imagine that would make much difference though.

This:


if(collide)



{




vel = (iVel * massDiff + 2 * jMass * jVel) / massSum;




//The particles may be too close to each other - move them apart




pos = iPos + normalVec;




dist = radSum;



}

val is the same for all work items, and this could cause a write conflict serialisation (i think?).  Not sure how you'd fix this as your original code just overwrote it and used the highest j that had a collision's value here, but somehow it could be moved outside of the b2 loop.  (I would try the thing below first though).

Hmm, maybe the common read of oldPos and so on are also causing channel conflicts (see section 6.1.2.2 'reads of the same address' of the amd app programming guide).  Actually the more I think about it the more this seems likely given you're on cypress - and this would be a huge performance hit (i'm not sure how one can tell if it is from the sprofile output).  Try only reading xx in local work item 0, and sharing it using LDS.  Either go through the array you have or just define the variables local - you have plenty of LDS to spare.

e.g. change the start of the first loop to something like this:

local double4 iPos;

local double4 iVel;

... etc.

if (lx == 0) {

   iPos = oldPos;

   iVel= ...;

  etc.

}

barrier()

Dunno what else to suggest - make sure that forceSum thing is correct, and if with the read of xx changed it is still 20x slower, this all looks like a big dead-end and i'd be pretty somethinged if it didn't help.

chaosed0
Adept II

Re: Optimizing Kernel

"Blast," eh? Says something different in the email I got.

You're right, I totally overlooked that double-declaration of forceSum. However, after removing the local one and switching the tile size to 64 (another thing I overlooked) the time cut to 250ms per kernel run, but no more. I also changed the accesses of the private vars to only having one thread read them, also to no avail. Oh, also I switched b1 and b2.

But get this - when I comment out the whole if(forceInter) block out, the kernel speeds up drastically, from 250ms to 4ms per run. As soon as I put in the statement "forceAdd += forceSum" in the inner for loop, with the if(forceInter) block still commented out, the kernel time went right back up to 250ms.

Now, I suppose this could indicate a myriad of things. Maybe it recognizes that the variables involved in the calculation of forceSum are in no way relevant, and stops calculating them. Seems unlikely, though, because I assigned a constant to forceAdd, commented out the calculation of all the other variables involved in the calculation of forceAdd and it only got a little faster - and I think that speedup is just because the kernel stopped using a lot of its registers, so many more wavefronts can run.

The investigation continues. I will edit if I figure anything out.

I really wish OpenCL had better profiling tools, so you could see exactly where in the code bottlenecks are occuring. Well, we work with what we have.

EDIT: I switched over to the windows partition and ran KernelAnalyzer; for what it's worth, I'm attaching the  GPU ISA of the kernel. I'm pretty terrible at understanding assembly - more so when it's not from a CPU - but there's a pretty huge chunk of ALU clauses full of MOV, ADD_64 and MUL_64 missing when the if(forceInter) block is in the code as opposed to when it's not. Maybe it'll be helpful, no idea.

EDIT2: Commenting out the assignation of forceSum to the fsum array also has the same effect. Posting the source I'm using now, as well.

As an aside, everything runs slower on windows - host code and OpenCL code both.

Message was edited by: Ed Lu

0 Likes
notzed
Challenger

Re: Optimizing Kernel

re blast: yeah i was trying to work out what word this silly software didn't like ... as if the terrible editor wasn't frustrating enough!

First i'll just say that if you remove any calculation the optimisation WILL remove all the redundant code that isn't required.  i.e. it will trace the data flow, and if the any dependency doesn't go to local or global memory that whole sub-graph will be elminated.

So when comparing timings, make sure you're getting the right result in each iteration, as bugs can also throw away redundant code and throw the timings right out, not to mention change the memory access patterns and so on.  I've been caught out on this a few times, just testing minor changes without verifying the results, and you end up just wasting your time.

I'm not super familiar with that isa, but it looks reasonable to me.  All those FMA's and MUL's just implement your equation (and sqrt() needs a few), so if they're not there you're not getting much done ...

I'm still trying to work out why this code would run so much slower than your first iteration, because assuming you're getting the same results, as far as i can tell this ticks all the boxes for 'good gpu code'.  i.e. memory access coalesced, no divergent branching, simple loop that all work items execute, etc.  Unless i'm missing something big ...

There is so much arithmetic even bad memory access patterns can probably be hidden to a great extent.  I've not done any double stuff, so i'm not sure how much of a hit that is either.

The interactions[] access is also accessing the same address and maybe that causes bank conflicts, but i'd be surprised if changing that would make much difference even if it is in the inner loop, but i can't see anything else to try ...

e.g.

for (int b2; ...) {

  local int work;

  barrier();

  if (lx == 0) {

   work = interactions[b1*numBlocks + b2] > NO_INTERACTION;

  }

barrier();

if (work) {

...

}

0 Likes
chaosed0
Adept II

Re: Optimizing Kernel

On the optimizing-away of the calculation... Yeah, I wasn't sure if I was really onto anything or not, but honestly right now I'm just trying to chase random threads on the hope that they get me somewhere. I don't really have any idea of what to try next either. I just tried placing the interactions test outside and got a small boost, but the time still remains slower than my old code.

If there are no other ideas, please don't feel you've an obligation to keep on trying. You've already proved yourself an awesome person.

0 Likes
binying
Challenger

Re: Optimizing Kernel

Is processInteractions2.cl the latest code you want to optimize? If you haven't found a solution yet, I'd like to take a look at this.

By the way, there is an OpenCL example called n-body in AMD APP SDK, which should be useful for you.

0 Likes
chaosed0
Adept II

Re: Optimizing Kernel

Sorry for the long delay - I was away for the weekend.

Yes, the processInteractions2.cl is the latest code. However, if you want to start from the beginning, you can look at the very first .cl file I posted.

As to the SDK sample, I'm not sure how it's computing time when it's reporting statistics. I run the program with "NBody -x 32768 -i 10 -q -t", but when I increase the iterations (-i) to 100, or even 1000, the program takes a long time but reports back approximately the same kernel time/total time. Anyway, if the total running time of the program (wall-clock) is the actual time it's spending to solve the problem, it's probably as slow or slower than my original code.

EDIT: After looking through this thread (http://devgurus.amd.com/message/1141259#1141259), it looks like it's possible to solve this problem reasonably well. However, there's a lot of theory in that thread, but not many results; moreover, it was written two years ago, and everyone's talking about CAL.

0 Likes