28 Replies Latest reply on Jul 16, 2012 6:46 PM by notzed

    Optimizing Kernel

    chaosed0

      Here's the situation. I'm trying to write an MD simulation in OpenCL that processes both collisions and non-bonded force interactions (Van Der Waals forces through the Lennard Jones potential). Essentially, that means an O(n^2) algorithm that checks particles for closeness and evaluates new velocities/sums up a bunch of forces. For the purposes of precision, I need to use 64-bit floating point numbers - double-types, with the extension cl_amd_fp64. My simulation is actually not running too badly, in terms of speed, compared to other algorithms of this type. However, I need it a lot faster.

       

      Here's a high-level overview of the algorithm I'm using:

      1.) Iterate through the particles in a spatially-contiguous order, grouping them into "tiles" of 32 particles each.

      2.) Figure out which of these tiles are interacting, using a simple O(n^2) algorithm, and store them in an array.

      3.) Calculate pairwise interactions between specific particles using the results from the previous kernel.

      4.) Update timestep using an Euler integrator (I know, it's terribly imprecise, but I'm just trying to eke out as much speed as I can).

       

      Number 3 is where the bottleneck occurs, taking up nearly 100% of the total time. Step number 2 takes a lot of time when there are a lot of particles in the system, and I'll deal with it later, but for the number of particles we need to simulate, 1, 2, and 4 take just about negligible time compared to 3. Using the AMD APP Profiler, I can see that it's only executing 4 out of 24 maximum wavefronts, limited by the number of VGPRs. I can also see that the ALU and fetch instructions, as well as the fetch size, are well above any other kernel I'm running.

       

      I'm going to attach the source code for step 3, as well as the output from the profiler after running 100 steps for 64000 particles. Note that I'm an amateur OpenCL/GPU programmer, so don't hesitate to point out obvious optimizations or other that I could be making. Also, I know this is a bit of a personal issue, rather than one that deals with more people than just me - not a great fit for a discussion forum. I emailed someone from AMD and they said to post the question here for now, while they try to find someone to help me out.  I've been beating my head against this problem for nearly two weeks, and nearly all of the optimizations I try just make it slower, so I'll be very happy if anyone can give some input on this.

        • Re: Optimizing Kernel
          Skysnake

          ok, how far i can see on a short view, you don't use local memory. I would first try to use this per workgroupe shared memory to "cache" the data from the global Memory.

           

          Is is also possible, that you don't read in aligned memory patterns from the global memory?

            • Re: Optimizing Kernel
              chaosed0

              You're right, I could not find a good way to fit local memory into this. Since the tile size is often smaller than the work group size, I wasn't sure how to get the data and cache it concurrently without introducing yet more nested branches. Not only that, but many threads don't need the information; the "if(NO_INTERACTION)" statement is there to prune out a lot of the calculations.

              The same went for coalesced memory reads; the first set (getting the "i" data) is coalesced, I believe, but I could not figure out a good way to get the particle "j" data in a coalesced fashion. However, either of these methods would probably help the time immensely - from the profiling information, it looks like the fetch unit is going crazy. I just couldn't find a way to implement them.

                • Re: Optimizing Kernel
                  Skysnake

                  It sounds silly at the first look, but just try to get rid of all this "skip calculations" optimizations.

                   

                  I don't know how big the percentage of calculation elimination is, but perhaps you can than use aligned reads from global memory and local memory. It is often so, that just to calculate something is faster than fancy elimination of calculations.

                   

                  Just try it. It is hard to say, if something is faster or not.

                    • Re: Optimizing Kernel
                      chaosed0

                      I've actually tried this already; getting rid of the "tiling" algorithm and instead just resorting to the O(n^2) algorithm (I guess it would technically be O(n) run in parallel) resulted in a pretty massive slowdown with this many particles. It was only faster on particles below some lower number (I don't remember exactly anymore).

                      I've pretty much been trying random things like this for the last two weeks, to no avail. I've also been trying to look for a different algorithm/trying to develop one myself, with not much luck.

                • Re: Optimizing Kernel
                  kbrafford

                  I think people who solve these MD problems really fast only solve the complete N-Body problem for particles that are close to each other, then use Particle-Mesh Ewald summation and dipoles in the frequency domain for particles that are far away.  You could also look at using the AMD FFT library with that part of the modeling.

                   

                  http://en.wikipedia.org/wiki/Ewald_summation

                  • Re: Optimizing Kernel
                    notzed

                    I presume you mean this is part 3 ... looking at the code it looks pretty slow.  I'm surprised you're getting ok performance TBH and I wouldn't be surprised if you can't get a good order of magnitude or more out of it.

                    The inner loop can be trivially paralleised (based a cursory look, i can't see any dependencies within the loop).  For this type of problem you should use a loop that calculates partial sums in batches based on the work size (and: use 64 for the local work size), which are then summed outside of the loop to get a result for a given b1,b2.

                    It's complicated a bit as you're using double4, which means conflicts in LDS, but you can get around that by splitting the data when you access LDS.

                    local double fsum[64 * 4];
                    int lx = get_local_id(0);
                    
                    ... outer loop stuff
                    
                    // inner loop
                    int j = lx;
                    double4 forceSum = 0;
                    while (j < numParticles) {
                      // do innermost calculation of ( i, j ) result,
                      // update forceSum with result
                      j += 64;
                    }
                    // save partial sum
                    fsum[lx + 64 * 0] = forceSum.s0;
                    fsum[lx + 64 * 1] = forceSum.s1;
                     .. etc.
                    barrier(CLK_LOCAL_MEM_FENCE);
                    

                    Then do a parallel reduction sum on the 4 rows of fsum.  e.g.

                    for (int x =64/2; x>0; x>>=1) {
                     if (x + lx < 64) {
                      forceSum.s0 += fsum[lx + 128 * 0 + x];
                      ... 1, 2, 3
                    
                      fsum[lx + 128 * 0] = forceSum.s0;
                      ... 1, 2, 3
                     }
                     barrier(CLK_LOCAL_MEM_FENCE);
                    }
                    

                    Now, fsum[128*n] = forceSum.sn  Actually since you need to iterate over all b2 as well, you'd do that before doing the parallel sum, and then you'd end up with finalForce directly.  So the outer loop structure will be somethng like this:

                    int i = get_group_id(0);
                    while (i < numParticles) {
                       for(int b2 = 0; b2 < numBlocks; b2++) {
                         if(interactions[b2*numBlocks+b1] > NO_INTERACTION) {
                               // inner loop from above
                            }
                       }
                      
                      i += get_num_groups(0);
                    }
                    // parallel sum from above
                    

                    The interactions[] test shouldn't be too expensive here since every work-item is doing it in lock-step, and they're all taking the same branch.

                    Actually ... since this means each work-group might do a different amount of work, it may be beneficial to use a 'queue' for working out 'i', rather than hard-coding it.  Here you just use a atomic counter (on amd, or a global atomic otherwise) and atomic_inc() to get/update the i value, but otherwise the loop is the same.

                    Finally, the way this is executed is that you set the local work size to 64,1,1 (and you can use the reqd_work_group_size hint to specify this on the kernel), and then for the global work-size you just give it something that fits the hardware.  The kernels themselves consume work until it's complete.  As a hint for something that fits the hardware, start with compute units (as reported by the opencl api) * 64 (the work-group local size) * 16 (this last factor is the parallelism per compute unit, just fiddle around till you get the best result on the device).

                    1 of 1 people found this helpful
                      • Re: Optimizing Kernel
                        chaosed0

                        Thanks for the detailed answer, notzed!

                        Actually, I tried parallelizing the outer loop with a 2D NDRange, but it only made the code slower - I would guess because the extra memory accesses that it incurred were more expensive than the removal of the outer for loop.

                        Anyway, I will try to apply these changes on Monday and get back to you. In the meantime, though, can you please explain how to choose local workgroup size? I got it into my head that the local workgroup size is always best at the maximum size, but is this not true?

                          • Re: Optimizing Kernel
                            notzed

                            Optimal local work group size depends on quite a few factors, so optimising it isn't trivial.

                             

                            e.g. some of the factors:

                             

                            • If you're using LDS, then the amount of LDS required depends on the LWS (you will always need Nl+x memory cells as there's no point if every work item doesn't have it's own slots).
                            • How much LDS you use is one factor on how many wavefronts can execute concurrently, and whether it can even run at all.
                            • If you're using barriers, then once you go above 64 work items (i.e. wavefront) then I can only imagine it adds complication to the barrier scheduling - e.g. now the hardware needs to synchronise across multiple wavefronts (I don't know for sure whether this makes a real difference though)
                            • Also with barriers, the compiler can no-op them if the LWS <= 64 items on a GPU since all 64 work items execute in lock-step (on AMD at least, and here you need to specify the reqd_work_group_size hint).
                            • If you're accessing global memory more than once (per address per work item) bigger work groups mean more cache pressure.
                            • Registers are also a limit on concurrency.
                            • On AMD, using a non-multiple of 64 work items means you have guaranteed wastage of ALU slots (i don't know what nvidia is now, i think it's a  multiple of 32).

                             

                            So after a lot of experimentation, now I just use a couple of rules of thumb as a starting point:

                             

                            • Just use 64 work items as soon as you need to use LDS for any parallel stuff.
                            • Always try to use at least 64 for any problem (unless it just wont fit or requires too much address logic to make fit, but even a serial task that needs to access an array will benefit from parallel memory loads).
                            • Use 16x16 for pixel-by-pixel oriented problems on images.

                             

                            If I have the time/interest/inclination I might fiddle further, but after writing a lot of kernels you run out of time (or interest!) to experiment every time trying to get the absolute optimal performance, and in practice the rules above normally get a good result anyway.  Trying to do too much can sometimes over-specify the problem and over-tune it to a specific bit of hardware and a specific test case.   One thing you get from experience is just what sort of performance to expect: if you know you're an order of magnitude below par you know it's worth the effort to keep working on the problem, but eventually you hit a point of diminishing returns and the small gains possible just aren't worth it - and the thing with GPU code is that it's easy to leave 10x the performance on the floor without knowing it.   I'm not satisfied unless I get at least 10x the contemporary cpu performance of a given algorithm, and not happy unless i get at least 100x for parallel ones.

                             

                            If you're not using LDS, then to some extent it just doesn't matter a lot: the GPU's execute wavefronts, and whether the wave-front is part of a wider work-group, or another work-group, or another iteration of the same work-group, there isn't much (without knowing internal hardware details) to tell them apart (e.g. try a simple array addition problem and adjust the work-size from 64-512 in steps of 64, the differences will be measurable but not huge).  As soon as you use LDS (and it's an awesome addition that really makes opencl worth worrying about) the landscape changes totally.  Here LDS becomes a precious commodity, and the overheads of barrier() are usually worth removing, and here smaller work-groups can be a big gain.

                             

                            In the example above, using 64 makes the internal loop more efficient:

                             

                            • The partial sum is just kept in a register the whole time, so you only need 1 register per N/64 results, rather than N.
                            • Likewise fewer complex address calculations are required.
                            • The parallel sum only needs to sum up 64 items and only needs 64 items worth of LDS to do it.
                            • The barriers are compiled to nothing (with the right hint)
                            • The parallel sum has fixed loop indices so can be optimised/unrolled better by the compiler (although it's only a small part of the execution time in such a bit of code, and the design makes it a smaller part of the execution time the larger the problem gets).
                            • You're using a tiny bit of LDS, so the only impediment to concurrent work-groups on the same CU is the register load.

                             

                            This internal/for free problem-size-reduction aspect is a very useful optimisation trick.

                          • Re: Optimizing Kernel
                            chaosed0

                            After going over the code more thoroughly, I'm a bit confused. When you say to put the inner loop into the outer loop, I assume you mean like this:

                             

                            local double fsum[64 * 4];

                            int lx = get_local_id(0);

                            int i = get_group_id(0);

                            while (i < numParticles) {

                               for(int b2 = 0; b2 < numBlocks; b2++) {

                                 if(interactions[b2*numBlocks+b1] > NO_INTERACTION) {

                            // inner loop

                            int j = lx;

                            double4 forceSum = 0;

                            while (j < numParticles) {

                              // do innermost calculation of ( i, j ) result,

                              // update forceSum with result

                              j += 64;

                            }

                             

                            // save partial sum, n = {0, 1, 2}

                            fsum[lx + 64 * n] = forceSum.sn

                                 }

                               }

                             

                              i += get_num_groups(0);

                            }

                             

                            However, I'm not really sure how to interpret this. The interactions[] if statement should cull out some computations between blocks who are too far apart, but it seems like the inner while loop just does the same thing every time, no matter what b2 is. If this is what you meant by "the inner for loop has no dependencies," then I'm not sure if it's correct; if you look at the original code, jBegin=b2*tileSize, and jEnd=jBegin+tileSize, so the inner for loop becomes:

                            for(int j = b2*tileSize; j < b2*tileSize+tileSize && j < numParticles; j++)

                            { /*Force and collision calculations*/ }

                            The two variables are not explicitly declared this way, but they are meant to serve this purpose - I might have  obfuscated that in an attempt to not repeat the multiplication.

                            The other thing I'm confused about is the calculation of the global work size. I always thought of a work-item as a single particle, or perhaps a single block, which is obviously not the case here.

                            Thanks for your help so far, notzed - sorry for not giving you anything in return except for more pleas for help, but I really appreciate it.

                             

                            EDIT: One more remark: It seems like this partial sum method would reduce the number of writes, but what about the reads? Since the "innermost (i, j) calculation" remains unchanged, that means I'm still reading the particle j info many times.

                             

                            Message was edited by: Ed Lu

                              • Re: Optimizing Kernel
                                notzed

                                Ahhh right i'm sorry, i didn't notice that detail, i just assumed it was processing all other particles (which in hindsight doesn't make sense).  I didn't really have time to write your whole routine for you, just give you some ideas ;-)

                                 

                                The only real difference is j is initialised to lx + b2 * tilesize rather than lx, and ends in a shorter time.  Except that if your tile size is only 32, then you don't need to loop at all, but also it wont work very efficiently and you can't use all 64 threads (you could do two 32-lots at once, but then the summation is a bit more complex - but not much).  Changing the tile size to 64 and removing the loop would be the easiest ...

                                 

                                What i meant by no dependencies is that each loop iteration stands alone - it's only output is the summand, and there is no feed-back between loops.  i.e. no value from step j-1 is used by step j, or mathematically it's just a sum, and not a relation.

                                 

                                BTW you don't really need to worry about repeating multiplication or not - the compiler will do this for you.  I usually do it just to save typing, particularly if any of the values are going to change.  But often whether you do or not will compile into exactly the same code.

                                 

                                The global work size is per work-item, but if you're setting the local work size it has to be a multiple of that.  So you have gws/lws work groups, each work-group executes on the same cpu core (even on a cpu driver) which allows LDS to work, etc.  If you use a 'persistent kernel' approach as i listed, you move the range calculation to the kernel, and all you're doing is trying to saturate all parts of the whole device with the task in the assumption that you're doing a good amount of work and it wont be wasteful.

                                 

                                As to your last question: well without changing the algorithm you can't reduce the calculations: at the lowest level it's still O(N^2). But doing it this way takes advantages of the hardware characteristics.  The main difference is that the with the inner loop you had it read memory in pretty much the worst-case possible way, and this one reads memory coalesced (i.e. each adjacent work item reads an adjacent memory cell).   It's just rotated the calculation order by 90 degrees.

                                  • Re: Optimizing Kernel
                                    chaosed0

                                    Alright, I think after reading through everything again, and with this new post, I understand how it is supposed to work. However, I implemented the kernel and it's running slower than before! I'm not sure where the problem lies - perhaps I still don't quite understand.

                                    Interestingly, the kernel runs fastest when the outer loop is completely eliminated, i.e. the "parallelization factor" is equal to the number of particles (the global work size, then, is 64*numParticles). Maybe the slowdown comes, then, because reading the information of particle 'i' is not coalesced anymore?

                                    I attached the new kernel and profiler info. I'll continue trying to see what's going on in the meanwhile.

                                     

                                    By the way I modified the parallel reduction sum because I tested it by itself and it didn't seem like it was working. Other than that, though, I haven't tested anything for correctness yet - just speed.

                                     

                                    EDIT: I accidentally multiplied the number of threads by 20 (it should be 4096000, not 80million) in the profiling result, but I just did it again and it's not too much different that the one I attached.

                                      • Re: Optimizing Kernel
                                        notzed

                                        chaosed0 wrote:

                                         

                                        Alright, I think after reading through everything again, and with this new post, I understand how it is supposed to work. However, I implemented the kernel and it's running slower than before! I'm not sure where the problem lies - perhaps I still don't quite understand.

                                         

                                        Interestingly, the kernel runs fastest when the outer loop is completely eliminated, i.e. the "parallelization factor" is equal to the number of particles (the global work size, then, is 64*numParticles). Maybe the slowdown comes, then, because reading the information of particle 'i' is not coalesced anymore?

                                        I attached the new kernel and profiler info. I'll continue trying to see what's going on in the meanwhile.

                                         

                                        By the way I modified the parallel reduction sum because I tested it by itself and it didn't seem like it was working. Other than that, though, I haven't tested anything for correctness yet - just speed.

                                         

                                        EDIT: I accidentally multiplied the number of threads by 20 (it should be 4096000, not 80million) in the profiling result, but I just did it again and it's not too much different that the one I attached.

                                         

                                        Ahh slower - blast.

                                         

                                        Maybe reading i not coalesced doesn't help, but it doesn't do it very often so should't matter much.  I guess without the outer loop removed the difference is only minor?   It was just an idea anyway, it could be removed simply enough - it's necessary when you don't know the work-size at enqueue time (e.g. a queue filled by a kernel), but probably not needed here now i think about it.

                                         

                                        Is this the same problem as the initial profiler output?  20x slower?  Ouch.  I find it strange that it's slower than the original code because the memory access in the internal loop should be much better, and the parallelism is higher.  Particularly that it's so much slower, but i've been surprised before.  The cache hit ratio and some other profiling points look better ...  I presume you made tilesize 64 here, otherwise you're doing twice as much work.

                                         

                                        I just spotted a bug, forceSum is defined twice, should only be defined just inside the first loop, i'm surprised it isn't just optimising everything away - actually if you haven't fixed that in the code you ran, it probably makes all the profiling results useless as every result of forceSum will be 0.

                                         

                                        Is the range of b1 and b2 the same, could you perhaps invert the access, so that the interactions test is accessing adjacent items?  It wont be coalesced but it should be cached.  Can't imagine that would make much difference though.

                                         

                                        This:

                                         


                                        if(collide)



                                        {




                                        vel[i] = (iVel * massDiff + 2 * jMass * jVel) / massSum;




                                        //The particles may be too close to each other - move them apart




                                        pos[j] = iPos + normalVec;




                                        dist = radSum;



                                        }

                                         

                                        val[i] is the same for all work items, and this could cause a write conflict serialisation (i think?).  Not sure how you'd fix this as your original code just overwrote it and used the highest j that had a collision's value here, but somehow it could be moved outside of the b2 loop.  (I would try the thing below first though).

                                         

                                        Hmm, maybe the common read of oldPos[i] and so on are also causing channel conflicts (see section 6.1.2.2 'reads of the same address' of the amd app programming guide).  Actually the more I think about it the more this seems likely given you're on cypress - and this would be a huge performance hit (i'm not sure how one can tell if it is from the sprofile output).  Try only reading xx[i] in local work item 0, and sharing it using LDS.  Either go through the array you have or just define the variables local - you have plenty of LDS to spare.

                                         

                                        e.g. change the start of the first loop to something like this:

                                        local double4 iPos;

                                        local double4 iVel;

                                        ... etc.

                                         

                                        if (lx == 0) {

                                           iPos = oldPos[i];

                                           iVel= ...;

                                          etc.

                                        }

                                        barrier()

                                         

                                        Dunno what else to suggest - make sure that forceSum thing is correct, and if with the read of xx[i] changed it is still 20x slower, this all looks like a big dead-end and i'd be pretty somethinged if it didn't help.

                                        1 of 1 people found this helpful
                                          • Re: Optimizing Kernel
                                            chaosed0

                                            "Blast," eh? Says something different in the email I got.

                                            You're right, I totally overlooked that double-declaration of forceSum. However, after removing the local one and switching the tile size to 64 (another thing I overlooked) the time cut to 250ms per kernel run, but no more. I also changed the accesses of the private vars to only having one thread read them, also to no avail. Oh, also I switched b1 and b2.

                                            But get this - when I comment out the whole if(forceInter) block out, the kernel speeds up drastically, from 250ms to 4ms per run. As soon as I put in the statement "forceAdd += forceSum" in the inner for loop, with the if(forceInter) block still commented out, the kernel time went right back up to 250ms.

                                            Now, I suppose this could indicate a myriad of things. Maybe it recognizes that the variables involved in the calculation of forceSum are in no way relevant, and stops calculating them. Seems unlikely, though, because I assigned a constant to forceAdd, commented out the calculation of all the other variables involved in the calculation of forceAdd and it only got a little faster - and I think that speedup is just because the kernel stopped using a lot of its registers, so many more wavefronts can run.

                                            The investigation continues. I will edit if I figure anything out.

                                            I really wish OpenCL had better profiling tools, so you could see exactly where in the code bottlenecks are occuring. Well, we work with what we have.

                                             

                                            EDIT: I switched over to the windows partition and ran KernelAnalyzer; for what it's worth, I'm attaching the  GPU ISA of the kernel. I'm pretty terrible at understanding assembly - more so when it's not from a CPU - but there's a pretty huge chunk of ALU clauses full of MOV, ADD_64 and MUL_64 missing when the if(forceInter) block is in the code as opposed to when it's not. Maybe it'll be helpful, no idea.

                                             

                                            EDIT2: Commenting out the assignation of forceSum to the fsum array also has the same effect. Posting the source I'm using now, as well.

                                            As an aside, everything runs slower on windows - host code and OpenCL code both.

                                             

                                            Message was edited by: Ed Lu

                                              • Re: Optimizing Kernel
                                                notzed

                                                re blast: yeah i was trying to work out what word this silly software didn't like ... as if the terrible editor wasn't frustrating enough!

                                                 

                                                First i'll just say that if you remove any calculation the optimisation WILL remove all the redundant code that isn't required.  i.e. it will trace the data flow, and if the any dependency doesn't go to local or global memory that whole sub-graph will be elminated.

                                                 

                                                So when comparing timings, make sure you're getting the right result in each iteration, as bugs can also throw away redundant code and throw the timings right out, not to mention change the memory access patterns and so on.  I've been caught out on this a few times, just testing minor changes without verifying the results, and you end up just wasting your time.

                                                 

                                                I'm not super familiar with that isa, but it looks reasonable to me.  All those FMA's and MUL's just implement your equation (and sqrt() needs a few), so if they're not there you're not getting much done ...

                                                 

                                                I'm still trying to work out why this code would run so much slower than your first iteration, because assuming you're getting the same results, as far as i can tell this ticks all the boxes for 'good gpu code'.  i.e. memory access coalesced, no divergent branching, simple loop that all work items execute, etc.  Unless i'm missing something big ...

                                                 

                                                There is so much arithmetic even bad memory access patterns can probably be hidden to a great extent.  I've not done any double stuff, so i'm not sure how much of a hit that is either.

                                                 

                                                The interactions[] access is also accessing the same address and maybe that causes bank conflicts, but i'd be surprised if changing that would make much difference even if it is in the inner loop, but i can't see anything else to try ...

                                                 

                                                e.g.

                                                for (int b2; ...) {

                                                  local int work;

                                                  barrier();

                                                  if (lx == 0) {

                                                   work = interactions[b1*numBlocks + b2] > NO_INTERACTION;

                                                  }

                                                barrier();

                                                if (work) {

                                                ...

                                                }

                                                  • Re: Optimizing Kernel
                                                    chaosed0

                                                    On the optimizing-away of the calculation... Yeah, I wasn't sure if I was really onto anything or not, but honestly right now I'm just trying to chase random threads on the hope that they get me somewhere. I don't really have any idea of what to try next either. I just tried placing the interactions test outside and got a small boost, but the time still remains slower than my old code.

                                                    If there are no other ideas, please don't feel you've an obligation to keep on trying. You've already proved yourself an awesome person.

                                                      • Re: Optimizing Kernel
                                                        binying

                                                        Is processInteractions2.cl the latest code you want to optimize? If you haven't found a solution yet, I'd like to take a look at this.

                                                         

                                                        By the way, there is an OpenCL example called n-body in AMD APP SDK, which should be useful for you.

                                                          • Re: Optimizing Kernel
                                                            chaosed0

                                                            Sorry for the long delay - I was away for the weekend.

                                                             

                                                            Yes, the processInteractions2.cl is the latest code. However, if you want to start from the beginning, you can look at the very first .cl file I posted.

                                                            As to the SDK sample, I'm not sure how it's computing time when it's reporting statistics. I run the program with "NBody -x 32768 -i 10 -q -t", but when I increase the iterations (-i) to 100, or even 1000, the program takes a long time but reports back approximately the same kernel time/total time. Anyway, if the total running time of the program (wall-clock) is the actual time it's spending to solve the problem, it's probably as slow or slower than my original code.

                                                             

                                                            EDIT: After looking through this thread (http://devgurus.amd.com/message/1141259#1141259), it looks like it's possible to solve this problem reasonably well. However, there's a lot of theory in that thread, but not many results; moreover, it was written two years ago, and everyone's talking about CAL.

                                                              • Re: Optimizing Kernel
                                                                binying

                                                                when you run the n-body sample with argument "-t", where "-t" means "print timing", you get

                                                                 

                                                                -----------------------------

                                                                Particles                Iterations               Time(sec)                kernelTime(sec)

                                                                1024                     500                      0.207416                 0.000729055

                                                                ------------------------------

                                                                 

                                                                kernelTime (sec) is defined as,

                                                                kernelTime = (double)(sampleCommon->readTimer(timer)) / iterations; --line 771, nBody.cpp

                                                                 

                                                                while "Time (sec)" is actually,

                                                                totalTime = setupTime + kernelTime; -- line 832, nBody.cpp

                                                                 

                                                                Since kernelTime is devided by "iterations", the program always reports approximately the same kernel time/ total time.

                                                                I am still working on your code. On the other hand, have you calculated the maximum/ideal speed-up you can achieve and compared with what you've got?

                                                                  • Re: Optimizing Kernel
                                                                    chaosed0

                                                                    Alright, thanks for clearing that up. That means that their code is quite a bit slower than mine.

                                                                    I haven't calculated the ideal time because, to be honest, I am not very sure how to. For one, I don't know how many threads are started because of the wavefront-swapping when one of them stalls. Secondly, I'm not sure how many cycles/time an instruction takes - and I don't even know how to count instructions in my program. Perhaps there's another approach for measuring ideal time that I'm missing?

                                                                      • Re: Optimizing Kernel
                                                                        binying

                                                                        At this point, would you mind telling me your email address? I am pretty sure that I can help you.

                                                                         

                                                                         

                                                                        (1) The method to compute theoretical kernel execute time:

                                                                        (Br + Bw)/(theoretical memory bandwidth)

                                                                        Where Br is the bytes of the data you need to read from the global memory, and

                                                                        Bw is the Bytes of the data you need to write to the global memory.

                                                                         

                                                                        (2) The method to get theoretical bandwidth of accessing global memory or the bandwidth of reading global memory to local memory:

                                                                        (a) Open the Catalyst Control Center.

                                                                        (b) Find out the “Information” option,  then open this option.

                                                                        (c) Select the “hardware” option, and there will be a table.

                                                                        (d) Noticing the last item: Total bandwidth of video memory, e.g., 28.8GB/sec.

                                                                        (e) If there is not this item, you should compute this value using the following formula:

                                                                        MemoryClock * (the bit width of your memory interface * constant)/109,(GB/sec),where the constant is usually 2 or 4 and depends on what GPU you are using.

                                                                         

                                                                        For exmaple, the MemoryClock is 900MHZ, and a 512-bit wide memory interface. The theoretical memory bandwidth is:

                                                                        900*106*(512/8)*2/109= 115.2(GB/s).


                                                                         


                                                • Re: Optimizing Kernel
                                                  chaosed0

                                                  In unrelated questions, how the heck do I stop this thread from becoming "Assumed answered" after some amount of time? To change it back to unanswered, I have to mark an answer correct and then unmark it, which is spectacularly obtuse for forum software.

                                                    • Re: Optimizing Kernel

                                                      I'm sorry. After reading your original post and all the responses, I thought your original questions were answered by the replies. So I marked the thread Assumed Answered. 

                                                       

                                                      However, you are correct in that the way to change a thread from Assumed Answered to Not Answered is to "to mark an answer correct and then unmark it."

                                                    • Re: Optimizing Kernel
                                                      chaosed0

                                                      Ok, here is the final result. Sadly, it still does not use local memory because every time I tried, it simply slowed the program down more. Notzed had some great input, and Binying suggested some helpful ideas, but in the end, we could not find anything that sped the kernel up by much.

                                                      Essentially, all I did was change the kernel so that the wavefronts read, for each iteration of the inner for loop, particles b2*tileSize to b2*tileSize + tileSize by incrementing the particle 'j' index by the local index. That's the only change that I found that actually sped the thing up by any amount.

                                                      I'm going to conclude by saying that this type of algorithm is probably not a good fit for the GPU. The "pruning" branches necessary for the algorithm to go fast do not work well in the kernel, and memory accesses jump around too much. I'm going to stick with this algorithm for now and try to get it working on a cluster, see if it helps at all (somehow I doubt it, but hey, just doing my job) and work from there.

                                                      No one really got the correct answer, so I'm just going to mark the question assumed answered.