when you run the n-body sample with argument "-t", where "-t" means "print timing", you get
Particles Iterations Time(sec) kernelTime(sec)
1024 500 0.207416 0.000729055
kernelTime (sec) is defined as,
kernelTime = (double)(sampleCommon->readTimer(timer)) / iterations; --line 771, nBody.cpp
while "Time (sec)" is actually,
totalTime = setupTime + kernelTime; -- line 832, nBody.cpp
Since kernelTime is devided by "iterations", the program always reports approximately the same kernel time/ total time.
I am still working on your code. On the other hand, have you calculated the maximum/ideal speed-up you can achieve and compared with what you've got?
Alright, thanks for clearing that up. That means that their code is quite a bit slower than mine.
I haven't calculated the ideal time because, to be honest, I am not very sure how to. For one, I don't know how many threads are started because of the wavefront-swapping when one of them stalls. Secondly, I'm not sure how many cycles/time an instruction takes - and I don't even know how to count instructions in my program. Perhaps there's another approach for measuring ideal time that I'm missing?
In unrelated questions, how the heck do I stop this thread from becoming "Assumed answered" after some amount of time? To change it back to unanswered, I have to mark an answer correct and then unmark it, which is spectacularly obtuse for forum software.
At this point, would you mind telling me your email address? I am pretty sure that I can help you.
(1) The method to compute theoretical kernel execute time:
(Br + Bw)/(theoretical memory bandwidth)
Where Br is the bytes of the data you need to read from the global memory, and
Bw is the Bytes of the data you need to write to the global memory.
(2) The method to get theoretical bandwidth of accessing global memory or the bandwidth of reading global memory to local memory:
(a) Open the Catalyst Control Center.
(b) Find out the “Information” option, then open this option.
(c) Select the “hardware” option, and there will be a table.
(d) Noticing the last item: Total bandwidth of video memory, e.g., 28.8GB/sec.
(e) If there is not this item, you should compute this value using the following formula:
MemoryClock * (the bit width of your memory interface * constant)/109,(GB/sec),where the constant is usually 2 or 4 and depends on what GPU you are using.
For exmaple, the MemoryClock is 900MHZ, and a 512-bit wide memory interface. The theoretical memory bandwidth is:
I'm sorry. After reading your original post and all the responses, I thought your original questions were answered by the replies. So I marked the thread Assumed Answered.
However, you are correct in that the way to change a thread from Assumed Answered to Not Answered is to "to mark an answer correct and then unmark it."
Alright, thanks for your input Kristen. The issue isn't quite resolved yet - Binying and I are trading emails, but I think I will post my final solution before I mark the thread answered.
Ok, here is the final result. Sadly, it still does not use local memory because every time I tried, it simply slowed the program down more. Notzed had some great input, and Binying suggested some helpful ideas, but in the end, we could not find anything that sped the kernel up by much.
Essentially, all I did was change the kernel so that the wavefronts read, for each iteration of the inner for loop, particles b2*tileSize to b2*tileSize + tileSize by incrementing the particle 'j' index by the local index. That's the only change that I found that actually sped the thing up by any amount.
I'm going to conclude by saying that this type of algorithm is probably not a good fit for the GPU. The "pruning" branches necessary for the algorithm to go fast do not work well in the kernel, and memory accesses jump around too much. I'm going to stick with this algorithm for now and try to get it working on a cluster, see if it helps at all (somehow I doubt it, but hey, just doing my job) and work from there.
No one really got the correct answer, so I'm just going to mark the question assumed answered.