I am running NBody Sample from AMD APP SDK.
To give a brief introduction: the sample simulates a large a number of particles. One work-item is assigned the work for calculation of a single particle.
Now as per algorithm, each workitem needs to read the complete buffer storing the position of particles. So each work-item accesses same buffer elements as soon as they start. This should result in channel conflicts(right?) as all workgroups want to access the same data elements corrosponding to same memory channel.
But when I profile the application for (-x 10240) on cypress/cayman, I get zero FechUnit stalled value. Does that mean data is getting broadcasted to all compute units, or am i missing something?
The NBody example is really badly implemented from optimization point of view.
There is old post about it somewhere in forum. You can get optimal implementation ( 95% of card peek perf ) in examples of CAL++ library.
Thanks for replying. I know it has not been improved for a very long time.
But my question is does the data get broadcasted if all workgroups simultanously try to access it or should there be channel conflicts..
I can confirm that broadcast works on 5xxx with LDS ( local memory ) and with TU ( Texture Unit = images ). It doesn't work with reading from global memory using UAVs ( standard memory read ).
Thanks hazeman, for sharing your experience. I was also trying to write some tests and it seems the same way, as you said.
I am still working on finding the impact of channel conflicts on global memory access.