In the appendices of the AMD APP Programming Guide there is a list of hardware specs for various graphics cards. Included in the list are the number of channels and the number of LDS banks; however there is not information about the number of global banks. Where/how can I find this information?
Although in programming guide it says that cypress has 16 banks in global memory, no corrosponding numbers are available for other devices.
1. I am also interested in knowing which conflict is more severe Bank conflict or Channel Conflict?
2. Can there be a situation when bank conflict occurs, but channel conflict does not?
3. Also It would be very helpful, if someone can tell how much memory(in coalesced fashion) should be accessed by a wavefront to get maximum throughput?
While we're waiting for someone at AMD address our questions, maybe this will help answer your first question. In the AMD APP SDK 2.6 you can run some tests on linear reads (uncached) by changing the offset to be increasing powers of two: AMD APP > samples > opencl > benchmarks > GlobalMemoryOptimizations. The first time the bandwidth decreases significantly is probably the channel conflict. If it then levels off for before significantly decreasing a second time, I would say that second drop is the bank conflict. I tried this method but my results were not as conclusive as I anticipated due to the latency hiding nature of the GPU scheduler.
I can take a stab at your second question. A bank/channel conflict serialized those memory accesses. If the bank and channels have the same frequencies then serializing at the top (bank conflict) causes the next lower step (channel) to be serialized, but the converse of that is not necessarily true.
Could you provide a link to the AMD APP programming guide that you are referring to?
I'm hoping I can find the person at AMD who worked on that guide, and that person can answer your questions.
Thanks Settle, for your response.
I am not convinced that GlobalMemoryBandwidth can answer my questions. The graph approach you mentioned looks good.
Please share any results(if you have). I will also try to study the global memory behaviour on my end.
Here's a chart and the raw data from the tests I ran on my AMD Radeon HD 6750M (6 compute units, 1 GB GDDR RAM, 4 channels, 128 bit memory bus width).
|Offset (Bytes)||Bandwidth (GB/s)|
It seems that a first drop starts at 8 KB and continues until 32 KB, and then a second drop starts and finishes at 256 KB. Now I've got to find the link between these numbers and the hardware.
Here's what I heard back:
"Yes, that’s correct for NI and SI.
The Global Data Share (GDS) is 64 kbyte in size and is constructed from 32 - 512deep x 32 bits wide register files with one write and one read port via a pseudo two port memories. The memory block be will constructed to have an interleaved bank address such that the lower 5 bits of the dword address will be the bank select and the upper 9 bits will be the address within the bank. The memory will be constructed such that all 32 banks can be read, written or both in one clock"