I want to exchange data between the threads of a wavefront via LDS, but I'm not sure I get the maximum performance. Unfortunately Stream Profiler is not working on my Platform (Windows 7 64bit).
I have read that the LDS is composed of 32 banks of width 32 bits. A bank cannot process more than one access per clock. So I'm trying to provide a DWORD offset based on thread ID.
But, what is actually precisely happening when my IL kernel executes a lds_store instruction? Will there be four sets of 16 write accesses, with thread ids:
First set: id 15...0
Second set: id 31..16 and so forth? Or is it completely different?
Any help greatly appreciated.
Since you are responsible for computing the addresses in LDS that are used for storing data, you can create any access pattern. The only constraint is that the address you specify falls within the allowable range of addresses for the LDS allocation. e.g. an allocation of 256 bytes allows each of 64 work items to write 4 bytes, i.e. a float or an int.
So, for example, you could just use the work item ID linearly to decide on the address. So address 0 for work item 0, address 4 for work item 1, etc.
16 work items will store simultaneously. As long as the addressing pattern you've chosen is simple like I've suggested then there will be no bank collision and it will run at full speed. i.e. in 4 clocks all 64 work items will complete.
If the workgroup size is larger than 64 work items, then it will take longer.
Store can actually write 8 bytes per clock - i.e. each work item can write two 4-byte values to LDS. This requires the "vec" version of the instruction (which can write 1, 2, 3 or 4 values to LDS). This also requires that the two addresses being stored in LDS are consecutive (e.g. 16 and 20 are consecutive 4-byte addresses). If you write 3 or 4 values to LDS then the operation is performed over two successive instruction clocks (i.e. 8 physical clocks).
Note you don't specify each address in this situation, you specify the base address for the vector store. Then you use a mask, e.g. .xy or .xyzw with the lds_store_vec instruction to specify which of the 16 bytes are written.
Loads work much the same. Two loads can be performed in one cycle without the "vec" version of the instruction. In this case the compiler is simply scheduling two separate loads as one instruction. You can use lds_load_vec too.
Bank conflicts only arise when any of the 16 work items in a single physical clock cycle access multiple addresses in a bank. So, for example, if work items 0 to 7 access addresses 0, 4 ... 28 and then work items 8 to 15 access addresses 128, 132 ... 156, you'll get a bank conflict for all of 8 to 15. That will make the instruction take twice as long.
thanks a lot for your answer. It is very helpful.
"16 work items will store simultaneously."
That was clear to me, what I wanted to know was how these 16 work items are identified. But from your answer I conclude that they have thread IDs 0 to 15, then 16 to 31 etc.
(Given 64 threads per wavefront one could also assign thread IDs differently, first per processor, in this case the sequence would be 0, 4, 8 etc. I was wondering about these issues since I used the thread numbering as above but I still get low performance.)
Considering lds_store_vec: let's assume an LDS address of ID*8 and a mask of .xyzw. What would be the sequence of accesses? Using clock #: ID# -> (bank#, bank#):
clock 0: 0 -> (0,1); 1 -> (2,3); ... 15 -> (30,31)
clock 1: 0 -> (2,3); 1 -> (4,5); ... 15 -> (0,1)
clock 2: 16 -> (0,1); 17 -> (2,3); ... 31 -> (30,31)
Do you think the above scheme is accurate? Considering only bank conflicts, this would work well.
You should use GPUSA or SKA to see how your code compiles to instructions. Are you familiar with these?
The compilation will take the form of the attached code, where R1.z is the address for the first 8 bytes and R1.w is the address for the second 8 bytes. So R0.xyzw is written in two instruction cycles.
So the first 4 physical cycles are for instruction 10 and for work items 0...15, 16...31, 32...47 and 48...63. Then instruction 11 does the same.
To test the throughput of your kernel you might want to set the store and/or the load address to a constant, e.g. 0. This way all reads and writes will run without bank conflict. This will tell you the maximum performance of your kernel.
10 x: LDS_WRITE_REL ____, R1.z, R0.x, R0.y y: [other stuff might go here] 11 x: LDS_WRITE_REL ____, R1.w, R0.z, R0.w
Whoops, setting the store address to 0 might make it slow down by a factor of 16, as it will try to write the values one after the other. Not sure, to be honest.
A constant read address for all work items should be fine though as that will be a "broadcast".
> You should use GPUSA or SKA to see how your code compiles to instructions. Are you familiar with these?
Honestly, no. People keep telling that working on IL is already insane...
On the other hand, without intimate knowledge of these minute details your performance can easily drop to 1/10 or worse. How do people write efficient kernels on higher level such as Brook or OpenCl? Are the compilers that smart?
Well, didn't want to start a discussion, at any rate, thanks!
I highly recommend using SKA for coding. Sometimes it's easier to spot bugs in your code by looking at the ISA.
Do you have the AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf? Since it is incomplete you should also refer to R700-Family_Instruction_Set_Architecture.pdf which gives more background information.