cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

maxdz8
Elite

[GCN] What is a "fetch" instruction?

Preface: I haven't read the GCN instruction manual yet. I start to believe it's a must however the sheer size of the document scares me a bit and I don't plan to use IL anyway but stay on CL only.

I have a kernel which cannot saturate memory bandwidth but produce very high ALUBusy%, in the order of 90%, with extra 10% likely unused due to occupancy issues. I noticed many operations in this kernel are really independent 4D so I reformulated each work item as a set of 4 work items collaborating using LDS on a 4x16 work group which I need to be on wavefront size to exploit free barriers. From now on, I'll refer to the 4 "x" work items making up an "original" WI as a team.

It is worth to be noted that each WI in the team works of a "channel" of data, but the data is comprised by eight successive elements so each WI needs to iterate on stride 4 uint to load each successive value it will mangle.

As expected, the instruction count for each WI decreased, even considering the team there are often savings >15% with a few notable exceptions.

As a start, VFetchInst (as reported by CodeXL) remained the same on function SW, which means I'm now doing 4x the amount of fetches. On a companion function IR, the amount of fetches decreased slightly, which implies it increased a lot considering a team (about 330%).

I am now thinking about using a LDS-assisted load to load for example a float4-per-WI and then dispatching it to the correct WI in the team. However, I was pretty surprised the compiler couldn't extract the information I was trying to express.

Most importantly, performance turned out to be worse than the original yet bandwidth is not saturated

So, now the questions:

  1. What is a fetch instruction? From my - admittedly incomplete - understanding I speculate fetches are really always 4D. This somehow makes sense in my head as I'm running a 128bit card, it would make sense to me for a fetch to be a whole memory subsystem reservation.
  2. Is this expected to happen? I really tried to be very clear in using those pointers, I somehow tried to make sure the compiler would have understood the WIs in the team were loading consecutive 32bit elements - "stride one", as noted as optimal in the AMD APP documentation.
  3. What I am doing is the opposite of vectorization. I'm really trying to extract more parallelism so all the operations can go to a different ALU in the same SIMD lane and happening in a single clock instead of 4. Is this approach correct? Is it a good practice?
  4. Considering the performance seems to go along with fetch, why are those instructions so expensive? Thinking twice, I admit they aren't as they seem to get dispatched as a standard instruction yet the amount of instructions generated somehow makes them "expensive" from a 10,000 ft view.
  5. How often should I consider LDS data sharing to circumvent that issue? It seems odd to me that LDS setup and data copy could still be faster than just coalescing the instruction in the first place.
  6. Do you have suggestions to help the compiler understand what I'm doing?

Other suggestions - even when not directly dealing with the problem - are welcome.

0 Likes
5 Replies
sudarshan
Staff

Hi,

Tried to understand what you are trying to do but could not understand much from your write-up. It would be better if you can describe by giving some kind of pseudo-code/ code snippets to make your point.

Typically at application layer, you should not need GCN instruction set and using the memory optimization guidelines in AMD OpenCL programming guide, you should be able to extract the performance.

0 Likes

Hello sudarshan, thank you for your reply.

I apologize for the delay but I am really confused and I wanted to get some more data to at least consolidate what I have in mind.

The AMD App docs (I'm referring to the version packaged with the APP SDK itself) seems to suggest that performance is best when WIs fetch data with "stride one" (I assume it means one 32-bit DWORD). Of course every workload is different and mine was perhaps not the best suited however, I've iterated the algorithm a few times and so far it seems like the cost of multiple fetches offsets the benefit of having stride one very soon.

So far, it seemed to me that having small bursts of sequential read, not across WIs but sequential inside each WI, gives the best performance to me, this implies stride cannot be one and appears to be contradicting what the SDK says. In particular, at a certain point I changed my packing code to allow each WI to load uint4 instead of 4 uints (sequential memory read across WIs), this resulted in ~10% performance increase. I found that very surprising because the memory area accessed across the whole wavefront is the same.

So I concluded fetch instructions must have been optimized for 128 bit reads (which makes sense considering bus width is multiple of 128).

So basically, the initial packing was

x0x1..x62x63 y0y1..y62y63 z0z0..z62z63 w0w1..w62w63 ....

And what I assumed was that the fetching would have been super fast, with all WIs incrementing their fetch address by 64uints each fetch, but it turned out this to be faster:

x0y0z0w0 x1y1z1w1 ...

even though the fetch stride was not zero (it actually has far better performance for random reads, but that's another issue)

So I'm having difficulty solidifying my experience so yes, I am still quite confused.

0 Likes
realhet
Miniboss

Hi,

I'd just say some general thoughts on GCN:

- This is a 32bit architecture which can handle 64bit data types as well and can address 64bit memory. So there's absolutely no need for 4x vectorisation as it wan a must on HD6xxx and below.

- The best way to read memory and LDS is that when the entire wavefront accesses a whole aligned block of 2048bit data. If there are gap between the workitems, that is not that good, athough there are dedicated standalone instructions that can batch process 128bit reads too.

This single instruction can read the most data from LDS: ds_read2_b64. It reads two 64bit values from two different addresses. While it only needs 64bit of instruction codewords. The question is that whether OpenCL (actually the Amd_IL) compiler notices that you are going to read 4x adjacent dwords, and compiles to use this compact instruction or not instead of using separate 4 dword reads (including additional address calculations). In my old work I used this instruction to maintain a fast work queue in the LDS memory and it was very useful

Also I suggest that you should read the docs about those 32 LDS memory banks. With wrong patterns it is easy to cause bank conflicts there.

Hello realhet, thanks for the insight.

I am not vectorizing at all. As fact, I tried the opposite approach of "scalarizing" in an attempt to produce code better suited to GCN SIMD lanes. It seems I managed to get LDS right since I haven't had a single bank conflict so far... to do that I set up my layout in columns instead of rows so I could use the whole bandwidth.

It is very useful to know this instruction exist. Unfortunately, I'm out of time now for further experimentation as my priorities call me elsewhere but I suspect this could have been of great use to me as well, it seems that my column-major layout would prevent that to be used.

Is this function full-speed on read? The APP docs seems to suggest reading 64-bit block to only use half LDS bandwidth.

0 Likes

ds_read2_b64 is a batch instruction: it still send the commands to read 4 dwords to the dword based LDS, while not making the program code too redundant.

There is an even bigger batch instruction: s_buffer_load_dwordx16 that reads big data in the backround and the instruction decoder can fetch vector or other instruction types immediately. In the gcn program code that is important to interleave evenly all the different instruction types so that instruction arbitrator thing can provide work to do for all types of units in the compute unit.When you put 16x s_buffer_load_dword in the program, then that specific wavefront will overuse the S ALU, not letting other wavefronts to use it and because of this stall the V ALU can be under utilized.