So, we have a few 7970s, and so I naturally wanted to see how well they performed. Good news is, about 33% is what I saw over a 6970.
Bad news is, it doesn't work, is quirky, etc, so here's a run down...
First, obviously, I'm *still* using cal/IL. when calling calInit, sometimes the program just crashes. Nothing special, just a simple call to calInit sometimes crashes the program. Sometimes I have to restart it 5 or 6 times just to get past the library init phase.
Second, my program fails....I'm still working out the details which led me to go looking for the equivilant doc to HD 6900 Series Instruction Set Architecture for the 7900 series, but my google fu seems to be weak. Has this been released? Could you point it out to me? It looks to me the trouble is happening at the place I mentioned having trouble with before with ret_logicalz and the like, only this time, the behavior is different, and I'm not exactly sure how. I was hoping to find some answers in the ISA doc, or at least help me go through the disassembly.
I suppose any logical errors resulting in code that works for a 6900 series card and not in a 7900 series card is a bug...and I would think that since OpenCL is (still?) based on cal, I would think others should have run into this, but I saw nothing similar. Perhaps there is more undocumented stuff going on to work around 7970 differences in OpenCL?
Well, I decided to test the ret/continue thing and take out the early termination. No good. Still not working as it should, yet, when I write intermediate values they're correct....I'm really scratching my head on this one...
g buffer support is not present for 7xxx series cards, so if your IL is using that, it won't work so use UAVs instead. Also, we did find a bug with pixel shaders via IL that will be addressed in a later driver.
I got the feeling with all the uav documentation, and the fact that openCL used it in favor of global buffers it was better/on the way out, so everything uses uavs.
I spent the entire day yesterday trying to figure out where things are going wrong, but couldn't find anywhere. Perhaps its only when the card is loaded up on the ALU with minimal ram hits, as I've designed the code to use an absolutly minimal amount of ram. (we're talking less than 128 bytes/thread.) To get any sort of debugging though, obviously I have to write a lot more. The only thing I couldn't seem to figure out, and this was only at the end of the day was it looked like anything thread 4 or higher was wrong. However, I may have just miscalculated the size I needed, and overwrote the beginning of 4 with the end of 3. Still, I wouldn't have expected 3 to have been correct in that case as 2 should have overwritten 3, and 1 overwritten 2....again, everything I tried just left me with more head scratching.
We also tried a seperate kernel we had developed here, and it too produced incorrect results.
I thought about trying the 12.1a drivers, but again that was the end of the day I saw that they supported the 7970 as well, so I didn't bother, futher, given the date on the driver I was using was 1/25, I wasn't really sure 12.1a would have been newer/better. Perhaps I'll have some time to look into this next week again, but one way or another, its gotta work eventually! Is there a 12.2 preview up yet that may address the pixel shader issue you mentioned? Would it effect a compute program (il_cs_2_0)?
Since I don't *need* it working on the 7970's at the moment, I just shelved that box and switched back to the 6900 series.
Out of curiousity, is AMD going to release an ISA doc like they have for other arcitectures in the past? It was surprisingly helpful in working with the Cayman architecture.
IL compute shaders are fine, as far as I know. That doesn't mean there can't be bugs But our internal tests are working.
I suspect AMD will release the ISA docs for Tahiti, I just don't know when.
Well, that just leavs me with what are some differences that would cause IL working on Cayman and Turks GPUs useless on Tahiti GPUs?
I guess the one other thing is if I were to ask about specific instructions, or parts of instructions, could you give me a quick breakdown of what I'm looking at? Or is it more simple than it looks....a quick 5 min total glance made things look ambiguous, but perhaps I'm overthinking it?
The major changes have to deal with UAV's. Here are the basics of the changes.
Wow, did I miss a document where those were listed?
The constant buffer may be the problem....That and all my UAV's are raw....seems like that is also not preferred?
All my reads have been aligned on 16 byte boundries, and always read at a minimum of 16 bytes. Seemed like that was the best way to get it bursting before. Any hints with the raw UAVs?
I'll try next week using read-only uavs instead of the constant buffer. Not having the size limitation there could help for some problems I already discounted as being faster on the GPU!
I don't think our IL document has been updated for SI yet.
Your best bet is to see how OpenCL generates AMDIL and follow that. Even on EG/NI this was the way it is(i.e. caching on UAV 11 only, arena is UAV 8 only, etc...).
raw uav's are equivalent to a read-write typeless UAV with stride of 4, the difference however is there is a max of 12 raw uavs, but 256 typeless uavs.
So, riddle me this...it seems like as soon as I add
I get "Unsupported program construct detected in back-end!"
I tried appending a _length(16) to it, and that made no difference.
I originally had loads, and searching through the dll, I didn't see any specific typeless load, so I tried uav_raw_load since it claims to be typeless and thought that may have been the problem, so I commented out all my loads. Same error. Same code using cb0 compiles just fine (obviously there was a little more work in maintaining registers for the locations of cb values, as I haven't tried any of the more cisc-like instructions available yet, but its nearly 1-1 line wise..)
Any ideas? I know its difficult without a test case....I'll try tomorrow to see what happens with some generic not-optimized-out test cases and see if I can get you any more info....just hoping there's again something obvious I may have overlooked!
Here are some hints I've found out recently while moving from Evergreen to GCN architecture:
- Use 12.1 catalyst, the things below will generate access violation when using the new driver.
- write a compute_shader not a ps
- Use uav instead of global buffer
- when allocating uav resource, the format should be CAL_FORMAT_UNORM_INT32_1 (not 4)! (Evergreen will copy only 1/4 memory when you specify 4 components instead of 1 component).
- for the cb0 CAL_FORMAT_UNORM_INT32_4 format will do it, just like a year earlier.
- use the res_alloc_global_buffer flag to allocate uav and cb resources (linear 'dimensionless' memory).
- While using small amounts of mem bandwidth, I didn't noticed any performance difference when using pinned/local/remote memory.
- Use vAbsTIdFlat instead of vWindowCoord. (wincoord is a bit slow on a compute_shader but with vAbsTIdFlat you can get the same performance like in the good old pixel_shader+vW)
The above stuff works great on Evergreen architecture too with the latest drivers.
I think really worth to replace PS/g with CS/Uav. No performance loss at all and better compatibility with the new CAL.