Thank you. It's actually quite nice to have somebody who understands my codes.
I can just guess if you unroll that maybe you can't fit into the ICache.
That is exactly right. With Bitslice DES, it is crucial that the main loop is on the instruction cache, no matter which architecture you are dealing with.
So does the dealing with s_xxxxPC_b64 worth the effort?
Absolutely. With these lightweight function calls, the performance gain was around 100% compared to the unrolled version.
What tool have you used to generate the asm sources?
It's a combination of a custom code generator and CLRadeonExtender.
I started with an OpenCL kernel. I created GCN byte codes with CodeXL for GCN 1.0/1.1/1.2, disassembled them with CLRadeonExtender, and used diff to see how the OpenCL compiler handled three different GCN architectures. Once I have a functional kernel written in GCN assembly, I analysed its register usage, generated an optimized version of the main loop with a custom code generator, and merged it with the disassembled code.
Sry for having too many questions, I'm excited, because I rarely see cool GCN ASM nowadays.
Oh, no problem. Your work was my original inspiration, so I am honored to answer your questions