cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

realhet
Miniboss

Re: GCN ISA Assembler

Hi Bdot,

You're welcome!

I had checked what's with 14.6beta, and it turned out that the driver developers changed the way parameters (buffers) are passed to the kernel. It's improved and uses less instructions and less vregs for my small testcase. It need some time but unfortunately my current job doesn't involve gpu programming, so thats why it is stuck at 13.4. Btw I wonder if 13.4 supports the new R290 cards. Maybe not, and then this is indeed a problem... But sooner or later I gonna have time off, and then I wanna do some hobby programming on GCN, so I'll probably have time to understand how the new elf works.

"I noticed that Cat14.4 does write -save-temps ... is that what was needed for the disasm to work?"

The problem is with -save-temps -fno-opencl -fno-il -fno-llvmir combination. It produces an ELF that only contains the binary executable. And you are unable to load this type of elf and disasm it, unless you are using an older catalyst (below 13.4 for example 12.10 is great). So you can see disasm for the opencl test, but no disasm for the mandelbrot example which is written in asm, and it has no higher level sources included in the elf file.

0 Kudos
Reply
realhet
Miniboss

Re: GCN ISA Assembler

Hi All,

If anyone interested, there is a new version of HetPas accessible on my site.

The most important improvement that it is now compatible with Catalyst 14.6, 14.7, 14.9 and 14.12. You can generate binaries using any of these versions that can be executed on any of these versions. Also it works with the recent R9 cards now.

I've also made a case study on implementing an alternative crypto-currency (Groestl) in GCN ASM.

You can read the series of blog-posts here: Implementing Groestl hash function in GCN ASM | HetPas

Although in the end it turned out while using appropriate OpenCL compiler(14.7) and with some black magic the OCL version became so fast that I approximate the final performance improvement over OpenCL only 10%. But it was a fun project for my holidays and I learned much from it. Hope someone can learn from it too. In the series of blog-posts I go through the most obvious optimizations to reach a noticeable speedup, and finally it resulted that those techniques can also be used in optimizing the OpenCL kernels.

My next hobby project will be a rope simulation with a little twist. I hope I will get to make it in this year, haha.

(Sorry about the lack of 'hello world' examples, they are still broken on the latest drivers. They need Catalyst 13.4.)

0 Kudos
Reply
maxdz8
Elite

Re: GCN ISA Assembler

Hello realhet, awesome work you're doing here!

I'd love to look at it in detail and get some GCN ISA as well... the CL compiler is borderline random sometimes.

I would like to ask if you can take a look at my GRS-MYR implementation. Users have reported various interesting things with it... the cool thing is that I really wrote it for clarity over performance. I was expecting a very minor loss of speed....

Instead most users (7800 up, 280 up) reported a huge drop in rate... users with low end cards happen to churn along great at like 5x the speed! It seems to me GCN cores are not always the same but AFAIK CapeVerde and Tahiti are the same design. Do you have any theory on why could this be happening?

As a last note: I'm currently inclined to believe the T-table approach might be not optimal for GPUs.The nvidia folks seem to have a bitslices groestl implementation which increased perfomance by 3x.

realhet
Miniboss

Re: GCN ISA Assembler

Hi,

I've checked your code: it's the same 8 table lookup thing. So I guess the same things should be applied:

- VGPRS<=128

- CodeSize<=32KB (for the main loop)

- Table lookups can be balanced between LDS and L1. 3x LDS and 1x L1 is the sweet spot.

CapeVerde vs. Tahiti scalability issue. I don't know... This thipe of thing doesn't use any shared resources, the code just can run on the CUs alone. So it should be scalable without a problem. Give a lot of workitems to it...

Bitslice: I checked it a bit and if I see it well, you just can't avoid the lookups. Those are the bottlenecks now, and that's why I can only slightly outperform a well optimized ocl code with asm. Maybe the lookpus can be faster as they don't need to be 64bit... Can you tell exact speeds on that NVidia bitslice approach? Current T-Table ocl is 33MH/s on the R9 290x. And the asm well be like 36 MH/s when somebody implements the first/last round optimizations.

As you said 'NVidia', I've heard about that awesome 3 operand bitwise logic instruction: all the 16*16 logic operand combinations can be selected by an immediate parameter. Maybe the NVidia version uses that too. Currently we have the only 3 operand bitwise operation that is  BFI (a&b | ~a&c).

0 Kudos
Reply
maxdz8
Elite

Re: GCN ISA Assembler

Yes, I know it is the same. It has been written to be the same in theory but it has this strange behavior (it results in 7750 beating 7850 and NV 750Ti).

The program I distribute adjusts worksize to both CU count and nominal driver clock, resulting in a number of hashes which is always 64n.

I also have several other variants (including mixed LDS/L1 which produced the very same results as yours); none seem to be considerably better than this on my hardware.

Unfortunately, I haven't got much from my users. Some of them don't even have proper English, I cannot really blame them. Most of this community seems to be very jealous of their data, and the few data they give is usually incomplete.

I cannot tell if their speedup stems from the operation you mention but I've heard NV has a swizzle instruction which is awesome for this kind of things. I hope next GCN will have it as well (if not on Tonga already) because for those simple algorithms it seems to be a far cry from LDS sharing.

Thank you very much for your time!

0 Kudos
Reply
realhet
Miniboss

Re: GCN ISA Assembler

GCN has swizzle too. (If you meen between the workitems of a wavefront)

I had the idea to try that with LiteCoin so that wat the LDS could be the bottlenect kinstead of the MEM and the math could be realized somewhat paralel using ds_swizzle instruction. As it is an ASIC territory it doen't worth it except for learning/experimenting purposes.

Small card/big card problem: Are you kernel launches taking at least 50ms? If you're around the minimum number of workitems it is risky because the LDS is 'randomizing' execution times. So I think it is better to have a few million *n instead of 64n workitems. Long running kernels always produce the best performances. This is only problematic when you have to process realtime data, but in mining it is not a problem.

And start with the biggest issue first: First VGPRS opt as going down from 150 down to 128 can run 2x faster. Then InstructionCache hits can add let's say 50%. and only after that the LDS/L1 balance became important.

0 Kudos
Reply
maxdz8
Elite

Re: GCN ISA Assembler


realhet wrote:


GCN has swizzle too. (If you meen between the workitems of a wavefront)


I used the wrong term. They refer it as "shuffle" I think. But yes, I mean permuting private registers transversally across work items in the same wavefront. Is it exposed in OpenCL? I think I've missed it completely. Perhaps it is in CL2?


realhet wrote:


I had the idea to try that with LiteCoin so that wat the LDS could be the bottlenect kinstead of the MEM and the math could be realized somewhat paralel using ds_swizzle instruction. As it is an ASIC territory it doen't worth it except for learning/experimenting purposes.


I also did the same. My data does not support the idea of scrypt being memory-bound (regardless of either bandwidth or latency). My readings in CodeXL gave me ~95% VALUBusy if memory serves. This seems to go well with its reputation of being "too hot". If memory serves, my bandwidth measurements were ~30%. Everything I attempted to lower latency usage (at the expense of ALU) resulted in lower performance which I consider typical behavior of ALU-bound scenarios. Most of those measurements with GAP 2 if memory serves.


realhet wrote:


Small card/big card problem: Are you kernel launches taking at least 50ms? If you're around the minimum number of workitems it is risky because the LDS is 'randomizing' execution times. So I think it is better to have a few million *n instead of 64n workitems. Long running kernels always produce the best performances. This is only problematic when you have to process realtime data, but in mining it is not a problem.


This is not under my control. My software does not target core miners but rather occasional users. It is focused on keeping the system responsive so I target ~30ms per dispatch. Most are much lower. Testing is performed usually at ~100ms.

Grs-myr kernel under (14.9 I think) consumed 104/20 VGPR/SGPR and took 16.08 KiB. VALUBusy was around 30%. I'm afraid I don't understand your terminology... shouldn't it be small enough?

0 Kudos
Reply
realhet
Miniboss

Re: GCN ISA Assembler

Hi, and I wish you a happy new year !

Please check out my GCN Quick Reference Guide here -> GCN Quick Reference Card | HetPas

(Almost) every instruction is listed. GCN3 features are highlighed with red color.

Big thanks to AMD for the ISA manuals, and to matszpk for the amazing ISA encoding documentation here -> ClrxToc – CLRadeonExtender

0 Kudos
Reply
matszpk
Adept III

Re: GCN ISA Assembler

Thank you realhet. I wish you happy new year too.

I have small affair. I found yet another error in the AMD GCN ISA manual: Just one opcode is not correct (I verified it): DS_WRITE_SRC2_B64 is not 204, but 205 (like in GCN 1.2).

Opcode 204 just hangs up GPU . That's all. Thank you for attention.

EDIT: I forgot about: opcode for DS_WRITE_SRC2_B32 is incorrect too. It should be 141. Thank you

0 Kudos
Reply
realhet
Miniboss

Re: GCN ISA Assembler

Have you heard the news?

http://653fb62b3a129d296422-3019ba142970aa3e5db9c4ca20cb2da4.r64.cf1.rackcdn.com/images/W1Z-tj7wQiCV...

Finally there will be an officially supported way to inject asm into the gpu.

I really hope they'll also implement the DD instruction to be able to compile any machine code.

So no more ELF patching/hacking will be needed when a new driver comes out.