Archives Discussions

boxerab · ‎09-08-2015

I've spent a lot of time optimizing register usage for my GCN 1.0 HD 7700 card. Will I have to do this again when I inevitably

upgrade, to Hawaii or Fury? Or will my optimizations still apply to these cards? Thanks.

realhet · ‎09-09-2015

I think not.

Comparing with the HD7770:

On Fury there are some extra instructions mostly for cryptography and data exchange across neighboring threads, so it doesn't matter, unless you used byte-permute stuff.

On Fury and on GCN 1.2 (?) there are some new/modified memory instructions/addressing modes.

Basically the optimizer could be almost the same for all the GCN models. (The assembler must be different for Fury)

I think the most important thing that affects optimization (and register usage) is the 'version number of the Catalyst driver'.

What's the algorithm you're working on? (If it's no secret)

If it is memory bound (eg. constantly and randomly accessing from an area bigger than 2 megabytes), you can't do much.

I've heard you gained 2x speedup already

View solution in original post

realhet · ‎09-09-2015

I think not.

Comparing with the HD7770:

On Fury there are some extra instructions mostly for cryptography and data exchange across neighboring threads, so it doesn't matter, unless you used byte-permute stuff.

On Fury and on GCN 1.2 (?) there are some new/modified memory instructions/addressing modes.

Basically the optimizer could be almost the same for all the GCN models. (The assembler must be different for Fury)

I think the most important thing that affects optimization (and register usage) is the 'version number of the Catalyst driver'.

What's the algorithm you're working on? (If it's no secret)

If it is memory bound (eg. constantly and randomly accessing from an area bigger than 2 megabytes), you can't do much.

I've heard you gained 2x speedup already

boxerab · ‎09-09-2015

Thanks for the info. I don't think I am memory bound, although the app is pretty memory intensive. I try to use images wherever I can,

and I am accessing them in raster order. These are image compression kernels for jepg 2000 compression. The speed I am getting on

my old mid-range HD 7770 is mind blowing.

realhet · ‎09-09-2015

>jpeg 2000 compression

So it starts with wavelet transform. Now that's a memory hungry one. At least it's kinda sequential.

boxerab · ‎09-09-2015

Yes, wavelet to start. Actually, the signal processing part is the easiest to parallelize - although the kernel is quite involved.

The most memory intensive part is actually the later stages, where each bit plane must be processed in three different passes.

And the hardest part to parallelize is the last part, a serial MQ arithmetic encoder. This was designed around 20 years ago, to avoid multiplication, but it is full of branches, so not great for GPGPU.

realhet · ‎09-10-2015

Back then mul was slow.

But not today -> mad24

There are other goodies here, maybe they can help -> Integer Built-In Functions

boxerab · ‎09-10-2015

Thanks. I am making extensive use of mad24, but I didn't notice mul24. Should save me a few cycles.

boxerab · ‎09-10-2015

Strange: mul24 and mad24 increases register usage in a few of my kernels. So, I think I will avoid them for now: I have become

a bit obsessive about VGPR usage

maxdz8 · ‎09-10-2015

But... codeXL sometimes reports different simulations ('colorful bars' page) for post-Tonga, and sometimes I've observed this in post-hawaii WRT GCN1.0.

No idea however if it happens in reality.

Archives Discussions

Do register usage and kernel occupancy vary with different GCN cards ?