I've spent a lot of time optimizing register usage for my GCN 1.0 HD 7700 card. Will I have to do this again when I inevitably
upgrade, to Hawaii or Fury? Or will my optimizations still apply to these cards? Thanks.
Solved! Go to Solution.
I think not.
Comparing with the HD7770:
On Fury there are some extra instructions mostly for cryptography and data exchange across neighboring threads, so it doesn't matter, unless you used byte-permute stuff.
On Fury and on GCN 1.2 (?) there are some new/modified memory instructions/addressing modes.
Basically the optimizer could be almost the same for all the GCN models. (The assembler must be different for Fury)
I think the most important thing that affects optimization (and register usage) is the 'version number of the Catalyst driver'.
What's the algorithm you're working on? (If it's no secret)
If it is memory bound (eg. constantly and randomly accessing from an area bigger than 2 megabytes), you can't do much.
I've heard you gained 2x speedup already
I think not.
Comparing with the HD7770:
On Fury there are some extra instructions mostly for cryptography and data exchange across neighboring threads, so it doesn't matter, unless you used byte-permute stuff.
On Fury and on GCN 1.2 (?) there are some new/modified memory instructions/addressing modes.
Basically the optimizer could be almost the same for all the GCN models. (The assembler must be different for Fury)
I think the most important thing that affects optimization (and register usage) is the 'version number of the Catalyst driver'.
What's the algorithm you're working on? (If it's no secret)
If it is memory bound (eg. constantly and randomly accessing from an area bigger than 2 megabytes), you can't do much.
I've heard you gained 2x speedup already
Thanks for the info. I don't think I am memory bound, although the app is pretty memory intensive. I try to use images wherever I can,
and I am accessing them in raster order. These are image compression kernels for jepg 2000 compression. The speed I am getting on
my old mid-range HD 7770 is mind blowing.
>jpeg 2000 compression
So it starts with wavelet transform. Now that's a memory hungry one. At least it's kinda sequential.
Yes, wavelet to start. Actually, the signal processing part is the easiest to parallelize - although the kernel is quite involved.
The most memory intensive part is actually the later stages, where each bit plane must be processed in three different passes.
And the hardest part to parallelize is the last part, a serial MQ arithmetic encoder. This was designed around 20 years ago, to avoid multiplication, but it is full of branches, so not great for GPGPU.
Back then mul was slow.
But not today -> mad24
There are other goodies here, maybe they can help -> Integer Built-In Functions
Thanks. I am making extensive use of mad24, but I didn't notice mul24. Should save me a few cycles.
Strange: mul24 and mad24 increases register usage in a few of my kernels. So, I think I will avoid them for now: I have become
a bit obsessive about VGPR usage
But... codeXL sometimes reports different simulations ('colorful bars' page) for post-Tonga, and sometimes I've observed this in post-hawaii WRT GCN1.0.
No idea however if it happens in reality.