Have you tested hard coding your loop paramaters and then using a "#pragma unroll X" where X is the loop execution time. I have found even for relatively small loops this can give me performance increases of 5-15x in some cases. Note:That If the loop is too big it might cause the registers to spill". The difficult part is you need different versions in your case to support local work sizes. I have found that I simply have to deal with it in my code and have lots of versions if I am going to support multiple architectures but the performance gains are too large to ignore. I have a preprocessor that duplicates my code with multiple times for this so that I don't get a typo in one version.
I have not seen a performance increase in int16. Sometimes I get a better result with int8 but sometimes slightly worse depending on what I am doing so you might just need to try.
Also you might check your memory access pattern to make sure it is optimized (check the APP parallel programming guide for more info on how this works best for AMD).
Greyscale eh? Well you're doing way more work than necessary. I presume your image format is CL_UNSIGNED_INT8 and not INT32.
You're keeping 4 independent partial sums until the end when you sum them up, but you could sum them before storing them into local memory and thus require 1/4 the local memory and 1/4 of the alu ops for the reduction. It should also remove the local bank conflicts (but see the end).
If you need to get it working cross-platform, I would start working with that and hardcoding the local work-group size (e.g. use #define TILE_WIDTH 128, etc) as this will allow the compiler to pre-calculate most of the addressing which could be significant here. I would also just try a simpler kernel that only does one difference at a time; although by changing the internal loop to track 3 rather than 12 values you will relieve register pressure significantly so it might not be a help as it would require more memory bandwidth. I'd probably try floats too - using UNORM_INT8 as the data format and read_imagef(), gpu's are good at floats: I don't think rounding errors will matter here.
(on your other question regarding wider int vectors: usually just choose the native size for the devices, which is normally 128 bits wide, i.e. int4, float4, char16, etc - the driver can be queried for this but all current(?) devices are like this, any wider and it's just the same as unrolling the loop once which may be worse since you're reducing the parallelism by half).
On AMD only, I would definitely consider using byte arrays rather than images, and then use amd_sad for the summation, it handles the overflow for you. If you try amd_sad, use uchar16 as the interface size (obviously this requires the data be 16-byte-multiple-wide, but should give you the peak performance). If using arrays you have to do all the range checking yourself. I would probably at least try a work-size of 64x1 for this, as it lets the compiler remove barriers (use the reqd_work_group_size annotation)
I would definitely hard-code the local workgroup size, which lets you simplify the interface and coding as well (e.g. you can just define the reduction array inline). The compiler will automatically unroll the loop if you give it a constant loop specification - i've had a great deal of problems using #pragma unroll with amd's compiler in the past, but i don't know if it's been fixed now - but generally it isn't required for small hard-coded compile-time-decipherable loops which you have here.
Finally, for your reduction, just do it with simple flat addressing. It doesn't matter what order you read the data when you sum it so long as each item is read once. e.g. use len=width*height; then just iterate over that.
And I noticed you have 2 separate data sizes here, intra-frames are smaller. I would definitely consider moving that to a separate routine since it would simplify the execution paths.
Oh ... a really big thing I didn't notice (and needed the full source to see!!) NEVER EVER use column-major addressing unless you absolutely have to and know what you're doing, if you get the numbers wrong you're going to totally kill performance - this alone could be the source of your abysmal performance. You could easily get 1/32(!!!) of the LDS performance if your data-width is a power of 2>=32 (i.e. as you have) Always do row-major order in the order of the dimension number (i.e. 0, 1, 2).
e.g.these (and all the rest) must be swapped ( 1 <> 0)
int l_i= get_local_id(1);
int l_j = get_local_id(0);
So your local (2d) addressing should always look something like:
unsigned int lid = get_local_id(1) * get_local_size(0) + get_local_id(0);
(where get_local_size(0) should be hard-coded to a define if you can)
This would also adversely affect your opencl/CPU results as the data wont be accessed serially but with a (very) cache-unfriendly stride.
This might also be why 128x2 works out better than 16x16 (which is counter-intuitive, but still might just be a coincidence). I'd also try 8x8 with a reqd_work_group_size set as removing the barriers might be a win (I'm only presuming your hardware supports this).
Thank you very much for your help.
- I use CL_UNSIGNED_INT8 data with RGBA to store my grayscale pixels indeed. To my mind, the most important was to compute on vector data type. But as you said, the reduction is faster doing a "pre reduction" on the vector. I get -0.5ms from this tip (on 2.25ms).
- Hardcoding the local workgroup size helps to gain some micro-second also, as I do not need that much flefixiblity, it is welcome.
-The index I was using was the bad way indeed, thanks a lot for that indication. However it still looks more intuitive to use the other way x). But I get -0.6ms using the right way, with 8x8 workgroup size, which is the best size.
-I am not sure to understand why you mean by "flat adrdessing"
Finally, for your reduction, just do it with simple flat addressing.
I am working on re-writing the reduction loop from your indication, and I will also try float data.
Anyway, I reduced my execution time by 55% which is very good =), it's now just below 1ms per frame.