Trying to speedup processing of few large arrays I used shared/local memory for splittling arrays to smaller blocks and to increase execution domain of kernel.
It wroks on on my dev host (C-60 Loveland) and also gives correct results on HD6950 GPU. But some testers report wrong computations on some GPUs.
So far tested:
C-60 Loveland with OpenCL 1.2 AMD-APP (1268.1) driver (Windows) - correct results
HD6950 with OpenCL 1.2 AMD-APP (1348.5) driver (Windows) - correct results
HD7970/Tahiti with Catalyst 14.9 (Windows) - invalid results
Tahiti LE with Catalyst 14.12/ OpenCL 1.2 AMD-APP (1642.5) driver (Linux) - correct results
Hawaii Pro with Catalyst 14.9/ OpenCL 1.2 AMD-APP (1526.3) driver (Linux)- invalid results
Not too clear is it driver version related issue or card architecture related or some issue with kernel's code itself.
Here is the kernel under question: http://pastebin.com/c9sX8Xwj
It has debug output enabled and different cards provide quite different outputs.
What is wrong here?
P.S. kernel's local domain is always {x,1,z} hence no local id(1) used inside kernel. Also, kernel produced correct results on HD7970 with workgroups/local domain of (1,1,64) and (4,1,1)(this one means no array splitting at all) but generated wrong results with (1,1,128).
Did not find any allowed WG configs that would fail on C-60 so far...
Additional tests were made on Tahity, Tahity LE and Hawaii devices under Windows and Linux.
While Tahity LE worked with all possible workgroup geometry, both Tahity and Hawaii work correctly only when workgroup size less or equal to wave size (that is WGsize<=64). And for all possible kernel geometries. That is 2x1x32 works as well as 4x1x16, but 1x1x128 will not go.
All this points to some issues with synchronization between waves. Some required barriers missed? Or some issue on another than source code level?...
Hi,
Would like to check the source code here. The shared path is not accessible here. Could you check that?
Regards,
Ravi
Actually, very high probability that this issue has same roots as described in this thread: possible OpenCl compiler bug few months ago. Cause we tried latest available drivers it means issue not fixed still.
Please do fix to already known and CONFIRMED by your staff issue first. This would save lot of time both users and support staff not to re-check and re-report already detected bugs over and over.
And full kernel code in case I'm mistaken and this is another issue:
Thanks for the code. Could you please also tell me the size of local memory allocated in the host code?
Also what is the global_work_size for all the different local_work_sizes you have mentioned in your earlier posts?
Regards,
Ravi
Thanks for looking into this issue.
Requested data:
1) List of global kernel sizes for run with failures (received on Tahity LE host that able to do this kernel properly):
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)
host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)
2)Allocated local memory:
err |= clSetKernelArg(PC_find_triplets_avg_kernel_HD5_cl,8,sizeof(cl_float4)*64*4,NULL); |
That is, local memory area enough for storing max possible workgroup size (256) of cl_float4 values is allocated.
Maybe worth to make it tunable to real WG size used, but for now just max possible amount allocated. In cases where WG size smaller than 256 just not all allocated amount really used.
3) Global domain vs local domain sizes.
Global domain depends on data being processed. I listed global sizes for very that task that has failures on some GPUs but processed OK on others.
local domain currently tunable. Listed one (4,1,64) will work OK on some GPUs but will fail on Tahity and Hawaii.
If one chose something like (2,1,32) or (1,1,64) (with very same first two dimensions of global domains) task will be finished OK on ALL tested devices. As one can see, WG has different geometry but always have to be of size of single wavefront to work everywhere.
EDIT:
4) There is another very similar kernel that processes some sizes. and can result in failure too. The difference from listed one - there is no write into global memory for averaged value. All other just the same.
I'll list global domains used there soon too.
EDIT2:
And here are missed sizes from secondary kernel:
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)
this sounds like you arent using a barrier somewhere. 64 is the magic wavefront size,<= it all work items execute in lock step on amd gpus.
however after glancing it over i wasnt sure your offset calculation of tmp_local made sense... things like that can also produce this kind of problem,
Yes, some synching missing was the first thing I thought about.
But so far I can't find where barriers missed. Also, keep in mind that this kernel works perfectly on HD6950 for example. And on C-60 APU too. Both devices would experience same issues as some of GCN ones in case of missing barrier... but they don't.
And what exactly you don't like in tmp_local? There are get_local_size(2) number of threads/work items that work cooperatively on single array. Kernel processes few such arrays hence single workgroup handles few independend teams of threads (governed by get_local_id(0) index). Also, get_local_id(1) always zero cause WG dimensions always x*1*z. So get_local_id(1) doesn't participate in calculations.
Please be more specific what exactly you consider as wrong there.
reposting - first post got rejected because of a sentence containing AMD and the word fail. Lighten up mods. AMD has caused people headaches repeatedly so it is a fact of life to consider in debugging.
To remove thread / race conditions / AMD originating failures from the problem, try validating your code in python with something like numpy trying to emulate what the threads should be doing and all the id/index computations. I've found you can get pretty close mapping in most places but you still must do it carefully.
Btw I don't know if it makes a bit of difference in performance these days but generally you would want swap your semantics dimensions of x and z because x is the fastest moving, z is the slowest.
I was also not able to figure why you would want an x dimension larger >1, not that necessarily undermines your issue. The hardware scheduler should pretty much do the job of x for you.
You might also try declaring the workgroup shared memory locally and see if that changes anything (I wouldn't be surprised if the compiler emits different instructions) due to that alone.
Except for the semantics mapping of x and z and what you are currently using x for I have a ton of code just like this function. We all know the compiler sucks alot but generally I don't have issues with it on simple stuff like this which leads me to believe its one of those "everything looks correct but there's just one problem hiding in plain sight". Indexing bugs can also alias like this which is part of why I suggest try writing the code in python. You'll at least have more proof/peace of mind and have an easier environment to inspect what's going on for most issues (not all).
Another thing you could do is write out intermediate data to GDS so you can double check and bisect the range of where the calculations go bad...
Also, you remember float parallel reductions are not equivalent to serial reductions right? This is due to associativity of floating point math. You can s/float/int/ maybe and test to see if you get the same result across different work group sizes as long as there's enough numerical stability..
I'm too big fan of Occam razor principle to not going into troubles of re-implementing kernel on another language knowing that it works already OK on some hardware until explanation why it works only on such hardware will be received. Current most simple explanation is bug in compiler that emits synching machine code for subset of GCN devices. In few days I will able to test same kernel on iGPU and nVidia. It would be quite enough to rule out algorithmic issues IMO (even if having working kernel on VLIW ATi GPUs isn't enough).
Regarding x and z dimensions - historical reasons mostly - there is similar kernel that doesn't use local memory. Taking into account strided access to global memory and low computational density of kernel I'm not expect big speed difference from reordering local memory accesses. Biggest issue that this kernel solves is to load all CUs with work, that's not the case sometimes with older one.
Why x-dim in workgroup: cause I want flexibility in WG geometry (and this flexibility will gone if local memory will be allocated inside kernel, BTW). App processes different numbers of arrays at once with different sizes on devices with different numbers of CUs, hence I need different number of workgroups and waves. Having workgroup of single thread team, especially when this team equal or less than wavefront size will limit waves in flight on CU that will reduce occupancy and performance.
Possibility of rounding issue can be ruled out cause on devices where kernel gives incorrect results it gives _different_results from run to run, not just invalid but stable ones (runs with identical WG geometries set of course).
right, well...
Also, did you compare results against a regular old CPU target with your problematic WG sizes? Does commenting out the printfs you have change outcomes any either - I've seen it's presence/absence influence some strange things and not just limited to race conditions. Storing intermediate results to GDS would allow you to bisect the range down too where the problem occurs - it can help you figure out where to look in IR/ISA results.
Also I noticed on the cards you tested with problems used 14.9 - I know you noted this but did you bother retesting against 14.12?
For local memory declaration, it's just a debugging suggestion - not a perm change suggestion. Again: tweak -> run -> analyze & infer.
printfs were added because of this issue, it exists (on some GPUs with and w/o those printfs).
For now I successfully ran that kernel on iGPU HD2500 too - no issues. So, VLIW AMD GPUs, lesser (Tahiti LE) GCN GPUs, iGPUs - all free from this issue.
Wasn't able to check with nVidia ones so far but variance between hardware quite big already.
AFAIK tester who reported this issue first did test under few different driver versions. Would be good to find working driver indeed, but it can be only workaround, if latest driver has this bug...
Raistmer,
I'm not expecting this to work but on multigpu systems right now a problem like this also exists. I reencountered the problem for several hours yesterday so I wanted to see if the kinda-fix changes anything for you. This might help AMD find and fix 2 problems. The test-fix is setting environmental variable GPU_NUM_COMPUTE_RINGS=1
Thanks for suggestion but both Windows and Linux testers reported negative results. That env variable has no influence on this particular issue.
Hi Raistmer,
My apologies for this delay.
From your posts, it seems that your issue is a platform specific one ( particularly with few GCN cards). As you pointed out whether your problem has anything to do with this one: possible OpenCl compiler bug or not. If so, then I'm sorry, because that issue has not been resolved yet.
However, at this moment, I'm not sure whether both are same or not. That's why, I would like to forward this issue to concerned team by filing an bug report against it. To do so, I need a complete reproducible test-case. Could you please provide such one?
Regards,
Hello. Yes, you summarized issue right. Only some of GCN-family cards are affected, but those who are affected both under Windows and Linux it seems. Other platforms (nVidia, Intel GPU) just as older AMD cards seems not affected.
Since initial guess about wave size involvement we did more comprehensive testing of all possible workgroup sizes.
Issue sweems more complex than just smaller or bigger WG size regarding wave size.
Here is full table:
x/z | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
1 | + | + | + | + | + | + | + | + | + |
2 | + | + | + | + | + | + | + | + | 0 |
4 | + | + | + | + | + | + | + | 0 | 0 |
8 | + | + | + | + | + | - | 0 | 0 | 0 |
16 | + | + | + | - | - | 0 | 0 | 0 | 0 |
32 | + | + | - | - | 0 | 0 | 0 | 0 | 0 |
64 | + | - | - | 0 | 0 | 0 | 0 | 0 | 0 |
128 | - | - | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
256 | + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
"+" means kernels work OK with sich workgroup. "-" - false detections. 0 - such WG size not supported on AMD hardware.
As one can see some configs that exceed wave size work OK still. And indeed, all that smaller than wave works OK.
Also, from all listed earlier domain sizes for those 2 kernels only sizes with x-dim equal 256 and 512 give false detections.
All other sizes are silent besides 6 small ones that give true detections at any config just as on all other hardware 9hence, we did not miss valid detection, we just get false ones and at 2 specific domain sizes only). Number of false detections differ between runs but in all runs I saw only (256,y,z) (and only in single case) and (512,y,z) domain (global) sizes lead to failures.
I'll construct test case for this issue illustration and upload in separate post with description how to use it.
Here is the test case
To run just launch executable with desired kernel workgroup configuration.
Example for (4,1,64) workgroup:
start MB7_win_x86_SSE_OpenCL_ATi_HD5_r2889.exe -tune 1 4 1 64
y-component should always be 1 and this particular kernel num 1 too (first number).
App will produce few different files but you need to look for stderr.txt only.
reference one zipped inside archive.
Relevant part of it (check that first listed line present to ensure app got desired option):
TUNE: kernel 1 now has workgroup size of (4,1,64)
Autocorr: peak=19.20864, time=20.13, delay=6.6902, d_freq=1419769860.95, chirp=-1.8134, fft_len=128k
TripletFind miss: domain(32,15,64), (local)_(with_average) kernel
TripletFind miss: domain(8,15,64), (local)_(wo_average) kernel
TripletFind miss: domain(32,15,64), (local)_(with_average) kernel
Gaussian: peak=3.140635, mean=0.5500718, ChiSq=1.353511, time=76.34, d_freq=1419770570.67,
score=1.136018, null_hyp=2.090153, chirp=-4.5252, fft_len=16k
TripletFind miss: domain(8,15,64), (local)_(with_average) kernel
TripletFind miss: domain(16,15,64), (local)_(wo_average) kernel
TripletFind miss: domain(32,15,64), (local)_(with_average) kernel
and:
class PC_triplet_find_miss: | total=6, | N=6, | <>=1, | min=1 | max=1 |
class PoT_transfer_needed: | total=11, | N=11, | <>=1, | min=1 | max=1 |
If you see different number of reported misses and increased number of needed transfer that means you see bug under consideration.
Example of bad behaving config:
TUNE: kernel 1 now has workgroup size of (2,1,128)
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
......
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
Autocorr: peak=19.20864, time=20.13, delay=6.6902, d_freq=1419769860.95, chirp=-1.8134, fft_len=128k
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
....
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
Gaussian: peak=3.140636, mean=0.5500715, ChiSq=1.353511, time=76.34, d_freq=1419770570.67,
score=1.13603, null_hyp=2.090154, chirp=-4.5252, fft_len=16k
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
....
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
TripletFind miss: domain(16,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(32,15,128), (local)_(with_average) kernel
TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(256,15,128), (local)_(wo_average) kernel
TripletFind miss: domain(512,15,128), (local)_(with_average) kernel
....
class PC_triplet_find_miss: | total=181, | N=181, | <>=1, | min=1 | max=1 |
class PoT_transfer_needed: | total=186, | N=186, | <>=1, | min=1 | max=1 |
BTW, I have some report that for Linux bug fixed for Hawaii (at least) in 15.3 Beta driver, though that driver has another issues (power-safe low-freq not rised for first GPU).
Looking forward to get fix in Windows driver too.
Thanks for providing the reproducible test case. I'll try and get back to you shortly.
Regards,
As you've mentioned here Catalyst 15.4 Beta driver for Windows (1642.5 (VM) OCL part) has considerably increased CPU usage that your application is working fine using the latest driver. Its good to hear that. Thanks for the confirmation.
I really appreciate you for creating a new thread regarding the unexpected performance issue you are facing from the latest driver. Making a new thread for a new/unrelated problem is always a good idea and we always encourage that to all. Thanks once again.
Regards,