I think just because when you run on 256x1 you have to specify it as one-dimension group {256} not as two dimension group {256,1}.
Yeah, 256x1 with 2D kernel is fine. This is how most of the samples are coded.
Originally posted by: Raistmer Hello. I have some very long kernel that should run over big execution domain. If I enqueue it over whole domain it causes driver restart (due too long execution). So I split execution domain on blocks and call that kernel over smaller parts. When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart. What can be reasons for such different behavior with same number of threads ? I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads. Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case). In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?
Do you have barriers in kernel? if you have barries in kernel, you can at most have workgroupsize is 64. Please send testcase to reproduce your issue.
You can have atmost 256 threads per group. so it is not possible to use 128 x 10 or 10 x 128.
Originally posted by: genagannaDo you have barriers in kernel? if you have barries in kernel, you can at most have workgroupsize is 64. Please send testcase to reproduce your issue.
Is this documented anywhere? I have been having trouble with a memory testing kernel
that produces sporadic errors in about 0.7% of runs which led me to believe my 5870
was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of
256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according
to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?
Will this be addressed in future releases? This was a really difficult bug to find and I only
out of desperation tried the smaller workgroupsize because initially I trusted the result returned
by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)
Edit: is 64 the limit for all asics or is the size of a wavefront the limit?
Originally posted by: HarryH Originally posted by: genagannaDo you have barriers in kernel? if you have barries in kernel, you can at most have workgroupsize is 64. Please send testcase to reproduce your issue.
Is this documented anywhere? I have been having trouble with a memory testing kernel
that produces sporadic errors in about 0.7% of runs which led me to believe my 5870
was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of
256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according
to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?
Will this be addressed in future releases? This was a really difficult bug to find and I only
out of desperation tried the smaller workgroupsize because initially I trusted the result returned
by clGetKernelWorkGroupInfo(...,
CL_KERNEL_WORK_GROUP_SIZE, ...)
This is true for only 4xxx cards. clGetKernelWorkGroupInfo returns trusted value. It is recommended to use the returned value from clGetKernelWrokGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..)
In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because
executing the kernel with a workgroupsize of 256 led to errors and reducing the
workgroupsize got rid of them. The errors are more or less random and occur in only
0.7% of cases. But the problem can be reproduced and the observed errors seem
to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory
tester I first thought of soft errors but now I suspect the problem lies with the compiler's
estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo
producing a suggested workgroupsize that is too large.
Originally posted by: HarryH In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because
executing the kernel with a workgroupsize of 256 led to errors and reducing the
workgroupsize got rid of them. The errors are more or less random and occur in only
0.7% of cases. But the problem can be reproduced and the observed errors seem
to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory
tester I first thought of soft errors but now I suspect the problem lies with the compiler's
estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo
producing a suggested workgroupsize that is too large.
Please give us testcase to reproduce this issue. Testcase will help us to fix this issue.
genaganna
Sorry I forgot to mention that. I am already making a simplified version that reproduces
the error and will send it to streamdeveloper@amd.com when it's finished (should
be this week). Is that the right address ? It's bit too much to post on the forum.
Originally posted by: HarryH genaganna
Sorry I forgot to mention that. I am already making a simplified version that reproduces
the error and will send it to streamdeveloper@amd.com when it's finished (should
be this week). Is that the right address ? It's bit too much to post on the forum.
Yes email id is correct. Thank you for helping us to fix issues.
You're welcome
Originally posted by: HarryH In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because
executing the kernel with a workgroupsize of 256 led to errors and reducing the
workgroupsize got rid of them. The errors are more or less random and occur in only
0.7% of cases. But the problem can be reproduced and the observed errors seem
to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory
tester I first thought of soft errors but now I suspect the problem lies with the compiler's
estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo
producing a suggested workgroupsize that is too large.
It is really hard to get kernel workgroup size=GPU's max supported workgroup size. In early versions of SDK it allow you to use max size possible but then the recent version force you to use a size <= an optimized max number given by the compiler after it compile the kernel source code. According to ATI doc, it says that since there limited resource (i.e. registers...) then the maximum number of work items/workgroup depends on the utilization of the kernel. So, to determine the max size of your kernel's work group size you have to query rather than the CL_KERNEL_WORK_GROUP_SIZE rather than CL_DEVICE_MAX_WORK_GROUP_SIZE
Originally posted by: genaganna Originally posted by: HarryHOriginally posted by: genagannaDo you have barriers in kernel? if you have barries in kernel, you can at most have workgroupsize is 64. Please send testcase to reproduce your issue.
Is this documented anywhere? I have been having trouble with a memory testing kernel
that produces sporadic errors in about 0.7% of runs which led me to believe my 5870
was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of
256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according
to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?
Will this be addressed in future releases? This was a really difficult bug to find and I only
out of desperation tried the smaller workgroupsize because initially I trusted the result returned
by clGetKernelWorkGroupInfo(...,
CL_KERNEL_WORK_GROUP_SIZE, ...)
This is true for only 4xxx cards. clGetKernelWorkGroupInfo returns trusted value. It is recommended to use the returned value from clGetKernelWrokGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..)
Is this based on any variables? If not then I don't think this is a good idea since kernels vary dramatically depending on the block size chosen.
ryta1203,
IT is really hard to determine dimension preferance on a fixed rule.When we are using global buffers,the more linear the memory access pattern more faster runs the kernels
But when we are using images,the data can be effectively cached in the L1 cache(which is inherently 2D).So 2D cache lines can be more effectively utilised in case of 2D memory access.
Moreover,your performance depends on the way kernel is written and executed depending on the GPU you are using.
Originally posted by: genaganna Originally posted by: Raistmer Hello. I have some very long kernel that should run over big execution domain. If I enqueue it over whole domain it causes driver restart (due too long execution). So I split execution domain on blocks and call that kernel over smaller parts. When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart. What can be reasons for such different behavior with same number of threads ? I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads. Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case). In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?
Do you have barriers in kernel? if you have barries in kernel, you can at most have workgroupsize is 64. Please send testcase to reproduce your issue.
You can have atmost 256 threads per group. so it is not possible to use 128 x 10 or 10 x 128.
raistmer,
Indeed you are right,it is the memory acces patterns that determine our performance in such cases.But these patterns depend on many factors:
how are you fetching data inside your kernel?
how are your wavefronts organised?
how is your cache being used?
Some cases support 1D approach while others favour 2D approach.
Therefore its very hard to determine which way is more efficient.