cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

About execution domain dimensions

how to chose them properly?

Hello.
I have some very long kernel that should run over big execution domain.
If I enqueue it over whole domain it causes driver restart (due too long execution).
So I split execution domain on blocks and call that kernel over smaller parts.

When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart.
What can be reasons for such different behavior with same number of threads ?

I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads.
Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case).

In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?
0 Likes
16 Replies
rotor
Journeyman III

I think just because when you run on 256x1 you have to specify it as one-dimension group {256} not as two dimension group {256,1}.

0 Likes

No, it goes fine with 128x1 for example. There is no such requirement in spec.
0 Likes

Yeah, 256x1 with 2D kernel is fine. This is how most of the samples are coded.

0 Likes
genaganna
Journeyman III

Originally posted by: Raistmer Hello. I have some very long kernel that should run over big execution domain. If I enqueue it over whole domain it causes driver restart (due too long execution). So I split execution domain on blocks and call that kernel over smaller parts. When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart. What can be reasons for such different behavior with same number of threads ? I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads. Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case). In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?


Do you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue. 

You can have atmost 256 threads per group. so it is not possible to use 128 x 10 or 10 x 128.

 

0 Likes

Originally posted by: genagannaDo you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue.


Is this documented anywhere? I have been having trouble with a memory testing kernel

that produces sporadic errors in about 0.7% of runs which led me to believe my 5870

was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of

256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according

to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?

Will this be addressed in future releases? This was a really difficult bug to find and I only

out of desperation tried the smaller workgroupsize because initially I trusted the result returned

by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)

Edit: is 64 the limit for all asics or is the size of a wavefront the limit?

0 Likes

Originally posted by: HarryH
Originally posted by: genagannaDo you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue.


 

Is this documented anywhere? I have been having trouble with a memory testing kernel

 

that produces sporadic errors in about 0.7% of runs which led me to believe my 5870

 

was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of

 

256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according

 

to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?

 

Will this be addressed in future releases? This was a really difficult bug to find and I only

 

out of desperation tried the smaller workgroupsize because initially I trusted the result returned

 

by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)

 

 

 

This is true for only 4xxx cards.  clGetKernelWorkGroupInfo returns trusted value.   It is recommended to use the returned value from clGetKernelWrokGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..)

0 Likes

In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because

executing the kernel with a workgroupsize of 256 led to errors and reducing the

workgroupsize got rid of them. The errors are more or less random and occur in only

0.7% of cases. But the problem can be reproduced and the observed errors seem

to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory

tester I first thought of soft errors but now I suspect the problem lies with the compiler's

estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo

producing a suggested workgroupsize that is too large.

0 Likes

Originally posted by: HarryH In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because

 

executing the kernel with a workgroupsize of 256 led to errors and reducing the

 

workgroupsize got rid of them. The errors are more or less random and occur in only

 

0.7% of cases. But the problem can be reproduced and the observed errors seem

 

to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory

 

tester I first thought of soft errors but now I suspect the problem lies with the compiler's

 

estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo

 

producing a suggested workgroupsize that is too large.

 

Please give us testcase to reproduce this issue.  Testcase will help us to fix this issue.

0 Likes

genaganna

Sorry I forgot to mention that. I am already making a simplified version that reproduces

the error and will send it to streamdeveloper@amd.com when it's finished (should

be this week). Is that the right address ? It's bit too much to post on the forum.

0 Likes

Originally posted by: HarryH genaganna

 

Sorry I forgot to mention that. I am already making a simplified version that reproduces

 

the error and will send it to streamdeveloper@amd.com when it's finished (should

 

be this week). Is that the right address ? It's bit too much to post on the forum.

 

Yes email id is correct. Thank you for helping us to fix issues.

0 Likes

You're welcome

0 Likes

Originally posted by: HarryH In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because

 

executing the kernel with a workgroupsize of 256 led to errors and reducing the

 

workgroupsize got rid of them. The errors are more or less random and occur in only

 

0.7% of cases. But the problem can be reproduced and the observed errors seem

 

to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory

 

tester I first thought of soft errors but now I suspect the problem lies with the compiler's

 

estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo

 

producing a suggested workgroupsize that is too large.

 



It is really hard to get kernel workgroup size=GPU's max supported workgroup size. In early versions of SDK it allow you to use max size possible but then the recent version force you to use a size <= an optimized max number given by the compiler after it compile the kernel source code. According to ATI doc, it says that since there limited resource (i.e. registers...) then the maximum number of work items/workgroup depends on the utilization of the kernel. So, to determine the max size of your kernel's work group size you have to query rather than the CL_KERNEL_WORK_GROUP_SIZE rather than CL_DEVICE_MAX_WORK_GROUP_SIZE

0 Likes

Originally posted by: genaganna
Originally posted by: HarryH
Originally posted by: genagannaDo you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue.


 

Is this documented anywhere? I have been having trouble with a memory testing kernel

 

that produces sporadic errors in about 0.7% of runs which led me to believe my 5870

 

was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of

 

256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according

 

to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?

 

Will this be addressed in future releases? This was a really difficult bug to find and I only

 

out of desperation tried the smaller workgroupsize because initially I trusted the result returned

 

by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)

 

   

This is true for only 4xxx cards.  clGetKernelWorkGroupInfo returns trusted value.   It is recommended to use the returned value from clGetKernelWrokGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..)

Is this based on any variables? If not then I don't think this is a good idea since kernels vary dramatically depending on the block size chosen.

0 Likes

ryta1203,

IT is really hard to determine dimension preferance on a fixed rule.When we are using global buffers,the more linear the memory access pattern more faster runs the kernels

But when we are using images,the data can be effectively cached in the L1 cache(which is inherently 2D).So 2D cache lines can be more effectively utilised in case of 2D memory access.

 

Moreover,your performance depends on the way kernel is written and executed depending on the GPU you are using.

0 Likes

Originally posted by: genaganna

Originally posted by: Raistmer Hello. I have some very long kernel that should run over big execution domain. If I enqueue it over whole domain it causes driver restart (due too long execution). So I split execution domain on blocks and call that kernel over smaller parts. When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart. What can be reasons for such different behavior with same number of threads ? I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads. Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case). In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?



Do you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue. 

You can have atmost 256 threads per group. so it is not possible to use 128 x 10 or 10 x 128.
 


Maybe I was not clear. I speak not about workgroup size, I speak about grid size. workgroup value is not setted (by default).
There is no real issue there is some question. Big grid sizes cause driver restart for pretty obvious reasons.
Question is why 128x2 can execute faster than 256x1 on my hardware?
Can it be related to different memory access patterns for example?







0 Likes

raistmer,

Indeed you are right,it is the memory acces patterns that determine our performance in such cases.But these patterns depend on many factors:

how are you fetching data inside your kernel?

how are your wavefronts organised?

how is your cache being used?

Some cases support 1D approach while others favour 2D approach.

Therefore its very hard to determine which way is more efficient.

 

0 Likes