Archives Discussions

Atmapuri · ‎10-03-2011

Hi!

If I run my kernel with global work group size of 524289 instead of 524288 (which is exact power of two), the performance drops 5x. The size of memory buffers allocated does not change. So this cant be allignment issue.

Why such big penalties? Note that sizes 524290 and 524291 work fine.

Thanks!
Atmapuri

genaganna · ‎10-03-2011

Originally posted by: Atmapuri Hi!

If I run my kernel with global work group size of 524289 instead of 524288 (which is exact power of two), the performance drops 5x. The size of memory buffers allocated does not change. So this cant be allignment issue. Why such big penalties? Note that sizes 524290 and 524291 work fine. Thanks! Atmapuri

Could you please paste AMD APP Profiler output for two combination?

Could you please give us also system information(OS, CPU, GPU, Driver Version and SDK version)?

Are you running on CPU or GPU?

nou · ‎10-03-2011

factorization of 524289 is 3*174763 so it can run workgroups only with 3 workitems.

524290 can by divided by 109 and 524290 with 179. to fully utilize AMD GPU you must use workgproup with size 64.

LeeHowes · ‎10-03-2011

As nou implied, the problem here is that on current generations of chips all work groups in a launch have to be the same size. The result of that is that the numbers nou quotes come up: if you don't launch a multiple of 64 your performance will fall off relative to the factorisation of the launch. If you launch a prime you'll get wavefronts that contain just a single work item as it launches groups of size 1.

This is unfortunate, but the solution would be to split the launch in the runtime, and it seems that most users are happy launching multiples of 64 (indeed, most users launch in groups and hence they enforce that repeated size anyway) so that would be a low-priority fix in the runtime.

The right answer usually with this kind of parallel programming is to pad your data to an efficient launch or cache multiple of the machine (or pad out of a cache multiple, depending on the situation).

Atmapuri · ‎10-04-2011

This is unfortunate, but the solution would be to split the launch in the runtime, and it seems that most users are happy launching multiples of 64 (indeed, most users launch in groups and hence they enforce that repeated size anyway) so that would be a low-priority fix in the runtime.

The right answer usually with this kind of parallel programming is to pad your data to an efficient launch or cache multiple of the machine (or pad out of a cache multiple, depending on the situation).

If most users would be happy with this limitation, it would be present in C++ compilers as well. If however there is a desire to expand the number of applications for which Open CL can be used, then "most are happy" no longer applies. As I understand there is a growing desire to mitigate as many constraints as possible to allow more algorithms to be accelerated using Open CL.

In this context, I dont quite understand your reply. As you say, the work would need to be split inside the driver and thus the solution is possible.

Thanks!
Atmapuri

LeeHowes · ‎10-04-2011

I mean that it's not a request that comes up often, most people are reasonably happy programming the hardware with its constraints in mind as long as those constraints are very predictable like this one. Hence when it comes to prioritising development time given other performance issues that you'll see around the forum, most of which are far less predictable for the developer and hence more frustrating for more people, it is a low priority request.

Atmapuri · ‎10-05-2011

Thank you for the explanation. There is a difference though that I would like to be noted between bugs and/or feature limitations. Some feature limitations are such that they prevent a group users from ever stepping on other bugs, because they kill the project upfront. In return you cant say how many people would have complained about this, because it is part of the spec. I think this is one of such feature limitations.

Archives Discussions

Unusual work size execution speed dependancies...