Archives Discussions

thomasp · ‎01-31-2012

Hi

I plan to code something in OpenCL, using überKernel pattern.

It means that a given kernel would have this structure:

__kernel void my_uber_kernel(void)

{

while(...)

{

if(stage==..)

{

device_function_0() ;

} else

if(stage==...)

{

device_function_1() ;

}

// etc...

stage = stage + 1 ;

}

Each one of device_function_X() potentially contains a substantial amount of code.

I'm wondering if there is known limitations regarding the amount of instructions supported (per thread?) before performances are impacted ?

Does splitting process in small device functions calls help to optimize ?

Or do I have to split process in several kernel calls (so that above-mentioned device_function_X become kernels)

hazeman · ‎02-01-2012

I think your question was ambiguous. You asked for program size limit ( maximum possible ). And Micah answered it - You can have really huge kernels and in practice I doubt it's possible to hit this limit. But I thought that maybe you want to ask/know when there is a performance penalty for kernel size and that's why I posted my answer.

When GPU doesn't find kernel code in cache it has to load it from global memory. And global memory is orders of magnitude slower than cache. Also you hit penalty for cache miss.

You can find post with benchmark here

View solution in original post

MicahVillmow · ‎01-31-2012

There is a limit on code size, but it is program dependent as on the GPU everything gets inlined, which can massively explode what could be a relatively small program with lots of function calls.

thomasp · ‎01-31-2012

ok

any idea of an order of magnitude ?

I mean, something like : "1000 lines of inlined code is the limit before performance is exponantially degraded"

MicahVillmow · ‎01-31-2012

We are talking multiple MB of source code after inlining everything.

jeff_golds · ‎01-31-2012

Keep in mind that registers are reserved globally as we can't know which part of the code you will use. So if the worst-case is using 200 registers, then that can limit performance of other code paths that would use far less registers.

I have seen program that have 70,000 ISA instructions and they are still running pretty fast. Compilation take around 30s on a fast machine.

thomasp · ‎01-31-2012

Thank you all for these enlightening answers

hazeman · ‎01-31-2012

GPU has program cache. If I'm not mistaken it's 48 KB in size. So any code witch is bigger then this limits is slowing down.

There was some old post with data from benchmarking different sized kernels.

You can check exact code size by looking at the ISA from your kernel ( it's at the end of ISA listing ).

thomasp · ‎02-01-2012

MicahVillmow, although stamped as "AMD", is talking of MB (MegaBytes?) of source code, wheras you and jeff_golds evoke 48KB / 70,000 ISA instructions.

The answers have quite a different order of magnitude !

How are performances degraded when total amount of instructions overflow program cache (by factor 2, 4, ...) ? Is there known tests on this point ?

hazeman · ‎02-01-2012

I think your question was ambiguous. You asked for program size limit ( maximum possible ). And Micah answered it - You can have really huge kernels and in practice I doubt it's possible to hit this limit. But I thought that maybe you want to ask/know when there is a performance penalty for kernel size and that's why I posted my answer.

When GPU doesn't find kernel code in cache it has to load it from global memory. And global memory is orders of magnitude slower than cache. Also you hit penalty for cache miss.

You can find post with benchmark here

thomasp · ‎02-01-2012

I agree the question was ambiguous, this is due to a lack of knowledge from me obviously.

Anyway thank you for the clarifications and pointer, I think I get it now

MicahVillmow · ‎02-01-2012

Yeah, if the question was related to code cache, then JeffG/Hazeman are correct. My answer was to what size of code has been known to make it through the compiler. Larger code bases do not make it through unless you have large ram machines(think 8G+).

notzed · ‎02-02-2012

Although it depends a bit on the specifics, in general I don't think you want to do this.

The over-head of invoking kernels is quite high, but the penalty for a poorly executing kernel will likely be much worse. If you're invoking a sequence of kernels which only work with on-device memory, do a non-trivial amount of work, and require no host synchronisation - the overhead is negligible. And this is the only possible reason I can see to try it - assuming as in your example, the sequence is static, or at least not dependent on the result from a preceding stage.

Putting everything in one kernel limits optimisation opportunities for the coder and the hardware as well as the compiler:

a) all tasks must share the same workgroup size/topology which is often a critical performance decision (as soon as you use shared memory you can't avoid it).

b) hardware parallelism is limited by the requirements of the most demanding task (i.e. most registers, shared memory)

c) the compiler seems to over-use registers when given a complex bit of code (as of a few months ago at least).

Even trying to tie relatively simple tasks together that directly share some memory accesses can be slower than two separate simpler kernels that communicate via global memory, if the combined kernel means less parallelism for the task that most needs it.

Long running kernels also impact system interactivity (not sure about ng devices) which is often undesirable.

Sounds like a nightmare to debug too ...

thomasp · ‎02-03-2012

So your point is mainly : "better use multiple kernels than überkernel when each stage is independant"

Anyways it does not seem hard to try each of the two solutions. As you said the different stages are expected to be independant from each other and no intermediate host-to-device memory transfer is needed.

Thank you for this input, I'll keep that in mind.

Archives Discussions

OpenCL, is there instruction limitations ?

MicahVillmow, although stamped as "AMD", is talking of MB (MegaBytes?) of source code, wheras you and jeff_golds evoke 48KB / 70,000 ISA instructions.