12 Replies Latest reply on Feb 3, 2012 2:08 AM by thomasp

    OpenCL, is there instruction limitations ?




      I plan to code something in OpenCL, using überKernel pattern.

      It means that a given kernel would have this structure:


      __kernel void my_uber_kernel(void)






                     device_function_0() ;

                } else



                     device_function_1() ;


                // etc...

                stage = stage + 1 ;




      Each one of device_function_X() potentially contains a substantial amount of code.

      I'm wondering if there is known limitations regarding the amount of instructions supported (per thread?) before performances are impacted ?


      Does splitting process in small device functions calls help to optimize ?

      Or do I have to split process in several kernel calls (so that above-mentioned device_function_X become kernels)

        • Re: OpenCL, is there instruction limitations ?

          There is a limit on code size, but it is program dependent as on the GPU everything gets inlined, which can massively explode what could be a relatively small program with lots of function calls.

          1 of 1 people found this helpful
          • Re: OpenCL, is there instruction limitations ?

            GPU has program cache. If I'm not mistaken it's 48 KB in size. So any code witch is bigger then this limits is slowing down.

            There was some old post with data from benchmarking different sized kernels.

            You can check exact code size by looking at the ISA from your kernel ( it's at the end of ISA listing ).

              • Re: OpenCL, is there instruction limitations ?
                MicahVillmow, although stamped as "AMD", is talking of MB (MegaBytes?) of source code, wheras you and jeff_golds evoke 48KB / 70,000 ISA instructions.


                The answers have quite a different order of magnitude !


                How are performances degraded when total amount of instructions overflow program cache (by factor 2, 4, ...) ? Is there known tests on this point ?

                  • Re: OpenCL, is there instruction limitations ?

                    I think your question was ambiguous. You asked for program size limit ( maximum possible ). And Micah answered it - You can have really huge kernels and in practice I doubt it's possible to hit this limit. But I thought that maybe you want to ask/know when there is a performance penalty for kernel size and that's why I posted my answer.

                    When GPU doesn't find kernel code in cache it has to load it from global memory. And global memory is orders of magnitude slower than cache. Also you hit penalty for cache miss.

                    You can find post with benchmark here

                • Re: OpenCL, is there instruction limitations ?

                  Although it depends a bit on the specifics, in general I don't think you want to do this.


                  The over-head of invoking kernels is quite high, but the penalty for a poorly executing kernel will likely be much worse.  If you're invoking a sequence of kernels which only work with on-device memory, do a non-trivial amount of work, and require no host synchronisation - the overhead is negligible.  And this is the only possible reason I can see to try it - assuming as in your example, the sequence is static, or at least not dependent on the result from a preceding stage.


                  Putting everything in one kernel limits optimisation opportunities for the coder and the hardware as well as the compiler:


                  a) all tasks must share the same workgroup size/topology which is often a critical performance decision (as soon as you use shared memory you can't avoid it).

                  b) hardware parallelism is limited by the requirements of the most demanding task (i.e. most registers, shared memory)

                  c) the compiler seems to over-use registers when given a complex bit of code (as of a few months ago at least).


                  Even trying to tie relatively simple tasks together that directly share some memory accesses can be slower than two separate simpler kernels that communicate via global memory, if the combined kernel means less parallelism for the task that most needs it.


                  Long running kernels also impact system interactivity (not sure about ng devices) which is often undesirable.


                  Sounds like a nightmare to debug too ...

                    • Re: OpenCL, is there instruction limitations ?

                      So your point is mainly : "better use multiple kernels than überkernel when each stage is independant"


                      Anyways it does not seem hard to try each of the two solutions. As you said the different stages are expected to be independant from each other and no intermediate host-to-device memory transfer is needed.

                      Thank you for this input, I'll keep that in mind.