cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

akhal935
Journeyman III

Calling one kernel from another

Hello

I have code that calls functions from within functions. So I want to know if I want to keep the same structure, is it possible that from within OpenCL kernel (as a function), I try to call another kernel (like function) and dont go back to Host program so that I dont have to make new buffers and copy data back and forth from kernel buffers and host program memories ?

0 Likes
15 Replies
antzrhere
Adept III

Why would you need to create new buffers and copy from Host to device for a second kernel function? All you do is, in your first kernel, write to device global memory and in your second kernel pass that same global memory as a pointer and read from it. Am I missing something?

But, answering your question, you can call other (non-kernel) functions from within OpenCL, as long as you don't call the same function within a function (recursion) which isn't permitted in OpenCL.

0 Likes

Originally posted by: antzrhere Why would you need to create new buffers and copy from Host to device for a second kernel function? All you do is, in your first kernel, write to device global memory and in your second kernel pass that same global memory as a pointer and read from it. Am I missing something?

But, answering your question, you can call other (non-kernel) functions from within OpenCL, as long as you don't call the same function within a function (recursion) which isn't permitted in OpenCL.

Well, lets be realistic here, unless you're writing a lot of data to memory, you might as well pass the results as arguments to another function.  I have full faith in Micah to have tried to pass as many parameters in registers as possible, rather than taking the performance hit and writing to memory, only to read from it again.  Who knows, maybe the compiler even detects that and keeps the stuff in registers when it inlines everything (which I believe is the default behavior, to inline as much as possible)...

Of course, you could just take control, and ensure things are done the right way by going lower level and using CAL/IL ... (hears gasps, sighs, and sees people fainting....sorry for uttering such dirty words!)

0 Likes

Yes, quite right, if its in register memory and there's enough to go round they should stay there (although I've never looked). I just don't get where the host to device memory transfers he's talking about fit in??

0 Likes

Originally posted by: antzrhere Yes, quite right, if its in register memory and there's enough to go round they should stay there (although I've never looked). I just don't get where the host to device memory transfers he's talking about fit in??

I guess I should have addressed more carefully what I think the OP's real problem was.  That is that the OP didn't realize you can make function calls in OpenCL.  So if you need to "chain kernels" you can make a master kernel function that just calls 2+ other functions.  The host to device transfers come into play if you simply call one kernel, retrieve the data, use it to populate the next kernel, call the next kernel, then retrieve the data again.

Nowhere in the documentation does it say when the kernel exits, memory will be maintained.  In fact, if I read the power management documentation correctly, it says exactly the opposite in a round about way.  It said they have the ability to shut down individual banks of ram at least on the 6990, so if a kernel exits, and the ram is powered down, its data will decay before the next kernel activates, and fires up that ram again. 

I will also say, I know with pixel shaders you can chain shaders, but again, that is at an IL level (or perhaps HLSL?) where outputs from one shader become inputs to the next, but its done with input and output registers.  I doubt that would work in OpenCL, maybe not HLSL, but I got out of graphics programming prior to HLSL becoming popular...It was all assembly for me!

 

 

0 Likes

corry,

The OpenCL global memory buffers are persistent between kernel invocations.  In other words, listing the same OpenCL global buffer as an argument to a series of kernels is how you pass data from one kernel to another without having to move the data back to the host memory.  My kernels use this feature extensively.

The LDS memory does goes away between kernel invocations, and thus in effect only exists while a particular kernel is executing.

 

0 Likes

Originally posted by: timattox corry,

The OpenCL global memory buffers are persistent between kernel invocations.  In other words, listing the same OpenCL global buffer as an argument to a series of kernels is how you pass data from one kernel to another without having to move the data back to the host memory.  My kernels use this feature extensively.

The LDS memory does goes away between kernel invocations, and thus in effect only exists while a particular kernel is executing.

Interesting, I suppose then the power management isn't that aggressive probably based on the context then...Should be the same in IL then...

Out of curiousity, why have multiple kernels?  I'll explain my reasoning below, but feel free to ignore it.  I won't be offended 🙂

I had thought about this option myself, but, in the end, decided on a "cob" rather than a bunch of seperate kernel invocations.  Even in the case where so much ram is used, that you could never hope to pass all parameters in registers, the "cob" option generally wins since you don't incur the startup, and stop penalties of launching individual kernels.  Unless AMD, nVidia, Intel, and whoever else has gotten togather to allow multiple kernels on a compute device at once, and switch at will from the host system, you'll see better performance.

There are a few mantras at work here...First off, when working with an external, non-native hardware/software interface, (usually for performance reasons) its almost unilaterally better to get into that interface, and stay in it as long as possible.  DSP's at a low level, and JNI at a high level follow this, and everything I've read leads me to believe GPU's are the same way.  Each has their own unique reasons for it, but its still there.  In general, when working with a new external interface, I just follow the rule...haven't been disappointed yet 🙂 

Second would be to never trust the optimizer.  (Which probably makes some compiler author's ears burn).  Spend some time with MSVC, and/or GCC with all optimization options cranked, IDA, and some basic patterns you think would be optimized.  You'll be shocked, horrified, and possibly psychologically scarred for life 🙂

Last is at whatever level you choose to work (ASM, C, C++, Java, OpenCL, etc) don't do things that make your code any less portable than that level already enforces, unless you're going to write multiple versions for other platforms.  OpenCL is quite portable, but has vendor specific extensions.  More related, if one vendor happened to accellerate the multiple seperate sequential kernel invocations, but no one else does, you've just written a non-portable bottleneck into your code using it.

0 Likes

Originally posted by: corry

 

 

 

Out of curiousity, why have multiple kernels?  I'll explain my reasoning below, but feel free to ignore it.  I won't be offended 🙂

 

 

I use multiple kernels for a variety of reasons:

  1. Changes in parallelism width and shape between algorithm phases... thus I can't just continue with the same number of work item threads, or I can't continue with the same workgroup size/shape (e.g. 1-D, 2-D, 3-D).
  2. Synchronization of read/write data structures across work groups
  3. To avoid user interface jumpyness if my kernel computes in one shot for too long, causing the X window display to stop updating.
  4. Some of my code seems to overwhelm the JIT compiler(s) if it's all in one kernel, causing it to generate less efficient code (spilling registers?), than if it gets it in smaller bites.  I've tried adding -O3 to the kernl build parameters, but it didn't seem to change things much if at all.  This is where I really wish there was an option to say to the Catalyst JIT compiler: "Compile for as long as you like (really: minutes if you want, but not days), I'll be reusing this kernel a gazillion times on this specific GPU card."
  5. Possibly due to #4, I found some GPU models (RV730, Turks) run my code faster when it's broken into "slices", and others GPU models (Cayman, Tesla cards) tend to prefer my code to be in fewer bigger kernels.

But I think we have digressed from the original thread topic...

0 Likes

Originally posted by: timattoxI use multiple kernels for a variety of reasons:

But I think we have digressed from the original thread topic...

I have a bad habit of digressing 🙂 its a problem, but in this case, I think it's still somewhat relevant, since each coder has to decide for him/herself how to organize his/her code...this information provides the OP with information to make that decision, perhaps in his/her case, it would be better to not switch back to the CPU. I know in my case, I'll be looking at how the IL compiler handles chunks at a later date when this thing becomes a little more complicated/nebulous. For now, it's far better to stay on the GPU for multiple seconds, but yeah, I have to figure out our use-case in GPU responsiveness...I do feel your optimizer pain though, when you generate everything in chunks of 4 for the VLIW4, and the compiler decides it knows better and slices things up, or when you put things in good ordering to use PVs, and it reorders and uses a crapload more registers...but thats really going off topic 🙂

 

0 Likes

My Actual problem is that my serial code structure call a function from main() function and then that primary function calls two other functions many times in loops. And my primary concern is not to parallelize that primary (first level) function but to parallelize second level functions which I call from that primary function.

I could normally keep that primary function as it is, and should make kernels of the secondary level functions; But the problem is those secondary level functions are called with different index position of arrays each time; i-e

calling side : <secondary-level-function>(a[], &b[2*n], &c[2*m]..);

secondary-level-function prototype: function(int x[], int y[], int z[]...);

So when I make OpenCL buffers of a[], b[], c[]... then its not possible to send those buffers from somewhere in middle to every next round of kernel ...

Thats why I decided to make even the primary level function a kernel, and then that kernel should call other kernels (secondary level functions) and so on... But I wonder how should I limit my primary kernel to only one main thread and should spawn threads only inside secondary level kernels?

0 Likes

Sounds like you've spent time in the multithreaded CPU world...

With no understanding of your problem, to me this sounds like a fitting a square peg in a round hole problem.

Either you have to figure out how to massivly parallelize that first task, or you need to have the host do the serial code, and dispatch the results to the parallel system. 

In GPGPU there is no concept of spawning threads.  The processor runs one set of code on all processors, so in general, your thread dispatcher is where you insert the GPGPU code. 

As I said though, thats all "in general"  there can be very good reasons not to, but I suspect it will be easier that way, and perhaps even faster since the CPU is usually 2-3 GHz, and the GPU is usually 400-1.5 GHz or so.  If what you have is truely serial, so there is only one thread on the GPU, the performace penalty should be obvious. 

0 Likes

An alternative understanding of your problem is that you have this:

for( some m and n 😞 

  call something(a, b[2n], c[2m])

  call somethingelse(x, y, z)

 

and you want to pass b[2n] as a pointer but can't because it's simply a buffer? If that's the case then just pass 2n as a parameter to the kernel. Set a, b and c once outside the loop, update 2n and 2m on each iteration of the outer loop and enqueue the kernels. The overhead of the launches shouldn't be too large because you should be able to do all those enqueues without ever waiting for the GPU to do anything. You can do one wait on events after the big loop.

As corry says, it's hard to tell precisely what you're trying to do. Can you write the entire loop and two called functions in simple pseudo code so we can get a better idea of what you're trying to achieve? At the moment you can't launch kernels from other kernels because of the way the GPU driver and pipeline work. If you can make your outer loop the parallel one and the inner loops dependent directly on that you could just use two function calls, but it depends entirely on the data access patterns and dependencies of the two called functions.

0 Likes

Thanks LeeHowes; I am considering your earlier comments but to make it clearer to you, my serial code structure is like below;

  tgle = 1;
  steps(n, mj, x, &x[(n/2)*2], y, &y[mj*2]);
  for(j = 0; j < m; j++)
    {
      mj = mj*2;
      if(tgle)
        {
          steps(n, mj, y, &y[(n/2)*2], x, &x[mj*2]);
          tgle = 0;
        }
      else
        {
          steps(n, mj, x, &x[(n/2)*2], y, &y[mj*2]);
          tgle = 1;
        }
    }

  mj = n/2;
  steps(n, mj, x, &x[(n/2)*2], y, &y[mj*2]);

 

and stpes() function is the heavier one that I hope better to parallelize. so should I keep this looping structure on host side and should keep enqueuing steps() kernel with like "n" and "mj" sending so that I can index buffers correctly insdie kernel?

0 Likes

Interesting, I suppose then the power management isn't that aggressive probably based on the context then...Should be the same in IL then...


Remember opencl is an abstraction.  How the hardware implements it is not necessarily related to how the programmer sees it.  It could copy the data back to/from cpu memory and turn off the gpu when not being used for all you know as a programmer ...

The abstraction is just that the global memory is persistent between kernel invocations and ready for them when they run.

 

 

0 Likes

Originally posted by: notzedThe abstraction is just that the global memory is persistent between kernel invocations and ready for them when they run.


Funny thing is that I have found that GPU global memory buffers can be too persitent.  One day I was surprized when I thought I had been doing a series of good improvements in my kernel, when sometime earlier in the day I had broken it so badly that it wasn't writing anything into the results buffer...  but the buffer still had the good data from an earlier run sitting in the DRAM.  For debugging/development sessions, I've now had to add a code phase that clobbers the contents of my results buffers at the beginning of the run, to make sure my "correct results" check at the end is looking at freshly created data, and not leftover "correct" data from the previous version of a kernel that ran moments before.

On a side note, this means that if you are doing cryptography or other sensitive work on a GPU, you need to make sure you write over any data buffers that you don't want any other code to see, once you are done with them of course.  There is no zeroing out of newly allocated memory by an OS on the GPU like we are used to in POSIX CPU-land.

0 Likes

Originally posted by: notzed
Interesting, I suppose then the power management isn't that aggressive probably based on the context then...Should be the same in IL then...


Remember opencl is an abstraction. How the hardware implements it is not necessarily related to how the programmer sees it. It could copy the data back to/from cpu memory and turn off the gpu when not being used for all you know as a programmer ...

The abstraction is just that the global memory is persistent between kernel invocations and ready for them when they run.

Don't get me started on high level developers assuming everything's a black box, abstracted away, fixes everything, you can't mess it up, etc...that fact is, for example on a CPU, you can write ***** poor code that thrashes cache, the TLB, the branch prediction history, doesn't let out of order execution help much, doesn't make use of coprocessors, multiprocessors, hardware thread processors, instruction fusing, and is generally a steaming pile of crap because you assumed the hardware and software abstractions and optimizers fixed everything for you....even hardware can't polish a software turd...

That said, sure, in this case, there are workarounds to understanding the mechanisms at work, such as the one timattox posted, but you have to understand there *is* a problem/feature before you can work around/with it...

 

Like I said....don't get me started...

0 Likes