cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

akhal935
Journeyman III

Calling one kernel from another

Hello

I have code that calls functions from within functions. So I want to know if I want to keep the same structure, is it possible that from within OpenCL kernel (as a function), I try to call another kernel (like function) and dont go back to Host program so that I dont have to make new buffers and copy data back and forth from kernel buffers and host program memories ?

Tags (1)
0 Likes
15 Replies
antzrhere
Adept III

Calling one kernel from another

Why would you need to create new buffers and copy from Host to device for a second kernel function? All you do is, in your first kernel, write to device global memory and in your second kernel pass that same global memory as a pointer and read from it. Am I missing something?

But, answering your question, you can call other (non-kernel) functions from within OpenCL, as long as you don't call the same function within a function (recursion) which isn't permitted in OpenCL.

0 Likes
corry
Adept III

Calling one kernel from another

Originally posted by: antzrhere Why would you need to create new buffers and copy from Host to device for a second kernel function? All you do is, in your first kernel, write to device global memory and in your second kernel pass that same global memory as a pointer and read from it. Am I missing something?

But, answering your question, you can call other (non-kernel) functions from within OpenCL, as long as you don't call the same function within a function (recursion) which isn't permitted in OpenCL.

Well, lets be realistic here, unless you're writing a lot of data to memory, you might as well pass the results as arguments to another function.  I have full faith in Micah to have tried to pass as many parameters in registers as possible, rather than taking the performance hit and writing to memory, only to read from it again.  Who knows, maybe the compiler even detects that and keeps the stuff in registers when it inlines everything (which I believe is the default behavior, to inline as much as possible)...

Of course, you could just take control, and ensure things are done the right way by going lower level and using CAL/IL ... (hears gasps, sighs, and sees people fainting....sorry for uttering such dirty words!)

0 Likes
antzrhere
Adept III

Calling one kernel from another

Yes, quite right, if its in register memory and there's enough to go round they should stay there (although I've never looked). I just don't get where the host to device memory transfers he's talking about fit in??

0 Likes
corry
Adept III

Calling one kernel from another

Originally posted by: antzrhere Yes, quite right, if its in register memory and there's enough to go round they should stay there (although I've never looked). I just don't get where the host to device memory transfers he's talking about fit in??

I guess I should have addressed more carefully what I think the OP's real problem was.  That is that the OP didn't realize you can make function calls in OpenCL.  So if you need to "chain kernels" you can make a master kernel function that just calls 2+ other functions.  The host to device transfers come into play if you simply call one kernel, retrieve the data, use it to populate the next kernel, call the next kernel, then retrieve the data again.

Nowhere in the documentation does it say when the kernel exits, memory will be maintained.  In fact, if I read the power management documentation correctly, it says exactly the opposite in a round about way.  It said they have the ability to shut down individual banks of ram at least on the 6990, so if a kernel exits, and the ram is powered down, its data will decay before the next kernel activates, and fires up that ram again. 

I will also say, I know with pixel shaders you can chain shaders, but again, that is at an IL level (or perhaps HLSL?) where outputs from one shader become inputs to the next, but its done with input and output registers.  I doubt that would work in OpenCL, maybe not HLSL, but I got out of graphics programming prior to HLSL becoming popular...It was all assembly for me!

 

 

0 Likes
timattox
Adept I

Calling one kernel from another

corry,

The OpenCL global memory buffers are persistent between kernel invocations.  In other words, listing the same OpenCL global buffer as an argument to a series of kernels is how you pass data from one kernel to another without having to move the data back to the host memory.  My kernels use this feature extensively.

The LDS memory does goes away between kernel invocations, and thus in effect only exists while a particular kernel is executing.

 

0 Likes
corry
Adept III

Calling one kernel from another

Originally posted by: timattox corry,

The OpenCL global memory buffers are persistent between kernel invocations.  In other words, listing the same OpenCL global buffer as an argument to a series of kernels is how you pass data from one kernel to another without having to move the data back to the host memory.  My kernels use this feature extensively.

The LDS memory does goes away between kernel invocations, and thus in effect only exists while a particular kernel is executing.

Interesting, I suppose then the power management isn't that aggressive probably based on the context then...Should be the same in IL then...

Out of curiousity, why have multiple kernels?  I'll explain my reasoning below, but feel free to ignore it.  I won't be offended 🙂

I had thought about this option myself, but, in the end, decided on a "cob" rather than a bunch of seperate kernel invocations.  Even in the case where so much ram is used, that you could never hope to pass all parameters in registers, the "cob" option generally wins since you don't incur the startup, and stop penalties of launching individual kernels.  Unless AMD, nVidia, Intel, and whoever else has gotten togather to allow multiple kernels on a compute device at once, and switch at will from the host system, you'll see better performance.

There are a few mantras at work here...First off, when working with an external, non-native hardware/software interface, (usually for performance reasons) its almost unilaterally better to get into that interface, and stay in it as long as possible.  DSP's at a low level, and JNI at a high level follow this, and everything I've read leads me to believe GPU's are the same way.  Each has their own unique reasons for it, but its still there.  In general, when working with a new external interface, I just follow the rule...haven't been disappointed yet 🙂 

Second would be to never trust the optimizer.  (Which probably makes some compiler author's ears burn).  Spend some time with MSVC, and/or GCC with all optimization options cranked, IDA, and some basic patterns you think would be optimized.  You'll be shocked, horrified, and possibly psychologically scarred for life 🙂

Last is at whatever level you choose to work (ASM, C, C++, Java, OpenCL, etc) don't do things that make your code any less portable than that level already enforces, unless you're going to write multiple versions for other platforms.  OpenCL is quite portable, but has vendor specific extensions.  More related, if one vendor happened to accellerate the multiple seperate sequential kernel invocations, but no one else does, you've just written a non-portable bottleneck into your code using it.

0 Likes
timattox
Adept I

Calling one kernel from another

Originally posted by: corry

 

 

 

Out of curiousity, why have multiple kernels?  I'll explain my reasoning below, but feel free to ignore it.  I won't be offended 🙂

 

 

I use multiple kernels for a variety of reasons:

  1. Changes in parallelism width and shape between algorithm phases... thus I can't just continue with the same number of work item threads, or I can't continue with the same workgroup size/shape (e.g. 1-D, 2-D, 3-D).
  2. Synchronization of read/write data structures across work groups
  3. To avoid user interface jumpyness if my kernel computes in one shot for too long, causing the X window display to stop updating.
  4. Some of my code seems to overwhelm the JIT compiler(s) if it's all in one kernel, causing it to generate less efficient code (spilling registers?), than if it gets it in smaller bites.  I've tried adding -O3 to the kernl build parameters, but it didn't seem to change things much if at all.  This is where I really wish there was an option to say to the Catalyst JIT compiler: "Compile for as long as you like (really: minutes if you want, but not days), I'll be reusing this kernel a gazillion times on this specific GPU card."
  5. Possibly due to #4, I found some GPU models (RV730, Turks) run my code faster when it's broken into "slices", and others GPU models (Cayman, Tesla cards) tend to prefer my code to be in fewer bigger kernels.

But I think we have digressed from the original thread topic...

0 Likes
corry
Adept III

Calling one kernel from another

Originally posted by: timattoxI use multiple kernels for a variety of reasons:

But I think we have digressed from the original thread topic...

I have a bad habit of digressing 🙂 its a problem, but in this case, I think it's still somewhat relevant, since each coder has to decide for him/herself how to organize his/her code...this information provides the OP with information to make that decision, perhaps in his/her case, it would be better to not switch back to the CPU. I know in my case, I'll be looking at how the IL compiler handles chunks at a later date when this thing becomes a little more complicated/nebulous. For now, it's far better to stay on the GPU for multiple seconds, but yeah, I have to figure out our use-case in GPU responsiveness...I do feel your optimizer pain though, when you generate everything in chunks of 4 for the VLIW4, and the compiler decides it knows better and slices things up, or when you put things in good ordering to use PVs, and it reorders and uses a crapload more registers...but thats really going off topic 🙂

 

0 Likes
akhal935
Journeyman III

Calling one kernel from another

My Actual problem is that my serial code structure call a function from main() function and then that primary function calls two other functions many times in loops. And my primary concern is not to parallelize that primary (first level) function but to parallelize second level functions which I call from that primary function.

I could normally keep that primary function as it is, and should make kernels of the secondary level functions; But the problem is those secondary level functions are called with different index position of arrays each time; i-e

calling side : <secondary-level-function>(a[], &b[2*n], &c[2*m]..);

secondary-level-function prototype: function(int x[], int y[], int z[]...);

So when I make OpenCL buffers of a[], b[], c[]... then its not possible to send those buffers from somewhere in middle to every next round of kernel ...

Thats why I decided to make even the primary level function a kernel, and then that kernel should call other kernels (secondary level functions) and so on... But I wonder how should I limit my primary kernel to only one main thread and should spawn threads only inside secondary level kernels?

0 Likes