cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

corry
Adept III

Calling one kernel from another

Sounds like you've spent time in the multithreaded CPU world...

With no understanding of your problem, to me this sounds like a fitting a square peg in a round hole problem.

Either you have to figure out how to massivly parallelize that first task, or you need to have the host do the serial code, and dispatch the results to the parallel system. 

In GPGPU there is no concept of spawning threads.  The processor runs one set of code on all processors, so in general, your thread dispatcher is where you insert the GPGPU code. 

As I said though, thats all "in general"  there can be very good reasons not to, but I suspect it will be easier that way, and perhaps even faster since the CPU is usually 2-3 GHz, and the GPU is usually 400-1.5 GHz or so.  If what you have is truely serial, so there is only one thread on the GPU, the performace penalty should be obvious. 

0 Likes
LeeHowes
Staff
Staff

Calling one kernel from another

An alternative understanding of your problem is that you have this:

for( some m and n 😞 

  call something(a, b[2n], c[2m])

  call somethingelse(x, y, z)

 

and you want to pass b[2n] as a pointer but can't because it's simply a buffer? If that's the case then just pass 2n as a parameter to the kernel. Set a, b and c once outside the loop, update 2n and 2m on each iteration of the outer loop and enqueue the kernels. The overhead of the launches shouldn't be too large because you should be able to do all those enqueues without ever waiting for the GPU to do anything. You can do one wait on events after the big loop.

As corry says, it's hard to tell precisely what you're trying to do. Can you write the entire loop and two called functions in simple pseudo code so we can get a better idea of what you're trying to achieve? At the moment you can't launch kernels from other kernels because of the way the GPU driver and pipeline work. If you can make your outer loop the parallel one and the inner loops dependent directly on that you could just use two function calls, but it depends entirely on the data access patterns and dependencies of the two called functions.

0 Likes
notzed
Challenger

Calling one kernel from another

Interesting, I suppose then the power management isn't that aggressive probably based on the context then...Should be the same in IL then...


Remember opencl is an abstraction.  How the hardware implements it is not necessarily related to how the programmer sees it.  It could copy the data back to/from cpu memory and turn off the gpu when not being used for all you know as a programmer ...

The abstraction is just that the global memory is persistent between kernel invocations and ready for them when they run.

 

 

0 Likes
timattox
Adept I

Calling one kernel from another

Originally posted by: notzedThe abstraction is just that the global memory is persistent between kernel invocations and ready for them when they run.


Funny thing is that I have found that GPU global memory buffers can be too persitent.  One day I was surprized when I thought I had been doing a series of good improvements in my kernel, when sometime earlier in the day I had broken it so badly that it wasn't writing anything into the results buffer...  but the buffer still had the good data from an earlier run sitting in the DRAM.  For debugging/development sessions, I've now had to add a code phase that clobbers the contents of my results buffers at the beginning of the run, to make sure my "correct results" check at the end is looking at freshly created data, and not leftover "correct" data from the previous version of a kernel that ran moments before.

On a side note, this means that if you are doing cryptography or other sensitive work on a GPU, you need to make sure you write over any data buffers that you don't want any other code to see, once you are done with them of course.  There is no zeroing out of newly allocated memory by an OS on the GPU like we are used to in POSIX CPU-land.

0 Likes
corry
Adept III

Calling one kernel from another

Originally posted by: notzed
Interesting, I suppose then the power management isn't that aggressive probably based on the context then...Should be the same in IL then...


Remember opencl is an abstraction. How the hardware implements it is not necessarily related to how the programmer sees it. It could copy the data back to/from cpu memory and turn off the gpu when not being used for all you know as a programmer ...

The abstraction is just that the global memory is persistent between kernel invocations and ready for them when they run.

Don't get me started on high level developers assuming everything's a black box, abstracted away, fixes everything, you can't mess it up, etc...that fact is, for example on a CPU, you can write ***** poor code that thrashes cache, the TLB, the branch prediction history, doesn't let out of order execution help much, doesn't make use of coprocessors, multiprocessors, hardware thread processors, instruction fusing, and is generally a steaming pile of crap because you assumed the hardware and software abstractions and optimizers fixed everything for you....even hardware can't polish a software turd...

That said, sure, in this case, there are workarounds to understanding the mechanisms at work, such as the one timattox posted, but you have to understand there *is* a problem/feature before you can work around/with it...

 

Like I said....don't get me started...

0 Likes
akhal
Journeyman III

Calling one kernel from another

Thanks LeeHowes; I am considering your earlier comments but to make it clearer to you, my serial code structure is like below;

  tgle = 1;
  steps(n, mj, x, &x[(n/2)*2], y, &y[mj*2]);
  for(j = 0; j < m; j++)
    {
      mj = mj*2;
      if(tgle)
        {
          steps(n, mj, y, &y[(n/2)*2], x, &x[mj*2]);
          tgle = 0;
        }
      else
        {
          steps(n, mj, x, &x[(n/2)*2], y, &y[mj*2]);
          tgle = 1;
        }
    }

  mj = n/2;
  steps(n, mj, x, &x[(n/2)*2], y, &y[mj*2]);

 

and stpes() function is the heavier one that I hope better to parallelize. so should I keep this looping structure on host side and should keep enqueuing steps() kernel with like "n" and "mj" sending so that I can index buffers correctly insdie kernel?

0 Likes