cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Gruzilkin
Journeyman III

running reduced kernels in parallel?

I'm trying to use Brook+ for my calculations, and only recently I've found out that kernel calls are asynchronous, so I've reordered kernel calls and the they look like this, these are 3 cycles

1) generation of multiple streams with data that also have out streams with error (they don't rely on each other most of the time, so they are executed in parallel on large amounts of data)

2) calculations of squared error for all error streams  from (1) (same here, they are executed in parallel)

3) reduced kernel for each stream to sum up  squared errors...

 

this third operations turnes out to be the slowest one (especially on small amounts of data) and it doesn't run in parallel... even if I use the amount of data that makes cycles (1) and (2) work in parallel, this third one still keeps being synchronized... so basicly, it takes about 5-10% of time to actually make final error streams, and 90-95% time to aggregate them, so I guess there has to be something that can make it work in parallel

 

 

here's the aggregation kernel:

reduce kernel void
CombineError (float4 e<>, reduce float4 error<> 
{
 error = error + e;
}

and I use it like this:

 for(int index=0 ; index<num ; index++) {
  CombineError(*(squaredErrorStreams[index]), *(combinedErrorStreams[index]));

  printf("CombineError isSync: %s\n", combinedErrorStreams[index]->isSync()?"true":"false");
 }

no matter how long it actually takes for it to complete - it's always synchronized...



0 Likes
4 Replies
gaurav_garg
Adept I

Regular kernels are asynchronous, but reduction kernels are not async.

Reduction kernels are implemented in multiple passes, it calls one kernel in each pass and next pass waits for previous pass to finish.

So, reduction can't be run in parallel with other kernels, but you can still run it in parallel with streamRead.

0 Likes

ok, then I guess in my case it's easier to get all the data from GPU with async read and aggretage it with CPU

 

although I don't really get your explanation...

reduced kernel runs in several passes, that's right, and each pass is dependent on the previous passes...  but if I call reduced kernel 10 times, for example, with different streams, they are actually independent to each other... so yes, I understand that reduced kernel takes more time to compute than the regular one, but still it would be nice to have severals reduced kernels running in async

0 Likes

Two kernels (regular or reduction) can never run in parallel, this limitation is not put by software. But, current GPUs are not able to run multiple kernels at the same time, they run them in sequence.

The advantage you can get from async nature is only by async data transfer (streamRead) or CPU work.

0 Likes

I see... thanks

 

this makes things clearer, I'll just try to make use of this async write then.... so that I'll be writing output data from the previous iteration to CPU (instead of aggregating it on GPU) in parallel with the computation of current iteration

 

this looks like to be the best use of resourses

0 Likes