running reduced kernels in parallel?

Discussion created by Gruzilkin on Apr 25, 2009
Latest reply on Apr 25, 2009 by Gruzilkin

I'm trying to use Brook+ for my calculations, and only recently I've found out that kernel calls are asynchronous, so I've reordered kernel calls and the they look like this, these are 3 cycles

1) generation of multiple streams with data that also have out streams with error (they don't rely on each other most of the time, so they are executed in parallel on large amounts of data)

2) calculations of squared error for all error streams  from (1) (same here, they are executed in parallel)

3) reduced kernel for each stream to sum up  squared errors...


this third operations turnes out to be the slowest one (especially on small amounts of data) and it doesn't run in parallel... even if I use the amount of data that makes cycles (1) and (2) work in parallel, this third one still keeps being synchronized... so basicly, it takes about 5-10% of time to actually make final error streams, and 90-95% time to aggregate them, so I guess there has to be something that can make it work in parallel



here's the aggregation kernel:

reduce kernel void
CombineError (float4 e<>, reduce float4 error<> 
 error = error + e;

and I use it like this:

 for(int index=0 ; index<num ; index++) {
  CombineError(*(squaredErrorStreams[index]), *(combinedErrorStreams[index]));

  printf("CombineError isSync: %s\n", combinedErrorStreams[index]->isSync()?"true":"false");

no matter how long it actually takes for it to complete - it's always synchronized...