cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rrr
Journeyman III

computing different kernels in parallel (Brook+)

Is it possible to submit a batch of different kernels to the GPU which use different SIMD engines?

Hello!

My problems have quite small matrix and vector sizes (10 to 100), so one kernel invocation often would keep only one SIMD engine busy. Does the brook+ runtime detect "independent" kernel invocations and submits further kernel invocations to the GPU without waiting on the completion of the already started kernels?


Example:

kernel void sum1(double v1<>, double v2<>, double result<>
{
    result = v1+v2;
}

kernel void sum2(double v1<>, double v2<>, double result<>
{
    result = v1+v2;
}


int main(int argc, char** argv)
{
    double a1<10>;    
    double a2<10>;    
    double a3<10>;    
    double ret1 <10>;
    double ret2 <10>;
    double ret3 <10>;
    ...
    sum1(a1,a2,ret1);
    sum2(a1,a3,ret2); // started in parallel to the sum1 invocation?
    sum1(a2,a3,ret3); // started in parallel to both previous invocations?
    ...
}

If Brook+ can parallelize such simple kernel invocations (in my example it would be feasable, because all output streams are disjunct), how does Brook+ detect independency of output streams? E. g., could Brook+ parallelize if ret1..ret3 are replaced by domain operators on a matrix which select disjunct domains?

If Brook+ cannot parallelize the kernel invocations, could I alternatively use CPU threads in the Brook+ (CPU) program to feed the kernel invocations to the GPU in parallel?

best regards
Robert

 

0 Likes
3 Replies
the729
Journeyman III

As far as I know, it cannot be parallelized. A kernel won't start running until the last kernel is finished.

However, I've no idea if multiple kernels run concurrently if they belongs to diffirent context.

0 Likes

I can run opengl side by side with brook+ no problem, I do seem to be experiencing this error the second time I try to map my kernel but thats a whole 'nother story.

I'm assuming the functionality is that any kernels from any context will be put to the command queue and then flushed when it is full or desired.  Therefore I'd assume you should be able to send as much into it from wherever as it is capable of, then it will flush and be begging for more!

As for parallelization, I don't believe any parts of the SDK can garuntee absolute parallelism, it is just up to the programmer to insure things happen in the order necessary.  It is wholly possible to take the brook generated CAL code and add a line or two to request information finishes before you access it.  And it seems the funcionality of Brook+ ensures that the data is fully available when you write out the values to a buffer.

As for a multi-threaded solution, there is still no garuntees I don't think as to the order things occur unless you make it so.  If you have a thread that calls a kernel and then maps it to a buffer, no doubt that thread will not return or proceed past that point until the data is wholly in the buffer.  In such a sense, would could call two threads side by side and await both their finishing before toying with the results.  However, according to the documentation it seems CAL would simply run the first kernel, and not run the second kernel until there is room available for it to be mapped to.

It may however be safe to assume if my understanding is correct, that if your two kernels does not branch at all (call other kernels) and it theoretically will only occupy 50% of the hardware each, that they may in fact execute in parallel.  Once again it is up to you to insure the results are synchronized properly.

Also do note that a double precision calculation must be mapped to 4 thread processors and therefor takes up a lot more room in the hardware.

I don't see a reason why there would be any problems if they do happen to execute in parallel, as the input buffers are not altered in the process, and the output buffers are completely seperate locations in the memory.

Have you tried the code at all yet?  A simple test may be to time each consecutively, and then to time them together and see if there's a noticeable difference.

Get back to the documentation and check out the section about how values are mapped from input streams to output streams, the details are in there, just that I'm a little fuzzy on my interpretation.  (I do believe there is a line that explicitly says the kernel will be mapped to the hardware, and then if there's room left the next kernel will be mapped to the remaining space, however I could be imagining it)

No matter how it works though, one must be very meticulous about the manner in which the operations are performed, whether they branch at all, etc.  (Also note that the memory reads will operate asynchronously, and may prove to be a serious issue without proper synchronization by hand in such circumstances).

Let me know how it works, I'm using the brute force approach myself to learn the system.

0 Likes

I'd also like to note the only true parallelism that I know of in the SDK is when mapping a kernel if all the data is loaded into memory an operation such as:

float4 foursies;

output = foursies + 10;

Will add 10 to each value in parallel.

So if you want the true parallel nature no questions asked I'd try and map out your kernels to take advantage of this where it is required.  I know personally it took a while to get used to these concepts, but they are very powerful.

Thusly if doubles are not absolutely necessary on could simply have one kernel with 1 operation:

kernel void sum(float3 a, float3 b, out float3 ret)
{
  ret = a + b;
}

func(float a1, float a2, float a3)
{
  float3 a, b, ret;

  a.x =a.y = a1;
  a.z = a2;

  b.x = a2;
  b.y = b.z = a3;

  sum(a,b,ret);
}

Also, there are many different ways depending on the circumstances to avoid using double precision, through different methods of wizardry.  Possibly investigate methods for this unless it is absolutely necessary to use double precision.

0 Likes