Archives Discussions

drstrip · ‎12-11-2009

The following (simplified) code causes brcc to hang:

kernel void computeEnergy(uint4 old_spin[][], uint4 new_spin<>, out uint4 energy<>, int ROWS, int COLS)
{
int4 index = instance();

<mumble>
}

kernel void updateSpin(uint4 spin_in<>, uint4 seeds_in<>, out uint4 spin_out<>, out uint4 seeds_out<>, int num_steps, uint num_spin_states, int ROWS, int COLS)
{
uint4 proposed_energy;
uint4 proposed_spin;
computeEnergy(spin_in, proposed_spin, proposed_energy, ROWS, COLS);
}

Is the hang being caused by the subkernel containing a stream (gather) operation? The manual is a little unclear on this - it says that kernels cannot call stream operators. Perhaps this was supposed to say subkernels cannot call stream operartors?

If the gather stream in the subkernel is the source of my problem, how does one deal with gathers that should be in subkernels? Just inline the code in your kernel? Can you perform a gather on stream data that local to the kernel (ie, not an input)? Since you can't write to the input stream, how do you do a gather operation inside a loop, where each gather is performed on the result of the previous iteration?

gaurav_garg · ‎12-14-2009

As proposed_spin is uninitialized, shouldn't you modify the sub-kernel signature like this-

kernel void computeEnergy(uint4 old_spin[][], out uint4 new_spin<>, out uint4 energy<>, int ROWS, int COLS)

drstrip · ‎12-14-2009

The code I posted is just to give the flavor.

In the real code, some other ops initialize proposed_spin. But the question is really about why brcc hangs. The docs say you can't do a stream ops, but is that really intended to mean subkernels can't do stream ops? If so, how do you do stream ops in a loop if you want to update the value of stream var for each iteration of the loop since the input var is read-only.

gaurav_garg · ‎12-15-2009

I think the problem is in using instance() method inside sub-kernel.

drstrip · ‎12-16-2009

With that latest hint, I'm making a little progress. The following example shows that indeed, you cannot have instance in a subkernel.

This compiles

kernel void subKernel(uint4 in_stream[], uint4 out out_stream<>)

{}

kernel void mainKernel(uint4 in_stream[], uint4 out out_stream<>)

{

subkernel(in_stream, out_steam);

}

Add an instance() call and it fails -

kernel void subKernel(uint4 in_stream[], uint4 out out_stream<>)

{

int4 indx = instance();

}

kernel void mainKernel(uint4 in_stream[], uint4 out out_stream<>)

{

subkernel(in_stream, out_steam);

}

OK, so we (at least I) now have learned that instance() is forbidden in a subkernel. We can work around this limitation, as we can do the instance call in the main kernel and pass the value:

kernel void subKernel(uint4 in_stream[], int pos, out uint4 out_stream<>)
{
out_stream = in_stream[pos];
}

kernel void testKernel(uint4 in_stream[], out uint4 out_stream<>)
{
int4 indx = instance();
out_stream = in_stream[indx.x];
subKernel(in_stream, indx.x, out_stream);
}

Recall, however, that my goal is to perform an operation on the input stream data iteratively, updating the values each time through the loop. in_stream is read-only, so we can't operate on that. The obvious thought is to copy in_stream to some local var that we can write to. (I'll omit the subkernel code from here on out - it doesn't change)

kernel void testKernel(uint4 in_stream[], out uint4 out_stream<>)
{
int4 indx = instance();
uint4 local_stream = in_stream[indx.x];
subKernel(in_stream, indx.x, out_stream);
}

This compiles, so I can copy the in_stream to my local var. Now letls just replace in_stream in the subKernel call with local_stream -

kernel void testKernel(uint4 in_stream[], out uint4 out_stream<>)
{
int4 indx = instance();
uint4 local_stream = in_stream[indx.x];
subKernel(local_stream, indx.x, out_stream);

}

Now we get a compiler error complaining of an invalid cast. Apparently local_stream cannot be cast to the type of in_stream[] in the subkernel prototype. It's "type" is uint4, just like in the signature. It's enough of the same type as in_stream in the testKernel that I could assign to it with no cast problems. So, what's going on?

gaurav_garg · ‎12-16-2009

Now we get a compiler error complaining of an invalid cast. Apparently local_stream cannot be cast to the type of in_stream[] in the subkernel prototype. It's "type" is uint4, just like in the signature. It's enough of the same type as in_stream in the testKernel that I could assign to it with no cast problems. So, what's going on?

in_stream is a uint4 gather stream (and it is passed as uint4 var_name[] in sub-kernel parameters), but local_stream is a uint4 variable (and it must be passed as uint4 var_name or uint4 var_name<> in sub-kernel parameters)

drstrip · ‎12-16-2009

how do I create a local var in the main kernel that can be passed as a uint4 gather stream to the subkernel?

The manual says that I can use an env var to specify read-write input streams, but suggests this is dangerous. Does that mean it doesn't work, may not work, is unreliable?

gaurav_garg · ‎12-16-2009

how do I create a local var in the main kernel that can be passed as a uint4 gather stream to the subkernel?

You cannot pass a local var as a gather stream in sub-kernel. The main gather stream must be directly passed to sub-kernel. There is no way you can write value on the gather stream.

The manual says that I can use an env var to specify read-write input streams, but suggests this is dangerous. Does that mean it doesn't work, may not work, is unreliable?

The manual is talking about a situation like this-

kernel void test(float a<>, out float b<> )

and then calling this kernel with the same stream as input and output.

test(a, a);

drstrip · ‎12-16-2009

So this brings us full circle to my underlying question:

Suppose you're trying to write a piece of code in which each element of an array updates it's state based on the values of it's neighbors. It's relatively straightforward to write this kernel using a gather stream. But now you want to loop over that update operation. Since the updated values cannot be written to the input gather stream, you can't just loop over the code you wrote in the once-through case. You can't create a local gather stream variable to pass to a subkernel. So how do you do it? Calling the kernel inside a loop running on the CPU means you have to pass data back and forth across the bus on every operation. If the computation has additional state along with the array itself, this can become extremely costly, killing any advantage of using the GPU.

riza_guntur · ‎12-16-2009

yes that's the point

CaptainN · ‎12-18-2009

drstrip,

Actually you can. You indeed need to call kernels in a loop, but between kernel invokations, just re-assign output stream to input, and input to output. It will not cause any data movement around, just handle swap. In a "second" kernel invokation you will receive output stream as an input.

The only problem here is that within 1 pass doing element i+1 you may not know whether element i has a update info, if element i+1 depends on element i. But this is general approach for parallel computing.

Once you finish your iterations, then read the stream out from the stream which was used as an output stream.

drstrip · ‎12-19-2009

Captain N writes:
You indeed need to call kernels in a loop, but between kernel invokations, just re-assign output stream to input, and input to output. It will not cause any data movement around, just handle swap. In a "second" kernel invokation you will receive output stream as an input.

If I understand, you are suggesting something like this -

int main(int argc, char** argv)
{
const int BUF_SIZE = 8192;
int (*in_data)[BUF_SIZE]= new int [BUF_SIZE][BUF_SIZE];
int (*out_data)[BUF_SIZE]= new int [BUF_SIZE][BUF_SIZE];

CPerfCounter timer;

unsigned int dims[2] = {BUF_SIZE, BUF_SIZE};

brook::Stream< int> in_stream(2, dims);
brook::Stream< int> out_stream(2, dims);

timer.Reset();
timer.Start();
for (int i = 0; i < 10; ++i)
in_stream.read(in_data);
timer.Stop();

std::cout << "Time to read stream = "<< timer.GetElapsedTime()/10 << std::endl;

timer.Reset();
timer.Start();

for (int i = 0; i < 10; ++i)
{
    testKernel(in_stream, out_stream);
    testKernel(out_stream, in_stream);
}

timer.Stop();
std::cout << "Time to execute kernel = " << timer.GetElapsedTime()/10. << std::endl;
}

Let's use a trivial kernel

kernel void testKernel(int in_stream<>, out int out_stream<>
{
return;
}

If no data is moved during the kernel call, then the second loop should take roughly the same amount of time regardless of the buffer size. However, that's not the case.

BUF_SIZE     Time to read stream   Time to execute kernel
1024            .0050                     .0023
2048            .016                      .0045
4096            .065                      .016
8192            .43                       1.246

(Times are in seconds).

This strongly suggests to me that each call to the kernel involves a data transfer, making it very costly for large arrays passed to the kernel.

gaurav_garg · ‎12-19-2009

First of all, your performance measurement is wrong. Both streamRead and kernel calls are asynchrnous. Also, kernel call waits from streamRead to finish before kernel execution.

So, your time measurement should be something like this-

stream.finish();

//timer_start();

// operation on stream - stream.read() or strream.write()

stream.finish();

// timer_stop();

If no data is moved during the kernel call, then the second loop should take roughly the same amount of time regardless of the buffer size. However, that's not the case.

Consider the case of 2048 buffer size. If data transfer is taking place between two kernel calls, kernel call time should include 4 * .016 sec(streamWrite and Read for in_stream and out_stream) = .064 sec (> 0.0045) that is definitely not the case.

drstrip · ‎12-19-2009

Some new experiments:

trivial kernel as above

relevant parts of caller look like

in_stream.finish()

timer.start();

in_stream.read();

for (i = 0; i < n; ++i) // test with n = 10, 100

{

testKernel(in_stream, out_stream);

testKernel(out_stream, in_stream);

}

timer.stop;

For n = 10, 100, the difference in execution time will represent the extra iterations of the loop, since each has the same stream.read().

Copmute the kernel call time per loop iteration (hence two kernel calls) as (t_100 - t_10)/90. You get the following time per iteration:

1024 - .001576 secs

2048 - .003874 secs

4096 - .015275 secs

8192 - 1.3053 secs

Once again, the times suggest that data is being transferred as part of the call, unless there is some other language feature I don't understand.

These times also allow you to compute the elapsed time for the stream.read() operation. The computed values are consistent with the values I get from direct timing using the following snippet:

in_stream.finish();

timer.start();

for (int i = 0; i < 10 ; ++i)

in_stream.read();

in_stream.finish();

timer.stop;

gaurav_garg · ‎12-20-2009

If no data is moved during the kernel call, then the second loop should take roughly the same amount of time regardless of the buffer size. However, that's not the case.

Once again, the times suggest that data is being transferred as part of the call, unless there is some other language feature I don't understand.

I don't understand how do you reach this conclusion?

drstrip · ‎12-20-2009

In my experiment I make 1 stream read call, then 10 pairs of kernel calls and collect the total time. I repeat the experiment, this time making 1 stream call and 100 pairs of kernel calls. The difference in time is equal to the time of making 90 pairs of kernel calls (except for some very small loop overhead). I performed this experiment for different stream sizes. That is the timing I reported in my previous post. Regardless of the cause, it is clear that calling an empty kernel with a larger stream takes more time than calling the same empty kernel with a smaller stream. What would cause this? I conjectured that some data transfer must be taking place. Perhaps this is wrong, but then what is causing the kernel with larger streams to take longer?

I have been avoiding looking at the compiler output, but maybe that's what it will come down to if we hope to understand what's going on.

gaurav_garg · ‎12-21-2009

Larger stream will take more time to execute the kernel because the kernel gets executed for larger number of GPU threads even if it is a empty kernel.

If data transfer time was included in kernel call, kernel execution time would have been larger than streamRead+ streamWrite time, that is definitely not the case.

Archives Discussions

brcc hangs - subkernel with stream operator?