question about multiple-kernel execution in CAL
Below is the brook+ sample code for bitonic sorting and I want to implement it in CAL (bitonic is the kernel function). But I am not sure if I am doing it right and I appreciate your comment.
float2 maxvalue = float2((float)Height, (float)Width);
float sorted1Strm;
float sorted2Strm;
// Write data to streams
streamRead(sorted1Strm, array[0]);
// Run Brook program
// lg(Length) stages
for (stage = 1; stage <= lgArraySize; ++stage)
{
int step = 0;
// Width of each sorted segment to be sorted in parallel (2, 4, 8, ...)
float segWidth = (float)pow(2.0f, stage);
for (step = 1; step <= stage; ++step)
{
// offset = (stageWidth/2, stageWidth/4, ... , 2, 1)
float offset = (float)pow(2.0f, stage - step);
// two buffers required since non-sequential gather is performaed
// from scratch buffer each step.
// flip source and target streams each iteration
if (!flip)
bitonic(sorted1Strm, sorted2Strm, segWidth, offset, offset * 2.0f);
esle
bitonic(sorted2Strm, sorted1Strm, segWidth, offset, offset * 2.0f);
flip ^= 0x01; // XOR flip w/ 0b1 which flips the flip variable between 0 and 1
}
}
// Write data back from streams
streamWrite((flip) ? sorted2Strm : sorted1Strm, array[1]);
In CAL, I would assume I need to allocate three memory resources (one local memory for input, one local memory for output and another constant buffer for function paramters).
Then I just write IL code of the kernel function and call the same code multiple times (because of the loop). I assume I can change the parameters values in constant buffer each time before the IL code execution. I also can switch the input and output each time before the IL execution by changing the memory bounding.(because of the flip)
An example is that:
"calModuleGetName(&progName, ctx, module, "i0");
calCtxSetMem(ctx, progName, memoryHandler[0]);
calModuleGetName(&progName, ctx, module, "o0");
calCtxSetMem(ctx, progName, memoryHandler[2]);"
The switch is simply that:
"calModuleGetName(&progName, ctx, module, "o0");
calCtxSetMem(ctx, progName, memoryHandler[0]);
calModuleGetName(&progName, ctx, module, "i0");
calCtxSetMem(ctx, progName, memoryHandler[2]);"
Am I right in doing so?
BTW: although NLM_Denoise has two kernel executions. It is different that NLM_Denoise uses fixed I/O binding mapping for each different kernel. Here we need to change memory I/O binding for the same kernel execution every time.