Archives Discussions

jfkong · ‎12-15-2008

question about multiple-kernel execution in CAL

Below is the brook+ sample code for bitonic sorting and I want to implement it in CAL (bitonic is the kernel function). But I am not sure if I am doing it right and I appreciate your comment.

float2 maxvalue = float2((float)Height, (float)Width);
float   sorted1Strm;
float   sorted2Strm;

// Write data to streams
streamRead(sorted1Strm, array[0]);

// Run Brook program
// lg(Length) stages
for (stage = 1; stage <= lgArraySize; ++stage)
{
     int     step        = 0;

     // Width of each sorted segment to be sorted in parallel (2, 4, 8, ...)
     float   segWidth    = (float)pow(2.0f, stage);

     for (step = 1; step <= stage; ++step)
     {
          // offset = (stageWidth/2, stageWidth/4, ... , 2, 1)
          float offset = (float)pow(2.0f, stage - step);

          // two buffers required since non-sequential gather is performaed
          // from scratch buffer each step.

          // flip source and target streams each iteration
          if (!flip)
                 bitonic(sorted1Strm, sorted2Strm, segWidth, offset, offset * 2.0f);
          esle
                 bitonic(sorted2Strm, sorted1Strm, segWidth, offset, offset * 2.0f);
          flip ^= 0x01; // XOR flip w/ 0b1 which flips the flip variable between 0 and 1
       }
}

// Write data back from streams
streamWrite((flip) ? sorted2Strm : sorted1Strm, array[1]);

In CAL, I would assume I need to allocate three memory resources (one local memory for input, one local memory for output and another constant buffer for function paramters).

Then I just write IL code of the kernel function and call the same code multiple times (because of the loop). I assume I can change the parameters values in constant buffer each time before the IL code execution. I also can switch the input and output each time before the IL execution by changing the memory bounding.(because of the flip)

An example is that:

"calModuleGetName(&progName, ctx, module, "i0");

calCtxSetMem(ctx, progName, memoryHandler[0]);

calModuleGetName(&progName, ctx, module, "o0");

calCtxSetMem(ctx, progName, memoryHandler[2]);"

The switch is simply that:

"calModuleGetName(&progName, ctx, module, "o0");

calCtxSetMem(ctx, progName, memoryHandler[0]);

calModuleGetName(&progName, ctx, module, "i0");

calCtxSetMem(ctx, progName, memoryHandler[2]);"

Am I right in doing so?

BTW: although NLM_Denoise has two kernel executions. It is different that NLM_Denoise uses fixed I/O binding mapping for each different kernel. Here we need to change memory I/O binding for the same kernel execution every time.

gaurav_garg · ‎12-15-2008

I think you no need to get the name handles again and again. Doing it once should suffice. So, your code should be something like this-

First get all the name handles-

calModuleGetName(&inName, ctx, module, "i0");

calModuleGetName(&outName, ctx, module, "o0");

"calCtxSetMem(ctx, inName, memoryHandler[0]);

calCtxSetMem(ctx, outName, memoryHandler[2]);"

The switch is simply that:

"calCtxSetMem(ctx, outName, memoryHandler[0]);

calCtxSetMem(ctx, inName, memoryHandler[2]);"

This optimization is quite important as calModuleGetName API has big overhead. Brook+ runtime already does these kind of optimizations and state management.

jfkong · ‎12-16-2008

yes, that is a good point.

Right now I have not reached the performance point and I have just successfully done a butterfly network on a simple 1D fft. It seems that dynamic I/O bounding works.

Archives Discussions

implement brook+ bitonic sorting code in CAL