2 Replies Latest reply on Dec 16, 2008 6:21 PM by jfkong

    implement brook+ bitonic sorting code in CAL

    jfkong
      question about multiple-kernel execution in CAL

      Below is the brook+ sample code for bitonic sorting and I want to implement it in CAL (bitonic is the kernel function).  But I am not sure if I am doing it right and I appreciate your comment.

      float2 maxvalue = float2((float)Height, (float)Width);
      float   sorted1Strm;
      float   sorted2Strm;
             
      // Write data to streams
      streamRead(sorted1Strm, array[0]);
                             
      // Run Brook program
      // lg(Length) stages
      for (stage = 1; stage <= lgArraySize; ++stage)
      {
           int     step        = 0;

           // Width of each sorted segment to be sorted in parallel (2, 4, 8, ...)
           float   segWidth    = (float)pow(2.0f, stage);

           for (step = 1; step <= stage; ++step)
           {
                // offset = (stageWidth/2, stageWidth/4, ... , 2, 1)
                float offset = (float)pow(2.0f, stage - step);

                // two buffers required since non-sequential gather is performaed
                // from scratch buffer each step.

                // flip source and target streams each iteration
                if (!flip)
                       bitonic(sorted1Strm, sorted2Strm, segWidth, offset, offset * 2.0f);
                esle
                       bitonic(sorted2Strm, sorted1Strm, segWidth, offset, offset * 2.0f);                         
                flip ^= 0x01; // XOR flip w/ 0b1 which flips the flip variable between 0 and 1
             }
      }

      // Write data back from streams
      streamWrite((flip) ? sorted2Strm : sorted1Strm, array[1]);

       

      In CAL, I would assume I need to allocate three memory resources (one local memory for input, one local memory for output and another constant buffer for function paramters).

      Then I just write IL code of the kernel function and call the same code multiple times (because of the loop). I assume I can change the parameters values in constant buffer each time before the IL code execution. I also can switch the input and output each time before the IL execution by changing the memory bounding.(because of the flip)

      An example is that:

      "calModuleGetName(&progName, ctx, module, "i0");

      calCtxSetMem(ctx, progName, memoryHandler[0]);

      calModuleGetName(&progName, ctx, module, "o0");

      calCtxSetMem(ctx, progName, memoryHandler[2]);"

      The switch is simply that:

      "calModuleGetName(&progName, ctx, module, "o0");

      calCtxSetMem(ctx, progName, memoryHandler[0]);

      calModuleGetName(&progName, ctx, module, "i0");

      calCtxSetMem(ctx, progName, memoryHandler[2]);"

       

      Am I right in doing so?

      BTW: although NLM_Denoise has two kernel executions.  It is different that NLM_Denoise uses fixed I/O binding mapping for each different kernel.  Here we need to change memory I/O binding for the same kernel execution every time.

        • implement brook+ bitonic sorting code in CAL
          gaurav.garg

          I think you no need to get the name handles again and again. Doing it once should suffice. So, your code should be something like this-

          First get all the name handles-

          calModuleGetName(&inName, ctx, module, "i0");

          calModuleGetName(&outName, ctx, module, "o0");

           

          "calCtxSetMem(ctx, inName, memoryHandler[0]);

          calCtxSetMem(ctx, outName, memoryHandler[2]);"

          The switch is simply that:

          "calCtxSetMem(ctx, outName, memoryHandler[0]);

          calCtxSetMem(ctx, inName, memoryHandler[2]);"

           

          This optimization is quite important as calModuleGetName API has big overhead. Brook+ runtime already does these kind of optimizations and state management.