7 Replies Latest reply on Aug 14, 2009 6:34 AM by rexiaoyu

    Problem with simple matrix addition


      I declare two 16x16 matries represented in 1D and do the addition. In the kernel, I make a thead process N elements(N = 1, 2,4,.....) , when N is less than 16, it works fine, but when it reaches 16, some kind of runtime error happens. I can not figure it out.

      The code is as below(main.cpp and locate.br):



      #include <stdio.h>

      #include <stdlib.h>

      #include "brookgenfiles/locate.h"


      using namespace brook;

      #define SIZE 16

      #define SIZE2 256


      void printMatrix(int len, float m[])


      int i, j;

      for (i = 0; i < len; i++)


      for (j = 0; j < len; j++)


      printf("%f, ", m[i * len + j]);






      int main()


      //array a, b and c

      float a[SIZE2];

      float b[SIZE2];

      float c[SIZE2];


      int i;

      for (i = 0; i< SIZE2; i++)


      a = 1.0;

      b = 2.0;



      unsigned int msize = SIZE2;

      Stream<float> sa(1, &msize);

      Stream<float> sb(1, &msize);

      Stream<float> sc(1, &msize);




      uint4 domainSize = uint4(SIZE2, 1, 1, 1);


      blockAdd(sa, sb, sc);



      if (sc.error())


      printf("Error occured! %s\n", sc.errorLog());

      return 1;


      printMatrix(SIZE, c);


      return 0;




      Attribute[GroupSize(64, 1, 1)]

      kernel void

      blockAdd(float a[], float b[], out float c[])


      int tid = instance().x;

      //every thread process len elements, len =  1, 2, 4, 6, 8, 16, ....

      //when len = 16, come the error

      int len= 16;


      int start = tid * len;

      int i;

      int index;

      for (i = 0; i < len; i++)


      c[start + i] = a[start + i] + b[start + i];



        • Problem with simple matrix addition

          What runtime error you see? It is a crash? If yes, where does it crash?

            • Problem with simple matrix addition

              Holiday passed and I am back

              The result seems weird. Sometimes I can get the correct answer, with matrix C full of 3 (only when len doesn't exceed 16); sometimes it reports a memory error, and now the answer becomes an array of random numbers.

              Is there something wrong with my algorithm in the kernel? I am wondering.

                • Problem with simple matrix addition

                  It seems the indices you are using are out of range. You are running 256 threads and each a, b, c contains only 256 elements.

                  Also, I would suggest to use both domainOffset and domainSize together. Brook+ runtime can ignore domian of execution hint if domainOffset is not specified. Also, check your results without Attribute qualifier in kernel.

                    • Problem with simple matrix addition

                      Thanks for your suggestion.  

                      Does it mean that to avoid the out of range problem, I can not write more than one element in the kernel?

                      But I knew the cal_idct sample provided with sdk writes more than one element in the IL kernel. Here is part of code:


                      // save 8x8 DCT coefficient block location

                       "ishl r16.x, vaTid.x, l8.w\n"


                       // load packed 8x8 DCT coefficients using texture cache

                       "mov  r0, g[r16.x+0]\n" 

                       "mov  r2, g[r16.x+1]\n" 

                       "mov  r4, g[r16.x+2]\n" 

                       "mov  r6, g[r16.x+3]\n" 

                       "mov  r8, g[r16.x+4]\n" 

                       "mov r10, g[r16.x+5]\n" 

                       "mov r12, g[r16.x+6]\n" 

                       "mov r14, g[r16.x+7]\n" 

                      //DO IDCT



                      // save DCT values

                       "mov g[r16.x+0], r0\n" 

                       "mov g[r16.x+1], r2\n" 

                       "mov g[r16.x+2], r4\n" 

                       "mov g[r16.x+3], r6\n" 

                       "mov g[r16.x+4], r8\n" 

                       "mov g[r16.x+5], r10\n" 

                       "mov g[r16.x+6], r12\n" 

                       "mov g[r16.x+7], r14\n"


                      In the code above, it first gets the absolute thread id and then maps it to a 8x8 block, which will be  processed later. At last it writes these elements back. It works fine. So I wonder whether I can do the same thing in Brook+.

                • Problem with simple matrix addition

                  In your kernel instance().x would return values from 0...255 and writing 16 elements in each thread would mean accessing memory element from 0...4095. But, the amount of memory allocated is 256 elements.