43 Replies Latest reply on Jan 30, 2009 3:14 PM by genaganna

    Could use some help with arrays

    yarr
      weird index behavior

      Hello everyone!

      Trying to get into brook+ GPU programming, I encountered a weird error with array indexes. Here's my test kernel:

      kernel void testkernel(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>)
      {
       uint ker_n1, ker_n2, t, z, mhm;
       uint2 test;
       
       ker_n1 = (uint)ker_in.x;
       ker_n2 = (uint)ker_in.y;

       t = (ker_n1+ker_key[0]) & (uint)0xFFFFFFFF;
       mhm=t>>(uint)24 &(uint)255;
       z = ker_s87[mhm];

       test.x = z;
       test.y = mhm;
       ker_out = (uint2)test;
      }

      Everything works okay, except of the highlited part.
      Somehow "z" just returns zero.

      "mhm" itself returns proper values: from zero to 255, and also if I just manually substitute the index in the "z" expression, for example:

      z = ker_s87[158];

      I get proper result, with z returning the needed element.

      I can't really understand what's wrong here and why doesnt it work in proper way. :(

      Thanks in advance for any ideas or suggestions!

        • Could use some help with arrays
          rick.weber

          When indexing into a stream, I usually use floating point indices. I'm not sure you can index using uint, though I could be wrong. Thus, try changing z = ker_s87[mhm] to z = ker_s87[(float)mhm].

            • Could use some help with arrays
              yarr

              Na-ah, still a no-go.

              However, stating "z = ker_s87[(uint)169]" does return a proper value. So, I guess, indexing by (u)ints seems to work.

                • Could use some help with arrays
                  jean-claude

                  Sometimes the compiler doesn't generates what you expect... I spent a couple of hours wondering what was going on with a piece of code just to discover that the compiled IL was not correct.

                  Well, i don't know if this is the case here, but you can maybe try the following equivalent code to yours and check the behaviour...

                  kernel void test2(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>
                  {
                   uint t, z, mhm;
                   
                   t = ker_in.x + ker_key[0];

                   mhm = (t>>(uint)24) & (uint)255;

                   z = ker_s87[mhm];

                   ker_out.x = z;
                   ker_out.y = mhm;
                  }

                    • Could use some help with arrays
                      yarr

                      That's how initial code looked like pretty much. ^^ Still works the same way tho: "mhm" operation returns proper results and "z" just wont work.

                      Actually it looked like this at first: "z = ker_s87[(t>>(uint)24) & (uint)255]" and the "mhm" itself was put in there to check whether there's maybe a problem with "t" or the "t>>(uint)24) & (uint)255" operation. But well, they seem to work just fine and it indeed looks like the compiler wont generate proper code for "z = ker_s87['something different than plain number']". :(

                        • Could use some help with arrays
                          genaganna

                          use -c flag and run again. Mostly it will work

                            • Could use some help with arrays
                              yarr

                              Using "-c" causes it to even fail building this kernel with the following error:

                              error C2664: 'void __test2::operator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream'
                              with [T=uint2] and [T=uint]
                              No constructor could take the source type, or constructor overload resolution was ambiguous

                              Also, I did some more testing; if I just state the parameter in the kernel and pass the variable to "z":

                              mhm = 139;
                              z = ker_s87[mhm];

                              I get the proper result returned. So I guess its not about passing the index as variable. The problem is that I need it to be computed inside the kernel, so I'm stuck here again... anyone got some ideas?

                                • Could use some help with arrays
                                  yarr

                                  Did some more testing, wrote this simple kernel:

                                  kernel void test3(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>, uint index_test)
                                  {
                                  ker_out.x=index_test;
                                  ker_out.y=ker_s87[index_test];
                                  }

                                  and behavior is very similar to the previous kernel: "ker_out.y" just returns zero.

                                  However, with:

                                  kernel void test4(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>)
                                  {
                                  uint index_test = (uint)255;
                                  ker_out.x=index_test;
                                  ker_out.y=ker_s87[index_test];
                                  }

                                  I get the last array element returned.
                                  So it looks like those arrays only accept constants as indices?

                                  Bug?



                                    • Could use some help with arrays
                                      genaganna

                                      Yarr,

                                      I have written following test case based on your discussions.

                                      could you please conform  whether test case reproduces the problem you are facing?

                                      kernel void testkernel(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>,uint index_test)

                                      {

                                          ker_out.x=index_test;

                                          ker_out.y=ker_s87[index_test];

                                      }

                                       

                                      int main(int argc, char** argv)

                                      {

                                          uint *i0 = NULL;

                                          uint  *i1 = NULL;

                                          uint *o0 = NULL;

                                          uint2 streami0<4>;

                                          uint2 streamo0<4>;

                                          uint c = 200;

                                          int i = 0, j = 0;

                                          int mismatched = 0;

                                         

                                          i0 = (uint*)malloc(4 * 2 * sizeof(uint));

                                          i1 = (uint*)malloc(256 * sizeof(uint));

                                          o0 = (uint*)malloc(4 * 2 * sizeof(uint));

                                       

                                          for(j = 0; j < 256; j++)

                                          {

                                              i1[j] = (uint)j;

                                          }

                                       

                                       

                                          //! Brook code

                                          streamRead(streami0, i0);

                                          testkernel(streami0, i1, streamo0, c);

                                          streamWrite(streamo0, o0);

                                       

                                          for(i = 0; i < 4; i++)

                                          {

                                              if(o0[i * 2 + 1] != i1[c])

                                              {

                                                  mismatched = 1;

                                                  break;

                                              }

                                          }

                                       

                                          if(mismatched)

                                              printf("Failed!!\n");

                                          else

                                              printf("Passed!!\n");

                                       

                                          free(i0);

                                          free(i1);

                                          free(o0);

                                      }

                                       

                                       

                                    • Could use some help with arrays
                                      gaurav.garg

                                      error C2664: 'void __test2:perator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream' with [T=uint2] and [T=uint] No constructor could take the source type, or constructor overload resolution was ambiguous

                                       

                                      Compiling brook kernel with -c flag means it changes a conatant array into stream. So, it requires you to pass this parameter as Stream and not as uint [2] (as mentioned in error)

                                        • Could use some help with arrays
                                          yarr

                                          Thanks alot for responding, that indeed made the program run properly.

                                           But, according to manual (also if I understood it correctly :)), not using constant buffer thingie is should be way slower? Rephrasing, I drastically reduce performance by swapping from constant buffer to just passing the data as another stream?
                                          And this also means, that constant buffer behavior is somewhat borked because I couldnt get elemets from there?

                                          Also, let me explain why performance is so important to me:
                                          I'm a student and me + some of my buddies picked the task of implementing russian GOST 28147-89 (http://en.wikipedia.org/wiki/GOST_28147-89) block cipher algorythm on various "unusual" devices as our graduation work. I picked AMD GPU, while 2 another guys picked nvidia GPU and CELL BE processor in sony PS3 respectively. So basically it resulted in some kind of "race" here of who gets the cipher to run fastest. :)

                                          Anyways, I already got some running program, but its embarrasingly slow compared to other "competitors". Running tests with 128mb of input data produces speeds of around 50-60mb/s (which is around the performance I get on my CPU) while the cell guy has ~150mb/s per core and nvidia guy (he's using g92 card) has around 2gb/sec.

                                          Judging by theoretical speed of my card (I'm using 4870 512Mb running at 775/4000 btw), it should easily outperform both of those, but it simply doesnt. :)

                                          This brings me to thinking something's REALLY wrong with my program (although it really looks alike to CUDA version for example). So could you please look through it and point out maybe some obvious bottlenecks which can be fixed to improve the working speed?

                                          I'll just paste kernels I'm using and shortly explain what they do:
                                          kernel uint sbox_gpu(uint t, uint ker_s[4][256]) //this is subkernel for applying the substitution boxes
                                          //works byte-by-byte
                                          {
                                          uint x;
                                          x=ker_s[0][t>>(uint)24 & (uint)255] << (uint)24 | ker_s[1][t>>(uint)16 & (uint)255] << (uint)16 | ker_s[2][t>>

                                          (uint)8 & (uint)255] << (uint)8 | ker_s[3][t & (uint)255];
                                          return x<<(uint)11 | x>>(uint)21;
                                          }

                                          //here's the main kernel, it accepts the input of: 1) 64 bit blocks, streamed
                                          //2) array of 256 bit cipher key sliced into 8 pieces
                                          //3) array of s-boxes to pass 'em to subkernel
                                          kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[], uint ker_s[][], out uint2 ker_out<>)
                                          {
                                          uint n1, n2;

                                          n1 = (uint)ker_in.x;
                                          n2 = (uint)ker_in.y;

                                          //swapping names each round instead of swapping blocks
                                          //-------------First block of 8 iterations-------------
                                          n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
                                          n2 = n2^sbox_gpu((n1+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);
                                          n2 = n2^sbox_gpu((n1+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);
                                          n2 = n2^sbox_gpu((n1+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);

                                          //-------------Second block of 8 iterations-------------
                                          // repeat the above part, I skipped it to conserve space.
                                          //-------------Third block of 8 iterations-------------
                                          // same deal, repeat the first part.
                                          //-------------Last (reversed) block of 8 iterations-------------
                                          n2 = n2^sbox_gpu((n1+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);
                                          n2 = n2^sbox_gpu((n1+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);
                                          n2 = n2^sbox_gpu((n1+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);
                                          n2 = n2^sbox_gpu((n1+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
                                          n1 = n1^sbox_gpu((n2+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);

                                          // There is no swap after the last round
                                          ker_out.x = n2;
                                          ker_out.y = n1;
                                          }

                                          You can see the whole sourcecode here: http://annihilator.comtv.ru/source.txt

                                          I'll be really grateful for any ideas about the ways of optimizing this code, huge thanks in advance.



                                            • Could use some help with arrays
                                              gaurav.garg

                                              Hi yarr,

                                              Some easy changes that I think can help you-

                                              1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).

                                              2. Change ker_key into constant buffer. You can keep both constant buffer and gather array(remove specified sizes from kernel declaration) in your program.

                                              kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[8], uint ker_s[][], out uint2 ker_out<>

                                              Compile it without -c. It should work.

                                              3. Convert uint2 ker_in and ker_out streams into uint4 stream. It reduces your number of threads to half and increases your ALU usage while keeping only one fetch instuction.

                                              4. Last step might be to increase number of outputs in a single kernel. Writing to multiple outputs helps you decrease total threads and increase ALU intensity of kernel. But, looks like this step won't be very helpful in your case.

                                              Please follow each step one by one and see performance after each step. Let me know about your results.

                                                • Could use some help with arrays
                                                  yarr

                                                  Thanks gaurav, your advices really seemed to speed it up somewhat. :)

                                                  Posting execution times for those kernels:
                                                  First (original) one with 1D array of 128mb input data: ~2.32 secs.
                                                  1) After changing those to 2D of 4096x4096: down to ~1.93.
                                                  2) After changing keys array to be in constant buffer: down to ~1.46.
                                                  That should be around 87mb/sec, which is still really really slow, but way better than what I had previously. :)

                                                  Third step is not that trivial (req's some tinkering with sboxes) and might even result in the kernel being slowed down instead, but I'll try to rewrite it anyways.

                                                  Btw, tried the modified "fastest" kernel with sbox subkernel just returning the input value (thus, the sbox routine being pretty much disabled, I guess) and got results of around 0.4 secs, which still looks quite slow...

                                                  Getting to the fourth step I get only one idea of maybe writing n1 and n2 to 2 separate outputs, but that'll require some extra time afterwards to rearrange them. Still worth a shot, I guess, so I'm on it. :)

                                                  EDIT: 4) Okay, tried 2 separate outputs: got slower by approx 0.2 to 0.3 secs, guess that wont really work.

                                                  EDIT2: 3) Tried some pretty straightforward, "brute" conversion to uint4 in/out streams by pretty much doubling kernel instructions:

                                                  n1 = (uint)ker_in.x;
                                                  n2 = (uint)ker_in.y;
                                                  n3 = (uint)ker_in.z;
                                                  n4 = (uint)ker_in.w;

                                                  n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
                                                  n4 = n4^sbox_gpu((n3+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
                                                  n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
                                                  n3 = n3^sbox_gpu((n4+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
                                                  [...]
                                                  and so on.

                                                  And well I get around 4.2 secs for executing this.
                                                  Removing s-boxes I get around 2.2 secs.

                                                  Even considering I doubled the input data (256 mb now), it's still a slowdown.
                                                  Either I need more "graceful" vector conversion, or it simply wont work.
                                                  The main problem with proper vectorization are those substitutions.

                                                  P.S. I'm running Vista ultimate x64 SP1 if that matters much.
                                                  SDK I'm using is of latest (1.3) version.

                                                  Thanks again.

                                                  • Could use some help with arrays
                                                    ryta1203
                                                    Originally posted by: gaurav.garg



                                                    1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).



                                                    I have seen this mentioned before, is this true for all 1D streams? That they would get better performance if put into 2D streams?

                                                    And by how much (aka: is it worth the coding time)?

                                                      • Could use some help with arrays
                                                        gaurav.garg

                                                        First of all, wavefronts are created from quads(2X2 thhreads). So, if you have 1D stream, you are actually using only half of SIMD cores.

                                                        Second reason is some overhead of Brook+ virualization. With CAL you can allocate 1D resource of max size 8192(2^13), but Brook+ allows you 1D streams of size 2^26. It has to generate some extra code for mapping from 1D stream to 2D texture space and vice-versa. Keep in mind that this overhead only appears if your stream is of size > 8192.

                                    • Could use some help with arrays
                                      MicahVillmow
                                      This is true and in most cases it is a fairly large performance increase(as long as you are not using global buffer). There are two reasons for this. One is that 1D streams have a limit on the size that is a magnitude smaller than 2D streams and if the 1D stream is over that hardware limit, then it requires address translation via a generic algorithm. If you can program your algorithm with 2D streams then you loose this overhead increasing performance.

                                      Also, even though our hardware is becoming more generalized, it is still a piece of hardware optimized for graphics which is inherently 2D. So by going the 2D stream route and using vector data types then you hit the fast paths that are highly optimized, thereby improving performance.
                                      • Could use some help with arrays
                                        MicahVillmow
                                        ryta,
                                        That is correct on vector types.

                                        The cal compiler should generate equivalent code for both of those examples if there is not complicated code in between each of the assignments.
                                          • Could use some help with arrays
                                            ryta1203
                                            Have a few questions about 2D arrays:

                                            1. streamRead/streamWrite won't take a double pointer? How do you copy over a dynamic 2D array into a 2D stream of the same size?

                                            2. Looking at the following kernel I am having some problems porting it from 1D to 2D.

                                            Here is the 1D code which works fine:

                                            int x, y, idx;
                                            idx = instance().x;
                                            x = idx%gx;
                                            y = (int)floor((float)idx/(float)gx);
                                            Fs1to4 = Fin1to4[idx];
                                            Fs5to8 = Fin5to8[idx];
                                            Fs9 = Fin9[idx];
                                            if ((y > bk-1) && (y <= my-bk+1))
                                            {
                                            if (idx=gx*y) //x==0
                                            {
                                            Fs1to4 = Fin1to4[(mx-1)+gx*y];
                                            Fs5to8 = Fin5to8[(mx-1)+gx*y];
                                            Fs9.w = Fin9[(mx-1)+gx*y].w;
                                            }
                                            if (idx == ((mx+1)+(gx*y))) //x==mx+1
                                            {
                                            Fs1to4 = Fin1to4[2+gx*y];
                                            Fs5to8 = Fin5to8[2+gx*y];
                                            Fs9.w = Fin9[2+gx*y].w;
                                            }
                                            }

                                            And here is the 2D code which does not work:

                                            int x, y, idx;
                                            x = instance().x;
                                            y = instance().y;
                                            idx = x+gx*y;
                                            Fs1to4 = Fin1to4[x][y];
                                            Fs5to8 = Fin5to8[x][y];
                                            Fs9 = Fin9[x][y];
                                            if ((y > bk-1) && (y <= my-bk+1))
                                            {
                                            if (x==0)
                                            {
                                            Fs1to4 = Fin1to4[(mx-1)][y];
                                            Fs5to8 = Fin5to8[(mx-1)][y];
                                            Fs9.w = Fin9[(mx-1)][y].w;
                                            }
                                            if (x==(mx+1))
                                            {
                                            Fs1to4 = Fin1to4[2][y];
                                            Fs5to8 = Fin5to8[2][y];
                                            Fs9.w = Fin9[2][y].w;
                                            }
                                            }

                                            Now, I know the x==0 part works because I have tried it in the first code with no problems; however, the second part seems to not work properly even if I substitute it in the 1D code. I think I am missing something as far as how the 2D domain and instance() work.
                                              • Could use some help with arrays
                                                gaurav.garg

                                                1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.

                                                2. When you index into 2D gather array, it has to be C-style indexing.

                                                a[k][j] where k - row number and j - column number.

                                                  • Could use some help with arrays
                                                    ryta1203
                                                    Originally posted by: gaurav.garg

                                                    1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.




                                                    2. When you index into 2D gather array, it has to be C-style indexing.




                                                    a[k][j] where k - row number and j - column number.



                                                    1. So there is no way to use dynamically allocated multidimensional arrays in Brook+? That's a pretty big limitation no?

                                                    2. Yes, I understand this. I'm not sure I understand how this helps my 2nd question at all, could you elaborate?

                                                      • Could use some help with arrays
                                                        gaurav.garg

                                                        As in your 1D code you have written -

                                                        x = idx%gx;
                                                        y = (int)floor((float)idx/(float)gx);

                                                        That made me thought x is column index and y is row index. Is that right?

                                                        If it's right, you need to change your code to - Fin1to4[y][x];

                                                          • Could use some help with arrays
                                                            ryta1203
                                                            Originally posted by: gaurav.garg

                                                            As in your 1D code you have written -


                                                            x = idx%gx;
                                                            y = (int)floor((float)idx/(float)gx);


                                                            That made me thought x is column index and y is row index. Is that right?


                                                            If it's right, you need to change your code to - Fin1to4[y][x];


                                                            So from this I could just swap the instance of:

                                                            x = instance().y;
                                                            y = instance().x;

                                                            Does that make sense? Just wondering because that doesn't seem to work. So are arrays in kernels in [column][row]? Are they put that way in streamRead?

                                                            ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???

                                                              • Could use some help with arrays
                                                                rick.weber

                                                                 

                                                                Originally posted by: ryta1203 ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???


                                                                 

                                                                 

                                                                 

                                                                 

                                                                 

                                                                Stream* xGPU[32/8];
                                                                float* x = (float*)malloc(sizeof(x[0])*npt);
                                                                 int coordDim[] = {8192, 8}; 

                                                                for(int curBlock = 0; curBlock < 32; curBlock+=8)
                                                                {
                                                                xGPU[curBlock/8] = new Stream(2,coordDim); 
                                                                xGPU[curBlock/8]->read(&x[8*8192*i]);
                                                                }
                                                                This is something I cooked up a few months ago. Alternatively, you can call streamRead(*xGPU, x);

                                                                 

                                                                 

                                                                 



                                                                 

                                                                  • Could use some help with arrays
                                                                    gaurav.garg

                                                                    Did you try changing other gather array access. Try this-

                                                                    int x, y, idx;
                                                                    x = instance().x;
                                                                    y = instance().y;
                                                                    idx = x+gx*y;
                                                                    Fs1to4 = Fin1to4[y][x];
                                                                    Fs5to8 = Fin5to8[y][x];
                                                                    Fs9 = Fin9[y][x];
                                                                    if ((y > bk-1) && (y <= my-bk+1))
                                                                    {
                                                                    if (x==0)
                                                                    {
                                                                    Fs1to4 = Fin1to4[y][(mx-1)];
                                                                    Fs5to8 = Fin5to8[y][(mx-1)];
                                                                    Fs9.w = Fin9[y][(mx-1)].w;
                                                                    }
                                                                    if (x==(mx+1))
                                                                    {
                                                                    Fs1to4 = Fin1to4[y][2];
                                                                    Fs5to8 = Fin5to8[y][2];
                                                                    Fs9.w = Fin9[y][2].w;
                                                                    }
                                                                    }

                                                                      • Could use some help with arrays
                                                                        ryta1203
                                                                        Gaurav,

                                                                        This doesn't work (I tried) and I wouldn't expect it too either. The conditions are based on the indexes of the columns and rows but have not been switched. I tried to switch them also and I get the same results.

                                                                        The results I get are as if the kernel is not called at all, meaning that it's either not making it into the first conditional or that it's not making it into the other two.

                                                                        This was one of the easiest kernels to implement in my 1D solution but for 2D it seems much trickier, although it really shouldn't be since the original code (CPU) is in 2D also.

                                                                        What I want to know is this:

                                                                        1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.

                                                                        2. If I understand you correctly, instance().x returns the column index and instance().y returns the row index, correct?

                                                                        In my 1D case, if I use idx==(mx+1)+gx*y then I get the correct results; however if I use x == mx+1 OR y == mx+1 I get incorrect results, so it might be that this problem is not specific to the 2D case just that I can't access the 2D using the 1D solution.


                                                                        EDIT: I thought it might be useful to post the original C code I am converting:

                                                                        for(y=BK;y<=MY-BK+1;y++)
                                                                        for(k=0 ; k < Q ; k + + )
                                                                        {
                                                                        F[0][y][k]=F[MX-1][y][k];
                                                                        F[MX+1][y][k]=F[2][y][k];
                                                                        }


                                                                        EDIT2: As a side note, I have seen significant speedup using 2D just through 2 (out of the 6) kernels and that solution is using a lot more streamRead/streamWrites than the full solution (as in the 1D version), so this is great if I can get it working!!
                                                                          • Could use some help with arrays
                                                                            gaurav.garg

                                                                            Can you post your kernel signature and runtime code for 2D case?

                                                                              • Could use some help with arrays
                                                                                ryta1203
                                                                                kernel signature? runtime code?

                                                                                I'm not familar with the workings of the code created by the brcc. Do you mean the generated IL?
                                                                                  • Could use some help with arrays
                                                                                    gaurav.garg

                                                                                    Kernel signature - function signature for you brook kernel giving information about kernel parameters

                                                                                    runtime code - Part of code where you declare streams and call kernel.

                                                                                      • Could use some help with arrays
                                                                                        ryta1203
                                                                                        void mcollid()
                                                                                        {
                                                                                        int step, x=0, y=0; float Norm1, Norm2, error1, error2;
                                                                                        float4 Fs1to4_1< gx, gy>;
                                                                                        float4 Fs1to4_2< gx, gy>;
                                                                                        float4 Fs5to8_1< gx, gy>;
                                                                                        float4 Fs5to8_2< gx, gy>;
                                                                                        float4 Fs9_1< gx, gy>;
                                                                                        float4 Fs9_2< gx, gy>;

                                                                                        float4 fs1to4_1< gx, gy>;
                                                                                        float4 fs5to8_1< gx, gy>;
                                                                                        float4 fs9_1< gx, gy>;
                                                                                        float4 fs1to4_2< gx, gy>;
                                                                                        float4 fs5to8_2< gx, gy>;
                                                                                        float4 fs9_2< gx, gy>;
                                                                                        float GEOs< gx, gy>;
                                                                                        float ss< 9>;
                                                                                        float rs< 9>;
                                                                                        int es< 18>;

                                                                                        streamRead(fs1to4_1, f1to4);
                                                                                        streamRead(fs5to8_1, f5to8);
                                                                                        streamRead(fs9_1, f9);
                                                                                        streamRead(Fs1to4_1, F1to4);
                                                                                        streamRead(Fs5to8_1, F5to8);
                                                                                        streamRead(Fs9_1, F9);
                                                                                        streamRead(ss, s);
                                                                                        streamRead(GEOs, GEO);
                                                                                        streamRead(rs, r);
                                                                                        streamRead(es, e);

                                                                                        mcollid_s(Fs1to4_1, Fs5to8_1, Fs9_1, fs1to4_1, fs5to8_1, fs9_1, GEOs, ss, G, Fs9_2, Fs5to8_2, Fs1to4_2);
                                                                                        Fs9_2.error();

                                                                                        advection1_s(Fs1to4_2, Fs5to8_2, Fs9_2, GEOs, es, gx, mx, my, rs, Fs9_1, Fs5to8_1, Fs1to4_1);
                                                                                        Fs9_1.error();

                                                                                        advection2_s(Fs1to4_1, Fs5to8_1, Fs9_1, gx, mx, my, bk, Fs9_2, Fs5to8_2, Fs1to4_2);
                                                                                        Fs9_2.error();

                                                                                        streamWrite(Fs1to4_2, F1to4);
                                                                                        streamWrite(Fs5to8_2, F5to8);
                                                                                        streamWrite(Fs9_2, F9);
                                                                                        }




                                                                                        kernel void advection2_s(float4 Fin1to4[][], float4 Fin5to8[][], float4 Fin9[][], int gx,
                                                                                        int mx, int my, int bk, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)
                                                                                        {
                                                                                        int x, y;
                                                                                        x = instance().x;
                                                                                        y = instance().y;
                                                                                        Fs1to4 = Fin1to4[x][y];
                                                                                        Fs5to8 = Fin5to8[x][y];
                                                                                        Fs9 = Fin9[x][y];
                                                                                        if ((y > bk-1) && (y <= my-bk+1))
                                                                                        {
                                                                                        if (x==0)
                                                                                        {
                                                                                        Fs1to4 = Fin1to4[(mx-1)][y];
                                                                                        Fs5to8 = Fin5to8[(mx-1)][y];
                                                                                        Fs9.w = Fin9[(mx-1)][y].w;
                                                                                        }
                                                                                        if (x==(mx+1))
                                                                                        {
                                                                                        Fs1to4 = Fin1to4[2][y];
                                                                                        Fs5to8 = Fin5to8[2][y];
                                                                                        Fs9.w = Fin9[2][y].w;
                                                                                        }
                                                                                        }
                                                                                        }
                                                                                  • Could use some help with arrays
                                                                                    gaurav.garg

                                                                                    I assume in float4 Fs5to8_2< gx, gy>;  gx is height and gy is width.

                                                                                    The only other part I have doubt is your gather array accesing. All of your indexing have y(current row index) as second component. Not sure if you wanted to do this.

                                                                                     

                                                                                    1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.


                                                                                    I give you an example that shows how to do this-

                                                                                    float a<height, width>;
                                                                                    float aPtr[height][width]; // or float* aPtr = new float[height*width];
                                                                                    streamRead(a, aPtr);

                                                                                      • Could use some help with arrays
                                                                                        ryta1203
                                                                                        yes, gx is row and gy is column but this isn't relevant since gx=gy in my example.

                                                                                        Here is how I have it:

                                                                                        float4 F5to8[gx][gy];
                                                                                        float4 Fs5to8< gx,gy >;
                                                                                        streamRead(Fs5to8, F5to8);

                                                                                        This should be right since this works for the first two kernels, no?

                                                                                        If you look at the original code, the domain is y=bk to y<=my-bk+1 and inside this domain only the rows that equal 0 and mx-1 are set; otherwise everything else stays the same.

                                                                                        I'm just not sure why instance().x == 0 and instance().x == mx-1 OR instance().y ==0 and instance().y == mx-1 don't work unless 1) instance() doesn't work as advertised OR (and the greater possibility) 2) I'm using instance() incorrectly.
                                                                                        • Could use some help with arrays
                                                                                          Gipsel

                                                                                           

                                                                                          Originally posted by: gaurav.garg
                                                                                          1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.


                                                                                          I give you an example that shows how to do this-

                                                                                          float a; float aPtr[height][width]; // or float* aPtr = new float[height*width]; streamRead(a, aPtr);



                                                                                          From my experience that is not working in a consistent way when using GPU or CPU backend. Just an example, when defining a stream

                                                                                            unsigned int dimc[] = {height,width};

                                                                                          ::brook::Stream<double2> s(2,dimc);

                                                                                           

                                                                                           and then invoking a kernel

                                                                                          kernel void test (*some optional input here*, out double2 s<>){

                                                                                              s.x = instance().x;

                                                                                              s.y = instance().y

                                                                                           

                                                                                          }

                                                                                          one sees at least with the CPU backend that instance().x is actually running from 0 to height-1, and instance().y from 0 to width-1. With the GPU-backend it is different (didn't get it working yet), but the indexing appears to be borked. To sum it up, with the CPU backend an array defined as

                                                                                          double t_a[height][width];

                                                                                          has to be fitted with a stream



                                                                                          ::brook::Stream t_s (2,{width, height});

                                                                                          otherwise it is not working over here.



                                                                                           

                                                                                           



                                                                                          PS: Why is the forum software appending empty lines to my post? I can't edit it away, it only adds another two mor lines everytime I edit my post :confused;

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           

                                                                                           



                                                                                           

                                                                                            • Could use some help with arrays
                                                                                              gaurav.garg

                                                                                               

                                                                                              unsigned int dimc[] = {height,width};


                                                                                              When using C++ style constructor it has to be -

                                                                                              unsigned int dimc[] = {width, height};

                                                                                                • Could use some help with arrays
                                                                                                  ryta1203
                                                                                                  Is this a bug? The aforementioned code I posted works if BRT_RUNTIME=cpu, for example:

                                                                                                  kernel void advection2_s(float4 Fin1to4[], float4 Fin5to8[], float4 Fin9[], int gx,
                                                                                                  int mx, int my, int bk, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)
                                                                                                  {
                                                                                                  int x, y, idx;
                                                                                                  idx = instance().x;
                                                                                                  x = idx%gx;
                                                                                                  y = (int)floor((float)idx/(float)gx);
                                                                                                  Fs1to4 = Fin1to4[idx];
                                                                                                  Fs5to8 = Fin5to8[idx];
                                                                                                  Fs9 = Fin9[idx];
                                                                                                  if ((y > bk-1) && (y <= my-bk+1))
                                                                                                  {
                                                                                                  if (x==0)//(idx == gx*y)
                                                                                                  {
                                                                                                  Fs1to4 = Fin1to4[(mx-1)+gx*y];
                                                                                                  Fs5to8 = Fin5to8[(mx-1)+gx*y];
                                                                                                  Fs9.w = Fin9[(mx-1)+gx*y].w;
                                                                                                  }
                                                                                                  if (x==mx+1)//(idx == (mx+1)+gx*y)
                                                                                                  {
                                                                                                  Fs1to4 = Fin1to4[2+gx*y];
                                                                                                  Fs5to8 = Fin5to8[2+gx*y];
                                                                                                  Fs9.w = Fin9[2+gx*y].w;
                                                                                                  }
                                                                                                  }
                                                                                                  }

                                                                                                  This works when BRT_RUNTIME=CPU just fine but fails (gives incorrect results) when BRT_RUNTIME=CAL.

                                                                                                  I'm not familiar with the large differences between these two runtimes but I would think they would be built to act the same, no?
                                                                                                    • Could use some help with arrays
                                                                                                      gaurav.garg

                                                                                                      Did you check errorLog on your output stream? Something like-

                                                                                                      if(Fs9.error())
                                                                                                      {
                                                                                                          std::cout << Fs9.errorLog();
                                                                                                      }

                                                                                                       

                                                                                                        • Could use some help with arrays
                                                                                                          ryta1203
                                                                                                          Absolutely! There is no error on the output stream, no log is being printed hence the conditional is not being taken.

                                                                                                          I think it's more of a logical error, but I found it odd that in my 1D example above if(x==mx+1) works for CPU but not CAL.
                                                                                                            • Could use some help with arrays
                                                                                                              gaurav.garg

                                                                                                              Assign one output with x, y and idx values and see if you can find out the problem.

                                                                                                              My guess is brcc is having issue with x and y calculation. e.g. calculate x like this-

                                                                                                              x = idx-y*gx;

                                                                                                              Try different combinations for x & y calculation.

                                                                                                                • Could use some help with arrays
                                                                                                                  ryta1203
                                                                                                                  Gaurav,

                                                                                                                  So, I finally got this working (I think, lol, ).

                                                                                                                  All I had to do was switch the instance() calls, for example:

                                                                                                                  FROM:
                                                                                                                  x = instance().x;
                                                                                                                  y = instance().y;

                                                                                                                  TO:
                                                                                                                  x = instance().y;
                                                                                                                  y = instance().x;

                                                                                                                  Everything else (all the other code and gather array accesses, etc) were left the same way, like F[x][y]. I didn't need to change them to F[y][x].

                                                                                                                  I honestly hope this saves some people time in the future.

                                                                                                                  Thanks for all the help gaurav, I really appreciate it!

                                                                                                                  I also understand that this can probably be combed through in the SDK examples but it would be nice to have it documented (if it isn't already) somewhere too. Thanks.
                                                                                                          • Could use some help with arrays
                                                                                                            Gipsel

                                                                                                             

                                                                                                            Originally posted by: gaurav.garg
                                                                                                            unsigned int dimc[] = {height,width};

                                                                                                            When using C++ style constructor it has to be -

                                                                                                            unsigned int dimc[] = {width, height};





                                                                                                            Which is actually quite confusing (and not really documented?) as it is indexed and accessed (for instance as a gather input) like a C array which is declared just the other way around.

                                                                                                            Btw., I got my code running now also on the GPU. It is really fast, about  120GFlop/s (double precision) get used on a HD4870. It is quite close to the theoretical maximum for the instruction mix (not that much MAD_64 in there and Brook does not generate any, presumably for precision reasons?). But it would have been far easier, if some standard functions would also be supported for doubles. I had to build my own exp and sqrt versions for doubles in IL. The rest of it was written in Brook.

                                                                                            • Could use some help with arrays
                                                                                              gaurav.garg

                                                                                              Sorry, I didn't mention instance() returns short vector index that is similiar to indexof().

                                                                                              So, instance().x is column index, instance().y is row index and so on.

                                                                                               

                                                                                      • Could use some help with arrays
                                                                                        dukeleto
                                                                                        I can confirm that in the examples you can find lines like the following (this one extracted from ImageFilter.br):


                                                                                        int jj = instance().x;
                                                                                        int ii = instance().y;

                                                                                        // These are the offsets so no looping is needed

                                                                                        o_img = img[ii][jj] * mask[0][0];



                                                                                        Personally I find this unnerving! It works, but seeing reference code with something like i=instance().y; j=instance().x
                                                                                        is really disconcerting.

                                                                                        Regards