10 Replies Latest reply on Jun 15, 2009 9:52 PM by Ceq

    Brook 1.4 gather bug?

    Ceq

      I think there is a bug that affects gather operations when using big 1D streams. The following example takes two streams and merges their data in a interleaved way, for example:

      SIZE = 4

      vIn1 = { 0, 1, 2, 3 }

      vIn2 = { -0, -1, -2, -3}

      vOut = { 0 , -0, 1, -1, 2, -2, 3, -3}

       

      However if you use big stream, like 2^20:

      - Using Catalyst 9.2, Brook+ 1.4, MSVC 2005, Radeon 3870x2 or Radeon 4850, WinXP 64 or WinXP 32

      SIZE = 1 << 20;

      vIn1 = { 0, 1, 2, 3, ... }

      vIn2 = { -0, -1, -2, -3, ...}

      vOut = { 0 , 0, 1, 0, 2, 0, 3, 0, ...}

      - Using Catalyst 9.5, Radeon 4850 and WinXP 64 returns undefined values instead of 0, usually previous memory data.

      - Using Catalyst 9.5, Radeon 3870x2 and WinXP 32 aborts the program.

       

      Test case:

      #include <  stdio.h >
      #include < stdlib.h >

      kernel void
      ker(float in1[ ], float in2[ ], out float out1< > )
      {
          int pos_x = instance().x;
          int i = pos_x >> 1;
          float t;
          if(pos_x & 0x01) {
              t = in2[i ];
          } else {
              t = in1[i ];
          }
          // Some operations with t
          out1 = t;
      }


      int main(int argc, char** argv) {
          const int SIZE = 1 << 20; // *** Wrong result using big sizes ***
          // const int SIZE = 1 << 10; // *** Right using small streams ***
          const int DSIZE = 2 * SIZE;
          unsigned int i;

          // Memory arrays
          float vIn1[SIZE ],  vIn2[SIZE ];
          float vOut[DSIZE ];

          // Init
          for(i = 0; i < SIZE; ++i) {
              vIn1[i ] =  (float)i;
              vIn2[i ] = -(float)i;
          }

          {
              // Stream arrays
              float sIn1<SIZE >,  sIn2<SIZE >;
              float sOut<DSIZE >;
              // Load
              streamRead(sIn1, vIn1 );
              streamRead(sIn2, vIn2 );
              // Kernel
              ker(sIn1, sIn2, sOut );
              // Save
              streamWrite(sOut, vOut );
          }

          // Print
          for(i = 0; i < 8; i++)
              printf("vOut[%i] = (%7.3f);\n", i, vOut[i ] );
      }

        • Brook 1.4 gather bug?
          emuller

          Isn't the max 1D resource size 2^13 = 8192?  For the 1<<20 case, did you check the streams allocate without error?

           

           

            • Brook 1.4 gather bug?
              Ceq

              Thanks for trying to help, Emuller. According to the user guide (2.2.1.2) maximum size for a stream is 2^26 elements, if you use a large 1D stream Brook+ should automatically enable address virtualization.

              In fact, there is a test in "samples/legacy/tests/address_translation" that tests this feature, but fails because address virtualization isn't working. Try and run it with command line "address_translation.exe -e -p -t -q -x 128 -y 128". This will do the test with 16384 elements.

                • Brook 1.4 gather bug?
                  gaurav.garg

                  It is a regression with Catalyst 9.5 or 9.4 in which CAL reports max1DWidth to be more than 8192, but resource alocation fails with width > 8192. It works fine with Catalyst 9.2 and 9.3.

                    • Brook 1.4 gather bug?
                      Ceq

                      Thanks Gaurav, using Catalyst 9.2 the test code I wrote in the first post returns a bad result. If I run it in GPU mode I get:

                      vOut = { 0, 0, 1, 0, 2, 0, 3, 0, ... }

                      However if I try using CPU backend I get the right result:

                      vOut = { 0 , -0, 1, -1, 2, -2, 3, -3, ...}

                      Any hint on this? Do you get the same results?

                      Note: In my installations setting environment variable "BRT_RUNTIME = CPU" for using CPU backend no longer works, I had to add the following code before the "// Init" comment to force CPU mode, could it be that Brook+ 1.4 doesn't read BRT_RUNTIME variable for runtime?:

                      ...
                      unsigned int count;
                      Device* device;
                      device = getDevices("cpu", &count);
                      useDevices(device, 1, NULL);
                      // Init
                      ...

                      WinXP 32, MSVC 2005, Radeon 3870x2, Brook+ 1.4, Catalyst 9.2

                        • Brook 1.4 gather bug?
                          gaurav.garg

                          Check error and errorLog() on your streams and see if it gives any information.

                          BRT_RUNTIME still works, but if you have used useDevices(), it gets more precedence over BRT_RUNTIME.

                            • Brook 1.4 gather bug?
                              Ceq

                              Using Catalyst 9.2 there are no error messages, but the result is wrong.

                              Using Catalyst 9.5 reports memory allocation failure if width > 8192 as you said.

                               

                              Updated test case, now executes first in CPU backend, prints results, and then executes in GPU backend. It also checks and prints Brook+ stream errors.

                               

                              File "ker.br"

                              ---------------------------------

                              kernel void
                              ker(float in1[ ], float in2[ ], out float out1< > )
                              {
                                  int pos_x = instance().x;
                                  int i = pos_x >> 1;
                                  float t;
                                  if(pos_x & 0x01) {
                                      t = in2[i ];
                                  } else {
                                      t = in1[i ];
                                  }
                                  // Some operations with t
                                  out1 = t;
                              }

                               

                              File "main.cpp"

                              ---------------------------------

                              #include "brook/Stream.h"
                              #include "brook/Device.h"
                              #include "built/ker.h"

                              using namespace std;
                              using namespace brook;

                              float *vIn1, *vIn2, *vOut;

                              void test(const char *backend, unsigned int size) {
                                  unsigned int i, count, dsize = 2 * size;
                                  Device* device = getDevices(backend, &count);
                                  useDevices(device, 1, NULL);
                                  printf("\nUsing %s backend\n", backend);

                                  // Stream arrays
                                  Stream<float> sIn1(1, &size);
                                  Stream<float> sIn2(1, &size);
                                  Stream<float> sOut(1, &dsize);
                                  
                                  // Load
                                  sIn1.read(vIn1);
                                  sIn2.read(vIn2);
                                  if(sIn1.error() ) puts(sIn1.errorLog() );
                                  if(sIn2.error() ) puts(sIn2.errorLog() );

                                  // Kernel
                                  ker(sIn1, sIn2, sOut );
                                  if(sIn1.error() ) puts(sIn1.errorLog() );
                                  if(sIn2.error() ) puts(sIn2.errorLog() );
                                  if(sOut.error() ) puts(sOut.errorLog() );

                                  // Save
                                  sOut.write(vOut);

                                  // Print
                                  for(i = 0; i < 8; i++)
                                      printf("vOut[%i] = (%7.3f);\n", i, vOut[i ] );
                              }

                              int main(int argc, char** argv) {

                                  unsigned int i, SIZE = 1 << 20; // *** Wrong result using big sizes ***
                                  // const int SIZE = 1 << 10; // *** Right using small streams ***

                                  // Memory arrays
                                  vIn1 = (float*)malloc(    SIZE * sizeof(float) );
                                  vIn2 = (float*)malloc(    SIZE * sizeof(float) );
                                  vOut = (float*)malloc(2 * SIZE * sizeof(float) );
                                  if(!vIn1 || ! vIn2 || !vOut) {
                                      printf("Enlarge project heap memory first\n");
                                      exit(0);
                                  }

                                  // Init
                                  for(i = 0; i < SIZE; ++i) {
                                      vIn1[i ] =  (float)i;
                                      vIn2[i ] = -(float)i;
                                  }

                                  test("cpu", SIZE);
                                  test("gpu", SIZE);

                                  // Free memory
                                  free(vIn1); free(vIn2); free(vOut);
                                  return 0;
                              }

                               

                              // WinXP 32, MSVC 2005, Radeon 3870x2, Brook+ 1.4, Catalyst 9.2

                                • Brook 1.4 gather bug?
                                  Ceq

                                  Looks like the test "/samples/legacy/test/domain" also fails, at least using Catalyst 9.5 (It works with 9.2).

                                  Please, can anybody confirm the output of the previous post? I just want to make sure that it isn't related to my installation or a mistake.

                                  Thanks.

                                    • Brook 1.4 gather bug?
                                      gaurav.garg

                                      Domain test works fine for me with Catalyst 9.5. Are you using the default dimensions?

                                        • Brook 1.4 gather bug?
                                          Ceq

                                          That's strange... yes, I was using default dimensions and building in x64 mode, note that it is the legacy test, the CPP version works fine. Well, I reverted to Catalyst 9.2 (I needed address translation) and this test is OK now.

                                          Does the code I wrote returns the right result? If so, maybe there is something wrong with my installation because of going back and forth with Catalyst drivers to test programs. I'm currently using WinXP 32, Catalyst 9.2 and CPU and CAL outputs for that code differ, this is the output:


                                          Using cpu backend
                                          vOut[0] = (  0.000);
                                          vOut[1] = ( -0.000);
                                          vOut[2] = (  1.000);
                                          vOut[3] = ( -1.000);
                                          vOut[4] = (  2.000);
                                          vOut[5] = ( -2.000);
                                          vOut[6] = (  3.000);
                                          vOut[7] = ( -3.000);

                                          Using gpu backend
                                          vOut[0] = (  0.000);
                                          vOut[1] = (  0.000);
                                          vOut[2] = (  1.000);
                                          vOut[3] = (  0.000);
                                          vOut[4] = (  2.000);
                                          vOut[5] = (  0.000);
                                          vOut[6] = (  3.000);
                                          vOut[7] = (  0.000);

                                           

                                          WinXP 32 SP3, MSVC 2005, Radeon 3870x2, Brook+ 1.4, Catalyst 9.2

                                • Brook 1.4 gather bug?
                                  Ceq

                                  Hi, this is just to report that I've tried Catalyst 9.6, but address virtualization is still not working.

                                   

                                  Originally posted by: gaurav.garg It is a regression with Catalyst 9.5 or 9.4 in which CAL reports max1DWidth to be more than 8192, but resource alocation fails with width > 8192. It works fine with Catalyst 9.2 and 9.3.