3 Replies Latest reply on Feb 7, 2012 5:17 PM by jeff_golds

    L1CacheHit on Radeon 5870 with global memory

    chevydevil

      Hello, I thought caching for global memory fetches is not possible at this time? How is it possible that gdebugger and the amd app profiling tool say that I have a cachehit of nearly fiffty percent for a simple 7-point stencil kernel?

        • Re: L1CacheHit on Radeon 5870 with global memory
          MicahVillmow

          Caching has been supported in various forms since SDK 2.3. It started out as explicitly specified and then moved to auto-detection for caching and then in SDK 2.6 move to caching by default and auto-detection for uncached.

            • Re: L1CacheHit on Radeon 5870 with global memory
              chevydevil

              Thx for the answer. I didn't know that. So the following kernel has an execution time for a 128x128x128 problem size of approx. 250.000 ns on my Radeon 5870.

               

              #define IX33(i,j,k) ((i) + ((X)*(j)) + ((X)*(Y)*(k)))
              __kernel void jacobi(__global float * x,
                                   __global float * x0,
                                   __global float * aux,
                                   const float a)
              
              {
                  int i = get_global_id(0);
                  int j = get_global_id(1);
                  int k = get_global_id(2);
              
              
                  int X = get_global_size(0);
                  int Y = get_global_size(1);
                  int Z = get_global_size(2);
              
                  int minCoord = 0;
                  int maxCoord = X-1;
              
                  float out;
                  out = ( (x[IX33(max(i-1,minCoord),j,k)] + x[IX33(min(i+1,maxCoord),j,k)]  +
                                              x[IX33(i,max(j-1,minCoord),k)] + x[IX33(i,min(j+1,maxCoord),k)]  +
                                              x[IX33(i,j,max(k-1,minCoord))] + x[IX33(i,j,min(k+1,maxCoord))]) +
                                             ac.x * x0[IX33(i,j,k)]) * a;
              
                      aux[IX33(i,j,k)]  = out;
              }
              

               

              So when I calculate the memory throughput with 32 byte *128 *128 *128 / 250.000 ns = 268GB/s there has to be something wrong with my calculation right?