14 Replies Latest reply on Jan 7, 2010 2:17 PM by LeeHowes

    Really Really Bad OpenCL performance on HD5850

    BarnacleJunior
      Is my initialization wrong?

      I've been dealing with DirectCompute on 5800 for a while, but I've read that OpenCL gets much better performance.. So I did a line-for-line port of the first pass of a prefix sum/scan routine from cs_5_0 HLSL to OpenCL.  When run on a 1<<20 element array 1000 times, the velocity of the algorithm on DirectCompute is about 6550 million values per second.  Implementing a full prefix scan I've gotten a radix sort on DX11 that does 65 million pairs per second (on a 2M element array).  It's less than half what a GTX280 does with CUDPP, but it's in the ballpark.


      My OpenCL code, however, is only doing 25 million elements per second on the very basic first-pass scan.  I'm running the test ten times in a row, so the card should have plenty of time to get to full speed.  Windowing graphics are also very sluggish and CPU utilization is 0%, so I'm sure it's not emulation.  I'm using the 9.12 hot fix driver on Win7 x64 RTM with a DDR2 PhenomII quadcore.

      My prefix sum kernel has 128 threads with 8 values per thread - this is the configuration that produces the best performance on DirectCompute.  It's almost as if the LDS isn't working or something.  Or am I making an incredibly stupid mistake using the CL library?

      Thanks,

      .sean

       

      The kernel code - scan.cl:

      #define NUM_THREADS (1<< NUM_LEVELS)

      // 32 channels means shift five and add
      #define BANK_ADDRESS(i) (i + (i>> 5))

      #define LOCAL_SIZE (BANK_ADDRESS(NUM_THREADS))

      void ThreadSum(uint tid, __local uint sharedSum[LOCAL_SIZE]) {
          uint tid2 = BANK_ADDRESS(tid);
         
          for(uint d = 0; d < NUM_LEVELS - 1; ++d) {
              barrier(CLK_LOCAL_MEM_FENCE);
              uint mask = (2<< d) - 1;
              uint offset = 1<< d;
              if(mask == (mask & tid))
                  sharedSum[tid2] += sharedSum[BANK_ADDRESS(tid - offset)];   
          }
          barrier(CLK_LOCAL_MEM_FENCE);
         
          if(0 == tid) {
              uint ai = BANK_ADDRESS(NUM_THREADS / 2 - 1);
              uint bi = BANK_ADDRESS(NUM_THREADS - 1);
             
              uint at = sharedSum[ai];
             
              sharedSum[ai] += sharedSum[bi];
              sharedSum[bi] += at + at;
          }

          for(uint d = NUM_LEVELS - 1; d; --d) {
              barrier(CLK_LOCAL_MEM_FENCE);
              uint mask = (1<< d) - 1;
              uint offset = 1<< (d - 1);
              if(mask == (mask & tid)) {
                  uint t = sharedSum[tid2];
                  uint r = BANK_ADDRESS(tid - offset);
                  sharedSum[tid2] += sharedSum[r];
                  sharedSum[r] = t;
              }
          }
          barrier(CLK_LOCAL_MEM_FENCE);
      }


      uint4 Inclusive4Sum(uint4 vec) {
          vec.y += vec.x;
          vec.z += vec.y;
          vec.w += vec.z;
          return vec;
      }

      __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1)))
      void PrefixSumBlock_Pass1(
          __global uint* pass1_values,
          __global uint* pass1_partialSums) {
         
          __local uint sharedSum[LOCAL_SIZE];
         
          uint tid = get_local_id(0);
          uint gid = get_group_id(0);
         
          uint index = 8 * NUM_THREADS * gid;
          uint aTarget = index + 8 * tid;
         
          uint4 a[2];
         
          for(uint i = 0; i < 2; ++i) {
              a.x = pass1_values[aTarget + 4 * i + 0];
              a
      .y = pass1_values[aTarget + 4 * i + 1];
              a.z = pass1_values[aTarget + 4 * i + 2];
              a
      .w = pass1_values[aTarget + 4 * i + 3];   
          }
         
          uint4 aInc[2];
          aInc[0] = Inclusive4Sum(a[0]);
          aInc[1] = Inclusive4Sum(a[1]) + aInc[0].w;
         
          uint tid2 = BANK_ADDRESS(tid);
          sharedSum[tid2] = aInc[1].w;
         
          ThreadSum(tid, sharedSum);
         
          uint total = sharedSum[BANK_ADDRESS(0)];
          uint aExc = sharedSum[tid2] - total;
         
          uint4 aSum[2];
         
          for(uint i = 0; i < 2; ++i)
              aSum = aInc - a + aExc;
             
          for(uint i = 0; i < 2; ++i) {
              pass1_values[aTarget + 4 * i + 0] = aSum
      .x;
              pass1_values[aTarget + 4 * i + 1] = aSum.y;
              pass1_values[aTarget + 4 * i + 2] = aSum
      .z;
              pass1_values[aTarget + 4 * i + 3] = aSum.w;
          }
         
          if(0 == tid)
              pass1_partialSums[gid] = total;   
      }

       

      My cpp:

      #pragma comment(lib, "opencl")
      #include <windows.h>
      #include <vector>
      #include <string>
      #include <fstream>
      #include <iostream>
      #include <cmath>

      #define __CL_ENABLE_EXCEPTIONS
      #include <cl/cl.h>

      typedef unsigned int uint;

      void CPUScan(const uint* source, uint* target, size_t numElements) {
          target[0] = 0;
          for(size_t i(1); i < numElements; ++i)
              target = target[i - 1] + source;
      }

      int main(int argc, char** argv) {
          std::ifstream f("scan.cl");
          std::string s(std::istreambuf_iterator<char>(f), std::istreambuf_iterator<char>(0));

          cl_uint numPlatforms;
          cl_platform_id platform = 0;
          cl_int status = clGetPlatformIDs(0, 0, &numPlatforms);

          std::vector<cl_platform_id> platforms(numPlatforms);
          status = clGetPlatformIDs(numPlatforms, &platforms[0], 0);

          cl_int err;
          cl_uint numDevices;
          err = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, 0, 0, &numDevices);
          std::vector<cl_device_id> devices(numDevices);
          clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, numDevices, &devices[0], 0);

          cl_context_properties props[3] = {
              CL_CONTEXT_PLATFORM,
              (cl_context_properties)platforms[0],
              0
          };
          cl_context context = clCreateContextFromType(props, CL_DEVICE_TYPE_GPU, 0, 0, &err);

          const char* source = s.c_str();
          cl_program program = clCreateProgramWithSource(context, 1, &source, 0, &err);

          const int NumLevels = 7;
          err = clBuildProgram(program, 1, &devices[0], "-D NUM_LEVELS=7", 0, 0);

          if(err) {
              char buildLog[2048];
              err = clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG,
                 sizeof(buildLog), buildLog, 0);

              std::cout<<buildLog;
              return 0;
          }

          // get the kernel
          cl_kernel kernel = clCreateKernel(program, "PrefixSumBlock_Pass1",  &err);
          
          cl_uint numArgs;
          clGetKernelInfo(kernel, CL_KERNEL_NUM_ARGS, sizeof(cl_uint), &numArgs, 0);


          // create the buffers
          const size_t NumElements = 1<< 20;
          const int NumLoops = 1000;
          
          const uint ThreadsPerGroup = 1<< NumLevels;
          const uint ValuesPerThread = 8;
          const uint ValuesPerGroup = ValuesPerThread * ThreadsPerGroup;
          const uint NumGroups = NumElements / ValuesPerGroup;


          std::vector<uint> values(NumElements);
          for(size_t i(0); i < NumElements; ++i)
              values = 1;
          
          cl_mem pass1Values = clCreateBuffer(context, CL_MEM_READ_WRITE |
              CL_MEM_COPY_HOST_PTR, 4 * NumElements, &values[0], &err);
          cl_mem pass1PartialSums = clCreateBuffer(context, CL_MEM_READ_WRITE |
              0, 4 * NumGroups, 0, &err);

          cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, &err);

          err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &pass1Values);
          err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &pass1PartialSums);

          cl_uint workSize = NumElements / ValuesPerThread;
          cl_uint workGroupSize = ThreadsPerGroup;

          LARGE_INTEGER freq;
              LARGE_INTEGER begin, end;
          QueryPerformanceFrequency(&freq);
          double period = 1.0 / freq.QuadPart;

          for(int j(0); j < 10; ++j) {
              QueryPerformanceCounter(&begin);

              for(int i(0); i < NumLoops; ++i) {
                  err = clEnqueueNDRangeKernel(commandQueue, kernel, 1, 0, &workSize, &workGroupSize,
                      0, 0, 0);
                  if(err) {
                      printf("Err = %d on iteration %d\n", err, i);
                      return 0;
                  }
              }
              clEnqueueReadBuffer(commandQueue, pass1Values, CL_TRUE, 0, 4 * NumElements,
                  &values[0], 0, 0, 0);

              QueryPerformanceCounter(&end);

              double elapsed = period * (end.QuadPart - begin.QuadPart);

              double velocity = NumElements * (NumLoops / elapsed);
              printf("GPU velocity: %1.3fM\n", velocity / 1.0e6);
          }

          std::vector<uint> partialSums(NumGroups);
          clEnqueueReadBuffer(commandQueue, pass1PartialSums, CL_TRUE, 0, 4 * NumGroups,
              &partialSums[0], 0, 0, 0);

          std::vector<uint> values2;
          values2.resize(NumElements);
          QueryPerformanceCounter(&begin);
          for(int i(0); i < NumLoops; ++i) {
              CPUScan(&values[0], &values2[0], NumElements);
              values.swap(values2);
          }
          QueryPerformanceCounter(&end);
          double elapsed = period * (end.QuadPart - begin.QuadPart);
          double velocity = NumElements * (NumLoops / elapsed);
          printf("CPU velocity: %1.3fM\n", velocity / 1.0e6);
      }

       

        • Really Really Bad OpenCL performance on HD5850
          MicahVillmow
          Some recommendations,
          Don't use this: uint4 aInc[2];
          Hardware private is not enabled yet, so using arrays like this goes out to main memory because stack/private memory is emulated in global.

          Vectorize your LDS access:
          for(uint i = 0; i < 2; ++i) {
          pass1_values[aTarget + 4 * i + 0] = aSum.x;
          pass1_values[aTarget + 4 * i + 1] = aSum.y;
          pass1_values[aTarget + 4 * i + 2] = aSum.z;
          pass1_values[aTarget + 4 * i + 3] = aSum.w;
          }

          This can be done with two writes of a uint4 instead of 8 writes of a scalar uint. 4x faster.

          Unroll your loops, the above can be written as:
          pass1_values[aTarget] = aSum[0];
          pass1_values[aTarget + 4] = aSum[1];


          The second/third recommendation are minor, the first one is your performance bottleneck.
            • Really Really Bad OpenCL performance on HD5850
              BarnacleJunior

              Ok, I've manually unrolled everything.  Will the compiler unroll static loops in the future?  Also, I don't know what you mean by

              pass1_values[aTarget] = aSum[0];
              pass1_values[aTarget + 4] = aSum[1];

              pass1_values is only type uint*.  Are you allowed to cast to get the assignment to work somehow?  Also, DirectCompute instruction stream generates separate load and store instructions for structured buffers for every DWORD in the struct.  Is that suboptimal too?

              It still only does 520 million uints per second.  D3D11 does 6800 uints per second.  Summing up arrays of 1 million 5000 times.

              .sean


              __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1)))
              void PrefixSumBlock_Pass1(
                  __global uint* pass1_values,
                  __global uint* pass1_partialSums) {
                 
                  __local uint sharedSum[LOCAL_SIZE];
                 
                  uint tid = get_local_id(0);
                  uint gid = get_group_id(0);
                 
                  uint index = 8 * NUM_THREADS * gid;
                  uint aTarget = index + 8 * tid;
                 
                  uint4 aLow, aHigh;
                  aLow.x = pass1_values[aTarget + 0];
                  aLow.y = pass1_values[aTarget + 1];
                  aLow.z = pass1_values[aTarget + 2];
                  aLow.w = pass1_values[aTarget + 3];   
                  aHigh.x = pass1_values[aTarget + 4];
                  aHigh.y = pass1_values[aTarget + 5];
                  aHigh.z = pass1_values[aTarget + 6];
                  aHigh.w = pass1_values[aTarget + 7];   
                 
                  uint4 aIncLow, aIncHigh;
                  aIncLow = Inclusive4Sum(aLow);
                  aIncHigh = Inclusive4Sum(aHigh) + aIncLow.w;
                 
                  uint tid2 = BANK_ADDRESS(tid);
                  sharedSum[tid2] = aIncHigh.w;
                 
                  ThreadSum(tid, sharedSum);
                 
                  uint total = sharedSum[BANK_ADDRESS(0)];
                  uint aExc = sharedSum[tid2] - total;
                 
                  uint4 aSumLow, aSumHigh;
                  aSumLow = aIncLow - aLow + aExc;
                  aSumHigh = aIncHigh - aHigh + aExc;

                  pass1_values[aTarget + 0] = aSumLow.x;
                  pass1_values[aTarget + 1] = aSumLow.y;
                  pass1_values[aTarget + 2] = aSumLow.z;
                  pass1_values[aTarget + 3] = aSumLow.w;
                  pass1_values[aTarget + 4] = aSumHigh.x;
                  pass1_values[aTarget + 5] = aSumHigh.y;
                  pass1_values[aTarget + 6] = aSumHigh.z;
                  pass1_values[aTarget + 7] = aSumHigh.w;

                  if(0 == tid)
                      pass1_partialSums[gid] = total;   
              }

               

               

            • Really Really Bad OpenCL performance on HD5850
              MicahVillmow
              BarnacleJunior,
              DirectCompute uses structured or typed UAV's, OpenCL uses Raw UAV's. In order to get the best performance, you want to use vector load/stores instead of scalar. This can be fixed by adding this line at the top of the kernel:
              global uint4* pass1_values_vec4 = (__global uint4*) pass1_values;
              Then:

              aLow.x = pass1_values[aTarget + 0];
              aLow.y = pass1_values[aTarget + 1];
              aLow.z = pass1_values[aTarget + 2];
              aLow.w = pass1_values[aTarget + 3];
              aHigh.x = pass1_values[aTarget + 4];
              aHigh.y = pass1_values[aTarget + 5];
              aHigh.z = pass1_values[aTarget + 6];
              aHigh.w = pass1_values[aTarget + 7];
              Becomes:
              aLow = pass1_values_vec4[aTarget/4];
              aHigh = pass1_values_vec4[aTarget/4 + 1];

              Same with the writes of aSumLow and aSumHigh.

              Also, another thing you might want to try is to use uint8 instead of uint4. This will greatly simplify a lot of your code.
              • Really Really Bad OpenCL performance on HD5850
                eduardoschardong

                 

                Originally posted by: BarnacleJunior but I've read that OpenCL gets much better performance..


                Not exactly, and the difference shouldn't be big anyway.

                // 32 channels means shift five and add

                #define BANK_ADDRESS(i) (i + (i>> 5))

                Well... No...

                Imagine the local data share as an array of floats (or uints, in this case), position 0, 32, 64, 96... are in bank 0, position 1, 33, 65, 97... are in bank 1 and so up to bank 31 so you want reads to be from different banks and there are 16 threads reading up to two banks each each cycle, if thread 0 reads from position 0 (bank 0) then thread 1 reads from position 1 and so there will be no bank conflict, if thread 0 reads from position 0 and, thread 1 from position 2 and 3 and so no bank conflict again, but if thread 0 reads from position 0, thread 1 from position 4 when we reach thread 8 there will be a bank conflict and the kernel will slow down, the code above will eventually cause some bank conflits, reading directly from LDS wouldn't, I'm not sure how the compiler handle uint4 and uint8 but since it's easy to workaround the bank conflict I believe it does it.

                 

                sharedSum[tid2] += sharedSum[BANK_ADDRESS(tid - offset)];


                This was what I was talking to when I wrote about too many loads and stores, this single line generates two loads, one add and one store, the lines above generates many more instructions but this line serve to point the problem, the add it generates is the only useful instruction here, all other are just control, the hardware can perform a lot of adds but no so many of others like, loads and stores from LDS, instead of adding two numbers on each loop interation try to add 8 or 16, maybe adding 64 goes faster since the hardware are capable of executing a lot of adds replicating them a bit won't drop the perfromance.

                I'm not sure on the behavior of atomic operations, but coudln't and AtomicAdd solve your problem?

                And, please, when posting source code please attach instead of posting directly, the forum is interpreteing [ i ] as italic.

                 

                  • Really Really Bad OpenCL performance on HD5850
                    BarnacleJunior

                    Thanks for the help eduardo.  I know the code will cause some bank conflicts, but I'm just trying to minimize things.  I don't know how to increase the density of integer operations in this routine.  It's just the garden variety scan ( http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html ).  I have no idea how it can be written to reduce the load-modify-store patterns.  Algorithmically I don't think I can improve it much by doing more work in each thread.

                    And whatever the bank conflicts or LDS load-modify-store issues are, they aren't causing the terrible performance, as I'm doing the same thing on DirectCompute, and that's running 26x faster.  It's just the first pass of a prefix sum that I started this thread with.  I tried reading and writing to global memory by pass1_values to __global uint8*, but that didn't change the throughput at all on ATI (although it did help on my GT120M).  The complete prefix sum (multiple passes on a 1<<20 element array) does 6700M uints on D3D11 and 252M on CL.  You know where I can get an implementation of this that runs well on ATI cards, so I can learn?  It's such an incredibly necessary algorithm to have, as it's the basis for enumeration and sorts and all of that.

                    Another thing I noticed is that under CL you can corrupt the global memory by writing past the end of your buffer.  To the best of my knowledge, D3D11 simply discards writes past the end of a UAV.  Will this behavior in CL change in the future, or is it some kind of choice made for performance?

                      • Really Really Bad OpenCL performance on HD5850
                        BarnacleJunior

                        In case I'm doing something stupid that I'm not seeing, I'm pasting the prefix sum code I'm benchmarking with.  It uses cl.hpp.  Again, the D3D11 equivalent gets 6700M/s and CL only gets 252M/s roughly.

                         

                         

                        --- scan.cl #define NUM_THREADS (1<< NUM_LEVELS) // 32 channels means shift five and add #define BANK_ADDRESS(i) (i + (i>> 5)) #define LOCAL_SIZE (BANK_ADDRESS(NUM_THREADS)) void ThreadSum(uint tid, __local uint sharedSum[LOCAL_SIZE]) { uint tid2 = BANK_ADDRESS(tid); for(uint d = 0; d < NUM_LEVELS - 1; ++d) { barrier(CLK_LOCAL_MEM_FENCE); uint mask = (2<< d) - 1; uint offset = 1<< d; if(mask == (mask & tid)) sharedSum[tid2] += sharedSum[BANK_ADDRESS(tid - offset)]; } barrier(CLK_LOCAL_MEM_FENCE); if(0 == tid) { uint ai = BANK_ADDRESS(NUM_THREADS / 2 - 1); uint bi = BANK_ADDRESS(NUM_THREADS - 1); uint at = sharedSum[ai]; sharedSum[ai] += sharedSum[bi]; sharedSum[bi] += at + at; } for(uint d = NUM_LEVELS - 1; d; --d) { barrier(CLK_LOCAL_MEM_FENCE); uint mask = (1<< d) - 1; uint offset = 1<< (d - 1); if(mask == (mask & tid)) { uint t = sharedSum[tid2]; uint r = BANK_ADDRESS(tid - offset); sharedSum[tid2] += sharedSum[r]; sharedSum[r] = t; } } barrier(CLK_LOCAL_MEM_FENCE); } uint4 Inclusive4Sum(uint4 vec) { vec.y += vec.x; vec.z += vec.y; vec.w += vec.z; return vec; } uint8 Inclusive8Sum(uint8 vec) { uint8 result; result.lo = Inclusive4Sum(vec.lo); result.hi = Inclusive4Sum(vec.hi) + result.lo.w; return result; } __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1))) void PrefixSumBlock_Pass1( __global uint* pass1_values, __global uint* pass1_partialSums) { __local uint sharedSum[LOCAL_SIZE]; global uint8* pass1_values_vec8 = (__global uint8*)pass1_values; uint tid = get_local_id(0); uint gid = get_group_id(0); uint index = NUM_THREADS * gid + tid; uint8 a = pass1_values_vec8[index]; uint8 aInc = Inclusive8Sum(a); uint tid2 = BANK_ADDRESS(tid); sharedSum[tid2] = aInc.s7; ThreadSum(tid, sharedSum); uint total = sharedSum[BANK_ADDRESS(0)]; uint aExc = sharedSum[tid2] - total; uint8 sum = aInc - a + aExc; pass1_values_vec8[index] = sum; if(0 == tid) pass1_partialSums[gid] = total; } /////////////////////////////////////////////////////////////////////////////////////////////////// // Finalize __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1))) void PrefixSumBlock_Finalize( __global uint* finalize_values) { __local uint sharedSum[LOCAL_SIZE]; __global uint8* finalize_values_vec8 = (__global uint8*)finalize_values; uint tid = get_local_id(0); uint gid = get_group_id(0); uint index = NUM_THREADS * gid + tid; uint8 a = finalize_values_vec8[index]; uint8 aInc = Inclusive8Sum(a); uint tid2 = BANK_ADDRESS(tid); sharedSum[tid2] = aInc.s7; ThreadSum(tid, sharedSum); uint total = sharedSum[BANK_ADDRESS(0)]; uint exc = sharedSum[tid2] - total; finalize_values_vec8[index] = aInc - a + (uint8)exc; } /////////////////////////////////////////////////////////////////////////////////////////////////// // Pass 2 __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1))) void PrefixSumBlock_Pass2( __global uint* pass2_values, __global const uint* pass2_offsets) { __global uint8* pass2_values_vec8 = (__global uint8*)pass2_values; uint tid = get_local_id(0); uint gid = get_group_id(0); uint index = NUM_THREADS * gid + tid; uint8 a = pass2_values_vec8[index]; uint partialSum = pass2_offsets[gid]; pass2_values_vec8[index] = a + (uint8)partialSum; } -- opencl.cpp #pragma comment(lib, "opencl") #include <windows.h> #include <vector> #include <string> #include <fstream> #include <iostream> #include <cmath> #include <cassert> #define __CL_ENABLE_EXCEPTIONS #include <cl/cl.hpp> typedef unsigned int uint; const size_t NumLoops = 5000; const size_t NumElements = 1<< 20; template<typename T> bool DumpBuffer(cl::CommandQueue& queue, cl::Buffer& buffer, std::vector<T>& data) { size_t size; buffer.getInfo(CL_MEM_SIZE, &size); if(size % sizeof(T)) return false; data.resize(size / sizeof(T)); queue.enqueueReadBuffer(buffer, true, 0, size, &data[0]); return true; } class PrefixSumCache { public: size_t Capacity() const { return _capacity; } cl::Buffer& BaseLevel() { return _levels[0]; } private: friend class CLPrefixSum; std::vector<cl::Buffer> _levels; size_t _capacity, _valuesPerGroup; }; class CLPrefixSum { public: void CreatePrefixSumCache(size_t capacity, std::auto_ptr<PrefixSumCache>* ppCache); void Execute(cl::CommandQueue& queue, size_t numElements, PrefixSumCache* cache); private: friend void CreateCLPrefixSum(cl::Context& context, size_t valuesPerGroup, size_t threadsPerGroup, cl::Kernel& pass1, cl::Kernel& finalize, cl::Kernel& pass2, std::auto_ptr<CLPrefixSum>* ppPrefixSum); static void LevelCounts(size_t numElements, size_t valuesPerGroup, std::vector<size_t>& levelCounts); cl::Context _context; cl::Kernel _pass1, _finalize, _pass2; size_t _valuesPerGroup, _threadsPerGroup; }; void CreateCLPrefixSum(cl::Context& context, size_t valuesPerGroup, size_t threadsPerGroup, cl::Kernel& pass1, cl::Kernel& finalize, cl::Kernel& pass2, std::auto_ptr<CLPrefixSum>* ppPrefixSum) { std::auto_ptr<CLPrefixSum> prefixSum(new CLPrefixSum); prefixSum->_context = context; prefixSum->_pass1 = pass1; prefixSum->_finalize = finalize; prefixSum->_pass2 = pass2; prefixSum->_valuesPerGroup = valuesPerGroup; prefixSum->_threadsPerGroup = threadsPerGroup; *ppPrefixSum = prefixSum; } void CLPrefixSum::LevelCounts(size_t numElements, size_t valuesPerGroup, std::vector<size_t>& levelCounts) { // write the dispatch count for each pass1/pass2 level. while(numElements > valuesPerGroup) { numElements = (numElements + valuesPerGroup - 1) / valuesPerGroup; levelCounts.push_back(numElements); } } void CLPrefixSum::CreatePrefixSumCache(size_t capacity, std::auto_ptr<PrefixSumCache>* ppCache) { std::auto_ptr<PrefixSumCache> cache(new PrefixSumCache); cache->_capacity = capacity; cache->_valuesPerGroup = _valuesPerGroup; std::vector<size_t> levelCounts; LevelCounts(capacity, _valuesPerGroup, levelCounts); capacity = ~(_valuesPerGroup - 1) & (capacity + _valuesPerGroup - 1); cache->_levels.push_back(cl::Buffer(_context, CL_MEM_READ_WRITE, 4 * capacity)); for(size_t i(0); i < levelCounts.size(); ++i) { size_t count = levelCounts[i]; capacity = ~(_valuesPerGroup - 1) & (count + _valuesPerGroup - 1); cache->_levels.push_back(cl::Buffer(_context, CL_MEM_READ_WRITE, 4 * capacity)); } *ppCache = cache; } void CLPrefixSum::Execute(cl::CommandQueue& queue, size_t numElements, PrefixSumCache* cache) { assert(cache->_valuesPerGroup == _valuesPerGroup); std::vector<size_t> levelCounts; LevelCounts(numElements, _valuesPerGroup, levelCounts); for(size_t level(0); level < levelCounts.size(); ++level) { cl::KernelFunctor f(_pass1, queue, 0, levelCounts[level] * _threadsPerGroup, _threadsPerGroup); f(cache->_levels[level], cache->_levels[level + 1]); } cl::KernelFunctor f(_finalize, queue, 0, _threadsPerGroup, _threadsPerGroup); f(cache->_levels.back()); for(int level(levelCounts.size() - 1); level >= 0; --level) { cl::KernelFunctor f(_pass2, queue, 0, levelCounts[level] * _threadsPerGroup, _threadsPerGroup); f(cache->_levels[level], cache->_levels[level + 1]); } } int main(int argc, char** argv) { // read in the file std::ifstream f("scan.cl"); std::string s(std::istreambuf_iterator<char>(f), std::istreambuf_iterator<char>(0)); std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); std::vector<cl::Device> devices; platforms.front().getDevices(CL_DEVICE_TYPE_GPU, &devices); cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0](), 0 }; cl::Context context(CL_DEVICE_TYPE_GPU, props); cl::CommandQueue commandQueue(context, devices[0]); cl::Program program(context, cl::Program::Sources(1, std::make_pair(s.c_str(), s.size()))); int err = program.build(devices, "-D NUM_LEVELS=7"); if(err) { std::string buildError; program.getBuildInfo(devices[0], CL_PROGRAM_BUILD_LOG, &buildError); std::cout<< buildError; return 0; } cl::Kernel pass1(program, "PrefixSumBlock_Pass1", &err); cl::Kernel finalize(program, "PrefixSumBlock_Finalize", &err); cl::Kernel pass2(program, "PrefixSumBlock_Pass2", &err); const int NumLevels = 7; const int NumThreads = 1<< NumLevels; const int ValuesPerThread = 8; const int ValuesPerGroup = 8 * NumThreads; std::auto_ptr<CLPrefixSum> prefixSum; CreateCLPrefixSum(context, ValuesPerGroup, NumThreads, pass1, finalize, pass2, &prefixSum); std::auto_ptr<PrefixSumCache> prefixSumCache; prefixSum->CreatePrefixSumCache(NumElements, &prefixSumCache); std::vector<uint> values(NumElements, 1); commandQueue.enqueueWriteBuffer(prefixSumCache->BaseLevel(), true, 0, NumElements * 4, &values[0]); commandQueue.finish(); LARGE_INTEGER freq; LARGE_INTEGER begin, end; QueryPerformanceFrequency(&freq); double period = 1.0 / freq.QuadPart; for(int j(0); j < 10; ++j) { QueryPerformanceCounter(&begin); for(int i(0); i < NumLoops; ++i) prefixSum->Execute(commandQueue, NumElements, prefixSumCache.get()); std::vector<uint> values2; DumpBuffer(commandQueue, prefixSumCache->BaseLevel(), values2); QueryPerformanceCounter(&end); double elapsed = period * (end.QuadPart - begin.QuadPart); double velocity = NumElements * (NumLoops / elapsed); printf("GPU velocity: %1.3fM\n", velocity / 1.0e6); } }

                    • Really Really Bad OpenCL performance on HD5850
                      MicahVillmow
                      "Another thing I noticed is that under CL you can corrupt the global memory by writing past the end of your buffer. To the best of my knowledge, D3D11 simply discards writes past the end of a UAV."

                      The global address space is uniform in OpenCL, so buffer overruns are possible. I haven't looked into how D3D11 produces code that deeply, but it seems that they use a single UAV per pointer, whereas in OpenCL this is not possible because of language design/hardware constraints.
                      • Really Really Bad OpenCL performance on HD5850
                        MicahVillmow
                        Also, try to remove as much flow control as possible to improve performance. Unroll loops, do redundant calculations instead of conditional blocks, use atomic ops.
                        For example:
                        uint at = sharedSum[ai];
                        uint bi = sharedSum[ bi];
                        uint atpbi = at + bi;
                        uint at2bi = at + at + bi;
                        bool write = 0 == tid;
                        uint writeai = select(ai, atpbi, write);
                        uint writebi = select(bi, at2bi, write);
                        sharedSum[ai] = writeai;
                        sharedSum[bi] = writebi;

                        In this case, if ai/bi is unique per thread in a group if the condition is false, the thread writes the old value back to LDS, if the condition is true, the thread updates the value.

                        This should be faster than the if statement because there is no flow control and updating LDS occurs in a ALU slot, so it is very fast. You can also use the atomic cmpxchg if the thread values are not unique, so only thread with the new value will update, everyone else will just update with the old value and not update once the new value has been added.
                        • Really Really Bad OpenCL performance on HD5850
                          MicahVillmow
                          marco,
                          We are investigating this issue, if you have any specific kernels that you want us to test can you post them here or send them to streamdeveloper@amd.com Re: Micah Villmow.
                            • Really Really Bad OpenCL performance on HD5850
                              LeeHowes

                              I'd remove the bank conflict reduction stuff and just unroll

                              a = a + a[i-n] where n is 1, 2, 4 etc. As it's a SIMD architecture that will be more efficient.

                              My own radix sort is hitting just under 200Mn keys/sec (a little under for key/value pairs, I forget the figures) in CS5 and about 130Mn for OpenCL. At the moment you should probably expect that the OpenCL code runs a little slower than DX11 code for various reasons, some of which are being worked on as Micah said. That code ended up using a pretty simple local prefix sum. I tried implementing something similar to yours and then heavily unrolling and trying variations of thread->data mappings only to find that a simpler approach outperformed all that substantially.

                               

                              ETA: I'll try to get the code out once I have some time.

                                • Really Really Bad OpenCL performance on HD5850
                                  BarnacleJunior

                                   

                                  Originally posted by: LeeHowes I'd remove the bank conflict reduction stuff and just unroll

                                   

                                  a = a + a[i-n] where n is 1, 2, 4 etc. As it's a SIMD architecture that will be more efficient.

                                   

                                  My own radix sort is hitting just under 200Mn keys/sec (a little under for key/value pairs, I forget the figures) in CS5 and about 130Mn for OpenCL. At the moment you should probably expect that the OpenCL code runs a little slower than DX11 code for various reasons, some of which are being worked on as Micah said. That code ended up using a pretty simple local prefix sum. I tried implementing something similar to yours and then heavily unrolling and trying variations of thread->data mappings only to find that a simpler approach outperformed all that substantially.

                                   

                                   

                                   

                                  ETA: I'll try to get the code out once I have some time.

                                   

                                  Ok, seeing your code would be great.  I've finally understood how to program for wavefronts without using barriers everywhere.  I finally understand the CUDPP CTA code well enough for it to be a reference to me, but those strategies don't seem to carry over to ATI cards well.  I should be able to break 100M pairs/sec on CS5 now but you must be using magic to get near 200M.

                                  Also, what numbers are you getting on a straight-up prefix sum?  I'm only able to hit 5444M uints/sec on CS5 with 256 threads/group.  My code for the sum is at the end of this thread:

                                  http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125201&enterthread=y

                                  Thanks!