6 Replies Latest reply on Jan 7, 2010 12:29 AM by BarnacleJunior

    Multiple wavefronts, bank conflicts, and scan question

    BarnacleJunior

      I'm still working on getting good prefix scan performance, with an algorithm like this one http://mgarland.org/files/papers/nvr-2008-003.pdf

      With 64 threads processing 8 values each, I get 5171M uints/sec, and with 256 threads processing 8 each I get 5444M uints/sec.  All other combinations are lower.  I am on a 5850, but even the 5444 term is lower than the 6300M reported in the NV paper for a GTX280.

      I'm using a scan like this:

      void ThreadSumWavefront(uint tid, uint scansize) {
          uint lane = (WAVEFRONT - 1) & tid;
          uint laneMask = ~(WAVEFRONT - 1) & tid;
             
          [unroll]
          for(uint offset = 1; offset < scansize; offset<<= 1) {
              uint tid2 = ((tid - offset) & (scansize - 1)) | laneMask;
              uint target = sharedSum[tid];
              uint source = sharedSum[tid2];
              bool cond = lane >= offset;
              target += cond ? source : 0;
              sharedSum[tid] = target;
          }
      }

      For a single wavefront, that is the only scan I need.  For multiple wavefronts, I have to take each 63==lane thread and gather those inclusive scans to the LDS and run the scan again, but with scansize=NUM_WAVEFRONTS rather than scansize=64.

      Is this a good strategy on ATI hardware?  the four wavefront version with 8 values per thread barely beats the best single wavefront version.  In most the stuff I've experimented with, single wavefront versions are most efficient, especially as the shaders get more complicated.  Also, do bank conflicts get in the way of widening the scan LDS to type uint4 rather than uint?  With only 32 LDS channels and 16 active threads, if each thread accesses its own slot in the LDS each cycle does that imply a 2-way bank conflict on each channel?  I've tried with uint2 LDS arrays, which I think avoid bank conflicts, but the added complexity seems to negate a performance benefit from the better ALU packing in the scan.

      Is there a high performance scan and sort that ATI can share, similar to CUDPP?

      thanks,

      .sean

       

        • Multiple wavefronts, bank conflicts, and scan question
          MicahVillmow
          BarnacleJunior,
          Dump the IL/ISA and analyze that to see what is holding you back. This can be done with GPU_DUMP_DEVICE_KERNEL=3
            • Multiple wavefronts, bank conflicts, and scan question
              BarnacleJunior

               

              Originally posted by: MicahVillmow BarnacleJunior, Dump the IL/ISA and analyze that to see what is holding you back. This can be done with GPU_DUMP_DEVICE_KERNEL=3


              I prefer using D3D11.  Is there a way to get that to dump the IL?

              • Multiple wavefronts, bank conflicts, and scan question
                BarnacleJunior

                 

                Originally posted by: MicahVillmow BarnacleJunior, Dump the IL/ISA and analyze that to see what is holding you back. This can be done with GPU_DUMP_DEVICE_KERNEL=3


                I rewrote the scan in OpenCL.  It is almost line-for-line identical to the D3D11 code.  However I'm again running into the OpenCL performance problem - it's going about 1/20th as fast as it should.  I had this problem before but it went away when I made a 64bit build.  This time, no such luck.  It is giving the correct results, it's just very very slow.  It's a problem getting the kernel running, not a problem with the kernel itself, because if I comment everything out except the global data reads, it doesn't pick up any.

                You can change NumThreads to 64 (one wavefront, only one scan operation - D3D11 does 5170M uints/sec) or to 256 (four wavefronts, two scans, like the garland paper - D3D11 does 5444M uints/sec).

                Here's a snapshot of the profiler for the kernel:

                http://www.earthrse.com/screenie/20100106-192341.png

                Nothing obviously bad to my eyes.  It is interesting to note that the Pass2 method takes 95% as long to execute as Pass1, even though it does almost zero work (indicating that this could already be bandwidth limited - but I'm skeptical of this profiling since there is a CL driver bug making it so slow).

                [edit: Had one bug with a macro in the scan.cl that I just fixed this morning.  Should work now]

                scan.cl: // Prefix sum processing 8 values per thread. // The caller must define the NUM_THREADS macro #define VALUES_PER_THREAD 8 #define NUM_VALUES (VALUES_PER_THREAD * NUM_THREADS) #define WAVEFRONT 64 // 64 threads per wavefront #define NUM_WAVEFRONTS (NUM_THREADS / WAVEFRONT) void ThreadSumWavefront(uint tid, uint scansize, __local uint sharedSum[NUM_THREADS]) { uint lane = (WAVEFRONT - 1) & tid; uint laneMask = ~(WAVEFRONT - 1) & tid; for(uint offset = 1; offset < scansize; offset<<= 1) { uint tid2 = ((tid - offset) & (scansize - 1)) | laneMask; uint target = sharedSum[tid]; uint source = sharedSum[tid2]; bool cond = lane >= offset; target += cond ? source : 0; sharedSum[tid] = target; } } uint4 Inclusive4Sum(uint4 vec) { vec.yw += vec.xz; vec.zw += vec.y; return vec; } uint8 Inclusive8Sum(uint8 vec) { uint8 result; result.lo = Inclusive4Sum(vec.lo); result.hi = Inclusive4Sum(vec.hi) + result.lo.w; return result; } __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1))) void PrefixSumBlock_Pass1( __global uint* pass1_values, __global uint* pass1_partialSums) { __local uint sharedSum[NUM_THREADS]; uint tid = get_local_id(0); uint gid = get_group_id(0); uint lane = (WAVEFRONT - 1) & tid; uint wavefront = tid / WAVEFRONT; global uint8* pass1_values_vec8 = (__global uint8*)pass1_values; uint index = NUM_THREADS * gid + tid; uint8 a = pass1_values_vec8[index]; uint8 aInc = Inclusive8Sum(a); sharedSum[tid] = aInc.s7; ThreadSumWavefront(tid, WAVEFRONT, sharedSum); barrier(CLK_LOCAL_MEM_FENCE); uint inc = sharedSum[tid]; #if NUM_WAVEFRONTS == 1 uint exc = inc - aInc.s7; #else // If this is the last thread in the wavefront, write the total // of the wavefront sum and run an inclusive scan if(lane == (WAVEFRONT - 1)) sharedSum[wavefront] = inc; barrier(CLK_LOCAL_MEM_FENCE); if(!wavefront) { uint x = sharedSum[tid]; ThreadSumWavefront(tid, NUM_WAVEFRONTS, sharedSum); sharedSum[tid] -= x; } barrier(CLK_LOCAL_MEM_FENCE); uint exc = inc - aInc.s7 + sharedSum[wavefront]; #endif uint8 aSum = aInc - a + (uint8)exc; pass1_values_vec8[index] = aSum; if((NUM_THREADS - 1) == tid) pass1_partialSums[gid] = a.s7 + aSum.s7; } /////////////////////////////////////////////////////////////////////////////////////////////////// // Pass 2 __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1))) void PrefixSumBlock_Pass2( __global uint* pass2_values, __global const uint* pass2_offsets) { __global uint8* pass2_values_vec8 = (__global uint8*)pass2_values; uint tid = get_local_id(0); uint gid = get_group_id(0); uint index = NUM_THREADS * gid + tid; uint8 a = pass2_values_vec8[index]; uint partialSum = pass2_offsets[gid]; pass2_values_vec8[index] = a + (uint8)partialSum; } main.cpp: #pragma comment(lib, "opencl") #include <windows.h> #include <vector> #include <string> #include <fstream> #include <iostream> #include <cmath> #include <cassert> // #define __CL_ENABLE_EXCEPTIONS #include <cl/cl.hpp> typedef unsigned int uint; const size_t NumLoops = 10000; const size_t NumElements = 1<< 22; template<typename T> bool DumpBuffer(cl::CommandQueue& queue, cl::Buffer& buffer, std::vector<T>& data) { size_t size; buffer.getInfo(CL_MEM_SIZE, &size); if(size % sizeof(T)) return false; data.resize(size / sizeof(T)); queue.enqueueReadBuffer(buffer, true, 0, size, &data[0]); return true; } class PrefixSumCache { public: size_t Capacity() const { return _capacity; } cl::Buffer& BaseLevel() { return _levels[0]; } private: friend class CLPrefixSum; std::vector<cl::Buffer> _levels; cl::Buffer _finalBuffer; size_t _capacity, _valuesPerGroup; }; class CLPrefixSum { public: void CreatePrefixSumCache(size_t capacity, std::auto_ptr<PrefixSumCache>* ppCache); void Execute(cl::CommandQueue& queue, size_t numElements, PrefixSumCache* cache); private: friend void CreateCLPrefixSum(cl::Context& context, size_t valuesPerGroup, size_t threadsPerGroup, cl::Kernel& pass1, cl::Kernel& pass2, std::auto_ptr<CLPrefixSum>* ppPrefixSum); static void LevelCounts(size_t numElements, size_t valuesPerGroup, std::vector<size_t>& levelCounts); cl::Context _context; cl::Kernel _pass1, _pass2; size_t _valuesPerGroup, _threadsPerGroup; }; void CreateCLPrefixSum(cl::Context& context, size_t valuesPerGroup, size_t threadsPerGroup, cl::Kernel& pass1, cl::Kernel& pass2, std::auto_ptr<CLPrefixSum>* ppPrefixSum) { std::auto_ptr<CLPrefixSum> prefixSum(new CLPrefixSum); prefixSum->_context = context; prefixSum->_pass1 = pass1; prefixSum->_pass2 = pass2; prefixSum->_valuesPerGroup = valuesPerGroup; prefixSum->_threadsPerGroup = threadsPerGroup; *ppPrefixSum = prefixSum; } void CLPrefixSum::LevelCounts(size_t numElements, size_t valuesPerGroup, std::vector<size_t>& levelCounts) { // write the dispatch count for each pass1/pass2 level. while(numElements > valuesPerGroup) { numElements = (numElements + valuesPerGroup - 1) / valuesPerGroup; levelCounts.push_back(numElements); } } void CLPrefixSum::CreatePrefixSumCache(size_t capacity, std::auto_ptr<PrefixSumCache>* ppCache) { std::auto_ptr<PrefixSumCache> cache(new PrefixSumCache); cache->_capacity = capacity; cache->_valuesPerGroup = _valuesPerGroup; std::vector<size_t> levelCounts; LevelCounts(capacity, _valuesPerGroup, levelCounts); capacity = ~(_valuesPerGroup - 1) & (capacity + _valuesPerGroup - 1); cache->_levels.push_back(cl::Buffer(_context, CL_MEM_READ_WRITE, 4 * capacity)); for(size_t i(0); i < levelCounts.size(); ++i) { size_t count = levelCounts[i]; capacity = ~(_valuesPerGroup - 1) & (count + _valuesPerGroup - 1); cache->_levels.push_back(cl::Buffer(_context, CL_MEM_READ_WRITE, 4 * capacity)); } cache->_finalBuffer = cl::Buffer(_context, CL_MEM_READ_WRITE, 4); *ppCache = cache; } void CLPrefixSum::Execute(cl::CommandQueue& queue, size_t numElements, PrefixSumCache* cache) { assert(cache->_valuesPerGroup == _valuesPerGroup); std::vector<size_t> levelCounts; LevelCounts(numElements, _valuesPerGroup, levelCounts); for(size_t level(0); level < levelCounts.size(); ++level) { cl::KernelFunctor f(_pass1, queue, cl::NDRange(), levelCounts[level] * _threadsPerGroup, _threadsPerGroup); f(cache->_levels[level], cache->_levels[level + 1]); } cl::KernelFunctor f(_pass1, queue, cl::NDRange(), _threadsPerGroup, _threadsPerGroup); f(cache->_levels.back(), cache->_finalBuffer); for(int level(levelCounts.size() - 1); level >= 0; --level) { cl::KernelFunctor f(_pass2, queue, cl::NDRange(), levelCounts[level] * _threadsPerGroup, _threadsPerGroup); f(cache->_levels[level], cache->_levels[level + 1]); } } int main(int argc, char** argv) { // read in the file std::ifstream f("scan.cl"); std::string s(std::istreambuf_iterator<char>(f), std::istreambuf_iterator<char>(0)); std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); std::vector<cl::Device> devices; platforms.front().getDevices(CL_DEVICE_TYPE_GPU, &devices); cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0](), 0 }; cl::Context context(CL_DEVICE_TYPE_GPU, props); cl::CommandQueue commandQueue(context, devices[0]); cl::Program program(context, cl::Program::Sources(1, std::make_pair(s.c_str(), s.size()))); const int NumThreads = 64; const int ValuesPerThread = 8; const int ValuesPerGroup = ValuesPerThread * NumThreads; char buildString[128]; sprintf(buildString, "-D NUM_THREADS=%d", NumThreads); int err = program.build(devices, buildString); if(err) { std::string buildError; program.getBuildInfo(devices[0], CL_PROGRAM_BUILD_LOG, &buildError); std::cout<< buildError; return 0; } cl::Kernel pass1(program, "PrefixSumBlock_Pass1", &err); cl::Kernel pass2(program, "PrefixSumBlock_Pass2", &err); std::auto_ptr<CLPrefixSum> prefixSum; CreateCLPrefixSum(context, ValuesPerGroup, NumThreads, pass1, pass2, &prefixSum); std::auto_ptr<PrefixSumCache> prefixSumCache; prefixSum->CreatePrefixSumCache(NumElements, &prefixSumCache); std::vector<uint> values(NumElements, 1); commandQueue.enqueueWriteBuffer(prefixSumCache->BaseLevel(), true, 0, NumElements * 4, &values[0]); commandQueue.finish(); LARGE_INTEGER freq, begin, end; QueryPerformanceFrequency(&freq); double period = 1.0 / freq.QuadPart; for(int j(0); j < 10; ++j) { QueryPerformanceCounter(&begin); for(int i(0); i < NumLoops; ++i) prefixSum->Execute(commandQueue, NumElements, prefixSumCache.get()); std::vector<uint> values2; DumpBuffer(commandQueue, prefixSumCache->BaseLevel(), values2); QueryPerformanceCounter(&end); double elapsed = period * (end.QuadPart - begin.QuadPart); double velocity = NumElements * (NumLoops / elapsed); printf("GPU velocity: %1.3fM\n", velocity / 1.0e6); } } ISA for pass1: ShaderType = 3 TargetChip = c ;SC Dep components NumClauseTemps = 4 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(41) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: ADD_INT R9.x, R1.x, KC0[7].x y: MOV R1.y, 0.0f z: MOV R1.z, 0.0f w: MOV R1.w, 0.0f t: MOV R2.y, 0.0f 1 x: LSHL ____, PV0.x, (0x00000006, 8.407790786e-45f).x y: MOV R3.y, 0.0f z: MOV R2.z, 0.0f w: MOV R2.w, 0.0f t: MOV R1.x, 0.0f 2 x: MOV R5.x, 0.0f y: MOV R4.y, 0.0f z: MOV R3.z, 0.0f w: ADD_INT ____, R0.x, PV1.x t: MOV R7.y, 0.0f 3 x: MOV R2.x, 0.0f y: MOV R8.y, 0.0f z: LSHL ____, PV2.w, (0x00000005, 7.006492322e-45f).x w: MOV R3.w, 0.0f t: MOV R5.y, 0.0f 4 x: MOV R3.x, 0.0f y: ADD_INT ____, KC1[0].x, PV3.z z: MOV R4.z, 0.0f w: MOV R4.w, 0.0f t: MOV R9.y, 0.0f 5 x: LSHR R10.x, PV4.y, (0x00000002, 2.802596929e-45f).x y: ADD_INT ____, PV4.y, (0x00000010, 2.242077543e-44f).y z: MOV R5.z, 0.0f w: ADD_INT ____, R0.x, (0x0000003F, 8.828180325e-44f).z t: AND_INT R4.x, (0xFFFFFFC0, -1.#QNANf).w, R0.x 6 x: LSHR R11.x, PV5.y, (0x00000002, 2.802596929e-45f).x y: AND_INT R6.y, PV5.w, (0x0000003F, 8.828180325e-44f).y z: AND_INT R6.z, R0.x, (0x0000003F, 8.828180325e-44f).y w: ADD_INT R5.w, R0.x, (0x0000003E, 8.688050479e-44f).z t: LSHL R14.x, R0.x, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(304) CNT(2) 7 VFETCH R13, R10.x, fc156 FORMAT(32_32_32_32_FLOAT) MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 8 VFETCH R12, R11.x, fc156 FORMAT(32_32_32_32_FLOAT) MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(73) CNT(122) 9 x: ADD_INT T0.x, R1.y, R13.x y: MOV R0.y, 0.0f z: MOV ____, R13.w w: MOV ____, R13.y VEC_120 t: MOV ____, R13.z 10 x: ADD_INT ____, R1.z, PV9.w y: ADD_INT ____, PV9.z, R1.w z: ADD_INT ____, PS9, PV9.y w: MOV ____, R12.y t: MOV R1.y, 0.0f 11 x: ADD_INT ____, R2.y, R12.x y: ADD_INT ____, R3.y, PV10.w VEC_120 z: ADD_INT ____, PV10.y, PV10.z w: ADD_INT T2.w, PV10.x, T0.x VEC_021 t: MOV ____, R12.w 12 x: ADD_INT T0.x, PV11.z, R2.z y: ADD_INT ____, PV11.y, PV11.x z: MOV ____, R12.z w: MOV T3.w, PV11.w t: ADD_INT T0.w, PS11, R1.x 13 x: ADD_INT ____, PV12.w, R2.w y: MOV ____, R12.z z: MOV T0.z, PV12.y w: ADD_INT ____, PV12.z, R1.y t: ADD_INT T1.x, R7.y, PV12.y 14 x: ADD_INT ____, R2.x, PV13.y y: ADD_INT ____, T0.w, PV13.w z: ADD_INT R2.z, PV13.x, T0.x VEC_021 w: ADD_INT T0.w, PV13.z, R4.y t: ADD_INT R1.x, R5.y, R12.x 15 x: ADD_INT ____, PV14.y, R3.z y: ADD_INT ____, T1.x, PV14.x z: ADD_INT T0.z, T0.z, R9.y w: ADD_INT ____, PV14.z, R0.y VEC_021 t: ADD_INT R1.x, R3.x, PS14 16 x: ADD_INT ____, T0.w, PV15.x y: MOV ____, PV15.y z: ADD_INT ____, R5.x, PV15.w w: ADD_INT T0.w, R7.y, PS15 t: OR_INT ____, R6.y, R4.x 17 x: ADD_INT ____, R8.y, PV16.z y: MOV ____, PV16.x z: ADD_INT R0.z, PV16.y, R9.y w: LSHL T1.w, PS16, (0x00000002, 2.802596929e-45f).x t: SETNE_INT T1.y, R6.z, 0.0f 18 x: ADD_INT R1.x, R3.w, PV17.x y: MOV ____, PV17.x z: MOV ____, PV17.x w: ADD_INT R0.w, PV17.y, R1.y t: MOV ____, PV17.x 19 x: ADD_INT R1.x, R4.z, PV18.x y: ADD_INT T0.y, PS18, R9.y z: ADD_INT R1.z, PV18.z, R9.y w: ADD_INT R0.w, PV18.y, R4.w t: ADD_INT R1.w, R3.x, PV18.w 20 x: ADD_INT ____, R7.y, R0.z y: ADD_INT ____, R5.z, PV19.x VEC_201 z: ADD_INT T1.z, R7.y, PS19 w: ADD_INT R0.w, R4.z, PV19.w t: ADD_INT ____, R5.z, PV19.z 21 x: ADD_INT R2.x, PV20.y, T0.w y: ADD_INT R0.y, T0.y, T0.z z: ADD_INT ____, R5.z, PV20.w w: AND_INT ____, R5.w, (0x0000003F, 8.828180325e-44f).x t: ADD_INT R0.z, PS20, PV20.x 22 x: ADD_INT ____, R0.x, (0x0000003C, 8.407790786e-44f).x y: SETGT_UINT R6.y, R6.z, 1 z: OR_INT ____, R4.x, PV21.w VEC_120 w: ADD_INT R0.w, PV21.z, T1.z t: SETGT_UINT R4.y, R6.z, (0x00000003, 4.203895393e-45f).y 23 x: LDS_WRITE ____, R14.x, PV22.w y: LSHL R7.y, PV22.z, (0x00000002, 2.802596929e-45f).x z: AND_INT ____, PV22.x, (0x0000003F, 8.828180325e-44f).y w: ADD_INT ____, R0.x, (0x00000038, 7.847271400e-44f).z VEC_120 t: SETGT_UINT R5.y, R6.z, (0x00000007, 9.809089250e-45f).w 24 x: LDS_READ_RET QA, T1.w y: AND_INT ____, PV23.w, (0x0000003F, 8.828180325e-44f).x z: OR_INT ____, R4.x, PV23.z w: ADD_INT ____, R0.x, (0x00000030, 6.726232629e-44f).y VEC_120 t: SETGT_UINT R2.y, R6.z, (0x0000000F, 2.101947696e-44f).z 25 x: LSHL R1.x, PV24.z, (0x00000002, 2.802596929e-45f).x y: AND_INT ____, PV24.w, (0x0000003F, 8.828180325e-44f).y z: OR_INT ____, R4.x, PV24.y w: ADD_INT ____, R0.x, (0x00000020, 4.484155086e-44f).z VEC_120 t: SETGT_UINT R3.y, R6.z, (0x0000001F, 4.344025239e-44f).w 26 x: LSHL R6.x, PV25.z, (0x00000002, 2.802596929e-45f).x y: AND_INT ____, PV25.w, (0x0000003F, 8.828180325e-44f).y z: OR_INT ____, R4.x, PV25.y w: MOV ____, R13.z t: MOV ____, 0.0f 27 x: LSHL R7.x, PV26.z, (0x00000002, 2.802596929e-45f).x y: ADD_INT ____, PS26, PV26.w z: OR_INT ____, R4.x, PV26.y w: ADD_INT ____, R8.y, T2.w t: MOV R3.x, 0.0f 28 x: MOV T1.x, 0.0f y: MOV R1.y, 0.0f z: ADD_INT ____, 0.0f, R2.x w: LSHL R3.w, PV27.z, (0x00000002, 2.802596929e-45f).x t: ADD_INT T2.w, PV27.w, PV27.y 29 x: MOV R8.x, 0.0f y: MOV ____, QA.pop z: ADD_INT R5.z, R3.x, PV28.z w: ADD_INT R4.w, R8.y, R13.x t: ADD_INT R2.x, T3.w, PV28.y 30 x: MOV R4.x, 0.0f y: ADD_INT R8.y, R0.y, T1.x z: MOV R4.z, 0.0f w: CNDE_INT R5.w, T1.y, 0.0f, PV29.y VEC_120 t: MOV R3.x, T2.w 03 ALU_PUSH_BEFORE: ADDR(195) CNT(52) 31 x: ADD_INT T1.x, R0.w, R5.w y: ADD_INT T1.y, R5.x, R4.w VEC_021 z: ADD_INT T1.z, R8.x, R5.z VEC_120 w: ADD_INT T2.w, R8.x, R8.y VEC_120 t: MOV R8.x, 0.0f 32 x: LDS_WRITE ____, R14.x, PV31.x y: MOV T0.y, 0.0f z: MOV T0.z, R2.z w: ADD_INT T3.w, R0.z, R4.x VEC_210 t: ADD_INT T1.w, R3.x, R4.z 33 x: LDS_READ_RET QA, R7.y y: ADD_INT ____, R5.x, R2.x z: MOV ____, 0.0f w: MOV T0.w, 0.0f t: ADD_INT R4.x, R8.x, T1.z 34 x: ADD_INT R2.x, T0.y, T1.y y: ADD_INT R0.y, R8.x, T2.w z: ADD_INT R0.z, R8.x, T3.w VEC_021 w: ADD_INT R1.w, R0.w, PV33.z t: ADD_INT R1.y, T0.y, PV33.y 35 x: LSHR R3.x, R0.x, (0x00000006, 8.407790786e-45f).x z: ADD_INT R1.z, T0.y, T1.w w: ADD_INT R2.w, T0.z, T0.w VEC_021 t: SETE_INT R5.x, R6.z, (0x0000003F, 8.828180325e-44f).y 36 y: MOV ____, QA.pop 37 w: CNDE_INT R123.w, R6.y, 0.0f, PV36.y 38 x: ADD_INT T1.x, T1.x, PV37.w 39 x: LDS_WRITE ____, R14.x, PV38.x 40 x: LDS_READ_RET QA, R1.x 41 y: MOV ____, QA.pop 42 w: CNDE_INT R123.w, R4.y, 0.0f, PV41.y 43 x: ADD_INT T1.x, T1.x, PV42.w 44 x: LDS_WRITE ____, R14.x, PV43.x 45 x: LDS_READ_RET QA, R6.x 46 y: MOV ____, QA.pop 47 w: CNDE_INT R123.w, R5.y, 0.0f, PV46.y 48 x: ADD_INT T1.x, T1.x, PV47.w 49 x: LDS_WRITE ____, R14.x, PV48.x 50 x: LDS_READ_RET QA, R7.x 51 y: MOV ____, QA.pop 52 w: CNDE_INT R123.w, R2.y, 0.0f, PV51.y 53 x: ADD_INT T1.x, T1.x, PV52.w 54 x: LDS_WRITE ____, R14.x, PV53.x 55 x: LDS_READ_RET QA, R3.w 56 z: MOV ____, QA.pop 57 x: CNDE_INT R123.x, R3.y, 0.0f, PV56.z 58 w: ADD_INT ____, T1.x, PV57.x 59 x: LDS_WRITE ____, R14.x, PV58.w 60 x: LDS_READ_RET QA, R14.x 61 x: MOV R1.x, QA.pop 62 x: PREDNE_INT ____, R5.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 04 JUMP POP_CNT(1) ADDR(6) 05 ALU_POP_AFTER: ADDR(247) CNT(3) 63 x: LSHL ____, R3.x, (0x00000002, 2.802596929e-45f).x 64 x: LDS_WRITE ____, PV63.x, R1.x 06 ALU_PUSH_BEFORE: ADDR(250) CNT(3) 65 x: SETGT_UINT R5.x, R0.x, (0x0000003F, 8.828180325e-44f).x 66 x: PREDE_INT ____, R5.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 07 JUMP POP_CNT(1) ADDR(9) 08 ALU_POP_AFTER: ADDR(253) CNT(1) 67 x: LDS_WRITE ____, R14.x, 0.0f 09 ALU: ADDR(254) CNT(42) 68 x: SUB_INT T0.x, R1.x, R0.w y: MOV T0.y, 0.0f z: LSHL ____, R3.x, (0x00000002, 2.802596929e-45f).x VEC_120 w: MOV T0.w, 0.0f t: MOV T1.y, 0.0f 69 x: LDS_READ_RET QA, PV68.z y: MOV T2.y, 0.0f z: MOV T0.z, 0.0f w: MOV T1.w, 0.0f t: SUB_INT T1.x, R2.x, R13.x 70 x: SUB_INT T2.x, R4.x, R12.x y: SUB_INT T3.y, R2.w, R13.w z: SUB_INT T1.z, R1.z, R13.z w: SUB_INT T2.w, R1.y, R13.y t: SETE_INT R0.x, R0.x, (0x0000003F, 8.828180325e-44f).x 71 y: SUB_INT R0.y, R1.w, R12.w z: SUB_INT T3.z, R0.z, R12.z w: SUB_INT T3.w, R0.y, R12.y 72 y: MOV ____, QA.pop 73 w: ADD_INT ____, PV72.y, T0.x 74 x: MOV ____, PV73.w y: MOV ____, PV73.w z: MOV ____, PV73.w w: ADD_INT ____, T0.w, PV73.w 75 x: ADD_INT ____, T1.w, PV74.w y: ADD_INT ____, PV74.x, T1.y z: ADD_INT ____, PV74.y, T0.y VEC_021 w: ADD_INT T0.w, PV74.z, T2.y VEC_102 76 x: ADD_INT ____, T1.w, PV75.z y: ADD_INT ____, T0.z, PV75.x z: ADD_INT T2.z, T0.z, PV75.y t: ADD_INT R1.y, PV75.w, T2.w 77 x: ADD_INT R1.x, PV76.y, T1.x VEC_021 y: ADD_INT ____, T0.z, PV76.x z: ADD_INT R1.z, PV76.z, T1.z t: ADD_INT R2.x, PV76.y, T2.x 78 y: ADD_INT R2.y, T0.w, T3.w z: ADD_INT R2.z, T2.z, T3.z w: ADD_INT R1.w, PV77.y, T3.y VEC_021 t: ADD_INT R2.w, PV77.y, R0.y 10 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R10], R1, ARRAY_SIZE(4) MARK VPM 11 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R11], R2, ARRAY_SIZE(4) MARK VPM 12 ALU_PUSH_BEFORE: ADDR(296) CNT(1) 79 x: PREDNE_INT ____, R0.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 13 JUMP POP_CNT(1) ADDR(17) 14 ALU: ADDR(297) CNT(6) KCACHE0(CB1:0-15) 80 x: ADD_INT R0.x, R12.w, R2.w z: LSHL ____, R9.x, (0x00000002, 2.802596929e-45f).x 81 y: ADD_INT ____, KC0[1].x, PV80.z 82 x: LSHR R1.x, PV81.y, (0x00000002, 2.802596929e-45f).x 15 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, MARK VPM 16 POP (1) ADDR(17) 17 NOP NO_BARRIER END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 2464;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x3000010F SQ_PGM_RESOURCES:NUM_GPRS = 15 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 ; RatOpIsUsed = 0x2 ; NumThreadPerGroupFlattened = 64 ; SetBufferForNumGroup = true

              • Multiple wavefronts, bank conflicts, and scan question
                eduardoschardong

                I was going to suggest you an algorithm but then I looked at your numbers and did some math, yours is pretty fast, good job, I'm not sure I could write something faster.

                Depending on how it was implemented what will ultimatly limit the perfix sum is memory bandwidth, not math power, HD5850 is pretty good at math, but has 10% lower memory bandwidth than GTX280 and LDS bandwidth isn't much better either (if at all).

                I would like to try something... I'm not sure if this will be as fast as i imagine on HD5850, but let's give a try, the attached code consists of two kernels, they should be invoked with 4096 threads, group size of 64, the width should be numElements / 16384, values buffer should have numElements * 4 bytes, temp buffer 16384 bytes, parameters are the same for both and they should execute in a sequence, once, I'm expecting this to be bandwidth limited to 10G uint/s on HD5850 since it's mostly a memory read/copy, if HD5850 do it then there is a way to improve prefix sum over GTX280, if not your code was too good for me

                 

                __kernel void PrefixSumDu1(__global uint4* values, __global uint* temp, int width) { uint index = get_global_id(0); __global uint4* source = values + index * width; uint4 ret = (uint4)(0, 0, 0, 0); for(uint i = 0; i < width; i ++) { ret += source[i]; } temp[index] = ret.x + ret.y + ret.z + ret.w; } __kernel void PrefixSumDu3(__global uint4* values, __global uint* temp, int width) { uint index = get_global_id(0); __global uint4* source = values + index * width; uint4 ret = temp[index]; for(uint i = 0; i < width; i++) { ret += source[i]; source[i] = ret; } }