cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nano23
Journeyman III

Strange behaviour with const and __constant

Hi,

i've compared 2 versions of a simple 2D convolution. First the "unoptimized" without const and __global isntead of __constant and the second one with const.

The "optimized" has nearly 2x ALUInsts, but with small filter sizes 3-31 its faster and with a filtersize of 101 its slower. I don't understand this behaviour

The output data size is 1024x1024.

The device is a HD 5850.

Optimized (Size 29)

Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
WriteBuffer 105  2,8396 4323,06                                         
WriteBuffer 106  0,19504 3,29                                         
convolve2D_0558A308 107 {   1024    1024       1}  {  256     1     1} 101,067670 1403163846918168210056,084,1137,212123265,560,3130,36004095010000
convolve2D_0558A308 108 {   1024    1024       1}  {  256     1     1} 101,046060 1403163846918168210056,064,1137,212123265,560,330,35004095010000
convolve2D_0558A308 109 {   1024    1024       1}  {  256     1     1} 101,091120 1403163846918168210056,064,1137,212123265,560,3130,35004095010000
convolve2D_0558A308 110 {   1024    1024       1}  {  256     1     1} 101,092420 1403163846918168210056,074,1137,212123265,560,3130,35004095010000
convolve2D_0558A308 111 {   1024    1024       1}  {  256     1     1} 101,064910 1403163846918168210056,064,1137,212123265,560,330,35004095010000
ReadBuffer 112  2,50243 4096                                         

Unoptimized (Size 29)

Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
WriteBuffer 105  2,82946 4323,06                                         
WriteBuffer 106  0,17178 3,29                                         
convolve2D_0549A690 107 {   1024    1024       1}  {  256     1     1} 120,43130 1303163843555168210023,052,1134,572525696032,237,9604095010000
convolve2D_0549A690 108 {   1024    1024       1}  {  256     1     1} 127,289080 1303163843555168210023,352,1134,572525696032,227,6204095010000
convolve2D_0549A690 109 {   1024    1024       1}  {  256     1     1} 127,618110 1303163843555168210023,452,1134,572525696032,147,4404095010000
convolve2D_0549A690 110 {   1024    1024       1}  {  256     1     1} 125,395150 1303163843555168210022,682,1134,572525696032,058,1504095010000
convolve2D_0549A690 111 {   1024    1024       1}  {  256     1     1} 125,414670 1303163843555168210023,292,1134,572525696032,177,6304095010000
ReadBuffer 112  3,31465 4096                                         

Optimized (Size 101)

Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
WriteBuffer 1  4,46239 4935,06                                         
WriteBuffer 2  0,7072 39,85                                         
convolve2D_00F1A808 3 {   1024    1024       1}  {  256     1     1} 1221,33760 140316384822302040210055,134,0337,4225778451,90,7930,45004095010000
convolve2D_00F1A808 4 {   1024    1024       1}  {  256     1     1} 1221,225360 140316384822302040210055,194,0337,4225778451,90,8430,49004095010000
convolve2D_00F1A808 5 {   1024    1024       1}  {  256     1     1} 1221,746390 140316384822302040210055,174,0337,4225778451,90,8330,48004095010000
convolve2D_00F1A808 6 {   1024    1024       1}  {  256     1     1} 1220,528340 140316384822302040210055,184,0337,4225778451,90,8530,48004095010000
convolve2D_00F1A808 7 {   1024    1024       1}  {  256     1     1} 1220,829290 140316384822302040210055,174,0337,4225778451,90,8530,48004095010000
ReadBuffer 8  1,75211 4096                                         

Unoptimized (Size 101)

 

Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
WriteBuffer 1  4,28947 4935,06                                         
WriteBuffer 2  2,48555 39,85                                         
convolve2D_00F4A810 3 {   1024    1024       1}  {  256     1     1} 1020,591450 130316384414272040210032,172,0334,8830981376037,662,3904095010000
convolve2D_00F4A810 4 {   1024    1024       1}  {  256     1     1} 993,801990 130316384414272040210033,412,0334,8830981376038,471,8304095010000
convolve2D_00F4A810 5 {   1024    1024       1}  {  256     1     1} 976,173330 130316384414272040210034,272,0334,8830981376039,081,5104095010000
convolve2D_00F4A810 6 {   1024    1024       1}  {  256     1     1} 998,364190 130316384414272040210032,242,0334,8830981376037,832,4804095010000
convolve2D_00F4A810 7 {   1024    1024       1}  {  256     1     1} 1038,830550 130316384414272040210033,572,0334,8830981376038,451,6404095010000
ReadBuffer 8  1,99185 4096                                         

 

thanks,

rgds

 

__kernel void convolve2D( __global float *src, __global float *filter, int filterWidth, int imageWidth, __global float *dest){ int destWidth = get_global_size(0), xOut = get_global_id(0), yOut = get_global_id(1); float sum = 0; for (int y = 0; y < filterWidth; ++y){ int idxFY = y*filterWidth; int idxInY = (yOut + y) * imageWidth + xOut; for (int x = 0; x < filterWidth; ++x){ int idxF = idxFY + x; int idxIn = idxInY + x; sum += filter[idxF]*src[idxIn]; } } int idxOut = yOut * destWidth + xOut; dest[idxOut] = sum; } // "optimized" __kernel void convolve2D(const __global float *src, __constant float *filter, const int filterWidth, const int imageWidth, __global float *dest){ const int destWidth = get_global_size(0), xOut = get_global_id(0), yOut = get_global_id(1); float sum = 0; for (int y = 0; y < filterWidth; ++y){ const int idxFY = y*filterWidth; const int idxInY = (yOut + y) * imageWidth + xOut; for (int x = 0; x < filterWidth; ++x){ const int idxF = idxFY + x; const int idxIn = idxInY + x; sum += filter[idxF]*src[idxIn]; } } const int idxOut = yOut * destWidth + xOut; dest[idxOut] = sum; }

0 Likes
3 Replies

const global and constant are different. const global is memory that exists on the device but is read only and uncached between clauses. constant uses hardware constant caches to load memory efficiently from either the kcache when the index is known at compile time, or via the texture cache.

At the larger filter sizes, you are probably trashing the cache causing memory fetches to go out to RAM.
0 Likes

ahh ok, but why are there more alu-instructions, if i make the index variables const?

0 Likes

constant memory accesses are ALU instructions and not memory instructions.
0 Likes