Double Precision Image

Discussion created by on Apr 2, 2011
Latest reply on Apr 11, 2011 by Jawed
Workaround causes intense slowdowns

I'm working on a program which I am trying to optimize for the GPU. I've seen many times people using the GPU texture cache to decrease execution times via the OpenCL image memory object.

By default this only supports Single Precision, however, I have come up with a trick to store double precision numbers in the image object. It involves pointer casting. I've attached the casting functions below. The base code, with no optimizations runs in about 36.4 milliseconds, while the image code takes 976 Milliseconds.

For legal reasons, I cannot attach all my code, so I will be limited to showing you the lines of code in question.

My original code accessed an array/1D-buffer stored in constant memory. The elements needed for the calculations were retrieved via pointers/array indexing. Via the attached conversion methods, I am able to successfully store and retrieve a double2 inside a single pixel of a float4 image with no data corruption.

My problem is that the image code runs 27.5 times slower than the unoptimized code. As far as I can tell the image code provides the same functionality as the buffer code with little additional overhead. The Stream Kernel Analyzer gives the following stats for the 2 versions of the code, on the Radeon HD 5870 (The card I’m running this on).

Buffer Code:

ALU -- 26
Fetch -- 2
Write -- 1
Est Cycles -- 53.42
ALU:Fetch -- 20.17
BottleNeck -- Global Fetch
Thread/Clock -- 0.60
Throughput -- 509 M Threads\Sec

Image Code:

ALU -- 33
Fetch -- 2
Write -- 1
Est Cycles -- 31.03
ALU:Fetch -- 1.45
BottleNeck -- ALU Ops
Thread/Clock -- 1.03
Throughput -- 876 M Threads\Sec

What I don’t understand is that the throughput went up by 72% and the estimated cycles went down, but the execution time is 27.5 times longer. I have also tried using the LDS but my results were inconclusive at best. I am probably doing something wrong, but I just don’t see it. :-)


------------------------ Conversion Functions ------------------------ //Complete - Tested double2 toDouble2(float4 f4) { return *((double2 *)((void*)&f4)); } //Complete - Tested float4 toFloat4(double2 d2) { return *((float4 *)((void*)&d2)); } //Complete - Tested int getVec2Index(int index) { return index/2; } //Complete - Tested double getVec2Element(double2 dv, int index) { switch(index%2) { case 0: return dv.s0; case 1: return dv.s1; } return -999999; } ---------------------------------------------------------------------------------- Original Unoptimized code (Buffer) [Processes 1 element per iteration] ---------------------------------------------------------------------------------- kernel void <kernel_name> (... , constant double * inputMatrix, ... ) { ... double total = 0; <for loop> { total += (inputMatrix[<row_offset>+x] * inputMatrix[<row_offset>+y]); } ... } --------------------------------------------------------------------- Image Optimized Code [Processes 2 elements per iteration] --------------------------------------------------------------------- kernel void <kernel_name> (... , read_only image2d_t inputMatrix,, ... ) { ... double total = 0; <for loop> { total += toDouble2(read_imagef(inputMatrix, sampler, (int2)(x,<row_offset>))) * getVec2Element(toDouble2(read_imagef(inputMatrix, sampler, (int2)(getVec2Index(y), <row_offset>))),y); } ... }