Archives Discussions

sir_um · ‎04-02-2011

Workaround causes intense slowdowns

I'm working on a program which I am trying to optimize for the GPU. I've seen many times people using the GPU texture cache to decrease execution times via the OpenCL image memory object.

By default this only supports Single Precision, however, I have come up with a trick to store double precision numbers in the image object. It involves pointer casting. I've attached the casting functions below. The base code, with no optimizations runs in about 36.4 milliseconds, while the image code takes 976 Milliseconds.

For legal reasons, I cannot attach all my code, so I will be limited to showing you the lines of code in question.

My original code accessed an array/1D-buffer stored in constant memory. The elements needed for the calculations were retrieved via pointers/array indexing. Via the attached conversion methods, I am able to successfully store and retrieve a double2 inside a single pixel of a float4 image with no data corruption.

My problem is that the image code runs 27.5 times slower than the unoptimized code. As far as I can tell the image code provides the same functionality as the buffer code with little additional overhead. The Stream Kernel Analyzer gives the following stats for the 2 versions of the code, on the Radeon HD 5870 (The card I’m running this on).

Buffer Code:

ALU -- 26
Fetch -- 2
Write -- 1
Est Cycles -- 53.42
ALU:Fetch -- 20.17
BottleNeck -- Global Fetch
Thread/Clock -- 0.60
Throughput -- 509 M Threads\Sec

Image Code:

ALU -- 33
Fetch -- 2
Write -- 1
Est Cycles -- 31.03
ALU:Fetch -- 1.45
BottleNeck -- ALU Ops
Thread/Clock -- 1.03
Throughput -- 876 M Threads\Sec

What I don’t understand is that the throughput went up by 72% and the estimated cycles went down, but the execution time is 27.5 times longer. I have also tried using the LDS but my results were inconclusive at best. I am probably doing something wrong, but I just don’t see it. 🙂

Thanks,
-Chris

------------------------ Conversion Functions ------------------------ //Complete - Tested double2 toDouble2(float4 f4) { return *((double2 *)((void*)&f4)); } //Complete - Tested float4 toFloat4(double2 d2) { return *((float4 *)((void*)&d2)); } //Complete - Tested int getVec2Index(int index) { return index/2; } //Complete - Tested double getVec2Element(double2 dv, int index) { switch(index%2) { case 0: return dv.s0; case 1: return dv.s1; } return -999999; } ---------------------------------------------------------------------------------- Original Unoptimized code (Buffer) [Processes 1 element per iteration] ---------------------------------------------------------------------------------- kernel void <kernel_name> (... , constant double * inputMatrix, ... ) { ... double total = 0; <for loop> { total += (inputMatrix[<row_offset>+x] * inputMatrix[<row_offset>+y]); } ... } --------------------------------------------------------------------- Image Optimized Code [Processes 2 elements per iteration] --------------------------------------------------------------------- kernel void <kernel_name> (... , read_only image2d_t inputMatrix,, ... ) { ... double total = 0; <for loop> { total += toDouble2(read_imagef(inputMatrix, sampler, (int2)(x,<row_offset>))) * getVec2Element(toDouble2(read_imagef(inputMatrix, sampler, (int2)(getVec2Index(y), <row_offset>))),y); } ... }

Jawed · ‎04-04-2011

I played with the attached code and the ISA for both looks sane. I suggest you work ground-up to see what's exploding in your face.

As an aside: when SKA shows statistics, the performance numbers it shows for code with one or more loops is based upon the Flow Control options. In SKA open the options dialog by doing Edit|Options and you will see three items: Branch Coherence, Avg Loop Count and Max Loop Count.

In my experience this feature is extremely woolly. Try setting Branch Coherence to 100% if all work items run the loop the same number of times. If not, well, you're in woolly land. Also set the loop count. You can see the effect on the reported statistics as you increase the loop count from a low value (e.g. 10) to a higher value (e.g. 100).

In the statistics you've posted I see ALU:fetch numbers. They shouldn't differ by an order of magnitude. I'm afraid to say I suspect you've been bitten by the compiler going berserk.

// Original Unoptimized code (Buffer) [Processes 1 element per iteration] kernel void f2d1 (global double *inputMatrix, int x, int y, int z, global double *output) { int pos = get_global_id(0); double total = 0; for (; pos < z; ++pos) total += (inputMatrix[pos+x] * inputMatrix[pos+y]); output[pos] = total; } // Image Optimized Code [Processes 2 elements per iteration] kernel void f2d2 (read_only image2d_t inputMatrix, int x, int y, int z, global double2 * output) { const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; int pos = get_global_id(0); double2 total = 0; for (; pos < z; ++pos) total += toDouble2(read_imagef(inputMatrix, sampler, (int2)(x,pos))) * getVec2Element(toDouble2(read_imagef(inputMatrix, sampler, (int2)(getVec2Index(y), pos))),y); output[pos] = total; }

himanshu_gautam · ‎04-04-2011

Hi sir.um,

IT was interesting to see your trick. Can you please post a small testcase which shows the issue, so we can get this fixed.

sir_um · ‎04-06-2011

Originally posted by: himanshu.gautam

Can you please post a small testcase which shows the issue, so we can get this fixed.

Perhaps this is not as small as you intended, but it has full error detection and is in a standardized format. Sorry, once I start coding I can't help but add features. 🙂

Let me know if there is anything else you need, And thank you for looking into this.

-Chris

------------------ --- Kernels.cl --- ------------------ #pragma OPENCL EXTENSION cl_amd_fp64 : enable #pragma OPENCL EXTENSION cl_amd_printf : enable double getVec2Element(double2 dv, int index); double toDouble (float2 f2); float2 toFloat2 (double d); double2 toDouble2 (float4 f4); float4 toFloat4 (double2 d2); kernel void code_buf ( int numRows, int rowWidth, constant double * inputMatrix, global double * resultMatrix) { int x = get_global_id(0); int y = get_global_id(1); double total = 0; int rowIndex = 0; for (int row=0; row<numRows; row++) { total += (inputMatrix[rowIndex+x] * inputMatrix[rowIndex+y]); rowIndex += rowWidth; } resultMatrix[(y*rowWidth)+x] = total; } kernel void code_img ( int numRows, int rowWidth, read_only image2d_t inputMatrix, global double2 * resultMatrix) { sampler_t sampler = CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE; int x = get_global_id(0); int y = get_global_id(1); double2 total = 0; for (int row=0; row<numRows; row++) { total += toDouble2(read_imagef(inputMatrix, sampler, (int2)(x,row))) * getVec2Element( toDouble2(read_imagef(inputMatrix, sampler, (int2)((int)(y/2), row))),y); } resultMatrix[(y*rowWidth)+x] = total; } //Complete - Tested kernel void conversionTest() { //Test faking a double from 2 floats. //init double double d = 5; printf("d:\t%f\n", d); //convert double to float2 float2 f2 = toFloat2(d); //copy float2 value to new float2 float2 f2_2 = f2; //convert new float2 back into double double d2 = toDouble(f2_2); printf("d2:\t%f\n", d2); } //Complete - Tested double getVec2Element(double2 dv, int index) { switch(index%2) { case 0: return dv.s0; case 1: return dv.s1; } return -999999; } //Complete - Tested double toDouble(float2 f2) { return *((double *)((void*)&f2)); } //Complete - Tested float2 toFloat2(double d) { return *((float2 *)((void*)&d)); } //Complete - Tested double2 toDouble2(float4 f4) { return *((double2 *)((void*)&f4)); } //Complete - Tested float4 toFloat4(double2 d2) { return *((float4 *)((void*)&d2)); } ---------------- --- main.cpp --- ---------------- /* * main.cpp * * Created on: Apr 5, 2011 * Author: Chris Alexander */ //OpenCL Includes #define __NO_STD_VECTOR #include <CL/cl.hpp> //C++ API Includes #include <iostream> #include <fstream> using namespace std; using namespace cl; double * inputMatrix; double * resultMatrix; #define RUN_LOOPS //#define DEBUG //#define DISABLE_CL_OPTIMIZATIONS void checkErr(cl_int err, const char * name); char * print_cl_errstring(cl_int err); //Variables ImageFormat IMAGE_FORAMAT; cl::vector<Device> * GPU_DEVICES; Device * GPU_DEVICE; Context * GPU_CONTEXT; CommandQueue * GPU_QUEUE; Program * GPU_PROGRAM; int inputWidth; int inputHeight; int inputArea; int resultWidth; int resultHeight; int resultArea; int main() { cout<<"Running!"<<endl; int workGroupWidth = 250; int workGroupHeight = 1; int dim = 4000; inputWidth = dim; inputHeight = dim; resultWidth = dim; resultHeight = dim; inputArea = inputWidth*inputHeight; resultArea = resultWidth*resultHeight; //--OpenCL Init cl_int err = CL_SUCCESS; //Request Supported Platforms from OpenCL cl::vector<Platform> platformList; err = Platform::get(&platformList); if (err == CL_INVALID_VALUE) { cout<<"Invalid Value!"<<endl; } if (err == CL_OUT_OF_HOST_MEMORY) { cout<<"Out of host memory!"<<endl; } checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "Get Platform List"); Platform platform = platformList[0]; //init environment constants IMAGE_FORAMAT.image_channel_order = CL_RGBA; IMAGE_FORAMAT.image_channel_data_type = CL_FLOAT; int IMG_DEGREE = 2; //2 doubles per pixel = 4 floats per pixel cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platform)(), 0}; //Query OpenCL for present devices GPU_DEVICES = new cl::vector<Device>(); err = platform.getDevices(CL_DEVICE_TYPE_GPU, GPU_DEVICES); checkErr(err, "Get GPU Devices"); GPU_DEVICE = &((*GPU_DEVICES)[0]); //Create Contexts GPU_CONTEXT = new Context(*GPU_DEVICES, cprops, NULL, NULL, &err); checkErr(err, "Create GPU Context"); GPU_QUEUE = new CommandQueue(*GPU_CONTEXT, *GPU_DEVICE, 0, &err); checkErr(err, "Create GPU CommandQueue"); //--GPU //Load Kernel from file into String Object ifstream gpuFile("src/kernels.cl"); std::string gpuProg(istreambuf_iterator<char>(gpuFile),(istreambuf_iterator<char>())); //Create ProgramSource object from String Source Program::Sources gpuSource(1, make_pair(gpuProg.c_str(), gpuProg.length())); //Create Program object from ProgramSource object GPU_PROGRAM = new Program(*GPU_CONTEXT, gpuSource, &err); checkErr(err, "Create GPU Program"); //Compile GPU Program #ifdef DISABLE_CL_OPTIMIZATIONS err = (*GPU_PROGRAM).build(*GPU_DEVICES, "-g -cl-opt-disable"); #else err = (*GPU_PROGRAM).build(*GPU_DEVICES, ""); #endif if (err) { if (err == CL_BUILD_PROGRAM_FAILURE) { cerr<<"Build Failed!"<<endl; } std::string log = (*GPU_PROGRAM).getBuildInfo<CL_PROGRAM_BUILD_LOG>(*GPU_DEVICE); std::cerr<<"GPU - Build Log:\n"<<log<<endl; exit(EXIT_FAILURE); } cout<<"OpenCL Initialized!"<<endl; //--Populate cout<<"Populate..."; inputMatrix = new double[inputArea]; resultMatrix = new double[resultArea]; //Input Matrix for (int x=0; x<inputArea; x++) { inputMatrix = x; } //Result Matrix for (int x=0; x<resultArea; x++) { resultMatrix = 0; } cout<<"Complete!"<<endl; //--Execute Image2D * inputMtrx_img; Buffer * inputMtrx_buf; Buffer * resultMtrx; //Events Event kernelEvent; Event copyInputMatrixData; Event copyResultMatrixData; /************************************************** ********** Start Conversion Test Kernel ********** **************************************************/ Kernel kernel = Kernel(*GPU_PROGRAM, "conversionTest", &err); checkErr(err, "Create GPU Kernel (conversionTest)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(1,1), NDRange(1,1), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (conversionTest)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); /************************************************** ********** Start Conversion Test Kernel ********** **************************************************/ /***************************************** ********** Start Buffer Kernel ********** *****************************************/ kernel = Kernel(*GPU_PROGRAM, "code_buf", &err); checkErr(err, "Create GPU Kernel (code_buf)"); cout<<"Running Buffer Code..."; cout.flush(); #ifdef RUN_LOOPS int before = GetTickCount(); int numIterations = 10; for (int x=0; x<numIterations; x++) #endif { //Input Matrix inputMtrx_buf = new Buffer(*GPU_CONTEXT, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(double)*inputWidth*inputHeight, inputMatrix, &err); checkErr(err, "Create InputMatrix Buffer"); err = (*GPU_QUEUE).enqueueWriteBuffer(*inputMtrx_buf, CL_FALSE, 0, sizeof(double)*inputArea, inputMatrix, NULL, &copyInputMatrixData); checkErr(err, "enqueueWriteBuffer() - Copy data to vectrA Buffer obj"); //Result Matrix resultMtrx = new Buffer(*GPU_CONTEXT, CL_MEM_READ_WRITE, sizeof(double)*resultArea, NULL, &err); checkErr(err, "Create ResultMatrix Buffer"); //Wait for data copy to finish err = copyInputMatrixData.wait(); checkErr(err, "enqueueWriteBufferEvent.wait() - Wait for inputMtrx data copy to finish"); //Set Kernel Args int argIndex = 0; err = kernel.setArg(argIndex++, inputHeight); checkErr(err, "Kernel(GPU) - setArg(0, inputHeight)"); err = kernel.setArg(argIndex++, inputWidth); checkErr(err, "Kernel(GPU) - setArg(1, inputWidth)"); err = kernel.setArg(argIndex++, *inputMtrx_buf); checkErr(err, "Kernel(GPU) - setArg(2, inputMtrx_buf)"); err = kernel.setArg(argIndex++, *resultMtrx); checkErr(err, "Kernel(GPU) - setArg(3, resultMtrx)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(resultWidth,resultHeight), NDRange(workGroupWidth, workGroupHeight), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (code_buf)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); //Read Result data to result vector err = (*GPU_QUEUE).enqueueReadBuffer(*resultMtrx, CL_FALSE, 0, sizeof(double)*resultArea, resultMatrix, NULL, &copyResultMatrixData); checkErr(err, "enqueueReadBuffer() - Copy data from resultMtrx Buffer obj"); //Wait for data copy to finish err = copyResultMatrixData.wait(); checkErr(err, "enqueueReadEvent.wait() - Wait for resultMtrx data copy to finish"); //clean up after this iteration. checkErr(CL_SUCCESS, "Begin Pointer Cleanup"); delete inputMtrx_buf; delete resultMtrx; checkErr(CL_SUCCESS, "End Pointer Cleanup"); } #ifdef RUN_LOOPS int after = GetTickCount(); long totalTime = after-before; double iterationTime = ((double)(totalTime))/numIterations; cout<<"Done!"<<endl <<" Buffer Total Time: "<<totalTime<<endl <<"Buffer Iteration Time: "<<iterationTime<<endl; #else cout<<"Done!"<<endl; #endif /*************************************** ********** End Buffer Kernel ********** ***************************************/ /************************************** ********** Start IMG Kernel ********** **************************************/ kernel = Kernel(*GPU_PROGRAM, "code_img", &err); checkErr(err, "Create GPU Kernel (code_img)"); #ifdef RUN_LOOPS cout<<"Running Image Code..."; cout.flush(); before = GetTickCount(); numIterations = 1; for (int x=0; x<numIterations; x++) #endif { //Input Matrix inputMtrx_img = new Image2D(*GPU_CONTEXT, CL_MEM_READ_ONLY, IMAGE_FORAMAT, inputWidth/IMG_DEGREE, inputHeight, 0, NULL, &err); checkErr(err, "Create InputMatrix Buffer"); //Origin cl::size_t<3> origin; origin[0] = 0;//x origin[1] = 0;//y origin[2] = 0;//z - On a 2d Image, the Z param MUST be zero(0). //Region cl::size_t<3> region; region[0] = inputWidth/IMG_DEGREE; //width region[1] = inputHeight; //height region[2] = 1;//z - On a 2d Image, the Z param MUST be one(1). err = (*GPU_QUEUE).enqueueWriteImage(*inputMtrx_img, CL_FALSE, origin, region, 0, 0, inputMatrix, NULL, &copyInputMatrixData); checkErr(err, "enqueueWriteImage() - Copy data to inputMtrx_img Image2D obj"); //Result Matrix resultMtrx = new Buffer(*GPU_CONTEXT, CL_MEM_READ_WRITE, sizeof(double)*resultArea, NULL, &err); checkErr(err, "Create ResultMatrix Buffer"); //Wait for data copy to finish err = copyInputMatrixData.wait(); checkErr(err, "enqueueWriteImageEvent.wait() - Wait for inputMtrx data copy to finish"); //Set Kernel Args int argIndex = 0; err = kernel.setArg(argIndex++, inputHeight); checkErr(err, "Kernel(GPU) - setArg(0, inputHeight)"); err = kernel.setArg(argIndex++, inputWidth/IMG_DEGREE); checkErr(err, "Kernel(GPU) - setArg(1, inputWidth/IMG_DEGREE)"); err = kernel.setArg(argIndex++, *inputMtrx_img); checkErr(err, "Kernel(GPU) - setArg(2, inputMtrx_img)"); err = kernel.setArg(argIndex++, *resultMtrx); checkErr(err, "Kernel(GPU) - setArg(3, resultMtrx)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(resultWidth,resultHeight), NDRange(workGroupWidth, workGroupHeight), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (code_IMG)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); //Read Result data to result vector err = (*GPU_QUEUE).enqueueReadBuffer(*resultMtrx, CL_FALSE, 0, sizeof(double)*resultArea, resultMatrix, NULL, &copyResultMatrixData); checkErr(err, "enqueueReadBuffer() - Copy data from resultMtrx Buffer obj"); //Wait for data copy to finish err = copyResultMatrixData.wait(); checkErr(err, "enqueueReadEvent.wait() - Wait for resultMtrx data copy to finish"); //clean up after this iteration. checkErr(CL_SUCCESS, "Begin Pointer Cleanup"); delete inputMtrx_img; delete resultMtrx; checkErr(CL_SUCCESS, "End Pointer Cleanup"); } #ifdef RUN_LOOPS after = GetTickCount(); totalTime = after-before; iterationTime = ((double)(totalTime))/numIterations; cout<<"Done!"<<endl <<" Image Total Time: "<<totalTime<<endl <<"Image Iteration Time: "<<iterationTime<<endl; #else cout<<"Done!"<<endl; #endif /************************************ ********** End IMG Kernel ********** ************************************/ cout<<"\n\nProgram Complete!"<<endl; } void checkErr(cl_int err, const char * name) { #ifdef DEBUG cerr << name << endl; cerr.flush(); #endif if (err != CL_SUCCESS) { cerr << "ERROR: " << name << " (" << err << ": " << print_cl_errstring(err) << ")" << endl; exit(EXIT_FAILURE); } } char *print_cl_errstring(cl_int err) { switch (err) { case CL_SUCCESS: return strdup("Success!"); case CL_DEVICE_NOT_FOUND: return strdup("Device not found."); case CL_DEVICE_NOT_AVAILABLE: return strdup("Device not available"); case CL_COMPILER_NOT_AVAILABLE: return strdup("Compiler not available"); case CL_MEM_OBJECT_ALLOCATION_FAILURE: return strdup("Memory object allocation failure"); case CL_OUT_OF_RESOURCES: return strdup("Out of resources"); case CL_OUT_OF_HOST_MEMORY: return strdup("Out of host memory"); case CL_PROFILING_INFO_NOT_AVAILABLE: return strdup("Profiling information not available"); case CL_MEM_COPY_OVERLAP: return strdup("Memory copy overlap"); case CL_IMAGE_FORMAT_MISMATCH: return strdup("Image format mismatch"); case CL_IMAGE_FORMAT_NOT_SUPPORTED: return strdup("Image format not supported"); case CL_BUILD_PROGRAM_FAILURE: return strdup("Program build failure"); case CL_MAP_FAILURE: return strdup("Map failure"); case CL_INVALID_VALUE: return strdup("Invalid value"); case CL_INVALID_DEVICE_TYPE: return strdup("Invalid device type"); case CL_INVALID_PLATFORM: return strdup("Invalid platform"); case CL_INVALID_DEVICE: return strdup("Invalid device"); case CL_INVALID_CONTEXT: return strdup("Invalid context"); case CL_INVALID_QUEUE_PROPERTIES: return strdup("Invalid queue properties"); case CL_INVALID_COMMAND_QUEUE: return strdup("Invalid command queue"); case CL_INVALID_HOST_PTR: return strdup("Invalid host pointer"); case CL_INVALID_MEM_OBJECT: return strdup("Invalid memory object"); case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR: return strdup("Invalid image format descriptor"); case CL_INVALID_IMAGE_SIZE: return strdup("Invalid image size"); case CL_INVALID_SAMPLER: return strdup("Invalid sampler"); case CL_INVALID_BINARY: return strdup("Invalid binary"); case CL_INVALID_BUILD_OPTIONS: return strdup("Invalid build options"); case CL_INVALID_PROGRAM: return strdup("Invalid program"); case CL_INVALID_PROGRAM_EXECUTABLE: return strdup("Invalid program executable"); case CL_INVALID_KERNEL_NAME: return strdup("Invalid kernel name"); case CL_INVALID_KERNEL_DEFINITION: return strdup("Invalid kernel definition"); case CL_INVALID_KERNEL: return strdup("Invalid kernel"); case CL_INVALID_ARG_INDEX: return strdup("Invalid argument index"); case CL_INVALID_ARG_VALUE: return strdup("Invalid argument value"); case CL_INVALID_ARG_SIZE: return strdup("Invalid argument size"); case CL_INVALID_KERNEL_ARGS: return strdup("Invalid kernel arguments"); case CL_INVALID_WORK_DIMENSION: return strdup("Invalid work dimension"); case CL_INVALID_WORK_GROUP_SIZE: return strdup("Invalid work group size"); case CL_INVALID_WORK_ITEM_SIZE: return strdup("Invalid work item size"); case CL_INVALID_GLOBAL_OFFSET: return strdup("Invalid global offset"); case CL_INVALID_EVENT_WAIT_LIST: return strdup("Invalid event wait list"); case CL_INVALID_EVENT: return strdup("Invalid event"); case CL_INVALID_OPERATION: return strdup("Invalid operation"); case CL_INVALID_GL_OBJECT: return strdup("Invalid OpenGL object"); case CL_INVALID_BUFFER_SIZE: return strdup("Invalid buffer size"); case CL_INVALID_MIP_LEVEL: return strdup("Invalid mip-map level"); default: return strdup("Unknown"); } }

Melkhior · ‎04-06-2011

Originally posted by: sir.um

By default this only supports Single Precision, however, I have come up with a trick to store double precision numbers in the image object. It involves pointer casting. I've attached the casting functions below. The base code, with no optimizations runs in about 36.4 milliseconds, while the image code takes 976 Milliseconds.

That's a common trick, but you really don't want to use pointers. They will force the compiler to do a round-trip to memory, that is very difficult to optimize away. The 'best' way is to use a union, as you get to share bytes w/o tricking the compiler. You also probably want to put the auxiliary funciton where they can be inlined. Something like:

union ff2d {
        float2 f;
        double d;
};
union ffff2dd {
        float4 f;
        double2 d;
};
double toDouble(float2 f2) {
        union ff2d a;
        a.f = f2;
        return a.d;
}
float2 toFloat2(double d) {
        union ff2d a;
        a.d = d;
        return a.f;
}
double2 toDouble2(float4 f4) {
        union ffff2dd a;
        a.f = f4;
        return a.d;
}
float4 toFloat4(double2 d2) {
        union ffff2dd a;
        a.d = d2;
        return a.f;
}

(warning, untested!)

nou · ‎04-06-2011

you should use a int32 textures as with float textures it can flush subnormal float number to zero.

rick_weber · ‎04-06-2011

Originally posted by: Melkhior
Originally posted by: sir.um

By default this only supports Single Precision, however, I have come up with a trick to store double precision numbers in the image object. It involves pointer casting. I've attached the casting functions below. The base code, with no optimizations runs in about 36.4 milliseconds, while the image code takes 976 Milliseconds.

That's a common trick, but you really don't want to use pointers. They will force the compiler to do a round-trip to memory, that is very difficult to optimize away. The 'best' way is to use a union, as you get to share bytes w/o tricking the compiler. You also probably want to put the auxiliary funciton where they can be inlined. Something like:

union ff2d { float2 f; double d; }; union ffff2dd { float4 f; double2 d; }; double toDouble(float2 f2) { union ff2d a; a.f = f2; return a.d; } float2 toFloat2(double d) { union ff2d a; a.d = d; return a.f; } double2 toDouble2(float4 f4) { union ffff2dd a; a.f = f4; return a.d; } float4 toFloat4(double2 d2) { union ffff2dd a; a.d = d2; return a.f; }

(warning, untested!)

OpenCL has a function called as_double() that reinterprets 64 bits without a cast. So, you can make a two channel image, sample into a float2 and call as_double to get your number. There's as_float2 to go the other way...

sir_um · ‎04-06-2011

Originally posted by: Jawed

As an aside: when SKA shows statistics, the performance numbers it shows for code with one or more loops is based upon the Flow Control options. In SKA open the options dialog by doing Edit|Options and you will see three items: Branch Coherence, Avg Loop Count and Max Loop Count.
In my experience this feature is extremely woolly. Try setting Branch Coherence to 100% if all work items run the loop the same number of times. If not, well, you're in woolly land. Also set the loop count. You can see the effect on the reported statistics as you increase the loop count from a low value (e.g. 10) to a higher value (e.g. 100).

Thanks Jawed, this really helps. I changed those settings and the Stats changed drastically. I assume they are more accurate now. 🙂 Thanks.

Originally posted by: nou

you should use a int32 textures as with float textures it can flush subnormal float number to zero.

Thanks, I went with CL_UNSIGNED_INT32

Originally posted by: rick.weber

OpenCL has a function called as_double() that reinterprets 64 bits without a cast. So, you can make a two channel image, sample into a float2 and call as_double to get your number. There's as_float2 to go the other way...

Thanks, I didn't know about these. However, in the OpenCL spec. it indicates that the as_typen() functions are only to support legacy code based on C extensions designed around specific Architectures, and, furthermore, "...this sort of conversion is not likely to be portable except to other OpenCL implementations for the same hardware architecture." (See below)

Quoted From: OpenCL-1.1.pdf (Footnote #27 on page 171; Section 6.2.4.1 Reinterpreting Types Using Unions)

...some other extensions to the C language designed to support particular vector ISA (e.g. AltiVec™, CELL Broadband Engine™ Architecture) use such conversions in conjunction with swizzle operators to achieve type unconversion. So as to support legacy code of this type, as_typen() allows conversions between vectors of the same size but different numbers of elements, even though the behavior of this sort of conversion is not likely to be portable except to other OpenCL implementations for the same hardware architecture.

Originally posted by: melkhior

That's a common trick, but you really don't want to use pointers. They will force the compiler to do a round-trip to memory, that is very difficult to optimize away.

Aww. I thought I had come up with something cool. 🙂

Originally posted by: melkhior

The 'best' way is to use a union, as you get to share bytes w/o tricking the compiler.

That is Awesome! I totaly didn't even think about unions! You're right, that is much better. Although I didn't notice much change in the execution times.

Originally posted by: melkhior

You also probably want to put the auxiliary funciton where they can be inlined.

I don't understand what you mean.

Originally posted by: melkhior

.
.
.
code
.
.
.
(warning, untested!)

I tested them and they work Great!

Originally posted by: himanshu.gautam

Can you please post a small testcase which shows the issue, so we can get this fixed.

If this gets fixed, I assume the fix would not reach me until the next version of the SDK? When is SDK v2.4 scheduled to release?

I've attached the latest code, implementing the improvements given on this thread.

thanks again,
-Chris

------------------ --- Kernels.cl --- ------------------ #pragma OPENCL EXTENSION cl_amd_fp64 : enable #pragma OPENCL EXTENSION cl_amd_printf : enable double getVec2Element(double2 dv, int index); double toDouble (float2 f2); float2 toFloat2 (double d); double2 toDouble2 (float4 f4); float4 toFloat4 (double2 d2); kernel void code_buf ( int numRows, int rowWidth, constant double * inputMatrix, global double * resultMatrix) { int x = get_global_id(0); int y = get_global_id(1); double total = 0; int rowIndex = 0; for (int row=0; row<numRows; row++) { total += (inputMatrix[rowIndex+x] * inputMatrix[rowIndex+y]); rowIndex += rowWidth; } resultMatrix[(y*rowWidth)+x] = total; } kernel void code_img ( int numRows, int rowWidth, read_only image2d_t inputMatrix, global double2 * resultMatrix) { sampler_t sampler = CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE; int x = get_global_id(0); int y = get_global_id(1); double2 total = 0; for (int row=0; row<numRows; row++) { total += toDouble2(read_imagef(inputMatrix, sampler, (int2)(x,row))) * getVec2Element( toDouble2(read_imagef(inputMatrix, sampler, (int2)((int)(y/2), row))),y); } resultMatrix[(y*rowWidth)+x] = total; } //Complete - Tested kernel void conversionTest() { //--double / float2 //init double double dBefore = 5.4321; printf("d-before:\t%f\n", dBefore); //convert double to float2 float2 f2 = toFloat2(dBefore); //copy float2 value to new float2 float2 f2_2 = f2; //convert new float2 back into double double dAfter = toDouble(f2_2); printf("d-after:\t%f\n", dAfter); //--double2 / float4 //init double double2 d2Before = (double2) (6.54321, 7.654321); printf("d2[1]-before:\t%f\nd2[2]-before:\t%f\n", d2Before.s0, d2Before.s1); //convert double to float2 float4 f4 = toFloat4(d2Before); //copy float2 value to new float2 float4 f4_4 = f4; //convert new float2 back into double double2 d2After = toDouble2(f4_4); printf("d2[1]-after:\t%f\nd2[2]-after:\t%f\n", d2After.s0, d2After.s1); } //Complete - Tested union F2toD1 { float2 f; double d; }; //Complete - Tested union F4toD2 { float4 f; double2 d; }; //Complete - Tested double toDouble(float2 f2) { union F2toD1 a; a.f = f2; return a.d; } //Complete - Tested float2 toFloat2(double d) { union F2toD1 a; a.d = d; return a.f; } //Complete - Tested double2 toDouble2(float4 f4) { union F4toD2 a; a.f = f4; return a.d; } //Complete - Tested float4 toFloat4(double2 d2) { union F4toD2 a; a.d = d2; return a.f; } //Complete - Tested double getVec2Element(double2 dv, int index) { switch(index%2) { case 0: return dv.s0; case 1: return dv.s1; } return -999999; } ---------------- --- main.cpp --- ---------------- /* * main.cpp * * Created on: Apr 5, 2011 * Author: Chris Alexander */ //OpenCL Includes #define __NO_STD_VECTOR #include <CL/cl.hpp> //C++ API Includes #include <iostream> #include <fstream> using namespace std; using namespace cl; double * inputMatrix; double * resultMatrix; #define RUN_LOOPS //#define DEBUG //#define DISABLE_CL_OPTIMIZATIONS void checkErr(cl_int err, const char * name); char * print_cl_errstring(cl_int err); //Variables ImageFormat IMAGE_FORAMAT; cl::vector<Device> * GPU_DEVICES; Device * GPU_DEVICE; Context * GPU_CONTEXT; CommandQueue * GPU_QUEUE; Program * GPU_PROGRAM; int inputWidth; int inputHeight; int inputArea; int resultWidth; int resultHeight; int resultArea; int main() { cout<<"Running!"<<endl; int workGroupWidth = 250; int workGroupHeight = 1; int dim = 4000; inputWidth = dim; inputHeight = dim; resultWidth = dim; resultHeight = dim; inputArea = inputWidth*inputHeight; resultArea = resultWidth*resultHeight; //--OpenCL Init cl_int err = CL_SUCCESS; //Request Supported Platforms from OpenCL cl::vector<Platform> platformList; err = Platform::get(&platformList); if (err == CL_INVALID_VALUE) { cout<<"Invalid Value!"<<endl; } if (err == CL_OUT_OF_HOST_MEMORY) { cout<<"Out of host memory!"<<endl; } checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "Get Platform List"); Platform platform = platformList[0]; //init environment constants IMAGE_FORAMAT.image_channel_order = CL_RGBA; IMAGE_FORAMAT.image_channel_data_type = CL_UNSIGNED_INT32; int IMG_DEGREE = 2; //2 doubles per pixel = 4 floats per pixel cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platform)(), 0}; //Query OpenCL for present devices GPU_DEVICES = new cl::vector<Device>(); err = platform.getDevices(CL_DEVICE_TYPE_GPU, GPU_DEVICES); checkErr(err, "Get GPU Devices"); GPU_DEVICE = &((*GPU_DEVICES)[0]); //Create Contexts GPU_CONTEXT = new Context(*GPU_DEVICES, cprops, NULL, NULL, &err); checkErr(err, "Create GPU Context"); GPU_QUEUE = new CommandQueue(*GPU_CONTEXT, *GPU_DEVICE, 0, &err); checkErr(err, "Create GPU CommandQueue"); //--GPU //Load Kernel from file into String Object ifstream gpuFile("src/kernels.cl"); std::string gpuProg(istreambuf_iterator<char>(gpuFile),(istreambuf_iterator<char>())); //Create ProgramSource object from String Source Program::Sources gpuSource(1, make_pair(gpuProg.c_str(), gpuProg.length())); //Create Program object from ProgramSource object GPU_PROGRAM = new Program(*GPU_CONTEXT, gpuSource, &err); checkErr(err, "Create GPU Program"); //Compile GPU Program #ifdef DISABLE_CL_OPTIMIZATIONS err = (*GPU_PROGRAM).build(*GPU_DEVICES, "-g -cl-opt-disable"); #else err = (*GPU_PROGRAM).build(*GPU_DEVICES, ""); #endif if (err) { if (err == CL_BUILD_PROGRAM_FAILURE) { cerr<<"Build Failed!"<<endl; } std::string log = (*GPU_PROGRAM).getBuildInfo<CL_PROGRAM_BUILD_LOG>(*GPU_DEVICE); std::cerr<<"GPU - Build Log:\n"<<log<<endl; exit(EXIT_FAILURE); } cout<<"OpenCL Initialized!"<<endl; //--Populate cout<<"Populate..."; inputMatrix = new double[inputArea]; resultMatrix = new double[resultArea]; //Input Matrix for (int x=0; x<inputArea; x++) { inputMatrix = x; } //Result Matrix for (int x=0; x<resultArea; x++) { resultMatrix = 0; } cout<<"Complete!"<<endl; //--Execute Image2D * inputMtrx_img; Buffer * inputMtrx_buf; Buffer * resultMtrx; //Events Event kernelEvent; Event copyInputMatrixData; Event copyResultMatrixData; /************************************************** ********** Start Conversion Test Kernel ********** **************************************************/ Kernel kernel = Kernel(*GPU_PROGRAM, "conversionTest", &err); checkErr(err, "Create GPU Kernel (conversionTest)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(1), NDRange(1), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (conversionTest)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); /************************************************** ********** Start Conversion Test Kernel ********** **************************************************/ /***************************************** ********** Start Buffer Kernel ********** *****************************************/ kernel = Kernel(*GPU_PROGRAM, "code_buf", &err); checkErr(err, "Create GPU Kernel (code_buf)"); cout<<"Running Buffer Code [ETA. ~2.5 seconds on Radeon HD 5870]..."; cout.flush(); #ifdef RUN_LOOPS int before = GetTickCount(); int numIterations = 20; for (int x=0; x<numIterations; x++) #endif { //Input Matrix inputMtrx_buf = new Buffer(*GPU_CONTEXT, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(double)*inputWidth*inputHeight, inputMatrix, &err); checkErr(err, "Create InputMatrix Buffer"); err = (*GPU_QUEUE).enqueueWriteBuffer(*inputMtrx_buf, CL_FALSE, 0, sizeof(double)*inputArea, inputMatrix, NULL, &copyInputMatrixData); checkErr(err, "enqueueWriteBuffer() - Copy data to vectrA Buffer obj"); //Result Matrix resultMtrx = new Buffer(*GPU_CONTEXT, CL_MEM_READ_WRITE, sizeof(double)*resultArea, NULL, &err); checkErr(err, "Create ResultMatrix Buffer"); //Wait for data copy to finish err = copyInputMatrixData.wait(); checkErr(err, "enqueueWriteBufferEvent.wait() - Wait for inputMtrx data copy to finish"); //Set Kernel Args int argIndex = 0; err = kernel.setArg(argIndex++, inputHeight); checkErr(err, "Kernel(GPU) - setArg(0, inputHeight)"); err = kernel.setArg(argIndex++, inputWidth); checkErr(err, "Kernel(GPU) - setArg(1, inputWidth)"); err = kernel.setArg(argIndex++, *inputMtrx_buf); checkErr(err, "Kernel(GPU) - setArg(2, inputMtrx_buf)"); err = kernel.setArg(argIndex++, *resultMtrx); checkErr(err, "Kernel(GPU) - setArg(3, resultMtrx)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(resultWidth,resultHeight), NDRange(workGroupWidth, workGroupHeight), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (code_buf)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); //Read Result data to result vector err = (*GPU_QUEUE).enqueueReadBuffer(*resultMtrx, CL_FALSE, 0, sizeof(double)*resultArea, resultMatrix, NULL, &copyResultMatrixData); checkErr(err, "enqueueReadBuffer() - Copy data from resultMtrx Buffer obj"); //Wait for data copy to finish err = copyResultMatrixData.wait(); checkErr(err, "enqueueReadEvent.wait() - Wait for resultMtrx data copy to finish"); //clean up after this iteration. checkErr(CL_SUCCESS, "Begin Pointer Cleanup"); delete inputMtrx_buf; delete resultMtrx; checkErr(CL_SUCCESS, "End Pointer Cleanup"); } #ifdef RUN_LOOPS int after = GetTickCount(); long totalTime = after-before; double iterationTime = ((double)(totalTime))/numIterations; cout<<"Done!"<<endl <<" Buffer Total Time (in ms): "<<totalTime<<endl <<"Buffer Iteration Time (in ms): "<<iterationTime<<endl; #else cout<<"Done!"<<endl; #endif /*************************************** ********** End Buffer Kernel ********** ***************************************/ /************************************** ********** Start IMG Kernel ********** **************************************/ kernel = Kernel(*GPU_PROGRAM, "code_img", &err); checkErr(err, "Create GPU Kernel (code_img)"); #ifdef RUN_LOOPS cout<<"Running Image Code [ETA. ~10 seconds on Radeon HD 5870]..."; cout.flush(); before = GetTickCount(); numIterations = 2; for (int x=0; x<numIterations; x++) #endif { //Input Matrix inputMtrx_img = new Image2D(*GPU_CONTEXT, CL_MEM_READ_ONLY, IMAGE_FORAMAT, inputWidth/IMG_DEGREE, inputHeight, 0, NULL, &err); checkErr(err, "Create InputMatrix Buffer"); //Origin cl::size_t<3> origin; origin[0] = 0;//x origin[1] = 0;//y origin[2] = 0;//z - On a 2d Image, the Z param MUST be zero(0). //Region cl::size_t<3> region; region[0] = inputWidth/IMG_DEGREE; //width region[1] = inputHeight; //height region[2] = 1;//z - On a 2d Image, the Z param MUST be one(1). err = (*GPU_QUEUE).enqueueWriteImage(*inputMtrx_img, CL_FALSE, origin, region, 0, 0, inputMatrix, NULL, &copyInputMatrixData); checkErr(err, "enqueueWriteImage() - Copy data to inputMtrx_img Image2D obj"); //Result Matrix resultMtrx = new Buffer(*GPU_CONTEXT, CL_MEM_READ_WRITE, sizeof(double)*resultArea, NULL, &err); checkErr(err, "Create ResultMatrix Buffer"); //Wait for data copy to finish err = copyInputMatrixData.wait(); checkErr(err, "enqueueWriteImageEvent.wait() - Wait for inputMtrx data copy to finish"); //Set Kernel Args int argIndex = 0; err = kernel.setArg(argIndex++, inputHeight); checkErr(err, "Kernel(GPU) - setArg(0, inputHeight)"); err = kernel.setArg(argIndex++, inputWidth/IMG_DEGREE); checkErr(err, "Kernel(GPU) - setArg(1, inputWidth/IMG_DEGREE)"); err = kernel.setArg(argIndex++, *inputMtrx_img); checkErr(err, "Kernel(GPU) - setArg(2, inputMtrx_img)"); err = kernel.setArg(argIndex++, *resultMtrx); checkErr(err, "Kernel(GPU) - setArg(3, resultMtrx)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(resultWidth,resultHeight), NDRange(workGroupWidth, workGroupHeight), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (code_IMG)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); //Read Result data to result vector err = (*GPU_QUEUE).enqueueReadBuffer(*resultMtrx, CL_FALSE, 0, sizeof(double)*resultArea, resultMatrix, NULL, &copyResultMatrixData); checkErr(err, "enqueueReadBuffer() - Copy data from resultMtrx Buffer obj"); //Wait for data copy to finish err = copyResultMatrixData.wait(); checkErr(err, "enqueueReadEvent.wait() - Wait for resultMtrx data copy to finish"); //clean up after this iteration. checkErr(CL_SUCCESS, "Begin Pointer Cleanup"); delete inputMtrx_img; delete resultMtrx; checkErr(CL_SUCCESS, "End Pointer Cleanup"); } #ifdef RUN_LOOPS after = GetTickCount(); totalTime = after-before; iterationTime = ((double)(totalTime))/numIterations; cout<<"Done!"<<endl <<" Image Total Time (in ms): "<<totalTime<<endl <<"Image Iteration Time (in ms): "<<iterationTime<<endl; #else cout<<"Done!"<<endl; #endif /************************************ ********** End IMG Kernel ********** ************************************/ cout<<"\n\nProgram Complete!"<<endl; } void checkErr(cl_int err, const char * name) { #ifdef DEBUG cerr << name << endl; cerr.flush(); #endif if (err != CL_SUCCESS) { cerr << "ERROR: " << name << " (" << err << ": " << print_cl_errstring(err) << ")" << endl; exit(EXIT_FAILURE); } } char *print_cl_errstring(cl_int err) { switch (err) { case CL_SUCCESS: return strdup("Success!"); case CL_DEVICE_NOT_FOUND: return strdup("Device not found."); case CL_DEVICE_NOT_AVAILABLE: return strdup("Device not available"); case CL_COMPILER_NOT_AVAILABLE: return strdup("Compiler not available"); case CL_MEM_OBJECT_ALLOCATION_FAILURE: return strdup("Memory object allocation failure"); case CL_OUT_OF_RESOURCES: return strdup("Out of resources"); case CL_OUT_OF_HOST_MEMORY: return strdup("Out of host memory"); case CL_PROFILING_INFO_NOT_AVAILABLE: return strdup("Profiling information not available"); case CL_MEM_COPY_OVERLAP: return strdup("Memory copy overlap"); case CL_IMAGE_FORMAT_MISMATCH: return strdup("Image format mismatch"); case CL_IMAGE_FORMAT_NOT_SUPPORTED: return strdup("Image format not supported"); case CL_BUILD_PROGRAM_FAILURE: return strdup("Program build failure"); case CL_MAP_FAILURE: return strdup("Map failure"); case CL_INVALID_VALUE: return strdup("Invalid value"); case CL_INVALID_DEVICE_TYPE: return strdup("Invalid device type"); case CL_INVALID_PLATFORM: return strdup("Invalid platform"); case CL_INVALID_DEVICE: return strdup("Invalid device"); case CL_INVALID_CONTEXT: return strdup("Invalid context"); case CL_INVALID_QUEUE_PROPERTIES: return strdup("Invalid queue properties"); case CL_INVALID_COMMAND_QUEUE: return strdup("Invalid command queue"); case CL_INVALID_HOST_PTR: return strdup("Invalid host pointer"); case CL_INVALID_MEM_OBJECT: return strdup("Invalid memory object"); case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR: return strdup("Invalid image format descriptor"); case CL_INVALID_IMAGE_SIZE: return strdup("Invalid image size"); case CL_INVALID_SAMPLER: return strdup("Invalid sampler"); case CL_INVALID_BINARY: return strdup("Invalid binary"); case CL_INVALID_BUILD_OPTIONS: return strdup("Invalid build options"); case CL_INVALID_PROGRAM: return strdup("Invalid program"); case CL_INVALID_PROGRAM_EXECUTABLE: return strdup("Invalid program executable"); case CL_INVALID_KERNEL_NAME: return strdup("Invalid kernel name"); case CL_INVALID_KERNEL_DEFINITION: return strdup("Invalid kernel definition"); case CL_INVALID_KERNEL: return strdup("Invalid kernel"); case CL_INVALID_ARG_INDEX: return strdup("Invalid argument index"); case CL_INVALID_ARG_VALUE: return strdup("Invalid argument value"); case CL_INVALID_ARG_SIZE: return strdup("Invalid argument size"); case CL_INVALID_KERNEL_ARGS: return strdup("Invalid kernel arguments"); case CL_INVALID_WORK_DIMENSION: return strdup("Invalid work dimension"); case CL_INVALID_WORK_GROUP_SIZE: return strdup("Invalid work group size"); case CL_INVALID_WORK_ITEM_SIZE: return strdup("Invalid work item size"); case CL_INVALID_GLOBAL_OFFSET: return strdup("Invalid global offset"); case CL_INVALID_EVENT_WAIT_LIST: return strdup("Invalid event wait list"); case CL_INVALID_EVENT: return strdup("Invalid event"); case CL_INVALID_OPERATION: return strdup("Invalid operation"); case CL_INVALID_GL_OBJECT: return strdup("Invalid OpenGL object"); case CL_INVALID_BUFFER_SIZE: return strdup("Invalid buffer size"); case CL_INVALID_MIP_LEVEL: return strdup("Invalid mip-map level"); default: return strdup("Unknown"); } }

Jawed · ‎04-07-2011

The two variants of the float4 to double2 conversion produce identical machine code. So, no wonder the performance is the same.

It's worth noting that the caution in the OpenCL specification is all about bitwise organisation of these data formats. The trick here relies upon the bits being in particular places for the double precision storage.

Some other hardware might have a different mapping of the 128-bits that make up the storage of double2 (or, if you prefer, the 64 bits that make up a double in comparison with a float2).

So the correct way to implement this is using as_double etc. This is documented and since it is documented to be platform-specific, anyone working with the code will be able to unravel the reason for it failing if their platform works differently. They won't have to muck about with code (whether it's pointer casting or union), they can read the specification.

So, in summary, all 3 techniques are the same.

---

On AMD GPU hardware currently everything is in-lined. There's nothing in the OpenCL spec relating to in-lining.

---

As to the underlying performance problem you are encountering, you might find this thread interesting:

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=136119

Here the performance problem is due to the way a buffer is spread across the memory channels of the GPU chip (banks) and the interaction of this banking with the sequence of work items that fetches from that buffer and the sizes of the fetches.

Your case isn't exactly the same (because you have two different types of buffer) but this scenario is indicative of very low-level gotchas.

himanshu_gautam · ‎04-07-2011

Hi sir.um,

We are looking into the issue. Meanwhile can you try your code with the new SDK. Also Please post the system configuration: CPU,GPU,SDK,DRIVER,OS.

sir_um · ‎04-08-2011

Originally posted by: Jawed

So the correct way to implement this is using as_double etc. This is documented and since it is documented to be platform-specific, anyone working with the code will be able to unravel the reason for it failing if their platform works differently. They won't have to muck about with code (whether it's pointer casting or union), they can read the specification.

Good to know. Given that they are the same, I agree, the as_typen() functions are the best choice.

Originally posted by: Jawed

As to the underlying performance problem you are encountering, you might find this thread interesting:
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=136119
Here the performance problem is due to the way a buffer is spread across the memory channels of the GPU chip (banks) and the interaction of this banking with the sequence of work items that fetches from that buffer and the sizes of the fetches.
Your case isn't exactly the same (because you have two different types of buffer) but this scenario is indicative of very low-level gotchas.

Thanks. I read that thread and it does sound somewhat like my problem, via the use of vector arithmatic vs scalar. However, I don't understand whether or not there was a solution given in that thread, or just a detailed explination of the problem.

Wow, v2.4 released today! 🙂 I downloaded the AMD APP SDK v2.4, and the new Driver for my GPU. The timings improved some, I assume, due to the performance improvements in the new SDK, but the Performance hit associated with my Double Precision Image Optimization still remains.

-Chris

------------------------------ --- Original Configuration --- ------------------------------ Windows 7 (64-bit) Home Premium Intel Core 2 Quad Q6600 @ 2.4 Ghz (Overclocked to 2.94 Ghz) ATI Radeon HD 5870 Eyefinity 6 Edition [2GB GDDR5] (Made by Sapphire) AMD APP SDK v2.3 (Taken from Catalyst Control Center) --- Driver Packaging Version: 8.801-101125a-109686E Catalyst Version: 10.12 Provider: ATI Technologies Inc. 2D Driver Version: 8.01.01.1105 2D Driver File Path: /REGISTRY/MACHINE/SYSTEM/ControlSet001/Control/CLASS/{4D36E968-E325-11CE-BFC1-08002BE10318}/0001 Direct3D Version: 8.14.10.0798 OpenGL Version: 6.14.10.10362 Catalyst Control Center Version: 2010.1125.2148.39102 ------------------------- --- New Configuration --- ------------------------- All Hardware and OS the same. AMD APP SDK v2.4 (Taken from Catalyst Control Center) --- Driver Packaging Version: 8.831.2-110308a-115928C-ATI Catalyst Version: 11.3 Provider: ATI Technologies Inc. 2D Driver Version: 8.01.01.1134 2D Driver File Path: /REGISTRY/MACHINE/SYSTEM/ControlSet001/Control/CLASS/{4D36E968-E325-11CE-BFC1-08002BE10318}/0001 Direct3D Version: 7.14.10.0817 OpenGL Version: 6.14.10.10600 Catalyst Control Center Version: 2011.0308.2325.42017

Jawed · ‎04-09-2011

The thread does provide a partial solution, which is a "jig" to de-correlate the banks being used by the buffers in question.

Additionally the profiling tool is capable of highlighting the problem - Jeff mentions that the float4 case has about 25% stall rate on reads. If you are using Visual Studio then you can run the profiler on your application.

Another way to test this is to remove memory operations as a potential bottleneck. You can do this by making all work-items fetch the same element from your input buffer. Clearly the result will be wrong but now all fetches are cached so if there is a gotcha in global memory accesses, you'll see a huge increase in performance. Use the kernel analyser to verify that the kernel you are using is doing the double-precision math. The risk with this simplification is that one of the optimisers will remove the loop entirely, so you need to check that the instructions run on the GPU come out to approximately the same cycle count as for the original kernel.

The code you've posted in this thread won't work on my system. Visual Studio's compiler is crashing (not erroring-out), complaining about cl.hpp or something, so I can't play. I don't normally use the C++ bindings so I'll have to spend some time getting to the bottom of that. I'll have a go later next week.

From your point of view it might be worth testing a variant of your kernel that processes multiple double2s instead of a single double2 per work item. Try with 2 or 4 double2s. An increase in these will change the memory access pattern so you might get lucky.

There's a chance this is slower rather than faster, if your execution domain is not very large. How many work items are there? I presume you have on the order of 10 million work items for the scalar case, in which case you should see a performance increase.

In general, when a work item is used to produce multiple independent scalar results, you have to experiment with the different options for data read and write patterns. e.g. if the execution domain is a 2D grid, does each work item process a 2x2 block or a 1x4 strip or other more exotic patterns (striped, diagonalised, dotted ...).

sir_um · ‎04-10-2011

Originally posted by: Jawed

Additionally the profiling tool is capable of highlighting the problem - Jeff mentions that the float4 case has about 25% stall rate on reads. If you are using Visual Studio then you can run the profiler on your application.

I do not have Visual Studio, I use Eclipse CDT, with the MinGW 64-bit compiler.

Originally posted by: Jawed

Another way to test this is to remove memory operations as a potential bottleneck. You can do this by making all work-items fetch the same element from your input buffer. Clearly the result will be wrong but now all fetches are cached so if there is a gotcha in global memory accesses, you'll see a huge increase in performance.

I modified the kernel so that every work item fetched element (0,0) from the image. The code did complete about 2.7 times faster the original code, though it is still an order of magnitude slower than the buffer code.

Originally posted by: Jawed

The code you've posted in this thread won't work on my system. Visual Studio's compiler is crashing (not erroring-out), complaining about cl.hpp or something, so I can't play. I don't normally use the C++ bindings so I'll have to spend some time getting to the bottom of that. I'll have a go later next week.

I believe the problem is that you do not have the cl.hpp include file. It is the file which contains the c++ bindings for the OpenCL C headers. You need cl.hpp and all the OpenCL C headers. You can download it here:
http://www.khronos.org/registry/cl/api/1.1/cl.hpp

Originally posted by: Jawed

From your point of view it might be worth testing a variant of your kernel that processes multiple double2s instead of a single double2 per work item. Try with 2 or 4 double2s. An increase in these will change the memory access pattern so you might get lucky.

I modified the code to only calculate a single double, and modified the image format's image_channel_order to cl_RG, so that only 1 double is stored per pixel. (It makes the kernel's indexing easier to understand) This code does run about 500-600 ms faster than the double2 code. The reason I originally designed it to store 2 doubles per pixel is that my innitial tests seemed to show that if you did not fully populate a pixel with 128-bits (4 floats), that the pixels would not be compressed to the actual the bit size of your ImageFormat and the read_image would still make a 128-bit read and throw out the unneeded data, thus killing performance. Based upon the results of these benchmarks, I would assume that my initial tests were wrong due to faulty code or a lack of understanding towards the Texture Cache, and that the Image object does in fact compress data, and only read the data requested from a read_imageui() call. Is that correct?

I've attached the updated code. I fixed a bugs/typos. When I changed the image format to int32 I forgot to change the read_imagef() call. I fixed that. Also, I modified it to produce a single double per pixel to make the indexing easier to follow. I also, changed it to use the built in as_typen() functions.

-Chris

------------------ --- kernels.cl --- ------------------ #pragma OPENCL EXTENSION cl_amd_fp64 : enable #pragma OPENCL EXTENSION cl_amd_printf : enable double getVec2Element(double2 dv, int index); kernel void code_buf ( int numRows, int rowWidth, constant double * inputMatrix, global double * resultMatrix) { int x = get_global_id(0); int y = get_global_id(1); double total = 0; int rowIndex = 0; for (int row=0; row<numRows; row++) { total += (inputMatrix[rowIndex+x] * inputMatrix[rowIndex+y]); rowIndex += rowWidth; } resultMatrix[(y*rowWidth)+x] = total; } kernel void code_img ( int numRows, int rowWidth, read_only image2d_t inputMatrix, global double * resultMatrix) { sampler_t sampler = CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE; int x = get_global_id(0); int y = get_global_id(1); /*int x = 0;//Used to test Cached read int y = 0;*/ double total = 0; for (int row=0; row<numRows; row++) { total +=as_double(read_imageui(inputMatrix, sampler, (int2)(x,row)).lo) * as_double(read_imageui(inputMatrix, sampler, (int2)(y,row)).lo); } resultMatrix[(y*rowWidth)+x] = total; } //Complete - Tested kernel void conversionTest() { //--double / float2 //init double double dBefore = 5.4321; printf("d-before:\t%f\n", dBefore); //convert double to float2 float2 f2 = as_float2(dBefore); //copy float2 value to new float2 float2 f2_2 = f2; //convert new float2 back into double double dAfter = as_double(f2_2); printf("d-after:\t%f\n", dAfter); //--double2 / float4 //init double double2 d2Before = (double2) (6.54321, 7.654321); printf("d2[1]-before:\t%f\nd2[2]-before:\t%f\n", d2Before.s0, d2Before.s1); //convert double to float2 float4 f4 = as_float4(d2Before); //copy float2 value to new float2 float4 f4_4 = f4; //convert new float2 back into double double2 d2After = as_double2(f4_4); printf("d2[1]-after:\t%f\nd2[2]-after:\t%f\n", d2After.s0, d2After.s1); } //Complete - Tested double getVec2Element(double2 dv, int index) { switch(index%2) { case 0: return dv.s0; case 1: return dv.s1; } return -999999; } ---------------- --- main.cpp ---- ---------------- /* * main.cpp * * Created on: Apr 5, 2011 * Author: Chris Alexander */ //OpenCL Includes #define __NO_STD_VECTOR #include <CL/cl.hpp> //C++ API Includes #include <iostream> #include <fstream> using namespace std; using namespace cl; double * inputMatrix; double * resultMatrix; #define RUN_LOOPS //#define DEBUG //#define DISABLE_CL_OPTIMIZATIONS void checkErr(cl_int err, const char * name); char * print_cl_errstring(cl_int err); //Variables ImageFormat IMAGE_FORMAT; cl::vector<Device> * GPU_DEVICES; Device * GPU_DEVICE; Context * GPU_CONTEXT; CommandQueue * GPU_QUEUE; Program * GPU_PROGRAM; int inputWidth; int inputHeight; int inputArea; int resultWidth; int resultHeight; int resultArea; int main() { cout<<"Running!"<<endl; int workGroupWidth = 250; int workGroupHeight = 1; int dim = 4000; inputWidth = dim; inputHeight = dim; resultWidth = dim; resultHeight = dim; inputArea = inputWidth*inputHeight; resultArea = resultWidth*resultHeight; //--OpenCL Init cl_int err = CL_SUCCESS; //Request Supported Platforms from OpenCL cl::vector<Platform> platformList; err = Platform::get(&platformList); if (err == CL_INVALID_VALUE) { cout<<"Invalid Value!"<<endl; } if (err == CL_OUT_OF_HOST_MEMORY) { cout<<"Out of host memory!"<<endl; } checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "Get Platform List"); Platform platform = platformList[0]; //init environment constants IMAGE_FORMAT.image_channel_order = CL_RG; IMAGE_FORMAT.image_channel_data_type = CL_UNSIGNED_INT32;//1 double per pixel = (2x) 32-bit ints per pixels cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platform)(), 0}; //Query OpenCL for present devices GPU_DEVICES = new cl::vector<Device>(); err = platform.getDevices(CL_DEVICE_TYPE_GPU, GPU_DEVICES); checkErr(err, "Get GPU Devices"); GPU_DEVICE = &((*GPU_DEVICES)[0]); //Create Contexts GPU_CONTEXT = new Context(*GPU_DEVICES, cprops, NULL, NULL, &err); checkErr(err, "Create GPU Context"); GPU_QUEUE = new CommandQueue(*GPU_CONTEXT, *GPU_DEVICE, 0, &err); checkErr(err, "Create GPU CommandQueue"); //--GPU //Load Kernel from file into String Object ifstream gpuFile("src/kernels.cl"); std::string gpuProg(istreambuf_iterator<char>(gpuFile),(istreambuf_iterator<char>())); //Create ProgramSource object from String Source Program::Sources gpuSource(1, make_pair(gpuProg.c_str(), gpuProg.length())); //Create Program object from ProgramSource object GPU_PROGRAM = new Program(*GPU_CONTEXT, gpuSource, &err); checkErr(err, "Create GPU Program"); //Compile GPU Program #ifdef DISABLE_CL_OPTIMIZATIONS err = (*GPU_PROGRAM).build(*GPU_DEVICES, "-g -cl-opt-disable"); #else err = (*GPU_PROGRAM).build(*GPU_DEVICES, ""); #endif if (err) { if (err == CL_BUILD_PROGRAM_FAILURE) { cerr<<"Build Failed!"<<endl; } std::string log = (*GPU_PROGRAM).getBuildInfo<CL_PROGRAM_BUILD_LOG>(*GPU_DEVICE); std::cerr<<"GPU - Build Log:\n"<<log<<endl; exit(EXIT_FAILURE); } cout<<"OpenCL Initialized!"<<endl; //--Populate cout<<"Populate..."; inputMatrix = new double[inputArea]; resultMatrix = new double[resultArea]; //Input Matrix for (int x=0; x<inputArea; x++) { inputMatrix = x; } //Result Matrix for (int x=0; x<resultArea; x++) { resultMatrix = 0; } cout<<"Complete!"<<endl; //--Execute Image2D * inputMtrx_img; Buffer * inputMtrx_buf; Buffer * resultMtrx; //Events Event kernelEvent; Event copyInputMatrixData; Event copyResultMatrixData; /************************************************** ********** Start Conversion Test Kernel ********** **************************************************/ Kernel kernel = Kernel(*GPU_PROGRAM, "conversionTest", &err); checkErr(err, "Create GPU Kernel (conversionTest)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(1), NDRange(1), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (conversionTest)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); /************************************************** ********** Start Conversion Test Kernel ********** **************************************************/ /***************************************** ********** Start Buffer Kernel ********** *****************************************/ kernel = Kernel(*GPU_PROGRAM, "code_buf", &err); checkErr(err, "Create GPU Kernel (code_buf)"); #ifdef RUN_LOOPS int before = GetTickCount(); int numIterations = 20; cout<<"Running Buffer Code ["<<numIterations<<" iterations. ETA: ~2.5 seconds on Radeon HD 5870]..."; cout.flush(); for (int x=0; x<numIterations; x++) #endif { //Input Matrix inputMtrx_buf = new Buffer(*GPU_CONTEXT, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(double)*inputWidth*inputHeight, inputMatrix, &err); checkErr(err, "Create InputMatrix Buffer"); err = (*GPU_QUEUE).enqueueWriteBuffer(*inputMtrx_buf, CL_FALSE, 0, sizeof(double)*inputArea, inputMatrix, NULL, &copyInputMatrixData); checkErr(err, "enqueueWriteBuffer() - Copy data to vectrA Buffer obj"); //Result Matrix resultMtrx = new Buffer(*GPU_CONTEXT, CL_MEM_READ_WRITE, sizeof(double)*resultArea, NULL, &err); checkErr(err, "Create ResultMatrix Buffer"); //Wait for data copy to finish err = copyInputMatrixData.wait(); checkErr(err, "enqueueWriteBufferEvent.wait() - Wait for inputMtrx data copy to finish"); //Set Kernel Args int argIndex = 0; err = kernel.setArg(argIndex++, inputHeight); checkErr(err, "Kernel(GPU) - setArg(0, inputHeight)"); err = kernel.setArg(argIndex++, inputWidth); checkErr(err, "Kernel(GPU) - setArg(1, inputWidth)"); err = kernel.setArg(argIndex++, *inputMtrx_buf); checkErr(err, "Kernel(GPU) - setArg(2, inputMtrx_buf)"); err = kernel.setArg(argIndex++, *resultMtrx); checkErr(err, "Kernel(GPU) - setArg(3, resultMtrx)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(resultWidth,resultHeight), NDRange(workGroupWidth, workGroupHeight), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (code_buf)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); //Read Result data to result vector err = (*GPU_QUEUE).enqueueReadBuffer(*resultMtrx, CL_FALSE, 0, sizeof(double)*resultArea, resultMatrix, NULL, &copyResultMatrixData); checkErr(err, "enqueueReadBuffer() - Copy data from resultMtrx Buffer obj"); //Wait for data copy to finish err = copyResultMatrixData.wait(); checkErr(err, "enqueueReadEvent.wait() - Wait for resultMtrx data copy to finish"); //clean up after this iteration. checkErr(CL_SUCCESS, "Begin Pointer Cleanup"); delete inputMtrx_buf; delete resultMtrx; checkErr(CL_SUCCESS, "End Pointer Cleanup"); } #ifdef RUN_LOOPS int after = GetTickCount(); long totalTime = after-before; double iterationTime = ((double)(totalTime))/numIterations; cout<<"Done!"<<endl <<" Buffer Total Time (in ms): "<<totalTime<<endl <<"Buffer Iteration Time (in ms): "<<iterationTime<<endl; #else cout<<"Done!"<<endl; #endif /*************************************** ********** End Buffer Kernel ********** ***************************************/ /************************************** ********** Start IMG Kernel ********** **************************************/ kernel = Kernel(*GPU_PROGRAM, "code_img", &err); checkErr(err, "Create GPU Kernel (code_img)"); #ifdef RUN_LOOPS before = GetTickCount(); numIterations = 2; cout<<"Running Image Code ["<<numIterations<<" iterations. ETA: ~9.5 seconds on Radeon HD 5870]..."; cout.flush(); for (int x=0; x<numIterations; x++) #endif { //Input Matrix inputMtrx_img = new Image2D(*GPU_CONTEXT, CL_MEM_READ_ONLY, IMAGE_FORMAT, inputWidth, inputHeight, 0, NULL, &err); checkErr(err, "Create InputMatrix Buffer"); //Origin cl::size_t<3> origin; origin[0] = 0;//x origin[1] = 0;//y origin[2] = 0;//z - On a 2d Image, the Z param MUST be zero(0). //Region cl::size_t<3> region; region[0] = inputWidth; //width region[1] = inputHeight;//height region[2] = 1;//z - On a 2d Image, the Z param MUST be one(1). err = (*GPU_QUEUE).enqueueWriteImage(*inputMtrx_img, CL_FALSE, origin, region, 0, 0, inputMatrix, NULL, &copyInputMatrixData); checkErr(err, "enqueueWriteImage() - Copy data to inputMtrx_img Image2D obj"); //Result Matrix resultMtrx = new Buffer(*GPU_CONTEXT, CL_MEM_READ_WRITE, sizeof(double)*resultArea, NULL, &err); checkErr(err, "Create ResultMatrix Buffer"); //Wait for data copy to finish err = copyInputMatrixData.wait(); checkErr(err, "enqueueWriteImageEvent.wait() - Wait for inputMtrx data copy to finish"); //Set Kernel Args int argIndex = 0; err = kernel.setArg(argIndex++, inputHeight); checkErr(err, "Kernel(GPU) - setArg(0, inputHeight)"); err = kernel.setArg(argIndex++, inputWidth); checkErr(err, "Kernel(GPU) - setArg(1, inputWidth)"); err = kernel.setArg(argIndex++, *inputMtrx_img); checkErr(err, "Kernel(GPU) - setArg(2, inputMtrx_img)"); err = kernel.setArg(argIndex++, *resultMtrx); checkErr(err, "Kernel(GPU) - setArg(3, resultMtrx)"); //Enqueue Kernel err = (*GPU_QUEUE).enqueueNDRangeKernel(kernel, NullRange, NDRange(resultWidth,resultHeight), NDRange(workGroupWidth, workGroupHeight), NULL, &kernelEvent); checkErr(err, "enqueue GPU Kernel (code_IMG)"); //Wait for kernel to finish kernelEvent.wait(); checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish"); //Read Result data to result vector err = (*GPU_QUEUE).enqueueReadBuffer(*resultMtrx, CL_FALSE, 0, sizeof(double)*resultArea, resultMatrix, NULL, &copyResultMatrixData); checkErr(err, "enqueueReadBuffer() - Copy data from resultMtrx Buffer obj"); //Wait for data copy to finish err = copyResultMatrixData.wait(); checkErr(err, "enqueueReadEvent.wait() - Wait for resultMtrx data copy to finish"); //clean up after this iteration. checkErr(CL_SUCCESS, "Begin Pointer Cleanup"); delete inputMtrx_img; delete resultMtrx; checkErr(CL_SUCCESS, "End Pointer Cleanup"); } #ifdef RUN_LOOPS after = GetTickCount(); totalTime = after-before; iterationTime = ((double)(totalTime))/numIterations; cout<<"Done!"<<endl <<" Image Total Time (in ms): "<<totalTime<<endl <<"Image Iteration Time (in ms): "<<iterationTime<<endl; #else cout<<"Done!"<<endl; #endif /************************************ ********** End IMG Kernel ********** ************************************/ cout<<"\n\nProgram Complete!"<<endl; } void checkErr(cl_int err, const char * name) { #ifdef DEBUG cerr << name << endl; cerr.flush(); #endif if (err != CL_SUCCESS) { cerr << "ERROR: " << name << " (" << err << ": " << print_cl_errstring(err) << ")" << endl; exit(EXIT_FAILURE); } } char *print_cl_errstring(cl_int err) { switch (err) { case CL_SUCCESS: return strdup("Success!"); case CL_DEVICE_NOT_FOUND: return strdup("Device not found."); case CL_DEVICE_NOT_AVAILABLE: return strdup("Device not available"); case CL_COMPILER_NOT_AVAILABLE: return strdup("Compiler not available"); case CL_MEM_OBJECT_ALLOCATION_FAILURE: return strdup("Memory object allocation failure"); case CL_OUT_OF_RESOURCES: return strdup("Out of resources"); case CL_OUT_OF_HOST_MEMORY: return strdup("Out of host memory"); case CL_PROFILING_INFO_NOT_AVAILABLE: return strdup("Profiling information not available"); case CL_MEM_COPY_OVERLAP: return strdup("Memory copy overlap"); case CL_IMAGE_FORMAT_MISMATCH: return strdup("Image format mismatch"); case CL_IMAGE_FORMAT_NOT_SUPPORTED: return strdup("Image format not supported"); case CL_BUILD_PROGRAM_FAILURE: return strdup("Program build failure"); case CL_MAP_FAILURE: return strdup("Map failure"); case CL_INVALID_VALUE: return strdup("Invalid value"); case CL_INVALID_DEVICE_TYPE: return strdup("Invalid device type"); case CL_INVALID_PLATFORM: return strdup("Invalid platform"); case CL_INVALID_DEVICE: return strdup("Invalid device"); case CL_INVALID_CONTEXT: return strdup("Invalid context"); case CL_INVALID_QUEUE_PROPERTIES: return strdup("Invalid queue properties"); case CL_INVALID_COMMAND_QUEUE: return strdup("Invalid command queue"); case CL_INVALID_HOST_PTR: return strdup("Invalid host pointer"); case CL_INVALID_MEM_OBJECT: return strdup("Invalid memory object"); case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR: return strdup("Invalid image format descriptor"); case CL_INVALID_IMAGE_SIZE: return strdup("Invalid image size"); case CL_INVALID_SAMPLER: return strdup("Invalid sampler"); case CL_INVALID_BINARY: return strdup("Invalid binary"); case CL_INVALID_BUILD_OPTIONS: return strdup("Invalid build options"); case CL_INVALID_PROGRAM: return strdup("Invalid program"); case CL_INVALID_PROGRAM_EXECUTABLE: return strdup("Invalid program executable"); case CL_INVALID_KERNEL_NAME: return strdup("Invalid kernel name"); case CL_INVALID_KERNEL_DEFINITION: return strdup("Invalid kernel definition"); case CL_INVALID_KERNEL: return strdup("Invalid kernel"); case CL_INVALID_ARG_INDEX: return strdup("Invalid argument index"); case CL_INVALID_ARG_VALUE: return strdup("Invalid argument value"); case CL_INVALID_ARG_SIZE: return strdup("Invalid argument size"); case CL_INVALID_KERNEL_ARGS: return strdup("Invalid kernel arguments"); case CL_INVALID_WORK_DIMENSION: return strdup("Invalid work dimension"); case CL_INVALID_WORK_GROUP_SIZE: return strdup("Invalid work group size"); case CL_INVALID_WORK_ITEM_SIZE: return strdup("Invalid work item size"); case CL_INVALID_GLOBAL_OFFSET: return strdup("Invalid global offset"); case CL_INVALID_EVENT_WAIT_LIST: return strdup("Invalid event wait list"); case CL_INVALID_EVENT: return strdup("Invalid event"); case CL_INVALID_OPERATION: return strdup("Invalid operation"); case CL_INVALID_GL_OBJECT: return strdup("Invalid OpenGL object"); case CL_INVALID_BUFFER_SIZE: return strdup("Invalid buffer size"); case CL_INVALID_MIP_LEVEL: return strdup("Invalid mip-map level"); default: return strdup("Unknown"); } }

Jawed · ‎04-11-2011

So I fiddled with project properties and got to playing with your code from posting 4.

I used the Stream Profiler to collect execution statistics for the two kernels. The curious thing was finding that the buffer kernel was coming back with a very strange (very very long) execution time, or even a negative execution time and no stats.

Rummaging in your code I discovered that you are not checking execution of the kernels for errors:

      //Wait for kernel to finish
      kernelEvent.wait();
      checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish");

which should read:

      //Wait for kernel to finish
      err = kernelEvent.wait();
      checkErr(err, "kernelEvent.wait() - Wait for Kernel to finish");

So it turns out that the buffer kernel is failing with error code -14 (which has a cryptic message attached to it in cl.h). But the image kernel is executing correctly. So this explains the "random number" durations for the first kernel - it wasn't executing.

It turns out this -14 code is caused by using:

constant double * inputMatrix

instead of:

global double * inputMatrix

which is because a constant buffer in OpenCL can only be 64KB (I think, I'm sure the limit is stated somewhere or is a device property that can be queried). Using "global" allows the kernel to execute.

Your code is including the time to create buffers and to copy data to and from buffers in the elapsed time of the kernels. This can be misleading. To explore optimisations on kernel code you should use multiple iterations of kernel execution only.

Also, if you rummage in AMD's SDK utils you will find a timer class that you can use instead of GetTickCount().

So, after all that, it turns out the image kernel is a lot faster than the buffer kernel! With a workgroup size of 128 (not 250 - and with dimensions of 2048x2048) performance is about 10x faster using image rather than buffer code: 5400ms for buffer versus 548ms for image (on HD5870).

The image code is still struggling though, the ALUs are only busy 60% of the time and cache hit rate is around 48%. To improve those numbers you need to process multiple double2s per work item.

You can then get into more involved algorithms where you do your own caching of the input buffer in local memory or you use multiple work items working together to compute a single result.

All of these improvements would be founded on the fact that neighbourhoods of work items are operating on neighbourhoods within the input data.

Jawed · ‎04-11-2011

Originally posted by: sir.um

I modified the kernel so that every work item fetched element (0,0) from the image. The code did complete about 2.7 times faster the original code, though it is still an order of magnitude slower than the buffer code.

OK, that tells you roughly what your target is then!

I modified the code to only calculate a single double, and modified the image format's image_channel_order to cl_RG, so that only 1 double is stored per pixel. (It makes the kernel's indexing easier to understand) This code does run about 500-600 ms faster than the double2 code. The reason I originally designed it to store 2 doubles per pixel is that my innitial tests seemed to show that if you did not fully populate a pixel with 128-bits (4 floats), that the pixels would not be compressed to the actual the bit size of your ImageFormat and the read_image would still make a 128-bit read and throw out the unneeded data, thus killing performance. Based upon the results of these benchmarks, I would assume that my initial tests were wrong due to faulty code or a lack of understanding towards the Texture Cache, and that the Image object does in fact compress data, and only read the data requested from a read_imageui() call. Is that correct?

No the "compression" effect is illusory I'm afraid. If you look at the ISA for your kernel the SAMPLE instruction should always be in the form of r0 not r0.x or r0.xy or r0.xyz or r0.w, etc. Any time you see 3 or less components explicitly listed for a SAMPLE instruction, bandwidth is being wasted.

On HD5870 and later GPUs a clause of upto 16 SAMPLE instructions is possible. The more of these in a clause the more there is a chance of increased cache hit rate. And the way to get that is to process multiple double2s per work item. Another way to do this is to partially unroll the loop - e.g. by a factor of 4 or 8.

The performance increase you observed is more of a side-effect. It's a bit like the way that performance varies with work-group size. It can be an interaction with memory banks and with cache hit rates. Fiddly stuff...