Archives Discussions

rollyng · ‎03-27-2011

Computational electromagnetism

Hi, I am aware that NV's has a 3D FDTD OpenCL example but not AMD/ATI? Can anyone post a link for any of these OpenCL FDTD examples?

For those who are interested, I find this paper on IEEE database

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05653857

Thanks!

rollyng · ‎05-04-2011

Hi, so no body interested?

I am trying to run NV example oclFDTD3d on my AMD hardware, but i cannot use the shrUtils.h which belongs to NV hardware only?

Are there any replacement / alternatives to functions such as shrlog() and sharlogEx() from AMD SDK headers?

Thanks!

himanshu_gautam · ‎05-04-2011

hi rollyng,

I don't have any idea what those functions contain. And having same functionality functions in SDKUtils is not very commonplace as AMD have their own utility structure.

So the way here is to download the NV SDK and use their functions directly if possible.

rollyng · ‎05-04-2011

Hi Himanshu,

Thank you! I think these functions are there to keep a log as the OpenCL program executes. Please forgive me to post the NV examples here (3 headers + 3 cpps + 1 kernel). I am not sure if the shrUtils from NV can be compile under AMD? If it cannot, is it possible to have alternatives from SDKUtil?

oclFDTD3d.h + oclFDTD3d.cpp

/* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ #ifndef _OCLFDTD3D_H_ #define _OCLFDTD3D_H_ // The values are set to give reasonable runtimes, they can // be changed but note that running very large dimensions can // take a very long time and you should avoid running on your // primary display in this case. #define k_dim_min 120 #define k_dim_max 376 #define k_dim_qa 248 // Note that the maximum radius is defined here as 4 since the // minimum work group height is 4, if you have a larger work // group then you can increase the radius accordingly. #define k_radius_min 2 #define k_radius_max 4 #define k_radius_default 4 // The values are set to give reasonable runtimes, they can // be changed but note that running a very large number of // timesteps can take a very long time and you should avoid // running on your primary display in this case. #define k_timesteps_min 1 #define k_timesteps_max 10 #define k_timesteps_default 5 #endif /* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ #include "oclFDTD3d.h" #include <oclUtils.h> #include <iostream> #include <iomanip> #include "FDTD3dReference.h" #include "FDTD3dGPU.h" // Name of the file with the source code for the computation kernel const char* clSourceFile = "FDTD3d.cl"; // Name of the log file const char *shrLogFile = "oclFDTD3d.txt"; // Forward declarations bool runTest(int argc, const char **argv); void showHelp(const int argc, const char **argv); int main(int argc, const char **argv) { // Start the log shrSetLogFileName(shrLogFile); shrLog("%s Starting...\n\n", argv[0]); // Check help flag if (shrCheckCmdLineFlag(argc, argv, "help")) { shrLog("Displaying help on console\n"); showHelp(argc, argv); } else { // Execute bool result = runTest(argc, argv); oclCheckErrorEX(result, true, NULL); } // Finish shrEXIT(argc, argv); } void showHelp(const int argc, const char **argv) { if (argc > 0) std::cout << std::endl << argv[0] << std::endl; std::cout << std::endl << "Syntax:" << std::endl; std::cout << std::left; std::cout << " " << std::setw(20) << "--device=<device>" << "Specify device to use for execution" << std::endl; std::cout << " " << std::setw(20) << "--dimx=<N>" << "Specify number of elements in x direction (excluding halo)" << std::endl; std::cout << " " << std::setw(20) << "--dimy=<N>" << "Specify number of elements in y direction (excluding halo)" << std::endl; std::cout << " " << std::setw(20) << "--dimz=<N>" << "Specify number of elements in z direction (excluding halo)" << std::endl; std::cout << " " << std::setw(20) << "--radius=<N>" << "Specify radius of stencil" << std::endl; std::cout << " " << std::setw(20) << "--timesteps=<N>" << "Specify number of timesteps" << std::endl; std::cout << " " << std::setw(20) << "--work-group-size=<N>" << "Specify work group size" << std::endl; std::cout << std::endl; std::cout << " " << std::setw(20) << "--noprompt" << "Skip prompt before exit" << std::endl; std::cout << std::endl; } bool runTest(int argc, const char **argv) { bool ok = true; float *host_output; float *device_output; float *input; float *coeff; int defaultDim; int dimx; int dimy; int dimz; int outerDimx; int outerDimy; int outerDimz; int radius; int timesteps; size_t volumeSize; memsize_t memsize; const float lowerBound = 0.0f; const float upperBound = 1.0f; // Determine default dimensions shrLog("Set-up, based upon target device GMEM size...\n"); if (ok) { // Get the memory size of the target device shrLog(" getTargetDeviceGlobalMemSize\n"); ok = getTargetDeviceGlobalMemSize(&memsize, argc, argv); } if (ok) { // We can never use all the memory so to keep things simple we aim to // use around half the total memory memsize /= 2; // Most of our memory use is taken up by the input and output buffers - // two buffers of equal size - and for simplicity the volume is a cube: // dim = floor( (N/2)^(1/3) ) defaultDim = floor(pow((memsize / (2.0 * sizeof(float))), 1.0/3.0)); // By default, make the volume edge size an integer multiple of 128B to // improve performance by coalescing memory accesses, in a real // application it would make sense to pad the lines accordingly int roundTarget = 128 / sizeof(float); defaultDim = defaultDim / roundTarget * roundTarget; defaultDim -= k_radius_default * 2; // Check dimension is valid if (defaultDim < k_dim_min) { shrLogEx(LOGBOTH | ERRORMSG, -1000, STDERROR); shrLog("\tinsufficient device memory (maximum volume on device is %d, must be between %d and %d).\n", defaultDim, k_dim_min, k_dim_max); ok = false; } else if (defaultDim > k_dim_max) { defaultDim = k_dim_max; } } // For QA testing, override default volume size if (ok) { if (shrCheckCmdLineFlag(argc, argv, "qatest")) { defaultDim = k_dim_qa; } } // Parse command line arguments if (ok) { char *dim = 0; if (shrGetCmdLineArgumentstr(argc, argv, "dimx", &dim)) { dimx = (int)atoi(dim); if (dimx < k_dim_min || dimx > k_dim_max) { shrLogEx(LOGBOTH | ERRORMSG, -1001, STDERROR); shrLog("\tdimx out of range (%d requested, must be between %d and %d), see header files for details.\n", dimx, k_dim_min, k_dim_max); ok = false; } } else { dimx = defaultDim; } if (shrGetCmdLineArgumentstr(argc, argv, "dimy", &dim)) { dimy = (int)atoi(dim); if (dimy < k_dim_min || dimy > k_dim_max) { shrLogEx(LOGBOTH | ERRORMSG, -1002, STDERROR); shrLog("\tdimy out of range (%d requested, must be between %d and %d), see header files for details.\n", dimy, k_dim_min, k_dim_max); ok = false; } } else { dimy = defaultDim; } if (shrGetCmdLineArgumentstr(argc, argv, "dimz", &dim)) { dimz = (int)atoi(dim); if (dimz < k_dim_min || dimz > k_dim_max) { shrLogEx(LOGBOTH | ERRORMSG, -1003, STDERROR); shrLog("\tdimz out of range (%d requested, must be between %d and %d), see header files for details.\n", dimz, k_dim_min, k_dim_max); ok = false; } } else { dimz = defaultDim; } if (shrGetCmdLineArgumentstr(argc, argv, "radius", &dim)) { radius = (int)atoi(dim); if (radius < k_radius_min || radius >= k_radius_max) { shrLogEx(LOGBOTH | ERRORMSG, -1004, STDERROR); shrLog("\tradius out of range (%d requested, must be between %d and %d), see header files for details.\n", radius, k_radius_min, k_radius_max); ok = false; } } else { radius = k_radius_default; } if (shrGetCmdLineArgumentstr(argc, argv, "timesteps", &dim)) { timesteps = (int)atoi(dim); if (timesteps < k_timesteps_min || radius >= k_timesteps_max) { shrLogEx(LOGBOTH | ERRORMSG, -1005, STDERROR); shrLog("\ttimesteps out of range (%d requested, must be between %d and %d), see header files for details.\n", timesteps, k_timesteps_min, k_timesteps_max); ok = false; } } else { timesteps = k_timesteps_default; } if (dim) free(dim); } // Determine volume size if (ok) { outerDimx = dimx + 2 * radius; outerDimy = dimy + 2 * radius; outerDimz = dimz + 2 * radius; volumeSize = outerDimx * outerDimy * outerDimz; } // Allocate memory if (ok) { shrLog(" calloc host_output\n"); if ((host_output = (float *)calloc(volumeSize, sizeof(float))) == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -1006, STDERROR); shrLog("\tInsufficient memory for host_output calloc, please try a smaller volume (use --help for syntax).\n"); ok = false; } } if (ok) { shrLog(" malloc input\n"); if ((input = (float *)malloc(volumeSize * sizeof(float))) == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -1007, STDERROR); shrLog("\tInsufficient memory for input malloc, please try a smaller volume (use --help for syntax).\n"); ok = false; } } if (ok) { shrLog(" malloc coeff\n"); if ((coeff = (float *)malloc((radius + 1) * sizeof(float))) == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -1008, STDERROR); shrLog("\tInsufficient memory for coeff malloc, please try a smaller volume (use --help for syntax).\n"); ok = false; } } // Create coefficients if (ok) { for (int i = 0 ; i <= radius ; i++) { coeff = 0.1f; } } // Generate data if (ok) { shrLog(" generateRandomData\n\n"); generateRandomData(input, outerDimx, outerDimy, outerDimz, lowerBound, upperBound); } if (ok) { shrLog("FDTD on %d x %d x %d volume with symmetric filter radius %d for %d timesteps...\n\n", dimx, dimy, dimz, radius, timesteps); } // Execute on the host if (ok) { shrLog("fdtdReference...\n"); ok = fdtdReference(host_output, input, coeff, dimx, dimy, dimz, radius, timesteps); shrLog("fdtdReference complete\n"); } // Allocate memory if (ok) { shrLog(" calloc device_output\n"); if ((device_output = (float *)calloc(volumeSize, sizeof(float))) == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -1009, STDERROR); shrLog("\tInsufficient memory for device output calloc, please try a smaller volume (use --help for syntax).\n"); ok = false; } } // Execute on the device if (ok) { shrLog("fdtdGPU...\n"); ok = fdtdGPU(device_output, input, coeff, dimx, dimy, dimz, radius, timesteps, argc, argv); shrLog("fdtdGPU complete\n"); } // Compare the results if (ok) { float tolerance = 0.0001f; shrLog("\nCompareData (tolerance %f)...\n", tolerance); ok = compareData(device_output, host_output, dimx, dimy, dimz, radius, tolerance); } shrLog("\n%s\n\n", (ok) ? "PASSED" : "FAILED"); return ok; }

rollyng · ‎05-04-2011

FDTD3dGPU.h + FDTD3dGPU.cpp

/* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ #ifndef _FDTD3DGPU_H_ #define _FDTD3DGPU_H_ #include <cstddef> #if defined(_WIN32) && defined(_MSC_VER) typedef unsigned __int64 memsize_t; #else #include <stdint.h> typedef uint64_t memsize_t; #endif #define k_localWorkX 32 #define k_localWorkY 8 #define k_localWorkMin 128 // Name of the file with the source code for the computation kernel extern const char* clSourceFile; bool getTargetDeviceGlobalMemSize(memsize_t *result, const int argc, const char **argv); bool fdtdGPU(float *output, const float *input, const float *coeff, const int dimx, const int dimy, const int dimz, const int radius, const int timesteps, const int argc, const char **argv); #endif /* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ #include "FDTD3dGPU.h" #include <oclUtils.h> #include <iostream> #include <algorithm> bool getTargetDeviceGlobalMemSize(memsize_t *result, const int argc, const char **argv) { bool ok = true; cl_platform_id platform = 0; cl_context context = 0; cl_device_id *devices = 0; cl_uint deviceCount = 0; cl_uint targetDevice = 0; cl_ulong memsize = 0; cl_int errnum = 0; // Get the NVIDIA platform if (ok) { shrLog(" oclGetPlatformID\n"); errnum = oclGetPlatformID(&platform); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("oclGetPlatformID (no platforms found).\n"); ok = false; } } // Get the list of GPU devices associated with the platform if (ok) { shrLog(" clGetDeviceIDs\n"); errnum = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount); devices = (cl_device_id *)malloc(deviceCount * sizeof(cl_device_id) ); errnum = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL); if ((deviceCount == 0) || (errnum != CL_SUCCESS)) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clGetDeviceIDs (returned error or no devices found).\n"); ok = false; } } // Create the OpenCL context if (ok) { shrLog(" clCreateContext\n"); context = clCreateContext(0, deviceCount, devices, NULL, NULL, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateContext (returned %d).\n", errnum); ok = false; } } // Select target device (device 0 by default) if (ok) { char *device = 0; if (shrGetCmdLineArgumentstr(argc, argv, "device", &device)) { targetDevice = (cl_uint)atoi(device); if (targetDevice >= deviceCount) { shrLogEx(LOGBOTH | ERRORMSG, -2000, STDERROR); shrLog("invalid target device specified on command line (device %d does not exist).\n", targetDevice); ok = false; } } else { targetDevice = 0; } if (device) { free(device); } } // Query target device for maximum memory allocation if (ok) { shrLog(" clGetDeviceInfo\n"); errnum = clGetDeviceInfo(devices[targetDevice], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(cl_ulong), &memsize, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clGetDeviceInfo (returned %d).\n", errnum); ok = false; } } // Save the result if (ok) { *result = (memsize_t)memsize; } // Cleanup if (devices) free(devices); if (context) clReleaseContext(context); return ok; } bool fdtdGPU(float *output, const float *input, const float *coeff, const int dimx, const int dimy, const int dimz, const int radius, const int timesteps, const int argc, const char **argv) { bool ok = true; const int outerDimx = dimx + 2 * radius; const int outerDimy = dimy + 2 * radius; const int outerDimz = dimz + 2 * radius; const size_t volumeSize = outerDimx * outerDimy * outerDimz; cl_context context = 0; cl_platform_id platform = 0; cl_device_id *devices = 0; cl_command_queue commandQueue = 0; cl_mem bufferOut = 0; cl_mem bufferIn = 0; cl_mem bufferCoeff = 0; cl_program program = 0; cl_kernel kernel = 0; cl_event *kernelEvents = 0; #ifdef GPU_PROFILING cl_ulong kernelEventStart; cl_ulong kernelEventEnd; #endif double hostElapsedTimeS; char *cPathAndName = 0; char *cSourceCL = 0; size_t szKernelLength; size_t globalWorkSize[2]; size_t localWorkSize[2]; cl_uint deviceCount = 0; cl_uint targetDevice = 0; cl_int errnum = 0; char buildOptions[128]; // Ensure that the inner data starts on a 128B boundary const int padding = (128 / sizeof(float)) - radius; const size_t paddedVolumeSize = volumeSize + padding; #ifdef GPU_PROFILING const int profileTimesteps = timesteps - 1; if (ok) { if (profileTimesteps < 1) { shrLog(" cannot profile with fewer than two timesteps (timesteps=%d), profiling is disabled.\n", timesteps); } } #endif // Get the NVIDIA platform if (ok) { shrLog(" oclGetPlatformID...\n"); errnum = oclGetPlatformID(&platform); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("oclGetPlatformID (returned %d).\n", errnum); ok = false; } } // Get the list of GPU devices associated with the platform if (ok) { shrLog(" clGetDeviceIDs"); errnum = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount); devices = (cl_device_id *)malloc(deviceCount * sizeof(cl_device_id) ); errnum = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clGetDeviceIDs (returned %d).\n", errnum); ok = false; } } // Create the OpenCL context if (ok) { shrLog(" clCreateContext...\n"); context = clCreateContext(0, deviceCount, devices, NULL, NULL, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateContext (returned %d).\n", errnum); ok = false; } } // Select target device (device 0 by default) if (ok) { char *device = 0; if (shrGetCmdLineArgumentstr(argc, argv, "device", &device)) { targetDevice = (cl_uint)atoi(device); if (targetDevice >= deviceCount) { shrLogEx(LOGBOTH | ERRORMSG, -2001, STDERROR); shrLog("invalid target device specified on command line (device %d does not exist).\n", targetDevice); ok = false; } } else { targetDevice = 0; } if (device) { free(device); } } // Create a command-queue if (ok) { shrLog(" clCreateCommandQueue\n"); commandQueue = clCreateCommandQueue(context, devices[targetDevice], CL_QUEUE_PROFILING_ENABLE, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateCommandQueue (returned %d).\n", errnum); ok = false; } } // Create memory buffer objects if (ok) { shrLog(" clCreateBuffer bufferOut\n"); bufferOut = clCreateBuffer(context, CL_MEM_READ_WRITE, paddedVolumeSize * sizeof(float), NULL, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateBuffer (returned %d).\n", errnum); ok = false; } } if (ok) { shrLog(" clCreateBuffer bufferIn\n"); bufferIn = clCreateBuffer(context, CL_MEM_READ_WRITE, paddedVolumeSize * sizeof(float), NULL, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateBuffer (returned %d).\n", errnum); ok = false; } } if (ok) { shrLog(" clCreateBuffer bufferCoeff\n"); bufferCoeff = clCreateBuffer(context, CL_MEM_READ_ONLY, (radius + 1) * sizeof(float), NULL, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateBuffer (returned %d).\n", errnum); ok = false; } } // Load the kernel from file if (ok) { shrLog(" shrFindFilePath\n"); cPathAndName = shrFindFilePath(clSourceFile, argv[0]); if (cPathAndName == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -2002, STDERROR); shrLog("shrFindFilePath returned null.\n"); ok = false; } } if (ok) { shrLog(" oclLoadProgSource\n"); cSourceCL = oclLoadProgSource(cPathAndName, "// Preamble\n", &szKernelLength); if (cSourceCL == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -2003, STDERROR); shrLog("oclLoadProgSource returned null.\n"); ok = false; } } // Create the program if (ok) { shrLog(" clCreateProgramWithSource\n"); program = clCreateProgramWithSource(context, 1, (const char **)&cSourceCL, &szKernelLength, &errnum); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateProgramWithSource (returned %d).\n", errnum); ok = false; } } // Check for a command-line specified work group size int localWorkMaxY; if (ok) { int userWorkSizeInt; if (shrGetCmdLineArgumenti(argc, argv, "work-group-size", &userWorkSizeInt)) { // Divide by k_localWorkX (integer division to clamp) localWorkMaxY = (userWorkSizeInt / k_localWorkX); } else { localWorkMaxY = k_localWorkY; } } // Build the program if (ok) { #ifdef WIN32 if (sprintf_s(buildOptions, sizeof(buildOptions), "-DRADIUS=%d -DMAXWORKX=%d -DMAXWORKY=%d -cl-fast-relaxed-math", radius, k_localWorkX, localWorkMaxY) < 0) { shrLogEx(LOGBOTH | ERRORMSG, -2004, STDERROR); shrLog("sprintf_s (failed).\n"); ok = false; } #else if (snprintf(buildOptions, sizeof(buildOptions), "-DRADIUS=%d -DMAXWORKX=%d -DMAXWORKY=%d -cl-fast-relaxed-math", radius, k_localWorkX, localWorkMaxY) < 0) { shrLogEx(LOGBOTH | ERRORMSG, -2004, STDERROR); shrLog("snprintf (failed).\n"); ok = false; } #endif } if (ok) { shrLog(" clBuildProgram (%s)\n", buildOptions); errnum = clBuildProgram(program, 0, NULL, buildOptions, NULL, NULL); if (errnum != CL_SUCCESS) { char buildLog[10240]; clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, sizeof(buildLog), buildLog, NULL); shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clBuildProgram (returned %d).\n", errnum); shrLog("Log:\n%s\n", buildLog); ok = false; } } // Create the kernel if (ok) { shrLog(" clCreateKernel\n"); kernel = clCreateKernel(program, "FiniteDifferences", &errnum); if (kernel == (cl_kernel)NULL || errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clCreateKernel (returned %d).\n", errnum); ok = false; } } // Get the maximum work group size size_t maxWorkSize; if (ok) { shrLog(" clGetKernelWorkGroupInfo\n"); errnum = clGetKernelWorkGroupInfo(kernel, devices[targetDevice], CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &maxWorkSize, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clGetKernelWorkGroupInfo (returned %d).\n", errnum); ok = false; } } // Check for a command-line specified work group size size_t userWorkSize; if (ok) { int userWorkSizeInt; if (shrGetCmdLineArgumenti(argc, argv, "work-group-size", &userWorkSizeInt)) { // Constrain to a multiple of k_localWorkX userWorkSize = (userWorkSizeInt / k_localWorkX * k_localWorkX); } else { userWorkSize = k_localWorkY; } // Constrain within allowed bounds userWorkSize = CLAMP(userWorkSize, k_localWorkMin, maxWorkSize); } // Set the work group size if (ok) { localWorkSize[0] = k_localWorkX; localWorkSize[1] = userWorkSize / k_localWorkX; globalWorkSize[0] = localWorkSize[0] * ceil((float)dimx / localWorkSize[0]); globalWorkSize[1] = localWorkSize[1] * ceil((float)dimy / localWorkSize[1]); shrLog(" set local work group size to %dx%d\n", localWorkSize[0], localWorkSize[1]); shrLog(" set total work size to %dx%d\n", globalWorkSize[0], globalWorkSize[1]); } // Copy the input to the device input buffer if (ok) { shrLog(" clEnqueueWriteBuffer bufferIn\n"); errnum = clEnqueueWriteBuffer(commandQueue, bufferIn, CL_TRUE, padding * sizeof(float), volumeSize * sizeof(float), input, 0, NULL, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clEnqueueWriteBuffer bufferIn (returned %d).\n", errnum); ok = false; } } // Copy the input to the device output buffer (actually only need the halo) if (ok) { shrLog(" clEnqueueWriteBuffer bufferOut\n"); errnum = clEnqueueWriteBuffer(commandQueue, bufferOut, CL_TRUE, padding * sizeof(float), volumeSize * sizeof(float), input, 0, NULL, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clEnqueueWriteBuffer bufferOut (returned %d).\n", errnum); ok = false; } } // Copy the coefficients to the device coefficient buffer if (ok) { shrLog(" clEnqueueWriteBuffer bufferCoeff\n"); errnum = clEnqueueWriteBuffer(commandQueue, bufferCoeff, CL_TRUE, 0, (radius + 1) * sizeof(float), coeff, 0, NULL, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clEnqueueWriteBuffer bufferCoeff (returned %d).\n", errnum); ok = false; } } // Allocate the events if (ok) { shrLog(" calloc events\n"); if ((kernelEvents = (cl_event *)calloc(timesteps, sizeof(cl_event))) == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -2005, STDERROR); shrLog("Insufficient memory for events calloc, please try a smaller volume (use --help for syntax).\n"); ok = false; } } // Start the clock shrDeltaT(0); // Set the constant arguments if (ok) { shrLog(" clSetKernelArg 2-6\n"); errnum = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&bufferCoeff); errnum |= clSetKernelArg(kernel, 3, sizeof(int), &dimx); errnum |= clSetKernelArg(kernel, 4, sizeof(int), &dimy); errnum |= clSetKernelArg(kernel, 5, sizeof(int), &dimz); errnum |= clSetKernelArg(kernel, 6, sizeof(int), &padding); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clSetKernelArg 2-6 (returned %d).\n", errnum); ok = false; } } // Execute the FDTD cl_mem bufferSrc = bufferIn; cl_mem bufferDst = bufferOut; if (ok) { shrLog(" GPU FDTD loop\n"); } for (int it = 0 ; ok && it < timesteps ; it++) { shrLog("\tt = %d ", it); // Set the dynamic arguments if (ok) { shrLog(" clSetKernelArg 0-1,"); errnum = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&bufferDst); errnum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&bufferSrc); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clSetKernelArg 0-1 (returned %d).\n", errnum); ok = false; } } // Launch the kernel if (ok) { shrLog(" clEnqueueNDRangeKernel\n"); errnum = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &kernelEvents[it]); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clEnqueueNDRangeKernel (returned %d).\n", errnum); ok = false; } } // Toggle the buffers cl_mem tmp = bufferSrc; bufferSrc = bufferDst; bufferDst = tmp; } if (ok) shrLog("\n"); // Wait for the kernel to complete if (ok) { shrLog(" clWaitForEvents\n"); errnum = clWaitForEvents(1, &kernelEvents[timesteps-1]); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clWaitForEvents (returned %d).\n", errnum); ok = false; } } // Stop the clock hostElapsedTimeS = shrDeltaT(0); // Read the result back, result is in bufferSrc (after final toggle) if (ok) { shrLog(" clEnqueueReadBuffer\n"); errnum = clEnqueueReadBuffer(commandQueue, bufferSrc, CL_TRUE, padding * sizeof(float), volumeSize * sizeof(float), output, 0, NULL, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clEnqueueReadBuffer bufferSrc (returned %d).\n", errnum); ok = false; } } // Report time #ifdef GPU_PROFILING double elapsedTime = 0.0; if (ok && profileTimesteps > 0) shrLog(" Collect profile information\n"); for (int it = 1 ; ok && it <= profileTimesteps ; it++) { shrLog("\tt = %d ", it); shrLog(" clGetEventProfilingInfo,", it); errnum = clGetEventProfilingInfo(kernelEvents[it], CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &kernelEventStart, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clGetEventProfilingInfo (returned %d).\n", errnum); ok = false; } shrLog(" clGetEventProfilingInfo\n", it); errnum = clGetEventProfilingInfo(kernelEvents[it], CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &kernelEventEnd, NULL); if (errnum != CL_SUCCESS) { shrLogEx(LOGBOTH | ERRORMSG, errnum, STDERROR); shrLog("clGetEventProfilingInfo (returned %d).\n", errnum); ok = false; } elapsedTime += (double)kernelEventEnd - (double)kernelEventStart; } if (ok && profileTimesteps > 0) { shrLog("\n"); // Convert nanoseconds to seconds elapsedTime *= 1.0e-9; double avgElapsedTime = elapsedTime / (double)profileTimesteps; // Determine number of computations per timestep size_t pointsComputed = dimx * dimy * dimz; // Determine throughput double throughputM = 1.0e-6 * (double)pointsComputed / avgElapsedTime; shrLogEx(LOGBOTH | MASTER, 0, "oclFDTD3d, Throughput = %.4f MPoints/s, Time = %.5f s, Size = %u Points, NumDevsUsed = %i, Workgroup = %u\n", throughputM, avgElapsedTime, pointsComputed, 1, localWorkSize[0] * localWorkSize[1]); } #endif // Cleanup if (kernelEvents) { for (int it = 0 ; it < timesteps ; it++) { if (kernelEvents[it]) clReleaseEvent(kernelEvents[it]); } free(kernelEvents); } if (kernel) clReleaseKernel(kernel); if (program) clReleaseProgram(program); if (cSourceCL) free(cSourceCL); if (cPathAndName) free(cPathAndName); if (bufferCoeff) clReleaseMemObject(bufferCoeff); if (bufferIn) clReleaseMemObject(bufferIn); if (bufferOut) clReleaseMemObject(bufferOut); if (commandQueue) clReleaseCommandQueue(commandQueue); if (devices) free(devices); if (context) clReleaseContext(context); return ok; }

rollyng · ‎05-04-2011

FDTD3dReference.h + FDTD3dReference.cpp

/* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ #ifndef _FDTD3DREFERENCE_H_ #define _FDTD3DREFERENCE_H_ void generateRandomData(float *data, const int dimx, const int dimy, const int dimz, const float lowerBound, const float upperBound); void generatePatternData(float *data, const int dimx, const int dimy, const int dimz, const float lowerBound, const float upperBound); bool fdtdReference(float *output, const float *input, const float *coeff, const int dimx, const int dimy, const int dimz, const int radius, const int timesteps); bool compareData(const float *output, const float *reference, const int dimx, const int dimy, const int dimz, const int radius, const float tolerance=0.0001f); #endif /* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ #include "FDTD3dReference.h" #include <cstdlib> #include <cmath> #include <iostream> #include <iomanip> #include <shrUtils.h> void generateRandomData(float *data, const int dimx, const int dimy, const int dimz, const float lowerBound, const float upperBound) { srand(0); for (int iz = 0 ; iz < dimz ; iz++) { for (int iy = 0 ; iy < dimy ; iy++) { for (int ix = 0 ; ix < dimx ; ix++) { *data = (float)(lowerBound + ((float)rand() / (float)RAND_MAX) * (upperBound - lowerBound)); ++data; } } } } void generatePatternData(float *data, const int dimx, const int dimy, const int dimz, const float lowerBound, const float upperBound) { for (int iz = 0 ; iz < dimz ; iz++) { for (int iy = 0 ; iy < dimy ; iy++) { for (int ix = 0 ; ix < dimx ; ix++) { *data = (float)(lowerBound + ((float)iz / (float)dimz) * (upperBound - lowerBound)); ++data; } } } } bool fdtdReference(float *output, const float *input, const float *coeff, const int dimx, const int dimy, const int dimz, const int radius, const int timesteps) { bool ok = true; const int outerDimx = dimx + 2 * radius; const int outerDimy = dimy + 2 * radius; const int outerDimz = dimz + 2 * radius; const size_t volumeSize = outerDimx * outerDimy * outerDimz; const int stride_y = outerDimx; const int stride_z = stride_y * outerDimy; float *intermediate = 0; const float *bufsrc = 0; float *bufdst = 0; float *bufdstnext = 0; // Allocate temporary buffer if (ok) { shrLog(" calloc intermediate\n"); if ((intermediate = (float *)calloc(volumeSize, sizeof(float))) == NULL) { shrLogEx(LOGBOTH | ERRORMSG, -3000, STDERROR); shrLog("Insufficient memory for calloc intermediate, please try a smaller volume (use --help for syntax).\n"); ok = false; } } // Decide which buffer to use first (result should end up in output) if (ok) { if ((timesteps % 2) == 0) { bufsrc = input; bufdst = intermediate; bufdstnext = output; } else { bufsrc = input; bufdst = output; bufdstnext = intermediate; } } // Run the FDTD (naive method) if (ok) { shrLog(" Host FDTD loop\n"); for (int it = 0 ; it < timesteps ; it++) { shrLog("\tt = %d\n", it); const float *src = bufsrc; float *dst = bufdst; for (int iz = -radius ; iz < dimz + radius ; iz++) { for (int iy = -radius ; iy < dimy + radius ; iy++) { for (int ix = -radius ; ix < dimx + radius ; ix++) { if (ix >= 0 && ix < dimx && iy >= 0 && iy < dimy && iz >= 0 && iz < dimz) { float value = (*src) * coeff[0]; for(int ir = 1 ; ir <= radius ; ir++) { value += coeff[ir] * (*(src + ir) + *(src - ir)); // horizontal value += coeff[ir] * (*(src + ir * stride_y) + *(src - ir * stride_y)); // vertical value += coeff[ir] * (*(src + ir * stride_z) + *(src - ir * stride_z)); // in front & behind } *dst = value; } else { *dst = *src; } ++dst; ++src; } } } // Rotate buffers float *tmp = bufdst; bufdst = bufdstnext; bufdstnext = tmp; bufsrc = (const float *)tmp; } shrLog("\n"); } if (intermediate) free(intermediate); return ok; } bool compareData(const float *output, const float *reference, const int dimx, const int dimy, const int dimz, const int radius, const float tolerance) { bool ok = true; for (int iz = -radius ; iz < dimz + radius ; iz++) { for (int iy = -radius ; iy < dimy + radius ; iy++) { for (int ix = -radius ; ix < dimx + radius ; ix++) { if (ix >= 0 && ix < dimx && iy >= 0 && iy < dimy && iz >= 0 && iz < dimz) { // Determine the absolute difference float difference = abs(*reference - *output); float error; // Determine the relative error if (*reference != 0) error = difference / *reference; else error = difference; // Check the error is within the tolerance if (error > tolerance) { ok = false; shrLog("Data error at point (%d,%d,%d)\t%f instead of %f\n", ix, iy, iz, *output, *reference); return ok; } } ++output; ++reference; } } } return ok; }

rollyng · ‎05-04-2011

OpenCL kernel, FDTD3d.cl

/* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ __kernel void FiniteDifferences(__global float * const output, __global const float * const input, __constant float * const coeff, const int dimx, const int dimy, const int dimz, const int padding) { bool valid = true; const int gtidx = get_global_id(0); const int gtidy = get_global_id(1); const int ltidx = get_local_id(0); const int ltidy = get_local_id(1); const int workx = get_local_size(0); const int worky = get_local_size(1); __local float tile[MAXWORKY + 2 * RADIUS][MAXWORKX + 2 * RADIUS]; const int stride_y = dimx + 2 * RADIUS; const int stride_z = stride_y * (dimy + 2 * RADIUS); int inputIndex = 0; int outputIndex = 0; // Advance inputIndex to start of inner volume inputIndex += RADIUS * stride_y + RADIUS + padding; // Advance inputIndex to target element inputIndex += gtidy * stride_y + gtidx; float infront[RADIUS]; float behind[RADIUS]; float current; const int tx = ltidx + RADIUS; const int ty = ltidy + RADIUS; if (gtidx >= dimx) valid = false; if (gtidy >= dimy) valid = false; // For simplicity we assume that the global size is equal to the actual // problem size; since the global size must be a multiple of the local size // this means the problem size must be a multiple of the local size (or // padded to meet this constraint). // Preload the "infront" and "behind" data for (int i = RADIUS - 2 ; i >= 0 ; i--) { behind = input[inputIndex]; inputIndex += stride_z; } current = input[inputIndex]; outputIndex = inputIndex; inputIndex += stride_z; for (int i = 0 ; i < RADIUS ; i++) { infront = input[inputIndex]; inputIndex += stride_z; } // Step through the xy-planes for (int iz = 0 ; iz < dimz ; iz++) { // Advance the slice (move the thread-front) for (int i = RADIUS - 1 ; i > 0 ; i--) behind = behind[i - 1]; behind[0] = current; current = infront[0]; for (int i = 0 ; i < RADIUS - 1 ; i++) infront = infront[i + 1]; infront[RADIUS - 1] = input[inputIndex]; inputIndex += stride_z; outputIndex += stride_z; barrier(CLK_LOCAL_MEM_FENCE); // Note that for the work items on the boundary of the problem, the // supplied index when reading the halo (below) may wrap to the // previous/next row or even the previous/next xy-plane. This is // acceptable since a) we disable the output write for these work // items and b) there is at least one xy-plane before/after the // current plane, so the access will be within bounds. // Update the data slice in the local tile // Halo above & below if (ltidy < RADIUS) { tile[ltidy][tx] = input[outputIndex - RADIUS * stride_y]; tile[ltidy + worky + RADIUS][tx] = input[outputIndex + worky * stride_y]; } // Halo left & right if (ltidx < RADIUS) { tile[ty][ltidx] = input[outputIndex - RADIUS]; tile[ty][ltidx + workx + RADIUS] = input[outputIndex + workx]; } tile[ty][tx] = current; barrier(CLK_LOCAL_MEM_FENCE); // Compute the output value float value = coeff[0] * current; #pragma unroll RADIUS for (int i = 1 ; i <= RADIUS ; i++) { value += coeff * (infront[i-1] + behind[i-1] + tile[ty - i][tx] + tile[ty + i][tx] + tile[ty][tx - i] + tile[ty][tx + i]); } // Store the output value if (valid) output[outputIndex] = value; } }

nou · ‎05-04-2011

it shoudl be possible if shrUtils dont use some CUDA specific things.

i recomed download nVidia SDK and extract shrUtils from there. or replace all function with your own.

rollyng · ‎05-04-2011

Hi nou,

Thanks! I'll include shrUtils.h and oclUtils.h into my project and report back what happens

Do you think these outputs are correct? Thank you!

Here is the output from AMD:

/home/rolly/workspace/oclFDTD/Debug/oclFDTD Starting...

Set-up, based upon target device GMEM size...
getTargetDeviceGlobalMemSize
oclGetPlatformID
WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

clGetDeviceIDs
clCreateContext
clGetDeviceInfo
calloc host_output
malloc input
malloc coeff
generateRandomData

FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...

fdtdReference...
fdtdReference complete
calloc device_output
fdtdGPU...
oclGetPlatformID...
WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

clGetDeviceIDs clCreateContext...
clCreateCommandQueue
clCreateBuffer bufferOut
clCreateBuffer bufferIn
clCreateBuffer bufferCoeff
shrFindFilePath
oclLoadProgSource
clCreateProgramWithSource
clBuildProgram (-DRADIUS=4 -DMAXWORKX=32 -DMAXWORKY=8 -cl-fast-relaxed-math)
clCreateKernel
clGetKernelWorkGroupInfo
set local work group size to 32x4
set total work size to 384x376
clEnqueueWriteBuffer bufferIn
clEnqueueWriteBuffer bufferOut
clEnqueueWriteBuffer bufferCoeff
calloc events
clSetKernelArg 2-6
GPU FDTD loop
    t = 0 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 1 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 2 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 3 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 4 clSetKernelArg 0-1, clEnqueueNDRangeKernel

clWaitForEvents
clEnqueueReadBuffer
fdtdGPU complete

CompareData (tolerance 0.000100)...

PASSED

Press <Enter> to Quit...
-----------------------------------------------------------

Those from NV:

/home/rolly/workspace/myFDTD/Debug/myFDTD Starting...

Set-up, based upon target device GMEM size...
getTargetDeviceGlobalMemSize
oclGetPlatformID
clGetDeviceIDs
clCreateContext
clGetDeviceInfo
calloc host_output
malloc input
malloc coeff
generateRandomData

FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...

fdtdReference...
calloc intermediate
Host FDTD loop
    t = 0
    t = 1
    t = 2
    t = 3
    t = 4

fdtdReference complete
calloc device_output
fdtdGPU...
oclGetPlatformID...
clGetDeviceIDs clCreateContext...
clCreateCommandQueue
clCreateBuffer bufferOut
clCreateBuffer bufferIn
clCreateBuffer bufferCoeff
shrFindFilePath
oclLoadProgSource
clCreateProgramWithSource
clBuildProgram (-DRADIUS=4 -DMAXWORKX=32 -DMAXWORKY=8 -cl-fast-relaxed-math)
clCreateKernel
clGetKernelWorkGroupInfo
set local work group size to 32x4
set total work size to 384x376
clEnqueueWriteBuffer bufferIn
clEnqueueWriteBuffer bufferOut
clEnqueueWriteBuffer bufferCoeff
calloc events
clSetKernelArg 2-6
GPU FDTD loop
    t = 0 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 1 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 2 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 3 clSetKernelArg 0-1, clEnqueueNDRangeKernel
    t = 4 clSetKernelArg 0-1, clEnqueueNDRangeKernel

clWaitForEvents
clEnqueueReadBuffer
fdtdGPU complete

CompareData (tolerance 0.000100)...

PASSED

Press <Enter> to Quit...
-----------------------------------------------------------

himanshu_gautam · ‎05-05-2011

rollyng,

AFAIK the outputs appear good. On AMD it is expected that the app will not find NV platform, so it executes on AMD platform. Every NV sample is expected to give that warning. you will be able to observe something similar when you try to run AMD samples on NV device.

rollyng · ‎05-05-2011

Hi Himanshu,

It appears running correctly and I am using Eclipse 3.5 with CDT 6.0 as IDE. I am also trying to debug the same program and see the memszie detected but it fails to finish, on the console window of Eclipse, it said

Thread[1] (Suspended: Signal 'SIGSEGV' received. Description: Segmentation fault.)

I have run the same code with debug on NV hardware, Eclipse returns without error, can anyone advice what I missed in my debug config?

Thanks!

himanshu_gautam · ‎05-05-2011

It difficult to answer right away. Please post youe system configuration: CPU,GPU,SDK,Driver,OS.

You might find someone who can try running that sample at your configuration and help you out.

rollyng · ‎05-05-2011

Hi himanshu,

Thanks, here is my config:

(1) CPU: Intel E5620

(2) GPU: AMD HD6990

(3) SDK: 2.4 x86_64

(4) Driver: 11.3 x86_64

(5) OS: Ubuntu 10.10 with Eclipse 3.5.2, CDT 6.0.0

Has anyone managed to run standard debug mode of Eclipse+CDT with AMD SDK examples? and are there any way to upload my screenshots to the OpenCL forum? Thanks!

himanshu_gautam · ‎05-09-2011

It would be really nice if you can answer a few more questions before I report it to AMD developers:

Does this sample run fine on CPU while using Eclipse + CDT with AMD APP SDK.

Also do you face similar issues with any other NVIDIA or AMD samples(for both CPU & GPU) using the same configuration.

I would also suggest you to search for some eclipse help as the samples are otherwise working.

rollyng · ‎05-10-2011

Hi himanshu,

I double checked with SDK2.4 examples as stated in my other thread, I think there is a problem with SDK2.4 while debugging with gdb, I can run all examples fine but if I go debug, it causes segmentation fault.

With NV hardware and their SDK, I can run gdb without such problem.

Please report to AMD developer, thank you!

himanshu_gautam · ‎06-14-2011

sorry for posting late here, but this has been reported to developers a while ago.

Thanks for reporting this.

Archives Discussions

3D FDTD example with OpenCL?