cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

richeek_arya
Journeyman III

Host memory is getting value just after kernel finishes

Hi all, 

I have a strange problem. I am calling the openCL kernel and waiting on it to finish  execution. This is how I am doing it:

status = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]);
 
    /* wait for the kernel call to finish execution */
    status = clWaitForEvents(1, &events[0]);
    if(status != CL_SUCCESS)
    {
        std::cout<<
            "Error: Waiting for kernel run to finish. \
            (clWaitForEvents)\n";
        return 1;
    }

status = clReleaseEvent(events[0]);

  And after this I am reading back the memory buffers on the host side:

/* Enqueue cand_dist Buffer*/
    status = clEnqueueReadBuffer(commandQueue, cand_dist_d, CL_TRUE, 0, LIST_SIZE*no_mimo_sym * sizeof(float), cand_dist, 0, NULL, &events[1]);

 

But when I run Visual Studio debugger I see cand_dist  get the values as soon as clWaitForEvents() returns. I do not understand how this could happen? At the end of the execution I am freeing memory so it is not stray values from old run. Does anyone know why this may happen?

I allocated device side memory like this:

cand_dist_d = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(float)*LIST_SIZE*no_mimo_sym , cand_dist, &status);

So is it device writing back the data on the host side too?

Thanks for the help!

Richeek

0 Likes
8 Replies
himanshu_gautam
Grandmaster

Hi recheek,

I was just discussing the same with somebody a few days ago. Actually as you use CL_MEM_USE_HOST_PTR the clBuffer cand_dist_d is created from the same host memory chunk used to store cand_dist. When you pass the buffer to GPU a new copy has to be created on GPU which is then modified by your kernel. But for buffer consistancy the two buffers(at CPU & at GPU) should have same data.

Although I am not 100% sure on this theory and would like to check that myself.Can you please post your code.

0 Likes

page 56 of the opencl 1.1 spec says:

CL_MEM_USE_HOST_PTR: This flag is valid only if host_ptr is not NULL. If specified, it indicates that the application wants the OpenCL implementation to use memory referenced by host_ptr as the storage bits for the memory object.


so, there's nothing strange about what you are observing, it's how it's meant to be.

0 Likes

Originally posted by: himanshu.gautam Hi recheek,

 

I was just discussing the same with somebody a few days ago. Actually as you use CL_MEM_USE_HOST_PTR the clBuffer cand_dist_d is created from the same host memory chunk used to store cand_dist. When you pass the buffer to GPU a new copy has to be created on GPU which is then modified by your kernel. But for buffer consistancy the two buffers(at CPU & at GPU) should have same data.



So it does mean that GPU is first copying over the data to its global memory. Later it is OpenCL which to maintain consistency copying over the data back to host.

 

Although I am not 100% sure on this theory and would like to check that myself.Can you please post your code.

 

I am posting the template code the came with AMD SDK. If you see the "output" buffer just after the kernel releases, it has the values.

 

Thanks,

Richeek

#include "Template.hpp" /* * \brief Host Initialization * Allocate and initialize memory * on the host. Print input array. */ int initializeHost(void) { width = 256; input = NULL; output = NULL; multiplier = 2; ///////////////////////////////////////////////////////////////// // Allocate and initialize memory used by host ///////////////////////////////////////////////////////////////// cl_uint sizeInBytes = width * sizeof(cl_uint); input = (cl_uint *) malloc(sizeInBytes); if(input == NULL) { std::cout<<"Error: Failed to allocate input memory on host\n"; return 1; } output = (cl_uint *) malloc(sizeInBytes); if(output == NULL) { std::cout<<"Error: Failed to allocate output memory on host\n"; return 1; } for(cl_uint i = 0; i < width; i++) input = i; // print input array print1DArray(std::string("Input").c_str(), input, width); return 0; } /* * Converts the contents of a file into a string */ std::string convertToString(const char *filename) { size_t size; char* str; std::string s; std::fstream f(filename, (std::fstream::in | std::fstream::binary)); if(f.is_open()) { size_t fileSize; f.seekg(0, std::fstream::end); size = fileSize = f.tellg(); f.seekg(0, std::fstream::beg); str = new char[size+1]; if(!str) { f.close(); return NULL; } f.read(str, fileSize); f.close(); str[size] = '\0'; s = str; delete[] str; return s; } else { std::cout << "\nFile containg the kernel code(\".cl\") not found. Please copy the required file in the folder containg the executable.\n"; exit(1); } return NULL; } /* * \brief OpenCL related initialization * Create Context, Device list, Command Queue * Create OpenCL memory buffer objects * Load CL file, compile, link CL source * Build program and kernel objects */ int initializeCL(void) { cl_int status = 0; size_t deviceListSize; /* * Have a look at the available platforms and pick either * the AMD one if available or a reasonable default. */ cl_uint numPlatforms; cl_platform_id platform = NULL; status = clGetPlatformIDs(0, NULL, &numPlatforms); if(status != CL_SUCCESS) { std::cout << "Error: Getting Platforms. (clGetPlatformsIDs)\n"; return 1; } if(numPlatforms > 0) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); if(status != CL_SUCCESS) { std::cout << "Error: Getting Platform Ids. (clGetPlatformsIDs)\n"; return 1; } for(unsigned int i=0; i < numPlatforms; ++i) { char pbuff[100]; status = clGetPlatformInfo( platforms, CL_PLATFORM_VENDOR, sizeof(pbuff), pbuff, NULL); if(status != CL_SUCCESS) { std::cout << "Error: Getting Platform Info.(clGetPlatformInfo)\n"; return 1; } platform = platforms; std::cout<< "string is:"<<pbuff<<"\n"; getchar(); if(!strcmp(pbuff, "Advanced Micro Devices, Inc.")) { break; } } delete platforms; } if(NULL == platform) { std::cout << "NULL platform found so Exiting Application." << std::endl; return 1; } /* * If we could find our platform, use it. Otherwise use just available platform. */ cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 }; ///////////////////////////////////////////////////////////////// // Create an OpenCL context ///////////////////////////////////////////////////////////////// context = clCreateContextFromType(cps, CL_DEVICE_TYPE_CPU, NULL, NULL, &status); if(status != CL_SUCCESS) { std::cout<<"Error: Creating Context. (clCreateContextFromType)\n"; return 1; } /* First, get the size of device list data */ status = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); if(status != CL_SUCCESS) { std::cout<< "Error: Getting Context Info \ (device list size, clGetContextInfo)\n"; return 1; } ///////////////////////////////////////////////////////////////// // Detect OpenCL devices ///////////////////////////////////////////////////////////////// devices = (cl_device_id *)malloc(deviceListSize); if(devices == 0) { std::cout<<"Error: No devices found.\n"; return 1; } /* Now, get the device list data */ status = clGetContextInfo( context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); if(status != CL_SUCCESS) { std::cout<< "Error: Getting Context Info \ (device list, clGetContextInfo)\n"; return 1; } ///////////////////////////////////////////////////////////////// // Create an OpenCL command queue ///////////////////////////////////////////////////////////////// commandQueue = clCreateCommandQueue( context, devices[0], 0, &status); if(status != CL_SUCCESS) { std::cout<<"Creating Command Queue. (clCreateCommandQueue)\n"; return 1; } ///////////////////////////////////////////////////////////////// // Create OpenCL memory buffers ///////////////////////////////////////////////////////////////// inputBuffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * width, input, &status); if(status != CL_SUCCESS) { std::cout<<"Error: clCreateBuffer (inputBuffer)\n"; return 1; } outputBuffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * width, output, &status); if(status != CL_SUCCESS) { std::cout<<"Error: clCreateBuffer (outputBuffer)\n"; return 1; } ///////////////////////////////////////////////////////////////// // Load CL file, build CL program object, create CL kernel object ///////////////////////////////////////////////////////////////// const char * filename = "Template_Kernels.cl"; std::string sourceStr = convertToString(filename); const char * source = sourceStr.c_str(); size_t sourceSize[] = { strlen(source) }; program = clCreateProgramWithSource( context, 1, &source, sourceSize, &status); if(status != CL_SUCCESS) { std::cout<< "Error: Loading Binary into cl_program \ (clCreateProgramWithBinary)\n"; return 1; } /* create a cl program executable for all the devices specified */ status = clBuildProgram(program, 1, devices, NULL, NULL, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Building Program (clBuildProgram)\n"; return 1; } /* get a kernel object handle for a kernel with the given name */ kernel = clCreateKernel(program, "templateKernel", &status); if(status != CL_SUCCESS) { std::cout<<"Error: Creating Kernel from program. (clCreateKernel)\n"; return 1; } return 0; } /* * \brief Run OpenCL program * * Bind host variables to kernel arguments * Run the CL kernel */ int runCLKernels(void) { cl_int status; cl_uint maxDims; cl_event events[2]; size_t globalThreads[1]; size_t localThreads[1]; size_t maxWorkGroupSize; size_t maxWorkItemSizes[3]; /** * Query device capabilities. Maximum * work item dimensions and the maximmum * work item sizes */ status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), (void*)&maxWorkGroupSize, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(cl_uint), (void*)&maxDims, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(size_t)*maxDims, (void*)maxWorkItemSizes, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; return 1; } globalThreads[0] = width; localThreads[0] = 1; if(globalThreads[0] > maxWorkItemSizes[0] || localThreads[0] > maxWorkGroupSize) { std::cout<<"Unsupported: Device does not support requested number of work items."; return 1; } /*** Set appropriate arguments to the kernel ***/ /* the output array to the kernel */ status = clSetKernelArg( kernel, 0, sizeof(cl_mem), (void *)&outputBuffer); if(status != CL_SUCCESS) { std::cout<<"Error: Setting kernel argument. (output)\n"; return 1; } /* the input array to the kernel */ status = clSetKernelArg( kernel, 1, sizeof(cl_mem), (void *)&inputBuffer); if(status != CL_SUCCESS) { std::cout<<"Error: Setting kernel argument. (input)\n"; return 1; } /*multiplier*/ status = clSetKernelArg( kernel, 2, sizeof(cl_uint), (void *)&multiplier); if(status != CL_SUCCESS) { std::cout<< "Error: Setting kernel argument. (multiplier)\n"; return 1; } /* * Enqueue a kernel run call. */ status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]); if(status != CL_SUCCESS) { std::cout<< "Error: Enqueueing kernel onto command queue. \ (clEnqueueNDRangeKernel)\n"; return 1; } /* wait for the kernel call to finish execution */ status = clWaitForEvents(1, &events[0]); /* OUTPUT BUFFER HAS VALUES AFTER THIS CALL */ if(status != CL_SUCCESS) { std::cout<< "Error: Waiting for kernel run to finish. \ (clWaitForEvents)\n"; return 1; } status = clReleaseEvent(events[0]); if(status != CL_SUCCESS) { std::cout<< "Error: Release event object. \ (clReleaseEvent)\n"; return 1; } /* Enqueue readBuffer*/ status = clEnqueueReadBuffer( commandQueue, outputBuffer, CL_TRUE, 0, width * sizeof(cl_uint), output, 0, NULL, &events[1]); if(status != CL_SUCCESS) { std::cout << "Error: clEnqueueReadBuffer failed. \ (clEnqueueReadBuffer)\n"; return 1; } /* Wait for the read buffer to finish execution */ status = clWaitForEvents(1, &events[1]); if(status != CL_SUCCESS) { std::cout<< "Error: Waiting for read buffer call to finish. \ (clWaitForEvents)\n"; return 1; } status = clReleaseEvent(events[1]); if(status != CL_SUCCESS) { std::cout<< "Error: Release event object. \ (clReleaseEvent)\n"; return 1; } return 0; } /* kernel */ /*! * Sample kernel which multiplies every element of the input array with * a constant and stores it at the corresponding output array */ __kernel void templateKernel(__global unsigned int * output, __global unsigned int * input, const unsigned int multiplier) { uint tid = get_global_id(0); output[tid] = input[tid] * multiplier; }

0 Likes

This is the hpp file:

 

/* ============================================================ Copyright (c) 2009 Advanced Micro Devices, Inc. All rights reserved. Redistribution and use of this material is permitted under the following conditions: Redistributions must retain the above copyright notice and all terms of this license. In no event shall anyone redistributing or accessing or using this material commence or participate in any arbitration or legal action relating to this material against Advanced Micro Devices, Inc. or any copyright holders or contributors. The foregoing shall survive any expiration or termination of this license or any agreement or access or use related to this material. ANY BREACH OF ANY TERM OF THIS LICENSE SHALL RESULT IN THE IMMEDIATE REVOCATION OF ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL. THIS MATERIAL IS PROVIDED BY ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" IN ITS CURRENT CONDITION AND WITHOUT ANY REPRESENTATIONS, GUARANTEE, OR WARRANTY OF ANY KIND OR IN ANY WAY RELATED TO SUPPORT, INDEMNITY, ERROR FREE OR UNINTERRUPTED OPERA TION, OR THAT IT IS FREE FROM DEFECTS OR VIRUSES. ALL OBLIGATIONS ARE HEREBY DISCLAIMED - WHETHER EXPRESS, IMPLIED, OR STATUTORY - INCLUDING, BUT NOT LIMITED TO, ANY IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ACCURACY, COMPLETENESS, OPERABILITY, QUALITY OF SERVICE, OR NON-INFRINGEMENT. IN NO EVENT SHALL ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, PUNITIVE, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, REVENUE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED OR BASED ON ANY THEORY OF LIABILITY ARISING IN ANY WAY RELATED TO THIS MATERIAL, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE ENTIRE AND AGGREGATE LIABILITY OF ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS SHALL NOT EXCEED TEN DOLLARS (US $10.00). ANYONE REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL ACCEPTS THIS ALLOCATION OF RISK AND AGREES TO RELEASE ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS FROM ANY AND ALL LIABILITIES, OBLIGATIONS, CLAIMS, OR DEMANDS IN EXCESS OF TEN DOLLARS (US $10.00). THE FOREGOING ARE ESSENTIAL TERMS OF THIS LICENSE AND, IF ANY OF THESE TERMS ARE CONSTRUED AS UNENFORCEABLE, FAIL IN ESSENTIAL PURPOSE, OR BECOME VOID OR DETRIMENTAL TO ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR CONTRIBUTORS FOR ANY REASON, THEN ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL SHALL TERMINATE IMMEDIATELY. MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF THIS LICENSE OR ANY AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL. NOTICE IS HEREBY PROVIDED, AND BY REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL SUCH NOTICE IS ACKNOWLEDGED, THAT THIS MATERIAL MAY BE SUBJECT TO RESTRICTIONS UNDER THE LAWS AND REGULATIONS OF THE UNITED STATES OR OTHER COUNTRIES, WHICH INCLUDE BUT ARE NOT LIMITED TO, U.S. EXPORT CONTROL LAWS SUCH AS THE EXPORT ADMINISTRATION REGULATIONS AND NATIONAL SECURITY CONTROLS AS DEFINED THEREUNDER, AS WELL AS STATE DEPARTMENT CONTROLS UNDER THE U.S. MUNITIONS LIST. THIS MATERIAL MAY NOT BE USED, RELEASED, TRANSFERRED, IMPORTED, EXPORTED AND/OR RE-EXPORTED IN ANY MANNER PROHIBITED UNDER ANY APPLICABLE LAWS, INCLUDING U.S. EXPORT CONTROL LAWS REGARDING SPECIFICALLY DESIGNATED PERSONS, COUNTRIES AND NATIONALS OF COUNTRIES SUBJECT TO NATIONAL SECURITY CONTROLS. MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF ANY LICENSE OR AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL. NOTICE REGARDING THE U.S. GOVERNMENT AND DOD AGENCIES: This material is provided with "RESTRICTED RIGHTS" and/or "LIMITED RIGHTS" as applicable to computer software and technical data, respectively. Use, duplication, distribution or disclosure by the U.S. Government and/or DOD agencies is subject to the full extent of restrictions in all applicable regulations, including those found at FAR52.227 and DFARS252.227 et seq. and any successor regulations thereof. Use of this material by the U.S. Government and/or DOD agencies is acknowledgment of the proprietary rights of any copyright holders and contributors, including those of Advanced Micro Devices, Inc., as well as the provisions of FAR52.227-14 through 23 regarding privately developed and/or commercial computer software. This license forms the entire agreement regarding the subject matter hereof and supersedes all proposals and prior discussions and writings between the parties with respect thereto. This license does not affect any ownership, rights, title, or interest in, or relating to, this material. No terms of this license can be modified or waived, and no breach of this license can be excused, unless done so in a writing signed by all affected parties. Each term of this license is separately enforceable. If any term of this license is determined to be or becomes unenforceable or illegal, such term shall be reformed to the minimum extent necessary in order for this license to remain in effect in accordance with its terms as modified by such reformation. This license shall be governed by and construed in accordance with the laws of the State of Texas without regard to rules on conflicts of law of any state or jurisdiction or the United Nations Convention on the International Sale of Goods. All disputes arising out of this license shall be subject to the jurisdiction of the federal and state courts in Austin, Texas, and all defenses are hereby waived concerning personal jurisdiction and venue of these courts. ============================================================ */ #ifndef TEMPLATE_H_ #define TEMPLATE_H_ #include <CL/cl.h> #include <string.h> #include <cstdlib> #include <iostream> #include <string> #include <fstream> /*** GLOBALS ***/ /* * Input data is stored here. */ cl_uint *input; /* * Output data is stored here. */ cl_uint *output; /* * Multiplier is stored in this variable */ cl_uint multiplier; /* problem size for 1D algorithm and width of problem size for 2D algorithm */ cl_uint width; /* The memory buffer that is used as input/output for OpenCL kernel */ cl_mem inputBuffer; cl_mem outputBuffer; cl_context context; cl_device_id *devices; cl_command_queue commandQueue; cl_program program; /* This program uses only one kernel and this serves as a handle to it */ cl_kernel kernel; /*** FUNCTION DECLARATIONS ***/ /* * OpenCL related initialisations are done here. * Context, Device list, Command Queue are set up. * Calls are made to set up OpenCL memory buffers that this program uses * and to load the programs into memory and get kernel handles. */ int initializeCL(void); /* * */ std::string convertToString(const char * filename); /* * This is called once the OpenCL context, memory etc. are set up, * the program is loaded into memory and the kernel handles are ready. * * It sets the values for kernels' arguments and enqueues calls to the kernels * on to the command queue and waits till the calls have finished execution. * * It also gets kernel start and end time if profiling is enabled. */ int runCLKernels(void); /* Releases OpenCL resources (Context, Memory etc.) */ int cleanupCL(void); /* Releases program's resources */ void cleanupHost(void); /* * Prints no more than 256 elements of the given array. * Prints full array if length is less than 256. * * Prints Array name followed by elements. */ void print1DArray( const std::string arrayName, const unsigned int * arrayData, const unsigned int length); #endif /* #ifndef TEMPLATE_H_ */

0 Likes

Hi,

I think I found out the problem. The template code is creating a CPU context instead of GPU one. Hence my best guess is that everything was running on CPU and never went to GPU. Please correct me if I am wrong. Here is how it was creating the context:

 context = clCreateContextFromType(cps,
                                      CL_DEVICE_TYPE_CPU,
                                      NULL,
                                      NULL,
                                      &status);

I change CL_DEVICE_TYPE_CPU to CL_DEVICE_TYPE_GPU and then I had to read buffer to get the values.

Could you tell me when to use CL_DEVICE_TYPE_CPU??

Thanks,

Richeek

0 Likes

in fact you want use CPU device when there is none GPU device. or your algorithm is not suitable for GPU.

0 Likes

Originally posted by: nou in fact you want use CPU device when there is none GPU device. or your algorithm is not suitable for GPU.

 

I did not quite nderstand you comment. Why do you say that algorithm is not suitable for GPU? Also is it true when I write CL_DEVICE_TYPE_CPU the algorithm is running on CPU?

0 Likes

it may be not suitable for GPU. so it may be worth use a CPU for that task. you aked when to use CPU.

yes when you create context with CL_DEVICE_TYPE_CPU it is runing on CPU and not on GPU.

0 Likes