I have an image2D that was allocated with CL_MEM_USE_PERSISTENT_MEM_AMD flag. Then I map this image and try to memcpy data to the obtained pointer. However, after unmapping the image, kernel fetches wrong data from the image.
I do not see any API errors in APP Profiler trace. Also, the pitch returned by mapping command perfectly matches the size of single image row in bytes.
The problem disappears if I do ANY of the following:
1. Delete CL_MEM_USE_PERSISTENT_MEM_AMD flag.
2. Replace CL_MEM_USE_PERSISTENT_MEM_AMD with CL_MEM_ALLOC_HOST_PTR.
3. Use buffer instead of image (the code is very different in this case).
The first two options lack zero-copy feature and the last option lacks texture caching in kernel.
The device used is A8-3850 APU located on the remote server. Unfortunately I don't have any AMD GPU at home.
I suppose that mapping device-resident host-visible images is unsupported currently?
Actually I need to overlap copying data from malloced host memory to device image2D with kernel execution.
//............................... //create image in host-visible device memory inputImage = cl::Image2D(context, CL_MEM_READ_ONLY | CL_MEM_USE_PERSISTENT_MEM_AMD, cl::ImageFormat(CL_A, CL_FLOAT), WIDTH, HEIGHT, 0, 0, &err); //............................... //map the image cl::Event evMap; cl::size_t<3> origin, size; size.push_back(WIDTH); size.push_back(HEIGHT); size.push_back(1); origin.push_back(0); origin.push_back(0); origin.push_back(0); *dstPtr = (float*)pQueue->enqueueMapImage(inputImage, CL_FALSE, CL_MAP_WRITE, origin, size, &inputImagePitch, 0, 0, &evMap); evMap.wait(); //............................... //copy data to mapped image (k lines, sz bytes each) for (int i = 0; i<k; i++) { memcpy(dstPtr, srcPtr, sz); dstPtr += sz; srcPtr += inputImagePitch; } //............................... //unmap the image cl::Event evUnmap; pQueue->enqueueUnmapMemObject(buffer, ptr, 0, &evUnmap); evUnmap.wait(); //............................... //run the kernel devProcessBlockKernel.setArg(0, inputImage); devProcessBlockKernel.setArg(1, outBuffer); cl::Event evKernel; pQueue->enqueueNDRangeKernel(devProcessBlockKernel, cl::NullRange, OverallThreads, ThreadsInBlock, 0, &evKernel); //...............................
Eventually I decided to use the third option: switch to buffers completely without even rewriting anything. Surprisingly enough I don't see any performance difference=). Perhaps both images and buffers use cache equally well.
Anyway, it is still interesting whether mapping a device host-visible 2D image is supported...
By the way, AMD APP 2.5, Windows 7 64 bit.
Originally posted by: stgatilov Eventually I decided to use the third option: switch to buffers completely without even rewriting anything. Surprisingly enough I don't see any performance difference=). Perhaps both images and buffers use cache equally well.
Anyway, it is still interesting whether mapping a device host-visible 2D image is supported...
By the way, AMD APP 2.5, Windows 7 64 bit.
Stgatilov,
There is a know issue in SDK2.5 on this. Are WIDTH and HEIGHT power of 2? Try with power of 2 sizes for both WIDTH and HEIGHT.
Thank you for reply!
The size of image was 1920x1080.
I bumped into this bug during AMD APP performance challenge. The competition is over now and I don't have access to AMD GPU anymore.
As this issue is already known, then the topic can be closed=)