cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

boxerab
Challenger

Strange clEnqueueMapImage/clEnqueueUnmapMemObject behaviour

Windows 10

Latest Crimson Driver

RX 470

I created a 1024x1024 image and mapped it with the CL_MAP_WRITE flag,

but I only mapped a 64x64 region in the image.

Using the returned pointer, I filled the entire 1024x1024 image , and called clEnqueueUnmapMemObject.

I expected only the 64x64 region to be copied from host to device, but in fact the entire 1024x1024 region gets copied.

Is this expected behaviour ? This doesn't seem very efficient to me: if I only map a 64x64 region, then only this region

should get copied to the card.

Thanks,

Aaron

0 Likes
1 Solution

Hi Aaron,

The difference in behaviour looks an expected one. As I've come to know, though both types use pinned host memory, current implementation of CL_MEM_USE_HOST_PTR is little bit different than CL_MEM_ALLOC_HOST_PTR because of some alignments requirements. This may change in future.

Regards,

View solution in original post

0 Likes
10 Replies
dipak
Big Boss

Hi Aaron,

Thanks for reporting it.

May I know how the image was created? Was it in pinned host memory? It would be really helpful if you could share a code snippet that manifests the  above behaviour.

Regards,

0 Likes

Thanks, Dipak. Here is the creation and mapping code.

So, it looks like the image dimensions are ignored by clEnqueueMapImage,

and the entire image is always mapped.

// create image

cl_mem_flags flags = CL_MEM_ALLOC_HOST_PTR;

flags |= CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY;

    cl_image_desc desc;

    desc.image_type = CL_MEM_OBJECT_IMAGE2D;

    desc.image_width = 540;

    desc.image_height = 1080;

    desc.image_depth = 0;

    desc.image_array_size = 0;

    desc.image_row_pitch = 0;

    desc.image_slice_pitch = 0;

    desc.num_mip_levels = 0;

    desc.num_samples = 0;

    desc.buffer = NULL;

    cl_image_format format;

    format.image_channel_order = CL_RGBA;

    format.image_channel_data_type = CL_UNSIGNED_INT32;

cl_mem image = clCreateImage(ocl->context, flags, &format, &desc, hostBuffer, &error_code);

    if (CL_SUCCESS != error_code)    {

   Util::LogError("Error: clCreateImage (CL_QUEUE_CONTEXT) returned %s.\n", Util::TranslateOpenCLError(error_code));

    }

// map region of image

cl_int error_code = CL_SUCCESS;

    size_t image_dimensions[3] = { 64,64, 1 };

    size_t image_origin[3] = { 0, 0, 0 };

    size_t image_row_pitch;

    *mappedPtr = clEnqueueMapImage(queue,

   img,
   TRUE,
   CL_MAP_WRITE,
   image_origin,
   image_dimensions,
   &image_row_pitch,
   NULL,
   0,
   NULL,
   NULL,
   &error_code);
    if (CL_SUCCESS != error_code){
   Util::LogError("Error: clEnqueueMapImage return %s.\n", Util::TranslateOpenCLError(error_code));

    }

0 Likes

Dipak,

I can confirm that the problem is CL_MEM_ALLOC_HOST_PTR .

I took the ImageBandwidth sample from APP SDK, and made some small modifications to the  program -

created a 8192x8192 image for writing, set the map region to 64x64, and looked at the timing for the unmap

call.

WIthout CL_MEM_ALLOC_HOST_PTR, unmap is very fast, as expected from such a small region.

With CL_MEM_ALLOC_HOST_PTR, unmap takes the same time as with map region set to 8192x8192.

So, there is a bug with region mapping when CL_MEM_ALLOC_HOST_PTR is set on image creation.

Cheers,

Aaron

0 Likes

Hi Aaron,

Sorry for this delayed reply.

I've got to know that the above behaviour seems an expected one. CL_MEM_ALLOC_HOST_PTR means system memory allocation. There is no such thing as “region” transfer in this case.

Regards,

It’s the same memory as seen by CPU or GPU.

0 Likes

Thanks, Dipak. But, it still sounds like a bug to me, or perhaps a missed opportunity?

For a discrete GPU, there is host memory and there is device memory. When I enqueue an unmap,

even with CL_MEM_ALLOC_HOST_PTR flag,  memory is transferred from host to device.

So, it must be possible to specify a region in this case.

If the driver always transfers the entire image, then this is very inefficient.

For APU, the situation is different. But for dGPU, it still makes sense to transfer a region, even with CL_MEM_ALLOC_HOST_PTR.

0 Likes

Hi Aaron,

In case of CL_MEM_ALLOC_HOST_PTR, the memory object is allocated in pinned host memory and shared by all the devices in the context. It behaves as a zero-copy object and same memory location is used for each mapping / unmapping. This behaviour is true even for dGPU.

As this pinned host memory is accessed directly by the kernel, kernel performance may suffer. The effect may even greater in case of dGPU. In general the application has to assume system memory with direct access for allocations with CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR.

Please refer below section in AMD Optimization guide:

Table 1.1 OpenCL Memory Object Properties

1.3.1.2 Pinned Host Memory

Regards,

0 Likes

Hi Dipak,

Thanks a lot for the detailed explanation. I am still puzzled

I see different behaviour between CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR.

If both of these use pinned host memory, then why am I able to transfer a region using CL_MEM_USE_HOST_PTR, but not with CL_MEM_ALLOC_HOST_PTR ?

When I ran my test with small region inside large opencl image, CL_MEM_USE_HOST_PTR was very fast, but CL_MEM_ALLOC_HOST_PTR was not.

Thanks,

Aaron

0 Likes

Your observation looks interesting. I'll check with the concerned team for some insights.

Regards,

0 Likes

Hi Aaron,

The difference in behaviour looks an expected one. As I've come to know, though both types use pinned host memory, current implementation of CL_MEM_USE_HOST_PTR is little bit different than CL_MEM_ALLOC_HOST_PTR because of some alignments requirements. This may change in future.

Regards,

0 Likes

Thanks for tracking this information down. It might be a good idea to add this to the optimization guide.

Fortunately, in my case, I always transfer the entire image from host to device, but others might be expecting

higher performance for sub-region transfers with CL_MEM_ALLOC_HOST_PTR

0 Likes