cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

timchist
Elite

clEnqueueWriteBufferRect does not work when region width is not equal to src pitch

I'm trying to copy a rectangular area of an 8-bit image channel in host memory to GPU buffer that has the same size as the area.

The area starts at offset (0, 0) and is 1696 x 2048 (width x height).

The whole image is 5585 x 3723.

This is the code I execute:

size_t bufferOrigin[3], hostOrigin[3], region[3];

bufferOrigin[0] = 0;

bufferOrigin[1] = 0;

bufferOrigin[2] = 0;

hostOrigin[0] = 0;

hostOrigin[1] = 0;

hostOrigin[2] = 0;

region[0] = 1696;

region[1] = 2048;

region[2] = 1;

clEnqueueWriteBufferRect(queue, ptrD, CL_TRUE, bufferOrigin, hostOrigin, region, 1696, 0, 5585, 0, ptrH, 0, NULL, NULL);

The first line of my resulting image is correct, but the second one is not (as well as all subsequent writes). In fact the output pixel (y=1, x=0) is taken from source pixel (y=0, x=1696), not from (y=1, x=0). So it seems that this function somehow incorrectly interprets the host_row_pitch parameter.

Is it really a bug or I am doing something wrong? Unfortunately APP SDK does not have any samples for clEnqueueWriteBufferRect.

0 Likes
1 Solution

timchist wrote:

I'm attaching a simple test case that shows how to reproduce the problem. The test passes on HD 5850, but fails on HD 7970 (both machines are running Windows 7 x64 and the driver included in Catalyst 13.1).

It's a real problem in OpenCL runtime. 5850 uses the generic code path with a kernel transfer. 7970 has a capability to use SDMA engine for that type of transfers. The SDMA implementation didn't count different pitches and slices. It will be fixed in the upcoming driver releases.

Regards,

German

View solution in original post

0 Likes
14 Replies
himanshu_gautam
Grandmaster

Hi timchist,

You problem looks same as described in this thread row_pitch error with CL_MEM_USE_HOST_PTR

This was a valid issue, and was reported to AMD Engg Team, some time back. A test case is also present in the thread. You can probably confirm that the attached testcase is enough is reproduce your problem.

Otherwise you can send a separate testcase.

Hope it helps

Thanks Himanshu. I'm not sure whether this is the same problem or a different one. I'm calling different OpenCL functions and not using images, but just copying data from host to device. Both problems can be caused by the same error on a lower level, but it's hard to tell without reverse-engineering the runtime modules.

The test case attached to the other thread, however, fails on a computer with HD 7970 and passes on a computer with HD 5850, so does my test case. It looks that the problem is indeed Tahiti-specific (or could be GCN-specific).

I will be posting my test case soon.

0 Likes
german
Staff

The following line sends 1696 as buffer_row_pitch and 5585 as host_row_pitch.

       clEnqueueWriteBufferRect(queue, ptrD, CL_TRUE, bufferOrigin, hostOrigin, region, 1696, 0, 5585, 0, ptrH, 0, NULL, NULL);

1696 as buffer_row_pitch means the image in GPU memory can be only <=1696 texels wide, but you said that the size of the both allocations is the same. So then it should look like this:

       clEnqueueWriteBufferRect(queue, ptrD, CL_TRUE, bufferOrigin, hostOrigin, region, 5585, 0, 5585, 0, ptrH, 0, NULL, NULL);

Transfer.png

0 Likes

> but you said that the size of the both allocations is the same

I did not say that. The CPU memory block is 5585x3723, GPU block is 1696x2048 (same size as the area that is being copied).

0 Likes
timchist
Elite

I'm attaching a simple test case that shows how to reproduce the problem. The test passes on HD 5850, but fails on HD 7970 (both machines are running Windows 7 x64 and the driver included in Catalyst 13.1).

---

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <CL/cl.h>

//------------------------------------------------------------------------------

void checkErr(char *func, cl_int err)

{

    if(err != CL_SUCCESS)

    {

        fprintf( stderr, "%s(): ", func );

        switch( err )

        {

            case CL_BUILD_PROGRAM_FAILURE:  fprintf (stderr, "CL_BUILD_PROGRAM_FAILURE"); break;

            case CL_COMPILER_NOT_AVAILABLE: fprintf (stderr, "CL_COMPILER_NOT_AVAILABLE"); break;

            case CL_DEVICE_NOT_AVAILABLE:   fprintf (stderr, "CL_DEVICE_NOT_AVAILABLE"); break;

            case CL_DEVICE_NOT_FOUND:       fprintf (stderr, "CL_DEVICE_NOT_FOUND"); break;

            case CL_INVALID_BINARY:         fprintf (stderr, "CL_INVALID_BINARY"); break;

            case CL_INVALID_BUILD_OPTIONS:  fprintf (stderr, "CL_INVALID_BUILD_OPTIONS"); break;

            case CL_INVALID_CONTEXT:        fprintf (stderr, "CL_INVALID_CONTEXT"); break;

            case CL_INVALID_DEVICE:         fprintf (stderr, "CL_INVALID_DEVICE"); break;

            case CL_INVALID_DEVICE_TYPE:    fprintf (stderr, "CL_INVALID_DEVICE_TYPE"); break;

            case CL_INVALID_OPERATION:      fprintf (stderr, "CL_INVALID_OPERATION"); break;

            case CL_INVALID_PLATFORM:       fprintf (stderr, "CL_INVALID_PLATFORM"); break;

            case CL_INVALID_PROGRAM:        fprintf (stderr, "CL_INVALID_PROGRAM"); break;

            case CL_INVALID_VALUE:          fprintf (stderr, "CL_INVALID_VALUE"); break;

            case CL_OUT_OF_HOST_MEMORY:     fprintf (stderr, "CL_OUT_OF_HOST_MEMORY"); break;

            default:                        fprintf (stderr, "Unknown error code: %d", (int)err); break;

        }

        fprintf (stderr, "\n");

        getchar();

        exit( err );

    }

}

int main(void)

{

    ///////////////////////////////////////////////////////////////////////////

    // Initialization

    ///////////////////////////////////////////////////////////////////////////

    int i = 0;

    cl_int err = CL_SUCCESS;

    cl_uint nPlatforms = 0;

    cl_platform_id *platforms = NULL;

    cl_platform_id platform = (cl_platform_id)NULL;

    cl_context_properties cprops[3];

    size_t nDevices = 0;

    cl_device_id *devices = NULL;

    size_t binary_size = 0;

    char * binary = NULL;

    cl_device_id device_id = 0;

    cl_context context;

    cl_command_queue queue, queue2;

    /* figure out the number of platforms on this system. */

    err = clGetPlatformIDs(0, NULL, &nPlatforms);

    checkErr( "clGetPlatformIDs", err );

    printf( "Number of platforms found: %d\n", nPlatforms );

    if( nPlatforms == 0 )

    {

        fprintf( stderr, "Cannot continue without any platforms. Exiting.\n" );

        return( -1 );

    }

    platforms = (cl_platform_id *)malloc( sizeof(cl_platform_id) * nPlatforms );

    err = clGetPlatformIDs( nPlatforms, platforms, NULL );

    checkErr( "clGetPlatformIDs", err );

    puts("Platforms:");

    for(cl_uint i = 0; i < nPlatforms; i++ )

    {

        char pbuf[100];

        err = clGetPlatformInfo( platforms, CL_PLATFORM_VENDOR,

                                 sizeof(pbuf), pbuf, NULL );

        checkErr( "clGetPlatformInfo", err );

        printf("#%d: %s\n", i, pbuf);

    }

    /* find the AMD platform. */

    for(cl_uint i = 0; i < nPlatforms; i++ )

    {

        char pbuf[100];

        err = clGetPlatformInfo( platforms, CL_PLATFORM_VENDOR,

                                 sizeof(pbuf), pbuf, NULL );

        checkErr( "clGetPlatformInfo", err );

        if( strcmp(pbuf, "Advanced Micro Devices, Inc.") == 0 )

        {

            printf( "Found AMD platform\n\n" );

            platform = platforms;

            break;

        }

    }

    if( platform == (cl_context_properties)NULL )

    {

        fprintf( stderr, "Could not find an AMD platform. Exiting.\n" );

        return( -1 );

    }

    cprops[0] = CL_CONTEXT_PLATFORM;

    cprops[1] = (cl_context_properties)platform;

    cprops[2] = (cl_context_properties)NULL; /* end of options list marker */

    /* create a context with all of the available devices. */

    context = clCreateContextFromType( cprops, CL_DEVICE_TYPE_GPU, NULL, NULL, &err );

    checkErr( "clCreateContextFromType", err );

    /* get a device count for this context. */

    err = clGetContextInfo( context, CL_CONTEXT_DEVICES, 0, NULL, &nDevices );

    checkErr( "clGetContextInfo", err );

    nDevices = nDevices / sizeof(cl_device_id); /* need to generate actual device count from size of required buffer. */

    printf( "Number of devices found: %d\n", nDevices );

    devices = (cl_device_id *)malloc( sizeof(cl_device_id) * nDevices );

    if (nDevices == 0) {

        fprintf( stderr, "Could not find GPU devices. Exiting.\n" );

        return( -1 );

    }

    /* grab the handles to all of the devices in the context. */

    err = clGetContextInfo( context, CL_CONTEXT_DEVICES, sizeof(cl_device_id)*nDevices, devices, NULL );

    checkErr( "clGetContextInfo", err );

    device_id = devices[0];

    queue = clCreateCommandQueue(context, device_id, 0, &err);

    checkErr("clCreateCommandQueue", err);

    ///////////////////////////////////////////////////////////////////////////

    // The actual test

    ///////////////////////////////////////////////////////////////////////////

    const int FullImageWidth = 256;

    const int FullImageHeight = 256;

    const int PartialImageWidth = 16;

    const int PartialImageHeight = 16;

    unsigned char* hostFullImage = new unsigned char[FullImageWidth * FullImageHeight];

    for(int y = 0; y < FullImageHeight; ++y)

        for(int x = 0; x < FullImageWidth; ++x)

            hostFullImage[y * FullImageWidth + x] = y * FullImageWidth + x;

    cl_mem deviceBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE, PartialImageWidth * PartialImageHeight, NULL, &err);

    checkErr("clCreateBuffer", err);

    unsigned char pattern = 0;

    err = clEnqueueFillBuffer(queue, deviceBuffer, &pattern, 1, 0, PartialImageWidth * PartialImageHeight, 0, NULL, NULL);

    checkErr("clEnqueueFillBuffer", err);

   

    size_t bufferOrigin[3], hostOrigin[3], region[3];

    bufferOrigin[0] = 0;

    bufferOrigin[1] = 0;

    bufferOrigin[2] = 0;

    hostOrigin[0] = 0;

    hostOrigin[1] = 0;

    hostOrigin[2] = 0;

    region[0] = PartialImageWidth;

    region[1] = PartialImageHeight;

    region[2] = 1;

    err = clEnqueueWriteBufferRect(queue, deviceBuffer, CL_TRUE, bufferOrigin, hostOrigin, region,

                                   PartialImageWidth, 0, FullImageWidth, 0, hostFullImage, 0, NULL, NULL);

    checkErr("clEnqueueWriteBufferRect", err);

    unsigned char* hostPartialImage = new unsigned char[PartialImageWidth * PartialImageHeight];

    err = clEnqueueReadBuffer(queue, deviceBuffer, CL_TRUE, 0, PartialImageWidth * PartialImageHeight, hostPartialImage, 0, NULL, NULL);

    checkErr("clEnqueueReadBuffer", err);

    bool testPassed = true;

    for(int y = 0; y < PartialImageHeight; ++y)

    {

        for(int x = 0; x < PartialImageWidth; ++x)

            if(hostFullImage[y * FullImageWidth + x] != hostPartialImage[y * PartialImageWidth + x])

            {

                testPassed = false;

                break;

            }

        if(!testPassed)

            break;

    }

    if(testPassed)

        puts("Test passed, all OK");

    else

    {

        puts("Test failed.\n");

   

        puts("Expected:");

        for(int y = 0; y < PartialImageHeight; ++y)

        {

            for(int x = 0; x < PartialImageWidth; ++x)

                printf("%3d ", (int)hostFullImage[y * FullImageWidth + x]);

            puts("");

        }

        puts("\nActual:");

        for(int y = 0; y < PartialImageHeight; ++y)

        {

            for(int x = 0; x < PartialImageWidth; ++x)

                printf("%3d ", (int)hostPartialImage[y * PartialImageWidth + x]);

            puts("");

        }

    }

    ///////////////////////////////////////////////////////////////////////////

    // Clean-up

    ///////////////////////////////////////////////////////////////////////////

    err = clReleaseMemObject(deviceBuffer);

    checkErr("clReleaseMemObject", err);

    err = clReleaseCommandQueue(queue);

    checkErr("clReleaseCommandQueue", err);

    err = clReleaseContext(context);

    checkErr("clReleaseContext", err);

    delete hostFullImage;

    delete hostPartialImage;

    return 0;

}

0 Likes

timchist wrote:

I'm attaching a simple test case that shows how to reproduce the problem. The test passes on HD 5850, but fails on HD 7970 (both machines are running Windows 7 x64 and the driver included in Catalyst 13.1).

It's a real problem in OpenCL runtime. 5850 uses the generic code path with a kernel transfer. 7970 has a capability to use SDMA engine for that type of transfers. The SDMA implementation didn't count different pitches and slices. It will be fixed in the upcoming driver releases.

Regards,

German

0 Likes

Thanks German. Are you working at AMD?

0 Likes

timchist wrote:

Thanks German. Are you working at AMD?

Yes, I am. The issue with CL_MEM_USE_HOST_PTR on image creation isn't the same.

0 Likes

Hi German,

Thanks,

Meanwhile, Is there a workaround through which we could disable the DMA?

0 Likes

Hi Himanshu,

There is a possibility to disable the accelerated DMA transfers with DRMDMA/SDMA engines. However that code path may have other issues and in general lower performance. You can try it with "set CAL_ENABLE_ASYNC_DMA=0".

0 Likes

Hi German,

I noticed that the issue is fixed in Catalyst 13.4.

What is the correct way to check for the driver version in our code so that we could only enable clEnqueueWriteCopyRect in Catalyst 13.4 and above? Device manager reports driver version as 12.104.0.0, but driver version obtained via OpenCL is "1124.2 (VM)". Should I extract the first number from the driver version string and write something similar to:

if(driverVersionFloat >= 1124.2f)

  // bug is fixed, enable clEnqueueWriteCopyRect

else

  // bug is present, avoid using clEnqueueWriteCopyRect

Thanks

0 Likes

I do not see a issue in this approach. Anyways I have asked more people.

0 Likes

Is there any relationship between Catalyst version (13.4), driver version shown in Windows Device Manager (12.104.0.0) and the version returned via OpenCL ("1124.2 (VM)")?

0 Likes

One difference is that 13.4 is the version of Catalyst graphics driver (OpenCL is just a part of it), where 1124.2 is OpenCL runtime version.

Also I tried checking in case a better way exists to check driver version as compared to the one you mentioned above. As of now, i do not have any success at that end. Probably you can go with your approach if it is essential (or maybe just a point in release notes of your application will suffice).