cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

doqtor
Adept I

Enqueuing to device side queue in a loop issue

In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels thousand of times and each iteration depends on the results from the previous iteration.

Below is the minimal version reproducing the issue - enqueue_kernel is either returning -1 or hanging when size parameter from kernelLauncher is set to >= 513. Everything seems to be OK when size is set to <= 512. Is this code OK? Am I hitting some hardware limit here? (1025th kernel enqueue is failing).

__kernel  __attribute__((reqd_work_group_size(256, 1, 1)))

void kernelA(int index)

{}

__kernel  __attribute__((reqd_work_group_size(256, 1, 1)))

void kernelB(int index)

{}

__kernel  __attribute__((reqd_work_group_size(1, 1, 1)))

__kernel void kernelLauncher(int size,  __global int *err)

{

    queue_t default_queue = get_default_queue();

    clk_event_t ev1, ev2;

    int ret;

    for (int index = 0; index < size; ++index)

    {

        void(^fnKernelA)(void) = ^{ kernelA(index); };

        if (index == 0)

        {

            ret = enqueue_kernel(default_queue,

                CLK_ENQUEUE_FLAGS_NO_WAIT,

                ndrange_1D(3*256, 256),

                0, NULL, &ev1,

                fnKernelA);

        }

        else

        {

            ret = enqueue_kernel(default_queue,

                CLK_ENQUEUE_FLAGS_NO_WAIT,

                ndrange_1D(3 * 256, 256),

                1, &ev2, &ev1,

                fnKernelA);

        }

        if (ret != CLK_SUCCESS)

        {

            *err = ret;

            return;

        }

        void(^fnKernelB)(void) = ^{ kernelB(index); };

        ret = enqueue_kernel(default_queue,

            CLK_ENQUEUE_FLAGS_NO_WAIT,

            ndrange_1D(256, 256),

            1, &ev1, &ev2,

            fnKernelB);

        if (ret != CLK_SUCCESS)

        {

            *err = ret;

            return;

        }

    }

}

My set up:

Ubuntu 14.04, R9 295, fglrx 15.20.3, AMD APP SDK 3.0

0 Likes
8 Replies
dipak
Big Boss

Am I hitting some hardware limit here?

May be. There is a size limit for the device queue which can be queried by clGetDeviceInfo  with param CL_DEVICE_QUEUE_ON_ DEVICE_MAX_SIZE.

[Same can be found from clinfo marked by "Queue on device max size"]

Usually, the queue size is set to a lower value (or preferred value)  than the max. limit ( CL_DEVICE_QUEUE_ON_ DEVICE_PREFERRED_SIZE or see "Queue on device preferred size" in clinfo). During the device queue creation, one can modify the size value using clCreateCommandQueueWithProperties  with param CL_QUEUE_SIZE.

For example,

        cl_queue_properties prop[] = {

  CL_QUEUE_PROPERTIES, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE|CL_QUEUE_ON_DEVICE_DEFAULT,

  CL_QUEUE_SIZE, maxQueueSize, 0 };

You may try this above.

Regards,

0 Likes

For R9 295 that is:

CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE: 262,144

CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE: 524,288

I set the max size but that didn't change anything, it's still failing on 1025th enqueue. I tried calling release_event - no difference.

Note that this code is working fine on Intel CPU.

0 Likes

Thanks for sharing your observation.

From your observation, it seems that the queue size has no effect on this number. I'll try the code at my end and let you know my findings.

Regards

0 Likes

it's still failing on 1025th enqueue. I tried calling release_event - no difference.
Note that this code is working fine on Intel CPU.

I guess, in case of R9 295, the value of CL_DEVICE_MAX_ON_DEVICE_EVENTS or "Max on device events" (clinfo) is  1024. So, falling after that value seems logical. However as you said, you tried to release the events but still not working. Could you please share that code?

Also, please check the value of CL_DEVICE_MAX_ON_DEVICE_EVENTS for the Intel CPU and then try to exceed that value.

0 Likes

On Intel CPU the limit is

CL_DEVICE_MAX_ON_DEVICE_EVENTS: 4,294,967,295

and I don't really want to try to go over that limit

On R9 295 that is:

CL_DEVICE_MAX_ON_DEVICE_EVENTS:        1,024

It's hard to try to figure out this because frequently it hangs the kernel and I have to go for reboot.

Code with release_event:

__kernel  __attribute__((reqd_work_group_size(1, 1, 1)))

__kernel void kernelLauncher(

    int size,

    __global int *err

    )

{

    queue_t default_queue = get_default_queue();

    clk_event_t ev1, ev2;

    int ret;

    for (int index = 0; index < size; ++index)

    {

        void(^fnKernelA)(void) = ^{ kernelA(

            index

            ); };

        if (index == 0)

        {

            ret = enqueue_kernel(default_queue,

                CLK_ENQUEUE_FLAGS_NO_WAIT,

                ndrange_1D(3*256, 256),

                0, NULL, &ev1,

                fnKernelA);

        }

        else

        {

            ret = enqueue_kernel(default_queue,

                CLK_ENQUEUE_FLAGS_NO_WAIT,

                ndrange_1D(3 * 256, 256),

                1, &ev2, &ev1,

                fnKernelA);

        }

        if (ret != CLK_SUCCESS)

        {

            *err = ret;

            return;

        }

        if(is_valid_event(ev2))

            release_event(ev2);

        void(^fnKernelB)(void) = ^{ kernelB(

            index

            ); };

        ret = enqueue_kernel(default_queue,

            CLK_ENQUEUE_FLAGS_NO_WAIT,

            ndrange_1D(256, 256),

            1, &ev1, &ev2,

            fnKernelB);

        if (ret != CLK_SUCCESS)

        {

            *err = ret;

            return;

        }

        if(is_valid_event(ev1))

            release_event(ev1);

    }

}

Another different attempt:

__kernel  __attribute__((reqd_work_group_size(1, 1, 1)))

__kernel void kernelLauncher(

    int size,

    __global int *err

    )

{

    queue_t default_queue = get_default_queue();

    clk_event_t ev1, ev2;

    int ret;

    for (int index = 0; index < size; ++index)

    {

        void(^fnKernelA)(void) = ^{ kernelA(

            index

            ); };

        if (index == 0)

        {

            ret = enqueue_kernel(default_queue,

                CLK_ENQUEUE_FLAGS_NO_WAIT,

                ndrange_1D(3*256, 256),

                0, NULL, &ev1,

                fnKernelA);

        }

        else

        {

            ret = enqueue_kernel(default_queue,

                CLK_ENQUEUE_FLAGS_NO_WAIT,

                ndrange_1D(3 * 256, 256),

                1, &ev2, &ev1,

                fnKernelA);

            if(ret == CLK_SUCCESS)

                release_event(ev2);

        }

        if (ret != CLK_SUCCESS)

        {

            *err = index*2;

            return;

        }

        void(^fnKernelB)(void) = ^{ kernelB(

            index

            ); };

        ret = enqueue_kernel(default_queue,

            CLK_ENQUEUE_FLAGS_NO_WAIT,

            ndrange_1D(256, 256),

            1, &ev1, &ev2,

            fnKernelB);

        if (ret != CLK_SUCCESS)

        {

            *err = index*2+1;

            return;

        }

        else

            release_event(ev1);

    }

}

0 Likes

Hi,

I was doing some experiments with the above code and got few interesting observations.

Firstly, it seems that the limit is not linked with events, instead number of kernels that can be enqueued at most. These limits depends on the size of the queue. On a Hawaii, the max. value seems 1024 when queue size is set to maximum.

In your case, as I guess, it reaches the limit when value of "size" greater than 512. After that the device stop responding. This can be realize if enqueue_marker  is used at a certain point. For example, the modified the code shown below was working fine for a large value of "size" (and even without the calling of release event):

  1. ret = enqueue_kernel(default_queue, 
  2.             CLK_ENQUEUE_FLAGS_NO_WAIT, 
  3.             ndrange_1D(256, 256), 
  4.             1, &ev1, &ev2, 
  5.             fnKernelB); 
  6.  
  7.         if (ret != CLK_SUCCESS) 
  8.         { 
  9.             *err = ret; 
  10.             return
  11.         } 
  12. if(index == SYNPOINT) {  // SYNPOINT = any suitable value within limit e.g. 128 or 256
  13.         enqueue_marker(default_queue, 1, &ev2, 0);
  14. }

Could you please check and share your findings?

Regards,

0 Likes

Hi dipak,

I tried your suggestion and when I set synpoint to index % 128 == 0 then there were first 1020 enqueue_kernel and 4 enqueue_marker successful and after that the 1021th enqueue_kernel failed which was the 1025th submission to the queue so yes it looks like that is a problem with the queue being full but marker is not releasing it. Also kernel doesn't hang anymore when the marker was added. I also tried releasing events but nothing changed.

Does that mean that there is a bug in the driver I'm using (fglrx 15.20.3)? Any possible workarounds?

Regards,

0 Likes

Actually, in my case, I was able to set the "size" value much greater than 512 using enqueue_marker. I even tried it  per call basis. Though my setup was not exactly same as yours. It was: Ubuntu 14.04 64bit, Hawaii XT, internal catalyst driver.

0 Likes