Archives Discussions

tugrul_512bit · ‎07-02-2017

Here is a simple algorithm that makes a sphere's surface wavy:

__kernel void test( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)

{

int threadId=get_global_id(0);

float dx=xyz[threadId*3]-arguments[2];float dy=xyz[threadId*3+1]-arguments[3];float t=arguments[1];

float ctr=arguments[0];float wave=0.02f*ctr*sin(40.0f*t+100.0f*sqrt(dx*dx+dy*dy));

xyzo[threadId*3]=xyz[threadId*3]+xyzn[threadId*3]*wave; // wave equation for all surface vertices

xyzo[threadId*3+1]=xyz[threadId*3+1]+xyzn[threadId*3+1]*wave; // wave equation for all surface vertices

xyzo[threadId*3+2]=xyz[threadId*3+2]+xyzn[threadId*3+2]*wave; // wave equation for all surface vertices

}

__kernel void waveEquation( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)

{

int threadId=get_global_id(0);

if(threadId<arguments[4])

{

queue_t q= get_default_queue();

ndrange_t ndrange = ndrange_1D(threadId*1,1,1);

void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};

int ctr=0;

enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

}

as you can see, each parent workitem spawns only 1 child kernel with only 1 workitem with only 1 local range value.

Sphere has 224*256 points(same as parent kernel workitems) to compute a wave equation each but only nearly half of them are updated.

If I spawn 2 workitems(in 1 child kernel) per parent workitem(but with only %50 total workitems this time), %90 of sphere is rendered correctly and also the edge points of %90 level looks garbage.

If I spawn 4 child workitems per parent workitem (again, by 1 child kernel but with %25 total parent workitems, making 224*256 points total again) then sphere is fully rendered correctly.

Isn't device-side queue(dynamic parallelism of OpenCL 2.0) a dynamic thing? Why does it randomly drop child kernels if queue is full?

RX550(drivers 17.6.2(latest for now)), OpenCL 2.0. Windows 10. 64-bit build. clBuildProgram compile options include 2.0 compile tags and also -g option is included for extra detailed errors and no error is returned. enqueue_kernel does not return any erroneous code (I didn't include that long error checking here, for readability).

The device side queue is created with:

cl_uint qsMax = device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE>();

cl_uint qsPref = device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();

//cl_uint qs=device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();

cl_queue_properties qprop[] = { CL_QUEUE_SIZE,((qsMax+qsPref))/3, CL_QUEUE_PROPERTIES, (cl_command_queue_properties)(CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT | CL_QUEUE_PROFILING_ENABLE), 0 };

device_queue = clCreateCommandQueueWithProperties(context.get(), device.get(), qprop, &err);

as it is seen, I'm using C++ bindings from khronos site which points to a github project where I downloaded the cl2.hpp.

Just to "fix" my sphere rendering issue, I pick queue-size as

( (qsMax+qsPref)) / 2

or

(2 * (qsMax+qsPref))/3

but if I choose only qsMax, it returns invalid queue and sometimes resource error so I pick between preferred value and max value. Choosing a very close value to the max value makes it very slow like 1/50 speed of without dynamical parallelism.

Could you reproduce this please? If not, can you post that working code piece please?

Are 60k child kernels too high for dynamic-parallelism or RX550? Should I expect higher values for bigger GPUs? If there is not any error, could this be a synchronization issue? How can I wait for device-queue to finish its job by host-side command?(I don't know, maybe device-side queue always syncs child kernels with their parents but it doesn't write in documents). Clinfo query by AMD APP SDK gives 8.3 MB of queue size. Also I don't know how many bytes are allocated in queue per child kernel with N number of parameters and M number of work groups and O number of kernel string bytes.

More importantly, what if I add ray-tracing to the program? I can't know how many times a ray will be refracted(spawns child kernel) or reflected(spawns another child kernel) per workitem. Another example could be mandelbrot generator. I can't know how many places would need extra parallelism/detail beforehand(without checking borders first as in examples but it increases complexity too much). It would be great if gpu could block the "enqueue_kernel" requests if commands are queal to queue size. If there were any dead-lock, it wouldn't let a kernel finish right? It finishes kernel always except when I add events in the kernel (release_event(evt) and evented version of enqueue_kernel)(but also waiting for child kernel in the parent workitem would make it slow, with only 1024 max events(clinfo query), defy meaning of dynamic parallelism, wouldn't it?).

Could there be another issue like: queue becomes full but to empty it, it needs to enqueue a kernel which needs resources but there is no resource left in GPU, silently consumes the issue since there is no error except waiting in (live?) lock. How can I know exact maximum queue size to stay away from such resource related locks? What happens if I just set queue size to 100? Would it give an error randomly in another type of GPU? All I can query from GPU is preferred and max values of queue sizes.

Regards.

dipak · ‎07-03-2017

Thanks for reporting this. I'm not sure if there is any such limitation for max. number of device kernels can be en-queued. I'll check with the related team and get back to you.

Regards,

tugrul_512bit · ‎07-03-2017

Thank you very much. If there is any way to know queue state(or its remaining bytes) in the kernel, it would help much.

Regards.

dipak · ‎07-05-2017

In the above scenario where the device queue is full, enqueue_kerenel() should return an error. The program behaviour you mentioned above is not unexpected. It would be helpful for the team if you could share the repro.

Regards,

tugrul_512bit · ‎07-05-2017

This is the exact scenario

Local size of parent = 256

Global size of parent = 256*224

arguments[4] is 224*256 just to make sure not overflows.

__kernel void test( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments) { int threadId=get_global_id(0); float dx=xyz[threadId*3]-arguments[2];float dy=xyz[threadId*3+1]-arguments[3];float t=arguments[1]; float ctr=arguments[0];float wave=0.02f*ctr*sin(40.0f*t+100.0f*sqrt(dx*dx+dy*dy)); xyzo[threadId*3]=xyz[threadId*3]+xyzn[threadId*3]*wave; // wave equation for all surface vertices xyzo[threadId*3+1]=xyz[threadId*3+1]+xyzn[threadId*3+1]*wave; // wave equation for all surface vertices xyzo[threadId*3+2]=xyz[threadId*3+2]+xyzn[threadId*3+2]*wave; // wave equation for all surface vertices } __kernel void waveEquation( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments) { int threadId=get_global_id(0); if(threadId<arguments[4]) { queue_t q = get_default_queue(); ndrange_t ndrange = ndrange_1D(threadId,1,1); void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);}; enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A); } }

but I tested against errors (with -g build option ofcourse)

queue_t q = get_default_queue(); ndrange_t ndrange = ndrange_1D(threadId,1,1); void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);}; int ctr=0; while((enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A)& ( CLK_DEVICE_QUEUE_FULL| CLK_EVENT_ALLOCATION_FAILURE| CLK_OUT_OF_RESOURCES | CLK_INVALID_NDRANGE | CLK_INVALID_QUEUE | CLK_INVALID_EVENT_WAIT_LIST | CLK_INVALID_ARG_SIZE ))>0 ) { }

and tested against not being CL_SUCCESS but it doesn't return error and doesn't get stuck in infinite loop neither. It silently consumes enqueue kernels or enqueued kernels overflow something that I don't understand. If it gives error, it should stuck in while loop forever but it doesn't.

I also tried get_default_queue() in the loop as if it is failing but it is not returning error neighter.

If it helps, I'm constructing the default command queue as:

l_uint qs=device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>(); cl_queue_properties qprop[] = { CL_QUEUE_SIZE, qs, CL_QUEUE_PROPERTIES, (cl_command_queue_properties)(CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT | CL_QUEUE_PROFILING_ENABLE), 0 }; device_queue = clCreateCommandQueueWithProperties(context.get(), device.get(), qprop, &err);

Lastly, it runs perfectly if number of child kernels are less than queue size, thats why I choose some value between preferred and max. Wheter those child kernels have different local range value than parent's local range value, doesn't matter, it runs erroneously whenever queue size is not enough. (I guess queue is not dynamic, unlike producer - consumer style, or should I remove it from while loop and cancel that workitem altogether and write "failed" in some shared array and check it if it is there from another workitem to re-try?)

Regards,

dipak · ‎07-06-2017

Thanks for describing in details.

I have one question for the below code snippet. OpenCL error codes are negative values, whereas the below code expects the positive values. Could you please check against the negative values and share your findings?

while((enqueue_kernel(..)>0 )
{
}

Regards,

tugrul_512bit · ‎07-06-2017

You are right about the negativity, while loop gets infinite when compared against negativity, but, strangely, also the following(without any loop) stucks in kernel and I have to ctr+alt+delete:

long er=enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

err[threadId]=er;

or

int er=enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

err[threadId]=er;

so I can't even take the error code to host side. But, this doesn't stuck in kernel:

uint er=enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

err[threadId]=er;

why? Are there return-value-overload in C? Isn't it against C rules? I'm sure it isn't C++, it says unknown type name for "class".

Tried testing with CodeXL, opened unity game engine editor in a gpu timeline profiling project (unity is where I'm running gpgpu), uint run fine, int stuck in kernel, ctr+alt+del didn't help get any error code, CodeXL said its not OpenCL.

Maybe the negativity bit itself is not functioning anymore? Is my card dead? I'm undervolting by 25mV %99 of time. I also tested with normal voltage too and without overclock.

Why would sign bit always malfunction for that enqueue_kernel return value? This is the stragest error I've ever had

Anyways, if I don't over do enqueue_kernel, it works perfectly: Smoothed K-Means Clustering with OpenCL 2.0 Dynamic Parallelism - YouTube

Maybe should I wait for next driver?

Regards.

dipak · ‎07-07-2017

As I was trying to reproduce the problem with the attached test-code, I didn't observe any such hang. Could you check the code at your end?

Regards,

tugrul_512bit · ‎07-08-2017

Thank you very much for this test file. I used the same cl2.hpp in the other project and had to change host side queue qith the new one because of deprecated compile error, then it run with this output:

No GPU device available.

Choose CPU as default device.

Device queue size = 4294967295

global_work_size = 1048576

Testing Data buffer...

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Total elements checked = 1048576 Passed = 1048576

Testing error buffer...

Total parent threads = 1048576 Passed = 1048576

Error buffer values(from last)...

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

this has no problem since queue size is more than work size. Then I tested with a lesser queue size using CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE instead, then it showed exactly same output. Then I deliberately wrote 1024 as queue size, it waited a few more seconds but created same output again:

No GPU device available.

Choose CPU as default device.

Device queue size = 1024

global_work_size = 1048576

Testing Data buffer...

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Total elements checked = 1048576 Passed = 1048576

Testing error buffer...

Total parent threads = 1048576 Passed = 1048576

Error buffer values(from last)...

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Now I'm searching through my project for what could have gone wrong, with comparisons to this test file. Thank you very much.

Regards,

tugrul_512bit · ‎07-08-2017

I did some changes in other project (unity),

it returns -1 for 23552 elements out of 57344 elements.

On host side, -1 is CL_DEVICE_NOT_FOUND but I don't know for device side.

dipak · ‎07-08-2017

Actually, in my case also enqueue_kernel returned -1 error code after a limited number calls. The actual number may vary with queue size and packet size. However, there was no hang as you mentioned earlier.

Regarding the error code, I was expecting CL_ENQUEUE_FAILURE not CL_DEVICE_NOT_FOUND. That I'll check with the related team.

Regards,

tugrul_512bit · ‎07-08-2017

Then tried this:

if(threadId==0)

err[threadId]= CLK_DEVICE_QUEUE_FULL;

if(threadId==1)

err[threadId]= CLK_EVENT_ALLOCATION_FAILURE;

if(threadId==2)

err[threadId]= CLK_OUT_OF_RESOURCES ;

if(threadId==3)

err[threadId]= CLK_INVALID_NDRANGE ;

if(threadId==4)

err[threadId]= CLK_INVALID_QUEUE ;

if(threadId==5)

err[threadId]= CLK_INVALID_EVENT_WAIT_LIST ;

if(threadId==6)

err[threadId]= CLK_INVALID_ARG_SIZE ;

this returned

-5

-7

-8

-3

-2

-4

-6

adding CL_ENQUEUE_FAILURE makes it say "undeclared identifier" since I'm compiling with -g

Regards,

dipak · ‎07-12-2017

It should be CLK_ENQUEUE_FAILURE instead of CL_ENQUEUE_FAILURE (seems a typo in this page enqueue_kernel ).

Actually the error code -1 was against CLK_ENQUEUE_FAILURE. Earlier I was comparing error value -1 with host-side code (CL_DEVICE_NOT_FOUND) which was not correct.

Also, as I've come to know, enqueue_kernel may not provide any extra error information when "-g" is used. In fact, it's an optional feature, not a requirement as per the spec. So, implementation may ignore the detail.

Regards,

tugrul_512bit · ‎07-15-2017

Thank you very much. I will be paying more attention to optionality/vendor-ignorability of a feature next time I open a new issue discussion.

Maybe I should have implemented my own producer-consumer structure before enqueueing on default queue. But that could be a performance hit because of constantly checking something in global memory.

Should I keep developing opencl programs or start vulkan? Because I heard opencl will join to vulkan. Will there be a steep vulkan learning curve be needed in future?

Regards

Archives Discussions

Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.