Thanks for reporting this.
In case of device side enqueue, could you please try something as below?
1) Host-side: Launch kernel with N-number of work-items [instead of only one in your case]
2) Device-side: Each work-item enqueues k-number of device-side kernels with nd-range size 'n'
[N, k and n may be any number and you may do some experiments with those values]
Please check and share your observation. If you still observe the same, please provide the complete project.
As a side note: performance constructs often don't scale to simple examples. I'd like to see the performance results anyway (i.e: the milliseconds out).