Hi guys,
Am I doing something wrong here? Device side enqueue is 5 times slower than host side enqueue in the below simple example:
Kernels:
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void sum(int elem, __global int *a, __global int *b, __global int *c)
{
int local_id = get_local_id(0);
int local_size = get_local_size(0);
c[elem*local_size + local_id] = a[elem*local_size + local_id] + b[elem*local_size + local_id];
}
__kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void sum2(int total, __global int *a, __global int *b, __global int *c)
{
for (int elem = 0; elem < total; ++elem)
{
enqueue_kernel(get_default_queue(),
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(256, 256),
^{ sum(elem, a, b, c);
});
}
}
Host side enqueue:
const size_t iterations = 2000;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for (size_t iter = 0; iter < iterations; ++iter)
{
kernel.setArg(0, static_cast<int>(iter));
queue.enqueueNDRangeKernel(*kernel, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));
}
queue.finish();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "kernel took " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms.\n";
std::cout << std::endl;
Device side enqueue:
const size_t iterations = 2000;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
kernel.setArg(0, static_cast<int>(iterations));
queue.enqueueNDRangeKernel(*kernel, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));
queue.finish();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "kernel took " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms.\n";
std::cout << std::endl;
My set up:
Ubuntu 14.04, R9 295, fglrx 15.20.3, AMD APP SDK 3.0
Solved! Go to Solution.
Thanks for reporting this.
In case of device side enqueue, could you please try something as below?
1) Host-side: Launch kernel with N-number of work-items [instead of only one in your case]
2) Device-side: Each work-item enqueues k-number of device-side kernels with nd-range size 'n'
[N, k and n may be any number and you may do some experiments with those values]
Please check and share your observation. If you still observe the same, please provide the complete project.
Regards,
Thanks for reporting this.
In case of device side enqueue, could you please try something as below?
1) Host-side: Launch kernel with N-number of work-items [instead of only one in your case]
2) Device-side: Each work-item enqueues k-number of device-side kernels with nd-range size 'n'
[N, k and n may be any number and you may do some experiments with those values]
Please check and share your observation. If you still observe the same, please provide the complete project.
Regards,
Hi dipak,
I followed your suggestion and it made the device side enqueue example run 2.5x faster.
Thanks!
As a side note: performance constructs often don't scale to simple examples. I'd like to see the performance results anyway (i.e: the milliseconds out).