Archives Discussions

doqtor · ‎09-01-2015

Hi guys,

Am I doing something wrong here? Device side enqueue is 5 times slower than host side enqueue in the below simple example:

Kernels:

__kernel __attribute__((reqd_work_group_size(256, 1, 1)))

void sum(int elem, __global int *a, __global int *b, __global int *c)

{

int local_id = get_local_id(0);

int local_size = get_local_size(0);

c[elem*local_size + local_id] = a[elem*local_size + local_id] + b[elem*local_size + local_id];

}

__kernel __attribute__((reqd_work_group_size(1, 1, 1)))

void sum2(int total, __global int *a, __global int *b, __global int *c)

{

for (int elem = 0; elem < total; ++elem)

{

enqueue_kernel(get_default_queue(),

CLK_ENQUEUE_FLAGS_NO_WAIT,

ndrange_1D(256, 256),

^{ sum(elem, a, b, c);

});

}

Host side enqueue:

const size_t iterations = 2000;

std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();

for (size_t iter = 0; iter < iterations; ++iter)

{

kernel.setArg(0, static_cast<int>(iter));

queue.enqueueNDRangeKernel(*kernel, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));

}

queue.finish();

std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

std::cout << "kernel took " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms.\n";

std::cout << std::endl;

Device side enqueue:

const size_t iterations = 2000;

std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();

kernel.setArg(0, static_cast<int>(iterations));

queue.enqueueNDRangeKernel(*kernel, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));

queue.finish();

std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

std::cout << "kernel took " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms.\n";

std::cout << std::endl;

My set up:

Ubuntu 14.04, R9 295, fglrx 15.20.3, AMD APP SDK 3.0

dipak · ‎09-03-2015

Thanks for reporting this.

In case of device side enqueue, could you please try something as below?

1) Host-side: Launch kernel with N-number of work-items [instead of only one in your case]

2) Device-side: Each work-item enqueues k-number of device-side kernels with nd-range size 'n'

[N, k and n may be any number and you may do some experiments with those values]

Please check and share your observation. If you still observe the same, please provide the complete project.

Regards,

View solution in original post

dipak · ‎09-03-2015

Thanks for reporting this.

In case of device side enqueue, could you please try something as below?

1) Host-side: Launch kernel with N-number of work-items [instead of only one in your case]

2) Device-side: Each work-item enqueues k-number of device-side kernels with nd-range size 'n'

[N, k and n may be any number and you may do some experiments with those values]

Please check and share your observation. If you still observe the same, please provide the complete project.

Regards,

doqtor · ‎10-12-2015

Hi dipak,

I followed your suggestion and it made the device side enqueue example run 2.5x faster.

Thanks!

maxdz8 · ‎09-05-2015

As a side note: performance constructs often don't scale to simple examples. I'd like to see the performance results anyway (i.e: the milliseconds out).

Archives Discussions

Poor kernel device enqueue performance