Hi,
I'm using a R290 under ubuntu 12.04 with opencl (catalyst-13.12).
This Is what I find strange:
I write a program that do a two stage reduction (as described here http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-c...)
I run it on a vector of 100000000 elements and I loop the kernel queue 10000 times. I'm using a local work group size of 64.
While the programm is running I typically get:
~$ aticonfig --adapter=0 --od-getclocks
Adapter 0 - AMD Radeon R9 290 Series
Core (MHz) Memory (MHz)
Current Clocks : 600 400
Current Peak : 600 400
Configurable Peak Range : [300-800] [400-700]
GPU load : 0%
~$ aticonfig --adapter=0 --od-gettemperature
Adapter 0 - AMD Radeon R9 290 Series
Sensor 0: Temperature - 53.50 C
The question is why the clock doesn't go up? Also the temperature doesn't grow and the GPU load is 0%...
The results in performance is 8.3e+09 red/s witch according the article is not so exciting
Thanks every help is welcome
I add some information, the problem for temperature and gpu load was a false problem.
Setting DISPLAY=:0.0 now I can read the correct temperature and load.
The question about the performance for the two stage reduction is still open.
@andreac Can you verify that OpenCL lists it as an active device with `clinfo`? I have a Sapphire r290x that works with aticonfig but is not recognized by OpenCL
My card is listed in clinfo and works in opencl
I followed the info found here http://devgurus.amd.com/thread/167775
Hi,
What happens when you keep the vector size 10000000? Do you get similar performance as mentioned in the article?
What if you change the work group size from 64 to 256? You may need to take care of any local memory in this case.
Regards,
Ravi
Changing to work size from 64 256 the results becomes wrong as you predicted and performance in terms of red/s is exactly the same.
With WS 64 and vector len 10000000 I mesure 8.32e+08 red/s (10 time slower).
Thanks for the tests andreac. This is strange. Is it possible for you to send me your code?
Yes of course, I'll send you the code in the next days.
Thanks
Andrea
Hi Andrea,
It seems that the way you are measuring time taken by kernel knl_reduce (by using get_timestamp) is not very accurate. It includes all the time taken by kernel queuing, launching and execution. In your case it is also counting the host code time, for example the "if" condition below, and the printf.
A better method to measure time is to use the OpenCL native APIs to measure kernel execution on the device. I am posting the relevant code.
cl_event timing_event;
unsigned long total_elapsed1 = 0;
cl_ulong start_time;
cl_ulong end_time;
unsigned long elapsed1 = 0;
for (int i = 0; i<ntrips; i++) {
CALL_CL_GUARDED(clEnqueueNDRangeKernel,
(queue, knl_Reduce,
/*dimensions*/ 1, NULL, gdim, ldim,
0, NULL, &timing_event));
CALL_CL_GUARDED(clFinish, (queue));
CALL_CL_GUARDED(clGetEventProfilingInfo,
(timing_event,
CL_PROFILING_COMMAND_START,
sizeof(cl_ulong),
&start_time, NULL));
CALL_CL_GUARDED(clGetEventProfilingInfo,
(timing_event,
CL_PROFILING_COMMAND_END,
sizeof(cl_ulong),
&end_time, NULL));
elapsed1 = (unsigned long) (end_time-start_time);
total_elapsed1 = total_elapsed1 + elapsed1;
Also you will need to set the property of command queue as profile-enabled.
Using this and running your code on Radeon HD 6570 with an array size of 100000 and 100 iterations, the results we are getting are:
1. Using your method: 1.36 e08 red/s
2. By enabling native OpenCL profiling: 4.61 e08 red/s
Please let us know what are you getting using OpenCL profiling. We are meanwhile looking at your code to see further optimization gaps.
Thanks,
AMD Support