Thanks bsp2020.
By reorganizing the algorithms (some computations faster in CPU), I was able to get 25% improvement over Catalyst/ OpenCL1.2 code. Not bad at all!
It's more than the vector_copy example, since there is about 2000 lines of OpenCL code in those algorithms.
I found some problems:
- The 2D does not seem to work as get_global_id(1) was always returning 0. Wasn't a big issue to convert it to 1D, since max dimension is very big
- hsa_signal_wait_acquire does not always work.... I used HSA_ENABLE_INTERRUPT=no to prevent freeze. But frequenty, the program was waiting doing nothing. Sometimes, I was lucky it could get it run 15 mins, but sometimes was stopping within a few seconds. So I went to this workaround:
while (ret = hsa_signal_load_acquire(signal) == 1) { nanosleep(&t,NULL); }
with 100µs delay.
That workaround always work. I could have the program running for hours without a glitch.
What I found very nice in HSA, is if you make a mistake in your kernels and you overflow, this is trapped by signal 11 instead of corrupting the graphic memory. This saves hours of debugging when you can't find it in CPU emulation.
Just now waiting for libraries to be released for stock kernel, and let's go crunching....