Using CAL instead of Brook+ can speed this up quite much... Calling the same kernel with CAL takes about 30us.
However, I still think this is very slow... Would there be a way to speed this up ? I know CAL is built over CTM so perhaps we could save some latency going at a lower level but CTM is nowhere to be found on AMD's website. Does someone know if it's still possible to use CTM?
There are few issues with your code.
1) The main issue is that you are using a 1D stream on a device that is optimized for 2D memory accesses, which brings up a few minor issues.
2) The size of your stream is so small that you are actually testing more of the runtime than the kernel performance. The runtime has to do things like setup state on the graphics card and copy the kernel over, which isn't free.
3) The architecture is best utilized when using a 2D stream that has a width and height that is a multiple of 2. This is because of how the streaming cores are setup and how they process pixels in quads. If you only use a 1D stream that is not address translated, then you are basically throwing away half of your streaming cores as they are not active when you run the kernel.
4) If a 1D stream IS address translated, then there is ALU overhead involved with translated between 1D and 2D coordinates in a generic way, which is not the most efficient.
If you run the throughput example from the CAL sdk with a width of 10, height of 1 and 1000000 iterations, you can see you get a measly 0.02GB/s, which is nowhere close to the peak performance of the card.
Hope this information helps.
Well, using 2D streams increases the speed by a factor of two. It takes now 32 us for every kernel call. But this is still way too slow for the purpose which I had in mind.
Isn't there any way to make kernel calls faster? Does the kernel have to be copied over in every loop cycle?
Just try to look at c++ code that BROOK+ will create.
If I understand this correct then every time (100000 times)
CPU will call copy() function with will do a loot of work before beginin to just coping only 10 values.
Probably to test performance you shold modifi codo to something like this.
for (t=0; t<16; ++t)
What exacly do you want to do may be it is easier and faste do this on CPU and letting GPU do other work.
Parallelism is not the only requirement for good GPGPU. Arithmetic intensity should also be high. This application is probably better suited for the CPU.
arithmetic intensity = operations / words transferred.