I couldn't find a info about level of task parallelism in AMD's OpenCL implementation.
OpenCL provides two ways to achieve task level parallelism:
- multiple in-order queues,
- one out-of-order queue,
I am curious which ones are fully implementend. I've read somewhere that APP SDK ignores event and executes kernel invocations in order. Also I've read that APP SDK support multiple queues and with multiple queues one can execute different kernels concurrently. But maybe when I create multiple queues then each one is binded to different set compute units?
I'm designing sorting algorithm (BWT transform) + compression and I plan to have many kernels, one heavy on LDS (initial sorting on small blocks), some heavy on global bandwidth (global sorting), some heavy on ALU & registers (compression) and I would want to have my GPU fully utilized. BWT is a block transformation so one block could be sorted and another block could be compressed in parallel.
Another questions are:
Why OpenCL has halved LSD bandwidth? I've read that in AMD's documentation.
Do APP SDK compiler accept some parameters to choose level of optimization, similiar to -O2 or -O3 in GCC?