Sure that 78K kernel launches is the first and biggest bottleneck.
This problem must be solved first.
And after you can select proper workItem counts:
WorkItems/WorkGroup -> must be 64 or 128 or 192 or 256.
- If the workgroupsize is 1 it will take the same time as when workgorupsize is 64.(In case if you are at peak ALU utilization.)
- Unless you have special needs I strongly suggest 64.
R9 270X has 1280 stream units. Thus the minimum streams required to be able to reach maximum ALU performance is 1280*4 = 5120 workitems.
Scalability is simliar here to: 5121 workitems will take 2x much time as 5120 workitems. (In case if you are at peak ALU utilization.)
But first that 78K kernel launches must be solved: it takes 20secs out of 23 secs. So I'd try to collect the input data in those test files you have, and process them in batch with one or a few long kernel launches. And probably this way it can reach 3 sec (unless some other bottleneck raises ).
Got the new video card working finally. Results are not materially different.
Agreed the 78k kernel call overhead is the biggest problem. Unfortunately I cannot batch process them without porting 1000s of lines of code into the kernel.
The results of one call changes the data (my GPUCALCX structure) for the next call ,
Don't know whether you are aware but this is from the HSA development Beta :
"Platform Atomics provides memory consistency for loads and stores in the host program and the compute kernel. The host and device can atomically operate on the same memory locations, and also have control over memory visibility. Platform atomics provide new instruction-level synchronization between the CPU and the HSA CUs– in addition to the coarse-grained “command-level” synchronization that OpenCL™ has traditionally provided.
For example, platform atomics enable the device and host to participate in “producer-consumer” style algorithms, without requiring the launch of a new kernel. The kernel running on the device can produce a piece of data, use the platform atomic operations to set a flag or enqueue the data to a platform scope, and the host can see the produced data – all while the compute kernel continues to run. Likewise, the CPU can act as the producer, sending work to a running kernel which is waiting for that data to arrive. Essentially platform atomics allow communication between host and device without having to relaunch new kernels from the host – this can be significantly faster, and also can result in a more natural coding style. Platform Atomics also helps in writing lock/wait-free data structures that can scale across the platform."
This sounds ideal for this problem. Essentially enqueue the kernel once only and use atomics to synchronise providing new GPUCALCX data for the kernel and for collecting the results.
Now that I understand I am not making stupid errors with the kernel (other than being too small), I can progress to trying out the new Kaveri features and find out if that can give me the
required performance. This will take a while though. If it's of interest to anyone I will report back when I have some Kaveri results.
Thanks to all who have offered suggestions. I am a little wiser.