What is the canonical way of getting multi-GPU processing of batched FFTs working? I tried using clFFT 2.4 from Github, and tried giving it multiple queues and buffers, but I got error code -4097 (CL_FFT_FEATURE_NOT_IMPLEMENTED) or something similar.
What was the intention of the implementors? Why does the API expose a feature that is not yet implemented? What is the correct way of doing this? It is not quite clear how I should setup the plan. What should be the batch size? The total number of 2D images that I have manually broken down to multiple buffers? Or is it the total number divided by the number of queues?
Any help is apreciated.