Greetings,
I'm writing variants of kernels in OpenCL which perform simple linear algebra operations using complex arithmetic. These are being converted from CUDA kernels, and I'm basically taking a two step approach here:
1. Convert the CUDA to OpenCL without any optimizations
2. Vectorize the computation using float4s to better match the Cyrpess architecture
My question relates to step 2 here. Can the 4-way vector unit on Cypress handle complex multiplication natively, e.g., multiply two pairs of complex numbers together in the same way that SSE and double-hummer FPU on BG/P can, or are extra operations required (inserting the minus 1 signs as approriate)?
Thanks.
Thanks for the update Micah. Some more questions:
Since Cypress doesn't support complex multiplication natively, do you have any suggestions on how best to implement complex arithmetic? My first thought was to write all operations in terms of 4 complex numbers, i.e., assign a float4 for real and a float4 for imaginary, this would enable maximum throughput. Unfortunately, this doesn't always map well to my problem. The alternative would seem to be write everything as floats, and let the compiler do its best, or to use float4s for two complex numbers, and perform the requisite twiddling on the components.
Is there a way for the AMD CPU OpenCL to use the complex SSE instructions? Is the compiler able to detect such sequences and issue the correct SSE instruction, or will this require true complex data type support in OpenCL 1.x ?
I'm sure you know the complex types are reserved in the current spec. Does anyone know if their implementation is on the OpenCL roadmap?
I realise you can't answer this, but any chance of native complex support in southern islands? 🙂
Thanks for your reply. Another question.
Does Cypress support fused-multiply-subtract on the four-wide vector unit? This would allow me to write vectorized complex arithmetic as two float4s (for real and imaginary) without penalty.
I note that fused-multiply-subtract is not supported on Fermi, in fact, the lack of a fms instruction is what prevents my kernel (complex-valued outer product sum) from exceeding 1 Tflop/s (short at 950 Gflop/s). Support for fms on Cypress would be a big plus in my book.
FMS = FMA: result = a*b - c = a*b + (-c).
The HW has some input modifiers that can be used to optimize certain common operations. You can read more in the ISA spec.
-Jeff