I'm writing variants of kernels in OpenCL which perform simple linear algebra operations using complex arithmetic. These are being converted from CUDA kernels, and I'm basically taking a two step approach here:
1. Convert the CUDA to OpenCL without any optimizations
2. Vectorize the computation using float4s to better match the Cyrpess architecture
My question relates to step 2 here. Can the 4-way vector unit on Cypress handle complex multiplication natively, e.g., multiply two pairs of complex numbers together in the same way that SSE and double-hummer FPU on BG/P can, or are extra operations required (inserting the minus 1 signs as approriate)?