I am doing parallel independent IIRs and am packing my samples into float4's in order to utilize the memory system better.
Is there a better way to take a float element of a float4 and expand it into its own float4?
float4 packed_samples; float4 *samples_cl; float4 samplevec; for (loop = 0; loop < MAX; loop++) { // get our next 4 samples as a 128 bit vector packed_samples = samples_cl[loop]; samplevec = (float4)(packed_samples.s0, packed_samples.s0, packed_samples.s0, packed_samples.s0); // ... // do stuff with the samplevec // ... samplevec = (float4)(packed_samples.s1, packed_samples.s1, packed_samples.s1, packed_samples.s1); // ... // do stuff with the samplevec // ... samplevec = (float4)(packed_samples.s2, packed_samples.s2, packed_samples.s2, packed_samples.s2); // ... // do stuff with the samplevec // ... samplevec = (float4)(packed_samples.s3, packed_samples.s3, packed_samples.s3, packed_samples.s3); // ... // do stuff with the samplevec // ... }
Awesome, thanks! That really makes the code more compact and readable.
Do you know that
float4 val = (float4)(1.0f)
replicates the scalar across all the elements so
val = (1.0f,1.0f,1.0f,1.0f)
, don't you? There's no need to force the
val = float4(1.0f,1.0f,1.0f,1.0f),
val = (float4)(1.0f) is equivalent according to the OpenCL spec.