I am doing parallel independent IIRs and am packing my samples into float4's in order to utilize the memory system better.
Is there a better way to take a float element of a float4 and expand it into its own float4?
float4 packed_samples; float4 *samples_cl; float4 samplevec; for (loop = 0; loop < MAX; loop++) { // get our next 4 samples as a 128 bit vector packed_samples = samples_cl[loop]; samplevec = (float4)(packed_samples.s0, packed_samples.s0, packed_samples.s0, packed_samples.s0); // ... // do stuff with the samplevec // ... samplevec = (float4)(packed_samples.s1, packed_samples.s1, packed_samples.s1, packed_samples.s1); // ... // do stuff with the samplevec // ... samplevec = (float4)(packed_samples.s2, packed_samples.s2, packed_samples.s2, packed_samples.s2); // ... // do stuff with the samplevec // ... samplevec = (float4)(packed_samples.s3, packed_samples.s3, packed_samples.s3, packed_samples.s3); // ... // do stuff with the samplevec // ... }