I'm looking for a way to populate a vector with values so I can do one big vector store, instead of several little stores. I'm trying to avoid doing byte stores, which are not always supported (right?).
for (int32_t v = 0; v < 8; ++v)
uchar8 r = 0;
for (int32_t u = 0; u < 8; ++u)
float8 t = (float8)(0.0f);
for (int32_t y = 0; y < 8; ++y)
t += foo(v, u, y);
t.s0123 += t.s4567;
t.s01 += t.s23;
t.s0 += t.s1;
uchar i = convert_uchar(clamp(rint(t.s0), 0.0f, 255.0f));
// ***this is the part I'm asking about***
// Here I want to "insert" t.s0 into r in vector member u
// Store whole vector
Is there a way to do it that is better than using a long and doing this:
r |= convert_long(clamp(rint(t.s0), 0.0f, 255.0f)) << (u * 8);
r |= convert_long(clamp(rint(t.s0), 0.0f, 255.0f)) << (56-(u * 8));
The AMD media ops let you do some stuff like this, but of course they are not portable See the programming guide, Appendix A, section A.8.4 e.g. amd_pack(), or amd_bytealign() perhaps.
Otherwise ... well i'd just stick to using longs or ints - doing things in sets of 4 seems fairly optimal alu wise on current hardware.
Unless memory is an issue, I tend to just use floats for storage if multiple passes are involved and only convert to byte at the end for display/output, or use images and let the compiler/hardware do the packing to suit the data.