I have been working with the binomial lattice problem lately and my original implementation assigned one value per thread. This method worked fine, but resulted in precision being lost as the number of timesteps increased past 32768.
Afterwards, in an attempt at increasing performance I tried to vectorize the kernel. While I was successful in getting the vectorization to work (although it doesn't run faster because of the branching it requires), oddly, I found it gave me greater precision, up to 114688 timesteps (which is the same point where the CPU single precision implementation loses too much precision compared to a CPU-based double precision implementation).
Does anyone have any idea why this is the case? Is it an effect of vectorization forcing the math not to be cut off at 24 bits (as MAD and MUL are IIRC) and instead forcing it to 32 bits? I took a brief look through both versions of the IL code in the kernel analyzer and didn't see anything that seemed out of place as to why it was originally being truncated.
Thanks in advance.