Archives Discussions

notyou · ‎12-08-2011

I have been working with the binomial lattice problem lately and my original implementation assigned one value per thread. This method worked fine, but resulted in precision being lost as the number of timesteps increased past 32768.

Afterwards, in an attempt at increasing performance I tried to vectorize the kernel. While I was successful in getting the vectorization to work (although it doesn't run faster because of the branching it requires), oddly, I found it gave me greater precision, up to 114688 timesteps (which is the same point where the CPU single precision implementation loses too much precision compared to a CPU-based double precision implementation).

Does anyone have any idea why this is the case? Is it an effect of vectorization forcing the math not to be cut off at 24 bits (as MAD and MUL are IIRC) and instead forcing it to 32 bits? I took a brief look through both versions of the IL code in the kernel analyzer and didn't see anything that seemed out of place as to why it was originally being truncated.

Thanks in advance.

-Matt

eugenek · ‎12-08-2011

That would seem quite improbable to me. Vectorization is essentially a hint to the compiler, it should have no effect on your results if your vectorized code is algorithmically equivalent to your original code. If you're seeing different results with and without vectorization, you're missing something in your own implementation.

notyou · ‎12-08-2011

Originally posted by: eugenek That would seem quite improbable to me. Vectorization is essentially a hint to the compiler, it should have no effect on your results if your vectorized code is algorithmically equivalent to your original code. If you're seeing different results with and without vectorization, you're missing something in your own implementation.

That's what I thought. Just to give a bit of background information, I received the original CUDA code which the developer used which worked up to 32768 time steps, after which he also lost too much precision. I converted his code to OpenCL and found that my implementation also lost too much precision after 32768 time steps (so our solutions were equal).

Then, I vectorized the code in an attempt to make it run faster at which point I noticed that I was accurate to 5-6 decimal places instead of only 3.

Originally posted by: N3KO

Could it be a floating point number issue? Floating point numbers (float and double data types) are not perfect representations of numbers and the order in which operations are executed can produce different results. For instance, consider the case where A is an array and the 'i' element of the array is A = 1.0/2.0^i

If you want to add the elements in A, a program would return different results depending of how you traverse the array, i.e. adding elements from i = 0 to N would give a different answer than if you do from i=N to 0. The reason is that accuracy is lost when you add small and big numbers.

That is just an example of how order of operations can affects results in floating point arithmetic but there are a few other potential issues that could have been mitigated in your vectorized version of your program.

I hadn't thought of that, but that shouldn't be an issue if all threads are independent and working on independent data, right?

Attached are both the original and then vectorized versions which do the same thing, but the vectorized version works for a larger number of time steps. One thing I also noticed when I was running some more tests is that, on my 5870M I was able to get to 114688 time steps and keeping precision to 3 decimal places. But when I ran the same code on an A6-3400 (laptop), it only seemed to be precise to 3 decimal places for up to 65536 time steps. In this case I would assume it has become a compiler "issue" since the target devices are different.

EDIT: re-ran my algorithm and the precision was now the same for both the dedicated and integrated GPU (but results still differ between the vectorized and unvectorized code).

// original unvectorized code __kernel void LookbackOpenCL(__global float *optionValues, __global float *tempOptionValues, int N, float u, float d, float pu, float pd, float disc, int i) { int globalID = get_global_id(0); int localID = get_local_id(0); float myYValue, tempOptionValue; myYValue = pow(u, globalID); if ( globalID < i ) { if ( globalID == 0 ) tempOptionValue = fmax(myYValue - 1.0f, ((pu * optionValues[1] * d) + (pd * optionValues[0] * u)) * disc); else tempOptionValue = fmax(myYValue - 1.0f, ((pu * optionValues[localID + 1] * d) + (pd * optionValues[globalID - 1] * u)) * disc); tempOptionValues[globalID] = tempOptionValue; } } // vectorized kernel __kernel void LookbackOpenCL(__global float2 *optionValues, __global float2 *tempOptionValues, int N, float u, float d, float pu, float pd, float disc, int i) { int globalID = get_global_id(0); int localID = get_local_id(0); float2 myYValue; float2 tempOptionValue; int highIndex = globalID * size + size; myYValue.s0 = pow(u, lowIndex); myYValue.s1 = pow(u, lowIndex + 1); if ( highIndex < i ) { if ( globalID == 0 ) { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues[0].s1 * d) + (pd * optionValues[0].s0 * u)) * disc); tempOptionValue.s1 = fmax(myYValue.s1 - 1.0f, ((pu * optionValues[1].s0 * d) + (pd * optionValues[0].s0 * u)) * disc); } else { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues[localID].s1 * d) + (pd * optionValues[globalID - 1].s1 * u)) * disc); tempOptionValue.s1 = fmax(myYValue.s1 - 1.0f, ((pu * optionValues[localID + 1].s0 * d) + (pd * optionValues[globalID ].s0 * u)) * disc); } tempOptionValues[globalID].s0 = tempOptionValue.s0; tempOptionValues[globalID].s1 = tempOptionValue.s1; } barrier(CLK_GLOBAL_MEM_FENCE); if ( highIndex == i ) { if ( globalID == 0 ) { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues[0].s1 * d) + (pd * optionValues[0].s0 * u)) * disc); } else { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues[localID].s1 * d) + (pd * optionValues[globalID - 1].s1 * u)) * disc); } tempOptionValues[globalID].s0 = tempOptionValue.s0; } }

N3KO · ‎12-08-2011

The 5870M should have a fused-multiply add (FMA) instruction that would do: a = a + (b*c) in a single step. The A6-3400 (cpu?) is most probably doing the multiplication and addition in two separate operations (first b*c then add a).

If you consider that each floating point operation will finally round the results into a float (32 bit) or a double (64 bit), the A6-3400 performs two rounding operations while the 5870's FMA performs a single rounding operation. The FMA enhances the precision of the computations while using single precision data types.

Moreover, floating point arithmetic is not associative or distributive. To check this, you can force the compiler into the order in which the operations are performed by using parenthesis.

notyou · ‎12-08-2011

I was using the GPU of the A6-3400. Where did you get that the 5870M has FMA? If you have a source, I'd like to add it to my collection since information like that would be useful.

N3KO · ‎12-08-2011

Could it be a floating point number issue? Floating point numbers (float and double data types) are not perfect representations of numbers and the order in which operations are executed can produce different results. For instance, consider the case where A is an array and the 'i' element of the array is A = 1.0/2.0^i

If you want to add the elements in A, a program would return different results depending of how you traverse the array, i.e. adding elements from i = 0 to N would give a different answer than if you do from i=N to 0. The reason is that accuracy is lost when you add small and big numbers.

That is just an example of how order of operations can affects results in floating point arithmetic but there are a few other potential issues that could have been mitigated in your vectorized version of your program.

MicahVillmow · ‎12-08-2011

The 5870M does not have FMA since it does not support double precision and that is a requirement of FMA.

A relatively complete list can be found here:
http://en.wikipedia.org/wiki/C...hics_processing_units

There is no mobile chip that has double precision since R7XX series, and FMA was introduced with EG chips, so no mobile with FMA.

notyou · ‎12-08-2011

Originally posted by: MicahVillmow The 5870M does not have FMA since it does not support double precision and that is a requirement of FMA. A relatively complete list can be found here: http://en.wikipedia.org/wiki/C...hics_processing_units There is no mobile chip that has double precision since R7XX series, and FMA was introduced with EG chips, so no mobile with FMA.

EDIT: re-ran my algorithm and it looks like the dedicated vs. integrated GPU precision problem has been solved.

3 questions:

1. Is there any documentation available publicly as to which chips support which features?

2. Do you have any ideas as to why this is occurring? I mean, it shouldn't be different for both codes since it should compile to nearly the same base code.

3. I understand you can't comment on unreleased products so this is more of an idea/theoretical question. Have you considered doing a limited production run of APUs that would support double precision? It would be useful for research where 64-bit precision is required but the algorithm shouldn't be run on a dedicated GPU because the memory access is better suited to an APU.

himanshu_gautam · ‎12-13-2011

1. You can refer to the ISA documents to know if some particular instruction is supported or not. Analyzing the ISA generated would also be a good idea.

2. Not really sure about this. Should mostly be some algorithmic alteration due to which better precision is observed.Do you see the same precision difference on every other card or soes it happen for 5870M only.

3. I guess that would be surely under considerations.

notyou · ‎12-13-2011

I do see the same difference on other hardware. So far I've tested it on an E-350, an A6-3400M and a 5870M and all of them had the "extra" precision.

I also updated (first) to the lastest driver and (then) SDK which also had the same "problem".

At this point I'm out of ideas other than it somehow compiling down to forcing 32-bit precision multiplication, but why it would do that I have no idea.

sbandyk · ‎02-08-2012

I've been looking for the definitive document from AMD that discussed double-precision support in the A8 APUs and in the process I've come across this thread. We've purchased several systems but I've not yet gotten a chance to write any code to query the hardware. All I've seen mirror's your comments though, none of the APUs support doubles in the GPU. 😞

I want to 2nd your suggestion about the utility of doubles on the APUs.

I understand that AMD is dealing with limited die space and that these chips are designed for consumer markets but there is a huge potential here for Research.

I'm an IT Manager in a research University. I oversee quite a few compute resources from single multi-socket compute nodes, to a 96 node dual-quad HPC cluster.

I've already purchased a number of A8-3850 nodes to offer up as 'free cycles' to unfunded researchers (mostly grad-students) but that was before I realized they didn't support doubles in the GPU [yes, my fault]. Going forward I'm debating going Intel and Nvidia and paying more per node for more performance, or sticking with AMD for price given the APUs are less appealing now.

There's huge potential for people in my position to deploy low-cost compute resources based on APUs and OpenCL code but I need to be able to compile research code requiring high precision. There's a demand for extremely inexpensive nodes [commodity APU parts] that can chug along on computations at a relatively leisurely pace, but an Opteron APU would also be hugely valuable to my groups that buy multiple quad-socket 8-core Opteron boxes to those currently building out CUDA Clusters. Given the extremely impressive single-precision GP-GPU performance we've seen with the A8s, a proper GP-GPU integrated onto an Opteron core would seriously compete with a much more expensive Intel/Tesla solution, like we're deploying now.

Please, give me an 8-core Opteron with a proper GPU integrated. Give me at least a 4-way memory bus to feed the GPU. I'd like the ability to interleave more DIMMs for more bandwith.

My people will do amazing things with that hardware.

diepchess · ‎02-14-2012

"Please, give me an 8-core Opteron with a proper GPU integrated. "

Of course such integrated gpu will have a budget of maybe a 18 watts or so and that'll be a hard limit.

Whereas the stand alone gpu's get fed by psu's that deliver a lot more power and quite efficient nowadays,

than the 375 watt it will be rated at.

Even if it is that 375 that AMD and Nvidia give it, still it's factor 20+ faster for single precision, and of course double precision it probably will be a 100+ times faster as i see no reason why an integrated gpu ever will be having double precision, as that will reduce yields.

sbandyk · ‎02-14-2012

Not exactly clear on your reply but I assume you're saying AMD would have an 18W budget for the GPU portion.

I realized after I posted that maybe I should have explicitly stated my thoughts were on a future product on a smaller process. I don't expect AMD to modify today's Magny Cours core to add a large GPU.

I'm also not asking for a 2000 core, top-of-the line GPU core. My current A8s have a small GPU core, by stand alone GPU standards, but it seems to be extremely fast compared to larger, more powerful GPUs on the PCIe bus when it comes to OpenCL calculations.

Maybe I should suggest everyone think of it from this angle..

I've no interest in having integrated graphics on an Opteron core so I can drive a console. What I want is a streaming co-processor. More and more code is being ported, mostly to CUDA now, but there's an opportunity for OpenCL to become more prominent with AMD taking the single GPU performance crown fairly decisively (though not sure for how long) and with the AMD [and coming Intel] APUs.. and with Intel prepping MIC for release. CUDA is amazing, at least when you've got the expertise on staff to write the code, but I see OpenCL as being a much better overall option given the way things have been going.

What I dream of is a processor with a lot of cores, and enough GPU cores on-die that we can accelerate some of our parallel algorithms. Even a paltry 80-core GPU is icing on the cake for computation when the latency is minimized as much as it appears to be with my A8s. And remember, I'm talking about HPC/HTC. I don't care about thermals as long as they're manageable. I want performance per dollar and per U, not performance per W (though that should also be excellent from what we've seen in A8).

If I'm building a massive cloud.. yes, power consumption becomes a huge issue even if I take a hit on per-node performance.

In my case though, I'd much rather have a relatively power-hungry CPU that gave me much more 'horsepower'. That's especially true if the alternative is a moderately power-hungry CPU and a 100-300W GPU card.

If I could get an 8-Core CPU with a useful GPU core integrated and it came in at 135W.. that's a deal from my perspective. It'd certainly be no harder to cool than my Dell c410s with 16 Teslas in them or my dual-Nehalem SuperMicro with a Tesla and a GTX260 in 1U (yes.. it's already old, it's only a 260 and a Tesla 1060 😞

Thanks,

Steven.

diepchess · ‎02-14-2012

hi!

Very interesting observation, especially for specific FFT's that simply risk using more bits than on paper is possible...

Not sure whether this is the case with you, but usually if you keep things within the registers you have more bits available than when you try to store it elsewhere.

Elsewhere stored it's just single precision and internal it has more bits.

This is very common problem with itanium2+ processors, maybe also is case here that some action loses you the bits that on paper shouldn't be there...

I remember long time ago, several generations it was a bit or 40 in register (so not 32). How much of that got used for mantissa i do not know.

In this case the reason might be simpler: that if you multiply that you have as output 32 bits rather than 23 for mantissa.

So when you reuse that result it should has clear advantages. Maybe the vectorized code is keeping more in registers and therefore less rounding off happens?

Archives Discussions

Vectorization results in more precision?