cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

GPU double arithmetic gives incorrect results

relative to CPU one

On CPU:
period=(double)sub_buffer_size/(double)freq;

On GPU:
kernel void GPU_fetch_array_kernel74t(int sub_buffer_size,float src[],float freq[],out float4 dest<>){
float f=freq[threadID];
double period=(double)sub_buffer_size/(double)f;
where freq[threadID] same frequency as in CPU version.

these periods are different enough to produce incorrect stride in folding.

Any comments about GPU double precision and calculations?
0 Likes
8 Replies
riza_guntur
Journeyman III

Originally posted by: Raistmer On CPU: period=(double)sub_buffer_size/(double)freq; On GPU: kernel void GPU_fetch_array_kernel74t(int sub_buffer_size,float src[],float freq[],out float4 dest<>{ float f=freq[threadID]; double period=(double)sub_buffer_size/(double)f; where freq[threadIS] same frequency as in CPU version. these periods are different enough to produce incorrect stride in folding. Any comments about GPU double precision and calculations?


Is it that bad?

0 Likes

Well, this period value used later in this code:

Differencies are enough to fold data with wrong offset making all further computations meaningless. Results don't pass validation.

What precision guarantied for 2 number division in double arithmetic for GPU code??

int n_per=(int)f; for(k=0;k<n_per;k++){ l=(int)(period*(double)k + 0.5); l+=(4*j);//R: index to data array computed acc.x+=src; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc;

0 Likes

I confirm that there is something wrong with the implementation of double-precision division. It assumes the presence of a fused multiply-add in hardware, which is not the case as far as I know.

However I didn't find examples of catastrophic losses of accuracy.

Can you give specific values for which the problem occurs?

What are the typical ranges of freq[], sub_buffer_size and k?

0 Likes

Originally posted by: scollange

I confirm that there is something wrong with the implementation of double-precision division. It assumes the presence of a fused multiply-add in hardware, which is not the case as far as I know.




However I didn't find examples of catastrophic losses of accuracy.




Can you give specific values for which the problem occurs?




What are the typical ranges of freq[], sub_buffer_size and k?



freq is ~273
k from zero to ~273, sub_buffer_size maybe ~65k or less.

I found that double precision on GPU is completely unusable for my app.
It gives wrong offsets into array that in turn results in complete wrong folding.
Loading periods from CPU only delayed problem, not solved it.

This line:
l=(int)(period*(double)k + 0.5);
uses double pricision period to compute offset into array.
Unfortunately, it seems this algorithm pretty unstable cause heavely depending from rounding mode.
But GPU should be able to fold same elements of array, if it will add another elements resulting array will be completely different and subsequent results don't pass validation with CPU results.
Unfortunately, ATI GPU is unable to produce same offsets as CPU does in this code.

I solved this problem by precomputing needed offsets on CPU and then just loading them into GPU. That is, no double precision calculations on GPU at all (code attached). But of course this is less performance solution. It costs ~2% of running time on test load.
Much better if these calculations would be done on GPU itself.

kernel void GPU_fetch_array_kernel94t(float src[],int offsets[][],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; //double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=offsets[threadID]; l+=(4*j);//R: index to data array computed acc.x+=src; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; }

0 Likes

Originally posted by: Raistmer
Originally posted by: scollangeCan you give specific values for which the problem occurs?

What are the typical ranges of freq[], sub_buffer_size and k?

freq is ~273, k from zero to ~273, sub_buffer_size maybe ~65k or less.


For such values even floats should be enough. You have 24 bits of your mantissa to play with and your example values need just 16 bits at max.

All the texture lookups in CAL/Brook are done with floats indexing into the arrays. And it works there too. I would be surprised, if there is no other bug causing this problem.

0 Likes

Originally posted by: Gipsel

Originally posted by: Raistmer
Originally posted by: scollangeCan you give specific values for which the problem occurs?




What are the typical ranges of freq[], sub_buffer_size and k?




freq is ~273, k from zero to ~273, sub_buffer_size maybe ~65k or less.





For such values even floats should be enough. You have 24 bits of your mantissa to play with and your example values need just 16 bits at max.




All the texture lookups in CAL/Brook are done with floats indexing into the arrays. And it works there too. I would be surprised, if there is no other bug causing this problem.



I didn't dig too deep in that.
Facts are:
kernel marked as 84t gives invalid folded array.
kernel marked as 94t gives correct folding on the same input data.
I will not argue that error bigger than some accepted tolerance, unfortunately, it just differ from CPU value enough to give wrong offset.
Maybe some bug exist, it would be nice to find it then.

failed kernel: kernel void GPU_fetch_array_kernel84t(int sub_buffer_size,float src[],double periods[],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=(int)(period*(double)k + 0.5); l+=(4*j);//R: index to data array computed acc.x+=src; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; } working kernel: kernel void GPU_fetch_array_kernel94t(float src[],int offsets[][],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; //double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=offsets[threadID]; l+=(4*j);//R: index to data array computed acc.x+=src; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; } preparation of offsets and kernel call: unsigned int sb_size=__min(max_threads_per_block_fetch,thread_num-t_offset); //fprintf(stderr,"before fetch2\n"); int* offsets=new int[sb_size*512];//R: should be max value of n_per possible. if(offsets==NULL) fprintf(stderr,"ERROR: offsets==NULL\n"); //fprintf(stderr,"before fetch3\n"); for(int tid=0;tid<sb_size;tid++){ //fprintf(stderr,"before fetch3a,n_per=%d\n",(int)freqs[tid]); for(int j=0;j<(int)freqs[tid];j++) offsets[tid*512+j]=(int)(periods[tid]*j + 0.5); } //fprintf(stderr,"before fetch4\n"); unsigned offsets_buf[]={512,sb_size}; //fprintf(stderr,"before stream declaration\n"); brook::Stream<int> gpu_offsets(2,offsets_buf); //fprintf(stderr,"gpu_offsets.before read\n"); gpu_offsets.read(offsets); //gpu_offsets.finish(); //fprintf(stderr,"gpu_offsets.finish\n"); //GPU_fetch_array_kernel5.domainOffset(uint4(0,t_offset,0,0)); //GPU_fetch_array_kernel5.domainSize(uint4(1,sb_size,1,1)); GPU_fetch_array_kernel94t(gpu_data,gpu_offsets,/**gpu_per_int,*/gpu_freqs,*gpu_temp); //while(!gpu_temp->isSync())Sleep(0); #if 1 gpu_temp->finish(); #endif if(gpu_temp->error()){ fprintf(stderr,"\nERROR: GPU_fetch_array_kernel74t: %s",gpu_temp->errorLog()); fprintf(stderr,"gpu_temp size:%u x %u = %u elements of type float4\n",buf_size[0],buf_size[1],buf_size[0]*buf_size[1]); }; delete[] offsets;

0 Likes

Originally posted by: scollange I confirm that there is something wrong with the implementation of double-precision division. It assumes the presence of a fused multiply-add in hardware, which is not the case as far as I know.

However I didn't find examples of catastrophic losses of accuracy.



Have you find any example with a loss of accuracy? According to the documentation the double precision division should yield the correctly rounded (to nearest) result.

I don't think your assertion the implementation needs a fused multiply add in hardware is correct. I had a look at the dissassembly and generally it uses an iteration algorithm for the reciprocal, i.e. the division of c = a/b is done as c = a * (1/b) with the iterations approximating the 1/b reciprocal. One property of that algorithm is the self healing of small errors introduced by the truncated intermediate result of the (not fused) MADD.

That alone would still produce 1ulp error (probably more in corner cases), but there is an added correction step directly for the a/b result in the end. This should reduce the error to 1 ulp in all cases. I have not much experience in proving such stuff, but because the last two MADDs (from the final correction step) are arranged in such a way that the truncations work in different directions, it wouldn't exclude the possibility of a correctly rounded result like the documentation claims.

kernel void ddiv(double a<>, double b<>, out double c<>) { c=a/b; } ; -------- Disassembly -------------------- 00 TEX: ADDR(96) CNT(2) VALID_PIX 0 SAMPLE R1.xy__, R0.xyxx, t1, s0 UNNORM(XYZW) 1 SAMPLE R0.xy__, R0.xyxx, t0, s0 UNNORM(XYZW) 01 ALU: ADDR(32) CNT(49) 2 x: FREXP_64 ____, R1.y y: FREXP_64 ____, R1.x z: FREXP_64 ____, 0.0f w: FREXP_64 ____, 0.0f t: MOV R2.z, 0.0f 3 x: F64_TO_F32 ____, PV2.w y: F64_TO_F32 ____, PV2.z w: SUB_INT T1.w, 0.0f, PV2.y 4 x: F32_TO_F64 T0.x, -1.0f y: F32_TO_F64 T0.y, 0.0f t: RCP_e ____, PV3.x 5 x: F32_TO_F64 T1.x, 1.0f y: F32_TO_F64 T1.y, 0.0f z: F32_TO_F64 T0.z, PS4 w: F32_TO_F64 T0.w, 0.0f 6 x: MUL_64 T2.x, T0.y, R1.y y: MUL_64 T2.y, T0.y, R1.y z: MUL_64 ____, T0.y, R1.y w: MUL_64 ____, T0.x, R1.x 7 x: LDEXP_64 T0.x, T0.w, T1.w y: LDEXP_64 T0.y, T0.z, T1.w 8 x: MULADD_64 ____, T2.y, PV7.y, T1.y y: MULADD_64 ____, T2.y, PV7.y, T1.y z: MULADD_64 ____, T2.y, PV7.y, T1.y w: MULADD_64 ____, T2.x, PV7.x, T1.x 9 x: MULADD_64 T0.x, PV8.y, T0.y, T0.y y: MULADD_64 T0.y, PV8.y, T0.y, T0.y z: MULADD_64 ____, PV8.y, T0.y, T0.y w: MULADD_64 ____, PV8.x, T0.x, T0.x 10 x: MULADD_64 ____, T2.y, PV9.y, T1.y y: MULADD_64 ____, T2.y, PV9.y, T1.y z: MULADD_64 ____, T2.y, PV9.y, T1.y w: MULADD_64 ____, T2.x, PV9.x, T1.x 11 x: MULADD_64 T0.x, PV10.y, T0.y, T0.y y: MULADD_64 T0.y, PV10.y, T0.y, T0.y z: MULADD_64 ____, PV10.y, T0.y, T0.y w: MULADD_64 ____, PV10.x, T0.x, T0.x 12 x: MUL_64 T1.x, R0.y, PV11.y y: MUL_64 T1.y, R0.y, PV11.y z: MUL_64 ____, R0.y, PV11.y w: MUL_64 ____, R0.x, PV11.x 13 x: MULADD_64 ____, T2.y, PV12.y, R0.y y: MULADD_64 ____, T2.y, PV12.y, R0.y z: MULADD_64 ____, T2.y, PV12.y, R0.y w: MULADD_64 ____, T2.x, PV12.x, R0.x 14 x: MULADD_64 R2.x, PV13.y, T0.y, T1.y y: MULADD_64 R2.y, PV13.y, T0.y, T1.y z: MULADD_64 ____, PV13.y, T0.y, T1.y w: MULADD_64 ____, PV13.x, T0.x, T1.x 02 EXP_DONE: PIX0, R2.xyzz END_OF_PROGRAM

0 Likes

Originally posted by: GipselHave you find any example with a loss of accuracy? According to the documentation the double precision division should yield the correctly rounded (to nearest) result.


Yes, I ran some quick test with random values some time ago and found errors up to 1 ulp. I didn't try with hard-to-round cases specifically, however.

 

I had a look at the dissassembly and generally it uses an iteration algorithm for the reciprocal, i.e. the division of c = a/b is done as c = a * (1/b) with the iterations approximating the 1/b reciprocal. One property of that algorithm is the self healing of small errors introduced by the truncated intermediate result of the (not fused) MADD.

 

That alone would still produce 1ulp error (probably more in corner cases), but there is an added correction step directly for the a/b result in the end.



So, below is my dissassembled version of the code, with some minor changes for readability.

It is definitely a variation of a typical algorithm from Peter Markstein, as used on the Itanium and PowerPC. It would not have made sense to write the iteration this way unless the author assumed the presence of an FMA.

I agree that the error is probably not more than 1 or 2 ulps, but then there are faster ways to compute an answer accurate to 1 ulp...

double amd_div(double a, double b) { int b_exp; double b_mant = frexp(b, &b_exp); // Split mantissa/exponent double y0 = (double) (1.0f / (float)b_mant); // Reciprocal approx ~22 bits y0 = ldexp(y0, -b_exp); // Give back exponent double e = -b * y0 + 1.0; // 1st Newton-Markstein iteration double y1 = e * y0 + y0; double e2 = -b * y1 + 1.0; // 2nd Newton-Markstein iteration double y2 = e2 * y1 + y1; double q = a * y2; double r = -b * q + a; // Rounding step, Markstein theorem return r * y2 + q; }

0 Likes