Originally posted by: Raistmer On CPU: period=(double)sub_buffer_size/(double)freq; On GPU: kernel void GPU_fetch_array_kernel74t(int sub_buffer_size,float src[],float freq[],out float4 dest<>{ float f=freq[threadID]; double period=(double)sub_buffer_size/(double)f; where freq[threadIS] same frequency as in CPU version. these periods are different enough to produce incorrect stride in folding. Any comments about GPU double precision and calculations?
Is it that bad?
int n_per=(int)f; for(k=0;k<n_per;k++){ l=(int)(period*(double)k + 0.5); l+=(4*j);//R: index to data array computed acc.x+=src
; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc;
I confirm that there is something wrong with the implementation of double-precision division. It assumes the presence of a fused multiply-add in hardware, which is not the case as far as I know.
However I didn't find examples of catastrophic losses of accuracy.
Can you give specific values for which the problem occurs?
What are the typical ranges of freq[], sub_buffer_size and k?
Originally posted by: scollange
I confirm that there is something wrong with the implementation of double-precision division. It assumes the presence of a fused multiply-add in hardware, which is not the case as far as I know.
However I didn't find examples of catastrophic losses of accuracy.
Can you give specific values for which the problem occurs?
What are the typical ranges of freq[], sub_buffer_size and k?
kernel void GPU_fetch_array_kernel94t(float src[],int offsets[][],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; //double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=offsets[threadID]
; l+=(4*j);//R: index to data array computed acc.x+=src ; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; }
Originally posted by: Raistmer Originally posted by: scollangeCan you give specific values for which the problem occurs?freq is ~273, k from zero to ~273, sub_buffer_size maybe ~65k or less.
What are the typical ranges of freq[], sub_buffer_size and k?
For such values even floats should be enough. You have 24 bits of your mantissa to play with and your example values need just 16 bits at max.
All the texture lookups in CAL/Brook are done with floats indexing into the arrays. And it works there too. I would be surprised, if there is no other bug causing this problem.
Originally posted by: Gipsel Originally posted by: RaistmerOriginally posted by: scollangeCan you give specific values for which the problem occurs?freq is ~273, k from zero to ~273, sub_buffer_size maybe ~65k or less.
What are the typical ranges of freq[], sub_buffer_size and k?
For such values even floats should be enough. You have 24 bits of your mantissa to play with and your example values need just 16 bits at max.
All the texture lookups in CAL/Brook are done with floats indexing into the arrays. And it works there too. I would be surprised, if there is no other bug causing this problem.
failed kernel: kernel void GPU_fetch_array_kernel84t(int sub_buffer_size,float src[],double periods[],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=(int)(period*(double)k + 0.5); l+=(4*j);//R: index to data array computed acc.x+=src
; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; } working kernel: kernel void GPU_fetch_array_kernel94t(float src[],int offsets[][],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; //double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=offsets[threadID] ; l+=(4*j);//R: index to data array computed acc.x+=src ; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; } preparation of offsets and kernel call: unsigned int sb_size=__min(max_threads_per_block_fetch,thread_num-t_offset); //fprintf(stderr,"before fetch2\n"); int* offsets=new int[sb_size*512];//R: should be max value of n_per possible. if(offsets==NULL) fprintf(stderr,"ERROR: offsets==NULL\n"); //fprintf(stderr,"before fetch3\n"); for(int tid=0;tid<sb_size;tid++){ //fprintf(stderr,"before fetch3a,n_per=%d\n",(int)freqs[tid]); for(int j=0;j<(int)freqs[tid];j++) offsets[tid*512+j]=(int)(periods[tid]*j + 0.5); } //fprintf(stderr,"before fetch4\n"); unsigned offsets_buf[]={512,sb_size}; //fprintf(stderr,"before stream declaration\n"); brook::Stream<int> gpu_offsets(2,offsets_buf); //fprintf(stderr,"gpu_offsets.before read\n"); gpu_offsets.read(offsets); //gpu_offsets.finish(); //fprintf(stderr,"gpu_offsets.finish\n"); //GPU_fetch_array_kernel5.domainOffset(uint4(0,t_offset,0,0)); //GPU_fetch_array_kernel5.domainSize(uint4(1,sb_size,1,1)); GPU_fetch_array_kernel94t(gpu_data,gpu_offsets,/**gpu_per_int,*/gpu_freqs,*gpu_temp); //while(!gpu_temp->isSync())Sleep(0); #if 1 gpu_temp->finish(); #endif if(gpu_temp->error()){ fprintf(stderr,"\nERROR: GPU_fetch_array_kernel74t: %s",gpu_temp->errorLog()); fprintf(stderr,"gpu_temp size:%u x %u = %u elements of type float4\n",buf_size[0],buf_size[1],buf_size[0]*buf_size[1]); }; delete[] offsets;
Originally posted by: scollange I confirm that there is something wrong with the implementation of double-precision division. It assumes the presence of a fused multiply-add in hardware, which is not the case as far as I know.
However I didn't find examples of catastrophic losses of accuracy.
Have you find any example with a loss of accuracy? According to the documentation the double precision division should yield the correctly rounded (to nearest) result.
I don't think your assertion the implementation needs a fused multiply add in hardware is correct. I had a look at the dissassembly and generally it uses an iteration algorithm for the reciprocal, i.e. the division of c = a/b is done as c = a * (1/b) with the iterations approximating the 1/b reciprocal. One property of that algorithm is the self healing of small errors introduced by the truncated intermediate result of the (not fused) MADD.
That alone would still produce 1ulp error (probably more in corner cases), but there is an added correction step directly for the a/b result in the end. This should reduce the error to 1 ulp in all cases. I have not much experience in proving such stuff, but because the last two MADDs (from the final correction step) are arranged in such a way that the truncations work in different directions, it wouldn't exclude the possibility of a correctly rounded result like the documentation claims.
kernel void ddiv(double a<>, double b<>, out double c<>) { c=a/b; } ; -------- Disassembly -------------------- 00 TEX: ADDR(96) CNT(2) VALID_PIX 0 SAMPLE R1.xy__, R0.xyxx, t1, s0 UNNORM(XYZW) 1 SAMPLE R0.xy__, R0.xyxx, t0, s0 UNNORM(XYZW) 01 ALU: ADDR(32) CNT(49) 2 x: FREXP_64 ____, R1.y y: FREXP_64 ____, R1.x z: FREXP_64 ____, 0.0f w: FREXP_64 ____, 0.0f t: MOV R2.z, 0.0f 3 x: F64_TO_F32 ____, PV2.w y: F64_TO_F32 ____, PV2.z w: SUB_INT T1.w, 0.0f, PV2.y 4 x: F32_TO_F64 T0.x, -1.0f y: F32_TO_F64 T0.y, 0.0f t: RCP_e ____, PV3.x 5 x: F32_TO_F64 T1.x, 1.0f y: F32_TO_F64 T1.y, 0.0f z: F32_TO_F64 T0.z, PS4 w: F32_TO_F64 T0.w, 0.0f 6 x: MUL_64 T2.x, T0.y, R1.y y: MUL_64 T2.y, T0.y, R1.y z: MUL_64 ____, T0.y, R1.y w: MUL_64 ____, T0.x, R1.x 7 x: LDEXP_64 T0.x, T0.w, T1.w y: LDEXP_64 T0.y, T0.z, T1.w 8 x: MULADD_64 ____, T2.y, PV7.y, T1.y y: MULADD_64 ____, T2.y, PV7.y, T1.y z: MULADD_64 ____, T2.y, PV7.y, T1.y w: MULADD_64 ____, T2.x, PV7.x, T1.x 9 x: MULADD_64 T0.x, PV8.y, T0.y, T0.y y: MULADD_64 T0.y, PV8.y, T0.y, T0.y z: MULADD_64 ____, PV8.y, T0.y, T0.y w: MULADD_64 ____, PV8.x, T0.x, T0.x 10 x: MULADD_64 ____, T2.y, PV9.y, T1.y y: MULADD_64 ____, T2.y, PV9.y, T1.y z: MULADD_64 ____, T2.y, PV9.y, T1.y w: MULADD_64 ____, T2.x, PV9.x, T1.x 11 x: MULADD_64 T0.x, PV10.y, T0.y, T0.y y: MULADD_64 T0.y, PV10.y, T0.y, T0.y z: MULADD_64 ____, PV10.y, T0.y, T0.y w: MULADD_64 ____, PV10.x, T0.x, T0.x 12 x: MUL_64 T1.x, R0.y, PV11.y y: MUL_64 T1.y, R0.y, PV11.y z: MUL_64 ____, R0.y, PV11.y w: MUL_64 ____, R0.x, PV11.x 13 x: MULADD_64 ____, T2.y, PV12.y, R0.y y: MULADD_64 ____, T2.y, PV12.y, R0.y z: MULADD_64 ____, T2.y, PV12.y, R0.y w: MULADD_64 ____, T2.x, PV12.x, R0.x 14 x: MULADD_64 R2.x, PV13.y, T0.y, T1.y y: MULADD_64 R2.y, PV13.y, T0.y, T1.y z: MULADD_64 ____, PV13.y, T0.y, T1.y w: MULADD_64 ____, PV13.x, T0.x, T1.x 02 EXP_DONE: PIX0, R2.xyzz END_OF_PROGRAM
Originally posted by: GipselHave you find any example with a loss of accuracy? According to the documentation the double precision division should yield the correctly rounded (to nearest) result.
Yes, I ran some quick test with random values some time ago and found errors up to 1 ulp. I didn't try with hard-to-round cases specifically, however.
I had a look at the dissassembly and generally it uses an iteration algorithm for the reciprocal, i.e. the division of c = a/b is done as c = a * (1/b) with the iterations approximating the 1/b reciprocal. One property of that algorithm is the self healing of small errors introduced by the truncated intermediate result of the (not fused) MADD.
That alone would still produce 1ulp error (probably more in corner cases), but there is an added correction step directly for the a/b result in the end.
So, below is my dissassembled version of the code, with some minor changes for readability.
It is definitely a variation of a typical algorithm from Peter Markstein, as used on the Itanium and PowerPC. It would not have made sense to write the iteration this way unless the author assumed the presence of an FMA.
I agree that the error is probably not more than 1 or 2 ulps, but then there are faster ways to compute an answer accurate to 1 ulp...
double amd_div(double a, double b) { int b_exp; double b_mant = frexp(b, &b_exp); // Split mantissa/exponent double y0 = (double) (1.0f / (float)b_mant); // Reciprocal approx ~22 bits y0 = ldexp(y0, -b_exp); // Give back exponent double e = -b * y0 + 1.0; // 1st Newton-Markstein iteration double y1 = e * y0 + y0; double e2 = -b * y1 + 1.0; // 2nd Newton-Markstein iteration double y2 = e2 * y1 + y1; double q = a * y2; double r = -b * q + a; // Rounding step, Markstein theorem return r * y2 + q; }