cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Bdot
Adept III

Kernel compiler problems in 11.10 - and cannot get rid of it

Windows 7-64, Phenom x4-955, HD 5770, APP SDK 2.5

 

After upgrading from 11.9 to 11.10 my kernels are not compiled anymore:

BUILD OUTPUT C:\Users\root\AppData\Local\Temp\OCLCEF5.tmp.cl(2192): error: more than one instance of overloaded function "mad24" matches the argument list: function "mad24(int, int, int) C++" function "mad24(uint, uint, uint) C++" argument types are: (uint, int, uint) *res_hi = mad24(mul_hi(a,b), 256, (*res_lo >> 24)); ^ ...

 

While this is rather easy to solve (change 256 to 256u), most of the compiled kernels now do not receive one of their parameters, a struct of 6 uints. Consequently, they produce bad results.

 

Now the really weird part: I cannot get rid of the 11.10 bugs. I uninstalled Catalyst 11.10 (along with APP SDK) and installed 11.9 (with and without prior reboot) - both the typed constant as well as the missing parameter problems remain.

I tried uninstalling catalyst, uninstalling the Graphics card from device manager (including deletion of driver files), deleting all files starting with ati from \windows\system32 and the wow6432... subdir, removing ATI* and AMD directories from users and program files and removing all registry keys mentioning AMD or ATI. Reboot, install 11.9 and both problems are back.

And finally, I installed a new Windows 7-64 copy in another partition, installed 11.9 from the same download, and voila, now my program runs fine, even without the 256 ->256u change.

 

Question 1: On my original installation, where is 11.10 hiding to make Catalyst 11.9 fail as well? APP SDK does not seem to be required anymore ... would installing 2.5 help?

Question 2: Any suggestion what I could try to get all my parameters delivered to the kernel?

 

Thanks for any help. I already spent so much time on this ... and am starting to feel a little frustrated 😛

 

0 Likes
12 Replies

the dll you are interested in is amdocl[32|64].dll
0 Likes

Thanks a lot, Micah!

 

Deinstalling 11.10, removing system32\amdocl64.dll, system32\amdoclcl64.dll, syswow64\amdocl.dll and syswow64\amdoclcl.dll, and then installing 11.9 did the trick, now my program runs again.

0 Likes

The same issues exist in 11.11. While the necessity to define a type for constants could be called a feature, I wonder how I could troubleshoot the missing parameter issue.

 

It's a kernel like the attached. b_in is initialized on the host with one component non-zero, the rest zero. In the kernel, all 6 components are received as zero. Setting the kernel parameter does not return an error code.

Will clSetKernelArg still copy the argument with 11.10 and 11.11? Or is it now required to keep the variable intact in the host program until the kernel is actually enqueued?

 

/* 72bit (3x 24bit) integer D=d0 + d1*(2^24) + d2*(2^48) */ typedef struct _int72_t { uint d0,d1,d2; }int72_t; /* 144bit (6x 24bit) integer D=d0 + d1*(2^24) + d2*(2^48) + ... */ typedef struct _int144_t { uint d0,d1,d2,d3,d4,d5; }int144_t; __kernel void mfakto_cl_71_4(uint exp, __private int72_t k_base, __global uint * restrict k_tab, __private int shiftcount, __private int144_t b_in, __global uint * restrict RES)

0 Likes

How are you passing a private data type (int72_t & int144_t) to a kernel from the host?

Surely you can only pass a constant or __global memory data type to a kernel from the host?

0 Likes

Bdot,
make sure that your host side structure uses cl_uint and not unsigned int.
0 Likes

Hi Micah,

 

thanks for the hint, but that did not help. No wonder as both are typedef'd to unsigned __int32, or 32-bit unsigned int.

So it remains the fact that some non-builtin data types are not correctly copied to the kernels since 11.10.

Would it be of some help if the community provided programs that you could add to the driver test suite? So that bugs in the drivers could be detected before releasing the drivers? For my program I (fortunately) have a test option that tests all of the program kernels with various inputs and compares them to well-known results. The malfunction of 11.10 and 11.11 immediately shows up.

 

( Just thinking how to improve the situation with the driver quality ...)


@antzrhere: The clSetKernelArg description mentions no limitation for the parameter size, nor any limitation to builtin data types. Therefore, passing a int144_t parameter to the kernel (by passing its address to clSetKernelArg) should work:

int144_t b = {0};

b.d4=16;

status = clSetKernelArg(l_kernel, 4, sizeof(int144_t), (void *) &b);

 

Interestingly, the int72_t parameter is passed on correctly ...

Best regards,

Bdot

0 Likes

Hi BDot,

It would be nice if you can paste your testcase, showing the issue.

[EDIT]I tried creating a small testcase, and it appears that I am able to pass user defined data types and get correct value.

Although as I am using a internal driver which might not be having the issue you describe, it would be still helpful if you can provide your code, so we can be sure that the problem is fixed.

0 Likes

Hi Himanshu,

 

The problem does not occur for each kernel, I mean, there is one kernel with the same set of parameters that receives them all correctly. But for the other kernels the error is clearly reproducible.

 

The binaries as well as the source code can be found at

http://www.mersenneforum.org/showthread.php?t=15646&page=6#143

A newer version of the kernel file, adjusted for the new requirement to specify a type for literals is attached.

If you just start the binary without parameters, it will perform a reduced selftest (full self test with -st). If the selftest fails anywhere, you'll see a line like:

ERROR: selftest failed for M53134687 (mfakto_cl_71_4)

The failed kernel name is in parenthesis, "mfakto_cl_71_4" in this case.

On the CPU, or when running Catalyst up to 11.9, there is no error.

To see that a parameter passed in is wrong, I enabled the kernels trace by using

#define TRACE_KERNEL 2

in mfakto_Kernels.cl [40].

Then, the kernel mfakto_cl_71_4 will print a line starting with "mfakto_cl_71_4: b_in=". This parameter must never be 0:0:0:0:0:0, but it is under 11.10 and 11.11 on a GPU.

 

Thanks a lot for any help!

/* This file is part of mfaktc (mfakto). Copyright (C) 2009, 2010, 2011 Oliver Weihe Bertram Franz mfaktc (mfakto) is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. mfaktc (mfakto) is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with mfaktc (mfakto). If not, see <http://www.gnu.org/licenses/>. */ /* All OpenCL kernels for mfakto Trial-Factoring is 2^p-1 divisible by q (q=2kp+1)? Remove Optional Square top bit mul by 2 mod 47 ------------ ------- ------------- ------ 1*1 = 1 1 0111 1*2 = 2 2 2*2 = 4 0 111 no 4 4*4 = 16 1 11 16*2 = 32 32 32*32 = 1024 1 1 1024*2 = 2048 27 27*27 = 729 1 729*2 = 1458 1 Thus, 2^23 = 1 mod 47. Subtract 1 from both sides. 2^23-1 = 0 mod 47. Since we've shown that 47 is a factor, 2^23-1 is not prime. */ // TRACE_KERNEL: higher is more trace, 0-5 currently used //#define TRACE_KERNEL 2 // If above tracing is on, only the thread with the ID below will trace #define TRACE_TID 0 // defines how many factor candidates the barrett kernels will process in parallel per thread // this is now defined via commandline to the OpenCL compiler //#define BARRETT_VECTOR_SIZE 4 /*********************************** * DONT CHANGE ANYTHING BELOW THIS * ***********************************/ #if (TRACE_KERNEL > 0) || defined (CHECKS_MODBASECASE) // available on all platforms so far ... #pragma OPENCL EXTENSION cl_amd_printf : enable //#pragma OPENCL EXTENSION cl_khr_fp64 : enable #endif // HD4xxx does not have atomics, but mfakto will work on these GPUs as well. // Without atomics, the factors found may be scrambled when more than one // factor is found per grid => if the reported factor(s) are not accepted // by primenet, then run the bitlevel again with the smallest possible grid size, // or run it on at least HD5... #ifdef cl_khr_global_int32_base_atomics #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define ATOMIC_INC(x) atom_inc(&x) #else //#pragma message "No atomic operations available - using simple ++" #define ATOMIC_INC(x) ((x)++) #endif /* 96bit (3x 32bit) integer D= d0 + d1*(2^32) + d2*(2^64) */ typedef struct _int96_1t { uint d0,d1,d2; }int96_1t; /* 192bit (6x 32bit) integer D=d0 + d1*(2^32) + d2*(2^64) + ... */ typedef struct _int192_1t { uint d0,d1,d2,d3,d4,d5; }int192_1t; #ifdef BARRETT_VECTOR_SIZE #include "barrett.cl" // one kernel file for different vector sizes (1, 2, 4, 8, 16) #else /**************************************** **************************************** * 32-bit-stuff for the 92/96-bit-kernel * Fallback to old no-vector-implementation ... **************************************** ****************************************/ /* 96bit (3x 32bit) integer D= d0 + d1*(2^32) + d2*(2^64) */ typedef struct _int96_t { uint d0,d1,d2; }int96_t; /* 192bit (6x 32bit) integer D=d0 + d1*(2^32) + d2*(2^64) + ... */ typedef struct _int192_t { uint d0,d1,d2,d3,d4,d5; }int192_t; void div_192_96(int96_1t *res, int192_1t q, int96_1t n, float nf); void div_160_96(int96_1t *res, int192_1t q, int96_1t n, float nf); void mul_96(int96_1t *res, int96_1t a, int96_1t b); void mul_96_192_no_low2(int192_1t *res, int96_1t a, int96_1t b); void mul_96_192_no_low3(int192_1t *res, int96_1t a, int96_1t b); #endif // Barrett #ifdef CL_GPU_SIEVE #include "sieve.cl" #endif /**************************************** **************************************** * 24-bit-stuff for the 71-bit-kernel * **************************************** ****************************************/ #define EVAL_RES(comp) \ if((a.d2.comp|a.d1.comp)==0 && a.d0.comp==1) \ { \ if ((f.d2.comp|f.d1.comp)!=0 || f.d0.comp != 1) \ { \ tid=ATOMIC_INC(RES[0]); \ if(tid<10) \ { \ RES[tid*3 + 1]=f.d2.comp; \ RES[tid*3 + 2]=f.d1.comp; \ RES[tid*3 + 3]=f.d0.comp; \ } \ } \ } /* 72bit (3x 24bit) integer D=d0 + d1*(2^24) + d2*(2^48) */ typedef struct _int72_t { uint d0,d1,d2; }int72_t; /* 144bit (6x 24bit) integer D=d0 + d1*(2^24) + d2*(2^48) + ... */ typedef struct _int144_t { uint d0,d1,d2,d3,d4,d5; }int144_t; /**************************************** **************************************** * 64-bit-stuff for the 64-bit-kernel * **************************************** ****************************************/ void square_64_128(ulong *res_hi, ulong *res_lo, const ulong in #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { *res_hi = mul_hi(in, in); *res_lo = in * in; #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf ("square_64_128: %llx ^ 2 = %llx : %llx\n", in, *res_hi, *res_lo); #endif } int gte_128(ulong v1_hi, ulong v1_lo, ulong v2_hi, ulong v2_lo) { if (v1_hi == v2_hi) return (v1_lo >= v2_lo); return (v1_hi >= v2_hi); } void sub_128(ulong *v1_hi, ulong *v1_lo, ulong v2_hi, ulong v2_lo #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf ("sub_128: %llx:%llx - %llx:%llx = ", *v1_hi, *v1_lo, v2_hi, v2_lo); #endif *v1_hi = *v1_hi - v2_hi - ((*v1_lo < v2_lo) ? 1 : 0); *v1_lo = *v1_lo - v2_lo; #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf ("%llx:%llx\n", *v1_hi, *v1_lo); #endif } void sub_if_gte_128(ulong *v1_hi, ulong *v1_lo, ulong v2_hi, ulong v2_lo #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { /* if (v1 >= v2) v1=v1-v2 */ ulong tmp_hi, tmp_lo; #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf ("sub_if_gte_128: %llx:%llx - %llx:%llx = ", *v1_hi, *v1_lo, v2_hi, v2_lo); #endif tmp_lo = *v1_lo - v2_lo; tmp_hi = *v1_hi - v2_hi - ((*v1_lo < v2_lo) ? 1 : 0); *v1_hi = (tmp_hi > *v1_hi) ? *v1_hi : tmp_hi; *v1_lo = (tmp_hi > *v1_hi) ? *v1_lo : tmp_lo; #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf ("%llx:%llx\n", *v1_hi, *v1_lo); #endif } void square_96_192_64(ulong *res_hi, ulong *res_mid, ulong *res_lo, const ulong in_hi, const ulong in_lo #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { __private ulong tmp1, tmp2; // (in_hi + in_lo) ^2 = in_hi^2 + 2*in_hi*in_lo + in_lo^2 // PERF: better when using private copies for *res*? // PERF: better using 32-bit parts? Or 24-bit? *res_lo = in_lo * in_lo; tmp1 = in_lo * in_hi; tmp2 = mul_hi(in_lo, in_lo); *res_mid = tmp1 << 1; *res_hi = in_hi * in_hi + ((*res_mid < tmp1) ? 1 : 0) + // "carry" from previous left-shift (mul_hi(in_lo, in_hi) << 1); // shift cannot overflow as in_hi uses only 32 of 64 bit. *res_mid = *res_mid + tmp2; *res_hi = *res_hi + ((*res_mid < tmp2) ? 1 : 0); // "carry" from above #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf ("square_96_192: %llx : %llx ^ 2 = %llx : %llx : %llx\n", in_hi, in_lo, *res_hi, *res_mid, *res_lo); #endif } // modulo by division ulong mod_128_64_d(__private ulong hi, __private ulong lo, const ulong q, const uint lshift #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { // some day I'll implement a fast 64-bit and a fast 96-bit kernel return 0; } // modulo by shift - cmp - sub ulong mod_128_64_s(__private ulong hi, __private ulong lo, const ulong q, const uint lshift #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { __private int i = clz(q) - clz(hi); // hi is i bitpositions larger than q; __const ulong mask= 0x8000000000000000 ; // first bit of ulong #if (TRACE_KERNEL > 2) if (q&mask) { if (tid==TRACE_TID) printf("ERROR: q >= 2^63: %llx (mask=%llx)\n", q, mask); } if (tid==TRACE_TID) printf("mod_128_64_s: i=%d: hi=%llx, lo=%llx, q=%llx, mask=%llx, shift=%u\n", i, hi, lo, q, mask, lshift); #endif #ifdef BETTER_BE_SAFE_THAN_SORRY __private ulong a = q << ( (i>0) ? i : 0); // a = q shifted to ~same magnitude as hi for ( ; i>0 ; i--) { hi = hi - ( (hi>a) ? a : 0 ); // subtract multiples of q a = a >> 1; // slowly shift back until we have q again #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_128_64_s: i=%d: hi= %llx, a=%llx\n", i, hi, a); #endif } #endif #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_128_64_s: hi= %llx, lo=%llx, q=%llx\n", hi, lo, q); #endif for (i=0; i<64; i++) //process the 64 bits of lo. PERF: unroll loop later { hi = (hi << 1) + ( (lo & mask) ? 1 : 0); // PERF: mad(2,hi,(lo & mask) ? 1 : 0) faster? lo = lo << 1; hi = hi - ( (hi>q) ? q : 0 ); // subtract q #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_128_64_s: i=%d: hi= %llx, lo=%llx, a=%llx\n", i, hi, lo, q); #endif } hi = hi << lshift; hi = hi - ( (hi>q) ? q : 0 ); // subtract q #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf("mod_128_64_s: return %llx\n", hi); #endif return hi; } // modulo by shift - cmp - sub void mod_192_96_s(__private ulong hi, __private ulong mid, __private ulong lo, __private ulong q_hi, __private ulong q_lo, __private uint lshift, __private ulong *r_hi, __private ulong *r_lo #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { __private long i = clz(q_hi) - clz(hi); // hi is i bitpositions larger than q (q at least 2^63) const ulong mask= 0x8000000000000000 ; // first bit of ulong __private ulong a_hi, a_lo; #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_192_96_s: i=%d: %llx:%llx:%llx mod %llx:%llx, shift=%u\n", i, hi, mid, lo, q_hi, q_lo, lshift); #endif if (i>0) { a_hi = (q_hi << i) | (q_lo >> (64-i)); // a = q shifted to ~same magnitude as hi a_lo = q_lo << i; #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_192_96_s: i=%lld: hi= %llx:%llx, a=%llx:%llx\n", i, hi, mid, a_hi, a_lo); #endif } for ( ; i>0 ; i--) { sub_if_gte_128(&hi, &mid, a_hi, a_lo #if (TRACE_KERNEL > 1) , tid #endif ); // subtract multiples of q a_lo = (a_lo >> 1) | (a_hi << 63); // slowly shift back until we have q again a_hi = a_hi >> 1; #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_192_96_s: i=%d: hi= %llx:%llx, a=%llx:%llx\n", i, hi, mid, a_hi, a_lo); #endif } sub_if_gte_128(&hi, &mid, q_hi, q_lo #if (TRACE_KERNEL > 1) , tid #endif ); // subtract q #if (TRACE_KERNEL > 2) if (tid==TRACE_TID) printf("mod_192_96_s: i=%d: hi= %llx:%llx, a=%llx:%llx\n", i, hi, mid, q_hi, q_lo); #endif //#pragma unroll 2 for (i=0; i<64; i++) //process the 64 bits of lo. { hi = (hi << 1) + ( (mid & mask) ? 1 : 0); // PERF: mad(2,hi,(lo & mask) ? 1 : 0) faster? or lo >> 63? mid = (mid << 1) + ( (lo & mask) ? 1 : 0); // PERF: mad(2,hi,(lo & mask) ? 1 : 0) faster? or lo >> 63? lo = lo << 1; sub_if_gte_128(&hi, &mid, q_hi, q_lo #if (TRACE_KERNEL > 1) , tid #endif ); // subtract q #if (TRACE_KERNEL > 3) if (tid==TRACE_TID) printf("mod_192_96_s: i=%d: hi= %llx, mid=%llx, lo=%llx, q=%llx:%llx\n", i, hi, mid, lo, q_hi, q_lo); #endif } if (lshift) { hi = (hi << 1) + ( (mid & mask) ? 1 : 0); // PERF: mad(2,hi,(lo & mask) ? 1 : 0) faster? or lo >> 63? mid = (mid << 1); sub_if_gte_128(&hi, &mid, q_hi, q_lo #if (TRACE_KERNEL > 1) , tid #endif ); // subtract q } *r_hi = hi; *r_lo = mid; #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf("mod_192_96_s: return %llx:%llx\n", hi, mid); #endif } __kernel void mfakto_cl_64(__private uint exp, __private ulong k_base, __global uint *k_tab, __private ulong4 b_pre_shift, __private int bit_max64, __global uint *RES) /* shiftcount is used for precomputing without mod a is precomputed on host ONCE. */ { // __private long shiftcount = b_pre_shift.w; // how many bits of exp are left to be processed __private ulong pp, hi, lo, k, q, r; __private uint tid, mask; // __private float qr; /* a little less than 1/q */ tid = get_global_id(0)+get_global_size(0)*get_global_id(1); pp = exp; // 32 -> 64 bit k = k_tab[tid]; // 32 -> 64 bit k = k*4620 + k_base; // NUM_CLASSES q = (pp<<1) * k + 1; // q = 2*k*exp+1 // the first bits of exp are processed on the host w/o modulo, // as the result of the squaring was less than the FC anyway. /* now only to bit_max^2 of the kernel // preprocessing is now done as long as it fits into 192 bits w/o modulo mod_192_96_s(b_pre_shift.z, b_pre_shift.y, b_pre_shift.x, q, 0, 0, &hi, &lo // 192 bit with q < 2^64 is more than the mod can handle #if (TRACE_KERNEL > 1) // Do a mod (q<<64) first to bring b down to 128 bits , tid #endif ); // initial modulo of the precomputed residue */ r = mod_128_64_s(b_pre_shift.y, b_pre_shift.x, q, 0 #if (TRACE_KERNEL > 1) , tid #endif ); // r = hi:lo % q // and now again the real modulo mask = 1<<(b_pre_shift.w); /* the 1 in mask now points to the bit-pos after the first modulo was necessary */ #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf("mfakto_cl_64: tid=%ld: p=%llx, k=%llx, q=%llx, mask=%llx, r=%llx\n", tid, pp, k, q, mask, r); #endif while (mask) { square_64_128(&hi, &lo, r #if (TRACE_KERNEL > 1) , tid #endif ); /*hi:lo = (r * r); */ r = mod_128_64_s(hi, lo, q, ( pp&mask ) ? 1 : 0 #if (TRACE_KERNEL > 1) , tid #endif ); // r = hi:lo << 0 or 1 % q mask = mask >> 1; // next bit of p #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf("mfakto_cl_64: q=%llx, mask=%llx, r=%llx\n", q, mask, r); #endif } #if (TRACE_KERNEL > 0) if (tid==TRACE_TID) printf("mfakto_cl_64: tid=%ld: q=%llx, k=%llx, r=%llx\n", tid, q, k, r); #endif /* finally check if we found a factor and write the factor to RES[] */ if(r==1) { #if (TRACE_KERNEL > 0) // will trace for any thread printf("mfakto_cl_64: tid=%ld found factor: q=%llx, k=%llx, r=%llx\n", tid, q, k, r); #endif tid=ATOMIC_INC(RES[0]); if(tid<10) /* limit to 10 factors per class */ { RES[tid*3 + 1]= 0; RES[tid*3 + 2]= (uint) (q >> 32); RES[tid*3 + 3]= (uint) q & 0xFFFFFFFF; } } } // this kernel is only used for a quick test at startup - no need to be correct 😉 // currently this kernel is used for testing what happens without atomics when multiple factors are found __kernel void mod_128_64_k(const ulong hi, const ulong lo, const ulong q, const float qr, __global uint *res #if (TRACE_KERNEL > 1) , __private uint tid #endif ) { __private uint i,f; f = get_global_id(0); f++; // let the reported results start with 1 // barrier(CLK_GLOBAL_MEM_FENCE); if (1 == 1) { i=ATOMIC_INC(res[0]); //#pragma OPENCL EXTENSION cl_amd_printf : enable // printf("thread %d: i=%d, res[0]=%d\n", get_global_id(0), i, res[0]); if(i<10) /* limit to 10 results */ { res[i*3 + 1]=f; res[i*3 + 2]=f; res[i*3 + 3]=f; } } } __kernel void mfakto_cl_95(__private uint exp, __private ulong k_base, __global uint *k_tab, __private ulong4 b_pre_shift, __private int bit_max64, __global uint *RES) /* shiftcount is used for precomputing without mod a is precomputed on host ONCE. */ { // __private long shiftcount = b_pre_shift.w; // how many bits of exp are left to be processed __private ulong q_lo, q_hi, r_lo, r_hi, pp, hi, mid, lo, k; __private uint tid, mask; tid = get_global_id(0)+get_global_size(0)*get_global_id(1); pp = exp; // 32 -> 64 bit k = k_tab[tid]; // 32 -> 64 bit k = k*4620 + k_base; // NUM_CLASSES q_lo = pp<<1; q_hi = mul_hi(q_lo, k); q_lo = q_lo * k + 1; // q = 2*k*exp+1 lo = b_pre_shift.x; // the first bits of exp are processed on the host w/o modulo, mid = b_pre_shift.y; // as the result of the squaring was less than the FC anyway. hi = b_pre_shift.z; // preprocessing is done as long as it fits into 192 bits w/o modulo mod_192_96_s(hi, mid, lo, q_hi, q_lo, 0, &r_hi, &r_lo #if (TRACE_KERNEL > 1) , tid #endif ); // initial modulo of the precomputed residue mask = 1<<(b_pre_shift.w); /* the 1 on mask now points to the bit-pos after the first modulo was necessary */ #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf("mfakto_cl_95: tid=%ld: p=%llx, k=%llx, q=%llx:%llx, mask=%llx, r=%llx:%llx\n", tid, pp, k, q_hi, q_lo, mask, r_hi, r_lo); #endif while (mask) { square_96_192_64(&hi, &mid, &lo, r_hi, r_lo #if (TRACE_KERNEL > 1) , tid #endif ); /*hi:mid:lo = (r * r); */ mod_192_96_s(hi, mid, lo, q_hi, q_lo, ( exp&mask ) ? 1 : 0, &r_hi, &r_lo #if (TRACE_KERNEL > 1) , tid #endif ); mask = mask >> 1; #if (TRACE_KERNEL > 1) if (tid==TRACE_TID) printf("mfakto_cl_95: q=%llx:%llx, mask=%llx, r=%llx:%llx\n", q_hi, q_lo, mask, r_hi, r_lo); #endif } #if (TRACE_KERNEL > 0) if (tid==TRACE_TID) printf("mfakto_cl_95: tid=%ld: q=%llx:%llx, k=%llx, r=%llx:%llx\n", tid, q_hi, q_lo, k, r_hi, r_lo); #endif /* finally check if we found a factor and write the factor to RES[] */ if((r_hi==0) && (r_lo==1)) { #if (TRACE_KERNEL > 0) // trace this for any thread printf("mfakto_cl_95: tid=%ld found factor: q=%llx:%llx, k=%llx, r=%llx:%llx\n", tid, q_hi, q_lo, k, r_hi, r_lo); #endif tid=ATOMIC_INC(RES[0]); if(tid<10) /* limit to 10 factors per class */ { RES[tid*3 + 1]= (uint) q_hi & 0xFFFFFFFF; RES[tid*3 + 2]= (uint) (q_lo >> 32); RES[tid*3 + 3]= (uint) q_lo & 0xFFFFFFFF; } } } /*===========uint exp, ulong k_base, __global uint *k_tab, ulong4 b_pre_shift, __global uint *RES============================ __kernel void mfakto_cl_barrett92_64(__private uint exp, __private ulong k_base, __global uint *k_tab, __private ulong4 b_pre_shift, __private int bit_max64, __global uint *RES)*/ /* shiftcount is used for precomputing without mod a is precomputed on host ONCE. bit_max64 is bit_max - 64! */ #ifndef BARRETT_VECTOR_SIZE /**************************************** **************************************** * 32-bit based 79- and 92-bit barrett-kernels * **************************************** ****************************************/ int cmp_ge_96(int96_1t a, int96_1t b) /* checks if a is greater or equal than b */ { if(a.d2 == b.d2) { if(a.d1 == b.d1)return(a.d0 >= b.d0); else return(a.d1 > b.d1); } else return(a.d2 > b.d2); } void sub_96(int96_1t *res, int96_1t a, int96_1t b) /* a must be greater or equal b! res = a - b */ { /* res->d0 = __sub_cc (a.d0, b.d0); res->d1 = __subc_cc(a.d1, b.d1); res->d2 = __subc (a.d2, b.d2); */ uint carry= b.d0 > a.d0; res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); carry = (res->d1 > a.d1) || ((res->d1 == a.d1) && carry); res->d2 = a.d2 - b.d2 - (carry ? 1 : 0); } int96_1t sub_if_gte_96(int96_1t a, int96_1t b) /* return (a>b)?a-b:a */ { int96_1t tmp; /* do the subtraction and use tmp.d2 to decide if the result is valid (if a was > b) */ uint carry= b.d0 > a.d0; tmp.d0 = a.d0 - b.d0; tmp.d1 = a.d1 - b.d1 - (carry ? 1 : 0); carry = (tmp.d1 > a.d1) || ((tmp.d1 == a.d1) && carry); tmp.d2 = a.d2 - b.d2 - (carry ? 1 : 0); return (tmp.d2 > a.d2) ? a : tmp; } void mul_96(int96_1t *res, int96_1t a, int96_1t b) /* res = a * b */ { /* res->d0 = __umul32 (a.d0, b.d0); res->d1 = __add_cc(__umul32hi(a.d0, b.d0), __umul32 (a.d1, b.d0)); res->d2 = __addc (__umul32 (a.d2, b.d0), __umul32hi(a.d1, b.d0)); res->d1 = __add_cc(res->d1, __umul32 (a.d0, b.d1)); res->d2 = __addc (res->d2, __umul32hi(a.d0, b.d1)); res->d2+= __umul32 (a.d0, b.d2); res->d2+= __umul32 (a.d1, b.d1); */ uint tmp; res->d0 = a.d0 * b.d0; res->d1 = mul_hi(a.d0, b.d0); res->d2 = mul_hi(a.d1, b.d0); tmp = a.d1 * b.d0; res->d1 += tmp; res->d2 += (tmp > res->d1)? 1 : 0; res->d2 += mul_hi(a.d0, b.d1); tmp = a.d0 * b.d1; res->d1 += tmp; res->d2 += (tmp > res->d1)? 1 : 0; res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0; } void mul_96_192_no_low2(int192_1t *res, int96_1t a, int96_1t b) /* res ~= a * b res.d0 and res.d1 are NOT computed. Carry from res.d1 to res.d2 is ignored, too. So the digits res.d{2-5} might differ from mul_96_192(). In mul_96_192() are two carries from res.d1 to res.d2. So ignoring the digits res.d0 and res.d1 the result of mul_96_192_no_low() is 0 to 2 lower than of mul_96_192(). */ { /* res->d2 = __umul32 (a.d2, b.d0); res->d3 = __umul32hi(a.d2, b.d0); res->d2 = __add_cc (res->d2, __umul32hi(a.d1, b.d0)); res->d3 = __addc_cc(res->d3, __umul32 (a.d2, b.d1)); res->d4 = __addc ( 0, 0); res->d2 = __add_cc (res->d2, __umul32hi(a.d0, b.d1)); res->d3 = __addc_cc(res->d3, __umul32 (a.d1, b.d2)); res->d4 = __addc_cc(res->d4, __umul32hi(a.d1, b.d2)); res->d5 = __addc ( 0, 0); res->d2 = __add_cc (res->d2, __umul32 (a.d0, b.d2)); res->d3 = __addc_cc(res->d3, __umul32hi(a.d0, b.d2)); res->d4 = __addc_cc(res->d4, __umul32 (a.d2, b.d2)); res->d5 = __addc (res->d5, __umul32hi(a.d2, b.d2)); res->d2 = __add_cc (res->d2, __umul32 (a.d1, b.d1)); res->d3 = __addc_cc(res->d3, __umul32hi(a.d1, b.d1)); res->d4 = __addc_cc(res->d4, __umul32hi(a.d2, b.d1)); res->d5 = __addc (res->d5, 0); */ uint tmp; res->d2 = mul_hi(a.d1, b.d0); tmp = mul_hi(a.d0, b.d1); res->d2 += tmp; res->d3 = (tmp > res->d2)? 1 : 0; tmp = a.d2 * b.d0; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = a.d1 * b.d1; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = a.d0 * b.d2; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = mul_hi(a.d2, b.d0); res->d3 += tmp; res->d4 = (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d1, b.d1); res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d0, b.d2); res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = a.d2 * b.d1; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = a.d1 * b.d2; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d2, b.d1); res->d4 += tmp; res->d5 = (tmp > res->d4)? 1 : 0; tmp = mul_hi(a.d1, b.d2); res->d4 += tmp; res->d5 += (tmp > res->d4)? 1 : 0; tmp = a.d2 * b.d2; res->d4 += tmp; res->d5 += (tmp > res->d4)? 1 : 0; res->d5 += mul_hi(a.d2, b.d2); } void mul_96_192_no_low3(int192_1t *res, int96_1t a, int96_1t b) /* res ~= a * b res.d0, res.d1 and res.d2 are NOT computed. Carry to res.d3 is ignored, too. So the digits res.d{3-5} might differ from mul_96_192(). In mul_96_192() are four carries from res.d2 to res.d3. So ignoring the digits res.d0, res.d1 and res.d2 the result of mul_96_192_no_low() is 0 to 4 lower than of mul_96_192(). */ { /* res->d3 = __umul32hi(a.d2, b.d0); res->d3 = __add_cc (res->d3, __umul32 (a.d2, b.d1)); res->d4 = __addc ( 0, 0); res->d3 = __add_cc (res->d3, __umul32 (a.d1, b.d2)); res->d4 = __addc (res->d4, __umul32hi(a.d1, b.d2)); // no carry propagation to d5 needed: 0xFFFF.FFFF * 0xFFFF.FFFF + 0xFFFF.FFFF + 0xFFFF.FFFE = 0xFFFF.FFFF.FFFF.FFFE // res->d4 = __addc_cc(res->d4, __umul32hi(a.d1, b.d2)); // res->d5 = __addc ( 0, 0); res->d3 = __add_cc (res->d3, __umul32hi(a.d0, b.d2)); res->d4 = __addc_cc(res->d4, __umul32 (a.d2, b.d2)); // res->d5 = __addc (res->d5, __umul32hi(a.d2, b.d2)); res->d5 = __addc ( 0, __umul32hi(a.d2, b.d2)); res->d3 = __add_cc (res->d3, __umul32hi(a.d1, b.d1)); res->d4 = __addc_cc(res->d4, __umul32hi(a.d2, b.d1)); res->d5 = __addc (res->d5, 0); */ uint tmp; res->d3 = mul_hi(a.d2, b.d0); tmp = mul_hi(a.d1, b.d1); res->d3 += tmp; res->d4 = (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d0, b.d2); res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = a.d2 * b.d1; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = a.d1 * b.d2; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d2, b.d1); res->d4 += tmp; res->d5 = (tmp > res->d4)? 1 : 0; tmp = mul_hi(a.d1, b.d2); res->d4 += tmp; res->d5 += (tmp > res->d4)? 1 : 0; tmp = a.d2 * b.d2; res->d4 += tmp; res->d5 += (tmp > res->d4)? 1 : 0; res->d5 += mul_hi(a.d2, b.d2); } void square_96_192(int192_1t *res, int96_1t a) /* res = a^2 = a.d0^2 + a.d1^2 + a.d2^2 + 2(a.d0*a.d1 + a.d0*a.d2 + a.d1*a.d2) */ { /* highest possible value for x * x is 0xFFFFFFF9 this occurs for x = {479772853, 1667710795, 2627256501, 3815194443} Adding x*x to a few carries will not cascade the carry */ uint tmp; res->d0 = a.d0 * a.d0; res->d1 = mul_hi(a.d0, a.d0); tmp = a.d0 * a.d1; res->d1 += tmp; res->d2 = (tmp > res->d1)? 1 : 0; res->d1 += tmp; res->d2 += (tmp > res->d1)? 1 : 0; res->d2 += a.d1 * a.d1; // no carry possible tmp = mul_hi(a.d0, a.d1); res->d2 += tmp; res->d3 = (tmp > res->d2)? 1 : 0; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = a.d0 * a.d2; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = mul_hi(a.d1, a.d1); res->d3 += tmp; res->d4 = (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d0, a.d2); res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = a.d1 * a.d2; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; res->d4 += a.d2 * a.d2; // no carry possible tmp = mul_hi(a.d1, a.d2); res->d4 += tmp; res->d5 = (tmp > res->d4)? 1 : 0; res->d4 += tmp; res->d5 += (tmp > res->d4)? 1 : 0; res->d5 += mul_hi(a.d2, a.d2); } void square_96_160(int192_1t *res, int96_1t a) /* res = a^2 */ /* this is a stripped down version of square_96_192, it doesn't compute res.d5 and is a little bit faster. For correct results a must be less than 2^80 (a.d2 less than 2^16) */ { /* highest possible value for x * x is 0xFFFFFFF9 this occurs for x = {479772853, 1667710795, 2627256501, 3815194443} Adding x*x to a few carries will not cascade the carry */ uint tmp, TWOad2 = a.d2 << 1; // a.d2 < 2^16 so this always fits res->d0 = a.d0 * a.d0; res->d1 = mul_hi(a.d0, a.d0); tmp = a.d0 * a.d1; res->d1 += tmp; res->d2 = (tmp > res->d1)? 1 : 0; res->d1 += tmp; res->d2 += (tmp > res->d1)? 1 : 0; res->d2 += a.d1 * a.d1; // no carry possible tmp = mul_hi(a.d0, a.d1); res->d2 += tmp; res->d3 = (tmp > res->d2)? 1 : 0; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = a.d0 * TWOad2; res->d2 += tmp; res->d3 += (tmp > res->d2)? 1 : 0; tmp = mul_hi(a.d1, a.d1); res->d3 += tmp; res->d4 = (tmp > res->d3)? 1 : 0; tmp = mul_hi(a.d0, TWOad2); res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; tmp = a.d1 * TWOad2; res->d3 += tmp; res->d4 += (tmp > res->d3)? 1 : 0; res->d4 += a.d2 * a.d2; // no carry possible res->d4 += mul_hi(a.d1, TWOad2); } void shl_96(int96_1t *a) /* shiftleft a one bit */ { a->d2 = (a->d2 << 1) + (a->d1 >> 31); a->d1 = (a->d1 << 1) + (a->d0 >> 31); a->d0 = a->d0 << 1; } #undef DIV_160_96 #ifndef CHECKS_MODBASECASE void div_192_96(int96_1t *res, int192_1t q, int96_1t n, float nf) #else void div_192_96(int96_1t *res, int192_1t q, int96_1t n, float nf, uint *modbasecase_debug) #endif /* res = q / n (integer division) */ { float qf; uint qi, tmp, carry; int192_1t nn; int96_1t tmp96; /********** Step 1, Offset 2^75 (2*32 + 11) **********/ #ifndef DIV_160_96 qf= convert_float_rtz(q.d5); qf= qf * 4294967296.0f + convert_float_rtz(q.d4); #else #ifdef CHECKS_MODBASECASE q.d5 = 0; // later checks in debug code will test if q.d5 is 0 or not but 160bit variant ignores q.d5 #endif qf= convert_float_rtz(q.d4); #endif qf*= 2097152.0f; qi=convert_uint(qf*nf); MODBASECASE_QI_ERROR(1<<22, 1, qi, 0); res->d2 = qi << 11; // nn = n * qi nn.d2 = n.d0 * qi; nn.d3 = mul_hi(n.d0, qi); tmp = n.d1 * qi; nn.d3 += tmp; nn.d4 = (tmp > nn.d3)? 1 : 0; tmp = mul_hi(n.d1, qi); nn.d4 += tmp; #ifndef DIV_160_96 nn.d5 = (tmp > nn.d4)? 1 : 0; tmp = n.d2 * qi; nn.d4 += tmp; nn.d5 += (tmp > nn.d4)? 1 : 0; nn.d5 += mul_hi(n.d2, qi); #else nn.d4 += n.d2 * qi; #endif // shiftleft nn 11 bits #ifndef DIV_160_96 nn.d5 = (nn.d5 << 11) + (nn.d4 >> 21); #endif nn.d4 = (nn.d4 << 11) + (nn.d3 >> 21); nn.d3 = (nn.d3 << 11) + (nn.d2 >> 21); nn.d2 = nn.d2 << 11; // q = q - nn carry= (nn.d2 > q.d2); q.d2 = q.d2 - nn.d2; tmp = q.d3 - nn.d3 - (carry ? 1 : 0); carry= (tmp > q.d3) || (carry && (tmp == q.d3)); q.d3 = tmp; #ifndef DIV_160_96 tmp = q.d4 - nn.d4 - (carry ? 1 : 0); carry= (tmp > q.d4) || (carry && (tmp == q.d4)); q.d4 = tmp; q.d5 = q.d5 - nn.d5 - (carry ? 1 : 0); #else

0 Likes

An update: As a workaround I changed the transferred data type in question to cl_uint8, and copy the contents to the right data type inside the kernel.

This works in 11.11, so the problem really seems to be with custom data types.

0 Likes

Hi,

If you can give a simplified testcase, it can speed up the bug isolation process.

0 Likes

I tried stripping down my kernels, but then it is no longer reproducible (at least when I remove the calls to other functions from the kernel).

So the error seems to depend a certain complexity of the kernels.

Catalyst 11.12 does not change the picture (it behaves the same way as 11.10 and 11.11).

 

Sorry, no easy test case 😞

0 Likes

okay Bdot,

I will try to look into the testcase, in a few days.

I would suggest you to try debugging the kernel using gDebugger(window only) or GDB(for CPU only).

0 Likes