I have quite lengthy printf's in my code, like
printf("exp=%d, x2=%x:%x, b=%x:%x:%x:%x:%x:%x, k_base=%x:%x:%x, bit_max=%d\n",
exp, exp96.d1, exp96.d0, bb.d5, bb.d4, bb.d3, bb.d2, bb.d1, bb.d0, k_base.d2, k_base.d1, k_base.d0, bit_max64+64);
they are displayed OK, but only up to two printf's. The third printf will hang up the program when running on the GPU. Running on the CPU is fine and will print everything. So I guess there is something odd with printf on the GPU.
for debugging purposes try CodeXL. it allow to set breakpoint in kernel and examine values of all variables.
I was confused by some other printout I had outside the kernel: in fact the problem is not solved by splitting the printf. It seems that printf only works once with a signle argument....