Could you share your implementation of dot_prod and cross_prod?
Sorry for the delay in replying. The implementations are given below:
float dot_prod(float4 u, float4 v)
prod = u.x * v.x + u.y * v.y + u.z * v.z;
float4 cross_prod(float4 u,float4 v)
prod.x = u.y*v.z - u.z*v.y;
prod.y = u.z*v.x - u.x*v.z;
prod.z = u.x*v.y - u.y*v.x;
Thanks Shunyo. Have you tried bringing the pl[i], pl[j] and pl[k] to private memory instead of directly passing to the dot and cross functions? If not, please try that for optimization.
Also, How are the pl and res buffers created? Do they reside in host or device?
Thanks for the suggestions. It worked out. Thanks.