What runtime error you see? It is a crash? If yes, where does it crash?
Holiday passed and I am back
The result seems weird. Sometimes I can get the correct answer, with matrix C full of 3 (only when len doesn't exceed 16); sometimes it reports a memory error, and now the answer becomes an array of random numbers.
Is there something wrong with my algorithm in the kernel? I am wondering.
It seems the indices you are using are out of range. You are running 256 threads and each a, b, c contains only 256 elements.
Also, I would suggest to use both domainOffset and domainSize together. Brook+ runtime can ignore domian of execution hint if domainOffset is not specified. Also, check your results without Attribute qualifier in kernel.
Thanks for your suggestion.
Does it mean that to avoid the out of range problem, I can not write more than one element in the kernel?
But I knew the cal_idct sample provided with sdk writes more than one element in the IL kernel. Here is part of code:
// save 8x8 DCT coefficient block location
"ishl r16.x, vaTid.x, l8.w\n"
// load packed 8x8 DCT coefficients using texture cache
"mov r0, g[r16.x+0]\n"
"mov r2, g[r16.x+1]\n"
"mov r4, g[r16.x+2]\n"
"mov r6, g[r16.x+3]\n"
"mov r8, g[r16.x+4]\n"
"mov r10, g[r16.x+5]\n"
"mov r12, g[r16.x+6]\n"
"mov r14, g[r16.x+7]\n"
// save DCT values
"mov g[r16.x+0], r0\n"
"mov g[r16.x+1], r2\n"
"mov g[r16.x+2], r4\n"
"mov g[r16.x+3], r6\n"
"mov g[r16.x+4], r8\n"
"mov g[r16.x+5], r10\n"
"mov g[r16.x+6], r12\n"
"mov g[r16.x+7], r14\n"
In the code above, it first gets the absolute thread id and then maps it to a 8x8 block, which will be processed later. At last it writes these elements back. It works fine. So I wonder whether I can do the same thing in Brook+.
In your kernel instance().x would return values from 0...255 and writing 16 elements in each thread would mean accessing memory element from 0...4095. But, the amount of memory allocated is 256 elements.
Can you explain why the cal_idct kernel works?
I wanna do the matrix addtion in compute shader. Do multiple elements addition in a thread. But it seems my code above didn't work as I expect. Then how to write the code?