I declare two 16x16 matries represented in 1D and do the addition. In the kernel, I make a thead process N elements(N = 1, 2,4,.....) , when N is less than 16, it works fine, but when it reaches 16, some kind of runtime error happens. I can not figure it out.
The code is as below(main.cpp and locate.br):
main.cpp:
#include <stdio.h>
#include <stdlib.h>
#include "brookgenfiles/locate.h"
using namespace brook;
#define SIZE 16
#define SIZE2 256
void printMatrix(int len, float m[])
{
int i, j;
for (i = 0; i < len; i++)
{
for (j = 0; j < len; j++)
{
printf("%f, ", m[i * len + j]);
}
printf("\n");
}
}
int main()
{
//array a, b and c
float a[SIZE2];
float b[SIZE2];
float c[SIZE2];
int i;
for (i = 0; i< SIZE2; i++)
{
a = 1.0;
b = 2.0;
}
unsigned int msize = SIZE2;
Stream<float> sa(1, &msize);
Stream<float> sb(1, &msize);
Stream<float> sc(1, &msize);
sa.read(a);
sb.read(b);
uint4 domainSize = uint4(SIZE2, 1, 1, 1);
blockAdd.domainSize(domainSize);
blockAdd(sa, sb, sc);
sc.write(c);
if (sc.error())
{
printf("Error occured! %s\n", sc.errorLog());
return 1;
}
printMatrix(SIZE, c);
getchar();
return 0;
}
locate.br:
Attribute[GroupSize(64, 1, 1)]
kernel void
blockAdd(float a[], float b[], out float c[])
{
int tid = instance().x;
//every thread process len elements, len = 1, 2, 4, 6, 8, 16, ....
//when len = 16, come the error
int len= 16;
int start = tid * len;
int i;
int index;
for (i = 0; i < len; i++)
{
c[start + i] = a[start + i] + b[start + i];
}
}
What runtime error you see? It is a crash? If yes, where does it crash?
Holiday passed and I am back
The result seems weird. Sometimes I can get the correct answer, with matrix C full of 3 (only when len doesn't exceed 16); sometimes it reports a memory error, and now the answer becomes an array of random numbers.
Is there something wrong with my algorithm in the kernel? I am wondering.
It seems the indices you are using are out of range. You are running 256 threads and each a, b, c contains only 256 elements.
Also, I would suggest to use both domainOffset and domainSize together. Brook+ runtime can ignore domian of execution hint if domainOffset is not specified. Also, check your results without Attribute qualifier in kernel.
Thanks for your suggestion.
Does it mean that to avoid the out of range problem, I can not write more than one element in the kernel?
But I knew the cal_idct sample provided with sdk writes more than one element in the IL kernel. Here is part of code:
// save 8x8 DCT coefficient block location
"ishl r16.x, vaTid.x, l8.w\n"
// load packed 8x8 DCT coefficients using texture cache
"mov r0, g[r16.x+0]\n"
"mov r2, g[r16.x+1]\n"
"mov r4, g[r16.x+2]\n"
"mov r6, g[r16.x+3]\n"
"mov r8, g[r16.x+4]\n"
"mov r10, g[r16.x+5]\n"
"mov r12, g[r16.x+6]\n"
"mov r14, g[r16.x+7]\n"
//DO IDCT
......
// save DCT values
"mov g[r16.x+0], r0\n"
"mov g[r16.x+1], r2\n"
"mov g[r16.x+2], r4\n"
"mov g[r16.x+3], r6\n"
"mov g[r16.x+4], r8\n"
"mov g[r16.x+5], r10\n"
"mov g[r16.x+6], r12\n"
"mov g[r16.x+7], r14\n"
In the code above, it first gets the absolute thread id and then maps it to a 8x8 block, which will be processed later. At last it writes these elements back. It works fine. So I wonder whether I can do the same thing in Brook+.
In your kernel instance().x would return values from 0...255 and writing 16 elements in each thread would mean accessing memory element from 0...4095. But, the amount of memory allocated is 256 elements.
Can you explain why the cal_idct kernel works?
I wanna do the matrix addtion in compute shader. Do multiple elements addition in a thread. But it seems my code above didn't work as I expect. Then how to write the code?