Archives Discussions

rexiaoyu · ‎08-05-2009

I declare two 16x16 matries represented in 1D and do the addition. In the kernel, I make a thead process N elements(N = 1, 2,4,.....) , when N is less than 16, it works fine, but when it reaches 16, some kind of runtime error happens. I can not figure it out.

The code is as below(main.cpp and locate.br):

main.cpp:

#include <stdio.h>

#include <stdlib.h>

#include "brookgenfiles/locate.h"

using namespace brook;

#define SIZE 16

#define SIZE2 256

void printMatrix(int len, float m[])

{

int i, j;

for (i = 0; i < len; i++)

{

for (j = 0; j < len; j++)

{

printf("%f, ", m[i * len + j]);

}

printf("\n");

}

int main()

{

//array a, b and c

float a[SIZE2];

float b[SIZE2];

float c[SIZE2];

int i;

for (i = 0; i< SIZE2; i++)

{

a = 1.0;

b = 2.0;

}

unsigned int msize = SIZE2;

Stream<float> sa(1, &msize);

Stream<float> sb(1, &msize);

Stream<float> sc(1, &msize);

sa.read(a);

sb.read(b);

uint4 domainSize = uint4(SIZE2, 1, 1, 1);

blockAdd.domainSize(domainSize);

blockAdd(sa, sb, sc);

sc.write(c);

if (sc.error())

{

printf("Error occured! %s\n", sc.errorLog());

return 1;

}

printMatrix(SIZE, c);

getchar();

return 0;

}

locate.br:

Attribute[GroupSize(64, 1, 1)]

kernel void

blockAdd(float a[], float b[], out float c[])

{

int tid = instance().x;

//every thread process len elements, len = 1, 2, 4, 6, 8, 16, ....

//when len = 16, come the error

int len= 16;

int start = tid * len;

int i;

int index;

for (i = 0; i < len; i++)

{

c[start + i] = a[start + i] + b[start + i];

}

gaurav_garg · ‎08-07-2009

What runtime error you see? It is a crash? If yes, where does it crash?

rexiaoyu · ‎08-11-2009

Holiday passed and I am back

The result seems weird. Sometimes I can get the correct answer, with matrix C full of 3 (only when len doesn't exceed 16); sometimes it reports a memory error, and now the answer becomes an array of random numbers.

Is there something wrong with my algorithm in the kernel? I am wondering.

gaurav_garg · ‎08-11-2009

It seems the indices you are using are out of range. You are running 256 threads and each a, b, c contains only 256 elements.

Also, I would suggest to use both domainOffset and domainSize together. Brook+ runtime can ignore domian of execution hint if domainOffset is not specified. Also, check your results without Attribute qualifier in kernel.

rexiaoyu · ‎08-11-2009

Thanks for your suggestion.

Does it mean that to avoid the out of range problem, I can not write more than one element in the kernel?

But I knew the cal_idct sample provided with sdk writes more than one element in the IL kernel. Here is part of code:

// save 8x8 DCT coefficient block location

"ishl r16.x, vaTid.x, l8.w\n"

// load packed 8x8 DCT coefficients using texture cache

"mov r0, g[r16.x+0]\n"

"mov r2, g[r16.x+1]\n"

"mov r4, g[r16.x+2]\n"

"mov r6, g[r16.x+3]\n"

"mov r8, g[r16.x+4]\n"

"mov r10, g[r16.x+5]\n"

"mov r12, g[r16.x+6]\n"

"mov r14, g[r16.x+7]\n"

//DO IDCT

......

// save DCT values

"mov g[r16.x+0], r0\n"

"mov g[r16.x+1], r2\n"

"mov g[r16.x+2], r4\n"

"mov g[r16.x+3], r6\n"

"mov g[r16.x+4], r8\n"

"mov g[r16.x+5], r10\n"

"mov g[r16.x+6], r12\n"

"mov g[r16.x+7], r14\n"

In the code above, it first gets the absolute thread id and then maps it to a 8x8 block, which will be processed later. At last it writes these elements back. It works fine. So I wonder whether I can do the same thing in Brook+.

gaurav_garg · ‎08-11-2009

In your kernel instance().x would return values from 0...255 and writing 16 elements in each thread would mean accessing memory element from 0...4095. But, the amount of memory allocated is 256 elements.

rexiaoyu · ‎08-11-2009

Can you explain why the cal_idct kernel works?

rexiaoyu · ‎08-14-2009

I wanna do the matrix addtion in compute shader. Do multiple elements addition in a thread. But it seems my code above didn't work as I expect. Then how to write the code?

Archives Discussions

Problem with simple matrix addition