I postet this Question already yesterday but today it was not findable. So here again:
I just want to test the Performance of the FireStream 9170 with a simple Matrix-Matrix-Multiplication in single precision. So I wrote a really simple kernel like this:
kernel
void matmul_kernel(float a<>, float b<>, out float c<>
{
c = a * b;
}
So if I run the executable with nxn-matrices with an n < 832 all works well. But if I start it with an n = 832 or above I get a segmentation fault.
I read that the size of a 2D stream is limited by 8192x8192. So what is the problem. I compiled it with address translation and without (-r flag), but it is exactly the same in both cases except for a less performance on address translated code.
I would be thankkful for any kind of help.
(1) I just allocate 3 matrices with float A
(2) I am working on Linux 64 and compile the files with a GNU Makefile using the brcc
(3) The problem remains the same if I comment the kernel call
(4) The Brook+ examples work fine with n < 4096, thats why I think my program should work with the same size of matrices
@ Ceq: You were right. The problem was the memory allocation. Now I allocate the memory with A = allocate_mat_f(); and all works fine.
But I am still limited to 3072x3072. Why?
3072*3072*sizeof(float) = 36 MB * 3 matrices = 108 MB. Where is the problem?
I dont think malloc is limited by 3072x3072 because I used matrices upto 10000x10000 in other programs working on the CPU.
Thanks for your answers.
Ok, I tested where the program aborts. It aborts between the first and the second streamRead at n = 3584. Thats crazy.
So I but a streamWrite after the first streamRead to wait that first streamRead has finished. in this way the program works with n = 4096, too.
At n ~ 5000 the programm aborts before the first streamRead.
n = 8192 would be great, but 4096 is ok. In this way I can get a realistic measurement.
In order to perform a streamRead, BRT will allocate some GPU remote (ie CPU local) memory, copy the data there, and then run a kernel to copy it to the actual surface (asynch).
The problem is likely when the BRT allocates this GPU remote memory. I believe there may be unseen (unreported?) limits. Try using the environment variable USE_NONCACHEABLE and set it to 1.
I've met this problem myself in using CAL directly...