I have ported my 3d solver to brook+ and it seems to work now but unfortunately not really fast.
Here you can see really simple test on 64x64x64 grid using Radeon HD4870
http://de.youtube.com/watch?v=x3oQcopUvYM
The main bottleneck seems to be streamRead() and streamWrite() functions for 3d streams with seems really slow.
I have done simple benchmarks without kernel calculation and one simple streamRead() on 128*128*128 stream needs around 150ms to complete.
streamWrite() needs approximately another 150ms, why is it so slow?
If I use CodeAnlyst to measure performance of my test application then the most time consuming funktion is memcpy() folowed by brook::CALMem::setDataAT() and brook::CALMem::getDataAT().
1D stream seems to be 3 times faster as 3D but this is still not really fast.
Of course it wold be really useful if streamRead() and streamWrite() could be used as rare as possible but for this purpose permanent stream will be needed.
Is this hardware limitation of software?
Could I expect significant speedup if I will use CAL only?
thanks for you answer in advance,
Remotion
Edit:
Well as I can see this proble is that getDataAT() and setDataAT() will call memcopy for every 2097152 elements and of course this is really really sloooow.
Is this a BUG?
Bu the way CALMem::getDataAT() sems to copy two times if (streamRank == 1) because there is no else like in setDataAT(), this must be a bug.
The simulation looks nice, Remotion.
What do you mean when you say "...but for this purpose permanent stream will be needed." ? Are you creating and destroying streams in every loop?
Yes I call one big Brook function from my c++ code with first alloc some streams then copy data to it using streamRead() then do all the calculation on GPU and then copy data back using streamWrite().
FluidSolverGPU(float3 v)
{
float3 sv<x,y,z>;
streamRead(sv,v);
solverFluidsKernel(sv);
streamWrite(sv,v);
}
Something like this pseudo function will be called every frame of simulation.
The calculation on GPU seems to be really fast but probably 100x faster as on QuadCore CPU but the copying is really slow.
I have not found any way until now to create container with stream to allocate it once on begin of the simulation.
struct FluidStreams
{
float3 sv<,,>;
float sd<,,>;
};
Something like this...
Hi,
In my test it seems that 1D stream transfer is faster as 3D streams.
So i could try to use 1D stream with is of course a bit tricky.
I copy every frame becouse I do not know way to leave stream on GPU the whole simulation time.
Getting data from GPU is of course still necessary for rendering.
But writing data to GPU is still necessary for external forces for example.
regards,
Remotion
Hi, maybe you can create static data
static float3 sv<128,128,128> and look your code to reduce streamwrite streamread
I had the same problem, Kernel on GPU was very fast but Read and Write was slow (like in CUDA), with static data i succeed to reduce Read and Write.
Hi,
static stream is nice Idea but it has some problems.
First I am usin Brook+ from virtual call witch can have multiple instances and so overwrite results and second I need variable sized stream and not allways 128*128*128.
But for temporaly storage this could be interesting.
Thanks,
Remotion
Originally posted by: ryta1203 This actually brings up a question I had: When you read the data to a kernel will it stay on the GPU over multiple kernel calls?
Yes it will stay in the GPU until you stream container will be deleted!
{
float a<100>;
streamRead(a,..);
while (...)
{
kernelCall(a,..);
}
streamWrite(a,..);
}//here the stream a will be deleted.
By creating you own stream instance fro c++ code the stream can be usen inside classes to and destroyed if the class will be destroyed.
Remotion
Just want to say thank you for such a wonderful information, it was really helpful!