cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Remotion
Journeyman III

3D fluid simulation using brook+

streamRead streamWrite bottleneck.

I have ported my 3d solver to brook+ and it seems to work now but unfortunately not really fast.

Here you can see really simple test on 64x64x64 grid using Radeon HD4870

http://de.youtube.com/watch?v=x3oQcopUvYM


The main bottleneck seems to be streamRead() and streamWrite() functions for 3d streams with seems really slow.

I have done simple benchmarks without kernel calculation and one simple streamRead() on 128*128*128 stream needs around 150ms to complete.

streamWrite() needs approximately another 150ms, why is it so slow?

If I use CodeAnlyst to measure performance of my test application then the most time consuming funktion is memcpy() folowed by brook::CALMem::setDataAT() and brook::CALMem::getDataAT().

1D stream seems to be 3 times faster as 3D but this is still not really fast.

Of course it wold be really useful if streamRead()  and streamWrite() could be used as rare as possible but for this purpose permanent stream will be needed.


Is this hardware limitation of software?

Could I expect significant speedup if I will use CAL only?

thanks for you answer in advance,

Remotion


Edit:

Well as I can see this proble is that getDataAT() and setDataAT() will call memcopy for every 2097152 elements and of course this is really really sloooow.

Is this a BUG?

Bu the way CALMem::getDataAT() sems to copy two times if (streamRank == 1) because there is no else like in setDataAT(), this must be a bug.

0 Likes
9 Replies
udeepta
Staff

The simulation looks nice, Remotion.

What do you mean when you say "...but for this purpose permanent stream will be needed." ? Are you creating and destroying streams in every loop?

0 Likes

Yes I call one big Brook function from my c++ code with first alloc some streams then copy data to it using streamRead() then do all the calculation on GPU and then copy data back using streamWrite().

FluidSolverGPU(float3 v)

{

   float3 sv<x,y,z>;

   streamRead(sv,v);

   solverFluidsKernel(sv);

   streamWrite(sv,v);

}

Something like this pseudo function will be called every frame of simulation.

The calculation on GPU seems to be really fast but probably 100x faster as on QuadCore CPU but the copying is really slow.

I have not found any way until now to create container with stream to allocate it once on begin of the simulation.

struct FluidStreams
{
   float3  sv<,,>;
   float   sd<,,>;
};
Something like this...

 

0 Likes


Hi,

Regarding the read/write performance, if I run the numbers, I'm getting around 160 MB/sec for a read or write of (128*128*128 * float3 * 4 bytes = ~25 MB). That does seem a little slow. There are a couple of things we could look at, picking different data formats, trying async copies and the like. If you like, we could take this offline; just send email to streamdeveloper@amd.com.

When you say you're copying in and out of the GPU every frame, is that inherent in your algorithm? I can see how you'd need to get the data back from the GPU, but shouldn't it be possible to leave every time step on the card for the next frame?

Thanks -- marcr


0 Likes

Hi,

In my test it seems that 1D stream transfer is faster as 3D streams.

So i could try to use 1D stream with is of course a bit tricky.

I copy every frame becouse I do not know way to leave stream on GPU the whole simulation time.

Getting data from GPU is of course still necessary for rendering.

But writing data to GPU is still necessary for external forces for example.

regards,

Remotion

0 Likes

Hi, maybe you can create static data

static float3 sv<128,128,128> and look your code to reduce streamwrite streamread

I had the same problem, Kernel on GPU was very fast but Read and Write was slow (like in CUDA), with static data i succeed to reduce Read and Write.

0 Likes

Hi,


static stream is nice Idea but it has some problems.

First I am usin Brook+ from virtual call witch can have multiple instances and so overwrite results and second I need variable sized stream and not allways 128*128*128.

But for temporaly storage this could be interesting.


Thanks,

Remotion

0 Likes

This actually brings up a question I had:

When you read the data to a kernel will it stay on the GPU over multiple kernel calls?

For instance:

...

streamRead(...);

while (...)
{
kernelCall(...);
}

streamWrite(..);

0 Likes

Originally posted by: ryta1203 This actually brings up a question I had: When you read the data to a kernel will it stay on the GPU over multiple kernel calls?



Yes it will stay in the GPU until you stream container will be deleted!

{

float a<100>;

streamRead(a,..);

while (...)
{
kernelCall(a,..);
}

streamWrite(a,..);

}//here the stream a will be deleted.


By creating you own stream instance fro c++ code the stream can be usen inside classes to and destroyed if the class will be destroyed.

Remotion

0 Likes

Just want to say thank you for such a wonderful information, it was really helpful!

0 Likes