cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

choks
Journeyman III

AmdFFT consumming lots of memory when called multiple times

Good day,

I was planning to test using AmdFFT for some OpenCL BOINC projets, but I discovered that after calling clAmdFftEnqueueTransform many times, my machine ran out of memory.

I can reproduce the problem with the x86_64 stock client:

./clAmdFft.Client -g -x 1048576 -p 20 -> runs fine

./clAmdFft.Client -g -x 1048576 -p 20000 -> eats more than 3 Gb of RAM and goes up until my machine run out of swap space.

It looks affecting only the GPU, because the same command with the -c flags does not have this problem.

I'am using latest 11.12 drivers, AmdFFT (clAmdFft-1.6.236.tar.gz), debian 64 and HD6950.

Thanks

0 Likes
1 Solution

Hi djohn,

We have made a maintenance release to fix this issue. Please download the latest 1.8 version and verify that it fixes the problem you were facing. Thanks.

View solution in original post

0 Likes
20 Replies
choks
Journeyman III

I found a workaround to get those FFT working reliably:

- I played with clAmdClient to find the best configuration. It was planar data for me (4M samples FFT).

- I then used the -d flags to output the OpenCL C code,

- then I ran the profiler (sprofile) on clAmdClient to see how the kernels where called

- then just created an OpenCL program calling the AmdClient generated OpenCl code, with the parameters I got from the profiler.

Of coarse it only work for one given FFT size. I got 6662 4M FFTs performed in 38 seconds (93 GFlops).

Not bad at all.

0 Likes

Hello Choks,

Thanks for reporting your experience using our library. I have several concerns about the way you are using the library.

1. The clAmdFft.Client code is just a sample program. The '-p' option is intended for measuring speed of the FFT and the normal range for this parameter is 1 - 200. This parameter is just an iteration count to do the same FFT n times. It is important to keep in mind that this parameter does not do FFTs of different data/input. Theoretically, the parameter should accept any number and run without problems, but passing a high count of 20000 is of no real value. Testing this high count on one system on our side did not cause any issues. But we will check this using different graphics card types and amount of system memory.

2. I hope you went through our reference manual to understand how we would like the users to use our library. If you have not read this, please go through the pdf manual that came with the package, particularly section 1.4.

3. The library is designed to compute FFT in batches, in other words you can compute more than 1 transform with a single clAmdFftEnqueueTransform call. But the library can operate on 2^24 (16777216) elements only at a time. If you need to compute a 4M size FFT (that I assume is 2^22), then you can compute 4 transforms of size 4M simultaneously. You can compute more than 4 transforms of course, but you would have to copy the results back and repopulate input data before computing the next batch of transforms.

Please let us know if you have trouble understanding any of our API functions. We value feedback and would be happy to improve our documentation to give the best experience for users.

0 Likes
kknox
Staff

Hi Choks,

I see your bug report, and will make an entry in our tracker.  I will let you know if we can reproduce this internally, and have more information to share.

0 Likes
kknox
Staff

Hi Choks,

Upon investigation, it was found that all the memory is consumed inside of the clFinish() call; the FFT library itself does not leak memory that we could detect.

It should be noted that OpenCL's API is lazy and asynchronous. 

Since our example program times GPU FFT performance with CPU timers, it first enqueue's all kernels in a loop before calling clFinish(), which allows us to amortize the cost of the data transfer for CPU timers. However, relatively little work is done inside of clAmdFftEnqueueTransform; computation does not start until clFinish() is executed. At this point, for 20000 enqueued kernels, all available resources are used up in the host system.

The relevant code is pasted below:

sTimer.Start(clFFTID);
for( cl_uint i = 0; i < profileCount; ++i
{        OPENCL_V_THROW( clAmdFftEnqueueTransform( plHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, &outEvent,
                          &clMemBuffersIn[ 0 ], BuffersOut, clMedBuffer ), "clAmdFftEnqueueTransform failed" );
}
OPENCL_V_THROW( clFinish( queue ), "clFinish failed" );     // All computational work is done here
sTimer.Stop(clFFTID);

20000 executing kernels is simply consuming all of your system resources.  We find internally that a value of several hundred is a good predictor of decent FFT performance, such as 200

0 Likes

Hello kknox,

I'm implementing kind of a diffusion reaction using clAmdFft and, although I'm calling clFinish() after every Fourier-transformation due to the facts you mentioned above, I run out of ressources after ~6000 fourier transformation calls.

My array is about 512x64 in size and I'm doing a 2D-FFT. Output of my data to a file is done after 100 steps, each one including two fourier transformations.

I hope you (or someone else) could give me a hint why this fails.

edit (13:10 PM):

I should also mention that I'm using a HD 7970 with the latest Catalyst 12.3.

example code:

...


//do fourier transformation

FftStat=clAmdFftEnqueueTransform(FftPlanOnlyY,CLFFT_FORWARD,1,&queue,0,NULL,NULL,&c_real,&c_fft,NULL);

if (FftStat!=0) {

printf("error starting FFT: %i",FftStat);

return 1;

}

clFinish(queue);


//apply diffusion in fourier space

status=clEnqueueNDRangeKernel(queue,diffusion_kernel,2,NULL,ThreadsTotal,ThreadsGroup,0,NULL,NULL);

if (status!=0) {

printf("Diffusion was not enqueued: %i",status);

return 1;

}

clFinish(queue);


//backwards fourier transformation

FftStat=clAmdFftEnqueueTransform(FftPlanOnlyY,CLFFT_BACKWARD,1,&queue,0,NULL,NULL,&c_fft,&c_real,NULL);

if (FftStat!=0) {

printf("error starting FFT: %i",FftStat);

return 1;

}

clFinish(queue);
0 Likes

Hi djohn,

It is great to see that you are using our FFT library in your application. I need some more details to understand where exactly you are running into problems. Are you using a batch size (clAmdFftSetPlanBatchSize) function to compute many FFT transforms at once? or is your code in a loop where you just compute one/two transfrom at a time but you setup your memory buffers with different data for every loop count? Can you provide your OS and bitness (example - Windows7 64-bit) information?

0 Likes

Hi,

thanks for your reply. My OS is a Linux 64bit system. My FFTs are called

within a loop and I only compute one FFT at a time. For the time being I

do not specify batch size, as I don't know exactly how much I should

allocate. The librarie's specification is not clear enough (to me )

to determine batch size a priori. If you need further information, feel

free to ask me, I really need to get around this to make progresses in

my work.

0 Likes

Hi,

I guess I just found a workaround. I did not set a specific buffer memory object for my transformations. It seems that clEnqueueTransform() creates a new buffer object every time it is called if no specific buffer is provided, but does not free memory space after usage or uses already created objects.

Now that I explicitely provide a buffer object, everything works fine again.

If the reason is the one I suggested, then I hope that this could be corrected for in a new version of the fft-library.

Thanks nonetheless.

0 Likes

Hi Djohn~

If you are referring to the tmpBuffer argument in our clAmdFftEnqueueTransform() API, I believe that it is working as expected.  Some FFT alrgorithms require a ping-pong back and forth between an input and output buffer, and the library might not be able to overwrite the input buffer, as it may have been declared READ_ONLY.  If we assume that the output buffer is always OK to write too, so there are instances where the library requires a tmpBuffer.  If the user does not specify a tmpBuffer to use, then the library allocates a tmpBuffer internally, and the lifetime of that buffer is the lifetime of the plan that created that buffer.  I suppose our other option would be to fail the call, but we thought that harsh.  However, we recognize the importance of letting our users control all the memory on the GPU, so we added an API clAmdFftGetTmpBufSize( ) to query the plan to ask it how much tmpBuffer storage it needs, and then the user can allocate that tmp buffer himself and control the lifetime of that buffer explicitly.  That sounds like what you are doing, and it sounds like it is working as designed. 

I hope that I have answered your questions.

Hi,

yes, that sounds like my quetion is answered now. Thanks again for your help. I still don't really see why you should allocate a new temporary buffer for every transformation, especially when the queue is flushed between the calls. But I am no specialist in computing science and as long as there's a way to cope with that, I'm happy. Thanks.

0 Likes

Hey,

I'm still using clamdfft, but now I am confronted with another problem. After calling clFinish(), my program is consuming more and more memory ressources on my CPU. The effect vanishes when I switch off enqueuing of FFTs. As I need thousands of transformations during the simulations I run, even a small amount of ressources, that gets consumed in every step sums up to several GB thus resulting in a killing of my process.

Does clFinish in combination wit clAmdFft somehow allocate ressources everytime it is called but does not deallocate them? If yes, is there any possibility to imagine a workaround?

Thanks in advance.

0 Likes

Have you transitioned to our v1.8 libraries?  I think this thread helped us find a memory leak in v1.6 that is fixed in v1.8.  Are you specifying your own temp buffer in the transform calls, or letting the library allocate one?

0 Likes

Hey kknox,

I've updated to v1.8 before and checked th error under different settings. I always specify my temp buffer after I already had some troubles with the issue you mentioned. What really strikes me is that I don't have the problem specified above if I use the library on my CPU, but only when I use GPU do do the FFTs. But if I do my calculations on the GPU, the FFT-library should not allocate any memory ressources on the host processor, shouldn't it?

I hope this can give you further ideas what's wrong with my program.

0 Likes

Hi,

It is indeed puzzling to note that you are not seeing this problem on CPU? By CPU, do you mean that you are using CL_DEVICE_TYPE_CPU when setting up the command queue and using it in the library? If there is memory leak in the library, we should see it on both, CPU & GPU targets. The fact that it is not happeneing on the CPU is interesting. Temporary buffer objects are created (or used from buffer specified by the user) in whatever device we are computing on. So if you are running on GPU, temporary memory reside on the GPU and if you are using CPU, temorary memory resides in system memory.

Would it be possible for you to attach a sample of your code, so we can debug on our side?

0 Likes

Hi,

yes I meant CL_DEVICE_TYPE_CPU. I hope the following code sample should be sufficient to reproduce the error. I think this should do it and give you the possibility to guess what's going wrong. Thanks very much again.

Message was edited by: Dens Johann

Hey, just as a question: Have you already been able to reproduce my error? Because I'm very curious if this is a problem specific to my system setup, because in that case I think I would have to change this.

0 Likes

Hi djohn,

Thanks for your sample code. We were able to identify the problem. It is a very tricky memory leak happening inside the library. We will be working on fixing it and releasing an updated version of the library. But it may be a while before that happens. In the meantime, would you be able to change how you use the library in your code? The problem is specific to where our 'clAmdFftEnqueueTransform' API is used in a loop and the loop count is very large (as in your case like 10^7). Instead, is it possible for you to compute many FFT tranforms at once using our 'clAmdFftSetPlanBatchSize' API and probably making the loop count smaller or eliminating it completely?

0 Likes

Hey Bragadeesh,

thanks a lot for taking the time to look at the sample code For physical reasons (as I'm doing a physical simulation), I have to do this transformation in every time step. So the only ting that can help me is decreasing the loop count. In the long term, I will not get around checking my system's behaviour on long time scales corresponding to high loop counts as 10^7. But while you will fix the memory leak, I will try to get along with low loop counts. At least it would be nice if the problem could be fixed within the next - let's say two - months. If you can already tell me that this won't be the case, I will have to do something like storing simulation results after low loop counts, then kill the program, then restart my simulation using the last final results as initial conditions. As you might consider, this is really really uncomforable but nonetheless could work. Thanks again for your effort. I'm really contented with AMD's reaction on problems that occur with it's libraries.

Bye the way, it would be nice if someone notices me when there will be a bug release (for example making a post in here). Thanks.

0 Likes

Hi djohn,

Thanks for your support and we are glad that you are using our libraries in your application. For this memory leak issue, we plan to do a maintenance release soon (approximately in 2 weeks). It will be made publicly available and we'll notify you when it is out.

0 Likes

Hi djohn,

We have made a maintenance release to fix this issue. Please download the latest 1.8 version and verify that it fixes the problem you were facing. Thanks.

0 Likes

Hi bragadeesh,

I've just installed the latest version and tried it. After running the code for serveral minutes I saw no change in memory consumption. So I guess it's finally working now Thanks again for your help.

Best

djohn

0 Likes