cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

ifrah
Adept II
Adept II

Re: Re: Max memory allocation Restriction

Thanks again

I have verified that my visual studio project is set to target 64-bit application. The example you posted does work with buffer sizes smaller than 4GB. The difference of this example with mine is that I am filling the input buffer on host side with random values and not in kernel.

The exact scenario in which i am getting error is that I allocate a 2GB array on host side, fill it with random numbers, then allocate a 2GB buffer with read_only and alloc_host_ptr flags, map it, do memcpy with host array, unmap it. Now during memcpy, I get an error of accessing 0x000....0 location and I am not sure why.

0 Kudos
Reply
drallan
Challenger
Challenger

Re: Max memory allocation Restriction

This thread on Big Buffers http://devgurus.amd.com/thread/159516 may also be of some help. I often use large contiguous buffers on multiple 7970s that use 2.5GB++ of the card's 3.0GB physical memory without overflowing to the PC's memory, which can cause significant slowdown. The easiest way is to set the environment variable GPU_MAX_ALLOC_PERCENT=90 although I use 95. This allows very large single buffer allocations >2GB).

There are some caveats to make it work correctly, though these are not rules AFAIK.

1. Use the basic read/write flags, this results in physical GPU buffers.

2. Make sure the PC has a lot of free (equal?) matching memory (after the OS, and your own app have allocated theirs).

3. It's sometimes useful to do a dummy kernel run once (i.e., with early return statement) before other buffer operations, results in a clean buffer layout ...

wayne_static
Adept II
Adept II

Re: Max memory allocation Restriction

Have you tried to do the normal read/write instead of the map/unmap to see if the behaviour is the same? In the sample I fill one vector on the host side and the GPU fills the other one and then I compare their values. This is just dummy operations to use the huge buffers on both ends. Try to do read/write rather than map/unmap to see if it helps otherwise I'm afraid I won't be able to offer much help unless it is possible and convenient for you to provide a test case so that I can try to re-create it at my end?

0 Kudos
Reply
ifrah
Adept II
Adept II

Re: Re: Re: Max memory allocation Restriction

nou

Yes you are right that by using alloc_host_ptr, performance degraded even for APU.

@wayne

Yes i did normal read/write and on writing to my input buffer, i got CL_MEM_OBJECT_ALLOCATION_FAILURE error. I will provide a test case soon

drallan

the way you have used contiguous memory, i will try to use that (can you share a sample code for convenience?) Unfortunately, the gpu devices i have (A10-5800K and HD5870) do not have a lot of memory. Without setting GPU_MAX_ALLOC_PERCENT parameter, clinfo gives global memory size of HD7660 in A10-5800K as 802160640 and max memory allocation size as 200540160 (for 5870, these values are 1073741824 and 536870912 respectively). by setting GPU_MAX_ALLOC_PERCENT = 95, i am getting 1016070144 as global memory size and max memory allocation as 254017536 which is not a significant improvement (for hd5870, 1073741824 and 763048755 respectively).

GPU-z gives memory value for 7660 as 512MB and for hd5870 as 1GB. the value for global memory given by clinfo is fine for 5870 but don't understand how come its 802MB and then increased to approx 1GB for GPU in APU. any memory more than 512MB will be allocated on host memory?

0 Kudos
Reply
drallan
Challenger
Challenger

Re: Re: Re: Max memory allocation Restriction

Unfortunately, the gpu devices i have (A10-5800K and HD5870) do not have a lot of memory. Without setting GPU_MAX_ALLOC_PERCENT parameter, clinfo gives global memory size of HD7660 in A10-5800K as 802160640 and max memory allocation size as 200540160 (for 5870, these values are 1073741824 and 536870912 respectively). by setting GPU_MAX_ALLOC_PERCENT = 95,i am getting 1016070144 as global memory size and max memory allocation as 254017536 which is not a significant improvement (for hd5870, 1073741824 and  respectively). GPU-z gives memory value for 7660 as 512MB and for hd5870 as 1GB. the value for global memory given by clinfo is fine for 5870 but don't understand how come its 802MB and then increased to approx 1GB for GPU in APU. any memory more than 512MB will be allocated on host memory?

Clinfo shows for the 7970s, before setting GPU_MAX_ALLOC_PERCENT = 95,

global memory =  2,147,483,648

max memory allocation = 536,870,912

after setting to 95 %

global memory =  3,221,225,472

max memory allocation = 2,804,154,368

Not so different (relatively) from the HD5870. Each gpu requires a certain amount of memory for the opencl system and for video processing. That size does not scale with GPU memory size.

I'm not so familiar with the APUs, but specs for A10-5800K (7660) just state shared memory and the sharing may be adjustable, look at this thread discussing A10-5800K and using GPU_MAX_HEAP_THREAD. http://devgurus.amd.com/thread/160288  Good luck.....

0 Kudos
Reply
ifrah
Adept II
Adept II

Re: Re: Max memory allocation Restriction

Hi,

I am attaching a very simple test case. I have added alternative approaches in comments (like mapping/unmapping). Please suggest any improvements in execution time and compute and memory bandwidth

0 Kudos
Reply
wayne_static
Adept II
Adept II

Re: Re: Max memory allocation Restriction

Hi ifrah,

That's great I'll take a look and see if there's anyway I can offer any help.

0 Kudos
Reply
wayne_static
Adept II
Adept II

Re: Re: Re: Max memory allocation Restriction

Hi ifrah,

The sample you provided works fine on my system, however, I was able to re-create your problem by increasing the input size. After toiling around, I observed that this issue of CL_MEM_OBJECT_ALLOCATION_FAILURE occurred when multiple buffers of considerable size exist. Even when you keep creating just one buffer inside a loop the problem still exists. The moment I created just one big buffer (~3.2 GB) the program worked fine.

Being that I'm not an expert in OpenCL yet, I really don't know the reason for this and I will certainly try to find out more because this has been a good learning exercise unless some expert here chips in

So based on your test case, I believe you are just adding 1 to every value in the input array and comparing GPU and CPU results. I have re-created your code in C++ but I kept it straight to the point without the profiling and extra loops. Hope you don't mind me using C++. I also added a similar kernel to yours ("addition"). So basically, the code achieves a similar goal to yours but uses only one huge buffer instead. I have decided to attach the full Visual Studio project that I have been using for your problem so take a look. It contains some useful comments though it's a very trivial code. Code does not really emphasize quality e.g, I borrowed your method for generating random numbers, naive floating-point verification and I hard-coded loop unrolling so please ignore these for now.

Sorry, although I am not able to solve your problem, I hope the logic I used in this trivial program will give you some idea in tackling your original problem.