There is a max buffer allocation limit (maximum memory allocation in clinfo) and its different for different devices. Can anyone explain why do we have this constraint?
Moreover, if my maximum memory allocation limit is 200540160, then i can allocate a 128MB buffer of unsigned ints (33554432*4 bytes). Now if I initialize another buffer of the same size, I will have to transfer it to GPU memory once the processing of 1st buffer is complete?
Is there a better alternative solution because i have an input array of tera bytes and breaking that into 128MB chunks and processing it one after the other is very slow.
I may not have the solution but I can share my experience with you.
Since you are working with terabytes of data I'm assuming you're running a 64-bit application? I have an algorithm I'm working on that produces a very large solution space depending on the input and by large I'm talking gigabytes of data. Initially, the max memory allocation limit on my HD 7750 was similar to your figure above, but now it is 536,870,912 bytes. I'm working with the latest Catalyst drivers and AMD APP SDK 2.9 on Windows 7.
Anyway the interesting thing is that in the 64-bit version of my application, I am able to create a buffer size of about 1.6 GB (512*256*3072*4 bytes integers)! Somehow the runtime intelligently allows this buffer to reside in host memory rather than move it to the GPU frame buffer which would have caused the application to crash since I have definitely exceeded the limit. And besides the HD 7750 has only 1 GB frame buffer. I did not specify any special flags for the buffer, just the simple cl::Buffer constructor and CL_MEM_READ_WRITE flag and that was all. Maybe you could try this and see how far you can push it, I'm guessing it will depend on how much host RAM you have. I shared my findings in a post here http://devgurus.amd.com/message/1300871#1300871 before but this seems new and behaviour is not documented.
Hi Wayne, thanks for your reply.
I installed the latest catalyst driver and 2.9 SDK on windows 7 but when i tried to allocate (2^26 * 4 bytes i.e 256MB) buffer (more than CL_DEVICE_MAX_MEM_ALLOC_SIZE), i got the same CL_INVALID_BUFFER_SIZE error while buffer creation (-61). I am using 64-bit platform and visual studio 2010.
So I guess I have to initialize multiple buffers. any idea how can i use multiple buffers. Let say i have 2 input and 2 output buffers of 128MB each. should I send both simultaneously to my kernel or should i run 2 separate kernels? either of these are not good solutions if my input increases to gigabytes and not tera bytes because my one buffer has the max size of 128MB. to cater 1GB alone, i need 4 input and 4 output buffers. Looking forward to suggestions. Thanks
Just to clarify, apart from using a 64-bit OS, is your Visual Studio project also 64-bit? You can check and change this from the project properties to ensure that it is being compiled for a 64-bit platform.
Which GPU are you using and what does clinfo say about the max memory allocation limit? You know as I mentioned earlier, I don't know why this thing works for me so let's try to re-create as similar setup as mine as possible.
Okay I see. I have the exact same APU on a couple of my machines but I have not tried the code I was referring to on it. I will do that tomorrow when I get to the office in the morning and get back to you to see if the same behaviour applies to APUs. I will test with a very simple and straight-forward code and if it succeeds I will share this with you for you to try as well.
I have managed to test a simple code on the APU. I simply made 256 * 512 * 3072 work-items return their global IDs, which gives us around 1.6 GB of data for our buffer to store. Below is a screen shot of the result from the machine with the A10-5800K APU. I have also attached the exact code and kernel that produced this result. If you want the full Visual Studio project I can send that to you as well.
Now the interesting thing is that this code will only run if we include the CL_MEM_ALLOC_HOST_PTR flag (line 103) when creating the buffer. In my original project, when I initially encountered this behaviour, I didn't have to include this flag and the runtime automatically handles the buffer allocation. Anyway, as you might already know, the CL_MEM_ALLOC_HOST_PTR flag tells OpenCL to use host memory so this might be your solution.
Please let me know how everything goes. Hope this helps in some way
Hey thanks a lot Wayne for all the help you provided it helped a lot!
So by using alloc_host_ptr, the max buffer size is restricted to *less than* max_mem_allocation size of cpu which is 1/4th of the total RAM available (~ 4GB = ~16GB/4 on my system) which is indeed more than 128MB (1/4th of total memory of gpu).
The max_mem_alloc size on my cpu is 4146315264. I got invalid_buffer_size error when I tried to allocate 2 buffers of 4GB each (using alloc_host_ptr flag) on cpu device and worked fine when allocating buffers of smaller sizes (1GB, 512MB etc) which is the correct behavior. Now, when i was allocating these 2 buffers (using alloc_host_ptr flag) on integrated gpu, i got accessing null position error when the size of buffers was 2GB each. when i was mapping host input array to buffer, i got this error. any ideas/thoughts that why is it so?
Now the question of how to use MULTIPLE BUFFERS for input and output still stands.
Let say i have 2 input and 2 output buffers of 128MB each. should I send both simultaneously to my kernel or should i run 2 separate kernels? either of these are not good solutions if my input increases to gigabytes and not tera bytes because one buffer has the max size of 128MB. to cater 1GB alone, i need 4 input and 4 output buffers. Looking forward to suggestions. Thanks
I'm glad I was able to offer any little help that I can.
Firstly, I think you are running out of address space (not entirely sure). That is the reason why you need to make sure that your Visual Studio project is set to target a 64-bit application. This way you can allocate more memory. Below is a screenshot of the same sample program as before but this time working with a buffer size of just around 3.2 GB, including the verification array on the host side so we are looking at over 6 GB as you can see from Task Manager. I suggest you check your Visual Studio project settings under Properties -> Configuration Manager and make sure it is really a 64-bit project because by default you projects are built as 32-bit.
Secondly, on the question of multiple buffers, you don't need to create so many buffers. Buffers are kinda like vehicles, just create one that is big enough to hold whatever data you need to pass to the kernel and then keep re-using it by reading/copying data back and forth between host and device. Say for instance you need to move 1024 MB of data to the kernel in, say, 4 chunks. Create a 256 MB buffer, copy the first 256 MB of data to the buffer, run your kernel and then repeat for the next 256 MB until you are done. It's kind of similar to the situation where you try to cache data from global memory to local memory, but this time you are working between host memory and GPU global memory.
Hope these make some sense and offer some help please try the Visual Studio project settings first and let's see how that goes because with a 64-bit application you should be able to allocate beyond 2 GB.