cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

simon1
Journeyman III

More pinned host-memory than device-memory capacity

Context: I'm working on a matching algorithm: basically, an unknown pattern is compared to a gallery in order to find the best match. The gallery contains up to a billion of examples, which is about 30GB (which fits into host-memory in my case).

On the Cuda version of my implementation, I'm allocating two buffers as big as possible on the GPU, I split the gallery into chunks of pinned host-memory (using cudaMallocHost). This allows me to upload the chunks on the device without any copy and at the highest bandwidth, and process one of the buffers while the other one is filling.

In section 3.1.1 of their OpenCL best pratices guide, NVidia explains how to do the same in OpenCL.

Here's how I tried it with my AMD GPU:

// create host buffers

cl_mem host_buffers[num_host_buffers];

for (uint i = 0; i < num_host_buffers; i++) {

   host_buffers = clCreateBuffer(context,

                                                  CL_MEM_ALLOC_HOST_PTR,

                                                  chunk_size * sizeof(int),

                                                   ...);

}

// init host buffers

for (uint i = 0; i < num_host_buffers; i++) {

    int* m = (int*)clEnqueueMapBuffer(queue, host_buffers, true,

                                                      CL_MAP_WRITE_INVALIDATE_REGION,

                                                      0, chunk_size * sizeof(int),

                                                      ...);

    // ...

    clEnqueueUnmapMemObject(queue, host_buffers, (void*)m, ...);

}

// alloc device buffers

for (uint i = 0; i < 2; i++) {

    device_buffers = clCreateBuffer(context,

                                                      CL_MEM_READ_WRITE | CL_MEM_HOST_NO_ACCESS,

                                                      chunk_size * sizeof(int), ...);

}

To upload the required chunk of data, I use a CopyBuffer from a host_buffer to a device_buffer. But the clEnqueueMaps start failing with a CL_MAP_FAILURE when the VRAM capacity is reached.

Regarding the table in section 4.5.2 of the APP programming guide, it seems that there's no way to allocated "upload-ready" memory chunks on the host only (at least without that "VM" thing enabled).

Manually align, page-lock and set non-cacheable a memory chunk is not an option either. From the APP guide, section 4.5.1.2:

Currently, the runtime recognizes only data that is in pinned host memory for operation arguments that are memory objects it has allocated in pinned host memory.

To make things short: what is the best way to manage a (splittable) set of data that fits in host-memory but not in device-memory ? Is it possible to avoid copies and take advantage of the highest bandwidth available in the same time ?

0 Likes
22 Replies
gbilotta
Adept III

Re: More pinned host-memory than device-memory capacity

I do not know the answer to your problem, but one thing you could try is the OpenCL memobject migration feature available since 1.2 (so this will not work on NVIDIA until they decide to upgrade their OpenCL support). The approach I would try is the following: allocate all thebuffers the way you are currently allocating your "host" buffers.  Do not create any "device" buffers. Instead, use clEnqueueMigrateMemObjects to juggle the buffers. Use a null migration flag when you want to move it to the device, use the CL_MIGRATE_MEM_OBJECT_HOST flag to "unload" it from the device.

I cannot guarantee that this will work, but my understanding of the memobject migration feature was added specifically for this purpose. If the platform implements it correctly, it should give the expected results. If not, we should probably ask AMD to make it work this way

0 Likes
himanshu_gautam
Grandmaster

Re: More pinned host-memory than device-memory capacity

I agree with the method suggested by gbilotta, but as like himself, i have not implemented it so far.

I will raise a request to have this behaviour demonstrated somehow in a APP Sample, but do not expect it to come any time soon.

Also check BufferBandwidth Sample to see how to acheive maximum buffer transfer speed

http://devgurus.amd.com/message/1296694#1296694 to see out-of-core behaviour as explained above.

do you mean the above code works fine with NVIDIA cards, but fails with AMD? Can you check with a smaller host-side data, as it may be the case, the map is allocating separate memory and clCreateBuffer is allocating separate memory for the same buffer.

0 Likes
simon1
Journeyman III

Re: More pinned host-memory than device-memory capacity


gbilotta wrote:



The approach I would try is the following: allocate all thebuffers the way you are currently allocating your "host" buffers.  Do not create any "device" buffers. Instead, use clEnqueueMigrateMemObjects to juggle the buffers.










I like your idea, but my code fails before the creation of any "device" buffer. Maybe using a CPU device this solution would work, but I don't want to rely on the availability of a CPU implementation.


himanshu.gautam wrote:



do you mean the above code works fine with NVIDIA cards, but fails with AMD? Can you check with a smaller host-side data, as it may be the case, the map is allocating separate memory and clCreateBuffer is allocating separate memory for the same buffer.










The above code comes directly from a NVidia sample, so it should be working. I'll test it if I can put my hands on an NVidia GPU.

With less buffers or buffers of smaller size, it works. As I said, the mapings start failing when the total amount of requested data is greater than the device memory capacity.

The workaround I found is to create the "host" buffers using ALLOC_HOST_PTR | COPY_HOST_PTR to initialize them instead of doing a clEnqueueMap. Still, when the total amount of requested "host" buffers goes beyond the GPU capacity, the whole process takes about 6 minutes instead of a few seconds...


Is there a "host-side" copy involved in the case of a mapping of a buffer created with ALLOC_HOST_PTR ? That would be a strange behaviour since the allocated host-memory should have all the requirements to be transferred at full bandwidth to the GPU.

0 Likes
himanshu_gautam
Grandmaster

Re: More pinned host-memory than device-memory capacity





The workaround I found is to create the "host" buffers using ALLOC_HOST_PTR | COPY_HOST_PTR to initialize them instead of doing a clEnqueueMap. Still, when the total amount of requested "host" buffers goes beyond the GPU capacity, the whole process takes about 6 minutes instead of a few seconds...


It might be that GPU is not evicting the buffer after using, and that is creating issues. Can you try using clReleaseMemObject() API more aggressively. I mean try to release a buffer as soon as it is not needed. If this does not help, I would request you to attach (as ZIP) a testcase to showcase the issue.

0 Likes
simon1
Journeyman III

Re: More pinned host-memory than device-memory capacity


himanshu.gautam wrote:


It might be that GPU is not evicting the buffer after using, and that is creating issues. Can you try using clReleaseMemObject() API more aggressively. I mean try to release a buffer as soon as it is not needed. If this does not help, I would request you to attach (as ZIP) a testcase to showcase the issue.







I don't want to release the buffers (those I created with the "ALLOC_HOST_PTR") since I need them during the whole program execution (which can goes up to a few months).

This issue is resolved, the strange timings were probably due to a lack of free CPU memory (which couldn't be swapped since the example I gave with clEnqueueMap crashed without freeing the page-locked memory chunks).

However, the CL_MAP_FAILURE error still appears if I try to use a clEnqueueMap on the host buffers.

Here is a complete example that shows the error: https://gist.github.com/notSimon/5812728

On my dev computer (HD7750 1GB, 2GB of host memory), it starts failing with 10 buffers, which is 640MB of host-memory. That should fit in the remaining free memory even if the mapping (for obscure reasons) allocates another copy of each buffers.

Edit: clEnqueueWriteBuffer fails with an OUT_OF_RESOURCES error whenever clEnqueueMap does.

0 Likes
gbilotta
Adept III

Re: Re: More pinned host-memory than device-memory capacity

I took the liberty of implementing the approach I proposed based on the memory migration option. It's available on GitHub with a public domain license. The program does a rather simple thing: it allocates a large number of buffers (enough to overcommit device memory) with CL_MEM_ALLOC_HOST_PTR, then it tries calling a simple kernel with two buffers at a time: buffer #0 and a newly selected buffer. Whenever a new buffer is being used, the previous one is migrated to the host. Note that I'm using object migration, and not object release. In theory, the platform should always be smart enough to swap out unused buffers to make room for used ones, and the object migration feature should only be a hint about when to do it. In other words, at object migration time the platform should free up the device resources used for the buffer, and thus it should always be possible to run the kernel. However, after using the 4th buffer, subsequent calls fail with an out of resources error.

I'm open to suggestions about how to improve this example, but I think it should be enough to track down what is going on (and what is NOT going on) during object migration. I'm particularly interested in the opinion of the original poster, if the example I created does reflect, in some way, what he was planning on doing.

Note that the faiure of this example highlights two issues with the current AMD platform:

  1. objects are not swapped out of the device when they could be swapped out to make room for objects that are needed (i.e. when a kernel does not use an object that was previously used, that object could be swapped out to make room for the objects needed by the kernel). Of course the OpenCL specification does not say anything about when this should happen, so I'm not sure this counts as an actual spec violation (I would have to reread the specification to get a clearer idea about this), but it's quite obvious that the concept of OpenCL buffer is abstact enough (compared to straightforward device memory pointers) to allow this kind of buffer juggling by a smart-enough platform. Essentially, there is no need for the object to physically reside on the device except when needed, so keeping them on or not is a matter of efficiency; but sacrificing efficiency for functionality is not a good thing.
  2. most importantly, objects are not being migrated out of the device even when the programmer explicitly asks the platform to do this. Now I can understand the platform not trying to second-guess the developer and swapping buffers out unless explicitly requested to do so, but when object migration is used to migrate objects to the host, the device resources should be freed (and then be reallocated, potentially at a different address, if/when the object is migrated back to the device).
0 Likes
simon1
Journeyman III

Re: Re: Re: More pinned host-memory than device-memory capacity

Thanks for your interest in that topic!

I tried the solution you suggested in your first post, and I hit to the same limitations.


gbilotta wrote:


I'm particularly interested in the opinion of the original poster, if the example I created does reflect, in some way, what he was planning on doing.



Here is what seams to be limiting in your implementation (assuming it would have worked as you expected):

In my case, the input buffers are read-only for the device, thus, in order to be as efficient as possible, the implementation should take into account the permission flags used for the buffer creation. When a buffer is read-only for the device there shouldn't be any device-to-host transfers for these data (and I really don't want them to overload the PCI-e).

But if you're asking for a "migration", it implicitly means you don't want a copy to remain in host-memory and the way-back transfer is then unavoidable.

In my (not so) specific case, a simple copy is enough. What I really need is a clCreateHostBuffer function that just... allocate a chunk of pinned host-memory. Maybe a clEnqueueCacheBuffer could be more elegant than performing explicit copies and device-allocations, but I really don't think the "migrate" word reflect what is missing here.

0 Likes
gbilotta
Adept III

Re: More pinned host-memory than device-memory capacity

simon wrote:

Thanks for your interest in that topic!

I tried the solution you suggested in your first post, and I hit to the same limitations.

This is kind of expected: the whole point of the test code is to show that migration does not work as expected. I hope that the code is simple enough to allow AMD engineers to look into it, in the hopes to have it fixed for the next release.

In my case, the input buffers are read-only for the device, thus, in order to be as efficient as possible, the implementation should take into account the permission flags used for the buffer creation. When a buffer is read-only for the device there shouldn't be any device-to-host transfers for these data (and I really don't want them to overload the PCI-e).

But if you're asking for a "migration", it implicitly means you don't want a copy to remain in host-memory and the way-back transfer is then unavoidable.

This is actually the reason why I'm using the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED flag: it means that there is no need to migrate the content of the buffer, so no memory copy should be necessary (assuming, of course, a smart enough implementation).

With your approach, I think that what happens is that the AMD platform attempts to copy the data to the device as soon as you unmap it from the host. I'm starting to think that maybe a combination of your approach and mine might yield the expected results, provided the platform does the right thing. I'm going to do some tests now and then report back.

EDIT: Ok, I think we're getting closer. I've added a second file to my github repository, that tries a different approach: it allocates gmem/alloc_max + 1 ‘host’ buffers and migrates them to the host right after allocation (you can probably try migrating them all at once, object migration can handle more than one buffer at a time), then it allocates 2 ‘device’ buffers and uses map/unmap on the ‘host’ buffers and then copy to the second device buffer. I have have 5 256MB ‘host’ buffers + 2 256MB ‘device’ buffers on a 1GB card.

Performance is not particularly thrilling, but I'm not doing anything to overlap computations and transfers, actually, so it could be expected. Also, the API trace claims that the ‘host’ buffers are device-resident, which is not particularly convincing, and zero-copy is not being used, allegedly.

I think AMD engineers should look better into this object migration stuff. Maybe my test programs will help them in this regard.

Also note that you cannot have too many host buffers anyway because you cannot pin arbitrary amounts of memory, anyway. But at least you can overcommit this way. (In practice, I think that you might want to have three device buffers so you can double-buffer and overlap computations on one and transfers to the other).

Also, at this point you probably don't need more than two or three buffers on the host either, since you can write on one while the other is being transferred (provided the AMD platform is smart enough to do actual asynchronous data transfers this way).

Hope this helps.

0 Likes
simon1
Journeyman III

Re: Re: More pinned host-memory than device-memory capacity


gbilotta wrote:


This is actually the reason why I'm using the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED flag: it means that there is no need to migrate the content of the buffer, so no memory copy should be necessary (assuming, of course, a smart enough implementation).







This is exactly why I said the migrate feature doesn't fit to my problem. I do want to keep the content of the buffers consistent. When migrating the buffers back to the host with the CONTENT_UNDEFINED flag, the only guarantee the spec gives you is that the chunk is in the host memory space. But I do want the guarantee that my reference patterns (what I called the gallery) are still in the buffers in order to be able to transfer them back to the device for further processing !

Thus I can't use the CONTENT_UNDEFINED flag because the content of my buffers does matter, but I don't want a device-to-host transfer neither. The only workaround I see to this is, as I said before, a smart enough implementation that take into account the RW permissions given at the buffers creation (this is maybe already the case, but we need more precision from AMD on this).


gbilotta wrote:


I've added a second file to my github repository, that tries a different approach: it allocates gmem/alloc_max + 1 ‘host’ buffers and migrates them to the host right after allocation (you can probably try migrating them all at once, object migration can handle more than one buffer at a time), then it allocates 2 ‘device’ buffers and uses map/unmap on the ‘host’ buffers and then copy to the second device buffer. I have have 5 256MB ‘host’ buffers + 2 256MB ‘device’ buffers on a 1GB card.






I tried that two (I did a lot of tests when I saw your first post about clEnqueueMigrate). On my computer it crashes beyond 50% of "host" memory.


gbilotta wrote:


Also note that you cannot have too many host buffers anyway because you cannot pin arbitrary amounts of memory, anyway.






This is the point of my topic, I do want arbitrary amount of host buffers. I know that is not possible but I would like to know exactly why such a limit (other than the physical amount of memory) has been set. After all, I can (m)allocate almost 2GB of page-aligned, page-locked and non-cacheable memory (which is what I think a "pinned" memory chunk is).


gbilotta wrote:


(In practice, I think that you might want to have three device buffers so you can double-buffer and overlap computations on one and transfers to the other). Also, at this point you probably don't need more than two or three buffers on the host either, since you can write on one while the other is being transferred (provided the AMD platform is smart enough to do actual asynchronous data transfers this way).






Two regular "device" buffers are enough here, since I can map and transfer a buffer while the other one in being used in computation. I think you're missing my initial problem: the whole thing works, and the transfers are correctly overlapped with computation in my current implementation, but I want to save as much RAM space as possible, as well as the CPU time (which is very critical in my case)

What I wanted to avoid is the extra host-side copy in order to "prepare" the data to be uploaded asynchronously at full memory bandwidth (by doing a clMap, memcpy, clUnmap for instance). This clearly doesn't seam to be possible with the actual AMD implementation.

Maybe I'm overengineering all this, but I think this the kind of problems we should be able to solve easily in the near future.

0 Likes