cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

simon1
Journeyman III

More pinned host-memory than device-memory capacity

Context: I'm working on a matching algorithm: basically, an unknown pattern is compared to a gallery in order to find the best match. The gallery contains up to a billion of examples, which is about 30GB (which fits into host-memory in my case).

On the Cuda version of my implementation, I'm allocating two buffers as big as possible on the GPU, I split the gallery into chunks of pinned host-memory (using cudaMallocHost). This allows me to upload the chunks on the device without any copy and at the highest bandwidth, and process one of the buffers while the other one is filling.

In section 3.1.1 of their OpenCL best pratices guide, NVidia explains how to do the same in OpenCL.

Here's how I tried it with my AMD GPU:

// create host buffers

cl_mem host_buffers[num_host_buffers];

for (uint i = 0; i < num_host_buffers; i++) {

   host_buffers = clCreateBuffer(context,

                                                  CL_MEM_ALLOC_HOST_PTR,

                                                  chunk_size * sizeof(int),

                                                   ...);

}

// init host buffers

for (uint i = 0; i < num_host_buffers; i++) {

    int* m = (int*)clEnqueueMapBuffer(queue, host_buffers, true,

                                                      CL_MAP_WRITE_INVALIDATE_REGION,

                                                      0, chunk_size * sizeof(int),

                                                      ...);

    // ...

    clEnqueueUnmapMemObject(queue, host_buffers, (void*)m, ...);

}

// alloc device buffers

for (uint i = 0; i < 2; i++) {

    device_buffers = clCreateBuffer(context,

                                                      CL_MEM_READ_WRITE | CL_MEM_HOST_NO_ACCESS,

                                                      chunk_size * sizeof(int), ...);

}

To upload the required chunk of data, I use a CopyBuffer from a host_buffer to a device_buffer. But the clEnqueueMaps start failing with a CL_MAP_FAILURE when the VRAM capacity is reached.

Regarding the table in section 4.5.2 of the APP programming guide, it seems that there's no way to allocated "upload-ready" memory chunks on the host only (at least without that "VM" thing enabled).

Manually align, page-lock and set non-cacheable a memory chunk is not an option either. From the APP guide, section 4.5.1.2:

Currently, the runtime recognizes only data that is in pinned host memory for operation arguments that are memory objects it has allocated in pinned host memory.

To make things short: what is the best way to manage a (splittable) set of data that fits in host-memory but not in device-memory ? Is it possible to avoid copies and take advantage of the highest bandwidth available in the same time ?

0 Likes
22 Replies
gbilotta
Adept III

I do not know the answer to your problem, but one thing you could try is the OpenCL memobject migration feature available since 1.2 (so this will not work on NVIDIA until they decide to upgrade their OpenCL support). The approach I would try is the following: allocate all thebuffers the way you are currently allocating your "host" buffers.  Do not create any "device" buffers. Instead, use clEnqueueMigrateMemObjects to juggle the buffers. Use a null migration flag when you want to move it to the device, use the CL_MIGRATE_MEM_OBJECT_HOST flag to "unload" it from the device.

I cannot guarantee that this will work, but my understanding of the memobject migration feature was added specifically for this purpose. If the platform implements it correctly, it should give the expected results. If not, we should probably ask AMD to make it work this way

0 Likes

I agree with the method suggested by gbilotta, but as like himself, i have not implemented it so far.

I will raise a request to have this behaviour demonstrated somehow in a APP Sample, but do not expect it to come any time soon.

Also check BufferBandwidth Sample to see how to acheive maximum buffer transfer speed

http://devgurus.amd.com/message/1296694#1296694 to see out-of-core behaviour as explained above.

do you mean the above code works fine with NVIDIA cards, but fails with AMD? Can you check with a smaller host-side data, as it may be the case, the map is allocating separate memory and clCreateBuffer is allocating separate memory for the same buffer.

0 Likes


gbilotta wrote:



The approach I would try is the following: allocate all thebuffers the way you are currently allocating your "host" buffers.  Do not create any "device" buffers. Instead, use clEnqueueMigrateMemObjects to juggle the buffers.










I like your idea, but my code fails before the creation of any "device" buffer. Maybe using a CPU device this solution would work, but I don't want to rely on the availability of a CPU implementation.


himanshu.gautam wrote:



do you mean the above code works fine with NVIDIA cards, but fails with AMD? Can you check with a smaller host-side data, as it may be the case, the map is allocating separate memory and clCreateBuffer is allocating separate memory for the same buffer.










The above code comes directly from a NVidia sample, so it should be working. I'll test it if I can put my hands on an NVidia GPU.

With less buffers or buffers of smaller size, it works. As I said, the mapings start failing when the total amount of requested data is greater than the device memory capacity.

The workaround I found is to create the "host" buffers using ALLOC_HOST_PTR | COPY_HOST_PTR to initialize them instead of doing a clEnqueueMap. Still, when the total amount of requested "host" buffers goes beyond the GPU capacity, the whole process takes about 6 minutes instead of a few seconds...


Is there a "host-side" copy involved in the case of a mapping of a buffer created with ALLOC_HOST_PTR ? That would be a strange behaviour since the allocated host-memory should have all the requirements to be transferred at full bandwidth to the GPU.

0 Likes





The workaround I found is to create the "host" buffers using ALLOC_HOST_PTR | COPY_HOST_PTR to initialize them instead of doing a clEnqueueMap. Still, when the total amount of requested "host" buffers goes beyond the GPU capacity, the whole process takes about 6 minutes instead of a few seconds...


It might be that GPU is not evicting the buffer after using, and that is creating issues. Can you try using clReleaseMemObject() API more aggressively. I mean try to release a buffer as soon as it is not needed. If this does not help, I would request you to attach (as ZIP) a testcase to showcase the issue.

0 Likes


himanshu.gautam wrote:


It might be that GPU is not evicting the buffer after using, and that is creating issues. Can you try using clReleaseMemObject() API more aggressively. I mean try to release a buffer as soon as it is not needed. If this does not help, I would request you to attach (as ZIP) a testcase to showcase the issue.







I don't want to release the buffers (those I created with the "ALLOC_HOST_PTR") since I need them during the whole program execution (which can goes up to a few months).

This issue is resolved, the strange timings were probably due to a lack of free CPU memory (which couldn't be swapped since the example I gave with clEnqueueMap crashed without freeing the page-locked memory chunks).

However, the CL_MAP_FAILURE error still appears if I try to use a clEnqueueMap on the host buffers.

Here is a complete example that shows the error: https://gist.github.com/notSimon/5812728

On my dev computer (HD7750 1GB, 2GB of host memory), it starts failing with 10 buffers, which is 640MB of host-memory. That should fit in the remaining free memory even if the mapping (for obscure reasons) allocates another copy of each buffers.

Edit: clEnqueueWriteBuffer fails with an OUT_OF_RESOURCES error whenever clEnqueueMap does.

0 Likes

I took the liberty of implementing the approach I proposed based on the memory migration option. It's available on GitHub with a public domain license. The program does a rather simple thing: it allocates a large number of buffers (enough to overcommit device memory) with CL_MEM_ALLOC_HOST_PTR, then it tries calling a simple kernel with two buffers at a time: buffer #0 and a newly selected buffer. Whenever a new buffer is being used, the previous one is migrated to the host. Note that I'm using object migration, and not object release. In theory, the platform should always be smart enough to swap out unused buffers to make room for used ones, and the object migration feature should only be a hint about when to do it. In other words, at object migration time the platform should free up the device resources used for the buffer, and thus it should always be possible to run the kernel. However, after using the 4th buffer, subsequent calls fail with an out of resources error.

I'm open to suggestions about how to improve this example, but I think it should be enough to track down what is going on (and what is NOT going on) during object migration. I'm particularly interested in the opinion of the original poster, if the example I created does reflect, in some way, what he was planning on doing.

Note that the faiure of this example highlights two issues with the current AMD platform:

  1. objects are not swapped out of the device when they could be swapped out to make room for objects that are needed (i.e. when a kernel does not use an object that was previously used, that object could be swapped out to make room for the objects needed by the kernel). Of course the OpenCL specification does not say anything about when this should happen, so I'm not sure this counts as an actual spec violation (I would have to reread the specification to get a clearer idea about this), but it's quite obvious that the concept of OpenCL buffer is abstact enough (compared to straightforward device memory pointers) to allow this kind of buffer juggling by a smart-enough platform. Essentially, there is no need for the object to physically reside on the device except when needed, so keeping them on or not is a matter of efficiency; but sacrificing efficiency for functionality is not a good thing.
  2. most importantly, objects are not being migrated out of the device even when the programmer explicitly asks the platform to do this. Now I can understand the platform not trying to second-guess the developer and swapping buffers out unless explicitly requested to do so, but when object migration is used to migrate objects to the host, the device resources should be freed (and then be reallocated, potentially at a different address, if/when the object is migrated back to the device).
0 Likes

Thanks for your interest in that topic!

I tried the solution you suggested in your first post, and I hit to the same limitations.


gbilotta wrote:


I'm particularly interested in the opinion of the original poster, if the example I created does reflect, in some way, what he was planning on doing.



Here is what seams to be limiting in your implementation (assuming it would have worked as you expected):

In my case, the input buffers are read-only for the device, thus, in order to be as efficient as possible, the implementation should take into account the permission flags used for the buffer creation. When a buffer is read-only for the device there shouldn't be any device-to-host transfers for these data (and I really don't want them to overload the PCI-e).

But if you're asking for a "migration", it implicitly means you don't want a copy to remain in host-memory and the way-back transfer is then unavoidable.

In my (not so) specific case, a simple copy is enough. What I really need is a clCreateHostBuffer function that just... allocate a chunk of pinned host-memory. Maybe a clEnqueueCacheBuffer could be more elegant than performing explicit copies and device-allocations, but I really don't think the "migrate" word reflect what is missing here.

0 Likes

simon wrote:

Thanks for your interest in that topic!

I tried the solution you suggested in your first post, and I hit to the same limitations.

This is kind of expected: the whole point of the test code is to show that migration does not work as expected. I hope that the code is simple enough to allow AMD engineers to look into it, in the hopes to have it fixed for the next release.

In my case, the input buffers are read-only for the device, thus, in order to be as efficient as possible, the implementation should take into account the permission flags used for the buffer creation. When a buffer is read-only for the device there shouldn't be any device-to-host transfers for these data (and I really don't want them to overload the PCI-e).

But if you're asking for a "migration", it implicitly means you don't want a copy to remain in host-memory and the way-back transfer is then unavoidable.

This is actually the reason why I'm using the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED flag: it means that there is no need to migrate the content of the buffer, so no memory copy should be necessary (assuming, of course, a smart enough implementation).

With your approach, I think that what happens is that the AMD platform attempts to copy the data to the device as soon as you unmap it from the host. I'm starting to think that maybe a combination of your approach and mine might yield the expected results, provided the platform does the right thing. I'm going to do some tests now and then report back.

EDIT: Ok, I think we're getting closer. I've added a second file to my github repository, that tries a different approach: it allocates gmem/alloc_max + 1 ‘host’ buffers and migrates them to the host right after allocation (you can probably try migrating them all at once, object migration can handle more than one buffer at a time), then it allocates 2 ‘device’ buffers and uses map/unmap on the ‘host’ buffers and then copy to the second device buffer. I have have 5 256MB ‘host’ buffers + 2 256MB ‘device’ buffers on a 1GB card.

Performance is not particularly thrilling, but I'm not doing anything to overlap computations and transfers, actually, so it could be expected. Also, the API trace claims that the ‘host’ buffers are device-resident, which is not particularly convincing, and zero-copy is not being used, allegedly.

I think AMD engineers should look better into this object migration stuff. Maybe my test programs will help them in this regard.

Also note that you cannot have too many host buffers anyway because you cannot pin arbitrary amounts of memory, anyway. But at least you can overcommit this way. (In practice, I think that you might want to have three device buffers so you can double-buffer and overlap computations on one and transfers to the other).

Also, at this point you probably don't need more than two or three buffers on the host either, since you can write on one while the other is being transferred (provided the AMD platform is smart enough to do actual asynchronous data transfers this way).

Hope this helps.

0 Likes


gbilotta wrote:


This is actually the reason why I'm using the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED flag: it means that there is no need to migrate the content of the buffer, so no memory copy should be necessary (assuming, of course, a smart enough implementation).







This is exactly why I said the migrate feature doesn't fit to my problem. I do want to keep the content of the buffers consistent. When migrating the buffers back to the host with the CONTENT_UNDEFINED flag, the only guarantee the spec gives you is that the chunk is in the host memory space. But I do want the guarantee that my reference patterns (what I called the gallery) are still in the buffers in order to be able to transfer them back to the device for further processing !

Thus I can't use the CONTENT_UNDEFINED flag because the content of my buffers does matter, but I don't want a device-to-host transfer neither. The only workaround I see to this is, as I said before, a smart enough implementation that take into account the RW permissions given at the buffers creation (this is maybe already the case, but we need more precision from AMD on this).


gbilotta wrote:


I've added a second file to my github repository, that tries a different approach: it allocates gmem/alloc_max + 1 ‘host’ buffers and migrates them to the host right after allocation (you can probably try migrating them all at once, object migration can handle more than one buffer at a time), then it allocates 2 ‘device’ buffers and uses map/unmap on the ‘host’ buffers and then copy to the second device buffer. I have have 5 256MB ‘host’ buffers + 2 256MB ‘device’ buffers on a 1GB card.






I tried that two (I did a lot of tests when I saw your first post about clEnqueueMigrate). On my computer it crashes beyond 50% of "host" memory.


gbilotta wrote:


Also note that you cannot have too many host buffers anyway because you cannot pin arbitrary amounts of memory, anyway.






This is the point of my topic, I do want arbitrary amount of host buffers. I know that is not possible but I would like to know exactly why such a limit (other than the physical amount of memory) has been set. After all, I can (m)allocate almost 2GB of page-aligned, page-locked and non-cacheable memory (which is what I think a "pinned" memory chunk is).


gbilotta wrote:


(In practice, I think that you might want to have three device buffers so you can double-buffer and overlap computations on one and transfers to the other). Also, at this point you probably don't need more than two or three buffers on the host either, since you can write on one while the other is being transferred (provided the AMD platform is smart enough to do actual asynchronous data transfers this way).






Two regular "device" buffers are enough here, since I can map and transfer a buffer while the other one in being used in computation. I think you're missing my initial problem: the whole thing works, and the transfers are correctly overlapped with computation in my current implementation, but I want to save as much RAM space as possible, as well as the CPU time (which is very critical in my case)

What I wanted to avoid is the extra host-side copy in order to "prepare" the data to be uploaded asynchronously at full memory bandwidth (by doing a clMap, memcpy, clUnmap for instance). This clearly doesn't seam to be possible with the actual AMD implementation.

Maybe I'm overengineering all this, but I think this the kind of problems we should be able to solve easily in the near future.

0 Likes

Check section 4.5 & 4.6 of AMD OpenCL Programming Guide to optimize the data transfers.

The device_buffer concept may be useful in NVIDIA, and it certainly works with AMD too. Can you try not using any device_buffer. In AMD's case, you can just allocate a buffer, map/unmap it and then directly assign it as kernel argument for your kernel.

0 Likes


himanshu.gautam wrote:



Check section 4.5 & 4.6 of AMD OpenCL Programming Guide to optimize the data transfers.


The device_buffer concept may be useful in NVIDIA, and it certainly works with AMD too. Can you try not using any device_buffer. In AMD's case, you can just allocate a buffer, map/unmap it and then directly assign it as kernel argument for your kernel.


The problem with using map/unmap is that when a buffer is not used anymore the AMD platform does not migrate it off the device to make room for the buffers that are needed. Say that you have 5 buffers buf0 ... buf4, each 1/4th of the total device memory,and that you use in a kernel buf0 and buf1, then buf0 and buf2, then buf0 and buf3, then buf0 and buf4. When you get at buf0 and buf4, all other buffers (1, 2, 3) are still resident on device memory, so mapping buffer 4 will fail due to insufficient device memory. What we want to do is find a way of unloading buffers from device memory without releasing the OpenCL buffers (because you still want to use them host-side, for example, or for any other reason). This is what is failing: there seems to be no way to tell the AMD platform: “keep this OpenCL buffer around, but take it off the device (for the time being)”. This is exactly what mem_object migration was invented for, but in AMD's platform it does not act as expected.

0 Likes

I do understand your point. And probably you would also agree that this is a very good feature to have in the runtime, but the current behavior is not spec violation. Let me take this issue with some senior people, and decide about it. But as of now, only workaround i know is to release the buffer.

Also clEnqueueMigrateMemObject is meant to make the copy of a buffer available on a specific device, at some specific time. IMHO, its description does not say that the buffer will be removed from the original location. Anyways it certainly makes sense to keep the buffer in the original location if the buffer is READ_ONLY. For WRITE_ONLY buffers it is sensible to delete the buffer from original location, but not sure if that happens.

0 Likes


himanshu.gautam wrote:



I do understand your point. And probably you would also agree that this is a very good feature to have in the runtime, but the current behavior is not spec violation. Let me take this issue with some senior people, and decide about it. But as of now, only workaround i know is to release the buffer.



Also clEnqueueMigrateMemObject is meant to make the copy of a buffer available on a specific device, at some specific time. IMHO, its description does not say that the buffer will be removed from the original location. Anyways it certainly makes sense to keep the buffer in the original location if the buffer is READ_ONLY. For WRITE_ONLY buffers it is sensible to delete the buffer from original location, but not sure if that happens.


I see. The problem is that the specification does not say anything about allocation and deallocation of the supporting memory for buffers. However, according to the spec, the purpose of object migration is to allow the user to choose "which device an OpenCL memory object resides [in]". This is not a copy, but a migration, which, at least to me, sounds as it should rather be a move.

I assume that at the platform level what happens is that a buffer object has a list of allocated memory regions of all devices (plus potentially the host). Only one of these allocated memory regions is the 'active' one, in the sense of being the one holding the current buffer contents. What migration does for sure is to copy the current buffer contents from the former 'active'  device to the new 'active'  device, but for efficiency reasons the platform is free to keep the memory allocated on the former 'active' device. This does sound like a legit interpretation of the specification.

So the issue is that presently AMD's platform's lazy allocation is not lazy enough, and it does not have lazy/smart deallocation. (The reason why I say that AMD's platform is not lazy enough in its allocation is that I suspect that the moment a buffer is unmapped its contents get transferred to the device(s) —after which it is just not possible to swap them out, meaning that there is no way to do what simon is trying to do.)

0 Likes


himanshu.gautam wrote:


Also clEnqueueMigrateMemObject is meant to make the copy of a buffer available on a specific device, at some specific time. IMHO, its description does not say that the buffer will be removed from the original location. Anyways it certainly makes sense to keep the buffer in the original location if the buffer is READ_ONLY. For WRITE_ONLY buffers it is sensible to delete the buffer from original location, but not sure if that happens.


Could you bring us back this information ? It is not that easy to monitor what is happening under the hood.

0 Likes

simon wrote:

gbilotta wrote:

This is actually the reason why I'm using the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED flag: it means that there is no need to migrate the content of the buffer, so no memory copy should be necessary (assuming, of course, a smart enough implementation).

This is exactly why I said the migrate feature doesn't fit to my problem. I do want to keep the content of the buffers consistent. When migrating the buffers back to the host with the CONTENT_UNDEFINED flag, the only guarantee the spec gives you is that the chunk is in the host memory space. But I do want the guarantee that my reference patterns (what I called the gallery) are still in the buffers in order to be able to transfer them back to the device for further processing !

Thus I can't use the CONTENT_UNDEFINED flag because the content of my buffers does matter, but I don't want a device-to-host transfer neither. The only workaround I see to this is, as I said before, a smart enough implementation that take into account the RW permissions given at the buffers creation (this is maybe already the case, but we need more precision from AMD on this).

Ah, I see. So the migration should be done without the CONTENT_UNDEFINED flag, and the platform should be smart enough to understand that when migrating a mem_object which is READ_ONLY for the device there should be no need to transfer data to the host. I agree with this.

gbilotta wrote:

I've added a second file to my github repository, that tries a different approach: it allocates gmem/alloc_max + 1 ‘host’ buffers and migrates them to the host right after allocation (you can probably try migrating them all at once, object migration can handle more than one buffer at a time), then it allocates 2 ‘device’ buffers and uses map/unmap on the ‘host’ buffers and then copy to the second device buffer. I have have 5 256MB ‘host’ buffers + 2 256MB ‘device’ buffers on a 1GB card.

I tried that two (I did a lot of tests when I saw your first post about clEnqueueMigrate). On my computer it crashes beyond 50% of "host" memory.

This is strange. An actual crash? The only reason I can see for it to crash (aside from bugs in the implementation) would be an unhandled out-of-memory condition. Do you have eough physical free memory on the host? For me it works until I exhaust device memory, but then again I have a 1GB card and 4GB of RAM on the host

gbilotta wrote:

Also note that you cannot have too many host buffers anyway because you cannot pin arbitrary amounts of memory, anyway.

This is the point of my topic, I do want arbitrary amount of host buffers. I know that is not possible but I would like to know exactly why such a limit (other than the physical amount of memory) has been set. After all, I can (m)allocate almost 2GB of page-aligned, page-locked and non-cacheable memory (which is what I think a "pinned" memory chunk is).

Hm. I think there are operating system limits also on the numberof page-locked buffers, not only on the actual amount of memory, but I'm not an operating system expert so don't take my word for it. It could also be that the AMD platform has some hard-coded (or computed from system resources) limit which is lower than the operating system limit, and that's the limit you're coming across. Not knowing much about AMD's internal, however, I cannot say.

gbilotta wrote:

(In practice, I think that you might want to have three device buffers so you can double-buffer and overlap computations on one and transfers to the other). Also, at this point you probably don't need more than two or three buffers on the host either, since you can write on one while the other is being transferred (provided the AMD platform is smart enough to do actual asynchronous data transfers this way).

Two regular "device" buffers are enough here, since I can map and transfer a buffer while the other one in being used in computation. I think you're missing my initial problem: the whole thing works, and the transfers are correctly overlapped with computation in my current implementation, but I want to save as much RAM space as possible, as well as the CPU time (which is very critical in my case)

What I wanted to avoid is the extra host-side copy in order to "prepare" the data to be uploaded asynchronously at full memory bandwidth (by doing a clMap, memcpy, clUnmap for instance). This clearly doesn't seam to be possible with the actual AMD implementation.

Maybe I'm overengineering all this, but I think this the kind of problems we should be able to solve easily in the near future.

Honestly, I don't think you're overengineering this. It's quite obvious that there are some limitations about AMD platforms that can be solved at the driver level to make this kind of behavior (which would be far from rare in any application that needs to process a stream of data in chunks). One thing that got me thinking is that the use of map and unmap might be part of the problem, since it is possible that at unmap time AMD's platform decides “oh, so this buffer is not needed on the host anymore, let's migrate it to the device”. This is the kind of behavior that, while sensible in most ‘standard’ applications, should be preventable, e.g. by explicit object migration (which, as you said and I agreed, should be smart enough to take into account the host-side and device-side RW flags to determine when actual data transfer is supposed to happen and when not.

0 Likes


gbilotta wrote:


This is strange. An actual crash? The only reason I can see for it to crash (aside from bugs in the implementation) would be an unhandled out-of-memory condition. Do you have eough physical free memory on the host? For me it works until I exhaust device memory, but then again I have a 1GB card and 4GB of RAM on the host




Sorry, by crash I meant that my error checking of the OpenCL states fails. With either a MAP_FAILURE or an OUT_OF_MEMORY error, as explained in a previous post. I just got some additional gigabytes of RAM, I'll run my tests again and see if there is any difference.


gbilotta wrote:


Hm. I think there are operating system limits also on the numberof page-locked buffers, not only on the actual amount of memory, but I'm not an operating system expert so don't take my word for it. It could also be that the AMD platform has some hard-coded (or computed from system resources) limit which is lower than the operating system limit, and that's the limit you're coming across. Not knowing much about AMD's internal, however, I cannot say.




Thanks for that pointer, I'll gather some informations about this.

EDIT: so there is a limit to the maximal amount of page-locked memory, which can be retrieved with getrlimit(RLIMIT_MEMLOCK,...). In my case, it is set to 64KiB. The limit can be increased if the process is run as superuser or with CAP_SYS_RESOURCE capability. I'm not sure that the OpenCL calls will inherit of the increased resource limits, I'm going to try this as soon as I can.

0 Likes

hi ,

I recentely wrote a test to overmap gpu memory, and it seems more buffers can be allocated to GPU than the GPU memory, by probably swapping the buffers already inside GPU memory. Maybe you can try the code on your setup.

0 Likes

So I tried this OverMappingBuffers on both NVIDIA and AMD. When the number of buffers is such that the device memory is overcommitted, NVIDIA fails with -4 relatively early (I suspect they don't do delayed allocation, so they realize they're out of GPU memory quite soon). On AMD, it starts chugging along, but then it fails with -5. This is running on a Cayman (HD6970) with 13.4 drivers. I really think buffers are never evicted from device memory in AMD's platform, until they are released.

0 Likes


gbilotta wrote:



This is running on a Cayman (HD6970) with 13.4 drivers. I really think buffers are never evicted from device memory in AMD's platform, until they are released.


I am also not claiming that as of now. Although i am getting -4 error for hd7870 with 13.6 beta driver after over-committing happens. 


On AMD, it starts chugging along, but then it fails with -5.


what does this mean? can u tell how many 16MB buffers were you able to allocate on amd and nv cards.

0 Likes

I changed the code to print the buffer that is been processed, and these are the findings:

  • on the HD6970 I can get up to 64 buffers (1GB), which is the total amount of gmem reported by clinfo for the card. I also notice that when going beyond 32 buffers things start to slow down (especially if using read/write instead do map/unmap);
  • on the Tesla C2070 (6GB of memory, but under OpenCL it only works in 32-bit mode, so 4GB is the maximum we can use) I get to 250 buffers (4016MB), so it would seem the processing does happen, but it's much faster (no perceptible slowdowns for any buffer);
  • strange results are coming from our HD7970: it should have 3GB of RAM according to the box, OpenCL only shows 2GB, but the out of resources only happens at 270 buffers, 4336MB (the host only has 4GB of RAM); does this mean that the driver (13.4 again, OpenCL 1.2 AMD-APP 1124.2) does evict buffers on this card, and is therefore limited by the host memory instead? (In this case, btw, out of resources is the correct error, I think, since it's running out of host resources to manage the buffers, not of device memory.)
0 Likes

With a HD 7750 (1GB), on Linux with 5GB of RAM and Catalyst 12.6: the clWrite and clMap versions both fail at 120 buffers (with -5 and -12), which is about 2GB of memory, but there is no obvious slowdown.

0 Likes

Nice to hear you had added some performance metrics to that test. Can you please share it in your github repo.(Link above somewhere )

It is interesting to know what happens on HD7750 + 5GB RAM machine. But Catalyst 12.6 is way old. Can you check with 13.6beta there? It will be also useful, if you can test on windows as well. Performance is expected to be better on windows.

0 Likes