11 Replies Latest reply on Dec 28, 2009 4:27 PM by omion

    Device vs host memory with buffers

    omion

      I just started looking into OpenCL with my brand new HD5850, but I ran into a bit of a problem:

      I have a context which includes both the CPU and GPU, but any buffers I create seem to be limited to the available GPU memory. My computer has 8GB RAM. OpenCL sees 1GB usable on the CPU and 256MB usable on the GPU. Once I try to allocate more than about 256MB of buffers, I get a message on stderr:

       

      C:\openclmainline\runtime\src\device\gpu\gpudevice.cpp:754: guarantee(!"We don't have enough video memory!")

      Then the whole program crashes. (That file doesn't exist, by the way - I assume the message is a placeholder in the source for the OpenCL driver.)

      Shouldn't the buffers be allocated on the host and only need to be on the GPU when it is using them? I kind of figured the driver would do all the memory swapping dynamically and automatically. Is this a function of the OpenCL spec or just of AMD's current implementation? If the latter, is there a fix planned?



        • Device vs host memory with buffers
          genaganna

           

          Originally posted by: omion

           

          Shouldn't the buffers be allocated on the host and only need to be on the GPU when it is using them? I kind of figured the driver would do all the memory swapping dynamically and automatically. Is this a function of the OpenCL spec or just of AMD's current implementation? If the latter, is there a fix planned?

           

           

          Omion,

                    Could you please clarify what you are looking?

            • Device vs host memory with buffers
              omion

              There were 2 things wrong with the situation; as far as I can tell:

              1. The memory looks like it's allocated on all devices in a context, which doesn't seem right. I would have expected the host memory to be used instead, then the devices cache it when a kernel needs it.

              2. If there really is a problem on the device the function should return an error code, rather than simply crashing.

              The second problem is definitely a bug. The first may just be me misunderstanding how the memory allocation works.

              I suppose I want to know if issue 2 is being worked on, and if issue 1 is actually a problem or just me not knowing how things work.

            • Device vs host memory with buffers
              n0thing

               

              Originally posted by: omion   Shouldn't the buffers be allocated on the host and only need to be on the GPU when it is using them?


              The buffers are allocated on the device and your buffer size is limited by CL_DEVICE_MAX_MEM_ALLOC_SIZE which is 256MB in your case.

              If your data set is larger then you need to divide it into multiple buffers.

                • Device vs host memory with buffers
                  omion

                  @n0thing:

                  That's actually what I'm doing. The problem is that it doesn't help for some reason (which is why I think there is something wrong).

                  For example, I should be able to allocate three 100MB buffers, but the third one always makes the program crash. I found out that I can actually make 266 buffers, each 1MB in size (1048576 bytes) before it dies. With 265 buffers neither the stderr message nor the crash occurs.

                  Also, attempting to allocate a buffer larger than CL_DEVICE_MAX_MEM_ALLOC_SIZE will simply result in the function failing with CL_INVALID_BUFFER_SIZE, which is much easier to deal with than a full program crash.

                   

                  Some background on what I'm doing: I need to work on data sets that may be up to 1TB in size. So I split up the set into slices that are around the size of the available system memory. However, as n0thing noted, the devices can't handle that much data. So the working set is actually represented by a number of buffers, each smaller than the smallest CL_DEVICE_MAX_MEM_ALLOC_SIZE for all devices.

                  So I have a data set that is 40GB in size. The program will do 20 passes with 2GB of memory at a time, with the 2GB represented by ten 200MB buffers.

                  I suppose I have a question to the other users: has anybody else run into this problem? It happens EVERY time I use a context that includes the GPU and the total buffer memory usage exceeds about 260MB.

                    • Device vs host memory with buffers
                      empty_knapsack

                       

                      Originally posted by: omionI suppose I have a question to the other users: has anybody else run into this problem? It happens EVERY time I use a context that includes the GPU and the total buffer memory usage exceeds about 260MB.


                       

                      Taking TemplateC example from SDK and increasing width from default 256 to 32*1024*1024 (thus, 128MB for input + 128MB for output) ends in the same way -- "C:\openclmainline\runtime\src\device\gpu\gpudevice.cpp:754: guarantee(!"We don't have enough video memory!")".

                      Obviously it's SDK problem, it shouldn't crash at all, it must report some OUT_OF_MEMORY error. Especially when we have 1Gb video RAM and allocating only 1/4 of it in small chunks.

                  • Device vs host memory with buffers
                    MicahVillmow
                    Omiom/Nou,
                    I think the key part of the spec that covers this is the return value of clCreateBuffer, section 5.2.1.
                    "CL_INVALID_BUFFER_SIZE if size is 0 or is greater than
                    CL_DEVICE_MAX_MEM_ALLOC_SIZE value specified in table 4.3 for all devices in
                    context."

                    The key wording is 'all devices'. So, the max amount of memory you can allocate for a context is the minimum reported by CL_DEVICE_MAX_MEM_ALLOC_SIZE of all devices associated with that context. In this case, the GPU is the bottleneck and limits the max allocation to 256MB.
                      • Device vs host memory with buffers
                        omion

                         

                        Originally posted by: MicahVillmow Omiom/Nou, I think the key part of the spec that covers this is the return value of clCreateBuffer, section 5.2.1. "CL_INVALID_BUFFER_SIZE if size is 0 or is greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE value specified in table 4.3 for all devices in context." The key wording is 'all devices'. So, the max amount of memory you can allocate for a context is the minimum reported by CL_DEVICE_MAX_MEM_ALLOC_SIZE of all devices associated with that context. In this case, the GPU is the bottleneck and limits the max allocation to 256MB.


                        Well, that's not quite what's happening (which is the problem). The spec says that each buffer needs to be smaller than the min value of all CL_DEVICE_MAX_MEM_ALLOC_SIZEs, but the problem occurs if the total number of bytes in all buffers exceeds this amount.

                        So, if I allocate a single 200MB buffer, the program is fine. If I then allocate another 200MB buffer (which should still be fine, according to the specs since the size requested is less than CL_DEVICE_MAX_MEM_ALLOC_SIZE) the program completely crashes. No error, just a crash.

                      • Device vs host memory with buffers
                        MicahVillmow
                        Yes, I agree that the crashing is a problem and this has been reported. Also, can you point me to the part of the spec that you believe states that this should be valid? I can't seem to find anything.
                          • Device vs host memory with buffers
                            omion

                             

                             

                            Originally posted by: MicahVillmowAlso, can you point me to the part of the spec that you believe states that this should be valid? I can't seem to find anything.


                            It was actually in the part of the specs that you quoted. It says that CL_INVALID_BUFFER_SIZE is returned if size is greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE for everything. size here refers to the third argument in creating a new buffer, not the total size for all buffers.

                            I'll give a pseudo-code example:

                            buf1 = clCreateBuffer(ctx, 0, 200*1024*1024, NULL, NULL);

                            buf2 = clCreateBuffer(ctx, 0, 200*1024*1024, NULL, NULL);

                            For both clCreateBuffers, size is less than 256MB, therefore neither allocation should return with CL_INVALID_BUFFER_SIZE.

                            Of course, after enough allocations the host's memory will run out, in which case CL_OUT_OF_HOST_MEMORY or CL_MEM_OBJECT_ALLOCATION_FAILURE would be returned (not really sure which one, though)