12 Replies Latest reply on Nov 14, 2013 11:54 PM by himanshu.gautam

    How to handle ClCreateBuffer Constraints?

    ifrah

      Hey all,

       

      I have a very simple problem in hand but the solution does not seem to be trivial.

      I have more than 1 giga byte of data (uints in an array) and I am applying a filter on them and getting my result in another array.

      I am doing this on AMD devastator using opencl. max memory allocation on devastator is 200540160 though global memory size is 536870912 as shown by clinfo. now the cpu part shows max memory allocation as 4146315264 and global memory size as 16585261056. This means that I have a lot of memory for my data.

      Now the problem is that I cannot allocate buffer (clCreateBuffer) having elements more than 32 million which is obvious from max memory allocation on devastator. How can i handle more than giga byte or even tera byte elements array? where and how to save these elements (in cpu memory or in gpu and how) and how can gpu access these elements?

       

      This problem becomes worse when my output array should have 10 times (or more) the number of elements as compared to that in input array (another scenario) because then the number of input elements i can process at a time reduce by a factor of 4. input elements that i can send now are limited to 2 million.

       

      Any suggestions to solve these problems?

        • Re: How to handle ClCreateBuffer Constraints?
          himanshu.gautam

          Allocate them as multiple buffers....I dont know much about your setup...

          But, I believe devastator is an APU part.

           

          Few tips:

           

          For read-only input buffers from kernel - use CL_MEM_READ_ONLY | CL_MEM_PERSISTENT_MEM_AMD flag.

          This will create the buffer inside the memory reserved for GPU and will have high Read/Write bandwidth to GPU.

          Good Write bandwidth to CPU and very less read bandwidth (uncached) to CPU.

           

          If you use ALLOC_HOST_PTR, it involves a latency as GPU VM tables have to be populated and the memory has to be pinned.

           

          Please check the useful PDF below to figure out how to allocate memory on APUs (that are not HSA based)

          http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

            • Re: How to handle ClCreateBuffer Constraints?
              ifrah

              Thanks, I am using persistent flag for my input buffer and alloc for my output buffer. even after using this, the bandwidth I am getting is a fraction of GB/s (less than 0.5) and not 6 or 8 as specified in the doc. I am calculating this by doing no.of elements/time taken in kernel. Why is it so?

                • Re: How to handle ClCreateBuffer Constraints?
                  himanshu.gautam

                  no of elements / time taken... is fine. But are you multiplying with sizeof() datatype..

                  i.e. if you are doing 1 million floats per second... This is equal to 1million * sizeof(float) == 4 MB per second.

                   

                  Also, You obviously are also calculating and not simply doing memory transfers.

                  So, you cannot hit the numbers unless you are doing simple memcpy..

                   

                  HTH,

                  Best Regards,

                  Bruhaspati

                  • Re: How to handle ClCreateBuffer Constraints?
                    himanshu.gautam


                    Also, I recommend you run "BufferBandwidth" program that comes with APP SDK on the APU.

                    Thats the benchmark you should strive to reach.

                     

                    The sample can be used with different "memory flags" and hence can understand the effect of memory flags on performance under different scenarios.

                     

                    - Bruhaspati

                      • Re: How to handle ClCreateBuffer Constraints?
                        ifrah

                        Thanks, I have gone through the bufferbandwidth a couple of times and all the 4 tests are pretty confusing. for example, why do we need copybuffer in 4th (pre-pinned) test. In 1st test, why cant we change flags for resultbuffer etc. Even the document with this sample does not explain each test. Can you explain each test in detail so that everything is clarified?

                         

                        My code seems to be more relevant to 1st test where I have 1 input buffer (read only and persistent mem amd) and 1 output buffer (write only and alloc host ptr). For input buffer, i map it to a ptr, memcpy an ip array to this ptr and unmap it. Then i send this buffer to kernel, apply filter and put the results in output array. then i map output buffer to a ptr, memcpy to an array and unmap it.

                        To my surprise, the mapping and memcpy input bandwidth is around 4.5gb for both discrete gpu and apu which shouldn't be the case. (for discrete, its a bit higher though pci-e is to be used for discrete so it should be less as compared to apu)

                         

                        Moreover, Can I get email addresses of some relevant AMD people so that I may show my code directly to them and discuss?

                          • Re: Re: How to handle ClCreateBuffer Constraints?
                            himanshu.gautam

                            >> why do we need copybuffer in 4th (pre-pinned) test


                            The fourth test (which is -type 3) is "clEnqueue[Read,Write], prepinned"

                            It looks like, the test uses "ALLOC_HOST_PTR" for the copy buffer -- which makes it a good candidate for DMA.

                            clEnqueueRead/Write using AHP buffers can get you pretty good bandwidth.

                            However, from what I know of, pre-pinned buffer -- I don't think this is referred to as pre-pinned buffers in AMD Programming Guide.

                            AHP are pinned by default.

                            Pre-pinned refers to UHP (use host ptr) contents which are pre-pinned at time of transfer so that DMA can be used to transfer them to device......

                            I will check internally and Thanks for the alert on Documentation. I will report that as well.

                             

                            [UPDATE]

                            I was wrong. AHP is also considered as pre-pinned....So, the sample is fine.

                             

                             

                            >> why cant we change flags for resultbuffer etc

                             

                            It is the inputBuffer and outputBuffer that matter because transfers in and out of them matters and that is what is reported.

                            So, the program provides flexibility on those buffer placements (using various flags)

                            "resultBuffer" is a device-resident intermediate buffer.

                            It is used by both "read_kernel" and "write_kernel".

                            In "read_kernel", it is the small tiny output buffer which collects final result. It has to be small because "read_kernel" is all about "memory reads". This is a per workgroup buffer - if you notice...

                            In "write_kernel", it is just a dummy thing and is not used....

                            Hence..

                            • Re: Re: How to handle ClCreateBuffer Constraints?
                              himanshu.gautam

                              >> My code seems to be more relevant to 1st test where I have 1 input buffer (read only and persistent mem amd) and 1 output buffer (write only and alloc host ptr).


                              So, I believe you are invoking it as "./BufferBandwidth -type 0 -if 6 -of 5"

                              Can you post your full output here?

                              The output shows map and memcpy separately.... I dont understand what you mean by mapping and memcpy bandwidth...

                              Please post your output including your device names (full output)

                              I can check and let you know...

                               

                              Also, Post info on your device, OS, Catalyst version etc..

                              Best ,

                              Bruhaspati