I have a very simple problem in hand but the solution does not seem to be trivial.
I have more than 1 giga byte of data (uints in an array) and I am applying a filter on them and getting my result in another array.
I am doing this on AMD devastator using opencl. max memory allocation on devastator is 200540160 though global memory size is 536870912 as shown by clinfo. now the cpu part shows max memory allocation as 4146315264 and global memory size as 16585261056. This means that I have a lot of memory for my data.
Now the problem is that I cannot allocate buffer (clCreateBuffer) having elements more than 32 million which is obvious from max memory allocation on devastator. How can i handle more than giga byte or even tera byte elements array? where and how to save these elements (in cpu memory or in gpu and how) and how can gpu access these elements?
This problem becomes worse when my output array should have 10 times (or more) the number of elements as compared to that in input array (another scenario) because then the number of input elements i can process at a time reduce by a factor of 4. input elements that i can send now are limited to 2 million.
Any suggestions to solve these problems?
Allocate them as multiple buffers....I dont know much about your setup...
But, I believe devastator is an APU part.
For read-only input buffers from kernel - use CL_MEM_READ_ONLY | CL_MEM_PERSISTENT_MEM_AMD flag.
This will create the buffer inside the memory reserved for GPU and will have high Read/Write bandwidth to GPU.
Good Write bandwidth to CPU and very less read bandwidth (uncached) to CPU.
If you use ALLOC_HOST_PTR, it involves a latency as GPU VM tables have to be populated and the memory has to be pinned.
Please check the useful PDF below to figure out how to allocate memory on APUs (that are not HSA based)
Thanks, I am using persistent flag for my input buffer and alloc for my output buffer. even after using this, the bandwidth I am getting is a fraction of GB/s (less than 0.5) and not 6 or 8 as specified in the doc. I am calculating this by doing no.of elements/time taken in kernel. Why is it so?
no of elements / time taken... is fine. But are you multiplying with sizeof() datatype..
i.e. if you are doing 1 million floats per second... This is equal to 1million * sizeof(float) == 4 MB per second.
Also, You obviously are also calculating and not simply doing memory transfers.
So, you cannot hit the numbers unless you are doing simple memcpy..
Also, I recommend you run "BufferBandwidth" program that comes with APP SDK on the APU.
Thats the benchmark you should strive to reach.
The sample can be used with different "memory flags" and hence can understand the effect of memory flags on performance under different scenarios.
Thanks, I have gone through the bufferbandwidth a couple of times and all the 4 tests are pretty confusing. for example, why do we need copybuffer in 4th (pre-pinned) test. In 1st test, why cant we change flags for resultbuffer etc. Even the document with this sample does not explain each test. Can you explain each test in detail so that everything is clarified?
My code seems to be more relevant to 1st test where I have 1 input buffer (read only and persistent mem amd) and 1 output buffer (write only and alloc host ptr). For input buffer, i map it to a ptr, memcpy an ip array to this ptr and unmap it. Then i send this buffer to kernel, apply filter and put the results in output array. then i map output buffer to a ptr, memcpy to an array and unmap it.
To my surprise, the mapping and memcpy input bandwidth is around 4.5gb for both discrete gpu and apu which shouldn't be the case. (for discrete, its a bit higher though pci-e is to be used for discrete so it should be less as compared to apu)
Moreover, Can I get email addresses of some relevant AMD people so that I may show my code directly to them and discuss?
>> why do we need copybuffer in 4th (pre-pinned) test
The fourth test (which is -type 3) is "clEnqueue[Read,Write], prepinned"
It looks like, the test uses "ALLOC_HOST_PTR" for the copy buffer -- which makes it a good candidate for DMA.
clEnqueueRead/Write using AHP buffers can get you pretty good bandwidth.
However, from what I know of, pre-pinned buffer -- I don't think this is referred to as pre-pinned buffers in AMD Programming Guide.
AHP are pinned by default.
Pre-pinned refers to UHP (use host ptr) contents which are pre-pinned at time of transfer so that DMA can be used to transfer them to device......
I will check internally and Thanks for the alert on Documentation. I will report that as well.
I was wrong. AHP is also considered as pre-pinned....So, the sample is fine.
>> why cant we change flags for resultbuffer etc
It is the inputBuffer and outputBuffer that matter because transfers in and out of them matters and that is what is reported.
So, the program provides flexibility on those buffer placements (using various flags)
"resultBuffer" is a device-resident intermediate buffer.
It is used by both "read_kernel" and "write_kernel".
In "read_kernel", it is the small tiny output buffer which collects final result. It has to be small because "read_kernel" is all about "memory reads". This is a per workgroup buffer - if you notice...
In "write_kernel", it is just a dummy thing and is not used....
>> My code seems to be more relevant to 1st test where I have 1 input buffer (read only and persistent mem amd) and 1 output buffer (write only and alloc host ptr).
So, I believe you are invoking it as "./BufferBandwidth -type 0 -if 6 -of 5"
Can you post your full output here?
The output shows map and memcpy separately.... I dont understand what you mean by mapping and memcpy bandwidth...
Please post your output including your device names (full output)
I can check and let you know...
Also, Post info on your device, OS, Catalyst version etc..
Thank you. I think I should attach my code here to get useful feedback from you i.e how and where to measure "bandwidth", "execution time" and "data transfer time" and which flags are best for this scenario? Moreover, if I want to run the same code on AMD CPU then what changes do I have to make in this code? (i used the same code on cpu and mapping output buffer is taking a lot of time (msec))
[Additional Question] Both USE_HOST_PTR and ALLOC_HOST_PTR refer to pre-pinned memory on the host. How are they different then both while writing code and their implementation?
I am using windows environment (visual studio) and there are two GPUs on the AMD machine, one is the integrated HD7660D with APU (A10-5800K) and the other is the discrete HD5800 connected by PCIe 2x bus.
Looking forward to the suggestions.
A side question, if i have intermediate buffers that are written in one kernel and read in 2nd kernel, which mem flags should be used for these and how? (simply write in GPU memory by using no flags at all?)
Looking forward to some feedback..
Update: The performance of the attached code is quite poor on AMD CPU (A10-5800K) whereas its quite good on AMD GPUs and even on Intel cpu and gpu (i5-3470). Any explanation for that?