cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Fuxianjun
Journeyman III

How to create buffer on device memory?

I create several arrays on cpu and want to operate them on gpu, I got that since operation is on gpu, to access gpu memory is faster than host memory. So, how to create buffers of these arrays on gpu memory? can I use just only clCreateBuffer() function ? If so, how to choose flag? Thanks !

0 Likes
11 Replies
Raistmer
Adept II

AFAIK there are few ways to do this
1) create buffer and map it to cpu, fill with data, then map to GPU (for prev SDK version it was slower than others, didn't test with SDK 2.2)
2) create buffer and use clEnqueueWriteBuffer to fill it with data.
(for windows it will also take memory from host too, but data transfer was faster than in case 1) for prev SDKs at least)
3) create buffer with copy memory flag. But this way is good if you need to update GPU buffer only once. If you need to update GPU buffer from host in loop - you will need case 2) again.
0 Likes

Originally posted by: Raistmer AFAIK there are few ways to do this 1) create buffer and map it to cpu, fill with data, then map to GPU (for prev SDK version it was slower than others, didn't test with SDK 2.2) 2) create buffer and use clEnqueueWriteBuffer to fill it with data. (for windows it will also take memory from host too, but data transfer was faster than in case 1) for prev SDKs at least) 3) create buffer with copy memory flag. But this way is good if you need to update GPU buffer only once. If you need to update GPU buffer from host in loop - you will need case 2) again.


Thank you very much for reply,but I still can not understand, could you please explain again in detail ?

For case 1, Dose map mean to use clEnqueueMapBuffer function ? 

For case 2, I think it is the best way for my problem. Dose "create buffer" mean to use clCreateBuffer function ? If so, which flag is proper ? Is this buffer created on host memory or GPU memory ? If using clEnqueueWriteBuffer, is the ptr parameter ptr a pointer to host memory or GPU memory ?

0 Likes

Fuxianjun,

use clEnqueueWriteBuffer instead,but you need to create buffer before using it.

In most cases,clcreatebuffer creates the buffer on host side,which is quite inefficient to access by GPU.

0 Likes

Originally posted by: himanshu.gautam Fuxianjun,

 

use clEnqueueWriteBuffer instead,but you need to create buffer before using it.

 

In most cases,clcreatebuffer creates the buffer on host side,which is quite inefficient to access by GPU.

 

This is not correct.  clCreateBuffer() is always defaults to device memory.  You can add flags like CL_ALLOC_HOST_PTR if you prefer that the memory reside on the host, or at least be host accessible, rather than the device.  Host memory is generally slower for the device.

Jeff

0 Likes

Originally posted by: jeff_golds
Originally posted by: himanshu.gautam Fuxianjun,

 

use clEnqueueWriteBuffer instead,but you need to create buffer before using it.

 

In most cases,clcreatebuffer creates the buffer on host side,which is quite inefficient to access by GPU.

 

This is not correct.  clCreateBuffer() is always defaults to device memory.  You can add flags like CL_ALLOC_HOST_PTR if you prefer that the memory reside on the host, or at least be host accessible, rather than the device.  Host memory is generally slower for the device.

Jeff

God,which of you two is correct on earth ? Dose anyone tell me the truth ?

 

0 Likes

CL_MEM_USE_HOST_PTR will assing pointer into host memory and when you call clEnqueueMapBuffer() it will map into that pointer.

implementation can still cache memory to device memory.

0 Likes

It depends on your algorithm -

1. If your algorithm is an iterative one like NBody or LBM simulation where your output is your next input then it is best to create buffer with clCreateBuffer using flags CL_MEM_READ_WRITE and using clEnqueueWriteBuffer in each iteration.

2. If your algorithm reads data from host only once then use clCreateBuffer using CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR flag and specifying your host pointer in the function.

0 Likes

My apologies.

nou explains it the best.

0 Likes

Originally posted by: jeff_golds

Originally posted by: himanshu.gautam Fuxianjun,


 


use clEnqueueWriteBuffer instead,but you need to create buffer before using it.


 


In most cases,clcreatebuffer creates the buffer on host side,which is quite inefficient to access by GPU.

This is not correct. clCreateBuffer() is always defaults to device memory. You can add flags like CL_ALLOC_HOST_PTR if you prefer that the memory reside on the host, or at least be host accessible, rather than the device. Host memory is generally slower for the device.

Jeff



Hehe, you based on OpenCL specs while Himanshu could reveal some details of current SDK implementation. From my own observations I see host memory increase more than enough to hold all allocated "on GPU" buffers. Hence at least buffer allocated on both GPU and host memories. But I can only hope that it's the case. Quite possibly it allocated only on host indeed




0 Likes

Originally posted by: RaistmerHehe, you based on OpenCL specs while Himanshu could reveal some details of current SDK implementation. From my own observations I see host memory increase more than enough to hold all allocated "on GPU" buffers. Hence at least buffer allocated on both GPU and host memories. But I can only hope that it's the case. Quite possibly it allocated only on host indeed


Actually, I work on the OpenCL runtime at AMD

Host memory increases due to the way we currently allocate transfer buffers, but the actual backing store is on the device.  Thus, for best device access performance, you should use clCreateBuffer().

If you want to create a buffer that can be quickly updated by the CPU, use CL_MEM_ALLOC_HOST_PTR.  You can use that data directly with the device, or, if accessing over PCIe is a bottleneck, you can use clCopyBuffer to copy the data to a buffer on the device.  This path will be more optimal soon.

Jeff

0 Likes
Raistmer
Adept II

LoL
And what about pinned memory ? Still not implemented ?
And mapping buffer - will it copy data to temporary GPU memory buffer when unmapped by host ? For example, my data path is to prepare some data into host memory, then put it to GPU, then does various transformations in kernels (each kernels takes buffer from prev one and sometimes modifies same buffer, sometimes writes into new one) and then (and only if some flag setted) transfer data back to host memory.
That is, duplicating buffers in host memory is waste of resourses in my case(some of then never used on host side at all), especially if not only memory allocated but data transferred between 2 buffer realisations too.
What the best way to implement such buffer usage in current implementation?
0 Likes