I'm sure someone knows the answer to this very simple question, but I have not found the solution, no matter how much I googled.
What is the way of allocating memory on device without using buffers? I tried the same way as one would allocate local memory, when setting kernel arguments, I specify a size and pass a NULL pointer. Then inside kernel, the specified kernel argument pointer which is __local appended will be allocated local memory. The same thing does not work if the kernel argument is __global appended.
What is the correct way of doing this?
What's wrong with clCreateBuffer? Unless you call clEnqueueWriteBuffer, there won't be any memory transaction between host and device and you will have an array (or buffer) to store data in the kernel.
What I understand from the OpenCL spec (pdf 1.1 rev 33) you can't allocate __global memory inside a kernel. See table 3.1 on page 27 section 3.3 Memory model
I do not want to allocate global memory inside a kernel. It's just that the code doesn't seperate well, if buffer creation mixes with simple memory allocation on device. Usually one creates all buffers for transfers in one place, and one might think it's possible to do all the allocations at the setKernelArgs point. I don't see the point why is there a distinction in the way local and global are allocated.
I know it works with buffers, but I wanted to know if this is the only way.
Edit: The one thing that bugs me most, is that if you create a buffer, you have to specify a pointer in host memory. It's not nice to pass a pointer to a buffer that has nothing to do with, because that buffer will not be used for transfers.
I doubt there is a way allocating global memory through setKernelArgs.
Sorry, I'm not the right person to answer why there is the difference in allocating between local and global memory 😉 Maybe someone cleverer than me can answer your question.
In addition to nou's post, I suppose it might be worth checking if your code is checking return values from clCreateBuffer, because from the way you're describing your code as "having to pass a pointer", it sounds like you have one of the flags CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR set. Both of those require a host pointer, however, when these flags are not set, the host_ptr argument must be NULL, and clCreateBuffer should fail(According to the spec) if you pass something else.
They are different pieces of hardware. Local memory is not directly accessible by the host, so it's allocation is just a reservation of size. Global memory is directly allocatable by the host so it's a reservation of size plus an initialization of data.
Thank you for the replies. So I conclude there is no way to allocate global memory without creating buffers, becuase that is the only way. If one does not wish to pass unwanted pointers for a buffer, because no transfers will be made or the memory does not need to be initialized, than not setting HOST_PTR flags will do the trick, keeping the proper usage of the READ/WRITE type flags.
Although it might be a different topic, but let me ask it here (if the flags were mentioned): could someone summarize what optimizations are done by setting the proper READ_ONLY, WRITE_ONLY, READ_WRITE flags? Yet again, a long long time ago, in a galaxy far far away I read that READ_ONLY tells the compiler that the data can be cached for reading. I never found a clear statement about telling the compiler to put something into constant cache for eg.
So if someone knows the tricks to the flags and telling the compiler to use specific memory caches, I'd be most glad. (and other people too, I'm sure)
So am I correct, that the compiler (at the moment) does not read code ahead in a manner to be able to cache reads from __global memory between each mem_fence(GLOBAL). If I am not mistaken, __global writes inside kernels are cached as long as it does not overflow the write cache available to all Compute Units. My question would involve intimacies such as does it matter to specify something WRITE_ONLY or READ_ONLY, or is it only a language capability for future compiler optimizations?
I can image WRITE_ONLY buffers will be written using the write cache, but if the buffer can also be read, cacheing might imply synchronization taken if a work item wants to access data in another Compute Unit's write cache, which has not yet been written into __global.
So my question remains: it would be good to know exactly what the point is of setting R/W_ONLY flags, what optimizations will be done by setting these buffer flags?
Micah's answer seemed to show the point of this being highly SDK version (compiler version to be more exact) dependant. If the answer is only for SDK v2.2 it would still be nice. If someone could even give a sneak-peak as to what improvements can be expected for future realeases, that would rock.
my point of view to READ/WRITE_FLAG is that when you specify READ only then implementation do not need synchronize buffer across multiple devices in context. if you only read then implementation can assume that in kernels you do not change content of buffer. so it do not need propagate changes to other devices memory.