I heard of the situation that the OpenCL Buffers can only be allocated in appropriate chunks:
I am using the Cpp Bindings from the Khronos Group with the compiler option "DCL_USE_DEPRECATED_OPENCL_1_1_APIS" to achieve correct linking. The Buffer is allocated as follows:
This Buffer contains 3D positions of points as 3 integers per point and is about 120mb. (10 mio points) This command returns cl_success. Can i assume that this behaviour indicates that the buffer is small enough to have not the problem of the 2 links above?
My Kernel reads the data as follows:
void testKernel(__global int* VBOpos)
gid = get_global_id(0);
__private int4 WorldSpacePointPosInt;
WorldSpacePointPosInt = vload4(0,&(VBOpos[3*gid]));
The values of the about first 8mio Points are correct. But after that I get only nonsense. It seems that the allocation is done correctly, but the informations are not settet correctly. I tried it also bevorhand with an shared GL VBO. The GL rendering of the VBO is correct. But when i read it with CL from the shared GL VBO also there are only about 8mio correct points and the rest is dirt.
My Nvidia Notebook does not has this problem.
I would like to chunk the date into several buffers, but I do not know which exact size of the buffer is definitively correct initialized.
AMD 7970, Win7 X64, Ubuntu X64 12.04, catalyst 12.8, AMD APP SDK 2.7
Sorry for my bad english and please reply, this would help me a lot.
Can i assume that this behaviour indicates that the buffer is small enough to have not the problem of the 2 links above?
---well, I think you can use the printf extension to confirm/double check.
Hope this answer will be helpful to you.
In the past if I exceeded the buffer limits, I got an error from allocation. You should be able to see maximum object size from clinfo output. Why dont you try it with vload3 ? and see if you are getting same values or not? also you can perhaps try on CPU?
The vload3 gives the same situation. I also tried to access the data with the  operator, with the same result. I tried to get the CL_DEVICE_MAX_MEM_ALLOC_SIZE with the following code but it return only two binary zeros...
DumpDetailedCLError(clGetDeviceInfo(deviceId,CL_DEVICE_MAX_MEM_ALLOC_SIZE, 0,0,&size), false);
DumpDetailedCLError(clGetDeviceInfo(deviceId,CL_DEVICE_MAX_MEM_ALLOC_SIZE, 500, &deviceInfo,0), false);
The one pro says to the AMD 7970:
Currently OpenCL users are limited to 25% of device memory,
I don't know where you get this from, perhaps it's a rumor, but it's certainly not correct.
(there is a 512MB limit per allocation call but you can allocate as much as you like)
I do predominately scientific computing and often need very large and fast memory so I am mostly using the 7970. On the 7970, I often allocate a single contiguous buffer that uses just shy of 3GB, the device limit. It's very simple, all you do is allocate in chunks of 512MB or less and make sure the chunks are rounded to about 0x4000 bytes, then they will be placed contiguously.
So i don't think that it is a problem with allocating, because i have only a fraction of this buffer sizes at the same hardware. It seems rather that it is a problem while reading in this address space. This appears with OpenCL buffer AND with a shared GL VBO.
So i now try to divide into chunks with 8mio points, but i don't know at which position exactly the readings are faulty and why it is so much earlier than the 512MB border.
Thanks for your response so far.
correct code for quering is this.
clGetDeviceInfo(deviceId, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &max_alloc_size, 0);
The printf puts also the expected faulty values. But checking with enqueueWriteBuffer and after it enqueueReadBuffer, i have determined that the value are correct written and readed compared to the host source.
So i thought that the problem is not the faulty values inside the memory, but rather the method i used to read. So i determined that global IDs with values above 8mio are faulty, wich i used to access the memory.
Now i shrinked my global ID work space and do looping my kernel to do overall the same work. This is now the solution of my Problem.
Does any one know how big the maximum global ID can be?
Thanks for correcting the code snipped. It prints "536870912 byte" which ensures, that this is not the problem. It helps me to detect that the memory access was the problem.
CL_DEVICE_MAX_WORK_ITMES_SIZES prints [256;256;256], which leads to a maximum global id number of 16777216. This is the work size i can use without faults.
global work size takes size_t , it should be at least as large as an unsigned 32bit value means it should be able to take at least 4,294,967,295 as far as I understand? work item sizes show the maximum local size you can use in each dimension and not related to the global size I think?
size_t The unsigned integer type of the result of the sizeof operator. This is a 32-bit unsigned integer if
CL_DEVICE_ADDRESS_BITS defined in clGetDeviceInfo is 32-bits and is a 64-bit unsigned integer if
CL_DEVICE_ADDRESS_BITS is 64-bits.
It seems that also a hardware depended max number exists. You should see it also inside the codeXL profiler.
But now i run into another strange problem. My memory access seems only to be valid, when i use the half of the described global work size. Are there some dependencies which must be considered, while raising the global work size?