cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cguenther
Adept II

Little fraction of allocated buffer correct initialized?

Hi there.

I heard of the situation that the OpenCL Buffers can only be allocated in appropriate chunks:

http://devgurus.amd.com/thread/158397

http://devgurus.amd.com/message/1282922#1282922

I am using the Cpp Bindings from the Khronos Group with the compiler option "DCL_USE_DEPRECATED_OPENCL_1_1_APIS" to achieve correct linking. The Buffer is allocated as follows:

cl::Buffer tempBuffer(m_context,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,size,VBOpos,&errret);


This Buffer contains 3D positions of points as 3 integers per point and is about 120mb. (10 mio points) This command returns cl_success. Can i assume that this behaviour indicates that the buffer is small enough to have not the problem of the 2 links above?

My Kernel reads the data as follows:

__kernel

void testKernel(__global int* VBOpos)

{

gid = get_global_id(0);

__private int4 WorldSpacePointPosInt;

WorldSpacePointPosInt = vload4(0,&(VBOpos[3*gid]));

}

The values of the about first 8mio Points are correct. But after that I get only nonsense. It seems that the allocation is done correctly, but the informations are not settet correctly. I tried it also bevorhand with an shared GL VBO. The GL rendering of the VBO is correct. But when i read it with CL from the shared GL VBO also there are only about 8mio correct points and the rest is dirt.

My Nvidia Notebook does not has this problem.

I would like to chunk the date into several buffers, but I do not know which exact size of the buffer is definitively correct initialized.

Testsystem:

AMD 7970, Win7 X64, Ubuntu X64 12.04, catalyst 12.8, AMD APP SDK 2.7

Sorry for my bad english and please reply, this would help me a lot.

0 Likes
10 Replies
binying
Challenger

Can i assume that this behaviour indicates that the buffer is small enough to have not the problem of the 2 links above?

---well, I think you can use the printf extension to confirm/double check.

Hope this answer will be helpful to you.

0 Likes

In the past if I exceeded the buffer limits, I got an error from allocation. You should be able to see maximum object size from clinfo output. Why dont you try it with vload3 ? and see if you are getting same values or not? also you can perhaps try on CPU?

The vload3 gives the same situation. I also tried to access the data with the [] operator, with the same result. I tried to get the CL_DEVICE_MAX_MEM_ALLOC_SIZE with the following code but it return only two binary zeros...

    cl_platform_id platformId;

    cl_device_id deviceId;

    clGetPlatformIDs(1,&platformId,0);

    clGetDeviceIDs(platformId,CL_DEVICE_TYPE_GPU,1,&deviceId,0);

    DumpDetailedCLError(clGetDeviceInfo(deviceId,CL_DEVICE_MAX_MEM_ALLOC_SIZE, 0,0,&size), false);

    char deviceInfo[size];

    DumpDetailedCLError(clGetDeviceInfo(deviceId,CL_DEVICE_MAX_MEM_ALLOC_SIZE, 500, &deviceInfo,0), false);

    printf("%s\n", deviceInfo);

    fflush(stdout);

The one pro says to the AMD 7970:

Currently OpenCL users are limited to 25% of device memory,

I don't know where you get this from, perhaps it's a rumor, but it's certainly not correct.

(there is a 512MB limit per allocation call but you can allocate as much as you like)

I do predominately scientific computing and often need very large and fast memory so I am mostly using the 7970. On the 7970, I often allocate a single contiguous buffer that uses just shy of 3GB, the device limit. It's very simple, all you do is allocate in chunks of 512MB or less and make sure the chunks are rounded to about 0x4000 bytes, then they will be placed contiguously.

So i don't think that it is a problem with allocating, because i have only a fraction of this buffer sizes at the same hardware. It seems rather that it is a problem while reading in this address space. This appears with OpenCL buffer AND with a shared GL VBO.

So i now try to divide into chunks with 8mio points, but i don't know at which position exactly the readings are faulty and why it is so much earlier than the 512MB border.

Thanks for your response so far.

0 Likes

correct code for quering is this.

cl_ulong max_alloc_size;

clGetDeviceInfo(deviceId, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &max_alloc_size, 0);

Thanks for correcting the code snipped. It prints "536870912 byte" which ensures, that this is not the problem. It helps me to detect that the memory access was the problem.

0 Likes

The printf puts also the expected faulty values. But checking with enqueueWriteBuffer and after it enqueueReadBuffer, i  have determined that the value are correct written and readed compared to the host source.

So i thought that the problem is not the faulty values inside the memory, but rather the method i used to read. So i determined that global IDs with values above 8mio are faulty, wich i used to access the memory.

Now i shrinked my global ID work space and do looping my kernel to do overall the same work. This is now the solution of my Problem.

Does any one know how big the maximum global ID can be?

0 Likes

CL_DEVICE_MAX_WORK_ITMES_SIZES prints [256;256;256], which leads to a maximum global id number of 16777216. This is the work size i can use without faults.

0 Likes

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html

global work size takes size_t , it should be at least as large as an unsigned 32bit value means it should be able to take at least 4,294,967,295  as far as I understand? work item sizes show the maximum local size you can use in each dimension and not related to the global size I think?

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/scalarDataTypes.html

size_t  The unsigned integer type of the result of the sizeof operator. This is a 32-bit unsigned integer if CL_DEVICE_ADDRESS_BITS defined in clGetDeviceInfo is 32-bits and is a 64-bit unsigned integer if CL_DEVICE_ADDRESS_BITS is 64-bits.

It seems that also a hardware depended max number exists. You should see it also inside the codeXL profiler.

But now i run into another strange problem. My memory access seems only to be valid, when i use the half of the described global work size. Are there some dependencies which must be considered, while raising the global work size?

0 Likes

Hi cguenther,

yurtsen is probably right. AFAIK there is no theoritical limit on the global size and the total possible numbers by a size_t data_type appears to be the practical limit.

IMHO, it will be helpful if you can share a small cut-down code snippet, using which we can reproduce the issue.

There are many samples like matrixmultiplication, where you can specify large matrix sizes which would result in large global work-item range.

But your case also involves a CL_GL interop, so it may be interesting to see.

Message was edited by: WorkItem 7

0 Likes