Global Work Group Size should not pose any problems for out of resource usage.
As I understand, the LDS, Register Usage are the key factors.
If you can run 1 workgroup (with global work group size = 256), it means that LDS and Register Usage are just fine so that compute units can run a workgroup.
I dont understand, why OUT_OF_RESOURCE should occur only when you run more than 4 workgroups. This is not LDS or Register related then.
This looks more like a lazy buffer allocation related error. What is the size of global-memory buffers (cl_mem) that you use for your kernel?
Are you sure, you set the global_work_size parameter and not the local_work_size parameter of the clEnqueueNDRangeKernel() function?
Because on Tahiti (and on any GCN chip) the maximum for local_work_size is exactly 1024. clEnqueueNDRangeKernel() has many parameters, easy to make a mistake.
Although on the sceenshot and on the .isa disasm footer it seems ok :/
If everything ok, then you should be able to launch millions of this kernel. (In code size you can go up to 32KB to say it's a big kernel for HD7700+ , because that's the size of the instruction cache)
Sorry, it was finally my bad ... I realized indeed that in order to make memory access optimization during execution, I put my input buffer in constant memory (so in so called GDS) but at the end I am using too much of it when increasing global work items number ..... using __global keyword instead of __constant in the kernel parameter solved my ressource problem....
Indeed 1 kernel is using in constant memory a vector of 16x32bits = 16*4 bytes = 64 bytes, so when using 1024 kernels I am at limit of 64kbytes for GDS.... that explains why I got the OUT OF RESSOURCES error code.
However for the moment my kernel is finally using registers and private memory and only 1 load and 1 store in global memory, so for now usage of constant memory is not justified, so I guess even if I run million of it in "parallel" the use of global memory at start and end of kernels will have no impact on bandwidth rather than using constant/cache memory ?
For info,my kernel number #k is reading in same big big vector memory at position k with a granularity of uint16 size .... so each kernel will extract a 16x4 bytes vector at start of its execution, put it in a local register/private memory and then process it .... at the end of processing, 2 vectors of 64bits are written in global memory.
Just a question: I would like to read the Tahiti HW register called "HW_ID" to see for each kernel work item #k the usage of wavefront, computing unit etc... Do we have intrinsics in AMD APP SDK so that OpenCL kernel compiler could generate appropriate assembly code to access the HW_ID register and store it in a desired variable ? (please refer to Table 5.6 and 5.7 in http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf)
Obviously my code would not be compatible with other GPU architecture using such kind of code, but it's for benchmark purpose and to understand how my kernel execution is running over time and ressources .....
The purpose of constant memory is to provide constants that are the same for all the work_items (or at least all the work_items in a specific wavefront). If you try to access them in a way that you'd access from global memory, then the compiler will go nuts while indexing every single constant for each work_items in a sequential way. So you did it good to place work_item specific data into global memory. Accessing const with an index that can be different on each work_item is slow.
Constant memory had a limit in the older VLIW architecture (32KB dedicated constant memory If I remember well). On GCN constants are read from global memory but through the scalar alu and a dedicated constant cache. So basically there is no limit, but I guess the old VLIW limit still exists. On GCN there are 105 S registers to hold constants before using them and the compiler cannot load 100KBytes of them anyways.
GDS memory is the same thing as LDS memory except that it belongs to all worki_tems, not just to a single work_group. And AFAIK you can't access GDS from OpenCL language.
HW_ID: No OpenCL support for that as well. And I bet no AMD_IL support too. You gotta go down to ISA level to reach it.
Indeed realhet, I used constant memory in the wrong way, thanks for clarifications.
About the HW_ID and other registers related to Southern Island ISA, I thought OpenCL compiler shall permit to use asm("....."); inlining conventions with clobbering indication ?
I dont want to break the compiler optimization/efficiency, but I at least need to see if I can ask the compiler to store this register in a given variable