cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rahulgarg
Adept II

UAV allocaton in CAL v2

On HD 4000 and 5000

I was reading through the new CAL IL and the CAL IL now exposes UAVs.  I am interested in doing 32-bit aligned writes which UAVs expose.

From what I understood, the relevant instructions compatible across both HD 4000 and HD 5000 are dcl_raw_uav, uav_raw_load and uav_raw_store. However,  how do I allocate a UAV using CAL C API and what resource id do I pass to dcl_raw_uav?

 

 

0 Likes
6 Replies

rahulgarg,
On the 4XXX series, there can only be a single raw_uav, ID 0 which is equivalent to global but dword aligned. On 5XXX series there are 8 buffers available.
Allocating a UAV should be equivalent to allocating Global buffer, and mapping with uav# instead of g[]. See, http://forums.amd.com/forum/me...eadid=119351&forumid=9 for other posts on this.
0 Likes

I guess it's more suitable topic to continue discussion about UAV started here: http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=119351&enterthread=y&STARTPAGE=3.

 

As far I realized that:

1. There no easy way to use UAV with CAL/IL now. I wasn't successful with allocating/binding UAV buffers. The code looks like:

 

if (calResAllocLocal2D(&resUAV1, device, 1024, 1024, CAL_FORMAT_UINT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) {
if (calCtxGetMem(&uav1Mem, ctx, resUAV1) != CAL_RESULT_OK) printf("getmem failed for uav1\n");
if (calModuleGetName(&uav1Name, ctx, module, "uav1") != CAL_RESULT_OK) printf("no uav1\n");
if (calCtxSetMem(ctx, uav1Name, uav1Mem) != CAL_RESULT_OK ) { printf("error in setmem for uav1 [%s]\n", calGetErrorString());}

always results in error at calCtxSetMem for HD 5XXX (but no errors for HD 4XXX if there only uav0). However, uav buffers sometimes binds even with this error status.

2. Resource samplers on HD 5XXX much slower than ones on HD 4XXX. I have no explanations for this, probably hardware just changed too much and now only UAVs can provide normal performance.

3. uav_raw_load_id() translates into ~ VFETCH R1.x___, R0.x, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET) ISA.

 sample_resource(0)_sampler(0) r1, r2 into ~ SAMPLE R1, R2.xy0x, t0, s0 UNNORM(XYZW)

I guess VFETCH must be faster on HD 5XXX but without correct bindings it's impossible to test it right now.

4. uav_raw_store_id() translates into ~MEM_RAT_CACHELESS_STORE_RAW: RAT(0)[R0].x___, R3, MARK VPM ISA for HD 5XXX and into ~MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x].x___, R1, ELEM_SIZE(3) VPM for HD 4XXX.

HD 5XXX STORE_RAW works much faster than HD 4XXX MEM_EXPORT.

5. Commenting out dcl_raw_uav_id(0) means nothing. I've already reported this several months ago,

dcl_raw_uav_id(0)

or

;dcl_raw_uav_id(0)

either way results in image (ELF) binary contains reference to uav0 and you'll got error when performing Run function that uav0 isn't defined.

 

 

Anyone got better luck with UAV & CAL/IL here?

 

0 Likes

I think the problem might be that raw UAV's are 1D buffers, not 2D. If you want a 2d buffer, you need to use dcl_uav_id.
0 Likes

I'm just watching at OpenCL's IL outputs and there are only raw UAV's presents. I've also tried calResAllocLocal1D but still without success.

As OpenCL's examples works I guess OpenCL layer declares UAVs in some different way. Also OpenCL's call to calConfig() with "CAL_OPENCL_MODE" = "1" a bit suspicious, does it changes something globally?..

Anyway, a lot of options to test... interesting...

0 Likes

I'm finally figured out how to deal with UAV. Resource allocation must be 1D and data format must be 1 component int. Now it looks obvious but takes some time to get there . Also, occasionally I'd problems with Catalyst installation, looks like CAL runtime DLLs weren't updated (no idea why), so complete reinstalling of 9.12 with manual ATI's DLLs removing solve problem with incorrect uad_load_id() behavior.

 

For one small test I've finally got faster results for memory reads with uav_load_id = vfetch vs resource_sample (by the way, I can't find any description of vfetch in recent documents). But for real application I'm interesting in, results are kinda the same, +- 5%. And performance of 5XXX still slower than it should be comparing with 4XXX. I'm kinda tired of all these tests now so I'll leave it as is.

 

Also, I've took TemplateC OpenCL's source code, put some hooks in it to intercept all calls to resource allocations done by ATI's OpenCL layer. Results are kinda interesting (at least for me ). I've also change width to 4*1024*1024.

Although Micah wrote earlier that calResAllocView is experimental function actually it's heavily used by OpenCL layer. In fact all UAV resources used where prepared with calResAllocView.

Hook installed. -- Looks like first buffer allocation was done before hook to resource 30000000 -- clCreateContextFromType() clGetContextInfo() clCreateCommandQueue() calResAllocRemote1D: 0: res:30000001 dev:10000000 devcount:1 width:00080000 format:10 flags:03 calResAllocLocal1D: 0: res:30000002 dev:10000000 width:00080000 format:10 flags:80000001 calResAllocRemote1D: 0: res:30000003 dev:10000000 devcount:1 width:00000100 format:0d flags:03 calCtxGetMem: 0: mem:40000000 ctx:20000000 res:30000003 calResAllocRemote1D: 0: res:30000004 dev:10000000 devcount:1 width:00000100 format:0d flags:03 calCtxGetMem: 0: mem:40000001 ctx:20000000 res:30000004 -- Repeats for resources from 30000005 to 30000102 -- calResAllocRemote1D: 0: res:30000103 dev:10000000 devcount:1 width:00000100 format:0d flags:03 calCtxGetMem: 0: mem:40000100 ctx:20000000 res:30000103 calResAllocRemote1D: 0: res:30000104 dev:10000000 devcount:1 width:00000100 format:0d flags:03 calCtxGetMem: 0: mem:40000101 ctx:20000000 res:30000104 clCreateBuffer() #1 calResAllocLocal1D: 0: res:30000105 dev:10000000 width:00600000 format:10 flags:01 calResFree: 0: res:30000000 CalResAllocView: 0: res:30000000 30000105 10000000 10 8 1 1, { 00400000 00000001 00000001} { 000c0000 00000000 00000000 00000000 } calResMap: 0: ptr:08470000 pitch:00400000 res:30000000 flags:00000002 calResUnmap: 0: res:30000000 clCreateBuffer() #2 calResAllocLocal1D: 0: res:30000106 dev:10000000 width:00a00000 format:10 flags:01 CalResAllocView: 0: res:30000107 30000106 10000000 10 8 1 1, { 00400000 00000001 00000001} { 004c0000 00000000 00000000 00000000 } calResFree: 0: res:30000000 calResFree: 0: res:30000105 CalResAllocView: 0: res:30000105 30000106 10000000 10 8 1 1, { 00400000 00000001 00000001} { 000c0000 00000000 00000000 00000000 } calResMap: 0: ptr:08470000 pitch:00400000 res:30000105 flags:00000002 calResUnmap: 0: res:30000105 clCreateProgramWithSource() clBuildProgram() clCreateKernel() clSetKernelArg() #1 clSetKernelArg() #2 clSetKernelArg() #3 clEnqueueNDRangeKernel clWaitForEvents() calModuleGetName: 0: name:60000001 ctx:20000000 module:80000000 uav0 calModuleGetName: 0: name:60000000 ctx:20000000 module:80000000 uav1 calModuleGetName: 0: name:60000000 ctx:20000000 module:80000000 uav1 calCtxGetMem: 0: mem:40000102 ctx:20000000 res:30000106 calCtxSetMem: ctx:20000000 name:60000000 mem:40000102 calCtxSetMem: ctx:20000000 name:60000001 mem:40000102 calResMap: 0: ptr:09cb0000 pitch:00400000 res:30000105 flags:00000002 calResUnmap: 0: res:30000105 calResMap: 0: ptr:09cb0000 pitch:00400000 res:30000107 flags:00000002 calResUnmap: 0: res:30000107 calResMap: 0: ptr:06cf0000 pitch:00000100 res:30000083 flags:00000002 calResUnmap: 0: res:30000083 calModuleGetName: 0: name:60000003 ctx:20000000 module:80000000 cb0 calCtxSetMem: ctx:20000000 name:60000003 mem:40000080 calResMap: 0: ptr:06d71000 pitch:00000100 res:30000104 flags:00000002 calResUnmap: 0: res:30000104 calModuleGetName: 0: name:60000002 ctx:20000000 module:80000000 cb1 calCtxSetMem: ctx:20000000 name:60000002 mem:40000101 calCtxGetMem: 0: mem:40000103 ctx:20000000 res:30000105 calCtxGetMem: 0: mem:40000104 ctx:20000000 res:30000107 calCtxSetMem: ctx:20000000 name:60000000 mem:00000000 calCtxSetMem: ctx:20000000 name:60000001 mem:00000000 clReleaseEvent() clEnqueueReadBuffer() calCtxGetMem: 0: mem:40000105 ctx:20000000 res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 calResMap: 0: ptr:06870000 pitch:00080000 res:30000001 flags:00000001 calResUnmap: 0: res:30000001 clWaitForEvents() clReleaseEvent() calResFree: 0: res:30000107 calResFree: 0: res:30000105 calResFree: 0: res:30000001 calResFree: 0: res:30000002 -- Repeats for resources from 30000003 to 30000103 -- calResFree: 0: res:30000103 calResFree: 0: res:30000104

0 Likes

And append buffers?  Are they prepared the same as a UAV from the CAL side, or are there special flags required for calResAlloc?

0 Likes