cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sgratton
Adept I

burst writing no longer working?

Seems to be a problem from cat 10.7 onwards

 

Hi there,

 

After trying out AMD Stream some time ago, with the release of the 6900 cards I thought I'd give it another go.   One issue in getting good memory performance with CAL then was the absence of burst reading (see link here).  Having bought a new card and installed the latest SDK (2.3) and drivers (10.12), I was surprised to see that not even burst writing seems to occur now in both linux and vista (64 bit).  For example if one runs the export_burst_perf sample and prints out the il (export_burst_perf -p) and then the isa (export_burst_perf -a), it appears that the il is written to give burst writes but that the isa doesn't do this.  For example...

 

il_cs_2_0
dcl_cb cb0[1]
dcl_num_thread_per_group 64
itof r0.z, vaTid0.x
div r0.y, r0.z, cb0[0].x
mod r0.x, r0.z, cb0[0].x
flr r0, r0
mul r0.x, r0.x, cb0[0].z
dcl_resource_id(0)_type(2d,unnorm)_fmtx(unknown)_fmty(unknown)_fmtz(unknown)_fmtw(unknown)
imul r0.w, vaTid0.x, cb0[0].w
sample_resource(0)_sampler(0) r1, r0.xy
add r0.x, r0.x, r0.1
sample_resource(0)_sampler(0) r2, r0.xy
add r0.x, r0.x, r0.1
sample_resource(0)_sampler(0) r3, r0.xy
add r0.x, r0.x, r0.1
sample_resource(0)_sampler(0) r4, r0.xy
add r0.x, r0.x, r0.1
mov g[r0.w + 0], r1
mov g[r0.w + 1], r2
mov g[r0.w + 2], r3
mov g[r0.w + 3], r4
end

compiles to give

...

04 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R0, ELEM_SIZE(3)  VPM
05 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R5, ELEM_SIZE(3)  VPM
06 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R6, ELEM_SIZE(3)  VPM
07 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R7, ELEM_SIZE(3)  VPM

...

 

Investigating further, I played with the SKA (1.7) on vista, set to compile code for a 4870.

 

The above kernel gives

03 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R5, ELEM_SIZE(3)   BRSTCNT(3)

 

for catalysts set to 10.6 and earlier in the options, but

 

02 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R0, ELEM_SIZE(3)
03 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R1, ELEM_SIZE(3)
04 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R2, ELEM_SIZE(3)
05 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R3, ELEM_SIZE(3)

 

for more recent catalysts, in particular including the most recent one.

 

So, I would like to know:

 

1.  Is this a bug, or is there a reason for this change?

 

2.  What are the performance implications? 

 

3.  Or do hardware improvements for the 6900's at least render bursting irrelevant?

 

4.  Is burst reading now supported in hardware in the 6900s?

 

5.  If this is a bug, will burst writing be supported by the compiler again shortly?  (So it can be used by the 6900s in particular.)

 

6. Will/is burst reading be supported by the compiler shortly?

 

Thanks for any advice,

Steven.

 

0 Likes
4 Replies

Global buffer is not the most efficient method of writing to memory on 8XX and 9XX devices. The most efficient method is to use UAV's, which were introduced with 8XX GPU's and a single UAV was back ported to be supported on 7XX devices. I've filed a regression against the compiler team, but can you see if using a UAV fixes your problem.
0 Likes

sgratton,
The cause has been found and the fix will be in a future driver release(probably March or April).
0 Likes

Dear Micah,

 

Thanks for taking a look at this.  I've begun to experiment with UAVs and will post questions in a new thread.  Playing with the SKA I did notice that, as you implied, UAV code does still work on r7xx (or even on the 3870 if you use a pixel shader!), and, so is affected by this issue; if you write say 8 consecutive float4's, you get two burst writes with the older catalysts

01 MEM_EXPORT_WRITE: DWORD_PTR[0], R8, ELEM_SIZE(3)   BRSTCNT(3)
02 MEM_EXPORT_WRITE: DWORD_PTR[16], R0, ELEM_SIZE(3)   BRSTCNT(3)

but 8 individual ones with the more recent ones:

 

01 MEM_EXPORT_WRITE: DWORD_PTR[0], R0, ELEM_SIZE(3)
02 MEM_EXPORT_WRITE: DWORD_PTR[4], R1, ELEM_SIZE(3)
03 MEM_EXPORT_WRITE: DWORD_PTR[8], R2, ELEM_SIZE(3)
04 MEM_EXPORT_WRITE: DWORD_PTR[12], R3, ELEM_SIZE(3)
05 MEM_EXPORT_WRITE: DWORD_PTR[16], R4, ELEM_SIZE(3)
06 MEM_EXPORT_WRITE: DWORD_PTR[20], R5, ELEM_SIZE(3)
07 MEM_EXPORT_WRITE: DWORD_PTR[24], R6, ELEM_SIZE(3)
08 MEM_EXPORT_WRITE: DWORD_PTR[28], R7, ELEM_SIZE(3)

 

So UAV code on r7xx should get better with this fix.

 

Best,

Steven.

0 Likes

sgratton,
A single UAV should work on the 3870, but is not officially supported, as it is mapped to the global buffer. However, you might find issues because we don't test UAV's on that device.
0 Likes