Archives Discussions

sgratton · ‎07-05-2008

Hi there,

Does anybody know if the hardware is able to "burst" global memory reads as well as writes (if this is a meaningful idea) and if so how to write IL to do this? My

mov r20,g[r1.x]
mov r21,g[r1.x+1]
mov r22,g[r1.x+2]
mov r23,g[r1.x+3]

seems to generate 4 MEM_GLOBAL_READ_IND gpuisa instructions, whereas the code with the src/dst's interchanged generate 1 MEM_GLOBAL_WRITE_IND with a BRSTCNT(3). I am concerned about memory bandwidth.

Relatedly, can I check that the theoretical memory bandwidth of a 3870 say is about 70GB/s? Is "all" of this accessible for any of global buffer reads only, writes only or read and writes together? If not I am worried that any code I write using mainly a global buffer will be doomed to be slow from the start, especially as some of the SDK examples seem to give numbers of order only 9GB/s (e.g. bursting_IL). Or will this change for the new cards?

Are there any other tips one can give for achieving maximum global buffer bandwidth? (One thing I have mooted for example is having "tall and thin" domains, e.g. (2,512), so that if a buffer is basically accessed by vObjIndex0.x each quad should be accessing sequential memory. I haven't had chance to test this in any way - does it make sense though and might it help?)

Thanks a lot,
Steven.

MicahVillmow · ‎07-07-2008

Steven,
The current implementation does not burst memory reads, however how you have it implemented is the easiest way to get the compiler to generate burst instructions when it is eventually implemented. On Rv670, memory reads are very slow and there is no way to get peak bandwidth with it. Memory writes are also slow, but not nearly as slow as reads.

In order to get maximum global buffer bandwidth, you need to burst four float4's per thread.

sgratton · ‎07-07-2008

Hi Micah,

It is disappointing that global buffer performance is so slow, but it does sound hopeful for future hardware/software improvements at least.

Do you know if the new cards (4870 etc) will be able to hit peak bandwidth with global buffers?

Also, I've seen on various reports that these contain local and global shared memory, which could obviously help a lot in reducing bandwidth requirements through data reuse. Will IL access to these be available in the next SDK? Presumably this will also bring/require synchronization capabilities?

Going back to current hardware, are there any favourable inter-thread alignments for accessing the global buffer that might help a bit (analogous say to Nvidia global memory coalescing if you're familiar with that)?

Finally, what happens if one forgets about using global buffers and goes back to using a regular input and output (i0 and o0 etc.) buffer, but makes them both link to the same resource? If each thread exclusively reads a unique patch of memory (as i0), and then writes back to the same patch (as o0), will the result be the same as if i0 and o0 were really different? (I'm trying to avoid needing 2 copies of a matrix in local memory...)

Best,
Steven.

MicahVillmow · ‎10-09-2008

Steven,
Sorry for missing this post, it must have just slipped past me. On the 4870, the peak bandwidth that I am able to get is via using texture input and bursting global output. This can be verified via the export_burst_perf.exe in the runtime directory of the 1.2 SDK. The peak rate is usually hit at processing 4 elements at a time.

This is only with compute shader however, using pixel shader you won't be able to get as high performance.

If each thread exclusively reads/writes to its own location, then there will be no data thrashing.

sgratton · ‎10-09-2008

Hi Micah,

Thanks for taking another look at this. The export_burst_perf numbers do seem very impressive! I'm finding myself in a bit of a "catch 22" situation, in that to read fast I want a resource bound as a texture whereas to write fast I want a resource bound as a global buffer! I need both because I have to make many passes over a matrix, altering it somewhat each time, so have to read it in and write it out each time.

If I try a compute shader, or a global buffer in a pixel shader, I have to use slow global reads, whereas with a pixel shader I can only operate on small sub-matrices at a time, pushing up the number of passes and so the total number of memory transactions, and also possibly have to double-buffer.

Other than bursting reads, one way out would be if it was possible for a resource to be both a global buffer and a regular local 2D one at the same time, assuming the latter could be accessed linearly, but of course none of this is possible at the moment. (I think it would help matrix-type codes generally if AMD exposed a linear or block-linear 2D texture option by the way. Is this in the works at all?)

The idea I had to avoid the double buffering of the PS solution was what I mentioned in the last paragraph of my previous post. I'm not sure if it was clear, but what I was trying to "calResAlloc" a single resource res0 say, use calCtxGetMem to get res0 into the context as mem0 say, then apply "calModuleGetName" for both the "o0" output buffer and the "i0" for a kernel, getting in1 and out1 respectively say, but then calCtxSetMem both in1 and out1 to the same mem0. Your comment about data thrashing suggests that "dual connecting" of a resource to a kernel is indeed legal then? It did seem to work!

By the way, I was having some problems with the timing code on linux, getting wild variations that were in fact caused by rounding error. I changed _freq in Timer.cpp to 1000000 then multiplied the right hand sides of the _start=... and n=... lines by 1000 to avoid the integer divide, which has helped, but even so it seems better to run with "-t -r 10 -w 1024 -h 1024" or something to get good results. Also, all tests count total bandwidth, which can skew the interpretation; e.g. import tests also count a fast destination write which explains why the tests with lower numbers of reads -- and hence a higher write:read ratio -- on the surface do better.

Best,
Steven.

PS If you get a chance to look at another old topic I'd be grateful to hear your comments; I should have known better than to post just around the release of the 1.2 beta! It is this one (and the result still persists with a 4870 and 1.2).

Also there was the query about cache flushing at the bottom of here...

MicahVillmow · ‎10-09-2008

Steven,
I had the same problem when I was developing the NLM_denoise algorithm. The solution to this issue of input of texture and output global is to allocate with the global buffer flag, but write to it as a 2D texture.
For example, assuming pixel shader mode and using vWinCoord0 for addressing(or decomposing vaTid.x into x/y in compute mode). You can access the global memory via the texture unit w/ X/Y coordinates and then writing via the global buffer via (y * pitch + x) instead of (y * width + x). This works great in Compute shader mode, but not as good in pixel shader mode. Look at how NLM_denoise handles this in the samples/app directory. The output of the first pass which is via global buffer, is read in via texture in the second pass. The addressing is the issue.

I'm guessing I missed most of these posts when I was out of the country training people for most of August.

As for the linux timing issue, i'll make a note of this to have someone look at.

sgratton · ‎10-10-2008

Hi Micah,

Thanks for all your replies! I've had a look at NLM_denoise and I think there may be a difference: perhaps I'm missing something, but in your execution phase aren't you copying data back and forth to the gpu each time? You're not somehow tricking the system that a single resource can be both a global buffer and a regular 2d input at the same time? (I didn't even try this because the former requires the CAL_RESALLOC_GLOBAL_BUFFER flag...)

Such gpu<->cpu transfer would be very bad for my program. Think of a 12228^2 matrix (=576 MB), then operating on it say (12288/4)= 3072 times. If this matrix went back and forth every time this'd need of the order of 2 TB of gpu<->cpu data transfer, limiting the runtime to say 200s just from the PCIe connection. If the HD4870 can hit 100GB/s then by keeping in device memory 10s is possible, then by using a global buffer rather than the streaming outputs to process more data at each pass (changing the 4 to 16) suggests 2.5 s is possible. (My analogous cuda implementation hits 7s on Nvidia's new cards, and would take about twice this in double precision; I had hoped that the DP performance of the 4870 would really have been able to have been exploited and it could have done even better -- 2.6 s is the theoretical floating point target!)

Talking of peak performance, I am still confused about how fast texture reads (or the local data shares that also seem to use the texture units) relative to registers really are, and what access patterns really work well, both for a single thread and across threads, particularly if each thread uses a lot of registers (50 say). Is the hardware too complicated to give any general rules?

Best,
Steven.

josopait · ‎10-10-2008

Steven,

you should be able to sample data from a texture and write to the same texture using the global buffer. That's what I am doing. I allocate the texture by calling calResAllocLocal2D with the CAL_RESALLOC_GLOBAL_BUFFER flag. Unless I am mistaken the texture stays in local gpu memory, so PCIe is not a bottleneck. In the first pass I am writing to the texture using global buffer writes, after that I am calling a different kernel that reads the same texture with sample instructions. If you want to read from and write to the same texture from within the same kernel, you would have to keep in mind that the memory is not immediately updated. But if you are careful about this I don't see why this shouldn't work.

Ingo

sgratton · ‎10-10-2008

Hi Ingo,

Thanks for telling me that this might work; from the documentation I had assumed that the CAL_RESALLOC_GLOBAL_BUFFER flag meant a resource could only be accessed as a global buffer. I'll try it tonight hopefully!

Best,
Steven.

MicahVillmow · ‎10-10-2008

Steven, Actually with each iteration of the loop I copy back the data, but between kernels I do not. The code that does this around line 502 of NLM_Denoise.cpp. As you can see, I am running two different kernels back to back. The first outputs via Global and the second inputs via texture.

As for performance numbers. I'll see if we are allowed to put this kind of information in a performance doc.

sgratton · ‎10-10-2008

Hi Ingo and Micah,

Thanks both for your help with this. I've just tried a test case and indeed it seems you can "connect" a resource allocated with the CAL_RESALLOC_GLOBAL_BUFFER flag to a kernel either as a global buffer g[], a texture input i#, or even as an output o#. In fact, you can make g[],i0 and o0 say in a single kernel all refer to the same resource!

And sorry Micah, I was looking at NLM_Denoise_Compute.cpp by mistake!

I look forward to trying a new version of my program sometime soon...

Best,
Steven.

MicahVillmow · ‎10-10-2008

Steven,
Ingo is correct in what he is doing. Although in NLM_Denoise I transfer data back for each iteration of the kernels, I don't transfer data between kernels. The main issue is that with global buffer, you can access a lot more space than with a texture. A N x M texture is packed into an K x M memory location, where for each row K, only the first N data points have valid memory locations. If you want to access data written in one pass via texture from data that was written on a previous path via global buffer, the global buffer must write data out in the correct format. This format is to generate an index via y * pitch + x, where pitch is the value returned by the calResAlloc function call. The required pitch alignment is also specified in CALdeviceattributes struct. If you write via the global buffer using this method, instead of using y * width + x, then there should be no issues with data accesses.

Archives Discussions

bursting global reads and global memory bandwidth?