Hi Micah,
Thanks for taking another look at this. The export_burst_perf numbers do seem very impressive! I'm finding myself in a bit of a "catch 22" situation, in that to read fast I want a resource bound as a texture whereas to write fast I want a resource bound as a global buffer! I need both because I have to make many passes over a matrix, altering it somewhat each time, so have to read it in and write it out each time.
If I try a compute shader, or a global buffer in a pixel shader, I have to use slow global reads, whereas with a pixel shader I can only operate on small sub-matrices at a time, pushing up the number of passes and so the total number of memory transactions, and also possibly have to double-buffer.
Other than bursting reads, one way out would be if it was possible for a resource to be both a global buffer and a regular local 2D one at the same time, assuming the latter could be accessed linearly, but of course none of this is possible at the moment. (I think it would help matrix-type codes generally if AMD exposed a linear or block-linear 2D texture option by the way. Is this in the works at all?)
The idea I had to avoid the double buffering of the PS solution was what I mentioned in the last paragraph of my previous post. I'm not sure if it was clear, but what I was trying to "calResAlloc" a single resource res0 say, use calCtxGetMem to get res0 into the context as mem0 say, then apply "calModuleGetName" for both the "o0" output buffer and the "i0" for a kernel, getting in1 and out1 respectively say, but then calCtxSetMem both in1 and out1 to the
same mem0. Your comment about data thrashing suggests that "dual connecting" of a resource to a kernel is indeed legal then? It did seem to work!
By the way, I was having some problems with the timing code on linux, getting wild variations that were in fact caused by rounding error. I changed _freq in Timer.cpp to 1000000 then multiplied the right hand sides of the _start=... and n=... lines by 1000 to avoid the integer divide, which has helped, but even so it seems better to run with "-t -r 10 -w 1024 -h 1024" or something to get good results. Also, all tests count total bandwidth, which can skew the interpretation; e.g. import tests also count a fast destination write which explains why the tests with lower numbers of reads -- and hence a higher write:read ratio -- on the surface do better.
Best,
Steven.
PS If you get a chance to look at another old topic I'd be grateful to hear your comments; I should have known better than to post just around the release of the 1.2 beta! It is
this one (and the result still persists with a 4870 and 1.2).
Also there was the query about cache flushing at the bottom of
here...