Ok, attached is a simple example which misbehaves and does not do the right thing. Can anybody *pretty please* tell me what I'm doing wrong?
il_cs_2_0 dcl_num_thread_per_group 64 dcl_literal l0, 0x00000004, 0x0000007f, 0x00000000, 0x00000001 dcl_literal l1, 64, 0x0, 0x0, 0x0 dcl_literal l2, 0x0000ffff, 0x0000000f, 0x00000002, 0x00000002 dcl_literal l3, 1,2,3,4 dcl_literal l4, 10.0f,11.0f,12.0f,13.0f dclarray r1,r5 mova a0.x, l3.x ; r1 should contain l2 ; but acctually results in ; r0 containing l2 mov r[a0.x], l2 ;src0 relative does not compile! ;mov g[vaTid.x], r[a0.x] ; check r0 or r1 contains l2 mov g[vaTid.x], r0 ;mov g[vaTid.x], r1 endmain end
I see this same issue and i'm looking into the correct way to use relative addressing.
relative addressing does not work on registers in this manner. From the IL spec that will be in the next release:
"Base and loop relative addressing cannot be used on registers not declared within the range of src1 and src2."
"Only shader inputs or outputs can be indexed in this way. The indexed (x) register type can be used for
So, you need to use a temp array for this if you want relative indexing into registers.
Your kernel will look something like:
dcl_literal l0, 0x00000004, 0x0000007f, 0x00000000, 0x00000001
dcl_literal l1, 64, 0x0, 0x0, 0x0
dcl_literal l2, 0x0000ffff, 0x0000000f, 0x00000002, 0x00000002
dcl_literal l3, 1,2,3,4
dcl_literal l4, 10.0f,11.0f,12.0f,13.0f
mov r10.x, l3.x
; r1 should contain l2
; but acctually results in
; r0 containing l2
mov x0[r10.x], l2
;src0 relative does not compile!
mov g[vaTid.x], x0[r10.x]
Thanks for the reply and example for indexed temp.
Acctually, in the end I was hoping to apply lessons learned for indexed rX then to indexed srX.
In sec 10.2 of the r700 ISA spec, page 331, there is the INDEX_GLOBAL bit to"Treat GPR address as absolute, not thread-relative", which I assume is set when I use an srX register. In addition there is the INDEX_GLOBAL_AR_X bit to "Treat GPR address as absolute, and add GPR-index (AR.X)". As such, in the ISA it seems to be possible to use indexed sr (SIMD-global) GPRs. My understanding was that the "dclarray" just tells the IL->ISA compiler not to rearrange that range, and then using an sr triggers the INDEX_GLOBAL bit in the ALU opcode. So then indexing an sr would trigger the INDEX_GLOBAL_AR_X bit. At least it seems the hardware should support this. Do you not agree? For my application indexed sr is potentially a very powerful feature to save me alot of memory fetches ... assuming I can get more than a handful of sr registers per thread, and they don't conflict with normal GPRs.
Section 4.7 (Relative Addressing) of the "R600 Assembly Language Format" document (in the 1.4.0 Stream SDK) says this (regarding valid indices for relative addressing):
"Valid indexes are AL, A0.x, A0.y, A0.z, A0.w, and Ga0.x (the last is used for R700-family shared register indexing)"
(Somehow that last piece of info found its way into a R600 doc.) Just thought that I would mention it in case you hadn't seen it - that seems to support your hypothesis that the hardware is capable of this.
It is true that the hardware does support this, however, the IL language does not. Please refer to the IL spec when programming in IL as not everything in the ISA is exposed directly in IL.
For shared registers conflicting with normal GPRs, for every shared register you use, you reduce the global register pool size by 2.
If you can give more information on how you want to use this, I can find out if there is an equivalent way to do it in IL.
A very nice feature of sr registers is there persistance across kernel calls. If there were a few more of them, this would allow me to greatly reduce memory fetching and dramatically increase my compute to fetch ratio, as my application is iterative and only about 10*64*4 threads. I have found I have 124 sr registers reliably preserving state across kernel calls with 640 threads. However if I have another kernel in between (with same dcl_shared_temp sr124 call) with only use of 9 GPRs accoring to ISA disassembly, the majority of those 124 SRs are corrupted upon later checking, and only the first 8 maintain the value I stored.
A killer feature for compute shaders would be if the thread id to SIMD proc mapping could be controlled, and then each thread could be given its own persistant register pool say with 128 elements if 640 threads are spawned, and split up from there, i.e. 64 elements for 1280 theads, etc.
Presently I'm using case statements on vTGroupid.x to assign threads their own persistant register pool.
Or is there another way to get persistance without going to memory?
You can control scheduling using the wavefrontAbs keyword for lds addressing mode. This forces only a single wavefront to execute at a time. Also, if you want SR's to be persistent across calls, ALL kernels must reserve the same number of SR registers.
LDS is also persistent across calls on the R7XX series, but the only way you are gauranteed this is if you use calCtxRunProgramGridArray api call.
Is LDS also persistent on the R8XX series? I noticed there were some big changes for LDS.
Still, I suppose sr registers are much faster than LDS on R7XX and maybe on the R8XX series. If your engineers ask if any ideas have come from the forum on how to improve on the R8XX for compute, please mention improvements to persistent registers or something to that effect. Around 32 elements per thread would be sufficient to dramatically improve compute intensity of my kernels.
My tests show LDS is not persistent on R8XX ... which is one of my target platforms, so I cannot use LDS for persistence. Is there no other option to save a few state variables between kernel calls besides double buffering on a global buffer? The 8 preserved sr's are not enough. I would need about twice that. Or maybe I just do everything in sr's and keep my thread count to 640, 1280 for 4870, 5870 resp. ... I assume there is no performance difference between GPRs and global GPRs?
I appears that
load_resource(0) sr2, vaTid.x000
does not work as expected, whereas
load_resource(0) r0, vaTid.x000
mov sr2, r0
does. Arggg. IL should generate more errors, and not let such things pass silently. So now I start to mix GPRs and global GPRs and it seems I can never be sure if writing to a GPR is not corrupting my global GPRs.
a GPR and a SR register are physically equivalent, the difference is whether the register offset in the hardware is absolute to the register file or relative to the wavefront ID.
Shared registers can only be used in ALU instructions and not texture/memory instructions because the texture unit does not have access to shared registers.
When you are testing LDS persistency, are you using calCtxRunProgramGridArray? That would be the only API call that would guarantee it, if it exists.
Originally posted by: MicahVillmow When you are testing LDS persistency, are you using calCtxRunProgramGridArray? That would be the only API call that would guarantee it, if it exists.
Yep, I added a python wrapper for calCtxRunProgramGridArray today. LDS on my 4870 is persistent. On my 5770 it is not. Is that unexpected? I am running 640 threads with the wavefrontAbs keyword.
Here's the code if you're interested ...
import unittest import sys import numpy import amdcal as cal import pdb class GridArrayTestCase(unittest.TestCase): def test_lds_persist(self): from time import time print "\n" il_set = """ il_cs_2_0 dcl_num_thread_per_group 64 dcl_literal l0, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff dcl_literal l1, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff dcl_literal l2, 0x00000001, 0x00000001, 0x00000001, 0x00000001 dcl_literal l3, 0x00000000, 0x00000004, 0x00000008, 0x0000000C dcl_lds_size_per_thread 8 ; size in bytes dcl_lds_sharing_mode _wavefrontAbs dcl_resource_id(0)_type(1d,unnorm)_fmtx(uint)_fmty(uint)_fmtz(uint)_fmtw(uint) load_resource(0) r0, vaTid.x000 lds_write_vec_lOffset(0) mem.xyzw, r0 end """ il_get = """ il_cs_2_0 dcl_num_thread_per_group 64 dcl_literal l0, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff dcl_literal l1, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff dcl_literal l2, 0x00000001, 0x00000001, 0x00000001, 0x00000001 dcl_literal l3, 0x00000000, 0x00000004, 0x00000008, 0x0000000C dcl_lds_size_per_thread 8 ; size in bytes dcl_lds_sharing_mode _wavefrontAbs lds_read_vec r0, vTid.x000 mov g[vaTid.x], r0 endmain end """ grid_size = 64 hw = 640 devnum = 0 # Initialize CAL, open device and create context cal.Init() dev = cal.Device(devnum) ctx = cal.Context(dev) print "Compile and link" sys.stdout.flush() # Compile, link and load kernel program target = dev.GetInfo()['target'] obj_set = cal.Code(cal.CAL_LANG_IL, target, il_set) img_set = cal.Image(obj_set) obj_get = cal.Code(cal.CAL_LANG_IL, target, il_get) img_get = cal.Image(obj_get) print "Create and population memory" sys.stdout.flush() res_io = cal.ResourceLocal1D(dev, hw, cal.CAL_FORMAT_UINT_4) print "Mapping memory" sys.stdout.flush() arr_in = numpy.random.randint(0,100,size=(hw*4,)).astype(numpy.uint32) res_io.LoadArray(arr_in) mem_io = cal.MemObject(ctx, res_io) # Load and set up module mod_set = cal.Module(ctx, img_set) mod_set.Bind("i0", mem_io) # Load and set up module mod_get = cal.Module(ctx, img_get) mod_get.Bind("g", mem_io) # Run it evt = ctx.RunGridArray([(mod_set,"main", hw, 1, grid_size),(mod_get,"main", hw, 1, grid_size)]) while not evt.IsDone(): pass # Extract results and display them arr_out = res_io.ToArray() #pdb.set_trace() print arr_out[:10] print arr_in[:10] assert numpy.all(arr_out==arr_in) def suite(): suite = unittest.makeSuite(GridArrayTestCase,'test') return suite if __name__ == "__main__": # unittest.main() runner = unittest.TextTestRunner(verbosity=2) runner.run(suite())
That looks correct but is the old method of accesssing LDS/Memory, however I have verified it with our hardware folks that LDS should be persistent across kernel calls. I think the problem is that the method for controlling schedulign on 7XX is different than 8XX. Instead you should be using dcl_lds_id(N) and dcl_lds_size(). On HD5XXX series of cards, dcl_lds_sharing_mode has no meaning. You need to allocate the LDS so that only a certain number of threads will fit in a simd and that it is the same between kernel calls. A test you can do is this,
Run 11 groups of 1024 threads and an LDS size of 32k, with each thread reading the value that is in LDS and writing it out to a unique location and then writing the group ID to LDS.
Then run the kernel again outputting to a second memory buffer.
If data is persistent, then one or more of your groups in the second run will read data that was written by the first kernel.
Thanks for the quick and informative reply. I'll try it, but I don't get what is the function of the dcl_lds_id(N) call, and what the value of N should be. This is presumably documented as a part of the next CAL/IL release? Is it even supported with the present public release?
In 7XX IL, you can only have a single LDS buffer, in 8XX you can have multiple LDS buffers each given a specific ID. The compiler will then handle the relative offsets for you so that you can access each LDS buffer with the same relative address, but the actual hardware location will be different.
lds_load_id(0) r0, r0.0
lds_load_id(1) r1, r0.0
would read values 0 and 1024 from LDS memory even though the offset t the load instruction is the same. These LDS instructions are byte addressed to dword aligned memory. This information will be in the next IL document, but should already work with OpenCL/CAL as this is what OpenCL generates. If you use one of the methods on this forum to dump the IL from a kernel that uses local memory, you will see more examples of the instructions.
very cool. I'll look deeper into this.