I need to do relative addressing on r0-r10, for example, as hinted at in section 3.3 of the IL lang spec. Does anyone have a kernel example, please please? Everything I try either generates no op codes in the ISA, or does not compile.
Thanks alot.
Ok, attached is a simple example which misbehaves and does not do the right thing. Can anybody *pretty please* tell me what I'm doing wrong?
il_cs_2_0 dcl_num_thread_per_group 64 dcl_literal l0, 0x00000004, 0x0000007f, 0x00000000, 0x00000001 dcl_literal l1, 64, 0x0, 0x0, 0x0 dcl_literal l2, 0x0000ffff, 0x0000000f, 0x00000002, 0x00000002 dcl_literal l3, 1,2,3,4 dcl_literal l4, 10.0f,11.0f,12.0f,13.0f dclarray r1,r5 mova a0.x, l3.x ; r1 should contain l2 ; but acctually results in ; r0 containing l2 mov r[a0.x], l2 ;src0 relative does not compile! ;mov g[vaTid.x], r[a0.x] ; check r0 or r1 contains l2 mov g[vaTid.x], r0 ;mov g[vaTid.x], r1 endmain end
Hi Micah,
Thanks for the reply and example for indexed temp.
Acctually, in the end I was hoping to apply lessons learned for indexed rX then to indexed srX.
In sec 10.2 of the r700 ISA spec, page 331, there is the INDEX_GLOBAL bit to"Treat GPR address as absolute, not thread-relative", which I assume is set when I use an srX register. In addition there is the INDEX_GLOBAL_AR_X bit to "Treat GPR address as absolute, and add GPR-index (AR.X)". As such, in the ISA it seems to be possible to use indexed sr (SIMD-global) GPRs. My understanding was that the "dclarray" just tells the IL->ISA compiler not to rearrange that range, and then using an sr triggers the INDEX_GLOBAL bit in the ALU opcode. So then indexing an sr would trigger the INDEX_GLOBAL_AR_X bit. At least it seems the hardware should support this. Do you not agree? For my application indexed sr is potentially a very powerful feature to save me alot of memory fetches ... assuming I can get more than a handful of sr registers per thread, and they don't conflict with normal GPRs.
Section 4.7 (Relative Addressing) of the "R600 Assembly Language Format" document (in the 1.4.0 Stream SDK) says this (regarding valid indices for relative addressing):
"Valid indexes are AL, A0.x, A0.y, A0.z, A0.w, and Ga0.x (the last is used for R700-family shared register indexing)"
(Somehow that last piece of info found its way into a R600 doc.) Just thought that I would mention it in case you hadn't seen it - that seems to support your hypothesis that the hardware is capable of this.
Jeremy Furtek
Hi Micah,
A very nice feature of sr registers is there persistance across kernel calls. If there were a few more of them, this would allow me to greatly reduce memory fetching and dramatically increase my compute to fetch ratio, as my application is iterative and only about 10*64*4 threads. I have found I have 124 sr registers reliably preserving state across kernel calls with 640 threads. However if I have another kernel in between (with same dcl_shared_temp sr124 call) with only use of 9 GPRs accoring to ISA disassembly, the majority of those 124 SRs are corrupted upon later checking, and only the first 8 maintain the value I stored.
A killer feature for compute shaders would be if the thread id to SIMD proc mapping could be controlled, and then each thread could be given its own persistant register pool say with 128 elements if 640 threads are spawned, and split up from there, i.e. 64 elements for 1280 theads, etc.
Presently I'm using case statements on vTGroupid.x to assign threads their own persistant register pool.
Or is there another way to get persistance without going to memory?
Is LDS also persistent on the R8XX series? I noticed there were some big changes for LDS.
Still, I suppose sr registers are much faster than LDS on R7XX and maybe on the R8XX series. If your engineers ask if any ideas have come from the forum on how to improve on the R8XX for compute, please mention improvements to persistent registers or something to that effect. Around 32 elements per thread would be sufficient to dramatically improve compute intensity of my kernels.
My tests show LDS is not persistent on R8XX ... which is one of my target platforms, so I cannot use LDS for persistence. Is there no other option to save a few state variables between kernel calls besides double buffering on a global buffer? The 8 preserved sr's are not enough. I would need about twice that. Or maybe I just do everything in sr's and keep my thread count to 640, 1280 for 4870, 5870 resp. ... I assume there is no performance difference between GPRs and global GPRs?
I appears that
load_resource(0) sr2, vaTid.x000
does not work as expected, whereas
load_resource(0) r0, vaTid.x000
mov sr2, r0
does. Arggg. IL should generate more errors, and not let such things pass silently. So now I start to mix GPRs and global GPRs and it seems I can never be sure if writing to a GPR is not corrupting my global GPRs.
Originally posted by: MicahVillmow When you are testing LDS persistency, are you using calCtxRunProgramGridArray? That would be the only API call that would guarantee it, if it exists.
Yep, I added a python wrapper for calCtxRunProgramGridArray today. LDS on my 4870 is persistent. On my 5770 it is not. Is that unexpected? I am running 640 threads with the wavefrontAbs keyword.
Here's the code if you're interested ...
import unittest import sys import numpy import amdcal as cal import pdb class GridArrayTestCase(unittest.TestCase): def test_lds_persist(self): from time import time print "\n" il_set = """ il_cs_2_0 dcl_num_thread_per_group 64 dcl_literal l0, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff dcl_literal l1, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff dcl_literal l2, 0x00000001, 0x00000001, 0x00000001, 0x00000001 dcl_literal l3, 0x00000000, 0x00000004, 0x00000008, 0x0000000C dcl_lds_size_per_thread 8 ; size in bytes dcl_lds_sharing_mode _wavefrontAbs dcl_resource_id(0)_type(1d,unnorm)_fmtx(uint)_fmty(uint)_fmtz(uint)_fmtw(uint) load_resource(0) r0, vaTid.x000 lds_write_vec_lOffset(0) mem.xyzw, r0 end """ il_get = """ il_cs_2_0 dcl_num_thread_per_group 64 dcl_literal l0, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff dcl_literal l1, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff dcl_literal l2, 0x00000001, 0x00000001, 0x00000001, 0x00000001 dcl_literal l3, 0x00000000, 0x00000004, 0x00000008, 0x0000000C dcl_lds_size_per_thread 8 ; size in bytes dcl_lds_sharing_mode _wavefrontAbs lds_read_vec r0, vTid.x000 mov g[vaTid.x], r0 endmain end """ grid_size = 64 hw = 640 devnum = 0 # Initialize CAL, open device and create context cal.Init() dev = cal.Device(devnum) ctx = cal.Context(dev) print "Compile and link" sys.stdout.flush() # Compile, link and load kernel program target = dev.GetInfo()['target'] obj_set = cal.Code(cal.CAL_LANG_IL, target, il_set) img_set = cal.Image(obj_set) obj_get = cal.Code(cal.CAL_LANG_IL, target, il_get) img_get = cal.Image(obj_get) print "Create and population memory" sys.stdout.flush() res_io = cal.ResourceLocal1D(dev, hw, cal.CAL_FORMAT_UINT_4) print "Mapping memory" sys.stdout.flush() arr_in = numpy.random.randint(0,100,size=(hw*4,)).astype(numpy.uint32) res_io.LoadArray(arr_in) mem_io = cal.MemObject(ctx, res_io) # Load and set up module mod_set = cal.Module(ctx, img_set) mod_set.Bind("i0", mem_io) # Load and set up module mod_get = cal.Module(ctx, img_get) mod_get.Bind("g[]", mem_io) # Run it evt = ctx.RunGridArray([(mod_set,"main", hw, 1, grid_size),(mod_get,"main", hw, 1, grid_size)]) while not evt.IsDone(): pass # Extract results and display them arr_out = res_io.ToArray() #pdb.set_trace() print arr_out[:10] print arr_in[:10] assert numpy.all(arr_out==arr_in) def suite(): suite = unittest.makeSuite(GridArrayTestCase,'test') return suite if __name__ == "__main__": # unittest.main() runner = unittest.TextTestRunner(verbosity=2) runner.run(suite())
Thanks for the quick and informative reply. I'll try it, but I don't get what is the function of the dcl_lds_id(N) call, and what the value of N should be. This is presumably documented as a part of the next CAL/IL release? Is it even supported with the present public release?
very cool. I'll look deeper into this.