seems 69xx compute features have been leaked as slides..
but before speaking about that I want to know if GDS will be exposed
in OpenCL as is even speaked in OCL Webinars so there is still interest by AMD to implement:
*global wave sync?
New features are
*concurrent kernel support ("asynch kernel dispatch" in slide): well kernel asynch launch is supported in 58xx series right? even concurrent kernels seems but seems 69xx adds private address space for every kernel so should be easier to expose in OpenCL.. can be expect concurrent kernel support to be exposed in ocl in launch OCL drivers? a sample using concurrent kernels would be good..
Also will every SIMD core allowed of running only threadblocks from one kernel (like Fermi) or running arbitrary kernels (well with the limitations of local mem usage and registry pressure)
*dual dma engines: good job but will be implemented at the same time as SINGLE DMA or will be later? can we expect support shortly after launch (1-2 month after?) also to be answered later about antilles dual chip 69xx will implement quad dma engines right? as every has single gpu mem adress space will be necessary right?
A OCL dual dma sample would be good too..
*Seem slides mention full APP support for Antilles I hope multigpu cards are exploitable without serialization points in OpenCL so 100% gpu usage of all can be extracted using two independent command queues.. Same simple scalable samples would add trust in multigpu hardware and driver..
I see in ROPs slide "colescing writes" support and in compute slide "coalescing of shader read"-> I ask aren't suported right now coalesced reads and writes in 58xx? I hope after 69xx release someone can answer what specific improvements add 69xx series or what colescing limitations has 58xx hardware..
I see a "fetch direct to lds" I hope this adds from host mem to LDS mem without going through global mem and it's not clear how a opencl extension will be if adding a host side API allowing to send to LDS or a kernel function similar to prefetch to local mem functions but that would imply support for accessible host memory from device being exposed in OpenCL which would be right.. right now even 5xxx series allow accessing host mem from device called mem import and export but that not is exposed in OpenCL..
Even if you are able to answer for yourself, many people would be interested in your findings.
Also, could someone expand the abreviations GDS/GWS? I tried finding at many places, and they are always referred to as abreviations.
Plus I would be interested, how would it be possible to implement global wave sync inside the code? There is no standard function call to it, so I suspect it would be a vendor specific extension. Correct me if I'm wrong.
nice article, doesnt have a clou if it is correct.
"Cayman uses a tremendous amount of on-chip storage, which has a critical impact on performance, power and area. Each of the 24 SIMDs has a 256KB register file, 32KB LDS and 8KB L1 texture cache. Shared across the chip are the 512KB L2 texture cache, 64KB GDS, 32KB write combining cache and 128KB for the read/write (or color) cache. Altogether that is a total of 7840KB of data storage arrays – and this isn’t even counting the arrays used for instruction caching. The storage arrays are all designed using a custom memory compiler targeted at TSMC’s 40nm process and do not implement ECC. The cost of ECC is relatively high in terms of power and area, and more importantly, for graphics workloads, errors are determined by visual acuity, rather than bit for bit accuracy. Eschewing ECC for SRAM is an example of how AMD has balanced the competing needs of the graphics and compute markets – and focused on low hanging fruit."
i posted that article a while ago.
The second architecture memory structure is the Global Data Share (GDS), which is 64KB and shared by the entire GPU. The GDS plays a similar role to the LDS, but for sharing and communication across an entire kernel, rather than just a work-group. It is also 32-banked, with 25 cycle access latency, and includes atomic execution units and counters for append instructions and reductions. While not technically a part of the SIMD (since it is a globally shared structure), the GDS is explicitly available to each SIMD. The GDS is a structure that does not correspond to anything in the OpenCL or DirectCompute specification (unlike the LDS), and must be accessed and exposed through a vendor specific extension. However, it is used by the drivers for certain DirectX features such as append/consume buffers and UAV counters.
and you can find in cl_ext.h which ship AMD with SDK this function prototypes.
clCreateCounterAMD, clGetCounterInfoAMD, clRetainCounterAMD, clReleaseCounterAMD, clEnqueueReadCounterAMD, clEnqueueWriteCounterAMD.
it is under cl_amd_atomic_counter. so this will be most likely new extension which IMHO expose GDS atomic counters. as normal global atomics are slow and GDS atomic should be much faster (fast as local atomics).
@nou: sorry for crossposting then.
hmm, can someone confirm cayman/69xx has 256KB / SIMD?
i cant remember what cypress and juniper had...32KB?