cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

oscarbarenys1
Adept II

Info and questions about OCL support for 69xx compute features..

 

Hi,

seems 69xx compute features have been leaked as slides..

but before speaking about that I want to know if GDS will be exposed

in OpenCL as is even speaked in OCL Webinars so there is still interest by AMD to implement:

*gds?

*global wave sync?

New features are

*concurrent kernel support ("asynch kernel dispatch" in slide): well kernel asynch launch is supported in 58xx series right? even concurrent kernels seems but seems 69xx adds private address space for every kernel so should be easier to expose in OpenCL.. can be expect concurrent kernel support to be exposed in ocl in launch OCL drivers? a sample using concurrent kernels would be good..

Also will every SIMD core allowed of running only threadblocks from one kernel (like Fermi) or running arbitrary kernels (well with the limitations of local mem usage and registry pressure)

*dual dma engines: good job but will be implemented at the same time as SINGLE DMA or will be later? can we expect support shortly after launch (1-2 month after?) also to be answered later about antilles dual chip 69xx will implement quad dma engines right? as every has single gpu mem adress space will be necessary right?

A OCL dual dma sample would be good too..

*Seem slides mention full APP support for Antilles I hope multigpu cards are exploitable without serialization points in OpenCL so 100% gpu usage of all can be extracted using two independent command queues.. Same simple scalable samples would add trust in multigpu hardware and driver..

I see in ROPs slide "colescing writes" support and in compute slide "coalescing of shader read"->  I ask aren't suported right now coalesced reads and writes in 58xx? I hope after 69xx release someone can answer what specific improvements add 69xx series or what colescing limitations has 58xx hardware..

I see a "fetch direct to lds" I hope this adds from host mem to LDS mem without going through global mem and it's not clear how a opencl extension will be if adding a host side API allowing to send to LDS or a kernel function similar to prefetch to local mem functions but that would imply support for accessible host memory from device being exposed in OpenCL which would be right.. right now even 5xxx series allow accessing host mem from device called mem import and export but that not is exposed in OpenCL..

 

 

0 Likes
13 Replies

GDS/GWS will come in a future version of the SDK, it is still being defined in how to have it work within OpenCL. I can't speak about products that aren't released yet, so can't comment on your other items.
0 Likes

Thanks for update on GDS and I will bump thread once 69xx are avaible

if I'm not able to answer for myself.

0 Likes

Even if you are able to answer for yourself, many people would be interested in your findings.

Also, could someone expand the abreviations GDS/GWS? I tried finding at many places, and they are always referred to as abreviations.

Plus I would be interested, how would it be possible to implement global wave sync inside the code? There is no standard function call to it, so I suspect it would be a vendor specific extension. Correct me if I'm wrong.

0 Likes

IMHO GDS is Global Data Share and GWS is Global Wawe Sync

0 Likes

Hi Meteorhead,

just trying to be humble.. knowing can interest others I will try to post my findings here..

 

 

0 Likes

nice article, doesnt have a clou if it is correct.

http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=9

"Cayman uses a tremendous amount of on-chip storage, which has a critical impact on performance, power and area. Each of the 24 SIMDs has a 256KB register file, 32KB LDS and 8KB L1 texture cache. Shared across the chip are the 512KB L2 texture cache, 64KB GDS, 32KB write combining cache and 128KB for the read/write (or color) cache. Altogether that is a total of 7840KB of data storage arrays – and this isn’t even counting the arrays used for instruction caching. The storage arrays are all designed using a custom memory compiler targeted at TSMC’s 40nm process and do not implement ECC. The cost of ECC is relatively high in terms of power and area, and more importantly, for graphics workloads, errors are determined by visual acuity, rather than bit for bit accuracy. Eschewing ECC for SRAM is an example of how AMD has balanced the competing needs of the graphics and compute markets – and focused on low hanging fruit."

0 Likes

i posted that article a while ago.

The second architecture memory structure is the Global Data Share (GDS), which is 64KB and shared by the entire GPU. The GDS plays a similar role to the LDS, but for sharing and communication across an entire kernel, rather than just a work-group. It is also 32-banked, with 25 cycle access latency, and includes atomic execution units and counters for append instructions and reductions. While not technically a part of the SIMD (since it is a globally shared structure), the GDS is explicitly available to each SIMD. The GDS is a structure that does not correspond to anything in the OpenCL or DirectCompute specification (unlike the LDS), and must be accessed and exposed through a vendor specific extension. However, it is used by the drivers for certain DirectX features such as append/consume buffers and UAV counters.


and you can find in cl_ext.h which ship AMD with SDK this function prototypes.

clCreateCounterAMD, clGetCounterInfoAMD, clRetainCounterAMD, clReleaseCounterAMD, clEnqueueReadCounterAMD, clEnqueueWriteCounterAMD.

it is under cl_amd_atomic_counter. so this will be most likely new extension which IMHO expose GDS atomic counters. as normal global atomics are slow and GDS atomic should be much faster (fast as local atomics).

0 Likes

@nou: sorry for crossposting then.

hmm, can someone confirm cayman/69xx has 256KB / SIMD?

i cant remember what cypress and juniper had...32KB?

 

0 Likes

The numbers between Cypress and Cayman/Juniper did not change.
0 Likes

Am I right that GWS will be imlemented using GDS? Also, is there any info on next OpenCL spec whether it will include standard way of reaching either GDS or global syncing?

These features most likely will only be included if NV cards have corresponding HW to implement these features.

0 Likes

Meteorhead,
That is correct, but I don't know of any public info on the next OpenCL spec. We are working internally on a way to expose GDS/GWS in OpenCL, but because of the nature of the feature, we want to get it correct.
0 Likes

I would be interested in just CAL/IL support/documentation for GDS.  IL can be alot of fun, and very fast.  Its not such a chore either, if done in Python.  I know there's lots to be done for OpenCL ... but then OpenCL 69xx features will be built on IL 69xx support won't they?  And the documentation of 69xx IL features is useful internally too, no?  How about sending some of that IL goodness our way too evey once and a while ... its been a while since CAL/IL docs got a juicy update ... And AMD controls CAL/IL, so adding IL support for 69xx features like GDS wouldn't need to wait for OpenCL committee ratification ...

Thanks a million!

 

 

 

 

0 Likes

Originally posted by: emullerHow about sending some of that IL goodness our way too evey once and a while ... its been a while since CAL/IL docs got a juicy update ...


 

+1

0 Likes