cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Meteorhead
Challenger

Future HW and SDK

Questions about upcoming tech

Hi, I have opened this topic to have a place for everyone to post questions about always the actual upcoming HW and SDK capabilities and proprties.

0 Likes
46 Replies
Meteorhead
Challenger

Future HW and SDK

My first questions would be:

- What CL_DEVICE_TYPE will be the GPU inside the upcoming Liano APUs?

I ask because I fancy the thought of being able to write applications (and being able to see games) that run regularly on CPU, calculate physics, AI and other highly-paralell parts on the IGP inside the CPU, and use the GPU solely for graphics. Since APU stands for Accelerated Proccessing Unit, will the GPU inside Liano be a CL_DEVICE_TYPE_ACCELERATOR? It would be wise to make a distinction to devices that share their __global memory physically with the host (as Liano will do).

- Will either radeon 6xxx cards or the new APUs support out-of-order exec?

Out of order execution on GPU-s is useful, although hard to harness, but inside the APU it would be most useful, where if one uses OpenCL events smart, they could create massively optimal engines to games, where memory-handling, window-management, AI, physics, etc. could run wickedly fast.

- How much effort would it take to have higher DP capacity and/or support for QP?

I read somewhere how Radeon cards deal with DP operations, namely that 2 Stream cores are linked inside a vector-processor for the duration of the operation and the remaining 3 are non-operational for the time being. This is the reason DP capacity is 1/5 of SP. I do not know how NVIDIA implements DP, but since each CUDA core has a single INT and FP unit, I suspect there are 2 ways: some CUDA cores are native 64-bit, while others are not ; OR 32-bit INT and FP units do 64-bit operations at the cost of hidden register use. Since OpenCL inherently is able to query preferred vector widths at certain precisions, and Radeon SIMD engines are inherently capable of doing 64 (or even 128) bit operations with 32-bit shader processors via this linking, the question is the following: I know linking Stream Cores to do 64-bit operations takes up space inside the die, but how much more would it take, to have 4*1 for SP, 2*2 dor DP, and 1*4 for QP operations? Quadrupole precision might be something that is a lot harder to implement on NVIDIA cards with the usage of single execution units, and AMD could win quite a few customers in the GPGPU segment being first to support a healthy QP capacity on GPUs, but the same goes for solely the ability to link 2*2 Stream Cores to reach double the DP capacity. Radeon 6xxx series might not, but future 28nm GPUs might have the space on the SIMD engines to do the extra linking.

0 Likes
nou
Exemplar

Future HW and SDK

maybe they make new type CL_DEVICE_TYPE_APU.

IMHOout-of-order is just SW implementation of queue. concurent running of multiple kernel is another story.

each 5D unit can do MADD instruction which is counted as two FLOP. and with DP two and two units are linked together to perform two DP +-* operation. so one 5D unit can do 10 SP op/clock and 2 DP op/clock.

 

0 Likes
bubu
Adept II

Future HW and SDK

Originally posted by: Meteorhead

- What CL_DEVICE_TYPE will be the GPU inside the upcoming Liano APUs?

I bet Llano will expose 2 OpenCL devices, one typed as CPU and other typed as DX11 GPU.

- Will either radeon 6xxx cards or the new APUs support out-of-order exec?

I hope, as well as DMA transfers...

 

0 Likes
nou
Exemplar

Future HW and SDK

IMHO again DMA transfer is just limitation of current implementation. even 4xxx can do DMA transfer under CAL. IIRC some AMD stated that they are working on it.

0 Likes
Meteorhead
Challenger

Future HW and SDK

This is the part in the ATI OpenCL Computing Guide I have mentioned. So do I have it right, that when linking is done, no MADD operations are available, so one operation cannot be counted as 2 FLOPs. This quote is misleading in some way, it says "two or four are linked... to perform a SINGLE DP operation". Shouldn't it be 1 DP FLOP when linking two, and 2 DP FLOP when linking four?

But if this last is true, than DP capacity could only be increased by adding MADD capability under linked Processing Element mode. QP needs a little more linking, perhaps also the ability to deal with MADD operations.

If this is true though, that 2 DP operations can be dealt with at once, why does OpenCL report preferred DP vector width to be 1 with 5970?

A stream core is arranged as a five-way very long instruction word (VLIW) processor. Up to five scalar operations can be coissued in a VLIW instruction, each of which are executed on one of the corresponding five processing elements. Processing elements can execute single-precision floating point or integer operations. One of the five processing elements also can perform transcendental operations (sine, cosine, logarithm, etc.) Double-precision floating point operations are processed by connecting two or four of the processing elements (excluding the transcendental core) to perform a single double-precision operation. The stream core also contains one branch execution unit to handle branch instructions.

0 Likes
himanshu_gautam
Grandmaster

Future HW and SDK

hi all,

Nice to hear your thoughts.

meteorhead,

I confirm the bug in document.But i hope the issue has  been clarified by nou very well.

0 Likes
malcolm3141
Journeyman III

Future HW and SDK

I believe this is referred to in the Optimisation Guide - a DP add or sub requires two pipes (in other words two can be scheduled in one bundle), but a DP mul or fma takes all four pipes (and hence only one can be scheduled in each bundle).

Talking of future hardware, I would love to see AMD include 32bit multipliers in each of the xyzw pipes, and I could also see them provide enough hardware between two pipes to perform at least a DP mad or even better a full precision DP fma. To be able to claim >1TFlops DP performance from a single GPU would be amazing!

 

Malcolm

0 Likes
Meteorhead
Challenger

Future HW and SDK

If I'm not mistaken, I recall AMD stating that it wishes to follow the APU approach on the Opteron front-line beside desktop solutions. It would be nice to hear some bits (or even more) information from these products. Is it only a plan to integrate the IGP into the CPU to reduce energy consumtion, or will there be processors with higher SIMD capacity?

I am very much interested in every way parallel computing hardware can be neatly integrated into HPC clusters. I think all supercomputer owners (as well as those looking for HPC solutions) would welcome a way to have upgradeable HW, meaning an Opteron would include a maximum of 4 cores, and the rest of the die would be SIMD engines (and some cache). This way existing 1U racks could be reused for major upgrade in computing power.

Right now the most neat and compact way of creating a GPU cluster would be the solutions offered by *beep*, where 1U rackmount can hold 2 double width GPUs. Only problem is, that the half width motherboard offered holds 2 processor slots. GPU clusters (in my opinion) don't need very powerful processors, only fast in RAM access, and mediocre computing power. Having 1 quad-hexa-octa core processor per GPU is a waste of money and computing power.

If anyone has anything to add, or correct me at points, please do.

0 Likes
Meteorhead
Challenger

Future HW and SDK











Instead of opening a new topic, let me post to a previous one:

I know AMD employees will not speak about unreleased HW, so let me ask a theoretical question purely based on news, or information publicly available:

Some future GPU of AMD (most likely top Southern Islands) will feature a brand new architecture designed from scratch, having kept in mind the needs of APU integration.

http://wccftech.com/2011/06/15/amd-slides-detail-upcoming-radeon-hd-79-series-gpu-architecture/

There is one thing I do not understand. How come that they advertise this architecture as being another step toward GPGPU applications, but I really cannot see how SIMD-vector process is "general". VLIW architecture excelled at being the sweetspot between graphics and GPGPU. Graphics used VLIW architecture as a vector processor, and GPGPU applications leveraged the compiler to vectorize scalar code. Having 16 wide SIMD, which 4 threads may share seems to me that one thread has the minimum of 4 wide SIMD. One thread simply cannot utilize 4 wide SIMD, unless it is vectorized code.

As it seems to me:

1) say good bye to cross-vendor OCL code. Scalar OCL code will utilize 25% of the card (35% max). Hail to HPC and scientific use, where we'll have to develop two seperate host- and kernel-side code.

2) applications where vectorization cannot be done efficiently, will simply greatly underperform expectance on AMD HW.

The new architecture seems awesome, I really like all the new stuff packed into this and big gratz to AMD for that. However, VLIW seemed like the strength of AMD to me, and I thought that as soon as superscalar architecture, or VLIW is left behind, all that will remain is an architecturally inferior Tesla. Architecture greatly developed, superscalar design remains, but SIMD is far inferior to VLIW.

Please, someone tell me that I am wrong at some point. How will this be GPGPU?

0 Likes