Archives Discussions

savage309 · ‎04-25-2015

Hey,

I know that this topic has been upraised a lot, but I want to do that one more time ..

I am dealing with some GPGPU apps and I need to make them run fast on all kind of hardware, so for a quite some time now I am doing a bit in-depth research about the hardware and what it does exactly.

I have more experience with CUDA and nVidia GPUs (since nVidia OpenCL implementation is really bad), but I am more and more interested in the AMD GPUs, since I can see so much great potential in them, that is not being used properly today at our side.

I am trying to find if there is any difference between the SIMD model in GCN 1.2 and the SIMT model (in let say, Maxwell), or the SIMT is just a marketing buzz word used by nVidia (honestly, I don't see any much of a difference; if there is any it has to be in the way branching is handled). If there is difference, how does all this compares to the Intel GPUs ?

Further more, we lack good video lectures on GCN (or at least I can't find any; on the other side, we have the Stanford nVidia lectures which are quite good). The GCN white paper also could use a bit of refining (I am not hardware expert, but I have read quite a few white papers and I have some view on hardware, but at some point it got me lost).

Thanks !

tzachi_cohen · ‎04-30-2015

It is the opposite, transcendental functions are only supported on the vector engine, not on the scalar engine.

They operate at quarter rate, hence if you want to have a faster approximation you need to do it in less than 4 single precision operations.

Tzachi

View solution in original post

Raistmer · ‎04-26-2015

AFAICT AMD GPUs use SIMD registers and corresponding SIMD asm commands.

That is, it's true SIMD architecture like SSE in CPU world for example. That is, each thread (workitem) can issue hardware SIMD instruction that operates on 4 float numbers.

From other side, nVidia's device aways was scalar one. It can operates many threads/workitems simultaneously (in this aspect I see no big difference between AMD and nVidia, both are SIMT) but its IA is scalar. That is, float4 "emulated" by using 4 threads (or 4 serial operations in one thread, depends on actual implementation) while AMD chip has single vector instruction.

Would be interesting to get more info from experienced peoples indeed, especially about iGPUs that relatively new and uncovered.

savage309 · ‎04-26-2015

If I have ...

float4 a = global_data_0[thread_id];

float4 b = global_data_1[thread_id];

float4 res = a + b;

And I run N threads, each one of them will execute (a.x + b.x), than (a.y + b.y) and so on, right ? It is SIMD, because I have the single instruction (sum) executed on multiple data fields (from global_data), not because it will do SIMD sum on float4, right (the execution is being fragmented on those computation units, which are going in lock step and so on and so forth) ? And this is exactly what nVidia SIMT does, so they should be (roughly) the same.

maxdz8 · ‎04-27-2015

Sounds right.

It's very simple: what you write in a kernel definition is what the ALU assigned to the WI will do.

If you want to do SIMD in the sense of having multiple ALUs "collaborating" in adding vectors you will most likely go through LDS and write a kernel definition which has explicit notion of a WI running "in tandem" with it.

maxdz8 · ‎04-27-2015

Most definitely not. As far as I can tell, all instructions in AMD GCN are scalar instructions in nature. Note: I'm not an expert at ISA level. I've just rapidly skimmed the ISA manual various times.

AMD GCN classifies "vector" and "scalar" instructions on whatever they run on the vector ALUs or the scalar unit. They are not "vector" in the sense "they mangle vectors" but rather they are "executed across a vector of ALUs".

Your interpretation was sort of correct for VLIW. AFAIK.

tzachi_cohen · ‎04-27-2015

From a high perspective AMD's GCN architecture is a scalar SIMT design much like our green competitor.

There are, of course, several implementation differences, for example, our SIMT execution unit is called a wavefront and is 64 threads wide while theirs is called a warp and is 32 threads wide.

AMD has a unique scalar engine. While observing kernel code we noticed that some instructions have identical data-sets across all threads of a wavefront, these cases can be detected by the compiler and it will issue an instruction to the scalar engine instead of the vector engine. A scalar instruction will be executed once for all threads instead of being executed identically 64 times. This helps to improve power consumption. Moreover, since the scalar and vector engines are independent of each other they can process instruction from different wavefronts in parallel.

AMD GPUs can execute different OCL kernels in a parallel. When initializing several OCL queues they can be bound to different GPU entry points and submit kernels for execution concurrently. This helps to saturate the GPU when launching small kernels that do not fully occupy the machine individually.

Raistmer · ‎04-27-2015

Thanks for correction, yes, I spoke about older architectures.

BTW, from what family this

tzachi.cohen написал(а):

AMD GPUs can execute different OCL kernels in a parallel. When initializing several OCL queues they can be bound to different GPU entry points and submit kernels for execution concurrently. This helps to saturate the GPU when launching small kernels that do not fully occupy the machine individually.

feature supported (and has exposed support in OCL runtime) ?

AFAIK nVidia does the same since FERMI family, not?

tzachi_cohen · ‎04-27-2015

Async compute is supported on all GCN GPU with varying number of ACEs (Async Computed Engine). SI devices have two ACEs , Hawaii has 8.

You can read more about it on :

AMD Dives Deep On Asynchronous Shading

According to Anandtech, this feature is not supported on Fremi/ Kepler, not while the graphics queue is active.

Raistmer · ‎04-28-2015

Thanks for link.

FERMI had mention about being able to run 2 kernels simultaneously. And for Kepler they anounce this:

Hyper-Q – Hyper‐Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and significantly reducing CPU idle times. Hyper‐Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware‐managed connections (compared to the single connection available with Fermi). Hyper‐Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting achieved GPU utilization, can see up to dramatic performance increase without changing any existing code.

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

And for FERMI too:

Concurrent kernel execution I At the GPU level, GPU functions can be executed simultaneously I Architecture 2.x: 16 kernels [1] I Architecture 3.x: 32 kernels [1]

https://rcc.its.psu.edu/education/seminars/pages/advanced_cuda/AdvancedCUDA5.pdf

(slide 11/77)

Could you comment in what part it's different from GCN's ACE?

As I understood the main benefit highlighted in that Anandtech article is the GCN's ability to run few shaders (specifically, graphic shaders) simultaneously. Definitely it's great advantage but has little to do with GPGPU area. Here we mostly use compute shaders. And even that Anandtech article lists 32 compute queues for Kepler.

Also, IMHO this conclusion

So pre-Maxwell 2 GPUs have to either execute in serial or pre-empt to move tasks ahead of each other, which would indeed give AMD an advantage

directly contradicts both with listed table (32 compute queues for Kepler ) and with nVidia claims about their hardware and its ability (starting from FERMI ) to execute warps from 2 different kernels simultaneously.

BTW, reading that sentense in environment has much more sense, it becomes obvious they speaking again about merging compute and graphics tasks(shaders), not compute and compute ones. But again, while being able to compute and to render simultaneously is good thing, it's irrelevant for many pure GPGPU, computational applications.

savage309 · ‎04-28-2015

Thanks, that is some helfpuf information, especially for the purpose of the scalar unit. I've read about it in the whitepaper, however your answer made it much more clear than the explanation there.

So, is it true that the scheduling of the instructions can be a bit more flexible in GCN ? For example, it is not always SIMT (every thread has one lane of the vector unit), but the instructions can be scheduled over the lanes in a SIMD manner ?

In other words, do you support the optional __attribute__((vec_type_hint(<type>))) ?

tzachi_cohen · ‎04-29-2015

Hi ,

GCN GPUs are strict SIMT.

Decorating a kernel with '__attribute__((vec_type_hint(<type>)))' will not influence GPU compilation artifacts.

Since GCN is a scalar architecture, i.e. each thread can at most execute a single component ALU operation in a cycle there is no point in trying to horizontally-vectorize the code.

Tzachi

savage309 · ‎04-30-2015

Thank you so much.

Just one more question - if I have to do a lot of transcedental functions over different data fields should I create my own ones (using polonomials that are close to the real ones), if I dont' care so much about the error ?

In other words, is it true that only the scalar unit can do transcedental functions ?

tzachi_cohen · ‎04-30-2015

It is the opposite, transcendental functions are only supported on the vector engine, not on the scalar engine.

They operate at quarter rate, hence if you want to have a faster approximation you need to do it in less than 4 single precision operations.

Tzachi

nou · ‎04-30-2015

I would benchmark it. Nothing is better than hard data.

Archives Discussions

GCN SIMD vs SIMT