cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

debdatta_basu
Journeyman III

Raytracing on AMD hardware

Divergent wavefronts issue, and related discussions

I have 2 questions regarding ati implementation of opencl and its impact on raytracing.

Question 1:

I wasgoing through the opencl programming guide and spotted one issue where ati hardware is significantly different from Nvidia.  Each stram core is actually a 4x simd unit, and there are overalll less number of cores....

This would mean that ati hardware is more simd-like than nvidia, where stream cores are non simd, and there are larger number of cores......

This would mean more penalty for divergent warps and non- vector instructions on Ati...... This is obvious as warp(wavefront) size is 64 on ati and 32 on nvidia....

Raytracing, where calculations are highly divergent, this would mean significant slowdown for ati.

Is there some way to optimize this for ati hardware?

 

Question 2:

Nvidia has a limitation that each work group is scheduled on the same streaming core until all of its work is done. This calls for some sort of producer/consumer queue implemented using atomic operations to ensure that all warps(wavefronts) within a work group are active for the entire life of the work group.  This is detailed in the paper:

"understanding the efficiency of ray traversal on Gpus"

Does this apply to ati cards?

 

Looking forward to your responses,

Debdatta Basu

0 Likes
13 Replies
n0thing
Journeyman III

Shouldn't divergence of rays depend on a particular scenario? For example the rays will be highly divergent in a scene with rough objects than the ones with a smooth gradient as normals change rapidly in a vicinity. In that case a smaller SIMD width would be better.

Similar to Nvidia, all wavefronts of a work-group are executed on a particular SIMD unit only on ATI cards.

0 Likes
malcolm3141
Journeyman III

Hi,

 

AMD stream processors have 16 (8 or 4) stream cores with 5 units each. This is not SIMD, and in fact is best described as VLIW. The compiler schedules totally independant (mostly) instructions down any of the five pipes in parallel, in other words extracts instruction level parallelism from the code.

I do not know a huge amount of detail about nVidia branching behaviour, but AMD branches hurt quite a bit. A CF clause (control flow) may consume an extra 40-50 clocks, and it acts as a barrier for ILP so intructions may be scheduled less efficiently. However there is ample support in the hardware for predication, and the AMD OpenCL compiler does an adequate job at converting if statements to predicated sequences.

Bear in mind that the ATI hardware has a peak performance of over 2x that of the nVidia, and as such with careful design and programming you can end up significantly faster on ATI...

Regarding question 2, workgroups never move around the GPUs (ATI or nVidia). An algorithm may be implemented that uses a minimum number of workgroups (matching size of hardware threads) kept alive for the entirety of your calculation, but usually it is best to program them as intended, and allow the thread dispatching to be determined by the hardware. The hardware can then do its best to hide memory latency which is a huge speed issue.

Global atomics are very slow (>1000 clocks I think, can anyone confirm?). I have not seen any published work describing using ATI hardware in that way, but I expect it would work just like on nVidia to solve the same problems.

 

Malcolm

0 Likes

Thanks for all the responses, and thanks for clearing out the work groups issue...... And thanks for tolerating the nvidia references.

@priyadrshi..... Divergence of rays is the whole point of raytracing...in most cases with GI, the number of coherant rays are really small compared to the number of divergent ones....

@malcolm......

>>in other words extracts instruction level parallelism from the code.

I thought so too, based on the fact that the manual says VLIW. But then a few pages later, the manual also says:

From the manual:

>> "....All stream cores within a compute unit execute the same instruction for each cycle........  To hide latencies due to memory accesses and processing element operations, up to four workitems from the same wavefront are pipelined on the same stream core....."

Seems very much like SIMD to me.

 

Further, the manual says that branching is done using masking, as in the instruction executes on all threads in a wavefront, after which the disabled threads are masked out..... from the manual:

>> "The wavefront mask is set true for lanes (elements/items) in which x is true, then execute A. The mask then is inverted, and B is executed."

another classic SIMD characteristic.......Correct me if Im missing something.

These "features" exist for Nv as well, but at a much lower granularity. (32 instead of 64)

And about the producer/consumer queue thing, It gives significant speedup on Nvidia in cases where warps within a work group are of highly different lengths, as is the case with raytracing.

Debdatta Basu.

 

 

0 Likes

Ray tracing does tend to diverge, but the same problem is there on both nvidia and AMD hardware. AMD hardware will suffer more from identically divergent rays, but once you play the same packetisation tricks that stop nvidia's hardware performing abismally you will see similar improvements on AMD hardware.

I don't see any reason to think that the same techniques wouldn't apply to both architectures but with possibly slightly lower efficiency (from a much higher throughput starting point) with AMD hardware.

 

>> "....All stream cores within a compute unit execute the same instruction for each cycle........  To hide latencies due to memory accesses and processing element operations, up to four workitems from the same wavefront are pipelined on the same stream core....."

Seems very much like SIMD to me.



Yes, it is SIMD. In the same way that nvidia's hardware is SIMD. It is a 64-wide SIMD vector (wavefront). NV has a 32-wide SIMD vector (warp). NV can in some cases dual issue (or possibly triple issue on the GF104, their "superscalar" is slightly unclear to me at the moment) and AMD hardware uses fixed 5-wide VLIW packets to multi-issue instructions where the packets are generated by the compiler to reduce hardware scheduling overhead. It is a 64-wide SIMD issue of 5-wide VLIW instructions.

 

0 Likes

AMDs documentation often is a little confusing, but they do seem to be working on it.

I think some people have referred to the GPU execution model as SPMD - Single Program Multiple Data. This refers to the fact that the same program is executed over a small block of data - in this case the thread group (wavefront, hardware thread, ... unfortunately there is no standard name for this). This model is used (at a high level) by both AMD and nVidia.

Regarding the quote from the manual, the stream cores being referred to are the 5 way VLIW cores I believe. There are 16 (in 5870) of these in each SP (stream processor), and each of the 16 cores executes the exact same instruction bundle (xyzt) at the same time. ATI GPUs schedule a 'quad' (2x2 threads) in a pipelined fasion on each stream core, so the minimum hardware thread size (workgroup, wavefront...) on 5870 is 16 x 4 = 64 threads.

Branching is handled by a separate processor per SP. This processor decides (I believe) whether the entire hardware thread (64 threads) follows the same path. If it doesn't, then both paths are executed sequentially, and predication (masking) is used to resolve the branch.

Some more detail... ATI GPUs group instructions into clauses (take a look at the dissasembled output from Stream Kernel Analyser), with instructions executing at full speed within a clause, and a cost for switching to different clauses (perhaps 40-50clks is what I have heard). Branching is peformed at a clause level, and as such has an 'extra' cost over the obvious from serially scheduling both execution paths. For example, consider a branch that is 20 clks long, with divergence this would take 20 + 20 + ~40 = 80 clks, which is 4x longer than might be expected.

You can make use of conditional moves within an ALU clause to perform your own predication, and this avoids the clause overhead. AMDs OpenCL compiler does this automatically for small if/else statements, and often for ?: operators as well.

 

Hope this helps you understand the architecture...

Malcolm

0 Likes

Yeah.... got it... Thanks a lot.. I will keep you posted. 

 

0 Likes

Hi,

I'm new to this thread but let me try to revive it:

I read through but one thing (most likely the point) eluded my attention. The controversy of SIMD vs. VLIW if I'm not mistaken. Let me make a few quotations:

Debatta Basu -

>> "....All stream cores within a compute unit execute the same instruction for each cycle........  To hide latencies due to memory accesses and processing element operations, up to four workitems from the same wavefront are pipelined on the same stream core....."

Seems very much like SIMD to me.

 

Malcom -

AMD stream processors have 16 (8 or 4) stream cores with 5 units each. This is not SIMD, and in fact is best described as VLIW. The compiler schedules totally independant (mostly) instructions down any of the five pipes in parallel, in other words extracts instruction level parallelism from the code.

 

These two statements are somewhat controversial to me. 4 instances of a kernel (naturally with different IDs) are co-issued to the same Stream Core to hide read/write latencies. This is a minimum number and has nothing to do with the fact that vectors are 4 wide. (Starting from 69XX even the Stream Core will be 4 wide) Malcom states that instruction level paralellism is used. This is on a per-kernel basis I presume, not instructions from different instances of the same kernel are paralelled on a Stream Core in one cycle. This instruction-based parallelism might lead someone to think that as long as 4 instances of a kernel are in lock-step, and they have only scalar operations, they could be executed in one cycle on the same Stream Core (since they are co-issued to the same physical processor, one might think this is not impossible).

So can anyone clear this out for me: what is the difference in definition of VLIW and SIMD? And how far should one go vectorizing code? Does vectorizing using instruction-based parallelism is only inside a kernel or inside a wavefront?

0 Likes

>>  4 instances of a kernel (naturally with different IDs) are co-issued to the same Stream Core to hide read/write latencies....

Afaik, one stream core services just one kernel instance.... each stream core has 4 fp units and a special function unit....

 

>>The compiler schedules totally independant (mostly) instructions down any of the five pipes in parallel, in other words extracts instruction level parallelism from the code....

This refers to independent instructions within a single instance of a kernel... read superscalar execution of each kernel instance.. The difference between superscalar execution and simd is that superscalar units are capable of executing different instructions in each unit, as opposed to simd which proceeds in lock step..

Though in theory this sounds great, in practice, this is hardly the case, as in most kernels, simply vectorizing your code will improve performance substantially, in some cases around 3 to 4 times....

This is pretty much a simd  characteristic, and mostly happens because the compilers are new and do not do a very good job of reordering instructions, unrolling loops, etc..... this may change in the near future, but before that happens, you will get better mileage from just assuming they are simd and packing your code into float4s or similar....

 

On the bright side, the 5870 has ~1.5x-2x the peak floating point throughput, which means that if you pay careful attention to what you are coding, you will get better performance than the 480s....

The only problem is that not all code is vectorizable, and in those cases, we should see an improvement once the compilers have had a little more time to cook....

0 Likes

Ok, thank you for the explanation, now I understand.

Let me ask the following though (mostly targeted to someone at AMD, but anyone feel free to answer): how hard would it be to allow the compiler (and the HW) be able to pack independant instructions into a VLIW from different instances of the same kernel?

Most of the cases there are lot more kernels running then phyisical processors are present in a Compute Unit and the greatest difference in programming NV and AMD cards is that (atm) it takes great effort to vectorize code to make it efficient on AMD cards. It is a lot easier (and CPU like) to program with scalar operations. I would have no problem if as debdatta has said, the compilers were a little more advanced.

Here I do not want do demise the work of those writing the compilers, because most likely they do their outmost and their work has gone far taking how old the AMD OpenCL compiler is and how efficient it is compiling both onto CPU and GPU. My question would aim the fact, wouldn't it give an extra boost to compiler efficiency to allow creating VLIW words from different instances of a kernel? Most kernels are written in a manner that they spend 50-80% of their time in lock-step. This time could be vectorized perfectly using my idea.

I realize this might be impossible due to the "wiring" of shaders on the hardware, but it might be considered if the idea is not prematurely killed by hardware incapability. The idea seems mostly to be a compiler issue, although it does stretch into hardware architecture.

0 Likes

someone suggest that you can place whole kernel into static for loop.

for(int i=0;i<4;i++)
{
//code of kernel
}

and you can get better utilization as compiler this unroll this simple loops.

0 Likes

This is a real backstab from in front

Anyhow, this is a very messy way of reaching the goal, although it might work. I do not know what the compiler would do with a static loop that has dynamic loops inside or loops that are runtime constants. Also the question arises, if I have dynamic loops inside (which are clearly not VLIW friendly), if I place barriers before and after the loop, will the compiler understand what my aim is and how it should behave?

Plus this method implies that if one writes indexers to access data, which are smart enough to handle global and local IDs, group sizes, periodic boundaries and all of that, then adding another level of depth on indexing (namely one kernel is really 4 threads, and the outermost static running index is the new unique identifier inside a Stream Core... well that makes programming even more complex.

It is difficult, and I do not feel the assurance, that it will do what it's supposed to. I will try though, as soon as I'll have some spare time.

0 Likes

@Meteor....

Regarding your earlier post, VLIW is a hardware issue..... In cpus, the hardware dynamically packs instructions into the vliw cores.... However, on gpus, this is done statically by the  compiler(Can someone confirm this?)

instructions from independent threads running on the vliw units would mean that the thread schedulers would have to handle a lot more units..... This is  what Nvidia  does, and its not called VLIW anymore.

Debdatta Basu.

0 Likes

I would be very surprised if that many shaders would fit onto a die along with complex VLIW schedulers, that work runtime. Most likely this is done compile time, but some reassurance would be nice indeed.

The thread schedulers would not have significantly more work with my idea. Stream Cores would not run independant threads at once, but the compiler would realise, that if a kernel is written in a manner, that it can be run in lock-step together with another thread (no dynamic length work is done for e.g.) for a given time, then it would recompile it to a totally different kernel, in somewhat same way as nou has suggested. It would not be 4 different threads, but 1 thread with code from 4 different ones. It could even break down kernels into smaller ones. What do I care? Many of this parallelization can be done offline that the HW does not have to be aware of.

Something similar is done when compiling onto CPUs. You launch 1k threads on a dual core processor in workgroups of 128 for e.g. , you look at it in taskmgr, and it's a 24 threaded program. It has very little to do with what I said, but it does the same thing. I wouldn't care how small kernels my program is split up into, if it could do a decent packing of VLIW instructions, even from different kernels.

I will test Nou's suggestion, but it's a very unneat way, and clearly a pure AMD optimization, and when one would write cross-vendor code, NV people would laugh at seeing a kernel inside a static for loop, with indexers that become even more complex because of this. This is a temporary solution until the compilers do a better job at VLIW.

It would be nice if any AMD empoyee would comment our topic. Some constructive criticism or some insight into the intimacies of the compiler would be nice.

0 Likes