Archives Discussions

krrishnarraj · ‎07-17-2011

Am new to opencl and was used to cuda and nvidia gpus.

(Excuse me for using cuda terms here)

I thought a warp(32 threads) goes to 8 SPs( 4 threads to each SP ) in an SM

I was going through online examples given by AMD: http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study_7.aspx

it says using vectors in openCL increases throughput in GPU. now is it like 1 thread goes to 1 sp instead of 4 threads?

Can someone explain how does it improve performance in the hardware level.

Thanks

nou · ‎07-17-2011

AMD GPU use a VLIW4/5 architecture where one work unit can execute up to 4/5 instructions at once. so when you have two float4 vectors it will add in one instruction. nVidia must exectute four instructions.

also reading float4 vector from memory is more efficient than reading four float values.

krrishnarraj · ‎07-17-2011

thanks for the info. that means using vectors is a must for complete utilization.

sadly you cant use it everywhere. thats why they are dropping VLIW4 in the next GCN architecture.

LeeHowes · ‎07-17-2011

Not a must. Loop unrolling achieves the same thing. It's a VLIW architecture not a vector one at that level (like nvidia it's a vector architecture at the larger scale, of course) so the aim is to increase ILP: vectors do that by creating four instructions at a time instead of one, loop unrolling would too. What vectors also add is the ability to define 128-bit memory reads in each lane of the SIMD unit, this helps the memory system reach peak throughput because 16 lanes issuing 128-bit reads is the unit of data the memory system is designed to stream from DRAM.

Archives Discussions

How does vector type increase throughput in gpu?