Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- Vector types, VLIW, and wavefront size

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

08-30-2011
10:57 AM

Vector types, VLIW, and wavefront size

Suppose a standard saxpy kernel using float returns 64 from a call to clGetKernelWorkGroupInfo(CL_PREFERRED_WORK_GROUP_MULTIPLE, ...). If ones takes the same kernel and just changes float to float4, would the same query return 16 or 64?

Also, does float4 gets striped across (f.x, f.y, f.z, f.w execute concurrently) the 4 or 5 lanes (processing elements) in the 16 SIMD units in each compute units. Or does float4 get stacked onto a single lane (f.x, f.y, f.z, f.w execute sequentially), with different float4s executing similarly on the other lanes?

And if the wavefront size is 64, how can VLIW5 ever be fully utilized? VLIW4 times four cycles yields 64, but VLIW5 times four cycles yields 80. I'm really confused about how this works. It makes me think VLIW5 should have a wavefront size of 80, not 64.

6 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

08-30-2011
11:05 AM

The width of the VLIW unit is the number of ALU instructions that can be executed in parallel by a single work-item(i.e. software thread). The wavefront size is the number of work-items that are executing in parallel in a single hardware thread.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

08-30-2011
08:38 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

08-31-2011
07:22 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

08-31-2011
07:43 PM

Originally posted by:I'm no expert but I believe the 5 and 4 after the "VLIW" stands for the number of "math units", the cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units. Since branching is hadled separately in the stream processor, you could say the stream processor itself is the "ALU" with 4 AU units + 1 SFU (in the cypress case) + 1 LU(logical unit). Since we have 16 stream processors per compute unit we the VLIW width is 16. 4 work items can be pipelined so thats why AMD recommends a wavefront size of 64.fpaboim

The recommended wavefront size is 64 because there are 16 stream processors per compute unit and each stream processor interleaves 4 threads over 4 clock cycles.

Each stream processor has 4 ALUs. This implies that code that has little instruction parallelism will use little of the available performance. In the worst case, you only use 1/4 of the available ALUs every clock cycle even with 64 threads. This is why float4s are highly recommended if you can get by with them, because they trivially expose independent calculations (in addition to higher bandwidth from global memory and textures).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

09-01-2011
09:39 AM

Originally posted by:rick.weberOriginally posted by:Cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units.fpaboim

The recommended wavefront size is 64 because there are 16 stream processors per compute unit and each stream processor interleaves 4 threads over 4 clock cycles.

Each stream processor has 4 ALUs. This implies that code that has little instruction parallelism will use little of the available performance. In the worst case, you only use 1/4 of the available ALUs every clock cycle even with 64 threads. This is why float4s are highly recommended if you can get by with them, because they trivially expose independent calculations (in addition to higher bandwidth from global memory and textures).

I think my misunderstanding originates from thinking the SPU in Cypress (VLIW5) could also execute add, sub, mul, etc. in what would seem to work best on float5. However, if the SPU can only execute functions like sin, cos, exp, etc. then I understand why it's float4. Can you confirm that the SPU in Cypress (VLIW5) cannot execute add, sub, mul, etc.?

Are there any cases where VLIW5 (4 ALUs + 1 SPU) could be fully used back to back over 4 cycles given that you said the wavefront size is 64? Also, when you say interleave 4 threads over 4 cycles, what exactly is a thread (work-item?) and is that like a waterfall or in lock step?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

09-01-2011
10:02 AM

proper usage of SPU in cypress does not depend on the wavefront size.

To fully utilize VLIW5 of cypress you need to have enough Intruction level parallelism inside your kernel code. To be clear only a single kernel instance runs on a VLIW5.

Although I am not sure about the capabilities of the 5th processing unit in VLIW5.