cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

adrianr
Journeyman III

Questions about inconsistencies in OpenCL Programming Guide

I have written a fairly complex algorithm in OpenCL and it runs very well on NVIDIA cards and am starting to look at making it run more efficiently on AMD cards, specifically a Radeon HD 6320 (Zacate E-450), and have a few questions about the hardware design from the programming guide, as it seems to be very inconsistent in terminology used:

1. In Appendix D, devices are specified according to Processing Elements and Stream Cores (in my cases presumably 80 PEs and 16 SCs). However, Section 1.3 describes devices according to Processing Elements and ALUs, and says that each work item maps onto a processing element, with no mention of Streaming Cores. In Figure 1.6 however, it says "Scheduler maps work item onto Stream Core k", with an arrow pointing to Processing element k. It reads as if Stream Cores and Processing elements are being used to describe different parts of the hardware in different sections of the guide.

2. The HD 6320 is missing from Appendix D - several Zacate GPUs are listed, but not the E-450. Also, the programming guide talks about Southern Islands, Evergreen and Northern Islands devices, but I can't find anywhere in that document or elsewhere that specifies what my card is (evergreen?).

3. Some sentences simply don't make sense. In particular:

"Processing elements, in turn, contain numerous processing elements" in Section 1.3

"Each of these arrays executes a single instruction across each lane for each of a block of 16 work-items" also in Section 1.3

I would like to confirm:

1. What does a single work item run on? A Streaming core, a processing element or an ALU? Related, which of these corresponds with a CUDA core?

2. What is the relationship between an ALU and a Processing Element?

3. How many threads are there in the wavefront of the HD 6320?

Thanks!

0 Likes
1 Solution
nou
Exemplar

IMHO E-450 have same GPU like E-350 just on higher clocks. at least according to wiki.

evergreen GPU have VLIW5 architecture. your GPU contain two compute units. each of them consist from 16 stream cores. each stream core is x,y,z,w and t ALU unit. x,y,z,w units can perform simple instructions like addition, subtraction and multiplication. t unit can perform complex instructions like divide, cos, sin and such. on southern islands devices there is no t unit. complex instructions are executed combining two or four x,y,z,w units together. work item from OpenCL is mapped to 5/4-wide stream core. to efficiently used stream core your code must be vectorized as it execute 5 or 4 instructions in parallel. this vectorization perform compiler when try find independent scalar instruction in program flow and schedule it into one VLIW instruction. new GCN based GPU 7xxx doesn't use vectorized code and it execute 4 wavefronts at the same time on one compute units. wavefront consist from 64 work items and it is executed in four cycles per 16 units to hide memory latency. so it takes 4 cycles to execute one instruction on wavefront. low end GPU like zacate have half wavefront size 32. it is proffered work group size multiple from OpenCL.

1. it runs on stream core for VLIW architectures and on processing element or ALU for GCN.

2. it is same in terms of appendix D table.

3. wave front size is 32. but most of AMD GPU have 64 wide wavefront. so target this work group size. but in term of independent control flow and instruction code wavefront is one thread as single instruction is executed on whole wavefront and work items can't take different execution path. to overcome this both branch of control flow are executed in serial and computation is masked out for corresponding items in wavefront.

View solution in original post

0 Likes
5 Replies
nou
Exemplar

IMHO E-450 have same GPU like E-350 just on higher clocks. at least according to wiki.

evergreen GPU have VLIW5 architecture. your GPU contain two compute units. each of them consist from 16 stream cores. each stream core is x,y,z,w and t ALU unit. x,y,z,w units can perform simple instructions like addition, subtraction and multiplication. t unit can perform complex instructions like divide, cos, sin and such. on southern islands devices there is no t unit. complex instructions are executed combining two or four x,y,z,w units together. work item from OpenCL is mapped to 5/4-wide stream core. to efficiently used stream core your code must be vectorized as it execute 5 or 4 instructions in parallel. this vectorization perform compiler when try find independent scalar instruction in program flow and schedule it into one VLIW instruction. new GCN based GPU 7xxx doesn't use vectorized code and it execute 4 wavefronts at the same time on one compute units. wavefront consist from 64 work items and it is executed in four cycles per 16 units to hide memory latency. so it takes 4 cycles to execute one instruction on wavefront. low end GPU like zacate have half wavefront size 32. it is proffered work group size multiple from OpenCL.

1. it runs on stream core for VLIW architectures and on processing element or ALU for GCN.

2. it is same in terms of appendix D table.

3. wave front size is 32. but most of AMD GPU have 64 wide wavefront. so target this work group size. but in term of independent control flow and instruction code wavefront is one thread as single instruction is executed on whole wavefront and work items can't take different execution path. to overcome this both branch of control flow are executed in serial and computation is masked out for corresponding items in wavefront.

0 Likes

nou Thanks for the helpful explanation.

After reading your reply and following AMD programming guide, following picture of Northern Island GPU is drawn in my mind. (will go into southern once northern's architecture is clear to me). Please correct me if its not the case.

- One Northern Island GPU has multiple compute units. Each compute unit comprises multiple (16? or different for different northern island GPUs?) processing elements. Each processing element has 4 ALUs (4 way VLIW) with branch execution unit and GPRs. One work-item runs on all 4 ALUs of a single processing element for 4 cycles. Right?(YES RIGHT) or One work-item runs on 1 ALU of a single processing element for 4 cycles?

In the programming guide (Appendix D), it seems that they have called processing elements as stream cores and ALUs as processing elements. Is it so?

- Are 4 ALUs the same? i.e they can execute the same type of operations? (not talking about complex instructions here that take 2 or 3 ALUs)

- 4 ALUs in a PE can execute different instructions at a time?

- 1 particular (type of?) ALU in 16 PEs work on different data of the same instruction (16 ALUs). A second (type of?) ALU in 16 PEs work on different data of another instruction and same for other 2 ALUs. or all the 4 ALUs in all 16 PEs in one CU work on different data of same instruction?

- Now relating wavefront concept here is a bit tricky for me. Its mentioned every where that a wavefront consists of 64 work-items in new northern island GPUs. How come 16 work-items running a single instruction for 4 cycles on 16PEs become 64 work-items esp when they are working on 16 different data of same instruction (not 64) for 4 cycles?  'That instruction is repeated over four cycles to make the 64-element vector called a wavefront'. Does 64-element vector means 16 PEs * 4 ALUs = 64? If thats the case then 64 (single instruction multiple data) operations would end after 4 cycles and not in a single cycle so still 64 is not making any sense to me.

Please clarify. Thanks!

0 Likes

PLEASE SEE REPLY IN BOLD

Thanks for the helpful explanation.

After reading your reply and following AMD programming guide, following picture of Northern Island GPU is drawn in my mind. (will go into southern once northern's architecture is clear to me). Please correct me if its not the case.

- One Northern Island GPU has multiple compute units. Each compute unit comprises multiple (16?(IT IS 16 except for CEDAR where it is 😎 or different for different northern island GPUs?) processing elements. Each processing element has 4 ALUs (4 way VLIW) with branch execution unit and GPRs. One work-item runs on all 4 ALUs of a single processing element for 4 cycles. Right?(YES RIGHT) or One work-item runs on 1 ALU of a single processing element for 4 cycles?

In the programming guide (Appendix D), it seems that they have called processing elements as stream cores and ALUs as processing elements. Is it so? (THANKS FOR POINTING THIS. WE WILL FIX THIS)

- Are 4 ALUs the same? i.e they can execute the same type of operations? (not talking about complex instructions here that take 2 or 3 ALUs) (YES)

- 4 ALUs in a PE can execute different instructions at a time? (NO, ALL MUST EXECUTE SAME INSTRUCTION. THAT IS WHY VECTOR TYPE LIKE FLOAT4 ARE RECOMMENDED)

- 1 particular (type of?) ALU in 16 PEs work on different data of the same instruction (16 ALUs). A second (type of?) ALU in 16 PEs work on different data of another instruction and same for other 2 ALUs. or all the 4 ALUs in all 16 PEs in one CU work on different data of same instruction? (ALL 4 ALUS IN EVERY PROCESSING ELEMENT, EXECUTES SAME INSTRUCTION AT A TIME. THAT IS TO SAY, 64 THREADS OR A WAVEFRONT EXECUTE SAME INSTRUCTION TOGETHER)

- Now relating wavefront concept here is a bit tricky for me. Its mentioned every where that a wavefront consists of 64 work-items in new northern island GPUs. How come 16 work-items running a single instruction for 4 cycles on 16PEs become 64 work-items esp when they are working on 16 different data of same instruction (not 64) for 4 cycles?  'That instruction is repeated over four cycles to make the 64-element vector called a wavefront'. Does 64-element vector means 16 PEs * 4 ALUs = 64? If thats the case then 64 (single instruction multiple data) operations would end after 4 cycles and not in a single cycle so still 64 is not making any sense to me.

(ON A FINER LEVEL, YOU SHOULD LEARN ABOUT QUARTER-WAVEFRONT. PHYSICALLY ONLY A QUARTER WAVEFRONT EXECUTES AT A TIME, AND 4 SUCH QUARTER-WAVEFRONTS ARE CLUBBED TOGETHER TO EXECUTE IN LOCK STEPS TO CREATE A WAVEFRONT. )

As you confirmed that in northern island or evergreen GPUs, a single work-item works on all 4 or 5 ALUs of a single PE in compute unit, it comes out to be 16 work-items per compute unit and a max of 64 instructions (executing 1 inst with different data)

In southern islands, we have 4 16-lane SIMDs so does it mean that now we have 64 work-items (rather than 16 as in NI or evergreen)?

0 Likes


Ifrah Saeed wrote:




In southern islands, we have 4 16-lane SIMDs so does it mean that now we have 64 work-items (rather than 16 as in NI or evergreen)?



Yes, 4 wavefronts execute simultaneously on a compute unit in southern islands. Remember that at any point of time, only a quad-wavefront runs, so a total of 64 (4 X 16) work-items from 4 different wavefronts run together.

0 Likes