I have written a fairly complex algorithm in OpenCL and it runs very well on NVIDIA cards and am starting to look at making it run more efficiently on AMD cards, specifically a Radeon HD 6320 (Zacate E-450), and have a few questions about the hardware design from the programming guide, as it seems to be very inconsistent in terminology used:
1. In Appendix D, devices are specified according to Processing Elements and Stream Cores (in my cases presumably 80 PEs and 16 SCs). However, Section 1.3 describes devices according to Processing Elements and ALUs, and says that each work item maps onto a processing element, with no mention of Streaming Cores. In Figure 1.6 however, it says "Scheduler maps work item onto Stream Core k", with an arrow pointing to Processing element k. It reads as if Stream Cores and Processing elements are being used to describe different parts of the hardware in different sections of the guide.
2. The HD 6320 is missing from Appendix D - several Zacate GPUs are listed, but not the E-450. Also, the programming guide talks about Southern Islands, Evergreen and Northern Islands devices, but I can't find anywhere in that document or elsewhere that specifies what my card is (evergreen?).
3. Some sentences simply don't make sense. In particular:
"Processing elements, in turn, contain numerous processing elements" in Section 1.3
"Each of these arrays executes a single instruction across each lane for each of a block of 16 work-items" also in Section 1.3
I would like to confirm:
1. What does a single work item run on? A Streaming core, a processing element or an ALU? Related, which of these corresponds with a CUDA core?
2. What is the relationship between an ALU and a Processing Element?
3. How many threads are there in the wavefront of the HD 6320?