cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Archives Discussions

settle
Challenger

Future HW and SDK

Originally posted by: LeeHowes Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding ๐Ÿ™‚

 

 

In current AMD GPUs each SIMD unit has 4 ALUs (plus possibly 1 SFU depending on the model).  I still can't understand how work-items, vector types, etc. get mapped to the ALUs in AMD GPUs (and CPUs).

 

  1. Does AMD APP SDK perform implicit vectorization for the GPU?  How about the CPU?  If not, any plans of providing it in the near future?
  2. How are VLIW-4 (or VLIW-5) packets formed, from 4 independent operations within a single work-item or 4 independent operations among 4 contiguous work-items?  What happens in a kernel that doesn't have 4 independent operations but only has one operation like an fma or mad in saxpy with scalar float--one float saxpy per work-item?  Will only 1/4 of ALUs be utilized?
  3. How are current VLIW-4 packets executed within a SIMD, using all 4 ALUs at once (issue slots in space), or using 1 of the 4 ALUs over several cycles (issue slots in time)?  Or do I have that reversed?

 

I guess I'm looking for a simple and clear statement (I do scientific computing but don't have a formal CS background) from AMD similar to the following from "Writing Optimal OpenCL Code with Intel OpenCL SDK" in section 2.5 Benefiting from Implicit Vectorization:

"Vectorization module transforms scalar operations on adjacent work-items into an
equivalent vector operation. When vector operations already exist in the kernel source
code, they are scalarized (broken down into component operations) and re-vectored."

 

Thanks for your help clarifying these issues for me.

0 Likes
himanshu_gautam
Grandmaster

Future HW and SDK

Question: Does AMD APP SDK perform implicit vectorization for the GPU?  How about the CPU?  If not, any plans of providing it in the near future?

Answer: AMD APP SDK binds 4/5 independent instructions to  VLIW4/VLIW5 if it is able to find them. On CPU vectorization is done in similar situation.

 

Question: How are VLIW-4 (or VLIW-5) packets formed, from 4 independent operations within a single work-item or 4 independent operations among 4 contiguous work-items?

Answer: 4 independent instructions within a work-item.

 Question: What happens in a kernel that doesn't have 4 independent operations but only has one operation like an fma or mad in saxpy with scalar float--one float saxpy per work-item?  Will only 1/4 of ALUs be utilized?

Answer: Yes.

Question: How are current VLIW-4 packets executed within a SIMD, using all 4 ALUs at once (issue slots in space), or using 1 of the 4 ALUs over several cycles (issue slots in time)?  Or do I have that reversed?

Anwer: Instructions inside VLIW4/5 packets are executed simultanously on a SIMD. VLIW packets cannot be created from multiple work-items. All instructions in a VLIW packet must be from same work-item.

 

0 Likes
settle
Challenger

Future HW and SDK

Himanshu,

Nice answers, thank you!

0 Likes
Meteorhead
Challenger

Future HW and SDK

Does anybody know anything official about NGC? There was supposed to be a press release Dec. 5th in London, but there's absolutely no new from it.

It would be nice to know if all those neat numbers on the web are actually true, or just some troll publishes them saying "leaked", and then everybody copies, so they don't "fall behind".

0 Likes
Meteorhead
Challenger

Re: Future HW and SDK

I cannot seem to find the wishlist topic from the old forum for new SDK features, so let me post one here (and if the old topic exists, feel free to move this post there).

I would have a feature request, that I think would be most useful to people. NV's UVA is a really awesome feature, but although it will most likely never make it OpenCL (or it will a few years from now) I am not that much interested in it. However it is a compelling feature of CUDA. I was thinking of how this could be possible in OpenCL.

First quick question: how exactly is clEnqueueCopyBuffer implemented? Does it utilize pinned RAM or does it copy straight from device to device without CPU intervention? Because if this is the case, that is really decent and something on par with DirectGPU of NV.

Secondly, since AMD is really moving towards Fusion, (which sadly and ultimately is a sign of discrete graphics disappearing because even a Quad-socket Fusion rack server will never bring the computing power of 4 dedicated graphics cards because of the cooling) I was thinking of using an alternate solution. AMD has recently released the Partially Resident Texture demo of Leo, and that gave me the idea:

Could it be implemented, that partially resident buffers be used OpenCL? Although the technology mainly aims read-only textures, it would really rock if this approach could be used for GPGPU. One server with 256GB-of RAM will surpass any dedicated VRAM available in a system and would allow real neat simulations to run if datastream could be implmeneted efficiently. Since this technique is already used in OpenGL, which also uses IL and ISA (AFAIK), from there on under not much has to be changed (as I would expect), only a different interface has to be implemented. This would be something similar as UVA, but would allow much larger buffers. I know that things get complicated when the GPUs not only read but write this buffer too, but cache coherency with the CPU is already made, so why couldn't this be done? Or is it really not possible to get better performance than using host pointers on devices?

0 Likes
nou
Exemplar

Re: Future HW and SDK

well i think that partial resident buffers can be implemented with current API just fine. imagine creating HUGE buffer. from this buffer you create subbuffers and use them in kernel calls. it is just matter of runtime implementation. just add something like clCreateSubImage()

0 Likes
Meteorhead
Challenger

Re: Future HW and SDK

It is not completely identical, because in this situation the programmer has to keep track of which part of the image is modified by which device. This is not possible in the case when the location of reading is decided at runtime inside the kernel. Partially resident textures reduce VRAM to a cache, and the implementation streams the part of the texture into VRAM that is needed and hopes that in the next frame (iteration), the same side of the model (system) will be loaded on the given device, and in that next time the data is already present inside the cache. But if the orientation of the model would be decided inside the shader (kernel), I really wouldn't want to pass this back to host to be able to load the appropriate SubImage into VRAM.

This is where the vendor extension would come in handy. Hope I was clear. Aside from that, I will consider your idea, because it might be sufficient in my case, but it might turn out that it's just not flexible enough. (In my simulation I have moving borders inside the system, which would correspond to moving the borders around of the SubImages. If I can do this efficiently by not recreating the SubImages, but by doing clEnqueueCopyImageRect (or something like that) and copy just the updated parts of the image, than it will be ok. But I'll have to think about that if that works or not.

0 Likes