cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Meteorhead
Challenger

Future HW and SDK

Questions about upcoming tech

Hi, I have opened this topic to have a place for everyone to post questions about always the actual upcoming HW and SDK capabilities and proprties.

0 Likes
46 Replies

I hate the whole "DP is not important for consumer uses" argument.

What if Intel/AMD took this to heart and chopped the 80bit x87 FPU to 32 bits to make a consumer level CPU...

Fact is that most software uses the x87 instructions to perform all calculations and only casts back to double for storage.

Excel uses doubles, would people be happy if microsoft release a consumer version that only used single precision?

All numbers in javascript are double. Why so if single is all consumers need.

Fact is that other than multimedia, games and video compress every other computation the average pc user does is in double precision.

If openCL wants to move out of these areas and into general computing then double precision is a requirement not an optional addon.

 

0 Likes

moozoo,
We have double precision on our high-end consumer cards for the last 4 generations. The problem is not whether it is important for consumer use or not, but of hardware size/cost trade-off. A very small chip gets less double precision(or none at all) compared to the larger chips. A low end chip with the same single precision performance plus double precision would cost more, use more power and produce more heat for double performance that isn't much better than the CPU.

Most of your examples are software, which double can be done whether the chip supports double or not. The same can be said of OpenCL, someone can write a double precision library. A better comparison is to extra features like Hyper-threading, or trusted platform modules, which only exist on certain high end/enthusiast/server parts but not on the 'consumer' parts.

I'm not saying we shouldn't go to that path as it would make my life easier to be able to develop double code on a laptop, but we does not view DP as professional/HPC only use.


0 Likes

Originally posted by: MicahVillmow  but we does not view DP as professional/HPC only use.


Thanks Micah.

I guess I'm concerned  high DP performance will be reserved for HPC products as per Nvidia. There is a huge price gap between the highest Nvidia consumer graphics card and the cheapest Tesla.

I  fully accept that you (AMD) should try and differentiate your HPC parts. But I feel this should be on the basis of relablity, ecc, thermal design, designed for packed blade use, fast detailed support and additional driver features (Infiniband performance) etc

 

0 Likes

I agree also. Making a dual-GPU, double size ECC VRAM, strictly front-to-back cooled HPC card, with proper driver (Xorg independant), fit for close packing (meaning the cooler is 4 mm thinner than a double-width cooling solution) would indeed be welcome and be worth the extra money.

0 Likes
Meteorhead
Challenger

I cannot seem to find the wishlist topic from the old forum for new SDK features, so let me post one here (and if the old topic exists, feel free to move this post there).

I would have a feature request, that I think would be most useful to people. NV's UVA is a really awesome feature, but although it will most likely never make it OpenCL (or it will a few years from now) I am not that much interested in it. However it is a compelling feature of CUDA. I was thinking of how this could be possible in OpenCL.

First quick question: how exactly is clEnqueueCopyBuffer implemented? Does it utilize pinned RAM or does it copy straight from device to device without CPU intervention? Because if this is the case, that is really decent and something on par with DirectGPU of NV.

Secondly, since AMD is really moving towards Fusion, (which sadly and ultimately is a sign of discrete graphics disappearing because even a Quad-socket Fusion rack server will never bring the computing power of 4 dedicated graphics cards because of the cooling) I was thinking of using an alternate solution. AMD has recently released the Partially Resident Texture demo of Leo, and that gave me the idea:

Could it be implemented, that partially resident buffers be used OpenCL? Although the technology mainly aims read-only textures, it would really rock if this approach could be used for GPGPU. One server with 256GB-of RAM will surpass any dedicated VRAM available in a system and would allow real neat simulations to run if datastream could be implmeneted efficiently. Since this technique is already used in OpenGL, which also uses IL and ISA (AFAIK), from there on under not much has to be changed (as I would expect), only a different interface has to be implemented. This would be something similar as UVA, but would allow much larger buffers. I know that things get complicated when the GPUs not only read but write this buffer too, but cache coherency with the CPU is already made, so why couldn't this be done? Or is it really not possible to get better performance than using host pointers on devices?

0 Likes

well i think that partial resident buffers can be implemented with current API just fine. imagine creating HUGE buffer. from this buffer you create subbuffers and use them in kernel calls. it is just matter of runtime implementation. just add something like clCreateSubImage()

0 Likes

It is not completely identical, because in this situation the programmer has to keep track of which part of the image is modified by which device. This is not possible in the case when the location of reading is decided at runtime inside the kernel. Partially resident textures reduce VRAM to a cache, and the implementation streams the part of the texture into VRAM that is needed and hopes that in the next frame (iteration), the same side of the model (system) will be loaded on the given device, and in that next time the data is already present inside the cache. But if the orientation of the model would be decided inside the shader (kernel), I really wouldn't want to pass this back to host to be able to load the appropriate SubImage into VRAM.

This is where the vendor extension would come in handy. Hope I was clear. Aside from that, I will consider your idea, because it might be sufficient in my case, but it might turn out that it's just not flexible enough. (In my simulation I have moving borders inside the system, which would correspond to moving the borders around of the SubImages. If I can do this efficiently by not recreating the SubImages, but by doing clEnqueueCopyImageRect (or something like that) and copy just the updated parts of the image, than it will be ok. But I'll have to think about that if that works or not.

0 Likes