cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

[REPOST FROM GENERAL DISCUSSIONS] Help with using double-precision on a 3870 board.

I am reposting this for HScottH from General Discussions:

I'd like to do some GPGPU work with my Radeon 3870 board, but I'd prefer to use DirectX rather than the AMD SDK.

Is there a way to format input data (texture?) and output data (render target?) as FP64? How can that be done?

Another related issue... I have Vista 64 Ultimate, the latest Catalist drivers (downloaded yesterday), DirectX 10.1 (also yesterday), yet the DirectX Caps Viewer shows my Pixel and Vertex shader version support as 3.0. Shouldn't it say 4.1?

Thanks much for any and all help 🙂
0 Likes
16 Replies

I suspect that since AMD-ATI is the first to come out with double-precision floating point support on GPUs, DirectX may not yet support double-precision floating operations. Of course, I do not have direct experience programming in DirectX myself, but a similar question was asked outside this forum about double-precision support in OpenGL and the answer I got from our engineers was that it required a rewrite/addition to the spec and compiler.

Using AMD Stream SDK's Brook+ and CAL are going to be the fastest ways to get access to the double-precision ops on the 3870 and the FS9170. The recently released (and I apologize... not yet emailed out to existing downloader) supports double types in Brook+. Go to the AMD Stream website and go to the download page to download v1.0beta.

I, unfortunately, don't know the details of what version is supposed to be reported in the drivers for the various components.

Michael.
0 Likes

Hi Michael,

Thank you for both the re-post and the answer.

I read that in HSLS 3.0, the keyword "double" was removed, which seems an antithetical move given the trend in hardware.

Nonetheless, I further understand that DirectX 10.1 was chiefly motivated by AMD-ATI for this specific line of cards--hence, these are the only 10.1 cards on the market today.

But, after just a little research, I see that I can supply "untyped" buffers to the card and execute pixel shaders against them. All I need to do now is see if I can receive the data as streams of doubles. A little testing should give me the answers I need.

I downloaded your SDK yesterday and experimented with it, but it crashed my video card after running just one sample. I am using a single card for both Video output (in Vista 64) as well as GPGPU processing. If the card works well, I will buy a second and use it as a dedicated GPGPU board.

If you hear any more on this topic, please post back to this thread, as will I, so we can all keep up to date 🙂

Thanks again,
Scott
0 Likes

Keep us (the forum) informed on your experiment on using "untyped" buffers and receiving the stream as doubles.

As for the SDK, what were the symptoms of the crash?

One thing to try is updating to the latest Catalyst drivers and seeing if that fixes your problem.

Michael.
0 Likes

Hi Michael,

I'm not so sure it was the SDK responsible for the crash anymore. But, here's what I did:

* Vista 64 Ultimate box (build 2 weeks ago)
* SP1 installed, and all updates
* Downloaded and installed DirectX 10.1
* Downloaded and installed latest Catalyst driver (v8-something)
* Downloaded and installed your SDK
* Ran a couple samples

** Display went black. Windows said "Video driver stopped responding"

This happened several times, before finally getting a screen full of garbage.

But, I haven't touched it since and last night I found my computer in a similar state (garbage screen, locked up). I suspect the new driver has an issue now. Either that, or there is a heat problem in the GPU that just began two days ago? I doubt that, given how hard I've worked it since I bought it.

Oh, and by the way: No RAM or GPU overclocking. Stock parts. Stable until I downloaded and installed the above mentioned things.

I'll keep you posted on my findings 🙂

Scott
0 Likes

Actually, I thought of this when I read your original post and totally forgot to mention this.

The AMD Stream SDK is currently only supported on Win XP 32/64 not Vista. That may also be causing some amount of grief.

I'll note the fact that you are on Win Vista and see what we can do to get Vista support released.

By the way, keep me posted on your DirectX effort and GPGPU results. We are always interested in what our customers are doing with their cards and, for me in particular, what interesting GPGPU apps they are developing.

Michael.
0 Likes
bayoumi
Journeyman III

Hi Michael,
I read on this article:
http://www.bit-tech.net/hardwa..._ati_radeon_hd_3870/4
That the HD3870 makes full precision at quarter speed, is this true?
By the way, I have feeling that some of the documentation in the SDK talks more about an older generation (R600) architecture. I also could not find something in the documentation summarizing the R670 compared to the link above,
Thanks
Amr
0 Likes

Hi Amr,

That's roughly correct.

At the lowest level, you can think of the stream processor as a collection of VLIW processing units. On the FireStream 9170, each of the processing units has 5 scalar processing units, each capable of doing integer and single precision floating point operations. One of the 5 is also capable of doing double precision floating point operations as well as transcendentals. When you issue a double-precision instruction, then it will be using the scalar processor that is capable of doing double precision and not the other scalar processors in the VLIW processing unit.

I believe the current R600 ISA documentation is all that we have available for public distribution. The main addition to the 670 I believe is double-precision.

Michael.
0 Likes
HScottH
Journeyman III

Hi,

I researched the card extensively. I recall that double-precision should be between 1/4 and 1/2 the speed (125-250 GFlops maximum). I also read that internally, double-precision is emulated on 32-bit hardware, but also that the implementation has been verified to have the same accuracy as native FP64. This is a slightly different answer than Michael's, but more data-points is good 🙂

As for my experimentation, I've not delved into FP64 yet, but managed to get single-precision working well with DirectX 9, using both C++ and C#. I'm thus far able to get about 2GB/s write-up AND read-back (using 64 MB textures), and maybe 100 GFlops (although that low number is the result of my poorly written shaders and texture dependencies--I am just experimenting, after all :-).

I'm off to play with DX 10.1. I am convinced that I can write the FP64 data to the card; my only question is whether I can read and use it as FP64. I should know soon.

Oh, and I'm about to order a 2nd 3870. Would I do better to connect the two cards, or leave them disconnected, one for video and one for GPGPU?

Cheers!

Scott
0 Likes

Hi Scott,

Expected peak DPFP performance on the FireStream 9170 is around 100+ GFLOPS (I don't have my laptop on at the moment so I can't give you exact numbers but it is between 100-110 GFLOPS I believe).

If by connecting the two cards you mean CrossFire, it is best to leave CrossFire off when using multiple cards for compute. We can address the cards separately. If you have enough going on on the graphics display side for one card, you'd be better off dedicating one for GPGPU computations to avoid contending with the bandwidth requirements of the graphics display. It is going to be somewhat app dependent.

Michael.
0 Likes
bayoumi
Journeyman III

Hi Michael, Scott
I hope I am not changing the topic of thread, but I thought we are elaborating on double precision (let me know if you prefer to start a different thread)
If what Michael is saying is true regarding using only one out of five ALUs, that means we have only 320/5 = 64 ppe that simultaneously operate in real-time parallelism?
Regards
Amr
0 Likes
bayoumi
Journeyman III

I would like to add to my previous question: this means we are still full speed (~775MHz) for ALU clock, if use only the 64 FP ALUs?
Thanks
Amr
0 Likes

Hi Amr (and Scott since I just realized I forgot to respond to something in your post),

The double-precision isn't emulated on 32-bit hardware. Rather, the DPFP calculations reuse the 32-bit SPFP circuitry to generate true DPFP pathways.

320 stream cores refer to the total number of individual scalar processors available in all of the VLIW thread processors in all of the SIMD arrays on the stream processor. All 320 stream cores are capable of doing single precision and integer ops. 64 stream cores are capable of double precision and transcendentals. Due to the nature of the VLIW stream processors and the SIMD arrays, how many of those stream cores will be active at the same time depends on the instruction mix.

The core clock for the stream processor will be the same regardless of what instruction type is being run.

Michael.
0 Likes
HScottH
Journeyman III

Hi Michael,

Thank you for your answers. Your description of the DPFP is different than what I read. Interesting. I will probably do some DPFP work next weekend.

Thanks for the comments about adding another 3870 (or 9170). I suspected as much, and will try that configuration. My forte' is physics simulations, and one card for FPU and one for GPU should give me plenty of bandwidth for both.

For what it's worth, I have a 3.0 Ghz quad core Intel CPU, and have implemented a brute force Mandelbrot on both the CPU and the GPU (by 'brute force', I mean calculate every pixel, without any of the well known optimizations). I get about 1-2 frames per second on the CPU (using all four cores, without SSE), and about 25 on the GPU. What a joy to "hack" a solution yet still get such stunning performance 🙂

Another comment: your Shader Analyzer is awesome. I can crash it easily, cause it to lock up, and get it to compile shaders that contain illegal instructions... but all that notwithstanding, it's predictions about throughput are always nearly spot-on. I'm impressed; this is a tool I can't live without now 🙂
0 Likes

Hi Scott,

🙂 One of our engineers corrected me on my post in a meeting today. 🙂

Okay... so, here is how the DPFP pathways work...

Each VLIW thread processor is composed of 5 stream cores (scalar processors). Let's label them X, Y, Z, W and TRANS (per the R600 ISA doc).

Each of these can actually be a separate single precision op or integer op.

The TRANS unit can also execute some transcendental functions.

The way a double-precision op is calculated on the thread processor is that the X, Y, Z and W unit circuitries are connected together as a native hardware DPFP core and as a whole performs the double precision op. You could, if you chose, execute another integer operation or transcendental on the TRANS unit.

This yields the same DPFP performance as I wrote before but is now more faithful to what is actually in hardware. 🙂

My apologies for the mixup!

Michael.
0 Likes

BTW, I'm glad you were able to get a 25x speedup doing a "hack". 🙂 I look forward to seeing how you do with the final solution. 😉

And, I will share your feedback with the ShaderAnalyzer team. And I'll put you in touch with them and, if you don't mind, I'd like you to share with them your lock up and illegal cases so they can get those fixed up for you.

Michael.
0 Likes

Originally posted by: HScottH
Another comment: your Shader Analyzer is awesome. I can crash it easily, cause it to lock up, and get it to compile shaders that contain illegal instructions... but all that notwithstanding, it's predictions about throughput are always nearly spot-on. I'm impressed; this is a tool I can't live without now 🙂

Hi HScottH,
Thanks. Glad you like GPU ShaderAnalyzer. I'm disappointed to hear that you seeing crashes with GSA. Would you be able to forward your shaders that are causing the crashes so that I can investigate? You can email me or send me a private message if you prefer.

Cheers,
Seth.
0 Likes