cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pwvdendr
Adept II

Building a desktop for scientific GPU computations in OpenCL -- some questions.

I'm planning to build a new desktop (budget all together max. +/- 4000 euros) for my research as a PhD student. I'm leaning towards an AMD build, but I still have some questions:

  • is there anything I should pay attention to when it comes to the motherboard & cooling? These are two things that I have not much experience in. I'm currently thinking of 3-4 HD7970s (rather than 2 HD7990 due to longer waiting time and expected cooling/noise problems), but I'm not sure whether that will fit in everywhere, and since I can afford water cooling, I have no idea whether my noise/heating fear with fans blowing inside the case is justified at all. I plan to put the case next to my table&char, so minimizing noise is important.
  • Currently when doing OpenCL computations on GPU, my windows freeze and I can't do anything else (not even browsing/chatting/typing). What would be the best remedy for this? Should I buy a low-end nVidia card, attach my screen to it and run OpenCL on the AMD platform? And would that give performance problems when running applications that use the GPU, such as games? Or should I disable windows hardware accelleration at all? Or what is the best solution for this?
  • When it comes to CPU, is it relevant whether the CPU is also from AMD? I'm not sure if there are any interaction that I should keep in mind. Same question for the motherboard: is there any benefit in an all-AMD setup or can I just consider all these parts as independent?

Thanks in advance!

0 Likes
28 Replies
notyou
Adept III

1) I wouldn't worry about it provided you have a few case fans blowing air in/out to help regulate the temperature (just in case). But from what I've read in reviews, the 7970 already runs very quiet and cool compared to most other solutions.

2) I'm not 100% sure, but I believe this is a limitation in how the driver works. From what I know, the only solution is to make sure you don't let your kernel run more than a second or two (to minimize the freeze time) and instead use more enqueued kernels which gives control back to the CPU between kernel enqueues. You could also disable the watchdog timer, but this is more of a workaround to prevent the driver from timing out instead of making the machine usable while the OpenCL program is running. Others running Linux might be able to help out more with this one (but IIRC, running an Nvidia GPU as main + AMD GPUs for compute has its own problems).

3) Nope. I'm currently running an Intel + AMD system without any problems. The only possible benefit to an all AMD setup would be in the APU area (but then again, it's technically not possible to have an Intel+AMD APU) because of the on-chip features for faster communication. But this doesn't matter if you're using discrete GPUs.

0 Likes

"Others running Linux might be able to help out more with this one"

--> I should have mentioned that I plan to run windows, sorry. (win7, 64bit)

Unless there is a serious reason to switch to linux for this?

In any case, I intended to use the computer as my normal work computer, which means I should be able to continue working while the computations are running

0 Likes
ED1980
Adept II

For normal operation 3-4 HD7970s, desirable LGA 2011 platform, because the AM3 + you only get 8 lanes PCI-e 2.0 on the GPU which is not enough ... sometimes even 16x PCI-e 2.0 limits the performanceimageview.gif

0 Likes

ED1980 wrote:

For normal operation 3-4 HD7970s, desirable LGA 2011 platform, because the AM3 + you only get 8 lanes PCI-e 2.0 on the GPU which is not enough ... sometimes even 16x PCI-e 2.0 limits the performance

Would the LGA2011 platform matter much? I'd consider getting an core i7-2700K since that has HD3000 graphics (which might be useful since I should attach my monitors to the motherboard, to prevent the freezing issue I mentioned). I can't find any LGA 2011-compatible cards with HD3000 graphics.

So, would it matter a lot in speed, LGA 2011 vs LGA 1155? Or should I wait until intel ships LGA 2011-compatible cards with better graphics?

0 Likes

I would recommend connecting the display to a mainstrean GPU and keep the compute GPUs headless.

If you do not connect a display to a GPU the Windows watch dog will not initiate the TDR (Time out Detection Recovery) protocol when a kernel is executed for more than two seconds.

0 Likes

That is very interesting. Do you have a source to confirm this? Also, would it matter if the mainstream GPU is also AMD or not (i.e. same platform or not)? And if I use apps that use GPU (such as games), will they utilize the mainstream GPU only or also the compute GPUs?

0 Likes

I would highly recommend not to mix vendors, not because they wouldn't work, but because many cool features (like CL-GL interop) are more likely to work if they are the same vendor. Apps that use GPU (such as games) use GPU. (Really, the answer to your question is this) They will use the one that has the monitor hooked up to it. Best would be to have an IGP render desktop (and not waste PCI slots or lanes) and some games if you wish to play, and have the compute GPUs standalone with no graphics task assigned to them.

If you wish to play games while calculating... well then I think it would be more a user desktop, than a workstation. You should know however, that it is virtually impossible to play games and calculate on the same GPU (because compute kernels have higher priority than display kernels).

I think that simple applications that are limited by PCI-E bandwidth are just garbage. Specially stuff that are not highly interactive (and by higly I mean < 5ms latency). There must be really serious stuff going on under the hood if 4GB/s is not enough.

0 Likes

Apps that use GPU (such as games) use GPU. (Really, the answer to your question is this) They will use the one that has the monitor hooked up to it. Best would be to have an IGP render desktop (and not waste PCI slots or lanes) and some games if you wish to play, and have the compute GPUs standalone with no graphics task assigned to them.

So, "the one" that has the monitor hooked up to it, even if there are multiple monitors? And does it matter for this answer whether or not CrossFireX is enabled? I have no idea what the precise impact is on OpenCL and regular graphical apps, neither do I know if CrossFireX only works for games or if there are benefits for OpenCL as well.

If you wish to play games while calculating... well then I think it would be more a user desktop, than a workstation. You should know however, that it is virtually impossible to play games and calculate on the same GPU (because compute kernels have higher priority than display kernels).

No no, not playing games while calculating. 🙂

I was just wondering if it would be impossible to play games at all on such a machine, e.g. because the games can only find the cheap card I link the monitor to, rather than the big cards which would be compute-only. By the way, when I say games I mean any app using GPU, like video transcoding, photoshop, ...

But if I understand correct, using a cheap card A and high-performance cards B,C,D (all 4 in CrossFireX setup), then I can tell openCL to compute only on B,C,D so that A is free for rendering my desktop; and when not using OpenCL I can use them all for gaming/photoshop/other GPU apps. Right?

0 Likes

For OpenCL, you should forget about CrossFireX, it's best if you don't even connect the cards with CF bridge. (Simply because crossfired cards will simply not be visible for OpenCL) Windows should detect all GPUs hooked up to it, even if no monitor is attached.

The part about games... you can see in many games, that you can choose display adapter. I do not know what happens when you have multiple ones, and the monitor would be hooked on to one or the other. I don't know if the image can be calculated on one and then displayed on the other. (I have never had a machine with IGP and dGPU) So for this part, I cannot help you. If you connect the montitor to one of your dGPUs and use that for desktop (or game render), it will fall into the '2 seconds' category, which you most likely will wish to avoid. That's why the IGP is a better solution, plus desktop render will not get chunky in that case.

Good, so I should be using IGP for my screen. I tried it on my current desktop (moving the screen connection to my motherboard screen port) and I get a black screen on boot. So I'll need to do some more effort. Thanks for the explaining anyway!

0 Likes

Check your bios. I too have an integrated video card that drives a display and had to make sure the bios was using it.

0 Likes

in OpenCL you enumerate platforms and then enumerate devices on this platforms. then you create context from this devices. so most likely you will get first IGP as OpenCL device and then other GPU. so you must program your application that it will skip this slow device.

in OpenGL on AMD cards rednder that card which output is hooked up display. there is also WGL extension to set opengl affinity.

so if you don't want freeze your display then IGP solutions seems best,

0 Likes

I suggest you look into a library I wrote to do exactly this: https://github.com/nbigaouette/oclutils/

It will enumerate all  platforms available on the machine (on linux, you can have amd, nvidia and intel at the same time without issues). Then for each platform it will sort the devices according to their core counts. You then decide which platform and just ask for "the best" device there. It is also possible to lock the device so you can run two different simulation at the same time and they will run on two different GPU.

0 Likes
diepchess
Adept I

hi, when you intend to build a cluster kind of so not just 1 gpu, keep in mind that the software you want to do calculations for determines what hardware is most efficient.

A stand alone machine yo uconnect remote to for computation is preferred. Usually linux is the best OS for that,

as it allows easy SSH and is pretty stable and is not expensive unlike windows.

4000 euro is quite some budget.

So the obvious questions are:

a) how important is double precision?

b) how much effort for programming do you want to do

c) do you want to run software that already exists?

d) how big of a problem is using a lot of power?

Above 1 kilowatt usually most rooms start to get BIG problems with the heating. Heat is a major issue in HPC.

If you want to build just 1 machine with 1 gpu, choice is easy. Get an i7-3930k at a socket 2011.

It has 4 memory channels.

When you put in more gpu's realize you have to share the bandwidth over all gpu's.

For example what i'm building here is a small cluster, started with 8 mainboards for $60 a piece from ebay, supermicro ones with pci-e 2.0. Inside you can put in $30 cpu's like L5420 cheap on ebay.

So then you got the entire 8GB/s towards 1 gpu and each machine can run a different instance of your gpgpu

application.

Starting at a small cluster software is relative straightforward, for example with pdsh shell. Google for it. Gets used

for major supercomputers and it's free and very good.

No mainboard currently can deliver a huge mainboard effectively. Ignore the theoretical benchmarks. that's theory.

You're interested in how it performs. Clusters are total unbeatable in price and performance as compared to 1 expensive machine.

Is double precision not your mainfocus?

Then might i interest you at just picking up from ebay HD 5870's ?

Relative low power compared to 6000 series, just basically cannot prefetch memory and you can put in a few at each node.

6000 series eats big power of course compared to 7000 series, so you can probably skip the 7000 series.

However there is a lot of offers online of 6950's that are unlocked and basically have 1536 PE's, like the 6970.

Huge price difference. Most get offered for around 140 euro each.

So if power is not the biggest problem, realize that cheap gpu's in clustered manner are unbeatable in performance and price.

So pick up cheapest mainboard that's pci-e 2.0, put in a $50 processor or maybe even 2 cpu's depending upon whether you 'feed' your gpu, and throw in unlocked 6950's. For 1 Tflop double precision that's unbeatable in price again.

For 2000 euro, i'd say, build 4 machines with each 2 of those cards. So 8 cards in total. Will eat big power, yet you save 2000 euro for your power bill then.

The fastest gpu and fastest cpu's always are most expensive and 1 month after you bought one, price already dropped with hundreds of dollars for each component.

Note i build a custom box here to put in the 8 machines with big airflow and most important: removing much noise.

Throw from underneath air into the box and blow it out on top. Intend to blow in and out air using phobya 18 CM fans here. Put them at 700 RPM and quite a bunch of them. The air that get s out 'on top' you can so to speak directly blow to outside and air from outside building you blow underneath in a little (besides a lot of air that's inside the room, so by controlling the airflow that blows to the ground and blows out from the ceiling yo ucontrol temperature inside the room).

Where normal clusters cannot tolerate big temperature differences, like the one  i build here,

you can with a gpgpu cluster as those cards already used to run pretty hot inside.

So that means you can remove way more heat.

Please also note that PSU's are there nowadays that are pretty efficient. Best review site i found there is hardwaresecrets, as they also test the psu's at a bit hotter temperatures, whereas the 'awards' you get based upon a 23C airflow from a lab which is not realistic.

A gold certificated psu can do miracles as opposied to still very good psu's that are just 80% efficient. Huge difference in what you have to cool, especially as you also are in that risk zone of just above 1.5 - 2.0 kilowatt considering the plan you wrote down. 

A cluster of course is gonna eat a lot more power than that.

Please not that the fastest cpu's also consume really lots of power.

Do you need a fast cpu? The bandwidth a CPU with integrated memory controller delivers is dependant upon

how high it is clocked. Prepared to pay that big price?

i7-3960x is a genius cpu and most dudes use it to feed gpgpu as it has unrivalled speed and memory performance for single socket setups.

Yet it's expensive.

another cheapskate solution, still using 1 machine is use a 3930k processor. it's expensive yet not like the i7-3960x with big watercooling setup you can get it at 4.5ghz probably. Don't try it without huge watercooling. Don't use some sort of default setup for that. Really big pipes.

With RAM overclocked and a cpu clocked to a speed like that you'll have unrivalled bandwidth to a GPU, but sure it's gonna cost. A cluster will eat probably a bit more power, will deliver at least double the Tflops, is less than half the price, and you still can sell it later on for nearly same price. But probably you need to build your own box for it and some exhaust to outside and intake to outside for cooling.

Wahw, that's an elaborate answer. Thanks!

keep in mind that the software you want to do calculations for determines what hardware is most efficient.

Actually, I don't know this in advance. It's for my research, which is permantly under development and moving to new areas. What I do know however is that they will be purely combinatorial, so no floats or doubles will be involved at all. What will be involved:

  • massive computations over finite fields (always with lookup tables)
  • binary matrix reduction, binary rank computations, ...
  • coding theory -- requires proper support of popcount()/clz()
  • massive parallel randomized searches (Monte Carlo style)

I hope this gives a better idea on what exactly will be my purpose. It's certainly not a mainstream use of GPU, but it works as any other parallel algorithm.

4000 euro is quite some budget.

So the obvious questions are:

a) how important is double precision?

b) how much effort for programming do you want to do

c) do you want to run software that already exists?

d) how big of a problem is using a lot of power?

No mainboard currently can deliver a huge mainboard effectively. Ignore the theoretical benchmarks. that's theory.

You're interested in how it performs. Clusters are total unbeatable in price and performance as compared to 1 expensive machine.

It's a soft upper bound, there's no need to spend the money entirely of course, and if 5k is needed to suit my needs, then so be it. But with 3-4 HD7970 in mind, it can't really end up cheap anymore.

a) Not at all, even single precision doesn't matter as I mentioned above.

b) That's a tricky one: I'm willing to do effort to write good code, but I don't have much expertise yet in GPU programming, so I'm not sure if I could actually put much effort in it in a useful way. Being a little bit fool-proof is certainly a plus.

c) Not really, almost all things will be hand-written (only some standard complicated things like a random generator will not). But that doesn't mean compatibility is not important, e.g. I don't want to re-write everything when I buy a new computer in 5 years.

d) Depends on what you call a lot. My current intention is to use it as my main desktop computer, where the GPUs would be idle 90% of the time, which is why I'm considering HD7970 (very low idle power usage).

(Actually I didn't consider the remote cluster yet at all, I was thinking of just making it my main work machine, with windows and a CPU with embedded graphics for my screen, so that the GPUs are 100% free for computing).

If you want to build just 1 machine with 1 gpu, choice is easy. Get an i7-3930k at a socket 2011.

It has 4 memory channels.

When you put in more gpu's realize you have to share the bandwidth over all gpu's.
[...]

Do you need a fast cpu? The bandwidth a CPU with integrated memory controller delivers is dependant upon

how high it is clocked. Prepared to pay that big price?

Would bandwidth matter so much? I don't know how to test it easily, but I intuitively presume that my applications are not really bandwidth-intensive.

I was actually considering an LGA1155 board with an i7-2700K, since they feature integrated graphics, which would save me the hassle of a command-line remote machine (and instead give me one strong all-purpose computer). No socket 2011 boards/CPUs currently feature integrated graphics, which means my screen freezes during execution and the computations get are slowed down on the screen card.

The other reason for 3x HD7970 over 4x or any other config is that then I can just use the default air cooling with a Z68 Extreme7 Gen3 motherboard, this saves me the cost & hassle of a water cooler, plus it is a lot cheaper then going with a LGA2011 + hyperexpensive CPU.

Then might i interest you at just picking up from ebay HD 5870's ?

[...]

However there is a lot of offers online of 6950's that are unlocked and basically have 1536 PE's, like the 6970.

Huge price difference. Most get offered for around 140 euro each.

[...]

So if power is not the biggest problem, realize that cheap gpu's in clustered manner are unbeatable in performance and price.

[...]

For 2000 euro, i'd say, build 4 machines with each 2 of those cards. So 8 cards in total. Will eat big power, yet you save 2000 euro for your power bill then.

I'd say that's reasonable from a financial point of view, but I'm pretty sure the university administration would not agree. I can use the money to buy hardware or complete machines from shops (since they give official documents), but they don't like me to use it to buy from unofficial sources (such as ebay users) and I certainly cannot use it to cover my power bills (as that is mixed with personal power usage).

A gold certificated psu can do miracles as opposied to still very good psu's that are just 80% efficient. Huge difference in what you have to cool, especially as you also are in that risk zone of just above 1.5 - 2.0 kilowatt considering the plan you wrote down. 

A cluster of course is gonna eat a lot more power than that.

Out of curiosity: a lot more? 1.5 kW already costs 5 euros/day... doesn't that ruin the cheapness fairly soon?

0 Likes

Given the fact that you want to take 3 - 4 card HD7970, and use them for several years, I would advise to take all the same LGA2011 platform, despite the fact that it is hot, as if working four HD7970 standing close to each other, the noise of the cooling systems will be unbearable for you and so it is best to put water cooling ... water-cooled graphics cards are single slot, and take up less space on the motherboard that allows you to set up a separate low-power video card for your desktop.

So you get a system that has no restrictions in the form of bandwidth PCI-e (PCI-e v3.0> PCI-e v2.0) and normal, responsive screen...

0 Likes
vanja_z
Adept II

As mentioned, I would suggest taking another look at GNU/Linux as a platform because it has advantages in terms of,

  1. Remote administration and access, especially in a presumably university environment.
  2. Networking, depending on your faculties setup you may need Nix to access resources such as NFS drives and of course multi user support.
  3. Nix is also more commonly used, better documented and supported for cluster use should you wish to add another computer or 2 a few years down the line. This could be a good option particularly as second hand GPUs to match your original purchase start coming up.
  4. The GNU/Linux command line way of doing things, including developing is just generally superior to the Windows way and more common in research circles

If you want to keep GNU/Linux at least as an option, I would strongly reconsider choosing AMD hardware due to poor driver support and no clear policy on improving this.

Having used Nvidia for a number of years for my own research code and recently experimenting with AMD, I am extremely disappointed with AMD and would strongly advise against going with them for the following reasons,

  1. Poor performance, despite mild optimisations I am getting the same speeds on an HD6950 as a GTX275.
  2. Poor Linux driver, based on my use for gaming the Windows driver is no better. Performance regression is common and new bugs are regularly introduced.
  3. AMD only supports OpenCL (used to support CAL but is now deprecated), without going into an extended rant, I strongly prefer CUDA to OpenCL in terms of the compilation workflow, diagnostic tools, code structure and especially the horrible OpenCL memory model.
  4. Even within OpenCL, the AMD implementation has disadvantages. With AMD, you cannot allocate a memory object larger than 25% of the memory size (not a problem with Nvidia OpenCL). Also, AMD still doesn't support the cl_khr_fp64 extension for double precision support (you can use the cl_khr_fp64 extension of course). All you have to do is briefly scan this forum for a long list of non compliance with the OpenCL standards and undocumented behaviour.
  5. More difficult to administer multiple cards on Linux. No good way to run headless and you need to run dummy X displays on each GPU in order to access them. This may or may not be an issue for you on Windows although I have heard of similar complaints.

Honestly, I like to support the underdog and I don't like Nvidias company ethics at all so it pains me to say this but at this stage AMD is not a good idea for scientific computing. I purchased 2 HD6950s to use at home for gaming and test my code on with a view to setting up a machine similar to what you are suggesting and I am very happy that I tested the waters before jumping in.

My Suggestion for you:

Make your own decision. If you are doing a phd, I dare say you have enough time to setup a cheap experimental machine, maybe using old uni hardware. Get your development environment up and running and write some basic code implementing some very basic features of what you want to do and repeat with both vendors. This should give you an idea of the pros and cons of each one and if you bought second hand GPU's you can probably resell them for nearly the same price after a few weeks.

EDIT: need to run a dummy X display on each GPU in order to access it (not dummy xserver, typo)

Hmm... interesting points. Yes, I'd need to keep Linux at least an option. Actually, it's got more likely recently that I will be running linux as my main platform, purely remote-accessed.

The main problem for me with going nVidia is that one either has

  • Tesla, which is extremely expensive: 3x HD7970 costs only as much as 1 tesla card... or
  • GTX, which doesn't seem to be fit for GPGPU at all: if I look at existing GPGPU usage such as bitcoin mining, I see on https://en.bitcoin.it/wiki/Mining_hardware_comparison that nVidia cards barely reach 25% of their AMD counterparts... I don't know how representative Bitcoin mining is for general GPGPU, but it should at least be an indication and the gap is really huge...

Moreover, Cramming many HD7970 in one motherboard should be nearly trivial when this thing is ready for sale. For Tesla, I see no similar 1-slot versions. So even with sufficient money, I could only put in half as many cards, or less. Given this, would you still consider 2x Tesla or GTX a gain over 4x HD7970?

vanja_z wrote:

  1. More difficult to administer multiple cards on Linux. No good way to run headless and you need to run dummy xservers on each GPU in order to access them. This may or may not be an issue for you on Windows although I have heard of similar complaints.

Care to explain this a little further? This is the main thing that frightens me in your post. I have a friend running 2x HD6990 on linux without much problems, but this is for bitcoin mining, not for general OpenCL programming. So perhaps my situation will be different. But I fail to find more details on the web, so perhaps you can explain me?

0 Likes

there is need only for one xserver. also don't run unsupported distro by AMD as you will run into horrible issues like crash on boot and so. i am not entirely sure but with 79xx you should be able to allocate whole VRAM on linux. hopefully this will be supported on older cards too. most likely divided into four chunks. not sure what nvidia report as max allocation size on their cards but when you allocate more than that you violate opencl spec and you should get an error. it is same as in opengl where is AMD more strict as nvidia.

0 Likes

nou,

Yes, you only need one xserver, that was a typo, I meant you need dummy X displays. My apologies.

Regarding Nvidia memory allocation, I have tested GTX275 and GTX295 using various drivers and Linux distributions and although they report CL_DEVICE_MAX_MEM_ALLOC_SIZE to be 0.25 of CL_DEVICE_GLOBAL_MEM_SIZE as the AMD implementation does, the allocations work perfectly well up to nearly 100% just as they do in CUDA. This is also reported on their forums.

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html

CL_DEVICE_MAX_MEM_ALLOC_SIZE

Return type: cl_ulong

Max size of memory object allocation in bytes. The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)

The way I read this is that  the minimum value of CL_DEVICE_MAX_MEM_ALLOC_SIZE should be max (0.25*CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024), meaning that it would still comply to the specification if CL_DEVICE_MAX_MEM_ALLOC_SIZE was greater than its "minimum value". At any rate its a pointless and unreasonable limitation.

0 Likes

pwvdendr,

Bitcoin mining capacity is not a general indicator of GPGPU processing power. As I said, my code runs at the same speed on HD6950 and GTX275. It really depends on if your code is bound by memory bandwidth, integer operations or floating point operations (or a combination). More importantly, I believe AMD has a single instruction (BIT_ALIGN_INT) that Nvidia doesn't have which is used extensively in bitcoin mining and is a large contributor to the speed discrepancy.

With GTX and Tesla, you are correct regarding pricing. Tesla is hideously overpriced and this is one of the reasons I dislike Nvidia. On the other hand, I think you are mistaken about the performance differences between the two. They are based on essentially the same chips with the highest spec Tesla C2070 being equivalent to a GTX470 core. The main differences are,

  1. GTX cards based on GF100 and GFG110 (470, 480, 570, 580, 590) have the double precision floating point performance artificially limited to 1/8 of single precision performance while Tesla enjoy DP performance 1/2 of SP. This market segmentation tactic is another reason I dislike Nvidia.
  2. Telsa cards have more memory, 3GB or 6GB. This is less of an issue with 3 GB GTX580 being available now.
  3. EEC memory on Tesla may be a non issue or a deal breaker depending on your application. AMD doesn't have this available at all.
  4. GTX are clocked higher

pwvdendr wrote:

Care to explain this a little further? This is the main thing that frightens me in your post. I have a friend running 2x HD6990 on linux without much problems, but this is for bitcoin mining, not for general OpenCL programming.

I too briefly used 2 x HD6950 for Linux mining! I got over it though. Perhaps I exaggerated the troubles a bit. You need to run an Xserver (not necessary using Nvidia headless) and you need to set up dummy X screens on each GPU to be used. If you are using the cards from a working desktop with a monitor attached to each card then none of this is a problem, however if you want to run headless, it is an inconvenience and undocumented. Also, I have to run,

export DISPLAY=:0

before all cards are visible to OpenCL even with monitors connected.  I am not sure how this works in Windows.

Regarding 4x any card vs 2x any other card, PCIe bandwidth may or may not become an issue depending on your problem so its difficult to comment.

Good luck with your choices!

vanja_z wrote:

Bitcoin mining capacity is not a general indicator of GPGPU processing power. As I said, my code runs at the same speed on HD6950 and GTX275. It really depends on if your code is bound by memory bandwidth, integer operations or floating point operations (or a combination). More importantly, I believe AMD has a single instruction (BIT_ALIGN_INT) that Nvidia doesn't have which is used extensively in bitcoin mining and is a large contributor to the speed discrepancy.

O really? That explains a lot, I didn't know that. Thanks for the info!

I had  concluded that the main difference was that nVidia had better DP performance (for the FLOPS number), but far worse for general usage. And since I'm not working with floats at all (only integers, memory and logic) I set nVidia aside fairly quick. But ok, I'll try to get an nVidia card then and allow them a fair competition. 🙂

vanja_z wrote:

If you are using the cards from a working desktop with a monitor attached to each card then none of this is a problem, however if you want to run headless, it is an inconvenience and undocumented.

Wait what? I need a physical monitor for each card?? So I'd need to buy 4 monitors?? 😕

0 Likes

pwvdendr wrote:

Wait what? I need a physical monitor for each card?? So I'd need to buy 4 monitors?? 😕

that was true for windows. but not it not anymore.

0 Likes

pwvdendr wrote:

vanja_z wrote:

If you are using the cards from a working desktop with a monitor attached to each card then none of this is a problem, however if you want to run headless, it is an inconvenience and undocumented.

Wait what? I need a physical monitor for each card?? So I'd need to buy 4 monitors?? 😕

It is not true ... just one monitor...

pwvdendr wrote:

I had  concluded that the main difference was that nVidia had better DP performance (for the FLOPS number), but far worse for general usage. And since I'm not working with floats at all (only integers, memory and logic) I set nVidia aside fairly quick. But ok, I'll try to get an nVidia card then and allow them a fair competition. 🙂

compare the two best candidates (but different in price)

Tesla M2090

"NVIDIA unveiled the Tesla M2090 GPU this week. Equipped with 512 CUDA parallel processing cores, it delivers 665 GigaFLOPS  of peak double-precision performance and 178 GB/sec memory bandwidth"

AMD Radeon™ HD 7970

"Up to 925MHz Engine Clock

3GB GDDR5 Memory

1375MHz Memory Clock (5.5Gbps GDDR5)

264GB/s memory bandwidth (maximum)

3.79 TFLOPs Single Precision compute power

947 GFLOPs Double Precision compute power"

not to mention the cost of...

0 Likes

vanja_z wrote:

  1. Even within OpenCL, the AMD implementation has disadvantages. With AMD, you cannot allocate a memory object larger than 25% of the memory size (not a problem with Nvidia OpenCL).

  -  AMD compliant specifications OpenCL...

vanja_z wrote:

  1. Also, AMD still doesn't support the cl_khr_fp64 extension for double precision support (you can use the cl_khr_fp64 extension of course). All you have to do is briefly scan this forum for a long list of non compliance with the OpenCL standards and undocumented behaviour.

- Series 5800 and 7900 have already support cl_khr_fp64. For the 6900 series support also has been in the driver, "12-2_pre-certified_vista_win7_64_dd_ccc"(I hope the driver 12.2 for Linux support cl_khr_p64 also appear) ... So cl_khr_fp64 is no longer a problem

0 Likes

ED1980 wrote:

vanja_z wrote:

  1. Even within OpenCL, the AMD implementation has disadvantages. With AMD, you cannot allocate a memory object larger than 25% of the memory size (not a problem with Nvidia OpenCL).

  -  AMD compliant specifications OpenCL...

Can you please point me to where in the specification it says that the maximum buffer size must be 25% of global memory? I've heard this referred to however I haven't come across it myself. I am by no means an expert and have not read the spec cover to cover. All I have noticed is the online man page for clDeviceInfo,

"Max size of memory object allocation in bytes. The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)"

which to me doesn't support this limitation.

Thank you for the information regarding cl_khr_p64, this is some good news I was unaware of. I am still dubious though since there are plenty of features in the Windows driver that are missing from the Nix driver.


0 Likes

implementation can report that you can allocate full memory in one buffer. but if it report 25% and you allocate 100% and it did't report error it is error.

0 Likes

Does the 7970 really support cl_khr_fp64 at this time

see

http://devgurus.amd.com/message/1274897

0 Likes