So, I found the Vega FE for a reasonable price (about 30% discount, I believe), and decided to buy it (after checking the general requirements and stuff). To my surprise, after installing it, any OpenCL application I try to launch will complain about not finding an OpenCL device. I found this other discussion here in this forum:
This person seems to have nailed it: if it is a *requirement* to have a PCIe 3.0 + supported CPU in order to use this *workstation* card (this is NOT a gaming card), why in the world is this not part of its listed requirements? Now that I got the FE card + new power supply I no longer have money to get a new CPU/Motherboard/RAM! I currently have a AMD A10-6800K on a Gigabyte F2A85X-UP4 motherboard.
Yes, I am getting the same message as "pdxtabs" in kernel log:
Jan 3 21:35:03 rexy kernel: [ 3.507279] kfd kfd: skipped device 1002:6863, PCI rejects atomics
Funny enough, my RX 580 produces the same message:
Jan 3 21:58:42 rexy kernel: [ 6.555563] kfd kfd: skipped device 1002:67df, PCI rejects atomics
And yet it works perfectly (I am NOT using ROCm).
After about 3 hours of research and troubleshooting (tried different driver versions, ...), I removed the Vega FE and re-installed my RX 580 card. I am able to work again (on the old card), but I feel honestly bad after putting well over $700 on a card (and this is *discounted* price) because of the 16GB memory and the Vega architecture that I could see myself using.
So, is there really no way of using OpenCL (at least 1.2) on Linux (Ubuntu 16.04) with the Vega FE? (I couldn't find a way in the 3+ hours I invested today on this, but I am not an expert on this field).
Thanks in advance,
So... add 10 more hours of work to this (all of my Saturday, sigh...). Today I booted Windows 10 64 bits on this, and tried 2 different drivers for the Vega FE: it works, OpenCL applications start and work as intended (albeit it has some throttling, probably thermal). I also tried it with Unigine Superposition... I was expecting better performance tbh (it was roughly 1000 points above the RX 580 for the 1080P extreme setting: 3508 score vs 2556 the RX 580 got), but well, I didn't buy a 16GB video card to play games, this is for work and the 16GB should be useful, so... I still need to be able to use this thing with OpenCL on Linux!!!
What the previous paragraph proves is: this is not a hardware compatibility problem, this is 100% AMD drivers on Linux.
Back to Ubuntu 16.04, tried both available AMD drivers for this card, tried other drivers (the one I use for the RX 580), one of the drivers for the FE would only compile on Kernel 4.10.x and 4.4.x: I tried *both* of them, the other driver would compile correctly on 4.4.x, 4.10.x and 4.13.x: I tried all three of them, I also tried stopping the X system and work from the terminal (text-only GUI), I tried to use another video card (the RX 580) as main video card and leave the Vega FE on its own... no dice: AMD Vega Frontier Edition is *not* detected as a OpenCL device under Ubuntu 16.04 LTS no matter what driver I tried.
These were the drivers I tried on Linux:
amdgpu-pro-17.40-501128 (this would be the "17.Q4.1"). This one worked on kernels 4.4.x and 4.10.x, tried them both.
amdgpu-pro-17.50-511655 (this would be the "17.12.1"). This one worked on kernels 4.4.x, 4.10.x and 4.13.x: tried all three of them.
Any ideas? maybe contact AMD support?
I'm in a similar situation - I'm running SandyBridge-EP / Xeon E5 v1 and IvyBridge-EP / Xeon E5 v2 CPUs which support PCI-E v3.0 but not necessarily PCI-E atomics according to ROCM.
I'd like to use Vega 64 with linux for OpenCL workloads, at the moment this does not seem possible. Is there a way to use legacy OpenCL or OpenCL without ROCM?
Sigh... to think I could've gone for an nvidia card for just a few extra bucks (tensorflow, blender and other products have better support for CUDA than for OpenCL)... but no, I wanted to use AMD, I just like it and I already have 2x rx 580 and I assumed I could probably use the three of them on the same system, or at least the Vega + one rx 580 (even if I cannot run the same experiment across both cards, I could run a bigger experiment on the Vega and a smaller one on the RX580, still saving me some time).
I was going to create an AMD ticket, but their contact form was giving an error, guess I will try tomorrow or something.
Didn't even know AMD had 'official' support. Sure enough, I found the forum and it doesn't seem to be implemented
Although I'm trying to support AMD, the NVIDIA card I have 'just works' on Ubuntu with the latest CUDA. People talk about NVIDIA sh**ting on their customers, but at least the products work and they can get their work done.
AMD come on! Clear communication on a WORKING setup and the requirements would be a great start.
Or keep you customers in the dark with unusable products and you wont have any suckers (neé customers) left who are willing to give you a shot.
Side note: the crypto miners are paying > 2x MSRP for AMD cards (Vega) so you might even make some money switching to NVIDIA.
Yes, I created an AMD ticket, I hope they have some good news...
Is vega really good for mining? I thought it consumed too much power for the performance it provided (almost 2x as much as a Polaris or so).
Its good for certain algorithms.
If you hear back from AMD, please let me know what they say I'm unable to get the contact forum to work, nothing appears here after selecting language:
*update* ah they fixed it! i'll submit a ticket myself if you dont get any news
I need to build about ~20 AI/ML machines and if I cant use Vega with HIP / ROCM then OpenCL is the last option on this platform. Otherwise have to switch to switch to team green :/
Nope, nothing yet. I'll update this if they ever come back.
In the meantime I have been studying more about systems and stuff. ROCm does look interesting. I even designed a system, it includes:
+ AMD Ryzen Threadripper 1920x (or 1950x), why? because of the number of PCIe lanes, you can run 4x GPUs at: 16x, 8x, 16x, 8x, which is cool. Consider the Threadrippers are in fact like a dual-CPU system, so much that you can have NUMA mode and all (it has an UMA mode, but will have higher latency, so, it depends on your needs).
+ ASRock Taichi X399 (or the ASRock Fatal1ty X399 Professional Gaming, this one has a 10Gbps network interface built-in).
+ 64GB RAM, hard to decide: Crucial has a 4x 16GB ECC, 2666MHz RAM kit that is compatible with the Fatal1ty, whereas there are more higher-speed kits for the Tachi, including some 4x16GB kits from G.SKILL running at 2933 (these are overclocked). I read RAM is extremely important because the "Infinity Fabric" (CPU internal communication) runs at the memory clock AND external PCIe lanes performance also seem to depend on this.
+ EVGA power supply, 1300W or more (ideally 1600W, but these are in low supply as well, probably because of cryptominers). Also, 1500W or more I would have to review the electrical installation and possibly hire an electrician to upgrade that circuit (I could do it, but I do not have an electrician license here, I am not sure if it is a requirement in Florida if you do the work for yourself).
+ Thermaltake Core X9 case. This thing seems to be *huge*.
+ Add some custom-loop liquid cooling from EKWB, because 4x GPUs air-cooled won't work (read about it at a couple of blogs, apparently they will quickly throttle because of heat after the third card or so).
All of that would blow my budget *for the year*, that means not even replacement parts! (ouch! and also I would end up with some increased debt, which I was trying to avoid), and I wonder if it is going to be worth it... OpenCL is good enough for me, but ROCm could be better (still studying). Switching to nVidia without a system that allows for fast PCIe communication would have limitations as well.
I have *never* in my life spent more than $1500 on a work computer (this was *once* over 10 years ago), and my builds are usually around $700 or so every 5 years... this one would blow the $2k mark! (ouch!) and I would be unable to add the 4 GPUs, I would have to stay with one or two rx 580 + the Vega... Maybe I could put in my rx 480, after all, it is basically the same as the 580 as far as I can tell.
I *did* consider the Intel options, but to get a similarly wide (in PCIe lanes) system I would have to spend even more, and it would not run the cards as fast.
This is how they're run in a data center:
If they are throttling, something to look into would be high CFM fans and case air flow.
For AI/ML deep learning 10G isn't enough, you want at least 40G or QDR Infiniband with RDMA.
I did consider the Threadripper platform but our current platform gives us x8/x8/x8/x8 and we can build 3~4 systems for the cost of 1 TR system.
Back to this thread - I was able to get in touch with AMD support last night and found the kernel and Xorg hard requirements:
kernel (4.10.0-33) and Xorg versions(1.19.3)
which can be installed on Ubuntu 16.04.3 with:
sudo apt install --install-recommends linux-generic-hwe-16.04 xserver-xorg-hwe-16.04
I'll try amdgpu-pro with ROCm and legacy OpenCL after a clean install.
Yep, I had to use Xorg 1.19.3, and I did try Kernel 4.10 (and 4.13 and 4.4): no dice, I just lack the CPU/motherboard support for ROCm, and AMD drivers do not offer OpenCL (legacy) for this card, as far as I can tell it is ROCm only.
Let me know how your test goes!
Well, I am just a guy building a system for his personal experiments (I work as a sysadmin/dba mostly), so... not that I plan to build a big AI/ML cluster anytime soon... it would seriously complicate my choices and probably increase my costs (and, for now, this is just a hobby).
Seriously? 3~4 systems for the price of 1 TR??? I have considered the Intel path, but it would save me like 30% or so while providing me like 50% less performance overall (CPU side of things)... What did you use to get PCIe x8/x8/x8/x8 for $500~$700??? (including CPU(s), motherboard, RAM, case/chassis, fans, power supply(ies)).
Now, back to cooling: this card doesn't have an opening on the short side to allow for air coming from the case fans to go to the card's fins... I am no expert, but... how does it get enough air? the air path would go between one card and the other (both hot), and into the turbine, from the turbine along to rest of the card... unless those fans created a high pressure system within the case that forced air in the small gap between the cards at high speed (or maybe the 4 small threaded holes?), I have my doubts the cards will work cool I mean, those are 300W beasts. I took a look at other cards, including nVidia's Quadro and Titan Xp and they do have fins exposed on the short side where those fans would blow directly at, the same goes for the Tesla P100 (and similar), except these, being server-only cards, doesn't have a fan, just the fins where the chasis' fans would blow at. I couldn't find an example of server-only card from AMD, WX5100/7100 and 9100 all three have fans, which makes me believe these are more workstation cards.
Oh, found them from AMD (and the MI25 looks nice):
https://instinct.radeon.com/en/product/mi/radeon-instinct-mi25/ (well, this is yet to be released, it seems)
These also do not have a fan and the power connector is on the short side for the card.
In the picture posted previously those are Vega Frontier Editions.
I'm not an expert on airflow but there is a lid on this server and it is a wind tunnel with air moving from front to back. Blower cards can pull air from that small gap and use it for cooling. It may not be ideal, but with proper case fans it is enough.
A smoke test would be the best way to visualize this. You can also visualize this by putting a string by the intake and seeing how it is pulled in. By putting my hand near the fan intake, there is no breeze directly over the center but there is a force on the edges, almost parallel to the card.
The Radeon Instinct MI25 cards are likely better cooled due to the open back and specific design to take advantage of the server "wind tunnel" like you mentioned.
I haven't had a chance to retest bare linux as someone came up with a neat solution of running the cards in a virtualized instance with GPU pass through and then using them that way. This may be the best option until verified linux drivers are released, unless you have time to spend debugging.