cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

vv111y
Adept I

Hello, just starting in OpenCL GPGPU for Neural Nets

Hi,

I would like to make sure I'm starting on the right foot as I build my workstation and get developing. First the hardware:

- Supermicro 7046GT-TRF X8DTG-QF workstation. Last gen pcie 2.0 GPU workstations that were intended to house Nvidia's Fermi cards. 4 Slots for double wide GPU cards.

- 2X Xeon L5630.

- planned 4X Sapphire HD 7970 6GB 'Vapor-X' or 'Toxic'. I believe the extra RAM will really help with performance.

Anything I should know?

- modify for water-cooling.

Please let me know if there's anything I should know.

- Windows laptop for frontend & development. Workstation will be 'headless' - no display/keyboard/mouse. Remote desktop for managing/monitoring.

- Likely Ubuntu OS

Other Q's:

- Are there Linux tools to monitor and manage the workstation?

- Am I correct that the ideal dev environment is in Visual Studio and not Linux? I will download the Visual Studio packages.

- Also, any development supported for Mac OSX?

- Drivers !?

Thanks for the help!

0 Likes
11 Replies
void_ptr
Adept I

Hi vv111y,

> - planned 4X Sapphire HD 7970 6GB 'Vapor-X' or 'Toxic'. I believe the extra RAM will really help with performance.

> Anything I should know?

Sounds like a good choice to me. I don't see the bigger RAM helping performance so much as allowing you to have bigger neural nets with more and bigger weight matrices. I'm guessing you must have some strategy for dividing the workload across the GPUs, and are aware moving cl buffers between discrete GPUs has to be done "manually" via the opencl host API, in order to get data between discrete GPUs and the  host.

> - modify for water-cooling.

> Please let me know if there's anything I should know.

I don't know anything about water cooling. I have a 6990 running continually fully-loaded just air-cooled. One of the  GPUs does run kind of hot with the fan at 100% so I should probably look into that.

> - Windows laptop for frontend & development. Workstation will be 'headless' - no display/keyboard/mouse. Remote desktop for managing/monitoring.

> - Likely Ubuntu OS

I'm running a 7950 on Ubuntu. Developing on Windows and deploying on Linux sounds a bit awkward to me. Not to say it can't be done; probably a thousand ways to do it. If you're using Visual Studio's solutions and projects, I imagine you'll either have to maintain a separate set of build files for Linux. Or use something like CMake or QMake so you can build for windows and linux the same. If you're a bash shell/command-line guy you could use cygwin on windows and bash on Linux. If you want a GUI IDE that can run on both QtCreator might be a good choice. I'm using that on Ubuntu.

> Other Q's:

> - Are there Linux tools to monitor and manage the workstation?

Don't know much, except most guys I know just ssh into the remote Linux machine and do stuff on the command line.

> - Am I correct that the ideal dev environment is in Visual Studio and not Linux? I will download the Visual Studio packages.

Are there some tools (AMD or others) only available for windows? I not aware of any except BOLT, but haven't been paying close attention and could be wrong. I don't think BOLT helps much for neural nets, because things we need like matrix multiplies are not within its scope just yet, so you're back to using the open-cl host API to get that done. Regarding Visual Studio being ideal: I would say it depends more on you. If you are proficient with Visual Studio and don't know much about the Linux world, you'll have a learning curve trying to do things the Linux way. And vice versa. CodeXL for kernel tuning is advertised as available for win and linux both. If you are going to run the GPUs on a headless Ubuntu box, you'll have to learn to configure the machine, build and deploy on Linux anyway. I'm doing all my development on Ubuntu using QtCreator and the command line, using CMake for builds.

> - Also, any development supported for Mac OSX?

Don't know anything, sorry.

> - Drivers !?

You'll need the catalyst drivers and AMD APP SDK, available for win and linux.

Linux comes with open-source drivers that work with AMD cards. But they do not support OpenCL. For that, you need to remove the open source drivers and download and install AMD's proprietary Catalyst drivers. Ubuntu has a PPA for the drivers, but they usually seem to be out-of-date. I like to run the latest catalyst and appsdk, so do a manual install. AMD links to a community wiki with instructions. Seems I usually need to futz with things a bit to get it working everytime I upgrade. Understanding ldconf and how to make the system find the driver .so is something I had to learn.

Out of interest, what kind of neural nets and applications are you working on?

AMD doesn't support Mac. you must ask for support from Apple.

well there is emerging support of OpenCL for opensource drivers. you can now run bitcoin mining on AMD cards with opensource drivers http://www.phoronix.com/scan.php?page=news_item&px=MTM3ODM but it isn't ready for anything more complex.

0 Likes

Thanks nou. I'll keep that in mind

0 Likes

A couple other things occur to me. On the hardware topic, you might consider the firepro W9000/S9000, depending on how price sensitive you are. Basically the same GPU as the 7970, but has error correcting memory. And I want to say they also offer the ability to transfer or synchronize buffers directly between cards via PCIe, without having to bring them to the host first. Since you are sharing workload across multiple discrete GPUs it might be important for you. I might be wrong, and don't know how to exploit that ability. Maybe some who knows can speak up.

I was wondering if you've given thought to whether to use double/single/half precision float for there neural net weights. I was thinking I will try half_float, since that effectively doubles the number of weights you can fit in memroy, and cuts the required memory bandwidth in half, relative to 32-bit floats. I'd be interest to hear your thoughts.

0 Likes


void_ptr wrote:



A couple other things occur to me. On the hardware topic, you might consider the firepro W9000/S9000, depending on how price sensitive you are. Basically the same GPU as the 7970, but has error correcting memory. And I want to say they also offer the ability to transfer or synchronize buffers directly between cards via PCIe, without having to bring them to the host first. Since you are sharing workload across multiple discrete GPUs it might be important for you. I might be wrong, and don't know how to exploit that ability. Maybe some who knows can speak up.



I was wondering if you've given thought to whether to use double/single/half precision float for there neural net weights. I was thinking I will try half_float, since that effectively doubles the number of weights you can fit in memroy, and cuts the required memory bandwidth in half, relative to 32-bit floats. I'd be interest to hear your thoughts.



I'll reply for the other stuff later today, but I wanted to quickly answer this one -

For sure half_float, if we could lower than even better. Here's a paper where they used 4-bit weights:

[1201.6255] Is a 4-bit synaptic weight resolution enough? - Constraints on enabling spike-timing dep...

That's with custom hardware though.

The direct buffer sync would be nice, but way too costly. Error correcting also isn't worth it.

0 Likes

Hi Void_ptr, thanks for all the good info...


Sounds like a good choice to me. I don't see the bigger RAM helping performance so much as allowing you to have bigger neural nets with more and bigger weight matrices.


I'm trying for the biggest nets possible, so I'm looking for anything that helps. Size is my measure of performance.


I'm guessing you must have some strategy for dividing the workload across the GPUs,


Argh, no strategy yet, I'm figuring it out. Decide on partitioning the neural net, signals between cards & host, etc...

Thinking about it, the GPUs will be miles ahead of the pcie bus all the time. I wonder about doing compression on the cards to try and lessen the extremeness of the bottleneck. There's a company that does that to help with HPC interconnects.


and are aware moving cl buffers between discrete GPUs has to be done "manually" via the opencl host API, in order to get data between discrete GPUs and the  host.


I'm a noob for all of this so openCL and moving buffers will all be part of the learning curve


I don't know anything about water cooling. I have a 6990 running continually fully-loaded just air-cooled. One of the  GPUs does run kind of hot with the fan at 100% so I should probably look into that.


For me it's because of the noise. I'll be in the same room at least some of the time, and that's in a residence. Yours doesn't get too loud?


...I'm running a 7950 on Ubuntu. Developing on Windows and deploying on Linux sounds a bit awkward to me. Not to say it can't be done; probably a thousand ways to do it. If you're using Visual Studio's solutions and projects, I imagine you'll either have to maintain a separate set of build files for Linux. Or use something like CMake or QMake so you can build for windows and linux the same. If you're a bash shell/command-line guy you could use cygwin on windows and bash on Linux. If you want a GUI IDE that can run on both QtCreator might be a good choice. I'm using that on Ubuntu.


Knowing this I'll stick with linux then. I just got the impression from AMDs site that there was a lot intended for windows and VS.


Don't know much, except most guys I know just ssh into the remote Linux machine and do stuff on the command line.


hmmm, I'll be on the lookout for some GUI display so I can see what the machine is doing at a glance.


...I not aware of any except BOLT, but haven't been paying close attention and could be wrong. I don't think BOLT helps much for neural nets, because things we need like matrix multiplies are not within its scope just yet, so you're back to using the open-cl host API to get that done. Regarding Visual Studio being ideal: I would say it depends more on you. If you are proficient with Visual Studio and don't know much about the Linux world, you'll have a learning curve trying to do things the Linux way. And vice versa. CodeXL for kernel tuning is advertised as available for win and linux both. If you are going to run the GPUs on a headless Ubuntu box, you'll have to learn to configure the machine, build and deploy on Linux anyway. I'm doing all my development on Ubuntu using QtCreator and the command line, using CMake for builds.


Skipping BOLT. Prob skipping VS. I'll have a learning curve no matter what.


You'll need the catalyst drivers and AMD APP SDK, available for win and linux.



... AMD links to a community wiki with instructions. Seems I usually need to futz with things a bit to get it working everytime I upgrade. Understanding ldconf and how to make the system find the driver .so is something I had to learn.


Got it, thanks. You've saved me time already.


Out of interest, what kind of neural nets and applications are you working on?


Ultimately I'm going for the holy grail - general intelligence. What I'll try and do here is some stripped down proof-of-concept. Likely an agent that will autonomously obtain mastery of a body+environment. Also rudimentary communication to execute commands.

0 Likes

0 Likes

>> Ultimately I'm going for the holy grail - general intelligence.

I have similar aspirations.

>> I'm trying for the biggest nets possible, so I'm looking for anything that helps. Size is my measure of performance.

Me too.

I said:

I'm guessing you must have some strategy for dividing the workload across the GPUs,

You said:

>> Argh, no strategy yet,

Might I suggest you want to work that out before committing to a big heap of hardware?

>>I'm figuring it out. Decide on partitioning the neural net, signals between cards & host, etc...

My strategery is: I'd like a nice framework that abstracts the open CL host API details for all those NN kernels you/we are going to need. Get that working first. Worry about 4X GPUs and the all the hardware later after all that's done. In my case, I'm sure it's going to take > a year to get through all that development with proper unit tests in place (to make sure everything's working), and get it debugged with "toy" problems, before I'm ready to scale up to big neural nets for real problems. I expect, if you were to try to hand-code all the neural nets directly in C++ with opencl kernels, you'd go mad with all the tricky coding details. I figure, by the time my framework is done and working, Volcanic Islands or some nice new GPU will be out, and I would regret having spent all my money on what is now older technology. My opinion, for what it's worth.

>> Thinking about it, the GPUs will be miles ahead of the pcie bus all the time.

Yes people make a big deal about the overhead of bringing buffers back and forth across the PCIe. A real concern, and I guess that's a lot of what motivates HSA and shared CPU/GPU memory. No need to move the data at all. My idea is to leave the buffers in place on the (discrete) GPU, and then run the iterative NN training (e.g. gradient descent via backprop, or contrastive divergence) on the GPU without bringing it back and forth to the host memory upon every iteration. Just leave it there on the GPU. Backprop requires many iterations anyway. The thing I don't like about HSA is that the integrated GPUs (at least thus far) are always less powerful than the top of the line GPUs. Kaveri's rumored specs are still short of what you get with a proper discrete GPU. So I'd rather architect my system to run everything in place in the GPU with no PCIe traversals required, or more accurately, PCIe traversal only required once at the beginning, when initializing the neural net(s), and at the end of a mini-batch after many training iterations, NN weights back from the GPU.

>> For me it's because of the noise. I'll be in the same room at least some of the time, and that's in a residence. Yours doesn't get too loud?

Funny you should mention that. Yeah. It's kinda loud.

0 Likes

LOL, I wasn't going to buy any hardware now, but then those guys at servethehome found an ebay deal on last gen supermicro GPU workstations. Here's the new sale -- Supermicro 7046GT TRF X8DTG QF Tower Rack Server 2XPS No CPU RAM HD 672042054367 | eBay . So

$370 barebones

$400 2X xeon L5630 ($200 each)

$274 2X 16GB RAM = 32GB

No GPUs yet.

$1044 total.

I just couldn't say no. I know, it's only pcie 2.0, but the price!

I figure having something physical in front of me will put fire under my feet to get me going.

FYI I have found deal sites are hazardous for your financial health. More later...

himanshu_gautam
Grandmaster

I guess most of the questions have already been answered. Anyways as you are trying to run the set-up remotely, check http://devgurus.amd.com/message/1286878#1286878

All the best.

Thanks himanshu,

Yes that's info I need. Much appreciated.

0 Likes