cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

riza_guntur
Journeyman III

Is there anybody here create neural networks application using GPU?

Anybody?

I need suggestion on how to justify the use of GPGPU for neural networks as the kernel calls in neural networks are huge.

If you have something to overcome, let us know.

0 Likes
8 Replies
geigerh
Journeyman III

Actually I work with ANN (for about 25 years). We are at the moment porting our brand of ANN (biologically motivated, simulates real neurons) to the GPU environment, and this is certainly the most cost effective way to get a HUGE speed boost since this kind of simulation is easy to parallelize (only local information necessary). This may not be true for MLP, because the backtracking algorithm is not local. What type of ANN are you thinking about?

0 Likes

Originally posted by: geigerh Actually I work with ANN (for about 25 years). We are at the moment porting our brand of ANN (biologically motivated, simulates real neurons) to the GPU environment, and this is certainly the most cost effective way to get a HUGE speed boost since this kind of simulation is easy to parallelize (only local information necessary). This may not be true for MLP, because the backtracking algorithm is not local. What type of ANN are you thinking about?

How much dimension do you usually work on GPGPU thing? Before this, in my Univ. we do up to 900 at most for image file recognition.

I haven't see any speedup for now, 16 dimension with 96 training for each epoch. What happen now is a HUGE slowdown. Normal processing for small dimension and < 100 training with 1000 epoch normally takes less than 2 seconds while with ATI Stream with Brook+ it takes 10 seconds in my human timer (I feel so big slowdown, makes me leave the processing for a glass of water).

0 Likes
eduardoschardong
Journeyman III

Because it's easy to see ANNs as sequence of vectors operations, well, it's easy to parallelize.

 

Some years ago I did some work on ANNs, very few and not very important or big... geigerh probably have more experience on this.

Now I have a particular interest on a GPGPU implementation of this, if you and and happens to decide to make it open source please let me know...

 

And if you need any help feel free to contact me.

 

0 Likes

Actually I work with ANN (for about 25 years). We are at the moment porting our brand of ANN (biologically motivated, simulates real neurons) to the GPU environment, and this is certainly the most cost effective way to get a HUGE speed boost since this kind of simulation is easy to parallelize (only local information necessary). This may not be true for MLP, because the backtracking algorithm is not local. What type of ANN are you thinking about?

I'm working with fuzzy neuro linear vector quantization, based on fuzzy logic.

My work is right here http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=117003&enterthread=y currently at training step but I get huge slowdown compared to cpu implementation.

Because it's easy to see ANNs as sequence of vectors operations, well, it's easy to parallelize.

 

Some years ago I did some work on ANNs, very few and not very important or big... geigerh probably have more experience on this.

Now I have a particular interest on a GPGPU implementation of this, if you and and happens to decide to make it open source please let me know...

 

And if you need any help feel free to contact me.



I'd be happy to do that. I will release it at the end of february next year, or as fast as I can make my code run faster, as for now it has very sluggish performance.

0 Likes

Originally posted by: eduardoschardong Because it's easy to see ANNs as sequence of vectors operations, well, it's easy to parallelize.

 

Some years ago I did some work on ANNs, very few and not very important or big... geigerh probably have more experience on this.

Now I have a particular interest on a GPGPU implementation of this, if you and and happens to decide to make it open source please let me know...

 

And if you need any help feel free to contact me.

 

How to contact you?

0 Likes

By PM

0 Likes

We have appr. 800.000 neurons with something like 10 Mio connections, our training cycles are much shorter than in your case (appr. 50 recursions for 10 classes of 20 samples each), since we do not use backpropagation (not necessary since nonlinear biological neurons can learn nonlinear classifications in single-layer networks).

As to your efficiency problem: I am not a GPU expert (yet...), but you may have a problem if you call a single kernel in a CPU based loop as shown in your code example. What we do is write a simple kernel similar to your approach but we then define the output domain for the complete network and then call the run routine without a loop on the CPU side. The kernel processor then distributes the kernel code on all the available thread processors, running several threads on each processor and takes care of the loop over the predefined data domain.

I think you lose most of your computing time for setup (data transfer and call), while the kernel processing time itself is probably negligible in comparison.

For your type of training you might have to redefine your data setup to make it fit into the stream concept of the GPU.

Hope this is helpful

0 Likes

Originally posted by: geigerh We have appr. 800.000 neurons with something like 10 Mio connections, our training cycles are much shorter than in your case (appr. 50 recursions for 10 classes of 20 samples each), since we do not use backpropagation (not necessary since nonlinear biological neurons can learn nonlinear classifications in single-layer networks).

As to your efficiency problem: I am not a GPU expert (yet...), but you may have a problem if you call a single kernel in a CPU based loop as shown in your code example. What we do is write a simple kernel similar to your approach but we then define the output domain for the complete network and then call the run routine without a loop on the CPU side. The kernel processor then distributes the kernel code on all the available thread processors, running several threads on each processor and takes care of the loop over the predefined data domain.

I think you lose most of your computing time for setup (data transfer and call), while the kernel processing time itself is probably negligible in comparison.

For your type of training you might have to redefine your data setup to make it fit into the stream concept of the GPU.

Hope this is helpful

Reading your post, my code is really similar to you, it is not backpropagation, it is fuzzy neuro linear vector quantization. Maybe not nonlinear, but nonlinear always always come to linear solution too right? XD (Just kidding)

Perhaps what you mean by "distribute the output domain for the complete network" is using something like LDS  (local data share) in which I haven't understand yet.

I have to loop through one sample to another, calculate what I write as myu from another LDS what I called by vec_ref.

My code is like this:

start-of-loop -> read-another-fuzzy-input -> calculate-myu-by-comparing-input-with-vec-ref -> find-winner-myu -> check-if-it-the-same-as-expected-output -> update-vec-ref -> back-to-start-of-loop

If only I can:

1. put myu and vec_ref to LDS

2. read vec_ref at associated dimension

3. do double reduction inside kernel for:

+first get myu with minimum value in a cluster (in a row)

+and then get myu after processed above with maximum value

-those + above named finding winner

?but how? reducting LDS?

4. update vec_ref in vec_ref winner row only based on myu properties

One big question:

Is there anybody that could teach me LDS? InstanceInGroup()? or something like that?

0 Likes