Archives Discussions

remco · ‎01-04-2010

Dear all,

I am new to this forum but not to GPGPU. I have moderate experience with cuda and the G80 architecture. But I have no experience with ati cards. I understood that the learning curve for ati stream is somewhat steaper compared to cuda. Now I turned to ati cards because I have a fast cpu implementation (the best single threaded c language implementation I could come up with) but it is still very slow because my dataset is big. Now there are two things I can do.

The first possibility is to do some bit bashing in sse2/sse3 and use multiple threads but this takes a lot of development time. I am familiar with sse2/sse3.

Or I could turn to cuda or ati stream and use a parallel architecture. The current generation nvidia boards are only marginally faster with double precision code compared to a fast quad core CPU. The new generation ati cards are really fast with double precision code up to 600 GFlops dp performance which is 10 times faster compared to a fast cpu.

Is it easy to implement a hidden markov model in ati stream?

What kind of speed ups can one expect?

Thanks in advance!

frankas · ‎01-06-2010

Originally posted by: remco

Is it easy to implement a hidden markov model in ati stream?

What kind of speed ups can one expect?

Thanks in advance!

In general, using Brook+ or OpenCL is no harder that writing C code, once you have your developer enviornment up and running.

I haven't done double calculations, but a general observation is that you can often get more than 50% of theoretical speeds, without using exotic optimizations, if your task is highly paralellizable and with uniform code paths.

if you posted a link to C / C++ code similar to what you want to achieve, we may be able to tell you which performance issues you need to take into concideration.

remco · ‎01-07-2010

Dear frankas,

Thank you for your reply. Wow 50% of theoretical speed without using exotic optimizations is really good. Hidden markov models are paralellizable over the states and over the number of independent observations. (At least for the forward, backward and viterbi algorithms which are dynamic programming examples) The computation of the emission probabilities is completely paralel. And the reestimation of the matrices are examples of matrix reduction. Sum over one of the dimensions of a given matrix but not always the ideal direction.

I already found two articles of people who implemented hidden markov models on the GPU. Both report dramatic speed ups over CPU implementations. (One of them unrealistic there must have been something wrong with the CPU implementation)

http://www.graphics.stanford.edu/papers/clawhmmer/hmmer.pdf

http://www.liuchuan.org/pub/cuHMM.pdf

Oke the code that I would like to speed up is divided into 5 functions namely:

hmmemissionprobabilitiesgmm //Computation of emission probabilities using gaussian mixture models

hmmstatetransitionprobabilities //Computation of state transition probabilities gamma and epsilon

hmmstatepriormatrix //Computation of state prior matrix

hmmstatetransitionmatrix //computation of state transition matrix

hmmstateemissionmatrixgmm //computation of new gaussian mixture model

The code:

http://blackboard.tudelft.nl/webapps/cmsmain/webui/_xy-2061251_1-tid_Ir5s4JuD //hmmemissionprobabilitiesgmm

http://blackboard.tudelft.nl/webapps/cmsmain/webui/_xy-2061252_1-tid_4jW7cjat //hmmstateemissionmatrixgmm

http://blackboard.tudelft.nl/webapps/cmsmain/webui/_xy-2061256_1-tid_JwuoXS1P //hmmstatepriormatrix

http://blackboard.tudelft.nl/webapps/cmsmain/webui/_xy-2061253_1-tid_W1kscrDw //hmmstatetransitionmatrix

http://blackboard.tudelft.nl/webapps/cmsmain/webui/_xy-2061254_1-tid_QwyALa8M //hmmstatetransitionprobabilities

frankas · ‎01-09-2010

Hi, sorry for the delay, I didn't have time to read all the code you posted, but I noticed at some points there were inversion of matrices, but the size was not specified. Inversion of large matrices may be a challenge to do efficiently.

You may do well to google for GPU + matrices + invert

Here is a sample thread on this topic:

http://ubuntuforums.org/showthread.php?t=1217446

(With a link to a paper on the subject at the very bottom)

remco · ‎01-10-2010

Hi Frankas,

Like you already mentioned matrix inversions can be time consuming if the matrices are large and an inversion is often required. However in my implementation the matrix inversion is only done once per covariance matrix. And these covariance matrices never become larger than say 32 by 32 elements. But this is indeed something one could not see from the code. I do not use that many states but the number of training samples is really big and the length of each training samples is in the order of 200. The dimensionality of the observations is in the order of 12 but can be smaller or larger. The bits that take a lot of time are the per element evaluation of a multi dimensional normal distribution (inverse and determinant of the covariance matrices are only computed once but evaluating the exponential function milions of times takes a lot of time) If the number of states grow the complexity of the forward and backward algorithms grows with a power of two. These are the things that take up a lot of time.

frankas · ‎01-10-2010

You may have figures that I haven't done this kind of calculations before, but I will still try to help. If it all boils down to lots and lots of multiplications and additions, then GPU will be the right tool for the job. But you may have to cut and dice your problem a bit differently.

200 data sets may be a large job, but it isn't a big dataset as such for say a HD 5850 or 5850 type GPU. To keep the 5850 occupied, you need at least 288 'threads' minimum, but it is easier with 4 times that amount. Also if you want to keep your computer responsive during the calculations, it is best to split the work load up into smaller batches that can be executed in a fraction of a second, in fact single computations taking over 10 seconds will often be aborted before completion.

So the challenge might be to split the job into 1000 parallel threads, each running for a short time, and producing a intermediate output (a stream), and then breaking the task down to several such steps, each operating on part of the input and/or the intermediate streams, the finally producing the output (result) you desire. If your dataset doesn't fit in memory, you also need to consider IO requirements.

Also keep in mind that tasks that can't easily be done in parallel may just as well be done in the CPU allowing you to keep your existing code. Moving data between GPU / CPU is generally very fast.

remco · ‎01-14-2010

Hi Frankas,

Sorry for the delayed response but I was away for a couple of days. But thank you for your help up until so far. I thought of how I would implement this on a CUDA enabled card because I am familiar with these cards. Hidden Markov models deal with sequences of observations. Now the elements of a sequence often can not be processed in parallel because the computation of a given element depends on the previous element. But for hidden Markov model training one often has several several sequences of observations. And these sequences can always be processed in parallel. For my problem I have about 700 independent sequences that can be processed in parallel. I thought for a CUDA enabled card the best option is to process 16 independent sequences per thread block. This always given coalesced memory transfers. And a very nice granularity for the number of threads per plock. N*16 were N is the number of states. (Already full occupance when the number of states equals eight)

I also did an experiment with the sse2 instruction set but I was really disapointed with the speedup I got over my c implementation. The algorithm that I implemented in sse2 ran in 119 ms over 700 observations. and the c implementation did it in 124 ms. I concluded that the implementation is heavily memory bound. Otherwise one would expect at least a two times speed up. (two operands are processed in parallel and I did some serious loop unrolling with far more efficient adressing)

Were does dis 288 'threads' minimum come from? 64 threads per wokgroup for each simd processor. But 288 is not a multiple of 64?

frankas · ‎01-18-2010

Originally posted by: remco Hi Frankas,

Were does dis 288 'threads' minimum come from? 64 threads per wokgroup for each simd processor. But 288 is not a multiple of 64?

I use the 5850 as an example, it has 1440 "stream processing units" in marketing speak. They are organised in 18x16x5 - 18 SIMD arrays of 16 SPUs (each having 5 processing units). Usually 64 threads will be running on a SIMD array at a time, but this can be adjusted by changing the threading group size (haven't tried this myself). In pratice it means that if a few threads (ex 64) takes k time, then you can schedule 1152 such threads, and they will also complete in k time. But if you schedule 1152 + 64 threads - running time will be 2k.

If your aplpication is memory bound, you probably need to have the group size at 64, as this will hide memory latency. And you should then try to keep your thread count as close as you can to 1152 to ensure full utilization.

Archives Discussions

Hidden Markov Models ATI Stream