Dear all,

I am new to this forum but not to GPGPU. I have moderate experience with cuda and the G80 architecture. But I have no experience with ati cards. I understood that the learning curve for ati stream is somewhat steaper compared to cuda. Now I turned to ati cards because I have a fast cpu implementation (the best single threaded  c language implementation I could come up with) but it is still very slow because my dataset is big.  Now there are two things I can do.

The first possibility is to do some bit bashing in sse2/sse3 and use multiple threads but this takes a lot of development time. I am familiar with sse2/sse3.

Or I could turn to cuda or ati stream and use a parallel architecture. The current generation nvidia boards are only marginally faster with double precision code compared to a fast quad core CPU. The new generation ati cards are really fast with double precision code up to 600 GFlops dp performance which is 10 times faster compared to a fast cpu.

Is it easy to implement a hidden markov model in ati stream?

What kind of speed ups can one expect?

Thanks in advance!