Hello! Been coding past few years some CUDA (crunching discrete logarithms to sieve prime numbers formula k * 2^n +- c). That works very fast on Nvidia, because it has a L1 datacache where i can write data into. That speedups the algorithm quadratic. Now on ebay i see cheap GCN2 architecture S9150 which delivers several Tflops double precision. At first sight seems impossible to use L1 datacache there for writing so we can forget about sieving at AMD. For FFT the GCN2 is interesting to study however.
With FFT i mean actually a DWT to search for large prime numbers. Well known is the mersenne search there - yet that is just 1 specific case of 2 ^ n - 1 whereas i code for k * 2^n +- c which is more generic and used more. At the moment there is no fast code to do a FFT nor DWT at a Nvidia nor AMD gpu. They waste resources basically.
I find very little data on this GCN2 architecture. Do i conclude it correct that it doesn't have L1 datacache to write to?
For FFT another great solution is the registerfile provided it is large enough. Now i do not know how many different 'threads' need to co-chare for each of the 2816 opencl-cores the register file.
What is a typical amount there and how many registers can i use?
Of course i store doubles so please keep that in mind in the calculations.