I tried to search through the documentation for the latest gpu's; this as openCL defines quite a lot. Basically everything here:
Majority of that i indexed and have read. Very impressed by the improved documentation if we compare it with a few years ago (last indexation, start 2007).
Yet there is a lot of questions. This especially because majority of everything that has to do with prime numbers is pretty much sequential. Making a parallel framework for that is not exactly easy.
At Nvidia there is for example now a program that can do the exponent squaring for the Trial factorisation, but the hard work of sieving gets done at cpu's. In case of a 3200 core AMD that would practical mean the entire 8 gigabyte gets eaten, as it can proces more than a small nvidia card.
In short the huge progress of gpu processing power also simply no longer allows simple forms of parallellism. An 8 gigabyte link for a card that can handle teraflops is not really useful
Time therefore to start designing a model on paper and then if it seems to work on paper implement it. In my case we search for Wagstaff primes (Jeff Gilchrist, Paul Underwood, Tony Reix, Vincent Diepeveen - in reverse read the DRUG team).
A big problem always with all that prime number software is that you combine a number of different things, which basically could be different programs. But the bandwidth between both programs is so huge, that there is a big need of course for being able to run different program counters. So i have a number of architecture questions.
a) is it correct that the 5970 card has 2 gpu's, so all 5970 models,
and that each gpu's architecture has 16 SIMD's and each SIMD
consists out of 20 cores which exist out of 5 streamcores each?
16 * 20 * 5 = 1600
b) in manual i see the 5 streamcores, forming 1 compute unit, execute exactly the same instruction at the same time. Is that correct?
c) Now most important question: is it possible that 2 programs or more run at the same time in the gpu, executing different instructions at the same time;, i understand a concept of openCL is workgroups. How many workgroups can execute DIFFERENT code at the same time and how many streamcores ideally form a workgroup, if i want the full 3200 cores busy?
I see a number of 256, for me that's not needed at all, i'd go for 4 as it seems now but they HAVE to execute different code at the same time, as 1 workgroup generates small primes that get fed into the factoring workgroup.
The speed at which small primes get generated (say up to a bit or 96) is so fast and at just a few cores, that it is impossible to decouple the problem in the manner how the CUDA project is doing it. 3000+ cores that factor eat more primes than any PCI-E link can proces, even if you'd get that in a prefetched manner.
So in short can workgroups that executes at the same time execute different code, so the instruction stream that reaches each workgroup is a different program?
I know historically ATI could not do this, but i do not know the status now that there is so many cores that it is becoming hard to have the same code executed at all the cores at the same time and basically openCL demands it.
It would make of course no sense to execute at a core or 32 a program and have 3100 cores idle, then sit and wait until they generated the next batch and have 3200 cores proces the batch then.
Meanwhile i can't really buffer the batch very well into the RAM, as the full RAM already gets used as a cache for other purposes and there is just a gigabyte or 2 per GPU.
It is good that the manual says you can completely access the entire DEVICE ram. Very great! Really like that.
Thanks for answerring,