Suppose have 1 big array of a gigabyte where all cores must somehow read from regurarly. Now i can sort it in such manner that not all cores go read at the same cacheline. What is the ideal setup for an array to be read at the full bandwidth? How many bytes to keep 'in between' each read from the 1 gigabyte RAM?
Which bandwidth can i achieve in this manner?
What is the cacheline size that the L2 uses?
Can cores align themselves onto start of a cacheline?
How many bytes must i add 'in between' to get full bandwidth readspeed from the RAM?
How many cycles is it to get something out of the RAM doing a blocked read if all cores read at the same time, at the above manner from the RAM, (we assume full L2 miss of course in all cases)?
Secondly caching. Regurarly reads will be out of L2 lucky. How many cycles is the full read-latency out of L2?
"L1 cache" question, if it is there, which i do not assume for now, but i didn't know it had a L2 read cache either which i see in the diagram. If i do a memory read, again with all cores, is it possible the L1 has it by accident or some register file (as there is plenty of registers i saw) that hides the L2 latency? What i suppose now is that if i do a read from RAM, that there is only the L2 to protect me from suffering full latency. Is that correct?
Now important question:
Let's suppose 3199 cores are so lucky to get their data out of L2, as L2 still had the cacheline, 1 core needs to get it out of the RAM. Do all those 3199 cores need to wait for core 3200, for half a century, or can there be atomic differences in execution speed of the instruction stream?
Summarizing do all cores need to wait for core 3200, when they get their data from L2, or can the cores continue their job (running the SAME thread of course)?
Oh by the way i'm making big progress on paper, the model already is there how to get things done, but above details matter a lot for what type of code gets poured.
Thanks in advance for answerring any of the above questions,