Archives Discussions

h_l · ‎01-22-2016

Hi,

my questions are related to the caches of A10-78[5|7]0K APUs.

I wonder if the iGPU can directly access data from a L2-cache of a CPU module?
Are the two L2-caches shared among the two CPU modules?

Thanks!

----

Regarding 1)

The slides from HOT CHIPS [1] are not very clear and the discussion in [2] says, that the iGPUs' L2 is of size 512KB whereas the CPU has 2x2MB. This would imply that either GPU and CPU don't share the L2 cache or that the CPU-L2 is the "L3" of the GPU.

In the work of He et al. [3], the CPU is used to prefetch data into L2 which results into ~30% less memory stalls on the GPU.

[1] http://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-11-day1-epub/HC26.11-2-Mobile-Proce...

[2] What is the Size of GPU L2 cache for A10-7850k APU?

[3] http://www.vldb.org/pvldb/vol8/p329-he.pdf

bridgman · ‎01-23-2016

My understanding was that the CPU and GPU L2 caches were generally independent, although the GPU can be configured so that accesses to system memory will snoop the CPU caches to ensure consistency, so Figure 1 in the pdf does not match my understanding of how the hardware is configured.
Will check when I get back in the office.

bsp2020 · ‎01-24-2016

That's my understanding too. I was gonna write that the whole idea of using CPU to prefetch data was not possible on current AMD architecture and they must have used a simulation for their paper... Then, I skimmed through it and they were claiming that they built and test their stuff on existing AMD APUs...

@h_l,

If you are interested in scalar processor prefetching data for GPU, I find this paper (http://people.engr.ncsu.edu/hzhou/ipdps14.pdf ) very interesing. I'm hoping that the next GCN architecture will have feature described in the paper.

h_l · ‎01-26-2016

Many thanks for your reponses.

So, when it is practically impossible for the CPU to prefetch data* for the iGPU, an interesting question arises: Where does the 30% improvement come from (in [3])? - Maybe through less TLB misses? ...

* what we don't know with certainty, yet

@bsp2020: Interesting read. Separation of SIMT and scalar instructions would be a really powerful extension!

jlgreathouse · ‎02-19-2016

I have a hypothesis that may help answer your question about performance gains in light of the fact that the iGPU will not pull its data from the CPUs' caches.

Their prefetching scheme (where one of the CPUs spends its time prefetching data into a work-ahead set, and others spend time executing queries and/or decompressing data) is essentially making a small scratchpad within memory.

This yields performance benefits, which is the whole point of their paper. First off, the prefetching done on one of the CPUs increases the performance of the work done on the other 3 CPUs because that data now lies in the CPUs' caches (one execution/decompression kernel shares its L2 cache with the prefetcher core, the other core pair's L2 caches is coherent and accesses to it will pull data into the local cache as well). This amount of performance (shown in e.g. the first bar of Fig. 5) is the classic use for work-ahead sets described by Zhou et al.

The reason that this "prefetching" helps the GPU (the other bars in Fig. 5) is likely due to memory coalescing and GPU cache benefits..

Their prefetching process (as described in 3.2.1) brings in data from a number of disparate memory locations and puts them into a work-ahead-set (WAS) memory region (scratchpad) that is between 128KB and 1MB in size. What this means is that the addresses that the GPU work items are accessing are much closer together. This significantly increases the likelihood of multiple accesses from a wavefront being coalesced, thus reducing memory bandwidth requirements and decreasing latency.

Note that Section 5.3 mentions that the GPU's cache miss percentage goes down -- this is also perfectly understandable if they're moving all (or a majority) of their queries into a small region of memory. This will increase the hit rate in the GPU's cache because the region it is accessing is much easier to cache in the 512KB of GPU L2 cache.

These benefits thus do not come from data being in the CPUs' caches, but rather from making a scratchpad that helps the GPU memory accesses be more efficient. Thoughts?

h_l · ‎04-15-2016

This sounds absolutely reasonable.

In Figure 15, we can see that the memory stalls mostly happen on the CPU core that does the prefetching. However, a single core is not enough to fully saturate the memory bandwidth (it needs 3 CPU cores on my Kaveri APU). Maybe they "substituted" memory stalls times by idle times. This might explain the shockingly high query response times. With our database system, Q9 runs in less than 150ms and Q14 in 40ms (SF1, single threaded!). Which implies, that memory stalls are not the main issue here.

Nevertheless, this is off-topic. - The main-question about the cache hierarchy has been answered.

h_l · ‎02-18-2016

Hi,

are you back in the office? - Could you please confirm that the iGPU cannot read data from the CPU's L2 cache?

Many thanks in advance.

cgrant78 · ‎02-18-2016

Could you give some context to the question if you don't mind doing so? These are really good question btw. Please be aware that even though the memory on the system is shared by both the CPU and iGPU, the specific API( high level ) used to access the GPU will most likely partition the memory. Even with its own memory controller, the iGPU would still take advantage of the main L2 cache as any memory access generated for main memory will be cached in L2, in order to main coherency and simplify the APU design. However, it is confusing as one of the diagram shows that the iGPU memory controller can access memory directly without going through the NB, so I think the L2 is accessible, its just depends on which route the iGPU decided to go through for the request.

bridgman · ‎02-18-2016

It's a bit more complicated, since CPU and GPU each have their own independent L1/L2 caches. AFAICS the only way that GPU could take advantage of CPU's L2 cache is if

(a) the GPU was configured to snoop CPU cache (this should happen automatically when accessing shared memory via ATC/IOMMUv2, but is optional and off by default when accessing shared memory via GPUVM.

(b) the GPU-to-CPU cache snoop logic runs before a physical memory access is initiated and, on a hit, no physical memory access happens (as opposed to having CPU L2 flush the recently written value back to memory before letting the GPU access proceed). This seems "not impossible" but I haven't yet found the right person to confirm it with.

That said, the cache snoop logic is for dealing with dirty data in the CPU cache, ie where CPU has written to memory but the written data has not yet been returned to system memory. Don't have time to go through that paper again right now but my recollection was that the idea was to have the CPU *read* data and get it into the cache that way... but in that case the data would not be marked as dirty in CPU L2 and I don't *think* the snoop logic would pick it up.

pblinzer · ‎02-19-2016

To reconfirm John's point, while the CPU and GPU L2 are separate, be aware that due to probes/snooping used in the coherency protocol these two caches are kind of coupled that way and provide some of the functional benefits of a shared cache without necessarily sharing the same cache structure or same block.
One of the main performance benefits of cache coherency is actually avoiding unnecessary preemptive flushes that SW would induce otherwise to synchronize between different agents, reduction of flushes to the data lines needed increases cache utilization for a number of common use scenarios and a factor for good performance on common workload.

By the way, variants of the idea outlined in the paper have been published for quite a while now (decades actually) in research, first for multi-core and today for heterogeneous systems. While a cache-direct update has a lot of appeal for performance, it also requires a lot of SW awareness to be effective though.

Archives Discussions

Can the iGPU of an APU access the CPUs' L2-cache?