If you stop thinking a bit, you'll realize nobody cares about total amount of cache, but rather the total amount of cache per core. This is the case as you need to fit this size for best performance. You can also reconstruct the total size by using the core count.
Nonetheless, since there's no fast L1-to-L1 sharing you have to think at each cache as an independent set of memory.
We can go a long way discussing if that should be L1 size or L2 size. Sure the wording is a bit relaxed but perhaps this is better left to implementors. Is L2 per core? Is it shared across so many cores? On GPU, it's a per-memory-channel buffer.
For GPU it's very likely GCN L1 size. Again, that's 16KiB, that's per core (except 1 GCN core is fairly different thing from x86). That's standard on all currently selling Radeons 7000 and up, as well as APUs such as yours, except some low-end or mobile products.
While we're at it, you might read about local memory. You can consider it a faster and more efficient cache to be managed manually (as opposed to cache being automatic).
Thanks for your kind reply. I understand that to fit into L1 cache must improve the performance. To keep coherency among all caches, inevitable overhead should be introduced. If your answer is true, then I guess it is better in OpenCL specs to clarify that the size should be "dedicated-to-that-core" cache or "performance-optimal" cache. Thanks!