The upcoming Fusion processors raises some questions from the OpenCL point of view. Maybe it's too early to start worrying about it but perhaps this is the chance developers have to express expectations in an earlier stage.
1 - Will the GPU part be able to have direct access to RAM memory? Currently it is known that data transfer to OpenCL Device is very time-consuming.
2 - Will OpenCL see Fusion's GPU and CPU as two independent devices? I would think so because it may be hard to tell which code works best on CPU and GPU.
3 - Will the GPU and CPU share L1/L2 cache? Again, we may want to manipulate buffers using the GPU and then recover them directly in a serial code which is running on the CPU.
4 - If a Fusion processor has, say, 4 CPU cores and 5 GPU workgroups, will it be possible to use 1 core as scheduler and use device fission to use each remaining core/workgroup separatedly? I imagine that in the future multiple applications will use the GPU and CPU and none of them should consume all resources.
I have asked some similar questions some time back, but not many (zero) relevant answers arrived. But let me explain my opinion:
1 - Most likely it will do just that. Upon creating buffers, it will just pass a pointer to a RAM location, because there will be no side-port memory for APUs. That is why I suggested about half a year ago, that GPUs inside an APU should be seen as some other device , not a GPU (there is a defined device type ACCELERATOR in OpenCL, maybe this would be the place to use it), becuase these devices differ from regular GPUs, because there is no PCIe bus, and this allows major optimizations, or let's the programmer do much magic.
2 - Most likely. Only Intel is trying to create c++ copmilers that recognize parallelizable code inside a program, and recompile it in a manner, that parallel part is automatically is executed on the internal GPU. To OpenCL however (even Intel OpenCL SDK), it will most likely be 2 devices, because they are to be programmed a lot differently.
3 - Most likely not. L1 cache is definately for the CPU only, and my guess is that the GPU will not be able to allocate L2 cache for itself. The GPU will have it's own cache with regular O(kB) size. Sharing caches in the way you would like to use it is definately impossible. With OpenCL you will access RAM through buffers, and even if you use USE_HOST_PTR flag with buffer, the cache handle I believe won't be able to recognize, that the memory it is trying to read is the same as the contents of the buffer. It is not impossible, it might even be true (but only when using USE_HOST_PTR), but there is such a thing as too good to be true.
4 - I do not quite understand what you want to say here. APUs won't have as many multi-processors that any application couldn't make proper use of all of the processing elements. The program is written best, if one use of the GPU uses all of it's processors and the handling of multitasking is up to the thread scheduler. Most likely it can do the job better than I would be able to.