I have asked some similar questions some time back, but not many (zero) relevant answers arrived. But let me explain my opinion:
1 - Most likely it will do just that. Upon creating buffers, it will just pass a pointer to a RAM location, because there will be no side-port memory for APUs. That is why I suggested about half a year ago, that GPUs inside an APU should be seen as some other device , not a GPU (there is a defined device type ACCELERATOR in OpenCL, maybe this would be the place to use it), becuase these devices differ from regular GPUs, because there is no PCIe bus, and this allows major optimizations, or let's the programmer do much magic.
2 - Most likely. Only Intel is trying to create c++ copmilers that recognize parallelizable code inside a program, and recompile it in a manner, that parallel part is automatically is executed on the internal GPU. To OpenCL however (even Intel OpenCL SDK), it will most likely be 2 devices, because they are to be programmed a lot differently.
3 - Most likely not. L1 cache is definately for the CPU only, and my guess is that the GPU will not be able to allocate L2 cache for itself. The GPU will have it's own cache with regular O(kB) size. Sharing caches in the way you would like to use it is definately impossible. With OpenCL you will access RAM through buffers, and even if you use USE_HOST_PTR flag with buffer, the cache handle I believe won't be able to recognize, that the memory it is trying to read is the same as the contents of the buffer. It is not impossible, it might even be true (but only when using USE_HOST_PTR), but there is such a thing as too good to be true.
4 - I do not quite understand what you want to say here. APUs won't have as many multi-processors that any application couldn't make proper use of all of the processing elements. The program is written best, if one use of the GPU uses all of it's processors and the handling of multitasking is up to the thread scheduler. Most likely it can do the job better than I would be able to.