well you can use multiple devices. for example CPU support device fission which can divide one CPu to multiple sub-devices.
i heard that Evergreen family GPU can run multiple kernels concurently. what is true on this rumors. is in HW or it is only a rumor.
Yes, nou, the hardware can do it. If you run DirectX code you can measure it happening. OpenCL does not do it currently, there are technical reasons for that to do with the way OpenCL buffers map to DX views.
You can't really do task parallelism on the hardware at the moment, it is unfortunate but given the way queues are managed and the mechanisms of the PCI bus you'd be lucky to produce efficient task parallel code currently for the hardware anyway. If it's important to do task parallelism I would suggest you take an uberkernel approach and branch into a particular task inside the kernel.
Remember, of course, that a hardware thread is a wavefront, not a work item. If you want to write task-parallel code that runs efficiently make your tasks 64 work-items wide.
i run different kernels on the same device (HD 4870) by starting each kernel in a separat thread. It was faster than running kernels in order. and it gave correct results (tested with the dijkstra's algorithm and some sorting algorithms)
but i don't now, if it work correct with other algorithms...
possible it works for you to start kernels in separat thread with OpenMP...