Yes, nou, the hardware can do it. If you run DirectX code you can measure it happening. OpenCL does not do it currently, there are technical reasons for that to do with the way OpenCL buffers map to DX views.
You can't really do task parallelism on the hardware at the moment, it is unfortunate but given the way queues are managed and the mechanisms of the PCI bus you'd be lucky to produce efficient task parallel code currently for the hardware anyway. If it's important to do task parallelism I would suggest you take an uberkernel approach and branch into a particular task inside the kernel.
Remember, of course, that a hardware thread is a wavefront, not a work item. If you want to write task-parallel code that runs efficiently make your tasks 64 work-items wide.