Can you post some code here (in ZIP) that can showcase this long time taken for data transfer. I will try to reproduce and suggest changes that might help you improve the transfer speed.
It seems like that problem was fixed by reducing the amount of private memory requested in a kernel. One specific kernel that uses a lot of private memory arrays is compiled before any memory transfers are made from the host to the device and it seems that reducing the size of these arrays has fixed the problem of slow host-to-device memory transfers, but I am not too sure why. Does the OpenCL context allocate device private memory for a kernel as it is compiled, or at runtime? If allocation happens at compile time then this could have been the cause of my problems.
Anyway all works well now, so to sum up:
1) Update to newest drivers to avoid compiler segmentation fault.
2) Beware of how much private memory your kernels are using!
Many thanks for your help.