The best answer according to me is to check it yourself . AFAIK I don't think there are considerable overheads for manageing large threads.But you might get some performace gain by using co-elesced global memory access. And you should keep in mind that your compute units are not starved.
try and you will see. this is the best way.