no each work item in workgroup must hit SAME barrier. even that they hit same count of barrier doesn't is not enough.
Further to nou's comment just one workgroup of 256 work items is not likely to efficiently occupy the device. Any cache miss or barrier is likely to leave you with gaps in the instruction stream.
There isn't any significant difference between four groups of 64 and one of 256 unless you get benefits from reuse between the wavefronts, or (in the negative direction) if the larger groups don't mean you fully occupy the CU, reducing the total number of wavefronts and the device's ability to cover for stalls.