The AMD CU supports four simultaneous wave-fronts (of 64 work items). Each of these wave-fronts is executed on a processing element containing 16 ALUs. Threre are four PE per CU, so one CU at a time processes four wave-fronts, or 256 work-items.
The limit of 256 work-items per work-group is posed to give the scheduler a chance to schedule wave-fronts using SMT (simultaneous multi-threading). In SMT, two independent threads (or in AMD GPU context, two independent wave-fronts) share same computing resource, and the scheduler, knowing fully that they are independent, can schedule them independent of each other. This is also used in Intel's virtual eight core processors, where each core uses SMT to simultaneously process two independent threads.
Coming back to AMD GPUs, since work-groups execute independent of each other, a limit of 256 enables the scheduler to schedule work-groups in SMT fashion. If a wave-front from one work-group is waiting, the scheduler can easily schedule a wave-front from another work-group, exploiting their independence. With larger work-group size, it may be possible that only one work-group may be run on a CU, and since work-item independence is not guaranteed, SMT can not be used, affecting the performance negatively.