Kernel consists mostly from such blocks (see listing):

power values loaded into 16 float4 registers, then compared with some threshold, then size of array halved and process repeated.

ALU packing for this kernel very low, only 45,5. Other kernels have >80

What prevents to pack those instructions into VLIWs ?

d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw;d1.xy=d2.xz+d2.yw;d1.zw=d3.xz+d3.yw; d2.xy=d4.xz+d4.yw;d2.zw=d5.xz+d5.yw;d3.xy=d6.xz+d6.yw;d3.zw=d7.xz+d7.yw; d4.xy=d8.xz+d8.yw;d4.zw=d9.xz+d9.yw;d5.xy=d10.xz+d10.yw;d5.zw=d11.xz+d11.yw; d6.xy=d12.xz+d12.yw;d6.zw=d13.xz+d13.yw;d7.xy=d14.xz+d14.yw;d7.zw=d15.xz+d15.yw; if ( (d0.x>t.y)||(d0.y>t.y)||(d0.z>t.y)||(d0.w>t.y)||(d1.x>t.y)||(d1.y>t.y)||(d1.z>t.y)||(d1.w>t.y)|| (d2.x>t.y)||(d2.y>t.y)||(d2.z>t.y)||(d2.w>t.y)||(d3.x>t.y)||(d3.y>t.y)||(d3.z>t.y)||(d3.w>t.y)|| (d4.x>t.y)||(d4.y>t.y)||(d4.z>t.y)||(d4.w>t.y)||(d5.x>t.y)||(d5.y>t.y)||(d5.z>t.y)||(d5.w>t.y)|| (d6.x>t.y)||(d6.y>t.y)||(d6.z>t.y)||(d6.w>t.y)||(d7.x>t.y)||(d7.y>t.y)||(d7.z>t.y)||(d7.w>t.y) ){ was_pulse.y=1; }

Difficult to say without the full code (can't look at ISA with the code you've given directly)...

...however, generally speaking, reduce CF clauses (ISA doesn't allow packing across clauses) and reduce data dependency (I have no idea how large the compiler window is for this, I imagine it's fairly large though). Is it packing the d2.xywz > t.yyyy into a vector (for example)?