I wanted to specify different cumask value for each kernel(to explore fine-grained concurrency scheduling), and I tried to modify the ROCm source code (mostly AQL Packet), but failed because I found that I couldn't extend the Command Processor. Are there other ways to implement kernel-wise cumask?