ROCm Discussions

chrirocca · ‎03-18-2024

Hello,

I am trying to do some benchmarks targeting L2 cache on MI100. I am converting my code and my make file from Cuda and nvcc to ROCm and hipcc but I can't find this flag for hipcc in particular.
I was using -Xptas -dlcm=cg for bypass L1 cache on nvcc, is there anything similar on hipcc?
I was looking on the net but I didn't find any information about this.

infohills · ‎03-18-2024

In ROCm and HIP, there isn't a direct equivalent flag to -Xptas -dlcm=cg in nvcc for bypassing L1 cache. However, you can achieve similar behavior by using ROCm/HIP-specific optimizations or by modifying your code to utilize shared memory or other cache strategies. Utilize HIP's shared memory (similar to CUDA's shared memory) to manually manage caching. By using shared memory effectively, you can control data movement between threads within a thread block, potentially reducing reliance on L1 cache. ROCm and HIP offer their own set of optimizations and compiler flags. You can experiment with these flags to optimize memory access patterns and caching behavior. Some common optimization flags include -O3, -finline-functions, -fopenmp, etc.

chrirocca · ‎03-27-2024

Thank you for the reply.
I also know and used __ldcg API on CUDA to bypass L1 cache and specifically make a load operation from L2 cache. I saw that this should be available in ROCm (looking at the documentation here CUDA DEVICE API supported by HIP — HIPIFY Documentation (amd.com)), but when I try to compile it I get an error that says that it doesn't exist...
May I have confirmation about this?

engininja · ‎04-05-2024

From a search on the ROCm github org, __ldcg only has an implementation for half & half2. Even then it doesn't do anything special with the pointer. It's probably to make a certain piece of hipified code compile.

I don't see any clang builtins for AMDGPU that look obviously related to loading data or cache policy, except for gfx12, though there are some for cache invalidation.

If you check the code for composable_kernel there they use llvm buffer load&save intrinsics which have access to the GLC/DLC/SLC bits which affect caching as per your GPU's ISA docs. The instrinsics themselves are defined on llvm's github here in IntrinsicsAMDGPU.td

ROCm Discussions

Equivalent flag from nvcc to hipcc