cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

liwoog
Adept II

Very strange LDS access behavior, help needed.

Hi,

I have reduced my problem to the test below.

The first version of this kernel executes 100x faster than the second.

For testing sake and to only execute the minimum of wavefronts, loadCount is set to use the full LDS (32KB).

Any help would be greatly appreciated.

Lionel

PS: stuffing the if statement with a register-based computation, makes no difference.

kernel __attribute__((reqd_work_group_size(64, 1, 1)))

void testKernel(int  toiCount,

  local int * loadCount)

{

  int lid = get_local_id(0);

  loadCount[lid] = 0;

  // Loop over all beams numbers

  for (int toiIndex = 0; toiIndex < toiCount; ++toiIndex) {

  if (0 == toiIndex) {

// loadCount[lid] = 0;

  }

  }

}

kernel __attribute__((reqd_work_group_size(64, 1, 1)))

void testKernel(int  toiCount,

  local int * loadCount)

{

  int lid = get_local_id(0);

// loadCount[lid] = 0;

  // Loop over all beams numbers

  for (int toiIndex = 0; toiIndex < toiCount; ++toiIndex) {

  if (0 == toiIndex) {

  loadCount[lid] = 0;

  }

  }

}

0 Likes
3 Replies
dipak
Big Boss

When I tried to reproduce this behavior, I didn't observe such performance difference (100x as you said) for the above kernels. Please share a complete repro (with host-code) and mention your setup details.

0 Likes

Thank you Dipak, someone else at AMD already answered. what happens is that in the first case, the compiler removes the entire loop. The loop size is large ~5000000 iterations. Hence the speed differential.

0 Likes

Thanks for sharing this information.

Yes, that loop seems a dead code and compiler can remove it for optimization. Since I tried with much lower iteration value, I didn't observe much difference.

0 Likes