cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Wibowit
Journeyman III

My code uses uncached reads despite const * restrict pointers

sprofile (in Linux) shows Cache Hit of 0.00 %, so no cache is used

Hi,

My program is here: http://www63.zippyshare.com/v/93264665/file.html (note that it's Java program). It's a ST5 implementation, accepts two parameters: first is input file, second is output file.

First problem is that sprofile hogs my RAM. I have 8 GiB RAM but after about 30 kernel invocations memory gets filled up, then system gets unresponsive for a moment (or sometimes permanently) and then profiling speeds up. I don't know what's happening there, but such high memory usage suggest memory leaks.

My main problem is that I've used const * restrict modifiers, but decompiled code shows only uncached fetches: 8  VFETCH R2.x___, R0.w, fc173  MEGA(4)

 I'm concerned about that two attached kernels. They do not use cache, however using cache should greatly reduce memory fetches. How to enable caching?

BTW: My code doesn't compile if I add __read_only or similiar modifiers. Why? Compiler says that __read_only is a unrecognized identifier.

__kernel void generateTuples(__global uchar const * restrict data, __private int length, __global ulong * tuples) { size_t gid = get_global_id(0); uint pointer = gid; uint low = 0; low += data[pointer]; pointer = (pointer + 1) < length ? pointer + 1 : 0; low <<= 8; low += data[pointer]; pointer = (pointer + 1) < length ? pointer + 1 : 0; low <<= 8; low += data[pointer]; pointer = (pointer + 1) < length ? pointer + 1 : 0; low <<= 8; low += data[pointer]; pointer = (pointer + 1) < length ? pointer + 1 : 0; ulong tuple = convert_ulong(low) << 32; tuple |= (convert_uint(data[pointer]) << 24) + gid; tuples[gid] = tuple; } __kernel void retrieveST5(__global ulong * tuples, __global uchar const * restrict data, __private int length, __global uchar16 * output) { size_t gid = get_global_id(0); uchar16 chunk = 0; uint bufferPosition; bufferPosition = tuples[gid * 16 + 0] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s0 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 1] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s1 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 2] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s2 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 3] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s3 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 4] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s4 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 5] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s5 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 6] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s6 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 7] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s7 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 8] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s8 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 9] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.s9 = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 10] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.sa = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 11] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.sb = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 12] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.sc = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 13] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.sd = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 14] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.se = data[bufferPosition]; bufferPosition = tuples[gid * 16 + 15] & 0xFFFFFF; bufferPosition = bufferPosition == 0 ? length - 1 : bufferPosition - 1; chunk.sf = data[bufferPosition]; output[gid] = chunk; }

0 Likes
7 Replies
rick_weber
Adept II

Use __constant and __attribute__((max_constant_size()) if the data is small enough (e.g. fits in constant memory). If not, you might want to consider using the texture cache, which while not quite as fast, is still considerably faster than constant memory.

0 Likes

No, it won't fit in constant memory. It has size in range of megabytes.

How can I use texture cache? Should I convert buffer to texture? I want linear access but textures have tiled memory.

And you mispelled last words? I think you've meant "considerably faster than global memory".

There is no way to use texture cache on read only buffers? When I compile code with -fno-alias then I see utilization of cache but results are completely broken.

0 Likes

Wibowit,

This is a known issue. I can suggest you to use images instead.

0 Likes

"This is a known issue."

Is that a hardware issue or a software issue?

0 Likes

I refer to the issue that while using -fno-alias flag results are corrupted in some cases. I cannot say anything more than that it is being looked into and hopefully will be fixed in the next release.

0 Likes
rick_weber
Adept II

The read_only modifier only applies to images.

0 Likes

Jawed,
It is a software issue that should be fixed in an upcoming catalyst release. The problem was with addressing cached loads in the shader compiler.
0 Likes