Archives Discussions

cocular · ‎06-16-2013

Hello!

Recently I want to implement a priority queue in OpenCL and and some doubt about barrier and mem_fence. Here is my understanding.

barrier(clk_global_mem_fence):
1. It makes sure all the work-items in same work-groups reach this barrier
2. It makes sure that all the write to global memory in current work-item can be read correctly by other work-item in the same work-group after the barrier.
mem_fence:
1. It makes sure that all the write in current work-item can be correctly read by the this work-item after the fence

Do I miss something? Am I right?

Now for the performance issue. I read the AMD Accelerated Parallel Processing OpenCL Programming Guide. On page 136, there is a cache hierarchical figure. I see that there is a L1 cache per compute unit. Since one work-group can only be fitted in one compute unit, In a global barrier, GPU does not need wait 400+cycle to make sure the write done in the global memory but only want so that write on L1 completes?

If that is the case, what is the latency if L1 hits? Or L2 hits?

realhet · ‎06-17-2013

Hi,

I think you understand it well, but on 2.: I think you wanted to write: "It makes sure that (all) the write in current work-item can be correctly read by the this work-group(!!!) after the fence"

Not sure, but if you're looking for global synchronization, you can use memory atomic operations.

For the memory performance: In APP OpenCL guide there is an Optiomization section (chapter 5 and 6).

At "5.4 OpenCL Memory Resources: Capacity and Performance" there are the various bandwidths for the HD7970 GCN chip.

Also for VLIW HD5870 -> "6.4 OpenCL Memory Resources: Capacity and Performance"

cocular · ‎06-17-2013

For 2, I really want to write

... by the this work-item after the fence.

I'm not clear whether it is guarantee for the current work-group?

himanshu_gautam · ‎06-17-2013

You are right about barriers and mem_fence.

From performance perpective, your answers also lie in the programming guide itself.

The L1 cache can service up to four address requests per cycle, each delivering

up to 16 bytes. The bandwidth shown assumes an access size of 16 bytes;

smaller access sizes/requests result in a lower peak bandwidth for the L1 cache.

Using float4 with images increases the request size and can deliver higher L1

cache bandwidth.

Each memory channel on the GPU contains an L2 cache that can deliver up to

64 bytes/cycle. The AMD Radeon™ HD 7970 GPU has 12 memory channels;

thus, it can deliver up to 768 bytes/cycle; divided among 2048 stream cores, this

provides up to ~0.4 bytes/cycle for each stream core.

cocular · ‎06-17-2013

I think I do not state my question clearly. Consider the following code:


int data;


global_mem[global_id] = data;

barrier(clk_global_mem_fence)

So the write will go into L1. Since barrier will only make sure the sync between current work-groups, barrier will only have a very little cost. (the latency of L1), not 400+ cycle, won't it?

Also what does "per cycle" above means? the latency or bandwidth? Each device has a table for its bandwidth for its memory device. Is there a table for the latency in memory operation? Or and bandwidth and latency for arithmetic operation?

himanshu_gautam · ‎06-18-2013

I did got your point.Not sure if latencies are mentioned anywhere in the guide. Will ask about them.

Also I cannot say a answer with confidance for the question regarding the latency of the above code snippet too.

cocular · ‎06-19-2013

I have done some benchmark and find some unexpected result.

Me benchmark code copy 256*repeat_time data to GPU memory. In my test, repeat_time = 4096. I do add __kernel __attribute__((reqd_work_group_size(4, 1, 1))) to notify OpenCL that I will run it on less than one wavefront to test latency.

Code 1: use local barrier.

Code 2: use global barrier.

Code 3: no barrier.

Result: on HD6770 mobile:

Code 1: 64ms

Code 2: 500ms

Code3: less than 1 ms

Now there are several problem:

1. I do not use local memory. And there is only ONE wavefront (I write reqd_work_group_size). Why local barrier will significantly decrease the performance? It seems that it is just the instruction overhead. Pure 256*4096 times of local mem_fence still need such time.

2. global barrier is too slow! The latency is about 500 cycle!. Obviously, barrier() is stupid, L1 and ONE wavefront is not utilized. Or L1 has latency 500 cycle? OK. I'm wrong. Global operation is not cached. This is normal.

How can I optimize the global write with group sync? One way is manual maintain a cache in local memory and only do local barrier, which I think is too hard to implement. Any advice?

Code1:


__kernel __attribute__((reqd_work_group_size(4, 1, 1)))
void memtest(__global int *global_mem, int repeat_time)
{
    int data = get_global_id(0);

    for (int i = 0; i < repeat_time; ++i) {
        global_mem[data + 1  ] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        global_mem[data + 2  ] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        global_mem[data + 3  ] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        ...

        global_mem[data + 255] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        global_mem[data + 256] = data;
        barrier(CLK_LOCAL_MEM_FENCE);
    }
}

Code2:


__kernel __attribute__((reqd_work_group_size(4, 1, 1)))
void memtest(__global int *global_mem, int repeat_time)
{
    int data = get_global_id(0);

    for (int i = 0; i < repeat_time; ++i) {
        global_mem[data + 1  ] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        global_mem[data + 2  ] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        global_mem[data + 3  ] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        ...

        global_mem[data + 255] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        global_mem[data + 256] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

Code3:


__kernel __attribute__((reqd_work_group_size(4, 1, 1)))
void memtest(__global int *global_mem, int repeat_time)
{
    int data = get_global_id(0);

    for (int i = 0; i < repeat_time; ++i) {
        global_mem[data + 1  ] = data;


        global_mem[data + 2  ] = data;


        global_mem[data + 3  ] = data;


        ...

        global_mem[data + 255] = data;


        global_mem[data + 256] = data;

    }
}

himanshu_gautam · ‎06-20-2013

You are setting required workgroup size as just 4 work-items, and it is clear that different workgroups are writing into same memory locations, hence i would assume global writes must be done all the way to the global memory. For example global_id=0 modifies:1-256 index elements, global_id=4 (from 2nd WG) modifies: 5-260 index elements. Probably create a small section of global memory dedicated to each workgroup. It may give good performance.

I would believe Code 3, can get optimized away to a good degree.There are large write conflicts, so hard to say, what it will get compiled to.

EDIT: Also latencies are not a important measure to know, as they are always hidden. it should just be understood, that caches have very less latency as compared to global memories. Do you have idea about cache latencies from NVIDIA?

cocular · ‎06-20-2013

For the workgroup problem: there is only one workgroup to avoid problem in this banchmark.

The key point is that: the global memory buffer do NOT have a cache. L1 and L2 are only for texture memory. So the large latency is somehow reasonable here.

himanshu_gautam · ‎06-21-2013

The key point is that: the global memory buffer do NOT have a cache. L1 and L2 are only for texture memory. So the large latency is somehow reasonable here.

This is not true. Caching is used for global memory too. Check globalMemoryBandwidth sample, which shows close to 1000GBps when caching is present. It is quite tricky there to avoid using caching in order to demonstrate the actualy global memory bandwidth of a device.

cocular · ‎06-21-2013

Do you mean the this?

Global Memory Read
AccessType : single
VectorElements : 4
Bandwidth : 269.674 GB/s

Maybe my hardware do not have high enough performance to achieve 1000GB/s. But I think I could figure out why it can exceed the GPU's hardware limitation. CL kernel has the following pattern:


uint gid = get_global_id(0);
uint index = gid;
val = val + input[index + 0];
val = val + input[index + 1];
val = val + input[index + 2];
val = val + input[index + 3];
val = val + input[index + 4];
val = val + input[index + 5];
val = val + input[index + 6];
val = val + input[index + 7];
val = val + input[index + 8];
val = val + input[index + 9];

You can see that there are a lot of overlap between each thread. There is not global mem fence between the memory access. So Compiler/GPU is able to reorder them so that it can eliminate the overlap access to optimize the performance. I don't think we can call it cache in this situation.

In app guide Page 32, here says:

Only read-only buffers, such as constants, are cached.

himanshu_gautam · ‎06-24-2013

As I understand, caching comes into play when reading/writing from global mamory. IIRC, non-GCN have only read-only caches, but GCN has read/write caches. Why don't you write a small test yourself, probably you can modify GlobalMemoryBandwidth as per your understanding.

cocular · ‎06-24-2013

If global memory is cached, then what is
the main difference between global memory and texture memory?

Also the test above has already shows that global memory is not cached! Otherwise I think we do not need to wait 500 cycles for sync.

Also I provide the reference that says only read-only buffers are cached. Could you provide the reference says that global memory is cached?

AMD APP Guide P32:

When using a global buffer, each work-item can write to an arbitrary location
within the global buffer. Global buffers use a linear memory layout. If consecutive
addresses are written, the compute unit issues a burst write for more efficient
memory access. Only read-only buffers, such as constants, are cached.

I think that is clear enough.

himanshu_gautam · ‎06-28-2013

cocular wrote:

If global memory is cached, then what is
the main difference between global memory and texture memory?

Also the test above has already shows that global memory is not cached! Otherwise I think we do not need to wait 500 cycles for sync.

Also I provide the reference that says only read-only buffers are cached. Could you provide the reference says that global memory is cached?

Can you explain difference between read-only buffer and global bufffers? Aren't they both stored in GPU's global memory?

The difference between buffer and image is the way data is stored. buffer store data in linear manner, and images store it in tiled manner. But in the end, both of them are cached. Check section 5.1 of OpenCL Programming Guide, explaining the read and write paths to/from global memory.

cocular · ‎06-28-2013

The paragraph above explain the difference: (P31)

Image reads are cached through the texture system (corresponding to the L2 and
L1 caches).

If everything is cached, then the 1.5 subsection will be very strange.

But if only texture memory is cached, subsection 5.1/6.1 could still be understood because I think they are talk about texture memory.

So maybe you should upgrade both the document and bandwidth example to state them clearly that global buffer is not cached?

himanshu_gautam · ‎07-01-2013

The above statement about images is correct. And so is section 5.1. Doesn't this suggest caches are used for both of them? Why don't you write a bandwidth test yourself, and share here. The discussion is becoming lengthy unnecessarily

Can you post the output of globalMemoryBandwidth sample and Imagebandwidth sample?

cocular · ‎07-01-2013

Here is the benchmark:

tarball: https://dl.dropboxusercontent.com/u/8878837/bandwidth.zip

Kernel:


__kernel void image_read(int n,
                         __read_only image2d_t img,
                         __global int *ret)
{
    const sampler_t smp = (CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                           CLK_ADDRESS_CLAMP | //Clamp to zeros
                           CLK_FILTER_NEAREST); //Don't interpolate
    *ret = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    int result = 0;
    int gid = get_global_id(0);
    for (int i = 0; i < n; i += 2) {
        int2 cord;
        int4 val;

        cord = (int2)(gid, i);
        val = read_imagei(img, smp, cord);
        result ^= val.x;

        cord = (int2)(gid, i+1);
        val = read_imagei(img, smp, cord);
        result ^= val.x;

    }

    atomic_xor(ret, result);

}

__kernel void image_read2(int n,
                         __read_only image2d_t img,
                         __global int *ret)
{
    const sampler_t smp = (CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                           CLK_ADDRESS_CLAMP | //Clamp to zeros
                           CLK_FILTER_NEAREST); //Don't interpolate
    *ret = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    int result = 0;
    int gid = get_global_id(0);
    for (int i = 0; i < n; i += 2) {
        int2 cord;
        int4 val;

        cord = (int2)(i, gid);
        val = read_imagei(img, smp, cord);
        result ^= val.x;

        cord = (int2)(i+1, gid);
        val = read_imagei(img, smp, cord);
        result ^= val.x;
    }

    atomic_xor(ret, result);
}

__kernel void buffer_read(int n, __global int *buf, __global int *ret)
{
    *ret = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    int result = 0;
    int gid = get_global_id(0);

    int pos = gid*n;
    for (int i = 0; i < n; i+=4, pos+=4) {
        result ^= buf[pos+0];
        result ^= buf[pos+1];
        result ^= buf[pos+2];
        result ^= buf[pos+3];
    }
    atomic_xor(ret, result);
}

C code:


#include <CL/cl.h>
#include <stdio.h>
#include <assert.h>
#include <stdbool.h>
#include <time.h>
#include "time-cl.h"
#include "clwrapper.h"

#define M 4000
#define N 8000
int buffer, buffer2;

int main()
{
    srand((unsigned)time(NULL));
    cl_device_id device = cl_get_first_gpu_device();
    cl_context context = cl_create_context(device);
    cl_command_queue cmd_queue = cl_create_cmd_queue(context, device, CL_QUEUE_PROFILING_ENABLE);
    cl_program program = cl_create_program_from_src(context, device, "", TIME_KERNEL_SOURCE);

    cl_kernel kernel_img = cl_create_kernel(program, "image_read");
    cl_kernel kernel_img2 = cl_create_kernel(program, "image_read2");
    cl_kernel kernel_buf = cl_create_kernel(program, "buffer_read");

    int verifier = 0;
    for (int i = 0; i < M; ++i)
        for (int j = 0; j < N; ++j) {
            buffer = (i ^ j) + (i | j) + i - j + i*j + rand();
            buffer2 = buffer;
            verifier ^= buffer;
        }

    printf("verifier: %d\n", verifier);
    
    cl_image_format image_format = { CL_R, CL_UNSIGNED_INT32 };
    cl_image_desc   image_desc   = { CL_MEM_OBJECT_IMAGE2D, M, N, 0, 0, 0, 0, 0, 0, 0 };
    cl_mem img = cl_create_image(context,
                                 CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                 &image_format, &image_desc, &buffer2[0][0]);
    cl_image_desc   image_desc2  = { CL_MEM_OBJECT_IMAGE2D, N, M, 0, 0, 0, 0, 0, 0, 0 };
    cl_mem img2 = cl_create_image(context,
                                 CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                 &image_format, &image_desc2, &buffer[0][0]);
    cl_mem buf = cl_create_buffer(context,
                                  CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                  sizeof(buffer), &buffer[0][0]);

    cl_mem ret_img = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
    cl_mem ret_img2 = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
    cl_mem ret_buf = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);

    int t_size = N;
    cl_set_args(kernel_img, 0, sizeof(int), &t_size);
    cl_set_args(kernel_img, 1, sizeof(img), &img);
    cl_set_args(kernel_img, 2, sizeof(ret_img), &ret_img);

    cl_set_args(kernel_buf, 0, sizeof(int), &t_size);
    cl_set_args(kernel_buf, 1, sizeof(buf), &buf);
    cl_set_args(kernel_buf, 2, sizeof(ret_buf), &ret_buf);

    cl_set_args(kernel_img2, 0, sizeof(int), &t_size);
    cl_set_args(kernel_img2, 1, sizeof(img2), &img2);
    cl_set_args(kernel_img2, 2, sizeof(ret_img2), &ret_img2);

    cl_finish(cmd_queue);

    cl_event event_buf, event_img, event_img2;
    size_t global_work_size = M;


    cl_launch_kernel(cmd_queue, kernel_img, 1, NULL,
                     &global_work_size, NULL, 0, NULL, &event_img);
    cl_finish(cmd_queue);
    cl_wait_for_events(1, &event_img);
    int *v_img = cl_map_buffer(cmd_queue, ret_img, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
    printf("image bandwidth: %.2fGB/s in %.2fms\n",
           cl_get_kernel_bandwidth(event_img, N*N*sizeof(int)),
           cl_get_kernel_time(event_img));
    printf("verify: %d\n", *v_img);
    cl_unmap_buffer(cmd_queue, ret_img, v_img, 0, NULL, NULL);


    cl_launch_kernel(cmd_queue, kernel_img2, 1, NULL,
                     &global_work_size, NULL, 0, NULL, &event_img2);
    cl_finish(cmd_queue);
    cl_wait_for_events(1, &event_img2);
    int *v_img2 = cl_map_buffer(cmd_queue, ret_img2, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
    printf("image2 bandwidth: %.2fGB/s in %.2fms\n",
           cl_get_kernel_bandwidth(event_img2, N*N*sizeof(int)),
           cl_get_kernel_time(event_img2));
    printf("verify: %d\n", *v_img2);
    cl_unmap_buffer(cmd_queue, ret_img2, v_img2, 0, NULL, NULL);



    cl_launch_kernel(cmd_queue, kernel_buf, 1, NULL,
                     &global_work_size, NULL, 0, NULL, &event_buf);
    cl_finish(cmd_queue);
    cl_wait_for_events(1, &event_buf);
    int *v_buf = cl_map_buffer(cmd_queue, ret_buf, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
    printf("buffer bandwidth: %.2fGB/s in %.2fms\n",
           cl_get_kernel_bandwidth(event_buf, N*N*sizeof(int)),
           cl_get_kernel_time(event_buf));
    printf("verify: %d\n", *v_buf);
    cl_unmap_buffer(cmd_queue, ret_buf, v_buf, 0, NULL, NULL);

    return 0;
}

Result (Turk 6770):

verifier: 16011609
image bandwidth: 39.57GB/s in 6.02ms
verify: 16011609
image2 bandwidth: 33.08GB/s in 7.21ms
verify: 16011609
buffer bandwidth: 10.53GB/s in 22.65ms
verify: 16011609

Poor result for global buffer. Notice that if I do unroll more time, buffer will read much much faster, while image will not.

But on the other hand, I saw IL such as uav_raw_load_id(11)_cached. Do not really know what it means.

For offical benchmark:

Global Memory Read
AccessType      : single
VectorElements : 4
Bandwidth       : 148.804 GB/s

Global Memory Read
AccessType      : linear
VectorElements : 4
Bandwidth       : 86.6981 GB/s

Global Memory Read
AccessType      : linear(uncached)
VectorElements : 4
Bandwidth       : 36.884 GB/s

Global Memory Write
AccessType      : linear
VectorElements : 4
Bandwidth       : 24.8876 GB/s

Global Memory Read
AccessType      : random
VectorElements : 4
Bandwidth       : 21.3308 GB/s

Global Memory Read
AccessType      : unCombine_unCache
VectorElements : 4
Bandwidth       : 36.2945 GB/s

                        memcpy()   5.62 GB/s
                     memset(,1,) 13.85 GB/s
                     memset(,0,) 14.99 GB/s

AVERAGES (over loops 2 - 9, use -l for complete log)
--------

1. Host mapped write to inputImage

       clEnqueueMapImage(WRITE): 0.007440 s [     2.25 GB/s ]
                       memset(): 0.001863 s       9.01 GB/s
      clEnqueueUnmapMemObject(): 0.005611 s [     2.99 GB/s ]

2. GPU kernel read of inputImage

       clEnqueueNDRangeKernel(): 0.044637 s      37.59 GB/s
                 verification ok

3. GPU kernel write to outputImage

       clEnqueueNDRangeKernel(): 0.073920 s      22.70 GB/s

4. Host mapped read of outputImage

        clEnqueueMapImage(READ): 0.006750 s [     2.49 GB/s ]
                       CPU read: 0.001797 s       9.34 GB/s
                 verification ok
      clEnqueueUnmapMemObject(): 0.000043 s [   394.12 GB/s ]

The image bandwidth is even lower than my benchmark, while I use int instead of int4. Do not know why.

Archives Discussions

The definition about clk_global_mem_fence and mem_fence and their effect upon performance