16 Replies Latest reply on Jul 1, 2013 10:30 PM by cocular

    The definition about clk_global_mem_fence and mem_fence and their effect upon performance

    cocular

      Hello!

        Recently I want to implement a priority queue in OpenCL and and some doubt about barrier and mem_fence.  Here is my understanding.

      1. barrier(clk_global_mem_fence):
        1. It makes sure all the work-items in same work-groups reach this barrier
        2. It makes sure that all the write to global memory in current work-item can be read correctly by other work-item in the same work-group after the barrier.
      2. mem_fence:
        1. It makes sure that all the write in current work-item can be correctly read by the this work-item after the fence

      Do I miss something?  Am I right?

       

      Now for the performance issue.  I read the AMD Accelerated Parallel Processing OpenCL Programming Guide.  On page 136, there is a cache hierarchical figure.  I see that there is a L1 cache per compute unit.  Since one work-group can only be fitted in one compute unit, In a global barrier, GPU does not need wait 400+cycle to make sure the write done in the global memory but only want so that write on L1 completes?

       

      If that is the case, what is the latency if L1 hits? Or L2 hits?

        • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
          realhet

          Hi,

           

          I think you understand it well, but on 2.: I think you wanted to write: "It makes sure that (all) the write in current work-item can be correctly read by the this work-group(!!!) after the fence"

          Not sure, but if you're looking for global synchronization, you can use memory atomic operations.


          For the memory performance: In APP OpenCL guide there is an Optiomization section (chapter 5 and 6).

          At "5.4 OpenCL Memory Resources: Capacity and Performance" there are the various bandwidths for the HD7970 GCN chip.

          Also for VLIW HD5870 -> "6.4 OpenCL Memory Resources: Capacity and Performance"

          • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
            himanshu.gautam

            You are right about barriers and mem_fence.

             

            From performance perpective, your answers also lie in the programming guide itself.

            The L1 cache can service up to four address requests per cycle, each delivering

            up to 16 bytes. The bandwidth shown assumes an access size of 16 bytes;

            smaller access sizes/requests result in a lower peak bandwidth for the L1 cache.

            Using float4 with images increases the request size and can deliver higher L1

            cache bandwidth.

             

            Each memory channel on the GPU contains an L2 cache that can deliver up to

            64 bytes/cycle. The AMD Radeon™ HD 7970 GPU has 12 memory channels;

            thus, it can deliver up to 768 bytes/cycle; divided among 2048 stream cores, this

            provides up to ~0.4 bytes/cycle for each stream core.

              • Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                cocular

                I think I do not state my question clearly.  Consider the following code:

                 

                int data;
                
                global_mem[global_id] = data;
                barrier(clk_global_mem_fence)
                

                 

                So the write will go into L1.  Since barrier will only make sure the sync between current work-groups, barrier will only have a very little cost. (the latency of L1), not 400+ cycle, won't it?

                 

                Also what does "per cycle" above means? the latency or bandwidth?  Each device has a table for its bandwidth for its memory device.  Is there a table for the latency in memory operation?  Or and bandwidth and latency for arithmetic operation?

                  • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                    himanshu.gautam

                    I did got your point.Not sure if latencies are mentioned anywhere in the guide. Will ask about them.

                    Also I cannot say a answer with confidance for the question regarding the latency of the above code snippet too.

                      • Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                        cocular

                        I have done some benchmark and find some unexpected result.

                         

                        Me benchmark code copy 256*repeat_time data to GPU memory.  In my test, repeat_time = 4096.  I do add __kernel __attribute__((reqd_work_group_size(4, 1, 1))) to notify OpenCL that I will run it on less than one wavefront to test latency.

                         

                        Code 1: use local barrier.

                         

                        Code 2: use global barrier.

                         

                        Code 3: no barrier.

                         

                        Result: on HD6770 mobile:

                         

                        Code 1: 64ms

                         

                        Code 2: 500ms

                         

                        Code3: less than 1 ms

                         

                        Now there are several problem:

                        1.  I do not use local memory.  And there is only ONE wavefront (I write reqd_work_group_size).  Why local barrier will significantly decrease the performance?  It seems that it is just the instruction overhead.  Pure 256*4096 times of local mem_fence still need such time.

                         

                        2.   global barrier is too slow!  The latency is about 500 cycle!.  Obviously, barrier() is stupid, L1 and ONE wavefront is not utilized.  Or L1 has latency 500 cycle? OK.  I'm wrong.  Global operation is not cached. This is normal.

                         

                        How can I optimize the global write with group sync?  One way is manual maintain a cache in local memory and only do local barrier, which I think is too hard to implement.  Any advice?

                         

                         

                        Code1:

                        __kernel __attribute__((reqd_work_group_size(4, 1, 1)))
                        void memtest(__global int *global_mem, int repeat_time)
                        {
                            int data = get_global_id(0);
                        
                            for (int i = 0; i < repeat_time; ++i) {
                                global_mem[data + 1  ] = data;
                                barrier(CLK_LOCAL_MEM_FENCE);
                        
                                global_mem[data + 2  ] = data;
                                barrier(CLK_LOCAL_MEM_FENCE);
                        
                                global_mem[data + 3  ] = data;
                                barrier(CLK_LOCAL_MEM_FENCE);
                        
                                ...
                        
                                global_mem[data + 255] = data;
                                barrier(CLK_LOCAL_MEM_FENCE);
                        
                                global_mem[data + 256] = data;
                                barrier(CLK_LOCAL_MEM_FENCE);
                            }
                        }
                        
                        
                        
                        
                        
                        
                        
                        

                         

                        Code2:

                        __kernel __attribute__((reqd_work_group_size(4, 1, 1)))
                        void memtest(__global int *global_mem, int repeat_time)
                        {
                            int data = get_global_id(0);
                        
                            for (int i = 0; i < repeat_time; ++i) {
                                global_mem[data + 1  ] = data;
                                barrier(CLK_GLOBAL_MEM_FENCE);
                        
                                global_mem[data + 2  ] = data;
                                barrier(CLK_GLOBAL_MEM_FENCE);
                        
                                global_mem[data + 3  ] = data;
                                barrier(CLK_GLOBAL_MEM_FENCE);
                        
                                ...
                        
                                global_mem[data + 255] = data;
                                barrier(CLK_GLOBAL_MEM_FENCE);
                        
                                global_mem[data + 256] = data;
                                barrier(CLK_GLOBAL_MEM_FENCE);
                            }
                        }
                        
                        
                        
                        
                        
                        
                        
                        

                         

                        Code3:

                         

                        __kernel __attribute__((reqd_work_group_size(4, 1, 1)))
                        void memtest(__global int *global_mem, int repeat_time)
                        {
                            int data = get_global_id(0);
                        
                            for (int i = 0; i < repeat_time; ++i) {
                                global_mem[data + 1  ] = data;
                        
                        
                                global_mem[data + 2  ] = data;
                        
                        
                                global_mem[data + 3  ] = data;
                        
                        
                                ...
                        
                                global_mem[data + 255] = data;
                        
                        
                                global_mem[data + 256] = data;
                        
                            }
                        }
                        
                        
                        
                        
                        
                        
                        
                        
                          • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                            himanshu.gautam

                            You are setting required workgroup size as just 4 work-items, and it is clear that different workgroups are writing into same memory locations, hence i would assume global writes must be done all the way to the global memory. For example global_id=0 modifies:1-256 index elements, global_id=4 (from 2nd WG) modifies: 5-260 index elements. Probably create a small section of global memory dedicated to each workgroup. It may give good performance.

                            I would believe Code 3, can get optimized away to a good degree.There are large write conflicts, so hard to say, what it will get compiled to.

                             

                            EDIT: Also latencies are not a important measure to know, as they are always hidden. it should just be understood, that caches have very less latency as compared to global memories. Do you have idea about cache latencies from NVIDIA?

                              • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                cocular

                                For the workgroup problem: there is only one workgroup to avoid problem in this banchmark.

                                 

                                The key point is that: the global memory buffer do NOT have a cache. L1 and L2 are only for texture memory.  So the large latency is somehow reasonable here.

                                  • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                    himanshu.gautam

                                     

                                    The key point is that: the global memory buffer do NOT have a cache. L1 and L2 are only for texture memory.  So the large latency is somehow reasonable here.

                                    This is not true. Caching is used for global memory too. Check globalMemoryBandwidth sample, which shows close to 1000GBps when caching is present. It is quite tricky there to avoid using caching in order to demonstrate the actualy global memory bandwidth of a device.

                                      • Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                        cocular

                                        Do you mean the this?

                                        Global Memory Read
                                        AccessType      : single
                                        VectorElements  : 4
                                        Bandwidth      : 269.674 GB/s
                                        
                                        
                                        

                                         

                                        Maybe my hardware do not have high enough performance to achieve 1000GB/s.  But I think I could figure out why it can exceed the GPU's hardware limitation.  CL kernel has the following pattern:

                                        uint gid = get_global_id(0);
                                        uint index = gid;
                                        val = val + input[index + 0];
                                        val = val + input[index + 1];
                                        val = val + input[index + 2];
                                        val = val + input[index + 3];
                                        val = val + input[index + 4];
                                        val = val + input[index + 5];
                                        val = val + input[index + 6];
                                        val = val + input[index + 7];
                                        val = val + input[index + 8];
                                        val = val + input[index + 9];
                                        
                                        
                                        

                                        You can see that there are a lot of overlap between each thread.  There is not global mem fence between the memory access.  So Compiler/GPU is able to reorder them so that it can eliminate the overlap access to optimize the performance.  I don't think we can call it cache in this situation.

                                         

                                        In app guide Page 32, here says:

                                        Only read-only buffers, such as constants, are cached.
                                        
                                        
                                        
                                          • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                            himanshu.gautam

                                            As I understand, caching comes into play when reading/writing from global mamory. IIRC, non-GCN have only read-only caches, but GCN has read/write caches. Why don't you write a small test yourself, probably you can modify GlobalMemoryBandwidth as per your understanding.

                                              • Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                                cocular

                                                If global memory is cached, then what is
                                                the main difference between global memory and texture memory?

                                                 

                                                Also the test above has already shows that global memory is not cached!  Otherwise I think we do not need to wait 500 cycles for sync.

                                                 

                                                Also I provide the reference that says only read-only buffers are cached.  Could you provide the reference says that global memory is cached?

                                                AMD APP Guide P32:

                                                 

                                                When using a global buffer, each work-item can write to an arbitrary location
                                                within the global buffer. Global buffers use a linear memory layout. If consecutive
                                                addresses are written, the compute unit issues a burst write for more efficient
                                                memory access. Only read-only buffers, such as constants, are cached.
                                                
                                                

                                                 

                                                 

                                                I think that is clear enough.

                                                  • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                                    himanshu.gautam

                                                    cocular wrote:

                                                     

                                                    If global memory is cached, then what is
                                                    the main difference between global memory and texture memory?

                                                     

                                                    Also the test above has already shows that global memory is not cached!  Otherwise I think we do not need to wait 500 cycles for sync.

                                                     

                                                    Also I provide the reference that says only read-only buffers are cached.  Could you provide the reference says that global memory is cached?

                                                     

                                                    Can you explain difference between read-only buffer and global bufffers? Aren't they both stored in GPU's global memory?

                                                    The difference between buffer and image is the way data is stored. buffer store data in linear manner, and images store it in tiled manner. But in the end, both of them are cached. Check section 5.1 of OpenCL Programming Guide, explaining the read and write paths to/from global memory.

                                                      • Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                                        cocular

                                                        The paragraph above explain the difference: (P31)

                                                        Image reads are cached through the texture system (corresponding to the L2 and
                                                        L1 caches).
                                                        
                                                        
                                                        

                                                         

                                                        If everything is cached, then the 1.5 subsection will be very strange.

                                                         

                                                        But if only texture memory is cached, subsection 5.1/6.1 could still be understood because I think they are talk about texture memory.

                                                         

                                                        So maybe you should upgrade both the document and bandwidth example to state them clearly that global buffer is not cached?

                                                          • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                                            himanshu.gautam

                                                            The above statement about images is correct. And so is section 5.1. Doesn't this suggest caches are used for both of them? Why don't you write a bandwidth test yourself, and share here. The discussion is becoming lengthy unnecessarily

                                                            Can you post the output of globalMemoryBandwidth sample and Imagebandwidth sample?

                                                              • Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance
                                                                cocular

                                                                Here is the benchmark:

                                                                 

                                                                tarball: https://dl.dropboxusercontent.com/u/8878837/bandwidth.zip

                                                                 

                                                                Kernel:

                                                                __kernel void image_read(int n,
                                                                                         __read_only image2d_t img,
                                                                                         __global int *ret)
                                                                {
                                                                    const sampler_t smp = (CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                                                                                           CLK_ADDRESS_CLAMP | //Clamp to zeros
                                                                                           CLK_FILTER_NEAREST); //Don't interpolate
                                                                    *ret = 0;
                                                                    barrier(CLK_GLOBAL_MEM_FENCE);
                                                                
                                                                    int result = 0;
                                                                    int gid = get_global_id(0);
                                                                    for (int i = 0; i < n; i += 2) {
                                                                        int2 cord;
                                                                        int4 val;
                                                                
                                                                        cord = (int2)(gid, i);
                                                                        val = read_imagei(img, smp, cord);
                                                                        result ^= val.x;
                                                                
                                                                        cord = (int2)(gid, i+1);
                                                                        val = read_imagei(img, smp, cord);
                                                                        result ^= val.x;
                                                                    }
                                                                    atomic_xor(ret, result);
                                                                }
                                                                
                                                                __kernel void image_read2(int n,
                                                                                         __read_only image2d_t img,
                                                                                         __global int *ret)
                                                                {
                                                                    const sampler_t smp = (CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                                                                                           CLK_ADDRESS_CLAMP | //Clamp to zeros
                                                                                           CLK_FILTER_NEAREST); //Don't interpolate
                                                                    *ret = 0;
                                                                    barrier(CLK_GLOBAL_MEM_FENCE);
                                                                
                                                                    int result = 0;
                                                                    int gid = get_global_id(0);
                                                                    for (int i = 0; i < n; i += 2) {
                                                                        int2 cord;
                                                                        int4 val;
                                                                
                                                                        cord = (int2)(i, gid);
                                                                        val = read_imagei(img, smp, cord);
                                                                        result ^= val.x;
                                                                
                                                                        cord = (int2)(i+1, gid);
                                                                        val = read_imagei(img, smp, cord);
                                                                        result ^= val.x;
                                                                    }
                                                                
                                                                    atomic_xor(ret, result);
                                                                }
                                                                
                                                                __kernel void buffer_read(int n, __global int *buf, __global int *ret)
                                                                {
                                                                    *ret = 0;
                                                                    barrier(CLK_GLOBAL_MEM_FENCE);
                                                                
                                                                    int result = 0;
                                                                    int gid = get_global_id(0);
                                                                
                                                                    int pos = gid*n;
                                                                    for (int i = 0; i < n; i+=4, pos+=4) {
                                                                        result ^= buf[pos+0];
                                                                        result ^= buf[pos+1];
                                                                        result ^= buf[pos+2];
                                                                        result ^= buf[pos+3];
                                                                    }
                                                                    atomic_xor(ret, result);
                                                                }
                                                                
                                                                
                                                                
                                                                
                                                                

                                                                 

                                                                C code:

                                                                #include <CL/cl.h>
                                                                #include <stdio.h>
                                                                #include <assert.h>
                                                                #include <stdbool.h>
                                                                #include <time.h>
                                                                #include "time-cl.h"
                                                                #include "clwrapper.h"
                                                                
                                                                #define M 4000
                                                                #define N 8000
                                                                int buffer[M][N], buffer2[N][M];
                                                                
                                                                int main()
                                                                {
                                                                    srand((unsigned)time(NULL));
                                                                    cl_device_id device = cl_get_first_gpu_device();
                                                                    cl_context context = cl_create_context(device);
                                                                    cl_command_queue cmd_queue = cl_create_cmd_queue(context, device, CL_QUEUE_PROFILING_ENABLE);
                                                                    cl_program program = cl_create_program_from_src(context, device, "", TIME_KERNEL_SOURCE);
                                                                
                                                                    cl_kernel kernel_img = cl_create_kernel(program, "image_read");
                                                                    cl_kernel kernel_img2 = cl_create_kernel(program, "image_read2");
                                                                    cl_kernel kernel_buf = cl_create_kernel(program, "buffer_read");
                                                                
                                                                    int verifier = 0;
                                                                    for (int i = 0; i < M; ++i)
                                                                        for (int j = 0; j < N; ++j) {
                                                                            buffer[i][j] = (i ^ j) + (i | j) + i - j + i*j + rand();
                                                                            buffer2[j][i] = buffer[i][j];
                                                                            verifier ^= buffer[i][j];
                                                                        }
                                                                
                                                                    printf("verifier: %d\n", verifier);
                                                                    
                                                                    cl_image_format image_format = { CL_R, CL_UNSIGNED_INT32 };
                                                                    cl_image_desc   image_desc   = { CL_MEM_OBJECT_IMAGE2D, M, N, 0, 0, 0, 0, 0, 0, 0 };
                                                                    cl_mem img = cl_create_image(context,
                                                                                                 CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                                                                                 &image_format, &image_desc, &buffer2[0][0]);
                                                                    cl_image_desc   image_desc2  = { CL_MEM_OBJECT_IMAGE2D, N, M, 0, 0, 0, 0, 0, 0, 0 };
                                                                    cl_mem img2 = cl_create_image(context,
                                                                                                 CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                                                                                 &image_format, &image_desc2, &buffer[0][0]);
                                                                    cl_mem buf = cl_create_buffer(context,
                                                                                                  CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                                                                                  sizeof(buffer), &buffer[0][0]);
                                                                
                                                                    cl_mem ret_img = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
                                                                    cl_mem ret_img2 = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
                                                                    cl_mem ret_buf = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
                                                                
                                                                    int t_size = N;
                                                                    cl_set_args(kernel_img, 0, sizeof(int), &t_size);
                                                                    cl_set_args(kernel_img, 1, sizeof(img), &img);
                                                                    cl_set_args(kernel_img, 2, sizeof(ret_img), &ret_img);
                                                                
                                                                    cl_set_args(kernel_buf, 0, sizeof(int), &t_size);
                                                                    cl_set_args(kernel_buf, 1, sizeof(buf), &buf);
                                                                    cl_set_args(kernel_buf, 2, sizeof(ret_buf), &ret_buf);
                                                                
                                                                    cl_set_args(kernel_img2, 0, sizeof(int), &t_size);
                                                                    cl_set_args(kernel_img2, 1, sizeof(img2), &img2);
                                                                    cl_set_args(kernel_img2, 2, sizeof(ret_img2), &ret_img2);
                                                                
                                                                    cl_finish(cmd_queue);
                                                                
                                                                    cl_event event_buf, event_img, event_img2;
                                                                    size_t global_work_size = M;
                                                                
                                                                
                                                                    cl_launch_kernel(cmd_queue, kernel_img, 1, NULL,
                                                                                     &global_work_size, NULL, 0, NULL, &event_img);
                                                                    cl_finish(cmd_queue);
                                                                    cl_wait_for_events(1, &event_img);
                                                                    int *v_img = cl_map_buffer(cmd_queue, ret_img, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
                                                                    printf("image bandwidth: %.2fGB/s in %.2fms\n",
                                                                           cl_get_kernel_bandwidth(event_img, N*N*sizeof(int)),
                                                                           cl_get_kernel_time(event_img));
                                                                    printf("verify: %d\n", *v_img);
                                                                    cl_unmap_buffer(cmd_queue, ret_img, v_img, 0, NULL, NULL);
                                                                
                                                                
                                                                    cl_launch_kernel(cmd_queue, kernel_img2, 1, NULL,
                                                                                     &global_work_size, NULL, 0, NULL, &event_img2);
                                                                    cl_finish(cmd_queue);
                                                                    cl_wait_for_events(1, &event_img2);
                                                                    int *v_img2 = cl_map_buffer(cmd_queue, ret_img2, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
                                                                    printf("image2 bandwidth: %.2fGB/s in %.2fms\n",
                                                                           cl_get_kernel_bandwidth(event_img2, N*N*sizeof(int)),
                                                                           cl_get_kernel_time(event_img2));
                                                                    printf("verify: %d\n", *v_img2);
                                                                    cl_unmap_buffer(cmd_queue, ret_img2, v_img2, 0, NULL, NULL);
                                                                
                                                                
                                                                
                                                                    cl_launch_kernel(cmd_queue, kernel_buf, 1, NULL,
                                                                                     &global_work_size, NULL, 0, NULL, &event_buf);
                                                                    cl_finish(cmd_queue);
                                                                    cl_wait_for_events(1, &event_buf);
                                                                    int *v_buf = cl_map_buffer(cmd_queue, ret_buf, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
                                                                    printf("buffer bandwidth: %.2fGB/s in %.2fms\n",
                                                                           cl_get_kernel_bandwidth(event_buf, N*N*sizeof(int)),
                                                                           cl_get_kernel_time(event_buf));
                                                                    printf("verify: %d\n", *v_buf);
                                                                    cl_unmap_buffer(cmd_queue, ret_buf, v_buf, 0, NULL, NULL);
                                                                
                                                                    return 0;
                                                                }
                                                                
                                                                
                                                                
                                                                
                                                                

                                                                 

                                                                Result (Turk 6770):

                                                                verifier: 16011609
                                                                image bandwidth: 39.57GB/s in 6.02ms
                                                                verify: 16011609
                                                                image2 bandwidth: 33.08GB/s in 7.21ms
                                                                verify: 16011609
                                                                buffer bandwidth: 10.53GB/s in 22.65ms
                                                                verify: 16011609
                                                                
                                                                
                                                                

                                                                 

                                                                Poor result for global buffer.  Notice that if I do unroll more time, buffer will read much much faster, while image will not.

                                                                 

                                                                But on the other hand, I saw IL such as uav_raw_load_id(11)_cached.  Do not really know what it means.

                                                                 

                                                                For offical benchmark:

                                                                Global Memory Read
                                                                AccessType      : single
                                                                VectorElements  : 4
                                                                Bandwidth       : 148.804 GB/s
                                                                
                                                                Global Memory Read
                                                                AccessType      : linear
                                                                VectorElements  : 4
                                                                Bandwidth       : 86.6981 GB/s
                                                                
                                                                Global Memory Read
                                                                AccessType      : linear(uncached)
                                                                VectorElements  : 4
                                                                Bandwidth       : 36.884 GB/s
                                                                
                                                                Global Memory Write
                                                                AccessType      : linear
                                                                VectorElements  : 4
                                                                Bandwidth       : 24.8876 GB/s
                                                                
                                                                Global Memory Read
                                                                AccessType      : random
                                                                VectorElements  : 4
                                                                Bandwidth       : 21.3308 GB/s
                                                                
                                                                Global Memory Read
                                                                AccessType      : unCombine_unCache
                                                                VectorElements  : 4
                                                                Bandwidth       : 36.2945 GB/s
                                                                
                                                                
                                                                
                                                                                        memcpy()   5.62 GB/s
                                                                                     memset(,1,)  13.85 GB/s
                                                                                     memset(,0,)  14.99 GB/s
                                                                
                                                                
                                                                AVERAGES (over loops 2 - 9, use -l for complete log)
                                                                --------
                                                                
                                                                
                                                                1. Host mapped write to inputImage
                                                                
                                                                       clEnqueueMapImage(WRITE):  0.007440 s [     2.25 GB/s ]
                                                                                       memset():  0.001863 s       9.01 GB/s
                                                                      clEnqueueUnmapMemObject():  0.005611 s [     2.99 GB/s ]
                                                                
                                                                2. GPU kernel read of inputImage
                                                                
                                                                       clEnqueueNDRangeKernel():  0.044637 s      37.59 GB/s
                                                                                 verification ok
                                                                
                                                                3. GPU kernel write to outputImage
                                                                
                                                                       clEnqueueNDRangeKernel():  0.073920 s      22.70 GB/s
                                                                
                                                                4. Host mapped read of outputImage
                                                                
                                                                        clEnqueueMapImage(READ):  0.006750 s [     2.49 GB/s ]
                                                                                       CPU read:  0.001797 s       9.34 GB/s
                                                                                 verification ok
                                                                      clEnqueueUnmapMemObject():  0.000043 s [   394.12 GB/s ]
                                                                
                                                                
                                                                
                                                                
                                                                
                                                                

                                                                 

                                                                The image bandwidth is even lower than my benchmark, while I use int instead of int4.  Do not know why.