cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cocular
Journeyman III

Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

Do you mean the this?


Global Memory Read
AccessType      : single
VectorElements  : 4
Bandwidth      : 269.674 GB/s




Maybe my hardware do not have high enough performance to achieve 1000GB/s.  But I think I could figure out why it can exceed the GPU's hardware limitation.  CL kernel has the following pattern:


uint gid = get_global_id(0);
uint index = gid;
val = val + input[index + 0];
val = val + input[index + 1];
val = val + input[index + 2];
val = val + input[index + 3];
val = val + input[index + 4];
val = val + input[index + 5];
val = val + input[index + 6];
val = val + input[index + 7];
val = val + input[index + 8];
val = val + input[index + 9];




You can see that there are a lot of overlap between each thread.  There is not global mem fence between the memory access.  So Compiler/GPU is able to reorder them so that it can eliminate the overlap access to optimize the performance.  I don't think we can call it cache in this situation.

In app guide Page 32, here says:


Only read-only buffers, such as constants, are cached.




0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

As I understand, caching comes into play when reading/writing from global mamory. IIRC, non-GCN have only read-only caches, but GCN has read/write caches. Why don't you write a small test yourself, probably you can modify GlobalMemoryBandwidth as per your understanding.

0 Likes
cocular
Journeyman III

Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

If global memory is cached, then what is
the main difference between global memory and texture memory?

Also the test above has already shows that global memory is not cached!  Otherwise I think we do not need to wait 500 cycles for sync.

Also I provide the reference that says only read-only buffers are cached.  Could you provide the reference says that global memory is cached?

AMD APP Guide P32:


When using a global buffer, each work-item can write to an arbitrary location
within the global buffer. Global buffers use a linear memory layout. If consecutive
addresses are written, the compute unit issues a burst write for more efficient
memory access. Only read-only buffers, such as constants, are cached.



I think that is clear enough.

0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance


cocular wrote:



If global memory is cached, then what is
the main difference between global memory and texture memory?



Also the test above has already shows that global memory is not cached!  Otherwise I think we do not need to wait 500 cycles for sync.



Also I provide the reference that says only read-only buffers are cached.  Could you provide the reference says that global memory is cached?



Can you explain difference between read-only buffer and global bufffers? Aren't they both stored in GPU's global memory?

The difference between buffer and image is the way data is stored. buffer store data in linear manner, and images store it in tiled manner. But in the end, both of them are cached. Check section 5.1 of OpenCL Programming Guide, explaining the read and write paths to/from global memory.

0 Likes
cocular
Journeyman III

Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

The paragraph above explain the difference: (P31)


Image reads are cached through the texture system (corresponding to the L2 and
L1 caches).




If everything is cached, then the 1.5 subsection will be very strange.

But if only texture memory is cached, subsection 5.1/6.1 could still be understood because I think they are talk about texture memory.

So maybe you should upgrade both the document and bandwidth example to state them clearly that global buffer is not cached?

0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

The above statement about images is correct. And so is section 5.1. Doesn't this suggest caches are used for both of them? Why don't you write a bandwidth test yourself, and share here. The discussion is becoming lengthy unnecessarily

Can you post the output of globalMemoryBandwidth sample and Imagebandwidth sample?

0 Likes
cocular
Journeyman III

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

Here is the benchmark:

tarball: https://dl.dropboxusercontent.com/u/8878837/bandwidth.zip

Kernel:


__kernel void image_read(int n,
                         __read_only image2d_t img,
                         __global int *ret)
{
    const sampler_t smp = (CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                           CLK_ADDRESS_CLAMP | //Clamp to zeros
                           CLK_FILTER_NEAREST); //Don't interpolate
    *ret = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    int result = 0;
    int gid = get_global_id(0);
    for (int i = 0; i < n; i += 2) {
        int2 cord;
        int4 val;

        cord = (int2)(gid, i);
        val = read_imagei(img, smp, cord);
        result ^= val.x;

        cord = (int2)(gid, i+1);
        val = read_imagei(img, smp, cord);
        result ^= val.x;


    }


    atomic_xor(ret, result);


}

__kernel void image_read2(int n,
                         __read_only image2d_t img,
                         __global int *ret)
{
    const sampler_t smp = (CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                           CLK_ADDRESS_CLAMP | //Clamp to zeros
                           CLK_FILTER_NEAREST); //Don't interpolate
    *ret = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    int result = 0;
    int gid = get_global_id(0);
    for (int i = 0; i < n; i += 2) {
        int2 cord;
        int4 val;

        cord = (int2)(i, gid);
        val = read_imagei(img, smp, cord);
        result ^= val.x;

        cord = (int2)(i+1, gid);
        val = read_imagei(img, smp, cord);
        result ^= val.x;
    }

    atomic_xor(ret, result);
}

__kernel void buffer_read(int n, __global int *buf, __global int *ret)
{
    *ret = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    int result = 0;
    int gid = get_global_id(0);

    int pos = gid*n;
    for (int i = 0; i < n; i+=4, pos+=4) {
        result ^= buf[pos+0];
        result ^= buf[pos+1];
        result ^= buf[pos+2];
        result ^= buf[pos+3];
    }
    atomic_xor(ret, result);
}




C code:


#include <CL/cl.h>
#include <stdio.h>
#include <assert.h>
#include <stdbool.h>
#include <time.h>
#include "time-cl.h"
#include "clwrapper.h"

#define M 4000
#define N 8000
int buffer, buffer2;

int main()
{
    srand((unsigned)time(NULL));
    cl_device_id device = cl_get_first_gpu_device();
    cl_context context = cl_create_context(device);
    cl_command_queue cmd_queue = cl_create_cmd_queue(context, device, CL_QUEUE_PROFILING_ENABLE);
    cl_program program = cl_create_program_from_src(context, device, "", TIME_KERNEL_SOURCE);

    cl_kernel kernel_img = cl_create_kernel(program, "image_read");
    cl_kernel kernel_img2 = cl_create_kernel(program, "image_read2");
    cl_kernel kernel_buf = cl_create_kernel(program, "buffer_read");

    int verifier = 0;
    for (int i = 0; i < M; ++i)
        for (int j = 0; j < N; ++j) {
            buffer = (i ^ j) + (i | j) + i - j + i*j + rand();
            buffer2 = buffer;
            verifier ^= buffer;
        }

    printf("verifier: %d\n", verifier);
   
    cl_image_format image_format = { CL_R, CL_UNSIGNED_INT32 };
    cl_image_desc   image_desc   = { CL_MEM_OBJECT_IMAGE2D, M, N, 0, 0, 0, 0, 0, 0, 0 };
    cl_mem img = cl_create_image(context,
                                 CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                 &image_format, &image_desc, &buffer2[0][0]);
    cl_image_desc   image_desc2  = { CL_MEM_OBJECT_IMAGE2D, N, M, 0, 0, 0, 0, 0, 0, 0 };
    cl_mem img2 = cl_create_image(context,
                                 CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                 &image_format, &image_desc2, &buffer[0][0]);
    cl_mem buf = cl_create_buffer(context,
                                  CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                  sizeof(buffer), &buffer[0][0]);

    cl_mem ret_img = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
    cl_mem ret_img2 = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);
    cl_mem ret_buf = cl_create_buffer(context, CL_MEM_READ_ONLY, sizeof(int), NULL);

    int t_size = N;
    cl_set_args(kernel_img, 0, sizeof(int), &t_size);
    cl_set_args(kernel_img, 1, sizeof(img), &img);
    cl_set_args(kernel_img, 2, sizeof(ret_img), &ret_img);

    cl_set_args(kernel_buf, 0, sizeof(int), &t_size);
    cl_set_args(kernel_buf, 1, sizeof(buf), &buf);
    cl_set_args(kernel_buf, 2, sizeof(ret_buf), &ret_buf);

    cl_set_args(kernel_img2, 0, sizeof(int), &t_size);
    cl_set_args(kernel_img2, 1, sizeof(img2), &img2);
    cl_set_args(kernel_img2, 2, sizeof(ret_img2), &ret_img2);

    cl_finish(cmd_queue);

    cl_event event_buf, event_img, event_img2;
    size_t global_work_size = M;


    cl_launch_kernel(cmd_queue, kernel_img, 1, NULL,
                     &global_work_size, NULL, 0, NULL, &event_img);
    cl_finish(cmd_queue);
    cl_wait_for_events(1, &event_img);
    int *v_img = cl_map_buffer(cmd_queue, ret_img, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
    printf("image bandwidth: %.2fGB/s in %.2fms\n",
           cl_get_kernel_bandwidth(event_img, N*N*sizeof(int)),
           cl_get_kernel_time(event_img));
    printf("verify: %d\n", *v_img);
    cl_unmap_buffer(cmd_queue, ret_img, v_img, 0, NULL, NULL);


    cl_launch_kernel(cmd_queue, kernel_img2, 1, NULL,
                     &global_work_size, NULL, 0, NULL, &event_img2);
    cl_finish(cmd_queue);
    cl_wait_for_events(1, &event_img2);
    int *v_img2 = cl_map_buffer(cmd_queue, ret_img2, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
    printf("image2 bandwidth: %.2fGB/s in %.2fms\n",
           cl_get_kernel_bandwidth(event_img2, N*N*sizeof(int)),
           cl_get_kernel_time(event_img2));
    printf("verify: %d\n", *v_img2);
    cl_unmap_buffer(cmd_queue, ret_img2, v_img2, 0, NULL, NULL);



    cl_launch_kernel(cmd_queue, kernel_buf, 1, NULL,
                     &global_work_size, NULL, 0, NULL, &event_buf);
    cl_finish(cmd_queue);
    cl_wait_for_events(1, &event_buf);
    int *v_buf = cl_map_buffer(cmd_queue, ret_buf, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL);
    printf("buffer bandwidth: %.2fGB/s in %.2fms\n",
           cl_get_kernel_bandwidth(event_buf, N*N*sizeof(int)),
           cl_get_kernel_time(event_buf));
    printf("verify: %d\n", *v_buf);
    cl_unmap_buffer(cmd_queue, ret_buf, v_buf, 0, NULL, NULL);

    return 0;
}




Result (Turk 6770):


verifier: 16011609
image bandwidth: 39.57GB/s in 6.02ms
verify: 16011609
image2 bandwidth: 33.08GB/s in 7.21ms
verify: 16011609
buffer bandwidth: 10.53GB/s in 22.65ms
verify: 16011609




Poor result for global buffer.  Notice that if I do unroll more time, buffer will read much much faster, while image will not.

But on the other hand, I saw IL such as uav_raw_load_id(11)_cached.  Do not really know what it means.

For offical benchmark:


Global Memory Read
AccessType      : single
VectorElements  : 4
Bandwidth       : 148.804 GB/s

Global Memory Read
AccessType      : linear
VectorElements  : 4
Bandwidth       : 86.6981 GB/s

Global Memory Read
AccessType      : linear(uncached)
VectorElements  : 4
Bandwidth       : 36.884 GB/s

Global Memory Write
AccessType      : linear
VectorElements  : 4
Bandwidth       : 24.8876 GB/s

Global Memory Read
AccessType      : random
VectorElements  : 4
Bandwidth       : 21.3308 GB/s

Global Memory Read
AccessType      : unCombine_unCache
VectorElements  : 4
Bandwidth       : 36.2945 GB/s





                        memcpy()   5.62 GB/s
                     memset(,1,)  13.85 GB/s
                     memset(,0,)  14.99 GB/s


AVERAGES (over loops 2 - 9, use -l for complete log)
--------


1. Host mapped write to inputImage

       clEnqueueMapImage(WRITE):  0.007440 s [     2.25 GB/s ]
                       memset():  0.001863 s       9.01 GB/s
      clEnqueueUnmapMemObject():  0.005611 s [     2.99 GB/s ]

2. GPU kernel read of inputImage

       clEnqueueNDRangeKernel():  0.044637 s      37.59 GB/s
                 verification ok

3. GPU kernel write to outputImage

       clEnqueueNDRangeKernel():  0.073920 s      22.70 GB/s

4. Host mapped read of outputImage

        clEnqueueMapImage(READ):  0.006750 s [     2.49 GB/s ]
                       CPU read:  0.001797 s       9.34 GB/s
                 verification ok
      clEnqueueUnmapMemObject():  0.000043 s [   394.12 GB/s ]





The image bandwidth is even lower than my benchmark, while I use int instead of int4.  Do not know why.

0 Likes