8 Replies Latest reply on May 12, 2010 3:36 PM by LeeHowes

    global memory optimization - don´t get it

    diapolo

      I currently read through the AMD PDF ATI_Stream_SDK_OpenCL_Programming_Guide and have got some questions on the global memory optimization section in there.

      The guide sais: "Note that the memory segments
      indexed through base addresses A0 to An are not required to line up
      sequentially; for optimal performance, they must be aligned to 128 bytes and must not overlap."

      My kernel currently uses a 256MB array which holds uint2. I made sure, that the host memory is alligned and reserved it via: cl_uint *searchStrings = (cl_uint*)_aligned_malloc(sizeof(cl_uint2) * numCombinations, 16);

      That array is passed to my kernel via a write buffer (mem object). And there it is accessed read-only 8 times for each work-item (value is used in an addition).

      But I´m really unsure how to align to 128 Bytes and what it really means.

      Thanks,

      Dia

        • global memory optimization - don´t get it
          marwen

          i'm a beginner and i have your problem too.

          The only thing i can say is that using CL_MEM_ALLOC_HOST_PTR in clCreateBuffer will allocate memory for you and i think aligning it to 128 bytes.

          I'm not sure so wait for an expert answer.

            • global memory optimization - don´t get it
              diapolo

              Hey marwen,

              Perhaps we could first clarify, what we think is meant by host memory. I understand it as beeing the system RAM of the computer. Now if you use CL_MEM_USE_HOST_PTR the OpenCL implementation uses host memory (system RAM, not VRAM), if that buffer doesnt´t get cached into VRAM, right?

              CL_MEM_ALLOC_HOST_PTR would reseve space in system RAM, too, so no buffer in VRAM.

              You could be right, if you are sure your buffer in host memory is aligned, it should be aligned for the OpenCL kernel, but my main question is, how I can align these buffers to a 128 byte boundary ... any ideas?

              Thanks,

              Dia

                • global memory optimization - don´t get it
                  genaganna

                   

                  Originally posted by: diapolo Hey marwen,

                   

                  Perhaps we could first clarify, what we think is meant by host memory. I understand it as beeing the system RAM of the computer. Now if you use CL_MEM_USE_HOST_PTR the OpenCL implementation uses host memory (system RAM, not VRAM), if that buffer doesnt´t get cached into VRAM, right?

                   

                  CL_MEM_ALLOC_HOST_PTR would reseve space in system RAM, too, so no buffer in VRAM.

                   

                  You could be right, if you are sure your buffer in host memory is aligned, it should be aligned for the OpenCL kernel, but my main question is, how I can align these buffers to a 128 byte boundary ... any ideas?

                   

                   

                   

                   

                   

                  Please paste your kernel code and host code.  128 bit alignment means use 4 component vector type(like float4, int4, uint4) in kernel.

                    • global memory optimization - don´t get it
                      diapolo

                      Perhaps you can have a look here, it´s a kernel that computes and compares md5 values for hash cracking. It works, but perhaps you have got some hints, that could help me understand what can be optimized.

                      I know it´s not good to have control-flow (2 if clauses are in there), but for now I couldn´t optimize them out (it´s not faster to use ?: operator and factors to multiply the value that gets written to global memory with 0 and 1).

                      Thanks,

                      Dia

                      Edit: Host code is way to long to post here, but I guess you can see the most important things in here.

                      __attribute__((reqd_work_group_size(64, 0, 0))) __constant uint4 md5_init_values = (uint4){0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476}; inline uint F(uint b, uint c, uint d) { return(d ^ (b & (c ^ d))); } inline uint G(uint b, uint c, uint d) { return((b & d) | (c & ~d)); } inline uint H(uint b, uint c, uint d) { return(b ^ c ^ d); } inline uint I(uint b, uint c, uint d) { return (c ^ (b | ~d)); } inline void FF_opt_const(uint *a, uint b, uint x) { *a = b + rotate(((uint)3614090359 + x), 7); } inline void FF(uint *a, uint b, uint c, uint d, uint x, uint s, uint ac) { *a += F(b, c, d) + x + ac; *a = rotate(*a, s); *a += b; } inline void FF_opt(uint *a, uint b, uint c, uint d, uint s, uint ac) { *a += F(b, c, d) + ac; *a = rotate(*a, s); *a += b; } inline void GG(uint *a, uint b, uint c, uint d, uint x, uint s, uint ac) { *a += G(b, c, d) + x + ac; *a = rotate(*a, s); *a += b; } inline void GG_opt(uint *a, uint b, uint c, uint d, uint s, uint ac) { *a += G(b, c, d) + ac; *a = rotate(*a, s); *a += b; } inline void HH(uint *a, uint b, uint c, uint d, uint x, uint s, uint ac) { *a += H(b, c, d) + x + ac; *a = rotate(*a, s); *a += b; } inline void HH_opt(uint *a, uint b, uint c, uint d, uint s, uint ac) { *a += H(b, c, d) + ac; *a = rotate(*a, s); *a += b; } inline void II(uint *a, uint b, uint c, uint d, uint x, uint s, uint ac) { *a += I(b, c, d) + x + ac; *a = rotate(*a, s); *a += b; } inline void II_opt(uint *a, uint b, uint c, uint d, uint s, uint ac) { *a += I(b, c, d) + ac; *a = rotate(*a, s); *a += b; } __kernel void md5_main(const __global uint4 *searchedHash, const __global uint4 *searchStrings, __global uint4 *outputString) { const size_t gID = get_global_id(0); const uint4 searchString = searchStrings[gID]; uint a, b = md5_init_values.y, c = md5_init_values.z, d = md5_init_values.w; FF_opt_const(&a, b, searchString.x); FF(&d, a, b, c, searchString.y, 12, 0xe8c7b756); FF(&c, d, a, b, searchString.z, 17, 0x242070db); FF_opt(&b, c, d, a, 22, 0xc1bdceee); FF_opt(&a, b, c, d, 7, 0xf57c0faf); FF_opt(&d, a, b, c, 12, 0x4787c62a); FF_opt(&c, d, a, b, 17, 0xa8304613); FF_opt(&b, c, d, a, 22, 0xfd469501); FF_opt(&a, b, c, d, 7, 0x698098d8); FF_opt(&d, a, b, c, 12, 0x8b44f7af); FF_opt(&c, d, a, b, 17, 0xffff5bb1); FF_opt(&b, c, d, a, 22, 0x895cd7be); FF_opt(&a, b, c, d, 7, 0x6b901122); FF_opt(&d, a, b, c, 12, 0xfd987193); FF(&c, d, a, b, searchString.w, 17, 0xa679438e); FF_opt(&b, c, d, a, 22, 0x49b40821); GG(&a, b, c, d, searchString.y, 5, 0xf61e2562); GG_opt(&d, a, b, c, 9, 0xc040b340); GG_opt(&c, d, a, b, 14, 0x265e5a51); GG(&b, c, d, a, searchString.x, 20, 0xe9b6c7aa); GG_opt(&a, b, c, d, 5, 0xd62f105d); GG_opt(&d, a, b, c, 9, 0x02441453); GG_opt(&c, d, a, b, 14, 0xd8a1e681); GG_opt(&b, c, d, a, 20, 0xe7d3fbc8); GG_opt(&a, b, c, d, 5, 0x21e1cde6); GG(&d, a, b, c, searchString.w, 9, 0xc33707d6); GG_opt(&c, d, a, b, 14, 0xf4d50d87); GG_opt(&b, c, d, a, 20, 0x455a14ed); GG_opt(&a, b, c, d, 5, 0xa9e3e905); GG(&d, a, b, c, searchString.z, 9, 0xfcefa3f8); GG_opt(&c, d, a, b, 14, 0x676f02d9); GG_opt(&b, c, d, a, 20, 0x8d2a4c8a); HH_opt(&a, b, c, d, 4, 0xfffa3942); HH_opt(&d, a, b, c, 11, 0x8771f681); HH_opt(&c, d, a, b, 16, 0x6d9d6122); HH(&b, c, d, a, searchString.w, 23, 0xfde5380c); HH(&a, b, c, d, searchString.y, 4, 0xa4beea44); HH_opt(&d, a, b, c, 11, 0x4bdecfa9); HH_opt(&c, d, a, b, 16, 0xf6bb4b60); HH_opt(&b, c, d, a, 23, 0xbebfbc70); HH_opt(&a, b, c, d, 4, 0x289b7ec6); HH(&d, a, b, c, searchString.x, 11, 0xeaa127fa); HH_opt(&c, d, a, b, 16, 0xd4ef3085); HH_opt(&b, c, d, a, 23, 0x04881d05); HH_opt(&a, b, c, d, 4, 0xd9d4d039); HH_opt(&d, a, b, c, 11, 0xe6db99e5); HH_opt(&c, d, a, b, 16, 0x1fa27cf8); HH(&b, c, d, a, searchString.z, 23, 0xc4ac5665); II(&a, b, c, d, searchString.x, 6, 0xf4292244); II_opt(&d, a, b, c, 10, 0x432aff97); II(&c, d, a, b, searchString.w, 15, 0xab9423a7); II_opt(&b, c, d, a, 21, 0xfc93a039); II_opt(&a, b, c, d, 6, 0x655b59c3); II_opt(&d, a, b, c, 10, 0x8f0ccc92); II_opt(&c, d, a, b, 15, 0xffeff47d); II(&b, c, d, a, searchString.y, 21, 0x85845dd1); II_opt(&a, b, c, d, 6, 0x6fa87e4f); II_opt(&d, a, b, c, 10, 0xfe2ce6e0); II_opt(&c, d, a, b, 15, 0xa3014314); II_opt(&b, c, d, a, 21, 0x4e0811a1); II_opt(&a, b, c, d, 6, 0xf7537e82); if(searchedHash[0].x == (a + md5_init_values.x)) { II_opt(&d, a, b, c, 10, 0xbd3af235); II_opt(&c, d, a, b, 15, 0x2ad7d2bb); II(&b, c, d, a, searchString.z, 21, 0xeb86d391); if(all(searchedHash[0] == (uint4)((a + md5_init_values.x), (b + md5_init_values.y), (c + md5_init_values.z), (d + md5_init_values.w)))) { outputString[0] = searchString; } } }

                    • global memory optimization - don´t get it
                      afo

                      Just my understanding, please correct me if it is wrong...

                      I think that with CL_MEM_USE_HOST_PTR you provide the pointer, so you did the memory allocation, so it is in system memory, but the implementation could choose to cache it in device memory and the kernel will use the cached version (is there a way to enforce this?).

                      With CL_MEM_ALLOC_HOST_PTR, the OpenCL implementation chooses were to allocate (I am not sure if could be system memory or could be other kind of memory , e.g. PCIe) but your application can access it using the host_ptr, so I guess that is system memory, and OpenCL implementation is responsible for the allocation and deallocation of it.

                      With CL_MEM_COPY_HOST_PTR the OpenCL implementation chooses were to allocate and copies the contents pointed by host_ptr (but it says nothing about where is allocated, so I think that it is implementation dependant)

                      (It would be great if AMD explains were the memory is allocated in these cases, or under what conditions the memory arena is choosen.)

                      About the byte alignment, you can use _aligned_malloc() in windows and posix_memalign() in linux.

                      best regards,

                      Alfonso

                       

                        • global memory optimization - don´t get it
                          nou

                          if something didn't change all buffers are in VRAM.

                            • global memory optimization - don´t get it
                              diapolo

                               

                              Originally posted by: nou if something didn't change all buffers are in VRAM.


                              What was the question fot that answer ?

                              Dia

                              • global memory optimization - don´t get it
                                LeeHowes

                                That explanation is discussing the access from the kernel. So what it means is the following:

                                If you have 64 work items in a wavefront and each reads a float then, for peak performance and were doing a streaming app then your first wave would read bytes 0 to 255, the next 256 to 511 and so on. That's 64 reads * 4 bytes per read. In reality you can achieve peak performance only having a start value that is a multiple of 128, not 256. So if wave 0 read addresses 128 to 383 and so on then that would also achieve peak performance.

                                Of course, knowing that your read address as an offset from your base pointer starts at a multiple of 128 (which you can easily do as an array index that is a multiple of 128 from your passed global memory pointer) is not enough if the global memory pointer itself is not aligned. I think the GPU memory allocator should create aligned arrays for this reason, but you might want to test it to be sure of that.

                                Also remember that this applies to 2D data sets. If you have a matrix 290xsomething elements where each row is 290 elements long while row 1 might aligned, row 2 would not. So you should consider padding your matrix in memory such that the beginning of each row aligns. That is completely under your control in a way that the base pointer allocation isn't.

                                Was that any clearer? I don't think that aligning on the host makes any difference to aligning on the device. They're two completely separate allocations, and even aligned host memory will be slow to access from the device over PCIe so it probably makes little difference. Having said that, DMA engines may prefer 128-byte aligned data, and the CPU is often happier that way because of the way it sits in cache lines.