7 Replies Latest reply on Nov 14, 2014 5:20 AM by dipak

    r7 270x memory allocation issue

    nan

      Hi,

       

      I'm having serious issues with some R7 270x. It runs 2 threads and each thread allocates a 512MB, 256MB, 128MB and 4KB buffer in global device memory (in total: ~1.8GB). On some 270x I observed that only ~1.3GB are allocated in device memory and ~500MB in host memory (Catalyst 14.6 and Catalyst 14.7 on Windows7 64bit). The performance decreases by a factor ~10 if some memory is allocated in host memory. Using a R9 280x I observed from time to time similar behaviour but a reboot always helped (my R9 290 never showed this). The 270x shows this behaviour permanently on some computers.

      Here are GPU-Z screen shots: http://savepic.su/4323086.png and http://savepic.su/4321038.png

       

      Does someone know why not all memory is allocated in device memory?

       

      -- NaN

        • Re: r7 270x memory allocation issue
          dipak

          Hi,

          As you know, memory allocation depends on the memory flag pass to the clCreateBuffer/Image API. I've few queries: 1) What are the memory flags you're using during allocations? 2) Time of taking that memory stat i.e. just after the allocations or app running and each buffer being used? 3) Does the memory stat remain constant or vary time to time?

          It would be great help if you can share a sample project that manifests this problem such that we can reproduce it at our side.

           

          Regards,

            • Re: r7 270x memory allocation issue
              nan

              Hi,

              all buffers are allocated with clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &error_code); because only parts of the 4KB buffers need to be shared sometimes with the host. The buffers are allocated directly after initializing the OpenCL platform and spawning a thread for every independent data set (two data sets are run on each R7 270X). Each data set needs 896MB + 4KB of device memory. The buffers are used until the application is closed because highest efficiency is required. The issue seems to show up if most of the device memory is allocated. GPU-Z displays constant memory usage during runtime. My R9 280X (here about 2.7 GB are allocated) works correctly after boot up, but sometimes it displays the same behaviour as the R7 270X after the application was started and closed several times. Only logging out or rebooting the PC will help. But some (not all) R7 270X have this issue permanently.

              I could try to extract the relevant code out of my application, but I do not think that it would be useful because it's very simple. Here is a link to the binary Dropbox - OpenCL PTS miner v1.3

               

              -- NaN

                • Re: r7 270x memory allocation issue
                  dipak

                  Thanks for your reply and sharing the link. BTW, which file should I use? I tried to run clpts-v1.3_win_x86-64_Catalyst14 and found that it depends on some command-line setting. What should be the command-line arguments to test this behavior? [ A little explanation about the arguments may be useful to experiment with this app]

                  Meanwhile, if possible, please try to provide a sample project which captures this issue so that, if needed, I can directly forward it to concerned team.

                   

                  Regards,

                    • Re: r7 270x memory allocation issue
                      nan

                      Thanks for your reply. The app does work with Catalyst 13.12 (download a *_Catalyst13 file), but it is not recommend because the driver shows serious bugs in combination with R9 290(X). If Catalyst 14.x is used you should download a *_Catalyst14 file. I would not recommend Catalyst 14.4 because it is slow. I've obtained best performance with Catalyst 14.7rc1 on Windows. The Windows and Linux drivers have different performance characteristics. The performance of R9 290(X) does still not match my expectations based on the performance of R9 280X, but the performance increased with the latest Windows drivers and when using more data sets per GPU. I guess that the performance delta is caused by the driver, but it might be caused by differences of the micro architectures (e.g. Hawaii has 8 ACEs and Tahiti has only 2) of Tahiti and Hawaii, which are not documented.

                      The performance of the app is proportional to the ART (average round time) value (the less the better). The app consists of three OpenCL kernels and its execution times are displayed after the ART value, i.e ART: xx ms (xx/xx/xx). The first kernel has high register usage, high LDS usage and also high computational density. The other kernels are bottlenecked by LDS. One round consumes 1.125GB of write bandwidth and produces 1GB-1.125GB effective read bandwidth. The app won't do any work without internet connection or if no valid login data for a mining pool is specified. You can use the following command line arguments for testing (make sure that the GPU does not run into thermal throttling):

                      • GPUs with 1GB RAM: clpts_x86-64 -o 6 -u nanpic.test:x -g 1 -a 2
                      • GPUs with 2GB RAM: clpts_x86-64 -o 6 -u nanpic.test:x -g 2 -a 2
                      • GPUs with 3GB RAM: clpts_x86-64 -o 6 -u nanpic.test:x -g 2 -a 0
                      • R9 290(X): clpts_x86-64 -o 6 -u nanpic.test:x -g 4 -a 2 with Catalyst 14.7x or clpts_x86-64 -o 6 -u nanpic.test:x -g 2 -a 1 with older Catalyst drivers

                      The -g parameter controls the number of independent data sets per GPU and the -d parameter may be used, if one wants to select specific GPUs in a multi GPU environment. Discussion of the app can be found in fast AMD OpenCL PTS miner released - page 1 - Mining Support & Pools - BitShares Forum and the discussion of the current version starts at page 60.

                       

                      When I have some time I will try to create a simple test case so I can share the code with you.

                        • Re: Re: r7 270x memory allocation issue
                          dipak

                          Hi,

                          My apologies for this delayed reply.

                          Please find attached screen shots of GPU-Z  and task manager when I ran the clpts_x86-64.exe with option "-o 6 -u nanpic.test:x -d 0 -g 1 -a 2".

                          In your first post you mentioned as follows

                          On some 270x I observed that only ~1.3GB are allocated in device memory and ~500MB in host memory

                          I guess, you tried to indicate the value of "Memory Usage(Dynamic)" shown by GPU-Z as "host memory". But is it really host memory or a portion of GPU memory? Because the task manager doesn't show any such memory usage.

                           

                          Regards,

                            • Re: r7 270x memory allocation issue
                              nan

                              Hi,

                               

                              host-memory, which is allocated by the driver, won't show up in the Task Manager as allocated by a user process. The driver allocates pinned memory in kernel space so that the GPU can swap its memory buffer or can access it with DMA (I'm not a Windows expert and do not know details about AMD's implementation). Normally, "Memory Usage (Dynamic)" only grows if the size of used GPU memory exceeds the size of GPU RAM. This is a strong hint that "Memory Usage (Dynamic)" is swap memory of the GPU in host-memory. Most likely the size of the physically free memory in the Task Manager will decrease if "dynamic" memory is used.

                              Your screen-shot indicates that some GPU RAM is reserved by other applications because the memory usage is ~170MB greater than the size of the allocated buffers (most likely by the browser). But you have reproduced my issues because most memory is allocated as "dynamic" memory, which reduces the performance by nearly an order of magnitude. The size of the allocated buffers is 512MB, 256MB, 128MB (only accessed by the GPU) and 4KB (accessed by the GPU and CPU).

                               

                              -- NaN