cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nibal
Challenger

Memory corruption in latest crimson driver 15.302?

Using Ubuntu 14.04 and valgrind:

==00:00:01:30.014 4949== Invalid write of size 8

==00:00:01:30.014 4949== at 0x4C2F5F3: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)

==00:00:01:30.014 4949== by 0xB3B6154: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.014 4949== by 0xB3B899A: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.014 4949== by 0xB3BB911: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB3C5F98: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB3C6667: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB3C6838: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB329CFB: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB35182C: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB351BD6: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB2F2DAC: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB2F312C: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB29115E: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0xB30115B: ??? (in /usr/lib/libamdocl64.so)

==00:00:01:30.015 4949== by 0x60BA181: start_thread (pthread_create.c:312)

==00:00:01:30.015 4949== by 0x63CA47C: clone (clone.S:111)

==00:00:01:30.015 4949== Address 0x7f126ed63000 is not stack'd, malloc'd or (recently) free'd

Could be a false positive, but I'm getting some unexplained crashes:(

0 Likes
1 Solution

1. Freqs array reallocation in the code looks broken. The code below:

freqs[fidx].hz = sig.hz;

freqs[fidx++].ts = ts;

if (fidx >= maxfreqs) {

maxfreqs += 16;

freqs = realloc(freqs, maxfreqs);

}

Should be something

if (fidx == (maxfreqs-1)) {

maxfreqs += 16;

freqs = realloc(freqs, maxfreqs * sizeof(freq_t));

}

freqs[fidx].hz = sig.hz;

freqs[fidx++].ts = ts;

2. You call run_fft() with pass=8 and that causes access to a destroyed cl_event ndr on the pass=7.

You destroyed ndr (pass=7)

if (pass == MAXPASS - 1) {

    if ((err = waitForEventAndRelease(&ndr)) != SUCCESS)

you have access to a destroyed object and corrupt memory. (pass=8)

if (pass && (err = waitForEventAndRelease(&ndr)) != SUCCESS)

View solution in original post

14 Replies
nibal
Challenger

Actually this is much worse than I thought. This is real. That corruption existed in catalyst 15.201, 15.101 and anything in between. Not only it gave instability to the ocl part of the program, but anything else it came in contact with in the same program. Plz fix urgently. Is there a place to download older catalysts?

I will have to comment out all ocl parts and stop linking to the libraries until it is fixed

0 Likes

Hi Nibal

  I am  having the team look into this I will get back to you by the end of the week. 

Greg

0 Likes

Hi Greg,

And thanks for helping out.

This is a tough corruption to track. Since it is very reproducible in my system, I will try to limit it to specific ocl calls and update ticket.

BR

Nikos

0 Likes

What I need is what motherboard, processor, system bios version, which GPU, if possible vbios number for the GPU, which os and version ( if linux kernel version) you are running. Also if you have test app that causes the issue you can get us.

greg

0 Likes

My info so far:

Motherboard: Gigabyte Technology Co., Ltd. 970A-UD3P

BIOS: UEFI DualBIOS, American Megatrends Inc. version: F1

CPU: AMD FX(tm)-8320 Eight-Core Processor, @1.4 Ghz

GPU: AMD Radeon (TM) R9 270, Pitcairn, Curacao Pro, Platform ID: 0x7f7227b45a18 (as reported by clinfo)

OS: Ubuntu 14.04 x64, 3.13.0-49 generic

ocl SDK: 3.0, working ocl 1.2

Working on test app (Need to reboot).

BR

Nikos

0 Likes

Using printfs and the valgrind output I was able to bracket the Invalid write between NDRangeKernel and completing the kernel.

But here is the catch: It happens only on the first time the kernel is executed.

My kernel is a slightly modified kernel of your FFT sample.

Unfortunately validation of your FFT sample, will take more time.

Each time I run it through valgrind it crashes my PC.  I do not crash my PC when running FFT alone,

but I do not run it for long and corruption may not show. I will have to compile latest valgrind

from sources and retest

The pattern suggests that this is not specific to the kernel itself (else it would appear on every kernel pass),

but general to the kernel mechanism. I hope it can be reproduced with any kernel. I'm compiling as default (ocl 1.2)

BR,

Nikos

0 Likes

It doesn't show in your FFT sample. Will have to create a test app with my kernel

0 Likes

Hi Greg,

Plz use attached fft.tgz to recreate problem. Included in val.out are 2 more Invalid reads, which were not in original valgrind report. You might want to check on them, too. These contain full stack trace. Instructions for recreating bug:

-> tar -xzvf fft.tgz      //This will create a directory fft/ witth the sources

-> cd fft

-> make db

-> fft                        // Optional. This terminates with a core dump in my system. Be careful in yours it could crash your PC

-> make clean

-> make db

-> script

-> valgrind fft          // Best use latest valgrind 3.11.0, from sources.  Otherwise it might crash your PC. Can be interrupted with <ctrl-C>,

                                  but in my case it core dumps before I get the chance to and generates vgcore,<pid>

-> exit                    // Script

Let me know if you can recreate problem.

TIA

Nikos

0 Likes

1. Freqs array reallocation in the code looks broken. The code below:

freqs[fidx].hz = sig.hz;

freqs[fidx++].ts = ts;

if (fidx >= maxfreqs) {

maxfreqs += 16;

freqs = realloc(freqs, maxfreqs);

}

Should be something

if (fidx == (maxfreqs-1)) {

maxfreqs += 16;

freqs = realloc(freqs, maxfreqs * sizeof(freq_t));

}

freqs[fidx].hz = sig.hz;

freqs[fidx++].ts = ts;

2. You call run_fft() with pass=8 and that causes access to a destroyed cl_event ndr on the pass=7.

You destroyed ndr (pass=7)

if (pass == MAXPASS - 1) {

    if ((err = waitForEventAndRelease(&ndr)) != SUCCESS)

you have access to a destroyed object and corrupt memory. (pass=8)

if (pass && (err = waitForEventAndRelease(&ndr)) != SUCCESS)

Hi,

Oops, sorry about that. These 2 were artifacts of the test file, the additional invalid reads. Should have checked it more carefully before shipping it out, but it was quite complex. However, after fixing them, leaves the original memory corruption. I imagine you must have recreated it by now

Thank you for your feedback,

Nikos

0 Likes

When the team fixed these two issues. The corruption was no longer there

Only crashed with the two issues

Greg

Sent from Outlook Mobile<https://aka.ms/qtex0l>

0 Likes

These were artifacts of the test file as noted initially. I'm still getting the original problem:

==00:00:00:49.169 4191== Invalid write of size 8^M
==00:00:00:49.170 4191== at 0x4C2F0F3: memcpy@GLIBC_2.2.5 (vg_replace_strmem.c:1017)^M
==00:00:00:49.170 4191== by 0x68E8154: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68EA99A: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68ED911: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68F7F98: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68F8667: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68F8838: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x685BCFB: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x688382C: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x6883BD6: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x6824DAC: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x682512C: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x67C315E: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x683315B: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x4E3F181: start_thread (pthread_create.c:312)^M
==00:00:00:49.170 4191== by 0x565C47C: clone (clone.S:111)^M
==00:00:00:49.170 4191== Address 0x7fbf95aaf000 is not stack'd, malloc'd or (recently) free'd^M
==00:00:00:49.170 4191== ^M
{^M

Is this a false positive? Do you not see it in your valgrind?

BR,

Nikos

0 Likes

Hi,

After a week of extensive testing of the test program, without any problems, i can confirm that the valgrind report is a false positive.

Whatever other memory problems I have in my original program are due to my code.

Thank you,

Nikos

0 Likes