Using Ubuntu 14.04 and valgrind:
==00:00:01:30.014 4949== Invalid write of size 8
==00:00:01:30.014 4949== at 0x4C2F5F3: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:01:30.014 4949== by 0xB3B6154: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.014 4949== by 0xB3B899A: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.014 4949== by 0xB3BB911: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB3C5F98: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB3C6667: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB3C6838: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB329CFB: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB35182C: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB351BD6: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB2F2DAC: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB2F312C: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB29115E: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0xB30115B: ??? (in /usr/lib/libamdocl64.so)
==00:00:01:30.015 4949== by 0x60BA181: start_thread (pthread_create.c:312)
==00:00:01:30.015 4949== by 0x63CA47C: clone (clone.S:111)
==00:00:01:30.015 4949== Address 0x7f126ed63000 is not stack'd, malloc'd or (recently) free'd
Could be a false positive, but I'm getting some unexplained crashes:(
Solved! Go to Solution.
1. Freqs array reallocation in the code looks broken. The code below:
freqs[fidx].hz = sig.hz;
freqs[fidx++].ts = ts;
if (fidx >= maxfreqs) {
maxfreqs += 16;
freqs = realloc(freqs, maxfreqs);
}
Should be something
if (fidx == (maxfreqs-1)) {
maxfreqs += 16;
freqs = realloc(freqs, maxfreqs * sizeof(freq_t));
}
freqs[fidx].hz = sig.hz;
freqs[fidx++].ts = ts;
2. You call run_fft() with pass=8 and that causes access to a destroyed cl_event ndr on the pass=7.
You destroyed ndr (pass=7)
if (pass == MAXPASS - 1) {
if ((err = waitForEventAndRelease(&ndr)) != SUCCESS)
you have access to a destroyed object and corrupt memory. (pass=8)
if (pass && (err = waitForEventAndRelease(&ndr)) != SUCCESS)
Actually this is much worse than I thought. This is real. That corruption existed in catalyst 15.201, 15.101 and anything in between. Not only it gave instability to the ocl part of the program, but anything else it came in contact with in the same program. Plz fix urgently. Is there a place to download older catalysts?
I will have to comment out all ocl parts and stop linking to the libraries until it is fixed
Hi Nibal
I am having the team look into this I will get back to you by the end of the week.
Greg
Hi Greg,
And thanks for helping out.
This is a tough corruption to track. Since it is very reproducible in my system, I will try to limit it to specific ocl calls and update ticket.
BR
Nikos
What I need is what motherboard, processor, system bios version, which GPU, if possible vbios number for the GPU, which os and version ( if linux kernel version) you are running. Also if you have test app that causes the issue you can get us.
greg
My info so far:
Motherboard: Gigabyte Technology Co., Ltd. 970A-UD3P
BIOS: UEFI DualBIOS, American Megatrends Inc. version: F1
CPU: AMD FX(tm)-8320 Eight-Core Processor, @1.4 Ghz
GPU: AMD Radeon (TM) R9 270, Pitcairn, Curacao Pro, Platform ID: 0x7f7227b45a18 (as reported by clinfo)
OS: Ubuntu 14.04 x64, 3.13.0-49 generic
ocl SDK: 3.0, working ocl 1.2
Working on test app (Need to reboot).
BR
Nikos
Using printfs and the valgrind output I was able to bracket the Invalid write between NDRangeKernel and completing the kernel.
But here is the catch: It happens only on the first time the kernel is executed.
My kernel is a slightly modified kernel of your FFT sample.
Unfortunately validation of your FFT sample, will take more time.
Each time I run it through valgrind it crashes my PC. I do not crash my PC when running FFT alone,
but I do not run it for long and corruption may not show. I will have to compile latest valgrind
from sources and retest
The pattern suggests that this is not specific to the kernel itself (else it would appear on every kernel pass),
but general to the kernel mechanism. I hope it can be reproduced with any kernel. I'm compiling as default (ocl 1.2)
BR,
Nikos
It doesn't show in your FFT sample. Will have to create a test app with my kernel
Hi Greg,
Plz use attached fft.tgz to recreate problem. Included in val.out are 2 more Invalid reads, which were not in original valgrind report. You might want to check on them, too. These contain full stack trace. Instructions for recreating bug:
-> tar -xzvf fft.tgz //This will create a directory fft/ witth the sources
-> cd fft
-> make db
-> fft // Optional. This terminates with a core dump in my system. Be careful in yours it could crash your PC
-> make clean
-> make db
-> script
-> valgrind fft // Best use latest valgrind 3.11.0, from sources. Otherwise it might crash your PC. Can be interrupted with <ctrl-C>,
but in my case it core dumps before I get the chance to and generates vgcore,<pid>
-> exit // Script
Let me know if you can recreate problem.
TIA
Nikos
1. Freqs array reallocation in the code looks broken. The code below:
freqs[fidx].hz = sig.hz;
freqs[fidx++].ts = ts;
if (fidx >= maxfreqs) {
maxfreqs += 16;
freqs = realloc(freqs, maxfreqs);
}
Should be something
if (fidx == (maxfreqs-1)) {
maxfreqs += 16;
freqs = realloc(freqs, maxfreqs * sizeof(freq_t));
}
freqs[fidx].hz = sig.hz;
freqs[fidx++].ts = ts;
2. You call run_fft() with pass=8 and that causes access to a destroyed cl_event ndr on the pass=7.
You destroyed ndr (pass=7)
if (pass == MAXPASS - 1) {
if ((err = waitForEventAndRelease(&ndr)) != SUCCESS)
you have access to a destroyed object and corrupt memory. (pass=8)
if (pass && (err = waitForEventAndRelease(&ndr)) != SUCCESS)
Hi,
Oops, sorry about that. These 2 were artifacts of the test file, the additional invalid reads. Should have checked it more carefully before shipping it out, but it was quite complex. However, after fixing them, leaves the original memory corruption. I imagine you must have recreated it by now
Thank you for your feedback,
Nikos
When the team fixed these two issues. The corruption was no longer there
Only crashed with the two issues
Greg
Sent from Outlook Mobile<https://aka.ms/qtex0l>
These were artifacts of the test file as noted initially. I'm still getting the original problem:
==00:00:00:49.169 4191== Invalid write of size 8^M
==00:00:00:49.170 4191== at 0x4C2F0F3: memcpy@GLIBC_2.2.5 (vg_replace_strmem.c:1017)^M
==00:00:00:49.170 4191== by 0x68E8154: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68EA99A: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68ED911: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68F7F98: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68F8667: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x68F8838: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x685BCFB: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x688382C: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x6883BD6: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x6824DAC: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x682512C: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x67C315E: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x683315B: ??? (in /usr/lib/libamdocl64.so)^M
==00:00:00:49.170 4191== by 0x4E3F181: start_thread (pthread_create.c:312)^M
==00:00:00:49.170 4191== by 0x565C47C: clone (clone.S:111)^M
==00:00:00:49.170 4191== Address 0x7fbf95aaf000 is not stack'd, malloc'd or (recently) free'd^M
==00:00:00:49.170 4191== ^M
{^M
Is this a false positive? Do you not see it in your valgrind?
BR,
Nikos
Hi,
After a week of extensive testing of the test program, without any problems, i can confirm that the valgrind report is a false positive.
Whatever other memory problems I have in my original program are due to my code.
Thank you,
Nikos