APP CRASH post-detach after re-attach
My #2157 re-attach tree tries to add a re-attach loop to the burst_threads test, but it sometimes fails for 32-bit:
Output:
----------------------------------------------------------
Running |/work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads|
CMake Error at /work/dr/git/src/suite/tests/runmulti.cmake:98 (message):
*** cmd failed (1): pre-DR init
pre-DR start
pre-DR detach
pre-DR start
pre-DR detach
<ERROR: master_signal_handler with no siginfo (i#26?): tid=12885, sig=11>
***
Call Stack (most recent call first):
/work/dr/git/src/suite/tests/runmulti.cmake:107 (process_cmdline)
<end of output>
Test time = 3.39 sec
I can repro this manually every couple of attempts:
> clients/bin32/tool.drcacheoff.burst_threads
pre-DR init
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<unable to determine lib path for cross-arch execve>
<Initial options = -client_lib ';;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Detaching from process, entering final cleanup>
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Initial options = -client_lib ';;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Detaching from process, entering final cleanup>
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Initial options = -client_lib ';;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<ERROR: master_signal_handler with no siginfo (i#26?): tid=21659, sig=11>
<Detaching from process, entering final cleanup>
There are only 2 threads (there's no synch stopping the other app threads from exiting while I attach the debugger, so it's expected):
(gdb) thread apply all bt
Thread 2 (Thread 0xf4b56b40 (LWP 21659)):
#0 syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
#1 0xf4b56b40 in ?? ()
#2 0x0812dd7b in notify (priority=SYSLOG_ERROR, internal=true, synch=true, substitution_num=0, prefix=0x8370d2d "SYSLOG_ERROR",
fmt=0x8370ce8 "ERROR: master_signal_handler with no siginfo (i#26?): tid=%d, sig=%d") at /work/dr/git/src/core/utils.c:1983
#3 0x082b53ed in master_signal_handler_C (xsp=0xf4b55d3c "\320\vv\367\v") at /work/dr/git/src/core/unix/signal.c:4431
#4 0x08283c67 in master_signal_handler () at /work/dr/git/src/core/arch/x86/x86.asm:1451
#5 0xf4b55d3c in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Thread 1 (Thread 0xf735d700 (LWP 21653)):
#0 0xf7760be9 in __kernel_vsyscall ()
#1 0xf771e20d in pthread_join () from /lib/libpthread.so.0
#2 0x08057061 in main ()
(gdb) p /x *sc
$1 = {
gs = 0x63,
__gsh = 0x0,
fs = 0x73,
__fsh = 0x0,
es = 0x2b,
__esh = 0x0,
ds = 0x2b,
__dsh = 0x0,
edi = 0xf4b56b40,
esi = 0x0,
ebp = 0x84370e1b,
esp = 0xf4b56238,
ebx = 0xf7731000,
edx = 0x0,
ecx = 0xf3325930,
eax = 0x0,
trapno = 0xe,
err = 0x14,
eip = 0x3d3424d0,
cs = 0x23,
__csh = 0x0,
eflags = 0x10203,
esp_at_signal = 0xf4b56238,
ss = 0x2b,
__ssh = 0x0,
fpstate = 0xf4b55e50,
oldmask = 0x0,
cr2 = 0x3d3424d0
}
(gdb) x/4i sc->eip
0x3d3424d0: Cannot access memory at address 0x3d3424d0
(gdb) p heapmgmt
No symbol "heapmgmt" in current context.
(gdb) p heapmgt
$2 = (heap_management_t *) 0x83ea700 <temp_heapmgt>
(gdb) dps sc->esp sc->esp+180
0xf4b56238 0x00000000 No symbol matches (void *)$retaddr.
0xf4b5623c 0x3ff00000 No symbol matches (void *)$retaddr.
0xf4b56240 0x77feab6d No symbol matches (void *)$retaddr.
0xf4b56244 0x39db8cf4 No symbol matches (void *)$retaddr.
0xf4b56248 0x00000000 No symbol matches (void *)$retaddr.
0xf4b5624c 0x00000000 No symbol matches (void *)$retaddr.
0xf4b56250 0x88000000 No symbol matches (void *)$retaddr.
0xf4b56254 0x3d3424d0 No symbol matches (void *)$retaddr.
0xf4b56258 0x00000000 No symbol matches (void *)$retaddr.
0xf4b5625c 0x00000000 No symbol matches (void *)$retaddr.
0xf4b56260 0x02000000 No symbol matches (void *)$retaddr.
0xf4b56264 0x41a00000 No symbol matches (void *)$retaddr.
0xf4b56268 0xf7597000 _GLOBAL_OFFSET_TABLE_ in section .got.plt of /lib/libm.so.6
0xf4b5626c 0xf755675c sin + 6156 in section .text of /lib/libm.so.6
(gdb) x/4i 0xf755675c-5
0xf7556757 <sin+6151>: call 0xf757b930 <__dubsin>
0xf755675c <sin+6156>: add $0x20,%esp
0xf755675f <sin+6159>: fldl 0x10(%esp)
So the app thread is in do_some_work() calling sin() presumably in the code cache.
Yet here's another one -- it's at a native PC but eax seems to be messed up:
(gdb) up 3
#3 0x082b5501 in master_signal_handler_C (xsp=0xf732cd3c "\320+s\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431 SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/4i sc->eip
0xf754dc3c <__dubsin+780>: fldl -0xe5e0(%eax)
0xf754dc42 <__dubsin+786>: fldl -0x13ab8(%eax)
0xf754dc48 <__dubsin+792>: fmul %st(1),%st
0xf754dc4a <__dubsin+794>: fld %st(1)
(gdb) x/4wx sc->eax
0x2000c006: Cannot access memory at address 0x2000c006
(gdb) x/4wx sc->eax-0xe5e0
0x1fffda26: Cannot access memory at address 0x1fffda26
(gdb) p /x *sc
$2 = {
edi = 0xf732db40,
esi = 0xf732d2d0,
ebp = 0xf732d338,
esp = 0xf732d208,
ebx = 0xf732d2d0,
edx = 0x0,
ecx = 0xe31de8e8,
eax = 0x2000c006,
eip = 0xf754dc3c,
}
translating thread 29600 to 0xf754dc08
thread 29600 will resume as native
thread 29600 detaching to PC 0xf754dc08
<...>
waiting for thread 29601
<ERROR: master_signal_handler with no siginfo (i#26?): tid=29600, sig=11>
c3c is not far from c08: one cbr in between, reasonable that it got that
far and then crashed.
All floats: no eax ref in between.
Another one has a bad ebp:
(gdb) x/1i sc->eip
0x8056e3e <do_some_work(int)+51>: fstpl -0x10(%ebp)
(gdb) x/4wx sc->ebp-0x10
0xf75f2a60 <s7>: 0x5816c78d 0xbf2a019f 0xda488820 0xbc21ac06
f73c9000-f758f000 r-xp 00000000 09:00 1355064 /usr/lib/libc-2.21.so
OK I got the bad PC again with more info now:
translating thread 31961 to 0xf7542993 eax=0xf755e000 ecx=0xf330d270 ebx=0xf4b1d2d0 edx=0x00000000 edi=0xf4b1db40 esi=0xf4b1d2d0 ebp=0xf4b1d240 esp=0xf4b1d208
thread 31961 will resume as native
thread 31961 detaching to PC 0xf7542993 eax=0xf755e000 ecx=0xf330d270 ebx=0xf4b1d2d0 edx=0x00000000 edi=0xf4b1db40 esi=0xf4b1d2d0 ebp=0xf4b1d240 esp=0xf4b1d208
<ERROR: master_signal_handler with no siginfo (i#26?): tid=31961, sig=11>
(gdb) x/2i sc->eip
0xb9c97f20: Cannot access memory at address 0xb9c97f20
(gdb) x/10i 0xf7542993
0xf7542993 <__dubsin+99>: fmul %st(2),%st
0xf7542995 <__dubsin+101>: fxch %st(2)
0xf7542997 <__dubsin+103>: fmul %st(1),%st
(gdb) p /x *sc
edi = 0xf4b1db40,
esi = 0x0,
ebp = 0x60004b1d,
esp = 0xf4b1d248,
ebx = 0xf76f8000,
edx = 0x0,
ecx = 0xf330d270,
eax = 0x0,
eip = 0xb9c97f20,
eflags = 0x10242,
(gdb) dps 0xf4b1d208 0xf4b1d248+16
0xf4b1d208 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d20c 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d210 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d214 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d218 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d21c 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d220 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d224 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d228 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d22c 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d230 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d234 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d238 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d23c 0x00000000 No symbol matches (void *)$retaddr.
0xf4b1d240 0x46505845 No symbol matches (void *)$retaddr.
0xf4b1d244 0xb9c97f20 No symbol matches (void *)$retaddr.
0xf4b1d248 0x00000000 No symbol matches (void *)$retaddr.
So it did a ret to get to the bad PC. Except the stack frame is bigger than 0x3c! It's 0x64, as shown below. ?!?!
(gdb) disas 0xf7542993
Dump of assembler code for function __dubsin:
0xf7542930 <+0>: push %ebx
0xf7542931 <+1>: call 0xf7517162 <__x86.get_pc_thunk.ax>
0xf7542936 <+6>: add $0x1b6ca,%eax
0xf754293b <+11>: sub $0x60,%esp
<...>
0xf7542fd6 <+1702>: add $0x60,%esp
0xf7542fd9 <+1705>: pop %ebx
0xf7542fda <+1706>: ret
Running in DR -loglevel 3 in a single thread: the next OP_ret after entering __subsin is that 0x60-frame ret.
Another one, this one with an even smaller frame shift:
translating thread 2376 to 0xf7512a4e eax=0xf752e000 ecx=0xf32db220 ebx=0xf72f22d0 edx=0x00000000 edi=0xf72f2b40 esi=0xf72f22d0 ebp=0xf72f2218 esp=0xf72f2208
thread 2376 will resume as native
thread 2376 detaching to PC 0xf7512a4e eax=0xf752e000 ecx=0xf32db220 ebx=0xf72f22d0 edx=0x00000000 edi=0xf72f2b40 esi=0xf72f22d0 ebp=0xf72f2218 esp=0xf72f2208
<ERROR: master_signal_handler with no siginfo (i#26?): tid=2376, sig=11>
(gdb) x/2i sc->eip
0x36d18f7a: Cannot access memory at address 0x36d18f7a
(gdb) x/4i 0xf7512a4e
0xf7512a4e <__dubsin+286>: faddp %st,%st(2)
0xf7512a50 <__dubsin+288>: fld %st(0)
0xf7512a52 <__dubsin+290>: fadd %st(2),%st
0xf7512a54 <__dubsin+292>: fsubr %st,%st(1)
(gdb) p /x *sc
edi = 0xf72f2b40,
esi = 0x0,
ebp = 0x2bd4f354,
esp = 0xf72f2220,
ebx = 0xf76c8000,
edx = 0x0,
ecx = 0xf32db220,
eax = 0x1,
eip = 0x36d18f7a,
eflags = 0x10202,
(gdb) dps 0xf72f2208 0xf72f2220+8
0xf72f2208 0x40cfb0ee No symbol matches (void *)$retaddr.
0xf72f220c 0x3d14f217 No symbol matches (void *)$retaddr.
0xf72f2210 0x7fffd024 No symbol matches (void *)$retaddr.
0xf72f2214 0x39a8d3eb No symbol matches (void *)$retaddr.
0xf72f2218 0x2bd4f354 No symbol matches (void *)$retaddr.
0xf72f221c 0x36d18f7a No symbol matches (void *)$retaddr.
0xf72f2220 0x00000000 No symbol matches (void *)$retaddr.
Look at ebp as well, for both: it's very different from esp, when __dubsin does not write or even read it.
Here's another bad-ebp one, once again pointing into a library code section, but this one has the bad ebp value in the xl8:
thread 4857 detaching to PC 0xf7556a6a eax=0xf7572000 ecx=0xf771fee0 ebx=0xf5b332d0 edx=0x00000000 edi=0xf5b33b40 esi=0xf5b332d0 ebp=0xf7563a70 esp=0xf5b33208
(gdb) x/2i sc->eip
0x8056e3e <do_some_work(int)+51>: fstpl -0x10(%ebp)
0x8056e41 <do_some_work(int)+54>: addl $0x1,-0x14(%ebp)
(gdb) p /x *sc
edi = 0xf5b33b40,
esi = 0x0,
ebp = 0xf7563a70,
esp = 0xf5b33320,
ebx = 0xf770c000,
edx = 0x0,
ecx = 0xf7564180,
eax = 0x1,
eip = 0x8056e3e,
Removing the re-attach loop: I cannot repro any crashes in 200+ runs. With re-attach: I can get a crash just about every other time. Yet the detach context, etc. should all be per-thread: there's no global state here, right?
Is it corruption from some other thread? Or is it some sigreturn thing? 1 thread instead of 8: 200 iters w/ no crashes.
Check out this one: I added an extra re-attach, and on the final detach, we only have 3 other threads, and 2 of them both end up crashing, one with a bad PC and one with a bad eax which it sets itself:
<Attached to 8 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (6365)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (6365)>
synched with 4 threads
translating thread 6365 to 0xf7787be9 eax=0xfffffffc ecx=0x00000000 ebx=0xf6b81ba8 edx=0x000018df edi=0xf6b81b40 esi=0x00000000 ebp=0x000000f0 esp=0xffed2380
thread 6365 will resume as native
thread 6365 detaching sc=0xe32feba0 to PC 0xf7787be9 eax=0xfffffffc ecx=0x00000000 ebx=0xf6b81ba8 edx=0x000018df edi=0xf6b81b40 esi=0x00000000 ebp=0x000000f0 esp=0xffed2380
translating thread 6367 to 0xf75a2dff eax=0x2000c006 ecx=0xf335bad0 ebx=0xf6b812d0 edx=0x00000000 edi=0xf6b81b40 esi=0xf6b812d0 ebp=0xf6b81338 esp=0xf6b81208
thread 6367 will resume as native
thread 6367 detaching sc=0xe3396ba0 to PC 0xf75a2dff eax=0x2000c006 ecx=0xf335bad0 ebx=0xf6b812d0 edx=0x00000000 edi=0xf6b81b40 esi=0xf6b812d0 ebp=0xf6b81338 esp=0xf6b81208
translating thread 6373 to 0x082dc670 eax=0x00000000 ecx=0x00000000 ebx=0x083a43ec edx=0x00000049 edi=0xf3b7bb40 esi=0x00000000 ebp=0xf3b7a490 esp=0xf3b7a470
thread 6373 will resume as native
waiting for thread 6365
waiting for thread 6367
thread 6373 detaching sc=0xe34c6ba0 to PC 0x082dc670 eax=0x00000000 ecx=0x00000000 ebx=0x083a43ec edx=0x00000049 edi=0xf3b7bb40 esi=0x00000000 ebp=0xf3b7a490 esp=0xf3b7a470
waiting for thread 6373
<ERROR: master_signal_handler with no siginfo (i#26?): tid=6367, sig=11>
cleaning up thread 6365
cleaning up thread 6367
cleaning up thread 6373
<Detaching from process, entering final cleanup>
(gdb) thread 2
[Switching to thread 2 (Thread 0xf3b7bb40 (LWP 6373))]
#0 syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
176 pop REG_XDI
(gdb) up 3
#3 0x082b559e in master_signal_handler_C (xsp=0xf3b7ad3c "\320{x\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431 SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/2i sc->eip
0x3d2466dc: Cannot access memory at address 0x3d2466dc
(gdb) thread 3
[Switching to thread 3 (Thread 0xf6b81b40 (LWP 6367))]
#0 syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
176 pop REG_XDI
(gdb) up 3
#3 0x082b559e in master_signal_handler_C (xsp=0xf6b80d3c "\320{x\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431 SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/2i sc->eip
0xf75a3024 <__dubsin+1780>: faddl -0xe5b8(%eax)
0xf75a302a <__dubsin+1786>: jmp 0xf75a2e23 <__dubsin+1267>
(gdb) p /x $eax
$1 = 0x0
(gdb) x/2i 0xf75a2dff
0xf75a2dff <__dubsin+1231>: fucomi %st(4),%st
0xf75a2e01 <__dubsin+1233>: fstp %st(4)
(gdb) x/2i 0xf7787be9
0xf7787be9 <__kernel_vsyscall+9>: pop %ebp
0xf7787bea <__kernel_vsyscall+10>: pop %edx
(gdb) x/2i 0x082dc670
0x82dc670 <syscall_0args+6>: pop %edi
0x82dc671 <syscall_0args+7>: pop %esi
The bad PC one was sent to a DR address?!?
(gdb) x/10i 0x082dc670 -2
0x82dc66e <syscall_0args+4>: int $0x80
0x82dc670 <syscall_0args+6>: pop %edi
0x82dc671 <syscall_0args+7>: pop %esi
0x82dc672 <syscall_0args+8>: pop %ebp
0x82dc673 <syscall_0args+9>: pop %ebx
0x82dc674 <syscall_0args+10>: ret
(gdb) dps 0xf3b7a470 0xf3b7a470+20
0xf3b7a470 0xf3b7bb40 No symbol matches (void *)$retaddr.
0xf3b7a474 0x00000000 No symbol matches (void *)$retaddr.
0xf3b7a478 0xf3b7a490 No symbol matches (void *)$retaddr.
0xf3b7a47c 0x083a43ec _GLOBAL_OFFSET_TABLE_ in section .got.plt of /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads
0xf3b7a480 0x0829b311 os_thread_yield + 28 in section .text of /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads
Xref self-interp #2161 (closed). Doesn't that require an app timer signal, though? Would SIGCHLD trigger it? Isn't SIGCHLD only going to the parent?
At this point I'm adding the re-attach best-effort code I have and disabling this test, to make progress, and filing this as a bug to fix separately.