APP CRASH post-detach after re-attach

My #2157 re-attach tree tries to add a re-attach loop to the burst_threads test, but it sometimes fails for 32-bit:

Output:
----------------------------------------------------------
Running |/work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads|
CMake Error at /work/dr/git/src/suite/tests/runmulti.cmake:98 (message):
  *** cmd failed (1): pre-DR init
  pre-DR start
  pre-DR detach
  pre-DR start
  pre-DR detach
  <ERROR: master_signal_handler with no siginfo (i#26?): tid=12885, sig=11>
  ***
Call Stack (most recent call first):
  /work/dr/git/src/suite/tests/runmulti.cmake:107 (process_cmdline)
<end of output>
Test time =   3.39 sec

I can repro this manually every couple of attempts:

> clients/bin32/tool.drcacheoff.burst_threads
pre-DR init
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<unable to determine lib path for cross-arch execve>
<Initial options = -client_lib ';;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Detaching from process, entering final cleanup>
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Initial options = -client_lib ';;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Detaching from process, entering final cleanup>
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Initial options = -client_lib ';;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<ERROR: master_signal_handler with no siginfo (i#26?): tid=21659, sig=11>
<Detaching from process, entering final cleanup>

There are only 2 threads (there's no synch stopping the other app threads from exiting while I attach the debugger, so it's expected):

(gdb) thread apply all bt

Thread 2 (Thread 0xf4b56b40 (LWP 21659)):
#0  syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
#1  0xf4b56b40 in ?? ()
#2  0x0812dd7b in notify (priority=SYSLOG_ERROR, internal=true, synch=true, substitution_num=0, prefix=0x8370d2d "SYSLOG_ERROR", 
    fmt=0x8370ce8 "ERROR: master_signal_handler with no siginfo (i#26?): tid=%d, sig=%d") at /work/dr/git/src/core/utils.c:1983
#3  0x082b53ed in master_signal_handler_C (xsp=0xf4b55d3c "\320\vv\367\v") at /work/dr/git/src/core/unix/signal.c:4431
#4  0x08283c67 in master_signal_handler () at /work/dr/git/src/core/arch/x86/x86.asm:1451
#5  0xf4b55d3c in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 0xf735d700 (LWP 21653)):
#0  0xf7760be9 in __kernel_vsyscall ()
#1  0xf771e20d in pthread_join () from /lib/libpthread.so.0
#2  0x08057061 in main ()

(gdb) p /x *sc
$1 = {
  gs = 0x63, 
  __gsh = 0x0, 
  fs = 0x73, 
  __fsh = 0x0, 
  es = 0x2b, 
  __esh = 0x0, 
  ds = 0x2b, 
  __dsh = 0x0, 
  edi = 0xf4b56b40, 
  esi = 0x0, 
  ebp = 0x84370e1b, 
  esp = 0xf4b56238, 
  ebx = 0xf7731000, 
  edx = 0x0, 
  ecx = 0xf3325930, 
  eax = 0x0, 
  trapno = 0xe, 
  err = 0x14, 
  eip = 0x3d3424d0, 
  cs = 0x23, 
  __csh = 0x0, 
  eflags = 0x10203, 
  esp_at_signal = 0xf4b56238, 
  ss = 0x2b, 
  __ssh = 0x0, 
  fpstate = 0xf4b55e50, 
  oldmask = 0x0, 
  cr2 = 0x3d3424d0
}
(gdb) x/4i sc->eip
   0x3d3424d0:  Cannot access memory at address 0x3d3424d0
(gdb) p heapmgmt
No symbol "heapmgmt" in current context.
(gdb) p heapmgt
$2 = (heap_management_t *) 0x83ea700 <temp_heapmgt>

(gdb) dps sc->esp sc->esp+180
0xf4b56238  0x00000000  No symbol matches (void *)$retaddr.
0xf4b5623c  0x3ff00000  No symbol matches (void *)$retaddr.
0xf4b56240  0x77feab6d  No symbol matches (void *)$retaddr.
0xf4b56244  0x39db8cf4  No symbol matches (void *)$retaddr.
0xf4b56248  0x00000000  No symbol matches (void *)$retaddr.
0xf4b5624c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b56250  0x88000000  No symbol matches (void *)$retaddr.
0xf4b56254  0x3d3424d0  No symbol matches (void *)$retaddr.
0xf4b56258  0x00000000  No symbol matches (void *)$retaddr.
0xf4b5625c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b56260  0x02000000  No symbol matches (void *)$retaddr.
0xf4b56264  0x41a00000  No symbol matches (void *)$retaddr.
0xf4b56268  0xf7597000  _GLOBAL_OFFSET_TABLE_ in section .got.plt of /lib/libm.so.6
0xf4b5626c  0xf755675c  sin + 6156 in section .text of /lib/libm.so.6

(gdb) x/4i 0xf755675c-5
   0xf7556757 <sin+6151>:       call   0xf757b930 <__dubsin>
   0xf755675c <sin+6156>:       add    $0x20,%esp
   0xf755675f <sin+6159>:       fldl   0x10(%esp)

So the app thread is in do_some_work() calling sin() presumably in the code cache.

Yet here's another one -- it's at a native PC but eax seems to be messed up:

(gdb) up 3
#3  0x082b5501 in master_signal_handler_C (xsp=0xf732cd3c "\320+s\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431                SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/4i sc->eip
   0xf754dc3c <__dubsin+780>:   fldl   -0xe5e0(%eax)
   0xf754dc42 <__dubsin+786>:   fldl   -0x13ab8(%eax)
   0xf754dc48 <__dubsin+792>:   fmul   %st(1),%st
   0xf754dc4a <__dubsin+794>:   fld    %st(1)
(gdb) x/4wx sc->eax
0x2000c006:     Cannot access memory at address 0x2000c006
(gdb) x/4wx sc->eax-0xe5e0
0x1fffda26:     Cannot access memory at address 0x1fffda26
(gdb) p /x *sc
$2 = {
  edi = 0xf732db40, 
  esi = 0xf732d2d0, 
  ebp = 0xf732d338, 
  esp = 0xf732d208, 
  ebx = 0xf732d2d0, 
  edx = 0x0, 
  ecx = 0xe31de8e8, 
  eax = 0x2000c006, 
  eip = 0xf754dc3c, 
}

translating thread 29600 to 0xf754dc08
thread 29600 will resume as native
thread 29600 detaching to PC 0xf754dc08
<...>
waiting for thread 29601
<ERROR: master_signal_handler with no siginfo (i#26?): tid=29600, sig=11>

c3c is not far from c08: one cbr in between, reasonable that it got that
far and then crashed.
All floats: no eax ref in between.

Another one has a bad ebp:

(gdb) x/1i sc->eip
   0x8056e3e <do_some_work(int)+51>:    fstpl  -0x10(%ebp)
(gdb) x/4wx sc->ebp-0x10
0xf75f2a60 <s7>:        0x5816c78d      0xbf2a019f      0xda488820      0xbc21ac06

f73c9000-f758f000 r-xp 00000000 09:00 1355064                            /usr/lib/libc-2.21.so

OK I got the bad PC again with more info now:

translating thread 31961 to 0xf7542993 eax=0xf755e000 ecx=0xf330d270 ebx=0xf4b1d2d0 edx=0x00000000 edi=0xf4b1db40 esi=0xf4b1d2d0 ebp=0xf4b1d240 esp=0xf4b1d208
thread 31961 will resume as native
thread 31961 detaching to PC 0xf7542993 eax=0xf755e000 ecx=0xf330d270 ebx=0xf4b1d2d0 edx=0x00000000 edi=0xf4b1db40 esi=0xf4b1d2d0 ebp=0xf4b1d240 esp=0xf4b1d208

<ERROR: master_signal_handler with no siginfo (i#26?): tid=31961, sig=11>

(gdb) x/2i sc->eip
   0xb9c97f20:  Cannot access memory at address 0xb9c97f20
(gdb) x/10i 0xf7542993
   0xf7542993 <__dubsin+99>:    fmul   %st(2),%st
   0xf7542995 <__dubsin+101>:   fxch   %st(2)
   0xf7542997 <__dubsin+103>:   fmul   %st(1),%st
(gdb) p /x *sc
  edi = 0xf4b1db40, 
  esi = 0x0, 
  ebp = 0x60004b1d, 
  esp = 0xf4b1d248, 
  ebx = 0xf76f8000, 
  edx = 0x0, 
  ecx = 0xf330d270, 
  eax = 0x0, 
  eip = 0xb9c97f20, 
  eflags = 0x10242, 
(gdb) dps 0xf4b1d208 0xf4b1d248+16
0xf4b1d208  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d20c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d210  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d214  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d218  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d21c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d220  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d224  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d228  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d22c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d230  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d234  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d238  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d23c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d240  0x46505845  No symbol matches (void *)$retaddr.
0xf4b1d244  0xb9c97f20  No symbol matches (void *)$retaddr.
0xf4b1d248  0x00000000  No symbol matches (void *)$retaddr.

So it did a ret to get to the bad PC. Except the stack frame is bigger than 0x3c! It's 0x64, as shown below. ?!?!

(gdb) disas 0xf7542993
Dump of assembler code for function __dubsin:
   0xf7542930 <+0>:     push   %ebx
   0xf7542931 <+1>:     call   0xf7517162 <__x86.get_pc_thunk.ax>
   0xf7542936 <+6>:     add    $0x1b6ca,%eax
   0xf754293b <+11>:    sub    $0x60,%esp
   <...>
   0xf7542fd6 <+1702>:  add    $0x60,%esp
   0xf7542fd9 <+1705>:  pop    %ebx
   0xf7542fda <+1706>:  ret

Running in DR -loglevel 3 in a single thread: the next OP_ret after entering __subsin is that 0x60-frame ret.

Another one, this one with an even smaller frame shift:

translating thread 2376 to 0xf7512a4e eax=0xf752e000 ecx=0xf32db220 ebx=0xf72f22d0 edx=0x00000000 edi=0xf72f2b40 esi=0xf72f22d0 ebp=0xf72f2218 esp=0xf72f2208
thread 2376 will resume as native
thread 2376 detaching to PC 0xf7512a4e eax=0xf752e000 ecx=0xf32db220 ebx=0xf72f22d0 edx=0x00000000 edi=0xf72f2b40 esi=0xf72f22d0 ebp=0xf72f2218 esp=0xf72f2208
<ERROR: master_signal_handler with no siginfo (i#26?): tid=2376, sig=11>
(gdb) x/2i sc->eip
   0x36d18f7a:  Cannot access memory at address 0x36d18f7a
(gdb) x/4i 0xf7512a4e
   0xf7512a4e <__dubsin+286>:   faddp  %st,%st(2)
   0xf7512a50 <__dubsin+288>:   fld    %st(0)
   0xf7512a52 <__dubsin+290>:   fadd   %st(2),%st
   0xf7512a54 <__dubsin+292>:   fsubr  %st,%st(1)
(gdb) p /x *sc
  edi = 0xf72f2b40, 
  esi = 0x0, 
  ebp = 0x2bd4f354, 
  esp = 0xf72f2220, 
  ebx = 0xf76c8000, 
  edx = 0x0, 
  ecx = 0xf32db220, 
  eax = 0x1, 
  eip = 0x36d18f7a, 
  eflags = 0x10202, 
(gdb) dps 0xf72f2208 0xf72f2220+8
0xf72f2208  0x40cfb0ee  No symbol matches (void *)$retaddr.
0xf72f220c  0x3d14f217  No symbol matches (void *)$retaddr.
0xf72f2210  0x7fffd024  No symbol matches (void *)$retaddr.
0xf72f2214  0x39a8d3eb  No symbol matches (void *)$retaddr.
0xf72f2218  0x2bd4f354  No symbol matches (void *)$retaddr.
0xf72f221c  0x36d18f7a  No symbol matches (void *)$retaddr.
0xf72f2220  0x00000000  No symbol matches (void *)$retaddr.

Look at ebp as well, for both: it's very different from esp, when __dubsin does not write or even read it.

Here's another bad-ebp one, once again pointing into a library code section, but this one has the bad ebp value in the xl8:

thread 4857 detaching to PC 0xf7556a6a eax=0xf7572000 ecx=0xf771fee0 ebx=0xf5b332d0 edx=0x00000000 edi=0xf5b33b40 esi=0xf5b332d0 ebp=0xf7563a70 esp=0xf5b33208

(gdb) x/2i sc->eip
   0x8056e3e <do_some_work(int)+51>:    fstpl  -0x10(%ebp)
   0x8056e41 <do_some_work(int)+54>:    addl   $0x1,-0x14(%ebp)
(gdb) p /x *sc
  edi = 0xf5b33b40, 
  esi = 0x0, 
  ebp = 0xf7563a70, 
  esp = 0xf5b33320, 
  ebx = 0xf770c000, 
  edx = 0x0, 
  ecx = 0xf7564180, 
  eax = 0x1, 
  eip = 0x8056e3e,

Removing the re-attach loop: I cannot repro any crashes in 200+ runs. With re-attach: I can get a crash just about every other time. Yet the detach context, etc. should all be per-thread: there's no global state here, right?

Is it corruption from some other thread? Or is it some sigreturn thing? 1 thread instead of 8: 200 iters w/ no crashes.

Check out this one: I added an extra re-attach, and on the final detach, we only have 3 other threads, and 2 of them both end up crashing, one with a bad PC and one with a bad eax which it sets itself:

<Attached to 8 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (6365)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (6365)>
synched with 4 threads
translating thread 6365 to 0xf7787be9 eax=0xfffffffc ecx=0x00000000 ebx=0xf6b81ba8 edx=0x000018df edi=0xf6b81b40 esi=0x00000000 ebp=0x000000f0 esp=0xffed2380
thread 6365 will resume as native
thread 6365 detaching sc=0xe32feba0 to PC 0xf7787be9 eax=0xfffffffc ecx=0x00000000 ebx=0xf6b81ba8 edx=0x000018df edi=0xf6b81b40 esi=0x00000000 ebp=0x000000f0 esp=0xffed2380
translating thread 6367 to 0xf75a2dff eax=0x2000c006 ecx=0xf335bad0 ebx=0xf6b812d0 edx=0x00000000 edi=0xf6b81b40 esi=0xf6b812d0 ebp=0xf6b81338 esp=0xf6b81208
thread 6367 will resume as native
thread 6367 detaching sc=0xe3396ba0 to PC 0xf75a2dff eax=0x2000c006 ecx=0xf335bad0 ebx=0xf6b812d0 edx=0x00000000 edi=0xf6b81b40 esi=0xf6b812d0 ebp=0xf6b81338 esp=0xf6b81208
translating thread 6373 to 0x082dc670 eax=0x00000000 ecx=0x00000000 ebx=0x083a43ec edx=0x00000049 edi=0xf3b7bb40 esi=0x00000000 ebp=0xf3b7a490 esp=0xf3b7a470
thread 6373 will resume as native
waiting for thread 6365
waiting for thread 6367
thread 6373 detaching sc=0xe34c6ba0 to PC 0x082dc670 eax=0x00000000 ecx=0x00000000 ebx=0x083a43ec edx=0x00000049 edi=0xf3b7bb40 esi=0x00000000 ebp=0xf3b7a490 esp=0xf3b7a470
waiting for thread 6373
<ERROR: master_signal_handler with no siginfo (i#26?): tid=6367, sig=11>
cleaning up thread 6365
cleaning up thread 6367
cleaning up thread 6373
<Detaching from process, entering final cleanup>

(gdb) thread 2
[Switching to thread 2 (Thread 0xf3b7bb40 (LWP 6373))]
#0  syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
176             pop      REG_XDI
(gdb) up 3
#3  0x082b559e in master_signal_handler_C (xsp=0xf3b7ad3c "\320{x\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431                SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/2i sc->eip
   0x3d2466dc:  Cannot access memory at address 0x3d2466dc
(gdb) thread 3
[Switching to thread 3 (Thread 0xf6b81b40 (LWP 6367))]
#0  syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
176             pop      REG_XDI
(gdb) up 3
#3  0x082b559e in master_signal_handler_C (xsp=0xf6b80d3c "\320{x\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431                SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/2i sc->eip
   0xf75a3024 <__dubsin+1780>:  faddl  -0xe5b8(%eax)
   0xf75a302a <__dubsin+1786>:  jmp    0xf75a2e23 <__dubsin+1267>
(gdb) p /x $eax
$1 = 0x0
(gdb) x/2i 0xf75a2dff
   0xf75a2dff <__dubsin+1231>:  fucomi %st(4),%st
   0xf75a2e01 <__dubsin+1233>:  fstp   %st(4)
(gdb) x/2i 0xf7787be9 
   0xf7787be9 <__kernel_vsyscall+9>:    pop    %ebp
   0xf7787bea <__kernel_vsyscall+10>:   pop    %edx
(gdb) x/2i 0x082dc670 
   0x82dc670 <syscall_0args+6>: pop    %edi
   0x82dc671 <syscall_0args+7>: pop    %esi

The bad PC one was sent to a DR address?!?

(gdb) x/10i 0x082dc670 -2
   0x82dc66e <syscall_0args+4>: int    $0x80
   0x82dc670 <syscall_0args+6>: pop    %edi
   0x82dc671 <syscall_0args+7>: pop    %esi
   0x82dc672 <syscall_0args+8>: pop    %ebp
   0x82dc673 <syscall_0args+9>: pop    %ebx
   0x82dc674 <syscall_0args+10>:        ret    
(gdb) dps 0xf3b7a470 0xf3b7a470+20
0xf3b7a470  0xf3b7bb40  No symbol matches (void *)$retaddr.
0xf3b7a474  0x00000000  No symbol matches (void *)$retaddr.
0xf3b7a478  0xf3b7a490  No symbol matches (void *)$retaddr.
0xf3b7a47c  0x083a43ec  _GLOBAL_OFFSET_TABLE_ in section .got.plt of /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads
0xf3b7a480  0x0829b311  os_thread_yield + 28 in section .text of /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads

Xref self-interp #2161 (closed). Doesn't that require an app timer signal, though? Would SIGCHLD trigger it? Isn't SIGCHLD only going to the parent?

At this point I'm adding the re-attach best-effort code I have and disabling this test, to make progress, and filing this as a bug to fix separately.