What if debuggers couldn’t help.

6 minute read

So it seems that I passed GSOC’s first evaluation which means I will still get to work on porting relibc to Gentoo. The whole task feels like a game. But no, don’t jump to the wrong conclusion. It resembles a game in the way things get much harder as you progress.

When I first started, I was bragging (at least when I come to think about with my self) about how cool I am, working on debugging gdb with gdb and fixing complex issues. But back then I also was telling my self: “How harder could it possibly get, this is probably the top of the iceberg.” If I learned something it is that to never ask that question because it can and in fact, it does get harder.

So while I was working on trying to get gcc to compile, I faced some issues regarding collect2 using /usr/bin/ld instead of the one built on top of relibc. Easy fix! no problem with that. I just replaced /usr/bin/ld with a symlink to my ld.

Now I noticed there is a library that is loaded at runtime as a plugin. Plugins are bad news, They need working dlopen, dlclose and dlsym all of which are almost not implemented in relibc. To be precise they are implemented, but the way they work is that dlopen will load your library into the global linker symbol resolving space, dlclose will do nothing. and dlsym will just resolve the symbol from the global symbol space regardless of your handler.

The problem with that approach is that you may have plugin1.so and plugin2.so Both of which are using the same callback function name to point to completely different functionality. so If both symbols share the same symbol space then they would conflict.

So I started implementing them (with no reference implementation to check how it is done) multiple times, and each time it was so broken that the easier solution was git reset --hard. So I realized that what I want to do is try doing small modification to the current ld.so implementation then make sure nothing break during each modification.

Slowly but steadily I started refactoring ld.so and it wasn’t as hard as what I was trying to do before. And before I knew it, ld.so was ready to be part of dlopen (and friends). I implemented dlopen and the rest of the functions. and Tried a test.

I got a weird segfault when trying to test. That was the stack trace:

#0  0x7ffff7e69f86 in core::ptr::write (dst=0x6b1758, src=0x5) at libcore/ptr/mod.rs:800
#1  0x7ffff7c8b7e1 in alloc::collections::btree::node::slice_insert (slice=..., idx=0x0, val=0x5)
    at liballoc/collections/btree/node.rs:1681
#2  0x7ffff7ca6310 in alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert_fit (self=0x7fffffffcd50, key=0x5, val=...)
    at liballoc/collections/btree/node.rs:1035
#3  0x00007ffff7ca6e12 in alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert (self=..., key=0x5, val=...)
    at liballoc/collections/btree/node.rs:1052
#4  0x7ffff7d8b9e9 in alloc::collections::btree::map::VacantEntry<K,V>::insert (self=..., value=...)
    at liballoc/collections/btree/map.rs:2423
#5  0x7ffff7d843b5 in alloc::collections::btree::map::BTreeMap<K,V>::insert (self=0x555556467ad8, key=0x5, value=...)
    at liballoc/collections/btree/map.rs:803
#6  0x7ffff7d66e82 in relibc::ld_so::linker::Linker::load_library (self=0x555556467a10, name=...) at src/ld_so/linker.rs:140
#7  0x7ffff7dbd403 in dlopen (cfilename=0x555555556000 "./lib1.so", flags=0x0) at src/header/dlfcn/mod.rs:66
#8  0x55555555523e in main () at libcaller.c:3
#9  0x7ffff7d39b62 in relibc_start (sp=0x7fffffffe2d0) at src/start.rs:145
#10 0x55555555508c in _start () at src/crt0/src/lib.rs:12

And there was error message stored on the stack Tried to shrink to a larger capacitycapacity overflowa formatting trait implementation returned an error. A segfault when doing BTreeMap::insert() is a very bad bug no matter how you look at it. I kept trying multiple things wanting to know what to do and nothing worked. Then I noticed something strange, That segfault happens only when that tree is accessed from dlopen but during the early stage where ld.so runs, such segfault never happens.

I tried running both tests side by side, trying to figure out what is causing the segfault. The first thing I noticed is that they both run in different address spaces. So the thing is that we have 2 copies of libc. One of them is statically linked against ld.so and the other one is dynamically loaded. And for some reason, they can’t be uses interchangeably even when they are ABI compatible, so I learned the hard way that neither relibc nor rust stdlib provide referential transparency.

So I hard to maintain a list of callbacks that is initialized by ld.so and become part TlS so that dlopen and friends can call into ld.so version of libc. And in fact, it worked.

Like this bug, there were few more bugs, that even debugger cannot directly help you figure what is going wrong. For example, any program that used dlopen didn’t have a clean exit and it segfaults when main is done. That bug is very interesting because the segfault was during pthread deinitialization. Code that I didn’t have to deal with at all. Yet it is crashing. In these cases, I keep commenting code until the crash disappear so I get to know which line of code is somehow causing the crash. Of course, that technique breaks functionality, but in exchange, you get to pinpoint the root cause of segfault.

Updated:

Comments