So it seems that I passed GSOC’s first evaluation which means I will still get to work on porting relibc to Gentoo. The whole task feels like a game. But no, don’t jump to the wrong conclusion. It resembles a game in the way things get much harder as you progress.
When I first started, I was bragging (at least when I come to think about with my self) about how cool I am, working on debugging gdb with gdb and fixing complex issues. But back then I also was telling my self: “How harder could it possibly get, this is probably the top of the iceberg.” If I learned something it is that to never ask that question because it can and in fact, it does get harder.
So while I was working on trying to get gcc to compile, I faced some issues regarding
instead of the one built on top of
relibc. Easy fix! no problem with that. I just replaced /usr/bin/ld with a symlink to
Now I noticed there is a library that is loaded at runtime as a plugin. Plugins are bad news, They need working
dlsym all of which are almost not implemented in relibc. To be precise they are implemented,
but the way they work is that
dlopen will load your library into the global linker symbol resolving space,
will do nothing. and
dlsym will just resolve the symbol from the global symbol space regardless of your handler.
The problem with that approach is that you may have
plugin2.so Both of which are using the same
callback function name to point to completely different functionality. so If both symbols share the same symbol space
then they would conflict.
So I started implementing them (with no reference implementation to check how it is done) multiple times, and each time
it was so broken that the easier solution was
git reset --hard. So I realized that what I want to do is try doing
small modification to the current
ld.so implementation then make sure nothing break during each modification.
Slowly but steadily I started refactoring ld.so and it wasn’t as hard as what I was trying to do before. And before I knew it, ld.so was ready to be part of dlopen (and friends). I implemented dlopen and the rest of the functions. and Tried a test.
I got a weird segfault when trying to test. That was the stack trace:
#0 0x7ffff7e69f86 in core::ptr::write (dst=0x6b1758, src=0x5) at libcore/ptr/mod.rs:800 #1 0x7ffff7c8b7e1 in alloc::collections::btree::node::slice_insert (slice=..., idx=0x0, val=0x5) at liballoc/collections/btree/node.rs:1681 #2 0x7ffff7ca6310 in alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert_fit (self=0x7fffffffcd50, key=0x5, val=...) at liballoc/collections/btree/node.rs:1035 #3 0x00007ffff7ca6e12 in alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert (self=..., key=0x5, val=...) at liballoc/collections/btree/node.rs:1052 #4 0x7ffff7d8b9e9 in alloc::collections::btree::map::VacantEntry<K,V>::insert (self=..., value=...) at liballoc/collections/btree/map.rs:2423 #5 0x7ffff7d843b5 in alloc::collections::btree::map::BTreeMap<K,V>::insert (self=0x555556467ad8, key=0x5, value=...) at liballoc/collections/btree/map.rs:803 #6 0x7ffff7d66e82 in relibc::ld_so::linker::Linker::load_library (self=0x555556467a10, name=...) at src/ld_so/linker.rs:140 #7 0x7ffff7dbd403 in dlopen (cfilename=0x555555556000 "./lib1.so", flags=0x0) at src/header/dlfcn/mod.rs:66 #8 0x55555555523e in main () at libcaller.c:3 #9 0x7ffff7d39b62 in relibc_start (sp=0x7fffffffe2d0) at src/start.rs:145 #10 0x55555555508c in _start () at src/crt0/src/lib.rs:12
And there was error message stored on the stack
Tried to shrink to a larger capacitycapacity overflowa formatting trait implementation returned an error. A segfault
BTreeMap::insert() is a very bad bug no matter how you look at it. I kept trying multiple things wanting to
know what to do and nothing worked. Then I noticed something strange, That segfault happens only when that tree is
dlopen but during the early stage where ld.so runs, such segfault never happens.
I tried running both tests side by side, trying to figure out what is causing the segfault. The first thing I noticed is
that they both run in different address spaces. So the thing is that we have 2 copies of libc. One of them is statically
linked against ld.so and the other one is dynamically loaded. And for some reason, they can’t be uses interchangeably even
when they are ABI compatible, so I learned the hard way that neither relibc nor rust
stdlib provide referential
So I hard to maintain a list of callbacks that is initialized by ld.so and become part TlS so that dlopen and friends can call into ld.so version of libc. And in fact, it worked.
Like this bug, there were few more bugs, that even debugger cannot directly help you figure what is going wrong. For
example, any program that used dlopen didn’t have a clean exit and it segfaults when main is done. That bug is very
interesting because the segfault was during
pthread deinitialization. Code that I didn’t have to deal with at all.
Yet it is crashing. In these cases, I keep commenting code until the crash disappear so I get to know which line of
code is somehow causing the crash. Of course, that technique breaks functionality, but in exchange, you get to pinpoint the
root cause of segfault.