A tale of Thread Local Storage(TLS) bug

4 minute read

At some point during my work on porting relibc to gentoo, I reached to the point where I am working on compiling relibc based gcc that would emit gcc based binaries. Then I faced what seemed to be a strange and extremely nasty bug in relibc based binaries. The makefile was running commands like genhooks, and it was segmentation faulting. That sounded like your normal bug, well when I tried to run the same program with the same arguments it didn’t crash. So that was the bug: make was consistently generating crashes but running the commands manually was not. So I need to somehow to debug the process while it is running from the makefile.

What was special about this bug was that all relibc linked binaries crashed this way, so I thought Ok I need to intercept the process at a very early stage but not every process, I only care about the ones that are relibc based. So I went to relibc’s ld.so start function and injected a here: jmp here. This would stall any ld.so invocation until manual intervention.

Then I attached gdb to the stalling relibc based binary and modified the instruction pointer myself so that it would escape the infinite loop. Then it all revealed. The bug was in ld.so, and the make file did manually set LD_LIBRARY_PATH. That was the reason why I would get different results (crash in when using make and normal execution otherwise). At this point, I realized there is no need for that hackish breakpoint mechanism and that I would simply need to just run relibc binaries with the same environment variables. And things went smooth starting from this point.

The store made short, the bug was related to the fact that access function would set errno variable when a doesn’t find the library it is looking for. The problem here is that errno is thread-local storage, which is initialized very late long after relibc gets loaded by the loader. On the other hand, that access function It was used to figure out where exactly relibc is located.

So that was your typical egg and chicken problem. And to break the cycle, I simply rewrote my own thin wrapper for access system call that doesn’t touch errno variable, that wrapper was only meant to be used from within ld.so.

That would sound perfectly fine solution but then, there was a huge problem: redox and Linux do not have the same system calls. Shortly after my patch was applied in mainstream relibc, it was replaced with another patch (that in fact did access errno). So it became up to me to implement a proper fix for the bug. But to do that I would need to have a redox environment to at least make sure that relibc compiles there.

One quick solution was to use redoxer, which is rust tool that spans quick mv for running commands on redox. The only problem (at least on my machine) was that it needed root which I didn’t feel safe with.

So what I did was spawning chroot environemnt, installed rust, qemu, and, fuse there. Then I installed redoxer using cargo and made sure it is working.

Redoxer has very nice integration with rust. For example, if you run use redoxer test in a rust repo, it will run cargo test for that repo inside redox, which was all I needed to test my code inside redox.