One of the reasons people utilize Ruby is because it takes care of memory management. Create an object, use it, and then forget about it, and the system will clean up after you. That’s a lot more convenient than having to keep track of everything else, like in C. It also eliminates a lot of potential bugs, leaving you more mental energy to focus on producing code that does useful things.
One of the symptoms of C code that has a memory management error is a “segmentation fault” - trying to read from memory that the program should not be attempting to read. If you run into a segmentation fault, it’s a serious bug that may even lead to security problems if it can be exploited.
So, when we were running some rspec tests against our Rails code, it was pretty surprising to see the Ruby interpreter itself segfault, as Ruby is widely used, well-tested code. This isn’t an everyday occurrence, and it piqued our curiosity.
Since the problem only occurred at the very end of rspec runs, it didn’t seem to be a bug that would impact our production systems, but it is still annoying to see Ruby crash like this.
How do you figure out what’s causing a bug like this?
Rails uses a lot of different gems (libraries), so the first thing was to whittle down what was loaded to try and eliminate anything that might not be involved.
Having done that, a few things stood out - the MongoDB gem, the fact that the crash happened at the very end of an rspec run, and code to clean up the database after the tests.
As a first step, we were able to create a standalone Rails project that consistently demonstrated the segfault. This was an important step in asking for help, because we’d be able to show what was going on without sharing all of our codebase.
Ruby is being worked on constantly, so another thing to check was whether a recent version might have fixed the problem, so we fetched the Ruby interpreter from github and compiled that, and ran the project. Still crashing.
Having the source code was helpful though, because it’s possible to recompile it with thread debugging enabled, which outputs a lot of data about what’s going on.
Between that and a look at the source code, and some of the gems - specifically the concurrent-ruby gem - and their use of threads, it became evident that something untoward was happening when threads were being utilized in finalizers.
Finalizers, in Ruby, are a somewhat obscure feature that lets you run some code when an object gets cleaned up. They are defined with ObjectSpace.define_finalizer, which lets you specify a Proc to run.
What seemed to be happening was that as rspec finished and shut down, some of these finalizers were being called.
Normally, as Ruby shuts down, it marks the main thread as KILLED, and you can’t launch any more threads. But one of the finalizers created a thread and then ran a join to ensure it had stopped. It turns out that this join puts the main Ruby interpreter in a state other than KILLED, which then frees up other finalizers to launch threads.
At the same time these threads are being fired off, the main Ruby interpreter continues its process of shutting down, until the two things end up colliding: the Ruby interpreter shuts down and frees various bits of itself, whereas the threads are still spinning up and at a certain point try to access bits and pieces that would normally be available, but have been “yanked out from underneath their feet”. The result: Ruby crashes.
With a better understanding of the model of what was happening, it was possible to write a short script with no external Rails code or gems that triggered the problem.
As of this writing, the Ruby core team has not provided a fix for this issue. One proposed idea would create a strict limit on not creating Threads within finalizers; however it’s likely that this would break existing code.
Fortunately, the maintainer of the ruby-concurrency gem has found a way to ensure the issue is not triggered from that code, which means we do not see the bug triggered any more in our project, even if it's still present in Ruby core.
References:
https://github.com/ruby-concurrency/concurrent-ruby/issues/808