Fibers from out of (user) space – Deeper view

Jordan Sheinfeld

Jordan Sheinfeld

Principal Engineer at the Taboola's Video group. Very enthusiastic about new technologies. Today spends most of his time learning and deploying new technologies and improving the performance and scale of systems.

Jordan Sheinfeld | 26 Dec 2018 | Java

Tags: concurrency, CPU, fibers, performance, threads

A couple of months ago my team had its first experience working with Java fibers, we needed to make our main application work asynchronously.
In this 3 part series, I will share my team’s experience and how we deploy and implement Java fibers in production.

In Part 1 we talked about what fibers are in high level, how they compare to threads and why we started to explore them.

In Part 2 we went further in-depth about how fibers differ from threads, how to create fibers, how to work with them and the  basic concepts of how they work.

In this part, we’ll discuss what’s going on under the hood in fibers and deep dive into the implementation of how fibers work and what lessons we learnt during our journey working with them. We will also see how this magic happens…

 

Under the hood

Fibers are implemented by instrumenting our JVM bytecode instructions, and patching them in order to save and resume state. In order to instrument our code, we have to run our application with Quasar Java agent in the command line: -javaagent:quasar-core-0.7.10.jar

The framework uses a ForkJoinPool executor, to run the fibers on a set of limited amount of threads (defaulted to number of cores, but configurable).

ForkJoinPool is a form of thread pool that follows the concepts of divide and conquer, and works by stealing algorithms to best utilize the CPU and thread contention costs of the normal ThreadPoolExecutor.

It mostly shines in use cases where tasks can be divided to subtasks, such as running fibers code blocks. Internally, it uses internal task queues for each worker thread that lowers the contention on the main executor task queue.
Also, it uses a dequeue to lower the synchronization on adding items to the queue and fetching them from the tail in a stack order. The stealing part comes where a worker has ended its job and immediately goes and fetches work from other workers queues.

The main benefit of this is to fully utilize the usage of the worker threads in the job cycle. ForkJoinPool is not necessarily better than ThreadPoolExecutor, it depends on the use case, since it has more overhead. If there is a known amount of work that can be evenly distributed, ThreadPoolExecutor is better if work can be divided and re-submitted and should be re-assembled, ForkJoinPool may be a better choice.

What does it look like behind the scenes?

Let’s take this simple fiber as an example:

Time to dive in and inspect some bytecode.

The above snippet translates into the following bytecode:

 

Now, let’s have a look what the Quasar agent does to that code bytecode after instrumentation:

After instrumentation, we take the resulting bytecode and decompile it, in order to make it more readable:

We can see that the decompilation gives us hints of what’s going on, even though it’s a best effort decompilation and not accurate. Analyzing the bytecode, points out that there is a decision at the beginning of the method to make a jump into injected labels in the code according to the state we are in.

TABLESWITCH

1: L3

2: L4

default: L5

L5

We can see that L3 is defined near the first sleep, and L4 is defined near the second sleep.

The injected push calls in the decompiled code, suggest that during the method the Quasar framework is being told where we are currently in the method, before throwing an exception to quit it. Quasar uses a special internal SuspensionException in order to escape coding blocks.

Indeed, each time the code runs it is being interrupted in calls to methods that are marked as @Suspended. Quasar stores the index and run count and all the stack variables, and frames in the current position so that it can resume at the exact place after the interruption triggers.

 

Lessons learned

1.  Adding fibers took some time but was worth it. It was mostly built from adding @Suspendable annotations to the relevant methods in the code and stabilize the application. On the way there were few challenges, there can be external libraries or dependencies that can cause problems because they may not be 100% compatible with fibers, however, most of the time it’s possible to overcome this. The transition was quite smooth in terms of code refactoring, no major changes to code were taken except adding the @Suspendable annotations.

2. During the development phase several issues were revealed. After trying to launch a server, suddenly, strange NullPointerException exceptions started to be thrown into our log files, they appeared in places where they really shouldn’t and for no good reason. Investigation had come up with the understanding that some methods in key places were not marked as @Suspendable which caused this shady behavior to happen.

Key things to remember are, first, run your application with the following VM parameter in order to verify correctness instrumentation of your code:

-Dco.paralleluniverse.fibers.verifyInstrumentation=true , this will try to figure out all the places that are missing @Suspendable annotations. Be aware that this flag not always 100% detects the missing pieces.

Second, it is important to annotate interfaces too, if you have a class that inherits from an interface, and you mark some method in it as @Suspendable, also mark its sibling method inside the interface definition.

3. If you are using ReentrantLocks and conditions, even though they have a corresponded fibers version it is highly advised to define your condition as:

Otherwise, you may get some serious strange errors due to the fact the Condition interface is not @Suspendable annotated.

4. Memory leaks pitfalls. Even though ThreadLocals are managed by fibers and transparent to them, there could be some glitches mostly in external libraries. For example, this happened with Netty. Our application is using gRPC heavily, gRPC is using netty IO underneath, the library uses a memory/bytes thread pool cache to maintain fast memory and avoid  GC overhead. A ThreadDeathWatcher class expects running threads to be killed and then remove them from an internal watch list. This isn’t happening under fibers where the main ForkJoinPool engine threads are always running, which will eventually cause a memory leak from a huge ArrayList that contains many ThreadDeatchWatcher instances.

5. Define the correct amount of parallelism for the ForkJoinPool by specifying the VM parameter:

-Dco.paralleluniverse.fibers.DefaultFiberPool.parallelism=XXX, by default the number of cores in the system will be used, but sometimes, that is not enough. You need to measure it and play with the numbers.

6. Monitor. Fibers expose JMX metrics, so you can monitor the ForkJoinPool queues, number of fibers running on the system and much more. You can possibly also address the default ForkJoin queues and monitor their waiting queues, for example:

7. Spurious Wakeups – are known behavior in threads synchronization mechanism that states that a waiting thread in a wait state, may wake up from it’s blocking state due a spurious wakeup not necessarily fired from a notify event. The reasons for that to happen are mostly due to the fact how POSIX/Win32 system blocking calls are implemented and their sensitivity to signal processing. Because of this, LockSupport.park() / LockSupport.unpark() must be used in a loop and check a condition to verify that a real change in a condition happened and not a spurious wakeup. The same goes with Fibers, however, the story with them is a little bit different, since they run in ForkJoinPool, lags may occur between calling the park() to the time they really parks, this must be checked if unpark() is called before a fiber was parked can lead the system to a deadlock.

 

Summary and conclusions

I hope you got a valuable glimpse into fibers and their power.
There are more technical challenges and things to improve. The library is maintained in GitHub and has active contributors. The latest version 0.8.0 is required, and only runs on JVM 11, the latest for Java 8 is 0.7.10. It is planed to be part of the JVM (see Project Loom).

Moving our main bidding infrastructure to work with fibers enabled us to make our code more readable, easier to maintain, and drove us to add new asynchronous capabilities in a relatively low complexity.

The costs and efforts where not huge, and the modifications to the code did not require crazy refactoring, this is because of the imperative way fibers are designed.
However, you should measure everything and keep track of abnormal behavior that fibers might yield when using less compatible libraries – that will be known only after trial and error.

In terms of performance when comparing to threads – well, here to be honest I had more hopes, we really liked the technology and we wished it could bring also performance improvements out of the box, sadly it wasn’t so clear to us and the fight against threads could be declared as a tie.

 

References

 

Go to the previous part …

Part II – Hands on