Fibers from out of (user) space – Hands on

Jordan Sheinfeld

Jordan Sheinfeld

Principal Engineer at the Taboola's Video group. Very enthusiastic about new technologies. Today spends most of his time learning and deploying new technologies and improving the performance and scale of systems.

Jordan Sheinfeld | 25 Dec 2018 | Java

Tags: concurrency, CPU, fibers, performance, threads

A couple of months ago my team had its first experience working with Java fibers, we needed to make our main application work asynchronously.
In this 3 part series, I will share my team’s experience and how we deploy and implement Java fibers in production.

In the previous part (Part 1), we talked about what fibers are in high level, how they compare to threads and why we started to explore them.

In this part we’ll focus further in-depth about fibers and how they differ from threads, we’ll see how to create fibers, how to work with them, and the basic concepts of how they work.

Threads vs. Fibers

We searched for a reason why not to stay with threads. We researched the costs and performance penalties of working with threads vs. fibers.

We wanted to find proof that fibers can work better than threads, or at least shine in some areas. So we did several tests and experiments to try and prove that, mostly on performance and scale.

I must say, the research did not yield conclusive results like I wished it did, but instead, it taught us a lot and enabled us to control the behavior of our service via a simple configuration flag.

Eventually it really depends on your specific use-case. In our case the performance differences between threads and fibers were minor but we gained a better imperative and more clean code.

Performance

Performance measurement is always a problem to do, because it is based on the conditions of the system it is running on.
But let’s take a look at a standard benchmark test that does allow us to get a sense of the performance we could get by working with Fibers:

The thread ring problem.
In the thread ring problem, we create 500 threads (can be any other amount as well), while connecting them in a ring (circle) structure so that the last thread points to the first one.
Then, serially we pass a message from one thread to the other, in a circular way, 10,000 times. There are many ways to implement this, here is one example available on GitHub: https://github.com/vy/fiber-test It uses the de-facto framework to measure nano performance on the JVM – JMH. It requires a little tweaking to run, but eventually, here are the results:

Environment and plan:

  • Testing environment, laptop: Thinkpad X1, Core i7-7500U 2.70Ghz (4 cores), Ubuntu 1.64
  • 5 Warm Up iterations, 5 executions.

Results:

  • Java Threads: 10.646 ops/s
  • Fibers: 103.241 ops/s

Fibers show improvement of almost x10 in this case, nice, we have a potential here !

perf

Next thing was to put our newly refactored application under perf to figure out if we gain any improvements in metrics such as branch predictions, CPU utilization, page faults and such. If you are unfamiliar with perf , it is a Swiss-army knife Linux profiler for almost everything. Our refactored application had a flag to set whether to run with fibers or threads. The following shows the differences between the 2 runs.

We issued the following command to run perf:

Environment and plan:

  • Testing environment, server: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, CentOS 7.2.1511
  • 40 cores
  • 128gb RAM
  • Production traffic, +/- 300 QPS
  • 5 minutes sampling.

Results:

Method/Metric cpu-clock (msec) context-switches cpu-migrations page-faults cycles instructions branches branch-misses
Fibers 2221610.273407 19,480,439 2,957,727 485,777 5,183,396,393,765 5,867,959,049,330 1,211,078,813,490 20,758,136,733
Threads 2076849.036113 19,361,672 3,122,642 468,374 4,826,607,131,964 5,518,275,491,102 1,141,887,811,701 20,030,907,473

The results are not conclusive.  One reason for this is the behavior of our application. It suffers from long business logic decisions and methods that consume a lot of CPU and do not block enough for fibers switching. In the parts where they do block, threads are doing similar job, therefore the outcome stays similar. If we reduced the number of threads of the ForkJoinPool (see Part 3) we could have a better ruling in the favor of fibers, but we couldn’t due to a large amount of CPU in our code.

Scale

Next was to try and differentiate the amount of threads vs. fibers in an application.

So, to demonstrate the burden on the OS when creating thousands of threads vs. the same amount of fibers, we did the following:

Lets see what happens when we try to run a small program that creates 100k threads:

This is the output you will see when you run it:

Oops, It died.

It was unable to allocate and create native threads due to memory limitations.

Now, let’s write a similar fiber version of this:

Running the above works like a charm 🙂

Now that we got the sense of what fibers are, in terms of performance and scale, let’s see how to create/work with them…

Hello world – how do we create a simple fiber?

In order to simplify our lives, fibers are implemented in a very similar fashion to Java threads. They have a functional interface, are Runnable like, and are launched/destroyed in the same way.

In fact, their implementation in Java is such that both implement a shared class named Strand, that is an abstraction of both a thread and a fiber.

This means you can design the system based on Strands, and decide by configuration whether to run on fibers or normal Java threads.

At the moment, fibers are used as an external library with intentions to make them part of the JVM (see: Project Loom).

Fibers require the code to be instrumented – instrumentation is a method used to inject (patch) bytecode instructions on top of existing classes that Java produces when it compiles sources to bytecode.

There are 2 ways to do it:

  1. Via dynamic instrumentation, by adding a -javaagent parameter to the VM parameters
  2. Using static instrumentation, by building the bytecode with instrumentation using build tools such as Ant/Maven

So first, to add support for fibers in your project , add the following to your Maven pom.xml:

In order to make methods in our code “fiber friendly”, we need to annotate them with the @Suspendable annotation or declare them to throw SuspendExecution.

This will tell Quasar what our interruption points are, so that instrumentation will be active.

 

Let’s write a simple fiber that runs a single fiber and calls 2 methods that print some output and sleep:

To run it we use the following command:

Output:

We can see that interruption occurred between method1 and method2, due to the fact that the runCount increased by one on the second print, which implies that a fiber branch selection had occurred.

What went on? Why was the run count increased? Here is a step by step tracing:

  1. Fiber is first launched and started.
  2. run method is running.
  3. method1 is being called (run count = 1)
  4. Fiber.sleep is being called, the fiber stops
  5. The fiber scheduler is running and re-schedules this fiber.
  6. run is running again
  7. Instrumentation now jumps to method2
  8. method2 is being called (run count = 2)
  9. Fiber.sleep is being called, the fiber stops
  10. The fiber scheduler is running and re-schedules this fiber.
  11. Instrumentation now jumps to after method2
  12. Fiber ends.

What’s next

In this part we got a hold on how to create simple fibers, how they shine in performance and scale (in some scenarios), and how they can help us.

In the next part, we’ll deep dive into the structure of fibers, how they work behind the scenes in order to better understand them, and what we learnt when implementing them in production.

Continue reading the next part ….

Part III – Deeper view

Or go to the previous part …

Part I – Overview