Scaling Out Jenkins Based CI with Docker and Nomad

Anton Weiss

Anton Weiss

Anton Weiss is a Principal Software Delivery Consultant and Trainer at Otomato software Ltd. He has been enabling effective software delivery since 2001, accompanying companies in DevOps transformation processes and enjoys writing and speaking on technology, collaboration and innovation.

Anton Weiss | 18 Oct 2017 | System

Tags: devops, Jenkins, web applications

The existing CI processes in Taboola are quite demanding  – a full product build includes 150 maven modules accounting for about 20000 unit tests. Beside running builds from master branch there are also builds for all feature branches and patch releases. All in all accumulating to more than a hundred builds per day. Some of the executed tests are pretty heavy on CPU and memory, performing resource-intensive data crunching. As you can imagine – all this requires substantial CI infra horse power. Which it definitely possesses:

Taboola’s Jenkins cluster currently has 35+ Jenkins slaves, each with 20 to 40 CPU cores and at least 100 Gb memory. Each slave runs 5 to 10 executors. All in all  – a powerful CI/CD factory.

But even with all this power – there are limitations. On the day of the weekly release the volume of builds peaks and we sometimes find ourselves starving for resources. Builds line up in queues wasting valuable developer time and causing unnecessary stress.

Further enhancing  the problem is the fact that some of the build slaves have specific system configuration and can’t be used for all kinds of processes. We love and nurture our pet servers but treating them as cattle is so much more productive.

So there’s room for improvement! Add to this the fact that Taboola has been growing fast and we can only anticipate the build volume to continue accelerating in the future – and finding that improvement becomes a necessity.

Taboola development infra (should we say DevOps?) team is constantly looking at ways of enhancing the overall productivity and eliminating bottlenecks in the delivery process. Smart test execution algorithms have been developed breaking down the tests into variable size chunks to optimize the load distribution.  

But we didn’t want to stop there. On a quest to maximize effective resource utilization we started looking at using containers. It’s no secret that in the last couple of years they’ve become the single hottest piece of technology everyone is raving about. They bring all the benefits of workload portability, resource isolation and deterministic deployments. And yes – they are a perfect match for CI/CD – allowing to quickly spin up and tear down pristine and versioned build and test environments. Moreover  – they can help with turning our Jenkins slave cluster into a real herd of cattle, where all we need from the slave nodes is to have Docker installed – all the rest of configuration will be taken care of by container images.

As with all improvement initiatives involving new tech – we started out small. Installed the Jenkins Docker plugin and set out to create the build slave image equipped with all the needed tools. The physical build slaves are managed by Foreman+Puppet – so puppet manifests were a great help in getting this configured fast.

Once the slave image was ready we uploaded it to Taboola’s Artifactory integrated Docker registry and tried running builds from Jenkins with the use of Jenkins Docker plugin. Another day of tweaking was spent trying to get the ssl certificates working but in the end we achieved great success – the build and tests were passing! Unsurprisingly it took hours to complete the full cycle as at this stage we were only using one physical Docker host for our games. As depicted in the following diagram:

 

Now that we’ve proven the build could be executed in a container the time came to decide on a container scheduling solution. Container schedulers (sometimes also referred to as orchestrators) are responsible for receiving container workload tasks and distributing them across a cluster of container hosts. A number of solutions exist with the most renowned being Docker Swarm (by Docker Inc), Kubernetes (from Google) and Marathon (from Mesosphere). But we didn’t go with one of those. Instead we chose to try a tool that comes from a company we’ve learnt to rely on – Hashicorp. Taboola has been using their service discovery tool – Consul in production for quite some time, and we heard good things about their scheduler/cluster manager named Nomad. Beside coming from Hashicorp, it also boasts the following attractive features (as stated on the official website):

  • Flexible Workloads: Nomad can schedule containers, but also standalone applications and batch tasks across a cluster of hosts.
  • Operational Simplicity: Nomad ships as a single binary, both for clients and servers, and requires no external services for coordination or storage. Nomad is distributed and highly available, and combines resource management and scheduling into a single system for simplicity.
  • Built for Scale: Nomad was designed from the ground up to support global scale infrastructure. Nomad is distributed and highly available, using both leader election and state replication to provide availability in the face of failures. Nomad is optimistically concurrent, enabling all servers to participate in scheduling decisions which increases the total throughput and reduces latency to support demanding workloads.

 

The decision to go with Nomad was taken in collaboration with Taboola’s production operations team. They were also looking for a cluster management solution and it was decided we will all be better off by joining our research efforts.

Additional motivation was provided by this blog post which showed us integration between Nomad and Jenkins was already taken care of:  http://www.ivoverberk.nl/scalable-ci-cd-with-nomad-and-jenkins/

All we had to do is connect the dots! We’ve created a Nomad cluster, installed the Jenkins plugin and tried to schedule our first Nomad-based docker slave. Now it looked like this:

But of course it’s never that easy.

One of the issues we had to deal with was the fact that current builds rely on 1) shared workspaces and 2) configuration written to an NFS mount.

Nomad in its current configuration doesn’t allow mounting specific host folders into containers. This makes some sense – if you want real dynamic cluster scheduling you don’t want to depend on data sitting on the host. But that also meant we had to mount the NFS from inside the container at bringup. While fully possible in Nomad – this wasn’t doable through the Jenkins plugin.

But that’s what’s so great about open source. You need functionality – you’re free to build it!

We extended the plugin to support alternative container bringup commands, verified it was working and submitted a pull request to the official Github repo. The maintainer – Ivo Verberk, was quick to react – he merged the change, added some improvements of his own and released a new version.

In the meantime we encountered a new issue. Containers weren’t getting scheduled fast enough. Sometimes Jenkins would idly wait for up to a couple of minutes before deciding to schedule a new slave. Now multiply this by 50 test jobs and you find yourself waiting for hours… The investigation of Jenkins slave scheduling strategy wasn’t easy – it is not really documented and the examples online are rare and sparse. The real breakthrough came from reading the code of Yet-Another-Docker-Plugin. It revealed that there is a better way  – extending Jenkins NodeProvisioner.Strategy . A couple of days of tweaking and voilà

– we had slaves coming up in seconds, not minutes – just as one would expect from containers.

This has also been merged upstream and is now part of jenkins-nomad-plugin release 0.3

So the end result is a cluster of 4 Nomad nodes capable of running 15-20 Jenkins slaves whenever needed.

We’re still sorting out some stability and performance issues with this setup. As mentioned – Taboola testing suite is very CPU and memory-intensive and we occasionally see some jobs strangling others when scheduled on the same node. Probably comes down to careful resource quota tuning. We still need to find the perfect balance between stability and efficient resource utilization. Of course any tips on this will be most welcome.

The conclusion:

Nomad and Docker are still young technologies. Together they can provide great benefits for your software delivery tooling, but there’s tweaking involved. As always with open source – you can enjoy it the most if you’re prepared to give back. We’ll be happy if you decide to use Jenkins with Nomad and benefit from our contribution to the plugin. Technology grows faster with community support – so do let us know if we can help you, or if you can help us. Building software together is much more fun. After all that’s what DevOps is all about.

Happy delivering!