Developer Productivity Engineering Blog

How to Make Scala Compilation 5X Faster

This is part of a series of blog posts on Scala compilation in the cloud:

  1. Using Scala in the Cloud: A Guide to Amazon EC2 Instance Types
  2. How to Make Scala Compilation 5X Faster (this post)

In Part 1 we looked at what is the best Amazon EC2 instance for running your Scala CI builds. We found out that compilation speed is not affected much by instance size, while the price doubles with every step. The most cost-effective instance is thus m5.large (2 cores). But what happens when we throw in our Hydra parallel Scala compiler?

Since the creation of Hydra we focused our efforts on making Scala compilation faster, without compromise. The Scala ecosystem is incredibly rich, with fantastic libraries that exercise the language to its fullest and Hydra promises to keep that intact: Hydra is a drop-in replacement for the Scala compiler, and supports the full Scala language: macros and compiler plugins included.

But how fast can we get, without compromising our goals? It turns out it’s more than 5 times faster! Let’s find out how…

Benchmark project

(If you’ve read our previous entry, please skip ahead to Test #1 below)

For our benchmark we wanted to choose an open-source project that is representative of a real-world project (i.e. not a toy-project) and that can build on several Scala versions. We’re particularly interested in seeing how the latest Scala version (2.12.8 at the time of writing) compares to the first one in the Scala 2.12.1 series.

We settled on scala-debugger, a tool and library for debugging Scala programs on the JVM:

  • Code size: 88,382 lines of code, excluding blank lines and comments
  • Compilation time: above 3 minutes

Methodology

Our main question is “How fast can we compile Scala?”, so we will measure only the actual compilation time. In particular, we are not going to consider the usual time-consuming tasks in a CI build, like dependency resolution and download, package or publish steps, etc. (but keep an eye on our blog to see how to minimize those in our next installment!)

Benchmarking on modern hardware is harder than it seems, and running inside the JVM makes it even trickier. Multiple layers of caching (CPU, OS), just in time (JIT) compilation, and garbage collection (GC) are just a few of the variables that influence our measurements.

In order to get reliable results, we need to bring the system up in a so-called steady-state, or warm state. As the JVM starts up and executes a program, it first needs to load classes from disk, starting with interpreting the bytecode, and, as it discovers what methods are executed most often, compile them to native code.

Setup

  • To run the benchmark, we compile all projects a few times: this is the warm up run, and should take about 4-5 minutes.
  • We then start measuring, and run the sbt compile task 8 times to measure the time it takes to do a full build. We pick the median value.
  • In between compilation runs, we delete the output directory (we do not run the clean task, as that would also remove the dependency resolution results and cause additional work for sbt).
  • Report the minimum, maximum and median value for compilation time.
  • We used OpenJDK 1.8u191, 7GB of heap (6GB for the smallest instance), and -J-XX:MaxMetaspaceSize=512m -J-XX:ReservedCodeCacheSize=512M. We monitored GC times and they didn’t show up as significant.

All of this is automated in our Hydra sbt plugin.

Generally, we’re going to look at the median value, but keep an eye on the spread. If the values are spread over too large an interval, the noise may be too high to draw meaningful conclusions.

Test #1: MacBook Pro 2016

As a warm up, let’s first see how it all works on a developer laptop, then move on to some beefier hardware and finish up in style with a fleet of EC2 virtual machines.

As a first shot, we went for a regular, 2 years old developer laptop and the default configuration: Macbook Pro 2016, Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz. Hydra uses by default the number of physical cores, so in this case the benchmark used 4 workers.

The chart shows only the test submodule, since the other projects have compilation times below 5 seconds. We obtained a speedup of ~3x, using Hydra with 4 workers!

This isn’t too far from the ideal speedup when parallelizing on 4 workers (there is always some overhead), so the next step is to see what can we expect if we run on a machine with more cores to spare.

Test #2: 5.6x Speed-up on Intel Xeon

Our next stop is a relatively old server grade hardware: Intel Xeon E5-2680 v2, a processor launched in Q3 2013, based on the Ivy Bridge architecture and boasting 10 physical cores. When we increase the number of Hydra workers to 8, we get a huge 5.6x speedup!

Notice that single-threaded compilation is slower than the more recent (consumer) hardware found in my laptop. This is to be expected, since consumer CPUs are optimized for single-threaded loads and may boost the frequency of a single core when needed.

Compiling Scala in the Cloud

Not everyone has an Intel Xeon lying around, so what can we expect when running Hydra in the cloud? Amazon EC2 makes it extremely easy (and appealing) to use it for CI/CD builds. In our previous blog Using Scala in the Cloud: A Guide to Amazon EC2 Instance Types we tested 16 different instances, so let’s see how they perform when using Hydra!

A quick reminder of Amazon nomenclature (skip ahead if this is old news to you): one dimension is the number of vCPUs on each instance, and we’ll be testing four different sizes: 2, 4, 8, and 16 vCPUs. The other dimension is the instance type:

  • General purpose machines (m5.*). These are Intel Xeon based, with a number of cores that doubles with each increment. We are going to test m5.large, m5.xlarge, m5.2xlarge and m5.4xlarge (with 2, 4, 8 and 16 virtual cores respectively). We’ll be using this as the baseline.
  • AMD general purpose machines (m5a.*). These are based on AMD EPYC 7000 processor, and mirror the m5 nomenclature at a slightly lower price than the Intel line (~10%). We are going to test the same instance types
  • Compute optimized (c5.*). These are Intel Xeon based and are, no surprises, optimized for compute-heavy tasks. They feature less memory than the general-purpose line but at a slightly lower price point (~10%), roughly the same as the AMD line
  • Memory optimized (r5.*). These instances feature twice the memory size of the general purpose line (starting at 16GB instead of 8GB), but at a higher price than the general purpose line (~25%).

There are other types–such as burstable instances–but we decided against testing these due to too much noise vs signal. We used dedicated instances for benchmarking and the spread of benchmark values was only 1-2% around the median value, so we decided to not show the error bars as they would be too small.

Test #3: 5x Speedups On Amazon EC2

We ran the same benchmark using all the available cores on each instance type, using both Scala 2.12.1 and Scala 2.12.8. One data point is missing, the compute-optimized smallest instance doesn’t have enough memory to compile the project, neither with Scala nor with Hydra.

On small instances with only 2 virtual CPUs Hydra gives only a small speedup, but compilation times drop drastically as we move to more potent machines. When running on 4xlarge instances, both General Compute and Memory Optimized broke the barrier of 5x speedup!

Results are roughly similar when switching to Scala 2.12.8, with 5x speedup on the largest instance! Not too bad, and this proves the point that Hydra is utilizing all available cores to the fullest.

While compilation speed is great in itself, it’s worth checking out the economics: what is the cost per build? Does Hydra make economical sense?

Since Scala 2.12.8 is clearly faster than 2.12.1, we skip the numbers for 2.12.1. Upgrades of minor Scala versions are usually easy and it doesn’t make much sense to stay on an old version.

The cost per build with Hydra stays very close to the cheapest option, to the point that the 5x speedup on a general purpose m5.4xlarge machine can be achieved at a cost that’s 25% lower than on an m5.xlarge instance (two sizes lower)!

Conclusion: 5X Speedups = Money Saved

Hydra delivers a sweet 5x speedup on a large, real-world project running on Amazon EC2 instances! And all this without compromising on Scala features, libraries or changes to the development environment. Moreover, the cost curve gets (almost) flat: faster feedback at a lower cost.