March | 2016 | rybczak.net

Recently there was a discussion on reddit revolving around overwhelming compilation times for non-trivial Haskell projects using GHC. The specific problem OP was having turned out to be a quirk in the optimizer (apparently large list literals are not GHC friendly), however there were also claims that the compiler is getting slower with each major release. While this may be an exaggeration, I personally experienced the decrease in performance after I upgraded GHC from 7.6 to 7.8 as one of the modules of hpqtypes started to take substantially longer to compile.

The thread also contained a couple of general tips for reducing compilation times. Based on the given advice and additional experimentation I was able to cut compilation times of various projects down by 30-50%. Let me demonstrate how to achieve this by using lens and idris as examples.

Setting up the environment

Let’s get the latest versions from git and set up a cabal sandbox for lens:
$ git clone git@github.com:ekmett/lens.git $ cd lens $ cabal sandbox init $ cabal install -j --dependencies-only

and idris:
$ git clone git@github.com:idris-lang/Idris-dev.git $ cd Idris-dev $ cabal sandbox init $ cabal install -j --dependencies-only

The default behavior in recent cabal versions is to install both static and shared versions of the libraries and this is exactly what we want.

We look at compilation times of lens library (built with cabal build) and idris library (built with cabal build lib:idris), other targets (test suites in particular) were not considered.

Disabling optimization

The default behavior for cabal is to compile packages with optimization enabled. This is a sane default for installing dependencies, but not so much for development sessions when most of the time we are not concerned about the performance as much as type-checking information or testing whether the newly developed piece of application does the right thing.

Luckily we can disable that rather easily by passing --disable-optimization to cabal during configure phase. Let’s see how disabling optimization (equivalent to passing -O0 instead of -O1 to GHC) affects compilation times for both lens and idris (GHC 7.10.3, Intel Core i7-3770):

Library	Total time, -O1 (s)	Total time, -O0 (s)	Time reduction (%)
lens	33.61	32.96	1.93
idris	230.98	57.16	75.25

Interestingly, disabling optimization cuts compilation time drastically for idris whereas it has almost no effect on lens. A possible explanation is that these projects contain a very different type of code (a lot of type-checker stressing, yet relatively simple functions for lens and complex logic and data transformations for idris), hence the difference.

Tweaking garbage collector settings

lens

If in the past you had to profile a Haskell program and track its memory usage, time spent doing GC etc., there is a good chance you are familiar with RTS options, specifically -s. Fortunately for us GHC is also a Haskell program, so we can apply the same here:
$ cabal configure --ghc-options="-j +RTS -s -RTS"

This will show us a number of useful statistics collected during compilation.
$ cabal build

The results are as follows:

  51,744,119,912 bytes allocated in the heap
   7,108,019,352 bytes copied during GC
     266,067,568 bytes maximum residency (25 sample(s))
      12,360,040 bytes maximum slop
             734 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4545 colls,  4262 par   75.959s   9.682s     0.0021s    0.0220s
  Gen  1        25 colls,    18 par    8.390s   1.131s     0.0452s    0.1030s

  Parallel GC work balance: 15.48% (serial 0%, perfect 100%)

  TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   55.998s  ( 22.726s elapsed)
  GC      time   84.350s  ( 10.813s elapsed)
  EXIT    time    0.042s  (  0.041s elapsed)
  Total   time  140.400s  ( 33.580s elapsed)

  Alloc rate    924,035,605 bytes per MUT second

  Productivity  39.9% of total user, 166.9% of total elapsed

The full build took 33.58 seconds of wall clock time, where 22.73 seconds were spent doing actual work and 10.81 seconds on garbage collection, which is definitely not ideal. The important information is that the majority of time was spent dealing with short-lived data (generation 0). That’s a starting point for improvement.

If we can decrease the number of generation 0 data collections (there were 4545 of them), we can possibly save a substantial amount of time. To achieve this we are interested in the following RTS options (excerpt from the GHC manual), which look like they’re exactly what we need to customize:

-Asize

[Default: 512k] Set the allocation area size used by the garbage collector. The allocation area (actually generation 0 step 0) is fixed and is never resized (unless you use -H, below).

Increasing the allocation area size may or may not give better performance (a bigger allocation area means worse cache behaviour but fewer garbage collections and less promotion).

With only 1 generation (-G1) the -A option specifies the minimum allocation area, since the actual size of the allocation area will be resized according to the amount of data in the heap (see -F, below).

-nsize

[Default: 0, Example: -n4m] When set to a non-zero value, this option divides the allocation area (-A value) into chunks of the specified size. During execution, when a processor exhausts its current chunk, it is given another chunk from the pool until the pool is exhausted, at which point a collection is triggered.

This option is only useful when running in parallel (-N2 or greater). It allows the processor cores to make better use of the available allocation area, even when cores are allocating at different rates. Without -n, each core gets a fixed-size allocation area specified by the -A, and the first core to exhaust its allocation area triggers a GC across all the cores. This can result in a collection happening when the allocation areas of some cores are only partially full, so the purpose of the -n is to allow cores that are allocating faster to get more of the allocation area. This means less frequent GC, leading a lower GC overhead for the same heap size.

This is particularly useful in conjunction with larger -A values, for example -A64m -n4m is a useful combination on larger core counts (8+).

The default allocation area size is 512KB, so let’s increase it to 128MB, simultaneously dividing it into a set of 2MB chunks to adjust for the fact that each core may use a different amount of memory. Then it’s time to rebuild the project from scratch and see if this helps.
$ cabal clean $ cabal configure --ghc-options="-j +RTS -A128m -n2m -s -RTS" $ cabal build

The results are as follows:

  51,600,961,856 bytes allocated in the heap
   1,199,661,504 bytes copied during GC
     198,991,376 bytes maximum residency (5 sample(s))
       6,914,672 bytes maximum slop
            1441 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        59 colls,    57 par   17.486s   2.214s     0.0375s    0.0971s
  Gen  1         5 colls,     3 par    0.969s   0.160s     0.0320s    0.0698s

  Parallel GC work balance: 10.93% (serial 0%, perfect 100%)

  TASKS: 20 (1 bound, 19 peak workers (19 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   53.068s  ( 22.496s elapsed)
  GC      time   18.455s  (  2.374s elapsed)
  EXIT    time    0.030s  (  0.029s elapsed)
  Total   time   71.564s  ( 24.900s elapsed)

  Alloc rate    972,352,408 bytes per MUT second

  Productivity  74.2% of total user, 213.3% of total elapsed

The full build took 24.9 seconds of wall clock time, where 22.5 seconds were spent on actual work and 2.37 seconds on garbage collection. This is a huge performance win – GHC spent almost 80% less time on garbage collection! The downside of this is the memory usage – twice as much as the previous run. It is worth noting though that RAM is much cheaper than time, so it’s still a great deal.

If such high memory usage is unacceptable, one may simply reduce size of the allocation area – using 64MB or even 32MB also gives performance gains, just not as substantial. I included a few tables with more data to compare how various flags affect memory usage and compilation times.

Compilation statistics (default GC settings)
Jobs	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)
1	41.02	6.93	47.95	664
2	29.33	8.18	37.51	687
4	22.4	8.21	30.61	725
8	22.11	10.97	33.08	742

Compilation statistics (-A32m -n2m)
Jobs	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
1	41.43	6.32	47.75	697	104.97	0.42
2	26.31	5.76	32.07	712	103.64	14.50
4	22.71	5.82	28.53	772	106.48	6.80
8	22.51	4.9	27.41	903	121.70	17.14

Compilation statistics (-A128m -n2m)
Jobs	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
1	41.58	4.87	46.45	757	114.01	3.13
2	26.63	3.64	30.27	910	132.46	19.30
4	22.39	2.95	25.34	1123	154.90	17.22
8	22.44	2.43	24.87	1514	204.04	24.82

It’s worth noting that modifying garbage collector settings has the biggest impact when compiling modules in parallel – for sequential compilation the gains are small (roughly 3%).

Overall, fiddling with GC gives us almost 25% reduction of compilation time in the best case. But we are not done yet.

idris

Since the detailed analysis was done for lens, I just include results of compiling idris with default GC settings and -A128m -n2m for both optimized and non-optimized build.

Compilation statistics (-j8 -O1)
GC settings	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
default	147.47	83.4	230.87	2384	n/a	n/a
-A128m -n2m	148.29	28.06	176.35	3250	136.33	23.62

Compilation statistics (-j8 -O0)
GC settings	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
default	41.44	15.68	57.12	625	n/a	n/a
-A128m -n2m	40.64	3.31	43.95	1499	239.84	23.06

As we can see, tweaking GC settings gives similar reduction of compilation times for both optimized and unoptimized idris build as for lens.

To share or not to share?

As mentioned earlier, by default cabal builds both static and shared library version, yet during development we don’t need both of them. Let’s see how opting to build only one version affects compilation times (all builds were done with -j +RTS -A128m -n2m -RTS flags):

Project	Both versions (s)	Static only (s)	Shared only (s)	Time reduction (%)	Overall time reduction (%)
lens	25.03	20.83	20.7	17.30	37.42
idris (-O1)	179.32	178.97	140.23	21.80	39.26
idris (-O0)	44.13	44.19	26.32	40.36	53.92

In case of lens building only one version of the library is around 17% faster than building both. Also, whether static or shared version is built is irrelevant as building them takes approximately the same amount of time.

Idris, on the other hand, gives rather surprising results, where building shared version only is much faster than building static one! Unfortunately I have no good explanation for why that is the case. If anyone does, please let me know as I’m quite curious.

Overall we end up with 37% compilation time improvement for lens and 39% to 53% improvement for idris (depending on the optimization level), which is great.

Conclusions

GHC can be quite slow at times and surely there is a lot of low hanging fruit if it comes to specific performance improvements in the compiler itself (in case I described at the beginning locating a culprit may be as easy as doing git bisect between ghc-7.6.1-release and ghc-7.8.1-release tags). However, a substantial decrease in compilation times can be achieved simply by tweaking garbage collector settings as well as instructing the compiler to emit code suitable for dynamic linking.

To summarize, in order to reduce compilation times of a Haskell project it is best to pass to cabal configure the following options:

--disable-optimization to disable optimization if appropriate,
--ghc-options="-j +RTS -A128m -n2m -RTS" to use multiple CPU cores and make GHC spend substantially less time doing garbage collection,
--disable-library-vanilla to build only shared library version,
--enable-executable-dynamic to link any executable in the project dynamically.

rybczak.net

A mixture of programming, computer science and mathematics.

Monthly Archives: March 2016

How to reduce compilation times of Haskell projects