How to reduce compilation times of Haskell projects

March 26, 2016Haskellarybczak

Recently there was a discussion on reddit revolving around overwhelming compilation times for non-trivial Haskell projects using GHC. The specific problem OP was having turned out to be a quirk in the optimizer (apparently large list literals are not GHC friendly), however there were also claims that the compiler is getting slower with each major release. While this may be an exaggeration, I personally experienced the decrease in performance after I upgraded GHC from 7.6 to 7.8 as one of the modules of hpqtypes started to take substantially longer to compile.

The thread also contained a couple of general tips for reducing compilation times. Based on the given advice and additional experimentation I was able to cut compilation times of various projects down by 30-50%. Let me demonstrate how to achieve this by using lens and idris as examples.

Setting up the environment

Let’s get the latest versions from git and set up a cabal sandbox for lens:
$ git clone git@github.com:ekmett/lens.git $ cd lens $ cabal sandbox init $ cabal install -j --dependencies-only

and idris:
$ git clone git@github.com:idris-lang/Idris-dev.git $ cd Idris-dev $ cabal sandbox init $ cabal install -j --dependencies-only

The default behavior in recent cabal versions is to install both static and shared versions of the libraries and this is exactly what we want.

We look at compilation times of lens library (built with cabal build) and idris library (built with cabal build lib:idris), other targets (test suites in particular) were not considered.

Disabling optimization

The default behavior for cabal is to compile packages with optimization enabled. This is a sane default for installing dependencies, but not so much for development sessions when most of the time we are not concerned about the performance as much as type-checking information or testing whether the newly developed piece of application does the right thing.

Luckily we can disable that rather easily by passing --disable-optimization to cabal during configure phase. Let’s see how disabling optimization (equivalent to passing -O0 instead of -O1 to GHC) affects compilation times for both lens and idris (GHC 7.10.3, Intel Core i7-3770):

Library	Total time, -O1 (s)	Total time, -O0 (s)	Time reduction (%)
lens	33.61	32.96	1.93
idris	230.98	57.16	75.25

Interestingly, disabling optimization cuts compilation time drastically for idris whereas it has almost no effect on lens. A possible explanation is that these projects contain a very different type of code (a lot of type-checker stressing, yet relatively simple functions for lens and complex logic and data transformations for idris), hence the difference.

Tweaking garbage collector settings

lens

If in the past you had to profile a Haskell program and track its memory usage, time spent doing GC etc., there is a good chance you are familiar with RTS options, specifically -s. Fortunately for us GHC is also a Haskell program, so we can apply the same here:
$ cabal configure --ghc-options="-j +RTS -s -RTS"

This will show us a number of useful statistics collected during compilation.
$ cabal build

The results are as follows:

  51,744,119,912 bytes allocated in the heap
   7,108,019,352 bytes copied during GC
     266,067,568 bytes maximum residency (25 sample(s))
      12,360,040 bytes maximum slop
             734 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4545 colls,  4262 par   75.959s   9.682s     0.0021s    0.0220s
  Gen  1        25 colls,    18 par    8.390s   1.131s     0.0452s    0.1030s

  Parallel GC work balance: 15.48% (serial 0%, perfect 100%)

  TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   55.998s  ( 22.726s elapsed)
  GC      time   84.350s  ( 10.813s elapsed)
  EXIT    time    0.042s  (  0.041s elapsed)
  Total   time  140.400s  ( 33.580s elapsed)

  Alloc rate    924,035,605 bytes per MUT second

  Productivity  39.9% of total user, 166.9% of total elapsed

The full build took 33.58 seconds of wall clock time, where 22.73 seconds were spent doing actual work and 10.81 seconds on garbage collection, which is definitely not ideal. The important information is that the majority of time was spent dealing with short-lived data (generation 0). That’s a starting point for improvement.

If we can decrease the number of generation 0 data collections (there were 4545 of them), we can possibly save a substantial amount of time. To achieve this we are interested in the following RTS options (excerpt from the GHC manual), which look like they’re exactly what we need to customize:

-Asize

[Default: 512k] Set the allocation area size used by the garbage collector. The allocation area (actually generation 0 step 0) is fixed and is never resized (unless you use -H, below).

Increasing the allocation area size may or may not give better performance (a bigger allocation area means worse cache behaviour but fewer garbage collections and less promotion).

With only 1 generation (-G1) the -A option specifies the minimum allocation area, since the actual size of the allocation area will be resized according to the amount of data in the heap (see -F, below).

-nsize

[Default: 0, Example: -n4m] When set to a non-zero value, this option divides the allocation area (-A value) into chunks of the specified size. During execution, when a processor exhausts its current chunk, it is given another chunk from the pool until the pool is exhausted, at which point a collection is triggered.

This option is only useful when running in parallel (-N2 or greater). It allows the processor cores to make better use of the available allocation area, even when cores are allocating at different rates. Without -n, each core gets a fixed-size allocation area specified by the -A, and the first core to exhaust its allocation area triggers a GC across all the cores. This can result in a collection happening when the allocation areas of some cores are only partially full, so the purpose of the -n is to allow cores that are allocating faster to get more of the allocation area. This means less frequent GC, leading a lower GC overhead for the same heap size.

This is particularly useful in conjunction with larger -A values, for example -A64m -n4m is a useful combination on larger core counts (8+).

The default allocation area size is 512KB, so let’s increase it to 128MB, simultaneously dividing it into a set of 2MB chunks to adjust for the fact that each core may use a different amount of memory. Then it’s time to rebuild the project from scratch and see if this helps.
$ cabal clean $ cabal configure --ghc-options="-j +RTS -A128m -n2m -s -RTS" $ cabal build

The results are as follows:

  51,600,961,856 bytes allocated in the heap
   1,199,661,504 bytes copied during GC
     198,991,376 bytes maximum residency (5 sample(s))
       6,914,672 bytes maximum slop
            1441 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        59 colls,    57 par   17.486s   2.214s     0.0375s    0.0971s
  Gen  1         5 colls,     3 par    0.969s   0.160s     0.0320s    0.0698s

  Parallel GC work balance: 10.93% (serial 0%, perfect 100%)

  TASKS: 20 (1 bound, 19 peak workers (19 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   53.068s  ( 22.496s elapsed)
  GC      time   18.455s  (  2.374s elapsed)
  EXIT    time    0.030s  (  0.029s elapsed)
  Total   time   71.564s  ( 24.900s elapsed)

  Alloc rate    972,352,408 bytes per MUT second

  Productivity  74.2% of total user, 213.3% of total elapsed

The full build took 24.9 seconds of wall clock time, where 22.5 seconds were spent on actual work and 2.37 seconds on garbage collection. This is a huge performance win – GHC spent almost 80% less time on garbage collection! The downside of this is the memory usage – twice as much as the previous run. It is worth noting though that RAM is much cheaper than time, so it’s still a great deal.

If such high memory usage is unacceptable, one may simply reduce size of the allocation area – using 64MB or even 32MB also gives performance gains, just not as substantial. I included a few tables with more data to compare how various flags affect memory usage and compilation times.

Compilation statistics (default GC settings)
Jobs	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)
1	41.02	6.93	47.95	664
2	29.33	8.18	37.51	687
4	22.4	8.21	30.61	725
8	22.11	10.97	33.08	742

Compilation statistics (-A32m -n2m)
Jobs	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
1	41.43	6.32	47.75	697	104.97	0.42
2	26.31	5.76	32.07	712	103.64	14.50
4	22.71	5.82	28.53	772	106.48	6.80
8	22.51	4.9	27.41	903	121.70	17.14

Compilation statistics (-A128m -n2m)
Jobs	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
1	41.58	4.87	46.45	757	114.01	3.13
2	26.63	3.64	30.27	910	132.46	19.30
4	22.39	2.95	25.34	1123	154.90	17.22
8	22.44	2.43	24.87	1514	204.04	24.82

It’s worth noting that modifying garbage collector settings has the biggest impact when compiling modules in parallel – for sequential compilation the gains are small (roughly 3%).

Overall, fiddling with GC gives us almost 25% reduction of compilation time in the best case. But we are not done yet.

idris

Since the detailed analysis was done for lens, I just include results of compiling idris with default GC settings and -A128m -n2m for both optimized and non-optimized build.

Compilation statistics (-j8 -O1)
GC settings	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
default	147.47	83.4	230.87	2384	n/a	n/a
-A128m -n2m	148.29	28.06	176.35	3250	136.33	23.62

Compilation statistics (-j8 -O0)
GC settings	MUT (s)	GC (s)	Total (s)	Mem. usage (MB)	Mem. usage increase (%)	Time reduction (%)
default	41.44	15.68	57.12	625	n/a	n/a
-A128m -n2m	40.64	3.31	43.95	1499	239.84	23.06

As we can see, tweaking GC settings gives similar reduction of compilation times for both optimized and unoptimized idris build as for lens.

To share or not to share?

As mentioned earlier, by default cabal builds both static and shared library version, yet during development we don’t need both of them. Let’s see how opting to build only one version affects compilation times (all builds were done with -j +RTS -A128m -n2m -RTS flags):

Project	Both versions (s)	Static only (s)	Shared only (s)	Time reduction (%)	Overall time reduction (%)
lens	25.03	20.83	20.7	17.30	37.42
idris (-O1)	179.32	178.97	140.23	21.80	39.26
idris (-O0)	44.13	44.19	26.32	40.36	53.92

In case of lens building only one version of the library is around 17% faster than building both. Also, whether static or shared version is built is irrelevant as building them takes approximately the same amount of time.

Idris, on the other hand, gives rather surprising results, where building shared version only is much faster than building static one! Unfortunately I have no good explanation for why that is the case. If anyone does, please let me know as I’m quite curious.

Overall we end up with 37% compilation time improvement for lens and 39% to 53% improvement for idris (depending on the optimization level), which is great.

Conclusions

GHC can be quite slow at times and surely there is a lot of low hanging fruit if it comes to specific performance improvements in the compiler itself (in case I described at the beginning locating a culprit may be as easy as doing git bisect between ghc-7.6.1-release and ghc-7.8.1-release tags). However, a substantial decrease in compilation times can be achieved simply by tweaking garbage collector settings as well as instructing the compiler to emit code suitable for dynamic linking.

To summarize, in order to reduce compilation times of a Haskell project it is best to pass to cabal configure the following options:

--disable-optimization to disable optimization if appropriate,
--ghc-options="-j +RTS -A128m -n2m -RTS" to use multiple CPU cores and make GHC spend substantially less time doing garbage collection,
--disable-library-vanilla to build only shared library version,
--enable-executable-dynamic to link any executable in the project dynamically.

ncmpcpp 0.7 and 0.6.8 released

November 22, 2015ncmpcpparybczak

The list of introduced changes is available here.

ncmpcpp 0.7_beta1 released

November 4, 2015ncmpcpparybczak

The list of introduced changes is available here.

Calculation of Fibonacci numbers in logarithmic number of steps

November 1, 2015Haskell, Mathematicsarybczak

Disclaimer: this post is pretty much a comprehensive solution to exercise 1.19 from Structure and Interpretation of Computer Programs by Harold Abelson and Gerald Jay Sussman (the solution is written in Haskell instead of Lisp though).

Let us quickly recall the definition of Fibonacci numbers as well as a few well-known ways to calculate them.

Fibonacci numbers are defined recursively, with $n$-th number denoted as $
F_n = \left\{
\begin{array}{l l}
0 & \quad \text{if $n = 0$} \\
1 & \quad \text{if $n = 1$} \\
F_{n-2} + F_{n-1} & \quad \text{otherwise.}
\end{array} \right.
$

This definition can be translated to a Haskell program in a straightforward manner:

fib1 :: Int -> Integer
fib1 0 = 0
fib1 1 = 1
fib1 n = fib1 (n-2) + fib1 (n-1)

However, computation of $n$-th Fibonacci number with the above function requires $O(log_2(F_n)n)$ space and $O(log_2(F_n)2^n)$ time: recursive calls create a binary tree of depth $n$ (hence the space requirement as you need to keep track of $n$ intermediate values, each taking up to $O(log_2(F_n))$ bits), which contains at most $2^n-1$ elements, where each node takes $O(log_2(F_n))$ time to compute as we’re using arbitrary precision arithmetic. It is worth noting that $O(log_2(F_n))$ is in fact equal to $O(n)$ as the growth of Fibonacci sequence is exponential, but the former will be used as it’s more clear.

The other, significantly more performant way is to use the iterative approach, i.e. start with $F_0$ and $F_1$ and by appropriately adding them, compute your way up to $F_n$:

fib2 :: Int -> Integer
fib2 n = go n 0 1
  where
    go k a b
      | k == 0    = a
      | otherwise = go (k - 1) b (a + b)

The initial state is $(0, 1) = (F_0, F_1)$ and each iteration we replace a state $(F_{n-k}, F_{n-k+1})$ with $(F_{n-k+1}, F_{n-k+2})$, eventually ending up with $(F_n, F_{n+1})$ and taking the first element of a pair as a result.

The performance of this variation is much better than its predecessor as its space complexity is $O(log_2(F_n))$ and time complexity is $O(log_2(F_n)n)$.

Can we do better than that? As a matter of fact, we can. First of all, observe that in the last solution, each iteration we replaced a pair of two numbers with another pair of two numbers using elementary arithmetic operations, which is basically a transformation in two dimensional vector space (let us assume $\mathbb{R}^2$). In fact, this transformation is linear and its matrix representation is $T = \begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}$, because $T \cdot \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} = \begin{pmatrix} F_{n+1} \\ F_n + F_{n+1} \end{pmatrix} = \begin{pmatrix} F_{n+1} \\ F_{n+2} \end{pmatrix}$.

With this knowledge the formula for Fibonacci numbers can be written in a compact way:

$F_n = \pi_1\left( T^n \cdot \begin{pmatrix} F_0 \\ F_1 \end{pmatrix} \right)$, where $\pi_1 : \mathbb{R}^2 \rightarrow \mathbb{R}$, $\pi_1\left(\begin{pmatrix}v_1 \\ v_2\end{pmatrix}\right) = v_1$.

There are two routes to choose from at this point: we can either use an algorithm for fast exponentiation of matrices (such as square and multiply) to calculate $T^n$ or try to apply more optimizations. We will go with the latter, because matrix multiplication is still pretty heavy operation and we can improve the situation by exploiting the structure of $T$.

First, close observation of the powers of $T$ allows us to notice the following:

Fact. $T^n = \begin{pmatrix} F_{n-1} & F_n \\ F_n & F_{n+1} \end{pmatrix}$ for $n \in \mathbb{N_+}$.

Proof. For $n = 1$ it is easy to see that $T = \begin{pmatrix}F_0 & F_1 \\ F_1 & F_2\end{pmatrix}$. Now assume that the fact is true for some $n \in \mathbb{N_+}$. Then $T^{n+1} = T^n \cdot T = \begin{pmatrix} F_{n-1} & F_n \\ F_n & F_{n+1} \end{pmatrix} \cdot \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} F_n & F_{n-1} + F_n \\ F_{n+1} & F_n + F_{n+1} \end{pmatrix} = \begin{pmatrix} F_n & F_{n+1} \\ F_{n+1} & F_{n+2} \end{pmatrix}$.

Corollary. For $n \in \mathbb{N_+}$ there exists $p$ and $q$ such that $T^n = \begin{pmatrix} p & q \\ q & p+q \end{pmatrix}$.

Equipped with this knowledge we can not only represent $T^k$ for some $k \in \mathbb{N_+}$ using only two numbers instead of four, but also derive a transformation of these numbers that corresponds to the computation of $T^{2k}$.

Fact. Let $n \in \mathbb{N_+}$, $p, q \in \mathbb{N}$. Then $\begin{pmatrix} p & q \\ q & p+q \end{pmatrix}^2 = \begin{pmatrix} p’ & q’ \\ q’ & p’+q’ \end{pmatrix}$, where $p’ = p^2 + q^2$ and $q’ = (2p + q)q$.

Now, we can put all of these together and construct the final solution:

fib3 :: Int -> Integer
fib3 n = go n 0 1 0 1
  where
    go k a b p q
      | k == 0    = a
      | odd k     = go (k - 1) (p*a + q*b) (q*a + (p+q)*b) p q
      | otherwise = go (k `div` 2) a b (p*p + q*q) ((2*p + q)*q)

Let us denote $i$-th bit of $n$ by $n_i \in \{0,1\}$, where $i \in \{0, \dots, \lfloor log_2(n) \rfloor \}$. We start with $\begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} F_0 \\ F_1 \end{pmatrix}$ and $\begin{pmatrix} p \\ q \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix} \cong T$. Then we traverse the bits of $n$ and return $a$. Note that while iterating through $n_i$:

$\begin{pmatrix} p \\ q \end{pmatrix} \cong T^{2^i}$.
$\begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} F_m \\ F_{m+1} \end{pmatrix}$ for $m = \displaystyle\sum_{j = 0}^{i-1} n_j \cdot 2^j$.
If $n_i = 1$, $\begin{pmatrix} F_m \\ F_{m+1} \end{pmatrix}$ is replaced with $T^{2^i} \cdot \begin{pmatrix} F_m \\ F_{m+1} \end{pmatrix} = \begin{pmatrix} F_{m+2^i} \\ F_{m+2^i+1} \end{pmatrix}$.

Hence, in the end $\begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix}$, so the function correctly computes $n$-th Fibonacci number.

The space complexity of this solution is $O(log_2(F_n))$, whereas its time complexity is $O(log_2(F_n)(log_2(n)+H(n)))$ with $H(n)$ being Hamming weight of $n$.

Now, let us put all of the implementations together and measure their performance using criterion library.

{-# OPTIONS_GHC -Wall #-}
{-# LANGUAGE BangPatterns #-}
module Main where

import Criterion.Main
import Criterion.Types

fib1 :: Int -> Integer
fib1 0 = 0
fib1 1 = 1
fib1 n = fib1 (n - 2) + fib1 (n - 1)

fib2 :: Int -> Integer
fib2 n = go n 0 1
  where
    go !k !a b
      | k == 0    = a
      | otherwise = go (k - 1) b (a + b)

fib3 :: Int -> Integer
fib3 n = go n 0 1 0 1
  where
    go !k !a b !p !q
      | k == 0    = a
      | odd k     = go (k - 1) (p*a + q*b) (q*a + (p+q)*b) p q
      | otherwise = go (k `div` 2) a b (p*p + q*q) ((2*p + q)*q)

main :: IO ()
main = defaultMainWith (defaultConfig { timeLimit = 2 }) [
    bgroup "fib1" $ map (benchmark fib1) $ [10, 20] ++ [30..42]
  , bgroup "fib2" $ map (benchmark fib2) $ 10000 : map (100000*) [1..10]
  , bgroup "fib3" $ map (benchmark fib3) $ 1000000 : map (10000000*) ([1..10] ++ [20])
  ]
  where
    benchmark fib n = bench (show n) $ whnf fib n

The above program was compiled with GHC 7.10.2 and run on Intel Core i7 3770. HTML report generated by it is available here.

In particular, after substituting the main function with:

main :: IO ()
main = defaultMainWith (defaultConfig { timeLimit = 2 }) [
    bgroup "fib3" [benchmark fib3 1000000000]
  ]
  where
    benchmark fib n = bench (show n) $ whnf fib n

we can see that the final implementation is able to calculate billionth Fibonacci number in a very reasonable time:
benchmarking fib3/1000000000 time 30.82 s (29.86 s .. 31.97 s) 1.000 R² (0.999 R² .. 1.000 R²) mean 30.34 s (29.96 s .. 30.56 s) std dev 345.1 ms (0.0 s .. 387.0 ms) variance introduced by outliers: 19% (moderately inflated)

rybczak.net

A mixture of programming, computer science and mathematics.