The thread also contained a couple of general tips for reducing compilation times. Based on the given advice and additional experimentation **I was able to cut compilation times of various projects down by 30-50%**. Let me demonstrate how to achieve this by using *lens* and *idris* as examples.

Let’s get the latest versions from git and set up a cabal sandbox for lens:

$ git clone git@github.com:ekmett/lens.git

$ cd lens

$ cabal sandbox init

$ cabal install -j --dependencies-only

and idris:

$ git clone git@github.com:idris-lang/Idris-dev.git

$ cd Idris-dev

$ cabal sandbox init

$ cabal install -j --dependencies-only

The default behavior in recent cabal versions is to install both static and shared versions of the libraries and this is exactly what we want.

We look at compilation times of lens library (built with `cabal build`

) and idris library (built with `cabal build lib:idris`

), other targets (test suites in particular) were not considered.

The default behavior for cabal is to compile packages with optimization enabled. This is a sane default for installing dependencies, but not so much for development sessions when most of the time we are not concerned about the performance as much as type-checking information or testing whether the newly developed piece of application does the right thing.

Luckily we can disable that rather easily by passing `--disable-optimization`

to cabal during configure phase. Let’s see how disabling optimization (equivalent to passing -O0 instead of -O1 to GHC) affects compilation times for both lens and idris (GHC 7.10.3, Intel Core i7-3770):

Library | Total time, -O1 (s) | Total time, -O0 (s) | Time reduction (%) |
---|---|---|---|

lens | 33.61 | 32.96 | 1.93 |

idris | 230.98 | 57.16 | 75.25 |

Interestingly, disabling optimization cuts compilation time drastically for idris whereas it has almost no effect on lens. A possible explanation is that these projects contain a very different type of code (a lot of type-checker stressing, yet relatively simple functions for lens and complex logic and data transformations for idris), hence the difference.

If in the past you had to profile a Haskell program and track its memory usage, time spent doing GC etc., there is a good chance you are familiar with RTS options, specifically -s. Fortunately for us GHC is also a Haskell program, so we can apply the same here:

$ cabal configure --ghc-options="-j +RTS -s -RTS"

This will show us a number of useful statistics collected during compilation.

$ cabal build

The results are as follows:

51,744,119,912 bytes allocated in the heap 7,108,019,352 bytes copied during GC 266,067,568 bytes maximum residency (25 sample(s)) 12,360,040 bytes maximum slop 734 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4545 colls, 4262 par 75.959s 9.682s 0.0021s 0.0220s Gen 1 25 colls, 18 par 8.390s 1.131s 0.0452s 0.1030s Parallel GC work balance: 15.48% (serial 0%, perfect 100%) TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 55.998s ( 22.726s elapsed) GC time 84.350s ( 10.813s elapsed) EXIT time 0.042s ( 0.041s elapsed) Total time 140.400s ( 33.580s elapsed) Alloc rate 924,035,605 bytes per MUT second Productivity 39.9% of total user, 166.9% of total elapsed

The full build took 33.58 seconds of wall clock time, where 22.73 seconds were spent doing actual work and 10.81 seconds on garbage collection, which is definitely not ideal. The important information is that the majority of time was spent dealing with short-lived data (generation 0). That’s a starting point for improvement.

If we can decrease the number of generation 0 data collections (there were 4545 of them), we can possibly save a substantial amount of time. To achieve this we are interested in the following RTS options (excerpt from the GHC manual), which look like they’re exactly what we need to customize:

-Asize [Default: 512k] Set the allocation area size used by the garbage collector. The allocation area (actually generation 0 step 0) is fixed and is never resized (unless you use -H, below). Increasing the allocation area size may or may not give better performance (a bigger allocation area means worse cache behaviour but fewer garbage collections and less promotion). With only 1 generation (-G1) the -A option specifies the minimum allocation area, since the actual size of the allocation area will be resized according to the amount of data in the heap (see -F, below). -nsize [Default: 0, Example: -n4m] When set to a non-zero value, this option divides the allocation area (-A value) into chunks of the specified size. During execution, when a processor exhausts its current chunk, it is given another chunk from the pool until the pool is exhausted, at which point a collection is triggered. This option is only useful when running in parallel (-N2 or greater). It allows the processor cores to make better use of the available allocation area, even when cores are allocating at different rates. Without -n, each core gets a fixed-size allocation area specified by the -A, and the first core to exhaust its allocation area triggers a GC across all the cores. This can result in a collection happening when the allocation areas of some cores are only partially full, so the purpose of the -n is to allow cores that are allocating faster to get more of the allocation area. This means less frequent GC, leading a lower GC overhead for the same heap size. This is particularly useful in conjunction with larger -A values, for example -A64m -n4m is a useful combination on larger core counts (8+).

The default allocation area size is 512KB, so let’s increase it to 128MB, simultaneously dividing it into a set of 2MB chunks to adjust for the fact that each core may use a different amount of memory. Then it’s time to rebuild the project from scratch and see if this helps.

$ cabal clean

$ cabal configure --ghc-options="-j +RTS -A128m -n2m -s -RTS"

$ cabal build

The results are as follows:

51,600,961,856 bytes allocated in the heap 1,199,661,504 bytes copied during GC 198,991,376 bytes maximum residency (5 sample(s)) 6,914,672 bytes maximum slop 1441 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 59 colls, 57 par 17.486s 2.214s 0.0375s 0.0971s Gen 1 5 colls, 3 par 0.969s 0.160s 0.0320s 0.0698s Parallel GC work balance: 10.93% (serial 0%, perfect 100%) TASKS: 20 (1 bound, 19 peak workers (19 total), using -N8) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 53.068s ( 22.496s elapsed) GC time 18.455s ( 2.374s elapsed) EXIT time 0.030s ( 0.029s elapsed) Total time 71.564s ( 24.900s elapsed) Alloc rate 972,352,408 bytes per MUT second Productivity 74.2% of total user, 213.3% of total elapsed

The full build took 24.9 seconds of wall clock time, where 22.5 seconds were spent on actual work and 2.37 seconds on garbage collection. This is a huge performance win – GHC spent almost 80% less time on garbage collection! The downside of this is the memory usage – twice as much as the previous run. It is worth noting though that RAM is much cheaper than time, so it’s still a great deal.

If such high memory usage is unacceptable, one may simply reduce size of the allocation area – using 64MB or even 32MB also gives performance gains, just not as substantial. I included a few tables with more data to compare how various flags affect memory usage and compilation times.

Jobs | MUT (s) | GC (s) | Total (s) | Mem. usage (MB) |
---|---|---|---|---|

1 | 41.02 | 6.93 | 47.95 | 664 |

2 | 29.33 | 8.18 | 37.51 | 687 |

4 | 22.4 | 8.21 | 30.61 | 725 |

8 | 22.11 | 10.97 | 33.08 | 742 |

Jobs | MUT (s) | GC (s) | Total (s) | Mem. usage (MB) | Mem. usage increase (%) | Time reduction (%) |
---|---|---|---|---|---|---|

1 | 41.43 | 6.32 | 47.75 | 697 | 104.97 | 0.42 |

2 | 26.31 | 5.76 | 32.07 | 712 | 103.64 | 14.50 |

4 | 22.71 | 5.82 | 28.53 | 772 | 106.48 | 6.80 |

8 | 22.51 | 4.9 | 27.41 | 903 | 121.70 | 17.14 |

Jobs | MUT (s) | GC (s) | Total (s) | Mem. usage (MB) | Mem. usage increase (%) | Time reduction (%) |
---|---|---|---|---|---|---|

1 | 41.58 | 4.87 | 46.45 | 757 | 114.01 | 3.13 |

2 | 26.63 | 3.64 | 30.27 | 910 | 132.46 | 19.30 |

4 | 22.39 | 2.95 | 25.34 | 1123 | 154.90 | 17.22 |

8 | 22.44 | 2.43 | 24.87 | 1514 | 204.04 | 24.82 |

It’s worth noting that modifying garbage collector settings has the biggest impact when compiling modules in parallel – for sequential compilation the gains are small (roughly 3%).

Overall, fiddling with GC gives us almost 25% reduction of compilation time in the best case. But we are not done yet.

Since the detailed analysis was done for lens, I just include results of compiling idris with default GC settings and `-A128m -n2m`

for both optimized and non-optimized build.

GC settings | MUT (s) | GC (s) | Total (s) | Mem. usage (MB) | Mem. usage increase (%) | Time reduction (%) |
---|---|---|---|---|---|---|

default | 147.47 | 83.4 | 230.87 | 2384 | n/a | n/a |

-A128m -n2m | 148.29 | 28.06 | 176.35 | 3250 | 136.33 | 23.62 |

GC settings | MUT (s) | GC (s) | Total (s) | Mem. usage (MB) | Mem. usage increase (%) | Time reduction (%) |
---|---|---|---|---|---|---|

default | 41.44 | 15.68 | 57.12 | 625 | n/a | n/a |

-A128m -n2m | 40.64 | 3.31 | 43.95 | 1499 | 239.84 | 23.06 |

As we can see, tweaking GC settings gives similar reduction of compilation times for both optimized and unoptimized idris build as for lens.

As mentioned earlier, by default cabal builds both static and shared library version, yet during development we don’t need both of them. Let’s see how opting to build only one version affects compilation times (all builds were done with `-j +RTS -A128m -n2m -RTS`

flags):

Project | Both versions (s) | Static only (s) | Shared only (s) | Time reduction (%) | Overall time reduction (%) |
---|---|---|---|---|---|

lens | 25.03 | 20.83 | 20.7 | 17.30 | 37.42 |

idris (-O1) | 179.32 | 178.97 | 140.23 | 21.80 | 39.26 |

idris (-O0) | 44.13 | 44.19 | 26.32 | 40.36 | 53.92 |

In case of lens building only one version of the library is around 17% faster than building both. Also, whether static or shared version is built is irrelevant as building them takes approximately the same amount of time.

Idris, on the other hand, gives rather surprising results, where building shared version only is much faster than building static one! Unfortunately I have no good explanation for why that is the case. If anyone does, please let me know as I’m quite curious.

**Overall we end up with 37% compilation time improvement for lens and 39% to 53% improvement for idris** (depending on the optimization level), which is great.

GHC can be quite slow at times and surely there is a lot of low hanging fruit if it comes to specific performance improvements in the compiler itself (in case I described at the beginning locating a culprit may be as easy as doing git bisect between ghc-7.6.1-release and ghc-7.8.1-release tags). However, a substantial decrease in compilation times can be achieved simply by tweaking garbage collector settings as well as instructing the compiler to emit code suitable for dynamic linking.

To summarize, in order to reduce compilation times of a Haskell project it is best to pass to `cabal configure`

the following options:

`--disable-optimization`

to disable optimization if appropriate,`--ghc-options="-j +RTS -A128m -n2m -RTS"`

to use multiple CPU cores and make GHC spend substantially less time doing garbage collection,`--disable-library-vanilla`

to build only shared library version,`--enable-executable-dynamic`

to link any executable in the project dynamically.

Let us quickly recall the definition of Fibonacci numbers as well as a few well-known ways to calculate them.

Fibonacci numbers are defined recursively, with \(n\)-th number denoted as \(

F_n = \left\{

\begin{array}{l l}

0 & \quad \text{if $n = 0$} \\

1 & \quad \text{if $n = 1$} \\

F_{n-2} + F_{n-1} & \quad \text{otherwise.}

\end{array} \right.

\)

This definition can be translated to a Haskell program in a straightforward manner:

fib1 :: Int -> Integer fib1 0 = 0 fib1 1 = 1 fib1 n = fib1 (n-2) + fib1 (n-1)

However, computation of \(n\)-th Fibonacci number with the above function requires \(O(log_2(F_n)n)\) space and \(O(log_2(F_n)2^n)\) time: recursive calls create a binary tree of depth \(n\) (hence the space requirement as you need to keep track of \(n\) intermediate values, each taking up to \(O(log_2(F_n))\) bits), which contains at most \(2^n-1\) elements, where each node takes \(O(log_2(F_n))\) time to compute as we’re using arbitrary precision arithmetic. It is worth noting that \(O(log_2(F_n))\) is in fact equal to \(O(n)\) as the growth of Fibonacci sequence is exponential, but the former will be used as it’s more clear.

The other, significantly more performant way is to use the iterative approach, i.e. start with \(F_0\) and \(F_1\) and by appropriately adding them, compute your way up to \(F_n\):

fib2 :: Int -> Integer fib2 n = go n 0 1 where go k a b | k == 0 = a | otherwise = go (k - 1) b (a + b)

The initial state is \((0, 1) = (F_0, F_1)\) and each iteration we replace a state \((F_{n-k}, F_{n-k+1})\) with \((F_{n-k+1}, F_{n-k+2})\), eventually ending up with \((F_n, F_{n+1})\) and taking the first element of a pair as a result.

The performance of this variation is much better than its predecessor as its space complexity is \(O(log_2(F_n))\) and time complexity is \(O(log_2(F_n)n)\).

Can we do better than that? As a matter of fact, we can. First of all, observe that in the last solution, each iteration we replaced a pair of two numbers with another pair of two numbers using elementary arithmetic operations, which is basically a transformation in two dimensional vector space (let us assume \(\mathbb{R}^2\)). In fact, this transformation is linear and its matrix representation is \(T = \begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}\), because \(T \cdot \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} = \begin{pmatrix} F_{n+1} \\ F_n + F_{n+1} \end{pmatrix} = \begin{pmatrix} F_{n+1} \\ F_{n+2} \end{pmatrix}\).

With this knowledge the formula for Fibonacci numbers can be written in a compact way:

\(F_n = \pi_1\left( T^n \cdot \begin{pmatrix} F_0 \\ F_1 \end{pmatrix} \right)\), where \(\pi_1 : \mathbb{R}^2 \rightarrow \mathbb{R}\), \(\pi_1\left(\begin{pmatrix}v_1 \\ v_2\end{pmatrix}\right) = v_1\).

There are two routes to choose from at this point: we can either use an algorithm for fast exponentiation of matrices (such as square and multiply) to calculate \(T^n\) or try to apply more optimizations. We will go with the latter, because matrix multiplication is still pretty heavy operation and we can improve the situation by exploiting the structure of \(T\).

First, close observation of the powers of \(T\) allows us to notice the following:

**Fact.** \(T^n = \begin{pmatrix} F_{n-1} & F_n \\ F_n & F_{n+1} \end{pmatrix}\) for \(n \in \mathbb{N_+}\).

*Proof.* For \(n = 1\) it is easy to see that \(T = \begin{pmatrix}F_0 & F_1 \\ F_1 & F_2\end{pmatrix}\). Now assume that the fact is true for some \(n \in \mathbb{N_+}\). Then \(T^{n+1} = T^n \cdot T = \begin{pmatrix} F_{n-1} & F_n \\ F_n & F_{n+1} \end{pmatrix} \cdot \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} F_n & F_{n-1} + F_n \\ F_{n+1} & F_n + F_{n+1} \end{pmatrix} = \begin{pmatrix} F_n & F_{n+1} \\ F_{n+1} & F_{n+2} \end{pmatrix}\).

**Corollary.** For \(n \in \mathbb{N_+}\) there exists \(p\) and \(q\) such that \(T^n = \begin{pmatrix} p & q \\ q & p+q \end{pmatrix}\).

Equipped with this knowledge we can not only represent \(T^k\) for some \(k \in \mathbb{N_+}\) using only two numbers instead of four, but also derive a transformation of these numbers that corresponds to the computation of \(T^{2k}\).

**Fact.** Let \(n \in \mathbb{N_+}\), \(p, q \in \mathbb{N}\). Then \(\begin{pmatrix} p & q \\ q & p+q \end{pmatrix}^2 = \begin{pmatrix} p’ & q’ \\ q’ & p’+q’ \end{pmatrix}\), where \(p’ = p^2 + q^2\) and \(q’ = (2p + q)q\).

Now, we can put all of these together and construct the final solution:

fib3 :: Int -> Integer fib3 n = go n 0 1 0 1 where go k a b p q | k == 0 = a | odd k = go (k - 1) (p*a + q*b) (q*a + (p+q)*b) p q | otherwise = go (k `div` 2) a b (p*p + q*q) ((2*p + q)*q)

Let us denote \(i\)-th bit of \(n\) by \(n_i \in \{0,1\}\), where \(i \in \{0, \dots, \lfloor log_2(n) \rfloor \}\). We start with \(\begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} F_0 \\ F_1 \end{pmatrix}\) and \(\begin{pmatrix} p \\ q \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix} \cong T\). Then we traverse the bits of \(n\) and return \(a\). Note that while iterating through \(n_i\):

- \(\begin{pmatrix} p \\ q \end{pmatrix} \cong T^{2^i}\).
- \(\begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} F_m \\ F_{m+1} \end{pmatrix}\) for \(m = \displaystyle\sum_{j = 0}^{i-1} n_j \cdot 2^j\).
- If \(n_i = 1\), \(\begin{pmatrix} F_m \\ F_{m+1} \end{pmatrix}\) is replaced with \(T^{2^i} \cdot \begin{pmatrix} F_m \\ F_{m+1} \end{pmatrix} = \begin{pmatrix} F_{m+2^i} \\ F_{m+2^i+1} \end{pmatrix}\).

Hence, in the end \(\begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix}\), so the function correctly computes \(n\)-th Fibonacci number.

The space complexity of this solution is \(O(log_2(F_n))\), whereas its time complexity is \(O(log_2(F_n)(log_2(n)+H(n)))\) with \(H(n)\) being Hamming weight of \(n\).

Now, let us put all of the implementations together and measure their performance using criterion library.

{-# OPTIONS_GHC -Wall #-} {-# LANGUAGE BangPatterns #-} module Main where import Criterion.Main import Criterion.Types fib1 :: Int -> Integer fib1 0 = 0 fib1 1 = 1 fib1 n = fib1 (n - 2) + fib1 (n - 1) fib2 :: Int -> Integer fib2 n = go n 0 1 where go !k !a b | k == 0 = a | otherwise = go (k - 1) b (a + b) fib3 :: Int -> Integer fib3 n = go n 0 1 0 1 where go !k !a b !p !q | k == 0 = a | odd k = go (k - 1) (p*a + q*b) (q*a + (p+q)*b) p q | otherwise = go (k `div` 2) a b (p*p + q*q) ((2*p + q)*q) main :: IO () main = defaultMainWith (defaultConfig { timeLimit = 2 }) [ bgroup "fib1" $ map (benchmark fib1) $ [10, 20] ++ [30..42] , bgroup "fib2" $ map (benchmark fib2) $ 10000 : map (100000*) [1..10] , bgroup "fib3" $ map (benchmark fib3) $ 1000000 : map (10000000*) ([1..10] ++ [20]) ] where benchmark fib n = bench (show n) $ whnf fib n

The above program was compiled with GHC 7.10.2 and run on Intel Core i7 3770. HTML report generated by it is available here.

In particular, after substituting the main function with:

main :: IO () main = defaultMainWith (defaultConfig { timeLimit = 2 }) [ bgroup "fib3" [benchmark fib3 1000000000] ] where benchmark fib n = bench (show n) $ whnf fib n

we can see that the final implementation is able to calculate billionth Fibonacci number in a very reasonable time:

benchmarking fib3/1000000000

time 30.82 s (29.86 s .. 31.97 s)

1.000 R² (0.999 R² .. 1.000 R²)

mean 30.34 s (29.96 s .. 30.56 s)

std dev 345.1 ms (0.0 s .. 387.0 ms)

variance introduced by outliers: 19% (moderately inflated)