# Introduction

About a year ago there was a reddit post on the Ising Model in Haskell. The discussion seems to have fizzled out but Ising models looked like a perfect fit for Haskell using repa. In the end it turns out that they are not a good fit for repa, at least not using the original formulation. It may turn out that we can do better with Swendson-Yang or Wolff. But that belongs to another blog post.

We can get some parallelism at a gross level using the Haskell parallel package via a one line change to the sequential code. However, this does not really show off Haskell’s strengths in this area. In any event, it makes a good example for “embarassingly simple” parallelism in Haskell, the vector package and random number generation using the random-fu package.

The Ising model was (by Stigler’s law) proposed by Lenz in 1920 as a model for ferromagnetism, that is, the magnetism exhibited by bar magnets. The phenomenon ferromagnetism is so named because it was first observed in iron (Latin ferrum and chemical symbol Fe). It is also exhibited, for example, by rare earths such as gadolinium. Ferromagnetic materials lose their magnetism at a critical temperature: the Curie temperature (named after Pierre not his more famous wife). This is an example of a phase transition (ice melting into water is a more familiar example).

The Ising model (at least in 2 dimensions) predicts this phase transition and can also be used to describe phase transitions in alloys.

Abstracting the Ising model from its physical origins, one can think of it rather like Conway’s Game of Life: there is a grid and each cell on the grid is updated depending on the state of its neighbours. The difference with the Game of Life is that the updates are not deterministic but are random with the randomness selecting which cell gets updated as well as whether it gets updated. Thus we cannot update all the cells in parallel as would happen if we used repa. The reader only interested in this abstraction can go straight to the implementation (after finishing this introduction).

The diagram below shows a 2 dimensional grid of cells. Each cell can either be in an (spin) up state or (spin) down state as indicated by the arrows and corresponding colours. The Ising model then applies a parameterized set of rules by which the grid is updated. For certain parameters the cells remain in a random configuration, that is the net spin (taking up = 1 and down = -1) remains near zero; for other parameters, the spins in the cells line up (not entirely as there is always some randomness). It is this lining up that gives rise to ferromagnetism.

On the other hand, the physics and the Monte Carlo method used to simulate the model are of considerable interest in their own right. Readers interested in the Monte Carlo method can skip the physics and go to Monte Carlo Estimation. Readers interested in the physics can start with the section on Magnetism.

Definitions, theorems etc. are in bold and terminated by $\blacksquare$.

# Magnetism

Following Ziman (Ziman 1964) and Reif (Reif 2009), we assume that each atom in the ferromagnetic material behaves like a small magnet. According to Hund’s rules, we would expect unpaired electrons in the $d$ and $f$ shells for example in the transition elements and rare earths and these would supply the magnetic moment. However, the magnetic interaction between atoms is far too small to account for ferromagnetism. For Iron, for example, the Curie temperature is 1043K and the magnetic interaction between atoms cannot account for this by some margin. There is a quantum mechanical effect called the exchange interaction which is a consequence of the Pauli exclusion principle. Two electrons with the same spin on neighbouring atoms cannot get too close to each other. If the electrons have anti-parallel spins then the exclusion principle does not apply and there is no restriction on how close they can get to each other. Thus the electrostatic interaction between electrons on neighbouring atoms depends on whether spins are parallel or anti-parallel. We can thus write the Hamiltonian in the form:

$\displaystyle E = -J\sum_{\langle i, j\rangle} \sigma_i \sigma_j - B\sum_k \sigma_k$

Where

• The notation $\langle i, j\rangle$ means we sum over all the nearest neighbours in the lattice;

• $B$ is the applied magnetic field (note we use E for the Hamiltonian), for all of this article we will assume this to be $0$;

• The range of each index is $1 \ldots M$ where $N = M \times M$ is the total number of atoms;

• And $J$ the coupling constant expressing the strength of the interaction between neighboring spins and depending on the balance between the Pauli exclusion principle and the electrostatic interaction energy of the electrons, this may be positive corresponding to parallel spins (ferromagnetism which is the case we consider in this article) or negative corresponding to anti parallel spins (antiferromagnetism or ferrimagnetism which we consider no further).

# Acknowledgments

This post uses the random-fu package for random number generation and has also benefitted from comments by the author of that package (James Cook).

All diagrams were drawn using the Haskell diagrams domain specific language; the inhabitants of #diagrams were extremely helpful in helping create these.

Internet sources too numerous to mention were used for the physics and Monte Carlo. Some are listed in the bibliography. Apologies if you recognize something which does not get its just acknowledgement. The advantage of blog posts is that this can easily be remedied by leaving a comment.

Pragmas and imports to which only the over-enthusiastic reader need pay attention.

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}

> {-# LANGUAGE TypeFamilies                  #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> {-# LANGUAGE TypeOperators                 #-}
> {-# LANGUAGE ScopedTypeVariables           #-}

> module Ising (
>          energy
>        , magnetization
>        , McState(..)
>        , thinN
>        , gridSize
>        , nItt
>        , expDv
>        , tCrit
>        , singleUpdate
>        , initState
>        , multiUpdate
>        , temps
>        , initColdGrid
>        , exampleGrid
>        , notAperiodics
>        , main
>        ) where


We put all of our code for the diagrams in this blog post in a separate module to avoid clutter.

> import Diagrams

> import qualified Data.Vector.Unboxed as V
> import qualified Data.Vector.Unboxed.Mutable as M

> import Data.Random.Source.PureMT
> import Data.Random

> import Control.Parallel.Strategies

> import Graphics.Rendering.Chart.Backend.Cairo

> import Data.Array.Repa hiding ( map, (++), zipWith )
> import Data.Array.Repa.Algorithms.Matrix

> import PrettyPrint ()

> gridSize :: Int
> gridSize = 20
>
> exampleGrid :: Int -> V.Vector Int
> exampleGrid gridSize = V.fromList $f (gridSize * gridSize) > where > f m = > evalState (replicateM m (sample (uniform (0 :: Int) (1 :: Int)) >>= > \n -> return$ 2*n - 1))
>       (pureMT 1)

> couplingConstant :: Double
> couplingConstant = 1.0
>
> kB :: Double
> kB = 1.0
>
> mu :: Double
> mu = 1.0


# The Boltzmann Distribution

Statistical physics is an extremely large subject but there needs to be some justification for the use of the Boltzmann distribution. For an excellent introduction to the subject of Statistical Physics (in which the Boltzmann distribution plays a pivotal role) see David Tong’s lecture notes (Tong (2011)).

Suppose we have we 3 boxes (we use the more modern nomenclature rather than the somewhat antiquated word urn) and 7 balls and we randomly assign the balls to boxes. Then it is far more likely that we will get an assignment of 2,2,3 rather than 0,0,7. When the numbers of boxes and balls become large (which is the case in statistical physics where we consider e.g. $10^{23}$ numbers of atoms) then it becomes very, very, very likely that the balls (or atoms) will spread themselves out uniformly as in the example.

Now suppose we have $N$ balls and $M$ boxes and we associate an energy $k$ with the $k$-th box but restrict the total energy to be constant. Thus we have two constraints:

• The total number of balls $N = \sum_1^M n_k$ and

• The total energy $E = \sum_1^M k\, n_k$.

Let us assume the balls are allocated to boxes in the most likely way and let us denote the values in each box for this distribution as $\tilde{n}_k$.

Let us now move 2 balls from box $k$, one to box $k-1$ and one to box $k+1$. Note that both constraints are satisified by this move. We must therefore have

\displaystyle \begin{aligned} \frac{1}{(\tilde{n}_{k-1} + 1)!}\frac{1}{(\tilde{n}_{k} - 2)!}\frac{1}{(\tilde{n}_{k+1} + 1)!} &\le \frac{1}{\tilde{n}_{k-1}!}\frac{1}{\tilde{n}_{k}!}\frac{1}{\tilde{n}_{k+1}!} \\ \frac{\tilde{n}_k}{\tilde{n}_{k-1} + 1}\frac{\tilde{n}_k - 1}{\tilde{n}_{k+1} + 1} &\le 1 \end{aligned}

And let us start from the most likely distribution and move 1 ball from box $k-1$ and 1 ball from box $k+1$ into box $k$. Again both constraints are satisified by this move. Doing a similar calculation to the one above we get

\displaystyle \begin{aligned} \frac{1}{(\tilde{n}_{k-1} - 1)!}\frac{1}{(\tilde{n}_{k} + 2)!}\frac{1}{(\tilde{n}_{k+1} - 1)!} &\le \frac{1}{\tilde{n}_{k-1}!}\frac{1}{\tilde{n}_{k}!}\frac{1}{\tilde{n}_{k+1}!} \\ 1 &\le \frac{\tilde{n}_k + 1}{\tilde{n}_{k+1}}\frac{\tilde{n}_{k} + 2}{\tilde{n}_{k-1}} \end{aligned}

From this we deduce that as $n_k \rightarrow \infty$ then $n_k^2 \approx n_{k-1}n_{k+1}$ or that $n_{k+1} / n_k = B$ for some constant $B$.

Telecsoping we can write $n_k = n_1 B^{k-1} \propto B^k$. Thus the probability of a ball being in box $k$ is

$\displaystyle \mathbb{P}(k) = \frac{B^k}{\sum_{i=1}^M B^i}$

If we now set $B = e^{-\beta}$ and $Z = \sum_{i=1}^M B^i$ then we can re-write our distribution as

$\displaystyle \mathbb{P}(k) = \frac{e^{-\beta k}}{Z}$

It should therefore be plausible that the probability of finding a system with a given energy $E(\sigma)$ in a given state $\sigma$ is given by the Boltzmann distribution

$\displaystyle \mathbb{P}(\sigma) = \frac{\exp(-E(\sigma) / k_B T)}{Z(T)}$

where we have defined the temperature to be $T = 1 / k_B\beta$ with $k_B$ being Boltzmann’s constant and $Z(T)$ is another normalizing constant.

## Specific Heat

Using the Boltzmann distribution we can calculate the average energy of the system

\displaystyle \begin{aligned} \langle E \rangle &\triangleq \sum_\sigma E(\sigma) \mathbb{P}(\sigma) \\ &= \sum_\sigma E(\sigma) \frac{\exp(-E(\sigma) / k_B T)}{Z} \\ &= -\frac{1}{Z} \frac{\partial Z}{\partial \beta} \\ &= \frac{\partial}{\partial \beta} \log Z \end{aligned}

where it is understood that $Z$ depends on $T$.

If we let $C_V$ be the specific heat per particle (at constant volume) then

\displaystyle \begin{aligned} NC_V &\triangleq \frac{\partial \langle E\rangle}{\partial T} \\ &= \frac{\partial \langle E\rangle}{\partial \beta} \frac{\partial \beta}{\partial T} \\ &= -\frac{1}{k_B T^2} \frac{\partial \langle E\rangle}{\partial \beta} \\ &= -\frac{1}{k_B T^2} \frac{\partial^2 \log Z}{\partial \beta^2} \\ &= -\frac{1}{k_B T^2} \frac{\partial}{\partial \beta} \sum_\sigma E(\sigma) \frac{\exp(-E(\sigma) / k_B T)}{Z} \\ &= -\frac{1}{k_B T^2} \Bigg[\frac{\big(\sum_\sigma E(\sigma) \exp(-E(\sigma) / k_B T)\big)^2}{Z^2} + \frac{\sum_\sigma -E^2(\sigma) \exp(-E(\sigma) / k_B T)}{Z}\Bigg] \end{aligned}

We know that

$\displaystyle \Delta E^2 \triangleq \langle (E - \langle E \rangle)^2 \rangle = \langle E^2 \rangle - \langle E \rangle^2$

So we can write

$\displaystyle \Delta E^2 = k_BT^2C_V$

so if we can estimate the energy and the square of the energy then we can estimate the specific heat.

# Monte Carlo Estimation

Although Ising himself developed an analytic solution in 1 dimension and Onsager later developed an analytic solution in 2 dimensions, no-one has (yet) found an analytic solution for 3 dimensions.

One way to determine an approximate answer is to use a Monte Carlo method.

## Uniform Sampling

We could pick sample configurations at random according to the Boltzmann distribution

$\displaystyle \mathbb{P}(\sigma) = \frac{\exp(-E(\sigma) / k_B T)}{Z(T)}$

where the sum $T$ is the temperature, $k_B$ is Boltzmann’s constant, $E$ is the energy of a given state

$\displaystyle E(\sigma) = -J\sum_{i, j} \sigma_i \sigma_j - B \sum_k \sigma_k$

and $Z(T)$ is a normalizing constant (Z for the German word Zustandssumme, “sum over states”)

$\displaystyle Z(T) = \sum_\sigma \exp(-E(\sigma) / k_B T)$

We can evaluate the energy for one state easily enough as the Haskell below demonstrates. Note that we use so-called periodic boundary conditions which means our grid is actually a torus with no boundaries. In other words, we wrap the top of the grid on to the bottom and the left of the grid on to the right.

[As an aside, we represent each state by a Vector of Int. No doubt more efficient representations can be implemented. We also have to calculate offsets into the vector given a point’s grid co-ordinates.]

> energy :: (Fractional a, Integral b, M.Unbox b) => a -> V.Vector b -> a
> energy j v = -0.5 * j * (fromIntegral V.sum energyAux) > where > > energyAux = V.generate l f > > l = V.length v > > f m = c * d > where > i = m mod gridSize > j = (m mod (gridSize * gridSize)) div gridSize > > c = v V.! jc > jc = gridSize * i + j > > d = n + e + s + w > > n = v V.! jn > e = v V.! je > s = v V.! js > w = v V.! jw > > jn = gridSize * ((i + 1) mod gridSize) + j > js = gridSize * ((i - 1) mod gridSize) + j > je = gridSize * i + ((j + 1) mod gridSize) > jw = gridSize * i + ((j - 1) mod gridSize)  But what about the normalizing constant $Z$? Even for a modest grid size say $10 \times 10$, the number of states that needs to be summed over is extremely large $2^{10 \times 10}$. Instead of summing the entire state space, we could draw R random samples $(\sigma^{(i)})_{0 \le i < R}$ uniformly from the state space. We could then use $\displaystyle Z_R \triangleq \sum_0^{R-1} \exp (-\beta\sigma_i)$ to estimate e.g. the magnetization $\displaystyle \langle M \rangle = \sum_\sigma M(\sigma) \frac{\exp(-\beta E(\sigma))}{Z(T)} \mathrm{d} \sigma$ by $\displaystyle \widehat{\langle M \rangle} = \sum_{i=0}^{R-1} M(\sigma) \frac{exp(-\beta E(\sigma(i)))}{Z_R}$ However we know from statistical physics that systems with large numbers of particles will occupy a small portion of the state space with any significant probability. And according to (MacKay 2003, chap. 29), a high dimensional distribution is often concentrated on small region of the state space known as its typical set $T$ whose volume is given by $|T| \approx 2^H$ where $H$ is the (Shannon) entropy of the (Boltzmann) distribution which for ease of exposition we temporarily denote by $P$. $\displaystyle H = -\sum_\sigma P(\sigma)\log_2(P(\sigma))$ If almost all the probability mass is located in $T$ then the actual value of the (mean) magnetization will determined by the values that $M$ takes on that set. So uniform sampling will only give a good estimate if we make $R$ large enough that we hit $T$ at least a small number of times. The total size of the state space is $2^N$ and the $|T| \approx 2^H$, so there is a probability of $2^H / 2^N$ of hitting $T$. Thus we need roughly $2^{N - H}$ samples to hit $T$. At high temperatures, the Boltzmann distribution flattens out so roughly all of the states have an equal likelihood of being occupied. We can calculate the (Shannon) entropy for this. $\displaystyle H \approx \sum_\sigma \frac{1}{2^N}\log_2 2^N = N$ We can do a bit better than this. At high temperatures, $\beta J \ll 1$ and taking $B = 0$ (as we have been assuming all along) \displaystyle \begin{aligned} Z &= \sum_\sigma \exp\big(-\beta E(\sigma)\big) \\ &= \sum_\sigma \exp\bigg(\beta J \sum_{i, j} \sigma_i \sigma_j - B \sum_k \sigma_k\bigg) \\ &= \sum_\sigma \prod_{i, j} \exp\beta J \sigma_i \sigma_j \end{aligned} By definition \displaystyle \begin{aligned} \cosh x &= \frac{e^x + e^{-x}}{2} \\ \sinh x &= \frac{e^x - e^{-x}}{2} \end{aligned} Thus \displaystyle \begin{aligned} \exp \beta J \sigma_i \sigma_j &= \cosh\beta J + \sigma_i\sigma_j\sinh \beta J \\ &= \cosh\beta J (1 + \sigma_i\sigma_j\tanh \beta J) \end{aligned} We can therefore re-write the partition function as \displaystyle \begin{aligned} Z &= \sum_\sigma \prod_{i, j} \cosh\beta J (1 + \sigma_i\sigma_j\tanh \beta J) \\ &= (\cosh \beta J)^{4N/2} \sum_\sigma \prod_{i, j} (1 + \sigma_i\sigma_j\tanh \beta J) \\ \end{aligned} where the factor 2 is because we count the exchange interaction twice for each pair of atoms and the factor 4 is because each atom is assumed to only interact with 4 neighbours. Since at high temperatures, by assumption, we have $\beta J \ll 1$ then $\tanh \beta J \ll 1$ also. Thus $\displaystyle Z \approx (\cosh \beta J)^{2N}\sum_\sigma 1 = 2^N(\cosh \beta J)^{2N}$ Calculating the free energy \displaystyle \begin{aligned} F &= -k_B T \ln Z \\ &= -k_B T N \ln 2 -k_B T 2 N \ln (\cosh \beta J) \\ &\approx -k_B T N \ln 2 -k_B T N (\beta J)^2 \\ &= -k_B T N \ln 2 - N\frac{J^2}{k_B T} \\ \end{aligned} From this we can determine the (Boltzmann) entropy $\displaystyle S = k_B N (\ln 2 - (\beta J)^2) \approx k_B N \ln 2$ which agrees with our rather hand-wavy derivation of the (Shannon) entropy at high temperatures. At low temperatures the story is quite different. We can calculate the ground state energy where all the spins are in the same direction. $\displaystyle E_0 = -J\sum_{i,j}\sigma_i\sigma_j = -4JN / 2 = -2JN$ And we can assume that at low temperatures that flipped spins are isolated from other flipped spins. The energy for an atom in the ground state is -4J and thus the energy change if it flips is 8J. Thus the energy at low temperatures is $\displaystyle E_1 \approx E_0 + 8J\sum_i (\sigma_i + 1) / 2$ The partition function is \displaystyle \begin{aligned} Z &= 2\sum_\sigma \exp{-\beta \bigg(E_0 + 8J\sum_i (\sigma_i + 1) /2\bigg)} \\ &= 2\exp{-\beta E_0} \prod_i \sum_{\sigma_i = \pm 1} \exp{-8\beta J(\sigma_i + 1) / 2} \\ &= 2\exp{-\beta E_0} (1 + \exp{-8\beta J})^N \end{aligned} Again we can calculate the free energy and since we are doing this calculation to get a rough estimate of the entropy we can approximate further. \displaystyle \begin{aligned} F &= -k_B T \ln Z \\ &= -k_B (T \ln 2 - T E_0 + T N \ln (1 + \exp{-8\beta J})) \\ &\approx -k_B (T \ln 2 - T E_0 + T N \exp{-8\beta J}) \end{aligned} From this we can determine the (Boltzmann) entropy. Using the fact that $\beta = 1 / k_B T$ and thus that $\displaystyle \frac{\partial \beta}{\partial T} = - \frac{1}{k_B T^2} = -k_B\beta^2$ we have \displaystyle \begin{aligned} S &= - \frac{\partial F}{\partial T} \\ &= k_B\bigg(\ln 2 + N\frac{\partial}{\partial T} \exp{-8\beta J}\bigg) \\ &= k_B\bigg(\ln 2 + N\frac{\partial}{\partial T} \frac{1}{\beta}\exp{-8\beta J}\bigg) \\ &= k_B\bigg(\ln 2 + N\frac{\partial \beta}{\partial T} \frac{\partial}{\partial \beta}\frac{1}{\beta}\exp{-8\beta J}\bigg) \\ &= k_B\bigg(\ln 2 + N\frac{\partial \beta}{\partial T} \frac{\partial}{\partial \beta}\frac{1}{\beta}\exp{-8\beta J}\bigg) \\ &= k_B\bigg(\ln 2 + N\frac{\partial \beta}{\partial T} \bigg(-\frac{1}{\beta^2}\exp{-8\beta J} - \frac{8J}{\beta}\exp{-8\beta J}\bigg) \\ &= k_B\bigg(\ln 2 + N\exp{-8\beta J} + 8JN\beta\exp{-8\beta J}\bigg) \end{aligned} The critical temperature (as we shall obtain by simulation) given by Onsager’s exact result in 2d is $\displaystyle \frac{1}{\beta} = k_BT_{\text Crit} = \frac{2J}{\ln(\sqrt{2}+1)}$ Taking $J = 1$ > tCrit :: Double > tCrit = 2.0 * j / log (1.0 + sqrt 2.0) where j = 1  ghci> tCrit 2.269185314213022  Plugging this in we get ghci> exp (-8.0/tCrit) * (1.0 + 8.0 / tCrit) 0.1332181153896559  Thus the (Shannon) entropy is about 0.13N at the interesting temperature and is about N at high temperatures. So uniform sampling would require $\sim 2^{(N - N)}$ samples at high temperatures but $\sim 2^{(N - 0.13N)} \approx 2^{N /2}$ at temperatures of interest. Even for our modest $10 \times 10$ grid this is $2^{50} \approx 10^{17}$ samples! Fortunately, Metropolis and his team (Metropolis et al. 1953) discovered a way of constructing a Markov chain with a limiting distribution of the distribution required which does not require the evaluation of the partition function and which converges in a reasonable time (although theoretical results substantiating this latter point seem to be hard to come by). # Markov Chains Markov first studied the stochastic processes that came to be named after him in 1906. We follow (Norris 1998), (Beichl and Sullivan 2000), (Diaconis and Saloff-Coste 1995), (Chib and Greenberg 1995) (Levin, Peres, and Wilmer 2008) and (Gravner 2011). As usual we work on a probability measure space $(\Omega, {\mathbb F}, \mathbb{P})$ (that is $\mathbb{P}(\Omega) = 1$). Although this may not be much in evidence, it is there lurking behind the scenes. Let $S$ be a finite set. In the case of an Ising model with $N$ cells, this set will contain $2^N$ elements (all possible configurations). A Markov chain is a discrete time stochastic process $X_0, X_1, \ldots$ such that $\displaystyle \mathbb{P} (X_{n+1} = j \,|\, X_0 = i_0, X_1 = i_1, \dots X_n = i) = \mathbb{P} (X_{n+1} = j \,|\, X_n = i)$ $\blacksquare$ That is, where a Markov chain goes next only depends on where it is not on its history. A stochastic transition matrix is a matrix $P = \{ p_{ij} : i, j \in S \}$ such that $\displaystyle \sum_{j \in S} p_{ij} = 1 \, \forall i \in S$ $\blacksquare$ We can describe a Markov chain by its transition matrix $P$ and initial distribution $\pi_0(i) = \mathbb{P} (X_0 = i)$. In the case we say a stochastic process $(X_n)_{n \ge 0}$ is Markov $(\pi_0, P)$. We need to be able to discuss properties of Markov chains such as stationarity, irreducibility, recurrence and ergodicity. ## Stationarity A Markov chain has a stationary distribution $\pi(i)$ if $\displaystyle \sum_{i \in S} \pi(i) p_{ij} = \pi(j)$ $\blacksquare$ One question one might ask is whether a given Markov chain has such a distribution. For example, for the following chain, any distribution is a stationary distribution. That is $\pi P = \pi$ for any $\pi$. Any distribution is a stationary distribution for the unit transition matrix. $\displaystyle \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ The $n$-th transition matrix of a Markov chain is $P^n$. The corresponding matrix entries are $\displaystyle p^{(n)}_{ij} = (P^n)_{ij}$ Another key question is, if there is a unique stationary distribution, will the $n$-th transition probabilities converge to that distribution, that is, when does, $p^{(n)}_{ij} \rightarrow \pi(j)$ as $n \rightarrow \infty$. ## Irreducibility Write ${\mathbb P}_i(A) = {\mathbb P}(A \, | \, X_0 = i)$ We say that $i$ leads to $j$ and write $i \rightarrow j$ if $\displaystyle {\mathbb P}_i(X_n = j \, \text{eventually}) > 0$ Theorem For distinct states $i$ and $j$, the following are equivalent: • $i \rightarrow j$ • $p_{i_0i_1} p_{i_1i_2} \ldots p_{i_{n-1}i_n} > 0$ • $p_{ij}^{(n)} > 0$ for some $n \ge 0$ $\blacksquare$ This makes it clear that $\rightarrow$ is transitive and reflexive hence an equivalence relation. We can therefore partition the states into classes. If there is only one class then the chain is called irreducible. For example, $\displaystyle \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 & 0 \\ \frac{1}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 0 & \frac{1}{4} & \frac{3}{4} \\ 0 & 0 & 0 & 1 \end{bmatrix}$ has classes $\{1, 2\}$, $\{3\}$ and $\{4\}$ so is not irreducible. On the other hand $\displaystyle \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 \\ \frac{1}{2} & \frac{1}{4} & \frac{1}{4} \\ 0 & \frac{1}{3} & \frac{2}{3} \end{bmatrix}$ is irreducible. ## Recurrence Let $(X_n)_{n \ge 0}$ be a Markov chain. A state $i$ is recurrent if $\displaystyle {\mathbb P} (X_n = i \, \text{infinitely often}) = 1$ The first passage time is defined as $\displaystyle T_j = \inf_{n \ge 1} \{X_n = j\}$ Note that the $\inf$ is taken over $n$ strictly greater than 1. Incidentally the first passage time is a stopping time but that any discussion of that would take this already long article even longer. The expectation of the $i$-th first passage time starting from $i$ is denoted $m_i = {\mathbb E}_i(T_i)$. Theorem Let $P$ be irreducible then the following are equivalent: • Every state is positive recurrent. • Some state is positive recurrent. • P has an invariant distribution $\pi$ and in this case $\pi(i) = 1 / m_i$. $\blacksquare$ A state $i$ is aperiodic if $p^{(n)}_{ii} > 0$ for all sufficiently large $n$. For example, the chain with this transition matrix is not periodic: $\displaystyle \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ as running the following program segment shows with the chain flip-flopping between the two states. > notAperiodic0 :: Array U DIM2 Double > notAperiodic0 = fromListUnboxed (Z :. (2 :: Int) :. (2 :: Int)) ([0,1,1,0] :: [Double]) > notAperiodics :: [Array U DIM2 Double] > notAperiodics = scanl mmultS notAperiodic0 (replicate 4 notAperiodic0)  ghci> import Ising ghci> import PrettyPrint ghci> import Text.PrettyPrint.HughesPJClass ghci> pPrint notAperiodics [[0.0, 1.0] [1.0, 0.0], [1.0, 0.0] [0.0, 1.0], [0.0, 1.0] [1.0, 0.0], [1.0, 0.0] [0.0, 1.0], [0.0, 1.0] [1.0, 0.0]]  Theorem Let $P$ be irreducible and aperiodic and suppose that $P$ has an invariant distribution $\pi$. Let $\pi_0$ be any distribution (on the state space???). Suppose that $(X_n)_{n \ge 0}$ is Markov $(\pi_0, P)$ then • $\mathbb{P}(X_n = j) \rightarrow \pi(j)$ as $n \rightarrow \infty$ for all $j$ $\blacksquare$ A proof of this theorem uses coupling developed by Doeblin; see (Gravner 2011) for more details. Corollary With the conditions of the preceding Theorem • $p_{ij}^{(n)} \rightarrow \pi(j)$ as $n \rightarrow \infty$ for all $i, j$ $\blacksquare$ If the state space is infinite, the existence of a stationary distribution is not guaranteed even if the Markov chain is irreducible, see (Srikant 2009) for more details. ## Detailed Balance A stochastic matrix $P$ and a distribution $\pi$ are said to be in detailed balance if $\displaystyle \pi(i) p_{ij} = \pi(j) p_{ji}$ $\blacksquare$ Theorem If a stochastic matrix $P$ and a distribution $\pi$ are in detailed balance then $\pi$ is a stationary distribution. $\blacksquare$ ## The Ergodic Theorem Define the number of visits to $i$ strictly before time $n$ as $\displaystyle V_i(n) \triangleq \sum_{k = 0}^{n - 1}{\mathcal I}_{\{X_k = i\}}$ $V_i(n) / n$ is the proportion of time before $n$ spent in state $i$. Theorem Let $(X_n)_{(n \ge 0)}$ be an irreducible Markov chain then $\displaystyle {\mathbb P} \Bigg(\frac{V_i(n)}{n} \rightarrow \frac{1}{m_i} \, {\text as} \, n \rightarrow \infty \Bigg) = 1$ If further the chain is positive recurrent then for any bounded function $f : {\mathbb I} \rightarrow {\mathbb R}$ then $\displaystyle {\mathbb P} \Bigg(\frac{1}{n} \sum_{k = 0}^{n - 1}f(X_i)\rightarrow \hat{f} \, {\text as} \, n \rightarrow \infty \Bigg) = 1$ If further still the chain is aperiodic then it has a unique stationary distribution, which is also the limiting distribution $\blacksquare$ A Markov chain satisfying all three conditions is called ergodic. ## The Metropolis Algorithm Thus if we can find a Markov chain with the required stationary distribution and we sample a function of this chain we will get an estimate for the average value of the function. What Metropolis and his colleagues did was to provide a method of producing such a chain. Algorithm Let $\pi$ be a probability distribution on the state space $\Omega$ with $\pi(i) > 0$ for all $i$ and let $(Q, \pi_0)$ be an ergodic Markov chain on $\Omega$ with transition probabilities $q(i,j) > 0$ (the latter condition is slightly stronger than it need be but we will not need fully general conditions). Create a new (ergodic) Markov chain with transition probabilities $\displaystyle p_{ij} = \begin{cases} q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\ 1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j = i \end{cases}$ where $\land$ takes the maximum of its arguments. Calculate the value of interest on the state space e.g. the total magnetization for each step produced by this new chain. Repeat a sufficiently large number of times and take the average. This gives the estimate of the value of interest. $\blacksquare$ Let us first note that the Markov chain produced by this algorithm almost trivially satisfies the detailed balance condition, for example, \displaystyle \begin{aligned} \pi(i) q(i,j)\bigg[\frac{\pi(j) q(j, i)}{\pi(i)q(i,j)} \land 1\bigg] &= \pi(i)q(i,j) \land \pi(j)q(j,i) \\ &= \pi(j)q(j,i)\bigg[\frac{\pi(i) q(i, j)}{\pi(j)q(j,i)} \land 1\bigg] \end{aligned} Secondly since we have specified that $(Q, \pi_0)$ is ergodic then clearly $(P, \pi_0)$ is also ergodic (all the transition probabilities are $> 0$). So we know the algorithm will converge to the unique distribution we specified to provide estimates of values of interest. Two techniques that seem to be widespread in practical applications are burn in and thinning. Although neither have strong theoretical justification (“a thousand lemmings can’t be wrong”), we follow the practices in our implementation. • “Burn in” means run the chain for a certain number of iterations before sampling to allow it to forget the initial distribution. • “Thinning” means sampling the chain every $n$ iterations rather than every iteration to prevent autocorrelation. ## Haskell Implementation Calculating the total magnetization is trivial; we just add up all the spins and multiply by the magnetic moment of the electron. > magnetization :: (Num a, Integral b, M.Unbox b) => a -> V.Vector b -> a > magnetization mu = (mu *) . fromIntegral . V.sum  We keep the state of the Monte Carlo simulation in a record. > data McState = McState { mcMagnetization :: !Double > , mcMAvg :: !Double > , mcEnergy :: !Double > , mcEAvg :: !Double > , mcEAvg2 :: !Double > , mcCount :: !Int > , mcNumSamples :: !Int > , mcGrid :: !(V.Vector Int) > } > deriving Show  As discussed above we sample every thinN iterations, a technique known as “thinning”. > thinN :: Int > thinN = 100  The total number of iterations per Monte Carlo run. > nItt :: Int > nItt = 1000000  There are only a very limited number of energy changes that can occur for each spin flip. Rather the recalculate the value of the Boltzmann distribution for every spin flip we can store these in a Vector. For example if spin is up and all its surrounding spins are up and it flips to down then energy change is $8J$ and the corresponding value of the Boltzmann distribution is $\exp -8J\beta$. Another example, the energy before is $-2J$ and the energy after is $2J$ so the energy change is $4J$. > expDv :: Double -> Double -> Double -> V.Vector Double > expDv kB j t = V.generate 9 f > where > f n | odd n = 0.0 > f n = exp (j * ((fromIntegral (8 - n)) - 4.0) * 2.0 / (kB * t))  The most important function is the single step update of the Markov chain. We take an Int representing the amount of thinning we wish to perform, the vector of pre-calculated changes of the Boltzmann distribution, the current state, a value representing the randomly chosen co-ordinates of the grid element that will be updated and a value sampled from the uniform distribution which will decide whether the spin at the co-ordinates will be updated. > singleUpdate :: Int -> V.Vector Double -> McState -> (Int, Int, Double) -> McState > singleUpdate thinN expDvT u (i, j, r) = > McState { mcMagnetization = magNew > , mcMAvg = mcMAvgNew > , mcEnergy = enNew > , mcEAvg = mcEAvgNew > , mcEAvg2 = mcEAvg2New > , mcCount = mcCount u + 1 > , mcNumSamples = mcNumSamplesNew > , mcGrid = gridNew > } > where > > (gridNew, magNew, enNew) = > if p > r > then ( V.modify (\v -> M.write v jc (-c)) v > , magOld - fromIntegral (2 * c) > , enOld + couplingConstant * fromIntegral (2 * c * d) > ) > else (v, magOld, enOld) > > magOld = mcMagnetization u > enOld = mcEnergy u > > (mcMAvgNew, mcEAvgNew, mcEAvg2New, mcNumSamplesNew) = > if (mcCount u) mod thinN == 0 > then ( mcMAvgOld + magNew > , mcEAvgOld + enNew > , mcEAvg2Old + enNew2 > , mcNumSamplesOld + 1 > ) > else (mcMAvgOld, mcEAvgOld, mcEAvg2Old, mcNumSamplesOld) > > enNew2 = enNew * enNew > > mcMAvgOld = mcMAvg u > mcEAvgOld = mcEAvg u > mcEAvg2Old = mcEAvg2 u > mcNumSamplesOld = mcNumSamples u > > v = mcGrid u > > p = expDvT V.! (4 + c * d) > > c = v V.! jc > jc = gridSize * i + j > > d = n + e + s + w > > n = v V.! jn > e = v V.! je > s = v V.! js > w = v V.! jw > > jn = gridSize * ((i + 1) mod gridSize) + j > js = gridSize * ((i - 1) mod gridSize) + j > je = gridSize * i + ((j + 1) mod gridSize) > jw = gridSize * i + ((j - 1) mod gridSize)  In order to drive our Markov chain we need a supply of random positions and samples from the uniform distribution. > randomUpdates :: Int -> V.Vector (Int, Int, Double) > randomUpdates m = > V.fromList
>   evalState (replicateM m x)
>   (pureMT 1)
>   where
>     x = do r <- sample (uniform (0 :: Int)    (gridSize - 1))
>            c <- sample (uniform (0 :: Int)    (gridSize - 1))
>            v <- sample (uniform (0 :: Double)            1.0)
>            return (r, c, v)


To get things going, we need an initial state. We start with a cold grid, that is, one with all spins pointing down.

> initColdGrid :: V.Vector Int
> initColdGrid = V.fromList $replicate (gridSize * gridSize) (-1) > > initState :: McState > initState = McState { mcMagnetization = magnetization mu initColdGrid > , mcMAvg = 0.0 > , mcEnergy = energy couplingConstant initColdGrid > , mcEAvg = 0.0 > , mcEAvg2 = 0.0 > , mcCount = 0 > , mcNumSamples = 0 > , mcGrid = initColdGrid > }  We will want to run the simulation over a range of temperatures in order to determine where the phase transition occurs. > temps :: [Double] > temps = getTemps 4.0 0.5 100 > where > getTemps :: Double -> Double -> Int -> [Double] > getTemps h l n = [ m * x + c | > w <- [1..n], > let x = fromIntegral w ] > where > m = (h - l) / (fromIntegral n - 1) > c = l - m  Now we can run a chain at a given temperature. > multiUpdate :: McState -> Double -> V.Vector (Int, Int, Double) -> McState > multiUpdate s t = V.foldl (singleUpdate thinN (expDv kB couplingConstant t)) s  For example running the model at a temperature of $3.0$ for 100, 1000 and 10,000 steps respectively shows disorder growing. On the other hand running the model at a temperature of $2.0$ shows a very limited disorder. For any given state we can extract the magnetization, the energy and the square of the energy. > magFn :: McState -> Double > magFn s = abs (mcMAvg s / (fromIntegral$ mcNumSamples s))
>
> magFnNorm :: McState -> Double
> magFnNorm s = magFn s / (fromIntegral (gridSize * gridSize))
>
> enFn :: McState -> Double
> enFn s  = mcEAvg s / (fromIntegral $mcNumSamples s) > > enFnNorm :: McState -> Double > enFnNorm s = enFn s / (fromIntegral (gridSize * gridSize))  > en2Fn :: McState -> Double > en2Fn s = mcEAvg2 s / (fromIntegral$ mcNumSamples s)


And we can also calculate the mean square error.

> meanSqErr :: McState -> Double
> meanSqErr s = e2 - e*e
>   where
>     e = enFn s
>     e2 = en2Fn s
>
> meanSqErrNorm :: McState -> Double
> meanSqErrNorm s = meanSqErr s / (fromIntegral (gridSize * gridSize))


Finally we can run our simulation in parallel using parMap rather than map (this is the one line change required to get parallelism).

> main :: IO ()
> main = do print "Start"
>
>           let rs = parMap rpar f temps
>                     where
>                       f t = multiUpdate initState t (randomUpdates nItt)
>
>           renderableToFile (FileOptions (500, 500) PNG)
>                            (errChart temps rs magFnNorm enFnNorm meanSqErrNorm)
>                           "diagrams/MagnetismAndEnergy.png"
>
>           print "Done"


## Performance and Parallelism

ghc -O2 Ising.lhs -threaded -o Ising -package-db=.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d/ -main-is Ising
time ./Ising +RTS -N1
"Start"
"Done"

real	0m14.879s
user	0m14.508s
sys	0m0.369s

time ./Ising +RTS -N2
"Start"
"Done"

real	0m8.269s
user	0m15.521s
sys	0m0.389s

time ./Ising +RTS -N4
"Start"
"Done"

real	0m5.444s
user	0m19.386s
sys	0m0.414s

## Bibliography and Resources

Beichl, Isabel, and Francis Sullivan. 2000. “The Metropolis Algorithm.” In Computing in Science and Engg., 2:65–69. 1. Piscataway, NJ, USA: IEEE Educational Activities Department. doi:10.1109/5992.814660.

Chib, Siddhartha, and Edward Greenberg. 1995. “Understanding the Metropolis-hastings Algorithm.” The American Statistician 49 (4) (November): 327–335. http://www.jstor.org/stable/2684568.

Diaconis, Persi, and Laurent Saloff-Coste. 1995. “What Do We Know About the Metropolis Algorithm?” In Proceedings of the Twenty-seventh Annual ACM Symposium on Theory of Computing, 112–129. STOC ’95. New York, NY, USA: ACM. doi:10.1145/225058.225095. http://doi.acm.org/10.1145/225058.225095.

Gravner, Janko. 2011. “MAT135A Probability.” https://www.math.ucdavis.edu/~gravner/MAT135A/resources/lecturenotes.pdf.

Levin, David A., Yuval Peres, and Elizabeth L. Wilmer. 2008. Markov Chains and Mixing Times. American Mathematical Society. http://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf.

MacKay, David J. C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press. http://www.cambridge.org/0521642981.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21: 1087–1092.

Norris, James R. 1998. Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

Reif, F. 2009. Fundamentals of Statistical and Thermal Physics. McGraw-hill Series in Fundamentals of Physics. Waveland Press, Incorporated. http://books.google.co.uk/books?id=gYpBPgAACAAJ.

Srikant, R. 2009. “ECE534 Random Processes.” http://www.ifp.illinois.edu/~srikant/ECE534/Spring09/DTMC.pdf.

Tong, David. 2011. “Lectures on Statistical Physcs.” http://www.damtp.cam.ac.uk/user/tong/statphys.html.

Ziman, J.M. 1964. Principles of the Theory of Solids. London, England: Cambridge University Press.

## Parking in Westminster: An Analysis in Haskell

I had a fun weekend analysing car parking data in Westminster at the Future Cities Hackathon along with

• Amit Nandi
• Jackie Steinitz
• Ian Ozsvald
• Mateusz Łapsa-Malawski

Apparently in the world of car parking where Westminster leads the rest of UK follows. For example Westminster is rolling out individual parking bay monitors.

Our analysis gained an honourable mention. Ian has produced a great write-up of our analysis with fine watercolour maps and Bart’s time-lapse video of parking behaviour.

We mainly used Python, Pandas and Excel for the actual analysis and QGIS for the maps.

I thought it would be an interesting exercise to recreate some of the analysis in Haskell.

First some pragmas and imports.

> {-# OPTIONS_GHC -Wall                    #-}
> {-# OPTIONS_GHC -fno-warn-orphans        #-}
>
> {-# LANGUAGE ScopedTypeVariables         #-}
> {-# LANGUAGE ViewPatterns                #-}
> {-# LANGUAGE DeriveTraversable           #-}
> {-# LANGUAGE DeriveFoldable              #-}
> {-# LANGUAGE DeriveFunctor               #-}
>
> module WardsOfLondon ( parkingDia ) where
>
> import Database.Shapefile
>
> import Data.Binary.Get
> import qualified Data.ByteString.Lazy as BL
> import qualified Data.ByteString as B
> import Data.Binary.IEEE754
> import Data.Csv hiding ( decode, lookup )
> import Data.Csv.Streaming
> import qualified Data.Vector as V
> import Data.Time
> import qualified Data.Text as T
> import Data.Char
> import qualified Data.Map.Strict as Map
> import Data.Int( Int64 )
> import Data.Maybe ( fromJust, isJust )
> import Data.List ( unfoldr )
>
> import Control.Applicative
>
> import Diagrams.Prelude
> import Diagrams.Backend.Cairo.CmdLine
>
> import System.FilePath
> import System.Directory
> import System.Locale
> import System.IO.Unsafe ( unsafePerformIO )
>
> import Data.Traversable ( Traversable )
> import qualified Data.Traversable as Tr
> import Data.Foldable ( Foldable )


A type synonym to make typing some of our functions a bit more readable (and easier to modify e.g. if we want to use Cairo).

> type Diag = Diagram Cairo R2


The paths to all our data.

• SHP files are shape files, a fairly old but widespread map data format that was originally produced by a company called ESRI.

• The polygons for the outline of the wards in Westminster. Surely there is a better place to get this rather than using tree canopy data.

• The polyline data for all the roads (and other stuff) in the UK. We selected out all the roads in a bounding box for London. Even so plotting these takes about a minute.

• The parking data were provided by Westminster Council. The set we consider below was about 4 million lines of cashless parking meter payments (about 1.3G).

> prefix :: FilePath
> prefix = "/Users/dom"
>
>
> borough :: FilePath
> borough = "WestMinster"
>
> parkingBorough :: FilePath
> parkingBorough = "ParkingWestminster"
>
> flGL :: FilePath
>
> flParkingCashless :: FilePath
> flParkingCashless = "ParkingCashlessDenorm.csv"


The data for payments are contained in a CSV file so we create a record in which to keep the various fields contained therein.

> data Payment = Payment
>                { _amountPaid       :: LaxDouble
>                ,  paidDurationMins :: Int
>                ,  startDate        :: UTCTime
>                , _startDay         :: DayOfTheWeek
>                , _endDate          :: UTCTime
>                , _endDay           :: DayOfTheWeek
>                , _startTime        :: TimeOfDay
>                , _endTime          :: TimeOfDay
>                , _designationType  :: T.Text
>                ,  hoursOfControl   :: T.Text
>                , _tariff           :: T.Text
>                , _maxStay          :: T.Text
>                , spaces            :: Maybe Int
>                , _street           :: T.Text
>                , _xCoordinate      :: Maybe Double
>                , _yCoordinate      :: Maybe Double
>                , latitude          :: Maybe Double
>                , longitude         :: Maybe LaxDouble
>                }
>   deriving Show
>
> data DayOfTheWeek = Monday
>                   | Tuesday
>                   | Wednesday
>                   | Thursday
>                   | Friday
>                   | Saturday
>                   | Sunday


We need to be able to parse the day of the week.

> instance FromField DayOfTheWeek where
>   parseField s = read <$> parseField s  The field containing the longitude has values of the form -.1. The CSV parser for Double will reject this so we create our own datatype with a more relaxed parser. > newtype LaxDouble = LaxDouble { laxDouble :: Double } > deriving Show > > instance FromField LaxDouble where > parseField = fmap LaxDouble . parseField . addLeading > > where > > addLeading :: B.ByteString -> B.ByteString > addLeading bytes = > case B.uncons bytes of > Just (c -> '.', _) -> B.cons (o '0') bytes > Just (c -> '-', rest) -> B.cons (o '-') (addLeading rest) > _ -> bytes > > c = chr . fromIntegral > o = fromIntegral . ord  We need to be able to parse dates and times. > instance FromField UTCTime where > parseField s = do > f <- parseField s > case parseTime defaultTimeLocale "%F %X" f of > Nothing -> fail "Unable to parse UTC time" > Just g -> return g > > instance FromField TimeOfDay where > parseField s = do > f <- parseField s > case parseTime defaultTimeLocale "%R" f of > Nothing -> fail "Unable to parse time of day" > Just g -> return g  Finally we can write a parser for our record. > instance FromRecord Payment where > parseRecord v > | V.length v == 18 > = Payment <$>
>            v .!  0 <*>
>            v .!  1 <*>
>            v .!  2 <*>
>            v .!  3 <*>
>            v .!  4 <*>
>            v .!  5 <*>
>            v .!  6 <*>
>            v .!  7 <*>
>            v .!  8 <*>
>            v .!  9 <*>
>            v .! 10 <*>
>            v .! 11 <*>
>            v .! 12 <*>
>            v .! 13 <*>
>            v .! 14 <*>
>            v .! 15 <*>
>            v .! 16 <*>
>            v .! 17
>          | otherwise = mzero


To make the analysis simpler, we only look at what might be a typical day, a Thursday in February.

> selectedDay :: UTCTime
> selectedDay = case parseTime defaultTimeLocale "%F %X" "2013-02-28 00:00:00" of
>   Nothing -> error "Unable to parse UTC time"
>   Just t  -> t


It turns out that there are a very limited number of different sorts of hours of control so rather than parse this and calculate the number of control minutes per week, we can just create a simple look up table by hand.

> hoursOfControlTable :: [(T.Text, [Int])]
> hoursOfControlTable = [
>     ("Mon - Fri 8.30am - 6.30pm"                      , [600, 600, 600, 600, 600,   0,   0])
>   , ("Mon-Fri 10am - 4pm"                             , [360, 360, 360, 360, 360,   0,   0])
>   , ("Mon - Fri 8.30-6.30 Sat 8.30 - 1.30"            , [600, 600, 600, 600, 600, 300,   0])
>   , ("Mon - Sat 8.30am - 6.30pm"                      , [600, 600, 600, 600, 600, 600,   0])
>   , ("Mon-Sat 11am-6.30pm "                           , [450, 450, 450, 450, 450, 450,   0])
>   , ("Mon - Fri 8.00pm - 8.00am"                      , [720, 720, 720, 720, 720,   0,   0])
>   , ("Mon - Fri 8.30am - 6.30pm "                     , [600, 600, 600, 600, 600,   0,   0])
>   , ("Mon - Fri 10.00am - 6.30pm\nSat 8.30am - 6.30pm", [510, 510, 510, 510, 510, 600,   0])
>   , ("Mon-Sun 10.00am-4.00pm & 7.00pm - Midnight"     , [660, 660, 660, 660, 660, 660, 660])
>   ]


Now we create a record in which to record the statistics in which we are interested:

• Number of times a lot is used.

• Number of usage minutes. In reality this is the amount of minutes purchased and people often leave a bay before their ticket expires so this is just a proxy.

• The hours of control for the lot.

• The number of bays in the lot.

N.B. The !’s are really important otherwise we get a space leak. In more detail, these are strictness annotations which force the record to be evaluated rather than be carried around unevaluated (taking up unnecessary space) until needed.

> data LotStats = LotStats { usageCount       :: !Int
>                          , usageMins        :: !Int64
>                          , usageControlTxt  :: !T.Text
>                          , usageSpaces      :: !(Maybe Int)
>                          }
>   deriving Show


As we work our way through the data we need to update our statistics.

> updateStats :: LotStats -> LotStats -> LotStats
> updateStats s1 s2 = LotStats { usageCount = (usageCount s1) + (usageCount s2)
>                              , usageMins  = (usageMins s1) +  (usageMins s2)
>                              , usageControlTxt  = usageControlTxt s2
>                              , usageSpaces = usageSpaces s2
>                              }
>
> initBayCountMap :: Map.Map (Pair Double) LotStats
> initBayCountMap = Map.empty


We are going to be working with co-ordinates which are pairs of numbers so we need a data type in which to keep them. Possibly overkill.

> data Pair a = Pair { xPair :: !a, yPair :: !a }
>   deriving (Show, Eq, Ord, Functor, Foldable, Traversable)


Functions to get bounding boxes.

> getPair :: Get a -> Get (a,a)
> getPair getPart = do
>     x <- getPart
>     y <- getPart
>     return (x,y)
>
> getBBox :: Get a -> Get (BBox a)
> getBBox getPoint = do
>     bbMin <- getPoint
>     bbMax <- getPoint
>     return (BBox bbMin bbMax)
>
> bbox :: Get (BBox (Double, Double))
> bbox = do
>   shpFileBBox <- getBBox (getPair getFloat64le)
>   return shpFileBBox
>
> getBBs :: BL.ByteString -> BBox (Double, Double)
> getBBs = runGet $do > _ <- getShapeType32le > bbox > > isInBB :: (Ord a, Ord b) => BBox (a, b) -> BBox (a, b) -> Bool > isInBB bbx bby = ea >= eb && wa <= wb && > sa >= sb && na <= nb > where > (ea, sa) = bbMin bbx > (wa, na) = bbMax bbx > (eb, sb) = bbMin bby > (wb, nb) = bbMax bby > > combineBBs :: (Ord a, Ord b) => BBox (a, b) -> BBox (a, b) -> BBox (a, b) > combineBBs bbx bby = BBox { bbMin = (min ea eb, min sa sb) > , bbMax = (max wa wb, max na nb) > } > where > (ea, sa) = bbMin bbx > (wa, na) = bbMax bbx > (eb, sb) = bbMin bby > (wb, nb) = bbMax bby  A function to get plotting information from the shape file. > getRecs :: BL.ByteString -> > [[(Double, Double)]] > getRecs = runGet$ do
>   _ <- getShapeType32le
>   _ <- bbox
>   nParts  <- getWord32le
>   nPoints <- getWord32le
>   parts   <- replicateM (fromIntegral nParts) getWord32le
>   points  <- replicateM (fromIntegral nPoints) (getPair getFloat64le)
>   return (getParts (map fromIntegral parts) points)
>
> getParts :: [Int] -> [a] -> [[a]]
> getParts offsets ps = unfoldr g (gaps, ps)
>   where
>     gaps = zipWith (-) (tail offsets) offsets
>     g (  [],   []) = Nothing
>     g (  [],   xs) = Just (xs, ([], []))
>     g (n:ns,   xs) = Just (take n xs, (ns, drop n xs))


We need to be able to filter out e.g. roads that are not in a given bounding box.

> recsOfInterest :: BBox (Double, Double) -> [ShpRec] -> [ShpRec]
> recsOfInterest bb = filter (flip isInBB bb . getBBs . shpRecData)


A function to process each ward in Westminster.

> processWard :: [ShpRec] -> FilePath ->
>                IO ([ShpRec], ([[(Double, Double)]], BBox (Double, Double)))
> processWard recDB fileName = do
>   input <- BL.readFile $prefix </> dataDir </> borough </> fileName > let (hdr, recs) = runGet getShpFile input > bb = shpFileBBox hdr > let ps = head$ map getRecs (map shpRecData recs)
>   return $(recsOfInterest bb recDB, (ps, bb))  We want to draw roads and ward boundaries. > colouredLine :: Double -> Colour Double -> [(Double, Double)] -> Diag > colouredLine thickness lineColour xs = (fromVertices$ map p2 xs) #
>                                        lw thickness #
>                                        lc lineColour


And we want to draw parking lots with the hue varying according to how heavily they are utilised.

> bayDots :: [Pair Double] -> [Double] -> Diag
> bayDots xs bs = position (zip (map p2 $map toPair xs) dots) > where dots = map (\b -> circle 0.0005 # fcA (blend b c1 c2) # lw 0.0) bs > toPair p = (xPair p, yPair p) > c1 = darkgreen withOpacity 0.7 > c2 = lightgreen withOpacity 0.7  Update the statistics until we run out of data. > processCsv :: Map.Map (Pair Double) LotStats -> > Records Payment -> > Map.Map (Pair Double) LotStats > processCsv m rs = case rs of > Cons u rest -> case u of > Left err -> error err > Right val -> case Tr.sequence$ Pair (laxDouble <$> longitude val) (latitude val) of > Nothing -> processCsv m rest > Just v -> if startDate val == selectedDay > then processCsv (Map.insertWith updateStats v delta m) rest > else processCsv m rest > where > delta = LotStats { usageCount = 1 > , usageMins = fromIntegral$ paidDurationMins val
>                            , usageControlTxt = hoursOfControl val
>                            , usageSpaces     = spaces val
>                            }
>   Nil mErr x  -> if BL.null x
>                  then m
>                  else error $"Nil: " ++ show mErr ++ " " ++ show x  > availableMinsThu :: LotStats -> Maybe Double > availableMinsThu val = > fmap fromIntegral$
>   fmap (!!(fromEnum Thursday)) $> flip lookup hoursOfControlTable$
>   usageControlTxt val


Now for the main function.

> parkingDiaM :: IO Diag
> parkingDiaM = do


Read in the 4 million records lazily.

>   parkingCashlessCsv <- BL.readFile $> prefix </> > dataDir </> > parkingBorough </> > flParkingCashless  Create our statistics. > let bayCountMap = processCsv initBayCountMap (decode False parkingCashlessCsv) > > vals = Map.elems bayCountMap  Calculate the available minutes for each bay. > availableMinsThus :: [Maybe Double] > availableMinsThus = zipWith f (map availableMinsThu vals) > (map (fmap fromIntegral . usageSpaces) vals) > where > f x y = (*) <$> x <*> y


Calculate the actual minutes used for each lot and the usage which determine the hue of the colour of the dot representing the lot on the map.

>       actualMinsThu :: [Double]
>       actualMinsThu =
>         map fromIntegral $> map usageMins vals > > usage :: [Maybe Double] > usage = zipWith f actualMinsThu availableMinsThus > where > f x y = (/) <$> pure x <*> y


We will need to the co-ordinates of each lot in order to be able to plot it.

>   let parkBayCoords :: [Pair Double]
>       parkBayCoords = Map.keys bayCountMap


Get the ward shape files.

>   fs <- getDirectoryContents $prefix </> dataDir </> borough > let wardShpFiles = map (uncurry addExtension)$
>            filter ((==".shp"). snd) $> map splitExtension fs  Get the London roads shape file. > inputGL <- BL.readFile flGL > let recsGL = snd$ runGet getShpFile inputGL


Get the data we wish to plot from each ward shape file.

>   rps <- mapM (processWard recsGL) wardShpFiles


Get the roads inside the wards.

>   let zs = map (getRecs . shpRecData) $concat$ map fst rps


And create blue diagram elements for each road.

>       ps :: [[Diag]]
>       ps = map (map (colouredLine 0.0001 blue)) zs


Create diagram elements for each ward boundary.

>       qs :: [[Diag]]
>       qs = map (map (colouredLine 0.0003 navajowhite)) (map (fst. snd) rps)


Westminster is located at about 51 degrees North. We want to put a background colour on the map so either we need to move Westminster to be at the origin or create a background rectangle centred on Westminster. We do the former. We create a rectangle which is slightly bigger than the bounding box of Westminster. And we translate everything so that the South West corner of the bounding box of Westminster is the origin.

>   let bbWestminster = foldr combineBBs (BBox (inf, inf) (negInf, negInf)) $> map (snd . snd) rps > where > inf = read "Infinity" > negInf = read "-Infinity" > > let (ea, sa) = bbMin bbWestminster > (wa, na) = bbMax bbWestminster > wmHeight = na - sa > wmWidth = wa - ea  Create the background. > wmBackground = translateX (wmWidth / 2.0)$
>                      translateY (wmHeight / 2.0) $> scaleX 1.1$
>                      scaleY 1.1 $> rect wmWidth wmHeight # fcA (yellow withOpacity 0.1) # lw 0.0  Plot the streets. > wmStreets = translateX (negate ea)$
>                    translateY (negate sa) $> mconcat (mconcat ps)  Plot the parking lots. > wmParking = translateX (negate ea)$
>                   translateY (negate sa) $> uncurry bayDots$
>                   unzip $> map (\(x, y) -> (x, fromJust y))$
>                   filter (isJust . snd) $> zip parkBayCoords usage  Plot the ward boundaries. > wmWards = translateX (negate ea)$
>                 translateY (negate sa) $> mconcat (mconcat qs) > > return$ wmBackground <>
>            wmWards <>
>            wmStreets <>
>            wmParking


Sadly we have to use unsafePerformIO in order to be able to create the post using BlogLiteratelyD.

> parkingDia :: Diag
> parkingDia = unsafePerformIO parkingDiaM


And now we can see all the parking lots in Westminster as green dots. The darkness represents how heavily utilised they are. The thick gold lines delineate the wards in Westminster. In case it isn’t obvious the blue lines are the roads. The Thames, Hyde Park and Regent’s Park are fairly easy to spot. Less easy to spot but still fairly visible are Buckingham Palace and Green Park.

# Observations

• We appear to need to use ghc -O2 otherwise we get a spaceleak.

• We didn’t explicitly need the equivalent of pandas. It would be interesting to go through the Haskell and Python code and see where we used pandas and what the equivalent was in Haskell.

• Python and R seem more forgiving about data formats e.g. they handle -.1 where Haskell doesn’t. Perhaps this should be in the Haskell equivalent of pandas.

Posted in Haskell, Statistics | 1 Comment

# Preface

The intended audience of this article is someone who knows something about Machine Learning and Artifical Neural Networks (ANNs) in particular and who recalls that fitting an ANN required a technique called backpropagation. The goal of this post is to refresh the reader’s knowledge of ANNs and backpropagation and to show that the latter is merely a specialised version of automatic differentiation, a tool that all Machine Learning practitioners should know about and have in their toolkit.

# Introduction

The problem is simple to state: we have a (highly) non-linear function, the cost function of an Artificial Neural Network (ANN), and we wish to minimize this so as to estimate the parameters / weights of the function.

In order to minimise the function, one obvious approach is to use steepest descent: start with random values for the parameters to be estimated, find the direction in which the the function decreases most quickly, step a small amount in that direction and repeat until close enough.

But we have two problems:

• We have an algorithm or a computer program that calculates the non-linear function rather than the function itself.

• The function has a very large number of parameters, hundreds if not thousands.

One thing we could try is bumping each parameter by a small amount to get partial derivatives numerically

$\displaystyle \frac{\partial E(\ldots, w, \ldots)}{\partial w} \approx \frac{E(\ldots, w + \epsilon, \ldots) - E(\ldots, w, \ldots)}{\epsilon}$

But this would mean evaluating our function many times and moreover we could easily get numerical errors as a result of the vagaries of floating point arithmetic.

As an alternative we could turn our algorithm or computer program into a function more recognisable as a mathematical function and then compute the differential itself as a function either by hand or by using a symbolic differentiation package. For the complicated expression which is our mathematical function, the former would be error prone and the latter could easily generate something which would be even more complex and costly to evaluate than the original expression.

The standard approach is to use a technique called backpropagation and the understanding and application of this technique forms a large part of many machine learning lecture courses.

Since at least the 1960s techniques for automatically differentiating computer programs have been discovered and re-discovered. Anyone who knows about these techniques and reads about backpropagation quickly realises that backpropagation is just automatic differentiation and steepest descent.

• Refresher on neural networks and backpropagation;

• Methods for differentiation;

• Backward and forward automatic differentiation and

• Concluding thoughts.

The only thing important to remember throughout is the chain rule

$\displaystyle (g \circ f)'(a) = g'(f(a))\cdot f'(a)$

in alternative notation

$\displaystyle \frac{\mathrm{d} (g \circ f)}{\mathrm{d} x}(a) = \frac{\mathrm{d} g}{\mathrm{d} y}(f(a)) \frac{\mathrm{d} f}{\mathrm{d} x}(a)$

where $y = f(x)$. More suggestively we can write

$\displaystyle \frac{\mathrm{d} g}{\mathrm{d} x} = \frac{\mathrm{d} g}{\mathrm{d} y} \frac{\mathrm{d} y}{\mathrm{d} x}$

where it is understood that $\mathrm{d} g / \mathrm{d} x$ and $\mathrm{d} y / \mathrm{d} x$ are evaluated at $a$ and $\mathrm{d} g / \mathrm{d} y$ is evaluated at $f(a)$.

For example,

$\displaystyle \frac{\mathrm{d}}{\mathrm{d} x} \sqrt{3 \sin(x)} = \frac{\mathrm{d}}{\mathrm{d} x} (3 \sin(x)) \cdot \frac{\mathrm{d}}{\mathrm{d} y} \sqrt{y} = 3 \cos(x) \cdot \frac{1}{2\sqrt{y}} = \frac{3\cos(x)}{2\sqrt{3\sin(x)}}$

# Acknowledgements

Sadly I cannot recall all the sources I looked at in order to produce this article but I have made heavy use of the following.

# Neural Network Refresher

Here is our model, with $\boldsymbol{x}$ the input, $\hat{\boldsymbol{y}}$ the predicted output and $\boldsymbol{y}$ the actual output and $w^{(k)}$ the weights in the $k$-th layer. We have concretised the transfer function as $\tanh$ but it is quite popular to use the $\text{logit}$ function.

\displaystyle \begin{aligned} a_i^{(1)} &= \sum_{j=0}^{N^{(1)}} w_{ij}^{(1)} x_j \\ z_i^{(1)} &= \tanh(a_i^{(1)}) \\ a_i^{(2)} &= \sum_{j=0}^{N^{(2)}} w_{ij}^{(2)} z_j^{(1)} \\ \dots &= \ldots \\ a_i^{(L-1)} &= \sum_{j=0}^{N^{(L-1)}} w_{ij}^{(L-1)} z_j^{(L-2)} \\ z_j^{(L-1)} &= \tanh(a_j^{(L-1)}) \\ \hat{y}_i &= \sum_{j=0}^{N^{(L)}} w_{ij}^{(L)} z_j^{(L-1)} \\ \end{aligned}

with the loss or cost function

$\displaystyle E(\boldsymbol{w}; \boldsymbol{x}, \boldsymbol{y}) = \frac{1}{2}\|(\hat{\boldsymbol{y}} - \boldsymbol{y})\|^2$

The diagram below depicts a neural network with a single hidden layer.

In order to apply the steepest descent algorithm we need to calculate the differentials of this latter function with respect to the weights, that is, we need to calculate

$\displaystyle \Delta w_{ij} = \frac{\partial E}{\partial w_{ij}}$

Applying the chain rule

$\displaystyle \Delta w_{ij} = \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial a_i}\frac{\partial a_i}{\partial w_{ij}}$

Since

$\displaystyle a_j^{(l)} = \sum_{i=0}^N w_{ij}^{(l)}z_i^{(l-1)}$

we have

$\displaystyle \frac{\partial a_i^{(l)}}{\partial w_{ij}^{(l)}} = \frac{\sum_{k=0}^M w_{kj}^{(l)}z_k^{(l-1)}}{\partial w_{ij}^{(l)}} = z_i^{(l-1)}$

Defining

$\displaystyle \delta_j^{(l)} \equiv \frac{\partial E}{\partial a_j^{(l)}}$

we obtain

$\displaystyle \Delta w_{ij}^{(l)} = \frac{\partial E}{\partial w_{ij}^{(l)}} = \delta_j^{(l)} z_i^{(l-1)}$

Finding the $z_i$ for each layer is straightforward: we start with the inputs and propagate forward. In order to find the $\delta_j$ we need to start with the outputs a propagate backwards:

For the output layer we have (since $\hat{y}_j = a_j$)

$\displaystyle \delta_j = \frac{\partial E}{\partial a_j} = \frac{\partial E}{\partial y_j} = \frac{\partial}{\partial y_j}\bigg(\frac{1}{2}\sum_{i=0}^M (\hat{y}_i - y_i)^2\bigg) = \hat{y}_j - y_j$

For a hidden layer using the chain rule

$\displaystyle \delta_j^{(l-1)} = \frac{\partial E}{\partial a_j^{(l-1)}} = \sum_k \frac{\partial E}{\partial a_k^{(l)}}\frac{\partial a_k^{(l)}}{\partial a_j^{(l-1)}}$

Now

$\displaystyle a_k^{(l)} = \sum_i w_{ki}^{(l)}z_i^{(l-1)} = \sum_i w_{ki}^{(l)} f(a_i^{(l-1)})$

so that

$\displaystyle \frac{\partial a_k^{(l)}}{\partial a_j^{(l-1)}} = \frac{\sum_i w_{ki}^{(l)} f(a_i^{(l-1)})}{\partial a_j^{(l-1)}} = w_{kj}^{(l)}\,f'(a_j^{(l-1)})$

and thus

$\displaystyle \delta_j^{(l-1)} = \sum_k \frac{\partial E}{\partial a_k^{(l)}}\frac{\partial a_k^{(l)}}{\partial a_j^{(l-1)}} = \sum_k \delta_k^{(l)} w_{kj}^{(l)}\, f'(a_j^{(l-1)}) = f'(a_j^{(l-1)}) \sum_k \delta_k^{(l)} w_{kj}^{(l)}$

Summarising

1. We calculate all $a_j$ and $z_j$ for each layer starting with the input layer and propagating forward.

2. We evaluate $\delta_j^{(L)}$ in the output layer using $\delta_j = \hat{y}_j - y_j$.

3. We evaluate $\delta_j$ in each layer using $\delta_j^{(l-1)} = f'(a_j^{(l-1)})\sum_k \delta_k^{(l)} w_{kj}^{(l)}$ starting with the output layer and propagating backwards.

4. Use $\partial E / \partial w_{ij}^{(l)} = \delta_j^{(l)} z_i^{(l-1)}$ to obtain the required derivatives in each layer.

For the particular activation function $\tanh$ we have $f'(a) = \tanh' (a) = 1 - \tanh^2(a)$. And finally we can use the partial derivatives to step in the right direction using steepest descent

$\displaystyle w' = w - \gamma\nabla E(w)$

where $\gamma$ is the step size aka the learning rate.

# Differentiation

So now we have an efficient algorithm for differentiating the cost function for an ANN and thus estimating its parameters but it seems quite complex. In the introduction we alluded to other methods of differentiation. Let us examine those in a bit more detail before moving on to a general technique for differentiating programs of which backpropagation turns out to be a specialisation.

## Numerical Differentiation

Consider the function $f(x) = e^x$ then its differential $f'(x) = e^x$ and we can easily compare a numerical approximation of this with the exact result. The numeric approximation is given by

$\displaystyle f'(x) \approx \frac{f(x + \epsilon) - f(x)}{\epsilon}$

In theory we should get a closer and closer approximation as epsilon decreases but as the chart below shows at some point (with $\epsilon \approx 2^{-26}$) the approximation worsens as a result of the fact that we are using floating point arithmetic. For a complex function such as one which calculates the cost function of an ANN, there is a risk that we may end up getting a poor approximation for the derivative and thus a poor estimate for the parameters of the model.

## Symbolic Differentiation

Suppose we have the following program (written in Python)

import numpy as np

def many_sines(x):
y = x
for i in range(1,7):
y = np.sin(x+y)
return y

When we unroll the loop we are actually evaluating

$\displaystyle f(x) = \sin(x + \sin(x + \sin(x + \sin(x + \sin(x + \sin(x + x))))))$

Now suppose we want to get the differential of this function. Symbolically this would be

\displaystyle \begin{aligned} f'(x) &= (((((2\cdot \cos(2x)+1)\cdot \\ &\phantom{=} \cos(\sin(2x)+x)+1)\cdot \\ &\phantom{=} \cos(\sin(\sin(2x)+x)+x)+1)\cdot \\ &\phantom{=} \cos(\sin(\sin(\sin(2x)+x)+x)+x)+1)\cdot \\ &\phantom{=} \cos(\sin(\sin(\sin(\sin(2x)+x)+x)+x)+x)+1)\cdot \\ &\phantom{=} \cos(\sin(\sin(\sin(\sin(\sin(2x)+x)+x)+x)+x)+x) \end{aligned}

Typically the non-linear function that an ANN gives is much more complex than the simple function given above. Thus its derivative will correspondingly more complex and therefore expensive to compute. Moreover calculating this derivative by hand could easily introduce errors. And in order to have a computer perform the symbolic calculation we would have to encode our cost function somehow so that it is amenable to this form of manipulation.

# Automatic Differentiation

## Reverse Mode

Traditionally, forward mode is introduced first as this is considered easier to understand. We introduce reverse mode first as it can be seen to be a generalization of backpropagation.

Consider the function

$\displaystyle f(x) = \exp(\exp(x) + (\exp(x))^2) + \sin(\exp(x) + (\exp(x))^2)$

Let us write this a data flow graph.

We can thus re-write our function as a sequence of simpler functions in which each function only depends on variables earlier in the sequence.

\displaystyle \begin{aligned} u_7 &= f_7(u_6, u_5, u_4, u_3, u_2, u_1) \\ u_6 &= f_6(u_5, u_4, u_3, u_2, u_1) \\ \ldots &= \ldots \\ u_1 &= f_1(u_1) \end{aligned}

\displaystyle \begin{aligned} \mathrm{d}u_7 &= \frac{\partial f_7}{\partial u_6} \mathrm{d} u_6 + \frac{\partial f_7}{\partial u_5} \mathrm{d} u_5 + \frac{\partial f_7}{\partial u_4} \mathrm{d} u_4 + \frac{\partial f_7}{\partial u_3} \mathrm{d} u_3 + \frac{\partial f_7}{\partial u_2} \mathrm{d} u_2 + \frac{\partial f_7}{\partial u_1} \mathrm{d} u_1 \\ \mathrm{d}u_6 &= \frac{\partial f_6}{\partial u_5} \mathrm{d} u_5 + \frac{\partial f_6}{\partial u_4} \mathrm{d} u_4 + \frac{\partial f_6}{\partial u_3} \mathrm{d} u_3 + \frac{\partial f_6}{\partial u_2} \mathrm{d} u_2 + \frac{\partial f_6}{\partial u_1} \mathrm{d} u_1 \\ \ldots &= \ldots \\ \mathrm{d}u_1 &= \frac{\partial f_1}{\partial u_1} \mathrm{d} u_1 \end{aligned}

In our particular example, since $u_1, \dots, u_5$ do not depend on $u_6$

\displaystyle \begin{aligned} \frac{\mathrm{d}u_7}{\mathrm{d}u_6} &= 1 \end{aligned}

Further $u_6$ does not depend on $u_5$ so we also have

\displaystyle \begin{aligned} \frac{\mathrm{d}u_7}{\mathrm{d}u_5} &= 1 \\ \end{aligned}

Now things become more interesting as $u_6$ and $u_5$ both depend on $u_4$ and so the chain rule makes an explicit appearance

\displaystyle \begin{aligned} \frac{\mathrm{d}u_7}{\mathrm{d}u_4} &= \frac{\mathrm{d}u_7}{\mathrm{d}u_6}\frac{\mathrm{d}u_6}{\mathrm{d}u_4} + \frac{\mathrm{d}u_7}{\mathrm{d}u_5}\frac{\mathrm{d}u_5}{\mathrm{d}u_4} \\ &= \frac{\mathrm{d}u_7}{\mathrm{d}u_6}\exp{u_4} + \frac{\mathrm{d}u_7}{\mathrm{d}u_5}\cos{u_5} \end{aligned}

Carrying on

\displaystyle \begin{aligned} \frac{\mathrm{d}u_7}{\mathrm{d}u_3} &= \frac{\mathrm{d}u_7}{\mathrm{d}u_4}\frac{\mathrm{d}u_4}{\mathrm{d}u_3} \\ &= \frac{\mathrm{d}u_7}{\mathrm{d}u_4} \\ \frac{\mathrm{d}u_7}{\mathrm{d}u_2} &= \frac{\mathrm{d}u_7}{\mathrm{d}u_4}\frac{\mathrm{d}u_4}{\mathrm{d}u_2} + \frac{\mathrm{d}u_7}{\mathrm{d}u_3}\frac{\mathrm{d}u_3}{\mathrm{d}u_2} \\ &= \frac{\mathrm{d}u_7}{\mathrm{d}u_4} + 2u_2\frac{\mathrm{d}u_7}{\mathrm{d}u_4} \\ \frac{\mathrm{d}u_7}{\mathrm{d}u_1} &= \frac{\mathrm{d}u_7}{\mathrm{d}u_2}\frac{\mathrm{d}u_2}{\mathrm{d}u_1} \\ &=\frac{\mathrm{d}u_7}{\mathrm{d}u_2}\exp{u_2} \end{aligned}

Note that having worked from top to bottom (the forward sweep) in the graph to calculate the function itself, we have to work backwards from bottom to top (the backward sweep) to calculate the derivative.

So provided we can translate our program into a call graph, we can apply this procedure to calculate the differential with the same complexity as the original program.

The pictorial representation of an ANN is effectively the data flow graph of the cost function (without the final cost calculation itself) and its differential can be calculated as just being identical to backpropagation.

## Forward Mode

An alternative method for automatic differentiation is called forward mode and has a simple implementation. Let us illustrate this using Haskell 98. The actual implementation is about 20 lines of code.

First some boilerplate declarations that need not concern us further.

> {-# LANGUAGE NoMonomorphismRestriction #-}
>
>     Dual(..)
>   , f
>   , idD
>   , cost
>   , zs
>   ) where
>
> default ()


Let us define dual numbers

> data Dual = Dual Double Double
>   deriving (Eq, Show)


We can think of these pairs as first order polynomials in the indeterminate $\epsilon$, $x + \epsilon x'$ such that $\epsilon^2 = 0$

Thus, for example, we have

\displaystyle \begin{aligned} (x + \epsilon x') + (y + \epsilon y') &= ((x + y) + \epsilon (x' + y')) \\ (x + \epsilon x')(y + \epsilon y') &= xy + \epsilon (xy' + x'y) \\ \log (x + \epsilon x') &= \log x (1 + \epsilon \frac {x'}{x}) = \log x + \epsilon\frac{x'}{x} \\ \sqrt{(x + \epsilon x')} &= \sqrt{x(1 + \epsilon\frac{x'}{x})} = \sqrt{x}(1 + \epsilon\frac{1}{2}\frac{x'}{x}) = \sqrt{x} + \epsilon\frac{1}{2}\frac{x'}{\sqrt{x}} \\ \ldots &= \ldots \end{aligned}

Notice that these equations implicitly encode the chain rule. For example, we know, using the chain rule, that

$\displaystyle \frac{\mathrm{d}}{\mathrm{d} x}\log(\sqrt x) = \frac{1}{\sqrt x}\frac{1}{2}x^{-1/2} = \frac{1}{2x}$

And using the example equations above we have

\displaystyle \begin{aligned} \log(\sqrt {x + \epsilon x'}) &= \log (\sqrt{x} + \epsilon\frac{1}{2}\frac{x'}{\sqrt{x}}) \\ &= \log (\sqrt{x}) + \epsilon\frac{\frac{1}{2}\frac{x'}{\sqrt{x}}}{\sqrt{x}} \\ &= \log (\sqrt{x}) + \epsilon x'\frac{1}{2x} \end{aligned}

Notice that dual numbers carry around the calculation and the derivative of the calculation. To actually evaluate $\log(\sqrt{x})$ at a particular value, say 2, we plug in 2 for $x$ and 1 for $x'$

$\displaystyle \log (\sqrt(2 + \epsilon 1) = \log(\sqrt{2}) + \epsilon\frac{1}{4}$

Thus the derivative of $\log(\sqrt{x})$ at 2 is $1/4$.

With a couple of helper functions we can implement this rule ($\epsilon^2 = 0$) by making Dual an instance of Num, Fractional and Floating.

> constD :: Double -> Dual
> constD x = Dual x 0
>
> idD :: Double -> Dual
> idD x = Dual x 1.0


Let us implement the rules above by declaring Dual to be an instance of Num. A Haskell class such as Num simply states that it is possible to perform a (usually) related collection of operations on any type which is declared as an instance of that class. For example, Integer and Double are both types which are instances on Num and thus one can add, multiply, etc. values of these types (but note one cannot add an Integer to a Double without first converting a value of the former to a value of the latter).

As an aside, we will never need the functions signum and abs and declare them as undefined; in a robust implementation we would specify an error if they were ever accidentally used.

> instance Num Dual where
>   fromInteger n             = constD $fromInteger n > (Dual x x') + (Dual y y') = Dual (x + y) (x' + y') > (Dual x x') * (Dual y y') = Dual (x * y) (x * y' + y * x') > negate (Dual x x') = Dual (negate x) (negate x') > signum _ = undefined > abs _ = undefined  We need to be able to perform division on Dual so we further declare it to be an instance of Fractional. > instance Fractional Dual where > fromRational p = constD$ fromRational p
>   recip (Dual x x') = Dual (1.0 / x) (-x' / (x * x))


We want to be able to perform the same operations on Dual as we can on Float and Double. Thus we make Dual an instance of Floating which means we can now operate on values of this type as though, in some sense, they are the same as values of Float or Double (in Haskell 98 only instances for Float and Double are defined for the class Floating).

> instance Floating Dual where
>   pi = constD pi
>   exp   (Dual x x') = Dual (exp x)   (x' * exp x)
>   log   (Dual x x') = Dual (log x)   (x' / x)
>   sqrt  (Dual x x') = Dual (sqrt x)  (x' / (2 * sqrt x))
>   sin   (Dual x x') = Dual (sin x)   (x' * cos x)
>   cos   (Dual x x') = Dual (cos x)   (x' * (- sin x))
>   sinh  (Dual x x') = Dual (sinh x)  (x' * cosh x)
>   cosh  (Dual x x') = Dual (cosh x)  (x' * sinh x)
>   asin  (Dual x x') = Dual (asin x)  (x' / sqrt (1 - x*x))
>   acos  (Dual x x') = Dual (acos x)  (x' / (-sqrt (1 - x*x)))
>   atan  (Dual x x') = Dual (atan x)  (x' / (1 + x*x))
>   asinh (Dual x x') = Dual (asinh x) (x' / sqrt (1 + x*x))
>   acosh (Dual x x') = Dual (acosh x) (x' / (sqrt (x*x - 1)))
>   atanh (Dual x x') = Dual (atanh x) (x' / (1 - x*x))


That’s all we need to do. Let us implement the function we considered earlier.

> f =  sqrt . (* 3) . sin


The compiler can infer its type

ghci> :t f
f :: Floating c => c -> c


We know the derivative of the function and can also implement it directly in Haskell.

> f' x = 3 * cos x / (2 * sqrt (3 * sin x))


Now we can evaluate the function along with its automatically calculated derivative and compare that with the derivative we calculated symbolically by hand.

ghci> f idD 2 Dual 1.6516332160855343 (-0.3779412091869595) ghci> f' 2 -0.3779412091869595  To see that we are not doing symbolic differentiation (it’s easy to see we are not doing numerical differentiation) let us step through the actual evaluation. \displaystyle \begin{aligned} f\,\\,\mathrm{idD}\,2 &\longrightarrow \mathrm{sqrt} \cdot \lambda x \rightarrow 3x \cdot \sin \\,\mathrm{idD}\,2 \\ &\longrightarrow \mathrm{sqrt} \cdot \lambda x \rightarrow 3x \cdot \sin \\,\mathrm{Dual}\,2\,1 \\ &\longrightarrow \mathrm{sqrt} \cdot \lambda x \rightarrow 3x\,\\, \mathrm{Dual}\,\sin(2)\,\cos(2) \\ &\longrightarrow \mathrm{sqrt} \,\\, \mathrm{Dual}\,3\,0 \times \mathrm{Dual}\,\sin(2)\,\cos(2) \\ &\longrightarrow \mathrm{sqrt} \,\\, \mathrm{Dual}\,(3\sin(2))\, (3\cos(2) + 0\sin(2)) \\ &\longrightarrow \mathrm{sqrt} \,\\, \mathrm{Dual}\,(3\sin(2))\, (3\cos(2)) \\ &\longrightarrow \mathrm{Dual}\,(\mathrm{sqrt} (3\sin(2)))\, (3\cos(2)) / (2\,\mathrm{sqrt}(3\sin(2))) \\ &\longrightarrow 1.6516332160855343 -0.3779412091869595 \end{aligned} # A Simple Application In order not to make this blog post too long let us apply AD to finding parameters for a simple regression. The application to ANNs is described in a previous blog post. Note that in a real application we would use the the Haskell AD and furthermore use reverse AD as in this case it would be more efficient. First our cost function $\displaystyle L(\boldsymbol{x}, \boldsymbol{y}, m, c) = \frac{1}{2n}\sum_{i=1}^n (y - (mx + c))^2$ > cost m c xs ys = (/ (2 * (fromIntegral length xs))) $> sum$
>                  zipWith errSq xs ys
>   where
>     errSq x y = z * z
>       where
>         z = y - (m * x + c)

ghci> :t cost
cost :: Fractional a => a -> a -> [a] -> [a] -> a


Some test data

> xs = [1,2,3,4,5,6,7,8,9,10]
> ys = [3,5,7,9,11,13,15,17,19,21]


and a learning rate

> gamma = 0.04


Now we create a function of the two parameters in our model by applying the cost function to the data. We need the (partial) derivatives of both the slope and the offset.

> g m c = cost m c xs ys


Now we can take use our Dual numbers to calculate the required partial derivatives and update our estimates of the parameter. We create a stream of estimates.

> zs = (0.1, 0.1) : map f zs
>   where
>
>     deriv (Dual _ x') = x'
>
>     f (c, m) = (c - gamma * cDeriv, m - gamma * mDeriv)
>       where
>         cDeriv = deriv $g (constD m)$ idD c
>         mDeriv = deriv $flip g (constD c)$ idD m


And we can calculate the cost of each estimate to check our algorithm converges and then take the the estimated parameters when the change in cost per iteration has reached an acceptable level.

ghci> take 2 $drop 1000$ map (\(c, m) -> cost m c xs ys) zs
[1.9088215184565296e-9,1.876891490619424e-9]

ghci> take 2 drop 1000 zs [(0.9998665320141327,2.0000191714150106),(0.999867653022265,2.0000190103927853)]  # Concluding Thoughts ## Efficiency Perhaps AD is underused because of efficiency? It seems that the Financial Services industry is aware that AD is more efficient than current practice albeit the technique is only slowly permeating. Order of magnitude improvements have been reported. Perhaps AD is slowly permeating into Machine Learning as well but there seem to be no easy to find benchmarks. ## Automatic Differentiation Tools If it were only possible to implement automatic differentiation in Haskell then its applicability would be somewhat limited. Fortunately this is not the case and it can be used in many languages. In general, there are three different approaches: • Operator overloading: available for Haskell and C++. See the Haskell ad package and the C++ FADBAD approach using templates. • Source to source translators: available for Fortran, C and other languages e.g., ADIFOR, TAPENADE and see the wikipedia entry for a more comprehensive list. • New languages with built-in AD primitives. I have not listed any as it seems unlikely that anyone practicing Machine Learning would want to transfer their existing code to a research language. Maybe AD researchers could invest time in understanding what language feature improvements are needed to support AD natively in existing languages. | 1 Comment ## The Precession of the Perihelion of Mercury via Legendre Polynomials ## Introduction The planet Mercury has a highly elliptical orbit with a perihelion of about 0.31 AU and an aphelion of about 0.47 AU. This ellipse is not stationary but itself rotates about the Sun, a phenomenon known as the precession of the perihelion. A calculation carried out using Newtonian mechanics gives a value at variance with observation. The deficit is explained using General Relativity although we do not apply the relativistic correction in this article. Just to give a flavour of the Haskell, we will have to calculate values of the infinite series of Legendre Polynomials evaluated at 0. We have \displaystyle \begin{aligned} P_{2n}(0) &= \frac{(-1)^n(2n)!}{2^{2n}(n!)^2} \\ P_{2n+1}(0) &= 0 \end{aligned} Since we are dealing with infinite series we will want to define this co-recursively. We could use the Stream package but let us stay with lists. > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > > {-# LANGUAGE NoMonomorphismRestriction #-} > > module Legendre ( > legendre0s > , main) where > > import Data.List > import Text.Printf > import Initial > > legendre0s :: [Rational] > legendre0s = interleave legendre0Evens legendre0Odds > where > legendre0Evens = 1 : zipWith f [1..] legendre0Evens > where f n p = negate p * (2 * n * (2 * n - 1)) / (2^2 * n^2)
>     legendre0Odds = 0 : legendre0Odds
>
> interleave :: [a] -> [a] -> [a]
> interleave = curry $unfoldr g > where > g ([], _) = Nothing > g (x:xs, ys) = Just (x, (ys, xs))  And now we can calculate any number of terms we need ghci> take 10$ legendre0s
[1 % 1,0 % 1,(-1) % 2,0 % 1,3 % 8,0 % 1,(-5) % 16,0 % 1,35 % 128,0 % 1]


The reader wishing to skip the physics and the mathematical derivation can go straight to the section on implementation

This article calculates the precession in Haskell using Newtonian methods. Over a long enough period, the gravitational effect of each outer planet on Mercury can be considered to be the same as a ring with the same mass as the planet; in other words we assume that the mass of each planet has been smeared out over its orbit. Probably one can model Saturn’s rings using this technique but that is certainly the subject of a different blog post.

More specifically, we model the mass of the ring as being totally concentrated on one particular value of $\phi = \pi / 2$ and one particular value of $r = a$ with total mass $M$.

\displaystyle \begin{aligned} M &= \int_0^{2\pi} \int_0^\pi \int_0^\infty K\, \delta(\phi - \pi / 2)\, \delta(r - a)\, r^2\sin\phi\, \mathrm{d} r\, \mathrm{d} \phi\, \mathrm{d} \theta \\ &= 2\pi \int_0^\pi \int_0^\infty K\, \delta(\phi - \pi / 2)\, \delta(r - a)\, r^2\sin\phi\, \mathrm{d} \phi\, \mathrm{d} r \\ &= 2\pi \int_0^\infty K\, \delta(r - a)\, r^2\, \mathrm{d} r \\ &= 2\pi K a^2 \end{aligned}

where $\delta$ is the Dirac delta function. Thus the density of our ring is

$\displaystyle \rho(r, \phi) = M \frac{\delta(\phi - \pi / 2) \delta(r - a)}{2\pi a^2}$

## Acknowledgement

This blog follows the exposition given in [@Fitz:Newtonian:Dynamics] and [@brown:SpaceTime] concretized for the precession of the perihelion of Mercury with some of the elisions expanded. More details on Legendre Polynomials can be found in [@Bowles:Legendre:Polynomials].

## Axially Symmetric Mass Distributions

We consider axially symmetric mass distributions in spherical polar co-ordinates $(r, \phi, \theta)$ where $r$ runs from $0$ to $\infty$, $\phi$ (the polar angle) runs from $0$ to $\pi$ and $\theta$ (the azimuthal angle) runs from $0$ to $2\pi$.

For clarity we give their conversion to cartesian co-ordinates.

\displaystyle \begin{aligned} x &= r\sin\phi\cos\theta \\ y &= r\sin\phi\sin\theta \\ z &= r\cos\phi \end{aligned}

The volume element in spherical polar co-ordinates is given by $r^2\sin\phi\,\mathrm{d} r\,\mathrm{d} \phi\,\mathrm{d} \theta$.

The gravitational potential given by $N$ masses each of mass $m_i$ and at position $\boldsymbol{r}_i$ is:

$\displaystyle \Phi(\boldsymbol{r}) = -G\sum_{i=1}^N\frac{m_i}{\|\boldsymbol{r}_i - \boldsymbol{r}\|}$

If instead of point masses, we have a mass distribution $\rho(\boldsymbol{r})$ then

$\displaystyle \Phi(\boldsymbol{r}) = -G\int_{\mathbb{R}^3}\frac{\rho(\boldsymbol{r}')}{\|\boldsymbol{r}' - \boldsymbol{r}\|}\, \mathrm{d} V$

where $\mathrm{d} V$ is the volume element.

If the mass distribution is axially symmetric then so will the potential. In spherical polar co-ordinates:

\displaystyle \begin{aligned} \Phi(r, \phi) &= -G\int_0^{2\pi} \int_0^\pi \int_0^\infty \frac{\rho(r', \phi')}{\|\boldsymbol{r}' - \boldsymbol{r}\|}\, r'^2\sin\phi'\, \mathrm{d} r\, \mathrm{d} \phi'\, \mathrm{d} \theta' \\ &= -2\pi G\int_0^\pi \int_0^\infty \rho(r', \phi') \langle\|\boldsymbol{r}' - \boldsymbol{r}\|^{-1}\rangle\, r'^2\sin\phi'\, \mathrm{d} r\, \mathrm{d} \phi' \\ \end{aligned}

where $\langle\ldots\rangle$ denotes the average over the azimuthal angle.

$\displaystyle \|\boldsymbol{r} - \boldsymbol{r}'\|^{-1} = (r^2 - 2\boldsymbol{r}\cdot\boldsymbol{r}' + r'^2)^{-1/2}$

Expanding the middle term on the right hand size and noting that $\theta = 0$:

\displaystyle \begin{aligned} \boldsymbol{r}\cdot\boldsymbol{r}' &= r\sin\phi\cos\theta r'\sin\phi'\cos\theta' + r\sin\phi\sin\theta r'\sin\phi'\sin\theta' + r\cos\phi r'\cos\phi' \\ &= r\sin\phi r'\sin\phi'\cos\theta' + r\cos\phi r'\cos\phi' \\ &= rr'(\sin\phi\sin\phi'\cos\theta' + \cos\phi\cos\phi') \end{aligned}

Writing $F = \sin\phi\sin\phi'\cos\theta' + \cos\phi\cos\phi'$ and noting that

$\displaystyle \frac{1}{\sqrt{1 - 2xt + t^2}} = \sum_{n=0}^\infty t^n P_n(x)$

where $P_n$ are the Legendre Polynomials we see that when $r' < r$

$\displaystyle \|\boldsymbol{r} - \boldsymbol{r}'\|^{-1} = \frac{1}{r}\sum_{n=0}^\infty{\bigg(\frac{r'}{r}}\bigg)^n P_n(F)$

Applying the Spherical Harmonic Addition Theorem (or see [@arfken]) we obtain

$\displaystyle \langle\|\boldsymbol{r} - \boldsymbol{r}'\|^{-1}\rangle = \frac{1}{r}\sum_{n=0}^\infty{\bigg(\frac{r'}{r}}\bigg)^n P_n(\cos\phi) P_n(\cos\phi')$

Similarly when $r < r'$ we obtain

$\displaystyle \langle\|\boldsymbol{r} - \boldsymbol{r}'\|^{-1}\rangle = \frac{1}{r'}\sum_{n=0}^\infty{\bigg(\frac{r}{r'}}\bigg)^n P_n(\cos\phi) P_n(\cos\phi')$

Substituting into the equation for the potential for axially symmetric mass distributions gives us

\displaystyle \begin{aligned} \Phi(r, \phi) &= -2\pi G\int_0^\pi \int_0^\infty \rho(r', \phi') \langle\|\boldsymbol{r}' - \boldsymbol{r}\|^{-1}\rangle\, r'^2\sin\phi'\, \mathrm{d} r\, \mathrm{d} \phi' \\ &= -2\pi G\int_0^\pi \int_0^r \rho(r', \phi')\frac{1}{r}\sum_{n=0}^\infty{\bigg(\frac{r'}{r}}\bigg)^n P_n(\cos\phi) P_n(\cos\phi')\, r'^2\sin\phi'\, \mathrm{d} r\, \mathrm{d} \phi' \\ &\phantom{=} -2\pi G\int_0^\pi \int_r^\infty \rho(r', \phi')\frac{1}{r'}\sum_{n=0}^\infty{\bigg(\frac{r}{r'}}\bigg)^n P_n(\cos\phi) P_n(\cos\phi')\, r'^2\sin\phi'\, \mathrm{d} r\, \mathrm{d} \phi' \\ &= \sum_{n=0}^\infty \Phi_n(r) P_n(\cos\phi) \end{aligned}

where

\displaystyle \begin{aligned} \Phi_n(r) &= -\frac{2\pi G}{r^{n+1}}\int_0^r\int_0^\pi r'^{n+2}\rho(r', \phi')P_n(\cos\phi')\sin\phi'\,\mathrm{d}r'\,\mathrm{d}\phi' \\ &\phantom{=} -2\pi G r^n\int_r^\infty\int_0^\pi r'^{1-n}\rho(r', \phi')P_n(\cos\phi')\sin\phi'\,\mathrm{d}r'\,\mathrm{d}\phi' \end{aligned}

Note that the first integral has limits $0$ to $r$ and the second has limits $r$ to $\infty$.

It is well known that the Legendre Polynomials form an orthogonal and complete set for continuous functions. Indeed

$\displaystyle \int_{-1}^1 P_n(x)P_m(x)\,\mathrm{d}x = \frac{2\delta_{nm}}{2n + 1}$

Thus we can write

$\displaystyle \rho(r, \phi) = \sum_{n=o}^\infty \rho_n(r)P_n(\cos\phi)$

Using the orthogonality condition we have

$\displaystyle \rho_n(r) = (n + 1/2)\int_0^\pi \rho(r, \phi) P_n(\cos\phi) \sin\phi\,\mathrm{d}\phi$

Hence

\displaystyle \begin{aligned} \Phi_n(r) &= -\frac{2\pi G}{(n + 1/2)r^{n+1}}\int_0^r r'^{n+2}\rho_n(r')\,\mathrm{d}r' \\ &\phantom{=} -\frac{2\pi G r^n}{n + 1/2}\int_r^\infty r'^{1-n}\rho_(r')\,\mathrm{d}r'\,\mathrm{d}r' \end{aligned}

## Gravitational Potential of a Ring

We now substitute in the axially symmetric density of a ring

\displaystyle \begin{aligned} \rho_n(r) &= (n + 1/2)\int_0^\pi \rho(r, \phi) P_n(\cos\phi) \sin\phi\,\mathrm{d}\phi \\ &=(n + 1/2)\int_0^\pi M \frac{\delta(\phi - \pi / 2) \delta(r - a)}{2\pi a^2} P_n(\cos\phi) \sin\phi\,\mathrm{d}\phi \\ &= (n + 1/2) M \frac{\delta(r - a)}{2\pi a^2} P_n(0) \end{aligned}

Substituting again

\displaystyle \begin{aligned} \Phi_n(r) &= -\frac{2\pi G}{(n + 1/2)r^{n+1}}\int_0^r r'^{n+2}\rho_n(r')\,\mathrm{d}r' \\ &\phantom{=} -\frac{2\pi G r^n}{n + 1/2}\int_r^\infty r'^{1-n}\rho_(r')\,\mathrm{d}r'\,\mathrm{d}r' \\ &= -\frac{2\pi G}{(n + 1/2)r^{n+1}}\int_0^r r'^{n+2} (n + 1/2) M \frac{\delta(r' - a)}{2\pi a^2} P_n(0) \,\mathrm{d}r' \\ &\phantom{=} -\frac{2\pi G r^n}{n + 1/2}\int_r^\infty r'^{1-n} (n + 1/2) M \frac{\delta(r' - a)}{2\pi a^2} P_n(0) \,\mathrm{d}r' \end{aligned}

Thus for $a < r$

\displaystyle \begin{aligned} \Phi_n(r) &= -\frac{2\pi G}{r^{n+1}} a^{n+2} M \frac{1}{2\pi a^2} P_n(0) \\ &= -\frac{G M P_n(0)}{a}\bigg(\frac{a}{r}\bigg)^{n+1} \end{aligned}

And for $r < a$

\displaystyle \begin{aligned} \Phi_n(r) &= -2\pi G r^n a^{1-n} M \frac{1}{2\pi a^2} P_n(0) \\ &= -\frac{G M P_n(0)}{a} \bigg(\frac{r}{a}\bigg)^n \end{aligned}

Thus at $\phi = \pi / 2$ and $r < a$ we have

$\displaystyle \Phi(r) \equiv \Phi(r, \pi / 2) = -\frac{G M}{a} \sum_{n=0}^\infty P_n^2(0) \bigg(\frac{r}{a}\bigg)^n$

and for $r > a$

$\displaystyle \Phi(r) = -\frac{G M}{a} \sum_{n=0}^\infty P_n^2(0) \bigg(\frac{a}{r}\bigg)^{n+1}$

Let $M$ be the mass of the Sun then the potential due to all the Sun and all planets at a distance $r$ (excluding the planet positioned at $r$) is

$\displaystyle \Phi(r) = -\frac{GM}{r} - \sum_{n=0}^\infty P_n^2(0) \Bigg[\sum_{a_i < r}\frac{G m_i}{a_i}\bigg(\frac{a_i}{r}\bigg)^{n+1} + \sum_{a_i > r}\frac{G m_i}{a_i}\bigg(\frac{r}{a_i}\bigg)^n\Bigg]$

## Apsidal Angles

An apsis is the closest and furthest point that a planet reaches i.e. the perihelion and the aphelion. Without the perturbing influence of the outer planets the angle between these points, the apsidal angle, would be $\pi$. In presence of the outer planets this is no longer the case.

Writing down the Lagrangian for a single planet we have

$\displaystyle \mathbb{L} = \frac{1}{2}m(\dot{r}^2 + r^2\dot{\theta}^2) + \Phi(\boldsymbol{r})$

where $\Phi$ is the total potential due to the Sun and the other planets (as calculated above). $\theta$ is ignorable so we have a conserved quantity $mr^2\dot{\theta}$. We write $h$ for $r^2\dot{\theta}$ which is also conserved.

Applying Lagrange’s equation for $r$ we have

$\displaystyle \frac{\mathrm{d}}{\mathrm{d} t}\bigg(\frac{\partial \mathbb{L}}{\partial \dot{r}}\bigg) - \frac{\partial \mathbb{L}}{\partial r} = m\ddot{r} - mr\dot{\theta}^2 + \frac{\partial \Phi}{\partial r} = 0$

Thus the radial equation of motion is

$\displaystyle \ddot{r} - \frac{h^2}{r^3} = -\frac{GM}{r^2} - \sum_{n=0}^\infty P_n^2(0) \Bigg[ \sum_{a_i < r}\frac{G m_i}{a_i^2}(n+1)\bigg(\frac{a_i}{r}\bigg)^{n+2} - \sum_{a_i > r}\frac{G m_i}{a_i^2}n\bigg(\frac{r}{a_i}\bigg)^{n-1}\Bigg]$

To make further progress let us take just one term for a ring outside the planet of consideration and use the trick given in [@brown:SpaceTime]. Writing $r_\mathrm{a}$ for the aphelion, $r_\mathrm{p}$ for the perihelion and $r_\mathrm{m}$ for the major radius we have

\displaystyle \begin{aligned} \frac{A}{r_{\mathrm{a}}^2} + \frac{B}{r_{\mathrm{a}}^3} &= \frac{M}{r_{\mathrm{a}}^2} - \sum_{n=0}^\infty P_n^2(0) \Bigg[\frac{G m}{a^2}n\bigg(\frac{r_\mathrm{a}}{a}\bigg)^{n-1}\Bigg] \\ \frac{A}{r_{\mathrm{p}}^2} + \frac{B}{r_{\mathrm{p}}^3} &= \frac{M}{r_{\mathrm{p}}^2} - \sum_{n=0}^\infty P_n^2(0) \Bigg[\frac{G m}{a^2}n\bigg(\frac{r_\mathrm{p}}{a}\bigg)^{n-1}\Bigg] \end{aligned}

Defining $g$ by writing

$\displaystyle \frac{A}{r^2} + \frac{B}{r^3} = \frac{g(r)}{r^3}$

we have

\displaystyle \begin{aligned} Ar_\mathrm{p} + B &= g(r_\mathrm{p}) \\ Ar_\mathrm{a} + B &= g(r_\mathrm{a}) \end{aligned}

giving

$\displaystyle A = \frac{g(r_\mathrm{p}) - g(r_\mathrm{a})}{r_\mathrm{p} - r_\mathrm{a}}$

Using the Taylor approximation

\displaystyle \begin{aligned} g(r_{\mathrm{p}}) &\approx g(r_\mathrm{m}) + \frac{r_{\mathrm{p}} - r_{\mathrm{a}}}{2} g'(r_\mathrm{m}) \\ g(r_{\mathrm{a}}) &\approx g(r_\mathrm{m}) - \frac{r_{\mathrm{p}} - r_{\mathrm{a}}}{2} g'(r_\mathrm{m}) \end{aligned}

Thus

$\displaystyle A \approx g'(r_\mathrm{m})$

Then since

$\displaystyle g(r) = Mr - \sum_{n=0}^\infty P_n^2(0) \Bigg[G m a n\bigg(\frac{r}{a}\bigg)^{n+2}\Bigg]$

We have

$\displaystyle A = g'(r_\mathrm{m}) = M - \sum_{n=0}^\infty P_n^2(0) \Bigg[G m n(n+2)\bigg(\frac{r_\mathrm{m}}{a}\bigg)^{n+1}\Bigg]$

It is a nuisance to be continually writing $r_\mathrm{m}$. From now on this is denoted by $r$. Using

$\displaystyle B = r^3 g(r) - r A$

We obtain

\displaystyle \begin{aligned} B &= Mr - \sum_{n=0}^\infty P_n^2(0) \Bigg[G m a n\bigg(\frac{r}{a}\bigg)^{n+2}\Bigg] \\ &\phantom{=} -r\Bigg(M - \sum_{n=0}^\infty P_n^2(0) \Bigg[G m n(n+2)\bigg(\frac{r}{a}\bigg)^{n+1}\Bigg]\Bigg) \\ &= \sum_{n=0}^\infty P_n^2(0) \Bigg[G m a n(n+1)\bigg(\frac{r}{a}\bigg)^{n+2}\Bigg] \\ \end{aligned}

We can therefore re-write the radial equation of motion approximately as

\displaystyle \begin{aligned} \ddot{r} - \frac{h^2}{r^3} &= -\frac{A}{r^2} - \frac{B}{r^3} \\ \ddot{r} - \frac{h^2 - B}{r^3} &= -\frac{A}{r^2} \end{aligned}

Now let us re-write the equation of motion as a relation between $r$ and $\theta$.

$\displaystyle \dot{r} = \frac{\mathrm{d} \theta}{\mathrm{d} t}\frac{\mathrm{d} t}{\mathrm{d} \theta}\frac{\mathrm{d} r}{\mathrm{d} t} = \frac{h}{r^2}\frac{\mathrm{d} r}{\mathrm{d} \theta}$
\displaystyle \begin{aligned} \ddot{r} &= \frac{h}{r^2}\frac{\mathrm{d}}{\mathrm{d} t}\bigg(r^{-2}\frac{\mathrm{d} r}{\mathrm{d} \theta}\bigg) \\ &= \frac{h}{r^2}\frac{\mathrm{d} \theta}{\mathrm{d} t}\frac{\mathrm{d} t}{\mathrm{d} \theta}\frac{\mathrm{d}}{\mathrm{d} t}\bigg(r^{-2}\frac{\mathrm{d} r}{\mathrm{d} \theta}\bigg) \\ &= \frac{h}{r^2}\frac{\mathrm{d}}{\mathrm{d} \theta}\bigg(r^{-2}\frac{\mathrm{d} r}{\mathrm{d} \theta}\bigg) \end{aligned}

Thus we have

$\displaystyle \frac{\mathrm{d}}{\mathrm{d} \theta}\bigg(r^{-2}\frac{\mathrm{d} r}{\mathrm{d} \theta}\bigg) - \frac{(h^2 - B)}{r} = -\frac{A}{r^2}$

Letting $u = 1 /r$ we can re-write this as

$\displaystyle \frac{1}{(1 - B / h^2)} \frac{\mathrm{d}^2 u}{\mathrm{d} \theta^2} + u = \frac{A}{h^2(1 - B / h^2)}$

This is the equation for simple harmonic motion with $\omega^2 = 1 - B / h^2$ and since for a circular orbit $h^2 = GMr$ we can write

\displaystyle \begin{aligned} \omega &= \sqrt{1 - B / h^2} \approx 1 - \frac{1}{2}\frac{B}{h^2} \\ &= 1 - \frac{1}{2}\frac{m}{M}\sum_{n=0}^\infty P_n^2(0) n(n+1)\bigg(\frac{r}{a}\bigg)^{n+1} \\ \end{aligned}

and therefore the change in radians per revolution is

$\displaystyle \Delta \theta = |2\pi (\omega - 1)| = \pi\frac{m}{M}\sum_{n=0}^\infty P_n^2(0) n(n+1)\bigg(\frac{r}{a}\bigg)^{n+1}$

To convert this to arc-seconds per century we apply a conversion factor

$\displaystyle 414.9 \frac{360}{2\pi} 3600$

where 414.9 is the number of orbits of Mercury per century.

## Implementation

The implementation is almost trivial given that we have previously calculated the Legendre Polynomials (evaluated at 0). First let us make the code a bit easier to read by defining arithmetic pointwise (note that for polynomials we would not want to do this).

> instance Num a => Num [a] where
>   (*) = zipWith (*)
>   (+) = zipWith (+)
>   abs         = error "abs makes no sense for infinite series"
>   signum      = error "signum makes no sense for infinite series"
>   fromInteger = error "fromInteger makes no sense for infinite series"


Next we define our conversion function so that we can compare our results against those obtained by Le Verrier.

> conv :: Floating a => a -> a
> conv x = x * 414.9 * (360 / (2 * pi)) * 3600


The main calculation for which we can take any number of terms.

> perturbations :: Double -> Double -> Double -> Double -> [Double]
> perturbations mRing mSun planetR ringR =
>   map ((pi * (mRing / mSun)) *) xs
>     where
>       xs = (map (^2) $map fromRational legendre0s) * > (map fromIntegral [0..]) * > (map fromIntegral [1..]) * > (map ((planetR / ringR)^) [1..])  Arbitrarily, let us take 20 terms. > predict :: Double -> Double -> Double -> Double > predict x y z = sum$
>                 map conv $> take 20$
>                 perturbations x sunMass y z


And now let us compare our calculations with Le Verrier’s.

> main :: IO ()
> main = do
>   printf "Venus   %3.1f %3.1f\n"
>          (280.6 :: Double)
>   printf "Earth    %3.1f  %3.1f\n"
>          (83.6 :: Double)
>   printf "Mars      %3.1f   %3.1f\n"
>          (2.6 :: Double)
>   printf "Jupiter %3.1f %3.1f\n"
>          (152.6 :: Double)

ghci> main
Venus   280.6 286.0
Earth    83.6  95.3
Mars      2.6   2.4
Jupiter 152.6 160.1


## Appendix

Note the lectures by Fitzpatrick [@Fitz:Newtonian:Dynamics] use a different approximation for the apsidal angle

$\displaystyle \psi = \pi\bigg(3 + \frac{r \mathrm{d} F / \mathrm{d} r}{F}\bigg)^{-1/2}$

We do not derive this here but note that the expansion and approximation are not entirely straightforward and are given here for completenes. Note that the result derived this way is identical to the result obtained in the main body of the article.

The radial force is given by $F(r) = -\mathrm{d}\Phi(r) / \mathrm{d} r$

$\displaystyle F(r) = -\frac{GM}{r^2} - \sum_{n=0}^\infty P_n^2(0) \Bigg[ \sum_{a_i < r}\frac{G m_i}{a_i^2}(n+1)\bigg(\frac{a_i}{r}\bigg)^{n+2} - \sum_{a_i > r}\frac{G m_i}{a_i^2}n\bigg(\frac{r}{a_i}\bigg)^{n-1}\Bigg]$

We also have

$\displaystyle r\frac{\mathrm{d} F}{\mathrm{d} r} = 2\frac{GM}{r^2} + \sum_{n=0}^\infty P_n^2(0) \Bigg[ \sum_{a_i < r}\frac{G m_i}{a_i^2}(n+1)(n+2)\bigg(\frac{a_i}{r}\bigg)^{n+2} + \sum_{a_i > r}\frac{G m_i}{a_i^2}n(n-1)\bigg(\frac{r}{a_i}\bigg)^{n-1}\Bigg]$

Thus

$\displaystyle 2F(r) + r\frac{\mathrm{d} F}{\mathrm{d} r} = \sum_{n=0}^\infty P_n^2(0) \Bigg[ \sum_{a_i < r}\frac{G m_i}{a_i^2}n(n+1)\bigg(\frac{a_i}{r}\bigg)^{n+2} + \sum_{a_i > r}\frac{G m_i}{a_i^2}n(n+1)\bigg(\frac{r}{a_i}\bigg)^{n-1}\Bigg]$

Re-arranging

$\displaystyle \bigg(3 + \frac{r \mathrm{d} F / \mathrm{d} r}{F}\bigg)^{-1/2} = \bigg(1 + 2 + \frac{r \mathrm{d} F / \mathrm{d} r}{F}\bigg)^{-1/2}$

we note that the last two terms can be re-written with a numerator of

$\displaystyle \sum_{n=0}^\infty P_n^2(0) \Bigg[ \sum_{a_i < r}\frac{G m_i}{a_i^2}n(n+1)\bigg(\frac{a_i}{r}\bigg)^{n+2} + \sum_{a_i > r}\frac{G m_i}{a_i^2}n(n+1)\bigg(\frac{r}{a_i}\bigg)^{n-1}\Bigg]$

and a denominator which is dominated by the $-GM / r^2$. Thus

\displaystyle \begin{aligned} 2 + \frac{r \mathrm{d} F / \mathrm{d} r}{F} &\approx -\sum_{n=0}^\infty P_n^2(0) \Bigg[ \sum_{a_i < r}\frac{m_i r^2}{M a_i^2}n(n+1)\bigg(\frac{a_i}{r}\bigg)^{n+2} + \sum_{a_i > r}\frac{m_i r^2}{M a_i^2}n(n+1)\bigg(\frac{r}{a_i}\bigg)^{n-1}\Bigg] \\ &= -\sum_{n=0}^\infty P_n^2(0) n(n+1)\Bigg[ \sum_{a_i < r}\frac{m_i}{M}\bigg(\frac{a_i}{r}\bigg)^n + \sum_{a_i > r}\frac{m_i}{M}\bigg(\frac{r}{a_i}\bigg)^{n+1}\Bigg] \end{aligned}

Since this term is $\ll 1$ we can expand the term of interest further

\displaystyle \begin{aligned} \bigg(1 + 2 + \frac{r \mathrm{d} F / \mathrm{d} r}{F}\bigg)^{-1/2} &\approx \Bigg(1 -\sum_{n=0}^\infty P_n^2(0) n(n+1)\Bigg[ \sum_{a_i < r}\frac{m_i}{M}\bigg(\frac{a_i}{r}\bigg)^n + \sum_{a_i > r}\frac{m_i}{M}\bigg(\frac{r}{a_i}\bigg)^{n+1}\Bigg] \Bigg)^{-1/2} \\ &= 1 + \frac{1}{2} \sum_{n=0}^\infty P_n^2(0) n(n+1)\Bigg[ \sum_{a_i < r}\frac{m_i}{M}\bigg(\frac{a_i}{r}\bigg)^n + \sum_{a_i > r}\frac{m_i}{M}\bigg(\frac{r}{a_i}\bigg)^{n+1}\Bigg] \end{aligned}

## Bibliography

Arfken, George. 1985. Mathematical Methods for Physicists. Third.. Orlando: ap.

Bowles, Robert. “Properties of Legendre Polynomials.” http://www.ucl.ac.uk/~ucahdrb/MATHM242/OutlineCD2.pdf.

Brown, Kevin. 2013. Physics in space and time. lulu.com.

Fitzpatrick, Richard. 1996. “Newtonian Dynamics.” http://farside.ph.utexas.edu/teaching/336k/lectures.

## Planetary Simulation with Excursions in Symplectic Manifolds

When I started to do this, it seemed straightforward enough: pick a problem which admitted a numerical solution, find an algorithm and code it up. I chose the problem of orbital dynamics as I had always been fascinated by the precession of the perihelion of Mercury (which is mainly caused by the pull of the other planets in the Solar System) and because this admits of at least two different methods of numerical solution both of which I hope will show the power of Haskell in this area. This led to the selection of a suitable algorithm and I read that one should prefer a symplectic method such as the Leapfrog which conserves the energy of a system (a highly desirable requirement when modelling orbital dynamics). My conscience would not let me write about such a method without being able to explain it. This led into the Hamiltonian formulation of classical mechanics, symplectic manifolds and symplectic (numerical) methods.

The reader interested in the Haskell implementations and performance comparisons (currently not with other programming languages) can read the introduction and skip to the section on performance. I apologise in advance to experts in classical mechanics, symplectic geometery and numerical analysis and can only hope I have not traduced their subjects too much.

Note that we do not make it as far the perihelion of Mercury in this article but we do simulate the planets in the outer solar system.

## Introduction

Forget about Newton and suppose you are told that the way to do mechanics is to write down the total energy of the system in which you are interested and then apply Hamilton’s equations.

Consider a mass of $m$ attached to a light rod of length $l$ which is attached to a point from which it can swing freely in a plane. Then the kinetic energy is:

$\displaystyle \frac{1}{2}mv^2 = \frac{1}{2}ml^2\dot{\theta}^2$

and the potential energy (taking this to be 0 at $\theta = 0$) is:

$\displaystyle mgl(1 - \cos\theta)$

Thus the Hamiltonian is:

$\displaystyle \mathbb{H} = \frac{1}{2}ml^2\dot{\theta}^2 + mgl(1 - \cos\theta)$

Using the Langrangian ${\mathbb{L}} = T - V$ where $T$ and $V$ are the kinetic and potential energies respectively, let us set the generalized momentum

$\displaystyle p = \frac{\partial\mathbb{L}}{\partial\dot{\theta}} = ml^2\dot{\theta}$

Then we can re-write the Hamiltonian as:

$\displaystyle \mathbb{H} = \frac{p^2}{2ml^2} + mgl(1 - \cos\theta)$

Applying Hamilton’s equations we obtain

\displaystyle \begin{aligned} \dot{\theta} &= \frac{\partial\mathbb{H}}{\partial p} = \frac{p}{ml^2} \\ \dot{p} &= -\frac{\partial\mathbb{H}}{\partial \theta} = -mgl\sin\theta \end{aligned}

Differentiating the first equation with respect to time we then obtain the familiar equation describing the motion of a simple pendulum.

$\displaystyle \ddot{\theta} = \frac{\dot{p}}{ml^2} = \frac{-mgl\sin\theta}{ml^2} = -\frac{g}{l}\sin\theta$

Now we would like to calculate the pendulum’s position and velocity at a given time. The obvious starting place is to use the explicit Euler method.

\displaystyle \begin{aligned} \theta_{n+1} &= \theta_n + h\frac{p_n}{ml^2} \\ p_{n+1} &= p_n - hmgl\sin\theta_n \end{aligned}

First we need some pragmas, exports (required to create the diagrams) and imports.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}

> {-# LANGUAGE NoMonomorphismRestriction    #-}
> {-# LANGUAGE FlexibleContexts             #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE GeneralizedNewtypeDeriving   #-}
> {-# LANGUAGE TypeOperators                #-}

> module Symplectic (
>     pendulumEE
>   , pendulumSE
>   , jupiterEarth
>   , outerPlanets
>   , main
>   ) where

> import           Data.Array.Repa hiding ((++), zipWith)
> import qualified Data.Array.Repa as Repa

> import           Control.Monad
> import           System.Environment
> import           System.Console.GetOpt

> import           Foreign.Storable

> import qualified Data.Yarr as Y
> import           Data.Yarr (loadS, dzip2, dzip3, F, L)
> import           Data.Yarr.Repr.Delayed (UArray)
> import qualified Data.Yarr.Shape as S
> import qualified Data.Yarr.Utils.FixedVector as V
> import           Data.Yarr.Utils.FixedVector (VecList, N3)
> import qualified Data.Yarr.IO.List as YIO
> import qualified Data.Yarr.Walk as W
>
> import qualified Initial as I


Some type synomyms although it is doubtful how useful these are since the generalized co-ordinates and momenta that one uses with Hamiltonian methods can have different units depending on how the physical problem is formulated.

> type Distance = Double
> type Mass     = Double
> type Speed    = Double


The functions to update the position and momentum.

> stepMomentumEE :: Double -> Double -> Double -> Double -> Double
> stepMomentumEE m l p q = p -  h * m * g * l * sin q

> stepPositionEE :: Double -> Double -> Double -> Double -> Double
> stepPositionEE m l p q = q + h * p / (m * l^2)


The explicit Euler method itself. Notice that both update functions use the previous position and momentum.

> stepOnceEE :: Double -> Double -> Double -> Double -> (Double, Double)
> stepOnceEE m l p q = (newP, newQ)
>   where
>     newP = stepMomentumEE m l p q
>     newQ = stepPositionEE m l p q


The physical data for our problem and also the step length for the numerical method.

> h, m, l, g :: Double
> h = 0.01 -- Seconds
> l = 1.0  -- Metres
> m = 1.0  -- Kilograms
> g = 9.81 -- Metres * Seconds^-2


Let’s start our pendulum at the bottom with an angular velocity that ensures we don’t go over the top.

> initTheta, initThetaDot, initP :: Double
> initTheta    = 0.0
> initP        = m * l^2 * initThetaDot

> runEE :: Double -> Double -> [(Double, Double)]
> runEE initP initTheta = iterate (uncurry (stepOnceEE m l))
>                                 (initP, initTheta)

> pendulumEE :: [(Double, Double)]
> pendulumEE = runEE initP initTheta


The diagram below plots the position of the pendulum (the angle it makes with the vertical) against momentum, both axes normalised so that the maximum position and momentum are 1.0. We would expect that the trajectory would form a closed path that is traversed indefinitely as the pendulum swings back and forth. Instead we see that trajectory gradually spirals outward showing that energy is not conserved but steadily increases over time, an undesirable state of affairs.

Instead let us apply the the symplectic Euler method:

\displaystyle \begin{aligned} p_{n+1} = p_n - hmgl\sin\theta_n \\ \theta_{n+1} = \theta_n + \frac{hp_{n+1}}{2ml^2} \end{aligned}

The functions to update the position and momentum.

> stepMomentum :: Double -> Double -> Double -> Double -> Double
> stepMomentum m l p q = p -  h * m * g * l * sin q

> stepPosition :: Double -> Double -> Double -> Double -> Double
> stepPosition m l p q = q + h * p / (m * l^2)


The symplectic Euler method itself. Notice that only the update function for momentum uses both the previous position and momentum; the update function for position uses the previous position but the new momentum.

> stepOnce :: Double -> Double -> Double -> Double -> (Double, Double)
> stepOnce m l p q = (newP, newQ)
>   where
>     newP = stepMomentum m l p q
>     newQ = stepPosition m l newP q

> runSE :: Double -> Double -> [(Double, Double)]
> runSE initP initTheta = iterate (uncurry (stepOnce m l))
>                                 (initP, initTheta)

> pendulumSE :: [(Double, Double)]
> pendulumSE = runSE initP initTheta


The diagram below plots the position of the pendulum (the angle it makes with the vertical) against momentum, both axes normalised so that the maximum position and momentum are 1.0. We would expect that the trajectory would form a closed path that is traversed indefinitely as the pendulum swings back and forth. And indeed this is the case.

So in this case the energy is conserved so this looks like a good candidate for simulating orbital dynamics. But why does this work? It really looks very similar to the explicit Euler method.

## Theory

Warning: some handwaving as a full and formal exposition of the theory would take much more space than can be contained in this blog article.

We can think of the evolution of the pendulum as taking place on a 2-dimensional manifold manifold $\mathbb{S}^1 \times \mathbb{R}$ where $\mathbb{S}^1$ is the 1-dimensional sphere (a circle) since the pendulum’s space co-ordinate can only take on values $0 \le q < 2\pi$.

We can define a (symplectic) 2-form on this manifold:

$\displaystyle \omega = dq \wedge dp$

Using this we can now produce a vector field from the Hamiltonian: $\mathbb{H} : \mathbb{S}^1 \times \mathbb{R} \longrightarrow \mathbb{R}$

In order to this and without proof let us record the following fact.

Theorem

Let $(M, \omega)$ be a symplectic manifold. Then there exists a bundle isomorphism $\tilde{\omega} : TM \longrightarrow T^*M$ defined by $\tilde{\omega}(X_p)(Y_p) = \omega_p(X_p, Y_p)$.

$\blacksquare$

This is analagous to the isomorphism one can derive in a (semi) Riemannian manifold with the metric in some sense playing the role of the 2-form (see [@o1983semi] for example).

We assume the Hamiltonian to be a smooth function. We can form the 1-form $dH$ and we can define the Hamiltonian vector field $X_H = \tilde{\omega}^{-1}(dH)$.

We have

$\displaystyle d\mathbb{H} = \frac{\partial{\mathbb{H}}}{\partial q}dq + \frac{\partial{\mathbb{H}}}{\partial p}dp$

Thus the corresponding vector field is given by

$\displaystyle X_\mathbb{H} = \frac{\partial{\mathbb{H}}}{\partial q}\frac{\partial}{\partial q} - \frac{\partial{\mathbb{H}}}{\partial p}\frac{\partial}{\partial p}$

The flow of this vector field is the solution to

\displaystyle \begin{aligned} \dot{q} &= \frac{\partial \mathbb{H}}{\partial p} \\ \dot{p} &= -\frac{\partial \mathbb{H}}{\partial q} \\ \end{aligned}

In other words by using the symplectic 2-form and the Hamiltonian we have regained Hamilton’s equations.

Theorem

$\mathbb{H}$ is constant on flows of $X_\mathbb{H}$.

Proof

$\displaystyle X_{\mathbb{H}}{\mathbb{H}} = \omega(X_{\mathbb{H}}, X_{\mathbb{H}}) = 0$

since $\omega$ is alternating.

$\blacksquare$

When the Hamiltonian function represents the energy of the system being studied then this says that energy remains constant on flows. That is, as the system evolves according to the flow $\phi_t$ given by the vector field $X_{\mathbb{H}}$ then $\mathbb{H}(q_t, p_t) = \mathbb{H}(\phi_t(q_0, p_0)) = \mathbb{H}(q_0, p_0)$.

Thus it makes sense to look for numeric methods which maintain this invariant, that is methods which preserve the symplectic 2-form.

Definition

A diffeomorphsim between two symplectic manifolds $f : (M, \mu) \longrightarrow (M, \nu)$ is a symplectomorphism if

$\displaystyle f^*\nu = \mu$

$\blacksquare$

In co-ordinates, we have

\displaystyle \begin{aligned} f^*(dq \wedge dp) &= (\frac{\partial f_q}{\partial q} dq + \frac{\partial f_q}{\partial p} dp) \wedge (\frac{\partial f_p}{\partial q} dq + \frac{\partial f_p}{\partial p} dp) \\ &= (\frac{\partial f_q}{\partial q}\frac{\partial f_p}{\partial p} - \frac{\partial f_p}{\partial q}\frac{\partial f_q}{\partial p})dq \wedge dp \\ &= dq \wedge dp \end{aligned}

Or in matrix form

$\displaystyle {\begin{bmatrix} \frac{\partial f_q}{\partial q} & \frac{\partial f_q}{\partial p} \\ \frac{\partial f_p}{\partial q} & \frac{\partial f_p}{\partial p} \end{bmatrix}}^\top \, \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix} \, \begin{bmatrix} \frac{\partial f_q}{\partial q} & \frac{\partial f_q}{\partial p} \\ \frac{\partial f_p}{\partial q} & \frac{\partial f_p}{\partial p} \end{bmatrix} = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}$

We now check that the symplectic Euler method satisfies this.

### Symplectic Euler

\displaystyle \begin{aligned} p_{n+1} &= p_n - h\nabla_q H(p_{n+1}, q_n) \\ q_{n+1} &= q_n + h\nabla_p H(p_{n+1}, q_n) \end{aligned}

We check that this really is symplectic. First suppose we have two functions:

\displaystyle \begin{aligned} x &= u - f(x,v) \\ y &= v + g(x,v) \\ \end{aligned}

Then we can find partial derivatives:

\displaystyle \begin{aligned} dx &= du - \frac{\partial f}{\partial x}dx - \frac{\partial f}{\partial v}dv \\ dy &= dv + \frac{\partial g}{\partial x}dx + \frac{\partial g}{\partial v} dv \\ \end{aligned}
\displaystyle \begin{aligned} \frac{\partial x}{\partial u} &= 1 - \frac{\partial f}{\partial x}\frac{\partial x}{\partial u} \\ \frac{\partial x}{\partial v} &= -\frac{\partial f}{\partial x}\frac{\partial x}{\partial v} -\frac{\partial f}{\partial v} \\ \frac{\partial y}{\partial u} &= \frac{\partial g}{\partial x}\frac{\partial x}{\partial u} \\ \frac{\partial y}{\partial v} &= 1 + \frac{\partial g}{\partial x}\frac{\partial x}{\partial v} + \frac{\partial g}{\partial v} \end{aligned}

Re-arranging:

\displaystyle \begin{aligned} \frac{\partial x}{\partial u}(1 + \frac{\partial f}{\partial x}) &= 1 \\ \frac{\partial x}{\partial v}(1 + \frac{\partial f}{\partial x}) &= -\frac{\partial f}{\partial v} \\ \frac{\partial y}{\partial u} -\frac{\partial g}{\partial x}\frac{\partial x}{\partial u} &= 0 \\ \frac{\partial y}{\partial v} - \frac{\partial g}{\partial x}\frac{\partial x}{\partial v} &= 1 + \frac{\partial g}{\partial v} \end{aligned}

Pulling everything together in matrix form:

$\displaystyle \begin{bmatrix} 1 + \frac{\partial f}{\partial x} & 0 \\ -\frac{\partial g}{\partial x} & 1 \end{bmatrix} \, \begin{bmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{bmatrix} = \begin{bmatrix} 1 & -\frac{\partial f}{\partial v} \\ 0 & 1 + \frac{\partial g}{\partial v} \end{bmatrix}$

Now substitute in the functions for the Euler symplectic method and we obtain

$\displaystyle \begin{bmatrix} 1 + hH_{qp} & 0 \\ -hG_{pp} & 1 \end{bmatrix} \, \frac{\partial \big(p_{n+1},q_{n+1}\big)}{\partial \big(p_{n},q_{n}\big)} = \begin{bmatrix} 1 & -hH_{qq} \\ 0 & 1 + hH_{qp} \end{bmatrix}$

The reader can then check by a straightforward but tedious calculation that

$\displaystyle \frac{\partial \big(p_{n+1},q_{n+1}\big)}{\partial \big(p_{n},q_{n}\big)}^\top J \frac{\partial \big(p_{n+1},q_{n+1}\big)}{\partial \big(p_{n},q_{n}\big)} = J$

where

$\displaystyle J = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}$

Thus the symplectic Euler method really is symplectic.

On the other hand for the explicit Euler for this particular example we have

$\displaystyle \frac{\partial \big(p_{n+1},q_{n+1}\big)}{\partial \big(p_{n},q_{n}\big)} = \begin{bmatrix} 1 & h / ml^2 \\ -hmglcos\theta_n & 1 \end{bmatrix}$

And a simple calculation shows that

$\displaystyle \frac{\partial \big(p_{n+1},q_{n+1}\big)}{\partial \big(p_{n},q_{n}\big)}^\top J \frac{\partial \big(p_{n+1},q_{n+1}\big)}{\partial \big(p_{n},q_{n}\big)} = \begin{bmatrix} 0 & 1 + h^2\cos\theta\\ -1 - h^2\cos\theta & 0 \end{bmatrix}$

Thus the explicit Euler method is not symplectic i.e., does not preserve areas. Thus the path traversed is not an integral curve of the Hamiltonian vector field. We can see this in the diagram: the path spirals outwards. More details and examples can be found in [@IAUS:152:407Y; @Cross:2005:SIOC].

## Planetary Motion

Normally we would express the gravitational constant in SI units but to be consistent with [@hairer2010geometric] we use units in which distances are expressed in astronomical units, masses are measured relative to the sun and time is measured in earth days.

$\displaystyle {\mathbb H} = \frac{1}{2}\sum_{i=0}^n \frac{p_i^\top p_i}{m_i} - \frac{G}{2}\sum_{i=0}^n\sum_{j \neq i} \frac{m_i m_j}{\|q_i - q_j\|}$

Applying Hamilton’s equations we obtain

\displaystyle \begin{aligned} \dot{q_k^a} &= \frac{\partial\mathbb{H}}{\partial p_k^a} = \frac{p_k^a}{m_k} \\ \dot{p_k^a} &= -\frac{\partial\mathbb{H}}{\partial q_k^a} = G\sum_{j \neq k}m_k m_i \frac{q_k^a - q_j^a}{\|q_k - q_j\|^3} \end{aligned}

In this case it easy to see that these are the same as Newton’s laws of motion.

Applying the Euler symplectic method we obtain:

\displaystyle \begin{aligned} q_k^{n+1} &= q_k^n + h \frac{p_k^n}{m_k} \\ p_k^{n+1} &= p_k^n + h G\sum_{j \neq k}m_k m_i \frac{q_k^{n+1} - q_j^{n+1}}{\|q_k^{n+1} - q_j^{n+1}\|^3} \end{aligned}

### Repa Implementation

We use repa represent our positions and momenta as 2-dimensional arrays, each planet is given a 3-dimensional position vector and a 3-dimensional momentum vector.

> newtype PositionP a = QP { positionP :: Array a DIM2 Double }
> newtype MomentaP  a = PP { momentaP :: Array a DIM2 Double }
> newtype MassP     a = MP { massP :: Array a DIM1 Double }

> stepPositionP :: forall a b c m .
>                  , Source a Double
>                  , Source b Double
>                  , Source c Double
>                  ) =>
>                  Double ->
>                  PositionP a ->
>                  MassP b ->
>                  MomentaP c ->
>                  m (PositionP U)
> stepPositionP h qs ms ps = do
>   do newQs <- computeP $(positionP qs) +^ ((momentaP ps) *^ h3 /^ ms2) > return$ QP newQs
>     where
>       (Z :. i :. j) = extent $momentaP ps > > h3 = extend (Any :. i :. j)$ fromListUnboxed Z [h]
>       ms2 = extend (Any :. j) $massP ms  Each planet produces forces on every other planet so we work with 3-dimsenional arrays and explicitly set the force of a planet on itself to zero. Once the forces on each planet have been calculated, we sum them to produce a resultant force which we then use to step the momentum forward. > stepMomentumP :: forall a b c m . > ( Monad m > , Source a Double > , Source b Double > , Source c Double > ) => > Double -> > Double -> > PositionP a -> > MassP b -> > MomentaP c -> > m (MomentaP U) > stepMomentumP gConst h qs ms ps = > do fs <- sumP$ transpose $zeroDiags fss > newPs <- computeP$ (momentaP ps) +^ (fs *^ dt2)
>      return $PP newPs > where > is = repDim2to3Outer$ prodPairsMasses $massP ms > qDiffs = pointDiffs$ positionP qs
>     preDs = Repa.map (^3) $> Repa.map sqrt$
>             sumS $> Repa.map (^2)$
>             qDiffs
>     ds    = repDim2to3Outer preDs
>     preFs = Repa.map (* (negate gConst)) $> qDiffs /^ ds > fss = is *^ preFs > Z :.i :. _j :. k = extent fss > dt2 = extend (Any :. i :. k)$ fromListUnboxed Z [h]
>
>     repDim2to3Outer a = extend (Any :. I.spaceDim) a
>
>     zeroDiags x = traverse x id f
>       where
>         f _ (Z :. i :. j :. k) | i == j    = 0.0
>                                | otherwise = x!(Z :. i :. j :. k)

> stepOnceP :: ( Monad m
>              , Source a Double
>              , Source b Double
>              , Source c Double
>              ) =>
>              Double ->
>              Double ->
>              MassP b ->
>              PositionP a ->
>              MomentaP c ->
>              m (PositionP U, MomentaP U)
> stepOnceP gConst h ms qs ps = do
>   newPs <- stepMomentumP gConst h qs ms ps
>   newQs <- stepPositionP h qs ms newPs
>   return (newQs, newPs)

> kineticEnergyP :: MassP U -> MomentaP U -> IO (Array D DIM0 Double)
> kineticEnergyP ms ps = do
>   preKes <- sumP $(momentaP ps) *^ (momentaP ps) > ke <- sumP$ preKes /^ (massP ms)
>   return $Repa.map (* 0.5) ke > > potentialEnergyP :: Double -> > MassP U -> > PositionP U -> > IO (Array U DIM1 Double) > potentialEnergyP gConst ms qs = do > ds2 <- sumP$ Repa.map (^2) $pointDiffs$ positionP qs
>   let ds   = Repa.map sqrt ds2
>       is   = prodPairsMasses $massP ms > pess = zeroDiags$ Repa.map (* (0.5 * negate gConst)) $is /^ ds > pes <- sumP pess > return pes > > where > > zeroDiags x = traverse x id f > where > f _ (Z :. i :. j) | i == j = 0.0 > | otherwise = x!(Z :. i :. j)  > hamiltonianP :: Double -> > MassP U -> > PositionP U -> > MomentaP U -> > IO Double > hamiltonianP gConst ms qs ps = do > ke <- kineticEnergyP ms ps > pes <- potentialEnergyP gConst ms qs > pe <- sumP pes > te :: Array U DIM0 Double <- computeP$ ke +^ pe
>   return $head$ toList te

> prodPairsMasses :: Source a Double =>
>                    Array a DIM1 Double ->
>                    Array D DIM2 Double
> prodPairsMasses ms = ns *^ (transpose ns)
>
>   where
>     (Z :. i) = extent ms
>     ns = extend (Any :. i :. All) ms

>
> pointDiffs :: Source a Double =>
>               Array a DIM2 Double ->
>               Array D DIM3 Double
> pointDiffs qs = qss -^ (transposeOuter qss)
>   where
>
>     qss = replicateRows qs
>
>     transposeOuter qs = backpermute (f e) f qs
>       where
>         e = extent qs
>         f (Z :. i :. i' :. j) = Z :. i' :. i :. j
>
>     replicateRows :: Source a Double =>
>                      Array a DIM2 Double ->
>                      Array D DIM3 Double
>     replicateRows a = extend (Any :. i :. All) a
>       where (Z :. i :. _j) = extent a


Using the single step udate, we can step as many times as we wish.

> stepN :: forall m . Monad m =>
>          Int ->
>          Double ->
>          Double ->
>          MassP U ->
>          PositionP U ->
>          MomentaP U ->
>          m (PositionP U, MomentaP U)
> stepN n gConst dt masses = curry updaterMulti
>   where
>     updaterMulti = foldr (>=>) return updaters
>     updaters = replicate n (uncurry (stepOnceP gConst dt masses))


Sometimes we need all the intermediate steps e.g. for plotting.

> stepNs :: Monad m =>
>           Int ->
>           Double ->
>           Double ->
>           MassP U ->
>           PositionP U ->
>           MomentaP U ->
>           m [(PositionP U, MomentaP U)]
> stepNs n gConst dt ms rs vs = do
>   rsVs <- stepAux n rs vs
>   return $(rs, vs) : rsVs > where > stepAux 0 _ _ = return [] > stepAux n rs vs = do > (newRs, newVs) <- stepOnceP gConst dt ms rs vs > rsVs <- stepAux (n-1) newRs newVs > return$ (newRs, newVs) : rsVs


### Yarr Implementation

We use yarr represent our positions and momenta as 1-dimensional arrays, each planet is given a 3-dimensional position vector and a 3-dimensional momentum vector.

> vZero :: VecList N3 Double
> vZero = V.replicate 0

> type ArrayY = UArray F L S.Dim1

> newtype PositionY  = QY { positionY :: VecList N3 Double }
>   deriving (Show, Storable)
> newtype MomentumY = PY { momentumY :: VecList N3 Double }
>   deriving (Show, Storable)
> type MomentaY     = ArrayY MomentumY
> type PositionsY   = ArrayY PositionY
> type MassesY      = ArrayY Mass
> type ForceY       = VecList N3 Double
> type ForcesY      = ArrayY ForceY

> stepPositionY :: Double -> PositionsY -> MassesY -> MomentaY -> IO ()
> stepPositionY h qs ms ps = do
>   loadS S.fill (dzip3 upd qs ms ps) qs
>   where
>     upd :: PositionY -> Mass -> MomentumY -> PositionY
>     upd q m p = QY $V.zipWith (+) > (positionY q) > (V.map (* (h / m)) (momentumY p))  Note the requirement to fill the forces array with zeros before using it. > stepMomentumY :: Double -> > Double -> > PositionsY -> > MassesY -> > MomentaY -> > IO () > stepMomentumY gConst h qs ms ps = do > let nBodies = Y.extent ms > fs :: ForcesY <- Y.new nBodies > S.fill (\_ -> return vZero) (Y.write fs) 0 nBodies > let forceBetween i pos1 mass1 j > | i == j = return vZero > | otherwise = do > pos2 <- qs Y.index j > mass2 <- ms Y.index j > let deltas = V.zipWith (-) (positionY pos1) (positionY pos2) > dist2 = V.sum$ V.map (^ 2) deltas
>               a = 1.0 / dist2
>               b = (negate gConst) * mass1 * mass2 * a * (sqrt a)
>           return $V.map (* b) deltas > forceAdd :: Int -> Int -> ForceY -> IO () > forceAdd i _ f = do > f0 <- fs Y.index i > Y.write fs i (V.zipWith (+) f0 f) > force i pos = do > mass <- ms Y.index i > S.fill (forceBetween i pos mass) (forceAdd i) 0 nBodies > upd momentum force = > PY$ V.zipWith (+)
>         (momentumY momentum)
>         (V.map (\f -> f * h) force)
>   S.fill (Y.index qs) force 0 nBodies
>   loadS S.fill (dzip2 upd ps fs) ps

> stepOnceY :: Double ->
>              Double ->
>              MassesY ->
>              PositionsY ->
>              MomentaY ->
>              IO ()
> stepOnceY gConst h ms qs ps = do
>   stepMomentumY gConst h qs ms ps
>   stepPositionY h qs ms ps

> potentialEnergyY :: Double ->
>                     MassesY ->
>                     PositionsY ->
>                     IO (ArrayY Double)
> potentialEnergyY gConst ms qs = do
>   let nBodies = Y.extent ms
>   pes :: ArrayY Double <- Y.new nBodies
>   S.fill (\_ -> return 0.0) (Y.write pes) 0 nBodies
>   let peOnePairParticles :: Int ->
>                             Int ->
>                             IO Double
>       peOnePairParticles i j
>         | i == j = return 0.0
>         | otherwise = do
>           q1 <- qs Y.index i
>           m1 <- ms Y.index i
>           q2 <- qs Y.index j
>           m2 <- ms Y.index j
>           let qDiffs = V.zipWith (-) (positionY q1) (positionY q2)
>               dist2  = V.sum $V.map (^2) qDiffs > a = 1.0 / dist2 > b = 0.5 * (negate gConst) * m1 * m2 * (sqrt a) > return b > peAdd i _ pe = do > peDelta <- pes Y.index i > Y.write pes i (pe + peDelta) > peFn i _ = do > S.fill (peOnePairParticles i) (peAdd i) 0 nBodies > S.fill (Y.index qs) peFn 0 nBodies > return pes  > kineticEnergyY :: MassesY -> MomentaY-> IO Double > kineticEnergyY ms ps = do > let nakedPs = Y.delay$ Y.dmap momentumY ps
>   let preKes = Y.dmap V.sum $dzip2 (V.zipWith (*)) nakedPs nakedPs > kes = dzip2 (/) preKes (Y.delay ms) > ke <- W.walk (W.reduceL S.foldl (+)) (return 0) kes > return$ 0.5 * ke

> hamiltonianY :: Double -> MassesY -> PositionsY -> MomentaY-> IO Double
> hamiltonianY gConst ms qs ps = do
>   ke  <- kineticEnergyY ms ps
>   pes <- potentialEnergyY gConst ms qs
>   pe  <- W.walk (W.reduceL S.foldl (+)) (return 0) pes
>   return $pe + ke  ## The Outer Solar System We convert the starting positions and momenta for the planets into repa and yarr friendly representations. > mosssP :: MassP U > mosssP = MP$ fromListUnboxed (Z :. n) I.massesOuter
>   where
>     n = length I.massesOuter
>
> mosssY :: IO MassesY
> mosssY = YIO.fromList n I.massesOuter
>   where
>     n = length I.massesOuter
>
> qosss :: PositionP U
> qosss = QP $fromListUnboxed (Z :. n :. I.spaceDim) xs > where > xs = concat I.initQsOuter > n = length xs div I.spaceDim > > qosssY :: IO PositionsY > qosssY = YIO.fromList nBodies$ Prelude.map f [0 .. nBodies - 1]
>   where
>     nBodies = length I.initQsOuter
>     f :: Int -> PositionY
>     f i = QY $> V.vl_3 ((I.initQsOuter!!i)!!0) > ((I.initQsOuter!!i)!!1) > ((I.initQsOuter!!i)!!2) > > posss :: MomentaP U > posss = PP$ fromListUnboxed (Z :. n :. I.spaceDim) xs
>   where
>     xs = concat I.initPsOuter
>     n  = length xs div I.spaceDim
>
> posssY :: IO MomentaY
> posssY = YIO.fromList nBodies $Prelude.map f [0 .. nBodies - 1] > where > nBodies = length I.initPsOuter > f :: Int -> MomentumY > f i = PY$
>           V.vl_3 ((I.initPsOuter!!i)!!0)
>                  ((I.initPsOuter!!i)!!1)
>                  ((I.initPsOuter!!i)!!2)


Rather arbitrarily we run the outer planets for 2000 steps with a step length of 100 days.

> outerPlanets :: [[(Double, Double)]]
> outerPlanets = runIdentity $do > rsVs <- stepNs 2000 I.gConstAu 100 mosssP qosss posss > let qs = Prelude.map fst rsVs > xxs = Prelude.map > (\i -> Prelude.map ((!(Z :. (i :: Int) :. (0 :: Int))) . > positionP) qs) > [5,0,1,2,3,4] > xys = Prelude.map > (\i -> Prelude.map ((!(Z :. (i :: Int) :. (1 :: Int))) . > positionP) qs) > [5,0,1,2,3,4] > return$ zipWith zip xxs xys


Plotting the results, we can see that we have a simulation which, as we expect, conserves energy.

## Performance

Let’s see how repa and yarr perform against each other.

> data YarrOrRepa = Repa | Yarr
>   deriving Show
>
> data Options = Options  { optYarr :: YarrOrRepa
>                         }
>
> startOptions :: Options
> startOptions = Options  { optYarr = Repa
>                         }
>
> options :: [OptDescr (Options -> IO Options)]
> options = [
>   Option ['Y'] ["yarr"] (NoArg (\opt -> return opt { optYarr = Yarr }))
>          "Use yarr"
>   ]

> main :: IO ()
> main = do
>   args <- getArgs
>   let (actions, _nonOpts, _msgs) = getOpt RequireOrder options args
>   opts <- foldl (>>=) (return startOptions) actions
>   case optYarr opts of
>     Repa -> do
>       hPre <- hamiltonianP I.gConstAu mosssP qosss posss
>       putStrLn $show hPre > (qsPost, psPost) <- stepN I.nStepsOuter I.gConstAu I.stepOuter > mosssP qosss posss > hPost <- hamiltonianP I.gConstAu mosssP qsPost psPost > putStrLn$ show hPost
>     Yarr -> do
>       ms :: MassesY <- mosssY
>       ps <- posssY
>       qs <- qosssY
>       hPre <- hamiltonianY I.gConstAu ms qs ps
>       putStrLn $show hPre > S.fill (\_ -> return ()) > (\_ _ -> stepOnceY I.gConstAu I.stepOuter ms qs ps) > (0 :: Int) I.nStepsOuter > hPost <- hamiltonianY I.gConstAu ms qs ps > putStrLn$ show hPost


With 200,000 steps we get the following for repa.

$time ./Symplectic -3.215453183208164e-8 -3.139737384661333e-8 real 0m18.400s user 0m18.245s sys 0m0.154s And a significant speed up with yarr. $ time ./Symplectic -Y
-3.215453183208164e-8
-3.13973738466838e-8

real	0m0.553s
user	0m0.539s
sys	0m0.013s

With 2,000,000 steps (about 548,000 years) we get the following with yarr.

time ./Symplectic -Y -3.215453183208164e-8 -3.2144315777817145e-8 real 0m5.477s user 0m5.369s sys 0m0.107s It would be interesting to compare this against a C implementation. ## Appendices ### Appendix A: Jupiter, Earth and Sun We need some initial conditions to start our simulation. Instead of taking real data, let’s make up something which is realistic but within our control. Following [@Fitz:Newtonian:Dynamics] we can write Kepler’s laws as \displaystyle \begin{aligned} r &= \frac{a(1 - e^2)}{1 - e\cos\theta} \\ r^2\dot{\theta} &= \sqrt{(1 - e^2)}na^2 \\ GM_{\rm Sun} &= n^2a^3 \end{aligned} where $T$ is the period of the orbit, $n = 2\pi / T$ is the mean angular orbital velocity, $a$ is the major radius of the elliptical orbit (Kepler’s first law: “The orbit of every planet is an ellipse with the Sun at one of the two foci”), $G$ is the gravitational constant and $M_{\rm Sun}$ is the mass of the sun and $e$ is the eccentricity of a planet’s orbit. We can calculate the mean angular orbital velocity of Jupiter: > nJupiter :: Double > nJupiter = sqrt I.gConst * I.sunMass / I.jupiterMajRad^3


Let us calculate the initial conditions assuming that Jupiter starts at its perihelion. The angular velocity at that point is entirely in the negative $y$ direction.

With the Leapfrog Method we need the velocity to be half a time step before the perihelion.

If we take $(0.0, -v_p, 0.0)$ to be the velocity of Jupiter at the perihelion then if $\delta\theta$ is the angle with respect to the negative y-axis at half a time step before Jupiter reaches the perihelion then the velocity of Jupiter at this point is given by simple trigonometry:
( − vpsin(δθ),  − vpcos(δθ), 0. 0) ≈ ( − vpδθ,  − vp(1 − δθ2 / 2), 0. 0)

In Haskell, we get the following initial conditions:

> jupiterThetaDotP :: Double
>   nJupiter *
>   sqrt (1 - I.jupiterEccentrity^2) / I.jupiterPerihelion^2
>
> jupiterDeltaThetaP :: Double
> jupiterDeltaThetaP = jupiterThetaDotP * I.stepTwoPlanets / 2
>
> jupiterVPeri :: Speed
> jupiterVPeri = jupiterThetaDotP * I.jupiterPerihelion
>
> jupiterInitX :: Speed
> jupiterInitX = negate $jupiterVPeri * jupiterDeltaThetaP > > jupiterInitY :: Speed > jupiterInitY = negate$ jupiterVPeri * (1 - jupiterDeltaThetaP^2 / 2)
>
> jupiterV :: (Speed, Speed, Speed)
> jupiterV = (jupiterInitX, jupiterInitY, 0.0)
>
> jupiterR :: (Distance, Distance, Distance)
> jupiterR = (negate I.jupiterPerihelion, 0.0, 0.0)


We can do the same for Earth but we assume the earth is at its perihelion on the opposite side of the Sun to Jupiter.

> nEarth :: Double
> nEarth = sqrt $I.gConst * I.sunMass / I.earthMajRad^3 > > earthThetaDotP :: Double > earthThetaDotP = nEarth * > I.earthMajRad^2 * > sqrt (1 - I.earthEccentrity^2) / I.earthPerihelion^2 > > earthDeltaThetaP :: Double > earthDeltaThetaP = earthThetaDotP * I.stepTwoPlanets / 2 > > earthVPeri :: Speed > earthVPeri = earthThetaDotP * I.earthPerihelion > > earthInitX :: Speed > earthInitX = earthVPeri * earthDeltaThetaP > > earthInitY :: Speed > earthInitY = earthVPeri * (1 - earthDeltaThetaP^2 / 2) > > earthV :: (Speed, Speed, Speed) > earthV = (earthInitX, earthInitY, 0.0) > > earthR :: (Distance, Distance, Distance) > earthR = (I.earthPerihelion, 0.0, 0.0)  For completeness we give the Sun’s starting conditions. > sunV :: (Speed, Speed, Speed) > sunV = (0.0, 0.0, 0.0) > > sunR :: (Distance, Distance, Distance) > sunR = (0.0, 0.0, 0.0)  > initVs :: Array U DIM2 Speed > initVs = fromListUnboxed (Z :. nBodies :. I.spaceDim)$ concat xs
>   where
>     nBodies = length xs
>     xs = [ [earthX,   earthY,   earthZ]
>          , [jupiterX, jupiterY, jupiterZ]
>          , [sunX,     sunY,     sunZ]
>          ]
>     (earthX,   earthY,   earthZ)   = earthV
>     (jupiterX, jupiterY, jupiterZ) = jupiterV
>     (sunX,     sunY,     sunZ)     = sunV

> initPs :: MomentaP U
> initPs = PP $runIdentity$ computeP $ms2 *^ initVs > where > (Z :. _i :. j) = extent initVs > ms2 = extend (Any :. j) (massP masses)  > initQs :: PositionP U > initQs = QP$ fromListUnboxed (Z :. nBodies :. I.spaceDim) $concat xs > where > nBodies = length xs > xs = [ [earthX, earthY, earthZ] > , [jupiterX, jupiterY, jupiterZ] > , [sunX, sunY, sunZ] > ] > (earthX, earthY, earthZ) = earthR > (jupiterX, jupiterY, jupiterZ) = jupiterR > (sunX, sunY, sunZ) = sunR  > masses :: MassP U > masses = MP$ fromListUnboxed (Z :. nBodies) I.massesTwoPlanets
>   where
>     nBodies = length I.massesTwoPlanets

> jupiterEarth :: [((Double, Double),
>                   (Double, Double),
>                   (Double, Double))]
> jupiterEarth = runIdentity $do > rsVs <- stepNs I.nStepsTwoPlanets I.gConst I.stepTwoPlanets > masses initQs initPs > let qs = Prelude.map fst rsVs > exs = Prelude.map ((!(Z :. (0 :: Int) :. (0 :: Int))) . > positionP) qs > eys = Prelude.map ((!(Z :. (0 :: Int) :. (1 :: Int))) . > positionP) qs > jxs = Prelude.map ((!(Z :. (1 :: Int) :. (0 :: Int))) . > positionP) qs > jys = Prelude.map ((!(Z :. (1 :: Int) :. (1 :: Int))) . > positionP) qs > sxs = Prelude.map ((!(Z :. (2 :: Int) :. (0 :: Int))) . > positionP) qs > sys = Prelude.map ((!(Z :. (2 :: Int) :. (1 :: Int))) . > positionP) qs > return$ zip3 (zip exs eys) (zip jxs jys) (zip sxs sys)


Plotting the results we can see a reasonable picture for Jupiter’s and Earth’s orbits.

### Appendix B: The Canonical Symplectic Form for the Cotangent Bundle

There are plenty of symplectic manifolds besides $\mathbb{R}^{2n}$. The cotangent bundle has a canonical symplectic 2-form and hence is a symplectic manifold.

Proof

Let $\pi : T^* M \longrightarrow M$ be the projection function from the cotangent bundle to the base manifold, that is, $\pi(x,\xi) = x$. Then $\pi_* : T(T^*M) \longrightarrow TM$ and we can define a 1-form (the canonical or tautological 1-form) on $v \in T_{(x,\xi)}(T^* M)$ as

$\displaystyle \theta_{(x,\xi)} (v) = \xi(\pi_* v)$

By definition:

$\displaystyle \pi_* \bigg(\frac{\partial}{\partial x_i}\bigg)(f) = \frac{\partial}{\partial x_i}(f \circ \pi) = \frac{\partial f}{\partial x_i}$

and

$\displaystyle \pi_* \bigg(\frac{\partial}{\partial \xi_i}\bigg)(f) = \frac{\partial}{\partial \xi_i}(f \circ \pi) = 0$

If we then write $v \in T_{(x,\xi)}(T^*M)$ in co-ordinate form:

$\displaystyle v = a^i\frac{\partial}{\partial x_i} + \alpha^i\frac{\partial}{\partial \xi_i}$

we have

$\displaystyle \pi_*v = a^i\pi_*\frac{\partial}{\partial x_i} + \alpha^i\pi_*\frac{\partial}{\partial \xi_i} = a^i\frac{\partial}{\partial x_i}$

Taking $\xi = \xi_i dx^i$ we have that

$\displaystyle \xi(\pi_*v) = \xi_i a^i$

Thus we have:

$\displaystyle \theta_{(x,\xi)} = \xi_i dx^i$

We then have a closed 2-form

$\displaystyle \omega = d\theta$

In co-ordinate terms:

$\displaystyle \omega = d\theta = d (\xi_i dx^i) = d\xi^i \wedge dx^i$

### Bibliography

Cross, Matthew I. 2006. Symplectic integrators and optimal control. UK.

Fitzpatrick, Richard. 1996. “Newtonian Dynamics.” http://farside.ph.utexas.edu/teaching/336k/lectures.

Hairer, E., C. Lubich, and G. Wanner. 2010. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations. Springer Series in Computational Mathematics. Springer. http://books.google.co.uk/books?id=ssrFQQAACAAJ.

Hudak, Paul, John Hughes, Simon Peyton Jones, and Philip Wadler. 2007. “A history of Haskell: being lazy with class.” In Proceedings of the third ACM SIGPLAN conference on History of programming languages, 12–1. New York, NY, USA: ACM. http://doi.acm.org/10.1145/1238844.1238856.

O’Neill, B. 1983. Semi-Riemannian Geometry With Applications to Relativity, 103. Pure and Applied Mathematics. Elsevier Science. http://books.google.co.uk/books?id=CGk1eRSjFIIC.

Yoshida, H. 1992. “Symplectic Integrators for Hamiltonian Systems: Basic Theory.” In Chaos, Resonance, and Collective Dynamical Phenomena in the Solar System, ed. S. Ferraz-Mello, 152:407.

## Introduction

Neural networks are a method for classifying data based on a theory of how biological systems operate. They can also be viewed as a generalization of logistic regression. A method for determining the coefficients of a given model, backpropagation, was developed in the 1970’s and rediscovered in the 1980’s.

The article “A Functional Approach to Neural Networks” in the Monad Reader shows how to use a neural network to classify handwritten digits in the MNIST database using backpropagation.

The reader is struck by how similar backpropagation is to automatic differentiation. The reader may not therefore be surprised to find that this observation had been made before: Domke2009a. Indeed as Dan Piponi observes: “the grandaddy machine-learning algorithm of them all, back-propagation, is nothing but steepest descent with reverse mode automatic differentiation”.

## Neural Networks

We can view neural nets or at least a multi layer perceptron as a generalisation of (multivariate) linear logistic regression.

We follow (Rojas 1996; Bishop 2006). We are given a training set:

$\displaystyle \{(\boldsymbol{x}_0, \boldsymbol{y}_0), (\boldsymbol{x}_1, \boldsymbol{y}_1), \ldots, (\boldsymbol{x}_p, \boldsymbol{y}_p)\}$

of pairs of $n$-dimensional and $m$-dimensional vectors called the input and output patterns in Machine Learning parlance. We wish to build a neural network model using this training set.

A neural network model (or at least the specific model we discuss: the multi-layer perceptron) consists of a sequence of transformations. The first transformation creates weighted sums of the inputs.

$\displaystyle a_j^{(1)} = \sum_{i=1}^{K_0} w^{(1)}_{ij}x_i + w_{0j}^{(1)}$

where $K_0 \equiv n$ is the size of the input vector and there are $j = 1,\ldots,K_1$ neurons in the so called first hidden layer of the network. The weights are unknown.

The second transformation then applies a non-linear activation function $f$ to each $a_j$ to give the output from the $j$-th neuron in the first hidden layer.

$\displaystyle z_j^{(1)} = f(a_j^{(1)})$

Typically, $f$ is chosen to be $\tanh$ or the logistic function. Note that if it were chosen to be the identity then our neural network would be the same as a multivariate linear logistic regression.

We now repeat these steps for the second hidden layer:

Ultimately after we applied $L-1$ transformations (through $L-1$ hidden layers) we produce some output:

We show an example neural in the diagram below.

The input layer has 7 nodes. There are 2 hidden layers, the first has 3 nodes and the second has 5. The output layer has 3 nodes.

We are also given a cost function:

$\displaystyle E(\boldsymbol{w}; \boldsymbol{x}, \boldsymbol{y}) = \frac{1}{2}\|(\hat{\boldsymbol{y}} - \boldsymbol{y})\|^2$

where $\hat{\boldsymbol{y}}$ is the predicted output of the neural net and $\boldsymbol{y}$ is the observed output.

As with logistic regression, our goal is to find weights for the neural network which minimises this cost function. We initialise the weights to some small non-zero amount and then use the method of steepest descent (aka gradient descent). The idea is that if $f$ is a function of several variables then to find its minimum value, one ought to take a small step in the direction in which it is decreasing most quickly and repeat until no step in any direction results in a decrease. The analogy is that if one is walking in the mountains then the quickest way down is to walk in the direction which goes down most steeply. Of course one get stuck at a local minimum rather than the global minimum but from a machine learning point of view this may be acceptable; alternatively one may start at random points in the search space and check they all give the same minimum.

We therefore need calculate the gradient of the loss function with respect to the weights (since we need to minimise the cost function). In other words we need to find:

$\displaystyle \nabla E(\boldsymbol{x}) \equiv (\frac{\partial E}{\partial w_1}, \ldots, \frac{\partial E}{\partial w_n})$

Once we have this we can take our random starting position and move down the steepest gradient:

$\displaystyle w'_i = w_i - \gamma\frac{\partial E}{\partial w_i}$

where $\gamma$ is the step length known in machine learning parlance as the learning rate.

Some pragmas and imports required for the example code.

> {-# LANGUAGE RankNTypes                #-}
> {-# LANGUAGE DeriveFunctor             #-}
> {-# LANGUAGE DeriveFoldable            #-}
> {-# LANGUAGE DeriveTraversable         #-}
> {-# LANGUAGE ScopedTypeVariables       #-}
> {-# LANGUAGE TupleSections             #-}
> {-# LANGUAGE NoMonomorphismRestriction #-}

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}

> module NeuralNet
>        ( test1
>        , test2
>        , test3
>        ) where

> import Numeric.AD

> import Data.Traversable (Traversable)
> import Data.Foldable (Foldable)
> import Data.List
> import Data.List.Split
> import System.Random
> import qualified Data.Vector as V

> import Control.Monad

> import Data.Random ()
> import Data.Random.Distribution.Beta
> import Data.Random.Distribution.Uniform
> import Data.RVar

> import Text.Printf


## Logistic Regression Redux

Let us first implement logistic regression. This will give us a reference against which to compare the equivalent solution expressed as a neural network.

Instead of maximimizing the log likelihood, we will minimize a cost function.

> cost :: Floating a => V.Vector a -> a -> V.Vector a -> a
> cost theta y x = 0.5 * (y - yhat)^2
>   where
>     yhat = logit $V.sum$ V.zipWith (*) theta x

> logit :: Floating a =>
>          a -> a
> logit x = 1 / (1 + exp (negate x))


We add a regularization term into the total cost so that the parameters do not grow too large. Note that we do not regularize over the bias.

> delta :: Floating a => a
> delta = 0.01

> totalCost :: Floating a =>
>              V.Vector a ->
>              V.Vector a ->
>              V.Vector (V.Vector a) ->
>              a
> totalCost theta y x = (a + delta * b) / l
>   where
>     l = fromIntegral $V.length y > a = V.sum$ V.zipWith (cost theta) y x
>     b = (/2) $V.sum$ V.map (^2) $V.drop 1 theta  We determine the gradient of the regularized cost function. > delTotalCost :: Floating a => > V.Vector a -> > V.Vector (V.Vector a) -> > V.Vector a -> > V.Vector a > delTotalCost y x = grad f > where > f theta = totalCost theta (V.map auto y) (V.map (V.map auto) x)  And finally we can apply gradient descent. > gamma :: Double > gamma = 0.4  > stepOnceCost :: Floating a => > a -> > V.Vector a -> > V.Vector (V.Vector a) -> > V.Vector a -> > V.Vector a > stepOnceCost gamma y x theta = > V.zipWith (-) theta (V.map (* gamma)$ del theta)
>     where
>       del = delTotalCost y x


## Neural Network Representation

Let us borrow, generalize and prune the data structures used in “A Functional Approach to Neural Networks”. Some of the fields in the borrowed data structures are probably no longer necessary given that we are going to use automated differentiation rather than backpropagation. Caveat lector!

The activation function itself is a function which takes any type in the Floating class to the same type in the Floating class e.g. Double.

> newtype ActivationFunction =
>   ActivationFunction
>   {
>     activationFunction :: Floating a => a -> a
>   }


A neural network is a collection of layers.

> data Layer a =
>   Layer
>   {
>     layerWeights  :: [[a]],
>     layerFunction :: ActivationFunction
>   } deriving (Functor, Foldable, Traversable)

> data BackpropNet a = BackpropNet
>     {
>       layers       :: [Layer a],
>       learningRate :: Double
>     } deriving (Functor, Foldable, Traversable)


We need some helper functions to build our neural network and to extract information from it.

> buildBackpropNet ::
>   Double ->
>   [[[a]]] ->
>   ActivationFunction ->
>   BackpropNet a
> buildBackpropNet learningRate ws f =
>   BackpropNet {
>       layers       = map buildLayer checkedWeights
>     , learningRate = learningRate
>     }
>   where checkedWeights = scanl1 checkDimensions ws
>         buildLayer w   = Layer { layerWeights  = w
>                                 , layerFunction = f
>                                 }
>         checkDimensions :: [[a]] -> [[a]] -> [[a]]
>         checkDimensions w1 w2 =
>           if 1 + length w1 == length (head w2)
>           then w2
>           else error $"Inconsistent dimensions in weight matrix\n" ++ > show (length w1) ++ "\n" ++ > show (length w2) ++ "\n" ++ > show (length$ head w1) ++ "\n" ++
>                         show (length $head w2)  > extractWeights :: BackpropNet a -> [[[a]]] > extractWeights x = map layerWeights$ layers x


In order to undertake gradient descent on the data structure in which we store a neural network, BackpropNet, it will be convenient to be able to add such structures together point-wise.

> instance Num a => Num (Layer a) where
>
> addLayer :: Num a => Layer a -> Layer a -> Layer a
>   Layer { layerWeights  = zipWith (zipWith (+))
>                                   (layerWeights x)
>                                   (layerWeights y)
>         , layerFunction = layerFunction x
>         }

> instance Num a => Num (BackpropNet a) where

> addBPN :: Num a => BackpropNet a -> BackpropNet a -> BackpropNet a
> addBPN x y = BackpropNet { layers = zipWith (+) (layers x) (layers y)
>                           , learningRate = learningRate x
>                           }


We store information about updating of output values in each layer in the neural network as we move forward through the network (aka forward propagation).

> data PropagatedLayer a
>     = PropagatedLayer
>         {
>           propLayerIn         :: [a],
>           propLayerOut        :: [a],
>           propLayerWeights    :: [[a]],
>           propLayerActFun     :: ActivationFunction
>         }
>     | PropagatedSensorLayer
>         {
>           propLayerOut :: [a]
>         } deriving (Functor, Foldable, Traversable)


Sadly we have to use an inefficient calculation to multiply matrices; see this email for further details.

> matMult :: Num a => [[a]] -> [a] -> [a]
> matMult m v = result
>   where
>     lrs = map length m
>     l   = length v
>     result = if all (== l) lrs
>              then map (\r -> sum $zipWith (*) r v) m > else error$ "Matrix has rows of length " ++ show lrs ++
>                           " but vector is of length " ++ show l


Now we can propagate forwards. Note that the code from which this is borrowed assumes that the inputs are images which are $m \times m$ pixels each encoded using a grayscale, hence the references to bits and the check that values lie in the range $0 \leq x \leq 1$.

> propagateNet :: (Floating a, Ord a, Show a) =>
>                 [a] ->
>                 BackpropNet a ->
>                 [PropagatedLayer a]
> propagateNet input net = tail calcs
>   where calcs = scanl propagate layer0 (layers net)
>         layer0 = PropagatedSensorLayer $validateInput net input > > validateInput net = validateInputValues . > validateInputDimensions net > > validateInputDimensions net input = > if got == expected > then input > else error ("Input pattern has " ++ show got ++ > " bits, but " ++ > show expected ++ " were expected") > where got = length input > expected = (+(negate 1))$
>                            length $> head$
>                            layerWeights $> head$
>                            layers net
>
>         validateInputValues input =
>           if (minimum input >= 0) && (maximum input <= 1)
>           then input
>           else error "Input bits outside of range [0,1]"


Note that we add a 1 to the inputs to each layer to give the bias.

> propagate :: (Floating a, Show a) =>
>              PropagatedLayer a ->
>              Layer a ->
>              PropagatedLayer a
> propagate layerJ layerK = result
>   where
>     result =
>       PropagatedLayer
>         {
>           propLayerIn         = layerJOut,
>           propLayerOut        = map f a,
>           propLayerWeights    = weights,
>           propLayerActFun     = layerFunction layerK
>         }
>     layerJOut = propLayerOut layerJ
>     weights   = layerWeights layerK
>     a = weights matMult (1:layerJOut)
>     f :: Floating a => a -> a
>     f = activationFunction $layerFunction layerK  > evalNeuralNet :: (Floating a, Ord a, Show a) => > BackpropNet a -> [a] -> [a] > evalNeuralNet net input = propLayerOut$ last calcs
>   where calcs = propagateNet input net


We define a cost function.

> costFn :: (Floating a, Ord a, Show a) =>
>           Int ->
>           Int ->
>           [a] ->
>           BackpropNet a ->
>           a
> costFn nDigits expectedDigit input net = 0.5 * sum (map (^2) diffs)
>   where
>     predicted = evalNeuralNet net input
>     diffs = zipWith (-) ((targets nDigits)!!expectedDigit) predicted

> targets :: Floating a => Int -> [[a]]
> targets nDigits = map row [0 .. nDigits - 1]
>   where
>     row m = concat [x, 1.0 : y]
>       where
>         (x, y) = splitAt m (take (nDigits - 1) $repeat 0.0)  If instead we would rather perform gradient descent over the whole training set (rather than stochastically) then we can do so. Note that we do not regularize the weights for the biases. > totalCostNN :: (Floating a, Ord a, Show a) => > Int -> > V.Vector Int -> > V.Vector [a] -> > BackpropNet a -> > a > totalCostNN nDigits expectedDigits inputs net = cost > where > cost = (a + delta * b) / l > > l = fromIntegral$ V.length expectedDigits
>
>     a = V.sum $V.zipWith (\expectedDigit input -> > costFn nDigits expectedDigit input net) > expectedDigits inputs > > b = (/(2 * m))$ sum $map (^2) ws > > m = fromIntegral$ length ws
>
>     ws = concat $concat$
>          map stripBias $> extractWeights net > > stripBias xss = map (drop 1) xss  > delTotalCostNN :: (Floating a, Ord a, Show a) => > Int -> > V.Vector Int -> > V.Vector [a] -> > BackpropNet a -> > BackpropNet a > delTotalCostNN nDigits expectedDigits inputs = grad f > where > f net = totalCostNN nDigits expectedDigits > (V.map (map auto) inputs) net  > stepOnceTotal :: Int -> > Double -> > V.Vector Int -> > V.Vector [Double] -> > BackpropNet Double -> > BackpropNet Double > stepOnceTotal nDigits gamma y x net = > net + fmap (* (negate gamma)) (delTotalCostNN nDigits y x net)  ## Example I Let’s try it out. First we need to generate some data. Rather arbitrarily let us create some populations from the beta distribution. > betas :: Int -> Double -> Double -> [Double] > betas n a b = > fst$ runState (replicateM n (sampleRVar (beta a b))) (mkStdGen seed)
>     where
>       seed = 0


We can plot the populations we wish to distinguish by sampling.

> a, b :: Double
> a          = 15
> b          = 6
> nSamples :: Int
> nSamples   = 100000
>
> sample0, sample1 :: [Double]
> sample0 = betas nSamples a b
> sample1 = betas nSamples b a

> mixSamples :: [Double] -> [Double] -> [(Double, Double)]
> mixSamples xs ys = unfoldr g ((map (0,) xs), (map (1,) ys))
>   where
>     g ([], [])         = Nothing
>     g ([],  _)         = Nothing
>     g ( _, [])         = Nothing
>     g ((x:xs), (y:ys)) = Just $(x, (y:ys, xs))  > createSample :: V.Vector (Double, Double) > createSample = V.fromList$ take 100 $mixSamples sample1 sample0  > lRate :: Double > lRate = 0.01 > actualTheta :: V.Vector Double > actualTheta = V.fromList [0.0, 1.0] > initTheta :: V.Vector Double > initTheta = V.replicate (V.length actualTheta) 0.1  > logitAF :: ActivationFunction > logitAF = ActivationFunction logit  > test1 :: IO () > test1 = do > > let testNet = buildBackpropNet lRate [[[0.1, 0.1], [0.1, 0.1]]] logitAF  > let vals :: V.Vector (Double, V.Vector Double) > vals = V.map (\(y, x) -> (y, V.fromList [1.0, x]))$ createSample
>
>   let gs = iterate (stepOnceCost gamma (V.map fst vals) (V.map snd vals))
>                    initTheta
>       theta = head $drop 1000 gs > printf "Logistic regression: theta_0 = %5.3f, theta_1 = %5.3f\n" > (theta V.! 0) (theta V.! 1) > > let us = V.map (round . fst) createSample > let vs = V.map snd createSample > let fs = iterate (stepOnceTotal 2 gamma us (V.map return vs)) testNet > phi = extractWeights$ head $drop 1000 fs > printf "Neural network: theta_00 = %5.3f, theta_01 = %5.3f\n" > (((phi!!0)!!0)!!0) (((phi!!0)!!0)!!1) > printf "Neural network: theta_10 = %5.3f, theta_11 = %5.3f\n" > (((phi!!0)!!1)!!0) (((phi!!0)!!1)!!1)  ghci> test1 Logistic regression: theta_0 = -2.383, theta_1 = 4.852 Neural network: theta_00 = 2.386, theta_01 = -4.861 Neural network: theta_10 = -2.398, theta_11 = 4.886  ## Example II Now let’s try a neural net with 1 hidden layer using the data we prepared earlier. We seed the weights in the neural with small random values; if we set all the weights to 0 then the gradient descent algorithm might get stuck. > uniforms :: Int -> [Double] > uniforms n = > fst$ runState (replicateM n (sampleRVar stdUniform)) (mkStdGen seed)
>     where
>       seed = 0

> randomWeightMatrix :: Int -> Int -> [[Double]]
> randomWeightMatrix numInputs numOutputs = y
>   where
>     y = chunksOf numInputs weights
>     weights = map (/ 100.0) $uniforms (numOutputs * numInputs)  > w1, w2 :: [[Double]] > w1 = randomWeightMatrix 2 2 > w2 = randomWeightMatrix 3 2  > initNet2 :: BackpropNet Double > initNet2 = buildBackpropNet lRate [w1, w2] logitAF > > labels :: V.Vector Int > labels = V.map (round . fst) createSample  > inputs :: V.Vector [Double] > inputs = V.map (return . snd) createSample  Instead of hand-crafting gradient descent, let us use the library function as it performs better and is easier to implement. > estimates :: (Floating a, Ord a, Show a) => > V.Vector Int -> > V.Vector [a] -> > BackpropNet a -> > [BackpropNet a] > estimates y x = gradientDescent$
>                 \theta -> totalCostNN 2 y (V.map (map auto) x) theta


Now we can examine the weights of our fitted neural net and apply it to some test data.

> test2 :: IO ()
> test2 = do
>
>   let fs = estimates labels inputs initNet2
>   mapM_ putStrLn $map (take 60)$
>                    map show $extractWeights$
>                    head $drop 1000 fs > putStrLn$ show $evalNeuralNet (head$ drop 1000 fs) [0.1]
>   putStrLn $show$ evalNeuralNet (head $drop 1000 fs) [0.9]  ghci> test2 [[3.3809809537916933,-6.778365921046131],[-5.157492699008754 [[1.2771246165025043,5.294090869351353,-8.264801192310706],[ [0.997782987249909,2.216698813392053e-3] [1.4853346509852003e-3,0.9985148392767443]  ## Example III Let’s try a more sophisticated example and create a population of 4 groups which we measure with 2 variables. > c, d :: Double > c = 15 > d = 8 > sample2, sample3 :: [Double] > sample2 = betas nSamples c d > sample3 = betas nSamples d c  > mixSamples3 :: Num t => [[a]] -> [(t, a)] > mixSamples3 xss = concat$ transpose $> zipWith (\n xs -> map (n,) xs) > (map fromIntegral [0..]) > xss > sample02, sample03, sample12, sample13 :: [(Double, Double)] > sample02 = [(x, y) | x <- sample0, y <- sample2] > sample03 = [(x, y) | x <- sample0, y <- sample3] > sample12 = [(x, y) | x <- sample1, y <- sample2] > sample13 = [(x, y) | x <- sample1, y <- sample3]  > createSample3 :: forall t. Num t => V.Vector (t, (Double, Double)) > createSample3 = V.fromList$ take 512 $mixSamples3 [ sample02 > , sample03 > , sample12 > , sample13 > ]  Rather annoyingly picking random weights seemed to give a local but not global minimum. This may be a feature of having more nodes in the hidden layer than in the input layer. By fitting a neural net with no hidden layers to the data and using the outputs as inputs to fit another neural net with no hidden layers, we can get a starting point from which we can converge to the global minimum. > w31, w32 :: [[Double]] > w31 = [[-1.795626449637491,1.0687662199549477,0.6780994566671094], > [-0.8953174631646047,1.536931540024011,-1.7631220370122578], > [-0.4762453998497917,-2.005243268058972,1.2945899127545906], > [0.43019763097582875,-1.5711869072989957,-1.187180183656747]] > w32 = [[-0.65116209142284,0.4837310591797774,-0.17870333721054968, > -0.6692619856605464,-1.062292154441557], > [-0.7521274440366631,-1.2071835415415136e-2,1.0078929981538551, > -1.3144243587577473,-0.5102027925579049], > [-0.7545728756863981,-0.4830112128458844,-1.2901624541811962, > 1.0487049495446408,9.746209726152217e-3], > [-0.8576212271328413,-0.9035219951783956,-0.4034500456652809, > 0.10091187689838758,0.781835908789879]] > > testNet3 :: BackpropNet Double > testNet3 = buildBackpropNet lRate [w31, w32] logitAF  > labels3 :: V.Vector Int > labels3 = V.map (round . fst) createSample3 > inputs3 :: V.Vector [Double] > inputs3 = V.map ((\(x, y) -> [x, y]) . snd) createSample3  Now we use the library gradientDescent function to generate neural nets which ever better fit the data. > estimates3 :: (Floating a, Ord a, Show a) => > V.Vector Int -> > V.Vector [a] -> > BackpropNet a -> > [BackpropNet a] > estimates3 y x = gradientDescent$
>                  \theta -> totalCostNN 4 y (V.map (map auto) x) theta


Finally we can fit a neural net and check that it correctly classifies some data.

> test3 :: IO ()
> test3 = do
>   let fs = drop 100 $estimates3 labels3 inputs3 testNet3 > mapM_ putStrLn$ map (take 60) $map show$ extractWeights $head fs > putStrLn$ take 60 $show$ evalNeuralNet (head fs) [0.1, 0.1]
>   putStrLn $take 60$ show $evalNeuralNet (head fs) [0.1, 0.9] > putStrLn$ take 60 $show$ evalNeuralNet (head fs) [0.9, 0.1]
>   putStrLn $take 60$ show $evalNeuralNet (head fs) [0.9, 0.9]  ghci> test3 [[-2.295896239599931,2.705409060274802,2.1377566388724047],[ [[-0.6169787627963551,2.5369568963968256,-0.3515306366626614 [2.638026636449198e-2,9.091308688841797e-2,0.373349222824566 [0.13674565454319784,1.128123133092104e-2,0.8525700090804755 [0.30731134024095474,0.8197492648500939,1.3704140162804749e- [0.6773814649389487,0.22533958204471505,0.1957913744022863,4  ## References Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Rojas, R. 1996. Neural networks: a systematic introduction. Springer-Verlag New York Incorporated. http://books.google.co.uk/books?id=txsjjYzFJS4C. Posted in Haskell, Machine Learning | 2 Comments ## Cabal Hacking at OdHac This is a very informal blog of the Cabal hacking team at OdHac: Sasha Manzyuk, Bram Schuur and me (Dominic Steinitz). I really enjoyed pair programming with both of them and I certainly wouldn’t have tracked down the bug I investigated without their help. Also a big thank you to Roman Cheplyaka for organising a great event. ## Cabal Bug Fixing I Ticket 823 The problem is indeed in the function Distribution.Simple.Utils.wrapText, which is used to wrap messages at 79 characters. No attempt is made to detect whether a space at which a line is wrapped is part of a file name. One can reproduce the problem more easily as follows: $ cabal install http://hackage.haskell.org/packages/archive/groups/0.2.0.1/groups-0.2.0.1.tar.gz --prefix "/home/manzyuk/foo/bar/baz quux"

Resolving dependencies...
In order, the following will be installed:
groups-0.2.0.1 (reinstall)
Warning: Note that reinstalls are always dangerous. Continuing anyway...
Configuring groups-0.2.0.1...
Building groups-0.2.0.1...
Preprocessing library groups-0.2.0.1...
[1 of 1] Compiling Data.Group ( src/Data/Group.hs, dist/build/Data/Group.o )
In-place registering groups-0.2.0.1...
Installing library in /home/manzyuk/foo/bar/baz
quux/lib/groups-0.2.0.1/ghc-7.4.1
Registering groups-0.2.0.1...
Installed groups-0.2.0.1

Note that the line starting with “Installing library in” is wrapped at the space that is part of a directory name.

I suppose, a proper way to fix it would be to use an algebraic data type rather than String to represent messages, something like Text.PrettyPrint.Doc, so that file names can be treated as unbreakable entities that shouldn’t be wrapped. This, however, seems to require quite some re-engineering of error handling in Cabal. Should I go ahead and do it?

2013-05-04 Sat: Mikhail Glushenkov commented: “I think it’s a minor issue and your efforts are better spent elsewhere.”

### Ticket 760

Lines beginning with ‘–’ are comments. To enable library profiling one has to uncomment the line

-- library-profiling: False

and change the value of the option to True:

library-profiling: True

This is somewhat confusing: an option can be in three different states: True, False, and commented out.

### Ticket 774

We can augment the tokenizer so that lines beginning with ‘>’ are left intact:

tokeniseLine :: (LineNo, Indent, HasTabs, String) -> [Token]
tokeniseLine (n0, i, t, l) = case split n0 l of
(Span _ l':ss) -> Line n0 i t l' :ss
cs              -> cs
where split _ "" = []
split n s@('>' : _) = [Line n i t s]
split n s  = case span (\c -> c /='}' && c /= '{') s of
("", '{' : s') ->             OpenBracket  n : split n s'
(w , '{' : s') -> mkspan n w (OpenBracket  n : split n s')
("", '}' : s') ->             CloseBracket n : split n s'
(w , '}' : s') -> mkspan n w (CloseBracket n : split n s')
(w ,        _) -> mkspan n w []

mkspan n s ss | null s'   =             ss
| otherwise = Span n s' : ss
where s' = trimTrailing (trimLeading s)

Is it a good / right solution? Am I overlooking anything?

2013-05-04 Sat: Roman has pointed out that Duncan Coutts announced he was rewriting the parser of .cabal files, so it probably doesn’t makes sense to fix the old parser if it’s going to be replaced by the new one.

### Ticket 757

Can’t reproduce with Firefox 20.0 (Linux/x86$_{\mathrm{64}}$).

### Ticket 735

I tried to reproduce this issue and I can’t. I tried to build NaturalSort-0.2.1 with GHC 7.4.1 and Cabal 1.16.03, and I would expect the build to fail due to this line in NaturalSort.cabal: Cabal-Version: >= 1.6 && < 1.9, but it does build, and the log file does contain some log info. No summary file has been produced though.

I also tried to build cal3d, which on my machine fails due to a missing C library, but again the log file contains the error message. The summary file is generated when I try to install the library from Hackage but is not generated when I manually download the source and try to build the library locally, which confirms the behavior observed in Issue #1189.

### Ticket 676

The culprit of the problem was the function simplifyCondTree, which traversed CondTree in post-order, putting global options after the options defined in the branches of conditionals. I’ve changed that to pre-order traversal.

Sent a pull request. Also, added a test demonstrating what’s changed.

### Ticket 885

Duplicate of Issue 774. 2013-05-04 Sat: Closed by Johan Tibell.

### Ticket 883

I don’t think this is a cabal bug.

I tried to reproduce this behaviour: I downloaded monad-par-0.3.4.1 and mangled the file Control/Monad/Par/Scheds/Direct.hs so that its top looked as follows:

{-# LANGUAGE RankNTypes, NamedFieldPuns, BangPatterns,
ExistentialQuantification, CPP, ScopedTypeVariables,
TypeSynonymInstances, MultiParamTypeClasses,
GeneralizedNewtypeDeriving, PackageImports,
ParallelListComp
#-} # OPTIONS_GHC
{- -Wall -fno-warn-name-shadowing -fno-warn-unused-do-bind #-}

Running GHC directly on this file gives the following error message:

Control/Monad/Par/Scheds/Direct.hs:6:18: parse error on input #'

Building through cabal build produces the following error message:

Control/Monad/Par/Scheds/Direct.hs:1:1:
File name does not match module name:
Saw: Main'
Expected: Control.Monad.Par.Scheds.Direct'

However, running GHC on Control/Monad/Par.hs produces an identical error message.

I’m guessing that something like this is happening:

• when GHC is run directly on Control/Monad/Par/Scheds/Direct.hs, it tries to parse a module declaration, encounters # OPTIONS_GHC, decides that the file doesn’t have a module header (in particular, defaults the module name to Main), tries to parse # OPTIONS_GHC and fails;
• when GHC is run on Control/Monad/Par.hs, it expects to find a module named Control.Monad.Par.Scheds.Direct in the file Control/Monad/Par/Scheds/Direct.hs, but when it tries to parse it, it encounters # OPTIONS_GHC, and decides that the file doesn’t have a module header, defaults the module name to Main, and fails with a different error.

This theory is supported by the following experiment: replace the pragmas and # OPTIONS_GHC above with some code, e.g., foo = 3. Then running GHC on Control/Monad/Par/Scheds/Direct.hs produces

Control/Monad/Par/Scheds/Direct.hs:12:1:
parse error on input module'

and running GHC on Control/Monad/Par.hs produces

Control/Monad/Par/Scheds/Direct.hs:1:1:
File name does not match module name:
Saw: Main'
Expected: Control.Monad.Par.Scheds.Direct'

The summary of the above is that this is not a cabal bug (and probably not a bug at all), and this ticket should be closed.

2013-05-04 Sat: Closed by Johan Tibell.

### Ticket 417

Is this still an issue? I see a comment in Cabal/Distribution/PackageDescription/Parse.hs:

-- "build-depends" is a local field now.  To be backwards
-- compatible, we still allow it as a global field in old-style
-- package description files and translate it to a local field by
-- adding it to every non-empty section

which makes me think that what Duncan is suggesting:

or it should automatically propagate such global options into each buildable component.

2013-05-05 Sun: It seems I was wrong, investigating.

## Cabal Bug Fixing II

• Work on adding a –lowest-dependencies flag for cabal which builds a package against its lowest dependencies. This is useful for package maintainers to see whether the lower bounds of a package are still valid. An initial version is issued as a pull-request but the actual workings are still being tweaked.
• Small bugfix making the –package-db flag accept relative paths

## Cabal Bug Fixing III

I took a look at Ticket 1297.

My first step was to grab the latest sources and run the tests.

bash-3.2$cabal test Running 2 test suites... Test suite package-tests: RUNNING... Cabal test suite - testing cabal version 1.17.0 BuildTestSuiteDetailedV09: : [Failed] build failed! expected: False but got: True BuildDeps/InternalLibrary2: : [Failed] executable should have linked with the internal library expected: "myLibFunc internal" but got: "" BuildDeps/InternalLibrary3: : [Failed] executable should have linked with the internal library expected: "myLibFunc internal" but got: "" BuildDeps/InternalLibrary4: : [Failed] executable should have linked with the installed library expected: "myLibFunc installed" but got: "" PackageTests/CMain: : [OK] Test Cases Total Passed 20 20 Failed 4 4 Total 24 24 Test suite package-tests: FAIL Test suite logged to: dist/test/Cabal-1.17.0-package-tests.log Test suite unit-tests: RUNNING... Test suite unit-tests: PASS Test suite logged to: dist/test/Cabal-1.17.0-unit-tests.log 1 of 2 test suites (1 of 2 test cases) passed. BuildTestSuiteDetailedV09 is a test which tests that Cabal testing is working correctly. It takes a small Cabal project with a test-suite stanza builds it and runs the tests in the test suite. This test passes on linux so it looked like the problem was in the way cabal was running the tests. This turned out to be a red herring. I pulled the test out of the Cabal test suite and ran it on its own (obvious in hindsight). So the problem was in Cabal not in the way Cabal was being tested. bash-3.2$ ../cabal/cabal-install/dist/build/cabal/cabal configure --enable-tests --package-db=../cabal/Cabal/dist/package.conf.inplace
Warning: The package list for 'hackage.haskell.org' is 38 days old.
Run 'cabal update' to get the latest list of available packages.
Warning: my.cabal: A package using section syntax must specify at least
'cabal-version: >= 1.2'.
Resolving dependencies...
Warning: my.cabal: A package using section syntax must specify at least
'cabal-version: >= 1.2'.
Configuring BuildTestSuiteDetailedV09-0.1...
bash-3.2$../cabal/cabal-install/dist/build/cabal/cabal build Warning: my.cabal: A package using section syntax must specify at least 'cabal-version: >= 1.2'. Building BuildTestSuiteDetailedV09-0.1... Preprocessing library BuildTestSuiteDetailedV09-0.1... [1 of 1] Compiling Dummy ( Dummy.hs, dist/build/Dummy.o ) In-place registering BuildTestSuiteDetailedV09-0.1... Preprocessing test suite 'dummy' for BuildTestSuiteDetailedV09-0.1... [1 of 1] Compiling Dummy ( Dummy.hs, dist/build/Dummy.o ) ar: dist/build/Dummy.o: No such file or directory But why was Cabal complaining about a missing Dummy.o when it had clearly just created it? Sasha suggested using strace (http://en.wikipedia.org/wiki/Strace) but this didn’t exist on my Mac. But I did have dtrace (http://hints.macworld.com/article.php?story=20071031121823710 and http://chihungchan.blogspot.com/2009/05/which-process-deleted-my-file.html). Dominics-MacBook-Pro:Dummy dom$ dtrace -qn 'syscall::unlink:entry {printf("PID=%d, CMD=%s, FILE=%s\n", pid, curpsinfo->pr_psargs, copyinstr(arg0));}'
dtrace: failed to initialize dtrace: DTrace requires additional privileges
Dominics-MacBook-Pro:Dummy dom$sudo dtrace -qn 'syscall::unlink:entry {printf("PID=%d, CMD=%s, FILE=%s\n", pid, curpsinfo->pr_psargs, copyinstr(arg0));}' Password: PID=16137, CMD=Emacs, FILE=/Users/dom/Dropbox/Private/.#Everything.org PID=16137, CMD=Emacs, FILE=/Users/dom/Dropbox/Private/#Everything.org# PID=82566, CMD=ar, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T//ar.ugfvvt PID=82567, CMD=libtool, FILE=dist/build/libHSBuildTestSuiteDetailedV09-0.1.a PID=82557, CMD=cabal, FILE=dist/build/libHSBuildTestSuiteDetailedV09-0.1.a PID=82557, CMD=cabal, FILE=dist/build/libHSBuildTestSuiteDetailedV09-0.1_p.a PID=82557, CMD=cabal, FILE=dist/build/libHSBuildTestSuiteDetailedV09-0.1-ghc7.6.2.dylib PID=82557, CMD=cabal, FILE=dist/build/HSBuildTestSuiteDetailedV09-0.1.o PID=82560, CMD=ghc, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T/ghc82560_0/ghc82560_0.c PID=82560, CMD=ghc, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T/ghc82560_0/ghc82560_0.s PID=82573, CMD=i686-apple-darwi, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T//ccnsroxB.s PID=82562, CMD=i686-apple-darwi, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T//ccjXqsIc.s PID=82571, CMD=ghc, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T/ghc82571_0/ghc82571_0.c PID=82571, CMD=ghc, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T/ghc82571_0/ghc82571_0.s PID=82576, CMD=as, FILE=dist/build/Dummy.o PID=82577, CMD=ar, FILE=/var/folders/3p/m593dprn5snbjz6sc45c77f80000gn/T//ar.mlwmuT PID=82557, CMD=cabal, FILE=dist/build/libdummy.a PID=82557, CMD=cabal, FILE=dist/build/libdummy_p.a PID=82557, CMD=cabal, FILE=dist/build/libdummy-ghc7.6.2.dylib PID=82557, CMD=cabal, FILE=dist/build/dummy.o I noticed I had a Dummy.o and a dummy.o. Also the name of the test-suite was dummy and the name of the module in the the Cabal package was Dummy. I tried changing the name of the test-suite stanza to Dummy. No luck but when I tried changing the name of the test-suite stanza to something completely different everything worked. bash-3.2$ ../cabal/cabal-install/dist/build/cabal/cabal build
Warning: my.cabal: A package using section syntax must specify at least
'cabal-version: >= 1.2'.
Building BuildTestSuiteDetailedV09-0.1...
Preprocessing library BuildTestSuiteDetailedV09-0.1...
[1 of 1] Compiling Dummy            ( Dummy.hs, dist/build/Dummy.o )
In-place registering BuildTestSuiteDetailedV09-0.1...
Preprocessing test suite 'Urk' for BuildTestSuiteDetailedV09-0.1...
[1 of 1] Compiling Dummy            ( Dummy.hs, dist/build/Dummy.o )
In-place registering Urk-0.1...
[1 of 1] Compiling Main             ( dist/build/UrkStub/UrkStub-tmp/UrkStub.hs, dist/build/UrkStub/UrkStub-tmp/Main.o )
Linking dist/build/UrkStub/UrkStub ...

### Case Insensitive but Case Preserving

It turns out that most Macs are configured like this. You can check as shown below. The absence of Case-sensitive in the Name field means the file system is case insensitive.

bash-3.2\$ diskutil info disk0s2
Device Identifier:        disk0s2
Device Node:              /dev/disk0s2
Part of Whole:            disk0
Device / Media Name:      Customer

Volume Name:              Macintosh HD
Escaped with Unicode:     Macintosh%FF%FE%20%00HD

Mounted:                  Yes
Mount Point:              /
Escaped with Unicode:     /

File System Personality:  Journaled HFS+
Type (Bundle):            hfs
Name (User Visible):      Mac OS Extended (Journaled)
Journal:                  Journal size 40960 KB at offset 0x1238b000
Owners:                   Enabled

Partition Type:           Apple_HFS
OS Can Be Installed:      Yes
Media Type:               Generic
Protocol:                 SATA
SMART Status:             Verified

Total Size:               499.4 GB (499418034176 Bytes) (exactly 975425848 512-Byte-Blocks)
Volume Free Space:        427.8 GB (427780079616 Bytes) (exactly 835507968 512-Byte-Blocks)
Device Block Size:        512 Bytes

Solid State:              Yes