Fun with (Kalman) Filters Part II

Introduction

Suppose we have particle moving in at constant velocity in 1 dimension, where the velocity is sampled from a distribution. We can observe the position of the particle at fixed intervals and we wish to estimate its initial velocity. For generality, let us assume that the positions and the velocities can be perturbed at each interval and that our measurements are noisy.

A point of Haskell interest: using type level literals caught a bug in the mathematical description (one of the dimensions of a matrix was incorrect). Of course, this would have become apparent at run-time but proof checking of this nature is surely the future for mathematicians. One could conceive of writing an implementation of an algorithm or proof, compiling it but never actually running it purely to check that some aspects of the algorithm or proof are correct.

The Mathematical Model

We take the position as x_i and the velocity v_i:

\displaystyle  \begin{aligned}  x_i &= x_{i-1} + \Delta T v_{i-1} + \psi^{(x)}_i \\  v_i &= v_{i-1} + \psi^{(v)}_i \\  y_i &= a_i x_i + \upsilon_i  \end{aligned}

where \psi^{(x)}_i, \psi^{(v)}_i and \upsilon_i are all IID normal with means of 0 and variances of \sigma^2_x, \sigma^2_v and \sigma^2_y

We can re-write this as

\displaystyle  \begin{aligned}  \boldsymbol{x}_i &= \boldsymbol{A}_{i-1}\boldsymbol{x}_{i-1} + \boldsymbol{\psi}_{i-1} \\  \boldsymbol{y}_i &= \boldsymbol{H}_i\boldsymbol{x}_i + \boldsymbol{\upsilon}_i  \end{aligned}

where

\displaystyle  \boldsymbol{A}_i =  \begin{bmatrix}  1 & \Delta T\\  0 & 1\\  \end{bmatrix}  ,\quad  \boldsymbol{H}_i =  \begin{bmatrix}  a_i & 0 \\  \end{bmatrix}  ,\quad  \boldsymbol{\psi}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(x)}_i\big)  ,\quad  \boldsymbol{\Sigma}^{(x)}_i =  \begin{bmatrix}  \sigma^2_{x} & 0\\  0 & \sigma^2_{v} \\  \end{bmatrix}  ,\quad  \boldsymbol{\upsilon}_i \sim {\cal{N}}\big(0,\boldsymbol{\Sigma}^{(y)}_i\big)  ,\quad  \boldsymbol{\Sigma}^{(y)}_i =  \begin{bmatrix}  \sigma^2_{z} \\  \end{bmatrix}

Let us denote the mean and variance of \boldsymbol{X}_i\,\vert\,\boldsymbol{Y}_{i-1} as \hat{\boldsymbol{x}}^\flat_i and \hat{\boldsymbol{\Sigma}}^\flat_i respectively and note that

\displaystyle  \begin{aligned}  {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{H}_i\boldsymbol{X}_i\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Upsilon}_i}  \end{aligned}

Since {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} and {\boldsymbol{Y}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} are jointly Gaussian and recalling that ({\hat{\boldsymbol{\Sigma}}^\flat_i})^\top = \hat{\boldsymbol{\Sigma}}^\flat_i as covariance matrices are symmetric, we can calculate their mean and covariance matrix as

\displaystyle  \begin{bmatrix}  \hat{\boldsymbol{x}}^\flat_i \\  \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i  \end{bmatrix}  ,\quad  \begin{bmatrix}  \hat{\boldsymbol{\Sigma}}^\flat_i & \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top \\  \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i & \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \end{bmatrix}

We can now use standard formulæ which say if

\displaystyle  \begin{bmatrix}  \boldsymbol{X} \\  \boldsymbol{Y}  \end{bmatrix}  \sim  {\cal{N}}  \begin{bmatrix}  \begin{bmatrix}  \boldsymbol{\mu}_x \\  \boldsymbol{\mu}_y  \end{bmatrix}  &  ,  &  \begin{bmatrix}  \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\  \boldsymbol{\Sigma}^\top_{xy} & \boldsymbol{\Sigma}_y  \end{bmatrix}  \end{bmatrix}

then

\displaystyle  \boldsymbol{X}\,\vert\,\boldsymbol{Y}=\boldsymbol{y} \sim {{\cal{N}}\big( \boldsymbol{\mu}_x + \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y(\boldsymbol{y} - \boldsymbol{\mu}_y) , \boldsymbol{\Sigma}_x - \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1}_y\boldsymbol{\Sigma}^\top_{xy}\big)}

and apply this to

\displaystyle  (\boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i-1})\,\vert\,(\boldsymbol{Y}_i\,\vert\, \boldsymbol{Y}_{i-1})

to give

\displaystyle  \boldsymbol{X}_i\,\vert\, \boldsymbol{Y}_{i} = \boldsymbol{y}_i  \sim  {{\cal{N}}\big( \hat{\boldsymbol{x}}^\flat_i + \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top  \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1}  (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) , \hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i\big)}

This is called the measurement update; more explicitly

\displaystyle  \begin{aligned}  \hat{\boldsymbol{x}}^i &\triangleq  \hat{\boldsymbol{x}}^\flat_i +  \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top  \big(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i\big)^{-1}  (\boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i) \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq  {\hat{\boldsymbol{\Sigma}}^\flat_i - \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top(\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i)^{-1}\boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i}  \end{aligned}

Sometimes the measurement residual \boldsymbol{v}_i, the measurement prediction covariance \boldsymbol{S}_i and the filter gain \boldsymbol{K}_i are defined and the measurement update is written as

\displaystyle  \begin{aligned}  \boldsymbol{v}_i & \triangleq  \boldsymbol{y}_i - \boldsymbol{H}_i\hat{\boldsymbol{x}}^\flat_i \\  \boldsymbol{S}_i & \triangleq  \boldsymbol{H}_i \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top + \boldsymbol{\Sigma}^{(y)}_i \\  \boldsymbol{K}_i & \triangleq \hat{\boldsymbol{\Sigma}}^\flat_i  \boldsymbol{H}_i^\top\boldsymbol{S}^{-1}_i \\  \hat{\boldsymbol{x}}^i &\triangleq \hat{\boldsymbol{x}}^\flat_i + \boldsymbol{K}_i\boldsymbol{v}_i \\  \hat{\boldsymbol{\Sigma}}_i &\triangleq \hat{\boldsymbol{\Sigma}}^\flat_i - \boldsymbol{K}_i\boldsymbol{S}_i\boldsymbol{K}^\top_i  \end{aligned}

We further have that

\displaystyle  \begin{aligned}  {\boldsymbol{X}_i}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_{i-1}}\,\vert\,{\boldsymbol{Y}_{i-1}} =  {\boldsymbol{A}_i\boldsymbol{X}_{i-1}\,\vert\,{\boldsymbol{Y}_{i-1}} + \boldsymbol{\Psi}_i}  \end{aligned}

We thus obtain the Kalman filter prediction step:

\displaystyle  \begin{aligned}  \hat{\boldsymbol{x}}^\flat_i &=  \boldsymbol{A}_{i-1}\hat{\boldsymbol{x}}_{i-1} \\  \hat{\boldsymbol{\Sigma}}^\flat_i &= \boldsymbol{A}_{i-1}  \hat{\boldsymbol{\Sigma}}_{i-1}  \boldsymbol{A}_{i-1}^\top  + \boldsymbol{\Sigma}^{(x)}_{i-1}  \end{aligned}

Further information can be found in (Boyd 2008), (Kleeman 1996) and (Särkkä 2013).

A Haskell Implementation

The hmatrix now uses type level literals via the DataKind extension in ghc to enforce compatibility of matrix and vector operations at the type level. See here for more details. Sadly a bug in the hmatrix implementation means we can’t currently use this excellent feature and we content ourselves with comments describing what the types would be were it possible to use it.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> {-# LANGUAGE DataKinds                    #-}
> {-# LANGUAGE ScopedTypeVariables          #-}
> {-# LANGUAGE RankNTypes                   #-}
> module FunWithKalmanPart1a where
> import Numeric.LinearAlgebra.HMatrix hiding ( outer )
> import Data.Random.Source.PureMT
> import Data.Random hiding ( gamma )
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import Control.Monad.Loops

Let us make our model almost deterministic but with noisy observations.

> stateVariance :: Double
> stateVariance = 1e-6
> obsVariance :: Double
> obsVariance = 1.0

And let us start with a prior normal distribution with a mean position and velocity of 0 with moderate variances and no correlation.

> -- muPrior :: R 2
> muPrior :: Vector Double
> muPrior = vector [0.0, 0.0]
> -- sigmaPrior :: Sq 2
> sigmaPrior :: Matrix Double
> sigmaPrior = (2 >< 2) [ 1e1,   0.0
>                       , 0.0,   1e1
>                       ]

We now set up the parameters for our model as outlined in the preceeding section.

> deltaT :: Double
> deltaT = 0.001
> -- bigA :: Sq 2
> bigA :: Matrix Double
> bigA = (2 >< 2) [ 1, deltaT
>                 , 0,      1
>                 ]
> a :: Double
> a = 1.0
> -- bigH :: L 1 2
> bigH :: Matrix Double
> bigH = (1 >< 2) [ a, 0
>                 ]
> -- bigSigmaY :: Sq 1
> bigSigmaY :: Matrix Double
> bigSigmaY = (1 >< 1) [ obsVariance ]
> -- bigSigmaX :: Sq 2
> bigSigmaX :: Matrix Double
> bigSigmaX = (2 >< 2) [ stateVariance, 0.0
>                      , 0.0,           stateVariance
>                      ]

The implementation of the Kalman filter using the hmatrix package is straightforward.

> -- outer ::  forall m n . (KnownNat m, KnownNat n) =>
> --           R n -> Sq n -> L m n -> Sq m -> Sq n -> Sq n -> [R m] -> [(R n, Sq n)]
> outer :: Vector Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> Matrix Double
>          -> [Vector Double]
>          -> [(Vector Double, Matrix Double)]
> outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX ys = result
>   where
>     result = scanl update (muPrior, sigmaPrior) ys
> 
>     -- update :: (R n, Sq n) -> R m -> (R n, Sq n)
>     update (xHatFlat, bigSigmaHatFlat) y =
>       (xHatFlatNew, bigSigmaHatFlatNew)
>       where
>         -- v :: R m
>         v = y - bigH #> xHatFlat
>         -- bigS :: Sq m
>         bigS = bigH <> bigSigmaHatFlat <> (tr bigH) + bigSigmaY
>         -- bigK :: L n m
>         bigK = bigSigmaHatFlat <> (tr bigH) <> (inv bigS)
>         -- xHat :: R n
>         xHat = xHatFlat + bigK #> v
>         -- bigSigmaHat :: Sq n
>         bigSigmaHat = bigSigmaHatFlat - bigK <> bigS <> (tr bigK)
>         -- xHatFlatNew :: R n
>         xHatFlatNew = bigA #> xHat
>         -- bigSigmaHatFlatNew :: Sq n
>         bigSigmaHatFlatNew = bigA <> bigSigmaHat <> (tr bigA) + bigSigmaX

We create some ranodm data using our model parameters.

> singleSample ::(Double, Double) ->
>                RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> singleSample (xPrev, vPrev) = do
>   psiX <- rvarT (Normal 0.0 stateVariance)
>   let xNew = xPrev + deltaT * vPrev + psiX
>   psiV <- rvarT (Normal 0.0 stateVariance)
>   let vNew = vPrev + psiV
>   upsilon <- rvarT (Normal 0.0 obsVariance)
>   let y = a * xNew + upsilon
>   lift $ W.tell [(y, (xNew, vNew))]
>   return (xNew, vNew)
> streamSample :: RVarT (W.Writer [(Double, (Double, Double))]) (Double, Double)
> streamSample = iterateM_ singleSample (1.0, 1.0)
> samples :: ((Double, Double), [(Double, (Double, Double))])
> samples = W.runWriter (evalStateT (sample streamSample) (pureMT 2))

Here are the actual values of the randomly generated positions.

> actualXs :: [Double]
> actualXs = map (fst . snd) $ take nObs $ snd samples
> test :: [(Vector Double, Matrix Double)]
> test = outer muPrior sigmaPrior bigH bigSigmaY bigA bigSigmaX
>        (map (\x -> vector [x]) $ map fst $ snd samples)

And using the Kalman filter we can estimate the positions.

> estXs :: [Double]
> estXs = map (!!0) $ map toList $ map fst $ take nObs test
> nObs :: Int
> nObs = 1000

And we can see that the estimates track the actual positions quite nicely.

Of course we really wanted to estimate the velocity.

> actualVs :: [Double]
> actualVs = map (snd . snd) $ take nObs $ snd samples
> estVs :: [Double]
> estVs = map (!!1) $ map toList $ map fst $ take nObs test

Bibliography

Boyd, Stephen. 2008. “EE363 Linear Dynamical Systems.” http://stanford.edu/class/ee363.

Kleeman, Lindsay. 1996. “Understanding and Applying Kalman Filtering.” In Proceedings of the Second Workshop on Perceptive Systems, Curtin University of Technology, Perth Western Australia (25-26 January 1996).

Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. Vol. 3. Cambridge University Press.

Fun with (Kalman) Filters Part I

Suppose we wish to estimate the mean of a sample drawn from a normal distribution. In the Bayesian approach, we know the prior distribution for the mean (it could be a non-informative prior) and then we update this with our observations to create the posterior, the latter giving us improved information about the distribution of the mean. In symbols

\displaystyle p(\theta \,\vert\, x) \propto p(x \,\vert\, \theta)p(\theta)

Typically, the samples are chosen to be independent, and all of the data is used to perform the update but, given independence, there is no particular reason to do that, updates can performed one at a time and the result is the same; nor is the order of update important. Being a bit imprecise, we have

\displaystyle p(z \,\vert\, x, y) = p(z, x, y)p(x, y) = p(z, x, y)p(x)p(y) = p((z \,\vert\, x) \,\vert\, y) = p((z \,\vert\, y) \,\vert\, x)

The standard notation in Bayesian statistics is to denote the parameters of interest as \theta \in \mathbb{R}^p and the observations as x \in \mathbb{R}^n. For reasons that will become apparent in later blog posts, let us change notation and label the parameters as x and the observations as y.

Let us take a very simple example of a prior X \sim {\cal{N}}(0, \sigma^2) where \sigma^2 is known and then sample from a normal distribution with mean x and variance for the i-th sample c_i^2 where c_i is known (normally we would not know the variance but adding this generality would only clutter the exposition unnecessarily).

\displaystyle p(y_i \,\vert\, x) = \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)

The likelihood is then

\displaystyle p(\boldsymbol{y} \,\vert\, x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi c_i^2}}\exp\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg)

As we have already noted, instead of using this with the prior to calculate the posterior, we can update the prior with each observation separately. Suppose that we have obtained the posterior given i - 1 samples (we do not know this is normally distributed yet but we soon will):

\displaystyle p(x \,\vert\, y_1,\ldots,y_{i-1}) = {\cal{N}}(\hat{x}_{i-1}, \hat{\sigma}^2_{i-1})

Then we have

\displaystyle \begin{aligned} p(x \,\vert\, y_1,\ldots,y_{i}) &\propto p(y_i \,\vert\, x)p(x \,\vert\, y_1,\ldots,y_{i-1}) \\ &\propto \exp-\bigg(\frac{(y_i - x)^2}{2c_i^2}\bigg) \exp-\bigg(\frac{(x - \hat{x}_{i-1})^2}{2\hat{\sigma}_{i-1}^2}\bigg) \\ &\propto \exp-\Bigg(\frac{x^2}{c_i^2} - \frac{2xy_i}{c_i^2} + \frac{x^2}{\hat{\sigma}_{i-1}^2} - \frac{2x\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg) \\ &\propto \exp-\Bigg( x^2\Bigg(\frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}\Bigg) - 2x\Bigg(\frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}\Bigg)\Bigg) \end{aligned}

Writing

\displaystyle \frac{1}{\hat{\sigma}_{i}^2} \triangleq \frac{1}{c_i^2} + \frac{1}{\hat{\sigma}_{i-1}^2}

and then completing the square we also obtain

\displaystyle \frac{\hat{x}_{i}}{\hat{\sigma}_{i}^2} \triangleq \frac{y_i}{c_i^2} + \frac{\hat{x}_{i-1}}{\hat{\sigma}_{i-1}^2}

More Formally

Now let’s be a bit more formal about conditional probability and use the notation of \sigma-algebras to define {\cal{F}}_i = \sigma\{Y_1,\ldots, Y_i\} and M_i \triangleq \mathbb{E}(X \,\vert\, {\cal{F}}_i) where Y_i = X + \epsilon_i, X is as before and \epsilon_i \sim {\cal{N}}(0, c_k^2). We have previously calculated that M_i = \hat{x}_i and that {\cal{E}}((X - M_i)^2 \,\vert\, Y_1, \ldots Y_i) = \hat{\sigma}_{i}^2 and the tower law for conditional probabilities then allows us to conclude {\cal{E}}((X - M_i)^2) = \hat{\sigma}_{i}^2. By Jensen’s inequality, we have

\displaystyle {\cal{E}}(M_i^2) = {\cal{E}}({\cal{E}}(X \,\vert\, {\cal{F}}_i)^2)) \leq {\cal{E}}({\cal{E}}(X^2 \,\vert\, {\cal{F}}_i))) = {\cal{E}}(X^2) = \sigma^2

Hence M is bounded in L^2 and therefore converges in L^2 and almost surely to M_\infty \triangleq {\cal{E}}(X \,\vert\, {\cal{F}}_\infty). The noteworthy point is that if M_\infty = X if and only if \hat{\sigma}_i converges to 0. Explicitly we have

\displaystyle \frac{1}{\hat{\sigma}_i^2} = \frac{1}{\sigma^2} + \sum_{k=1}^i\frac{1}{c_k^2}

which explains why we took the observations to have varying and known variances. You can read more in Williams’ book (Williams 1991).

A Quick Check

We have reformulated our estimation problem as a very simple version of the celebrated Kalman filter. Of course, there are much more interesting applications of this but for now let us try “tracking” the sample from the random variable.

> {-# OPTIONS_GHC -Wall                     #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing  #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults   #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind  #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> {-# OPTIONS_GHC -fno-warn-orphans         #-}
> module FunWithKalmanPart1 (
>     obs
>   , nObs
>   , estimates
>   , uppers
>   , lowers
>   ) where
> 
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> var, cSquared :: Double
> var       = 1.0
> cSquared  = 1.0
> 
> nObs :: Int
> nObs = 100
> createObs :: RVar (Double, [Double])
> createObs = do
>   x <- rvar (Normal 0.0 var)
>   ys <- replicateM nObs $ rvar (Normal x cSquared)
>   return (x, ys)
> 
> obs :: (Double, [Double])
> obs = evalState (sample createObs) (pureMT 2)
> 
> updateEstimate :: (Double, Double) -> (Double, Double) -> (Double, Double)
> updateEstimate (xHatPrev, varPrev) (y, cSquared) = (xHatNew, varNew)
>   where
>     varNew  = recip (recip varPrev + recip cSquared)
>     xHatNew = varNew * (y / cSquared + xHatPrev / varPrev)
> 
> estimates :: [(Double, Double)]
> estimates = scanl updateEstimate (y, cSquared) (zip ys (repeat cSquared))
>   where
>     y  = head $ snd obs
>     ys = tail $ snd obs
> 
> uppers :: [Double]
> uppers = map (\(x, y) -> x + 3 * (sqrt y)) estimates
> 
> lowers :: [Double]
> lowers = map (\(x, y) -> x - 3 * (sqrt y)) estimates

Bibliography

Williams, David. 1991. Probability with Martingales. Cambridge University Press.

Gibbs Sampling in Haskell

This is really intended as a draft chapter for our book. Given the diverse natures of the intended intended audiences, it is probably a bit light on explanation of the Haskell (use of monad transformers) for those with a background in numerical methods. It is hoped that the explanation of the mathematics is adequate for those with a background in Haskell but not necessarily in numerical methods. As always, any feedback is gratefully accepted.

Introduction

Imagine an insect, a grasshopper, trapped on the face of a clock which wants to visit each hour an equal number of times. However, there is a snag: it can only see the value of the hour it is on and the value of the hours immediately anti-clockwise and immediately clockwise. For example, if it is standing on 5 then it can see the 5, the 4, and the 6 but no others.

It can adopt the following strategy: toss a fair coin and move anti-clockwise for a head and move clockwise for a tail. Intuition tells us that over a large set of moves the grasshopper will visit each hour (approximately) the same number of times.

Can we confirm our intuition somehow? Suppose that the strategy has worked and the grasshopper is now to be found with equal probability on any hour. Then at the last jump, the grasshopper must either have been at the hour before the one it is now on or it must have been at the hour after the one it is now on. Let us denote the probability that the grasshopper is on hour n by \pi(n) and the (conditional) probability that the grasshopper jumps to state n given it was in state m by p(n \, |\, m). Then we have

\displaystyle   \pi'(n) = p(n \, |\, n - 1)\pi(n - 1) + p(n \, |\, n + 1)\pi(n + 1)

Substituting in where N is a normalising constant (12 in this case) we obtain

\displaystyle   \pi'(n) = \frac{1}{2}\frac{1}{N} + \frac{1}{2}\frac{1}{N} = \frac{1}{N}

This tells us that the required distribution is a fixed point of the grasshopper’s strategy. But does the strategy actually converge to the fixed point? Let us perform an experiment.

First we import some modules from hmatrix.

> {-# LANGUAGE FlexibleContexts #-}
> module Chapter1 where
> import Data.Packed.Matrix
> import Numeric.LinearAlgebra.Algorithms
> import Numeric.Container
> import Data.Random
> import Control.Monad.State
> import qualified Control.Monad.Writer as W
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT

Let us use a clock with 5 hours to make the matrices sufficiently small to fit on one page.

Here is the strategy encoded as a matrix. For example the first row says jump to position 1 with probablity 0.5 or jump to position 5 with probability 0.5.

> eqProbsMat :: Matrix Double
> eqProbsMat = (5 >< 5)
>         [ 0.0, 0.5, 0.0, 0.0, 0.5
>         , 0.5, 0.0, 0.5, 0.0, 0.0
>         , 0.0, 0.5, 0.0, 0.5, 0.0
>         , 0.0, 0.0, 0.5, 0.0, 0.5
>         , 0.5, 0.0, 0.0, 0.5, 0.0
>         ]

We suppose the grasshopper starts at 1 o’clock.

> startOnOne :: Matrix Double
> startOnOne = ((1 >< 5) [1.0, 0.0, 0.0, 0.0, 0.0])

If we allow the grasshopper to hop 1000 times then we see that it is equally likely to be found on any hour hand with a 20% probability.

ghci> eqProbsMat
  (5><5)
   [ 0.0, 0.5, 0.0, 0.0, 0.5
   , 0.5, 0.0, 0.5, 0.0, 0.0
   , 0.0, 0.5, 0.0, 0.5, 0.0
   , 0.0, 0.0, 0.5, 0.0, 0.5
   , 0.5, 0.0, 0.0, 0.5, 0.0 ]

ghci> take 1 $ drop 1000  $ iterate (<> eqProbsMat) startOnOne
  [(1><5)
   [ 0.20000000000000007, 0.2, 0.20000000000000004, 0.20000000000000004, 0.2 ]]

In this particular case, the strategy does indeed converge.

Now suppose the grasshopper wants to visit each hour in proportion the value of the number on the hour. Lacking pen and paper (and indeed opposable thumbs), it decides to adopt the following strategy: toss a fair coin as in the previous strategy but only move if the number is larger than the one it is standing on; if, on the other hand, the number is smaller then choose a number at random from between 0 and 1 and move if this value is smaller than the ratio of the proposed hour and the hour on which it is standing otherwise stay put. For example, if the grasshopper is standing on 5 and gets a tail then it will move to 6 but if it gets a head then four fifths of the time it will move to 4 but one fifth of the time it will stay where it is.

Suppose that the strategy has worked (it is not clear that is has) and the grasshopper is now to be found at 12 o’clock 12 times as often as at 1 o’clock, at 11 o’clock 11 times as often as at 1 o’clock, etc. Then at the last jump, the grasshopper must either have been at the hour before the one it is now on, the hour after the one it is now on or the same hour it is now on. Let us denote the probability that the grasshopper is on hour n by \pi(n).

\displaystyle   \pi'(n) = p(n \, |\, n - 1)\pi(n - 1) + p(n \, |\, n)\pi(n) + p(n \, |\, n + 1)\pi(n + 1)

Substituting in at 4 say

\displaystyle   \begin{aligned}  \pi'(4) &= \frac{1}{2}\pi(3) +             \frac{1}{2}\frac{1}{4}\pi(4) +             \frac{1}{2}\frac{4}{5}\pi(5) \\          &= \frac{1}{2}\bigg(\frac{3}{N} + \frac{1}{4}\frac{4}{N} + \frac{4}{5}\frac{5}{N}\bigg) \\          &= \frac{1}{N}\frac{8}{2} \\          &= \frac{4}{N} \\          &= \pi(4)  \end{aligned}

The reader can check that this relationship holds for all other hours. This tells us that the required distribution is a fixed point of the grasshopper’s strategy. But does this strategy actually converge to the fixed point?

Again, let us use a clock with 5 hours to make the matrices sufficiently small to fit on one page.

Here is the strategy encoded as a matrix. For example the first row says jump to position 1 with probablity 0.5 or jump to position 5 with probability 0.5.

> incProbsMat :: Matrix Double
> incProbsMat = scale 0.5 $
>   (5 >< 5)
>     [ 0.0,         1.0,     0.0,        0.0, 1.0
>     , 1.0/2.0, 1.0/2.0,     1.0,        0.0, 0.0
>     , 0.0,     2.0/3.0, 1.0/3.0,        1.0, 0.0
>     , 0.0,         0.0, 3.0/4.0,    1.0/4.0, 1.0
>     , 1.0/5.0,     0.0,     0.0,    4.0/5.0, 1.0/5.0 + 4.0/5.0
>     ]

We suppose the grasshopper starts at 1 o’clock.

If we allow the grasshopper to hop 1000 times then we see that it is equally likely to be found on any hour hand n with a probability of n times the probability of being found on 1.

ghci> incProbsMat
  (5><5)
   [  0.0,                0.5,                 0.0,   0.0, 0.5
   , 0.25,               0.25,                 0.5,   0.0, 0.0
   ,  0.0, 0.3333333333333333, 0.16666666666666666,   0.5, 0.0
   ,  0.0,                0.0,               0.375, 0.125, 0.5
   ,  0.1,                0.0,                 0.0,   0.4, 0.5 ]

ghci> take 1 $ drop 1000  $ iterate (<> incProbsMat) startOnOne
  [(1><5)
   [ 6.666666666666665e-2, 0.1333333333333333, 0.19999999999999996, 0.2666666666666666, 0.33333333333333326 ]]

In this particular case, the strategy does indeed converge.

Surprisingly, this strategy produces the desired result and is known as the Metropolis Algorithm. What the grasshopper has done is to construct a (discrete) Markov Process which has a limiting distribution (the stationary distribution) with the desired feature: sampling from this process will result in each hour being sampled in proportion to its value.

Markov Chain Theory

Let us examine what is happening in a bit more detail.

The grasshopper has started with a very simple Markov Chain: one which jumps clockwise or anti-clockwise with equal probability and then modified it. But what is a Markov Chain?

A time homogeneous Markov chain is a countable sequence of random variables
X_0, X_1, \ldots such that

\displaystyle   \mathbb{P} (X_{n+1} = j \,|\, X_0 = i_0, X_1 = i_1, \dots X_n = i) = \mathbb{P} (X_{n+1} = j \,|\, X_n = i)

We sometimes say that a Markov Chain is discrete time stochastic process with the above property.

So the very simple Markov Chain can be described by

\displaystyle   q(i, j) =  \begin{cases}  \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = \frac{1}{2} & \text{if } j = i + 1 \mod N \\  \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = \frac{1}{2} & \text{if } j = i - 1 \mod N \\  \mathbb{P} (X_{n+1} = j \,|\, X_n = i) = 0 & \text{otherwise }  \end{cases}

The grasshopper knows that \pi(i) = i/N so it can calculate \pi(j)/\pi(i) = j/i without knowing N. This is important because now, without knowing N, the grasshopper can evaluate

\displaystyle   p(i, j) =  \begin{cases}  q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\  1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(k) q(k,i)}{\pi(i) q(i,k)} \land 1 \bigg] & \text{if } j = i  \end{cases}

where \land takes the maximum of its arguments. Simplifying the above by substituing in the grasshopper’s probabilities and noting that j = i \pm 1 \mod N is somewhat obscure way of saying jump clockwise or anti-clockwise we obtain

\displaystyle   q(i, j) =  \begin{cases}  \frac{1}{2} (\frac{j}{i} \land 1) & \text{if } j \text{ is 1 step clockwise} \\  \frac{1}{2} (\frac{j}{i} \land 1) & \text{if } j \text{ is 1 step anti-clockwise} \\  1 - \frac{1}{2}(\frac{j^c}{i} \land 1) - \frac{1}{2}(\frac{j^a}{i} \land 1) & \text{if } j = i \text{ and } j^c \text{ is one step clockwise and } j^a \text{ is one step anti-clockwise} \\  0 & \text{otherwise}  \end{cases}

The Ergodic Theorem

In most studies of Markov chains, one is interested in whether a chain has a stationary distribution. What we wish to do is take a distribution and create a chain with this distribution as its stationary distribution. We will still need to show that our chain does indeed have the correct stationary distribution and we state the relevant theorem somewhat informally and with no proof.

Theorem

An irreducible, aperiodic and positive recurrent Markov chain has a unique stationary distribution.

Roughly speaking

  • Irreducible means it is possible to get from any state to any other state.

  • Aperiodic means that returning to a state having started at that state occurs at irregular times.

  • Positive recurrent means that the first time to hit a state is finite (for every state and more pedantically except on sets of null measure).

Note that the last condition is required when the state space is infinite – see Skrikant‘s lecture notes for an example and also for a more formal definition of the theorem and its proof.

Algorithm

Let \pi be a probability distribution on the state space \Omega with \pi(i) > 0 for all i and let (Q, \pi_0) be an ergodic Markov chain on \Omega with transition probabilities q(i,j) > 0 (the latter condition is slightly stronger than it need be but we will not need fully general conditions).

Create a new (ergodic) Markov chain with transition probabilities

\displaystyle   p_{ij} =  \begin{cases}  q(i,j)\bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j \ne i \\  1 - \sum_{k : k \ne i} q(i,k) \bigg[\frac{\pi(j) q(j,i)}{\pi(i) q(i,j)} \land 1 \bigg] & \text{if } j = i  \end{cases}

where \land takes the maximum of its arguments.

Calculate the value of interest on the state space e.g. the total magnetization for each step produced by this new chain.

Repeat a sufficiently large number of times and take the average. This gives the estimate of the value of interest.

Convergence

Let us first note that the Markov chain produced by this algorithm almost trivially satisfies the detailed balance condition, for example,

\displaystyle   \begin{aligned}  \pi(i) q(i,j)\bigg[\frac{\pi(j) q(j, i)}{\pi(i)q(i,j)} \land 1\bigg]  &= \pi(i)q(i,j) \land \pi(j)q(j,i) \\  &= \pi(j)q(j,i)\bigg[\frac{\pi(i) q(i, j)}{\pi(j)q(j,i)} \land 1\bigg]  \end{aligned}

Secondly since we have specified that (Q, \pi_0) is ergodic then clearly (P, \pi_0) is also ergodic (all the transition probabilities are > 0).

So we know the algorithm will converge to the unique distribution we specified to provide estimates of values of interest.

Gibbs Sampling

Random Scan

For simplicity let us consider a model with two parameters and that we sample from either parameter with equal probability. In this sampler, We update the parameters in a single step.

\displaystyle   \begin{cases}  \text{Sample } \theta_1^{(i+1)} \sim \pi(\theta_1 \,\big|\, \theta_2^{(i)}) & \text{with probability } \frac{1}{2} \\  \text{Sample } \theta_2^{(i+1)} \sim \pi(\theta_2 \,\big|\, \theta_1^{(i)}) & \text{with probability } \frac{1}{2}  \end{cases}

The transition density kernel is then given by

\displaystyle   q\big(\boldsymbol{\theta}^{(i+1)}, \boldsymbol{\theta}^{(i)}\big) =  \frac{1}{2}\pi(\theta_1^{(i+1)} \,\big|\, \theta_2^{(i)})\delta({\theta_2^{(i)},\theta_2^{(i+1)}}) +  \frac{1}{2}\pi(\theta_2^{(i+1)} \,\big|\, \theta_1^{(i)})\delta({\theta_1^{(i)},\theta_1^{(i+1)}})

where \delta is the Dirac delta function.

Detailed balance

This sampling scheme satisifies the detailed balance condition. We have

\displaystyle   \begin{aligned}  \pi(\theta_1, \theta_2)  \bigg[  \frac{1}{2}\pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \frac{1}{2}\pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\  \frac{1}{2}\bigg[\pi(\theta_1, \theta_2)  \pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \pi(\theta_1, \theta_2)  \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\  \frac{1}{2}\bigg[\pi(\theta_1, \theta_2')  \pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \pi(\theta_1', \theta_2)  \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})\bigg] &= \\  \frac{1}{2}\bigg[  \pi(\theta_2')\pi(\theta_1 \,\big|\, \theta_2')  \frac{1}{2}\pi(\theta_1' \,\big|\, \theta_2)\delta({\theta_2,\theta_2'}) +  \pi(\theta_1')\pi(\theta_2 \,\big|\, \theta_1')  \pi(\theta_2' \,\big|\, \theta_1)\delta({\theta_1,\theta_1'})  \bigg] &= \\  \frac{1}{2}\bigg[  \pi(\theta_1', \theta_2')\pi(\theta_1 \,\big|\, \theta_2')  \delta({\theta_2',\theta_2}) +  \pi(\theta_1', \theta_2')\pi(\theta_2 \,\big|\, \theta_1')  \delta({\theta_1',\theta_1})  \bigg] &= \\  \pi(\theta_1', \theta_2')\bigg[  \frac{1}{2}\pi(\theta_1 \,\big|\, \theta_2')  \delta({\theta_2',\theta_2}) +  \frac{1}{2}\pi(\theta_2 \,\big|\, \theta_1')  \delta({\theta_1',\theta_1})  \bigg] &  \end{aligned}

In other words

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big) =  \pi\big({\boldsymbol{\theta'}}\big)q\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big)

Hand waving slightly, we can see that this scheme satisfies the premises of the ergodic theorem and so we can conclude that there is a unique stationary distribution and \pi must be that distribution.

Systematic Scan

Most references on Gibbs sampling do not describe the random scan but instead something called a systematic scan.

Again for simplicity let us consider a model with two parameters. In this sampler, we update the parameters in two steps.

\displaystyle   \begin{aligned}  \text{Sample } \theta_1^{(i+1)} & \sim & \pi(\theta_1 \,\big|\, \theta_2^{(i)}) \\  \text{Sample } \theta_2^{(i+1)} & \sim & \pi(\theta_2 \,\big|\, \theta_1^{(i+1)})  \end{aligned}

We observe that this is not time-homegeneous; at each step the transition matrix flips between the two transition matrices given by the individual steps. Thus although, as we show below, each individual transtion satisifies the detailed balance condition, we cannot apply the ergodic theorem as it only applies to time-homogeneous processes.

The transition density kernel is then given by

\displaystyle   q\big(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(i+1)}\big) =  q_1\big(\boldsymbol{\theta}^{(i)}, \tilde{\boldsymbol{\theta}}\big)  q_2\big(\tilde{\boldsymbol{\theta}}, \boldsymbol{\theta}^{(i+1)}\big)

where \tilde{\boldsymbol{\theta}} = (\theta_1^{(i+1)}, \theta_2^{(i)})^\top.

Thus

\displaystyle   q\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi(\theta_1' \,\big|\, \theta_2)  \pi(\theta_2' \,\big|\, \theta_1')

Detailed balance

Suppose that we have two states \boldsymbol{\theta} = (\theta_1, \theta_2)^\top and \boldsymbol{\theta}' = (\theta_1', \theta_2')^\top and that \theta_2 \neq \theta_2'. Then q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) = 0. Trivially we have

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi\big({\boldsymbol{\theta'}}\big)q_1\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)

Now suppose that \theta_2 = \theta_2'

\displaystyle   \begin{aligned}  \pi(\theta_1, \theta_2)q_1((\theta_1, \theta_2), (\theta_1', \theta_2)) & =  \pi(\theta_1, \theta_2)\pi(\theta_1' \,\big|\, \theta_2) \\  & = \pi(\theta_1 \,\big|\, \theta_2)\pi(\theta_1', \theta_2) \\  & = \pi(\theta_1 \,\big|\, \theta_2')\pi(\theta_1', \theta_2') \\  & = \pi(\theta_1', \theta_2')q_1((\theta_1', \theta_2), (\theta_1, \theta_2))  \end{aligned}

So again we have

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q_1\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi\big({\boldsymbol{\theta'}}\big)q_1\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)

Similarly we can show

\displaystyle   \pi\big({\boldsymbol{\theta}}\big)q_2\big(\boldsymbol{\theta}, \boldsymbol{\theta}'\big) =  \pi\big({\boldsymbol{\theta'}}\big)q_2\big(\boldsymbol{\theta}', \boldsymbol{\theta}\big)

But note that

\displaystyle   \begin{aligned}  \pi(\theta_1, \theta_2)  q_1((\theta_1, \theta_2), (\theta_1', \theta_2))  q_2((\theta_1', \theta_2), (\theta_1', \theta_2')) & =  \pi(\theta_1, \theta_2)  \pi(\theta_1' \,\big|\, \theta_2)  \pi(\theta_2' \,\big|\, \theta_1') \\  & = \pi(\theta_1', \theta_2) \pi(\theta_1 \,\big|\, \theta_2) \pi(\theta_2' \,\big|\, \theta_1') \\  & = \pi(\theta_1' \,\big|\, \theta_2) \pi(\theta_1 \,\big|\, \theta_2) \pi(\theta_2', \theta_1')  \end{aligned}

whereas

\displaystyle   \begin{aligned}  \pi(\theta_1', \theta_2')  q_1((\theta_1', \theta_2'), (\theta_1, \theta_2'))  q_2((\theta_1, \theta_2'), (\theta_1, \theta_2)) & =  \pi(\theta_1', \theta_2')  \pi(\theta_1 \,\big|\, \theta_2')  \pi(\theta_2 \,\big|\, \theta_1) \\  & = \pi(\theta_1, \theta_2') \pi(\theta_1' \,\big|\, \theta_2') \pi(\theta_2 \,\big|\, \theta_1) \\  & = \pi(\theta_2' \,\big|\, \theta_1) \pi(\theta_1' \,\big|\, \theta_2') \pi(\theta_2, \theta_1)  \end{aligned}

and these are not necessarily equal.

So the detailed balance equation is not satisfied, another sign that we cannot appeal to the ergodic theorem.

An Example: The Bivariate Normal

Let us demonstrate the Gibbs sampler with a distribution which we actually know: the bivariate normal.

\displaystyle   \begin{bmatrix}  \theta_1 \\  \theta_2  \end{bmatrix}  \bigg| y \sim  N \begin{bmatrix}  \begin{bmatrix}  \theta_1 \\  \theta_2  \end{bmatrix} &  \begin{bmatrix}  1 & \rho \\  \rho & 1  \end{bmatrix}  \end{bmatrix}

The conditional distributions are easily calculated to be

\displaystyle   \begin{aligned}  \theta_1 \,\vert\, \theta_2, y &\sim {\cal{N}}(y_1 + \rho(\theta_2 - y_2), 1 - \rho^2) \\  \theta_2 \,\vert\, \theta_1, y &\sim {\cal{N}}(y_2 + \rho(\theta_1 - y_1), 1 - \rho^2)  \end{aligned}

Let’s take a correlation of 0.8, a data point of (0.0, 0.0) and start the chain at (2.5, 2.5).

> rho :: Double
> rho = 0.8
> 
> y :: (Double, Double)
> y = (0.0, 0.0)
> 
> y1, y2 :: Double
> y1 = fst y
> y2 = snd y
> 
> initTheta :: (Double, Double)
> initTheta = (2.5, 2.5)

We pre-calculate the variance needed for the sampler.

> var :: Double
> var = 1.0 - rho^2

In Haskell and in the random-fu package, sampling from probability distributions is implemented as a monad. We sample from the relevant normal distributions and keep the trajectory using a writer monad.

> gibbsSampler :: Double -> RVarT (W.Writer [(Double,Double)]) Double
> gibbsSampler oldTheta2 = do
>   newTheta1 <- rvarT (Normal (y1 + rho * (oldTheta2 - y2)) var)
>   lift $ W.tell [(newTheta1, oldTheta2)]
>   newTheta2 <- rvarT (Normal (y2 + rho * (newTheta1 - y1)) var)
>   lift $ W.tell [(newTheta1, newTheta2)]
>   return $ newTheta2

It is common to allow the chain to “burn in” so as to “forget” its starting position. We arbitrarily burn in for 10,000 steps.

> burnIn :: Int
> burnIn = 10000

We sample repeatedly from the sampler using the monadic form of iterate. Running the monadic stack is slightly noisy but nonetheless straightforward. We use mersenne-random-pure64 (albeit indirectly via random-source) as our source of entropy.

> runMCMC :: Int -> [(Double, Double)]
> runMCMC n =
>   take n $
>   drop burnIn $
>   snd $
>   W.runWriter (evalStateT (sample (ML.iterateM_ gibbsSampler (snd initTheta))) (pureMT 2))

We can look at the trajectory of our sampler for various run lengths.

For bigger sample sizes, plotting the distribution sampled re-assures us that we are indeed sampling from a bivariate normal distribution as the theory predicted.

Applications to Bayesian Statistics

Some of what is here and here excluding JAGS and STAN (after all this is a book about Haskell).

Applications to Physics

Applications to Physics

Most of what is here.

Gibbs Sampling in R, Haskell, Jags and Stan

Introduction

It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module Gibbs (
>     main
>   , m
>   , Moments(..)
>   ) where
> 
> import qualified Data.Vector.Unboxed as V
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import qualified Control.Foldl as L
> 
> import Diagrams.Backend.Cairo.CmdLine
> 
> import LinRegAux
> 
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )

The length of our chain and the burn-in.

> nrep, nb :: Int
> nb   = 5000
> nrep = 105000

Data generated from {\cal{N}}(10.0, 5.0).

> xs :: [Double]
> xs = [
>     11.0765808082301
>   , 10.918739177542
>   , 15.4302462747137
>   , 10.1435649220266
>   , 15.2112705014697
>   , 10.441327659703
>   , 2.95784054883142
>   , 10.2761068139607
>   , 9.64347295100318
>   , 11.8043359297675
>   , 10.9419989262713
>   , 7.21905367667346
>   , 10.4339807638017
>   , 6.79485294803006
>   , 11.817248658832
>   , 6.6126710570584
>   , 12.6640920214508
>   , 8.36604701073303
>   , 12.6048485320333
>   , 8.43143879537592
>   ]

A Bit of Theory

Gibbs Sampling

For a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions.

Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters i \triangleq \boldsymbol{\theta}^{(i)} \triangleq (\theta^{(i)}_1,\ldots, \theta^{(i)}_n) and the conditional posteriors by \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big) where {\boldsymbol{\theta}}^{(i)}_{-k} = \big(\theta_1^{(i)},\ldots,\theta_{k-1}^{(i)},\theta_{k+1}^{(i)}\ldots\theta_n^{(i)}\big) then

\displaystyle   \begin{aligned}  \frac{\pi\big(\boldsymbol{\theta}^{(j)}\big)q\big(\boldsymbol{\theta}^{(j)}, \boldsymbol{\theta}^{(i)}\big)}  {\pi(\boldsymbol{\theta}^{(i)})q(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(j)})}  &=  \frac{  \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  }  {  \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)  } \\  &=  \frac{  \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  }  {  \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  } \\  &= 1  \end{aligned}

where we have used the rules of conditional probability and the fact that \boldsymbol{\theta}_i^{(-k)} = \boldsymbol{\theta}_j^{(-k)}

Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters.

Normal Distribution with Unknown Mean and Variance

It is fairly standard to use an improper prior

\displaystyle   \begin{aligned}  \pi(\mu, \tau) \propto \frac{1}{\tau} & & -\infty < \mu < \infty\, \textrm{and}\, 0 < \tau < \infty  \end{aligned}

The likelihood is

\displaystyle   p(\boldsymbol{x}\,|\,\mu, \sigma) = \prod_{i=1}^n \bigg(\frac{1}{\sigma\sqrt{2\pi}}\bigg)\exp{\bigg( -\frac{(x_i - \mu)^2}{2\sigma^2}\bigg)}

re-writing in terms of precision

\displaystyle   p(\boldsymbol{x}\,|\,\mu, \tau) \propto \prod_{i=1}^n \sqrt{\tau}\exp{\bigg( -\frac{\tau}{2}{(x_i - \mu)^2}\bigg)} = \tau^{n/2}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}

Thus the posterior is

\displaystyle   p(\mu, \tau \,|\, \boldsymbol{x}) \propto \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}

We can re-write the sum in terms of the sample mean \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i and variance s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 using

\displaystyle   \begin{aligned}  \sum_{i=1}^n (x_i - \mu)^2 &= \sum_{i=1}^n (x_i - \bar{x} + \bar{x} - \mu)^2 \\  &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2\sum_{i=1}^n (x_i - \bar{x})(\bar{x} - \mu) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\  &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2(\bar{x} - \mu)\sum_{i=1}^n (x_i - \bar{x}) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\  &= (n - 1)s^2 + n(\bar{x} - \mu)^2  \end{aligned}

Thus the conditional posterior for \mu is

\displaystyle   \begin{aligned}  p(\mu \,|\, \tau, \boldsymbol{x}) &\propto \exp{\bigg( -\frac{\tau}{2}\bigg(\nu s^2 + \sum_{i=1}^n{(\mu - \bar{x})^2}\bigg)\bigg)} \\  &\propto \exp{\bigg( -\frac{n\tau}{2}{(\mu - \bar{x})^2}\bigg)} \\  \end{aligned}

which we recognise as a normal distribution with mean of \bar{x} and a variance of (n\tau)^{-1}.

The conditional posterior for \tau is

\displaystyle   \begin{aligned}  p(\tau \,|\, , \mu, \boldsymbol{x}) &\propto \tau^{n/2 -1}\exp\bigg(-\tau\frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)  \end{aligned}

which we recognise as a gamma distribution with a shape of n/2 and a scale of \frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}

In this particular case, we can calculate the marginal posterior of \mu analytically. Writing z = \frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2} we have

\displaystyle   \begin{aligned}  p(\mu \,|\, \boldsymbol{x}) &= \int_0^\infty p(\mu, \tau \,|\, \boldsymbol{x}) \textrm{d}\tau \\  &\propto \int_0^\infty \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)} \textrm{d}\tau \\  &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \int_0^\infty z^{n/2 - 1}\exp{-z}\textrm{d}\tau \\  &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \\  \end{aligned}

Finally we can calculate

\displaystyle   \begin{aligned}  p(\mu \,|\, \boldsymbol{x}) &\propto \bigg( (n - 1)s^2 + n(\bar{x} - \mu)^2 \bigg)^{-n/2} \\  &\propto \bigg( 1 + \frac{n(\mu - \bar{x})^2}{(n - 1)s^2} \bigg)^{-n/2} \\  \end{aligned}

This is the non-standardized Student’s t-distribution t_{n-1}(\bar{x}, s^2/n).

Alternatively the marginal posterior of \mu is

\displaystyle   \frac{\mu - \bar{x}}{s/\sqrt{n}}\bigg|\, x \sim t_{n-1}

where t_{n-1} is the standard t distribution with n - 1 degrees of freedom.

The Model in Haskell

Following up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using foldl’ but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below.

> x2Sum, xSum, n :: Double
> (x2Sum, xSum, n) = L.fold stats xs
>   where
>     stats = (,,) <$>
>             (L.premap (\x -> x * x) L.sum) <*>
>             L.sum <*>
>             L.genericLength

And re-writing the sample variance s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 using

\displaystyle   \begin{aligned}  \sum_{i=1}^n (x_i - \bar{x})^2 &= \sum_{i=1}^n (x_i^2 - 2x_i\bar{x} + \bar{x}^2) \\  &= \sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^n x_i + \sum_{i=1}^n \bar{x}^2 \\  &= \sum_{i=1}^n x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 \\  &= \sum_{i=1}^n x_i^2 - n\bar{x}^2 \\  \end{aligned}

we can then calculate the sample mean and variance using the sums we have just calculated.

> xBar, varX :: Double
> xBar = xSum / n
> varX = n * (m2Xs - xBar * xBar) / (n - 1)
>   where m2Xs = x2Sum / n

In random-fu, the Gamma distribution is specified by the rate paratmeter, \beta.

> beta, initTau :: Double
> beta = 0.5 * n * varX
> initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)

Our sampler takes an old value of \tau and creates new values of \mu and \tau.

> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double))
> gibbsSampler oldTau = do
>   newMu <- sample (Normal xBar (recip (sqrt (n * oldTau))))
>   let shape = 0.5 * n
>       scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar)
>   newTau <- sample (Gamma shape (recip scale))
>   return $ Just ((newMu, newTau), newTau)

From which we can create an infinite stream of samples.

> gibbsSamples :: [(Double, Double)]
> gibbsSamples = evalState (ML.unfoldrM gibbsSampler initTau) (pureMT 1)

As our chains might be very long, we calculate the mean, variance, skewness and kurtosis using an incremental method.

> data Moments = Moments { mN :: !Double
>                        , m1 :: !Double
>                        , m2 :: !Double
>                        , m3 :: !Double
>                        , m4 :: !Double
>                        }
>   deriving Show
> moments :: [Double] -> Moments
> moments xs = foldl' f (Moments 0.0 0.0 0.0 0.0 0.0) xs
>   where
>     f :: Moments -> Double -> Moments
>     f m x = Moments n' m1' m2' m3' m4'
>       where
>         n = mN m
>         n'  = n + 1
>         delta = x - (m1 m)
>         delta_n = delta / n'
>         delta_n2 = delta_n * delta_n
>         term1 = delta * delta_n * n
>         m1' = m1 m + delta_n
>         m4' = m4 m +
>               term1 * delta_n2 * (n'*n' - 3*n' + 3) +
>               6 * delta_n2 * m2 m - 4 * delta_n * m3 m
>         m3' = m3 m + term1 * delta_n * (n' - 2) - 3 * delta_n * m2 m
>         m2' = m2 m + term1

In order to examine the posterior, we create a histogram.

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = xBar - 2.0 * sqrt varX
>     upper = xBar + 2.0 * sqrt varX

And fill it with the specified number of samples preceeded by a burn-in.

> hist :: Histogram V.Vector BinD Double
> hist = fillBuilder hb (take (nrep - nb) $ drop nb $ map fst gibbsSamples)

Now we can plot this.

And calculate the skewness and kurtosis.

> m :: Moments
> m = moments (take (nrep - nb) $ drop nb $ map fst gibbsSamples)
ghci> import Gibbs
ghci> putStrLn $ show $ (sqrt (mN m)) * (m3 m) / (m2 m)**1.5
  8.733959917065126e-4

ghci> putStrLn $ show $ (mN m) * (m4 m) / (m2 m)**2
  3.451374739494607

We expect a skewness of 0 and a kurtosis of 3 + 6 / \nu - 4 = 3.4 for \nu = 19. Not too bad.

The Model in JAGS

JAGS is a mature, declarative, domain specific language for building Bayesian statistical models using Gibbs sampling.

Here is our model as expressed in JAGS. Somewhat terse.

model {
	for (i in 1:N) {
		x[i] ~ dnorm(mu, tau)
	}
	mu ~ dnorm(0, 1.0E-6)
	tau <- pow(sigma, -2)
	sigma ~ dunif(0, 1000)
}

To run it and examine its results, we wrap it up in some R

## Import the library that allows R to inter-work with jags.
library(rjags)

## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)

jags <- jags.model('example1.bug',
                   data = list('x' = fn[,1], 'N' = 20),
                   n.chains = 4,
                   n.adapt = 100)

## Burnin for 10000 samples
update(jags, 10000);

mcmc_samples <- coda.samples(jags, variable.names=c("mu", "sigma"), n.iter=20000)

png(file="diagrams/jags.png",width=400,height=350)
plot(mcmc_samples)
dev.off()

And now we can look at the posterior for \mu.

The Model in STAN

STAN is a domain specific language for building Bayesian statistical models similar to JAGS but newer and which allows variables to be re-assigned and so cannot really be described as declarative.

Here is our model as expressed in STAN. Again, somewhat terse.

data {
  int<lower=0> N;
  real x[N];
}

parameters {
  real mu;
  real<lower=0,upper=1000> sigma;
}
model{
  x     ~ normal(mu, sigma);
  mu    ~ normal(0, 1000);
}

Just as with JAGS, to run it and examine its results, we wrap it up in some R.

library(rstan)

## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)

## Running the model
fit1 <- stan(file = 'Stan.stan',
             data = list('x' = fn[,1], 'N' = 20),
             pars=c("mu", "sigma"),
             chains=3,
             iter=30000,
             warmup=10000)

png(file="diagrams/stan.png",width=400,height=350)
plot(fit1)
dev.off()

Again we can look at the posterior although we only seem to get medians and 80% intervals.

PostAmble

Write the histogram produced by the Haskell code to a file.

> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
>   mainRender ( DiagramOpts (Just 900) (Just 700) fn
>              , DiagramLoopOpts False Nothing 0
>              )
> main :: IO ()
> main = do
>   displayHeader "diagrams/DataScienceHaskPost.png"
>     (barDiag
>      (zip (map fst $ asList hist) (map snd $ asList hist)))

The code can be downloaded from github.

Student’s T and Space Leaks

Introduction

The other speaker at the Machine Learning Meetup at which I gave my talk on automatic differentiation gave a very interesting talk on A/B testing. Apparently this is big business these days as attested by the fact I got 3 ads above the wikipedia entry when I googled for it.

It seems that people tend to test with small sample sizes and to do so very often, resulting in spurious results. Of course readers of XKCD will be well aware of some of the pitfalls.

I thought a Bayesian approach might circumvent some of the problems and set out to write a blog article only to discover that there was no Haskell library for sampling from Student’s t. Actually there was one but is currently an unreleased part of random-fu. So I set about fixing this shortfall.

I thought I had better run a few tests so I calculated the sampled mean, variance, skewness and kurtosis.

I wasn’t really giving this my full attention and as a result ran into a few problems with space. I thought these were worth sharing and that is what this blog post is about. Hopefully, I will have time soon to actually blog about the Bayesian equivalent of A/B testing.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> 
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> 
> module StudentTest (
>     main
>   ) where
> 
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Random.Distribution.T
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List

Space Analysis

Let’s create a reasonable number of samples as the higher moments converge quite slowly.

> nSamples :: Int
> nSamples = 1000000

An arbitrary seed for creating the samples.

> arbSeed :: Int
> arbSeed = 8

Student’s t only has one parameter, the number of degrees of freedom.

> nu :: Integer
> nu = 6

Now we can do our tests by calculating the sampled values.

> ts :: [Double]
> ts =
>   evalState (replicateM nSamples (sample (T nu)))
>             (pureMT $ fromIntegral arbSeed)
> mean, variance, skewness, kurtosis :: Double
> mean = (sum ts) / fromIntegral nSamples
> variance = (sum (map (**2) ts)) / fromIntegral nSamples
> skewness = (sum (map (**3) ts) / fromIntegral nSamples) / variance**1.5
> kurtosis = (sum (map (**4) ts) / fromIntegral nSamples) / variance**2

This works fine for small sample sizes but not for the number we have chosen.

./StudentTest +RTS -hc
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

It seems a shame that the function in the Prelude has this behaviour but never mind let us ensure that we consume values strictly (they are being produced lazily).

> mean' = (foldl' (+) 0 ts) / fromIntegral nSamples
> variance' = (foldl' (+) 0 (map (**2) ts)) / fromIntegral nSamples
> skewness' = (foldl' (+) 0 (map (**3) ts) / fromIntegral nSamples) / variance'**1.5
> kurtosis' = (foldl' (+) 0 (map (**4) ts) / fromIntegral nSamples) / variance'**2

We now have a space leak on the heap as using the ghc profiler below shows. What went wrong?

If we only calculate the mean using foldl then all is well. Instead of 35M we only use 45K.

Well that gives us a clue. The garbage collector cannot reclaim the samples as they are needed for other calculations. What we need to do is calculate the moments strictly altogether.

Let’s create a strict record to do this.

> data Moments = Moments { m1 :: !Double
>                        , m2 :: !Double
>                        , m3 :: !Double
>                        , m4 :: !Double
>                        }
>   deriving Show

And calculate the results strictly.

> 
> m  = foldl' (\m x -> Moments { m1 = m1 m + x
>                              , m2 = m2 m + x**2
>                              , m3 = m3 m + x**3
>                              , m4 = m4 m + x**4
>                              }) (Moments 0.0 0.0 0.0 0.0) ts
> 
> mean''       = m1 m / fromIntegral nSamples
> variance''   = m2 m / fromIntegral nSamples
> skewness''   = (m3 m / fromIntegral nSamples) / variance''**1.5
> kurtosis'' = (m4 m / fromIntegral nSamples) / variance''**2

Now we have what we want; the program runs in small constant space.

> main :: IO ()
> main = do
>   putStrLn $ show mean''
>   putStrLn $ show variance''
>   putStrLn $ show skewness''
>   putStrLn $ show kurtosis''

Oh and the moments give the expected answers.

ghci> mean''
  3.9298418844289093e-4

ghci> variance''
  1.4962681916693004

ghci> skewness''
  1.0113188204317015e-2

ghci> kurtosis''
  5.661776268997382

Running the Code

To run this you will need my version of random-fu. The code for this article is here. You will need to compile everything with profiling, something like

ghc -O2 -main-is StudentTest StudentTest.lhs -prof
    -package-db=.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d

Since you need all the packages to be built with profiling, you will probably want to build using a sandbox as above. The only slightly tricky aspect is building random-fu so it is in your sandbox.

runghc Setup.lhs configure --enable-library-profiling
--package-db=/HasBayes/.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d
--libdir=/HasBayes/.cabal-sandbox/lib

Laplace’s Equation in Haskell: Using a DSL for Stencils

Introduction

Suppose we have a square thin plate of metal and we hold each of edges at a temperature which may vary along the edge but is fixed for all time. After some period depending on the conductivity of the metal, the temperature at every point on the plate will have stabilised. What is the temperature at any point?

We can calculate this using by solving Laplace’s equation \nabla^2 \phi = 0 in 2 dimensions. Apart from the preceeding motivation, a more compelling reason for doing so is that it is a moderately simple equation, in so far as partial differential equations are simple, that has been well studied for centuries.

In Haskell terms this gives us the opportunity to use the repa library and use hmatrix which is based on Lapack (as well as other libraries) albeit hmatrix only for illustratative purposes.

I had originally intended this blog to contain a comparison repa’s performance against an equivalent C program even though this has already been undertaken by the repa team in their various publications. And indeed it is still my intention to produce such a comparision. However, as I investigated further, it turned out a fair amount of comparison work has already been done by a team from Intel which suggests there is currently a performance gap but one which is not so large that it outweighs the other benefits of Haskell.

To be more specific, one way in which using repa stands out from the equivalent C implementation is that it gives a language in which we can specify the stencil being used to solve the equation. As an illustration we substitute the nine point method for the five point method merely by changing the stencil.

A Motivating Example: The Steady State Heat Equation

Fourier’s law states that the rate of heat transfer or the flux \boldsymbol{\sigma} is proportional to the negative temperature gradient, as heat flows from hot to cold, and further that it flows in the direction of greatest temperature change. We can write this as

\displaystyle  \boldsymbol{\sigma} = -k\nabla \phi

where \phi : \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R} is the temperature at any given point on the plate and k is the conductivity of the metal.

Moreover, we know that for any region on the plate, the total amount of heat flowing in must be balanced by the amount of heat flowing out. We can write this as

\displaystyle  \nabla \cdot \boldsymbol{\sigma} = 0

Substituting the first equation into the second we obtain Laplace’s equation

\displaystyle  \nabla^2 \phi = 0

For example, suppose we hold the temperature of the edges of the plate as follows

\displaystyle  \begin{matrix}  \phi(x, 0) = 1 & \phi(x, 1) = 2 & \phi(0, y) = 1 & \phi(1, y) = 2  \end{matrix}

then after some time the temperature of the plate will be as shown in the heatmap below.

Notes:

  1. Red is hot.

  2. Blue is cold.

  3. The heatmap is created by a finite difference method described below.

  4. The y-axis points down (not up) i.e. \phi(x,1) is at the bottom, reflecting the fact that we are using an array in the finite difference method and rows go down not up.

  5. The corners are grey because in the five point finite difference method these play no part in determining temperatures in the interior of the plate.

Colophon

Since the book I am writing contains C code (for performance comparisons), I need a way of being able to compile and run this code and include it “as is” in the book. Up until now, all my blog posts have contained Haskell and so I have been able to use BlogLiteratelyD which allows me to include really nice diagrams. But clearly this tool wasn’t really designed to handle other languages (although I am sure it could be made to do so).

Using pandoc’s scripting capability with the small script provided

#!/usr/bin/env runhaskell
import Text.Pandoc.JSON

doInclude :: Block -> IO Block
doInclude cb@(CodeBlock ("verbatim", classes, namevals) contents) =
  case lookup "include" namevals of
       Just f     -> return . (\x -> Para [Math DisplayMath x]) =<< readFile f
       Nothing    -> return cb
doInclude cb@(CodeBlock (id, classes, namevals) contents) =
  case lookup "include" namevals of
       Just f     -> return . (CodeBlock (id, classes, namevals)) =<< readFile f
       Nothing    -> return cb
doInclude x = return x

main :: IO ()
main = toJSONFilter doInclude

I can then include C code blocks like this

~~~~ {.c include="Chap1a.c"}
~~~~

And format the whole document like this

pandoc -s Chap1.lhs --filter=./Include -t markdown+lhs > Chap1Expanded.lhs
BlogLiteratelyD Chap1Expanded.lhs > Chap1.html

Sadly, the C doesn’t get syntax highlighting but this will do for now.

PS Sadly, WordPress doesn’t seem to be able to handle \color{red} and \color{blue} in LaTeX so there are some references to blue and red which do not render.

Acknowledgements

A lot of the code for this post is taken from the repa package itself. Many thanks to the repa team for providing the package and the example code.

Haskell Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE BangPatterns                  #-}
> {-# LANGUAGE TemplateHaskell               #-}
> {-# LANGUAGE QuasiQuotes                   #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module Chap1 (
>     module Control.Applicative
>   , solveLaplaceStencil
>   , useBool
>   , boundMask
>   , boundValue
>   , bndFnEg1
>   , fivePoint
>   , ninePoint
>   , testStencil5
>   , testStencil9
>   , analyticValue
>   , slnHMat4
>   , slnHMat5
>   , testJacobi4
>   , testJacobi6
>   , bndFnEg3
>   , runSolver
>   , s5
>   , s9
>   ) where
>
> import Data.Array.Repa                   as R
> import Data.Array.Repa.Unsafe            as R
> import Data.Array.Repa.Stencil           as A
> import Data.Array.Repa.Stencil.Dim2      as A
> import Prelude                           as P
> import Data.Packed.Matrix
> import Numeric.LinearAlgebra.Algorithms
> import Chap1Aux
> import Control.Applicative

Laplace’s Equation: The Five Point Formula

We show how to apply finite difference methods to Laplace’s equation:

\displaystyle  \nabla^2 u = 0

where

\displaystyle  \nabla^2 = \frac{\partial^2}{\partial x^2} +\frac{\partial^2}{\partial y^2}

For a sufficiently smooth function (see (Iserles 2009, chap. 8)) we have

\displaystyle  \begin{aligned}  \frac{\partial^2 u}{\partial x^2}\mathop{\Bigg|_{x = x_0 + k\Delta x}}_{y = y_0 + l\Delta x} &= \frac{1}{(\Delta x)^2}\Delta_{0,x}^2 u_{k,l} + \mathcal{O}((\Delta x)^2) \\  \frac{\partial^2 u}{\partial y^2}\mathop{\Bigg|_{x = x_0 + k\Delta x}}_{y = y_0 + l\Delta x} &= \frac{1}{(\Delta x)^2}\Delta_{0,y}^2 u_{k,l} + \mathcal{O}((\Delta x)^2)  \end{aligned}

where the central difference operator \Delta_0 is defined as

\displaystyle  (\Delta_0 z)_k \triangleq z_{k + \frac{1}{2}} - z_{k - \frac{1}{2}}

We are therefore led to consider the five point difference scheme.

\displaystyle  \frac{1}{(\Delta x)^2}(\Delta_{0,x}^2 + \Delta_{0,y}^2) u_{k,l} = 0

We can re-write this explicitly as

\displaystyle  u_{k-1,l} + u_{k+1,l} + u_{k,l-1} + u_{k,l+1} - 4u_{k,l} = 0

Specifically for the grid point (2,1) in a 4 \times 4 grid we have

\displaystyle  {{u_{1,1}}} + {{u_{3,1}}} + {{u_{2,0}}} + {{u_{2,2}}} - 4{{u_{2,1}}} = 0

where blue indicates that the point is an interior point and red indicates that the point is a boundary point. For Dirichlet boundary conditions (which is all we consider in this post), the values at the boundary points are known.

We can write the entire set of equations for this grid as

\displaystyle  \begin{bmatrix}  -4.0 & 1.0 & 1.0 & 0.0 \\  1.0 & -4.0 & 0.0 & 1.0 \\  1.0 & 0.0 & -4.0 & 1.0 \\  0.0 & 1.0 & 1.0 & -4.0  \end{bmatrix}  \begin{bmatrix}  {{u_{11}}} \\  {{u_{21}}} \\  {{u_{12}}} \\  {{u_{22}}}  \end{bmatrix}  =  \begin{bmatrix}  -{{u_{10}}} + -{{u_{01}}} \\  -{{u_{20}}} + -{{u_{31}}} \\  -{{u_{02}}} + -{{u_{13}}} \\  -{{u_{23}}} + -{{u_{32}}}  \end{bmatrix}

A Very Simple Example

Let us take the boundary conditions to be

\displaystyle  \begin{matrix}  u(x, 0) = 1 & u(x, 1) = 2 & u(0, y) = 1 & u(1, y) = 2  \end{matrix}

With our 4 \times 4 grid we can solve this exactly using the hmatrix package which has a binding to LAPACK.

First we create a 4 \times 4 matrix in hmatrix form

> simpleEgN :: Int
> simpleEgN = 4 - 1
>
> matHMat4 :: IO (Matrix Double)
> matHMat4 = do
>   matRepa <- computeP $ mkJacobiMat simpleEgN :: IO (Array U DIM2 Double)
>   return $ (simpleEgN - 1) >< (simpleEgN - 1) $ toList matRepa
ghci> matHMat4
  (2><2)
   [ -4.0, 1.0
   ,  1.0, 0.0 ]

Next we create the column vector as presribed by the boundary conditions

> bndFnEg1 :: Int -> Int -> (Int, Int) -> Double
> bndFnEg1 _ m (0, j) |           j > 0 && j < m = 1.0
> bndFnEg1 n m (i, j) | i == n && j > 0 && j < m = 2.0
> bndFnEg1 n _ (i, 0) |           i > 0 && i < n = 1.0
> bndFnEg1 n m (i, j) | j == m && i > 0 && i < n = 2.0
> bndFnEg1 _ _ _                                 = 0.0
> bnd1 :: Int -> [(Int, Int)] -> Double
> bnd1 n = negate .
>          sum .
>          P.map (bndFnEg1 n n)
> bndHMat4 :: Matrix Double
> bndHMat4 = ((simpleEgN - 1) * (simpleEgN - 1)) >< 1 $
>            mkJacobiBnd fromIntegral bnd1 3
ghci>  bndHMat4
  (4><1)
   [ -2.0
   , -3.0
   , -3.0
   , -4.0 ]
> slnHMat4 :: IO (Matrix Double)
> slnHMat4 = matHMat4 >>= return . flip linearSolve bndHMat4
ghci> slnHMat4
  (4><1)
   [               1.25
   ,                1.5
   , 1.4999999999999998
   , 1.7499999999999998 ]

The Jacobi Method

Inverting a matrix is expensive so instead we use the (possibly most) classical of all iterative methods, Jacobi iteration. Given A\boldsymbol{x} = \boldsymbol{b} and an estimated solution \boldsymbol{x}_i^{[k]}, we can generate an improved estimate \boldsymbol{x}_i^{[k+1]}. See (Iserles 2009, chap. 12) for the details on convergence and convergence rates.

\displaystyle  \boldsymbol{x}_i^{[k+1]} = \frac{1}{A_{i,i}}\Bigg[\boldsymbol{b}_i - \sum_{j \neq i} A_{i,j}\boldsymbol{x}_j^{[k]}\Bigg]

The simple example above does not really give a clear picture of what happens in general during the update of the estimate. Here is a larger example

Sadly, WordPress does not seem to be able to render 16 \times 16 matrices written in LaTeX so you will have to look at the output from hmatrix in the larger example below. You can see that this matrix is sparse and has a very clear pattern.

Expanding the matrix equation for a {\text{point}} not in the {\text{boundary}} we get

\displaystyle  x_{i,j}^{[k+1]} = \frac{1}{4}(x^{[k]}_{i-1,j} + x^{[k]}_{i,j-1} + x^{[k]}_{i+1,j} + x^{[k]}_{i,j+1})

Cleary the values of the points in the boundary are fixed and must remain at those values for every iteration.

Here is the method using repa. To produce an improved estimate, we define a function relaxLaplace and we pass in a repa matrix representing our original estimate \boldsymbol{x}_i^{[k]} and receive the one step update \boldsymbol{x}_i^{[k+1]} also as a repa matrix.

We pass in a boundary condition mask which specifies which points are boundary points; a point is a boundary point if its value is 1.0 and not if its value is 0.0.

> boundMask :: Monad m => Int -> Int -> m (Array U DIM2 Double)
> boundMask gridSizeX gridSizeY = computeP $
>                                 fromFunction (Z :. gridSizeX + 1 :. gridSizeY + 1) f
>   where
>     f (Z :. _ix :.  iy) | iy == 0         = 0
>     f (Z :. _ix :.  iy) | iy == gridSizeY = 0
>     f (Z :.  ix :. _iy) | ix == 0         = 0
>     f (Z :.  ix :. _iy) | ix == gridSizeX = 0
>     f _                                   = 1

Better would be to use at least a Bool as the example below shows but we wish to modify the code from the repa git repo as little as possible.

> useBool :: IO (Array U DIM1 Double)
> useBool = computeP $
>           R.map (fromIntegral . fromEnum) $
>           fromFunction (Z :. (3 :: Int)) (const True)
ghci> useBool
  AUnboxed (Z :. 3) (fromList [1.0,1.0,1.0])

We further pass in the boundary conditions. We construct these by using a function which takes the grid size in the x direction, the grid size in the y direction and a given pair of co-ordinates in the grid and returns a value at this position.

> boundValue :: Monad m =>
>               Int ->
>               Int ->
>               (Int -> Int -> (Int, Int) -> Double) ->
>               m (Array U DIM2 Double)
> boundValue gridSizeX gridSizeY bndFn =
>   computeP $
>   fromFunction (Z :. gridSizeX + 1 :. gridSizeY + 1) g
>   where
>     g (Z :. ix :. iy) = bndFn gridSizeX gridSizeY (ix, iy)

Note that we only update an element in the repa matrix representation of the vector if it is not on the boundary.

> relaxLaplace
>   :: Monad m
>      => Array U DIM2 Double
>      -> Array U DIM2 Double
>      -> Array U DIM2 Double
>      -> m (Array U DIM2 Double)
>
> relaxLaplace arrBoundMask arrBoundValue arr
>   = computeP
>     $ R.zipWith (+) arrBoundValue
>     $ R.zipWith (*) arrBoundMask
>     $ unsafeTraverse arr id elemFn
>   where
>     _ :. height :. width
>       = extent arr
>
>     elemFn !get !d@(sh :. i :. j)
>       = if isBorder i j
>         then  get d
>         else (get (sh :. (i-1) :. j)
>               +   get (sh :. i     :. (j-1))
>               +   get (sh :. (i+1) :. j)
>               +   get (sh :. i     :. (j+1))) / 4
>     isBorder !i !j
>       =  (i == 0) || (i >= width  - 1)
>          || (j == 0) || (j >= height - 1)

We can use this to iterate as many times as we like.

> solveLaplace
>   :: Monad m
>         => Int
>         -> Array U DIM2 Double
>         -> Array U DIM2 Double
>         -> Array U DIM2 Double
>         -> m (Array U DIM2 Double)
>
> solveLaplace steps arrBoundMask arrBoundValue arrInit
>  = go steps arrInit
>   where
>     go !i !arr
>       | i == 0
>       = return     arr
>
>       | otherwise
>       = do arr' <- relaxLaplace arrBoundMask arrBoundValue arr
>            go (i - 1) arr'

For our small example, we set the initial array to 0 at every point. Note that the function which updates the grid, relaxLaplace will immediately over-write the points on the boundary with values given by the boundary condition.

> mkInitArrM :: Monad m => Int -> m (Array U DIM2 Double)
> mkInitArrM n = computeP $ fromFunction (Z :. (n + 1) :. (n + 1)) (const 0.0)

We can now test the Jacobi method

> testJacobi4 :: Int -> IO (Array U DIM2 Double)
> testJacobi4 nIter = do
>   mask    <- boundMask simpleEgN simpleEgN
>   val     <- boundValue simpleEgN simpleEgN bndFnEg1
>   initArr <- mkInitArrM simpleEgN
>   solveLaplace nIter mask val initArr

After 55 iterations, we obtain convergence up to the limit of accuracy of double precision floating point numbers. Note this only provides a solution of the matrix equation which is an approximation to Laplace’s equation. To obtain a more accurate result for the latter we need to use a smaller grid size.

ghci> testJacobi4 55 >>= return . pPrint
  [0.0, 1.0, 1.0, 0.0]
  [1.0, 1.25, 1.5, 2.0]
  [1.0, 1.5, 1.75, 2.0]
  [0.0, 2.0, 2.0, 0.0]

A Larger Example

Armed with Jacobi, let us now solve a large example.

> largerEgN, largerEgN2 :: Int
> largerEgN = 6 - 1
> largerEgN2 = (largerEgN - 1) * (largerEgN - 1)

First let us use hmatrix.

> matHMat5 :: IO (Matrix Double)
> matHMat5 = do
>   matRepa <- computeP $ mkJacobiMat largerEgN :: IO (Array U DIM2 Double)
>   return $ largerEgN2 >< largerEgN2 $ toList matRepa
ghci> matHMat5
  (16><16)
   [ -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  1.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  0.0,  1.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  0.0,  0.0,  1.0, -4.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  1.0,  0.0,  0.0,  0.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0,  0.0
   ,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0,  0.0
   ,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0,  0.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  1.0,  0.0,  0.0,  1.0,  0.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  0.0,  0.0,  0.0,  1.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  0.0, -4.0,  1.0,  0.0,  0.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  1.0,  0.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0,  1.0
   ,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  1.0,  0.0,  0.0,  1.0, -4.0 ]
> bndHMat5 :: Matrix Double
> bndHMat5 = largerEgN2>< 1 $ mkJacobiBnd fromIntegral bnd1 5
ghci>  bndHMat5
  (16><1)
   [ -2.0
   , -1.0
   , -1.0
   , -3.0
   , -1.0
   ,  0.0
   ,  0.0
   , -2.0
   , -1.0
   ,  0.0
   ,  0.0
   , -2.0
   , -3.0
   , -2.0
   , -2.0
   , -4.0 ]
> slnHMat5 :: IO (Matrix Double)
> slnHMat5 = matHMat5 >>= return . flip linearSolve bndHMat5
ghci> slnHMat5
  (16><1)
   [ 1.0909090909090908
   , 1.1818181818181817
   , 1.2954545454545454
   ,                1.5
   , 1.1818181818181817
   , 1.3409090909090906
   , 1.4999999999999996
   , 1.7045454545454544
   , 1.2954545454545459
   ,                1.5
   , 1.6590909090909092
   ,  1.818181818181818
   , 1.5000000000000004
   , 1.7045454545454548
   , 1.8181818181818186
   , 1.9090909090909092 ]

And for comparison, let us use the Jacobi method.

> testJacobi6 :: Int -> IO (Array U DIM2 Double)
> testJacobi6 nIter = do
>   mask    <- boundMask largerEgN largerEgN
>   val     <- boundValue largerEgN largerEgN bndFnEg1
>   initArr <- mkInitArrM largerEgN
>   solveLaplace nIter mask val initArr
ghci> testJacobi6 178 >>= return . pPrint
  [0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
  [1.0, 1.0909090909090908, 1.1818181818181817, 1.2954545454545454, 1.5, 2.0]
  [1.0, 1.1818181818181817, 1.3409090909090908, 1.5, 1.7045454545454546, 2.0]
  [1.0, 1.2954545454545454, 1.5, 1.6590909090909092, 1.8181818181818183, 2.0]
  [1.0, 1.5, 1.7045454545454546, 1.8181818181818181, 1.9090909090909092, 2.0]
  [0.0, 2.0, 2.0, 2.0, 2.0, 0.0]

Note that with a larger grid we need more points (178) before the Jacobi method converges.

Stencils

Since we are functional programmers, our natural inclination is to see if we can find an abstraction for (at least some) numerical methods. We notice that we are updating each grid element (except the boundary elements) by taking the North, East, South and West surrounding squares and calculating a linear combination of these.

Repa provides this abstraction and we can describe the update calculation as a stencil. (Lippmeier and Keller 2011) gives full details of stencils in repa.

> fivePoint :: Stencil DIM2 Double
> fivePoint = [stencil2|  0 1 0
>                         1 0 1
>                         0 1 0 |]

Using stencils allows us to modify our numerical method with a very simple change. For example, suppose we wish to use the nine point method (which is \mathcal{O}((\Delta x)^4)!) then we only need write down the stencil for it which is additionally a linear combination of North West, North East, South East and South West.

> ninePoint :: Stencil DIM2 Double
> ninePoint = [stencil2| 1 4 1
>                        4 0 4
>                        1 4 1 |]

We modify our solver above to take a stencil and also an Int which is used to normalise the factors in the stencil. For example, in the five point method this is 4.

> solveLaplaceStencil :: Monad m
>                        => Int
>                        -> Stencil DIM2 Double
>                        -> Int
>                        -> Array U DIM2 Double
>                        -> Array U DIM2 Double
>                        -> Array U DIM2 Double
>                        -> m (Array U DIM2 Double)
> solveLaplaceStencil !steps !st !nF !arrBoundMask !arrBoundValue !arrInit
>  = go steps arrInit
>  where
>    go 0 !arr = return arr
>    go n !arr
>      = do arr' <- relaxLaplace arr
>           go (n - 1) arr'
>
>    relaxLaplace arr
>      = computeP
>      $ R.szipWith (+) arrBoundValue
>      $ R.szipWith (*) arrBoundMask
>      $ R.smap (/ (fromIntegral nF))
>      $ mapStencil2 (BoundConst 0)
>      st arr

We can then test both methods.

> testStencil5 :: Int -> Int -> IO (Array U DIM2 Double)
> testStencil5 gridSize nIter = do
>   mask    <- boundMask gridSize gridSize
>   val     <- boundValue gridSize gridSize bndFnEg1
>   initArr <- mkInitArrM gridSize
>   solveLaplaceStencil nIter fivePoint 4 mask val initArr
ghci> testStencil5 5 178 >>= return . pPrint
  [0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
  [1.0, 1.0909090909090908, 1.1818181818181817, 1.2954545454545454, 1.5, 2.0]
  [1.0, 1.1818181818181817, 1.3409090909090908, 1.5, 1.7045454545454546, 2.0]
  [1.0, 1.2954545454545454, 1.5, 1.6590909090909092, 1.8181818181818183, 2.0]
  [1.0, 1.5, 1.7045454545454546, 1.8181818181818181, 1.9090909090909092, 2.0]
  [0.0, 2.0, 2.0, 2.0, 2.0, 0.0]
> testStencil9 :: Int -> Int -> IO (Array U DIM2 Double)
> testStencil9 gridSize nIter = do
>   mask    <- boundMask gridSize gridSize
>   val     <- boundValue gridSize gridSize bndFnEg1
>   initArr <- mkInitArrM gridSize
>   solveLaplaceStencil nIter ninePoint 20 mask val initArr
ghci> testStencil9 5 178 >>= return . pPrint
  [0.0, 1.0, 1.0, 1.0, 1.0, 0.0]
  [1.0, 1.0222650172207302, 1.1436086139049304, 1.2495750646811328, 1.4069077172153264, 2.0]
  [1.0, 1.1436086139049304, 1.2964314331751594, 1.4554776038855908, 1.6710941204241017, 2.0]
  [1.0, 1.2495750646811328, 1.455477603885591, 1.614523774596022, 1.777060571200304, 2.0]
  [1.0, 1.4069077172153264, 1.671094120424102, 1.777060571200304, 1.7915504172099226, 2.0]
  [0.0, 2.0, 2.0, 2.0, 2.0, 0.0]

We note that the methods give different answers. Before explaining this, let us examine one more example where the exact solution is known.

We take the example from (Iserles 2009, chap. 8) where the boundary conditions are:

\displaystyle  \begin{aligned}  \phi(x, 0) &= 0 \\  \phi(x, 1) &= \frac{1}{(1 + x)^2 + 1} \\  \phi(0, y) &= \frac{y}{1 + y^2} \\  \phi(1, y) &= \frac{y}{4 + y^2}  \end{aligned}

This has the exact solution

\displaystyle  u(x, y) = \frac{y}{(1 + x)^2 + y^2}

And we can calculate the values of this function on a grid.

> analyticValue :: Monad m => Int -> m (Array U DIM2 Double)
> analyticValue gridSize = computeP $ fromFunction (Z :. gridSize + 1 :. gridSize + 1) f
>   where
>     f (Z :. ix :. iy) = y / ((1 + x)^2 + y^2)
>       where
>         y = fromIntegral iy / fromIntegral gridSize
>         x = fromIntegral ix / fromIntegral gridSize

Let us also solve it using the Jacobi method with a five point stencil and a nine point stencil. Here is the encoding of the boundary values.

> bndFnEg3 :: Int -> Int -> (Int, Int) -> Double
> bndFnEg3 _ m (0, j) |           j >= 0 && j <  m = y / (1 + y^2)
>   where y = (fromIntegral j) / (fromIntegral m)
> bndFnEg3 n m (i, j) | i == n && j >  0 && j <= m = y / (4 + y^2)
>   where y = fromIntegral j / fromIntegral m
> bndFnEg3 n _ (i, 0) |           i >  0 && i <= n = 0.0
> bndFnEg3 n m (i, j) | j == m && i >= 0 && i <  n = 1 / ((1 + x)^2 + 1)
>   where x = fromIntegral i / fromIntegral n
> bndFnEg3 _ _ _                                   = 0.0

We create a function to run a solver.

> runSolver ::
>   Monad m =>
>   Int ->
>   Int ->
>   (Int -> Int -> (Int, Int) -> Double) ->
>   (Int ->
>    Array U DIM2 Double ->
>    Array U DIM2 Double ->
>    Array U DIM2 Double ->
>    m (Array U DIM2 Double)) ->
>   m (Array U DIM2 Double)
> runSolver nGrid nIter boundaryFn solver = do
>   mask    <- boundMask nGrid nGrid
>   val     <- boundValue nGrid nGrid boundaryFn
>   initArr <- mkInitArrM nGrid
>   solver nIter mask val initArr

And put the five point and nine point solvers in the appropriate form.

> s5, s9 :: Monad m =>
>           Int ->
>           Array U DIM2 Double ->
>           Array U DIM2 Double ->
>           Array U DIM2 Double ->
>           m (Array U DIM2 Double)
> s5 n = solveLaplaceStencil n fivePoint 4
> s9 n = solveLaplaceStencil n ninePoint 20

And now we can see that the errors between the analytic solution and the five point method with a grid size of 8 are \cal{O}(10^{-4}).

ghci> liftA2 (-^) (analyticValue 7) (runSolver 7 200 bndFnEg3 s5) >>= return . pPrint
  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  [0.0, -3.659746856576884e-4, -5.792613003869074e-4, -5.919333582729558e-4, -4.617020226472812e-4, -2.7983716661839075e-4, -1.1394184484148084e-4, 0.0]
  [0.0, -4.0566163490589335e-4, -6.681826442424543e-4, -7.270498771604073e-4, -6.163531890425178e-4, -4.157604876017795e-4, -1.9717865146007263e-4, 0.0]
  [0.0, -3.4678314565880775e-4, -5.873627029994999e-4, -6.676042377350699e-4, -5.987527967581119e-4, -4.318102416048242e-4, -2.2116263241278578e-4, 0.0]
  [0.0, -2.635436147627873e-4, -4.55055831294085e-4, -5.329636937312088e-4, -4.965786933938399e-4, -3.7401874422060555e-4, -2.0043638973538114e-4, 0.0]
  [0.0, -1.7773949138776696e-4, -3.1086347862371855e-4, -3.714478154303591e-4, -3.5502855035249303e-4, -2.7528200465845587e-4, -1.5207424182367424e-4, 0.0]
  [0.0, -9.188482657347674e-5, -1.6196970595228066e-4, -1.9595925291693295e-4, -1.903987061394885e-4, -1.5064155667735002e-4, -8.533752030373543e-5, 0.0]
  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

But using the nine point method significantly improves this.

ghci> liftA2 (-^) (analyticValue 7) (runSolver 7 200 bndFnEg3 s9) >>= return . pPrint
  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  [0.0, -2.7700522166329566e-7, -2.536751151638317e-7, -5.5431452705700934e-8, 7.393573120406671e-8, 8.403487600228132e-8, 4.188249685954659e-8, 0.0]
  [0.0, -2.0141002235463112e-7, -2.214645128950643e-7, -9.753369634157849e-8, 2.1887763435035623e-8, 6.305346988977334e-8, 4.3482495659663556e-8, 0.0]
  [0.0, -1.207601019737048e-7, -1.502713803391842e-7, -9.16850228516175e-8, -1.4654435886995998e-8, 2.732932558036083e-8, 2.6830928867571657e-8, 0.0]
  [0.0, -6.883445567013036e-8, -9.337114890983766e-8, -6.911451747027009e-8, -2.6104150896433254e-8, 4.667329939200826e-9, 1.1717137371469732e-8, 0.0]
  [0.0, -3.737430460254432e-8, -5.374955715231611e-8, -4.483740087546373e-8, -2.299792309368165e-8, -4.122571728437663e-9, 3.330287268177301e-9, 0.0]
  [0.0, -1.6802381437586167e-8, -2.5009212159532446e-8, -2.229028683853329e-8, -1.3101905282919546e-8, -4.1197137368165215e-9, 3.909041701444238e-10, 0.0]
  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Bibliography

Iserles, A. 2009. A First Course in the Numerical Analysis of Differential Equations. A First Course in the Numerical Analysis of Differential Equations. Cambridge University Press. http://books.google.co.uk/books?id=M0tkw4oUucoC.

Lippmeier, Ben, and Gabriele Keller. 2011. “Efficient Parallel Stencil Convolution in Haskell.” In Proceedings of the 4th ACM Symposium on Haskell, 59–70. Haskell ’11. New York, NY, USA: ACM. doi:10.1145/2034675.2034684. http://doi.acm.org/10.1145/2034675.2034684.